Show Notes
Gracia Carmona O et al., Patterns, 7 (2026) 101425. doi:10.1016/j.patter.2025.101425 - IndeLLM uses protein language models (ESM2) to score in-frame indels and a compact Siamese transfer-learning model that achieves state-of-the-art pathogenicity prediction with MCC = 0.77. Key terms: IndeLLM, protein language models, in-frame indels, Siamese network, ESM2.
Study Highlights:
Using human protein sequences and ESM2 embeddings, the authors develop IndeLLM, a zero-shot scoring function that sums overlapping-region probabilities to correct length bias in in-frame indels. They train a compact Siamese one-hidden-layer network on PLM embeddings with biologically guided embedding splitting and achieve MCC = 0.77 on the test set. Per-residue probability differences mapped onto structures (FGFR1, GLMN) identify local regions affected by indels and improve interpretability. The framework reduces insertion false negatives and is released with Colab and GitHub tools for indel annotation and disease-variant analysis.
Conclusion:
IndeLLM zero-shot scoring and a small Siamese transfer-learning model provide improved, interpretable indel pathogenicity prediction, with the Siamese model achieving MCC = 0.77.
Music:
Enjoy the music based on this article at the end of the episode.
Article title:
Leveraging protein language models and a scoring function for indel characterization and transfer learning
First author:
Gracia Carmona O
Journal:
Patterns, 7 (2026) 101425. doi:10.1016/j.patter.2025.101425
DOI:
10.1016/j.patter.2025.101425
Reference:
Gracia Carmona O, Leipart V, Amdam GV, Orengo C, Fraternali F. Leveraging protein language models and a scoring function for indel characterization and transfer learning. Patterns. 7 (2026) 101425. https://doi.org/10.1016/j.patter.2025.101425
License:
This episode is based on an open-access article published under the Creative Commons Attribution 4.0 International License (CC BY 4.0) - https://creativecommons.org/licenses/by/4.0/
Support:
Base by Base – Stripe donations: https://donate.stripe.com/7sY4gz71B2sN3RWac5gEg00
Official website https://basebybase.com
On PaperCast Base by Base you’ll discover the latest in genomics, functional genomics, structural genomics, and proteomics.
Episode link: https://basebybase.com/episodes/indellm-indel-siamese-model
QC:
This episode was checked against the original article PDF and publication metadata for the episode release published on 2026-02-17.
QC Scope:
- article metadata and core scientific claims from the narration
- excludes analogies, intro/outro, and music
- transcript coverage: Audited the scientific content conveyed in the transcript: indel biology, IndeLLM zero-shot scoring, Siamese Model 4, performance metrics, interpretability via structure-mapped probability changes, structural validation with AlphaFold, and broad applicability including non-human systems and accessible tooling.
- transcript topics: Indel biology: in-frame indels vs frameshift indels; Protein language models and length bias; Indel scoring: IndeLLM zero-shot (overlapping regions); Probability scoring math: sum vs log-sum; Siamese network (Model 4) and transfer learning; Performance metrics: MCC 0.65 (zero-shot) and 0.77 (Siamese); comparison to Provean
QC Summary:
- factual score: 10/10
- metadata score: 10/10
- supported core claims: 8
- claims flagged for review: 0
- metadata checks passed: 4
- metadata issues found: 0
Metadata Audited:
- article_doi
- article_title
- article_journal
- license
Factual Items Audited:
- IndeLLM zero-shot scoring uses overlapping regions to correct length bias
- Switch from log probability sums to sum of probabilities to reduce noise
- Model 4 Siamese network with embedding splitting achieves MCC 0.77
- Zero-shot MCC is 0.65 and outperforms Brandes scoring (0.58); comparable to supervised methods
- Per-residue probability differences mapped to FGFR1 and GLMN explain pathogenicity; AlphaFold used for validation
- Honeybee example demonstrates generalizability beyond humans
QC result: Pass.