Show Notes
Gaudio HA et al., Nature Communications - This episode examines a systematic benchmark of six commercial large language models applied to plasma cell-free RNA across three clinical cohorts, assessing LLM-driven gene-panel nomination and autonomous classifier construction versus conventional statistical workflows. Key terms: large language models, cell-free RNA, biomarker discovery, machine learning, diagnostics.
Study Highlights:
Six state-of-the-art LLMs were tested on cfRNA datasets from Kawasaki disease vs MIS-C, tuberculosis vs symptomatic controls, and ME/CFS vs sedentary controls for gene-panel nomination and end-to-end classifier building. LLM-nominated panels recapitulated canonical immune pathways and outperformed random gene sets, matching differential expression–derived panels in the tuberculosis cohort. End-to-end automation was feasible but model- and task-dependent: OpenAI o3 matched conventional performance for KD vs MIS-C but underperformed for TB and ME/CFS. Models showed prompt-adherence issues and sometimes returned non-reference or hallucinated features, which limits reproducibility.
Conclusion:
Current LLMs can extract biologically meaningful cfRNA candidate panels and partially automate biomarker workflows, but results are variable and traditional or hybrid statistical workflows remain necessary; rigorous validation and constrained output schemas are required before clinical deployment.
Music:
Enjoy the music based on this article at the end of the episode.
Article title:
Benchmarking large language models for cell-free RNA diagnostic biomarker discovery
First author:
Gaudio HA
Journal:
Nature Communications
DOI:
10.1038/s41467-026-74077-x
Reference:
Gaudio HA, Bliss A, Loy CJ, Eweis‑LaBolle D, Gardella AE & De Vlaminck I. Benchmarking large language models for cell-free RNA diagnostic biomarker discovery. Nature Communications (2026). doi:10.1038/s41467-026-74077-x
License:
This episode is based on an open-access article published under the Creative Commons Attribution 4.0 International License (CC BY 4.0) – https://creativecommons.org/licenses/by/4.0/
Support:
Base by Base – Stripe donations: https://donate.stripe.com/7sY4gz71B2sN3RWac5gEg00
Official website https://basebybase.com
On PaperCast Base by Base you'll discover the latest in genomics, functional genomics, structural genomics, and proteomics.
Episode link: https://basebybase.com/episodes/benchmarking-llms-cfrna
QC:
This episode was checked against the original article PDF and publication metadata for the episode release published on 2026-06-17.
QC Scope:
- article metadata and core scientific claims from the narration
- excludes analogies, intro/outro, and music
- transcript coverage: Substantively audited the transcript's coverage of the study design, LLM benchmarking across three cohorts, gene-panel nomination, end-to-end classifier construction, prompt effects, and the hybrid-workflow conclusions, with reference to supporting results in the article.
- transcript topics: Study design and cohorts (KD vs MIS-C, TB vs symptomatic controls, ME/CFS vs sedentary controls); Prompt adherence and gene-panel nomination; Comparison of LLM panels to random and DGE panels; End-to-end classifier construction and cross-cohort performance; Disease-informed vs disease-naïve prompts impact; Limitations: probability calibration, data leakage concerns
QC Summary:
- factual score: 10/10
- metadata score: 10/10
- supported core claims: 5
- claims flagged for review: 0
- metadata checks passed: 4
- metadata issues found: 0
Metadata Audited:
- article_doi
- article_title
- article_journal
- license
Factual Items Audited:
- Six LLMs were evaluated across three cohorts: OpenAI o3, GPT-4o, Claude Opus 4, Claude 3.7 Sonnet, Gemini 2.5 Pro, Gemini 2.0 Flash
- Cohorts studied: Kawasaki disease (KD) vs MIS-C, active tuberculosis (TB) vs symptomatic controls, ME/CFS vs sedentary controls
- Two tasks were used: (1) feature selection to nominate 200 genes; (2) end-to-end classifier construction from cfRNA count matrices
- LLM-nominated panels outperformed random panels; TB panels matched/exceeded DGE; KD and ME/CFS panels trailed DGE in predictive performance
- End-to-end classifiers were feasible for some models; OpenAI o3 and Claude Opus 4 completed end-to-end pipelines
- OpenAI o3 achieved ~86.7% mean accuracy for KD vs MIS-C end-to-end; TB: ~76.3–77.6%; ME/CFS showed limited gains
QC result: Pass.