394: Benchmarking LLMs for cfRNA biomarker discovery

Gaudio HA et al., Nature Communications - This episode examines a systematic benchmark of six commercial large language models applied to plasma cell-free RNA across three clinical cohorts, assessing LLM-driven gene-panel nomination and autonomous classifier construction versus conventional statistical workflows. Key terms: large language models, cell-free RNA, biomarker discovery, machine learning, diagnostics.

Study Highlights:
Six state-of-the-art LLMs were tested on cfRNA datasets from Kawasaki disease vs MIS-C, tuberculosis vs symptomatic controls, and ME/CFS vs sedentary controls for gene-panel nomination and end-to-end classifier building. LLM-nominated panels recapitulated canonical immune pathways and outperformed random gene sets, matching differential expression–derived panels in the tuberculosis cohort. End-to-end automation was feasible but model- and task-dependent: OpenAI o3 matched conventional performance for KD vs MIS-C but underperformed for TB and ME/CFS. Models showed prompt-adherence issues and sometimes returned non-reference or hallucinated features, which limits reproducibility.

Conclusion:
Current LLMs can extract biologically meaningful cfRNA candidate panels and partially automate biomarker workflows, but results are variable and traditional or hybrid statistical workflows remain necessary; rigorous validation and constrained output schemas are required before clinical deployment.

Music:
Enjoy the music based on this article at the end of the episode.

Article title:
Benchmarking large language models for cell-free RNA diagnostic biomarker discovery

First author:
Gaudio HA

Journal:
Nature Communications

DOI:
10.1038/s41467-026-74077-x

Reference:
Gaudio HA, Bliss A, Loy CJ, Eweis‑LaBolle D, Gardella AE & De Vlaminck I. Benchmarking large language models for cell-free RNA diagnostic biomarker discovery. Nature Communications (2026). doi:10.1038/s41467-026-74077-x

License:
This episode is based on an open-access article published under the Creative Commons Attribution 4.0 International License (CC BY 4.0) – https://creativecommons.org/licenses/by/4.0/

Support:
Base by Base – Stripe donations: https://donate.stripe.com/7sY4gz71B2sN3RWac5gEg00

Official website https://basebybase.com

On PaperCast Base by Base you'll discover the latest in genomics, functional genomics, structural genomics, and proteomics.

Episode link: https://basebybase.com/episodes/benchmarking-llms-cfrna

QC:
This episode was checked against the original article PDF and publication metadata for the episode release published on 2026-06-17.

QC Scope:
- article metadata and core scientific claims from the narration
- excludes analogies, intro/outro, and music
- transcript coverage: Substantively audited the transcript's coverage of the study design, LLM benchmarking across three cohorts, gene-panel nomination, end-to-end classifier construction, prompt effects, and the hybrid-workflow conclusions, with reference to supporting results in the article.
- transcript topics: Study design and cohorts (KD vs MIS-C, TB vs symptomatic controls, ME/CFS vs sedentary controls); Prompt adherence and gene-panel nomination; Comparison of LLM panels to random and DGE panels; End-to-end classifier construction and cross-cohort performance; Disease-informed vs disease-naïve prompts impact; Limitations: probability calibration, data leakage concerns

QC Summary:
- factual score: 10/10
- metadata score: 10/10
- supported core claims: 5
- claims flagged for review: 0
- metadata checks passed: 4
- metadata issues found: 0

Metadata Audited:
- article_doi
- article_title
- article_journal
- license

Factual Items Audited:
- Six LLMs were evaluated across three cohorts: OpenAI o3, GPT-4o, Claude Opus 4, Claude 3.7 Sonnet, Gemini 2.5 Pro, Gemini 2.0 Flash
- Cohorts studied: Kawasaki disease (KD) vs MIS-C, active tuberculosis (TB) vs symptomatic controls, ME/CFS vs sedentary controls
- Two tasks were used: (1) feature selection to nominate 200 genes; (2) end-to-end classifier construction from cfRNA count matrices
- LLM-nominated panels outperformed random panels; TB panels matched/exceeded DGE; KD and ME/CFS panels trailed DGE in predictive performance
- End-to-end classifiers were feasible for some models; OpenAI o3 and Claude Opus 4 completed end-to-end pipelines
- OpenAI o3 achieved ~86.7% mean accuracy for KD vs MIS-C end-to-end; TB: ~76.3–77.6%; ME/CFS showed limited gains

QC result: Pass.

394: Benchmarking LLMs for cfRNA biomarker discovery

Show Notes

Other Episodes

333: Holistic determination of cfDNA ends

146: Automated, Decentralized cfDNA Profiling for Targetable and Resistance Alterations

189: DNA methylation patterns facilitate tracing the origin of neuroendocrine neoplasms