207: Semantic Design of de novo Genes with Evo

Show Notes

️ Episode 207: Semantic Design of de novo Genes with Evo

In this episode of PaperCast Base by Base, we explore how a genomic language model called Evo can use genomic context to design entirely new DNA sequences that encode functional genes and multi-component defence systems.

Study Highlights:
Researchers trained the Evo genomic language model on long prokaryotic and phage DNA sequences and used genomic neighbourhoods as prompts to autocomplete new genes whose functions mirror those of their neighbours. They experimentally validated Evo-designed toxin–antitoxin systems and type III toxin–antitoxin modules, discovering novel protein toxins, protein antitoxins and RNA antitoxins that strongly modulate bacterial survival despite low or absent sequence similarity to natural proteins. Using prompts from anti-CRISPR operons, they generated diverse anti-CRISPR proteins that block SpCas9 activity and protect cells from phage infection, including candidates that cannot be confidently assigned to any known protein family. Finally, they scaled this semantic design strategy to build SynGenome, a public resource of more than 120 billion base pairs of Evo-generated DNA organised by gene ontology and domain annotations to enable function-guided exploration across many biological pathways.

Conclusion:
This work shows that genomic language models can move beyond imitating nature, using semantic relationships in genomes to design de novo functional genes and systems that expand the sequence space available for protein engineering and synthetic biology.

Music:
Enjoy the music based on this article at the end of the episode.

Reference:
Merchant AT, King SH, Nguyen E, Hie BL. Semantic design of functional de novo genes from a genomic language model. Nature. 2025. https://doi.org/10.1038/s41586-025-09749-7

License:
This episode is based on an open-access article published under the Creative Commons Attribution 4.0 International License (CC BY 4.0) – https://creativecommons.org/licenses/by/4.0/

Support:
Base by Base – Stripe donations: https://donate.stripe.com/7sY4gz71B2sN3RWac5gEg00

Official website https://basebybase.com

Castos player https://basebybase.castos.com

On PaperCast Base by Base you’ll discover the latest in genomics, functional genomics, structural genomics, and proteomics.

Chapters

(00:00:00) - A New Way to Design New Genomes
(00:05:45) - Artificial Intelligence's challenge to protein design
(00:11:08) - Uncovering the genome's hidden secrets
(00:11:53) - The Secret Life of Genes

Episode Transcript

[00:00:00] Speaker A: Foreign. [00:00:14] Speaker B: Welcome to Base by Bass, the papercast that brings genomics to you wherever you are. Thanks for listening and don't forget to follow and rate us in your podcast app. We talk a lot about engineering life, but that usually means, you know, tweaking or optimizing things that nature already built. [00:00:31] Speaker A: Right. Working with existing blueprints. [00:00:32] Speaker B: But what if. What if we could design brand new biological components? I mean, entire genes that have never existed before? Not by trying to guess their structure, but just by telling an AI what context they should show up in. [00:00:47] Speaker A: I mean, that's really the holy grail of synthetic biology, isn't it? For decades, designing a functional protein from scratch has been just incredibly hard. You're either making tiny changes to a known sequence, or you're using these massive, complex structural models, which can be super slow and often struggle. [00:01:04] Speaker B: Exactly. And the space of what's possible for a functional gene, it's astronomical. Natural evolution has only explored a tiny little corner of it. So today, what we're diving into is some research that's found a totally new way to navigate that space. They've built this generative genomic model, it's called evo, that can create functional de novo genesis. [00:01:26] Speaker A: And when we say de novo, we really mean it. These are genes that have zero significant sequence or structural similarity to anything we know of. This isn't just modifying things. [00:01:36] Speaker B: Yeah. [00:01:37] Speaker A: This is genuine creation. It's guided by the language of the genome itself. [00:01:41] Speaker B: So before we get into the nuts and bolts of how they pulled this off, we really want to recognize the innovative work we're discussing today. [00:01:48] Speaker A: Absolutely. We're celebrating the team, including a DDT merchant, Samuel H. King, Eric Nguyen, and Brian L. High. They're affiliated with Stanford University and the ARC Institute. [00:01:58] Speaker B: And their work, it just. It fundamentally changes how we can think about generative genomics. It's a robust new framework. [00:02:05] Speaker A: It really is. They basically moved the field from treating the genome like a dictionary of parts. [00:02:10] Speaker B: You can look up to treating it like a language. A language you can actually use to write entirely new biological sentences. [00:02:16] Speaker A: Exactly. [00:02:16] Speaker B: Yeah. [00:02:17] Speaker A: To really get the breakthrough, though, you have to start with the main problem. How do you tell an AI what function is? You might say, I want a new enzyme, but how do you specify what you want that piece of DNA to do inside a living cell? Most past attempts have relied on things like sequence similarity, which. [00:02:35] Speaker B: Which just limits you to things that look like stuff we already know. [00:02:38] Speaker A: Precisely. And this is where the paper makes just a genius move. They borrow a concept, but not from biology, from linguistics. [00:02:48] Speaker B: Distributional semantics. Right. I've heard this in the context of large language models for, you know, human language. [00:02:55] Speaker A: That's the one. The idea is simple. The meaning of a word is defined by the company. It keeps the other words around it in a sentence. [00:03:02] Speaker B: Okay, so how does that map to a genome? [00:03:04] Speaker A: Well, it maps perfectly onto how genes are organized in prokaryotes. So bacteria and archaea functionally related genes often sit right next to each other in these clusters called operons. [00:03:14] Speaker B: Oh, right. So you'll have a set of genes for say, breaking down a sugar and they're all lined up, up in a row. [00:03:19] Speaker A: Exactly. Because they all contribute to the same biological pathway. For decades, scientists have used this for guilt by association. If you find an unknown gene next to known metabolism genes, it probably has to do with metabolism, but they're flipping it. [00:03:35] Speaker B: They're not using it to figure out what something is, they're using it to create something new. [00:03:39] Speaker A: That is the critical shift. They trained a model on massive amounts of prokaryotic genomes. So the AI learns these multi gene relationships, it learns the grammar, and then it can do function guided design just by using the neighboring genes as a prompt. [00:03:53] Speaker B: Okay, so let's get into the tech. The tool itself is called Evo 1.5. It's a huge genomic language model and. [00:03:59] Speaker A: Crucially, it was trained on entire genomes at single nucleotide resolution, not just little snippets. This lets it see that big picture context, those operon structures that can be thousands of bases long. [00:04:11] Speaker B: So the method, this semantic design, it sounds almost simple when you describe it. [00:04:16] Speaker A: In principle it is. You give the model a DNA prompt, let's say the two genes upstream and the two genes downstream from where you want a new function, then you just ask it to autocomplete the missing gene in the middle. [00:04:26] Speaker B: You provide the functional neighborhood, and Evo fills in the blank with a gene. That makes sense. [00:04:31] Speaker A: There, that's it. [00:04:32] Speaker B: So their first step was just to prove that Evo was actually using the context and not just, you know, memorizing the most common gene for that spot. [00:04:39] Speaker A: Right. So they did an autocomplete test on a really well known gene, RPOs, which is one vital for bacterial stress response. And even when they gave the model only 30% of the input sequence, only 30%, it achieved an 85% amino acid recovery rate. [00:04:55] Speaker B: Wow, that's, that's incredibly accurate. [00:04:58] Speaker A: It is. But here's the really clever insight, the part that goes beyond just memorization. When they looked at the DNA sequences Evo generated. They saw massive nucleotide diversity. [00:05:09] Speaker B: Wait, okay, so the protein output, the amino acids were the same, but the underlying DNA code was different. [00:05:16] Speaker A: Very different. It was using all sorts of silent mutations. And that's the tell. It proves Evo isn't just spitting back something from its training data. [00:05:24] Speaker B: Ah, I get it. It's synthesizing completely new DNA sequences that still code for the right protein. It's learned the redundancy in the genetic code. [00:05:33] Speaker A: Yeah, it's operating in that dark matter of sequence space, finding solutions that work, but. But that evolution may never have stumbled upon. [00:05:41] Speaker B: So Evo can synthesize novel sequences that fit a context. The real test then, is actual creation. Did it work? [00:05:48] Speaker A: It worked incredibly well. They targeted three really challenging, highly diverse functions, starting with defense systems. Okay, they began with toxic antitoxin systems. Tas. These are these little self destructor dormancy switches in bacteria. And they are famous for evolving super fast and having very little sequence conservation. [00:06:05] Speaker B: A tough target. [00:06:06] Speaker A: Very tough target for the protein protein systems. The T2 TAS. EVO generated a functional toxin and four different antitoxins. And get this, those antitoxins only had about 21 to 27% sequence identity to any known proteins. [00:06:21] Speaker B: 21%. That's deep in what they call the twilight zone, Right, where you basically can't predict function from sequence anymore. [00:06:27] Speaker A: You can't. So the fact that these worked is solid proof of de novo design. [00:06:32] Speaker B: That's incredible. But the function was even crazier, wasn't it? [00:06:34] Speaker A: It was. Two of the generated antitoxins, Evoat 2 and Evoat 4, showed multitoxin neutralizing activity. [00:06:42] Speaker B: Meaning they didn't just work against one toxin. [00:06:44] Speaker A: They rescued bacteria from three different natural toxins. Really? MAS and yobi, that kind of broad compatibility is not something you see very often in nature. It's like the AI uncovered a more fundamental or a more modular way of solving the problem. [00:07:00] Speaker B: And it didn't stop with proteins. They went after RNA systems too. [00:07:02] Speaker A: Yep, the T3TA systems, it designed a functional RNA antitoxin and a toxic protein. And again, that protein, EVO T1, had no strong sequence or even predicted structural similarity to any known toxin. [00:07:14] Speaker B: But the the real mic drop moment for novelty seems to be the anti CRISPRs. [00:07:19] Speaker A: Oh, absolutely. Anti CRISPRs, or ACRES, are what viruses use to shut down the bacterial immune system. They're hyper diverse, constantly popping up as brand new inventions. They're the perfect test case. And the success rate, a robust 17% experimental success rate in generating functional acres against PCOTAs 9, which for a de novo design task with no special fine tuning, is extremely high. [00:07:43] Speaker B: Okay, so how novel were they really? How did they prove it? [00:07:46] Speaker A: They did this really smart analysis on the two most novel ones, Evoacre 1 and Evoacre 2. They showed zero significant sequence or structural similarity to anything known. [00:07:56] Speaker B: Right. [00:07:57] Speaker A: Then they did what's called a residue coverage analysis. They basically asked if we had to build these AI proteins out of little snippets of known natural proteins, how many different snippets would we need? [00:08:07] Speaker B: Like trying to solve a puzzle. [00:08:08] Speaker A: Exactly. And for Evoaco 1 and 2, it took fragments from 28 to 31 different natural proteins to explain their composition. [00:08:15] Speaker B: Wow. So it's not just a remix, it's something genuinely new. [00:08:18] Speaker A: Genuinely new. It's a level of novelty on par with proteins designed by these incredibly complex specialized computational pipelines. But Evo did it with just a simple contextual prompt. [00:08:29] Speaker B: That is a massive difference in effort. [00:08:31] Speaker A: Huge. And this versatility led them to scale up. They created Syngenome, which is a public database of over 120 billion base pairs of AI generated DNA. [00:08:43] Speaker B: 120 billion billion. [00:08:45] Speaker A: It's derived from prompts covering 9,000 different functional terms. And it's high quality pic. It mimics the statistical properties of real prokaryotic genomes. [00:08:56] Speaker B: And it's already proving useful. Right. They used it to confirm a link between a previously mysterious protein domain and cytochrome C. They did. [00:09:04] Speaker A: So it's not just a database, it's a discovery engine. [00:09:06] Speaker B: So if we what does this actually mean for synthetic biology? How does this change the game? [00:09:11] Speaker A: Well, it's a completely new orthogonal approach. You're not starting with structure anymore. You don't need a mechanistic hypothesis. You don't even need to do task specific fine tuning for every new function you want. [00:09:21] Speaker B: You're getting access to parts of that functional sequence space that evolution just hasn't touched, or parts that older methods would have just thrown out because, you know, the predicted structure looked weird or low confidence. [00:09:35] Speaker A: Exactly. Who cares what the structure looks like if the function works in context? [00:09:39] Speaker B: And for researchers, having C genome is like. I mean, it's a massive pre generated library for gene discovery. [00:09:47] Speaker A: It saves countless hours. You can go search it right now for a function you're interested in. [00:09:51] Speaker B: Okay, but it can't be perfect. There have to be limitations here. The generation is autoregressive, right? Predicting one base after another, that's a crucial point. [00:10:00] Speaker A: It can sometimes fall into repetitive sequences or produce non functional hallucinations. [00:10:06] Speaker B: So you still have to test everything in the lab. [00:10:08] Speaker A: You absolutely do. This doesn't replace the bench, but what it does is it dramatically improves the quality and the novelty of your starting candidates. Instead of screening a million random variants, you're screening a hundred highly plausible, totally novel ones. [00:10:23] Speaker B: And the other big limitation is the reliance on context, isn't it? This works so well in bacteria because of those neat little operons, correct? [00:10:31] Speaker A: Applying this to, say, a human genome where genes are spread out and separated by vast non coding regions, that's a much bigger challenge. It's going to require the next generation of genomic language models. [00:10:43] Speaker B: But those limitations just kind of point the way forward, don't they? It feels like this is just the beginning. We're learning to write biology. [00:10:50] Speaker A: I think that's the perfect way to put it. To sum it up, this model Evo uses the company A gene keeps its genomic context to successfully design completely functional de novo proteins and RNAs. It's generating novel anti CRISPRs, multifunctional antitoxins, all without needing any prior structural or evolutionary information. It's opening up these huge unexplored territories of functional sequence space. [00:11:12] Speaker B: So if we can use the language of the genome to rapidly build custom biological parts that nature hasn't even thought of yet, what fundamental discoveries are now within our grasp, just waiting to be found in a database like Syngenic? That's a really exciting question. This episode was based on an Open Access article under the CC BY 4.0 license. You can find a direct link to the paper and the license in our episode Description if you enjoyed this, follow or subscribe in your podcast app and leave a five star rating. If you'd like to support our work, use the donation link in the description now. Stay with us for an original track created especially for this episode and inspired by the article you've just heard about. Thanks for listening and join us next time as we explore more science. Thanks for watching. Base by base. [00:12:15] Speaker C: Fluorescent screens glow over midnight glass and. [00:12:20] Speaker B: Steel. [00:12:25] Speaker C: Evil reads the quiet genome like a city made of code Every neighborhood of bases hiding what they might reveal. As we ask it for a future nature never O Can a line of unseen letters learn the meaning of a sign? Can a fragment of a sequence teach the rest how to align we let the model dream in context, Let it wander through disappearing. Unknown starts to speak New genes arising from the patterns underneath in the silence of the genome something living starts to shine it will draw us a different map of a life One token at. [00:13:35] Speaker B: A. [00:13:55] Speaker C: Toxins find their partners Gentle antitoxins in the dark Anti crispr shields ignite when the phages start to swarm Strange proteins with no family still can carry out their spark and singing Nor their stories in an artificial stone. Semantic design turning distance into rhyme Ghost of function taking shape between the lines from 11 stamp it's evening to the far edges of time ever lets the hit and codes of life recombine. And real life.

Previous Episode Next Episode

207: Semantic Design of de novo Genes with Evo

Show Notes

Chapters

Episode Transcript

Other Episodes

️ 27: Nucleocytosolic Vehicles — A CRISPR RNP Delivery Breakthrough

134: Single-Cell Maps Link Intestinal Metaplasia to Esophageal Adenocarcinoma Risk

190: Single-Cell Networks Reveal Cell Type–Specific Mechanisms in Type 2 Diabetes