215: Protein Set Transformer for high-diversity viromics

Show Notes

️ Episode 215: Protein Set Transformer for high-diversity viromics

In this episode of PaperCast Base by Base, we explore Protein Set Transformer (PST) is a protein-based genome language model that represents genomes as sets of proteins to improve genome and protein representations across diverse viral datasets

Study Highlights:
PST embeds proteins with ESM2, concatenates positional and strand vectors, contextualizes proteins with a multi-head attention encoder, and produces genome embeddings via a learnable weighted decoder pooling. The foundation PST-TL models were pretrained on >100k dereplicated viral genomes encoding >6M proteins using a triplet-loss objective with PointSwap augmentation and evaluated on IMG/VR v4 and MGnify soil virus test sets. PST-TL outperformed other protein- and nucleotide-based methods at recovering genome–genome relationships, including remote relationships, and its protein embeddings clustered structural capsid folds and late-gene functional modules. PST improved annotation transfer for hypothetical proteins via embedding and structure-aware clustering and boosted viral host-species prediction when used in a graph link-prediction framework.

Conclusion:
PST provides transferable genome- and protein-level embeddings that strengthen representation, annotation, and host-prediction tasks for diverse viral and microbial genomics applications

Music:
Enjoy the music based on this article at the end of the episode.

Reference:
Martin, C., Gitter, A., Anantharaman, K. Protein Set Transformer: a protein-based genome language model to power high-diversity viromics. Nat Commun (2025). https://doi.org/10.1038/s41467-025-66049-4

License:
This episode is based on an open-access article published under the Creative Commons Attribution 4.0 International License (CC BY 4.0) – https://creativecommons.org/licenses/by/4.0/

Support:
Base by Base – Stripe donations: https://donate.stripe.com/7sY4gz71B2sN3RWac5gEg00

Official website https://basebybase.com

Castos player https://basebybase.castos.com

On PaperCast Base by Base you’ll discover the latest in genomics, functional genomics, structural genomics, and proteomics.

Episode link: https://basebybase.castos.com/episodes/protein-set-transformer

Chapters

(00:00:00) - Deep Learning in Viral Biology
(00:02:31) - Preliminary insights into viral biology
(00:08:01) - PSTTL: The Hidden Genome of Viruses
(00:11:29) - PSTTL: The Virality Model
(00:14:22) - Preston 2, Context-aware viral evolution
(00:16:19) - Signs and Numbers in the Code

Episode Transcript

[00:00:00] Speaker A: Foreign. [00:00:14] Speaker B: Welcome to Base by Base, the papercast that brings genomics to you wherever you are. Thanks for listening and don't forget to follow and rate us in your podcast app. Today we are embarking on a deep dive into the world of viruses, those microscopic masters of mutation. And I mean, just think about this for a second. Viruses are the single most abundant biological entity on the planet. [00:00:37] Speaker C: They're literally everywhere, in every ecosystem, every drop of water. They're modulating everything from, you know, our own health to global carbon cycles. [00:00:45] Speaker B: We're absolutely crucial. And yet, when we try to study them at this massive scale, this field we call viromics, we just hit a wall, A huge bottleneck, and it's one. [00:00:55] Speaker C: Of the biggest in modern biology. [00:00:56] Speaker B: It's not about our ability to sequence anymore, is it? We, the data, we can generate terabytes of it. [00:01:01] Speaker C: Oh, exactly. The problem is interpretation. It's making sense of it all. Viral evolution is just. It's relentless. Their genomes, their proteins, they mutate and diverge so fast that our traditional tools just can't keep up. [00:01:13] Speaker B: So the tools that rely on finding a close match in a database, they fall silent. [00:01:18] Speaker C: If a virus is too distantly related to anything we've ever seen and annotated, it becomes what we call dark matter. We have the sequence, we know it's there, but we have absolutely no idea. [00:01:29] Speaker B: What it does, no idea about its function, its structure. Yeah, nothing. [00:01:33] Speaker C: It's a black box. [00:01:34] Speaker B: So if that sequence similarity gets erased by evolution so quickly, we need a totally new way of looking at this. A kind of genetic Rosetta stone, a system that doesn't need perfect letter for letter matches, but can somehow understand the deeper grammar, the underlying structure of a viral genome. A shortcut to understanding these, these evolutionary chameleons. [00:01:57] Speaker C: And that is precisely our mission today. We're going to unpack a really revolutionary genome language model that's built to do exactly that, to interpret this hyper diverse viral data and really to build a new foundation for viromics for this work. [00:02:10] Speaker B: Today we're celebrating the research of Cody Martin, Anthony Gitter, and Karthik Anantharaman, who are primarily from the University of Wisconsin Madison and affiliated institutions. [00:02:19] Speaker C: Yeah, their work is a fantastic leap forward. It really demonstrates how deep learning can be applied to interpret this incredibly complex, diverse genomic data. It's a direct shot at that dark matter problem we were just talking about. [00:02:31] Speaker B: So let's get a bit more context. Viromics, this large scale study of viral communities is a huge research area, but this diversity gap is it's profound. [00:02:41] Speaker C: It is. I mean, if you look at bacterial genomes usually have some universal genes you can use for comparison. But for viruses, most of the data lacks any usable functional labels or those shared markers. We need to build a family tree, a phylogeny. [00:02:56] Speaker B: And that's where our language models started to look so promising. [00:02:58] Speaker C: Exactly. We have these incredible tools, protein language models, or PLMs. The most established one is probably ESM2 and it's been trained on millions of individual protein sequences. [00:03:08] Speaker B: So it's amazing at understanding a single isolated protein phenomenal. [00:03:13] Speaker C: It can learn its biochemistry, its function, its structure. But that's also where previous attempts to use these models in ferromics fell a bit short. They were too focused on single tasks. [00:03:22] Speaker B: Like just annotating one protein at a time or trying to predict a host. [00:03:26] Speaker C: Right. They miss the forest for the trees. A viral genome is not just a random bag of proteins. Its parts are organized. That organization is a product of its evolution. You have functional cassettes. All the structural genes might be clustered together. [00:03:40] Speaker B: Okay, so there's a logic to the. [00:03:41] Speaker C: Layout, a deep evolutionary logic. And previous models just didn't capture that. A truly useful foundation model has to understand the context of each protein within the entire genome. [00:03:52] Speaker B: Alright, so let's unpack their solution then. The protein set transformer, or pst. How does its architecture get at this idea of genomic context? [00:04:01] Speaker C: The core idea is actually pretty elegant. PST models a genome not as a string of letters, but as a set of proteins. [00:04:08] Speaker B: A set. Okay. [00:04:09] Speaker C: Think of the genome as a toolbox. PST doesn't really care about the order you lay the tools out on the workbench. But it cares very, very deeply about which tools are in the box and what each tool does in relation to all the others. [00:04:21] Speaker B: That makes sense. It moves beyond that strict linear sequence idea. So how does the model actually process this set of tools? [00:04:29] Speaker C: It uses an encoder decoder setup. It's designed to produce context aware information on two levels. The individual proteins and then the genome as a whole. It starts with the input stage. It takes those really high quality vector embeddings from a pre trained model like ESM2 for every single protein in the set. [00:04:47] Speaker B: So it's already starting with a rich description of each protein's basic features. What does PST add to that to make it context aware? [00:04:55] Speaker C: This is the really clever part. It concatenates. It just sticks on these small learnable vectors to each of those big ESM2 embeddings. [00:05:03] Speaker B: And these small vectors carry what Information, Just two things. [00:05:08] Speaker C: The protein's position in the genome and which coding strand it's on. That's how the model is forced right from the start to pay attention to the genome's organization. [00:05:16] Speaker B: Huh. That's fascinating. It's like you have a great dictionary, that's ESM2, and then you add grammar and syntax rules to every single word before you even try to read the sentence. [00:05:25] Speaker C: That's a perfect analogy. And that enriched input then goes into the encoder. It uses multi head attention. And this is where the magic really starts. The encoder looks at every protein and figures out how it relates to every other protein in that same genome. [00:05:40] Speaker B: So it's building the context. [00:05:41] Speaker C: It's building the context, yeah. The output is a new set of embeddings, PSP protein embeddings that now encode both the protein's own features and its specific role within that genome. [00:05:52] Speaker B: And then the decoder pulls it all together. [00:05:53] Speaker C: Yep. The decoder uses a multi head attention pooling mechanism. It's basically a very sophisticated weighting system that the model learns. It decides the relative importance of each of those contextualized proteins to single consolidated representation for the entire genome. [00:06:10] Speaker B: And you get both the protein level and the genome level view in one. [00:06:13] Speaker C: Go in a single pass. [00:06:14] Speaker B: Okay, so the architecture makes sense, but how did they train it? They didn't use the standard method, masked language modeling. They chose something called triplet loss or psttl. Why that change? [00:06:25] Speaker C: Well, masked language modeling is great if you want to predict a missing word in a sentence. But for genomics, especially viromics, our main goal isn't really prediction, it's mapping evolutionary relationships. [00:06:36] Speaker B: Ah. [00:06:37] Speaker C: And for these complex high dimensional vectors, triplet loss is just far better because it directly trains the model to understand relatedness. [00:06:46] Speaker B: So what does that look like? What's the triplet? [00:06:49] Speaker C: Think of it like a family tree challenge. For the model, you give it three things. An anchor genome, which is your starting point. Then a positive example, its closest relative, and a negative example, some distant unrelated genome. [00:07:01] Speaker B: Got it. [00:07:02] Speaker C: The whole goal of triplet loss is spatial. It just trains the model to pull the anchor and the positive closer together in this big mathematical space, and to push the negative one further away by a guaranteed margin. [00:07:14] Speaker B: So it's forcing the model to create a map that accurately reflects evolutionary distance. But how do they know which genome is the positive one to begin with, especially with all this unannotated data? [00:07:25] Speaker C: They use a clever metric called chamfer distance. Without getting too technical, it's a way of measuring the difference between two sets of things. It finds the best way to match up the proteins from one genome to the proteins in another and measures the average distance and makes sure the positive example really has the most similar collection of proteins overall. [00:07:45] Speaker B: And to make the model even more robust, they threw in something called point swap. What does that do? [00:07:50] Speaker C: It's a form of data augmentation that mimics a real biological process, homologous recombination, which, which happens all the time in viruses. [00:07:58] Speaker B: So it's like genes getting swapped between related viruses. [00:08:01] Speaker C: Exactly. Pointswap simulates this by swapping similar protein vectors between two related genomes in the training data. By showing the model these slightly jumbled but still biologically plausible examples, you prevent it from just memorizing the input. It has to learn the essential features. [00:08:17] Speaker B: And the scale of this was just enormous. [00:08:19] Speaker C: Oh, massive. The foundation models were pre trained on over 100,000 high quality viral genomes. That's more than 6 million proteins. [00:08:27] Speaker B: So let's get to the findings because this is where all that architectural and training sophistication really shines. How did PSTTL actually do? When they benchmarked it, it was a. [00:08:37] Speaker C: Decisive win across the board. PSTTL significantly outperformed every other method they tested it against. [00:08:44] Speaker B: And that includes what? [00:08:45] Speaker C: Everything from, you know, standard nucleotide K MER approaches to other big genome models like Gen SLM and Hyundai DNA. And crucially, it blew simpler protein based methods out of the water. Like just taking the average of all the ESM2 embeddings. [00:09:01] Speaker B: That last point seems really important. The approach they called pstctx, just averaging. [00:09:06] Speaker C: The vectors that failed, it performed very poorly. Which is strong proof that the whole architecture, the context building encoder, and that learned weighting in the decoder is absolutely essential. You can't just throw the proteins in a bag and average them. [00:09:18] Speaker B: The structure is vital. [00:09:19] Speaker C: It is. But the most revealing result, the real aha moment came from its ability to detect remote evolutionary relationships. [00:09:27] Speaker B: Okay, let's define that for everyone listening. What's the difference between a close relationship and a remote one in varomics? [00:09:33] Speaker C: Right. So we use two key metrics for close relatives, we use average amino acid identity or aai. You can think of it like checking if two books use a lot of the same words. If the AAI is high, they're obviously related. [00:09:45] Speaker B: And the remote ones, that's where the. [00:09:46] Speaker C: Virus has evolved so much that the AAI drops to basically zero. The sequence similarity is gone. The words are all different now. But the books might still have the exact same plot structure, the same character type. [00:09:57] Speaker B: So the underlying structure is conserved. [00:09:59] Speaker C: Exactly. And we can measure that with something called average structural identity, or ASI, which compares the predicted 3D folds of the proteins. [00:10:08] Speaker B: So when AI is zero, traditional tools are completely blind. What did PSTTL see? [00:10:14] Speaker C: This is the amazing part. PSTTL showed a strong positive correlation with the structural similarity with asi, even when the sequence similarity, the aai, was totally gone. Wow. The model is inferring this deep conserved structural relationship just from the genomic context. It's completely bypassing the need for sequence data. [00:10:33] Speaker B: That's. That's groundbreaking. It's seeing the shared blueprint when the building materials look completely different. Did this translate to function too? Could it map out the organization within a single gene genome? [00:10:44] Speaker C: It did. It was clear proof that the context aware training worked. PST consistently grouped related protein functions into what the authors call functional modules. [00:10:54] Speaker B: Any specific examples? [00:10:56] Speaker C: Yeah. The standout one was how it clustered the proteins involved in what are called late genes. So things like the tail proteins, the head and packaging proteins, the lysis proteins that burst the cell open. [00:11:07] Speaker B: And why is that particular group significant? [00:11:10] Speaker C: Because that clustering reflects a known conserved biological reality. It's a common organizational pattern in viruses like the famous lambda phage genome. And the key thing is the simpler models, like averaged ESM2 completely fail to see this pattern. It proves PST is actually learning the functional architecture of the genome. [00:11:29] Speaker B: Which brings us to the biggest challenge of all that dark matter. The 70 to 90% of viral proteins with no known function. What did PST do? There are. [00:11:39] Speaker C: This is where there's massive hope. The researchers noticed that these proteins of unknown function, these hypothetical proteins, were often given a very high weight by the model's decoder. [00:11:49] Speaker B: So the model thought they were really important for defining what that virus was. [00:11:53] Speaker C: Right. And when they looked closer, they found that PSTTL was incredible at clustering these unknown proteins together with known capsid proteins that shared structural homology. [00:12:04] Speaker B: So let me get this straight. Even if a hypothetical protein had no sequence match to anything, if it had the 3D fold of a capsid protein, PST could group it into the capsid functional module. [00:12:16] Speaker C: That's exactly it. It's transferring annotation based on inferred structure and context, not just sequence, this could dramatically expand our ability to assign function to this vast, unannotated part of the viral world. [00:12:27] Speaker B: Okay, so let's talk real world application. They tested this on a classic viromics task, host prediction. [00:12:33] Speaker C: They did, as a proof of concept they took the PSTTL genome embeddings and fed them into an existing graph based prediction framework. The results were astounding. [00:12:41] Speaker B: Better than the current tools? [00:12:43] Speaker C: Significantly better. It outperformed established specialized host prediction tools like IPARSOP at finding the true host species for the test viruses. [00:12:53] Speaker B: What's so amazing about that is that PST was never specifically trained to do host prediction. [00:12:58] Speaker C: Exactly. It speaks volumes about the quality of the embeddings it produces. It was trained for general representation. The fact that it excels at this downstream task means it's capturing truly fundamental biological information about the virus. The evolutionary map it creates inherently contains the ecological map. [00:13:17] Speaker B: So what are the bigger implications here? Where does this research go next? It wasn't just built for viruses, right? [00:13:22] Speaker C: No, not at all. The architecture is totally agnostic. You can feed it any set of proteins. The authors are very clear that it could be readily applied to create a foundation model for all of microbial genomics. [00:13:32] Speaker B: Bacteria, archaea, which face a lot of the same problems with divergence and annotation gaps. [00:13:38] Speaker C: For sure. This could really revolutionize how we annotate genes across the entire microbial tree of life. [00:13:43] Speaker B: What about the limitations? What did the authors say needs to be improved? [00:13:47] Speaker C: They were very transparent about that. One clear path for improvement is that PST uses fixed embeddings from ESM2. They think performance could be boosted even more by fine tuning that input PLM and at the same time as a main PST model. [00:14:02] Speaker B: So training the whole system end to end. [00:14:04] Speaker C: Right. And another idea was to use a dual training objective, one that focuses on protein protein relationships and another on genome genome relationships, to get an even finer grade map. [00:14:15] Speaker B: And before we get to our take home message, they did include a section on biosecurity with a powerful viral model like this. What was their assessment of the risk? [00:14:24] Speaker C: They did a full assessment and concluded the risk is low. The main reason is that PST works at the level of a set of proteins. That makes it really difficult to use it for de novo generation of a full functional viral genome. Plus, the training data itself contained a tiny fraction of human pathogens less than 0.2%. They also consulted external experts who agree that the huge scientific benefit of releasing the model in code far outweighs any low theoretical risk. [00:14:50] Speaker B: This is truly foundational work. Then the protein set transformer PST really seems to solve this dark matter problem in viromics by moving beyond simple sequence matching. [00:15:01] Speaker C: It absolutely does. It proves that by intelligently processing the genomic context of these protein sets where they are, how they're oriented, which ones are important. PST creates a far superior foundation model for understanding viral evolution and function. [00:15:16] Speaker B: So the central insight is that in these hyper diverse genetic spaces, contextual awareness is really the key. [00:15:21] Speaker C: It's everything. It's the key to an accurate genomic interpretation. [00:15:25] Speaker B: So that leaves us with a final thought. Since PST proved so effective at accurately linking these incredibly diverse viruses to their hosts based purely on this conserved functional organization, how might this new context aware approach fundamentally change how we design targeted interventions like phage therapies for diseases related to the microbiome? This episode was based on an Open Access article under the CC BY 4.0 license. You can find a direct link to the paper and the license in our episode description. If you enjoyed this, follow or subscribe in your podcast app and leave a five star rating. If you'd like to support our work, use the donation link in the description now. Stay with us for an original track created especially for this episode and inspired by the article you've just heard about. Thanks for listening and join us next time as we explore more science base by base. [00:16:19] Speaker A: Sat time fragments drifting in the tide of SL. Patterns whisper in the silent flow. Signals Learning where the lost ones go go Every break becomes a line Every line becomes a sign Shadows fold into the frame Naming pieces with no names Constellations made of code Lighting paths we never knowed Pulling distance into sight Turning noise to shape and light in the drift of endless space Hidden worlds fall into place. Proteins gather in a shifting set. Silent stories we haven't read yet. The chaos of the viral night Something find the way to. If the scattered stars align every meaning falls in time what was Distance starts to bend Finding structure in the air. Constellations made of cone Lighting paths we never known Pulling distance into sight Turning noise to shape and light in the drift. Hidden world fall into place.

Previous Episode Next Episode

215: Protein Set Transformer for high-diversity viromics

Show Notes

Chapters

Episode Transcript

Other Episodes

194: Bayesian History of Science: Watson and Crick and the Structure of DNA

163: Animal origins: looping back in time

293: IndeLLM (ESM2) zero-shot scoring and Siamese transfer learning for in-frame indel prediction (MCC 0.77)