Machine learning and AI are revolutionizing protein design.

Is it time for AI in genome research? Today’s paper brings Evo 2, a sequel to the first genomic foundation model! More tokens, more nucleotides, and more power! A super exciting read.

Don’t keep this newsletter a secret: Forward it to a friend today!

Was this email forwarded to you? Subscribe here!

Evo 2: Genome-Scale AI

Researchers introduce Evo 2, a genomic foundation model that uses AI to learn patterns in DNA, predict biological features, and generate genome-scale sequences. Image credits: Nature.

DNA: the Language of Life

All life runs on DNA.

Genomic DNA directs every cell. With just 4 nucleotides, it encodes everything, from molecular machines to full organisms and beyond! It stores evolutionary history, molecular blueprints, regulatory instructions… A biological goldmine!

And it’s not static: changes in genomes drive evolution, enable organisms to adapt, and can be the difference between physiological and pathological states.

So no surprises: genomic DNA is central to biomedical research, diagnostics, and synthetic biology! Of course, it’s extremely fascinating. Also, incredibly complex! And here the problems start.

How do we extract meaning from genomic DNA?

Studying Genomic DNA, with AI help

AI is changing everything, from writing emails to studying biology.

And AI is great at finding patterns in enormous datasets. We’ve already seen the power of machine learning in protein design. It was enough to win the 2024 Nobel Prize in Chemistry!

Of course, researchers turned these tools towards DNA.

2 years ago, scientists at the Arc Institute published Evo, a genomic foundation model (a deep learning model that can be applied to a wide range of uses, from Wikipedia) trained on 300 billion nucleotides.

Evo learned relationships between distant parts of the genome and could generate realistic DNA sequences, RNA, and proteins. Then came Evo 1.5. Focused on protein function, it even created new prokaryotic toxins and anti-CRISPR proteins!

Amazing! But the big limitation?

Original Evo only worked on prokaryotic genomes.

Expanding Evo for all Life

This gave scientists a clear goal!

And they delivered. Today’s paper introduces Evo 2, a biological foundation model trained on an incredible 9 trillion DNA base pairs (!) from all domains of life, from bacteriophages to eukaryotes (including humans!).

Evo 2 captures the evolutionary patterns in DNA, capturing the molecular → organismal DNA “grammar” across all domains of life. This enables:

  • Prediction: mutational effects, RNA/protein properties.

  • Design: sequence generation.

Unlike task-specific tools, Evo 2 is built to be a generalist. A jack of all trades across biological scales and domains of life!

Let’s see how it works!

Training Giant Models

The team created 2 versions:

  • Evo 2 7B: Trained on 2.4 trillion tokens.

  • Evo 2 40 B: Trained on the entire dataset of 9 trillion tokens!

Here, you can think of a token as a single nucleotide, roughly. This means that the bigger model saw far more sequences during training. This gives you better performance, but at the cost of more compute.

But the real advance is context. “Context” here is how much DNA sequence the model can look at once to make a prediction.

The training happened in two phases:

  1. Pretraining: Shorter contexts (8k tokens) to learn local interactions in the genome, like functional elements.

  2. Midtraining: The context window is expanded to 1 million (!) tokens to capture long-range genome structures (operons, prophage regions, etc.).

But where did they get all this data?

The team actually built the dataset, creating OpenGenome2! They carefully curated genomic data from bacteria, archaea, eukarya, and bacteriophages. The result? A comprehensive dataset with nearly 9 trillion nucleotides.

And they released it as open source! So you can also go and train your model.

But someone is missing from the party: viruses that infect eukaryotes. For biosafety reasons, the team excluded possibly harmful viruses from the dataset.

Awesome, we have our models. Let’s use them!

Predicting Genomes with Evo 2

Evo 2 detects the patterns that evolution left in DNA.

By learning the likelihood of sequences appearing in his vast training data, it can capture conserved patterns that are tied to function. If a mutation disrupts a functional site, the mutated sequence becomes less likely.

This way, the model can perform predictions without task-specific training (zero-shot prediction).

The team tested it on several tasks:

  • Evolutionary constraints

    Evo 2 captures codon usage, different genetic codes (stop codons), and gene essentiality signals across microbes. The likelihood changes correlated with mutational scanning data across proteins and RNA datasets. Evo 2 isn’t always the best on every benchmark, but it shows broad predictive power on DNA, RNA, and proteins

  • Variant effect and clinical prediction

    Evo 2 separated pathogenic vs benign variants across the dataset ClinVar. It worked on coding variants, but it was exceptional on non-coding ones. Other models couldn’t even work on those!
    The team also tested it on splicing variants of BRCA1/BRCA2, involved in breast cancer. Evo 2 outperformed on non-coding variants and was competitive for coding sites. They also combined it with a different model: the combination was great for predicting clinical pathogenicity!

So, Evo 2 is great for prediction! But that’s not all.

Generating New Genomes with Evo 2

Evo 2 is a generative model.

It learned the language of DNA, and it can’t wait to speak it. Usually, the researchers fed a DNA sequence (1-10 kb long) as a prompt, and the output is a DNA sequence matching the genomic context.

Two applications:

  • Genome-scale generation

    So cool! 3 systems tested:

    • Human mitochondrial DNA: Using portions of human mitochondrial sequence as prompts, Evo 2 generated plausible sequences, including multimeric protein complexes!

    • Mycoplasma genitalium: A small prokaryotic genome, around 580 kb. The researchers gave Evo 2 a 10.5 kb prompt, and it generated realistic genome-scale sequences.

    • Saccharomyces cerevisiae: They used a 10 kb prompt to generate a 330 kb sequence for yeast chromosome III! Incredible.

    These were only evaluated in silico, and they would probably not work. But it’s still cool!

  • Controlled design of mammalian chromatin accessibility

    The authors combined Evo 2 with a different model to design kilobase-scale sequences with targeted chromatin accessibility patterns. They synthesized and tested the designs in human cell lines, measuring ATAC-seq. And the cells showed the designed patterns! They even encoded Morse code into these accessibility patterns.

Interpretability and Biosafety

Large language models, like Evo 2, are often seen as a black box. You give an input, something happens in them, and you get an output. This is true, but there has been a lot of work on interpretability!

The team used another model to extract what features of Evo 2 map to concrete biology, to understand what it learnt about biology! For example:

  • Prophage/phage spacer feature.

  • ORF/tRNA/rRNA features.

  • Protein-structure linked features (α-helix, β-sheet activations).

And more! Pretty cool.

Biosafety was also a major focus.

Because the team excluded eukaryotic viruses from the training dataset, Evo 2 cannot generate those sequences. When prompted with human virus sequences, Evo 2 spat out random nucleotides. So yeah, it really can’t work with them!

As generative biology gets more powerful, biosafety has to remain front and center.

Evo 2: Generalistic Genomic Foundation Model

An incredible advance!

Honestly, the scale of these models is mind-blowing. Ah, and the team made everything open-source: dataset, model parameters, training code, all of it! Amazing.

Evo 2 bridges prediction and design at scales that are compatible with real genomes. It covers all domains of life, it shows great versatility on its own, and it becomes even more powerful when combined with other methods!

It’s not perfect, with potential biosafety risks, ethical concerns, and few experimentally validated outputs. But it’s super cool! I can see it being transformative in:

  • Clinical variants screening: Especially where data are scarce.

  • Synthetic biology: I can’t wait to see new proteins…

  • Drug development: New RNA vaccines, engineered proteins, or who knows what!

The paper is dense and worth a read here!

If you made it this far, thank you! What do you think of AI in biology? Do you think AI has a place in biomedicine? Reply and let me know!

P.S: Know someone interested in AI and biology? Share it with them!

What did you think of today's newsletter?

Your feedback helps create the best newsletter possible!

Login or Subscribe to participate

More Room:

  • Remembering Epigenetics: My memory is not the best. Maybe biotech has a solution! Epigenetic changes have been linked to memory. In this study, the authors investigate whether epigenetic changes at a single genomic site can control memory. Using CRISPR-based epigenetic editing in memory-related neurons, they show that modifying the Arc gene promoter is both necessary and sufficient to regulate memory expression. This control is reversible and works across different memory stages, demonstrating that site-specific epigenetic changes can directly govern learned behavior.

  • AI Learns Protein Interactions: Protein-protein interactions make life work. But identifying them in humans is challenging due to weak coevolutionary signals. In this study, the authors combine massive genomic data (30 PB) with a new deep-learning model trained on predicted structures to improve detection. They screen 200 million protein pairs and predict ~17,800 interactions with high precision, including thousands not previously detected. This work expands the human interactome and provides new insights into protein function and disease mechanisms.

  • Complex Dynamic Genomes: You wanted some more genome research? There you go! Understanding the organization of complex genomes is challenging, particularly for kinetoplast DNA (kDNA), a network of interlinked DNA circles. In this study, the authors use dCas9 linked to quantum dots to label and track different DNA components within kDNA. They find that maxicircles localize to the network periphery and exhibit subdiffusive motion, which may contribute to structural properties like buckling. The method also enables measurement of network stiffness and provides a general tool to study genome organization and dynamics in complex systems.

Keep Reading