- Plenty of Room
- Posts
- Genome, meet AI: Evo is here
Genome, meet AI: Evo is here
Plus: night and day science, and some more DNA crystals.
Welcome to Plenty of Room!
Today, we have a cool AI model that learnt all about genomic DNA! It’s a
Plenty of Room is your guide to the cutting-edge news related to molecular machines.
Already subscribed? Share with a friend that might find this interesting! It really helps.
New here? Just go ahead and subscribe!
Let’s get into it now.
Genome, meet AI: Evo is here
Evo is a foundational model capable of integrating sequence prediction and generation across DNA, RNA and protein. Credits to together.ai
I’ve talked a lot about how generative AI is changing the protein design landscape (maybe too much? Come on, it’s cool stuff!). Something I have seen a lot less is generative AI applied to the DNA level. And this time I don’t mean DNA nanotech, but good old-fashioned genomic DNA. And there are good reasons for this:
Data availability: While protein datasets are abundant, genomic datasets for AI training are less accessible.
Sequence length: Proteins are usually just a few hundred amino acids long. In contrast, genomic DNA can stretch to billions of nucleotides, and it requires single-nucleotide precision, because even a tiny mutation can drastically change how genes and proteins function.
Despite these challenges, the genome holds an unparalleled treasure trove of information: evolutionary history, RNA and protein production instructions, and so much more. It’s a goldmine for research, diagnostics, and synthetic biology. So, how to solve this problem?
Well, the authors of today’s paper have an answer. This paper introduces Evo, a foundational model (a foundational model is a machine learning or deep learning model that can be applied across a wide range of use cases, per Wikipedia) trained on almost 3 million prokaryotic and phage genomes, for a whopping 300 billion nucleotides! This model is capable of integrating sequence prediction and generation across DNA, RNA and protein at a resolution of a single nucleotide over long genomic sequences. But let’s take a closer look and understand why this is exciting!
The first innovation that the team introduces is the use of the StripedHyena architecture (funny name). Without getting too technical, it works by combining different mechanisms:
Attention Mechanisms: These capture long-range dependencies such as regulatory elements that control genes thousands of base pairs away.
Data-Controlled Convolutions: These excel at recognizing local patterns like codons, operons, or short motifs and are computationally efficient.
This hybrid design is important because genomic sequences are long and require consideration of both the local context and global structure. The StripedHyena architecture balances computational efficiency with biological relevance, handling sequences up to 131 kilobases long during training (that is huge, for these type of models).
So, once they trained their model, what did they do with it? Here comes the interesting stuff. Even if this model was only trained on DNA data, it was actually able to make good predictions at the DNA, RNA and protein level, showing that it had learned all three of levels just by looking at the genome. They focused in particular on three biological outcomes:
Impact of mutations on protein and non-coding RNA functions.
Effects of promoter and ribosome-binding site sequences on gene expression.
Identification of essential genes in genomes, offering insights into genetic viability.
All of this was evaluated using dataset from previous studies and compared with specialized models: Evo generally worked as well or even better than the specialized models, showing that it achieves single-nucleotide resolution, and this enables very fine-grained analyses.
But they didn’t stop at this predictive level: they also explored the generative abilities of their new model to try and design realistic and functional biological sequences. They tested their model in different ways:
CRISPR-Cas Systems: Evo designed a new CRISPR system, EvoCas9-1, which matched or outperformed natural systems in lab tests.
Transposons: Evo created realistic transposable elements, useful for studying gene mobility or engineering genome-editing tools.
Synthetic Genomes: Evo generated bacterial genomes up to 1 megabase in size, preserving key features like:
Operon structures for co-regulated genes.
Realistic coding densities.
Phylogenetic patterns that align with evolutionary principles.
Of course, the model also presents limitations and challenges.
Computational Resources: Training and deploying models at this scale require substantial computational power, potentially limiting accessibility.
Validation: Despite promising results, further experimental validation is needed to fully confirm Evo’s generalizability across biological systems.
In addition, the authors acknowledge the ethical concerns around generative biology tools: for example, they didn’t use sequences from organisms that infects eukaryotes, and they remind of the need for policymakers to discuss these issues before they become a problem.
But all in all, this was a very cool paper! And very dense. A model like this can have applications in practically every aspect of biology (DNA is kind of fundamental), but some that come immediately to mind are:
Synthetic Biology: Evo facilitates the design of new molecular systems, enabling applications like custom CRISPR systems with enhanced functionality or synthetic genomes for industrial applications.
Drug Development: Evo’s capabilities could assist in creating new RNA vaccines or engineered proteins.
So, if this intrigued you, don’t hesitate to go and read the paper for yourself!
In other news:
More DNA crystals: If enjoy DNA crystals, this is the review for you. It reviews advancements in structural DNA nanotechnology for creating precise 2D and 3D lattices from DNA nanostructures. These lattices enable applications in photonic crystals, nanoelectronics, and bioengineering by leveraging self-assembly to organize nanomaterials with molecular precision. Cool!
Day science vs night science: Sometimes, older papers come back in the old Twitter sphere (X-sphere?). This time, it’s about balancing day and night science to drive innovation while maintaining trust in one’s work. What are these you ask? Well, day science is the more structured and hypothesis-driven research we are used to, while night science is the more unstructured, interdisciplinary exploration that fosters creativity and new ideas. Interested? Jump on the paper!
Silver + DNA = Love: Dumb me aside, this cool paper incorporates silver ions (Ag⁺) into DNA duplexes using cytosine-Ag⁺-cytosine (dC:Ag⁺:dC) base pairs as triggers for self-assembly. This method lays the groundwork for metalated DNA nanostructures with applications in high-precision nanotechnology and electronics. So maybe your next laptop will be based on DNA electronics! or maybe just better sensors, I don’t know.
Not yet a subscriber to Plenty of Room? Sign up today — it’s free!
You think a friend or a colleague might enjoy reading this? Don’t hesitate to share it with them!
Have a tip or story idea you want to share? Email me — I’d love to hear from you!
You have something you would love me to cover? Just reach out here or on my social!