- Plenty of Room
- Posts
- AI Protein Sequencing Breakthrough: Decoding Peptides Faster!
AI Protein Sequencing Breakthrough: Decoding Peptides Faster!
Plus: Reusable DNA Circuits and More!
Welcome to Plenty of Room!
Today, we explore proteomics, the large scale study of proteins, through a new AI model to improve peptide sequencing! I had to learn a lot for this, but it was worth it.
Plenty of Room is your guide to AI-driven protein design, DNA nanotech and more.
Love this issue? Spread the knowledge and share it!
Let’s dive right in.

InstaNovo is a new AI tool helping researchers decipher protein sequences, with applications in diagnostics and more. Image credits: Annekatrine Kirketerp-Mølle.
The Protein Sequencing Promise
In proteomics, the large scale study of proteins, researchers want to know the sequence of amino acids in a protein or a peptide (a shorter chain of amino acids). In an analogy to its DNA cousin, this is called protein sequencing. And let’s not forget why they do it (I do forget it, sometimes):
Biomedical Research and Diagnostics: Many diseases are linked to specific protein mutations or variants. To find them, we need the full protein sequences.
Drug Discovery: Of course! Peptides are often used in therapeutics: like the antibiotics we have seen, GLP-1, or insulin. Sequencing helps design better drugs, optimizing for stability or bioavailability.
Functional Studies: Proteins carry out most cellular functions: to understand how, we need their sequence!
De Novo Sequencing: This is the cool stuff. Figuring out a peptide’s sequence without a reference. It’s essential to discover new biomolecules from venoms, microbiomes or ancient samples. But we will talk more about this later!
Mass Spectrometry and De Novo Sequencing: The Challenge
Mass spectrometry (MS) is the go-to method for sequencing proteins. First, the proteins are digested into smaller peptides using enzymes. These fragments are then hit with an electron gun, which shatters them into charged peptides. These pieces are then attracted by a charged electrode, and a detector reads their mass-to-charge ratio (m/z). The result is a mass spectrum, a graph giving you the mass-to-charge ratio for the different pieces!
In proteomics, researchers often use tandem mass spectrometry (MS/MS): it’s exactly what it sounds, you run two analysis one after the other. The first step measures the peptide’s mass, and the second reveals the order of the amino acids.
While that might sound complicated, it’s actually the easy part. The real challenge? Figuring out what these mass-to-charge signals mean: which peptide do they belong to? Often scientists compare the protein spectra to know protein sequences in a database. But what if:
You’re studying an unknown organism with no reference genome?
You’re exploring mutated or modified proteins (like cancer antigens)?
Or working with synthetic or engineered proteins?
In these cases, databases can’t help you. And that’s where de novo sequencing comes in: it reconstructs the peptide sequence directly from the spectral data, without using databases! Very powerful: also, very hard. Why?
Spectra can be noisy or incomplete
Post-translational modifications make things harder
Long peptides are difficult to sequence
To summarize, the traditional workflow goes something like this:
Digest proteins into peptides (using enzymes like trypsin).
Analyze the peptides via tandem mass spectrometry (MS/MS).
Reconstruct the protein-level information by:
Matching to known proteins via databases or
Inferring novel protein sequences, in de novo approaches.
InstaNovo: AI for Peptide Sequencing
Enter today’s paper. To tackle this challenge, the team developed InstaNovo, a new framework for de novo peptide sequencing that leverages deep learning and diffusion models (we have seen them applied to protein design before).
To tell the whole story, they didn’t just create one tool, but two:
InstaNovo (IN): A transformer-based model that predicts peptides sequences directly from the mass spectra.
InstaNovo+ (IN+): A diffusion model that refines predicted sequences through iterative denoising, inspired by generative AI models (like DALL·E).
Okay, this sounds cool, but what does it actually mean? Because I’m no expert, and this sounds like a word-soup.
Model Architecture
InstaNovo
Based on transformers, the same architecture behind ChatGPT. Transformers are great at handle sequential data, like text or protein sequences. They look at all parts of a sequence at once and they figure out which parts are important to understand the context.
The model takes MS/MS spectra, which are treated as a set of mass-to-charge ratio and intensity pairs, and outputs the most likely peptide sequence for it. Very powerful!
InstaNovo+
This a diffusion model: these models learn to create data (images, or peptides in this case) by reversing a process that gradually adds noise to it. Imagine a blank image, and slowly add random noise over many steps until it's totally unrecognizable. The model learns how to revers this process: starting with pure noise and gradually transforming it into a realistic output.
This model takes an initial peptide guess and refines it step by step, reversing the “noise” to produce a more accurate prediction.
This two stage approach significantly improved accuracy and robustness, especially on difficult spectra.
Benchmarking and Results
The authors tested their new models on 8 different challenging applications, outperforming the current best model, Casanovo, in all of them (but Casanovo wins in the name department):
Peptide Prediction Accuracy:
Measured on HeLa cells proteome data, IN+ improved correct matches by 42%, especially for longer peptides.
Immunopeptidomics:
A (crazy) 175% boost in detecting HLA-bound peptides, key in cancer and vaccine research.
Antibody and Nanobody Sequencing:
InstaNovo+ accurately reconstructed nanobodies and antibodies, showing potential for therapeutic design!
Microbiome Identification:
The new models identified peptides from uncharacterized organisms from human wound fluid samples: gross but necessary!
The Dark Proteome:
InstaNovo and InstaNovo+ detected new peptides from snake venom and undiscovered pathogens! Exciting for drug discovery!
In Conclusion: Limitations and Future Work
Such a cool paper! I had to study a lot to understand it, but I guess that’s what I am (not) paid to do. I am liking this trend of applying AI to improve workflows in biology! InstaNovo and InstaNovo+ are a significant step forward in de novo peptide sequencing.
Of course, even these models still have limitations:
Diffusion models are computationally heavy and quite slow.
Evaluations are mostly in silico. I have the feeling it’s very hard (and expensive) to do extensive testing, but probably experimental validation will be needed for some applications, like drug development.
Of course, performance may vary on highly modified peptides or low quality spectra.
But all in all, a great work! And de novo protein sequencing has massive potential, especially with models like InstaNovo that work across organisms and applications. They could advance microbiome research in disease like Chron’s, or help with the discovery of novel proteins for different uses!
Don’t hesitate, go read the whole paper here! And as always, thank you for reading! What are your thoughts on this paper? Are you excited about de novo sequencing? Reply and let me know!
More Room:
Reusable DNA Circuits: If you are worried about sustainability of your DNA circuits (or their cost), we have a fix for you. This study introduces a method to make enzyme-driven DNA logic circuits reusable, addressing a key limitation in DNA computing. By using exonuclease III to selectively digest double-stranded DNA while preserving gate strands, the system resets without producing waste. The approach enables multiple reuses, up to four times in cascaded circuits and three times in a square root circuit, boosting efficiency, error correction, and cost-effectiveness in molecular computation.
Nanoscale 3D Printing: 3D printing is a pretty cool way of thinking about manufacturing: what if we could bring it to the nanoscale? DNA nanotech is one way to do it, but there are alternatives. This review explores 3D nanoprinting as a bottom-up approach to fabricating complex, functional nanostructures with diverse materials, offering an alternative to traditional nanolithography. It highlights recent advances, key challenges, and opportunities in material selection, device integration, and scalability, while discussing the potential impact on research and industry.
Ions and Counterions for DNA Nanotech: If you have ever done any work with DNA, you know that magnesium is ion of choice. But if it doesn’t have to be? This study shows that DNA nanostructures can be assembled at constant temperatures (4–50 °C) using various counterions, not just magnesium. Ion type and temperature affect the assembly of DNA motifs and 3D crystals. Notably, nickel ions enabled low-temperature assembly where standard methods fail. The assembled structures are biocompatible, suggesting new possibilities for DNA nanotech in environments with limited magnesium use.
Not yet a subscriber to Plenty of Room? Sign up today — it’s free!
You think a friend or a colleague might enjoy reading this? Don’t hesitate to share it with them!
Have a tip or story idea you want to share? Email me — I’d love to hear from you!
You have something you would love me to cover? Just reach out here or on my social!