-

Rocklin Lab Releases Megascale Open Protein Stability Dataset to Advance Biomolecular AI

Supported by the OpenFold Consortium, the MGnify Stability Dataset provides folding stability measurements for 1.8 million diverse protein domains, including crucial negative data needed to train better open foundation models.

BERKELEY, Calif.--(BUSINESS WIRE)--The Rocklin Lab at Northwestern University today announced the release of the MGnify Stability Dataset, a large-scale experimental resource containing folding stability measurements for 1.8 million diverse protein domains. The release builds on earlier Rocklin Lab megascale stability work by expanding both the scale and diversity of experimentally measured protein domains. The dataset, generated using cDNA display proteolysis, is now available to the research community to accelerate the development of improved models for protein stability prediction. The work was supported in part by the OpenFold Consortium, which sponsors the Rocklin Lab as part of its broader mission to advance open biomolecular AI.

Protein folding stability is a foundational property in biology and protein engineering, influencing whether a protein folds correctly, remains functional, avoids aggregation, or can be used successfully in therapeutic and biotechnology applications. Despite its importance, accurately predicting absolute stability – a direct measure of how energetically favorable it is for a protein sequence to adopt and maintain its folded state – has remained a long-standing challenge, in part due to the complexity of folding and the limited quantity of experimental measurements. Importantly, the dataset includes both stable and unstable proteins, providing the kind of negative data that is often missing from public biological datasets. For machine learning, those failure examples are not noise; they are essential training signal for learning the boundary between sequences that fold and sequences that do not.

The study was led by Gabriel Rocklin, an OpenFold Principal Investigator and Assistant Professor in the Department of Pharmacology and Center for Synthetic Biology at Northwestern University Feinberg School of Medicine, and Sergey Ovchinnikov, Assistant Professor of Biology at MIT. Rocklin's lab develops high-throughput experimental and computational methods to understand protein folding, stability, and design, with an emphasis on generating the large-scale biophysical datasets needed to train more accurate machine learning models.

Working with Prof. Rocklin, co-lead researcher Kotaro Tsuboyama (now a Lecturer at the Institute for Industrial Science at the University of Tokyo) created the MGnify Stability dataset by experimentally analyzing 1.8 million diverse protein domains. These domains were primarily drawn from the MGnify metagenomic database and span more than 200,000 sequence families, a vast increase in the diversity of folding stability data. Co-lead researcher Yehlin Cho applied these data to develop the predictive models SaProtΔG and ESM3ΔG. Unlike most stability models that are limited to predicting effects of mutations, these models accurately predict stability for most small protein domains, demonstrating how large and diverse folding stability data can substantially improve the longstanding challenge of predicting protein folding stability.

“It’s incredibly exciting to combine advances in computational protein modeling with this new massive biophysical dataset to accurately predict stability,” said Gabriel Rocklin, co-corresponding author of the study. “We couldn’t have done this without the huge scale and diversity of the new experimental data. At the same time, the recent advances in deep learning models for proteins enabled us to best leverage the data for accurate predictions.”

To test whether the dataset could support useful predictive models, the researchers benchmarked SaProtΔG and ESM3ΔG across several real-world applications. The models predicted the effects of substitutions, insertions, and deletions; recovered stability trends associated with thermophilic organisms; improved discrimination between stable and unstable computationally designed proteins; and correlated with nanobody aggregation temperature despite not being trained on nanobody data.

The authors note that there are opportunities to further improve the dataset and the models. The MGnify Stability dataset is currently restricted to domains 60–80 amino acids in length, and the experimental stabilities were resolved up to approximately 5 kcal/mol. Additional experimental data and new methods will be needed to improve predictive performance for larger, highly stable proteins.

Datasets of this kind are central to OpenFold’s roadmap because open foundation models require open, high-quality experimental data. Structure prediction alone is not enough for the next generation of biomolecular AI. Models must also learn biophysical properties such as stability, folding, aggregation risk, and designability. By supporting experimental groups like the Rocklin Lab, OpenFold aims to help build the open data layer needed for more predictive, reproducible, and broadly accessible AI for biology and drug discovery.

“This is exactly the kind of dataset the field needs to build better biomolecular AI,” said Woody Sherman, Chief Innovation Officer at PsiThera and Chairperson of the OpenFold Executive Committee. “Large, carefully generated experimental datasets, including negative data, are essential for moving beyond models that infer structure toward models that understand the biophysical properties that make proteins fold, function, and become useful for biology and drug discovery. OpenFold was glad to support the Rocklin Lab and help make this resource available to the community.”

About OpenFold

OpenFold is a nonprofit AI research consortium of academic and industry partners whose goal is to develop free and open-source software tools for biology and drug discovery, hosted as a project of the Open Molecular Software Foundation (OMSF). Membership is open to organizations across biotech, pharma, synthetic biology, software/technology, academia, and nonprofit research.

For more information, please visit OpenFold’s website.

Accessing the Dataset

To access the dataset and manuscript, please visit MGnify Stability Dataset.

Contacts

Media Contact: Mallory R. Tollefson, Ph. D.
mallory.tollefson@omsf.io

OpenFold Consortium


Release Summary
The dataset is now available to the research community to accelerate the development of improved models for protein stability prediction.
Release Versions

Contacts

Media Contact: Mallory R. Tollefson, Ph. D.
mallory.tollefson@omsf.io

More News From OpenFold Consortium

Open Free Energy Delivers Industry-Validated Accuracy Across 15 Pharmaceutical Companies in Largest Open Benchmarking Study of Free Energy Methods

BERKELEY, Calif.--(BUSINESS WIRE)--OpenFE tooling provides accurate, high throughput binding affinity predictions at a quality that will have real-world impact on drug discovery...

Open Molecular Software Foundation’s OpenFold Consortium Launches Research Fellowship at the University of Washington’s Institute for Protein Design

BERKELEY, Calif.--(BUSINESS WIRE)--The AI revolution in biology is built on a foundation of openness —researchers sharing data, code, and ideas freely so that others can build on them....

OpenFold Consortium Announces Major OpenFold3 Update and Public Release of Training Data for Reproducible Biomolecular AI

BERKELEY, Calif.--(BUSINESS WIRE)--We are enabling validation and rapid iteration so researchers can turn cofolding models into scientific infrastructure that speeds drug discovery...
Back to Newsroom