The tremors that shook the life sciences community last December were mere prelude. Back then, at a biennial protein-prediction competition, an artificial intelligence (AI) tool called AlphaFold delivered an outstanding performance, predicting structures quite as well as X-ray crystallography or cryo-electron microscopy (cryo-EM), the gold standard experimental techniques. Now, AlphaFold has produced an eruption: a release of more than 350,000 protein structure predictions. Besides covering nearly the entire human proteome, the release also extends to biologically significant organisms such as Escherichia coli, fruit fly, mouse, zebrafish, malaria parasite, and tuberculosis bacteria.
The AlphaFold release was arranged by AlphaFold’s developer, DeepMind, and by DeepMind’s scientific partner, the European Molecular Biology Laboratory (EMBL). Both DeepMind and the EMBL contributed to two articles that were posted to the Nature website. The first paper—“Highly accurate protein structure prediction with AlphaFold”—details AlphaFold’s computational methods. (According to the paper’s authors, the new and improved AlphaFold can “regularly predict protein structures with atomic accuracy even where no similar structure is known.”) The second paper—“Highly accurate protein structure prediction for the human proteome”—describes how AlphaFold’s machine learning method has been applied at scale to the human proteome. (The resulting dataset, the paper’s authors reported, “covers 58% of residues with a confident prediction, of which a subset (36% of all residues) have very high confidence.”)
DeepMind and the EMBL emphasize that they are making their predictions freely available to the community via a public database (the AlphaFold Protein Structure Database) hosted by the EMBL’s European Bioinformatics Institute. The partners anticipate that “routine large-scale and high-accuracy structure prediction will become an important tool, allowing new questions to be addressed from a structural perspective.”
In December 2020, AlphaFold was recognized as a solution to the 50-year-old grand challenge of protein structure prediction by the organizers of the Critical Assessment of Structure Prediction (CASP). This achievement, together with the introduction of the AlphaFold Protein Structure Database, may come to be seen as being roughly analogous to the early triumphs in human genome sequencing. Like the human genome sequence, comprehensive protein structure datasets promise to accelerate research across a variety of fields.
“Our goal at DeepMind has always been to build AI and then use it as a tool to help accelerate the pace of scientific discovery itself, thereby advancing our understanding of the world around us,” said DeepMind founder and CEO Demis Hassabis, PhD. “We used AlphaFold to generate the most complete and accurate picture of the human proteome. We believe this represents the most significant contribution AI has made to advancing scientific knowledge to date, and is a great illustration of the sorts of benefits AI can bring to society.”
At the same time the Nature articles appeared, a high-level overview of the AlphaFold network was posted to the DeepMind website. This overview explained that AlphaFold network generates structure predictions in two stages: “Stage 1 takes as input the amino acid sequence and a multiple sequence alignment (MSA). Its goal is to learn a rich ‘pairwise representation’ that is informative about which residue pairs are close in 3D space. Stage 2 uses this representation to directly produce atomic coordinates by treating each residue as a separate object, predicting the rotation and translation necessary to place each residue, and ultimately assembling a structured chain.”
DeepMind noted that AlphaFold can produce a 3D structure based on the representation at intermediate layers of the network. “AlphaFold’s belief about the correct structure develops during inference, layer by layer,” DeepMind continued. “Typically, a hypothesis emerges after the first few layers followed by a lengthy process of refinement, although some targets require the full depth of the network to arrive at a good prediction.”
DeepMind and the EMBL suggest that AlphaFold suggests how AI can build on the discoveries of generations of scientists, from the early pioneers of protein imaging and crystallography, to the thousands of prediction specialists and structural biologists who’ve spent years experimenting with proteins since.
“The AlphaFold database is a perfect example of the virtuous circle of open science,” said Edith Heard, director general of the EMBL. “AlphaFold was trained using data from public resources built by the scientific community, so it makes sense for its predictions to be public. Sharing AlphaFold predictions openly and freely will empower researchers everywhere to gain new insights and drive discovery.”
The ability to predict a protein’s shape computationally from its amino acid sequence—rather than determining it experimentally through years of painstaking, laborious, and often costly techniques—is already helping scientists to achieve in months what previously took years. For example, AlphaFold is being used by partners such as the Drugs for Neglected Diseases Initiative (DNDi), which has advanced research into life-saving cures for diseases that disproportionately affect the poorer parts of the world. Also, the Centre for Enzyme Innovation (CEI) is using AlphaFold to help engineer faster enzymes for recycling some of our most polluting single-use plastics.
For those scientists who rely on experimental protein structure determination, AlphaFold’s predictions have helped accelerate their research. For example, a team at the University of Colorado, Boulder, is finding promise in using AlphaFold predictions to study antibiotic resistance, while a group at the University of California, San Francisco, has used them to increase their understanding of SARS-CoV-2 biology.
“This will be one of the most important datasets since the mapping of the Human Genome,” said Ewan Birney, deputy director general of EMBL and director of EMBL’s European Bioinformatics Institute. “Making AlphaFold predictions accessible to the international scientific community opens up so many new research avenues, from neglected diseases to new enzymes for biotechnology and everything in between. This is a great new scientific tool, which complements existing technologies, and will allow us to push the boundaries of our understanding of the world.”
DeepMind and the EMBL indicate that the AlphaFold database and system will be periodically updated as they continue to invest in future improvements to AlphaFold. “Over the coming months,” the partners indicated in a press statement, “we plan to vastly expand the coverage to almost every sequenced protein known to science—over 100 million structures covering most of the UniProt reference database.”
More specific challenges were described in one of the recent Nature papers: “The parts of the human proteome still without a confident prediction represent directions for future research. Some proportion of these will be genuine failures, where a fixed structure exists but the current version of AlphaFold does not predict it. In many other cases, where the sequence is unstructured in isolation, the problem arguably falls outside the scope of single-chain structure prediction. It will be crucial to develop new methods that can address the biology of these regions, for example, by predicting the structure in complex or by predicting a distribution over possible states in the cellular milieu.”