Two years ago, as the COVID-19 pandemic swept across the world, researchers at DeepMind, the artificial intelligence (AI) and research laboratory subsidiary of Alphabet Inc., demonstrated how it could use machine learning to achieve a breakthrough in the ability to predict how proteins, the work-horses of the living cell, fold into the intricate shapes they take on. The work gave hope to biologists that they could use this kind of tool to tackle diseases such as the SARS-CoV-2 coronavirus much more quickly in the future.
Researchers were able to assess the abilities of DeepMind's AlphaFold2 thanks to its inclusion in the 14th Critical Assessment of Structure Prediction (CASP14), a benchmarking competition that ran through 2020 and which added a parallel program to uncover the structures of key proteins from the SARS-CoV2 virus to try to accelerate vaccine and drug development. The organizers of CASP14 declared the tool represented "an almost complete solution to the problem of computing three-dimensional structure from amino-acid sequences," though some caveats lie behind that statement.
Figure. An AlphaFold protein prediction with a very high (greater than 90 out of 100) per-residue confidence score.
In principle, quantum mechanical simulations can predict which collection of folds leads to the lowest combined energy of all the chemical bonds in the shape and the water and other molecules around it. However, this remains beyond the capacity of even today's computers and may not even be practical in most cases.
John Jumper, senior staff research scientist at DeepMind, points out that to perform a full molecular-dynamic simulation is not just computationally complex; it requires a complete specification of the environment around the protein in question. "Proteins are exquisitely sensitive machines and extremely finely balanced. We can't write down really good energy functions for them. Even small changes, like getting the salt concentration wrong or not specifying some condition, can cause them not to fold at all. And you have no hope of writing down all the correct conditions of every protein in the human cell," he says.
When biologists produce structures for proteins experimentally, they find ways to fix the molecule in what they hope is a representative conformation. One method is to isolate and crystallize the protein and then use X-ray diffraction to estimate the positions of atoms in the complex structure. Another increasingly common method is cryogenic electron microscopy (cryo-EM): freezing the isolated protein and then using the scattering of electron beams by the atoms to work out how the protein chain bends and folds. Years of effort have populated publicly accessible databases such as the Protein Data Bank (PDB) set up by a group of U.K. and U.S. laboratories in 1971. Though painstaking to create, this data has proven crucial to the growing efficacy of AI-based models.
Some machine-learning methods applied early in the 2010s focused on how protein sequences evolve and their relationship to the PDB shapes. Many proteins possess amino-acid residues in key positions that are important in determining structure across the many variants that have built up in the evolutionary record. These residues are often quite far apart in the chain, but come close together in the folded version and hold the protein in that shape through bonds formed dynamically by interactions between their electron shells. These similarities show up in the amino-acid sequences that are available for many proteins: they are much cheaper to obtain than information on structures. Training a network on these similarities, which are identified by a process of multiple sequence alignment (MSA), helped reduce the number of different possible shapes the physics engines needed to consider. However, accuracy using these models remained relatively poor.
That began to change when machine-learning researchers saw a connection between the data stored in PDB and the convolutional neural networks (CNNs) used to analyze two-dimensional (2D) images. One way to represent the folds in proteins is to use a contact map. This places each of the amino-acid residues in the sequence along the axes of a matrix, with a score at each pairing used to represent how close they are in space in the PDB model. Ranked highly at CASP12 held in 2016, RaptorX-Contact, developed at the Toyota Technological Institute in Chicago, used a CNN based on the widely used ResNet architecture, in effect treating the contact map as an image input.
Estimates of likely accuracy provided by AlphaFold2 itself tend to match up well with errors in the generated structure.
AlphaFold2 moved away from CNNs, switching the architecture to an attention-based neural network trained on PDB structures in a system that also incorporated the MSA technique used in older systems. The result was a 21-million-parameter model effective enough to push AlphaFold2 ahead of competitors at CASP14.
Before AlphaFold2 published the code for the model, researchers in the Baker Lab at the University of Washington developed their own approach to an attention-based neural network called RoseTTAFold. In doing so, they added a third component to the model that considers the three-dimensional (3D) coordinates of the residues directly to improve overall accuracy. However, AlphaFold2 delivered better results on average at CASP15, held at the end of last year.
After source code for the AlphaFold2 model was made available, many teams chosen to base their own work on it and presented this at CASP15, although DeepMind did not participate directly. Last year, a consortium of laboratories and technology companies released OpenFold, a streamlined version of AlphaFold2 developed by Mohammed AlQuraishi, assistant professor in the Department of Systems Biology at Columbia University and a member of Columbia's Program for Mathematical Genomics. In contrast to the code available made available by DeepMind, OpenFold also includes the tools for training customized versions of the model.
Experimentalists have also embraced the results from AI, aided in part by a database hosted by the pan-European life-sciences laboratory EMBL. It holds structures predicted by AlphaFold2 for all known human proteins, many of which do not yet have entries in PDB. Tom Terwilliger, senior scientist at the New Mexico Consortium, says the advantage of using structures predicted by AlphaFold2 and similar tools in experimental workflows is it can deliver hypothetical structures often close to the physical shapes indicated by cryo-EM or X-ray data, if not completely accurate. In practice, estimates of likely accuracy provided by AlphaFold2 itself tend to match up well with errors in the generated structure.
A typical use case is to take the predicted structure and "dock" it into an electron density map created by cryo-EM measurements to work out where the two differ and then adapt the structure to one that better fits the density map. This has been expanded into a workflow developed by Terwilliger and colleagues in which refinements are then provided to AlphaFold2 so it can generate a new prediction that is often much closer to the experimental data.
Some critics argue the lack of physics in the core model represents a problem that will need to be solved by future systems. If a small change is made to the core sequence given to AlphaFold2, perhaps one that disrupts the coupling between two of the residues, the model may fail to predict the change in physical structure.
Even so, several groups have found they can make AlphaFold2 better at handling mutations without changing its core structure. One technique is to use much shorter chains as inputs to the MSA process. For proteins with good representation in the existing data, these structures often resemble the different conformations a protein can take on as the environment around them changes or when they bind and release other molecules and ions.
Some believe the peformance of pLMs will depend on the quantity of data used for training, and the number of parameters in the model.
"So far, it seems that just letting the model learn everything, without making any prior physical assumptions, simply works better," AlQuraishi says. "I suspect this is in part because we have enough data to train a protein structure-prediction model, at least if the architecture is well designed, such as AlphaFold2's. It's conceivable that for other molecular problems where data is more scarce, including, for example, predicting the effects of structurally disruptive mutations, incorporating physics will pay dividends."
Large language models like Google's BERT may improve the ability of AI-based systems to predict the effects of mutations as well as the likely shapes of novel or unusual proteins where related sequences and structures are thin on the ground. So far, the results from several of the protein language-model (pLM) entrants have not reached the accuracy of AlphaFold2.
As with natural language processing, some believe the performance of pLMs will depend on the quantity of data used for training and the number of parameters in the model, which have yet to reach the billions that might be needed. Konstantin Weissenow, a Ph.D. student in the Rost Lab of Germany's Technical University of Munich, says in that team's work, pLMs do not perform as well as expected on proteins that do not have related sequences, and larger models do not necessarily perform better.
"The reasons are still being investigated. But they could be caused by the fact that the language model training itself is also biased by the available information in large sequence databases. I'd expect that the clever design of machine learning architectures will prove to be the more critical part in further improvement of structure prediction systems, pLM-based or not," Weissenow says.
In an increasing number of cases, the different AI models are being used in combination. A Baker Lab project attempted to find which yeast proteins are most likely to interact using a stripped-down form of RoseTTAFold to filter 5,500 candidate pairings out of a possible 4.3 million. The slower but more accurate AlphaFold2 reduced the shorter list down to 1,500 likely biologically active pairs. Early this year, startup Evozyne said it had used a pLM developed together with computing company Nvidia to generate variant proteins for synthesis quickly, with AlphaFold2 employed to help confirm the physical structures visually.
Weissenow notes, "I personally believe that pLM-based approaches will co-exist with traditional MSA-based methods for at least a while. If rich MSAs are available for a protein, the predictions from AF2 and similar systems will be the go-to for downstream users interested in the highest-quality structures."
Machine learning in protein structure prediction
Current Opinion in Chemical Biology 2021, 65, 1–8
Jumper, J. et al.
Highly accurate protein structure prediction with AlphaFold
Nature 596, 583–589 (2021)
Terwilliger, T.C. et al.
Improved AlphaFold modeling using implicit information from experimental density maps
Nature Methods 19, 1376–1382 (2022)
Weissenow, K., Heinzinger, M., and Rost, B.
Protein language model embeddings for fast, accurate, alignment-free protein structure prediction
Structure, 30, 8, 1169–1177 (2022)
©2023 ACM 0001-0782/23/05
Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and full citation on the first page. Copyright for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, to republish, to post on servers, or to redistribute to lists, requires prior specific permission and/or fee. Request permission to publish from firstname.lastname@example.org or fax (212) 869-0481.
The Digital Library is published by the Association for Computing Machinery. Copyright © 2023 ACM, Inc.
No entries found