From Chess to AI to Neural Network to Chemistry Nobel Prize 2024

Prof Homendra Naorem
Contd from previous issue
Of about 300 known amino acids, only 20 are found in human body. Since proteins are formed by polymerization of these 20 amino acids, there are huge number of proteins that can be formed by different combination and permutations of the 20 amino acids, e.g. for 2 amino acids, there are 20 or 400 possible ways of forming a dipeptide, 203 or 8000 for tripeptide, etc. which become frustratingly huger even for 10 amino acids, let alone the protein! Despite these colossal possibilities, only a few proteins with particular amino acid sequences and structure are found in the body. How does nature select and design the sequence of amino acids in proteins and their structure? That is a baffling question! But, knowing the sequence of the amino acids alone is not enough to understand its role in life processes unless the structure of the protein is known.
Determination of the structure of protein is indeed intellectually very challenging since the combination of 2 or more amino acids through peptide linkage will not yield a straight polymer chain but a nonplanar one with the a-carbon conserving its tetra- hedral structure. This will lead to twisting or folding of the polymer chain. Therefore, approach to understanding the polymer structure is through four different stages, namely, primary, secondary, tertiary and quaternary structures. In brief, primary structure provides the number and sequences of the amino acids in the protein, secondary provides the spatial conformation or folding of the chain, tertiary give the overall three-dimensional struc- ture of the protein while quaternary structure determines the arrangement of different units when more than one polypeptide chain known as units aggregate to form a protein. Because of the complexities involved, structure of only some proteins have been fully established till now while the structure of many more are yet to be established. How does the protein fold in a particular pattern, is it a random process ? Christian Anfinsen in 1961 showed that a protein’s three-dimensional structure is entirely governed by the sequence of amino acids in the protein for which he got Chemistry Nobel Prize in 1972. However, a protein with the same sequence of amino acids can fold in enormous ways–with 100 amino acids, it can have as many as 1047 different 3-D structures! But, in a cell, it will have only one particular folding! Moreover, if the protein is somehow made to unfold, it will return to the same folding as if all the information about how the protein is to be folded is present in the amino acid sequence! How does the chain of amino acids fold? Can the folding of a protein be predicted from its sequences? These have remained as enigmatic questions in life sciences.
Considering the complexities involved in experimental determination of structure of proteins, a group known as Critical Assessment of Protein Structure Prediction (CASP) began (1994) kind of competition to predict the structure of proteins based on the known amino acids sequence and whose structures is established but not revealed to the competitors. The prediction accuracy could not be better than 40% until Demis Hassabis with his AI model AlphaFold entered the CASP competition when the accuracy was improved to about 60%, better but still not acceptable yet. Meanwhile, John Jumper with a background of simulating protein dynamics joined the Google’s Deep-Mind and Hassabis to develop the AlphaFold2 using innovative break- through in AI and neural networks model. With the new AI architecture, it started delivering good results in the 14th CASP (2020) competition. The predictability of protein structures has so greatly improved that CASP’s organisers in 2020 realized that the age-old challenge for prediction of protein structure was over. Because, AlphaFold2 can predict a structure almost as good as X-ray crystallography can give which left the CASP’s founders with the question ‘what now’? Hassabis and Jumper generated the structure of all human proteins just in seconds or minutes at the same time predicting the structure of virtually all the 200 million proteins discovered so far! In 1998, Baker also participated the CASP competition using Rosetta, the computer software he developed to predict protein structures from a given sequence of amino acids. His program did really well but not good enough, which made him change his approach-instead of using the amino acid sequences in Rosetta to get the structures, he starts with a desired protein structure to obtain possible amino acid sequences – a kind of reverse engineering! This reverse approach enabled them to create entirely new proteins with designed sequences of amino acids, which made Baker de novo protein designer and constructor. The software is so successful that the structure of a protein (Top7) with a known amino acid sequence found in bacteria could be precisely predicted as good as the one given by X-ray diffraction patterns. Since then, many spectacular proteins have been created in Baker’s laboratory. In the best scientific traditions, Google DeepMind has made the code for Alpha-Fold2 publicly available and Baker has also released the code for Rosetta so that the global research community can not only use them but continue developing better software for wider use and finding new areas of application.
Development of computer programs or software to predict the structure of a protein or the amino acid sequences requires deep understanding of the under- lying principles of chemical binding and the major driving forces behind the protein folding like electrostatic interactions, hydrogen bond, hydrophobic effects, disulfide bonds, the van der waals forces, etc. Despite having no formal training in chemistry, all the three Nobel laureates have mastered them to effectively use it while developing the program algorithm with skilful use of machine learning and AI to predict the protein structure or construct new proteins. In the process, they have used the available data of almost all known proteins obtained through XRD or other experimental methods to extract the pattern before developing the AI driven programs. Large set of experimental data are necessary for development of AI driven predictive programs.
The awarding of the Nobel prize in Chemistry for using AI to predict protein structure and Nobel prize in physics for the foundational work in machine learning using AI networks marks the entry of AI in sciences in a big way. Soon, AI would be reshaping Chemistry and the chemical industry with algorithms that accelerate molecular design and how chemists solve complex structural problems. Apparently, majority of chemists currently are exuberantly in a race of synthesizing and ascertaining the structures of host of new compounds. The Chemistry Nobel 2024 is prompting them to start looking for any pattern or design in the formation of the compounds and their structure with natural, if not artificial, intelligence before some weirder trained in AI and machine learning skills uses your date and develop a program that can predict the library of new compounds you have made including the ones you can make! The message of this year’s Nobel prize in Physics and Chemistry, there- fore, is to make AI and machine learning as integral part of undergraduate cour-ses in Physics as well as Chemistry. Are we ready in India or, nearer home, in Manipur?

The writer is from the Dept of Chemistry Manipur University