Antibodies, small proteins produced by the immune system, can bind to specific parts of a virus to neutralize it. For example, to combat Covid-19, labs have made vaccines but have also turned their attention to synthetic antibodies that, by binding to the virus’s leading proteins, can prevent the virus from entering a human cell. Researchers at MIT have created Equidock, a machine learning model that can directly predict the complex that will form when two proteins bind. The research will be presented at the International Conference on Learning Representations.
To develop a successful synthetic antibody, researchers must understand exactly how it will bind to proteins. These, with lumpy 3D structures containing many folds, can clump together in millions of combinations, so finding the right protein complex from among almost countless candidates is extremely time-consuming.
Octavian-Eugen Ganea, a postdoc at MIT’s Computer Science and Artificial Intelligence Laboratory (CSAIL), and Xinyuan Huang, a graduate student at ETH Zurich, are the study’s co-lead authors. Regina Barzilay, a professor in the School of Engineering for AI and Health at CSAIL, and Tommi Jaakkola, Thomas Siebel Professor of Electrical Engineering at CSAIL and a member of the Institute for Data, Systems, and Society, also collaborated.
Equidock, a deep learning model
To streamline the process, the MIT researchers created a machine learning model that can directly predict the complex that will form when two proteins bind. Their technique is between 80 and 500 times faster than state-of-the-art software methods and often predicts protein structures that are closer to the actual structures observed experimentally.
Octavian-Eugen Ganea said:
“Deep learning is very effective in capturing interactions between different proteins that are otherwise difficult for chemists or biologists to write down experimentally. Some of these interactions are very complicated and people have not found good ways to express them. This deep learning model can learn these types of interactions from the data.”
Equidock, focuses on rigid body docking, which occurs when two proteins attach by rotating or moving in 3D space, but their shapes do not compress or bend.
The model takes the 3D structures of two proteins and converts these structures into 3D graphics that can be processed by the neural network. The proteins are formed from chains of amino acids, and each of these amino acids is represented by a node in the graph.
The researchers incorporated geometric knowledge into the model, so that it understands how objects can change if they are rotated or moved in 3D space. The model also incorporates mathematical knowledge to ensure that proteins always bind in the same way, regardless of their location in 3D space, as they do in the human body.
With this information, Equidock identifies the atoms in the two proteins that are most likely to interact and form chemical reactions, called binding pocket points. Then it uses these points to place the two proteins together in a complex.
Octavian-Eugen Ganea explains:
“If we can understand from the proteins which individual parts are likely to be these binding pocket points, then that will capture all the information we need to place the two proteins together. Assuming we can find those two sets of points, we can simply figure out how to rotate and translate the proteins so that one set matches the other set.”
One of the biggest challenges in building this model was the lack of training data.
Octavian-Eugen Ganea adds:
“Because there is so little 3D experimental data for proteins, it was especially important to incorporate geometric knowledge into Equidock,” says Ganea. Without these geometric constraints, the model could detect false correlations in the data set.”
Once the model was trained, the researchers compared it to four software methods. Equidock was able to predict the final protein complex after only one to five seconds. All of the baselines took much longer, from 10 minutes to an hour or more.
In quality measures, which calculate how closely the predicted protein complex matches the actual protein complex, Equidock was often comparable to the baselines, but sometimes underperformed them.
Octavian-Eugen Ganea states:
“We are still behind one of the baselines. Our method can still be improved, and it can still be useful. It could be used in a very large virtual screen where we want to understand how thousands of proteins can interact and form complexes. Our method could be used to generate an initial set of candidates very quickly, and then these could be refined with some of the more accurate but slower traditional methods.”
In addition to using this method with traditional models, the team wants to incorporate specific atomic interactions into Equidock so that it can make more accurate predictions. For example, sometimes protein atoms bind through hydrophobic interactions, which involve water molecules.
This technique could help scientists better understand certain biological processes that involve protein interactions, such as DNA replication and repair, which could also speed the process of developing new drugs.
“Our technique could also be applied to the development of small drug-like molecules. These molecules bind to protein surfaces in a specific way, so quickly determining how this binding occurs could shorten the drug development timeline.”
In the future, they plan to improve Equidock so that it can make predictions for flexible protein docking. The biggest hurdle is the lack of data for training; the team aims to generate synthetic data to improve the model.