Protein variant effect prediction using machine learning

Research output: Book/ReportPh.D. thesisResearch

Predicting how amino acid changes in a protein can affect different protein properties is an ongoing area of research with applications in studies of the molecular mechanisms behind evolution, human disease, protein engineering and more. In recent years, machine learning-based methods have emerged as powerful tools for modeling such variant effects. In this thesis, we show specific examples of how existing methods for computational variant effect prediction can be improved using modern machine learning techniques. This thesis is centered around one publication and two manuscripts.

In the first publication, we show that a combination of self-supervised and supervised machine learning can be used to develop a fast predictor of protein stability changes, RaSP, that is suitable for large-scale variant effect analysis. We validate and test the model using experimental and clinical data. We exemplify the large-scale application by generating stability change predictions for almost all single amino acid changes in the human proteome corresponding to ∼ 230 million predictions.

In the first manuscript, we modify our RaSP model into a new model, mRaSP, that is specifically designed for membrane proteins. Membrane proteins are difficult to characterize both experimentally and computationally. However, we show that a relatively simple but specialized model is able to make variant effect predictions at a level comparable, and sometimes superior, to an existing method based on Rosetta.

In the second manuscript, we explore how combined information from protein sequence and structure inputs can be used to generate robust variant effect predictions. We introduce a novel self-supervised method, SSEmb, which combines two existing uni-modal models into a single unified model that can be trained end-to-end. We show that the multimodal model is able to generate predictions that are well-correlated to experimental data even in the case were information from one of the inputs, in this case the MSA, is scarce. Furthermore, we show that the SSEmb embeddings contain rich information that can be used to train task-specific downstream models; in our case exemplified by the development of a downstream model to predict protein-protein binding sites at high accuracy.

Finally, we summarize our findings from each research project in the conclusion and point to ways in which our methods and results could be useful in future research.
Original languageEnglish
PublisherDepartment of Biology, Faculty of Science, University of Copenhagen
Number of pages164
Publication statusPublished - 2024

ID: 387428532