Portrait of author

Yuqi Chang:
An Exploration of Artificial Intelligence-Driven De Novo Peptide Sequencing: Towards Future Proteomics and Medicine

Date: 31-03-2023    Supervisor: Karsten Kristiansen & Siqi Liu




Identifying a peptide sequence from a given mass spectrum is an essential problem for both proteomics and medicine. Mass spectrometry is a widely used technique for the identification of peptide sequences. Currently, the spectra(MS2) data generated by mass spectrometers are analyzed mainly by searching-based approaches and de novo peptide sequencing. Compared to searching-based approaches, de novo peptide sequencing has the advantage of determining peptide sequences without the aid of sequence databases or libraries, and therefore can be used in a variety of fields where searchingbased approaches fail to work. In addition, the rapidly evolving artificial intelligence(AI) technology can support de novo peptide sequencing with a variety of tools at its disposal, making it a more and more powerful approach to the identification of peptide sequences.

The aim of this thesis was to develop a novel AI-based de novo peptide sequencing approach and explore its application in future proteomics and medicine. The specific goals were structured into the:

1) Implement an approach based on deep learning and tree search

2) Compare the searching-based and de novo sequencing strategies

3) Apply the approach to the analysis of HLA/MHC

In paper I, we implemented PepGo, a de novo peptide sequencing approach based on Transformer and Monte Carlo Tree Search(MCTS). PepGo can predict a peptide sequence from its mass spectrum without a database or library and even without training. Therefore it obtains maximum flexibility and adaptability. Our experiments show that PepGo can achieve good performance under certain circumstances and can be used as a supplement to other state-of-the-art approaches.

In Paper II, we compared the searching-based strategy(represented by MaxQuant and FragPipe) and the de novo sequencing  strategy(represented by PointNovo and PepGo) and evaluated their performance with spectra of synthetic peptides. The results showed that the de novo sequencing strategy can predict quite a lot of novel peptides like HLA Class I and Class II peptides that cannot be identified by the searching-based strategy, showing great potential applications of de novo sequencing in proteomics and future medicine.

Papers III, IV and V are our previous works based on genome sequencing. In paper III, we sequenced and de novo assembled the genomes of 150 individuals (50 trios), and constructed the haplotypes of the whole MHC region. Following this, in paper IV, we used the de novo assembled results of the 50 trios to reconstruct MHC haplotypes without relying on a reference genome and reported 100 full MHC haplotypes. 

Lastly, in paper V, we tested two widely used HLA typing methods using high-quality whole genome sequencing data. These works expanded our knowledge of HLA/MHC from the genome perspective, and explored methods other than mass spectrometry in studying HLA/MHC region. In conclusion, AI-based de novo peptide sequencing is a useful approach for proteomics. Particularly, it has the ability to predict novel peptides that can not be identified by searching-based approaches, which is particularly important for immunotherapy. In combination with other technologies like DNA sequencing, AI-based de novo peptide sequencing will play more important roles in future medicine.