Development and application of new bioinformatic tools for large-scale population genomic study

Research output: Book/ReportPh.D. thesisResearch

The Human Genome Project has revolutionized the fields of biology and medicine, paving the way for subsequent large-scale projects in human population genomics, such as the 1000 Genomes Project and the UK Biobank 100K Genomes Project. These initiatives have produced a profound understanding of genetic diversity across populations and fundamental principles underlying genedisease association. However, with the increasing number of sequenced samples, bioinformatics tools are struggling to handle the computation of such vast amounts of data, particularly for joint variant calling of tens of thousands of samples, which requires tens of millions of CPU hours. Meanwhile, the low-pass sequencing has become widely adopted with the growth of the large population reference panels. Nevertheless, existing imputation methods for low-pass sequencing data have not yet fully utilized the linkage information of read while as the power of graph genome. The lack of genetic diversity is another main challenge of current population genomic study. For instance, Chinese population remains under representative compared to its proportion to the global human population. This has emerged as a significant limitation for the study of global genetic diversity. I have developed three chapters according to these challenges.

In the first chapter of this thesis, I present a highly scalable joint variant calling tool called DPGT, which is based on the Apache Spark parallel framework. DPGT was designed as a scalable framework that extends to joint variant calling of million level samples. DPGT applied an expectation-maximization (EM) algorithm to largely reduce the computing hotpot in estimation of allele count or frequency. These optimizations had significantly reduced run time, CPU hours and storage. Compared to existing tools such as GATK and GLnexus, DPGT achieved a performance of 2-10 times faster testing on different core numbers, and 9 times less computing storage usage using the 2504 whole genome samples from the 1000 genome project. The second chapter proposed a new graph-based genome imputation method named GRAMP for low-pass whole genome sequencing samples. GRAMP adopted the advantage of the graph genome by modification of a graph genome data format called GBWT, enabling the fast retrieval of the haplotypes in the large reference panels. Together with Gibbs sampling and a modified Li-Stephen imputation model that consider information of whole reads rather than pure genotype information, GRAMP achieved better imputation accuracy in general, especially when the coverage is lower than 0.5X, compared with the other low-pass imputation tools such as GLIMPSE and QUILT.

To advance the genetic study on Chinese population, I present on the third chapter of this thesis an investigation of 1320 genes in a cohort of 10,539 healthy controls and 9,434 patients with psoriasis. Through joint variant calling, rigorous quality control and annotation procedures, we have successfully identified 8720 protein-truncating variants (PTVs), of which 77% are novel. Our analysis further reveals that the characteristics and function impact of these PTVs across different metrics. These findings provide valuable insights into the patterns of PTVs within the Chinese population.
Original languageEnglish
PublisherDepartment of Biology, Faculty of Science, University of Copenhagen
Number of pages103
Publication statusPublished - 2023

ID: 383008030