PhD Defence: Yong Zhang
Title: Development and application of new bioinformatic tools for large-scale population genomic study
Supervisor: Guojie Zhang
Chair of committee: Anders Albrechtsen (Department of Biology)
Committee members: Shuhua Xu (Fudan University), Hua Chen (Chinese Academy of Sciences)
Address: online by zoom https://ucph-ku.zoom.us/j/65830511389?pwd=N3ZOdjdlWGJKeW1YTk8yM1ZwTXhiZz09
Abstract: The Human Genome Project has revolutionized the fields of biology and medicine, paving the way for subsequent large-scale human population genomics projects such as the 1000 Genomes Project and the UK Biobank 100K Genomes Project. These initiatives have produced a profound understanding of genetic diversity across populations and the fundamental principles underlying gene-diseases association. However, with the increasing number of sequenced samples, bioinformatics tools are struggling to handle the computation of such vast amounts of data, particularly for large joint variant calling, which requires tens of millions of CPU hours. The low-pass sequencing has become widely adopted thanks to the application of the population reference panels. However, the existing imputation methods for low-pass sequencing data have not yet to fully utilize the power of the graph-based pan-genome reference. The lack of genetic diversity is another main challenge of current population genomic study. For instance, Chinese population remains under representative compared to its proportion to the global human population. This has emerged as a significant limitation for the study of global genetic diversity. I have developed three-chapter programs to address these challenges.
In the first chapter of this thesis, I present a highly scalable joint variant calling tool called DPGT, which is based on the Apache Spark parallel framework. DPGT designed a scalable framework that can directly process the original files generated from single sample variant calling without the need for region-based file splitting and implemented a new expectation-maximization (EM) algorithm for allele count or frequency estimation. These optimizations had significantly reduced run time, CPU hours and storage that used. Compared to existing tools such as GATK and GLnexus, DPGT achieved 2-10 times faster testing on different core numbers, including 16, 32, 64, 128, and 256 cores, and 9 times less computing storage usage using the 2504 whole genome samples from the 1000 genome project.
The second chapter proposed a new graph based genome imputation method named GRAMP for low-pass whole genome sequencing samples. GRAMP adopted the advantage of the graph genome by modification of a graph genome data format called GBWT, enabling the fast retrieve of the haplotypes in the large reference panels. Together with the Gibbs sampling algorithm and a modified Li-Stephen imputation model that consider the whole read information rather than just genotype information, GRAMP achieved better imputation accuracy at all, especially at the coverage lower than 0.5X, compare with the other low-pass imputation tools such as GLIMPSE and QUILT.
To advance the genetic study on Chinese population, the third chapter of this thesis presents an investigation on 1320 genes that were sequenced in a cohort of 10,539 healthy controls and 9434 patients with psoriasis. Through joint variant calling and rigorous quality control and annotation procedures, we have successfully identified 8720 Protein-truncating variants (PTVs), of which 77% are novel. Our analysis further reveals that approximately 88% of all PTVs are deleterious and subject to purifying selection. These findings provide valuable insights into the patterns of PTVs within the Chinese population.