Efficient statistical and computational methods for large scale sequencing data

Research output: Book/ReportDoctoral thesisResearch

This thesis presents statistical and computational methods for analyzing large scale sequencing data in genomics and population genetics. Since the begin of the modern genomics, the data size has been growing exponentially. As the relevant methods in the field have high computational cost dealing with large data set, this thesis has a particular focus on computational efficiency and scalability of the proposed algorithms and implementations. The thesis consists of four manuscripts covering topics in principal component analysis (PCA), genotype imputation and phasing, population structure and software development for genetic data.

The first manuscript is a published paper, presenting a fast and accurate out-of-core PCA framework for large scale data sets. It introduces a novel window-based randomized singular value decomposition algorithm, which can be used to rapidly analyze very large data sets with few passes over the data while achieving high accuracy. In addition, there is an efficient implementation in C++ to achieve the goal that the method has to be applicable, scalable and flexible with large data sets.

The second manuscript is ready for submission, presenting a rapid genotype imputation method with very large reference panels. It describes an memory-efficient algorithm and strategy for fast genotype imputation with constant computational complexity in the size of the reference panel. Furthermore, it extensively investigated the benefits of different reference panels, different sequencing data and different imputation methods in human genetics.

The third manuscript is still in preparation, describes a new model for fine population structure inference using low coverage sequencing data. The most attractive points of the model are without requirements of phased genotype data and the reference panel. In addition, it can jointly perform global/local admixture inference and genotype imputation in one goal, taking all information into account, such as haplotype structure.

The fourth manuscript is a submitted paper under review, introducing a C++ API for scripting rapid variant analysis with the complicated VCF/BCF file. The design of the API follows these principles: (a) be simple and safe to use; (b) be portable to other languages, particularly dynamic languages; (c) be of high performance.
Original languageEnglish
PublisherDepartment of Biology, Faculty of Science, University of Copenhagen
Number of pages149
Publication statusPublished - 2024

ID: 384252651