This thesis covers work in aspects of population genetics, statistical genetics and machine learning, and it consists of several new statistical methods as well as a novel machine learning framework for usage in population genetics.
The first paper presents two new methods for low-depth next-generation sequencing data, where the rst method infers population structure using principal component analysis and the other method estimates admixture proportions. The rst method accommodates the uncertainty in the genotypes by working directly on genotype likelihoods in an iterative approach for estimating individual allele frequencies. The method is shown to be more accurate than existing methods for inferring population structure. The individual allele frequencies can be used for estimating admixture proportions in a matrix factorization approach that is much faster than existing methods for estimating admixture proportions.
The second paper presents a new method for inferring population structure using principal component analysis in the presence of rampant non-random missingness. The method directly models the missingness in an expectation-maximization algorithm to impute missing data. We demonstrated that the method is more accurate than competing methods for inferring population structure, since most other methods fail due to not accounting for missingness. The method is further shown to scale to very large genetic datasets in terms of computational runtime.
The third paper describes a new method that tests for and quanties deviations from Hardy-Weinberg Equilibrium in structured populations using genotype or low-depth next-generation sequencing data. It naturally accounts for population structure by incorporating individual allele frequencies in a likelihood framework that works directly on genotype likelihoods. The method is shown to be more accurate at detecting and quantifying deviations from Hardy-Weinberg Equilibrium in structured populations than existing methods.
The fourth paper introduces a new method for inferring local haplotype structure by estimating latent encodings and clusterings of haplotypes in phased haplotype data using neural networks. It is based on a variational autoencoder model that can be used to infer population structure as well as to estimate admixture proportions in a novel likelihood framework. We demonstrate that this method is able to capture global ne-scale population structure by utilizing haplotype information, which is not performed in standard approaches.