Malthe Sebro Rasmussen:
Probabilistic modelling in genomics

Date: 15-09-2023    Supervisor: Anders Albrechtsen & Carsten Wiuf

This thesis concerns methods for whole-genome next-generation sequencing (NGS) data in population genetics. In various ways, the papers comprising the thesis aim to help us make inferences from such data. An over-arching theme throughout is an attempt to write models to make better use of the information in sequencing data in order to improve and scale our population genetic inferences. In conjunction with this work, there is a focus throughout on the practical applicability of these methods to real data, and in the form of efficient, usable software implementations of the proposed models and methods. The thesis consists of five manuscripts covering various topics within this umbrella.

Paper 1 presents a stochastic version of the EM algorithm for inference of the site frequency spectrum from low-depth data. It enables improved
estimation of the site frequency spectrum (SFS) while simultaneously lowering computational requirements orders of magnitudes relative to existing methods. This makes it possible to scale such estimation to large numbers of whole genomes at low-depth, which was previously hard or impossible, and it benefits downstream inference based on the SFS.

In Paper 2, we use the SFS to accurately infer a broad range of statistics at low-depth together with estimates of their standard errors. We show
how this makes it possible to run methods like TreeMix and qpGraph while accounting for genotype uncertainty, which leads to better estimates of drift and gene flow.

Paper 3 is a short software paper, describing a portable, ergonomic, and efficient implementation of a command-line interface covering a core SFS
workflow in a low-level language. Based on the experiences from the previous papers, it aims to make it easier to work with frequency spectra in a robust and reproducible manner.

Paper 4 describes a method for adjusting for the effects of population structure when measuring linkage disequilibrium (LD). In particular, substructure may cause significant bias in downstream analysis in conjunction with LD pruning. Together with theoretical analysis, we show how to use the adjusted measure for LD pruning to greatly reduce these issues in practice.

Finally, Paper 5 presents an approach to admixture analysis that incorporates haplotype structure with no requirement for assignment of genotype
phase or allelic state. By using haplotype information, we aim to improve our ability to accurately recover ancestry proportions, as well as to enable
such inference at much lower depth than what is currently possible.