Lasse Maretty:
Probabilistic methods for processing high-throughput sequencing signals

Date: 07-07-2016    Supervisor: Anders Krogh




High-throughput sequencing has the potential to answer many of the big questions in biology and medicine. It can be used to determine the ancestry of species, to chart complex ecosystems and to understand and diagnose disease. However, going from raw sequencing data to biological or medical insights is far from trivial.

A key challenge is that these methods cannot read the input sequences in their entirety. Due to technological constraints, they instead provide the sequences of very many fragments of the input molecules. Furthermore, not all nucleotides in these fragments are measured correctly and the final output of a typical experiment thus consists of hundreds of millions of error-containing sequence fragments.

This thesis concerns the development of methods for transforming such a raw sequencing signal into a simpler representation from which biological inferences can then be made. Importantly, the fact that the fragments are short and contain errors implies that there may be significant uncertainty associated with the signal. By using probabilistic models, we are able to quantify this uncertainty and propagate it to downstream analyses.

The first chapter describes a new method for reconstructing transcript sequences from RNA sequencing data. The method is based on a novel sparse prior distribution over transcript abundances and is markedly more accurate than existing approaches. The second chapter describes a new method for calling genotypes from a fixed set of candidate variants. The method queries the reads using a graph representation of the variants and hereby mitigates the reference-bias that characterise standard genotyping methods. In the last chapter, we apply this method to call the genotypes of 50 deeply sequencing parent-offspring trios from the GenomeDenmark project. By estimating the genotypes on a set of candidate variants obtained from both a standard mapping-based approach as well as de novo assemblies, we are able to find considerably more structural variation than previous studies.