PhD defense: Jonas Sibbesen
Probabilistic transcriptome assembly and variant graph genotyping
Supervisor: Anders Krogh, Professor, University of Copenhagen
Exam Committee
Associate Professor Gerton Lunther, Wellcome Trust, Centre for Human Genetics, Oxford University
Professor Jakob Skou Pedersen, Aarhus University
Associate Professor Thomas Hamelryck, University of Copenhagen
Abstract
The introduction of second-generation sequencing, has in recent years allowed the biological community to determine the genomes and transcriptomes of organisms and individuals at an unprecedented rate. However, almost every step in the sequencing protocol introduces uncertainties in how the resulting sequencing data should be interpreted. This has over the years spurred the development of many probabilistic methods that are capable of modelling different aspects of the sequencing process. Here, I present two of such methods that were developed to each tackle a different problem in bioinformatics, together with an application of the latter method to a large Danish sequencing project.
The first is a probabilistic method for transcriptome assembly that is based on a novel generative model of the RNA sequencing process and provides confidence estimates on the assembled transcripts. We show that this approach outperforms existing state-of-the-art methods measured using sensitivity and precision on both simulated and real data.
The second is a novel probabilistic method that uses exact alignment of k-mers to a set of variants graphs to provide unbiased estimates of genotypes in a population of individuals. Using simulations we show that this method markedly increases sensitivity without sacrificing precision, when compared to mapping-based approaches, especially in variant dense regions. We further demonstrate, using high coverage real genome sequencing data of parent-offspring trios, that our method is accurate even for larger structural variants measured using trio concordance.
Finally, we applied the second method to genotype variants, predicted using both a mapping-based approach and de novo assemblies, in a population of 50 Danish parent-offspring trios in the GenomeDenmark project. Using this hybrid-approach we not only created a variant set that was more complete, in term of structural variants, compared to previous similar studies but also significantly reduced the bias towards deletions normally observed in such studies.