Jonas Andreas Sibbesen:
Probabilistic transcriptome assembly and variant graph genotyping

Date: 08-06-2016    Supervisor: Anders Krogh

The introduction of second-generation sequencing, has in recent years allowed the biological community to determine the genomes and transcriptomes of organisms and individuals at an unprecedented rate. However, almost every step in the sequencing protocol introduces uncertainties in how the resulting sequencing data should be interpreted. This has over the years spurred the development of many probabilistic methods that are capable of modelling different aspects of the sequencing process. Here, I present two of such methods that were developed to each tackle a different problem in bioinformatics, together with an application of the latter method to a large Danish sequencing project.

The first is a probabilistic method for transcriptome assembly that is based on a novel generative model of the RNA sequencing process and provides confidence estimates on the assembled transcripts. We show that this approach outperforms existing state-of-the-art methods measured using sensitivity and precision on both simulated and real data.

The second is a novel probabilistic method that uses exact alignment of k-mers to a set of variants graphs to provide unbiased estimates of genotypes in a population of individuals. Using simulations we show that this method markedly increases sensitivity without sacrificing precision, when compared to mapping-based approaches, especially in variant dense regions. We further demonstrate, using high coverage real genome sequencing data of parent-offspring trios, that our method is accurate even for larger structural variants measured using trio concordance.

Finally, we applied the second method to genotype variants, predicted using both a mappingbased approach and de novo assemblies, in a population of 50 Danish parent-offspring trios in the GenomeDenmark project. Using this hybrid-approach we not only created a variant set that was more complete, in term of structural variants, compared to previous similar studies but also significantly reduced the bias towards deletions normally observed in such studies.