Line Skotte:
Statistical approaches accommodating uncertainty in modern genomic data

Date: 15-02-2015    Supervisor: Anders Albrechtsen

Due to recent technological advances the research fields of human genetics are poised as never before to provide valuable insights on the molecular basis of disease. The technological advances has made it possible to genotype hundreds of thousands known genetic variants, re-sequence entire genomes to discover new variants and perform detailed profiling of an individual's entire transcriptome. However, uncertainties pervades every level of modern genome-wide data analyses and timely development of statistical methods that accommodates these uncertainties are therefore necessary to fully exploit the potential of the technological advances.

The first of the four papers included in this thesis describes a new method for association mapping that accommodates uncertain genotypes from low-coverage re-sequencing data. The method allows uncertain genotypes using a score statistic based on the joint likelihood of the observed phenotypes and the observed sequencing data. This joint likelihood accounts for the genotype uncertainties via the posterior probabilities of each genotype given the observed sequencing data and the phenotype distributions are modelled using a generalised linear model framework which makes the contributed method applicable to case-control studies as well as mapping of quantitative traits. The contributed method provides a needed association test for quantitative traits in the presence of uncertain genotypes and it further allows correction for population structure in association tests for disease states and quantitative traits in the presence of uncertain genotypes, neither were possible prior to the development of the method. Our simulations show that the contributed method have higher statistical power than methods based on genotypes inferred from the sequencing data.

The second paper presents a new method for estimating an individual's ancestry in terms of admixture proportions from panels with low coverage next generation sequencing data. Unlike previous method for inferring ancestry, this method does not require exact knowledge of individual genotypes, instead it estimates admixture proportions from a likelihood model based on so-called genotype likelihoods using an accelerated EM-algorithm. Using simulations as well as publicly available sequencing data we demonstrate that the contributed method has great accuracy even for very low-depth sequencing data and that the application of previous methods to genotypes called from low-coverage sequencing data can introduce severe biases.

In the third paper we introduce a new method for association testing based on SNP chip genotype data from recently admixed populations that allows the effect sizes to depend on the ancestry of the tested allele. The method does not rely on accurate inference of the local ancestry which is not directly observable. Instead the unobserved local ancestry is accounted for via the posterior probabilities of local ancestry conditional on the observed genotype data and ancestry specific effects are estimated from the full likelihood model using an EM-algorithm. Our simulations show that the contributed method gives a dramatic increase in statistical power to detect association in some scenarios. In addition, the method contributes a test of the hypothesis that the effect sizes does not depend on the ancestry of allele, which, if significant, is suggestive that an identified lead SNP is not causal. The usefulness of the contributed method is demonstrated on data from the recently admixed Greenlandic population.

The last manuscript, based on work in progress, describes a new method for inferring imbalanced allelic transcription from RNA sequencing data. The method differs from previous methods in that is accounts for the well-known inherent over-dispersion in re-sequencing data and that it combines information across individuals to form a population-based measure of allelic imbalance. Our simulations show that the contributed method leads to a better discrimination between genes subject to allelic imbalance and those with balanced expression, and provides control of the number false positive inferences of allelic imbalance in individuals. We further demonstrate that combining information across individuals to form a population-based measure of allele-specific expression allows powerful to detection of genes experiencing modest degrees of allelic imbalance.