Analysing human genomes at different scales

Research output: Book/ReportPh.D. thesisResearch

  • Siyang Liu
The thriving of the Next-Generation sequencing (NGS) technologies in the past decade
has dramatically revolutionized the field of human genetics. We are experiencing a
wave of several large-scale whole genome sequencing studies of humans in the world.
Those studies vary greatly regarding cohort composition, sequencing strategy and
sample size. One of the main considerations when designing the project is the trade-off
between the number of sequenced individuals and the per-sample sequencing depth.
Several statistical models and theories were established. Validations of the models and
methods will be reflected by the analysis of real data.
This thesis covers studies in two human genome sequencing projects that distinctly
differ in terms of studied population, sample size and sequencing depth.
In the first project, we sequenced 150 Danish individuals from 50 trio families to 78x
coverage. The sophisticated experimental design enables high-quality de novo
assembly of the genomes and provides a good opportunity for mapping the structural
variations in the human population. We developed the AsmVar approach to discover,
genotype and characterize the structural variations from the assemblies. Our assemblybased
method boosts up the power to accurately discover large and complex structural
variants. We have identified and validated the extensive existence of structural variation
in the human population including many novel insertions. The structural variants are
almost symmetric in size spectrum and are generally in high linkage disequilibrium with
the known SNPs. They derived from various mechanisms that could be inferred from
the sequence breakpoints. In addition, we identified five novel structural variation
association signals in the key FTO gene region when using the Danish reference panel
to impute genotypes in the Danish Genetics of Overweight Young Adults (GOYA)
obesity cohort and prove the clinical usage of the Danish reference panel in genomewide
association studies.
In the second project, we have collected ultra-low depth sequencing data of more than
140, 000 Chinese pregnant women. We developed and applied novel methods to
analysing the data that are accumulating rapidly and now reach millions of sample
scale. We show that we are able to discover mutations with allele frequencies down to
around 0.2% and to explore fine-scale population structure and ancestry across the 31
administrative divisions and the 45 ethnic groups in the country. Most importantly, we
achieved median imputation accuracy of 0.92 for 737K polymorphic loci. Association
studies of two common traits height and body mass index on the imputed loci replicated
many previously known association loci and reveal several new genome-wide
significant signals. While the large number of samples and the low per-sample
sequencing depth proposed enormous methodological and computational challenges,
we demonstrated its utility in terms of population genetics and medical genomic.
Original languageEnglish
PublisherDepartment of Biology, Faculty of Science, University of Copenhagen
Publication statusPublished - 2017

ID: 187009478