The recent advent of high throughput sequencing of nucleic acids (RNA and DNA) has vastly expanded research into the functional and structural biology of the genome of all living organisms (and even a few dead ones). With this enormous and exponential growth in biological data generation come equally large demands in data handling, analysis and interpretation, perhaps defining the modern challenge of the computational biologist of the post-genomic era.
The first part of this thesis consists of a general introduction to the history, common terms and challenges of next generation sequencing, focusing on oft encountered problems in data processing, such as quality assurance, mapping, normalization, visualization, and interpretation.
Presented in the second part are scientific endeavors representing solutions to problems of two sub-genres of next generation sequencing.
For the first flavor, RNA-sequencing, a study of the effects on alternative RNA splicing of KO of the nonsense mediated RNA decay system in Mus, using digital gene expression and a custom-built exon-exon junction mapping pipeline is presented (article I). Evolved from this work, a Bioconductor package, spliceR, for classifying alternative splicing events and coding potential of isoforms from full isoform deconvolution software, such as Cufflinks (article II), is presented. Finally, a study using 5’-end RNA-seq for alternative promoter detection between healthy patients and patients with acute promyelocytic leukemia is presented (article III).
For the second flavor, DNA-seq, a study presenting genome wide profiling of transcription factor CEBP/A in liver cells undergoing regeneration after partial hepatectomy (article IV) is included.