Metagenomic data mining: Benchmarking and developing technologies to identify biomarkers from large metagenomics data sets

Research output: Book/ReportPh.D. thesisResearch

Metagenomics is an interdisciplinary subject that emerged two decades ago. It bypassed the traditional culture-based framework, extended the boundary of microbiology research, and provided a scope of microbial genetics and phylogenetics at the molecular level. Advanced by the development of bioinformatics and generations of DNA sequencing technologies, digging information from large sequencing data set became a pioneering task to reveal connections and relationships between microorganisms, and the “theatre of activity” – their metabolites, hosts and environmental conditions.

The shotgun metagenomics approaches are advantageous in providing both taxonomic and functional information. Novel sequencing platforms have emerged in the past few years. As an alternative option with lower cost and higher output compared to the dominant products at that time, I conducted a benchmark study of the BGISEQ-500 sequencer in 2017. I designed an experiment using 20 stool samples for metagenomics sequencing on the BGISEQ-500, HiSeq 2000 and HiSeq 4000 for comparison. The results provided the first set of performance metrics for human gut metagenomic sequencing data using the BGISEQ-500 and the accuracy and technical reproducibility confirmed the applicability of the new platform for metagenomic studies.

Based on this work, a case study which I took the lead applied a multi-omics approach to comprehensively analyze the influence of the gut microbiota, tumor mutational burden (TMB) and host genetic mutations to the outcome of immune checkpoint therapy (ICT) in non-small cell-lung cancer patients. Collaborating with the experts from Oncology and human genetics, a list of metagenomic biomarkers were identified that differed in abundance in relation to responses to ICT enabling development of models robustly predicting the probability that a given patient would benefit from ICT. Briefly, the results emphasize the potential for analyses of the gut microbiota for predicting the outcome prior to instigating ICT.

Realizing the limitation of taxonomic classification by current metagenomics technology, I turned to the typical target metagenomics approach. At present, the limitation of rRNA gene amplicon sequencing is either too short hypervariable regions used for classification, or insufficient throughput to cover a microbial community. I developed a novel bioinformatics approach to assemble long fragments of rRNA gene sequences on a single-molecular level by using a reads linkage barcoding strategy. Mock and soil samples were employed to benchmark the performance of obtaining bacterial rRNA genes (from 16S and 23S rRNA) and fungi ITS regions (covering partial 16S, ITS1, 5S, ITS2, and partial 28S rRNA sequences) and successfully classifying them to the species level.
Original languageEnglish
PublisherDepartment of Biology, Faculty of Science, University of Copenhagen
Number of pages96
Publication statusPublished - 2021

ID: 273017542