Semi-automated assembly of high-quality diploid human reference genomes

Research output: Contribution to journal › Journal article › Research › peer-review

Documents

Fulltext
Final published version, 14.9 MB, PDF document

Erich D. Jarvis
Giulio Formenti
Arang Rhie
Andrea Guarracino
Chentao Yang
Jonathan Wood
Alan Tracey
Francoise Thibaud-Nissen
Mitchell R. Vollger
David Porubsky
Haoyu Cheng
Mobin Asri
Glennis A. Logsdon
Paolo Carnevali
Mark J.P. Chaisson
Chen Shan Chin
Sarah Cody
Joanna Collins
Peter Ebert
Merly Escalona
Olivier Fedrigo
Robert S. Fulton
Lucinda L. Fulton
Shilpa Garg
Jennifer L. Gerton
Jay Ghurye
Anastasiya Granat
Richard E. Green
William Harvey
Patrick Hasenfeld
Alex Hastie
Marina Haukness
Erich B. Jaeger
Miten Jain
Melanie Kirsche
Mikhail Kolmogorov
Jan O. Korbel
Sergey Koren
Jonas Korlach
Joyce Lee
Daofeng Li
Tina Lindsay
Julian Lucas
Feng Luo
Tobias Marschall
Matthew W. Mitchell
Jennifer McDaniel
Fan Nie
Guojie Zhang
Heng Li
Human Pangenome Reference Consortium

The current human reference genome, GRCh38, represents over 20 years of effort to generate a high-quality assembly, which has benefitted society^1,2. However, it still has many gaps and errors, and does not represent a biological genome as it is a blend of multiple individuals^3,4. Recently, a high-quality telomere-to-telomere reference, CHM13, was generated with the latest long-read technologies, but it was derived from a hydatidiform mole cell line with a nearly homozygous genome⁵. To address these limitations, the Human Pangenome Reference Consortium formed with the goal of creating high-quality, cost-effective, diploid genome assemblies for a pangenome reference that represents human genetic diversity⁶. Here, in our first scientific report, we determined which combination of current genome sequencing and assembly approaches yield the most complete and accurate diploid genome assembly with minimal manual curation. Approaches that used highly accurate long reads and parent–child data with graph-based haplotype phasing during assembly outperformed those that did not. Developing a combination of the top-performing methods, we generated our first high-quality diploid reference assembly, containing only approximately four gaps per chromosome on average, with most chromosomes within ±1% of the length of CHM13. Nearly 48% of protein-coding genes have non-synonymous amino acid changes between haplotypes, and centromeric regions showed the highest diversity. Our findings serve as a foundation for assembling near-complete diploid human genomes at scale for a pangenome reference to capture global genetic variation from single nucleotides to structural rearrangements.

Original language	English
Journal	Nature
Volume	611
Issue number	7936
Number of pages	34
ISSN	0028-0836
DOIs	https://doi.org/10.1038/s41586-022-05325-5
Publication status	Published - 2022

Bibliographical note

ID: 330467025

Department of Biology