The European Reference Genome Atlas: piloting a decentralised approach to equitable biodiversity genomics

Research output: Working paperPreprintResearch

Documents

  • Preprint

    Submitted manuscript, 2.16 MB, PDF document

  • Ann M Mc Cartney
  • Giulio Formenti
  • Alice Mouton
  • Diego De Panis
  • Luisa S Marins
  • Henrique G Leitão
  • Genevieve Diedericks
  • Joseph Kirangwa
  • Marco Morselli
  • Judit Salces-Ortiz
  • Nuria Escudero
  • Alessio Iannucci
  • Chiara Natali
  • Hannes Svardal
  • Rosa Fernández
  • Tim De Pooter
  • Geert Joris
  • Mojca Strazisar
  • Jo Wood
  • Katie E Herron
  • Ole Seehausen
  • Phillip C Watts
  • Felix Shaw
  • Robert P Davey
  • Alice Minotto
  • José M Fernández
  • Astrid Böhne
  • Carla Alegria
  • Tyler Alioto
  • Paulo C Alves
  • Isabel R Amorim
  • Jean-Marc Aury
  • Niclas Backstrom
  • Petr Baldrian
  • Laima Baltrunaite
  • Endre Barta
  • Bertrand Bed’Hom
  • Caroline Belser
  • Johannes Bergsten
  • Laurie Bertrand
  • Helena Bilandžija
  • Mahesh Binzer-Panchal
  • Iliana Bista
  • Mark Blaxter
  • Paulo AV Borges
  • Guilherme Borges Dias
  • Mirte Bosse
  • Shilpa Garg
  • Ole Madsen
A global genome database of all of Earth’s species diversity could be a treasure trove of scientific discoveries. However, regardless of the major advances in genome sequencing technologies, only a tiny fraction of species have genomic information available. To contribute to a more complete planetary genomic database, scientists and institutions across the world have united under the Earth BioGenome Project (EBP), which plans to sequence and assemble high-quality reference genomes for all 1.5 million recognized eukaryotic species through a stepwise phased approach. As the initiative transitions into Phase II, where 150,000 species are to be sequenced in just four years, worldwide participation in the project will be fundamental to success. As the European node of the EBP, the European Reference Genome Atlas (ERGA) seeks to implement a new decentralised, accessible, equitable and inclusive model for producing high-quality reference genomes, which will inform EBP as it scales. To embark on this mission, ERGA launched a Pilot Project to establish a network across Europe to develop and test the first infrastructure of its kind for the coordinated and distributed reference genome production on 98 European eukaryotic species from sample providers across 33 European countries. Here we outline the process and challenges faced during the development of a pilot infrastructure for the production of reference genome resources, and explore the effectiveness of this approach in terms of high-quality reference genome production, considering also equity and inclusion. The outcomes and lessons learned during this pilot provide a solid foundation for ERGA while offering key learnings to other transnational and national genomic resource projects.Competing Interest StatementThe authors have declared no competing interest.Biodiversity genomicsThe application of genomic methods to research biodiversity.BUSCOA bioinformatic method (Benchmarking Universal Single-Copy Orthologues) used to estimate the completeness of the coding fraction of an organism’s genome based on the proportion of (lineage specific) single copy orthologous genes that are found in a genome assembly 51.INSDCInternational Nucleotide Sequence Database Collaboration (https://www.insdc.org/) is an initiative between the DDBJ, EMBL-EBI and NCBI that together act as a global repository of sequence data and associated metadata, and provide tools and services that allow access to genomic resources.Reference genomeAn accepted standard representation of an organism’s DNA sequence. High-quality reference genomes typically have high completeness (chromosome-level with few gaps in sequence), few errors, and are annotated and accessible. A reference genome serves as a tool for alignment-based analyses, such as variant calling or RNAseq, and has many other applications, for example, phylogenetics and evolutionary relationships, identification of genes and variants, functional analysis and comparative genomics. Reference genomes referred to as textquotedblleftdraftstextquotedblright are those that are under active construction and refinement, and not yet finalised through manual curation.Genomic resourceA genomic resource, for the purpose of this manuscript, refers to a reference genome, genome annotation, voucher specimen, cryopreserved sample and comprehensive metadata.FAIR PrinciplesA set of principles to guide appropriate management and curation of scientific data (https://www.go-fair.org/fair-principles/) that emphasise data accessibility and use by ensuring that data are Findable, Accessible, Interoperable, and Reusable. Due to the increasing amount of scientific data being reposited, FAIR guidelines promote a data format that is amenable to automated computational access of data by stakeholders63.CARE PrinciplesThe CARE principles for Indigenous data governance (https://www.gida-global.org/care) provide a governance framework that supports the recognition of rights and interests Indigenous Peoples’ to their physical and digital data as well as their Indigenous Knowledges64.MetadataA collection of data that provides contextual information about multiple characteristics of other, corresponding original data.VoucherA voucher specimen is a permanently preserved object (either whole or in part, and/or physical or digital) of an identified organism (verified by a recognised expert) and which is deposited in an accessible facility or database. A voucher provides physical evidence about any specimen’s taxonomic identity14. Voucher deposition is a best practice for conducting biodiversity genomics research.(Genome) annotationThe process of identifying the functions of different pieces of a genome. This includes genes that code for proteins and non coding features (e.g. intron-exon structure of protein coding genes, promotors, transposable elements). Typically performed using computational methods, followed by manual curation.(Genome) completenessAn estimate of how well a reference genome represents the complete sequence of the target organism. A complete genome should equal the haploid genome size of the target, but may be defined when ‘all chromosomes are gapless and have no runs of 10 or more ambiguous bases, there are no unplaced or unlocalized scaffolds, and all expected chromosomes are present.’ (https://www.ncbi.nlm.nih.gov/assembly/). There are different approaches to estimate the completeness, like BUSCO, analysing K-mers, etc.LibraryDNA, cDNA, or RNA that has been prepared for NGS within (usually) a specific size range and containing adapters, which are designed to be appropriate for (a) specific sequencing platform(s).(Genome) assemblyA genome assembly is a representation of an organism’s genome that is made using computer programs to turn (assemble) raw sequence data into longer, continuous sequences.PUIDA permanent unique identifier is a unique label for an object that does not change, such as the Digital Object Identifier (DOI) attached with a scientific publication.ENAThe European Nucleotide Archive (https://www.ebi.ac.uk/ena) is a global repository for sequence data and provides resources that support management and access to sequence data.Equity DeservingAccording to the Canadian Council (https://canadacouncil.ca/glossary/equity-seeking-groups) equity deserving groups are those individual researchers, communities, Peoples, regions or countries that have identified barriers to equal access, opportunities, and resources due to disadvantage and/or discrimination and that are actively seeking, and deserving of social justice and reparation. The discrimination experienced could be caused by attitudinal, historic, social, and environmental barriers that could be based on a plethora of characteristics that are including (but not limited to) sex, age, ethnicity, disability, economic status, gender, gender expression, nationality, race, sexual orientation, and creed.COPOThe Collaborative OPen Omics (COPO) platform is for researchers to publish their research assets, providing metadata annotation and deposition capability. It allows researchers to describe their datasets according to community standards and broker the submission of such data to appropriate repositories whilst tracking the resulting accessions/identifiers28.Open dataOpen data are freely accessible and unrestricted data that can be accessed, used, reused and shared with third parties for any purpose.HSMHierarchical Storage Management is both a data management and data storage technique which transparently manages the movement of data between the different layers of a tiered storage based on file size thresholds, usage and I/O pressure. Usually, a tiered storage is composed of one or more layers of disk arrays, ordered by capacity, latency, redundancy and storage cost. A slow but economically effective archival layer is at the bottom, composed of magnetic tape libraries and automated tape robots, with the highest capacity and latency. The movement between layers is automatically triggered.ONTOxford Nanopore Technologies (ONT; https://nanoporetech.com/) is a next generation sequencing technology whereby sequence data are generated from the changes in current that occur as single-stranded DNA or RNA molecules pass through nanoscale protein pores (nanopores). ONT provides long read data (up to several megabases) that facilitate genome assembly65,66.PacBioPacific Biosciences (PacBio; https://www.pacb.com/) is a single-molecule, real time (SMRT) next generation sequencing technology in which sequence data are generated by fluorescent light emission that occurs when a DNA polymerase adds nucleotides. PacBio produces long read data (tens of kilobases) that facilitate genome assembly.HiFi readsHiFi (High Fidelity) PacBio reads are produced by taking multiple sequences of the same molecule to provide a consensus sequence that is usually 12-20kbp long and has a low error rate (gt;99.9 67.Hi-CSequencing-based method used to study three-dimensional interactions among chromatin regions by measuring the frequency of contact between pairs of loci. Since contact frequency is related to the distance between a pair of loci, Hi-C linking information is used to help with scaffolding stages during a genome assembly process.Hi-C map / graph productionThe occurrence and frequency of Hi-C contacts are analysed and used in assembly scaffolding. They are typically visualised in Hi-C 2D heatmaps with the full genome sequence on the X and Y axis and a markup for each observed contact.Omni-CModified version of Hi-C that uses a sequence-independent endonuclease during its protocol to produce more even sequence coverage increasing overall resolution.RNA-SeqRNA-Seq is a technique that determines the complete or partial RNA sequence using NGS. The RNA expression profiles vary in different tissues of the same organism and can be influenced by physiopathological circumstances. RNA-Seq data facilitate genome assembly by providing empirical evidence for annotation of transcribed regions68.IsoSeqThis is a sequencing protocol developed by PacBio that aims to sequence full-length transcripts using the accurate, long read capabilities of PacBio HiFi technology. IsoSeq data facilitate analysis of transcriptomes and genome annotation by identifying full-length isoforms of transcripts.HaplotypeA haplotype refers to the collection of genetic material within an organism that is inherited together. Haplotype may be used to describe a few loci or any number of chromosomes (a chromosome-scale haplotype).K-merA K-mer is a DNA sequence of length k; for example, the sequence AGCT contains the 3-mers (K-mers of length 3) AGC and GCT.TranscriptomeA transcriptome is a set of aligned RNAseq reads representing RNA collected from a sample or collection of samples. This includes both protein-coding and non-coding transcripts. For the ERGA Pilot Project, poly-A+ transcripts were profiled.Interested PartiesThis term, for the purposes of this manuscript refers to the range of external stakeholders (e.g., commercial companies, policymakers etc) and rights holders (e.g., Indigenous Peoples) that have an interest in biodiversity genomics research.EBP Genome assembly quality standard 6..Q40Minimum reference standard of 6.C.Q40, i.e. megabase N50 contig continuity and chromosomal scale N50 scaffolding, with less than 1/10,000 error rate. For species with chromosome N50 smaller than a megabase this will be C.C.Q40. Additional recommendations include K-mer completeness gt;90 BUSCO complete single-copy single gt;90 BUSCO complete single duplicate lt; 5 and Gaps/Gbp lt;1000.Widening CountryWidening countries are countries with low participation rates in FP7 and H2020 projects (low level of investment into research and innovation (Ramp;I)). According to the Horizon Europe regulation the Widening countries are: Bulgaria, Croatia, Cyprus, Czech republic, Estonia, Greece, Hungary, Latvia, Lithuania, Malta, Poland, Portugal, Romania, Slovakia, Slovenia and all associated countries with equivalent characteristics in terms of Ramp;I performance and the Outermost Regions.
Original languageEnglish
Number of pages51
DOIs
Publication statusPublished - 2023
SeriesbioRxiv

ID: 381217784