Fast and accurate out-of-core PCA framework for large scale biobank data

Research output: Contribution to journalJournal articleResearchpeer-review

Standard

Fast and accurate out-of-core PCA framework for large scale biobank data. / Li, Zilong; Meisner, Jonas; Albrechtsen, Anders.

In: Genome Research, Vol. 33, No. 9, 2023, p. 1599-1608.

Research output: Contribution to journalJournal articleResearchpeer-review

Harvard

Li, Z, Meisner, J & Albrechtsen, A 2023, 'Fast and accurate out-of-core PCA framework for large scale biobank data', Genome Research, vol. 33, no. 9, pp. 1599-1608. https://doi.org/10.1101/gr.277525.122

APA

Li, Z., Meisner, J., & Albrechtsen, A. (2023). Fast and accurate out-of-core PCA framework for large scale biobank data. Genome Research, 33(9), 1599-1608. https://doi.org/10.1101/gr.277525.122

Vancouver

Li Z, Meisner J, Albrechtsen A. Fast and accurate out-of-core PCA framework for large scale biobank data. Genome Research. 2023;33(9):1599-1608. https://doi.org/10.1101/gr.277525.122

Author

Li, Zilong ; Meisner, Jonas ; Albrechtsen, Anders. / Fast and accurate out-of-core PCA framework for large scale biobank data. In: Genome Research. 2023 ; Vol. 33, No. 9. pp. 1599-1608.

Bibtex

@article{fa8cd3766cc04a8b95e91eea24c5c23e,
title = "Fast and accurate out-of-core PCA framework for large scale biobank data",
abstract = "Principal component analysis (PCA) is widely used in statistics, machine learning, and genomics for dimensionality reduction and uncovering low-dimensional latent structure. To address the challenges posed by ever-growing data size, fast and memory-efficient PCA methods have gained prominence. In this paper, we propose a novel randomized singular value decomposition (RSVD) algorithm implemented in PCAone, featuring a window-based optimization scheme that enables accelerated convergence while improving the accuracy. Additionally, PCAone incorporates out-of-core and multithreaded implementations for the existing Implicitly Restarted Arnoldi Method (IRAM) and RSVD. Through comprehensive evaluations using multiple large-scale real-world data sets in different fields, we show the advantage of PCAone over existing methods. The new algorithm achieves significantly faster computation time while maintaining accuracy comparable to the slower IRAM method. Notably, our analyses of UK Biobank, comprising around 0.5 million individuals and 6.1 million common single nucleotide polymorphisms, show that PCAone accurately computes the top 40 principal components within 9 h. This analysis effectively captures population structure, signals of selection, structural variants, and low recombination regions, utilizing <20 GB of memory and 20 CPU threads. Furthermore, when applied to single-cell RNA sequencing data featuring 1.3 million cells, PCAone, accurately capturing the top 40 principal components in 49 min. This performance represents a 10-fold improvement over state-of-the-art tools.",
author = "Zilong Li and Jonas Meisner and Anders Albrechtsen",
note = "Publisher Copyright: {\textcopyright} 2023 Li et al.",
year = "2023",
doi = "10.1101/gr.277525.122",
language = "English",
volume = "33",
pages = "1599--1608",
journal = "Genome Research",
issn = "1088-9051",
publisher = "Cold Spring Harbor Laboratory Press",
number = "9",

}

RIS

TY - JOUR

T1 - Fast and accurate out-of-core PCA framework for large scale biobank data

AU - Li, Zilong

AU - Meisner, Jonas

AU - Albrechtsen, Anders

N1 - Publisher Copyright: © 2023 Li et al.

PY - 2023

Y1 - 2023

N2 - Principal component analysis (PCA) is widely used in statistics, machine learning, and genomics for dimensionality reduction and uncovering low-dimensional latent structure. To address the challenges posed by ever-growing data size, fast and memory-efficient PCA methods have gained prominence. In this paper, we propose a novel randomized singular value decomposition (RSVD) algorithm implemented in PCAone, featuring a window-based optimization scheme that enables accelerated convergence while improving the accuracy. Additionally, PCAone incorporates out-of-core and multithreaded implementations for the existing Implicitly Restarted Arnoldi Method (IRAM) and RSVD. Through comprehensive evaluations using multiple large-scale real-world data sets in different fields, we show the advantage of PCAone over existing methods. The new algorithm achieves significantly faster computation time while maintaining accuracy comparable to the slower IRAM method. Notably, our analyses of UK Biobank, comprising around 0.5 million individuals and 6.1 million common single nucleotide polymorphisms, show that PCAone accurately computes the top 40 principal components within 9 h. This analysis effectively captures population structure, signals of selection, structural variants, and low recombination regions, utilizing <20 GB of memory and 20 CPU threads. Furthermore, when applied to single-cell RNA sequencing data featuring 1.3 million cells, PCAone, accurately capturing the top 40 principal components in 49 min. This performance represents a 10-fold improvement over state-of-the-art tools.

AB - Principal component analysis (PCA) is widely used in statistics, machine learning, and genomics for dimensionality reduction and uncovering low-dimensional latent structure. To address the challenges posed by ever-growing data size, fast and memory-efficient PCA methods have gained prominence. In this paper, we propose a novel randomized singular value decomposition (RSVD) algorithm implemented in PCAone, featuring a window-based optimization scheme that enables accelerated convergence while improving the accuracy. Additionally, PCAone incorporates out-of-core and multithreaded implementations for the existing Implicitly Restarted Arnoldi Method (IRAM) and RSVD. Through comprehensive evaluations using multiple large-scale real-world data sets in different fields, we show the advantage of PCAone over existing methods. The new algorithm achieves significantly faster computation time while maintaining accuracy comparable to the slower IRAM method. Notably, our analyses of UK Biobank, comprising around 0.5 million individuals and 6.1 million common single nucleotide polymorphisms, show that PCAone accurately computes the top 40 principal components within 9 h. This analysis effectively captures population structure, signals of selection, structural variants, and low recombination regions, utilizing <20 GB of memory and 20 CPU threads. Furthermore, when applied to single-cell RNA sequencing data featuring 1.3 million cells, PCAone, accurately capturing the top 40 principal components in 49 min. This performance represents a 10-fold improvement over state-of-the-art tools.

U2 - 10.1101/gr.277525.122

DO - 10.1101/gr.277525.122

M3 - Journal article

C2 - 37620119

AN - SCOPUS:85174530084

VL - 33

SP - 1599

EP - 1608

JO - Genome Research

JF - Genome Research

SN - 1088-9051

IS - 9

ER -

ID: 371695499