Pseudoreplication in genomics-scale data sets

Department of Biology

Pseudoreplication in genomics-scale data sets

Research output: Contribution to journal › Journal article › Research › peer-review

Standard

Pseudoreplication in genomics-scale data sets. / Waples, Robin S.; Waples, Ryan K.; Ward, Eric J.

In: Molecular Ecology Resources, Vol. 22, No. 2, 2022, p. 503-518.

Research output: Contribution to journal › Journal article › Research › peer-review

Harvard

Waples, RS, Waples, RK & Ward, EJ 2022, 'Pseudoreplication in genomics-scale data sets', Molecular Ecology Resources, vol. 22, no. 2, pp. 503-518. https://doi.org/10.1111/1755-0998.13482

APA

Waples, R. S., Waples, R. K., & Ward, E. J. (2022). Pseudoreplication in genomics-scale data sets. Molecular Ecology Resources, 22(2), 503-518. https://doi.org/10.1111/1755-0998.13482

Vancouver

Waples RS, Waples RK, Ward EJ. Pseudoreplication in genomics-scale data sets. Molecular Ecology Resources. 2022;22(2):503-518. https://doi.org/10.1111/1755-0998.13482

Author

Waples, Robin S. ; Waples, Ryan K. ; Ward, Eric J. / Pseudoreplication in genomics-scale data sets. In: Molecular Ecology Resources. 2022 ; Vol. 22, No. 2. pp. 503-518.

Bibtex

@article{aad34a296d2c4e8b9b118786959fc87d,

title = "Pseudoreplication in genomics-scale data sets",

abstract = "In genomics-scale datasets, loci are closely packed within chromosomes and hence provide correlated information. Averaging across loci as if they were independent creates pseudoreplication, which reduces the effective degrees of freedom (df{\textquoteright}) compared to the nominal degrees of freedom, df. This issue has been known for some time, but consequences have not been systematically quantified across the entire genome. Here we measured pseudoreplication (quantified by the ratio df{\textquoteright}/df) for a common metric of genetic differentiation (FST) and a common measure of linkage disequilibrium between pairs of loci (r2). Based on data simulated using models (SLiM and msprime) that allow efficient forward-in-time and coalescent simulations while precisely controlling population pedigrees, we estimated df{\textquoteright} and df{\textquoteright}/df by measuring the rate of decline in the variance of mean FST and mean r2 as more loci were used. For both indices, df{\textquoteright} increases with Ne and genome size, as expected. However, even for large Ne and large genomes, df{\textquoteright} for mean r2 plateaus after a few thousand loci, and a variance components analysis indicates that the limiting factor is uncertainty associated with sampling individuals rather than genes. Pseudoreplication is less extreme for FST, but df{\textquoteright}/df ≤0.01 can occur in datasets using tens of thousands of loci. Commonly-used block-jackknife methods consistently overestimated var(FST), producing very conservative confidence intervals. Predicting df{\textquoteright} based on our modeling results as a function of Ne, L, S, and genome size provides a robust way to quantify precision associated with genomics-scale datasets.",

keywords = "degrees of freedom, FST, genome size, jackknife variance, linkage disequilibrium, Ne, simulations",

author = "Waples, {Robin S.} and Waples, {Ryan K.} and Ward, {Eric J.}",

year = "2022",

doi = "10.1111/1755-0998.13482",

language = "English",

volume = "22",

pages = "503--518",

journal = "Molecular Ecology",

issn = "0962-1083",

publisher = "Wiley-Blackwell",

number = "2",

}

RIS

TY - JOUR

T1 - Pseudoreplication in genomics-scale data sets

AU - Waples, Robin S.

AU - Waples, Ryan K.

AU - Ward, Eric J.

PY - 2022

Y1 - 2022

N2 - In genomics-scale datasets, loci are closely packed within chromosomes and hence provide correlated information. Averaging across loci as if they were independent creates pseudoreplication, which reduces the effective degrees of freedom (df’) compared to the nominal degrees of freedom, df. This issue has been known for some time, but consequences have not been systematically quantified across the entire genome. Here we measured pseudoreplication (quantified by the ratio df’/df) for a common metric of genetic differentiation (FST) and a common measure of linkage disequilibrium between pairs of loci (r2). Based on data simulated using models (SLiM and msprime) that allow efficient forward-in-time and coalescent simulations while precisely controlling population pedigrees, we estimated df’ and df’/df by measuring the rate of decline in the variance of mean FST and mean r2 as more loci were used. For both indices, df’ increases with Ne and genome size, as expected. However, even for large Ne and large genomes, df’ for mean r2 plateaus after a few thousand loci, and a variance components analysis indicates that the limiting factor is uncertainty associated with sampling individuals rather than genes. Pseudoreplication is less extreme for FST, but df’/df ≤0.01 can occur in datasets using tens of thousands of loci. Commonly-used block-jackknife methods consistently overestimated var(FST), producing very conservative confidence intervals. Predicting df’ based on our modeling results as a function of Ne, L, S, and genome size provides a robust way to quantify precision associated with genomics-scale datasets.

AB - In genomics-scale datasets, loci are closely packed within chromosomes and hence provide correlated information. Averaging across loci as if they were independent creates pseudoreplication, which reduces the effective degrees of freedom (df’) compared to the nominal degrees of freedom, df. This issue has been known for some time, but consequences have not been systematically quantified across the entire genome. Here we measured pseudoreplication (quantified by the ratio df’/df) for a common metric of genetic differentiation (FST) and a common measure of linkage disequilibrium between pairs of loci (r2). Based on data simulated using models (SLiM and msprime) that allow efficient forward-in-time and coalescent simulations while precisely controlling population pedigrees, we estimated df’ and df’/df by measuring the rate of decline in the variance of mean FST and mean r2 as more loci were used. For both indices, df’ increases with Ne and genome size, as expected. However, even for large Ne and large genomes, df’ for mean r2 plateaus after a few thousand loci, and a variance components analysis indicates that the limiting factor is uncertainty associated with sampling individuals rather than genes. Pseudoreplication is less extreme for FST, but df’/df ≤0.01 can occur in datasets using tens of thousands of loci. Commonly-used block-jackknife methods consistently overestimated var(FST), producing very conservative confidence intervals. Predicting df’ based on our modeling results as a function of Ne, L, S, and genome size provides a robust way to quantify precision associated with genomics-scale datasets.

KW - degrees of freedom

KW - FST

KW - genome size

KW - jackknife variance

KW - linkage disequilibrium

KW - Ne

KW - simulations

U2 - 10.1111/1755-0998.13482

DO - 10.1111/1755-0998.13482

M3 - Journal article

C2 - 34351073

VL - 22

SP - 503

EP - 518

JO - Molecular Ecology

JF - Molecular Ecology

SN - 0962-1083

IS - 2

ER -

ID: 276330963