Accounting for long-range correlations in genome-wide simulations of large cohorts

Open Access

5 May 2020

journal article
research article
Published by Public Library of Science (PLoS) in PLoS Genetics

Vol. 16 (5), e1008619
https://doi.org/10.1371/journal.pgen.1008619

Abstract

Coalescent simulations are widely used to examine the effects of evolution and demographic history on the genetic makeup of populations. Thanks to recent progress in algorithms and data structures, simulators such as the widely-used msprime now provide genome-wide simulations for millions of individuals. However, this software relies on classic coalescent theory and its assumptions that sample sizes are small and that the region being simulated is short. Here we show that coalescent simulations of long regions of the genome exhibit large biases in identity-by-descent (IBD), long-range linkage disequilibrium (LD), and ancestry patterns, particularly when the sample size is large. We present a Wright-Fisher extension to msprime, and show that it produces more realistic distributions of IBD, LD, and ancestry proportions, while also addressing more subtle biases of the coalescent. Further, these extensions are more computationally efficient than state-of-the-art coalescent simulations when simulating long regions, including whole-genome data. For shorter regions, efficiency can be maintained via a hybrid model which simulates the recent past under the Wright-Fisher model and uses coalescent simulations in the distant past. Coalescent theory has provided deep theoretical insight into patterns of human diversity. Implementations of coalescent models in simulation software such as ms have further provided tools to interpret thousands of genomic studies. Recent technical progress has allowed for a dramatic increase in the scale at which genomes can be both measured and simulated, opening up opportunities for a finer understanding of evolutionary biology. However, we show that coalescent simulations of long regions of the genome exhibit large biases in sample relatedness, distorting haplotype sharing and ancestry patterns in simulated cohorts. We trace these biases to basic assumptions of the coalescent model, and show how the assumptions can be relaxed to provide a better description of the observed patterns of genetic polymorphism at a fraction of the computational cost.

Funding Information

Canadian Institutes of Health Research (MOP-136855)
Wellcome Trust (100956/Z/13/Z)

This publication has 32 references indexed in Scilit:

Population Genetics Models of Local Ancestry
Genetics, 2012
Cryptic Distant Relatives Are Common in Both Isolated and Cosmopolitan Genetic Samples
PLOS ONE, 2012
Gene Genealogies Within a Fixed Pedigree, and the Robustness of Kingman’s Coalescent
Genetics, 2012
Inference of human population history from individual whole-genome sequences
Nature, 2011
Inferring the Joint Demographic History of Multiple Populations from Multidimensional SNP Frequency Data
PLoS Genetics, 2009
On Recombination-Induced Multiple and Simultaneous Coalescent Events
Genetics, 2007
A Map of Recent Positive Selection in the Human Genome
PLoS Biology, 2006
Genomic Haplotype Blocks May Not Accurately Reflect Spatial Variation in Historic Recombination Intensity
Molecular Biology and Evolution, 2004
Selecting a Maximally Informative Set of Single-Nucleotide Polymorphisms for Association Analyses Using Linkage Disequilibrium
American Journal of Human Genetics, 2004
Properties of a neutral allele model with intragenic recombination
Theoretical Population Biology, 1983

Cited by 44 articles