An Accurate Sequentially Markov Conditional Sampling Distribution for the Coalescent With Recombination
Open Access
- 1 April 2011
- journal article
- Published by Oxford University Press (OUP) in Genetics
- Vol. 187 (4), 1115-1128
- https://doi.org/10.1534/genetics.110.125534
Abstract
The sequentially Markov coalescent is a simplified genealogical process that aims to capture the essential features of the full coalescent model with recombination, while being scalable in the number of loci. In this article, the sequentially Markov framework is applied to the conditional sampling distribution (CSD), which is at the core of many statistical tools for population genetic analyses. Briefly, the CSD describes the probability that an additionally sampled DNA sequence is of a certain type, given that a collection of sequences has already been observed. A hidden Markov model (HMM) formulation of the sequentially Markov CSD is developed here, yielding an algorithm with time complexity linear in both the number of loci and the number of haplotypes. This work provides a highly accurate, practical approximation to a recently introduced CSD derived from the diffusion process associated with the coalescent with recombination. It is empirically demonstrated that the improvement in accuracy of the new CSD over previously proposed HMM-based CSDs increases substantially with the number of loci. The framework presented here can be adopted in a wide range of applications in population genetics, including imputing missing sequence data, estimating recombination rates, and inferring human colonization history.Keywords
This publication has 30 references indexed in Scilit:
- An approximate likelihood for genetic data under a model with recombination and population splittingTheoretical Population Biology, 2009
- A Flexible and Accurate Genotype Imputation Method for the Next Generation of Genome-Wide Association StudiesPLoS Genetics, 2009
- Sensitive Detection of Chromosomal Segments of Distinct Ancestry in Admixed PopulationsPLoS Genetics, 2009
- Joint estimation of gene conversion rates and mean conversion tract lengths from population SNP dataBioinformatics, 2009
- Fast and flexible simulation of DNA sequence dataGenome Research, 2008
- Importance sampling and the two-locus model with subdivided population structureAdvances in Applied Probability, 2008
- Inferring Human Colonization History Using a Copying ModelPLoS Genetics, 2008
- A new multipoint method for genome-wide association studies by imputation of genotypesNature Genetics, 2007
- Genomic Relationships and Speciation Times of Human, Chimpanzee, and Gorilla Inferred from a Coalescent Hidden Markov ModelPLoS Genetics, 2007
- A Fast and Flexible Statistical Model for Large-Scale Population Genotype Data: Applications to Inferring Missing Genotypes and Haplotypic PhaseAmerican Journal of Human Genetics, 2006