Inference of Population Structure using Dense Haplotype Data

Top Cited Papers

Open Access

26 January 2012

journal article
research article
Published by Public Library of Science (PLoS) in PLoS Genetics

Vol. 8 (1), e1002453
https://doi.org/10.1371/journal.pgen.1002453

Abstract

The advent of genome-wide dense variation data provides an opportunity to investigate ancestry in unprecedented detail, but presents new statistical challenges. We propose a novel inference framework that aims to efficiently capture information on population structure provided by patterns of haplotype similarity. Each individual in a sample is considered in turn as a recipient, whose chromosomes are reconstructed using chunks of DNA donated by the other individuals. Results of this “chromosome painting” can be summarized as a “coancestry matrix,” which directly reveals key information about ancestral relationships among individuals. If markers are viewed as independent, we show that this matrix almost completely captures the information used by both standard Principal Components Analysis (PCA) and model-based approaches such as STRUCTURE in a unified manner. Furthermore, when markers are in linkage disequilibrium, the matrix combines information across successive markers to increase the ability to discern fine-scale population structure using PCA. In parallel, we have developed an efficient model-based approach to identify discrete populations using this matrix, which offers advantages over PCA in terms of interpretability and over existing clustering algorithms in terms of speed, number of separable populations, and sensitivity to subtle population structure. We analyse Human Genome Diversity Panel data for 938 individuals and 641,000 markers, and we identify 226 populations reflecting differences on continental, regional, local, and family scales. We present multiple lines of evidence that, while many methods capture similar information among strongly differentiated groups, more subtle population structure in human populations is consistently present at a much finer level than currently available geographic labels and is only captured by the haplotype-based approach. The software used for this article, ChromoPainter and fineSTRUCTURE, is available from http://www.paintmychromosomes.com/. The first step in almost every genetic analysis is to establish how sample members are related to each other. High relatedness between individuals can arise if they share a small number of recent ancestors, e.g. if they are distant cousins or a larger number of more distant ones, e.g. if their ancestors come from the same region. The most popular methods for investigating these relationships analyse successive markers independently, simply adding the information they provide. This works well for studies involving hundreds of markers scattered around the genome but is less appropriate now that entire genomes can be sequenced. We describe a “chromosome painting” approach to characterising shared ancestry that takes into account the fact that DNA is transmitted from generation to generation as a linear molecule in chromosomes. We show that the approach increases resolution relative to previous techniques, allowing differences in ancestry profiles among individuals to be resolved at the finest scales yet. We provide mathematical, statistical, and graphical machinery to exploit this new information and to characterize relationships at continental, regional, local, and family scales.

Keywords

This publication has 55 references indexed in Scilit:

Haplotype-resolved genome sequencing of a Gujarati Indian individual
Nature Biotechnology, 2011
Whole-genome molecular haplotyping of single cells
Nature Biotechnology, 2011
A map of human genome variation from population-scale sequencing
Nature, 2010
Genes mirror geography within Europe
Nature, 2008
Estimating Local Ancestry in Admixed Populations
American Journal of Human Genetics, 2008
A second generation human haplotype map of over 3.1 million SNPs
Nature, 2007
A worldwide survey of haplotype variation and linkage disequilibrium in the human genome
Nature Genetics, 2006
Principal components analysis corrects for stratification in genome-wide association studies
Nature Genetics, 2006
Estimation of individual admixture: Analytical and study design considerations
Genetic Epidemiology, 2005
Algorithms for inferring haplotypes
Genetic Epidemiology, 2004

Cited by 1019 articles