Estimating Haplotype Frequency and Coverage of Databases

Open Access

22 December 2008

journal article
research article
Published by Public Library of Science (PLoS) in PLOS ONE

Vol. 3 (12), e3988
https://doi.org/10.1371/journal.pone.0003988

Abstract

A variety of forensic, population, and disease studies are based on haploid DNA (e.g. mitochondrial DNA or Y-chromosome data). For any set of genetic markers databases of conventional size will normally contain only a fraction of all haplotypes. For several applications, reliable estimates of haplotype frequencies, the total number of haplotypes and coverage of the database (the probability that the next random haplotype is contained in the database) will be useful. We propose different approaches to the problem based on classical methods as well as new applications of Principal Component Analysis (PCA). We also discuss previous proposals based on saturation curves. Several conclusions can be inferred from simulated and real data. First, classical estimates of the fraction of unseen haplotypes can be seriously biased. Second, there is no obvious way to decide on required sample size based on traditional approaches. Methods based on testing of hypotheses or length of confidence intervals may appear artificial since no single test or parameter stands out as particularly relevant. Rather the coverage may be more relevant since it indicates the percentage of different haplotypes that are contained in a database; if the coverage is low, there is a considerable chance that the next haplotype to be observed does not appear in the database and this indicates that the database needs to be expanded. Finally, freeware and example data sets accompany the methods discussed in this paper: http://folk.uio.no/thoree/nhap/.

Keywords

This publication has 33 references indexed in Scilit:

More evidence for non-maternal inheritance of mitochondrial DNA?
Journal of Medical Genetics, 2005
Predicting the Conditional Probability of Discovering a New Class
Journal of the American Statistical Association, 2004
Inferring the Most Likely Geographical Origin of mtDNA Sequence Profiles
Annals of Human Genetics, 2004
Inferences from DNA Data: Population Histories, Evolutionary Processes and Forensic Match Probabilities
Journal of the Royal Statistical Society Series A: Statistics in Society, 2003
Assessing uncertainty in DNA evidence caused by sampling effects
Science & Justice, 2002
Estimating the Number of Classes in a Finite Population
Journal of the American Statistical Association, 1998
mtDNA analysis of the Galician population: a genetic edge of European variation
European Journal of Human Genetics, 1998
Estimating the Number of Classes via Sample Coverage
Journal of the American Statistical Association, 1992
A Normal Limit Law for a Nonparametric Estimator of the Coverage of a Random Sample
The Annals of Statistics, 1983
THE POPULATION FREQUENCIES OF SPECIES AND THE ESTIMATION OF POPULATION PARAMETERS
Biometrika, 1953

Cited by 34 articles