Using Genetic Distance to Infer the Accuracy of Genomic Prediction
Open Access
- 2 September 2016
- journal article
- research article
- Published by Public Library of Science (PLoS) in PLoS Genetics
- Vol. 12 (9), e1006288
- https://doi.org/10.1371/journal.pgen.1006288
Abstract
The prediction of phenotypic traits using high-density genomic data has many applications such as the selection of plants and animals of commercial interest; and it is expected to play an increasing role in medical diagnostics. Statistical models used for this task are usually tested using cross-validation, which implicitly assumes that new individuals (whose phenotypes we would like to predict) originate from the same population the genomic prediction model is trained on. In this paper we propose an approach based on clustering and resampling to investigate the effect of increasing genetic distance between training and target populations when predicting quantitative traits. This is important for plant and animal genetics, where genomic selection programs rely on the precision of predictions in future rounds of breeding. Therefore, estimating how quickly predictive accuracy decays is important in deciding which training population to use and how often the model has to be recalibrated. We find that the correlation between true and predicted values decays approximately linearly with respect to either FST or mean kinship between the training and the target populations. We illustrate this relationship using simulations and a collection of data sets from mice, wheat and human genetics. The availability of increasing amounts of genomic data is making the use of statistical models to predict traits of interest a mainstay of many applications in life sciences. Applications range from medical diagnostics for common and rare diseases to breeding characteristics such as disease resistance in plants and animals of commercial interest. We explored an implicit assumption of how such prediction models are often assessed: that the individuals whose traits we would like to predict originate from the same population as those that are used to train the models. This is commonly not the case, especially in the case of plants and animals that are parts of selection programs. To study this problem we proposed a model-agnostic approach to infer the accuracy of prediction models as a function of two common measures of genetic distance. Using data from plant, animal and human genetics, we find that accuracy decays approximately linearly in either of those measures. Quantifying this decay has fundamental applications in all branches of genetics, as it measures how studies generalise to different populations.Keywords
This publication has 48 references indexed in Scilit:
- The importance of information on relatives for the prediction of genomic breeding values and the implications for the makeup of reference data sets in livestock breeding schemesGenetics Selection Evolution, 2012
- Accuracies of genomic breeding values in American Angus beef cattle using K-means clustering for cross-validationGenetics Selection Evolution, 2011
- Optimized application of penalized regression methods to diverse genomic dataBioinformatics, 2011
- Genome-wide association mapping reveals a rich genetic architecture of complex traits in Oryza sativaNature Communications, 2011
- Common SNPs explain a large proportion of the heritability for human heightNature Genetics, 2010
- The impact of genetic relationship information on genomic breeding values in German Holstein cattleGenetics Selection Evolution, 2010
- Noninvasive prenatal diagnosis of fetal chromosomal aneuploidy by massively parallel genomic sequencing of DNA in maternal plasmaProceedings of the National Academy of Sciences of the United States of America, 2008
- Worldwide Human Relationships Inferred from Genome-Wide Patterns of VariationScience, 2008
- Convergent adaptation of human lactase persistence in Africa and EuropeNature Genetics, 2006
- Genome-wide genetic association of complex traits in heterogeneous stock miceNature Genetics, 2006