RAFFI: Accurate and fast familial relationship inference in large scale biobank studies using RaPID
Open Access
- 21 January 2021
- journal article
- research article
- Published by Public Library of Science (PLoS) in PLoS Genetics
- Vol. 17 (1), e1009315
- https://doi.org/10.1371/journal.pgen.1009315
Abstract
Inference of relationships from whole-genome genetic data of a cohort is a crucial prerequisite for genome-wide association studies. Typically, relationships are inferred by computing the kinship coefficients (ϕ) and the genome-wide probability of zero IBD sharing (π0) among all pairs of individuals. Current leading methods are based on pairwise comparisons, which may not scale up to very large cohorts (e.g., sample size >1 million). Here, we propose an efficient relationship inference method, RAFFI. RAFFI leverages the efficient RaPID method to call IBD segments first, then estimate the ϕ and π0 from detected IBD segments. This inference is achieved by a data-driven approach that adjusts the estimation based on phasing quality and genotyping quality. Using simulations, we showed that RAFFI is robust against phasing/genotyping errors, admix events, and varying marker densities, and achieves higher accuracy compared to KING, the current leading method, especially for more distant relatives. When applied to the phased UK Biobank data with ~500K individuals, RAFFI is approximately 18 times faster than KING. We expect RAFFI will offer fast and accurate relatedness inference for even larger cohorts. Inferring familial relationships has a wide range of applications. Family-based genome-wide association studies and population-based GWAS both require genetic relationships. Inferring relationship is essential for unknown familial structures and can be used to correct pedigree information due to false paternity, sample switches, or unregistered adoption. Current approaches for inferring relationships are not scalable for large cohorts comprising millions of individuals. Here, we present a fast and flexible method, called RAFFI, using Identical by Descent (IBD) segments. IBD segments are uninterrupted DNA segments inherited from a common ancestor. Relationships are usually inferred by computing the kinship coefficients and the genome-wide probability of zero IBD sharing among all pairs of individuals. In the first step, we search for IBD segments using RaPID which avoids a pairwise comparison of all individuals in a haplotype panel. In the second step, we compute the kinship coefficients to infer the relationships. To make our method robust against genotyping and phasing error, the thresholds of kinship coefficients for different degrees of relatedness are adjusted. As a result, the lower detection power of IBD segments due to phasing errors or misspecification of the genotyping error rate will not comprise the inference of relationships.This publication has 28 references indexed in Scilit:
- Efficient haplotype matching and storage using the positional Burrows–Wheeler transform (PBWT)Bioinformatics, 2014
- Identity by Descent: Variation in Meiosis, Across Genomes, and in PopulationsGenetics, 2013
- A mixed-model approach for genome-wide association studies of correlated traits in structured populationsNature Genetics, 2012
- Estimating Kinship in Admixed PopulationsAmerican Journal of Human Genetics, 2012
- Maximum-likelihood estimation of recent shared ancestry (ERSA)Genome Research, 2011
- Robust relationship inference in genome-wide association studiesBioinformatics, 2010
- Quality control and quality assurance in genotypic data for genome‐wide association studiesGenetic Epidemiology, 2010
- Genotype imputation for genome-wide association studiesNature Reviews Genetics, 2010
- Whole population, genome-wide mapping of hidden relatednessGenome Research, 2008
- The Human Genome Browser at UCSCGenome Research, 2002