RAFFI: Accurate and fast familial relationship inference in large scale biobank studies using RaPID

Open Access

21 January 2021

journal article
research article
Published by Public Library of Science (PLoS) in PLoS Genetics

Vol. 17 (1), e1009315
https://doi.org/10.1371/journal.pgen.1009315

Abstract

Inference of relationships from whole-genome genetic data of a cohort is a crucial prerequisite for genome-wide association studies. Typically, relationships are inferred by computing the kinship coefficients (ϕ) and the genome-wide probability of zero IBD sharing (π₀) among all pairs of individuals. Current leading methods are based on pairwise comparisons, which may not scale up to very large cohorts (e.g., sample size >1 million). Here, we propose an efficient relationship inference method, RAFFI. RAFFI leverages the efficient RaPID method to call IBD segments first, then estimate the ϕ and π₀ from detected IBD segments. This inference is achieved by a data-driven approach that adjusts the estimation based on phasing quality and genotyping quality. Using simulations, we showed that RAFFI is robust against phasing/genotyping errors, admix events, and varying marker densities, and achieves higher accuracy compared to KING, the current leading method, especially for more distant relatives. When applied to the phased UK Biobank data with ~500K individuals, RAFFI is approximately 18 times faster than KING. We expect RAFFI will offer fast and accurate relatedness inference for even larger cohorts. Inferring familial relationships has a wide range of applications. Family-based genome-wide association studies and population-based GWAS both require genetic relationships. Inferring relationship is essential for unknown familial structures and can be used to correct pedigree information due to false paternity, sample switches, or unregistered adoption. Current approaches for inferring relationships are not scalable for large cohorts comprising millions of individuals. Here, we present a fast and flexible method, called RAFFI, using Identical by Descent (IBD) segments. IBD segments are uninterrupted DNA segments inherited from a common ancestor. Relationships are usually inferred by computing the kinship coefficients and the genome-wide probability of zero IBD sharing among all pairs of individuals. In the first step, we search for IBD segments using RaPID which avoids a pairwise comparison of all individuals in a haplotype panel. In the second step, we compute the kinship coefficients to infer the relationships. To make our method robust against genotyping and phasing error, the thresholds of kinship coefficients for different degrees of relatedness are adjusted. As a result, the lower detection power of IBD segments due to phasing errors or misspecification of the genotyping error rate will not comprise the inference of relationships.

This publication has 28 references indexed in Scilit:

Efficient haplotype matching and storage using the positional Burrows–Wheeler transform (PBWT)
Bioinformatics, 2014
Identity by Descent: Variation in Meiosis, Across Genomes, and in Populations
Genetics, 2013
A mixed-model approach for genome-wide association studies of correlated traits in structured populations
Nature Genetics, 2012
Estimating Kinship in Admixed Populations
American Journal of Human Genetics, 2012
Maximum-likelihood estimation of recent shared ancestry (ERSA)
Genome Research, 2011
Robust relationship inference in genome-wide association studies
Bioinformatics, 2010
Quality control and quality assurance in genotypic data for genome‐wide association studies
Genetic Epidemiology, 2010
Genotype imputation for genome-wide association studies
Nature Reviews Genetics, 2010
Whole population, genome-wide mapping of hidden relatedness
Genome Research, 2008
The Human Genome Browser at UCSC
Genome Research, 2002

Cited by 7 articles