Calibrating the Performance of SNP Arrays for Whole-Genome Association Studies

Open Access

27 June 2008

journal article
research article
Published by Public Library of Science (PLoS) in PLoS Genetics

Vol. 4 (6), e1000109
https://doi.org/10.1371/journal.pgen.1000109

Abstract

To facilitate whole-genome association studies (WGAS), several high-density SNP genotyping arrays have been developed. Genetic coverage and statistical power are the primary benchmark metrics in evaluating the performance of SNP arrays. Ideally, such evaluations would be done on a SNP set and a cohort of individuals that are both independently sampled from the original SNPs and individuals used in developing the arrays. Without utilization of an independent test set, previous estimates of genetic coverage and statistical power may be subject to an overfitting bias. Additionally, the SNP arrays' statistical power in WGAS has not been systematically assessed on real traits. One robust setting for doing so is to evaluate statistical power on thousands of traits measured from a single set of individuals. In this study, 359 newly sampled Americans of European descent were genotyped using both Affymetrix 500K (Affx500K) and Illumina 650Y (Ilmn650K) SNP arrays. From these data, we were able to obtain estimates of genetic coverage, which are robust to overfitting, by constructing an independent test set from among these genotypes and individuals. Furthermore, we collected liver tissue RNA from the participants and profiled these samples on a comprehensive gene expression microarray. The RNA levels were used as a large-scale set of quantitative traits to calibrate the relative statistical power of the commercial arrays. Our genetic coverage estimates are lower than previous reports, providing evidence that previous estimates may be inflated due to overfitting. The Ilmn650K platform showed reasonable power (50% or greater) to detect SNPs associated with quantitative traits when the signal-to-noise ratio (SNR) is greater than or equal to 0.5 and the causal SNP's minor allele frequency (MAF) is greater than or equal to 20% (N = 359). In testing each of the more than 40,000 gene expression traits for association to each of the SNPs on the Ilmn650K and Affx500K arrays, we found that the Ilmn650K yielded 15% times more discoveries than the Affx500K at the same false discovery rate (FDR) level. Advances in SNP genotyping array technologies have made whole-genome association studies (WGAS) a readily available approach. Genetic coverage and the statistical power are two key properties to evaluate on the arrays. In this study, 359 newly sampled individuals were genotyped using Affymetrix 500K and Illumina 650Y SNP arrays. From these data, we obtained new estimates of genetic coverage by constructing a test set from among these genotypes and individuals that is independent from the SNPs and individuals used to construct the arrays. These estimates are notably smaller than previous ones, which we argue is due to an overfitting bias in previous studies. We also collected liver tissue RNA from the participants and profiled these samples on a comprehensive gene expression microarray. The RNA levels were used as a large-scale set of quantitative traits to calibrate the relative statistical power of the commercial arrays. Through this dataset and simulations, we find that the SNP arrays provide adequate power to detect quantitative trait loci when the causal SNP's minor allele frequency is greater than 20%, but low power is less than 10%. Importantly, we provide evidence that sample size has a greater impact on the power of WGAS than SNP density or genetic coverage.

This publication has 21 references indexed in Scilit:

Mapping the Genetic Architecture of Gene Expression in Human Liver
PLoS Biology, 2008
Power to Detect Risk Alleles Using Genome-Wide Tag SNP Panels
PLoS Genetics, 2007
Evaluating and improving power in whole-genome association studies using fixed marker sets
Nature Genetics, 2006
Joint analysis is more efficient than replication-based analysis for two-stage genome-wide association studies
Nature Genetics, 2006
A sparse marker extension tree algorithm for selecting the best set of haplotype tagging single nucleotide polymorphisms
Genetic Epidemiology, 2005
Efficiency and power in genetic association studies
Nature Genetics, 2005
Multiple Locus Linkage Analysis of Genomewide Expression in Yeast
PLoS Biology, 2005
Linkage Disequilibrium Patterns and tagSNP Transferability among European Populations
American Journal of Human Genetics, 2005
Mapping complex disease loci in whole-genome association studies
Nature, 2004
Statistical significance for genomewide studies
Proceedings of the National Academy of Sciences of the United States of America, 2003

Cited by 33 articles