Coverage and Characteristics of the Affymetrix GeneChip Human Mapping 100K SNP Set

Abstract
Improvements in technology have made it possible to conduct genome-wide association mapping at costs within reach of academic investigators, and experiments are currently being conducted with a variety of high-throughput platforms. To provide an appropriate context for interpreting results of such studies, we summarize here results of an investigation of one of the first of these technologies to be publicly available, the Affymetrix GeneChip Human Mapping 100K set of single nucleotide polymorphisms (SNPs). In a systematic analysis of the pattern and distribution of SNPs in the Mapping 100K set, we find that SNPs in this set are undersampled from coding regions (both nonsynonymous and synonymous) and oversampled from regions outside genes, relative to SNPs in the overall HapMap database. In addition, we utilize a novel multilocus linkage disequilibrium (LD) coefficient based on information content (analogous to the information content scores commonly used for linkage mapping) that is equivalent to the familiar measure r2 in the special case of two loci. Using this approach, we are able to summarize for any subset of markers, such as the Affymetrix Mapping 100K set, the information available for association mapping in that subset, relative to the information available in the full set of markers included in the HapMap, and highlight circumstances in which this multilocus measure of LD provides substantial additional insight about the haplotype structure in a region over pairwise measures of LD. The ability to survey hundreds of thousands of single nucleotide polymorphisms (SNPs) with cost-effective technologies is enabling investigators to conduct genome-wide association studies designed to find genetic variation affecting disease risk. To facilitate both interpretation of these studies and the design of follow-up studies, Nicolae and colleagues have made a comprehensive survey of the distribution and coverage of the first of these high-throughput platforms for genome-wide association mapping to be made publicly available, the Affymetrix GeneChip Human Mapping 100K set of SNPs (100K set). They found that SNPs within coding sequence are underrepresented in this mapping set relative to the set of SNPs included in the International HapMap Project, and this has consequences for the success of association studies. Measuring the information content confirms that the 100K set provides substantial coverage on variation in the HapMap database. The 100K set is quite redundant, as the SNPs were selected in the absence of information on the correlation (linkage disequilibrium) among them, and thus the relatively high value of the information content in the 100K set for the HapMap SNPs bodes well for general ability to survey genomic variation with a subset of variants.