Genetics in geographically structured populations: defining, estimating and interpreting FST

Abstract
Wright's F-statistics, and especially FST, provide important insights into the evolutionary processes that influence the structure of genetic variation within and among populations, and they are among the most widely used descriptive statistics in population and evolutionary genetics. FST is a property of the distribution of allele frequencies among populations. It reflects the joint effects of drift, migration, mutation and selection on the distribution of genetic variation among populations. FST has a central role in population and evolutionary genetics and has wide applications in fields from disease association mapping to forensic science. FST can be used to describe the distribution of genetic variation among any set of samples, but it is most usefully applied when the samples represent discrete units rather than arbitrary divisions along a continuous distribution. Statistics related to FST can be useful for haplotype or microsatellite data if an appropriate measure of evolutionary distance among alleles is available. Comparison of an estimate of FST from marker data with an estimate of QST from continuously varying trait data can be used to detect selection, but the estimate of FST may depend on the choice of marker and the estimate of QST may differ from neutral expectations if there is a non-additive component of genetic variance. Although the simple relationship between FST and migration rates in Wright's island model makes it tempting to infer migration rates from FST, caution is needed if such an approach is to be used. If estimates of FST from many loci are available, it may be possible to identify certain loci as 'outliers' that may have been subject to different patterns of selection or to different demographic processes. Case–control studies for association-mapping studies must account for the possibility that population substructure accounts for an observed association between a marker and a disease. The genomic control method uses background estimates of FST to control for such substructure. In forensic applications, the probabilities of obtaining a match are sometimes calculated for subpopulations that lack specific allele frequency data. A θ correction, in which θ is FST, is used to calculate the probability of a match using allele frequency information from a broader population that the subpopulation is part of. The massive amount of data that is being generated by population genomics projects can be understood fundamentally as allelic variation at individual loci. We therefore expect F-statistics to be at least as useful in understanding these data sets as they have been in population and evolutionary genetics for most of the last century.