Population-based biobank for analyzing the frequencies of clinically relevant DNA markers in the Russian population: bioinformatic aspects

Abstract
One of the tasks of population-based biobanks is to determine the frequencies of clinically relevant genetic polymorphisms in the population. The population of Russia is very heterogeneous both ethnically and genetically. Therefore, the frequencies of genetic markers are in demand not in one sample, but in a series of samples reflecting the heterogeneity of the gene pool of different peoples and regions.Aim. To divide the population of Russia and neighboring countries into population groups meeting certain conditions, as well as having a representative sample in existing data and biobanks.Material and methods. We developed a method for combining populations into larger groups with maintaining intragroup homogeneity based on the principal components analysis with K-means clustering, followed by refinement of clustering for higher homogeneity and a more equal distribution of group sizes using FST distances. The technology has been adjusted using the example of the Biobank of Northern Eurasia. Therefore, the material was the genome-wide data on 4.5 million genetic markers for 1,883 samples representing 247 populations of Russia and neighboring countries from this biobank. The developed approach, the resulting set of populations and related map can be applied for other collections of biomaterials from Russian populations.Results. Application of this approach made it possible to divide the entire population of Russia and neighboring countries into 29 ethnogeographic groups, characterized by relative genetic homogeneity. This set of populations is recommended as a baseline for population screenings to identify the frequency of any genetic markers among the population of Russia. A map has been constructed showing the division of population into 29 ethnogeographic areas.Conclusion. On the basis of a reliable genome-wide data, the zoning of gene pool of the Russian population was carried out. We identified ethnogeographic groups with intergroup contrasting allele frequencies, but at the same time with relatively homogeneous intragroup parameters. The resulting map and register of groups can be used in population genetic, medical genetic and pharmacogenetic studies.