Academia.eduAcademia.edu

Genome-wide sequence analyses of ethnic populations across Russia

2019, Genomics

University of Groningen Genome-wide sequence analyses of ethnic populations across Russia Zhernakova, Daria V.; Brukhin, Vladimir; Malov, Sergey; Oleksyk, Taras K.; Koepfli, Klaus Peter; Zhuk, Anna; Dobrynin, Pavel; Kliver, Sergei; Cherkasov, Nikolay; Tamazian, Gaik Published in: GENOMICS DOI: 10.1016/j.ygeno.2019.03.007 IMPORTANT NOTE: You are advised to consult the publisher's version (publisher's PDF) if you wish to cite from it. Please check the document version below. Document Version Publisher's PDF, also known as Version of record Publication date: 2020 Link to publication in University of Groningen/UMCG research database Citation for published version (APA): Zhernakova, D. V., Brukhin, V., Malov, S., Oleksyk, T. K., Koepfli, K. P., Zhuk, A., Dobrynin, P., Kliver, S., Cherkasov, N., Tamazian, G., Rotkevich, M., Krasheninnikova, K., Evsyukov, I., Sidorov, S., Gorbunova, A., Chernyaeva, E., Shevchenko, A., Kolchanova, S., Komissarov, A., ... O'Brien, S. J. (2020). Genomewide sequence analyses of ethnic populations across Russia. GENOMICS, 112(1), 442-458. https://doi.org/10.1016/j.ygeno.2019.03.007 Copyright Other than for strictly personal use, it is not permitted to download or to forward/distribute the text or part of it without the consent of the author(s) and/or copyright holder(s), unless the work is under an open content license (like Creative Commons). The publication may also be distributed here under the terms of Article 25fa of the Dutch Copyright Act, indicated by the “Taverne” license. More information can be found on the University of Groningen website: https://www.rug.nl/library/open-access/self-archiving-pure/taverneamendment. Take-down policy If you believe that this document breaches copyright please contact us providing details, and we will remove access to the work immediately and investigate your claim. Downloaded from the University of Groningen/UMCG research database (Pure): http://www.rug.nl/research/portal. For technical reasons the number of authors shown on this cover page is limited to 10 maximum. Download date: 20-12-2023 Genomics 112 (2020) 442–458 Contents lists available at ScienceDirect Genomics journal homepage: www.elsevier.com/locate/ygeno Original Article Genome-wide sequence analyses of ethnic populations across Russia a,b,⁎ a a,c a,d,r , Vladimir Brukhin , Sergey Malov , Taras K. Oleksyk , Daria V. Zhernakova Klaus Peter Koepflia,e, Anna Zhuka,f, Pavel Dobrynina,e, Sergei Klivera, Nikolay Cherkasova, Gaik Tamaziana, Mikhail Rotkevicha, Ksenia Krasheninnikovaa, Igor Evsyukova, Sviatoslav Sidorova, Anna Gorbunovaa,g, Ekaterina Chernyaevaa, Andrey Shevchenkoa, Sofia Kolchanovaa,d, Alexei Komissarova, Serguei Simonova, Alexey Antonika, Anton Logacheva, Dmitrii E. Polevh, Olga A. Pavlovah, Andrey S. Glotovu, Vladimir Ulantsevi, Ekaterina Noskovai,j, Tatyana K. Davydovas, Tatyana M. Sivtsevak, Svetlana Limborskal, Oleg Balanovskym,n,o, ⁎ Vladimir Osakovskyk, Alexey Novozhilovp, Valery Puzyrevq, Stephen J. O'Briena,t, T a Theodosius Dobzhansky Center for Genome Bioinformatics, St. Petersburg State University, St. Petersburg, Russian Federation Department of Genetics, University Medical Center Groningen, University of Groningen, Groningen, The Netherlands c Department of Mathematics, St. Petersburg Electrotechnical University, St. Petersburg, Russian Federation d Biology Department, University of Puerto Rico at Mayaguez, Mayaguez, Puerto Rico e National Zoological Park, Smithsonian Conservation Biology Institute, Washington, DC, USA f Vavilov Institute of General Genetics, Russian Academy of Sciences, St. Petersburg Branch, St. Petersburg, Russian Federation g I.I. Mechnikov North-Western State Medical University, St. Petersburg, Russian Federation h Centre Biobank, Research Park, St. Petersburg State University, St. Petersburg, Russian Federation i Computer Technologies Laboratory, ITMO University, St. Petersburg, Russian Federation j JetBrains Research, St. Petersburg, Russian Federation k Institute of Health, North-Eastern Federal University, Yakutsk, Russian Federation. l Department of Molecular Bases of Human Genetics, Institute of Molecular Genetics, Russian Academy of Sciences, Moscow, Russian Federation m Vavilov Institute of General Genetics, Russian Academy of Sciences, Moscow, Russian Federation n Research Centre for Medical Genetics, Moscow, Russian Federation o Biobank of North Eurasia, Moscow, Russian Federation p Department of Ethnography and Anthropology, St. Petersburg State University, St. Petersburg, Russian Federation q Research Institute of Medical Genetics, Tomsk National Research Medical Center, Russian Academy of Science, Tomsk, Russian Federation r Department of Biological Sciences, Oakland University, Rochester, MI 48309, USA s Federal State Budgetary Scietific Institution, "Yakut science center of complex medical problems", Yakutsk, Russian Federation t Guy Harvey Oceanographic Center, Halmos College of Natural Sciences and Oceanography, Nova Southeastern University, 8000 North Ocean Drive, Ft Lauderdale, Florida 33004, USA u Laboratory of biobanking and genomic medicine of Institute of translation biomedicine, St. Petersburg State University, St. Petersburg, Russian Federation b A BSTR A CT The Russian Federation is the largest and one of the most ethnically diverse countries in the world, however no centralized reference database of genetic variation exists to date. Such data are crucial for medical genetics and essential for studying population history. The Genome Russia Project aims at filling this gap by performing whole genome sequencing and analysis of peoples of the Russian Federation. Here we report the characterization of genome-wide variation of 264 healthy adults, including 60 newly sequenced samples. People of Russia carry known and novel genetic variants of adaptive, clinical and functional consequence that in many cases show allele frequency divergence from neighboring populations. Population genetics analyses revealed six phylogeographic partitions among indigenous ethnicities corresponding to their geographic locales. This study presents a characterization of population-specific genomic variation in Russia with results important for medical genetics and for understanding the dynamic population history of the world's largest country. ⁎ Corresponding authors at: Theodosius Dobzhansky Center for Genome Bioinformatics, St. Petersburg State University, St. Petersburg, Russian Federation. E-mail addresses: [email protected] (D.V. Zhernakova), [email protected] (S.J. O'Brien). https://doi.org/10.1016/j.ygeno.2019.03.007 Received 25 December 2018; Accepted 15 March 2019 Available online 19 March 2019 0888-7543/ © 2019 Elsevier Inc. All rights reserved. Genomics 112 (2020) 442–458 D.V. Zhernakova, et al. explored by identifying known variants and novel variants and their allele frequencies relative to variation in adjacent European, East Asian and South Asian populations. Genomic variation was further used to estimate genetic distance and relationships, historic gene flow and barriers to gene flow, the extent of population admixture, historic population contractions, and linkage disequilibrium patterns. Lastly, we present demographic models estimating historic founder events within Russia, and a preliminary HapMap of ethnic Russians from the European part of Russia and Yakuts from eastern Siberia. 1. Introduction The Russian Federation (Russia) has one of the most ethnically diverse indigenous human populations within a single country. According to the 2010 census, 195 ethnic groups are represented on Russian territory. The migrations of the last millennia have created a complex patchwork of human diversity that represents today's Russia. The (pre) historic milestones that founded modern Russian populations include settlement of Northern areas of Eurasia by anatomically modern humans, the eastward expansion of the Indo-European speakers, the westward expansion of the Uralic and Altai language families and centuries of admixture between them [1–6]. The migration routes for peopling Northern and Central Eurasia and the Americas inevitably passed through this territory, followed by the waves of great human migrations together with the exchange of knowledge and technology (and likely the genes) along the Silk Road [7,8]. Studies of population ancestry and structure in Russia would further provide genomic links to the lost Neanderthal and Denisovan cultures discovered in Russia's fossil beds [9,10]. There remain ongoing discussions about the origins of the ethnic Russian population. The ancestors of ethnic Russians were among the Slavic tribes that separated from the early Indo-European Group, which included ancestors of modern Slavic, Germanic and Baltic speakers, who appeared in the northeastern part of Europe ca. 1500 years ago. Slavs were found in the central part of Eastern Europe, where they came in direct contact with (and likely assimilation of) the populations speaking Uralic (Volga-Finnish and Baltic-Finnish), and also Baltic languages [11–13]. In the following centuries, Slavs interacted with the Iranian-Persian, Turkic and Scandinavian peoples, all of which in succession may have contributed to the current pattern of genome diversity across the different parts of Russia. At the end of the Middle Ages and in the early modern period, there occurred a division of the East Slavic unity into Russians, Ukrainians and Belarusians. It was the Russians who drove the colonization movement to the East, although other Slavic, Turkic and Finnish peoples took part in this movement, as the eastward migrations brought them to the Ural Mountains and further into Siberia, the Far East, and Alaska. During that interval, the Russians encountered the Finns, Ugrians, and Samoyeds speakers in the Urals, but also the Turkic, Mongolian and Tungus speakers of Siberia. Finally, in the great expanse between the Altai Mountains on the border with Mongolia, and the Bering Strait, they encountered paleo-Asiatic groups that may be genetically closest to the ancestors of the Native Americans [14]. Today's complex patchwork of human diversity in Russia has continued to be augmented by modern migrations from the Caucasus, and from Central Asia, as modern economic migrations take shape [15]. There have been several studies of genetic history of Russia using microarrays, microsatellites, Y-chromosome and mitochondrial genome sequences [16–35] and more recently using whole genome sequencing [36–38]. Most studies have focused on profiling specific ethnicities, but a centralized reference dataset of genomic variation of most Russian populations is currently lacking. Furthermore, a number of medicallyrelevant candidate genes with variants specific to groups within Russia have been reported [39–42]. To further expand on these reports, we initiated the Genome Russia Project [43,44], with the goal of sequencing the whole genomes of approximately 3500 individuals, including family trios, to assess the genetic diversity across the Russian Federation and to reveal functional genomic variation of medical significance. In the current study, we annotated whole genome sequences of individuals currently living on the territory of Russia and identifying themselves as ethnic Russian or as members of a named ethnic minority (Fig. 1). We analyzed genetic variation in three modern populations of Russia (ethnic Russians from Pskov and Novgorod regions and ethnic Yakut from the Sakha Republic), and compared them to the recently released genome sequences collected from 52 indigenous Russian populations [36,37]. The incidence of function-altering mutations was 2. Results Our study presents analyses of the whole genome sequences (WGS) at 30× coverage of 60 newly sequenced individuals from three populations: Pskov region (western Russia), Novgorod region (western Russia), and Yakutia (eastern Siberia), and comparing these to 204 individuals from 52 populations including both Russians and other ethnic groups (Table 1, Supplemental Table S1; Fig. 1). Samples of Pskov, Novgorod, and Yakut populations were collected as family trios (two biological parents and their adult child) upon obtaining informed consent and IRB approval, and with a stated three (or more) generation homogenous ancestry from the same ethnic group and the same region [45]. The genomes of all study participants were explored for known disease-associated mutations, as well as for medically important ‘loss-of-function’ coding variants including SNPs, short indels (< 20 bp), longer indels (20–100 bp), copy-number variants (CNVs) and segmental duplications (SDs). Variant calling and genotyping of SNPs and short indels in Pskov, Novgorod, and Yakut genomes revealed 8 million SNPs and 2 million indels per population (Supplemental Table S2; Supplemental material). Between 3 and 4% of these SNPs were classified as novel as compared to dbSNP (Supplemental Fig. S1a-b). Overall, over 10.5 million SNPs and 2.8 million short indels were found in all 60 samples from three Genome Russia populations combined (Supplemental Fig. S1c). As might be predicted, the number of overlapping SNPs and indels was higher when comparing Pskov and Novgorod than when comparing Yakut with the western Russia populations, in line with the geographic separation of these populations (Supplemental Fig. S1c). The same trend was observed for long indels (see Supplemental material). In addition, we resolved CNVs and aggregated them into segmental duplication (SD) profiles for the 60 Genome Russia samples. This resulted in regions of SDs spanning around ~214 Mbs in each dataset (Supplemental Table S3). The highest number of SDs (~3 Mb) was observed for Yakuts. We further compared SD profiles between populations using V statistics (Vst, see Methods and Supplemental material), and observed relatively strong differences between Yakut and the two western populations of Pskov and Novgorod, consistent with expectations based on geography (Supplemental Fig. S2). The collection of identified SNPs was used to inspect quantitative distinctions among 264 individuals from across Eurasia (Fig. 1) using Principal Component Analysis (PCA) (Fig. 2). The first and the second eigenvectors of the PCA plot are associated with longitude and latitude, respectively, of the sample locations and accurately separate Eurasian populations according to geographic origin. East European samples cluster near Pskov and Novgorod samples, which fall between northern Russians, Finno-Ugric peoples (Karelian, Finns, Veps etc.), and other Northeastern European peoples (Swedes, Central Russians, Estonian, Latvians, Lithuanians, and Ukrainians) (Fig. 2b). Yakut individuals map into the Siberian sample cluster as expected (Fig. 2a). To obtain an extended view of population relationships, we performed a maximum likelihood-based estimation of ancestry and population structure using ADMIXTURE [46] (Fig. 2c). The Novgorod and Pskov populations show similar profiles with their Northeastern European ancestors, while the Yakut ethnic group showed mixed ancestry similar to the Buryat and Mongolian groups. We further assessed ancestral divergence between the populations in western Russia (Pskov and Novgorod), and Siberia (Yakuts) by choosing ‘Ancestry Informative Markers’ (AIMs) from the major 443 Genomics 112 (2020) 442–458 D.V. Zhernakova, et al. Fig. 1. Map of the Russian Federation with locales of indigenous ethnic groups. Sample collections locales are indicated as colored circles with 2 letter code (see Supplemental Table 1). White dotted lines are arbitrary boundaries separating major population partitions suggested by the phylogenetic analyses (see text). Populations code colors correspond to geographic areas: A) Pink – North-Eastern Siberia; B) Green – Eastern Siberia; C) Brown – Western Siberia; D) Orange – VolgaUral; E) Red – Western Russia; F) Blue – Caucasus. (For interpretation of the references to colour in this figure legend, the reader is referred to the web version of this article.) Table 1 Number of samples used in the study. Sample group Region Number populations Number samples Number unrelated samples Number families GR Pskov GR Novgorod GR Yakuts Mallick et al. Pagani et al. Total Western Russia Western Russia East Siberia Many Many 1 1 1 18 45 55 22 20 18 31 173 264 14 15 14 32 174 249 7 5 4 0 0 16 GR means Genome Russia. Gene Mutation Database (HGMD) [49] as disease-causing mutations, were detected within the Pskov, Novgorod and Yakut populations, resulting in 1776 known disease-associated variants occurring within all samples (Supplemental Table S5). We profiled the distribution of disease-associated mutations from HGMD (disease-causing mutations – HGMD-DM) in the 60 Genome Russia individuals, which showed an average of 75 HGMD-DM variants per individual (Supplemental Fig. S4). In addition, 31 unique variants in 29 genes were classified as pathogenic after manual curation of HGMD-DM variants (Supplemental Table S6). Forty three (43) of the 60 individuals carried at least one pathogenic variant: 41 as heterozygous, one compound heterozygote, and one homozygous case (both in the ABCA4 gene, associated with age-related macular degeneration; [50]. Notably, three of eighteen Yakut participants were heterozygous for a pathogenic variant in the SBF1 gene (MAF = 0.17 compared to MAF < 0.003 in gnomAD database [51]; Table 2, Supplemental Table S6), related to Charcot-Marie-Tooth disease 4b3 type [52]. All variants conformed to Mendelian expectations in the trios. We validated each of the four disease-causing mutations showing the most Eurasian population groupings represented in 1000 Genome Project [47]. As expected European-specific AIMs were concentrated in the western Russia (Pskov and Novgorod) populations compared to the Yakut samples; while the converse was observed for East Asia-specific AIMs (Supplemental Fig. S3). Possible admixture sources of the Genome Russia populations were addressed more formally by calculating F3 statistics, which is an allele frequency-based measure, allowing to test if a target population can be modeled as a mixture of two source populations [48]. Results showed that Yakut individuals are best modeled as an admixture of Evens or Evenks with various European populations (Supplemental Table S4). Pskov and Novgorod showed admixture of European with Siberian or Finno-Ugric populations, with Lithuanian and Latvian populations being the dominant European sources for Pskov samples (Supplemental Table S4). 2.1. Medically relevant gene variants A total 894 medically relevant gene variants, annotated in the Human 444 Genomics 112 (2020) 442–458 D.V. Zhernakova, et al. Fig. 2. Sample relatedness based on genotype data. (a,b) Principal Component plot of 574 modern Russian genomes. Colors reflect geographical regions of collection; shapes reflect the sample source. Red ovals show the location of Genome Russia samples. (a) Eurasia; (b) Western Russia and neighboring countries. (c) Population structure across samples in 178 populations from five major geographic regions (k = 5). Samples are pooled across three different studies that covered the territory of Russian Federation (Mallick et al. 2016 [36], Pagani et al. 2016 [37], this study). The optimal k-value was selected by value of cross validation error. Russian samples from all studies (highlighted in bold dark blue) show a slight gradient from Eastern European (Ukrainian, Belorussian, Polish) to North European (Estonian Karelian, Finnish) structures, reflecting population history of northward expansion. Yakut samples from different studies (highlighted in bold red) also show a slight gradient from Mongolian to Siberian people (Evens), as expected from their original admixture and northward expansions. The samples originated from this study are highlighted, and plotted in separated boxes below. (For interpretation of the references to colour in this figure legend, the reader is referred to the web version of this article.) significant allele frequency difference together with other variants listed in Table 2 by Sanger sequencing, which confirmed genotypes and MAF for each. The Genome Russia subjects carrying functional mutations were all healthy, which would be explained by heterozygotes for recessive alleles, age-dependent penetrance (e.g. for ABCA4), and/or other gene–environmental interactions. 2.2. Loss-of-function SNPs Coding gene variants were annotated for their effect on proteins using the Ensembl Variant Effect Predictor (VEP) tool [53]. We investigated their occurrence in large SNP databases (1000 Genomes – 1000G [47], Exome Aggregation Consortium – ExAC [51], Genome Aggregation Database – gnomAD [51]), and reported associations with diseases and complex traits, focusing on loss-of-function (LoF) SNPs and 445 D.V. Zhernakova, et al. Table 2 Disease-associated and function-altering variants in Genome Russia samples. Category of discovery Phenotype Location Variant id ref/ alt Gene MAF Pskov MAF Novgorod MAF Yakuts MAF 1000G EUR MAF 1000G EAS Details Min p-value Medically relevant gene variants Albinism oculocutaneous II Charcot-Marie-Tooth disease 4b3 Age-related macular degeneration tyrosinemia type I Coronary artery calcification Diabetic kidney disease; Urinary uromodulin levels Astigmatism; cerebrospinal fluid clustering measurement; coronary artery bypass, vein graft stenosis Complement C2 deficiency Lactose intolerance Warfarin dosage sensitivity Skin pigmentation Retinitis pigmentosa Short stature syndrome 15q13.1 22q13.33 1p22.1 15q25.1 2q14.3 8q24.13 7p12.3 rs74653330 rs200488568 rs28938473 rs11555096 rs117753184 rs10101626 rs141576983 C/T T/C G/A C/T A/T G/T G/T OCA2 SBF1 ABCA4 FAH WDR33 TBC1D31 ABCA13 0.04 0 0 0.14 0 0.18 0 NA 0 0.07 NA 0 0.1 0 0.214 0.107 0 0 0.179 0.714 0.464 0.01 0 0.006 0.019 0 0.195 0.002 0.027 0.001 0 0 0.026 0.183 0.023 ST5, ST5, ST5, ST5, ST7 ST7 ST7 1.49E-04 6.96E-05 0.0204 0.00268 0.00104 2.41E-09 2.67E-13 6p21.33 2q21.3 16p11.2 5p13.2 1p36.11 2p24.3 rs572361305 rs4988235 rs9923231 rs16891982 rs3816539 rs369698072 A/G C/T C/G G/A C/T C2 MCM6 VKORC SLC45A2 DHDDS NBAS 0 0.36 0.25 1 0.11 0 0.1 0.47 0.2 1 0.07 0 0 0.04 0.86 0.07 0.96 0.071 0.002296 0.027 0.0121 0.0178 0.00599 8.03E-06 rs9277535 rs11030122 rs11593840 rs6742078 rs887829 rs3135718 A/G C/G A/G G/T C/T C/T HLA-DPB1 STIM1 LRMDA UGT1A1 0.11 0.54 0.57 0.29 0.29 0.46 0.17 0.47 0.37 0.27 0.27 0.37 0.39 0.11 0.43 0.46 0.46 0.07 0 0 0.885 0.006 0.709 NA (ExAC: 1.3e-4) 0.61 0.35 0.18 0.13 0.13 0.4 ST10 Fig. 3 Fig. 3 Fig. 3 Fig. 3 NA 6p21.32 11p15.4 10q22.3 2q37.1 2q37.1 10q26.13 0.007 0.508 0.388 0.938 0.235 NA (ExAC: 0) 0.27 0.33 0.41 0.3 0.298 0.43 ST11 ST11 ST12 ST12 ST12 ST12 0.0292 0.00761 0.00212 2.50E-05 2.50E-05 2.47E-04 Lof SNPs 446 Long indels Population-specific phenotypes Infectious diseases Pharmacogenomics Hepatitis B infection Kaposi's sarcoma Tamoxifen outcomes in breast cancer Irinotecan in Colorectal Cancer Trastuzumab Lapatinib in Breast Cancer treatment FGFR2 ST6 ST6 ST6 ST6 Variants described in multiple sections of the paper are listed in the table (column one corresponds to the section), showing variant and overlapping gene ids, phenotype associated with the variant or the gene. Allele frequency (AF) for Genome Russia is given for the alternative allele. Details column gives the table/Fig. with more information on these variants. The last column gives the minimum p-value for Fisher exact test of allele count difference between either Novgorod and Pskov compared with 1000G EUR or Yakut compared with 1000G EAS. The population AFs showing the minimum p-value are underlined. Genomics 112 (2020) 442–458 Genomics 112 (2020) 442–458 D.V. Zhernakova, et al. located in the LCT gene, but rather 14 kb upstream, within intron 13 of the mini-chromosome maintenance complex component 6 (MCM6) gene. The lactose tolerance A allele has been strongly selected in European populations within the last 10,000 years since the dawn of agriculture and modern civilization in the Fertile Crescent of the Middle East [60]. This LCT- MCM6 variant has been suggested to be one of the strongest signals of natural selection in the human genome [58,59]. As expected, the non-European chromosomes in EAS and Yakut were all nearly fixed for the ancestral G (lactose intolerant) allele, while CEU and FIN had a higher frequency for the A (lactose tolerant) allele. Pskov and Novgorod populations show intermediate A allele frequencies (41%), lower than CEU (74%) or FIN (59%) possibly indicating the effect of admixture (Fig. 3a, Table 2). Another example of a population-specific allele frequency difference important for medical drug dosage prediction, is response to warfarin (also called coumadin). Warfarin is a popular anti-coagulant that has severe side-effects, such as bleeding, if used in an inappropriate dosage. Response to warfarin depends on several factors, including genetic variants in the CYP2C9 and VKORC1 genes that are commonly used to predict the correct dose [61,62]. Carriers of the VKORC1-T allele, which is predominant in Asia, require a substantially lower dose of warfarin than Europeans, where the VKORC1-C allele predominates [61,62]. As expected, EAS and Yakut populations showed a higher frequency of the VKORC1-T allele (86% and 88% respectively), compared to the CEU (43%) and FIN (31%). The two western Genome Russian populations (Pskov and Novgorod) showed a T allele frequency (24%) similar to the Finnish population (Fig. 3b, Table 2). This likely means that warfarin dosage for Pskov and Novgorod individuals needs to be similar to that for Finns, while a lower dosage in Yakuts is expected to be effective, similar to populations of East Asia. Dramatic population stratification is also apparent for SLC45A2, a gene related to lighter skin pigmentation. Within European populations SLC45A2 has been shown to be under strong selection, as evidenced by multiple genome-wide scans of selection (reviewed in [63]. Analyses of ancient Eurasian genomes found that the allele associated with light skin pigmentation has likely reached fixation in modern Europeans from very low frequency during the Neolithic period, due to strong selection pressure over the past ~4000 years [57]. Not surprisingly, the light skin allele (G) is nearly fixed in both populations from western Russia (100%), while in the Yakut population the low frequency (7%) reflects some level of an ancestral genetic component shared with Europeans, since this variant is completely absent from the East Asian populations (EAS) (Fig. 3c, Table 2). Single-nucleotide mutation in the gene that encodes cis-prenyltransferase (DHDDS) has been identified as the cause for non-syndromic recessive retinitis pigmentosa (RP) [64,65]. The 757G > A: recessive missense variant in the DHDDS gene (rs3816539) associated with retinitis pigmentosa pathology is reduced in frequency in western Russian populations (9% vs 20% in CEU and 28% in FIN) compared to other European populations, and increased in the Yakut population (96% vs 71% in EAS) compared to the other Asian populations (Fig. 3d). It is not clear if this reversal derives from genetic drift or natural selection. The Yakut population is known to have higher allele frequencies for some variants, which sometimes leads to hereditary pathologies [39–42]. For example, rs369698072 in the NBAS gene is associated with short stature syndrome in Yakuts [40]. While this variant is extremely rare in European and Asian populations, it has a MAF of 7% in our Yakut samples, which is significantly higher than in other populations (p = 8.03 × 10−6, see Table 2). indels. Of 82,574 coding SNPs identified in the combined cohort, 2145 SNPs were identified as high-confidence loss-of-function variants (stop codon, frameshifts, splice alterations; see Methods). For the subsequent analyses, we selected only the 758 LoF SNPs that had an allele count of two or more and did not fail Mendelian inheritance expectations in any of the Genome Russia trios. One hundred and one (101) of these LoF SNPs were not reported in 1000G, ExAC or gnomAD (Supplemental Table S7). We detected 34 LoF SNPs showing elevated allele frequencies in Genome Russia populations compared to that in the European (EUR), East Asian (EAS), or South Asian (SAS) populations of 1000G (18 SNPs with MAF > 10 fold, and 17 SNPs with MAF = 5–10× greater than in human genetic population databases). Implicated genes, minor allele frequency (MAF) and allele counts for each observed LoF SNP, their allele frequencies in public databases, specific disease phenotype associated with genes in GWAS catalog, including five genes that are scored as LoF-intolerant [51] are listed in Supplemental Table S7. For example, a LoF variant rs117753184 of WDR33, an RNA editing gene previously associated with coronary artery calcification, carries a stop codon allele in Yakut at a MAF of 18%, but occurs at 3% frequency in East Asia and is even less frequent in European and South Asian populations (Table 2, Supplemental Table S7). This gene is considered as “LoF intolerant” according to ExAC [51] and may have clinical consequences that are not yet confirmed. Other LoF SNPs in Supplemental Table S7 are also potential candidates for both population differentiation and clinical influence. 2.3. Insertions and deletions As many as 757 short insertion-deletion mutations (< 20 bp) were annotated as LoF among the Genome Russia populations. Indel calling is known to be error prone, therefore we performed additional filtering by applying alignment-free k-mer-based genotyping (see Methods). In addition, novel insertions and deletions (indels) that failed Mendelian inheritance compliance were filtered out, leaving a total of 308 indels for which at least two alleles were present among the 43 unrelated Genome Russia individuals (Supplemental Table S8). We identified longer indels (20–100 bp) in the Pskov, Novgorod and Yakut populations and annotated the indels with Ensembl VEP [53] (Supplemental Tables S9,10). Each population had 1600–1900 long indels, of which < 1% overlapped exons and about 80% were previously recorded in dbSNP (Supplemental Table S9). Exon overlapping long indels were detected in 26 genes, with six genes having long indels located within exons in two or more populations (AGBL5, CHIT1, DNAH9, ENOSF1, PLCH2, and ZNF683). The majority of samples in the three populations were heterozygotes for the long indels overlapping with exons (Table 2, Supplemental Table S10). 2.4. Population-specific biomedical phenotypes Certain diseases and heritable traits have different occurrence in different populations due to genetic drift, adaptation or migration [54–57]. Variant frequencies with population-specific patterns can lead to differences in traits or disease prevalence in different populations that can influence tailored clinical treatment specific for particular populations. To date, complete Russian genomes have not been interrogated for the presence and incidence of medically significant variants. Here, we offer a first step in making the personalized approach in genomic medicine for this part of the world. To illustrate how differences in population history can affect frequencies of important physiological traits, we examined four familiar loci in depth: MCM6, VCORC1, SLC45A2, and DHDDS (Fig. 3). LCT, a gene that regulates adults' tolerance to lactose and milk products, is a well-known example of selection-based differentiation [58,59]. However, the first mutation associated with the lactose tolerance phenotype in Europeans −13.910: C > T (rs4988235) is not 2.5. Russian gene variants that are associated with infectious diseases, pharmacogenomics, and natural selection across the globe Table 2 also summarizes Russian gene variants that convey notable infectious disease, natural selection and pharmacogenomic phenotypes 447 Genomics 112 (2020) 442–458 D.V. Zhernakova, et al. Fig. 3. Differences in Genome Russia allele frequencies of SNPs in notable genes with important phenotypes differentiate among Eurasian ethic groups. Allele frequencies for populations of Pskov and Novgorod (combined) and Yakut are shown together with allele frequencies of 1000G populations: Europeans (CEU), Finnish (FIN), East Asians (EAS) and South Asians (SAS) for four SNPs: (a) rs4988235, located in MCM6 gene. This SNP is associated with adult type lactose intolerance. G allele tags the lactose intolerant haplotype [58,59]; (b) rs9923231, located in VKORC1 gene. This SNP is associated with Warfarin response. T allele carriers need reduced dose of warfarin; (c) rs16891982 located in SLC45A2 gene. G allele related to lighter skin pigmentation; (d) rs3816539 located in DHDDS gene. A allele is associated with retinitis pigmentosa. distinctive in world populations [54,55,66,67] (see also Supplemental Tables 11–13). Many of these show divergent MAF of Russian populations compared to parental EUR and EAS database frequencies, which we confirmed by Sanger sequencing (Supplemental material). These gene frequency distinctions may reflect historic genetic drift, occasional founder events, or possible assortative mating effects [56] that would, upon replication, be relevant to the health impact in Russian communities. We compared the variation in MAF for variants previously associated with human infectious disease, natural selection and pharmacogenomic phenotypes [54,55,67] from Supplemental Tables S11–13 in Russian versus 1000G populations to search for patterns of overall allele frequency change during the founding of Russian populations (Supplemental Fig. S5). While allele frequencies in western Russians (Pskov and Novgorod) resembled the European reference, we observed a different pattern for allele frequencies of Yakut versus the EAS allele frequencies. For example, we observed a rather tight cluster (indicating near invariance) among all alleles in Novgorod and Pskov versus their EUR neighbors for all three gene categories, while variance of the same alleles from EAS and SAS is considerably larger (Supplemental Fig. S5, left and center plots). The Yakut population shows larger substantial deviation from all database populations (EUR, EAS and SAS) for infectious disease and pharmacogenomics associated genes, but tighter clustering of the selected alleles with EAS. For the Novgorod and Pskov populations, the pattern may likely be interpreted as indicating that all the studied alleles were adapted and set before the recent founding of these populations in Russia from EUR predecessors with little drift or perturbation effects or MAF changes since. This explanation also seems to hold for the tight clustering of Yakut and EAS for the ‘selection’ alleles. However, if affirmed, the absence of clustering for Yakut-EAS for the alleles mediating infectious disease and pharmacogenomics phenotypes would suggest these important gene variants were altered by selection, drift or other demographic factors in more recent times after the original founder events. 2.6. Phylogeography of Russian peoples To further explore the relationships of individuals within and between different regions of Russia (Fig. 1), we constructed neighborjoining trees based on pairwise nucleotide differences of ~3.8 M homologous SNPs (after filtering, see Methods) from 231 unrelated individuals representing 55 ethnicities. The resulting topology (Fig. 4a) showed a stepwise arrangement of individuals into six phylogeographic clusters ordered from eastern Asia to western Europe, corresponding to the six regions separated by white dashed lines in Fig. 1. Individuals from each of the six geographic locations were clustered together as monophyletic clades, indicating recent isolation and restricted gene flow between them since. The family trio design of our project allowed us to accurately phase SNP data and identify the haplotype structure of our samples, which have been suggested to perform as well as or better than unlinked SNPs in reconstructing historical relationships of populations [68]. We created a haplotype-based phylogenetic tree with fineSTRUCTURE [68] using the same Russian genomes plus 308 additional neighboring Eurasian genomes [36,37](Fig. 4b). The analysis largely re-affirmed the geographic clusters obtained in the neighbor-joining tree (Fig. 4b). 448 Genomics 112 (2020) 442–458 D.V. Zhernakova, et al. Fig. 4. Phylogenetic analyses of samples from the territory of Russia. (a) Neighbor-joining tree showing relationships among 231 Russian-ancestry individuals based on pairwise nucleotide divergence for 3,779,316 homologous SNPs. The tree was rooted using two individuals from Vietnam. Colors are used to differentiate among individuals originating from six major geographic regions across the Russian Federation (see Fig. 1): Eastern Siberia, North-Eastern Siberia, Western Siberia, Volga-Ural region, the Caucasus, and Western Siberia. The separation between the three eastern regions (Eastern and North-Eastern and Western Siberia) and the western regions (Volga-Ural, Caucasus and Western Russia) is centered along the Ural Mountains. (b) Haplotype-based tree of samples from the territory of Russia and neighboring countries. (c,d) The heatmaps of gene flow barriers show for each point at the geographical map the interpolated differences in allele frequencies (AF) between the estimated AF at the point with AFs in the vicinity of this point. (c) The maximum difference in AFs over all directions is plotted. (d) The direction of the maximal difference in allele frequencies is coded by colors and arrows. 449 Genomics 112 (2020) 442–458 D.V. Zhernakova, et al. Individuals from Pskov clustered together with Estonians, Latvians, and Lithuanians, highlighting their close contact. Novgorod samples also showed similarities to Finno-Ugric groups such as Estonians and Ingrians. Yakuts clustered together with Evens, and Evenks (Fig. 4b), but the haplotype sharing also shows a close relation of Yakuts to Mongols, Buryats, Altaians, and Tuvinians, consistent with their postulated Turkic founders being from the Lake Baikal region [69]. The two phylogenetic approaches, when supplemented with Fig. 2, add confidence to a definitive population structure that indicates appreciable population isolation in recent times. The genetic distinctions seen in the phylogenetic analyses suggested there appears to have been strong geographic isolation restricting gene flow between certain groups. To assess this possibility, we imputed allele frequencies (AF) at each point of a geographical grid from the available AF and estimate the differences in AF between the predicted population at each point on the grid and other predicted populations in the geographic vicinity of that point in eight directions, using a method derivative to the one used in Pagani et al. [37]. Three putative gradients indicative of restricted gene flow “barriers “were identified within Russia (Fig. 4c) taking into account the direction of the most rapid allele frequency change and the maximal directional allele frequency changes (Fig. 4d). The most intense gene flow restriction occurs on the West of Siberia (corresponding to the Urals and Ob’ river), that separates the area populated predominantly by ethnic Russians (or Russianspeaking descendants of other eastern European groups) from the native people of Siberia and the Arctic. A second detected gene flow restriction lies in North Eastern Siberia (along the Lena River and Verkhoyansk Mountain Range). A third gene flow restriction barrier was identified at the Russian border at the South Far East (Fig. 4c). In this last case the direction of maximal local divergence in allele frequencies is changing from North to South if we follow across the restriction gradient region from South to North (Fig. 4d). For the other two restricted regions, the direction of maximal local divergence in allele frequencies is changing from East (North East) to West if we follow across the barrier from West to East (but not vice versa) and the maximal whole genome allele frequency changes are large. It is notable that the three areas of restricted gene flow correlate with geographic and climatic features, which may provide physical barriers for human migrations. Fig. 5. Demographic history of Yakut and Pskov-Novgorod populations. The GADMA approximation of populations' demographic history for three Genome Russia populations based upon comparison of expected allele frequency and the allele frequency spectrum. The best composite likelihood scenario suggests a founder event 6900 years BP and split between western Russian Populations and Yakut ancestors approximately 1200 years BP, coincident with the establishment of human settlements in Russian regions. allele frequency spectrum built without inferring ancestral alleles (folded AFS) for three populations: Yakut, Novgorod and Pskov (n = 14 unrelated for each). The GADMA approximation software tool [70] was used to compare the expected allele frequency and the observed allele frequency spectrum (AFS) over the parameter value space by computing a composite-likelihood score for the best plausible evolutionary scenarios (Fig. 5). The scenarios were simulated with the AFS data and the results were used to calculate the likelihoods of best fit for each model. Pskov and Novgorod show nearly identical histories. The best model and reconstruction combined the Western Russia population and indicated patterns implying a common “out of Africa” coalescence date at 70,000 years BP followed by a split and asymmetric migration from western Russia (Pskov –Novgorod) toward Yakutia 6900 years ago, followed by slower population growth and very limited migration events between the relatively isolated populations. A more recent split between Pskov and Novgorod occurred around 1200 years ago and was followed by population growth in both populations. All three populations have subsequently increased effective population size, most probably following postulated founder events and expansion in Russian regions around that period. 2.7. Demographic history The demographic history of a population's founder events or population bottlenecks can influence the genetic diversity, the length of haplotype blocks generated by linkage disequilibrium, and the genomewide patterning of endemic variation. We noted moderately high SNP variation in the Novgorod and Pskov samples compared to Yakut (Supplemental Fig. S1), raising the prospect of a population bottleneck or an historic founder event in the Yakut population's past. Another possible reason for the difference in variant numbers is reference bias, as the human reference genome reflects more European genetic variation. The distinctiveness of the study populations prompted a closer look at the patterns of SNP variation across the genomes. First, we computed the average length of extended regions of SNP homozygosity and noted that the Yakut population displayed relatively longer homozygous stretches (median length = 127 kbp) than the western Russian samples (median length = 119 kbp; Supplemental Fig. S6). When SNP density was plotted across the entire genome of Yakut and compared to Novgorod and Pskov, there were multiple chromosomal regions of the Yakut genomes with diminished SNP variation that would corroborate the evidence of a recent founder event or population contraction (Supplemental Figs. S6, 7). To assess demographic history within a population, coalescence rates were calculated and scaled by mutation rate and generation time (Fig. 5). Patterns of whole-genome sequence variation were used to model population history using the diffusion approximation to the 2.8. Haplotype map An important goal of the Genome Russia Project is to construct a haplotype map (HapMap) of ethnic Russians and several smaller ethnic minorities within the Russian Federation for further use in gene association as well as population studies. We analyzed western Russians (Novgorod and Pskov) and Yakuts separately and created two haplotype maps. We also assessed the variation in SNP density within the Western Russians as compared with Yakut (Supplemental Fig. S7) and haplotype length (Supplemental Fig. S8). A Haploview LD structure of a homologous regions on chromosome 17 is presented in Supplemental Fig. S9. Although these haplotypes are based upon a limited number of trios they present an illustration of the comparative differences between the large ethnic Russian populations (Novgorod and Pskov) and the more isolated ethnic group of Yakuts. With more extensive sampling, we expect that the precision, accuracy and utility of the LD patterns will increase substantially. Variation and LD structure of Genome Russia samples can be visualized using a genome track at http://garfield. dobzhanskycenter.org/genomerussia/. 450 Genomics 112 (2020) 442–458 D.V. Zhernakova, et al. modern Russia, and the presence of Estonians, Latvians and Lithuanians in the same branch on the phylogenetic tree probably indicates the same gene flow in the area, as the three modern Baltic countries had historic contact. When we examined the Yakut ethnic population in Siberia, our analysis supports the prior historic evidence on the origin of the Yakut people who live in the Sakha Republic (Yakutia) in eastern Siberia (North East Asia) and that practiced animal husbandry and semi-nomadic lifestyle. The ancestors of Yakuts were Turkic people with some Mongolian admixture that migrated from the Yenisey river to the Lake Baikal region, and expanded to the north, as far as Kolyma river [72]. Our data clearly shows this to be the case, indicating among other things a historic admixture between the Yakut and the native North Siberian people such as Evens and Evenks (Fig. 4b). On the other hand, some individuals are closer to the Altaic people and the Mongolians, supporting the earlier theories of Yakut origins [72]. Lastly, we present preliminary haplotype maps of the three groups: ethnic Russians represented by trios collected in Novgorod and Pskov regions, as well as for the Yakut population. The population-specific HapMap will assist in the identification of the causal or operative variants resolved by genome-wide association studies. 3. Discussion We present here an analysis of genomic patterns and inference of 264 people representing 55 ethnic groups living across the Russian Federation today in an initial step toward developing a comprehensive database describing genetic diversity in Russia (Fig. 1). Using whole genome sequences, we quantified the variation, and catalogued that for medical and population-based studies. Our first goal was to inspect variation of importance in medical diagnostics by screening known disease-associated variants and “loss-of-function” mutations within all genes. In a sampling of healthy Russian study volunteers, we present known variants mined from the HGMD database [49] as well as predicted loss-of-function SNPs, short and long insertion/deletion variants, segmental duplications and copy number variation regions (Table 2; Supplemental Tables S5–10). We demonstrate the occurrence and frequencies in trait- and disease-associated variants as an illustration of the medical genomic information relevant to diagnostics and prognostics of the Russian populations (Fig. 3). In 1950, Hermann Muller introduced the terms “Our Load of Mutations”, also termed “Genetic Load”, to assess and describe the accumulation of damaging or fitness-lowering variants that influence population survival, individual health and biomedical cost [71]. From the genome data offered here (Table 2; Supplemental Tables S5–13) we can begin to impute the genetic load qualitatively and quantitatively as a preview of the more comprehensive analyses anticipated in the Genome Russia Project, in parallel with the human genome sequencing initiatives begun in many world communities. We and others have applied the tools of molecular evolutionary genetics to address the many anthropological conundrums that involve Russian peoples and their recent ancestry [16–35]. We estimated relative ancestry of populations directly and compared the relatedness and phylogenetic distinctions among different ethnicities using multiple phylogenetic algorithms (Figs. 2, 4). The results demonstrate a separation of ethnic groups along geographic regions, which is further indicated by imputation of gene flow barriers across the Russian landscape. Coalescence calculations conform to archaeological estimates as affirming a recent isolation and separation of certain populations (Fig. 5). The Yakut genomes display moderate genetic homogeneity, most of which may be explained by founder events and genetic drift also mentioned in previous publications [23,27]. Our data lend support to historical records that suggest that the ethnic Russian people had early contact with groups speaking Uralic and Baltic languages and their subsequent expansion from the Central Eastern Europe to North, South and Eastern frontiers, followed by encounters with Uralic and also Baltic speaking populations [11,12]. This historical contact inevitably contributed to admixture and to the patterns seen in the current local genome diversity in western Russia. While ethnic Russian populations cluster with the West Europeans in the PCA plot (Fig. 2a, b), the groupings are not tight; rather they are spread along an axis indicating divergence and admixture (Fig. 2b). The neighbor-joining and fineSTRUCTURE trees show that Novgorod and Pskov define distinct clusters that group with their immediate neighbors: Novgorod with the Uralic (Komi, Ingrian, Estonian) and Pskov with the Baltic people (Estonian, Latvian and Lithuanian) (Fig. 4a and b). At the same time, the Uralic populations very likely received the genetic contributions from the Russians and other Slavs, with whom they share branches of the phylogenetic trees (Fig. 2b; Fig. 4b). In addition, other peoples that came in contact in the area carry the evidence of historic admixture (e.g. Scandinavian and Finns: Fig. 2b). The occurrence of Uralic admixture in Novgorod corroborates the historic evidence. In the middle of the 9th century, Novgorod was an important trade post on the route from the Baltic Sea to Constantinople in the Byzantine Empire. At the time, various Finnish, Baltic, and Slavic tribes populated the area [13]. The presence of Uralic admixture in Novgorod is justified by the historic contacts and gene flow that occurred for at least a millennium. Pskov is the westernmost region in 4. Materials and methods 4.1. Sample description We sampled family trios (two biological parents and their full aged children) of ethnic Russians from Pskov and Novgorod and Yakuts from Yakutia in Siberia (Table 1). The two ethnic Russian populations originated from the western part of the Russian Federation, namely the Pechora district of Pskov region and Starorussky district of Novgorod region. Yakut population is a representative of East Siberia and was collected in various locations in Yakutia (Sakha) Rebublic. The research protocol and informed consent documents were approved by the Institutional Review Board (IRB) of the Saint-Petersburg State University (#65/2015). DNA was extracted from blood samples using MagCore HF16 Automated Nucleic Acid Extractor (RBC Bioscience). 4.2. Data processing 4.2.1. Sequencing One μg of each DNA sample was used as starting material for whole genome library preparation. DNA was sheared using an M220 Focusedultrasonicator™ and microTUBE-50 tubes (Covaris, Inc.). The targeted library insert size was 350 bp. Genomic DNA libraries were constructed using TruSeq DNA PCR-Free Library Preparation Kits (Illumina, Inc., USA). All laboratory procedures were conducted in accordance with the protocol “TruSeq DNA PCR-Free Library Prep Reference Guide” (Illumina Part # 15036187 Rev. D, 2015). The final libraries were quantified using the KAPA library quantification kit for Illumina sequencing platforms (KAPA Biosystems, Inc., USA) and sequenced on the Illumina HiSeq 4000 platform (PE 2 × 150 bp; Illumina, Inc., USA) at the Resource Center Biobank of the Research Park of Saint-Petersburg State University, Russia, in accordance with the protocol “Illumina HiSeq 4000 System Guide” (Illumina Part # 15066496, Rev. 02 RUS, 2016). 4.2.2. Data analysis infrastructure Data analysis was performed at the Theodosius Dobzhansky Center for Genome Bioinformatics of Saint-Petersburg State University. For our project, we developed a closed protected network to securely perform data analyses. The protected network does not have any connections to the Internet and to other segments of the computer network. It is divided into two subnetworks located in two separate buildings, one of which contains the main storage system and the second one contains a 451 Genomics 112 (2020) 442–458 D.V. Zhernakova, et al. To check the quality of genotype data and correctness of gender and family relation data of the DNA samples, we assessed the percentage of missing genotypes per sample, looked for potential outliers using PCA on genotype data and compared the identity-by-descent and gender predicted from genotype and stated in the phenotype data using PLINK 1.9 [82]. We also assessed the percentage of known and novel SNPs, indels and singleton variants in the three datasets. Genotype statistics were collected using BCFtools 1.3 [77] and PLINK1.9 [82]. For some of the additional analyses, we needed a dataset in which genotypes were merged. Merging was performed in PLINK 1.9. Indels were aligned using LeftAlignAndTrimVariants in the GATK toolkit [83] prior to merging. To reduce the number of false positives in the list of LoF indels we validated the them by an alternative alignment-free genotyping method. For indel verification, we computed all 23-mers based on raw reads using Jellyfish software [74] and constructed de Bruijn graphs based on these kmers. For each indel we searched for unique flanking regions in the reference genome and filtered out all indels located in repeated regions or located in regions with several closely located SNVs. Next, each indel was confirmed by the presence of a bubble structure between two unique paths in the constructed graph, while for missing indels we expected only one unique path between two flanks. backup storage system. Access to the network is granted from eight dedicated desktops for researchers of the Dobzhansky Center. Data analyses were performed on a server cluster containing 192 CPU cores and 1.5 Tb of memory. For data storage, we use storage systems with a total capacity of 150 Tb. 4.2.3. Raw read quality control and filtration The initial quality control of the raw sequence reads was assessed using FastQC [73]. The distribution of 23-mer coverage was calculated and visualized by KrATER [https://pypi.python.org/pypi/KrATER/0. 1], based on the Jellyfish [74] k-mer counter. Adapter occurrence was estimated using Cookiecutter [75]. As adapter occurrence was low and had little impact on the genome alignment, we skipped the adapter removal stage. Finally, only reads with a mean quality score equal to or higher than 20 (Q20) were retained. Overall, six parameters were measured to assess the quality of the sequencing data: 1. Mode of coverage 2. Estimated mean coverage (calculated only for the non-repetitive region of genome using 23-mer distribution); 3. Variance coefficient of coverage (estimation of uniformity of coverage); 4. Fraction of read pairs with both reads retained after filtration (estimation of sequencing quality); 5. Fraction 23-mers with errors (estimation of sequencing error rate); 6. Fraction of read pairs without adapters or “N”s (estimation of library preparation and sequencing quality). 4.2.6. Copy number variation and segmental duplication discovery Copy number variants (CNVs) and segmental duplications were identified for the 60 individuals from Pskov, Novgorod and Yakuts. The human genome reference assembly GRCh38 was hard-masked from the repetitive regions using RepeatMasker [81] and Tandem Repeats Finder [84] software. Potential repeats also were identified with a k-mer approach by alignment of 36-mers using mrFast [85] and masking out the overrepresented fragments from the assembly. The copy number values (CNVs) were evaluated using mrCaNaVaR [85] in non-overlapping windows of 1Kbp of unmasked sequence. From each read of length 100 bp we selected two non-overlapping k-mers. The flanking regions of length 9 bp of potentially lower quality were excluded from the analysis. Population genetic analysis on CNVs was performed using Vst statistic, which estimates the proportion of variance attributable to variation between populations [86]. Analysis was based on average CNV values in windows of 100Kbp and involved 15 unrelated samples from Novgorod, 16 from Pskov and 14 from Yakutia. Segmental duplications (SD) were defined as regions that span at least 10Kbp in genomic coordinates of increased average copy number value in comparison to the mean copy number value in control (non-repetitive) regions of the corresponding individual with correction for dispersion. Several samples contained low quality tiles (according to FastQC) in some sequencing lanes. For these samples additional filtration was applied. All reads from low quality tiles were removed before the filtration steps described above. 4.2.4. Read alignment We mapped raw reads that passed our quality control measures to the GRCh38 human reference genome using Bowtie2 2.2.8 [76] with the “–very-sensitive” and “-X 800” option and obtained one BAM file per sample. We obtained alignment statistics from BAM files using a combination of samtools-1.3 [77], BEDTools2–2.25.0 [78] and custom scripts written in Python 2.7. 4.2.5. Variant calling and genotyping We sorted and indexed the individual BAM files using Sambamba 0.6.1 [79]. We used the SAMtools 1.3 mpileup utility with options -q 37 -Q 30 -t AD,INFO/AD,ADF,INFO/ADF,ADR,INFO/ADR,DP,SP and the BCFtools 1.3 call utility [77] with options -v -m -f GQ,GP for joint genotyping of samples in each population. To get a set of high-confidence variants, we selected only the variants that passed all of the following filters: (1) QUAL > 40, (2) FORMAT/GQ > 20, (3) FORMAT/ DP > 10 and (4) FORMAT/SP < 20 by using BCFtools view utility. We also filtered out variants by the universal mask (using an approach similar to [22]) which contained low-mappability and lowcomplexity genomic regions and covered 24% of the human genome. The regions of low mappability were identified in the following way: for each position in the genome, all 151-mers covering it were mapped back to the reference human genome using the Bowtie2 aligner with the same options as used for the read alignment and the ratio of the uniquely mapped 151-mers was calculated. If the ratio was < 0.5, then the position was considered to belong to a low-mappability region. The low-complexity genomic regions were obtained by merging three sets of regions: homopolymers of 7 bp or longer, low-complexity regions identified using DustMasker [80], and RepeatMasker-annotated lowcomplexity and microsatellite regions, and adding 10 bp to their flanks [81]. Statistics on the universal mask and its components are given in Supplemental Table S2. 4.2.7. Long indel calling We called genomic variants in the Pskov, Novgorod and Yakut populations using Platypus [87] with default options except for –assemble = 1, which enables local read assembly functionality. We filtered the obtained variants in the following series of steps: (1) indels called by Platypus (with “PASS” tag in “FILTER” field); (2) indels successfully normalized; (3) long indels (20 to 100 bp); (4) indels with quality scores (QUAL) > 40; (5) indels with minimal genotype quality (GQ) > 20; (6) indels outside of low complexity and low mappability regions. For steps (1), (2), (4) and (5) we used BCFtools utilities [77]. In step (2), we normalized indels using the BCFtools norm utility with the following options: –check-ref x -m-. In step (3), we selected long indels (20 to 100 bp) using a custom script. An indel was considered to have length from 20 to 100 bp if the difference between the lengths of the reference allele and the alternative allele was greater or equal to 20 bp and less or equal to 100 bp. In step (6), we filtered out indels located in low-complexity and low-mappability genomic regions using the universal mask described above and the BEDTools [78] intersect utility with the options -v -header. To determine the conformance of long indels to Mendelian laws of 452 Genomics 112 (2020) 442–458 D.V. Zhernakova, et al. inheritance, we used BCFtools [77] with plugin “mendelian.” We considered a long indel from our datasets to be present in a database, if its coordinates as well as reference and alternative alleles exactly matched those of some long indel in the database. As 1000 Genomes Phase 3 [47], ExAC [51] and gnomAD [51] databases use GRCh37 genomic coordinates, we employed UCSC Genome Browser utility liftOver to convert genomic coordinates of long indels in our datasets from GRCh38 to GRCh37. We intersected sets of long indels using the BCFTools [77] isec utility with options –n = 2 –w1. For long indel annotation and filtration, we used the Ensembl Variant Effect Predictor (VEP) version 84 [53]. In Supplemental Table S9, we considered a long indel to be novel/existing, if it missed/had an rs identifier in its VEP annotation. databases and with associated diseases obtained from the GWAS catalog [90], ClinVar [91] and HGMD [49]. We compared MAFs of variants identified in our data with those from 1000G phase 3 populations [47] using a chi-square sum test and testing codominant, dominant, recessive and allelic models. To gain more statistical power, we combined the individuals from Pskov and Novgorod. MAF estimation was performed after removing children from trios. To identify disease-causing mutations in our samples, we used the Human Gene Mutation Database (HGMD) Professional version 2016.2 [49]. We considered an HGMD variant as potentially pathogenic if it was annotated as a “Disease-causing Mutation” (DM). Pathogenic status of variants was accepted after manual curation according to American College of Medical Genetics and Genomics (ACMG) recommendations [92]. Literature search was performed in PubMed using the dbSNP reference SNP ID number, gene name or disorder name as a search term. Pathogenic HGMD-DM variants were screened for variants recommended by the ACMG to be returned to patients in genome and exome sequencing studies [93]. 4.2.8. Creating a combined dataset of SNPs For SNP-based analyses, including both mining of putatively medically-relevant variants and population genetics, we combined our 60 Genome Russia individuals (20 each from Novgorod, Pskov and Yakutsk) with the two published whole genome sequencing datasets of Pagani et al. [37] and Mallick et al. [36]. In all subsequent analyses (except for fineSTRUCTURE, PCA and ADMIXTURE, see below) we used only the samples collected on the territory of the Russian Federation: 31 samples from Mallick et al. [36] and 173 samples from Pagani et al. [37]. To merge the genotype data, we performed the following steps: (1) all genotype data from the two published datasets were converted to PLINK format and merged using the PLINK v.1.9 merge utility [88] (when possible we swapped the alleles and removed the SNPs when alleles were discordant); (2) we lifted the merged genotype data SNP coordinates to GRCh38 using the UCSC liftOver tool; (3) the resulting genotype data was merged with the Genome Russia genotype data of 60 samples using PLINK in the same way as described in (1). (4) Samples from the territory of Russia were extracted for further analyses; and (5) we applied the universal mask to remove low mappability and low complexity regions (as described above). By running a preliminary PCA, we identified a batch effect associated with the genotypes distinguishing the individuals from the Pagani et al. [31] study versus the Mallick et al. [32] study. To correct for this batch effect, we ran a chi-square test and removed SNPs with pvalues higher than 0.05 after Bonferroni correction. An additional PCA showed that this resolved the batch effect. 4.3. Population data analysis We performed population genetic analysis on the data obtained from the Genome Russia Project (n = 60) (Pagani et al. [37] (n = 173), and Mallick et al. [36] (n = 31), which together provide a widespread geographical sampling of Eurasia peoples. We merged whole genome data as described above. We further reduced the number of SNPs by removing those with a call rate < 95% and MAF < 5%. We performed LD pruning using PLINK v1.9 [88] indep-pairwise 1000 50 0.2 to select independent SNPs for non-phased data analysis. Finally, we reduced the number of individuals representing the Genome Russia Project by excluding progeny (Table 1). 4.3.1. Principal component analysis (PCA) We performed explanatory PCA based on all Eurasian samples using the SNPRelate [94] R package on the pruned set of SNPs. 4.3.2. Admixture analysis We used the unsupervised ADMIXTURE [46] algorithm to estimate genetic structure in the Genome Russia multilocus SNP dataset relative to the data from Pagani et al. [36,37]. A total of 557 individuals were included in our final dataset. Analyses were done for K values ranging from 2 to 10, each with 200 bootstrap replications. The best fitting K was selected according to the value of the cross validation error (Supplemental Table S14). 4.2.9. Genotype phasing We performed phasing of genotype sets using SHAPEIT v2 [89] without reference panels. We phased the following datasets: (1) Pskov and Novgorod individuals together and (2) Yakut individuals (both (1) and (2) were used for population-specific haplotype map creation); (3) all Genome Russia individuals combined with published Eurasian samples from [36,37] (for fineSTRUCTURE tree construction). Each dataset was phased in the following way: (1) genotypes were filtered using PLINK v1.9 [88] to remove samples with call rates < 95%, families with a Mendel error rate > 5%, and to remove SNPs with MAF < 5% and a Mendel error rate > 5%; (2) SNP positions were mapped to hg19 using the UCSC liftOver tool to make the coordinates consistent with SHAPEIT v2 recombination maps; (3) only autosomes were included; (4) SHAPEIT v2 was run using default options using genetic maps based on data from the 1000 Genomes Project (1000G). Chromosome X was phased for haplotype map construction using the SHAPEIT –chrX parameter. One family from Pskov included two children, and only one of them was kept for phasing. 4.3.3. F3 statistics We calculated the F3 statistics using Pskov, Novgorod and Yakut populations as targets and all possible pairs of Eurasian populations as sources using qp3pop from AdmixTools [95] with default settings. This analysis was performed on the genotype file filtered using the following filters: call rate > 0.95, MAF > 0.05 and HWE p > 10−4; LD pruning was performed using the following parameters in plink: indep-pairwise 1000 50 0.5. Only the F3 results with Z-score < −3 were reported. 4.3.4. Identification of ancestry informative markers Ancestry informative markers (AIMs) were identified based on 1000G phase3 [47] EUR, EAS and SAS data by identifying SNPs with allele count difference higher than 0.5 in each possible population pair. These SNPs were considered as AIMs for the population with higher allele counts. 4.2.10. Variant annotation The set of variants obtained after filtration were annotated using the Ensembl VEP [53] and the Loss-Of-Function Transcript Effect Estimator plugin to obtain potential LoF variants. We annotated the variants with MAFs from the 1000G phase3 [47], ExAC [51] and gnomAD [51] 4.3.5. Genetic distance and the identity by state (IBS) analysis We used Nei's DA1/2 distance [96] to evaluate genetic differences in each pair of individuals and obtain a hierarchical cluster analysis with 453 Genomics 112 (2020) 442–458 D.V. Zhernakova, et al. complete linkage. The DA genetic squared distance is actually a mean value of the squared Hellinger distance between allele frequencies over the whole genome. Moreover, we performed identity by state (IBS) analysis. The results obtained from the IBS analysis were very similar to the results obtained with the DA squared distance. The allele frequencies of each population were obtained as the half of the mean value of alleles available coded as 0, 1 and 2 for common homozygote, heterozygote and minor homozygote, respectively. Undefined variants were excluded from the analysis. An efficient methodology to evaluate allele frequencies at any point on the Earth is the kernel smoothing method. We determined that the Nadaraja–Watson estimator used in [22] to evaluate the expected allele frequencies fa(x) at any point x of the grid from the observed allele frequencies fka of k-th population located at some point xk does not work well if population locations xk are not uniformly distributed on the Earth. We evaluated the expected values fa(x) from the observed values of allele frequencies fka using a generalized linear smoother [97]: m fa (x ) = wk (x ) fka k=1 where wk (x ) = 1 dk ( x difference in allele frequencies (gradient) ∆f(x,ymax)) displays the magnitude of the barrier and the higher value of the ratio f(x,ymax)/∆f (x,ymin) in the neighborhood of the bound (the ratio should be equal to 1 on the bound) displays a sharpness of the barrier. Larger values of the ratio near the border should mean more gene flow along the border. 4.3.7. Neighbor-joining tree We estimated the phylogenetic relationships among 231 individuals of Russian ancestry, including 60 individuals from the Genome Russia Project (Novgorod, Pskov and Yakut), using the neighbor-joining algorithm [98]. The majority of individuals included were derived from the studies by Mallick et al. [36] and Pagani et al. [37]. Individuals showing a high proportion of admixture between two or more populations (based on the ADMIXTURE results reported in the Mallick et al. and Pagani et al. studies) were excluded from the analysis. A data matrix of 3,779,316 homologous SNPs was assembled after filtering the full SNP data set according to call rate (> 95%), minor allele frequency (> 5%) and Hardy-Weinburg equilibrium (p < 1e-4). Neighborjoining trees based on pairwise nucleotide divergence were constructed using PAUP* v4.0a159 [99]. Ties were broken randomly and no topological constraints were defined during the tree search. Tree files generated from PAUP* v4.0a159 (.tre) were saved and then visualized in FigTree v1.4.3 [100]. Two Vietnamese individuals were used to root the distance tree. (1) xk )/ m 1 k = 1 dk ( x xk ); d k = 1 d 1 m k=1 0 ( x xk 0 ) is the kernel density estimator with the Gaussian kernel φ. In contrast to the Nadaraja–Watson estimator, this approach is much more robust to splitting populations (up to single individuals). We selected heuristically the parameter of kernel width for density σ0 = 500 km and the main parameter of kernel width σ = 1000 km. We mapped the genetic IBS distances based on allele frequencies between the Pskov, Novgorod, and Yakutia populations and the evaluated allele frequencies at any point of the geographic grid. 4.3.8. Haplotype-based tree We used fineSTRUCTURE v2 [68] to create a haplotype-based tree. This was done on the set of 60 individuals from Pskov, Novgorod, and Yakutia combined with all samples from Eurasia from [36,37]. We phased the data as described above and removed children, which resulted in 573 samples used in the analysis. We ran fineSTRUCTURE with default settings using the 1000G phase 3 recombination maps. Chromopainter was run within fineSTRUCURE command automatically with default settings. To visualize the resulting tree, we used R scripts provided by the authors of fineSTRUCTURE. 4.3.6. Gene flow barriers analysis We improved and extended the framework for studying genetic differences between widely distributed populations of any size, originally developed in Pagani et al. [37], to investigate gene flow barriers on a grid. For any node xij in the geographical grid we draw a small circle Sr(xij) of radius r, set d = 8 equally spaced points of the Earth in the small circle in the 8 directions S, SE, E, NE, N, NW, W, SW from the node and calculated directional and mean increments for any node of the grid. The nodes were selected equally spaced in geographical coordinates with approximately 25 km between a node at the equator and its four neighbors. The distance (in kilometers) between the nodes depends on the latitude and becomes smaller in high latitudes. The distance between each node of the grid and the eight surrounding pointson-the-circle is fixed to r = 500 km. The allele frequencies are obtained sequentially for each node and 8 points around it on the circle (9 points) by using the formula (1) above, which requires to evaluate allele frequencies at any point by its geographic coordinates. Finally, taking mean values of the increments for all loci, we obtained the directions of the smallest and largest divergences and the mean divergence in the area by using the following formula: 2f (x ) = 1 L a 1 Rr 2d (fa (x ) 4.4. Haplotype estimation We inferred haplotypes from multilocus SNP genotypes by using the SHAPEIT2 tool [89] as described in the previous sections. This was done separately for the Pskov + Novgorod and Yakutia individuals. Haplotype structure analysis was performed in the Haploview software [101]. Haplotype blocks were estimated using the Solid Spine LD algorithm [101] between each pair of SNPs within 100 kb distance. 4.5. Identification of runs of homozygosity Runs of homozygosity (ROHs) for the three populations (Novgorod, Pskov, and Yakutia) were identified using the PLINK2 software [88]. Biallelic SNPs were considered for the analysis. PLINK2 was launched with the following options: –geno 0.05 –homozyg-density 1000 –homozyg-window-het 1 –homozyg-kb 10 –homozyg-window-snp 20. These options correspond to filtering out the variants with > 5% of missing call rates and requiring runs of homozygosity to contain at least one SNP per 1 Mb on average and be at least 10 kbp long. The sliding windows consisting of 20 SNPs and containing at most one heterozygous SNP were used to scan every individual for runs of homozygosity. fa (y ))2 , y S (x ) where S(x) is the set of d points on the small sphere around x; R is the radius of Earth and L is the number of loci. We call “barrier” a line by crossing which the genetic difference is maximal. In order to get more detailed results, we looked for directions of maximal changes in allele frequencies at each point on the grid. A true barrier should be accompanied by a rapid change of the evaluated allele frequencies with the appropriate change direction of the maximal difference in allele frequencies in its neighborhood. First, we examined the gradient direction change to the inverse (or closely inverse) one, which formed the boundary, in such a way that the gradients were directed outward with respect to the boundary. The local difference in allele frequencies ∆f (x) (or, more precisely, the maximal directional 5. Data access The datasets supporting the results of this article are publicly available at http://genomerussia.spbu.ru/dataaccess.html. 454 Genomics 112 (2020) 442–458 D.V. Zhernakova, et al. Acknowledgements [6] M. Raghavan, P. Skoglund, K.E. Graf, M. Metspalu, A. Albrechtsen, I. Moltke, S. Rasmussen, T.W. Stafford Jr., L. Orlando, E. Metspalu, M. Karmin, K. Tambets, S. Rootsi, R. Mägi, P.F. Campos, E. Balanovska, O. Balanovsky, E. Khusnutdinova, S. Litvinov, L.P. Osipova, S.A. Fedorova, M.I. Voevoda, M. DeGiorgio, T. SicheritzPonten, S. Brunak, S. Demeshchenko, T. Kivisild, R. Villems, R. Nielsen, M. Jakobsson, E. Willerslev, Upper Palaeolithic Siberian genome reveals dual ancestry of Native Americans, Nature 505 (2014) 87–91, https://doi.org/10. 1038/nature12736. [7] M. Mezzavilla, D. Vozzi, N. Pirastu, G. Girotto, P. d'Adamo, P. Gasparini, V. Colonna, Genetic landscape of populations along the silk road: admixture and migration patterns, BMC Genet. 15 (2014) 131, https://doi.org/10.1186/s12863014-0131-6. [8] D. Xu, S. Wen, The Silk Road: Language and Population Admixture and Replacement, in: Lang, Genes Northwest. China Adjac. Reg., Springer Singapore, Singapore, 2017, pp. 55–78, https://doi.org/10.1007/978-981-10-4169-3_4. [9] K. Prüfer, F. Racimo, N. Patterson, F. Jay, S. Sankararaman, S. Sawyer, A. Heinze, G. Renaud, P.H. Sudmant, C. de Filippo, H. Li, S. Mallick, M. Dannemann, Q. Fu, M. Kircher, M. Kuhlwilm, M. Lachmann, M. Meyer, M. Ongyerth, M. Siebauer, C. Theunert, A. Tandon, P. Moorjani, J. Pickrell, J.C. Mullikin, S.H. Vohr, R.E. Green, I. Hellmann, P.L.F. Johnson, H. Blanche, H. Cann, J.O. Kitzman, J. Shendure, E.E. Eichler, E.S. Lein, T.E. Bakken, L.V. Golovanova, V.B. Doronichev, M.V. Shunkov, A.P. Derevianko, B. Viola, M. Slatkin, D. Reich, J. Kelso, S. Pääbo, The complete genome sequence of a Neanderthal from the Altai Mountains, Nature 505 (2014) 43–49, https://doi.org/10.1038/nature12886. [10] D. Reich, R.E. Green, M. Kircher, J. Krause, N. Patterson, E.Y. Durand, B. Viola, A.W. Briggs, U. Stenzel, P.L.F. Johnson, T. Maricic, J.M. Good, T. Marques-Bonet, C. Alkan, Q. Fu, S. Mallick, H. Li, M. Meyer, E.E. Eichler, M. Stoneking, M. Richards, S. Talamo, M.V. Shunkov, A.P. Derevianko, J.-J. Hublin, J. Kelso, M. Slatkin, S. Pääbo, Genetic history of an archaic hominin group from Denisova Cave in Siberia, Nature 468 (2010) 1053–1060, https://doi.org/10.1038/ nature09710. [11] Paul M. Barford, The Early Slavs : Culture and Society in Early Medieval Eastern Europe, Cornell University Press, 2001. [12] J.P. Mallory, In Search of the Indo-Europeans : Language, Archaeology and Myth, Thames and Hudson, 1991. [13] D.A. Machinskiy, Migration of the Slavs in the 1st millennium AD e. (from written sources with the use of archeological data), in: V.D. Korolyuk, L.V. Zaborovsky (Eds.), Form. Early Feudal Slav. Natl. Nauka, Moscow, 1981, pp. 39–51. [14] M.C. Dulik, S.I. Zhadanov, L.P. Osipova, A. Askapuli, L. Gau, O. Gokcumen, S. Rubinstein, T.G. Schurr, Mitochondrial DNA and Y chromosome variation provides evidence for a recent common ancestry between Native Americans and Indigenous Altaians, Am. J. Hum. Genet. 90 (2012) 229–246, https://doi.org/10. 1016/j.ajhg.2011.12.014. [15] I.I. Korel’, L.V. Korel’, Modern contrasts in Russia's interregional migration, Reg. Res. Russ. 5 (2015) 147–153, https://doi.org/10.1134/S2079970515020057. [16] P. Flegontov, P. Changmai, A. Zidkova, M.D. Logacheva, N.E. Altınışık, O. Flegontova, M.S. Gelfand, E.S. Gerasimov, E.E. Khrameeva, O.P. Konovalova, T. Neretina, Y.V. Nikolsky, G. Starostin, V.V. Stepanova, I.V. Travinsky, M. Tříska, P. Tříska, T.V. Tatarinova, Genomic study of the Ket: a Paleo-Eskimo-related ethnic group with significant ancient north Eurasian ancestry, Sci. Rep. 6 (2016) 20768, https://doi.org/10.1038/srep20768. [17] V. Orekhov, A. Poltoraus, L.A. Zhivotovsky, V. Spitsyn, P. Ivanov, N. Yankovsky, Mitochondrial DNA sequence diversity in Russians, FEBS Lett. 445 (1999) 197–201, https://doi.org/10.1016/S0014-5793(99)00115-5. [18] I. Morozova, A. Evsyukov, A. Kon'kov, A. Grosheva, O. Zhukova, S. Rychkov, Russian ethnic history inferred from mitochondrial DNA diversity, Am. J. Phys. Anthropol. 147 (2012) 341–351, https://doi.org/10.1002/ajpa.21649. [19] M.V. Golubenko, V.P. Puzyrev, V.B. Salyukov, A.N. Kucher, N.O. Sanchat, Distribution of deletion-insertion polymorphism of mitochondrial DNA intragenic region V among indigenous population of the Tuva republic, Russ. J. Genet. 36 (2000) 293–297. [20] E.L. Loogväli, U. Roostalu, B.A. Malyarchuk, M.V. Derenko, T. Kivisild, E. Metspalu, K. Tambets, M. Reidla, H.V. Tolk, J. Parik, E. Pennarun, S. Laos, A. Lunkina, M. Golubenko, L. Barać, M. Peričić, O.P. Balanovsky, V. Gusar, E.K. Khusnutdinova, V. Stepanov, V. Puzyrev, P. Rudan, E.V. Balanovska, E. Grechanina, C. Richard, J.P. Moisan, A. Chaventré, N.P. Anagnou, K.I. Pappa, E.N. Michalodimitrakis, M. Claustres, M. Gölge, I. Mikerezi, E. Usanga, R. Villems, Disuniting uniformity: A pied cladistic canvas of mtDNA haplogroup H in Eurasia, Mol. Biol. Evol. 21 (2004) 2012–2021, https://doi.org/10.1093/molbev/msh209. [21] B. Malyarchuk, A. Litvinov, M. Derenko, K. Skonieczna, T. Grzybowski, A. Grosheva, Y. Shneider, S. Rychkov, O. Zhukova, Mitogenomic diversity in Russians and poles, Forensic Sci. Int. Genet. 30 (2017) 51–56, https://doi.org/10. 1016/j.fsigen.2017.06.003. [22] H. Sahakyan, B.H. Kashani, R. Tamang, A. Kushniarevich, A. Francis, M.D. Costa, A.K. Pathak, Z. Khachatryan, I. Sharma, M. Van Oven, J. Parik, H. Hovhannisyan, E. Metspalu, E. Pennarun, M. Karmin, E. Tamm, K. Tambets, A. Bahmanimehr, T. Reisberg, M. Reidla, A. Achilli, A. Olivieri, F. Gandini, U.A. Perego, N. AlZahery, M. Houshmand, M.H. Sanati, P. Soares, E. Rai, J. Šarac, T. Šarić, V. Sharma, L. Pereira, V. Fernandes, V. Černý, S. Farjadian, D.P. Singh, H. Azakli, D. Üstek, N.E. Trofimova, I. Kutuev, S. Litvinov, M. Bermisheva, E.K. Khusnutdinova, N. Rai, M. Singh, V.K. Singh, A.G. Reddy, H.V. Tolk, S. Cvjetan, L.B. Lauc, P. Rudan, E.N. Michalodimitrakis, N.P. Anagnou, K.I. Pappa, M.V. Golubenko, V. Orekhov, S.A. Borinskaya, K. Kaldma, M.A. Schauer, M. Simionescu, V. Gusar, E. Grechanina, P. Govindaraj, M. Voevoda, L. Damba, S. Sharma, L. Singh, O. Semino, D.M. Behar, L. Yepiskoposyan, M.B. Richards, M. Metspalu, T. Kivisild, K. Thangaraj, P. Endicott, G. Chaubey, A. Torroni, The scientists at the Dobzhansky Center were supported, in part, by the Russian Science Foundation grant (project no. 17-14-01138) and by St. Petersburg State University (Genome Russia Grant no. 1.52.1647.2016). WGS was performed at Research Resource Centre “Centre Biobank”, and data analysis was done at Computing Center, Research park, St. Petersburg State University. OB was supported by the Russian Science Foundation (RSF) grant 17-14-01345, Russian Foundation for Basic Research (RFBR) grant 16-04-00890 and by the State assignments of Russian Ministry of Science for the VIGG (01122019-0001) and for the RCMC. EN and VU were financially supported by the Government of Russian Federation (Grant 08-08). VO and TMS were supported in part by the Ministry of Education and Science of the Russian Federation (Project No. 17.6344.2017/8.9) Disclosure declaration The authors declare no competing interests. Author contributions Conceptualization: SJOB, VB, TKO, DVZ, SL. Sample collection: VB, NC, AG, IE, AS, SK, MR, AL, AN, TKD, TMS, VO, SL. DNA sample preparation and sequencing: IE, AL, DEP, AG. Data Analyses: DVZ, SM, TKO, KPK, AZ, PD, SK, NC, GT, MR, KK, IE, SSid, AG, EC, AK, SSim, AA, VU, EN, SJOB. Writing and editing: DVZ, VB, SM, TKO, KPK, PD, SK, GT, MR, KK, IE, SSid, AG, OB, AN, SJOB. Project administration: SJOB, VB, VP. Appendix A. Supplementary data Supplementary data to this article can be found online at https:// doi.org/10.1016/j.ygeno.2019.03.007. References [1] M.E. Allentoft, M. Sikora, K.-G. Sjögren, S. Rasmussen, M. Rasmussen, J. Stenderup, P.B. Damgaard, H. Schroeder, T. Ahlström, L. Vinner, A.S. Malaspinas, A. Margaryan, T. Higham, D. Chivall, N. Lynnerup, L. Harvig, J. Baron, P. Della Casa, P. Dąbrowski, P.R. Duffy, A.V. Ebel, A. Epimakhov, K. Frei, M. Furmanek, T. Gralak, A. Gromov, S. Gronkiewicz, G. Grupe, T. Hajdu, R. Jarysz, V. Khartanovich, A. Khokhlov, V. Kiss, J. Kolář, A. Kriiska, I. Lasak, C. Longhi, G. McGlynn, A. Merkevicius, I. Merkyte, M. Metspalu, R. Mkrtchyan, V. Moiseyev, L. Paja, G. Pálfi, D. Pokutta, Ł. Pospieszny, T.D. Price, L. Saag, M. Sablin, N. Shishlina, V. Smrčka, V.I. Soenov, V. Szeverényi, G. Tóth, S.V. Trifanova, L. Varul, M. Vicze, L. Yepiskoposyan, V. Zhitenev, L. Orlando, T. Sicheritz-Pontén, S. Brunak, R. Nielsen, K. Kristiansen, E. Willerslev, Population genomics of Bronze age Eurasia, Nature 522 (2015) 167–172, https://doi.org/10.1038/nature14507. [2] W. Haak, I. Lazaridis, N. Patterson, N. Rohland, S. Mallick, B. Llamas, G. Brandt, S. Nordenfelt, E. Harney, K. Stewardson, Q. Fu, A. Mittnik, E. Bánffy, C. Economou, M. Francken, S. Friederich, R.G. Pena, F. Hallgren, V. Khartanovich, A. Khokhlov, M. Kunst, P. Kuznetsov, H. Meller, O. Mochalov, V. Moiseyev, N. Nicklisch, S.L. Pichler, R. Risch, M.A. Rojo Guerra, C. Roth, A. Szécsényi-Nagy, J. Wahl, M. Meyer, J. Krause, D. Brown, D. Anthony, A. Cooper, K.W. Alt, D. Reich, Massive migration from the steppe was a source for Indo-European languages in Europe, Nature 522 (2015) 207–211, https://doi.org/10.1038/nature14317. [3] C. Gamba, E.R. Jones, M.D. Teasdale, R.L. McLaughlin, G. Gonzalez-Fortes, V. Mattiangeli, L. Domboróczki, I. Kővári, I. Pap, A. Anders, A. Whittle, J. Dani, P. Raczky, T.F.G. Higham, M. Hofreiter, D.G. Bradley, R. Pinhasi, Genome flux and stasis in a five millennium transect of European prehistory, Nat. Commun. 5 (2014) 5257, https://doi.org/10.1038/ncomms6257. [4] B. Yunusbayev, M. Metspalu, E. Metspalu, A. Valeev, S. Litvinov, R. Valiev, V. Akhmetova, E. Balanovska, O. Balanovsky, S. Turdikulova, D. Dalimova, P. Nymadawa, A. Bahmanimehr, H. Sahakyan, K. Tambets, S. Fedorova, N. Barashkov, I. Khidiyatova, E. Mihailov, R. Khusainova, L. Damba, M. Derenko, B. Malyarchuk, L. Osipova, M. Voevoda, L. Yepiskoposyan, T. Kivisild, E. Khusnutdinova, R. Villems, The genetic legacy of the expansion of Turkicspeaking nomads across Eurasia, PLoS Genet. 11 (2015) e1005068, , https://doi. org/10.1371/journal.pgen.1005068. [5] P. Skoglund, H. Malmström, M. Raghavan, J. Storå, P. Hall, E. Willerslev, M.T.P. Gilbert, A. Götherström, M. Jakobsson, Origins and genetic legacy of Neolithic farmers and hunter-gatherers in Europe, Science 336 (2012) 466–469, https://doi.org/10.1126/science.1216304. 455 Genomics 112 (2020) 442–458 D.V. Zhernakova, et al. [23] [24] [25] [26] [27] [28] [29] [30] [31] [32] [33] [34] [35] [36] R. Villems, Origin and spread of human mitochondrial DNA haplogroup U7, Sci. Rep. 7 (2017) 1–9, https://doi.org/10.1038/srep46044. V.P. Puzyrev, V.A. Stepanov, M.V. Golubenko, K.V. Puzyrev, N.R. Maximova, V.N. Kharkov, M.G. Spiridonova, A.N. Nogovitsina, MtDNA and Y-chromosome lineages in the Yakut population, Russ. J. Genet. 39 (2003) 816–822, https://doi. org/10.1023/A:1024761305958. V. Pankratov, S. Litvinov, A. Kassian, D. Shulhin, L. Tchebotarev, B. Yunusbayev, M. Möls, H. Sahakyan, L. Yepiskoposyan, East Eurasian ancestry in the middle of Europe: genetic footprints of steppe nomads in the genomes of Belarusian Lipka Tatars, Nat. Publ. Gr. (2016) 1–11, https://doi.org/10.1038/srep30197. P. Triska, N. Chekanov, V. Stepanov, E.K. Khusnutdinova, G.P.A. Kumar, V. Akhmetova, K. Babalyan, E. Boulygina, V. Kharkov, M. Gubina, I. Khidiyatova, I. Khitrinskaya, E.E. Khrameeva, R. Khusainova, N. Konovalova, S. Litvinov, A. Marusin, A.M. Mazur, V. Puzyrev, D. Ivanoshchuk, M. Spiridonova, A. Teslyuk, S. Tsygankova, M. Triska, N. Trofimova, E. Vajda, O. Balanovsky, A. Baranova, K. Skryabin, T.V. Tatarinova, E. Prokhortchouk, Between Lake Baikal and the Baltic Sea: genomic history of the gateway to Europe, BMC Genet. 18 (2017) 110, https://doi.org/10.1186/s12863-017-0578-3. K. Tambets, B. Yunusbayev, G. Hudjashov, A.-M. Ilumäe, S. Rootsi, T. Honkola, O. Vesakoski, Q. Atkinson, P. Skoglund, A. Kushniarevich, S. Litvinov, M. Reidla, E. Metspalu, L. Saag, T. Rantanen, M. Karmin, J. Parik, S.I. Zhadanov, M. Gubina, L.D. Damba, M. Bermisheva, T. Reisberg, K. Dibirova, I. Evseeva, M. Nelis, J. Klovins, A. Metspalu, T. Esko, O. Balanovsky, E. Balanovska, E.K. Khusnutdinova, L.P. Osipova, M. Voevoda, R. Villems, T. Kivisild, M. Metspalu, Genes reveal traces of common recent demographic history for most of the Uralic-speaking populations, Genome Biol. 19 (2018) 139, https://doi.org/ 10.1186/s13059-018-1522-1. S.A. Fedorova, M. Reidla, E. Metspalu, M. Metspalu, S. Rootsi, K. Tambets, N. Trofimova, S.I. Zhadanov, B. Kashani, A. Olivieri, M.I. Voevoda, L.P. Osipova, F.A. Platonov, M.I. Tomsky, E.K. Khusnutdinova, A. Torroni, R. Villems, Autosomal and uniparental portraits of the native populations of Sakha (Yakutia): implications for the peopling of Northeast Eurasia, BMC Evol. Biol. 13 (2013) 127, https://doi.org/10.1186/1471-2148-13-127. V.N. Pimenoff, D. Comas, J.U. Palo, G. Vershubsky, A. Kozlov, A. Sajantila, Northwest Siberian Khanty and Mansi in the junction of west and east Eurasian gene pools as revealed by uniparental markers, Eur. J. Hum. Genet. 16 (2008) 1254–1264, https://doi.org/10.1038/ejhg.2008.101. C. Der Sarkissian, O. Balanovsky, G. Brandt, V. Khartanovich, A. Buzhilova, S. Koshel, V. Zaporozhchenko, D. Gronenborn, V. Moiseyev, E. Kolpakov, V. Shumkin, K.W. Alt, E. Balanovska, A. Cooper, W. Haak, G. Consortium, Ancient DNA reveals prehistoric gene-flow from siberia in the complex human population history of North East Europe, PLoS Genet. 9 (2013) e1003296, , https://doi.org/ 10.1371/journal.pgen.1003296. A. Kushniarevich, O. Utevska, M. Chuhryaeva, A. Agdzhoyan, K. Dibirova, I. Uktveryte, M. Möls, L. Mulahasanovic, A. Pshenichnov, S. Frolova, A. Shanko, E. Metspalu, M. Reidla, K. Tambets, E. Tamm, S. Koshel, V. Zaporozhchenko, L. Atramentova, V. Kučinskas, O. Davydenko, O. Goncharova, I. Evseeva, M. Churnosov, E. Pocheshchova, B. Yunusbayev, E. Khusnutdinova, D. Marjanović, P. Rudan, S. Rootsi, N. Yankovsky, P. Endicott, A. Kassian, A. Dybo, C. Tyler-Smith, E. Balanovska, M. Metspalu, T. Kivisild, R. Villems, O. Balanovsky, O. Balanovsky, et al., PLoS One 10 (2015) e0135820, , https://doi.org/10.1371/ journal.pone.0135820. A.V. Khrunin, D.V. Khokhrin, I.N. Filippova, T. Esko, M. Nelis, N.A. Bebyakova, N.L. Bolotova, J. Klovins, L. Nikitina-Zake, K. Rehnström, S. Ripatti, S. Schreiber, A. Franke, M. Macek, V. Krulišová, J. Lubinski, A. Metspalu, S.A. Limborska, A genome-wide analysis of populations from European Russia reveals a new pole of genetic diversity in northern Europe, PLoS One 8 (2013) e58552, , https://doi. org/10.1371/journal.pone.0058552. L. Roewer, S. Willuweit, C. Krüger, M. Nagy, S. Rychkov, I. Morozowa, O. Naumova, Y. Schneider, O. Zhukova, M. Stoneking, I. Nasidze, Analysis of Y chromosome STR haplotypes in the European part of Russia reveals high diversities but non-significant genetic distances between populations, Int. J. Legal Med. 122 (2008) 219–223, https://doi.org/10.1007/s00414-007-0222-2. A. Fechner, D. Quinque, S. Rychkov, I. Morozowa, O. Naumova, Y. Schneider, S. Willuweit, O. Zhukova, L. Roewer, M. Stoneking, I. Nasidze, Boundaries and clines in the west Eurasian Y-chromosome landscape: insights from the European part of Russia, Am. J. Phys. Anthropol. 137 (2008) 41–47, https://doi.org/10. 1002/ajpa.20838. O. Balanovsky, S. Rootsi, A. Pshenichnov, T. Kivisild, M. Churnosov, I. Evseeva, E. Pocheshkhova, M. Boldyreva, N. Yankovsky, E. Balanovska, R. Villems, Two sources of the Russian patrilineal heritage in their Eurasian context, Am. J. Hum. Genet. 82 (2008) 236–250, https://doi.org/10.1016/j.ajhg.2007.09.019. B. Malyarchuk, T. Grzybowski, M. Derenko, M. Perkova, T. Vanecek, J. Lazur, P. Gomolcak, I. Tsybovsky, Mitochondrial DNA phylogeny in eastern and Western Slavs, Mol. Biol. Evol. 25 (2008) 1651–1658, https://doi.org/10.1093/molbev/ msn114. S. Mallick, H. Li, M. Lipson, I. Mathieson, M. Gymrek, F. Racimo, M. Zhao, N. Chennagiri, S. Nordenfelt, A. Tandon, P. Skoglund, I. Lazaridis, S. Sankararaman, Q. Fu, N. Rohland, G. Renaud, Y. Erlich, T. Willems, C. Gallo, J.P. Spence, Y.S. Song, G. Poletti, F. Balloux, G. van Driem, P. de Knijff, I.G. Romero, A.R. Jha, D.M. Behar, C.M. Bravi, C. Capelli, T. Hervig, A. MorenoEstrada, O.L. Posukh, E. Balanovska, O. Balanovsky, S. Karachanak-Yankova, H. Sahakyan, D. Toncheva, L. Yepiskoposyan, C. Tyler-Smith, Y. Xue, M.S. Abdullah, A. Ruiz-Linares, C.M. Beall, A. Di Rienzo, C. Jeong, E.B. Starikovskaya, E. Metspalu, J. Parik, R. Villems, B.M. Henn, U. Hodoglugil, R. Mahley, A. Sajantila, G. Stamatoyannopoulos, J.T.S. Wee, R. Khusainova, [37] [38] [39] [40] [41] [42] [43] [44] [45] [46] [47] 456 E. Khusnutdinova, S. Litvinov, G. Ayodo, D. Comas, M.F. Hammer, T. Kivisild, W. Klitz, C.A. Winkler, D. Labuda, M. Bamshad, L.B. Jorde, S.A. Tishkoff, W.S. Watkins, M. Metspalu, S. Dryomov, R. Sukernik, L. Singh, K. Thangaraj, S. Pääbo, J. Kelso, N. Patterson, D. Reich, The Simons Genome Diversity Project: 300 genomes from 142 diverse populations, Nature 538 (2016) 201–206, https:// doi.org/10.1038/nature18964. L. Pagani, D.J. Lawson, E. Jagoda, A. Mörseburg, A. Eriksson, M. Mitt, F. Clemente, G. Hudjashov, M. DeGiorgio, L. Saag, J.D. Wall, A. Cardona, R. Mägi, M.A.W. Sayres, S. Kaewert, C. Inchley, C.L. Scheib, M. Järve, M. Karmin, G.S. Jacobs, T. Antao, F.M. Iliescu, A. Kushniarevich, Q. Ayub, C. Tyler-Smith, Y. Xue, B. Yunusbayev, K. Tambets, C.B. Mallick, L. Saag, E. Pocheshkhova, G. Andriadze, C. Muller, M.C. Westaway, D.M. Lambert, G. Zoraqi, S. Turdikulova, D. Dalimova, Z. Sabitov, G.N.N. Sultana, J. Lachance, S. Tishkoff, K. Momynaliev, J. Isakova, L.D. Damba, M. Gubina, P. Nymadawa, I. Evseeva, L. Atramentova, O. Utevska, F.-X. Ricaut, N. Brucato, H. Sudoyo, T. Letellier, M.P. Cox, N.A. Barashkov, V. Škaro, L. Mulahasanovic, D. Primorac, H. Sahakyan, M. Mormina, C.A. Eichstaedt, D.V. Lichman, S. Abdullah, G. Chaubey, J.T.S. Wee, E. Mihailov, A. Karunas, S. Litvinov, R. Khusainova, N. Ekomasova, V. Akhmetova, I. Khidiyatova, D. Marjanović, L. Yepiskoposyan, D.M. Behar, E. Balanovska, A. Metspalu, M. Derenko, B. Malyarchuk, M. Voevoda, S.A. Fedorova, L.P. Osipova, M.M. Lahr, P. Gerbault, M. Leavesley, A.B. Migliano, M. Petraglia, O. Balanovsky, E.K. Khusnutdinova, E. Metspalu, M.G. Thomas, A. Manica, R. Nielsen, R. Villems, E. Willerslev, T. Kivisild, M. Metspalu, Genomic analyses inform on migration events during the peopling of Eurasia, Nature 538 (2016) 238–242, https://doi.org/10.1038/nature19792. E.H.M. Wong, A. Khrunin, L. Nichols, D. Pushkarev, D. Khokhrin, D. Verbenko, O. Evgrafov, J. Knowles, J. Novembre, S. Limborska, A. Valouev, Reconstructing genetic history of Siberian and Northeastern European populations, Genome Res. 27 (2017) 1–14, https://doi.org/10.1101/gr.202945.115. D. Kumar, D. Kumar, Genomics and Health in the Developing World, Oxford University Press, 2012, https://www.oupjapan.co.jp/en/node/1808 , Accessed date: 1 February 2018. N. Maksimova, K. Hara, I. Nikolaeva, T. Chun-Feng, T. Usui, M. Takagi, Y. Nishihira, A. Miyashita, H. Fujiwara, T. Oyama, A. Nogovicina, A. Sukhomyasova, S. Potapova, R. Kuwano, H. Takahashi, M. Nishizawa, O. Onodera, Neuroblastoma amplified sequence gene is associated with a novel short stature syndrome characterised by optic nerve atrophy and Pelger-Huët anomaly, J. Med. Genet. 47 (2010) 538–548, https://doi.org/10.1136/jmg.2009. 074815. H. Kondo, N. Maksimova, T. Otomo, H. Kato, A. Imai, Y. Asano, K. Kobayashi, S. Nojima, A. Nakaya, Y. Hamada, K. Irahara, E. Gurinova, A. Sukhomyasova, A. Nogovicina, M. Savvina, T. Yoshimori, K. Ozono, N. Sakai, Mutation in VPS33A affects metabolism of glycosaminoglycans: a new type of mucopolysaccharidosis with severe systemic symptoms, Hum. Mol. Genet. (2016), https://doi.org/10. 1093/hmg/ddw377 ddw377. V.P. Puzyrev, N.R. Maksimova, Hereditary diseases among Yakuts, Genetika 44 (2008) 1317–1324 http://www.ncbi.nlm.nih.gov/pubmed/19062529 , Accessed date: 5 March 2018. T.K. Oleksyk, V. Brukhin, S.J. O'Brien, The genome Russia project: closing the largest remaining omission on the world genome map, Gigascience 4 (2015) 53, https://doi.org/10.1186/s13742-015-0095-0. T.K. Oleksyk, V. Brukhin, S.J. O'Brien, Putting Russia on the genome map, Science 350 (2015) 747, https://doi.org/10.1126/science.350.6262.747-a. S. Leslie, B. Winney, G. Hellenthal, D. Davison, A. Boumertit, T. Day, K. Hutnik, E.C. Royrvik, B. Cunliffe, D.J. Lawson, D. Falush, C. Freeman, M. Pirinen, S. Myers, M. Robinson, P. Donnelly, W. Bodmer, P. Donnelly, W. Bodmer, The fine-scale genetic structure of the British population, Nature 519 (2015) 309–314, https:// doi.org/10.1038/nature14230. D.H. Alexander, J. Novembre, K. Lange, Fast model-based estimation of ancestry in unrelated individuals, Genome Res. 19 (2009) 1655–1664, https://doi.org/10. 1101/gr.094052.109. A. Auton, G.R. Abecasis, D.M. Altshuler, R.M. Durbin, D.R. Bentley, A. Chakravarti, A.G. Clark, P. Donnelly, E.E. Eichler, P. Flicek, S.B. Gabriel, R.A. Gibbs, E.D. Green, M.E. Hurles, B.M. Knoppers, J.O. Korbel, E.S. Lander, C. Lee, H. Lehrach, E.R. Mardis, G.T. Marth, G.A. McVean, D.A. Nickerson, J.P. Schmidt, S.T. Sherry, J. Wang, R.K. Wilson, E. Boerwinkle, H. Doddapaneni, Y. Han, V. Korchina, C. Kovar, S. Lee, D. Muzny, J.G. Reid, Y. Zhu, Y. Chang, Q. Feng, X. Fang, X. Guo, M. Jian, H. Jiang, X. Jin, T. Lan, G. Li, J. Li, Y. Li, S. Liu, X. Liu, Y. Lu, X. Ma, M. Tang, B. Wang, G. Wang, H. Wu, R. Wu, X. Xu, Y. Yin, D. Zhang, W. Zhang, J. Zhao, M. Zhao, X. Zheng, N. Gupta, N. Gharani, L.H. Toji, N.P. Gerry, A.M. Resch, J. Barker, L. Clarke, L. Gil, S.E. Hunt, G. Kelman, E. Kulesha, R. Leinonen, W.M. McLaren, R. Radhakrishnan, A. Roa, D. Smirnov, R.E. Smith, I. Streeter, A. Thormann, I. Toneva, B. Vaughan, X. Zheng-Bradley, R. Grocock, S. Humphray, T. James, Z. Kingsbury, R. Sudbrak, M.W. Albrecht, V.S. Amstislavskiy, T.A. Borodina, M. Lienhard, F. Mertes, M. Sultan, B. Timmermann, M.-L. Yaspo, L. Fulton, R. Fulton, V. Ananiev, Z. Belaia, D. Beloslyudtsev, N. Bouk, C. Chen, D. Church, R. Cohen, C. Cook, J. Garner, T. Hefferon, M. Kimelman, C. Liu, J. Lopez, P. Meric, C. O'Sullivan, Y. Ostapchuk, L. Phan, S. Ponomarov, V. Schneider, E. Shekhtman, K. Sirotkin, D. Slotta, H. Zhang, S. Balasubramaniam, J. Burton, P. Danecek, T.M. Keane, A. KolbKokocinski, S. McCarthy, J. Stalker, M. Quail, C.J. Davies, J. Gollub, T. Webster, B. Wong, Y. Zhan, C.L. Campbell, Y. Kong, A. Marcketta, F. Yu, L. Antunes, M. Bainbridge, A. Sabo, Z. Huang, L.J.M. Coin, L. Fang, Q. Li, Z. Li, H. Lin, B. Liu, R. Luo, H. Shao, Y. Xie, C. Ye, C. Yu, F. Zhang, H. Zheng, H. Zhu, C. Alkan, E. Dal, F. Kahveci, E.P. Garrison, D. Kural, W.-P. Lee, W. Fung Leong, M. Stromberg, A.N. Ward, J. Wu, M. Zhang, M.J. Daly, M.A. DePristo, R.E. Handsaker, E. Banks, Genomics 112 (2020) 442–458 D.V. Zhernakova, et al. [48] [49] [50] [51] [52] [53] [54] G. Bhatia, G. del Angel, G. Genovese, H. Li, S. Kashin, S.A. McCarroll, J.C. Nemesh, R.E. Poplin, S.C. Yoon, J. Lihm, V. Makarov, S. Gottipati, A. Keinan, J.L. Rodriguez-Flores, T. Rausch, M.H. Fritz, A.M. Stütz, K. Beal, A. Datta, J. Herrero, G.R.S. Ritchie, D. Zerbino, P.C. Sabeti, I. Shlyakhter, S.F. Schaffner, J. Vitti, D.N. Cooper, E.V. Ball, P.D. Stenson, B. Barnes, M. Bauer, R. Keira Cheetham, A. Cox, M. Eberle, S. Kahn, L. Murray, J. Peden, R. Shaw, E.E. Kenny, M.A. Batzer, M.K. Konkel, J.A. Walker, D.G. MacArthur, M. Lek, R. Herwig, L. Ding, D.C. Koboldt, D. Larson, K. Ye, S. Gravel, A. Swaroop, E. Chew, T. Lappalainen, Y. Erlich, M. Gymrek, T. Frederick Willems, J.T. Simpson, M.D. Shriver, J.A. Rosenfeld, C.D. Bustamante, S.B. Montgomery, F.M. De La Vega, J.K. Byrnes, A.W. Carroll, M.K. DeGorter, P. Lacroute, B.K. Maples, A.R. Martin, A. Moreno-Estrada, S.S. Shringarpure, F. Zakharia, E. Halperin, Y. Baran, E. Cerveira, J. Hwang, A. Malhotra, D. Plewczynski, K. Radew, M. Romanovitch, C. Zhang, F.C.L. Hyland, D.W. Craig, A. Christoforides, N. Homer, T. Izatt, A.A. Kurdoglu, S.A. Sinari, K. Squire, C. Xiao, J. Sebat, D. Antaki, M. Gujral, A. Noor, K. Ye, E.G. Burchard, R.D. Hernandez, C.R. Gignoux, D. Haussler, S.J. Katzman, W. James Kent, B. Howie, A. Ruiz-Linares, E.T. Dermitzakis, S.E. Devine, H. Min Kang, J.M. Kidd, T. Blackwell, S. Caron, W. Chen, S. Emery, L. Fritsche, C. Fuchsberger, G. Jun, B. Li, R. Lyons, C. Scheller, C. Sidore, S. Song, E. Sliwerska, D. Taliun, A. Tan, R. Welch, M. Kate Wing, X. Zhan, P. Awadalla, A. Hodgkinson, Y. Li, X. Shi, A. Quitadamo, G. Lunter, J.L. Marchini, S. Myers, C. Churchhouse, O. Delaneau, A. Gupta-Hinch, W. Kretzschmar, Z. Iqbal, I. Mathieson, A. Menelaou, A. Rimmer, D.K. Xifara, T.K. Oleksyk, Y. Fu, X. Liu, M. Xiong, L. Jorde, D. Witherspoon, J. Xing, B.L. Browning, S.R. Browning, F. Hormozdiari, P.H. Sudmant, E. Khurana, C. Tyler-Smith, C.A. Albers, Q. Ayub, Y. Chen, V. Colonna, L. Jostins, K. Walter, Y. Xue, M.B. Gerstein, A. Abyzov, S. Balasubramanian, J. Chen, D. Clarke, Y. Fu, A.O. Harmanci, M. Jin, D. Lee, J. Liu, X. Jasmine Mu, J. Zhang, Y. Zhang, C. Hartl, K. Shakir, J. Degenhardt, S. Meiers, B. Raeder, F. Paolo Casale, O. Stegle, E.-W. Lameijer, I. Hall, V. Bafna, J. Michaelson, E.J. Gardner, R.E. Mills, G. Dayama, K. Chen, X. Fan, Z. Chong, T. Chen, M.J. Chaisson, J. Huddleston, M. Malig, B.J. Nelson, N.F. Parrish, B. Blackburne, S.J. Lindsay, Z. Ning, Y. Zhang, H. Lam, C. Sisu, D. Challis, U.S. Evani, J. Lu, U. Nagaswamy, J. Yu, W. Li, L. Habegger, H. Yu, F. Cunningham, I. Dunham, K. Lage, J. Berg Jespersen, H. Horn, D. Kim, R. Desalle, A. Narechania, M.A. Wilson Sayres, F.L. Mendez, G. David Poznik, P.A. Underhill, L. Coin, D. Mittelman, R. Banerjee, M. Cerezo, T.W. Fitzgerald, S. Louzada, A. Massaia, G.R. Ritchie, F. Yang, D. Kalra, W. Hale, X. Dan, K.C. Barnes, C. Beiswanger, H. Cai, H. Cao, B. Henn, D. Jones, J.S. Kaye, A. Kent, A. Kerasidou, R. Mathias, P.N. Ossorio, M. Parker, C.N. Rotimi, C.D. Royal, K. Sandoval, Y. Su, Z. Tian, S. Tishkoff, M. Via, Y. Wang, H. Yang, L. Yang, J. Zhu, W. Bodmer, G. Bedoya, Z. Cai, Y. Gao, J. Chu, L. Peltonen, A. Garcia-Montero, A. Orfao, J. Dutil, J.C. Martinez-Cruzado, R.A. Mathias, A. Hennis, H. Watson, C. McKenzie, F. Qadri, R. LaRocque, X. Deng, D. Asogun, O. Folarin, C. Happi, O. Omoniwa, M. Stremlau, R. Tariyal, M. Jallow, F. Sisay Joof, T. Corrah, K. Rockett, D. Kwiatkowski, J. Kooner, T. Tịnh Hiê'n, S.J. Dunstan, N. Thuy Hang, R. Fonnie, R. Garry, L. Kanneh, L. Moses, J. Schieffelin, D.S. Grant, C. Gallo, G. Poletti, D. Saleheen, A. Rasheed, L.D. Brooks, A.L. Felsenfeld, J.E. McEwen, Y. Vaydylevich, A. Duncanson, M. Dunn, J.A. Schloss, A global reference for human genetic variation, Nature 526 (2015) 68–74, https://doi.org/10.1038/nature15393. D. Reich, K. Thangaraj, N. Patterson, A.L. Price, L. Singh, Reconstructing Indian population history, Nature 461 (2009) 489–494, https://doi.org/10.1038/ nature08365. P.D. Stenson, E.V. Ball, M. Mort, A.D. Phillips, J.A. Shiel, N.S.T. Thomas, S. Abeysinghe, M. Krawczak, D.N. Cooper, Human gene mutation database (HGMD ®): 2003 update, Hum. Mutat. 21 (2003) 577–581, https://doi.org/10. 1002/humu.10212. R. Allikmets, N.F. Shroyer, N. Singh, J.M. Seddon, R.A. Lewis, P.S. Bernstein, A. Peiffer, N.A. Zabriskie, Y. Li, A. Hutchinson, M. Dean, J.R. Lupski, M. Leppert, Mutation of the Stargardt disease gene (ABCR) in age-related macular degeneration, Science 277 (1997) 1805–1807 http://www.ncbi.nlm.nih.gov/pubmed/ 9295268 , Accessed date: 10 January 2018. M. Lek, K.J. Karczewski, E.V. Minikel, K.E. Samocha, E. Banks, T. Fennell, A.H. O'Donnell-Luria, J.S. Ware, A.J. Hill, B.B. Cummings, T. Tukiainen, D.P. Birnbaum, J.A. Kosmicki, L.E. Duncan, K. Estrada, F. Zhao, J. Zou, E. PierceHoffman, J. Berghout, D.N. Cooper, N. Deflaux, M. DePristo, R. Do, J. Flannick, M. Fromer, L. Gauthier, J. Goldstein, N. Gupta, D. Howrigan, A. Kiezun, M.I. Kurki, A.L. Moonshine, P. Natarajan, L. Orozco, G.M. Peloso, R. Poplin, M.A. Rivas, V. Ruano-Rubio, S.A. Rose, D.M. Ruderfer, K. Shakir, P.D. Stenson, C. Stevens, B.P. Thomas, G. Tiao, M.T. Tusie-Luna, B. Weisburd, H.-H. Won, D. Yu, D.M. Altshuler, D. Ardissino, M. Boehnke, J. Danesh, S. Donnelly, R. Elosua, J.C. Florez, S.B. Gabriel, G. Getz, S.J. Glatt, C.M. Hultman, S. Kathiresan, M. Laakso, S. McCarroll, M.I. McCarthy, D. McGovern, R. McPherson, B.M. Neale, A. Palotie, S.M. Purcell, D. Saleheen, J.M. Scharf, P. Sklar, P.F. Sullivan, J. Tuomilehto, M.T. Tsuang, H.C. Watkins, J.G. Wilson, M.J. Daly, D.G. MacArthur, E.A. Consortium, Analysis of protein-coding genetic variation in 60,706 humans, Nature 536 (2016) 285–291, https://doi.org/10.1038/ nature19057. A.M. Alazami, F. Alzahrani, S. Bohlega, F.S. Alkuraya, SET binding factor 1 (SBF1) mutation causes Charcot-Marie-tooth disease type 4B3, Neurology 82 (2014) 1665–1666, https://doi.org/10.1212/WNL.0000000000000331. W. Mclaren, L. Gil, S.E. Hunt, H.S. Riat, G.R.S. Ritchie, A. Thormann, P. Flicek, F. Cunningham, The ensembl variant effect predictor, Genome Biol. (2016) 1–14, https://doi.org/10.1186/s13059-016-0974-4. U.A. Meyer, U.M. Zanger, M. Schwab, Omics and drug response, Annu. Rev. Pharmacol. Toxicol. 53 (2013) 475–502, https://doi.org/10.1146/annurevpharmtox-010510-100502. [55] S. Fan, M.E.B. Hansen, Y. Lo, S.A. Tishkoff, Going global by adapting local: a review of recent human adaptation, Science 354 (2016) 54–59, https://doi.org/ 10.1126/science.aaf5098. [56] Matthew R. Robinson, Aaron Kleinman, Mariaelisa Graff, Anna A.E. Vinkhuyzen, David Couper, Michael B. Miller, Wouter J. Peyrot, Abdel Abdellaoui, Brendan P. Zietsch, Ilja M. Nolte, Jana V. van Vliet-Ostaptchouk, Harold Snieder, The LifeLines Cohort Study, Genetic Investigation of Anthropometric Traits (GIANT) consortium, Sarah E. Medland, Nicholas G. Martin, Patrik K.E. Magnusson, William G. Iacono, Matt McGue, Kari E. North, Jian Yang, Peter M. Visscher, Genetic evidence of assortative mating in humans, Nat. Hum. Behav. 1 (2017), https://doi.org/10.1038/s41562-016-0016 0016. [57] I. Mathieson, I. Lazaridis, N. Rohland, S. Mallick, N. Patterson, S.A. Roodenberg, E. Harney, K. Stewardson, D. Fernandes, M. Novak, K. Sirak, C. Gamba, E.R. Jones, B. Llamas, S. Dryomov, J. Pickrell, J.L. Arsuaga, J.M.B. de Castro, E. Carbonell, F. Gerritsen, A. Khokhlov, P. Kuznetsov, M. Lozano, H. Meller, O. Mochalov, V. Moiseyev, M.A.R. Guerra, J. Roodenberg, J.M. Vergès, J. Krause, A. Cooper, K.W. Alt, D. Brown, D. Anthony, C. Lalueza-Fox, W. Haak, R. Pinhasi, D. Reich, Genome-wide patterns of selection in 230 ancient Eurasians, Nature 528 (2015) 499–503, https://doi.org/10.1038/nature16152. [58] J.T. Troelsen, Adult-type hypolactasia and regulation of lactase expression, Biochim. Biophys. Acta Gen. Subj. 1723 (2005) 19–32, https://doi.org/10.1016/j. bbagen.2005.02.003. [59] N.S. Enattah, T. Sahi, E. Savilahti, J.D. Terwilliger, L. Peltonen, I. Järvelä, Identification of a variant associated with adult-type hypolactasia, Nat. Genet. 30 (2002) 233–237, https://doi.org/10.1038/ng826. [60] P.S. Bellwood, First Farmers : The Origins of Agricultural Societies, Blackwell Pub, 2005, https://www.wiley.com/en-us/First+Farmers%3A+The+Origins+of +Agricultural+Societies-p-9780631205661 , Accessed date: 12 July 2018. [61] N.A. Limdi, M. Wadelius, L. Cavallari, N. Eriksson, D.C. Crawford, M.-T.M. Lee, C.H. Chen, A. Motsinger-Reif, H. Sagreiya, N. Liu, A.H.B. Wu, B.F. Gage, A. Jorgensen, M. Pirmohamed, J.-G. Shin, G. Suarez-Kurtz, S.E. Kimmel, J.A. Johnson, T.E. Klein, M.J. Wagner, International Warfarin Pharmacogenetics Consortium, Warfarin pharmacogenetics: a single VKORC1 polymorphism is predictive of dose across 3 racial groups, Blood 115 (2010) 3827–3834, https://doi. org/10.1182/blood-2009-12-255992. [62] H. Takahashi, G.R. Wilkinson, E.A. Nutescu, T. Morita, M.D. Ritchie, M.G. Scordo, V. Pengo, M. Barban, R. Padrini, I. Ieiri, K. Otsubo, T. Kashima, S. Kimura, S. Kijima, H. Echizen, Different contributions of polymorphisms in VKORC1 and CYP2C9 to intra- and inter-population differences in maintenance dose of warfarin in Japanese, Caucasians and African-Americans, Pharmacogenet. Genomics 16 (2006) 101–110, https://doi.org/10.1097/01.fpc.0000184955.08453.a8. [63] R.A. Sturm, D.L. Duffy, Human pigmentation genes under environmental selection, Genome Biol. 13 (2012) 248, https://doi.org/10.1186/gb-2012-13-9-248. [64] B.L. Lam, S.L. Züchner, J. Dallman, R. Wen, E.C. Alfonso, J.M. Vance, M.A. Peričak-Vance, Mutation K42E in dehydrodolichol diphosphate synthase (DHDDS) causes recessive retinitis pigmentosa, Adv. Exp. Med. Biol. (2014) 165–170, https://doi.org/10.1007/978-1-4614-3209-8_21. [65] S. Züchner, J. Dallman, R. Wen, G. Beecham, A. Naj, A. Farooq, M.A. Kohli, P.L. Whitehead, W. Hulme, I. Konidari, Y.J.K. Edwards, G. Cai, I. Peter, D. Seo, J.D. Buxbaum, J.L. Haines, S. Blanton, J. Young, E. Alfonso, J.M. Vance, B.L. Lam, M.A. Peričak-Vance, Whole-exome sequencing links a variant in DHDDS to retinitis pigmentosa, Am. J. Hum. Genet. 88 (2011) 201–206, https://doi.org/10. 1016/j.ajhg.2011.01.001. [66] Y. Li, L. Si, Y. Zhai, Y. Hu, Z. Hu, J.-X. Bei, B. Xie, Q. Ren, P. Cao, F. Yang, Q. Song, Z. Bao, H. Zhang, Y. Han, Z. Wang, X. Chen, X. Xia, H. Yan, R. Wang, Y. Zhang, C. Gao, J. Meng, X. Tu, X. Liang, Y. Cui, Y. Liu, X. Wu, Z. Li, H. Wang, Z. Li, B. Hu, M. He, Z. Gao, X. Xu, H. Ji, C. Yu, Y. Sun, B. Xing, X. Yang, H. Zhang, A. Tan, C. Wu, W. Jia, S. Li, Y.-X. Zeng, H. Shen, F. He, Z. Mo, H. Zhang, G. Zhou, Genomewide association study identifies 8p21.3 associated with persistent hepatitis B virus infection among Chinese, Nat. Commun. 7 (2016) 11664, https://doi.org/ 10.1038/ncomms11664. [67] S.J. Chapman, A.V.S. Hill, Human genetic susceptibility to infectious disease, Nat. Rev. Genet. 13 (2012) 175–188, https://doi.org/10.1038/nrg3114. [68] D.J. Lawson, G. Hellenthal, S. Myers, D. Falush, Inference of population structure using dense haplotype data, PLoS Genet. 8 (2012) e1002453,, , https://doi.org/10. 1371/journal.pgen.1002453. [69] A.I. Gogolev, Basic stages of the formation of the Yakut people, Anthropol. Archeol. Eurasia. 31 (1992) 63–69, https://doi.org/10.2753/AAE10611959310263. [70] E. Noskova, V. Ulyantsev, K.P. Koepfli, S.J. O'Brien, P. Dobrynin, GADMA: Genetic algorithm for inferring demographic history of multiple populations from allele frequency spectrum data, BioRxiv (2019) 407734, https://doi.org/10.1101/ 407734. [71] H.J. Muller, Our load of mutations, Am. J. Hum. Genet. 2 (1950) 111–176 http:// www.ncbi.nlm.nih.gov/pubmed/14771033 , Accessed date: 10 January 2018. [72] V.N. Khar'kov, V.A. Stepanov, O.F. Medvedev, M.G. Spiridonova, N.R. Maksimova, A.N. Nogovitsyna, V.P. Puzyrev, The origin of Yakuts: analysis of Y-chromosome haplotypes, Mol. Biol. (Mosk) 42 (2008) 226–237 http://www.ncbi.nlm.nih.gov/ pubmed/18610830 , Accessed date: 1 February 2018. [73] S. Andrews, FastQC, (2010). [74] G. Marçais, C. Kingsford, A fast, lock-free approach for efficient parallel counting of occurrences of k-mers, Bioinformatics 27 (2011) 764–770, https://doi.org/10. 1093/bioinformatics/btr011. [75] E. Starostina, G. Tamazian, P. Dobrynin, S. O'Brien, A. Komissarov, Cookiecutter: A Tool for Kmer-Based Read Filtering and Extraction, BioRxiv, (2015). [76] B. Langmead, S.L. Salzberg, Fast gapped-read alignment with bowtie 2, Nat. 457 Genomics 112 (2020) 442–458 D.V. Zhernakova, et al. Methods 9 (2012) 357–359, https://doi.org/10.1038/nmeth.1923. [77] H. Li, B. Handsaker, A. Wysoker, T. Fennell, J. Ruan, N. Homer, G. Marth, G. Abecasis, R. Durbin, 1000 genome project data processing subgroup, the sequence alignment/map format and SAMtools, Bioinformatics 25 (2009) 2078–2079, https://doi.org/10.1093/bioinformatics/btp352. [78] A.R. Quinlan, I.M. Hall, BEDTools: a flexible suite of utilities for comparing genomic features, Bioinformatics 26 (2010) 841–842, https://doi.org/10.1093/ bioinformatics/btq033. [79] A. Tarasov, A.J. Vilella, E. Cuppen, I.J. Nijman, P. Prins, Sambamba: fast processing of NGS alignment formats, Bioinformatics 31 (2015) 2032–2034, https:// doi.org/10.1093/bioinformatics/btv098. [80] A. Morgulis, E.M. Gertz, A.A. Schäffer, R. Agarwala, A fast and symmetric DUST implementation to mask low-complexity DNA sequences, J. Comput. Biol. 13 (2006) 1028–1040, https://doi.org/10.1089/cmb.2006.13.1028. [81] A. Smit, R. Hubley, P. Green, RepeatMasker Open-4.0, n.d. http://www. repeatmasker.org. [82] S. Purcell, B. Neale, K. Todd-Brown, L. Thomas, M.A.R. Ferreira, D. Bender, J. Maller, P. Sklar, P.I.W. de Bakker, M.J. Daly, P.C. Sham, PLINK: a tool set for whole-genome association and population-based linkage analyses, Am. J. Hum. Genet. 81 (2007) 559–575, https://doi.org/10.1086/519795. [83] G.A. Van der Auwera, M.O. Carneiro, C. Hartl, R. Poplin, G. Del Angel, A. LevyMoonshine, T. Jordan, K. Shakir, D. Roazen, J. Thibault, E. Banks, K.V. Garimella, D. Altshuler, S. Gabriel, M.A. DePristo, From FastQ data to high confidence variant calls: the Genome Analysis Toolkit best practices pipeline, Curr. Protoc. Bioinforma. 43 (2013), https://doi.org/10.1002/0471250953.bi1110s43 11.10. 1–33. [84] G. Benson, Tandem repeats finder: a program to analyze DNA sequences, Nucleic Acids Res. 27 (1999) 573–580 http://www.ncbi.nlm.nih.gov/pubmed/9862982 , Accessed date: 5 March 2018. [85] C. Alkan, J.M. Kidd, T. Marques-Bonet, G. Aksay, F. Antonacci, F. Hormozdiari, J.O. Kitzman, C. Baker, M. Malig, O. Mutlu, S.C. Sahinalp, R.A. Gibbs, E.E. Eichler, Personalized copy number and segmental duplication maps using next-generation sequencing, Nat. Genet. 41 (2009) 1061–1067, https://doi.org/10.1038/ng.437. [86] R. Redon, S. Ishikawa, K.R. Fitch, L. Feuk, G.H. Perry, T.D. Andrews, H. Fiegler, M.H. Shapero, A.R. Carson, W. Chen, E.K. Cho, S. Dallaire, J.L. Freeman, J.R. González, M. Gratacòs, J. Huang, D. Kalaitzopoulos, D. Komura, J.R. MacDonald, C.R. Marshall, R. Mei, L. Montgomery, K. Nishimura, K. Okamura, F. Shen, M.J. Somerville, J. Tchinda, A. Valsesia, C. Woodwark, F. Yang, J. Zhang, T. Zerjal, J. Zhang, L. Armengol, D.F. Conrad, X. Estivill, C. Tyler-Smith, N.P. Carter, H. Aburatani, C. Lee, K.W. Jones, S.W. Scherer, M.E. Hurles, Global variation in copy number in the human genome, Nature 444 (2006) 444–454, https://doi.org/10.1038/nature05329. [87] A. Rimmer, H. Phan, I. Mathieson, Z. Iqbal, S.R.F. Twigg, W. WGS500 Consortium, A.O.M. Wilkie, G. McVean, G. Lunter, Integrating mapping-, assembly- and haplotype-based approaches for calling variants in clinical sequencing applications, Nat. Genet. 46 (2014) 912–918, https://doi.org/10.1038/ng.3036. [88] C.C. Chang, C.C. Chow, L.C. Tellier, S. Vattikuti, S.M. Purcell, J.J. Lee, Secondgeneration PLINK: rising to the challenge of larger and richer datasets, Gigascience 4 (2015) 7, https://doi.org/10.1186/s13742-015-0047-8. [89] J. O'Connell, D. Gurdasani, O. Delaneau, N. Pirastu, S. Ulivi, M. Cocca, M. Traglia, J. Huang, J.E. Huffman, I. Rudan, R. McQuillan, R.M. Fraser, H. Campbell, O. Polasek, G. Asiki, K. Ekoru, C. Hayward, A.F. Wright, V. Vitart, P. Navarro, J.- [90] [91] [92] [93] [94] [95] [96] [97] [98] [99] [100] [101] 458 F. Zagury, J.F. Wilson, D. Toniolo, P. Gasparini, N. Soranzo, M.S. Sandhu, J. Marchini, A general approach for haplotype phasing across the full Spectrum of relatedness, PLoS Genet. 10 (2014) e1004234, , https://doi.org/10.1371/journal. pgen.1004234. J. Macarthur, E. Bowler, M. Cerezo, L. Gil, P. Hall, E. Hastings, H. Junkins, A. Mcmahon, A. Milano, J. Morales, Z.M. Pendlington, D. Welter, T. Burdett, L. Hindorff, P. Flicek, F. Cunningham, H. Parkinson, The new NHGRI-EBI catalog of published genome-wide association studies (GWAS Catalog), 45 (2017) 896–901, https://doi.org/10.1093/nar/gkw1133. M.J. Landrum, J.M. Lee, M. Benson, G. Brown, C. Chao, S. Chitipiralla, B. Gu, J. Hart, D. Hoffman, J. Hoover, W. Jang, K. Katz, M. Ovetsky, G. Riley, A. Sethi, R. Tully, R. Villamarin-salomon, W. Rubinstein, D.R. Maglott, ClinVar : public archive of interpretations of clinically relevant variants, Nucleic Acids Res. 44 (2016) 862–868, https://doi.org/10.1093/nar/gkv1222. S. Richards, N. Aziz, S. Bale, D. Bick, S. Das, J. Gastier-Foster, W.W. Grody, M. Hegde, E. Lyon, E. Spector, K. Voelkerding, H.L. Rehm, ACMG Laboratory Quality Assurance Committee, Standards and guidelines for the interpretation of sequence variants: a joint consensus recommendation of the American College of Medical Genetics and Genomics and the Association for Molecular Pathology, Genet. Med. 17 (2015) 405–424, https://doi.org/10.1038/gim.2015.30. S.S. Kalia, K. Adelman, S.J. Bale, W.K. Chung, C. Eng, J.P. Evans, G.E. Herman, S.B. Hufnagel, T.E. Klein, B.R. Korf, K.D. McKelvey, K.E. Ormond, C.S. Richards, C.N. Vlangos, M. Watson, C.L. Martin, D.T. Miller, Recommendations for reporting of secondary findings in clinical exome and genome sequencing, 2016 Update (ACMG SF v2.0): a policy statement of the American College of Medical Genetics and Genomics, Genet. Med. 19 (2017) 249–255, https://doi.org/10.1038/gim. 2016.190. X. Zheng, D. Levine, J. Shen, S.M. Gogarten, C. Laurie, B.S. Weir, A high-performance computing toolset for relatedness and principal component analysis of SNP data, Bioinformatics 28 (2012) 3326–3328, https://doi.org/10.1093/ bioinformatics/bts606. N. Patterson, P. Moorjani, Y. Luo, S. Mallick, N. Rohland, Y. Zhan, T. Genschoreck, T. Webster, D. Reich, Ancient admixture in human history, Genetics 192 (2012) 1065–1093, https://doi.org/10.1534/genetics.112.145037. M. Nei, F. Tajima, Y. Tateno, Accuracy of estimated phylogenetic trees from molecular data. II. Gene frequency data, J. Mol. Evol. 19 (1983) 153–170 http:// www.ncbi.nlm.nih.gov/pubmed/6571220 , Accessed date: 10 January 2018. A.A. Georgiev, Consistent nonparametric multiple regression: the fixed design case, J. Multivar. Anal. 25 (1988), https://ac.els-cdn.com/0047259X88901558/1s2.0-0047259X88901558-main.pdf?_tid=ecc80f32-f615-11e7-807a00000aacb360&acdnat=1515596148_a0cba068a606f3fb100ae704efcf4027 , Accessed date: 10 January 2018. N. Saitou, M. Nei, The neighbor-joining method: a new method for reconstructing phylogenetic trees, Mol. Biol. Evol. 4 (1987) 406–425, https://doi.org/10.1093/ oxfordjournals.molbev.a040454. D.L. Swofford, PAUP*. Phylogenetic Analysis Using Parsimony (*and Other Methods), (2003). Rambaut, FigTree v. 1.4.0, http://Tree.Bio.Ed.Ac.Uk/Software/Figtree/, (2012). J.C. Barrett, B. Fry, J. Maller, M.J. Daly, Haploview: analysis and visualization of LD and haplotype maps, Bioinformatics 21 (2005) 263–265, https://doi.org/10. 1093/bioinformatics/bth457.