Population genomics

Last updated

Population genomics is the large-scale comparison of DNA sequences of populations. Population genomics is a neologism that is associated with population genetics. Population genomics studies genome-wide effects to improve our understanding of microevolution so that we may learn the phylogenetic history and demography of a population. [1]

Contents

History

Population genomics has been of interest to scientists since Darwin. Some of the first methods used for studying genetic variability at multiple loci included gel electrophoresis and restriction enzyme mapping. [2] Previously genomics was restricted to only the study of a low amount of loci. However recent advancements in sequencing and computer storage and power have allowed for the study of hundreds of thousands of loci from populations. [3] Analysis of this data requires identification of non-neutral or outlier loci that indicate selection in that region of the genome. This will allow the researcher to remove these loci to study genome wide effects or to focus on these loci if they are of interest.

Research applications

In the study of Schizosaccharomyces pombe (more commonly known as fission yeast), a popular model organism, population genomics has been used to understand the reason for the phenotypic variation within a species. However, since the genetic variation within this species was previously poorly understood due to technological restrictions, population genomics allows us to learn about the species' genetic differences. [4] In the human population, population genomics has been used to study the genetic change since humans began to migrate away from Africa approximately 50,000-100,000 years ago. It has been shown that not only were genes related to fertility and reproduction highly selected for, but also that the further humans moved away from Africa, the greater the presence of lactase. [5]

A 2007 study done by Begun et al. compared the whole genome sequence of multiple lines of Drosophila simulans to the assembly of D. melanogaster and D. yakuba . This was done by aligning DNA from whole genome shotgun sequences of D. simulans to a standard reference sequence before carrying out whole genome analysis of polymorphism and divergence. This revealed a large number of proteins that had experienced directional selection. They discovered previously unknown, large scale fluctuations in both polymorphism and divergence along chromosome arms. They found that the X chromosome had faster divergence and significantly less polymorphism than previously expected. They also found regions of the genome (e.g. UTRs) that signaled adaptive evolution. [6]

In 2014 Jacquot et al. studied the diversification and epidemiology of endemic bacterial pathogens by using the Borrelia burgdorferi species complex (the bacteria responsible for Lyme disease) as a model. They also wished to compare the genetic structure between B. burgdorferi and the closely related species B. garinii and B. afzelii . They began by sequencing samples from a culture and then mapping the raw read onto reference sequences. SNP based and phylogenetic analyses were used on both intraspecific and interspecific levels. When looking at the degree of genetic isolation, they found that intraspecific recombination rate was ~50 times higher than the interspecific rate. They also found that by using most of the genome conspecific strains didn’t cluster in clades, raising questions about previous strategies used when investigating pathogen epidemiology. [7]

Moore et al conducted a study in 2014 in which a group of Atlantic Salmon populations which were previously analyzed with traditional population genetic analyses (microsatellites, SNP-array genotyping, BayeScan (which uses the Dirichlet-multinomial distribution)) to place them into defined conservational units. This genomic assessment mostly agreed with previous results, but did identify more differences between regionally and genetically discrete groups, suggesting there were potentially even greater number of conservation units of salmon in those regions. These results verified the usefulness of genome-wide analysis in order to improve the accuracy of future designation of conservation units. [8]

In highly migratory marine species, traditional population genetic analyses often fail to identify population structure. In tunas, traditional markers such as short-range PCR products, microsatellites and SNP-arrays have struggled to distinguish fish stocks from separate ocean basins. However, population genomic research using RAD sequencing in yellowfin tuna [9] [10] and albacore [11] [12] has been able to distinguish populations from different ocean basins and reveal fine-scale population structure. These studies identify putatively adaptive loci that reveal strong population structure, even though these sites represent a relatively small proportion of the overall DNA sequence data. In contrast, the majority of sequenced loci that are presumed to be selectively neutral do not reveal patterns of population differentiation, matching results for traditional DNA markers. [9] [10] [11] [12] The same pattern of putatively adaptive loci and RAD sequencing revealing population structure, compared to limited insight provided by traditional DNA markers is also observed for other marine fishes, including striped marlin [13] and lingcod. [14]

Mathematical models

Understanding and analyzing the vast data that comes from population genomics studies requires various mathematical models. One method of analyzing this vast data is through QTL mapping. QTL mapping has been used to help find the genes that are responsible for adaptive phenotypes. [15] To quantify the genetic diversity within a population a value known as the fixation index, or FST is used. When used with Tajima's D, FST has been used to show how selection acts upon a population. [16] The McDonald-Kreitman test (or MK test) is also favored when looking for selection because it is not as sensitive to changes in a species' demography that would throw off other selection tests. [17]

Future developments

Most developments within population genomics have to do with increases in the sequencing technology. For example, restriction-site associated DNA sequencing, or RADSeq is a relatively new technology that sequences at a lower complexity and delivers higher resolution at a reasonable cost. [18] High-throughput sequencing technologies are also a rapidly growing field that allows for more information to be gathered on genomic divergence during speciation. [19] High-throughput sequencing is also very useful for SNP detection, which plays a key role in personalized medicine. [20] Another relatively new approach is reduced-representation library (RRL) sequencing which discovers and genotypes SNPs and also doesn't require reference genomes. [21]

See also

Notes

  1. Luikart, G.; England, P. R.; Tallmon, D.; Jordan S.; Taberlet P. (2003). "The Power and Promise of Population Genomics: From Genotyping to Genome Typing". Nature Reviews (4): 981-994
  2. Charlesworth, B. (2011). "Molecular population genomics: A short history" (PDF). Genetics Research. 92 (5–6): 397–411. doi: 10.1017/S0016672310000522 . PMID   21429271.
  3. Schilling, M. P.; Wolf, P. G.; Duffy, A. M.; Rai, H. S.; Rowe, C. A.; Richardson, B. A.; Mock, K. E. (2014). "Genotyping-by-Sequencing for Populus Population Genomics: An Assessment of Genome Sampling Patterns and Filtering Approaches". PLOS ONE. 9 (4): e95292. Bibcode:2014PLoSO...995292S. doi: 10.1371/journal.pone.0095292 . PMC   3991623 . PMID   24748384.
  4. Fawcett, J. A.; Iida, T.; Takuno, S.; Sugino, R. P.; Kado, T.; Kugou, K.; Mura, S.; Kobayashi, T.; Ohta, K.; Nakayama, J. I.; Innan, H. (2014). "Population Genomics of the Fission Yeast Schizosaccharomyces pombe". PLOS ONE. 9 (8): e104241. Bibcode:2014PLoSO...9j4241F. doi: 10.1371/journal.pone.0104241 . PMC   4128662 . PMID   25111393.
  5. Lachance, J.; Tishkoff, S. A. (2013). "Population Genomics of Human Adaptation". Annual Review of Ecology, Evolution, and Systematics. 44: 123–143. doi:10.1146/annurev-ecolsys-110512-135833. PMC   4221232 . PMID   25383060.
  6. Begun, D. J.; Holloway, A. K.; Stevens, K.; Hillier, L. W.; Poh, Y. P.; Hahn, M. W.; Nista, P. M.; Jones, C. D.; Kern, A. D.; Dewey, C. N.; Pachter, L.; Myers, E.; Langley, C. H. (2007). "Population Genomics: Whole-Genome Analysis of Polymorphism and Divergence in Drosophila simulans". PLOS Biology. 5 (11): e310. doi: 10.1371/journal.pbio.0050310 . PMC   2062478 . PMID   17988176.
  7. Jacquot, M.; Gonnet, M.; Ferquel, E.; Abrial, D.; Claude, A.; Gasqui, P.; Choumet, V. R.; Charras-Garrido, M.; Garnier, M.; Faure, B.; Sertour, N.; Dorr, N.; De Goër, J.; Vourc'h, G. L.; Bailly, X. (2014). "Comparative Population Genomics of the Borrelia burgdorferi Species Complex Reveals High Degree of Genetic Isolation among Species and Underscores Benefits and Constraints to Studying Intra-Specific Epidemiological Processes". PLOS ONE. 9 (4): e94384. Bibcode:2014PLoSO...994384J. doi: 10.1371/journal.pone.0094384 . PMC   3993988 . PMID   24721934.
  8. Moore, Jean-Sébastien; Bourret, Vincent; Dionne, Mélanie; Bradbury, Ian; O'Reilly, Patrick; Kent, Matthew; Chaput, Gérald; Bernatchez, Louis (December 2014). "Conservation genomics of anadromous Atlantic salmon across its North American range: outlier loci identify the same patterns of population structure as neutral loci". Molecular Ecology. 23 (23): 5680–5697. doi:10.1111/mec.12972. PMID   25327895. S2CID   12251497.
  9. 1 2 Grewe, P.M.; Feutry, P.; Hill, P.L.; Gunasekera, R.M.; Schaefer, K.M.; Itano, D.G.; Fuller, D.W.; Foster, S.D.; Davies, C.R. (2015). "Evidence of discrete yellowfin tuna (Thunnus albacares) populations demands rethink of management for this globally important resource". Scientific Reports. 5: 16916. Bibcode:2015NatSR...516916G. doi: 10.1038/srep16916 . PMC   4655351 . PMID   26593698.
  10. 1 2 Pecoraro, Carlo; Babbucci, Massimiliano; Franch, Rafaella; Rico, Ciro; Papetti, Chiara; Chassot, Emmanuel; Bodin, Nathalie; Cariani, Alessia; Bargelloni, Luca; Tinti, Fausto (2018). "The population genomics of yellowfin tuna (Thunnus albacares) at global geographic scale challenges current stock delineation". Scientific Reports. 8 (1): 13890. Bibcode:2018NatSR...813890P. doi: 10.1038/s41598-018-32331-3 . PMC   6141456 . PMID   30224658.
  11. 1 2 Anderson, Giulia; Hampton, John; Smith, Neville; Rico, Ciro (2019). "Indications of strong adaptive population genetic structure in albacore tuna (Thunnus alalunga) in the southwest and central Pacific Ocean". Ecology and Evolution. 9 (18): 10354–10364. doi: 10.1002/ece3.5554 . PMC   6787800 . PMID   31624554.
  12. 1 2 Vaux, Felix; Bohn, Sandra; Hyde, John R.; O'Malley, Kathleen G. (2021). "Adaptive markers distinguish North and South Pacific Albacore amid low population differentiation". Evolutionary Applications. 14 (5): 1343–1364. doi: 10.1111/eva.13202 . ISSN   1752-4571. PMC   8127716 . PMID   34025772.
  13. Mamoozadeh, Nadya R.; Graves, John E.; McDowell, Jan R. (2020). "Genome‐wide SNPs resolve spatiotemporal patterns of connectivity within striped marlin (Kajikia audax), a broadly distributed and highly migratory pelagic species". Evolutionary Applications. 13 (4): 677–698. doi: 10.1111/eva.12892 . PMC   7086058 . PMID   32211060.
  14. Longo, Gary C.; Lam, Laurel; Basnett, Bonnie; Samhouri, Jameal; Hamilton, Scott; Andrews, Kelly; Williams, Greg; Goetz, Giles; McClure, Michelle; Nichols, Krista M. (2020). "Strong population differentiation in lingcod (Ophiodon elongatus) is driven by a small portion of the genome". Evolutionary Applications. 13 (10): 2536–2554. doi: 10.1111/eva.13037 . PMC   7691466 . PMID   33294007.
  15. Stinchcombe, J. R.; Hoekstra, H. E. (2007). "Combining population genomics and quantitative genetics: Finding the genes underlying ecologically important traits". Heredity. 100 (2): 158–170. doi: 10.1038/sj.hdy.6800937 . PMID   17314923.
  16. Hohenlohe, P. A.; Bassham, S.; Etter, P. D.; Stiffler, N.; Johnson, E. A.; Cresko, W. A. (2010). "Population Genomics of Parallel Adaptation in Threespine Stickleback using Sequenced RAD Tags". PLOS Genetics. 6 (2): e1000862. doi: 10.1371/journal.pgen.1000862 . PMC   2829049 . PMID   20195501.
  17. Harpur, B. A.; Kent, C. F.; Molodtsova, D.; Lebon, J. M. D.; Alqarni, A. S.; Owayss, A. A.; Zayed, A. (2014). "Population genomics of the honey bee reveals strong signatures of positive selection on worker traits". Proceedings of the National Academy of Sciences. 111 (7): 2614–2619. Bibcode:2014PNAS..111.2614H. doi: 10.1073/pnas.1315506111 . PMC   3932857 . PMID   24488971.
  18. Davey, J. W.; Blaxter, M. L. (2011). "RADSeq: Next-generation population genetics". Briefings in Functional Genomics. 9 (5–6): 416–423. doi:10.1093/bfgp/elq031. PMC   3080771 . PMID   21266344.
  19. Ellegren, H. (2014). "Genome sequencing and population genomics in non-model organisms". Trends in Ecology & Evolution. 29 (1): 51–63. doi:10.1016/j.tree.2013.09.008. PMID   24139972.
  20. You, N.; Murillo, G.; Su, X.; Zeng, X.; Xu, J.; Ning, K.; Zhang, S.; Zhu, J.; Cui, X. (2012). "SNP calling using genotype model selection on high-throughput sequencing data". Bioinformatics. 28 (5): 643–650. doi:10.1093/bioinformatics/bts001. PMC   3338331 . PMID   22253293.
  21. Greminger, M. P.; Stölting, K. N.; Nater, A.; Goossens, B.; Arora, N.; Bruggmann, R. M.; Patrignani, A.; Nussberger, B.; Sharma, R.; Kraus, R. H. S.; Ambu, L. N.; Singleton, I.; Chikhi, L.; Van Schaik, C. P.; Krützen, M. (2014). "Generation of SNP datasets for orangutan population genomics using improved reduced-representation sequencing and direct comparisons of SNP calling algorithms". BMC Genomics. 15: 16. doi: 10.1186/1471-2164-15-16 . PMC   3897891 . PMID   24405840.

Related Research Articles

<span class="mw-page-title-main">Human genome</span> Complete set of nucleic acid sequences for humans

The human genome is a complete set of nucleic acid sequences for humans, encoded as DNA within the 23 chromosome pairs in cell nuclei and in a small DNA molecule found within individual mitochondria. These are usually treated separately as the nuclear genome and the mitochondrial genome. Human genomes include both protein-coding DNA sequences and various types of DNA that does not encode proteins. The latter is a diverse category that includes DNA coding for non-translated RNA, such as that for ribosomal RNA, transfer RNA, ribozymes, small nuclear RNAs, and several types of regulatory RNAs. It also includes promoters and their associated gene-regulatory elements, DNA playing structural and replicatory roles, such as scaffolding regions, telomeres, centromeres, and origins of replication, plus large numbers of transposable elements, inserted viral DNA, non-functional pseudogenes and simple, highly repetitive sequences. Introns make up a large percentage of non-coding DNA. Some of this non-coding DNA is non-functional junk DNA, such as pseudogenes, but there is no firm consensus on the total amount of junk DNA.

A microsatellite is a tract of repetitive DNA in which certain DNA motifs are repeated, typically 5–50 times. Microsatellites occur at thousands of locations within an organism's genome. They have a higher mutation rate than other areas of DNA leading to high genetic diversity. Microsatellites are often referred to as short tandem repeats (STRs) by forensic geneticists and in genetic genealogy, or as simple sequence repeats (SSRs) by plant geneticists.

<span class="mw-page-title-main">Single-nucleotide polymorphism</span> Single nucleotide in genomic DNA at which different sequence alternatives exist

In genetics and bioinformatics, a single-nucleotide polymorphism is a germline substitution of a single nucleotide at a specific position in the genome that is present in a sufficiently large fraction of considered population.

The International HapMap Project was an organization that aimed to develop a haplotype map (HapMap) of the human genome, to describe the common patterns of human genetic variation. HapMap is used to find genetic variants affecting health, disease and responses to drugs and environmental factors. The information produced by the project is made freely available for research.

<span class="mw-page-title-main">Comparative genomics</span>

Comparative genomics is a field of biological research in which the genomic features of different organisms are compared. The genomic features may include the DNA sequence, genes, gene order, regulatory sequences, and other genomic structural landmarks. In this branch of genomics, whole or large parts of genomes resulting from genome projects are compared to study basic biological similarities and differences as well as evolutionary relationships between organisms. The major principle of comparative genomics is that common features of two organisms will often be encoded within the DNA that is evolutionarily conserved between them. Therefore, comparative genomic approaches start with making some form of alignment of genome sequences and looking for orthologous sequences in the aligned genomes and checking to what extent those sequences are conserved. Based on these, genome and molecular evolution are inferred and this may in turn be put in the context of, for example, phenotypic evolution or population genetics.

<span class="mw-page-title-main">Identity by descent</span> Identical nucleotide sequence due to inheritance without recombination from a common ancestor

A DNA segment is identical by state (IBS) in two or more individuals if they have identical nucleotide sequences in this segment. An IBS segment is identical by descent (IBD) in two or more individuals if they have inherited it from a common ancestor without recombination, that is, the segment has the same ancestral origin in these individuals. DNA segments that are IBD are IBS per definition, but segments that are not IBD can still be IBS due to the same mutations in different individuals or recombinations that do not alter the segment.

In molecular biology, SNP array is a type of DNA microarray which is used to detect polymorphisms within a population. A single nucleotide polymorphism (SNP), a variation at a single site in DNA, is the most frequent type of variation in the genome. Around 335 million SNPs have been identified in the human genome, 15 million of which are present at frequencies of 1% or higher across different populations worldwide.

A tag SNP is a representative single nucleotide polymorphism (SNP) in a region of the genome with high linkage disequilibrium that represents a group of SNPs called a haplotype. It is possible to identify genetic variation and association to phenotypes without genotyping every SNP in a chromosomal region. This reduces the expense and time of mapping genome areas associated with disease, since it eliminates the need to study every individual SNP. Tag SNPs are useful in whole-genome SNP association studies in which hundreds of thousands of SNPs across the entire genome are genotyped.

Human evolutionary genetics studies how one human genome differs from another human genome, the evolutionary past that gave rise to the human genome, and its current effects. Differences between genomes have anthropological, medical, historical and forensic implications and applications. Genetic data can provide important insights into human evolution.

<span class="mw-page-title-main">Genome-wide association study</span> Study of genetic variants in different individuals

In genomics, a genome-wide association study, is an observational study of a genome-wide set of genetic variants in different individuals to see if any variant is associated with a trait. GWA studies typically focus on associations between single-nucleotide polymorphisms (SNPs) and traits like major human diseases, but can equally be applied to any other genetic variants and any other organisms.

<span class="mw-page-title-main">1000 Genomes Project</span> International research effort on genetic variation

The 1000 Genomes Project, launched in January 2008, was an international research effort to establish by far the most detailed catalogue of human genetic variation. Scientists planned to sequence the genomes of at least one thousand anonymous participants from a number of different ethnic groups within the following three years, using newly developed technologies which were faster and less expensive. In 2010, the project finished its pilot phase, which was described in detail in a publication in the journal Nature. In 2012, the sequencing of 1092 genomes was announced in a Nature publication. In 2015, two papers in Nature reported results and the completion of the project and opportunities for future research.

Expression quantitative trait loci (eQTLs) are genomic loci that explain variation in expression levels of mRNAs.

Molecular Inversion Probe (MIP) belongs to the class of Capture by Circularization molecular techniques for performing genomic partitioning, a process through which one captures and enriches specific regions of the genome. Probes used in this technique are single stranded DNA molecules and, similar to other genomic partitioning techniques, contain sequences that are complementary to the target in the genome; these probes hybridize to and capture the genomic target. MIP stands unique from other genomic partitioning strategies in that MIP probes share the common design of two genomic target complementary segments separated by a linker region. With this design, when the probe hybridizes to the target, it undergoes an inversion in configuration and circularizes. Specifically, the two target complementary regions at the 5’ and 3’ ends of the probe become adjacent to one another while the internal linker region forms a free hanging loop. The technology has been used extensively in the HapMap project for large-scale SNP genotyping as well as for studying gene copy alterations and characteristics of specific genomic loci to identify biomarkers for different diseases such as cancer. Key strengths of the MIP technology include its high specificity to the target and its scalability for high-throughput, multiplexed analyses where tens of thousands of genomic loci are assayed simultaneously.

<span class="mw-page-title-main">Exome sequencing</span> Sequencing of all the exons of a genome

Exome sequencing, also known as whole exome sequencing (WES), is a genomic technique for sequencing all of the protein-coding regions of genes in a genome. It consists of two steps: the first step is to select only the subset of DNA that encodes proteins. These regions are known as exons—humans have about 180,000 exons, constituting about 1% of the human genome, or approximately 30 million base pairs. The second step is to sequence the exonic DNA using any high-throughput DNA sequencing technology.

<span class="mw-page-title-main">Restriction site associated DNA markers</span> Type of genetic marker

Restriction site associated DNA (RAD) markers are a type of genetic marker which are useful for association mapping, QTL-mapping, population genetics, ecological genetics and evolutionary genetics. The use of RAD markers for genetic mapping is often called RAD mapping. An important aspect of RAD markers and mapping is the process of isolating RAD tags, which are the DNA sequences that immediately flank each instance of a particular restriction site of a restriction enzyme throughout the genome. Once RAD tags have been isolated, they can be used to identify and genotype DNA sequence polymorphisms mainly in form of single nucleotide polymorphisms (SNPs). Polymorphisms that are identified and genotyped by isolating and analyzing RAD tags are referred to as RAD markers. Although genotyping by sequencing presents an approach similar to the RAD-seq method, they differ in some substantial ways.

Imputation in genetics refers to the statistical inference of unobserved genotypes. It is achieved by using known haplotypes in a population, for instance from the HapMap or the 1000 Genomes Project in humans, thereby allowing to test for association between a trait of interest and experimentally untyped genetic variants, but whose genotypes have been statistically inferred ("imputed"). Genotype imputation is usually performed on SNPs, the most common kind of genetic variation.

Predictive genomics is at the intersection of multiple disciplines: predictive medicine, personal genomics and translational bioinformatics. Specifically, predictive genomics deals with the future phenotypic outcomes via prediction in areas such as complex multifactorial diseases in humans. To date, the success of predictive genomics has been dependent on the genetic framework underlying these applications, typically explored in genome-wide association (GWA) studies. The identification of associated single-nucleotide polymorphisms underpin GWA studies in complex diseases that have ranged from Type 2 Diabetes (T2D), Age-related macular degeneration (AMD) and Crohn's disease.

In the field of genetic sequencing, genotyping by sequencing, also called GBS, is a method to discover single nucleotide polymorphisms (SNP) in order to perform genotyping studies, such as genome-wide association studies (GWAS). GBS uses restriction enzymes to reduce genome complexity and genotype multiple DNA samples. After digestion, PCR is performed to increase fragments pool and then GBS libraries are sequenced using next generation sequencing technologies, usually resulting in about 100bp single-end reads. It is relatively inexpensive and has been used in plant breeding. Although GBS presents an approach similar to restriction-site-associated DNA sequencing (RAD-seq) method, they differ in some substantial ways.

Genome sequencing of endangered species is the application of Next Generation Sequencing (NGS) technologies in the field of conservation biology, with the aim of generating life history, demographic and phylogenetic data of relevance to the management of endangered wildlife.

References