skip to main content
research-article

Efficient algorithms for genome-wide association study

Published: 04 December 2009 Publication History

Abstract

Studying the association between quantitative phenotype (such as height or weight) and single nucleotide polymorphisms (SNPs) is an important problem in biology. To understand underlying mechanisms of complex phenotypes, it is often necessary to consider joint genetic effects across multiple SNPs. ANOVA (analysis of variance) test is routinely used in association study. Important findings from studying gene-gene (SNP-pair) interactions are appearing in the literature. However, the number of SNPs can be up to millions. Evaluating joint effects of SNPs is a challenging task even for SNP-pairs. Moreover, with large number of SNPs correlated, permutation procedure is preferred over simple Bonferroni correction for properly controlling family-wise error rate and retaining mapping power, which dramatically increases the computational cost of association study.
In this article, we study the problem of finding SNP-pairs that have significant associations with a given quantitative phenotype. We propose an efficient algorithm, FastANOVA, for performing ANOVA tests on SNP-pairs in a batch mode, which also supports large permutation test. We derive an upper bound of SNP-pair ANOVA test, which can be expressed as the sum of two terms. The first term is based on single-SNP ANOVA test. The second term is based on the SNPs and independent of any phenotype permutation. Furthermore, SNP-pairs can be organized into groups, each of which shares a common upper bound. This allows for maximum reuse of intermediate computation, efficient upper bound estimation, and effective SNP-pair pruning. Consequently, FastANOVA only needs to perform the ANOVA test on a small number of candidate SNP-pairs without the risk of missing any significant ones. Extensive experiments demonstrate that FastANOVA is orders of magnitude faster than the brute-force implementation of ANOVA tests on all SNP pairs. The principles used in FastANOVA can be applied to categorical phenotypes and other statistics such as Chi-square test.

References

[1]
Balding, D. J. 2006. A tutorial on statistical methods for population association studies. Nature Reviews Genetics 7, 10, 781--791.
[2]
Boyd, S. and Vandenberghe, L. 2004. Convex Optimization. Cambridge University Press, Cambridge, MA.
[3]
Breiman, L., Friedman, J. H., Olshen, R. A., and Stone, C. J. 1984. Classification and Regression Trees. Wadsworth, Inc., Monterey, CA.
[4]
Carlborg, O., Andersson, L., and Kinghorn, B. 2000. The use of a genetic algorithm for simultaneous mapping of multiple interacting quantitative trait loci. Genetics 155, 4, 2003--2010.
[5]
Carlson, C. S., Eberle, M. A., Kruglyak, L., and Nickerson, D. A. 2004. Mapping complex disease loci in whole-genome association studies. Nature 429, 446--452.
[6]
Chi, P., Duggal, P., Kao, W., Mathias, R. A., Grant, A. V., Stockton, M. L., Garcia, J. G. N., Ingergoll, R. G., Scott, A. F., Benty, T. H., Barnes, K. C., and Fallin, M. D. 2006. Comparison of SNP tagging methods using empirical data: Association study of 713 SNPs on chromosome 12q14.3-12q24.21 for asthma and total serum IgE in an African Caribbean population. Genet. Epidemiol. 30, 7, 609--619.
[7]
Curtis, D., North, B. V., and Sham, P. C. 2001. Use of an artificial neural network to detect association between a disease and multiple marker genotypes. Ann. Hum. Genet. 65, 95--107.
[8]
Doerge, R. W. 2002. Multifactorial genetics: Mapping and analysis of quantitative trait loci in experimental populations. Nat. Rev. Genet. 3, 43--52.
[9]
Dudoit, S. and van der Laan, M. J. 2008. Multiple Testing Procedures with Applications to Genomics. Springer-Verlag, Berlin, Germany.
[10]
Evans, D. M., Marchini, J., Morris, A. P., and Cardon, L. R. 2006. Two-stage two-locus models in genome-wide association. PLoS Genet. 2, e157.
[11]
Halperin, E., Kimmel, G., and Shamir, R. 2005. Tag SNP selection in genotype data for maximizing SNP prediction accuracy. In Proceedings of the ISMB. Oxford University Press.
[12]
Hoh, J. and Ott, J. 2003. Mathematical multi-locus approaches to localizing complex human trait genes. Nat. Rev. Genet. 4, 701--709.
[13]
Hoh, J., Wille, A., Zee, R., Cheng, S., Reynolds, R., Lindpaintner, K., and Ott, J. 2000. Selecting SNPs in two-stage analysis of disease association data: A model-free approach. Ann. Hum. Genet. 64, 413--417.
[14]
Ideraabdullah, F., dela Casa-Esperón, E., Bell, T. A., Petwiler, D. A., Magnuson, T., Sapienza, C., and Pardo-Manuel de Villena, F. 2004. Genetic and haplotype diversity among wild-derived mouse inbred strains. Gen. Res. 14, 10a, 1880--1887.
[15]
Liu, H. and Motoda, H. 1998. Feature selection for knowledge discovery and data mining. Kluwer Academic, Boston, MA.
[16]
Miller, R. G. 1981. Simultaneous Statistical Inference. Springer-Verlag, New York.
[17]
Moore, J. H., Gilbert, J. C., Tsai, C.-T., Chiang, F.-T., Holden, T., Barney, N., and White, B. C. 2006. A flexible computational framework for detecting, characterizing, and interpreting statistical patterns of epistasis in genetic studies of human disease susceptibility. J. Theoret. Biol. 241, 2, 252--261.
[18]
Nakamichi, R., Ukai, Y., and Kishino, H. 2001. Detection of closely linked multiple quantitative trait loci using a genetic algorithm. Genetics 158, 1, 463--475.
[19]
Nelson, M. R., Kardia, S. L., Ferrell, R. E., and Sing, C. F. 2001. A combinatorial partitioning method to identify multilocus genotypic partitions that predict quantitative trait variation. Gen. Res. 11, 458--470.
[20]
Ohno, Y., Tanase, H., Nabika, T., Otsuka, K., Sasaki, T., Suzawa, T., Morii, T., Yamori, Y., and Saruta, T. 2000. Selective genotyping with epistasis can be utilized for a major quantitative trait locus mapping in hypertension in rats. Genetics 155, 785--792.
[21]
Pagano, M. and Gauvreau, K. 2000. Principles of Biostatistics. Duxbury Press, Pacific Grove, CA.
[22]
Province, M. A., Shannon, W. D., and Rao, D. C. 2001. Classification methods for confronting heterogeneity. Adv. Genet. 42, 273--286.
[23]
Ritchie, M. D., Hahn, L. W., Roodi, N., Bailey, L. R., Dupont, W. D., Parl, F. F., and Moore, J. H. 2001. Multifactor-dimensionality reduction reveals high-order interactions among estrogen-metabolism genes in sporadic breast cancer. Amer. J. Hum. Gen. 69, 138--147.
[24]
Roberts, A., Mcmillan, L., Wang, W., Parker, J., Rusyn, I., and Threadgill, D. 2007. Inferring missing genotypes in large SNP panels using fast nearest-neighbor searches over sliding windows. In Proceedings of ISMB. Oxford University Press.
[25]
Saxena, R., Voight, B. F., Lyssenko, V., Burtt, N. P., de Bakker, P. I. W., Chen, H., Roix, J. J., Kathiresan, S., Hirschhorn, J. N., Daly, M. J., Hughes, T. E., Groop, L., Altshuler, D., Almgren, P., Florez, J. C., Meyer, J., Ardle, K., Bengtsson Boström, K., Isomaa, B., Lettre, G., Lindblad, U., Lyon, H. N., Melander, O., Newton-Cheh, C., Nilsson, P., Orho-Melander, M., Rástam, L., Speliotes, E. K., Taskinen, M.-R., Tuomi, T., Guiducci, C., Berglund, A., Carlson, J., Gianniny, L., Hackett, R., Hall, L., Holmkvist, J., Laurita, E., Sjögren, M., Sterner, M., Surti, A., Svensson, M., Svensson, M., Tewhey, R., Blumenstiel, B., Parkin, M., Defelice, M., Barry, R., Brodeur, W., Camarata, J., Chia, N., Fava, M., Gibbons, J., Handsaker, B., Healy, C., Nguyen, K., Gates, C., Sougnez, C., Gage, D., Nizzari, M., Gabriel, S. B., Chim, G.-W., Ma, Q., Parikh, H., Richardson, D., Ricke, D., and Purcell, S. 2007. Genome-wide association analysis identifies loci for type 2 diabetes and triglyceride levels. Science 316, 1331--1336.
[26]
Scuteri, A., Sanna, S., Chan, W. M., Uda, M., Albai, G., Strait, J., Najjar, S., Nagaraja, R., Orrú, M., Usala, G., Dei, M., Lai, S., Maschio, A., Busonero, F., Mulas, A., Ehret, G. B., Fink, A. A., Weder, A. B., Cooper, R. S., Galan, P., Chakravarti, A., Schlessinger, D., Cao, A., Lakatta, E., and Abecasis, G. R. 2007. Genome-wide association scan shows genetic variants in the FTO gene are associated with obesity-related traits. PLPoS Genet. 3, 7, e115.
[27]
Sebastiani, P., Lazarus, R., Weiss, S. T., Kunkel, L. M., Kohane, I. S., and Ramoni, M. F. 2003. Minimal haplotype tagging. Proc. Natl. Acad. Sci. USA 100, 17, 9900--9905.
[28]
Segr, D., Deluna, A., Church, G. M., and Kishony, R. 2005. Modular epistasis in yeast metabolism. Nat. Genet. 37, 77--83.
[29]
Sherriff, A., and Ott, J. 2001. Applications of neural networks for gene finding. Adv. Genet. 42, 287--297.
[30]
Shimomura, K., Low-Zeddies, S. S., King, D. P., Steeves, T. D., Whiteley, A., Kushla, J., Zemenides, P. D., Lin, A., Vitaterna, M. H., Churchill, G. A., and Takahashi, J. S. 2001. Genome-wide epistatic interaction analysis reveals complex genetic determinants of circadian behavior in mice. Gen. Res. 11, 6, 959--980.
[31]
Weedon, M. N., Lettre, G., Freathy, R. M., Lindgren, C. M., Voight, B. F., Perry, J. R., Elliott, K. S., Hackett, R., Guiducci, C., Shields, B., Zeggini, E., Lango, H., Lyssenko, V., Timpson, N. J., Burtt, N. P., Rayner, N. W., Saxena, R., Ardlie, K., Tobias, J. H., Ness, A. R., Ring, S. M., Palmer, C. N., Morris, A. D., Peltonen, L., Salomaa, V., Diabetes Genetics Initiative, Wellcome Trust Case Control Consortium, Davey Smith, G., Groop, L. C., Hattersley, A. T., McCarthy, M. I., Hirschhorn, J. N., and Frayling, T. M. 2007. A common variant of HMGA2 is associated with adult and childhood height in the general population. Nat. Genet. 39, 10, 1245--1250.
[32]
Westfall, P. H. and Young, S. S. 1993. Resampling-Based Multiple Testing. Wiley, New York.
[33]
Zhang, H. and Bonney, G. 2000. Use of classification trees for association studies. Genet. Epidemiol. 19, 323--332.
[34]
Zhang, X., Pan, F., Xie, Y., Zow, F., and Wang, W. 2009a. COE: A general approach for efficient genome-wide two-locus epistatic test in disease association study. In Proceedings of the 13th Annual International Conference on Research in Computational Molecular Biology (RECOMB). Lecture Notes in Computer Science, vol. 5541, Springer-Verlag, Berlin, Germany.
[35]
Zhang, X., Zou, F., and Wang, W. 2009b. FastChi: An efficient algorithm for analyzing gene-gene interactions. In Proceedings of the Pacific Symposium on Biocomputing. World Scientific.

Cited By

View all

Recommendations

Reviews

Amos O Olagunju

The convergence, divergence, parallel, and interaction motions by thousands of genes in humans make it difficult to predict the functionality of gene structures. In order to understand the fundamental techniques of multifaceted phenotypes, researchers must reflect on the cooperative heritable effects of diverse single nucleotide polymorphisms (SNPs): How should an efficient algorithm be devised for exploring significant associations among millions of SNPs__?__ How should false-positive errors be minimized in multiple statistical tests of association between one phenotype and several SNPs__?__ Zhang, Zou, and Wang present a resourceful analysis of variance (ANOVA) method for testing relationships on pairs of different sequences of structural units of nucleic acids. The resourceful ANOVA method consists of an algorithm for randomly permuting a phenotype-hundreds to thousands of times-to generate the pinpoint distribution for investigating lack of association between a genotype and the phenotype, and a repetitive procedure for locating the utmost test value for examining the statistical significance of each reshuffled phenotype. The authors define "two-locus association mapping" as the study of the relationship between a phenotype and SNP-pairs with distinctive positions on a genome. The two-locus ANOVA test with a permutation algorithm initially computes the F -statistics values for the binary SNP-pair variables and a quantitative phenotype. The F -statistics value is the average sum of squared deviations between groups divided by the mean sum of squared deviations within groups. Next, the phenotype is sampled without replacement, to obtain a reference distribution and the critical value of the F -statistics value. The upper bound (UB) of the sum of squared deviations between groups is generated for the two-locus ANOVA test, to relate one SNP and phenotype, and the prearranged phenotype values and pair-wise SNP genotypes. The UB is used to trim the bulk of the SNP-pairs for rapid retrieval of an SNP candidate, as well as to circumvent performing exhaustive ANOVA tests. The resourceful analysis of variance method's performance, relative to the traditional brute force technique (BFT), is evaluated with large datasets of neurosensory, metabolism, and cardiovascular SNPs, in genome relationship trials. The method exhibits superior runtimes over the BFT, in the genome-wide association tests. The time complexity of the method is far lower than that of the BFT, and its space complexity is linear to the size of the dataset. Akin to the resourceful ANOVA method, the authors craft an effective chi-square test for investigating the association between a phenotype and SNP pairs. An investigation of the interactions among several SNPs requires fast algorithms and processing machines. In the future, perhaps the proposed two-locus ANOVA test and chi-square test algorithm could be adapted to speedily execute on a parallel machine. This paper is for readers who are interested in the authors' algorithm and its impact on biology. Online Computing Reviews Service

Jens Lichtenberg

Zhang, Zou, and Wang propose an efficient algorithm to find single-nucleotide polymorphism (SNP) pairs associated with quantitative phenotypes-a process described as association study. A binary representation of all SNPs indicating whether an SNP is significantly linked to a specific phenotype is defined as the genotype of the SNP. Due to the vast number of SNPs in an organism, the problem of finding associations between genotypes and quantitative phenotypes is computationally intractable. By reducing the analysis to a two-locus association mapping, it is possible to apply the analysis of variance (ANOVA) test to discover significant associations. The paper focuses on enhancing the computational aspect, in order to allow a genome-wide analysis. It presents FastANOVA, an algorithm that can determine optimal solutions based on a small number of SNP candidates, by establishing upper bounds, which allows for efficient candidate retrieval. The paper offers a good theoretical foundation and presents the algorithm well. The efficiency of the algorithm is documented through runtime comparisons and experimental validations. In order to show the applicability of the approach to actual biological data, the authors analyze not only synthetic data, but also real phenotypes-cardiovascular, metabolism, and neurosensory. It would have been interesting to see additional experimental applications, possibly supported by several datasets from the National Center for Biotechnology Information (NCBI) gene expression omnibus (GEO) database. But even without these, Zhang et al. motivate and illustrate their algorithm in a very effective manner, making this a very significant paper for the study of phenotype to SNP associations. Online Computing Reviews Service

Access critical reviews of Computing literature here

Become a reviewer for Computing Reviews.

Comments

Information & Contributors

Information

Published In

cover image ACM Transactions on Knowledge Discovery from Data
ACM Transactions on Knowledge Discovery from Data  Volume 3, Issue 4
November 2009
196 pages
ISSN:1556-4681
EISSN:1556-472X
DOI:10.1145/1631162
Issue’s Table of Contents
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 04 December 2009
Accepted: 01 July 2009
Revised: 01 May 2009
Received: 01 January 2009
Published in TKDD Volume 3, Issue 4

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. ANOVA test
  2. Association study
  3. permutation test

Qualifiers

  • Research-article
  • Research
  • Refereed

Funding Sources

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)6
  • Downloads (Last 6 weeks)2
Reflects downloads up to 06 Nov 2024

Other Metrics

Citations

Cited By

View all
  • (2022)Compositional correlation analysis of gene expression time seriesAcademic Platform Journal of Engineering and Smart Systems10.21541/apjess.106076510:1(30-41)Online publication date: 1-Jan-2022
  • (2022)Concurrent outcomes from multiple approaches of epistasis analysis for human body mass index associated loci provide insights into obesity biologyScientific Reports10.1038/s41598-022-11270-012:1Online publication date: 4-May-2022
  • (2019)Application of Multi-label Learning Model for Chronic Kidney Disease Syndrome Classification2019 IEEE 5th International Conference on Computer and Communications (ICCC)10.1109/ICCC47050.2019.9064390(1729-1733)Online publication date: Dec-2019
  • (2016)Sequence variation between 462 human individuals fine-tunes functional sites of RNA processingScientific Reports10.1038/srep324066:1Online publication date: 12-Sep-2016
  • (2012)Rapid and Robust Resampling-Based Multiple-Testing Correction with Application in a Genome-Wide Expression Quantitative Trait Loci StudyGenetics10.1534/genetics.111.137737190:4(1511-1520)Online publication date: 31-Jan-2012
  • (2012)A Novel Method to Select Informative SNPs and Their Application in Genetic Association StudiesIEEE/ACM Transactions on Computational Biology and Bioinformatics (TCBB)10.1109/TCBB.2012.709:5(1529-1534)Online publication date: 1-Sep-2012

View Options

Get Access

Login options

Full Access

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media