research-article

Efficient algorithms for genome-wide association study

Authors:

Xiang Zhang,

Fei Zou,

Wei WangAuthors Info & Claims

ACM Transactions on Knowledge Discovery from Data (TKDD), Volume 3, Issue 4

Article No.: 19, Pages 1 - 28

https://doi.org/10.1145/1631162.1631167

Published: 04 December 2009 Publication History

Get Access

Abstract

Studying the association between quantitative phenotype (such as height or weight) and single nucleotide polymorphisms (SNPs) is an important problem in biology. To understand underlying mechanisms of complex phenotypes, it is often necessary to consider joint genetic effects across multiple SNPs. ANOVA (analysis of variance) test is routinely used in association study. Important findings from studying gene-gene (SNP-pair) interactions are appearing in the literature. However, the number of SNPs can be up to millions. Evaluating joint effects of SNPs is a challenging task even for SNP-pairs. Moreover, with large number of SNPs correlated, permutation procedure is preferred over simple Bonferroni correction for properly controlling family-wise error rate and retaining mapping power, which dramatically increases the computational cost of association study.

In this article, we study the problem of finding SNP-pairs that have significant associations with a given quantitative phenotype. We propose an efficient algorithm, FastANOVA, for performing ANOVA tests on SNP-pairs in a batch mode, which also supports large permutation test. We derive an upper bound of SNP-pair ANOVA test, which can be expressed as the sum of two terms. The first term is based on single-SNP ANOVA test. The second term is based on the SNPs and independent of any phenotype permutation. Furthermore, SNP-pairs can be organized into groups, each of which shares a common upper bound. This allows for maximum reuse of intermediate computation, efficient upper bound estimation, and effective SNP-pair pruning. Consequently, FastANOVA only needs to perform the ANOVA test on a small number of candidate SNP-pairs without the risk of missing any significant ones. Extensive experiments demonstrate that FastANOVA is orders of magnitude faster than the brute-force implementation of ANOVA tests on all SNP pairs. The principles used in FastANOVA can be applied to categorical phenotypes and other statistics such as Chi-square test.

References

[1]

Balding, D. J. 2006. A tutorial on statistical methods for population association studies. Nature Reviews Genetics 7, 10, 781--791.

Crossref

Google Scholar

[2]

Boyd, S. and Vandenberghe, L. 2004. Convex Optimization. Cambridge University Press, Cambridge, MA.

Digital Library

Google Scholar

[3]

Breiman, L., Friedman, J. H., Olshen, R. A., and Stone, C. J. 1984. Classification and Regression Trees. Wadsworth, Inc., Monterey, CA.

Google Scholar

[4]

Carlborg, O., Andersson, L., and Kinghorn, B. 2000. The use of a genetic algorithm for simultaneous mapping of multiple interacting quantitative trait loci. Genetics 155, 4, 2003--2010.

Google Scholar

[5]

Carlson, C. S., Eberle, M. A., Kruglyak, L., and Nickerson, D. A. 2004. Mapping complex disease loci in whole-genome association studies. Nature 429, 446--452.

Crossref

Google Scholar

[6]

Chi, P., Duggal, P., Kao, W., Mathias, R. A., Grant, A. V., Stockton, M. L., Garcia, J. G. N., Ingergoll, R. G., Scott, A. F., Benty, T. H., Barnes, K. C., and Fallin, M. D. 2006. Comparison of SNP tagging methods using empirical data: Association study of 713 SNPs on chromosome 12q14.3-12q24.21 for asthma and total serum IgE in an African Caribbean population. Genet. Epidemiol. 30, 7, 609--619.

Crossref

Google Scholar

[7]

Curtis, D., North, B. V., and Sham, P. C. 2001. Use of an artificial neural network to detect association between a disease and multiple marker genotypes. Ann. Hum. Genet. 65, 95--107.

Crossref

Google Scholar

[8]

Doerge, R. W. 2002. Multifactorial genetics: Mapping and analysis of quantitative trait loci in experimental populations. Nat. Rev. Genet. 3, 43--52.

Crossref

Google Scholar

[9]

Dudoit, S. and van der Laan, M. J. 2008. Multiple Testing Procedures with Applications to Genomics. Springer-Verlag, Berlin, Germany.

Google Scholar

[10]

Evans, D. M., Marchini, J., Morris, A. P., and Cardon, L. R. 2006. Two-stage two-locus models in genome-wide association. PLoS Genet. 2, e157.

Crossref

Google Scholar

[11]

Halperin, E., Kimmel, G., and Shamir, R. 2005. Tag SNP selection in genotype data for maximizing SNP prediction accuracy. In Proceedings of the ISMB. Oxford University Press.

Google Scholar

[12]

Hoh, J. and Ott, J. 2003. Mathematical multi-locus approaches to localizing complex human trait genes. Nat. Rev. Genet. 4, 701--709.

Crossref

Google Scholar

[13]

Hoh, J., Wille, A., Zee, R., Cheng, S., Reynolds, R., Lindpaintner, K., and Ott, J. 2000. Selecting SNPs in two-stage analysis of disease association data: A model-free approach. Ann. Hum. Genet. 64, 413--417.

Crossref

Google Scholar

[14]

Ideraabdullah, F., dela Casa-Esperón, E., Bell, T. A., Petwiler, D. A., Magnuson, T., Sapienza, C., and Pardo-Manuel de Villena, F. 2004. Genetic and haplotype diversity among wild-derived mouse inbred strains. Gen. Res. 14, 10a, 1880--1887.

Google Scholar

[15]

Liu, H. and Motoda, H. 1998. Feature selection for knowledge discovery and data mining. Kluwer Academic, Boston, MA.

Digital Library

Google Scholar

[16]

Miller, R. G. 1981. Simultaneous Statistical Inference. Springer-Verlag, New York.

Google Scholar

[17]

Moore, J. H., Gilbert, J. C., Tsai, C.-T., Chiang, F.-T., Holden, T., Barney, N., and White, B. C. 2006. A flexible computational framework for detecting, characterizing, and interpreting statistical patterns of epistasis in genetic studies of human disease susceptibility. J. Theoret. Biol. 241, 2, 252--261.

Crossref

Google Scholar

[18]

Nakamichi, R., Ukai, Y., and Kishino, H. 2001. Detection of closely linked multiple quantitative trait loci using a genetic algorithm. Genetics 158, 1, 463--475.

Google Scholar

[19]

Nelson, M. R., Kardia, S. L., Ferrell, R. E., and Sing, C. F. 2001. A combinatorial partitioning method to identify multilocus genotypic partitions that predict quantitative trait variation. Gen. Res. 11, 458--470.

Crossref

Google Scholar

[20]

Ohno, Y., Tanase, H., Nabika, T., Otsuka, K., Sasaki, T., Suzawa, T., Morii, T., Yamori, Y., and Saruta, T. 2000. Selective genotyping with epistasis can be utilized for a major quantitative trait locus mapping in hypertension in rats. Genetics 155, 785--792.

Google Scholar

[21]

Pagano, M. and Gauvreau, K. 2000. Principles of Biostatistics. Duxbury Press, Pacific Grove, CA.

Google Scholar

[22]

Province, M. A., Shannon, W. D., and Rao, D. C. 2001. Classification methods for confronting heterogeneity. Adv. Genet. 42, 273--286.

Crossref

Google Scholar

[23]

Ritchie, M. D., Hahn, L. W., Roodi, N., Bailey, L. R., Dupont, W. D., Parl, F. F., and Moore, J. H. 2001. Multifactor-dimensionality reduction reveals high-order interactions among estrogen-metabolism genes in sporadic breast cancer. Amer. J. Hum. Gen. 69, 138--147.

Crossref

Google Scholar

[24]

Roberts, A., Mcmillan, L., Wang, W., Parker, J., Rusyn, I., and Threadgill, D. 2007. Inferring missing genotypes in large SNP panels using fast nearest-neighbor searches over sliding windows. In Proceedings of ISMB. Oxford University Press.

Google Scholar

[25]

Saxena, R., Voight, B. F., Lyssenko, V., Burtt, N. P., de Bakker, P. I. W., Chen, H., Roix, J. J., Kathiresan, S., Hirschhorn, J. N., Daly, M. J., Hughes, T. E., Groop, L., Altshuler, D., Almgren, P., Florez, J. C., Meyer, J., Ardle, K., Bengtsson Boström, K., Isomaa, B., Lettre, G., Lindblad, U., Lyon, H. N., Melander, O., Newton-Cheh, C., Nilsson, P., Orho-Melander, M., Rástam, L., Speliotes, E. K., Taskinen, M.-R., Tuomi, T., Guiducci, C., Berglund, A., Carlson, J., Gianniny, L., Hackett, R., Hall, L., Holmkvist, J., Laurita, E., Sjögren, M., Sterner, M., Surti, A., Svensson, M., Svensson, M., Tewhey, R., Blumenstiel, B., Parkin, M., Defelice, M., Barry, R., Brodeur, W., Camarata, J., Chia, N., Fava, M., Gibbons, J., Handsaker, B., Healy, C., Nguyen, K., Gates, C., Sougnez, C., Gage, D., Nizzari, M., Gabriel, S. B., Chim, G.-W., Ma, Q., Parikh, H., Richardson, D., Ricke, D., and Purcell, S. 2007. Genome-wide association analysis identifies loci for type 2 diabetes and triglyceride levels. Science 316, 1331--1336.

Crossref

Google Scholar

[26]

Scuteri, A., Sanna, S., Chan, W. M., Uda, M., Albai, G., Strait, J., Najjar, S., Nagaraja, R., Orrú, M., Usala, G., Dei, M., Lai, S., Maschio, A., Busonero, F., Mulas, A., Ehret, G. B., Fink, A. A., Weder, A. B., Cooper, R. S., Galan, P., Chakravarti, A., Schlessinger, D., Cao, A., Lakatta, E., and Abecasis, G. R. 2007. Genome-wide association scan shows genetic variants in the FTO gene are associated with obesity-related traits. PLPoS Genet. 3, 7, e115.

Crossref

Google Scholar

[27]

Sebastiani, P., Lazarus, R., Weiss, S. T., Kunkel, L. M., Kohane, I. S., and Ramoni, M. F. 2003. Minimal haplotype tagging. Proc. Natl. Acad. Sci. USA 100, 17, 9900--9905.

Crossref

Google Scholar

[28]

Segr, D., Deluna, A., Church, G. M., and Kishony, R. 2005. Modular epistasis in yeast metabolism. Nat. Genet. 37, 77--83.

Crossref

Google Scholar

[29]

Sherriff, A., and Ott, J. 2001. Applications of neural networks for gene finding. Adv. Genet. 42, 287--297.

Crossref

Google Scholar

[30]

Shimomura, K., Low-Zeddies, S. S., King, D. P., Steeves, T. D., Whiteley, A., Kushla, J., Zemenides, P. D., Lin, A., Vitaterna, M. H., Churchill, G. A., and Takahashi, J. S. 2001. Genome-wide epistatic interaction analysis reveals complex genetic determinants of circadian behavior in mice. Gen. Res. 11, 6, 959--980.

Crossref

Google Scholar

[31]

Weedon, M. N., Lettre, G., Freathy, R. M., Lindgren, C. M., Voight, B. F., Perry, J. R., Elliott, K. S., Hackett, R., Guiducci, C., Shields, B., Zeggini, E., Lango, H., Lyssenko, V., Timpson, N. J., Burtt, N. P., Rayner, N. W., Saxena, R., Ardlie, K., Tobias, J. H., Ness, A. R., Ring, S. M., Palmer, C. N., Morris, A. D., Peltonen, L., Salomaa, V., Diabetes Genetics Initiative, Wellcome Trust Case Control Consortium, Davey Smith, G., Groop, L. C., Hattersley, A. T., McCarthy, M. I., Hirschhorn, J. N., and Frayling, T. M. 2007. A common variant of HMGA2 is associated with adult and childhood height in the general population. Nat. Genet. 39, 10, 1245--1250.

Crossref

Google Scholar

[32]

Westfall, P. H. and Young, S. S. 1993. Resampling-Based Multiple Testing. Wiley, New York.

Google Scholar

[33]

Zhang, H. and Bonney, G. 2000. Use of classification trees for association studies. Genet. Epidemiol. 19, 323--332.

Crossref

Google Scholar

[34]

Zhang, X., Pan, F., Xie, Y., Zow, F., and Wang, W. 2009a. COE: A general approach for efficient genome-wide two-locus epistatic test in disease association study. In Proceedings of the 13th Annual International Conference on Research in Computational Molecular Biology (RECOMB). Lecture Notes in Computer Science, vol. 5541, Springer-Verlag, Berlin, Germany.

Digital Library

Google Scholar

[35]

Zhang, X., Zou, F., and Wang, W. 2009b. FastChi: An efficient algorithm for analyzing gene-gene interactions. In Proceedings of the Pacific Symposium on Biocomputing. World Scientific.

Google Scholar

Cited By

View all

DİKBAŞ F(2022)Compositional correlation analysis of gene expression time seriesAcademic Platform Journal of Engineering and Smart Systems10.21541/apjess.106076510:1(30-41)Online publication date: 1-Jan-2022
https://doi.org/10.21541/apjess.1060765
D’Silva SChakraborty SKahali B(2022)Concurrent outcomes from multiple approaches of epistasis analysis for human body mass index associated loci provide insights into obesity biologyScientific Reports10.1038/s41598-022-11270-012:1Online publication date: 4-May-2022
https://doi.org/10.1038/s41598-022-11270-0
Peng LZhu XLiao HZhang P(2019)Application of Multi-label Learning Model for Chronic Kidney Disease Syndrome Classification2019 IEEE 5th International Conference on Computer and Communications (ICCC)10.1109/ICCC47050.2019.9064390(1729-1733)Online publication date: Dec-2019
https://doi.org/10.1109/ICCC47050.2019.9064390
Show More Cited By

Index Terms

Efficient algorithms for genome-wide association study
1. Applied computing
  1. Life and medical sciences
2. Information systems
  1. Information systems applications
    1. Data mining

Recommendations

Fastanova: an efficient algorithm for genome-wide association study
KDD '08: Proceedings of the 14th ACM SIGKDD international conference on Knowledge discovery and data mining

Studying the association between quantitative phenotype (such as height or weight) and single nucleotide polymorphisms (SNPs) is an important problem in biology. To understand underlying mechanisms of complex phenotypes, it is often necessary to ...
Evaluation of genome-wide association study results through development of ontology fingerprints

Motivation: Genome-wide association (GWA) studies may identify multiple variants that are associated with a disease or trait. To narrow down candidates for further validation, quantitatively assessing how identified genes relate to a phenotype of ...
Efficient design and analysis of genome-wide association studies

Reviews

Reviewer: Amos O Olagunju

The convergence, divergence, parallel, and interaction motions by thousands of genes in humans make it difficult to predict the functionality of gene structures. In order to understand the fundamental techniques of multifaceted phenotypes, researchers must reflect on the cooperative heritable effects of diverse single nucleotide polymorphisms (SNPs): How should an efficient algorithm be devised for exploring significant associations among millions of SNPs__?__ How should false-positive errors be minimized in multiple statistical tests of association between one phenotype and several SNPs__?__ Zhang, Zou, and Wang present a resourceful analysis of variance (ANOVA) method for testing relationships on pairs of different sequences of structural units of nucleic acids. The resourceful ANOVA method consists of an algorithm for randomly permuting a phenotype-hundreds to thousands of times-to generate the pinpoint distribution for investigating lack of association between a genotype and the phenotype, and a repetitive procedure for locating the utmost test value for examining the statistical significance of each reshuffled phenotype. The authors define "two-locus association mapping" as the study of the relationship between a phenotype and SNP-pairs with distinctive positions on a genome. The two-locus ANOVA test with a permutation algorithm initially computes the F -statistics values for the binary SNP-pair variables and a quantitative phenotype. The F -statistics value is the average sum of squared deviations between groups divided by the mean sum of squared deviations within groups. Next, the phenotype is sampled without replacement, to obtain a reference distribution and the critical value of the F -statistics value. The upper bound (UB) of the sum of squared deviations between groups is generated for the two-locus ANOVA test, to relate one SNP and phenotype, and the prearranged phenotype values and pair-wise SNP genotypes. The UB is used to trim the bulk of the SNP-pairs for rapid retrieval of an SNP candidate, as well as to circumvent performing exhaustive ANOVA tests. The resourceful analysis of variance method's performance, relative to the traditional brute force technique (BFT), is evaluated with large datasets of neurosensory, metabolism, and cardiovascular SNPs, in genome relationship trials. The method exhibits superior runtimes over the BFT, in the genome-wide association tests. The time complexity of the method is far lower than that of the BFT, and its space complexity is linear to the size of the dataset. Akin to the resourceful ANOVA method, the authors craft an effective chi-square test for investigating the association between a phenotype and SNP pairs. An investigation of the interactions among several SNPs requires fast algorithms and processing machines. In the future, perhaps the proposed two-locus ANOVA test and chi-square test algorithm could be adapted to speedily execute on a parallel machine. This paper is for readers who are interested in the authors' algorithm and its impact on biology. Online Computing Reviews Service

Reviewer: Jens Lichtenberg

Zhang, Zou, and Wang propose an efficient algorithm to find single-nucleotide polymorphism (SNP) pairs associated with quantitative phenotypes-a process described as association study. A binary representation of all SNPs indicating whether an SNP is significantly linked to a specific phenotype is defined as the genotype of the SNP. Due to the vast number of SNPs in an organism, the problem of finding associations between genotypes and quantitative phenotypes is computationally intractable. By reducing the analysis to a two-locus association mapping, it is possible to apply the analysis of variance (ANOVA) test to discover significant associations. The paper focuses on enhancing the computational aspect, in order to allow a genome-wide analysis. It presents FastANOVA, an algorithm that can determine optimal solutions based on a small number of SNP candidates, by establishing upper bounds, which allows for efficient candidate retrieval. The paper offers a good theoretical foundation and presents the algorithm well. The efficiency of the algorithm is documented through runtime comparisons and experimental validations. In order to show the applicability of the approach to actual biological data, the authors analyze not only synthetic data, but also real phenotypes-cardiovascular, metabolism, and neurosensory. It would have been interesting to see additional experimental applications, possibly supported by several datasets from the National Center for Biotechnology Information (NCBI) gene expression omnibus (GEO) database. But even without these, Zhang et al. motivate and illustrate their algorithm in a very effective manner, making this a very significant paper for the study of phenotype to SNP associations. Online Computing Reviews Service

Access critical reviews of Computing literature here

Become a reviewer for Computing Reviews.

Comments

Information & Contributors

Information

Published In

cover image ACM Transactions on Knowledge Discovery from Data

ACM Transactions on Knowledge Discovery from Data Volume 3, Issue 4

November 2009

196 pages

ISSN:1556-4681

EISSN:1556-472X

DOI:10.1145/1631162

Issue’s Table of Contents

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 04 December 2009

Accepted: 01 July 2009

Revised: 01 May 2009

Received: 01 January 2009

Published in TKDD Volume 3, Issue 4

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article
Research
Refereed

Funding Sources

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

6
Total Citations
View Citations
498
Total Downloads

Downloads (Last 12 months)6
Downloads (Last 6 weeks)2

Reflects downloads up to 06 Nov 2024

Other Metrics

View Author Metrics

Citations

Cited By

View all

DİKBAŞ F(2022)Compositional correlation analysis of gene expression time seriesAcademic Platform Journal of Engineering and Smart Systems10.21541/apjess.106076510:1(30-41)Online publication date: 1-Jan-2022
https://doi.org/10.21541/apjess.1060765
D’Silva SChakraborty SKahali B(2022)Concurrent outcomes from multiple approaches of epistasis analysis for human body mass index associated loci provide insights into obesity biologyScientific Reports10.1038/s41598-022-11270-012:1Online publication date: 4-May-2022
https://doi.org/10.1038/s41598-022-11270-0
Peng LZhu XLiao HZhang P(2019)Application of Multi-label Learning Model for Chronic Kidney Disease Syndrome Classification2019 IEEE 5th International Conference on Computer and Communications (ICCC)10.1109/ICCC47050.2019.9064390(1729-1733)Online publication date: Dec-2019
https://doi.org/10.1109/ICCC47050.2019.9064390
Ferreira POti MBarann MWieland TEzquina SFriedländer MRivas MEsteve-Codina AEstivill XGuigó RDermitzakis EAntonarakis SMeitinger TStrom TPalotie AFrançois Deleuze JSudbrak RLerach HGut ISyvänen AGyllensten USchreiber SRosenstiel PBrunner HVeltman JHoen PJan van Ommen GCarracedo ABrazma AFlicek PCambon-Thomsen AMangion JBentley DHamosh ARosenstiel PStrom TLappalainen TGuigó RSammeth M(2016)Sequence variation between 462 human individuals fine-tunes functional sites of RNA processingScientific Reports10.1038/srep324066:1Online publication date: 12-Sep-2016
https://doi.org/10.1038/srep32406
Zhang XHuang SSun WWang W(2012)Rapid and Robust Resampling-Based Multiple-Testing Correction with Application in a Genome-Wide Expression Quantitative Trait Loci StudyGenetics10.1534/genetics.111.137737190:4(1511-1520)Online publication date: 31-Jan-2012
https://doi.org/10.1534/genetics.111.137737
Liao BLi XZhu WCao Z(2012)A Novel Method to Select Informative SNPs and Their Application in Genetic Association StudiesIEEE/ACM Transactions on Computational Biology and Bioinformatics (TCBB)10.1109/TCBB.2012.709:5(1529-1534)Online publication date: 1-Sep-2012
https://dl.acm.org/doi/10.1109/TCBB.2012.70

View Options

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Cited By

Index Terms

Recommendations

Fastanova: an efficient algorithm for genome-wide association study

Evaluation of genome-wide association study results through development of ontology fingerprints

Efficient design and analysis of genome-wide association studies

Reviews

Access critical reviews of Computing literature here

Comments

Published In

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Funding Sources

Other Metrics

Article Metrics

Other Metrics

Cited By

Login options

Full Access

PDF

eReader

Abstract

References

Cited By

Index Terms

Recommendations

Fastanova: an efficient algorithm for genome-wide association study

Evaluation of genome-wide association study results through development of ontology fingerprints

Efficient design and analysis of genome-wide association studies

Reviews

Access critical reviews of Computing literature here

Comments

Information

Published In

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Funding Sources

Contributors

Other Metrics

Bibliometrics

Article Metrics

Other Metrics

Citations

Cited By

Get Access

Login options

Full Access

View options

PDF

eReader

Figures

Other

Share

Share this Publication link

Share on social media

Affiliations