skip to main content
10.1145/2320765.2320791acmotherconferencesArticle/Chapter ViewAbstractPublication PagesedbtConference Proceedingsconference-collections
research-article

Challenges and approaches for distributed workflow-driven analysis of large-scale biological data: vision paper

Published: 30 March 2012 Publication History

Abstract

Next-generation DNA sequencing machines are generating a very large amount of sequence data with applications in many scientific challenges and placing unprecedented demands on traditional single-processor bioinformatics algorithms. Middleware and technologies for scientific workflows and data-intensive computing promise new capabilities to enable rapid analysis of next-generation sequence data. Based on this motivation and our previous experiences in bioinformatics and distributed scientific workflows, we are creating a Kepler Scientific Workflow System module, called "bioKepler", that facilitates the development of Kepler workflows for integrated execution of bioinformatics applications in distributed environments. This vision paper discusses the challenges related to next-generation sequencing data, explains the approaches taken in bioKepler to help with analysis of such data, and presents preliminary results demonstrating these approaches.

References

[1]
I. Altintas, O. Barney, Z. Cheng, T. Critchlow, B. Ludaescher, S. Parker, A. Shoshani, and M. Vouk. Accelerating the scientific exploration process with scientific workflows. Journal of Physics: Conference Series, 46:468--478, 2006. SciDAC 2006.
[2]
I. Altintas, O. Barney, and E. Jaeger-Frank. Provenance collection support in the kepler scientific workflow system. In Proceedings of International Provenance and Annotation Workshop, pages 118--132, 2006.
[3]
I. Altintas, C. Berkley, E. Jaeger, M. Jones, B. Ludaescher, and S. Mock. Kepler: An extensible system for design and execution of scientific workflows. In Intl. Conference on Scientific and Statistical Database Management (SSDBM), Santorini Island, Greece, 2004.
[4]
S. F. Altschul, W. Gish, W. Miller, E. W. Myers, and D. J. Lipman. Basic Local Alignment Search Tool. Journal of Molecular Biology, 215(3):403--410, 1990.
[5]
D. Battré, S. Ewen, F. Hueske, O. Kao, V. Markl, and D. Warneke. Nephele/PACTs: A programming model and execution framework for web-scale analytical processing. In Proceedings of the 1st ACM symposium on Cloud computing, SoCC '10, pages 119--130, New York, NY, USA, 2010. ACM.
[6]
J. Dean and S. Ghemawat. Mapreduce: Simplified data processing on large clusters. Communications of the ACM, 51(1):107--113, 2008.
[7]
E. Deelman, G. Mehta, G. Singh, M.-H. Su, and K. Vahi. Pegasus: Mapping large-scale workflows to distributed resources. In I. J. Taylor, E. Deelman, D. B. Gannon, and M. Shields, editors, Workflows for e-Science, pages 376--394. Springer London, 2007.
[8]
T. Disz, M. Kubal, R. Olson, R. Overbeek, and R. Stevens. Challenges in large scale distributed computing: bioinformatics. In Proceedings of Challenges of Large Applications in Distributed Environments, 2005. CLADE 2005., pages 57--65. IEEE, 2005.
[9]
X. Fei, S. Lu, and C. Lin. A mapreduce-enabled scientific workflow composition framework. In ICWS '09: Proceedings of the 2009 IEEE International Conference on Web Services, pages 663--670, Washington, DC, USA, 2009. IEEE Computer Society.
[10]
A. Goderis, A. Brooks, I. Altintas, C. Goble, and E. Lee. Composing different models of computation in Kepler and Ptolemy II. Lecture Notes in Computer Science, III:182--190, 2007. Proc. 2nd International Workshop on Workflow systems in e-Science in conjunction with ICCS 2007.
[11]
A. Goderis, C. Brooks, I. Altintas, E. Lee, and C. Goble. Heterogeneous composition of models of computation. Future Generation Computer Systems, 25(5):552--560, 2009.
[12]
D. J. Goodman. Introduction and evaluation of martlet: a scientific workflow language for abstracted parallelisation. In WWW '07: Proceedings of the 16th international conference on World Wide Web, pages 983--992, New York, NY, USA, 2007. ACM.
[13]
I. Gorton, P. Greenfield, A. S. Szalay, and R. Williams. Data-intensive computing in the 21st century. IEEE Computer, 41(4):30--32, 2008.
[14]
B. Langmead, M. C. Schatz, J. Lin, M. Pop, and S. L. Salzberg. Searching for snps with cloud computing. Genome Biology, 10(134), November 2009.
[15]
H. Li and Z. Durbin. Fast and accurate short read alignment with burrows-wheeler transform. Bioinformatics, 25(14):1754--1760, 2009.
[16]
B. Ludaescher, I. Altintas, C. Berkley, D. Higgins, E. Jaeger-Frank, J. Jones, M. and Lee, and Y. Zhao. Scientific workflow management and the Kepler system. Concurrency and Computation: Practice & Experience, Special Issue on Scientific Workflows, 18(10):1039--1065, 2006.
[17]
M. Margulies, M. Egholm, W. Altman, S. Attiya, J. Bader, L. Bemben, J. Berka, M. Braverman, Y. Chen, Z. Chen, S. Dewell, L. Du, J. Fierro, X. Gomes, B. Godwin, W. He, S. Helgesen, C. Ho, G. Irzyk, S. Jando, M. Alenquer, T. Jarvie, K. Jirage, J. Kim, J. Knight, J. Lanza, J. Leamon, S. Lefkowitz, M. Lei, J. Li, K. Lohman, H. Lu, V. Makhijani, K. McDade, M. McKenna, E. Myers, E. Nickerson, J. Nobile, R. Plant, B. Puc, M. Ronan, G. Roth, G. Sarkis, J. Simons, J. Simpson, M. Srinivasan, K. Tartaro, A. Tomasz, K. Vogt, G. Volkmer, S. Wang, Y. Wang, M. Weiner, P. Yu, R. Begley, and J. Rothberg. Genome sequencing in microfabricated high-density picolitre reactors. Nature, 437(7057):376--380, September 2005.
[18]
C. Moretti, H. Bui, K. Hollingsworth, B. Rich, P. Flynn, and D. Thain. All-pairs: An abstraction for data-intensive computing on campus grids. IEEE Transactions on Parallel and Distributed Systems, 21:33--46, 2010.
[19]
P. Mouallem, D. Crawl, I. Altintas, M. A. Vouk, and U. Yildiz. A fault-tolerance architecture for kepler-based distributed scientific workflows. In Proceedings of Scientific and Statistical Database Management, 22nd International Conference (SSDBM 2010), volume 6187 of Lecture Notes in Computer Science, pages 452--460, Berlin, Heidelberg, 2010. Springer.
[20]
T. Oinn, M. Addis, J. Ferris, D. Marvin, M. Senger, M. Greenwood, T. Carver, K. Glover, M. R. Pocock, A. Wipat, and P. Li. Taverna: a tool for the composition and enactment of bioinformatics workflows. in. Bioinformatics, Oxford University Press, London, UK, 20(17):3045--3054, 2004.
[21]
J. Qin and T. Fahringer. Advanced data flow support for scientific grid workflow applications. In SC '07: Proceedings of the 2007 ACM/IEEE conference on Supercomputing, pages 1--12, New York, NY, USA, 2007. ACM.
[22]
M. Schatz. Cloudburst: Highly sensitive read mapping with mapreduce. Bioinformatics, 25(11):1363--1369, April 2009.
[23]
J. Shendure and H. Ji. Next generation-dna sequencing. Nature Biotechnology, 26(10):1135--1145, 2008.
[24]
A. D. Smith, Z. Xuan, and M. Q. Zhang. Using quality scores and longer reads improves accuracy of solexa read mapping. BMC Bioinformatics, 9(128), February 2008.
[25]
L. D. Stein. The case for cloud computing in genome informatics. Genome Biology, 11(5):207, 2010.
[26]
I. Taylor, M. Shields, I. Wang, and O. Rana. Triana applications within grid computing and peer to peer environments. Journal of Grid Computing, 1, 2003.
[27]
I. J. Taylor, E. Deelman, D. B. Gannon, and M. Shields, editors. Workflows for e-Science. Springer, 2007.
[28]
J. Wang, I. Altintas, P. R. Hosseini, D. Barseghian, D. Crawl, C. Berkley, and M. B. Jones. Accelerating parameter sweep workflows by utilizing ad-hoc network computing resources: An ecological example. In Services, IEEE Congress on, pages 267--274. IEEE Computer Society, 2009.
[29]
J. Wang, D. Crawl, and I. Altintas. Kepler + Hadoop: A general architecture facilitating data-intensive applications in scientific workflow systems. In WORKS '09: Proceedings of the 4th Workshop on Workflows in Support of Large-Scale Science, pages 1--8, Portland, Oregon, 2009. ACM New York, NY, USA.
[30]
D. Warneke and O. Kao. Exploiting dynamic resource allocation for efficient parallel data processing in the cloud. Parallel and Distributed Systems, IEEE Transactions on, 22(6):985--997, June 2011.
[31]
Y. Zhao, M. Hategan, B. Clifford, I. Foster, G. V. Laszewski, I. Raicu, T. Stef-praun, and M. Wilde. Swift: Fast, reliable, loosely coupled parallel computation. In Services, 2007 IEEE Congress on, pages 199--206. IEEE Press, 2007.

Cited By

View all

Index Terms

  1. Challenges and approaches for distributed workflow-driven analysis of large-scale biological data: vision paper

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image ACM Other conferences
    EDBT-ICDT '12: Proceedings of the 2012 Joint EDBT/ICDT Workshops
    March 2012
    265 pages
    ISBN:9781450311434
    DOI:10.1145/2320765
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 30 March 2012

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. application
    2. bioinformatics
    3. data-parallel patterns
    4. next generation sequence analysis
    5. scientific workflows

    Qualifiers

    • Research-article

    Funding Sources

    Conference

    ICDT '12

    Acceptance Rates

    Overall Acceptance Rate 7 of 10 submissions, 70%

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)8
    • Downloads (Last 6 weeks)0
    Reflects downloads up to 01 Jan 2025

    Other Metrics

    Citations

    Cited By

    View all

    View Options

    Login options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media