skip to main content
10.1007/978-3-030-57675-2_26guideproceedingsArticle/Chapter ViewAbstractPublication PagesConference Proceedingsacm-pubtype
Article

Managing Failures in Task-Based Parallel Workflows in Distributed Computing Environments

Published: 24 August 2020 Publication History

Abstract

Current scientific workflows are large and complex. They normally perform thousands of simulations whose results combined with searching and data analytics algorithms, in order to infer new knowledge, generate a very large amount of data. To this end, workflows comprise many tasks and some of them may fail. Most of the work done about failure management in workflow managers and runtimes focuses on recovering from failures caused by resources (retrying or resubmitting the failed computation in other resources, etc.) However, some of these failures can be caused by the application itself (corrupted data, algorithms which are not converging for certain conditions, etc.), and these fault tolerance mechanisms are not sufficient to perform a successful workflow execution. In these cases, developers have to add some code in their applications to prevent and manage the possible failures. In this paper, we propose a simple interface and a set of transparent runtime mechanisms to simplify how scientists deal with application-based failures in task-based parallel workflows. We have validated our proposal with use-cases from e-science and machine learning to show the benefits of the proposed interface and mechanisms in terms of programming productivity and performance.

References

[1]
Afgan, E., et al.: The galaxy platform for accessible, reproducible and collaborative biomedical analyses: 2018 update. Nucleic Acids Res. 46(1), 537–544 (2018).
[2]
Amela, R., Ramon-Cortes, C., Ejarque, J., Conejero, J., Badia, R.M.: Enabling Python to Execute Efficiently in Heterogeneous Distributed Infrastructures with PyCOMPSs. In: Proceedings of the 7th Workshop on Python for High-Performance and Scientific Computing, pp. 1–10. ACM, New York, NY, USA (2017).
[3]
Badia, R.M., et al.: COMP superscalar, an interoperable programming framework. SoftwareX 3, 32–36 (2015).
[4]
Álvarez Cid-Fuentes, J., SolàÂ, S., Álvarez, P., Castro-Ginard, A., Badia, R.M.: dislib: Large scale high performance machine learning in python. In: Proceedings of the 15th International Conference on eScience, pp. 96–105 (2019).
[5]
Deelman, E., et al.: Pegasus, a workflow management system for science automation. Fut. Gener. Comput. Syst. 46, 17–35 (2015).
[6]
Di Tommaso P, Chatzou M, Floden EW, Barja PP, Palumbo E, and Notredame C Nextflow enables reproducible computational workflows Nat. Biotechnol. 2017 35 4 316-319
[7]
Ejarque, J., Bertran, M., Conejero, J., Badia, R.M., Alvarez Cid-Fuentes, J.: Artifact to reproduce the experiments of Europar 2020 Paper: Managing Failures in Task-based Parallel Workflows in Distributed Computing Environments (2020)., https://springernature.figshare.com/articles/software Artifact_to_reproduce_the_experiments_of_Europar_2020_Paper_Managing_Failures_in_Task-based_Parallel_Workflows_in_Distributed_Computing_Environments_/12556445/1
[8]
Ejarque J, Domínguez M, and Badia RM A hierarchic task-based programming model for distributed heterogeneous computing Int. J. High Perform. Comput. Appl. 2019 33 5 987-997
[9]
Graf, H.P., Cosatto, E., Bottou, L., Durdanovic, I., Vapnik, V.: Parallel support vector machines: The cascade SVM. In: Proceedings of the 17th International Conference on Neural Information Processing Systems, pp. 521–528 (2004)
[10]
Lordan, F., et al.: ServiceSs: An interoperable programming framework for the cloud. J. Grid Comput. 12(1), 67–91 (2013).
[11]
McCabe TJ A complexity measure IEEE Trans. Software Eng. 1976 2 4 308-320
[12]
Mouallem P, Crawl D, Altintas I, Vouk M, and Yildiz U Gertz M and Ludäscher B A fault-tolerance architecture for Kepler-based distributed scientific workflows Scientific and Statistical Database Management 2010 Heidelberg Springer 452-460
[13]
Oliver, H.J.: Cylc (the cylc suite engine). Technical report (2016), http://cylc.github.io/cylc/
[14]
Pronk, S., et al.: Gromacs 4.5: A high-throughput and highly parallel open source molecular simulation toolkit. Bioinformatics 29(7), 845–854 (2013).
[15]
Wolstencroft, K., et al.: The taverna workflow suite: Designing and executing workflows of web services on the desktop, web or in the cloud. Nucleic Acids Res. 41(W1), W557–W561 (2013).

Cited By

View all

Recommendations

Comments

Information & Contributors

Information

Published In

cover image Guide Proceedings
Euro-Par 2020: Parallel Processing: 26th International Conference on Parallel and Distributed Computing, Warsaw, Poland, August 24–28, 2020, Proceedings
Aug 2020
618 pages
ISBN:978-3-030-57674-5
DOI:10.1007/978-3-030-57675-2

Publisher

Springer-Verlag

Berlin, Heidelberg

Publication History

Published: 24 August 2020

Author Tags

  1. Failure management
  2. Scientific workflows
  3. Parallel programming
  4. Distributed computing

Qualifiers

  • Article

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)0
  • Downloads (Last 6 weeks)0
Reflects downloads up to 03 Jan 2025

Other Metrics

Citations

Cited By

View all

View Options

View options

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media