Article

Managing Failures in Task-Based Parallel Workflows in Distributed Computing Environments

Authors:

Javier Álvarez Cid-Fuentes,

Javier Conejero,

Rosa M. BadiaAuthors Info & Claims

Euro-Par 2020: Parallel Processing: 26th International Conference on Parallel and Distributed Computing, Warsaw, Poland, August 24–28, 2020, Proceedings

Pages 411 - 425

https://doi.org/10.1007/978-3-030-57675-2_26

Published: 24 August 2020 Publication History

Abstract

Current scientific workflows are large and complex. They normally perform thousands of simulations whose results combined with searching and data analytics algorithms, in order to infer new knowledge, generate a very large amount of data. To this end, workflows comprise many tasks and some of them may fail. Most of the work done about failure management in workflow managers and runtimes focuses on recovering from failures caused by resources (retrying or resubmitting the failed computation in other resources, etc.) However, some of these failures can be caused by the application itself (corrupted data, algorithms which are not converging for certain conditions, etc.), and these fault tolerance mechanisms are not sufficient to perform a successful workflow execution. In these cases, developers have to add some code in their applications to prevent and manage the possible failures. In this paper, we propose a simple interface and a set of transparent runtime mechanisms to simplify how scientists deal with application-based failures in task-based parallel workflows. We have validated our proposal with use-cases from e-science and machine learning to show the benefits of the proposed interface and mechanisms in terms of programming productivity and performance.

References

[1]

Afgan, E., et al.: The galaxy platform for accessible, reproducible and collaborative biomedical analyses: 2018 update. Nucleic Acids Res. 46(1), 537–544 (2018).

[2]

Amela, R., Ramon-Cortes, C., Ejarque, J., Conejero, J., Badia, R.M.: Enabling Python to Execute Efficiently in Heterogeneous Distributed Infrastructures with PyCOMPSs. In: Proceedings of the 7th Workshop on Python for High-Performance and Scientific Computing, pp. 1–10. ACM, New York, NY, USA (2017).

Digital Library

[3]

Badia, R.M., et al.: COMP superscalar, an interoperable programming framework. SoftwareX 3, 32–36 (2015).

[4]

Álvarez Cid-Fuentes, J., SolàÂ, S., Álvarez, P., Castro-Ginard, A., Badia, R.M.: dislib: Large scale high performance machine learning in python. In: Proceedings of the 15th International Conference on eScience, pp. 96–105 (2019).

[5]

Deelman, E., et al.: Pegasus, a workflow management system for science automation. Fut. Gener. Comput. Syst. 46, 17–35 (2015).

Digital Library

[6]

Di Tommaso P, Chatzou M, Floden EW, Barja PP, Palumbo E, and Notredame C Nextflow enables reproducible computational workflows Nat. Biotechnol. 2017 35 4 316-319

[7]

Ejarque, J., Bertran, M., Conejero, J., Badia, R.M., Alvarez Cid-Fuentes, J.: Artifact to reproduce the experiments of Europar 2020 Paper: Managing Failures in Task-based Parallel Workflows in Distributed Computing Environments (2020)., https://springernature.figshare.com/articles/software Artifact_to_reproduce_the_experiments_of_Europar_2020_Paper_Managing_Failures_in_Task-based_Parallel_Workflows_in_Distributed_Computing_Environments_/12556445/1

[8]

Ejarque J, Domínguez M, and Badia RM A hierarchic task-based programming model for distributed heterogeneous computing Int. J. High Perform. Comput. Appl. 2019 33 5 987-997

Digital Library

[9]

Graf, H.P., Cosatto, E., Bottou, L., Durdanovic, I., Vapnik, V.: Parallel support vector machines: The cascade SVM. In: Proceedings of the 17th International Conference on Neural Information Processing Systems, pp. 521–528 (2004)

[10]

Lordan, F., et al.: ServiceSs: An interoperable programming framework for the cloud. J. Grid Comput. 12(1), 67–91 (2013).

Digital Library

[11]

McCabe TJ A complexity measure IEEE Trans. Software Eng. 1976 2 4 308-320

Digital Library

[12]

Mouallem P, Crawl D, Altintas I, Vouk M, and Yildiz U Gertz M and Ludäscher B A fault-tolerance architecture for Kepler-based distributed scientific workflows Scientific and Statistical Database Management 2010 Heidelberg Springer 452-460

[13]

Oliver, H.J.: Cylc (the cylc suite engine). Technical report (2016), http://cylc.github.io/cylc/

[14]

Pronk, S., et al.: Gromacs 4.5: A high-throughput and highly parallel open source molecular simulation toolkit. Bioinformatics 29(7), 845–854 (2013).

Digital Library

[15]

Wolstencroft, K., et al.: The taverna workflow suite: Designing and executing workflows of web services on the desktop, web or in the cloud. Nucleic Acids Res. 41(W1), W557–W561 (2013).

Cited By

Elia DScardigno SEjarque JD’Anca AAccarino GScoccimarro EDonno DPeano DImmorlano FAloisio G(2023)End-to-End Workflows for Climate Science: Integrating HPC Simulations, Big Data Processing, and Machine LearningProceedings of the SC '23 Workshops of The International Conference on High Performance Computing, Network, Storage, and Analysis10.1145/3624062.3624283(2042-2052)Online publication date: 12-Nov-2023
https://dl.acm.org/doi/10.1145/3624062.3624283
Vázquez-Novoa FConejero JTatu CBadia R(2023)Scalable Random Forest with Data-Parallel ComputingEuro-Par 2023: Parallel Processing10.1007/978-3-031-39698-4_27(397-410)Online publication date: 28-Aug-2023
https://dl.acm.org/doi/10.1007/978-3-031-39698-4_27
Vergés PLordan FEjarque JBadia R(2022)Task-Level Checkpointing System for Task-Based Parallel WorkflowsEuro-Par 2022: Parallel Processing Workshops10.1007/978-3-031-31209-0_19(251-262)Online publication date: 22-Aug-2022
https://dl.acm.org/doi/10.1007/978-3-031-31209-0_19

Index Terms

Index terms have been assigned to the content through auto-classification.

Recommendations

Exploring many task computing in scientific workflows
MTAGS '09: Proceedings of the 2nd Workshop on Many-Task Computing on Grids and Supercomputers

One of the main advantages of using a scientific workflow management system (SWfMS) to orchestrate data flows among scientific activities is to control and register the whole workflow execution. The execution of activities within a workflow with high ...
Evaluating parameter sweep workflows in high performance computing
SWEET '12: Proceedings of the 1st ACM SIGMOD Workshop on Scalable Workflow Execution Engines and Technologies

Scientific experiments based on computer simulations can be defined, executed and monitored using Scientific Workflow Management Systems (SWfMS). Several SWfMS are available, each with a different goal and a different engine. Due to the exploratory ...
Handling Failures in Parallel Scientific Workflows Using Clouds
SCC '12: Proceedings of the 2012 SC Companion: High Performance Computing, Networking Storage and Analysis

Failures are common in High Performance Computing (HPC) environments and can significantly impact the performance of scientific workflows executing on top of these large scale computing environments. Computing clouds are being used as promising HPC ...

Comments

Information & Contributors

Information

Published In

cover image Guide Proceedings

Euro-Par 2020: Parallel Processing: 26th International Conference on Parallel and Distributed Computing, Warsaw, Poland, August 24–28, 2020, Proceedings

Aug 2020

618 pages

ISBN:978-3-030-57674-5

DOI:10.1007/978-3-030-57675-2

Editors:
Maciej Malawski
AGH University of Science and Technology, Krakow, Poland
,
Krzysztof Rzadca
University of Warsaw, Warsaw, Poland

© Springer Nature Switzerland AG 2020.

Publisher

Springer-Verlag

Berlin, Heidelberg

Publication History

Published: 24 August 2020

Author Tags

Qualifiers

Article

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

3
Total Citations
View Citations
0
Total Downloads

Downloads (Last 12 months)0
Downloads (Last 6 weeks)0

Reflects downloads up to 03 Jan 2025

Other Metrics

View Author Metrics

Citations

Cited By

Elia DScardigno SEjarque JD’Anca AAccarino GScoccimarro EDonno DPeano DImmorlano FAloisio G(2023)End-to-End Workflows for Climate Science: Integrating HPC Simulations, Big Data Processing, and Machine LearningProceedings of the SC '23 Workshops of The International Conference on High Performance Computing, Network, Storage, and Analysis10.1145/3624062.3624283(2042-2052)Online publication date: 12-Nov-2023
https://dl.acm.org/doi/10.1145/3624062.3624283
Vázquez-Novoa FConejero JTatu CBadia R(2023)Scalable Random Forest with Data-Parallel ComputingEuro-Par 2023: Parallel Processing10.1007/978-3-031-39698-4_27(397-410)Online publication date: 28-Aug-2023
https://dl.acm.org/doi/10.1007/978-3-031-39698-4_27
Vergés PLordan FEjarque JBadia R(2022)Task-Level Checkpointing System for Task-Based Parallel WorkflowsEuro-Par 2022: Parallel Processing Workshops10.1007/978-3-031-31209-0_19(251-262)Online publication date: 22-Aug-2022
https://dl.acm.org/doi/10.1007/978-3-031-31209-0_19

View Options

View options

Media

Figures

Other

Tables

View Table of Contents