Managing Failures in Task-Based Parallel Workflows in Distributed Computing Environments.

2020 
Current scientific workflows are large and complex. They normally perform thousands of simulations whose results combined with searching and data analytics algorithms, in order to infer new knowledge, generate a very large amount of data. To this end, workflows comprise many tasks and some of them may fail. Most of the work done about failure management in workflow managers and runtimes focuses on recovering from failures caused by resources (retrying or resubmitting the failed computation in other resources, etc.) However, some of these failures can be caused by the application itself (corrupted data, algorithms which are not converging for certain conditions, etc.), and these fault tolerance mechanisms are not sufficient to perform a successful workflow execution. In these cases, developers have to add some code in their applications to prevent and manage the possible failures. In this paper, we propose a simple interface and a set of transparent runtime mechanisms to simplify how scientists deal with application-based failures in task-based parallel workflows. We have validated our proposal with use-cases from e-science and machine learning to show the benefits of the proposed interface and mechanisms in terms of programming productivity and performance.
    • Correction
    • Source
    • Cite
    • Save
    • Machine Reading By IdeaReader
    13
    References
    0
    Citations
    NaN
    KQI
    []