Addressing data resiliency for staging based scientific workflows

2019 
As applications move towards extreme scales, data-related challenges are becoming significant concerns, and in-situ workflows based on data staging and in-situ/in-transit data processing have been proposed to address these challenges. Increasing scale is also expected to result in an increase in the rate of silent data corruption errors, which will impact both the correctness and performance of applications. Furthermore, this impact is amplified in the case of in-situ workflows due to the dataflow between the component applications of the workflow. While existing research has explored silent error detection at the application level, silent error detection for workflows remains an open challenge. This paper addresses silent error detection for extreme scale in-situ workflows. The presented approach leverages idle computation resource in data staging to enable timely detection and recovery from silent data corruption, effectively reducing the propagation of corrupted data and end-to-end workflow execution time in the presence of silent errors. As an illustration of this approach, we use a spatial outlier detection approach in staging to detect errors introduced in data transfer and storage. We also provide a CPU-GPU hybrid staging framework for error detection in order to achieve faster error identification. We have implemented our approach within the DataSpaces staging service, and evaluated it using both synthetic and real workflows on a Cray XK7 system (Titan) at different scales. We demonstrate that, in the presence of silent errors, enabling error detection on staged data alongside a checkpoint/restart scheme improves the total in-situ workflow execution time by up to 22% in comparison with using checkpoint/restart alone.
    • Correction
    • Source
    • Cite
    • Save
    • Machine Reading By IdeaReader
    27
    References
    4
    Citations
    NaN
    KQI
    []