Database assisted distribution to improve fault tolerance for multiphysics applications

2015 
Multiscale physics applications present an interesting problem from a computer science standpoint as task granularity has the potential to vary drastically which places a heavy burden upon the task scheduler and load balancer. Additionally, due to the long execution time of some of these computations, fault tolerance becomes a necessity as not being able to recover from a fault during a single long running task results in the recomputation of all data used to generate the inputs. Traditionally, this is facilitated through the use of checkpointing. However, these checkpoints must be taken sparingly due to their high cost. In this paper, we describe our use of a NoSQL database and asynchronous task based runtimes to work directly from the checkpoints themselves with minimal code modifications by domain scientists. To evaluate the performance impact of this approach, we have studied the CoHMM proxy application: a co-design proxy application designed to test modern runtimes by simulating the propagation of a shock wave through a material through the use of the heterogeneous multiscale method. We distilled this proxy application to a library that we used to implement CoHMM in a range of runtimes with and without our database assisted approach and we measured the overhead of each with respect to the CoHMM application and the cost of serializing and migrating data in the runtimes themselves.
    • Correction
    • Source
    • Cite
    • Save
    • Machine Reading By IdeaReader
    19
    References
    3
    Citations
    NaN
    KQI
    []