Design and analysis of fault tolerance mechanism for sparrow

2014 
Big data processing frameworks are developing towards larger degrees of parallelism and shorter task durations in order to achieve lower response time. Scheduling highly parallel tasks that complete in nearly 100 milliseconds poses a major challenge for task schedulers. Taking the challenge, researchers turn to decentralized frameworks to relieve the pressure of task schedulers, among which Sparrow is a good choice. However, little efforts are devoted to fault tolerance of Sparrow, which does not handle worker failures, giving rise to incomplete tasks. We present a fault tolerance mechanism named Heartbeat on Sparrow to handle failures of worker machines. Through simulation, we compare it with a simple mechanism. The result shows that Heartbeat on Sparrow can detect worker failures faster and reschedule all failed tasks more efficiently, achieving recovery of tasks and states in sub-second time. We hope this mechanism will make some contributions to Sparrow and other decentralized designs on fault tolerance side.
    • Correction
    • Source
    • Cite
    • Save
    • Machine Reading By IdeaReader
    14
    References
    2
    Citations
    NaN
    KQI
    []