A Distributed Fault Analysis (DFA) Method for Fault Tolerance in High-Performance Computing Systems

2020 
An increasing proportion of computing capacity is being wasted, leading to service degradation in high-performance computing (HPC) systems because of failures and recovery operations. In addition, HPC is expected to meet large-scale computation needs in the future by minimizing the average failure time to almost negligible and by overcoming the growing challenge of handling fault tolerance. This paper proposes a novel distributed fault analysis (DFA) for discovering the presence of faults to handle the failure effectively. It aims to detect the various changes of the node state in the system in a specified time period and to ensure the proper functioning of the system. The DFA algorithm reduces the holding time in the event of a crash by regularly monitoring the nodes in fully connected systems and significantly reducing recovery time. The experiment analyzing a distributed system evaluates a variety of faults to accurately predict the faulty node and measures the sensitivity and specificity in comparison with the traditional mechanism to show the effectiveness of the proposal.
    • Correction
    • Source
    • Cite
    • Save
    • Machine Reading By IdeaReader
    29
    References
    0
    Citations
    NaN
    KQI
    []