A multi-factor monitoring fault tolerance model based on a GPU cluster for big data processing

2018 
Abstract High-performance computing clusters are widely used in large-scale data mining applications, and have higher requirements for persistence, stability and real-time use and sre therefore computationally intensive. To support large-scale data processing, we design a multi-factor real-time monitoring fault tolerance (MRMFT) model based on a GPU cluster. However, the higher clock frequency of GPU chips results in excessively high energy consumption in computing systems. Moreover, the ability to support a long-lasting high temperature operation varies greatly between different GPUs owing to the individual differences between the chips. In this paper, we design a GPU cluster energy consumption monitoring system based on wireless sensor networks (WSNs) and propose an energy consumption aware checkpointing (ECAC) for high energy consumption problems with the following two advantages: the system sets checkpoints according to actual energy consumption and the device temperature to improve the utilization of checkpoints and reduce time cost; and it exploits the parallel computing features of CPU and GPU to hide the CPU detection overhead in GPU parallel computation, and further reduce the time and energy consumption overhead in the fault tolerance phase. Using ECAC as the constraint and aiming for a persistent and reliable operation, the dynamic task migration mechanism is designed, and the reliability of the cluster is greatly improved. The theoretical analysis and experiment results show that the model improves the persistence and stability of the computing system while reducing checkpoint overhead.
    • Correction
    • Source
    • Cite
    • Save
    • Machine Reading By IdeaReader
    45
    References
    9
    Citations
    NaN
    KQI
    []