A real-time and reliable dynamic migration model for concurrent taskflow in a GPU cluster

2018 
High performance GPU clusters are widely used for massive amount of concurrent dataflow processing, and have higher requirements for real-time, reliability and flexibility. However, the higher computational intensiveness and resources utilization lead to excessively high system temperature and power consumption, and even result in instantaneous failures. In this paper, we present a real-time and efficient dynamic taskflow migration approach (DTMA) based on a computing cluster. Firstly, we propose our basic theoretical models. Among them, the cluster communication model elaborates on all the communication paths and calculates the communication overhead of different migration modes. Secondly, on the basis of theoretical models and multiple instances analysis, our taskflow migration rules are summarized, and the rules help to balance cluster resources utilization and improve the overall performance of GPUs. Thirdly, the DTMA adjusts the cluster task allocation by utilizing performance and power consumption aware migration approach. This is done to reduce single node power consumption and enhance system reliability by shifting the current GPU load to other available GPU (GPUs). Moreover, the DTMA uses a circular queue to store resources information of available GPUs for better task scheduling. We evaluate the effect of DTMA through analyzing power consumption, temperature, fan speed and migration cost with different experiments. The experiment results demonstrate that DTMA is able to improve the performance and reliability of our cluster computing system, and reduce instantaneous failures.
    • Correction
    • Source
    • Cite
    • Save
    • Machine Reading By IdeaReader
    38
    References
    4
    Citations
    NaN
    KQI
    []