Accelerating Training for Distributed Deep Neural Networks in MapReduce

2018 
Parallel training is prevailing in Deep Neural Networks (DNN) to reduce training time. The training data sets and layered training processes of DNN are assigned to multiple Graphics Processing Units (GPUs) in parallel training. But there are some obstacles to deploy parallel training in GPU cloud services. DNN has a tight-dependent layering structure where the next layer feeds on the output of its former layer. It is unavoidable to transmit big output data between separated layered training processes. Since cloud computing offers separated storage services and computing services, data transmission through network harms the performance in training time. Thus parallel training leads to an inefficient training process in GPU cloud environment. In this paper, we construct a distributed DNN training architecture to implement parallel training for DNN in MapReduce. The architecture assigns GPU cloud resources as a web service. We also address the concern of data transmission by proposing a distributed DNN scheduler to accelerate the training time. The scheduler makes use of minimum cost flows algorithm to assign GPU resources, which considers data locality and synchronization into minimizing training time. Compared with original schedulers, experimental results reveal that distributed DNN scheduler decreases the training time by 50% with least data transmission and synchronizing parallel training.
    • Correction
    • Source
    • Cite
    • Save
    • Machine Reading By IdeaReader
    25
    References
    0
    Citations
    NaN
    KQI
    []