DELAYED WEIGHT UPDATE FOR FASTER CONVERGENCE IN DATA-PARALLEL DEEP LEARNING

2018 
This paper presents a proposal of a data-parallel stochastic gradient descent (SGD) using delayed weight update. A large-scale neural network appears to solve advanced problems, but its processing time increases concomitantly with the network scale. For conventional data parallelism, workers must wait for data communication to and from a server during weight updating. Using the proposed data-parallel method, the network weight has a delay. It is therefore stale. Nevertheless, it gives faster convergence time by hiding the latency of the weight communication for the server. The server concurrently carries out the weight communication and weight update while workers calculate their gradients. The experimentally obtained results demonstrate that, in the proposed data parallel method, the final accuracy converges within degradation of 1.5% compared with the conventional method in both VGG and ResNet At maximum, the convergence speedup factor theoretically reaches double that of conventional data parallelism.
    • Correction
    • Source
    • Cite
    • Save
    • Machine Reading By IdeaReader
    17
    References
    2
    Citations
    NaN
    KQI
    []