Optimal Memory-aware Backpropagation of Deep Join Networks

2019 
In the context of Deep Learning training, memory needs to store activations can prevent the user to consider large models and large batch sizes. A possible solution is to rely on model parallelism to distribute the weights of the model and the activations over distributed memory nodes. In this paper, we consider another purely sequential approach to save memory using checkpointing techniques. Checkpointing techniques have been introduced in the context of Automatic Differentiation. They consist in storing some, but not all activations during the feed-forward network training phase, and then to recompute missing values during the backward phase. Using this approach, it is possible, at the price of re-computations, to use a minimal amount of memory. The case of a single homogeneous chain i.e.the case of a network whose all stages are identical and form a chain, is well understood and optimal solutions based on dynamic programming have been proved in the Automatic Differentiation literature. The networks encountered in practice in the context of Deep Learning are much more diverse, both in terms of shape and heterogeneity. The present paper can be seen as an attempt to extend the class of graphs that can be solved optimally. Indeed, we provide an optimal algorithm, based on dynamic programming, for the case of several chains that gathers when computing the loss function. This model typically corresponds to the case of Siamese or Cross Modal Networks.
    • Correction
    • Source
    • Cite
    • Save
    • Machine Reading By IdeaReader
    21
    References
    8
    Citations
    NaN
    KQI
    []