Mixed Precision Training

Paulius Micikevicius,Sharan Narang,Jonah M. Alben,Gregory F. Diamos,Erich Elsen,David García,Boris Ginsburg,Michael Houston,Oleksii Kuchaiev,Ganesh Venkatesh,Hao Wu

Mixed Precision Training

2018

Increasing the size of a neural network typically improves accuracy but also increases the memory and compute requirements for training the model. We introduce methodology for training deep neural networks using half-precision floating point numbers, without losing model accuracy or having to modify hyper-parameters. This nearly halves memory requirements and, on recent GPUs, speeds up arithmetic. Weights, activations, and gradients are stored in IEEE half-precision format. Since this format has a narrower range than single-precision we propose three techniques for preventing the loss of critical information. Firstly, we recommend maintaining a single-precision copy of weights that accumulates the gradients after each optimizer step (this copy is rounded to half-precision for the forward- and back-propagation). Secondly, we propose loss-scaling to preserve gradient values with small magnitudes. Thirdly, we use half-precision arithmetic that accumulates into single-precision outputs, which are converted to half-precision before storing to memory. We demonstrate that the proposed methodology works across a wide variety of tasks and modern large scale (exceeding 100 million parameters) model architectures, trained on large datasets.

Keywords:

Correction
Source
Cite
Save
Machine Reading By IdeaReader

References

450

Citations