Persia: A Hybrid System Scaling Deep Learning Based Recommenders up to 100 Trillion Parameters
2021
Deep learning based models have dominated the current landscape of production
recommender systems. Furthermore, recent years have witnessed an exponential
growth of the model scale--from Google's 2016 model with 1 billion parameters
to the latest Facebook's model with 12 trillion parameters. Significant quality
boost has come with each jump of the model capacity, which makes us believe the
era of 100 trillion parameters is around the corner. However, the training of
such models is challenging even within industrial scale data centers. This
difficulty is inherited from the staggering heterogeneity of the training
computation--the model's embedding layer could include more than 99.99% of the
total model size, which is extremely memory-intensive; while the rest neural
network is increasingly computation-intensive. To support the training of such
huge models, an efficient distributed training system is in urgent need. In
this paper, we resolve this challenge by careful co-design of both the
optimization algorithm and the distributed system architecture. Specifically,
in order to ensure both the training efficiency and the training accuracy, we
design a novel hybrid training algorithm, where the embedding layer and the
dense neural network are handled by different synchronization mechanisms; then
we build a system called Persia (short for parallel recommendation training
system with hybrid acceleration) to support this hybrid training algorithm.
Both theoretical demonstration and empirical study up to 100 trillion
parameters have conducted to justified the system design and implementation of
Persia. We make Persia publicly available (at
https://github.com/PersiaML/Persia) so that anyone would be able to easily
train a recommender model at the scale of 100 trillion parameters.
Keywords:
- Correction
- Source
- Cite
- Save
- Machine Reading By IdeaReader
114
References
0
Citations
NaN
KQI