CuWide: Towards Efficient Flow-based Training for Sparse Wide Models on GPUs

2020 
Wide models such as generalized linear models and factorization-based models have been extensively used in various predictive applications, e.g., recommendation systems. Due to the memory bounded property of the models, the performance improvement on CPU is reaching the limitation. GPU is known to have many computation units and high memory bandwidth, and becomes a promising platform for training machine learning models. However, the GPU training for the wide models is far from optimal due to the sparsity and irregularity in wide models. The existing GPU-based wide models are even slower than the ones using CPU. The classical training schema of the wide models does not optimized for the GPU architecture, which suffers from large amount of random memory accesses and redundant read/write of intermediate values. In this paper, we propose an efficient GPU-training framework for the large-scale wide models, named cuWide. To fully benefit from the memory hierarchy of GPU, cuWide applies a new flow-based schema for training, which leverages the spatial and temporal locality of wide models to drastically reduce the amount of communication with GPU global memory. We show that cuWide can be up to more than 20° faster than the state-of-the-art GPU solutions and multi-core CPU solutions.
    • Correction
    • Source
    • Cite
    • Save
    • Machine Reading By IdeaReader
    0
    References
    7
    Citations
    NaN
    KQI
    []