MAT: Processing In-Memory Acceleration for Long-Sequence Attention

2021 
Attention-based machine learning is used to model long-term dependencies in sequential data. Processing these models on long sequences can be prohibitively costly because of the large memory consumption. In this work, we propose MAT, a processing in-memory (PIM) framework, to accelerate long-sequence attention models. MAT adopts a memory-efficient processing flow for attention models to process sub-sequences in a pipeline with much smaller memory footprint. MAT utilizes a reuse-driven data layout and an optimal sample scheduling to optimize the performance of PIM attention. We evaluate the efficiency of MAT on two emerging long-sequence tasks including natural language processing and medical image processing. Our experiments show that MAT is $2.7 \times$ faster and $3.4 \times$ more energy efficient than the state-of-the-art PIM acceleration. As compared to TPU and GPU, MAT is $5.1 \times$ and $16.4 \times$ faster while consuming $27.5 \times$ and $41.0 \times$ less energy.
    • Correction
    • Source
    • Cite
    • Save
    • Machine Reading By IdeaReader
    13
    References
    0
    Citations
    NaN
    KQI
    []