MAT: Processing In-Memory Acceleration for Long-Sequence Attention

Minxuan Zhou,Yunhui Guo,Weihong Xu,Bin Li,Kevin W. Eliceiri,Tajana Rosing

MAT: Processing In-Memory Acceleration for Long-Sequence Attention

2021

Attention-based machine learning is used to model long-term dependencies in sequential data. Processing these models on long sequences can be prohibitively costly because of the large memory consumption. In this work, we propose MAT, a processing in-memory (PIM) framework, to accelerate long-sequence attention models. MAT adopts a memory-efficient processing flow for attention models to process sub-sequences in a pipeline with much smaller memory footprint. MAT utilizes a reuse-driven data layout and an optimal sample scheduling to optimize the performance of PIM attention. We evaluate the efficiency of MAT on two emerging long-sequence tasks including natural language processing and medical image processing. Our experiments show that MAT is $2.7 \times$ faster and $3.4 \times$ more energy efficient than the state-of-the-art PIM acceleration. As compared to TPU and GPU, MAT is $5.1 \times$ and $16.4 \times$ faster while consuming $27.5 \times$ and $41.0 \times$ less energy.

Keywords:

Correction
Source
Cite
Save
Machine Reading By IdeaReader

References

Citations