E.T.: re-thinking self-attention for transformer models on GPUs

Shiyang Chen,Shaoyi Huang,Santosh Pandey,Bingbing Li,Guang R. Gao,Long Zheng,Caiwen Ding,Hang Liu

E.T.: re-thinking self-attention for transformer models on GPUs

2021

Transformer-based deep learning models have become a ubiquitous vehicle to drive a variety of Natural Language Processing (NLP) related tasks beyond their accuracy ceiling. However, these models also suffer from two pronounced challenges, that is, gigantic model size and prolonged turnaround time. To this end, we introduce ET. that rE-thinks self-attention computation for Transformer models on GPUs with the following contributions: First, we introduce a novel self-attention architecture, which encompasses two tailored self-attention operators with corresponding sequence length-aware optimizations, and operation reordering optimizations. Second, we present an attention-aware pruning design which judiciously uses various pruning algorithms to reduce more computations hence achieves significantly shorter turnaround time. For the pruning algorithms, we not only revamp the existing pruning algorithms, but also tailor new ones for transformer models. Taken together, we evaluate E.T. across a variety of benchmarks for Transformer, BERTBASE and DistilBERT, where E.T. presents superior performance over the mainstream projects, including the popular Nvidia Enterprise solutions, i.e., TensorRT and FasterTransformer.

Keywords:

Correction
Source
Cite
Save
Machine Reading By IdeaReader

References

Citations