Triplet Attention: Rethinking the Similarity in Transformers

2021 
The Transformer model has benefited various real-world applications, where the self-attention mechanism with dot-products shows superior alignment ability on building long dependency. However, the pair-wisely attended self-attention limits further performance improvement on challenging tasks. To the extent of our knowledge, this is the first work to define the Triplet Attention (A3) for Transformer, which introduces triplet connections as the complementary dependency. Specifically, we define the triplet attention based on the scalar triplet product, which may be interchangeably used with the canonical one within the multi-head attention. It allows the self-attention mechanism to attend to diverse triplets and capture complex dependency. Then, we utilize the permuted formulation and kernel tricks to establish a linear approximation to A3. The proposed architecture could be smoothly integrated into the pre-training by modifying head configurations. Extensive experiments show that our methods achieve significant performance improvement on various tasks and two benchmarks.
    • Correction
    • Source
    • Cite
    • Save
    • Machine Reading By IdeaReader
    27
    References
    0
    Citations
    NaN
    KQI
    []