Enhancing Transformer with Horizontal and Vertical Guiding Mechanisms for Neural Language Modeling

2021 
Language modeling is an important problem in Natural Language Processing (NLP), and the multi-layer Transformer network is currently the most advanced and effective model for this task. However, there exist two inherent defects in its multi-head self-attention structure: (1) attention information loss: the lower-level attention weights cannot be explicitly passed through upper layers, which may lead the network lose some pivotal attention information captured by lower-level layers; (2) multi-head bottleneck: the dimension of each head in vanilla Transformer is relatively small and the process of each head is independent, which introduces an expressive bottleneck and makes subspace learning inadequate constitutionally. To overcome these two weaknesses, a novel neural architecture named Guide-Transformer is proposed in this paper. The Guide-Transformer utilizes horizontal and vertical attention information to guide the original process of the multi-head self-attention sublayer without introducing excessive complexity. The experimental results on three authoritative language modeling benchmarks demonstrate the effectiveness of Guide-Transformer. For the popular perplexity (ppl) and bits-per-character (bpc) evaluation metrics, Guide-Transformer achieves moderate improvements over the powerful baseline model.
    • Correction
    • Source
    • Cite
    • Save
    • Machine Reading By IdeaReader
    29
    References
    0
    Citations
    NaN
    KQI
    []