An Efficient Axial-Attention Network for Video-Based Person Re-Identification

2022 
The Non-local self-attention mechanism can significantly improve the capability of feature representation with long-range dependencies at the cost of high computational complexity. To address the issue, the self-attention-based autoregressive axial transformer has been proposed to apply attention along a single axis of the feature maps instead of the whole ones with large receptive fields. It performs axial-attention twice along the height- and width-axis respectively in the spatial dimension of the feature maps for the image data. However, there is still room for improvement. We can convert the 2D spatial feature map into a 1D feature sequence and just perform axial-attention once along it to save more computing resources. Motivated by the insight, we propose an Efficient Axial-Attention Network (EAAN) for video-based person re-identification (Re-ID) to reduce computation and improve accuracy by serializing feature maps with multi-granularity and reducing the number of axial-attention runs. We also introduce a deserialization approach that can restore the shape of the feature maps. Moreover, we expand the CTN (Channel Transformer Network) to a wider range of uses. Additionally, we verify that the serialized feature sequence can enhance expressiveness in our EAAN with lower complexity. Experiments on MARS and DukeMTMC-VideoReID (DukeV) datasets show outstanding performance in computation efficiency and accuracy. It not only outperforms the state-of-the-art method on MARS by 0.3% in both Rank-1 and mAP, and surpasses that on DukeV by 0.1% in Rank-1 with equal mAP, but also reduces parameters and GFLOPS (Giga Floating-point Operations Per Second) by 16.9% and 6.6% respectively compared to another axial-attention-based method.
    • Correction
    • Source
    • Cite
    • Save
    • Machine Reading By IdeaReader
    0
    References
    0
    Citations
    NaN
    KQI
    []