Hierarchical Attention Image-Text Alignment Network For Person Re-Identification

2021 
Description based Person Re-identification (Re-ID) is a crucial cross-modality task that aims at retrieving a specific person for the given textual description. Existing description based Re-ID methods focus on learning robust representations to effectively measure the similarity between the global features of two modalities. However, such global mapping disregards semantic consistencies between local visual and linguistic features. Further, there are major challenges of alignment uncertainty that occur due to poor correspondence of text-image pairs and text complexity arising due to the irrelevant words. Towards this, we propose an end-to-end Hierarchical Attention Image-Text Alignment Network, named as HAITA-Net. Our model comprises of: i) a hierarchical attention alignment network to determine the potential relationships of image content and textual information at different levels, namely, word-patch level, phrase-patch level, and sentence-image level for addressing alignment uncertainty; ii) a new strategy of Term Frequency-Inverse document Frequency thresholding to extract the salient tokens to alleviate the challenge of text complexity. The network is optimized via joint weighted hierarchical attention loss and cross-modal loss in an end-to-end manner. Extensive experiments demonstrate the effectiveness of our method.
    • Correction
    • Source
    • Cite
    • Save
    • Machine Reading By IdeaReader
    0
    References
    0
    Citations
    NaN
    KQI
    []