An Interpretable Visual Attention Plug-in for Convolutions

2020 
Raw images, which may contain many noisy background pixels, are typically used in convolutional neural network (CNN) training. This paper proposes a novel variance loss function based on a ground truth mask of the target object to enhance the visual attention of a CNN. The loss function regularizes the training process so that the feature maps in the later convolutional layer are focused more on target object areas and less on the background. Attention loss is computed directly from the feature maps, so no new parameters are added to the backbone network; therefore, no extra computational cost is added to the testing phase. The proposed attention model can be a plug-in for any pre-trained network architecture and can be used in conjunction with other attention models. Experimental results demonstrate that the proposed variance loss function improves classification accuracy by 2.22% over the baseline on the Stanford Dogs dataset, which is significantly higher than the improvements achieved by SENet (0.3%) and CBAM (1.14%). Our method also improves object detection accuracy by 2.5 mAP on the Pascal-VOC2007 dataset and store sign detection by 2.66 mAP over respective baseline models. Furthermore, the proposed loss function enhances the visualization and interpretability of a CNN.
    • Correction
    • Source
    • Cite
    • Save
    • Machine Reading By IdeaReader
    27
    References
    0
    Citations
    NaN
    KQI
    []