FiLMing Multimodal Sarcasm Detection with Attention

2021 
Sarcasm detection identifies natural language expressions whose intended meaning is different from what is implied by its surface meaning. Today, social media has given rise to an abundant amount of multimodal data where users express their opinions through text and images. Our paper aims to leverage multimodal data to improve the performance of the existing systems for sarcasm detection. We propose a novel architecture that uses the RoBERTa model with a co-attention to incorporate context incongruity between input text and image attributes. Further, we integrate feature-wise affine transformation (FiLM) by conditioning the input image through FiLMed ResNet blocks with the textual features to capture the multimodal information. The output from both the models and CLS token from RoBERTa is concatenated for the final prediction. Our results demonstrate that our proposed model outperforms the existing state-of-the-art methods by 6.14% F1 score on the public Twitter multimodal sarcasm detection dataset (Our code+data is available at https://tinyurl.com/kp2ruj7c).
    • Correction
    • Source
    • Cite
    • Save
    • Machine Reading By IdeaReader
    0
    References
    0
    Citations
    NaN
    KQI
    []