Simple and effective visual question answering in a single modality

2016 
Visual question answering (VQA) comes as a result of great development in computer vision and natural language processing, which requires deep understanding of images and questions and effective integration of them. Current works on VQA simply concatenated visual and textual features or compared them via dot product, which were unable to eliminate the semantic difference between them. We argue to transfer VQA problem into a single modality and propose a simple and effective baseline method, utilizing Long Short-Term Memory (LSTM) properties to filter particular information specified by questions in the generic descriptions of the image. We provide thorough analysis and extensive experiments on VQA benchmark dataset to discuss performance of different methods and prove the effectiveness of our proposed method.
    • Correction
    • Source
    • Cite
    • Save
    • Machine Reading By IdeaReader
    38
    References
    0
    Citations
    NaN
    KQI
    []