Simple and effective visual question answering in a single modality

Yuetan Lin,Zhangyang Pang,Yanan Li,Donghui Wang

Simple and effective visual question answering in a single modality

2016

Yuetan Lin
Zhangyang Pang
Yanan Li
Donghui Wang

Visual question answering (VQA) comes as a result of great development in computer vision and natural language processing, which requires deep understanding of images and questions and effective integration of them. Current works on VQA simply concatenated visual and textual features or compared them via dot product, which were unable to eliminate the semantic difference between them. We argue to transfer VQA problem into a single modality and propose a simple and effective baseline method, utilizing Long Short-Term Memory (LSTM) properties to filter particular information specified by questions in the generic descriptions of the image. We provide thorough analysis and extensive experiments on VQA benchmark dataset to discuss performance of different methods and prove the effectiveness of our proposed method.

Keywords:

Visualization
Knowledge extraction
Semantics
Benchmark (computing)
Natural language
Information retrieval
Encoding (memory)
Computer science
Question answering
Machine learning
Concatenation
Artificial intelligence
Natural language processing
Dot product

Correction
Source
Cite
Save
Machine Reading By IdeaReader

References

Citations