A Hierarchical Multimodal Attention-based Neural Network for Image Captioning

Yong Cheng,Fei Huang,Lian Zhou,Cheng Jin,Yuejie Zhang,Tao Zhang

A Hierarchical Multimodal Attention-based Neural Network for Image Captioning

2017

Yong Cheng
Fei Huang
Lian Zhou
Cheng Jin
Yuejie Zhang
Tao Zhang

A novel hierarchical multimodal attention-based model is developed in this paper to generate more accurate and descriptive captions for images. Our model is an "end-to-end" neural network which contains three related sub-networks: a deep convolutional neural network to encode image contents, a recurrent neural network to identify the objects in images sequentially, and a multimodal attention-based recurrent neural network to generate image captions. The main contribution of our work is that the hierarchical structure and multimodal attention mechanism is both applied, thus each caption word can be generated with the multimodal attention on the intermediate semantic objects and the global visual content. Our experiments on two benchmark datasets have obtained very positive results.

Keywords:

Convolutional neural network
Recurrent neural network
Time delay neural network
Artificial neural network
Computer science
Machine learning
Closed captioning
Artificial intelligence
Speech recognition
ENCODE

Correction
Source
Cite
Save
Machine Reading By IdeaReader

References

Citations