Research on Image Caption Based on Multiple Word Embedding Representations

2021 
Word embedding representation has been a research hot spot in the field of computer vision and natural language processing. The semantic quality of representation severely affects the effects of machine translation and image caption. The key factors affecting the quality of word embedding representation are capture of contextual semantics and solution of the problem of polysemy. In this paper, the principles of the current mainstream word embedding representation methods were explored, such as sparse representation (One-Hot), static representation (Word2Vec, GloVe), dynamic representation (ELMo, BERT) representation and based on position information (NN.Embedding-Position Embedding). The impact of those word embedding representations on image caption task was researched attentively. And those representations methods were compared experimentally using the MSCOCO data based on the same image caption model. The pros and cons of these methods were judged according to the accuracy and richness of the extracted label data. Experimental results show that the BERT model representation with dynamic bidirectional semantics has the best performance in image caption tasks, but it requires longer training time. The NN.Embedding representation has the second best image semantic description effect, and its training speed is the fastest.
    • Correction
    • Source
    • Cite
    • Save
    • Machine Reading By IdeaReader
    16
    References
    0
    Citations
    NaN
    KQI
    []