End-to-End Image-to-Speech Generation for Untranscribed Unknown Languages

2021 
Describing orally what we are seeing is a simple task we do in our daily life. However, in the natural language processing field, this simple task needs to be bridged by a textual modality that helps the system to generalize various objects in the image and various pronunciations in speech utterances. In this study, we propose an end-to-end Image2Speech system that does not need any textual information in its training. We use a vector-quantized variational autoencoder (VQ-VAE) model to learn the discrete representation of a speech caption in an unsupervised manner, where discrete labels are used by an image-captioning model. This self-supervised speech representation enables the Image2Speech model to be trained with the minimum amount of paired image-speech data while still maintaining the quality of the speech caption. Our experimental results with a multi-speaker natural speech dataset demonstrate our proposed text-free Image2Speech system’s performance close to the one with textual information. Furthermore, our approach also successfully outperforms the most recent existing frameworks with phoneme-based and grounding-based Image2Speech systems.
    • Correction
    • Source
    • Cite
    • Save
    • Machine Reading By IdeaReader
    37
    References
    2
    Citations
    NaN
    KQI
    []