End-to-End Image-to-Speech Generation for Untranscribed Unknown Languages
2021
Describing orally what we are seeing is a simple task we do in our daily life. However, in the natural language processing field, this simple task needs to be bridged by a textual modality that helps the system to generalize various objects in the image and various pronunciations in speech utterances. In this study, we propose an end-to-end Image2Speech system that does not need any textual information in its training. We use a vector-quantized variational autoencoder (VQ-VAE) model to learn the discrete representation of a speech caption in an unsupervised manner, where discrete labels are used by an image-captioning model. This self-supervised speech representation enables the Image2Speech model to be trained with the minimum amount of paired image-speech data while still maintaining the quality of the speech caption. Our experimental results with a multi-speaker natural speech dataset demonstrate our proposed text-free Image2Speech system’s performance close to the one with textual information. Furthermore, our approach also successfully outperforms the most recent existing frameworks with phoneme-based and grounding-based Image2Speech systems.
Keywords:
- Correction
- Source
- Cite
- Save
- Machine Reading By IdeaReader
37
References
2
Citations
NaN
KQI