When Hearing the Voice, Who Will Come to Your Mind

Hong Zhenhou,Wang Jianzong,Wei Wenqi,Jie Liu,Qu Xiaoyang,Bo Chen,Zihang Wei,Xiao Jing

When Hearing the Voice, Who Will Come to Your Mind

2021

Speech is a carrier containing rich biological information, such as speaker identity information including age, gender, race. In this paper, we explore the use of a self-supervised method to obtain speaker identity information from high-dimensional speech representations to generate face image. At the same time, considering that the biological information contained in the same piece of speech has different expression forms (such as images), we designed a cross-modal knowledge distillation method to transform the feature information from the visual domain to the speech domain. The feature vectors obtained through self-supervised learning and knowledge distillation are fed into a GAN-based generative model to obtain facial images containing speaker information. Subjective experiments show that our model can reach a well performance in the task of speaker identification. Experiments show that our proposed method can effectively establish the connection between different modalities and generate a face with rich biological information.

Keywords:

Correction
Source
Cite
Save
Machine Reading By IdeaReader

References

Citations