Talking Face Generation Based on Information Bottleneck and Complementary Representations

2021 
Audio-driven talking face generation is an active research direction in the field of virtual reality. The main challenge is that the generated lip shape of the speaker is out of sync with the input audio. To address this challenge, we propose a novel solution to synthesize lip-synchronized, high-quality, realistic video given input audio. We first decompose the target person's video frames into 3D face model parameters, and the information bottleneck is inserted into the audio-to-expression network to learn the mapping between audio features and expression parameters. Then, we replace the expression parameters in the target video frame with the extracted expression parameters from audio and re-render the face. Finally, we add high-level audio embedding extracted from the raw audio and lip landmarks embedding in the neural rendering network. The 3D face shapes, 2D landmarks, and audio embedding provide complementary information for the neural rendering network which guarantees the generation of lip-synchronized high-quality video portraits from the synthesized rendered faces. Experimental results show that compared with other talking face generation methods, our method is the best concerning lip synchronization with high video definition.
    • Correction
    • Source
    • Cite
    • Save
    • Machine Reading By IdeaReader
    14
    References
    0
    Citations
    NaN
    KQI
    []