Discrete acoustic space for an efficient sampling in neural text-to-speech

Marek Střelec,Jonas Rohnke,Antonio Bonafonte,Mateusz Lajszczak,Trevor Wood

Discrete acoustic space for an efficient sampling in neural text-to-speech

2021

Marek Střelec
Jonas Rohnke
Antonio Bonafonte
Mateusz Lajszczak
Trevor Wood

We present an SVQ-VAE architecture using a split vector quantizer for NTTS, as an enhancement to the well-known VAE and VQ-VAE architectures. Compared to these previous architectures, our proposed model retains the benefits of using an utterance-level bottleneck, while reducing the associated loss of representation power. We train the model on recordings in the highly expressive task-oriented dialogues domain and show that SVQ-VAE achieves a statistically significant improvement in naturalness over the VAE and VQ-VAE models. Furthermore, we demonstrate that the SVQ-VAE acoustic space is predictable from text, reducing the gap between the standard constant vector synthesis and vocoded recordings by 32%.

Keywords:

Domain (software engineering)
Bottleneck
Sampling (signal processing)
Power (physics)
Computer science
Acoustic space
Speech recognition
Speech synthesis
Representation (mathematics)
Naturalness

Correction
Source
Cite
Save
Machine Reading By IdeaReader

References

Citations