Persian speech synthesis using enhanced tacotron based on multi-resolution convolution layers and a convex optimization method

2021 
An end-to-end text-to-speech system generates acoustic features directly from input text to synthesize speech from it. The challenges of using these models for Persian language are lack of a proper data, and also detection of exceptions and Ezafe between words inherently (without grapheme-to-phoneme). In this paper, we propose to use an special end-to-end tts system named Tacotron2, and suggest solutions for the mentioned problems. For the lack of data problem, we collect a dataset proper for end-to-end text-to-speech including 21 hours of Persian speech and corresponding text. We use multi-resolution convolution and part of speech embedding layers in the encoder part of Tacotron2, to overcome the exceptions and Ezafe detection problem. In addition, in the case of Tacotron2, Mel-spectrogram generation process is unstable due to high dropout rate at inference time. To handle this problem, we propose to use a convex optimization method, named Net-Trim. Experimental results show that our proposed method increases Tacotron2 mean opinion score from 3.01 to 3.97. Furthermore, the proposed method decreases Mel cepstral distortion in comparison with Tacotron2.
    • Correction
    • Source
    • Cite
    • Save
    • Machine Reading By IdeaReader
    40
    References
    0
    Citations
    NaN
    KQI
    []