A Phoneme Sequence Driven Lightweight End-To-End Speech Synthesis Approach
2
Citation
13
Reference
10
Related Paper
Citation Trend
Abstract:
Abstract This paper develops an end-to-end neural network model for text-to-speech (TTS) system based on phoneme sequence. Inspired by the Tacotron-2, the proposed model adopts an encoder-decoder model with attention mechanism and applies mel-spectrogram to measure the intermediate acoustic feature. Phoneme sequence is used to replace the character sequence in order to overcome the shortage of the character feature used in Tacotron-2. Unlike the conventional concatenate methodology based TTS system, our model can generate waveform directly from phoneme sequence. In addition, analogue from text analysis, a new analysis methodology is proposed for phoneme analysis. Experiment result on LJ Speech dataset shows that, compared with char-based model, our model can get a comparative or better performance.Keywords:
Spectrogram
Sequence (biology)
Feature (linguistics)
Sequence labeling
End-to-end principle
Attention-based sequence-to-sequence models for speech recognition jointly train an acoustic model, language model (LM), and alignment mechanism using a single neural network and require only parallel audio-text pairs. Thus, the language model component of the end-to-end model is only trained on transcribed audio-text pairs, which leads to performance degradation especially on rare words. While there have been a variety of work that look at incorporating an external LM trained on text-only data into the end-to-end framework, none of them have taken into account the characteristic error distribution made by the model. In this paper, we propose a novel approach to utilizing text-only data, by training a spelling correction (SC) model to explicitly correct those errors. On the LibriSpeech dataset, we demonstrate that the proposed model results in an 18.6% relative improvement in WER over the baseline model when directly correcting top ASR hypothesis, and a 29.0% relative improvement when further rescoring an expanded n-best list using an external LM.
End-to-end principle
Spelling
Sequence (biology)
Word error rate
Cite
Citations (2)
Spectrogram
SIGNAL (programming language)
Cite
Citations (0)
Spectrogram
Harmonic
Line (geometry)
SIGNAL (programming language)
Cite
Citations (7)
Subword units are commonly used for end-to-end automatic speech recognition (ASR), while a fully acoustic-oriented subword modeling approach is somewhat missing.We propose an acoustic data-driven subword modeling (ADSM) approach that adapts the advantages of several text-based and acousticbased subword methods into one pipeline.With a fully acousticoriented label design and learning process, ADSM produces acoustic-structured subword units and acoustic-matched target sequence for further ASR training.The obtained ADSM labels are evaluated with different end-to-end ASR approaches including CTC, RNN-Transducer and attention models.Experiments on the LibriSpeech corpus show that ADSM clearly outperforms both byte pair encoding (BPE) and pronunciationassisted subword modeling (PASM) in all cases.Detailed analysis shows that ADSM achieves acoustically more logical word segmentation and more balanced sequence length, and thus, is suitable for both time-synchronous and label-synchronous models.We also briefly describe how to apply acoustic-based subword regularization and unseen text segmentation using ADSM.
End-to-end principle
Cite
Citations (9)
It has been widely recognized that the FFT-based spectrogram does not provide good simultaneous resolution in both time and frequency domains. A new method of spectral analysis has been developed based upon the Gabor expansion and the Wigner–Ville distribution. The resolution of the Gabor spectrogram is twice as high as that of a FFT-based spectrogram. In this report, FFT-based spectrograms and Gabor spectrograms are compared for 5 English vowels, 6 stops consonants, 4 fricatives, and vowels format transitions in a CVD contents on 6 normal subjects. Results demonstrate that the Gabor spectrogram is a promising alternative for FFT-based spectrogram in speech analysis because of its higher temporal and frequency resolution.
Spectrogram
Gabor transform
Frequency analysis
Cite
Citations (0)
Attention-based sequence-to-sequence models for speech recognition jointly train an acoustic model, language model (LM), and alignment mechanism using a single neural network and require only parallel audio-text pairs. Thus, the language model component of the end-to-end model is only trained on transcribed audio-text pairs, which leads to performance degradation especially on rare words. While there have been a variety of work that look at incorporating an external LM trained on text-only data into the end-to-end framework, none of them have taken into account the characteristic error distribution made by the model. In this paper, we propose a novel approach to utilizing text-only data, by training a spelling correction (SC) model to explicitly correct those errors. On the LibriSpeech dataset, we demonstrate that the proposed model results in an 18.6% relative improvement in WER over the baseline model when directly correcting top ASR hypothesis, and a 29.0% relative improvement when further rescoring an expanded n-best list using an external LM.
End-to-end principle
Spelling
Sequence (biology)
Word error rate
Cite
Citations (136)
Speech synthesis is widely used in many practical applications. In recent years, speech synthesis technology has developed rapidly. However, one of the reasons why synthetic speech is unnatural is that it often has over-smoothness. In order to improve the naturalness of synthetic speech, we first extract the mel-spectrogram of speech and convert it into a real image, then take the over-smooth mel-spectrogram image as input, and use image-to-image translation Generative Adversarial Networks(GANs) framework to generate a more realistic mel-spectrogram. Finally, the results show that this method greatly reduces the over-smoothness of synthesized speech and is more close to the mel-spectrogram of real speech.
Spectrogram
Smoothness
Image translation
Cite
Citations (6)
Anomalous Sound Detection (ASD) aims to identify whether the sound emitted from a machine is anomalous or not. Most advanced methods use 2-D CNNs to extract features of normal sounds from log-mel spectrograms for ASD. However, these methods can not fully exploit temporal information of log-mel spectrograms, resulting in poor performance on some machine types. In this paper, we propose a new framework for ASD named Spectrogram-Wavegram WaveNet (SW-WaveNet), which segments the 2-D log-mel spectrogram into 1-D waveform signals of different frequency bands and combines the representation vector extracted by WaveNet from segmented log-mel spectrograms and Wavegrams, respectively. The proposed framework utilizes WaveNet's powerful capability of modeling waveform signals to effectively extract temporal information from log-mel spectrograms and Wavegrams. Experiments on the DCASE 2020 Challenge Task 2 dataset show that our framework achieves higher average AUC scores (93.25%) and pAUC scores (87.41%) than the previous works.
Spectrogram
Representation
Cite
Citations (11)
Spectrogram
Feature (linguistics)
Confusion
Cite
Citations (39)
While established methods for imaging the time-frequency content of speech—such as the spectrogram—have frequently been christened ‘‘voiceprinting,’’ it is well-known that it and other currently popular imaging techniques cannot identify an individual’s voice to more than a suggestive extent. The reassigned spectrogram (also known by other names) is a relatively little-known method [S. A. Fulop and K. Fitz, ‘‘Algorithms for computing the time-corrected instantaneous frequency (reassigned) spectrogram, with applications,’’ J. Acoust. Soc. Am. 119, 360–377 (2006)] for imaging the time-frequency spectral information contained in a signal, which is able to show the instantaneous frequencies of signal components as well as the occurrence of impulses with dramatically increased precision compared to the spectrogram (magnitude of the short-time Fourier transform) or any other energy density time-frequency representation. It is shown here that it is possible to obtain a reassigned spectrogram image from a person’s voice that appears to be sufficiently individuating and consistent so as to serve as a true voiceprint identifying that person and excluding all other persons to a high degree of confidence. This is achieved by focusing on just a few phonatory pulsations, thereby revealing the vocal-fold vibrational signature unique to each person.
Spectrogram
SIGNAL (programming language)
Instantaneous phase
Representation
Signature (topology)
Cite
Citations (0)