Developing speaker independent ASR system using limited data through prosody modification based on fuzzy classification of spectral bins

2019 
Abstract The primary motive of this study is to develop an automatic speech recognition (ASR) system using limited amount of speech data such that it is least affected by speaker-dependent acoustic variations. The two factors contributing towards inter-speaker variability that are focused upon in this work are pitch and speaking-rate variations. In order to simulate such a limited data scenario, an ASR system is trained on adults' speech and tested using speech data from adult as well as child speakers. Compared to adults' speech test case, the recognition rates are noted to be extremely degraded when the test speech is from child speakers. The observed degradation is due to large differences in pitch and speaking-rate between adults' and children's speech along with other factors leading to inter-speaker acoustic variations. To overcome the mismatch in pitch and speaking-rate, two different approaches are proposed in this paper. In the first approach, the pitch and speaking-rate of children's speech test set are explicitly modified using a recently proposed prosody modification technique that exploits fuzzy classification of spectral bins. In the second approach, pitch and speaking-rate of the training data are modified to create newer versions of the data. In order to capture greater acoustic variability, the original and the modified versions are then pooled together. The ASR system trained on augmented data is noted to be more robust towards pitch and speaking-rate variations. Consequently, relative improvements of 17% and 31% over the baseline are obtained on decoding adults' and children's speech test sets, respectively.
    • Correction
    • Source
    • Cite
    • Save
    • Machine Reading By IdeaReader
    54
    References
    3
    Citations
    NaN
    KQI
    []