Enhancing robustness of zero resource children's speech recognition system through bispectrum based front-end acoustic features

2021 
Abstract Automatic recognition of children's speech for low resource languages is extremely challenging. This is due to the fact that speech data from child speakers for system training is unavailable in the case of such languages i.e., zero-resourced in terms of children's speech. Consequently, we are forced to decode children's speech on systems trained using a limited amount of adults' data. However, the acoustic mismatch between adults' and children's speech, such as pitch and speaking-rate differences, leads to highly degraded recognition rates. At the same time, since ASR systems embedded in smart devices are being used anywhere, effect of surrounding noise further degrades the recognition rates. Motivated by this fact, a noise robust front-end acoustic feature extraction approach exploiting bispectrum analysis is proposed in this paper. The proposed features are noise robust due to the inherent immunity of bispectrum towards additive noise. An added advantage of bispectrum analysis is reduced pitch sensitivity. This, in turn, helps reduce the aforementioned pitch-induced acoustic mismatch. Further to that, in order to deal with the unavailability of speech data from child speakers, we have explored the role of voice conversion through cycle-consistent generative adversarial network (CGAN) to modify the acoustic attributes of adults' data. Voice conversion via CGAN renders the adults' speech perceptually similar to children's speech as noted during the listening test. On augmenting the voice converted adults' speech into training, the ill-effects of the acoustic mismatch between adults' and children's speech is reduced to a large extent. Consequently, the recognition performance is significantly improved. The experimental evaluations presented in this paper demonstrate that the use of proposed features as well as CGAN-based voice conversion are highly suited for zero-resource children's speech recognition task under noisy conditions.
    • Correction
    • Source
    • Cite
    • Save
    • Machine Reading By IdeaReader
    38
    References
    1
    Citations
    NaN
    KQI
    []