The purpose, history, current state, and some evolving trends in feature extraction for speech recognition

1999 
Summary form only given, as follows. Firstly, the basic principles of automatic recognition of speech are reviewed. The acoustic analysis module is focused on in greater detail and distinctions between its two main blocks, the pattern classification and the feature extraction, are discussed. The early history of speech feature extraction mentions early attempts of Newton and Helmholtz to characterize information bearing components of vowels, and Scripture's analysis of phonographic voice recordings. The concept of short-term analysis and spectrograms is introduced together with the linear model of speech production. Reasons for spectral envelope estimation in ASR as well as basic techniques for its estimation such as homomorphic analysis and linear predictive analysis are introduced. Cepstrum as an approximation to Karhunen-Loeve transformation and cepstral lifters as means for modifying properties of simple Euclidean cepstral distances are also introduced. Inconsistencies of simple envelope estimation techniques with human speech perception are mentioned. Reasons for auditory-like feature extraction and some currently dominant auditory-like techniques such as Mel cepstral analysis and perceptual linear prediction (PLP) are described. The concept and basic properties of a modulation spectrum of speech is explained and its historical use in predicting intelligibility of speech in auditoria is mentioned. Dynamic features (delta, double-delta) are discussed, with a special focus on their interpretation as FIR filters applied to modulation spectrum of speech. RASTA filtering is introduced as an extension of FIR filtering done in dynamic feature estimation and reasons for its robustness to changes in communication environments explained. Interesting consistencies of RASTA processing with temporal properties of human hearing such as forward masking is also mentioned. The need for data-driven feature extraction is discussed and techniques for design of discriminant spectral basis and of discriminant RASTA filters are described with recent results of their applications in automatic recognition of speech and in speaker recognition. The concept of multi-band recognition of speech is introduced and its inherent robustness in presence of colored noise is discussed. The concept is further generalized into more general sub-stream based recognition and some techniques for merging of information sub-streams are described. Finally, recently introduced speech recognition from temporal patterns of spectral energies is described, and its inherent advantages in recognition of speech in adverse environments discussed.
    • Correction
    • Source
    • Cite
    • Save
    • Machine Reading By IdeaReader
    0
    References
    0
    Citations
    NaN
    KQI
    []