logo
    Phonetic anchor based state mapping for text-independent voice conversion
    2
    Citation
    8
    Reference
    10
    Related Paper
    Citation Trend
    Abstract:
    This paper describes a novel method for text-independent voice conversion using improved state mapping. HMM is used for representing the phonetic structure of training speech. Centroids of the common phonemes between source and target speech are utilized as phonetic anchors while establishing a mapping between acoustic spaces of source and target speakers. These phonetic anchors and weighted linear transform are used for creating a continuous parametric mapping from source to target speech parameters. The proposed technique is applicable to both intra-lingual and cross-lingual voice conversion. Experimental results show that state mapping is improved using proposed technique.
    Keywords:
    Centroid
    Visual speech recognition is able to supplement the information of speech sound to improve the accuracy of speech recognition. A viseme, which describes the facial and oral movements that occur alongside the voicing of a particular phoneme, is a supposed basic unit of speech in the visual domain. As in phonemes, there are variations for the same viseme expressed by different persons or even by the same person. A classifier must be robust to this kind of variation. In this chapter, the author's describe the Adaptively Boosted (AdaBoost) Hidden Markov Model (HMM) technique (Foo, 2004; Foo, 2003; Dong, 2002). By applying the AdaBoost technique to HMM modeling, a multi-HMM classifier that improves the robustness of HMM is obtained. The method is applied to identify context-independent and contextdependent visual speech units. Experimental results indicate that higher recognition accuracy can be attained using the AdaBoost HMM than that using conventional HMM.Request access from your librarian to read this chapter's full text.
    Viseme
    We present an algorithm for an isolated-word text-dependent speaker identification under normal and four stressful styles. The styles which are designed to simulate speech produced under real stressful conditions are: shout, slow, loud, and soft. The algorithm is based on the hidden Markov model (HMM) with a cepstral stress compensation technique. Comparing the HMM without cepstral stress compensation with the HMM combined with cepstral stress compensation, the recognition rate has improved with a little increase in the computations. The recognition rate has improved: from 90% to 93% in normal style, from 19% to 73% in shout style, from 62% to 84% in slow style, from 38% to 75% in loud style, and from 30% to 81% in soft style. The cepstral coefficients and transitional coefficients are combined to form an observation vector of the hidden Markov model. This algorithm is tested on a limited number of speakers due to our limited data base.
    Mel-frequency cepstrum
    Cepstrum
    Identification
    Speaker identification
    Citations (11)
    Visual speech recognition is able to supplement the information of speech sound to improve the accuracy of speech recognition. A viseme, which describes the facial and oral movements that occur alongside the voicing of a particular phoneme, is a supposed basic unit of speech in the visual domain. As in phonemes, there are variations for the same viseme expressed by different persons or even by the same person. A classifier must be robust to this kind of variation. In this chapter, the author’s describe the Adaptively Boosted (AdaBoost) Hidden Markov Model (HMM) technique (Foo, 2004; Foo, 2003; Dong, 2002). By applying the AdaBoost technique to HMM modeling, a multi-HMM classifier that improves the robustness of HMM is obtained. The method is applied to identify context-independent and contextdependent visual speech units. Experimental results indicate that higher recognition accuracy can be attained using the AdaBoost HMM than that using conventional HMM.
    Viseme
    AdaBoost
    Robustness
    The ability of a reader to recognize written words correctly, virtually and effortlessly is defined as Word Recognition or Isolated Word Recognition.It will recognize each word from their shape.Speech Recognition is the operating system which enablesto convert spoken words to written text which is called as Speech to Text (STT) method.Usual Method used in Speech Recognition (SR) is Neural Network, Hidden Markov Model (HMM) and Dynamic Time Warping (DTW).The widely used technique for Speech Recognition is HMM.Hidden Markov Model assumes that successive acoustic features of a spoken word are state independent.The occurrence of one feature is independent of the occurrence of the others state.Here each single unit of word is considered as state.Based upon the probability of the state it generates possible word sequence for the spoken word.Instead of listening to the speech, the generated sequence of text can be easily viewed.Each word is recognized from their shape.People with hearing impaired can make use of this Speech Recognition.
    Hearing impaired
    Citations (14)
    In this paper, a new suggested system for speaker recognition by using hidden markov model (HHM) algorithm. Many researches have been written in this subject, especially by HMM. Arabic language is one of the difficult languages and the work with it is very little, also, the work has been done for text dependent system where HMM is very effective and the algorithm trained at the word level. One the problems in such systems is the noise, so we take it in consideration by adding additive white gaussian noise (AWGN) to the speech signals to see its effect. Here, we used HMM with new algorithm with one state, where two of these components, i.e. (π and A) are removed. This give extremely accelerates the training and testing stages of recognition speeds with lowest memory usage, as seen in the work. The results show an excellent outcome. 100% recognition rate for the tested data, about 91.6% recognition rate with AWGN noise.
    Most state of the art automatic speech recognition (ASR) systems are typically based on continuous Hidden Markov Models (HMMs) as acoustic modeling technique. It has been shown that the performance of HMM speech recognizers may be affected by a bad choice of the type of acoustic feature parameters in the acoustic front end module. For these reasons, we propose in this paper a dedicated isolated word recognition system based on HMMs which was carefully optimized specifically at the acoustic analysis and HMM acoustical modeling levels. Such conception was tested and valued on Hidden Markov model toolkit platform (HTK). Systems performances were evaluated using the TIMIT database. One comparative study was carried out using two types of speech analysis: The cepstral method referred to as Mel frequency cepstral coefficients (MFCC) and the perceptual linear predictive (PLP) coding are used for different tests so as to evaluate and reinforce our conception. The frame shift duration effect of the acoustic analysis as well as the addition of the dynamic coefficients of the acoustic parameters (MFCC and PLP) were carefully tested in order to look for high accuracy for our optimized isolated word recognition (IWR) system. Finally, various experiments related to the HMM topology have been carried out in order to get better recognition accuracies. In fact, the effect of some modeling parameters of HMM on the recognition accuracy of  the IWR system such as the number of states as well as the number of Gaussian mixtures were analyzed in order to get the optimal HMM topology.   Key words: Isolated word recognition, perceptual linear predictive (PLP) coding, Mel frequency cepstral coefficients (MFCC) PLP, HMM, Hidden Markov model toolkit platform (HTK).
    Mel-frequency cepstrum
    Cepstrum
    TIMIT
    Citations (2)
    Automatic speech recognition system converts recorded audio speech signal into text output. Speech recognition has variety of applications in various domains. Hidden Markov Model (HMM) is widely used statistical approach in speech recognition system. The proposed work represents a speaker independent continuous speech recognition system for Indian English speakers using Hidden Markov Model Toolkit (HTK). Mel frequency cepstral coefficients (MFCC) are used as a feature vector. The results for automatic speech recognition system using HTK in different experiments are presented. These three different experiments includes cross-validation mode, without adapting HMMs and with adaptation of HMMs. Also the comparison in the accuracy of the recognized speech is discussed.
    Mel-frequency cepstrum
    Feature (linguistics)
    Cepstrum
    This paper focuses on hidden Markov model (HMM)- based speech synthesis, which has recently been demonstrated to be very effective in generating high-quality speech and started dominating speech synthesis research. The attractive point of this approach is that the synthesized speech can easily be modified by transforming HMM parameters with a small amount of speech data. Thus it is very useful for constructing speech synthesizers with various voice characteristics, speaking styles, and emotions.
    Citations (4)
    Robust speech recognition systems must address variations due to perceptually induced stress in order to maintain acceptable levels of performance in adverse conditions. This study proposes a new approach which combines stress classification and speech recognition into one algorithm. This is accomplished by generalizing the one-dimensional hidden Markov model to a multi-dimensional hidden Markov model (N-D HMM) where each stressed speech style is allocated a dimension in the N-D HMM. It is shown that this formulation better integrates perceptually induced stress effects for stress independent recognition. This is due to the sub-phoneme (state level) stress classification that is implicitly performed by the algorithm. The proposed N-D HMM method is compared to neutral and multi-styled stress trained 1-D HMM recognizers. Average recognition rates are shown to improve by +15.72% over the 1-D stress dependent recognizer and 26.67% over the 1-D neutral trained recognizer.
    Citations (2)
    In continuous speech recognition featuring hidden Markov model (HMM), word N-gram and time-synchronous beam search, a local modeling mismatch in the HMM will ofen cause the recognition perjlrmance to degrade. To cope with this problem, this paper proposes a method of restructuring Gaussian mixture pdfs in a pre-trained speaker-independent HMM based on speech data. In this method, mixture components are copied and shared among multiple mixture pdfs with the tendency of local errors taken iizto account. The tendency is given by comparing the pre-trtzined HMM and speech data which was used in the pre-trainl'ng. Experimental results prove that the proposed method ctin effectively restore local modeling mismatches and improve the recognition performance. 1. IYTRODUCTION In continuous speech recognition featuring hidden Markov model (HMM), word N-gram and time-synchronous beam search, a local acoustic modeling mismatch will often cause a likelihood score to fall locally. This may get a correct word sequence pruned away from recognition hypotheses or ranked low among all of the recognition hypotheses. Such a local acoustic modeling mismatch is frequent, especially in the recognition of speech by unrestricted speakers where a wide variety of speakers' individualities are dealt with, and in the recognition of spontar eous speech where spectral features are heavily deformed. It is crucial to overcome such modeling mismatches to achicve accurate recognition of speakerindependent and spontzneous speech. So far, a few methods based on an operation in likelihood score during search process to avoid wrong pruning have given tentative solutions for this problem [ 1],[2]. These methods, however, are unable to solve the root problem, that is, some acoustic phenomena ar: not properly modeled. Acoustic models themselves should be improved based on acoustic phenomena to solve the root problem. This paper proposes a :nethod of restructuring Gaussian mixture probability density fuiictions (pdfs) in a pre-trained speakerindependent HMM set. In this method, which aims at modeling several acoustic phenomena more properly, the number of components in each mixture pdf is inflated by copying new components from other mixture pdfs with the tendency of local errors taken into account. The tendency is given by comparing the pre-trained HMM set and speech data which was used in the pre-training. As each of the copied components is shared between source and destination mixture pdfs, the total number of Gaussian pdfs in the HMM set does not increase. In section 2, basic ideas of the proposed method are described. Section 3 gives experimental results of the proposed method and speech recognition using a newly yielded HMM set to show the improvement in recognition pe~ormance. In section 4, the effect of restoring the local acoustic modeling mismatch is verified using examples of speech recognition results.
    Pruning
    Citations (3)