logo
    A Myanmar large vocabulary continuous speech recognition system
    31
    Citation
    20
    Reference
    10
    Related Paper
    Citation Trend
    Abstract:
    This paper presents a large vocabulary automatic speech recognition (ASR) system for Myanmar. To the best of our knowledge, this is the first such system for the Myanmar language. We will report main processes of developing the system, including data collection, pronunciation lexicon construction, effective acoustic features selection, acoustic and language modelings, and evaluation criteria. Considering the fact that Myanmar being a tonal language, the tonal features were incorporated to acoustic modeling and their effectiveness were verified. Differences between the word-based language model (LM) and syllable-based LM were investigated; the word-based LM was found superior to the syllable-based model. To disambiguate the definitions of Myanmar words and achieve high reliability on the recognition results, we explored the characteristics of the Myanmar language, and proposed the Syllable Error Rate (SER) as a suitable evaluation criterion for Myanmar ASR system. 3 kinds of acoustic models; 1 Gaussian Mixture Model (GMM) and 2 Deep Neural Networks (DNNs) were explored by only utilizing the developed phonemically-balanced corpus consisting of 4K sentences and 40 hours of speech. An open evaluation set containing 100 utterances, spoken by 25 speakers, were experimented. With respect to the sequence discriminative training DNN, the results reached up to 15.63% in word error rate (WER) or 10.87% in SER.
    Keywords:
    Pronunciation
    Word error rate
    Discriminative model
    Spoken Language
    Treebank
    Many studies have explored on the usage of existing multilingual speech corpora to build an acoustic model for a target language. These works on multilingual acoustic modeling often use multilingual acoustic models to create an initial model. This initial model created is often suboptimal in decoding speech of the target language. Some speech of the target language is then used to adapt and improve the initial model. In this paper however, we investigate multilingual acoustic modeling in enhancing an acoustic model of the target language for automatic speech recognition system. The proposed approach employs context dependent acoustic model merging of a source language to adapt acoustic model of a target language. The source and target language speech are spoken by speakers from the same country. Our experiments on Malay and English automatic speech recognition shows relative improvement in WER from 2% to about 10% when multilingual acoustic model was employed.
    Malay
    Citations (2)
    This paper presents a large vocabulary automatic speech recognition (ASR) system for Myanmar. To the best of our knowledge, this is the first such system for the Myanmar language. We will report main processes of developing the system, including data collection, pronunciation lexicon construction, effective acoustic features selection, acoustic and language modelings, and evaluation criteria. Considering the fact that Myanmar being a tonal language, the tonal features were incorporated to acoustic modeling and their effectiveness were verified. Differences between the word-based language model (LM) and syllable-based LM were investigated; the word-based LM was found superior to the syllable-based model. To disambiguate the definitions of Myanmar words and achieve high reliability on the recognition results, we explored the characteristics of the Myanmar language, and proposed the Syllable Error Rate (SER) as a suitable evaluation criterion for Myanmar ASR system. 3 kinds of acoustic models; 1 Gaussian Mixture Model (GMM) and 2 Deep Neural Networks (DNNs) were explored by only utilizing the developed phonemically-balanced corpus consisting of 4K sentences and 40 hours of speech. An open evaluation set containing 100 utterances, spoken by 25 speakers, were experimented. With respect to the sequence discriminative training DNN, the results reached up to 15.63% in word error rate (WER) or 10.87% in SER.
    Pronunciation
    Word error rate
    Discriminative model
    Spoken Language
    Treebank
    Citations (31)
    The use of data-augmentation on training data can significantly improve the robustness of the deep neural networks-based automatic speech recognition (ASR) system. We propose an approach to building and developing an ASR system for the low-resource Arabic language using an End-to-End model and demonstrates the impact of data augmentation and the suggested language model on the results. We aimed to develop a system that transcribes Arabic audio containing human speech into text since a few existing systems tailored for Arabic are developed compared to other languages. Our model is trained based on the DeepSpeech2 framework that uses End-to-End deep learning. When data augmentation is utilized during training, the word error rate (WER) is improved by around 13%, and when using the language model, the word error rate is reduced by around 48%. Our best model achieves a competitive word error rate when the system was evaluated on the Common Voice 8.0 dataset at 2.8 of WER.
    Word error rate
    Robustness
    Training set
    Error Analysis
    Cache language model
    Training a speech recognition system needs audio data and their corresponding exact transcriptions. However, manual transcribing is expensive, labor intensive and error-prone. Some sources, such as TV broadcast, have subtitles. Subtitles are closed to the exact transcription, but not exactly the same. Some sentences might be paraphrased, deleted, changed in word order, etc. Building automatic speech recognition from inexact subtitles may result in a poor models and low performance system. Therefore, selecting data is crucial to obtain a highly performance models. In this work, we explore the lightly supervised approach, which is a process to select a good acoustic data to train Deep Neural Network acoustic models. We study data selection methods based on phone matched error rate and average word duration. Furthermore, we propose a new data selection method combining three recognizers. Recognizing the development set produces word error rate that is the metric to measure how good the model is. Data selection methods are evaluated on the real TV broadcast dataset.
    Word error rate
    Transcription
    Audio mining
    Citations (1)
    Automatic Speech Recognition (ASR) is a technology which is capable to convert speech into text. Research in this field is growing very rapidly and is applied in multiple languages. Voice recognition of isolated word and connected word for Indonesian language has been done with various approaches. This is done to get better accuracy in recognition. However, there are still a few research about introduction of continuous speech for the Indonesian language. This paper describes the efforts made to build an Indonesian automatic speech recognition to recognize continuous speech using Sphinx4 (toolkit from CMUSphinx). There are three steps taken to build Indonesia ASR, those are preparing corpus, forming acoustic model and testing. The result of the acoustic model test which was formed showed the value of 23% word error rate and 32,8% sentence error rate. The lower the two variables, the better the introduction to the input given speech files.
    Word error rate
    Audio mining
    Speech analytics
    Abstract One of the current research areas is speech recognition by aiding in the recognition of speech signals through computer applications . In this research paper, Acoustic Nudging, (AN) Model is used in re-formulating the persistence automatic speech recognition (ASR) errors that involves user’s acoustic irrational behavior which alters speech recognition accuracy. GMM helped in addressing low-resourced attribute of Yorùbá language to achieve better accuracy and system performance. From the simulated results given, it is observed that proposed Acoustic Nudging-based Gaussian Mixture Model (ANGM) improves accuracy and system performance which is evaluated based on Word Recognition Rate (WRR) and Word Error Rate (WER)given by validation accuracy, testing accuracy, and training accuracy. The evaluation results for the mean WRR accuracy achieved for the ANGM model is 95.277% and the mean Word Error Rate (WER) is 4.723%when compared to existing models. This approach thereby reduce error rate by 1.1%, 0.5%, 0.8%, 0.3%, and 1.4% when compared with other models. Therefore this work was able to discover a foundation for advancing current understanding of under-resourced languages and at the same time, development of accurate and precise model for speech recognition.
    Word error rate
    This paper addresses the problem of automatic speech recognition (ASR) error detection and their use for improving spoken language understanding (SLU) systems. In this study, the SLU task consists in automatically extracting, from ASR transcriptions , semantic concepts and concept/values pairs in a e.g touristic information system. An approach is proposed for enriching the set of semantic labels with error specific labels and by using a recently proposed neural approach based on word embeddings to compute well calibrated ASR confidence measures. Experimental results are reported showing that it is possible to decrease significantly the Concept/Value Error Rate with a state of the art system, outperforming previously published results performance on the same experimental data. It also shown that combining an SLU approach based on conditional random fields with a neural encoder/decoder attention based architecture , it is possible to effectively identifying confidence islands and uncertain semantic output segments useful for deciding appropriate error handling actions by the dialogue manager strategy .
    Word error rate
    Spoken Language
    Citations (2)
    Spoken authentic corpus is of great practical significance for the dynamic research of language and teaching and acquisition of spoken French. Recently, the establishment of the spoken French corpus and empirical research based on spoken authentic corpus has become the direction of language research. Thus, we introduce the establishment and application of AI speech recognition technology in a small French spoken corpus in detail as a platform to conduct an empirical study on the phenomenon of code-switching in French classroom teachers' discourse. In the research, we find that the corpus is objective and effective in revealing the types, motivations, and functions of code-switching in French classroom teachers' discourse. The corpus also provides a reference for future research on the teaching of spoken French based on the corpus.
    Spoken Language
    Corpus Linguistics
    Empirical Research
    Code (set theory)
    Text corpus
    Phenomenon
    Here a development of an Acoustic and Language Model is presented. Low Word Error Rate is an early good sign of a good Language and Acoustic Model. Although there are still parameters other than Words Error Rate, our work focused on building Bahasa Indonesia with approximately 2000 common words and achieved the minimum threshold of 25% Word Error Rate. There were several experiments consist of different cases, training data, and testing data with Word Error Rate and Testing Ratio as the main comparison. The language and acoustic model were built using Sphinx4 from Carnegie Mellon University using Hidden Markov Model for the acoustic model and ARPA Model for the language model. The models configurations, which are Beam Width and Force Alignment, directly correlates with Word Error Rate. The configurations were set to 1e-80 for Beam Width and 1e-60 for Force Alignment to prevent underfitting or overfitting of the acoustic model. The goals of this research are to build continuous speech recognition in Bahasa Indonesia which has low Word Error Rate and to determine the optimum numbers of training and testing data which minimize the Word Error Rate.
    Overfitting
    Word error rate
    Citations (4)