logo
    On speaker-independent, speaker-dependent, and speaker-adaptive speech recognition
    17
    Citation
    30
    Reference
    10
    Related Paper
    Citation Trend
    Abstract:
    The DARPA Resource Management task is used as the domain to investigate the performance of speaker-independent, speaker-dependent, and speaker-adaptive speech recognition. The authors already have a state-of-the-art speaker-independent speech recognition system, SPHINX. The error rate for RM2 test set is 4.3%. They extended SPHINX to speaker-dependent speech recognition. The error rate is reduced to 1.4-2.6% with 600-2400 training sentences for each speaker, which demonstrated a substantial difference between speaker-dependent and -independent systems. Based on speaker-independent models, a study was made of speaker-adaptive speech recognition. With 40 adaptation sentences for each speaker, the error rate can be reduced from 4.3% to 3.1%.< >
    Keywords:
    Sphinx
    Word error rate
    Speaker diarisation
    This paper presents a speaker identification system with speaker verification method. Compared with usual speaker identification systems, it requires much less speech data but still maintains high performance. In the implementation of this speaker identification system, we introduced several approaches to improve the effect of speech segmentation and used a speaker verification method based on a common codebook, HMM and speaker relative threshold (SRTHD). A novel method of integrating the speaker identification with the face recognition to verify a person's identity is also presented. It has been proved that this method is efficient to reduce the equal error rate of recognition.
    Speaker diarisation
    Speaker identification
    Speaker Verification
    Identification
    Word error rate
    Mel-frequency cepstrum
    In this paper, we describe newhigh-performanceon-line speaker diarization system which works faster than real-time and has very low latency. It consists of several modules including voice activity detection, novel speaker detection, speaker gender and speaker identity classification. Allmodules share a set of Gaussian mixturemodels (GMM) representing pause, male and female speakers, and each individual speaker. Initially, there are only three GMMs for pause and two speaker genders, trained in advance from some data. During the speaker diarization process, for each speech segment it is decidedwhether it comes from a new speaker or from already known speaker. In case of a new speaker, his/her gender is identified, and then, from the corresponding gender GMM, a new GMM is spawned by copying its parameters. This GMM is learned on-line using the speech segment data and from this point it is used to represent the new speaker. All individual speaker models are produced in this way. In the case of an old speaker, s/he is identified and the correspondingGMMis again learned on-line. In order to prevent an unlimited grow of the speaker model number, those models that have not been selected as winners for a long period of time are deleted from the system. This allows the system to be able to perform its task indefinitely in addition to being capable of self-organization, i.e. unsupervised adaptive learning, and preservation of the learned knowledge, i.e. speakers. Such functionalities are attributed to the so called Never-Ending Learning systems. For evaluation, we used part of the TC-STAR database consisting of European Parliament Plenary speeches. The results show that this system achieves a speaker diarization error rate of 4.6% with latency of at most 3 seconds.
    Speaker diarisation
    Copying
    Citations (28)
    This chapter presents a continuously growing field that promises a wealth of applications far beyond the field of speech processing: the automatic identification of persons from their uttered speech. Research is currently focusing mainly on two tasks: The task of speaker detection is to verify the identity of a new speaker against a set of pretrained speaker models. The task of speaker diarization is to find speech segments of the same speaker without any a priori knowledge. The chapter introduces the general ideas in the two fields then it continues to explain the task of speaker diarization by providing an overview of current work before providing a more detailed description of a concrete example of a diarization system. Then, variants and current research topics are discussed. It presents speaker recognition in a similar way. Finally it concludes the chapter pointing to open problems. Controlled Vocabulary Terms speaker recognition
    Speaker diarisation
    Identification
    Citations (3)
    Speaker recognition and verification has been used in a variety of commercial, forensic and military applications. The classical problem is that of supervised recognition, in which there is sufficient a priori information on the speakers to be identified. In such cases, the recognition system has speaker models, estimated during training sessions. This paper deals with the problem of unsupervised speaker classification, where no a priori speaker information is available. The algorithm accepts multi-speaker dialogue speech data, estimates the number of speakers and assigns each speech segment to its speaker. Preliminary results are described.
    Speaker diarisation
    Speaker Verification
    Citations (15)
    This paper presents a stream-based approach for unsupervised multi-speaker conversational speech segmentation. The main idea of this work is to exploit prior knowledge about the speaker space to find a low dimensional vector of speaker factors that summarize the salient speaker characteristics. This new approach produces segmentation error rates that are better than the state of the art ones reported in our previous work on the segmentation task in the NIST 2000 Speaker Recognition Evaluation (SRE). We also show how the performance of a speaker recognition system in the core test of the 2006 NIST SRE is affected, comparing the results obtained using single speaker and automatically segmented test data.
    Speaker diarisation
    NIST
    Speech segmentation
    Citations (92)
    In this paper, we propose a new speaker-class modeling and its adaptation method for the LVCSR system and evaluate the method on the Corpus of Spontaneous Japanese (CSJ). In this method, closer speakers are selected from training speakers and the acoustic models are trained by using their utterances for each evaluation speaker. One of the major issues of the speaker-class model is determining the selection range of speakers. In order to solve the problem, several models which have a variety of speaker range are prepared for each evaluation speaker in advance, and the most proper model is selected on a likelihood basis in the recognition step. In addition, we improved the recognition performance using unsupervised speaker adaptation with the speaker-class models. In the recognition experiments, a significant improvement could be obtained by using the proposed speaker adaptation based on speaker-class models compared with the conventional adaptation method.
    Speaker diarisation
    We investigate using state-of-the-art speaker diarization output for speech recognition purposes. While it seems obvious that speech recognition could benefit from the output of speaker diarization ("Who spoke when") for effective feature normalization and model adaptation, such benefits have remained elusive in the very challenging domain of meeting recognition from distant microphones. In this study, we show that recognition gains are possible by careful post-processing of the diarization output. Still, recognition accuracy may suffer when the underlying diarization system performs worse than expected, even compared to far less sophisticated speaker-clustering techniques. We obtain a more accurate and robust overall system by combining recognition output with multiple speaker segmentations and clusterings. We evaluate our methods on data from the 2009 NIST Rich Transcription meeting recognition evaluation.
    Speaker diarisation
    NIST
    Normalization
    Transcription
    Feature (linguistics)
    Citations (19)
    In the past year at Carnegie Mellon steady progress has been made in the area of acoustic and language modeling. The result has been a dramatic reduction in speech recognition errors in the SPHINX-II system. In this paper, we review SPHINX-II and summarize our recent efforts on improved speech recognition. Recently SPHINX-II achieved the lowest error rate in the November 1992 DARPA evaluations. For 5000-word, speaker-independent, continuous, speech recognition, the error rate was reduced to 5%.
    Sphinx
    Word error rate
    Citations (57)