We investigate an alternative formulation of phonetic feature representations for SVM-based speaker verification. The new features are based on conditional likelihood representations rather than the joint-likelihood or bag-of-ngram calculations traditionally used. Conditional likelihoods are shown to be a more natural method of modelling phonetic information, and improve upon conventional joint likelihoods in a number of cases. The problem of feature normalisation is also examined, with a previously proposed non-parametric method based on rank shown to be particularly useful. Combinations of feature representations are examined and the potential for complementary information between joint and conditional likelihoods considered. Additionally, feature compensation is applied to conditional likelihoods with considerable improvement in performance.
This paper proposes a novel method for speech endpoint detection. The developed method utilises gradient based edge detection algorithms, used in image processing, to detect boundaries of continuous speech in noisy conditions. It is simple and has low computational complexity. The accuracy of the proposed method was evaluated and compared to the ITU-T G.729 Annex-B voice activity detection (VAD) algorithm. To do this, the two algorithms were tested using a synthetically produced noisy-speech database, consisting of noisy-speech signals at various lengths and SNR. The results indicated that the developed method outperforms the G.729-B VAD algorithm at various signal-to-noise ratios.
In this paper, the Bayes factor is considered as a replacement verification criterion to the likelihood-ratio test in the context of GMM-based speaker verification. An advantage of this Bayesian method is that it allows for the incorporation of prior information and uncertainty of parameter estimates into the scoring process, complementing the Bayesian adaptation used in training. A development of Bayes factors for GMMs is presented based on incremental adaptation that is well-suited to inclusion in existing GMM-UBM systems. This method is extended to include the weighting of test frames to account for their statistical dependencies. Experiments on the 1999 NIST Speaker Recognition Evaluation corpus demonstrate improved performance over expected log-likelihood ratio scoring. These findings are supported with results from a modified version of the NIST Extended Data corpus of 2003.
This paper presents a novel technique for segmenting an audio stream into homogeneous regions according to speaker identities, background noise, music, environmental and channel conditions. Audio segmentation is useful in audio diarization systems, which aim to annotate an input audio stream with information that attributes temporal regions of the audio into their specific sources. The segmentation method introduced in this paper is performed using the Generalized Likelihood Ratio (GLR), computed between two adjacent sliding windows over preprocessed speech. This approach is inspired by the popular segmentation method proposed by the pioneering work of Chen and Gopalakrishnan, using the Bayesian Information Criterion (BIC) with an expanding search window. This paper will aim to identify and address the shortcomings associated with such an approach. The result obtained by the proposed segmentation strategy is evaluated on the 2002 Rich Transcription (RT-02) Evaluation dataset, and a miss rate of 19.47% and a false alarm rate of 16.94% is achieved at the optimal threshold.
This paper compares two of the leading techniques for session variability compensation in the context of support vector machine (SVM) speaker verification using Gaussian mixture model (GMM) mean supervectors: joint factor analysis (JFA) modeling and nuisance attribute projection (NAP). Motivation for this comparison comes from the distinctly different domains in which these techniques are employed-the probabilistic GMM domain versus the discriminative SVM kernel. A theoretical analysis is given comparing the JFA and NAP approaches to variability compensation. The role of speaker factors in the factor analysis model is also contrasted against the scatter difference NAP objective of retaining speaker information in the SVM kernel space. These methods for retaining speaker variation are found to provide improved verification performance over the removal of channel effects alone. Overall, experimental results on the NIST 2006 and 2008 SRE corpora demonstrate the effectiveness of both JFA and NAP techniques for reducing the effects of variability. However, the overheads associated with the implementation of JFA may make NAP a more attractive technique due to its simple yet effective approach to variability compensation.
This study assesses the recently proposed data-driven background dataset refinement technique for speaker verification using alternate SVM feature sets to the GMM supervector features for which it was originally designed. The performance improvements brought about in each trialled SVM configuration demonstrate the versatility of background dataset refinement. This work also extends on the originally proposed technique to exploit support vector coefficients as an impostor suitability metric in the data-driven selection process. Using support vector coefficients improved the performance of the refined datasets in the evaluation of unseen data. Further, attempts are made to exploit the differences in impostor example suitability measures from varying features spaces to provide added robustness.
Proposed is an approach to estimating confidence measures on the verification score produced by a Gaussian mixture model (GMM)-based automatic speaker verification system with applications to drastically reducing the typical data requirements for producing a confident verification decision. The confidence measures are based on estimating the distribution of the observed frame scores. The confidence estimation procedure is also extended to produce robust results with very limited and highly correlated frame scores as well as in the presence of score normalization. The proposed Early Verification Decision method utilizes the developed confidence measures in a sequential hypothesis testing framework, demonstrating that as little as 2–10 s of speech on average was able to produce verification results approaching that of using an average of over 100 s of speech on the 2005 NIST SRE protocol.
This paper examines combining both relevance MAP and subspace speaker adaptation processes to train GMM speaker models for use in speaker verification systems with a particular focus on short utterance lengths. The subspace speaker adaptation method involves developing a speaker GMM mean supervector as the sum of a speaker-independent prior distribution and a speaker dependent offset constrained to lie within a low-rank subspace, and has been shown to provide improvements in accuracy over ordinary relevance MAP when the amount of training data is limited. It is shown through testing on NIST SRE data that combining the two processes provides speaker models which lead to modest improvements in verification accuracy for limited data situations, in addition to improving the performance of the speaker verification system when a larger amount of available training data is available.
This paper presents a method of voice activity detection (VAD) for high noise scenarios, using a noise robust voiced speech detection feature. The developed method is based on the fusion of two systems. The first system utilises the maximum peak of the normalised time-domain autocorrelation function (MaxPeak). The second zone system uses a novel combination of cross-correlation and zero-crossing rate of the normalised autocorrelation to approximate a measure of signal pitch and periodicity (CrossCorr) that is hypothesised to be noise robust. The score outputs by the two systems are then merged using weighted sum fusion to create the proposed autocorrelation zero-crossing rate (AZR) VAD. Accuracy of AZR was compared to state of the art and standardised VAD methods and was shown to outperform the best performing system with an average relative improvement of 24.8% in half-total error rate (HTER) on the QUT-NOISE-TIMIT database created using real recordings from high-noise environments.
The recently proposed data-driven background dataset refinement technique provides a means of selecting an informative background for support vector machine (SVM)-based speaker verification systems. This paper investigates the characteristics of the impostor examples in such highly informative background datasets. Data-driven dataset refinement individually evaluates the suitability of candidate impostor examples for the SVM background prior to selecting the highest-ranking examples as a refined background dataset. Further, the characteristics of the refined dataset were analyzed to investigate the desired traits of an informative SVM background. The most informative examples of the refined dataset were found to consist of large amounts of active speech and distinctive language characteristics. The data-driven refinement technique was shown to filter the set of candidate impostor examples to produce a more disperse representation of the impostor population in the SVM kernel space, thereby reducing the number of redundant and less-informative examples in the background dataset. Furthermore, data-driven refinement was shown to provide performance gains when applied to the difficult task of refining a small candidate dataset that was mismatched to the evaluation conditions.