For efficient organization of speech recordings - meetings, interviews, voice mails, lectures - the ability to search for spoken keywords is an essential capability. Today, most spoken-document retrieval systems use large-vocabulary recognition. For the above scenarios, such systems suffer from both the unpredictable vocabulary/domain and generally high word-error rates (WER). We present a vocabulary-independent system to index and to search rapidly spontaneous speech. A speech recognizer generates lattices of phonetic word fragments, against which keywords are matched phonetically. We first show the need to use recognition alternatives (lattices) in a high-WER context, on a word-based baseline. Then we introduce our new method of phonetic word-fragment lattice generation, which uses longer-span language knowledge than a phoneme recognizer. Last we introduce heuristics to compact the lattices to feasible sizes that can be searched efficiently. On the LDC voice mail corpus, we show that vocabulary/domain-independent phonetic search is as accurate as a vocabulary/domain-dependent word-lattice based baseline system for in-vocabulary keywords (FOMs of 74-75%), but nearly maintains this accuracy also for out-of-vocabulary keywords.
We examine the task of spoken term detection in Chinese spontaneous speech with a lattice-based approach. We compare lattices generated with different units: word, character, tonal syllable and toneless syllable, and also look into methods of converting lattices from one unit to another one. We find the best system is with toneless-syllable lattices converted from word lattices. Further improvement is achieved by lattice post-processing and system combination. Our best system has an accuracy of 80.2% on a keyword spotting task.
Automatic generation of punctuation is an essential feature for many speech-to-text transcription tasks. This paper describes a maximum a-posteriori (MAP) approach for inserting punctuation marks into raw word sequences obtained from automatic speech recognition (ASR). The system consists of an ¿acoustic model¿ (AM) for prosodic features (actually pause duration) and a ¿language model¿ (LM) for text-only features. The LM combines three components: an MLP-based trigger-word model and a forward and a backward trigram punctuation predictor. The separation into acoustic and language model allows to learn these models on different corpora, especially allowing the LM to be trained on large amounts of data (text) for which no acoustic information is available. We find that the trigger-word LM is very useful, and further improvement can be achieved when combining both prosodic and lexical information. We achieve an F-measure of 81.0% and 56.5% for voicemails and podcasts, respectively, on reference transcripts, and 69.6% for voicemails on ASR transcripts.
This paper presents an approach to learn a better music similarity measure and presents an application to music playlist generation. Different from previous work, in our approach, automatically detected music attributes are used to represent each song. A set of kernels is employed in similarity measure, with each kernel measuring on a subset of music attributes and having a different importance weight. In automatic music playlist generation, a ranking method is presented, which considers multiple seed songs and possible outlier seed. Experiments show the effectiveness of the proposed approach, and the quality of the playlist generated based on automatic annotations is comparable to that based on manual annotations.
In this paper, we evaluate the effectiveness of adaptation methods for context-dependent deep-neural-network hidden Markov models (CD-DNN-HMMs) for automatic speech recognition. We investigate the affine transformation and several of its variants for adapting the top hidden layer. We compare the affine transformations against direct adaptation of the softmax layer weights. The feature-space discriminative linear regression (fDLR) method with the affine transformations on the input layer is also evaluated. On a large vocabulary speech recognition task, a stochastic gradient ascent implementation of the fDLR and the top hidden layer adaptation is shown to reduce word error rates (WERs) by 17% and 14%, respectively, compared to the baseline DNN performances. With a batch update implementation, the softmax layer adaptation technique reduces WERs by 10%. We observe that using bias shift performs as well as doing scaling plus bias shift.
Recent studies have shown that deep neural networks (DNNs) perform significantly better than shallow networks and Gaussian mixture models (GMMs) on large vocabulary speech recognition tasks. In this paper, we argue that the improved accuracy achieved by the DNNs is the result of their ability to extract discriminative internal representations that are robust to the many sources of variability in speech signals. We show that these representations become increasingly insensitive to small perturbations in the input with increasing network depth, which leads to better speech recognition performance with deeper networks. We also show that DNNs cannot extrapolate to test samples that are substantially different from the training examples. If the training data are sufficiently representative, however, internal features learned by the DNN are relatively stable with respect to speaker differences, bandwidth differences, and environment distortion. This enables DNN-based recognizers to perform as well or better than state-of-the-art systems based on GMMs or shallow networks without the need for explicit model adaptation or feature normalization.
We propose the joint speech translation and recognition (JSTAR) model that leverages the fast-slow cascaded encoder architecture for simultaneous end-to-end automatic speech recognition (ASR) and speech translation (ST). The model is transducer-based and uses a multi-objective training strategy that optimizes both ASR and ST objectives simultaneously. This allows JSTAR to produce high-quality streaming ASR and ST results. We apply JSTAR in a bilingual conversational speech setting with smart-glasses, where the model is also trained to distinguish speech from different directions corresponding to the wearer and a conversational partner. Different model pre-training strategies are studied to further improve results, including training of a transducer-based streaming machine translation (MT) model for the first time and applying it for parameter initialization of JSTAR. We demonstrate superior performances of JSTAR compared to a strong cascaded ST model in both BLEU scores and latency.
For efficient organization of speech recordings - meetings, interviews, voice mails, lectures - the ability to search for spoken keywords is an essential capability. In Seide et al. (2004) and Yu et al. (2004), we presented our work on vocabulary-independent search in spontaneous speech. That method involved linear scanning of phonetic lattices, and thus did not scale up to large collections. In this paper, we present a two-stage approach to fast search: first we retrieve segments from an index-like structure that are promising to contain the keyword, then we locate individual keyword occurrences by a detailed linear lattice scan. However, designing an efficient vocabulary-independent indexing structure is non-trivial. We use a "soft" index, similar to Allauzen et al., that provides expected term frequencies (ETF) of query terms. We propose to approximate ETF by M-gram phoneme language models estimated on the lattices (one per segment). Our index stores these language models in an inverted structure. Word spotting experiments on voicemails show that with this two-stage method, we lose under 4% FOM (figure of merit) relative at a 25-times speed-up compared with a full linear search.
An automatic system for detection of pronunciation errors by adult learners of English is embedded in a language-learning package.Four main features are: (1) a recognizer robust to non-native speech; (2) localization of phone-and word-level errors; (3) diagnosis of what sorts of phone-level errors took place; and (4) a lexicalstress detector.These tools together allow robust, consistent, and specific feedback on pronunciation errors, unlike many previous systems that provide feedback only at a more general level.The diagnosis technique searches for errors expected based on the student's mother tongue and uses a separate bias for each error in order to maintain a particular desired global false alarm rate.Results are presented here for non-native recognition on tasks of differing complexity and for diagnosis, based on a data set of artificial errors, showing that this method can detect many contrasts with a high hit rate and a low false alarm rate.
This paper describes a design and feasibility study for a large-scale automatic directory information system with a scalable architecture.The current demonstrator, called PADIS-XL 1 , operates in realtime and handles a database of a medium-size German city with 130,000 listings.The system uses a new technique of taking a combined decision on the joint probability over multiple dialogue turns, and a dialogue strategy that strives to restrict the search space more and more with every dialogue turn.During the course of the dialogue, the last name of the desired subscriber must be spelled out.The spelling recognizer permits continuous spelling and uses a context-free grammar to parse common spelling expressions.This paper describes the system architecture, our maximum a-posteriori (MAP) decision rule, the spelling grammar, and the dialogue strategy.We give results on the SPEECHDAT and SIETILL databases on recognition of first names by spelling and on jointly deciding on the spelled and the spoken name.In a 35,000-names setup, the joint decision reduced name-recognition errors by 31%.