Frank Seide

Microsoft (United States)

Author Statistics

Papers

Citation

H-Index

i-10 index

Research Trends

Author Order

Document Type

Co-Authors

Dong Yu

Bellevue Hospital Center

Peng Yu

Beijing University of Posts and Telecommunications

Gang Li

Chinese Academy of Sciences

Michael L. Seltzer

Meta (United States)

Jasha Droppo

Amazon (United States)

Ozlem Kalinli

Menlo School

Christian Fuegen

Meta (United States)

Jay Mahadeokar

Meta (Israel)

Kit Thambiratnam

Microsoft Research (United Kingdom)

Niko Moritz

University of Turku

Cooperative Institutions

Microsoft (United States)

Microsoft Research (United Kingdom)

Microsoft Research Asia (China)

Tsinghua University

Meta (Israel)

Chinese Academy of Sciences

Meta (United States)

University of Edinburgh

Philips (Germany)

University of Electronic Science and Technology of China

Author Statistics

Papers

Citation

H-Index

i-10 index

Research Field

Vocabulary-independent search in spontaneous speech

IEEE International Conference on Acoustics Speech and Signal Processing (2004)

Frank Seide Peng Yu Chengyuan Ma Eric Chang

For efficient organization of speech recordings - meetings, interviews, voice mails, lectures - the ability to search for spoken keywords is an essential capability. Today, most spoken-document retrieval systems use large-vocabulary recognition. For the above scenarios, such systems suffer from both the unpredictable vocabulary/domain and generally high word-error rates (WER). We present a vocabulary-independent system to index and to search rapidly spontaneous speech. A speech recognizer generates lattices of phonetic word fragments, against which keywords are matched phonetically. We first show the need to use recognition alternatives (lattices) in a high-WER context, on a word-based baseline. Then we introduce our new method of phonetic word-fragment lattice generation, which uses longer-span language knowledge than a phoneme recognizer. Last we introduce heuristics to compact the lattices to feasible sizes that can be searched efficiently. On the LDC voice mail corpus, we show that vocabulary/domain-independent phonetic search is as accurate as a vocabulary/domain-dependent word-lattice based baseline system for in-vocabulary keywords (FOMs of 74-75%), but nearly maintains this accuracy also for out-of-vocabulary keywords.

Word error rate

Heuristics

10.1109/icassp.2004.1325970

Cite

Citations (69)

A study of lattice-based spoken term detection for Chinese spontaneous speech

Meng Sha Peng Yu Frank Seide Jia Liu

We examine the task of spoken term detection in Chinese spontaneous speech with a lattice-based approach. We compare lattices generated with different units: word, character, tonal syllable and toneless syllable, and also look into methods of converting lattices from one unit to another one. We find the best system is with toneless-syllable lattices converted from word lattices. Further improvement is achieved by lattice post-processing and system combination. Our best system has an accuracy of 80.2% on a keyword spotting task.

Keyword spotting

Lattice (music)

Spotting

10.1109/asru.2007.4430186

Cite

Citations (21)

Automatic punctuation generation for speech

Wenzhu Shen Roger Peng Yu Frank Seide Ji Wu

Automatic generation of punctuation is an essential feature for many speech-to-text transcription tasks. This paper describes a maximum a-posteriori (MAP) approach for inserting punctuation marks into raw word sequences obtained from automatic speech recognition (ASR). The system consists of an ¿acoustic model¿ (AM) for prosodic features (actually pause duration) and a ¿language model¿ (LM) for text-only features. The LM combines three components: an MLP-based trigger-word model and a forward and a backward trigram punctuation predictor. The separation into acoustic and language model allows to learn these models on different corpora, especially allowing the LM to be trained on large amounts of data (text) for which no acoustic information is available. We find that the trigger-word LM is very useful, and further improvement can be achieved when combining both prosodic and lexical information. We achieve an F-measure of 81.0% and 56.5% for voicemails and podcasts, respectively, on reference transcripts, and 69.6% for voicemails on ASR transcripts.

Punctuation

Feature (linguistics)

Transcription

10.1109/asru.2009.5373365

Cite

Citations (6)

Learning a music similarity measure on automatic annotations with application to playlist generation

IEEE International Conference on Acoustics Speech and Signal Processing (2009)

Linxing Xiao Lie Lu Frank Seide Jie Zhou

This paper presents an approach to learn a better music similarity measure and presents an application to music playlist generation. Different from previous work, in our approach, automatically detected music attributes are used to represent each song. A set of kernels is employed in similarity measure, with each kernel measuring on a subset of music attributes and having a different importance weight. In automatic music playlist generation, a ranking method is presented, which considers multiple seed songs and possible outlier seed. Experiments show the effectiveness of the proposed approach, and the quality of the playlist generated based on automatic annotations is comparable to that based on manual annotations.

Similarity measure

Similarity (geometry)

Music Information Retrieval

Kernel (algebra)

10.1109/icassp.2009.4959976

Cite

Citations (13)

Adaptation of context-dependent deep neural networks for automatic speech recognition

2022 IEEE Spoken Language Technology Workshop (SLT) (2012)

Kaisheng Yao Dong Yu Frank Seide Hang Su Li Deng

In this paper, we evaluate the effectiveness of adaptation methods for context-dependent deep-neural-network hidden Markov models (CD-DNN-HMMs) for automatic speech recognition. We investigate the affine transformation and several of its variants for adapting the top hidden layer. We compare the affine transformations against direct adaptation of the softmax layer weights. The feature-space discriminative linear regression (fDLR) method with the affine transformations on the input layer is also evaluated. On a large vocabulary speech recognition task, a stochastic gradient ascent implementation of the fDLR and the top hidden layer adaptation is shown to reduce word error rates (WERs) by 17% and 14%, respectively, compared to the baseline DNN performances. With a batch update implementation, the softmax layer adaptation technique reduces WERs by 10%. We observe that using bias shift performs as well as doing scaling plus bias shift.

Softmax function

Discriminative model

Feature (linguistics)

10.1109/slt.2012.6424251

Cite

Citations (208)

Feature Learning in Deep Neural Networks - Studies on Speech Recognition Tasks.

arXiv (Cornell University) (2013)

Dong Yu Michael L. Seltzer Jinyu Li Jui-Ting Huang Frank Seide

Recent studies have shown that deep neural networks (DNNs) perform significantly better than shallow networks and Gaussian mixture models (GMMs) on large vocabulary speech recognition tasks. In this paper, we argue that the improved accuracy achieved by the DNNs is the result of their ability to extract discriminative internal representations that are robust to the many sources of variability in speech signals. We show that these representations become increasingly insensitive to small perturbations in the input with increasing network depth, which leads to better speech recognition performance with deeper networks. We also show that DNNs cannot extrapolate to test samples that are substantially different from the training examples. If the training data are sufficiently representative, however, internal features learned by the DNN are relatively stable with respect to speaker differences, bandwidth differences, and environment distortion. This enables DNN-based recognizers to perform as well or better than state-of-the-art systems based on GMMs or shallow networks without the need for explicit model adaptation or feature normalization.

Discriminative model

Normalization

Deep Neural Networks

Feature (linguistics)

Source

Cite

Citations (178)

Transcribing and Translating, Fast and Slow: Joint Speech Translation and Recognition

arXiv (Cornell University) (2024)

Niko Moritz Ruiming Xie Yashesh Gaur Ke Li Simone Merello

We propose the joint speech translation and recognition (JSTAR) model that leverages the fast-slow cascaded encoder architecture for simultaneous end-to-end automatic speech recognition (ASR) and speech translation (ST). The model is transducer-based and uses a multi-objective training strategy that optimizes both ASR and ST objectives simultaneously. This allows JSTAR to produce high-quality streaming ASR and ST results. We apply JSTAR in a bilingual conversational speech setting with smart-glasses, where the model is also trained to distinguish speech from different directions corresponding to the wearer and a conversational partner. Different model pre-training strategies are studied to further improve results, including training of a transducer-based streaming machine translation (MT) model for the first time and applying it for parameter initialization of JSTAR. We demonstrate superior performances of JSTAR compared to a strong cascaded ST model in both BLEU scores and latency.

Speech translation

10.48550/arxiv.2412.15415

Cite

Citations (0)

Fast Two-Stage Vocabulary-Independent Search In Spontaneous Speech

Peng Yu Frank Seide

For efficient organization of speech recordings - meetings, interviews, voice mails, lectures - the ability to search for spoken keywords is an essential capability. In Seide et al. (2004) and Yu et al. (2004), we presented our work on vocabulary-independent search in spontaneous speech. That method involved linear scanning of phonetic lattices, and thus did not scale up to large collections. In this paper, we present a two-stage approach to fast search: first we retrieve segments from an index-like structure that are promising to contain the keyword, then we locate individual keyword occurrences by a detailed linear lattice scan. However, designing an efficient vocabulary-independent indexing structure is non-trivial. We use a "soft" index, similar to Allauzen et al., that provides expected term frequencies (ETF) of query terms. We propose to approximate ETF by M-gram phoneme language models estimated on the lattices (one per segment). Our index stores these language models in an inverted structure. Word spotting experiments on voicemails show that with this two-stage method, we lose under 4% FOM (figure of merit) relative at a 25-times speed-up compared with a full linear search.

Keyword spotting

Inverted index

Linear search

10.1109/icassp.2005.1415155

Cite

Citations (25)

Automatic localization and diagnosis of pronunciation errors for second-language learners of English

ChiWei Che Nick J.-C. Wang Max Huang Hank Huang Frank Seide

An automatic system for detection of pronunciation errors by adult learners of English is embedded in a language-learning package.Four main features are: (1) a recognizer robust to non-native speech; (2) localization of phone-and word-level errors; (3) diagnosis of what sorts of phone-level errors took place; and (4) a lexicalstress detector.These tools together allow robust, consistent, and specific feedback on pronunciation errors, unlike many previous systems that provide feedback only at a more general level.The diagnosis technique searches for errors expected based on the student's mother tongue and uses a separate bias for each error in order to maintain a particular desired global false alarm rate.Results are presented here for non-native recognition on tasks of differing complexity and for diagnosis, based on a data set of artificial errors, showing that this method can detect many contrasts with a high hit rate and a low false alarm rate.

Pronunciation

Word error rate

False alarm

10.21437/eurospeech.1999-208

Cite

Citations (29)

Towards an automated directory information system

Frank Seide A. Kellner

This paper describes a design and feasibility study for a large-scale automatic directory information system with a scalable architecture.The current demonstrator, called PADIS-XL 1 , operates in realtime and handles a database of a medium-size German city with 130,000 listings.The system uses a new technique of taking a combined decision on the joint probability over multiple dialogue turns, and a dialogue strategy that strives to restrict the search space more and more with every dialogue turn.During the course of the dialogue, the last name of the desired subscriber must be spelled out.The spelling recognizer permits continuous spelling and uses a context-free grammar to parse common spelling expressions.This paper describes the system architecture, our maximum a-posteriori (MAP) decision rule, the spelling grammar, and the dialogue strategy.We give results on the SPEECHDAT and SIETILL databases on recognition of first names by spelling and on jointly deciding on the spelled and the spoken name.In a 35,000-names setup, the joint decision reduced name-recognition errors by 31%.

Spelling

10.21437/eurospeech.1997-372

Cite

Citations (21)