The Carnegie Mellon Communicator Corpus

Christina L. Bennett,Alexander I. Rudnicky

The Carnegie Mellon Communicator Corpus

2002

Christina L. Bennett
Alexander I. Rudnicky

As part of the DARPA Communicator program, Carnegie Mellon has, over the past three years, collected a large corpus of speech produced by callers to its Travel Planning system. To date, a total of 180,605 utterances (90.9 hours) have been collected. The data were used for a number of purposes, including acoustic and language modeling and the development of a spoken dialog system. The collection, transcription and annotation of these data prompted us to develop a number of procedures for managing the transcription process and for ensuring accuracy. We describe these, as well as some results based on these data. A portion of this corpus, covering the years 1999-2001, is being published for research purposes.

Keywords:

Speech recognition
Natural language processing
Annotation
Language model
Computer science
Artificial intelligence
Dialog system
spoken dialog

Correction
Source
Cite
Save
Machine Reading By IdeaReader

References

Citations