Transcription Of New Speaking Styles - Voicemail

1998 
TRANSCRIPTION OF NEW SPEAKING STYLES - VOICEMAILM. Padmanabhan, B. Ramabhadran, E. Eide, G. Ramaswamy,L. R. Bahl, P. S. Gopalakrishnan, S. RoukosIBM T. J. Watson Research CenterP.O.Box 218, Yorktown Heights, NY 105981INTRODUCTIONIn this pap er we describ e a new testb ed for develop-ing sp eech recognition algorithms - a VoiceMail tran-scription task, analogous to other tasks such as theSwitchb oard, CallHome [1] and the Hub 4 tasks [2]which are currently used by sp eech recognition re-searchers.Sp ontaneous sp eech o ccurring in day-to-day life can broadly b e classi ed into two categories(i) where the sp eaker do es not receiveany externalfeedback to direct his/her sp eech, and (ii) where thesp eaker receives external feedback from another p er-son/machine/audience. Examples of the former cat-egory are radio broadcast news, voicemail etc., andexamples of the latter category are telephone con-versations, natural language transaction systems (eg.ATIS), seminars, etc. In general to obtain the b estp erformance in transcribing a certain style of sp eech,it is necessary to train the sp eech recognition systemon similar style of training data. Some of the sp eechcategories mentioned ab ove are quite well representedin currently existing databases. However, voicemaildata is not well represented in any database, eventhough it represents a very large volume of real-worldsp eech data. Consequently there is a need for a Voice-mail database in order to improve transcription p er-formance on a voicemail transcription task, and alsoto establish a new test b ed for sp eech recognition al-gorithms.Similar to the Switchb oard/CallHome databases,the Voicemail database comprises telephone bandwidthsp ontaneous sp eech. However the di erence with re-sp ect to the Switchb oard and CallHome tasks is thatthe interaction is not b etween tohumans, but ratherbetween a human and a machine. Consequently, thesp eech is exp ected to b e a little more formal in its na-ture, without the problems of cross-talk, barge-in etc.This eliminates some of the variables and providesmore controlled conditions enabling one to concen-trate on the asp ects of sp ontaneous sp eech and e ectsof the telephone channel. In this pap er, we will de-scrib e the mo dality of collection of the sp eech data,and some algorithmic techniques that were devisedbased on this data. We will also describ e the initialresults of transcription p erformance on this task.2DATA COLLECTIONFor details of the data collection scheme see [3]. Briey,some of the characteristics of the voicemail data areas follows:The data represents extremely sp ontaneous sp eech.The data contains b oth long-distance and lo cal calls.Eachvoicemail message typically has a clickattheb eginning and/or end of the message arising from thecaller hanging up.The data is sub ject to the compression of the phone-mail system, which leads to a small degradation inaccuracy.The average length of a voicemail message is 31 sec-onds, however, the p eak of the histogram of voicemaildurations o ccurs at 18 seconds.The average rate of the sp eech is approximately 190words p er minute.The topics covered in the collected data rangedfrom p ersonal messages to messages with technical orbusiness-related content.The database was not quite gender balanced, withthe p ercentage of male sp eakers b eing 38 %.3SYSTEM OVERVIEWWe will rst briey describ e the IBM large-vo cabularysp eech recognition system.Essential asp ects of the1
    • Correction
    • Cite
    • Save
    • Machine Reading By IdeaReader
    7
    References
    6
    Citations
    NaN
    KQI
    []