The 2003 ISL rich transcription system for conversational telephony speech

This paper describes the ISL large vocabulary conversational telephony speech recognition system, which was tested in NIST's RT-03S ("Switchboard") evaluation. We present our experiments on improving preprocessing, acoustic modelling, and language modelling. The system features phone-dependent semi-tied full covariances, semi-tied clustering of septa-phones, clustering across phones, feature adaptive training, robust estimation of VTLN and MLLR, as well as context-dependent interpolation of language models. We present detailed results for each stage of our multi-pass transcription scheme. System development started with a 1997 SWB system, yielding a word error rate of 35.1% on our internal 1h development set. The final system performed at 21.8%, a 38% relative improvement. The error rate on the RT-03 CTS evaluation set is 23.4%.
    • Correction
    • Source
    • Cite
    • Save
    • Machine Reading By IdeaReader