Replacing Human Audio with Synthetic Audio for on-Device Unspoken Punctuation Prediction

Daria Soboleva,Ondrej Skopek,Márius Šajgalík,Victor Carbune,Felix Weissenberger,Julia Proskurnia,Bogdan Prisacari,Daniel Valcarce,Justin Lu,Rohit Prabhavalkar,Balint Miklos

Replacing Human Audio with Synthetic Audio for on-Device Unspoken Punctuation Prediction

2021

Daria Soboleva
Ondrej Skopek
Márius Šajgalík
Victor Carbune
Felix Weissenberger
Julia Proskurnia
Bogdan Prisacari
Daniel Valcarce
Justin Lu
Rohit Prabhavalkar
Balint Miklos

We present a novel multi-modal unspoken punctuation prediction system for the English language which combines acoustic and text features. We demonstrate for the first time, that by relying exclusively on synthetic data generated using a prosody-aware text-to-speech system, we can outperform a model trained with expensive human audio recordings on the unspoken punctuation prediction problem. Our model architecture is well suited for on-device use. This is achieved by leveraging hash-based embeddings of automatic speech recognition text output in conjunction with acoustic features as input to a quasi-recurrent neural network, keeping the model size small and latency low.

Correction
Source
Cite
Save
Machine Reading By IdeaReader

References

Citations