Automatic Preparation of Standard Arabic Phonetically Rich Written Corpora with Different Linguistic Units

Fadi Sindran,Firas Mualla,Tino Haderlein,Khaled Daqrouq,Elmar Nöth

Automatic Preparation of Standard Arabic Phonetically Rich Written Corpora with Different Linguistic Units

2017

Phonetically rich and balanced speech corpora are essential components in state-of-the-art automatic speech recognition (ASR) and text-to-speech (TTS) systems. The written form of speech corpora must be prepared carefully to represent the richness and balance in the linguistic content. There is a lack of this type of spoken and written corpora for Standard Arabic (SA), and the only one available was prepared manually by expert linguists and phoneticians. In this work, we address the task of automatic preparation of written corpora with rich linguistic units. Our work depends on a comprehensive statistical linguistic study of SA based on automatic phonetic transcription of texts with more than 5 million words. We prepared two written corpora: the first corpus contains all allophones in SA with at least 3 occurrences of each allophone and 17 occurences of each phoneme. The second corpus contains, in addition to all allophones, 90.72% of diphones in SA.

Keywords:

Correction
Source
Cite
Save
Machine Reading By IdeaReader

References

Citations