Acoustic Feature Analysis and Discriminative Modeling for Language Identification of Closely Related South-Asian Languages

2018 
With the advancement in technology, communication between people around the world from different linguistic backgrounds is increasing gradually, resulting in the requirement of language identification services. Language identification techniques extract distinguishable information as features of a language from the speech corpora to differentiate one language from other. Without publicly available speech corpora, comparison between different techniques will not be much reliable. This paper investigates state-of-the-art features and techniques for language identification of under-resource and closely related languages, namely Pashto, Punjabi, Sindhi, and Urdu. For language identification, speech corpus is designed and collected for mentioned languages. The dataset is a read speech data collected over telephone network (mobile and landline) from different regions of Pakistan. The speech corpus is annotated at the sentence level using X-SAMPA, its orthographic transcription is also provided, and verified data are divided into training and evaluation sets. Mel-frequency cepstral coefficients and their shifted delta cepstral features are used to develop language identification system of target languages. Gaussian mixture model with universal background model (GMM-UBM)-based and I-vector-based language identification approaches are investigated. The results show that GMM-UBM is more effective than the I-vector for language identification of short duration test utterances.
    • Correction
    • Source
    • Cite
    • Save
    • Machine Reading By IdeaReader
    46
    References
    8
    Citations
    NaN
    KQI
    []