Statistical learning of sub-words in Vietnamese language

2021 
Sub-words have recently attracted much attention and employed to improve many natural language processing applications. In this paper, we suggest a procedure to extract sub-word units from a text collection. The sub- word units are evaluated on two Vietnamese databases to analyze and discuss their statistics and characteristics for Vietnamese language, including sub-word types, sub-word frequency, top sub-word distribution and unknown sub-words in different text types. The experimental results also point out several problems in training and testing data splitting in a current Vietnamese language processing example of Optical Character Recognition (OCR) error correction.
    • Correction
    • Source
    • Cite
    • Save
    • Machine Reading By IdeaReader
    7
    References
    0
    Citations
    NaN
    KQI
    []