Statistical learning of sub-words in Vietnamese language

D.-Q. Nguyen,T. H. Le

Statistical learning of sub-words in Vietnamese language

2021

D.-Q. Nguyen
T. H. Le

Sub-words have recently attracted much attention and employed to improve many natural language processing applications. In this paper, we suggest a procedure to extract sub-word units from a text collection. The sub- word units are evaluated on two Vietnamese databases to analyze and discuss their statistics and characteristics for Vietnamese language, including sub-word types, sub-word frequency, top sub-word distribution and unknown sub-words in different text types. The experimental results also point out several problems in training and testing data splitting in a current Vietnamese language processing example of Optical Character Recognition (OCR) error correction.

Keywords:

Word (computer architecture)
Natural language processing
statistical learning
Test data
Error detection and correction
Artificial intelligence
Point (typography)
Text types
Optical character recognition
Vietnamese
Computer science

Correction
Source
Cite
Save
Machine Reading By IdeaReader

References

Citations