Statistical learning of sub-words in Vietnamese language
2021
Sub-words have recently attracted much attention and employed to improve many natural language processing applications. In this paper, we suggest a procedure to extract sub-word units from a text collection. The sub- word units are evaluated on two Vietnamese databases to analyze and discuss their statistics and characteristics for Vietnamese language, including sub-word types, sub-word frequency, top sub-word distribution and unknown sub-words in different text types. The experimental results also point out several problems in training and testing data splitting in a current Vietnamese language processing example of Optical Character Recognition (OCR) error correction.
Keywords:
- Correction
- Source
- Cite
- Save
- Machine Reading By IdeaReader
7
References
0
Citations
NaN
KQI