Extracting correctly aligned segments from unclean parallel data using character n-gram matching

Maja Popović,Alberto Poncelas

Extracting correctly aligned segments from unclean parallel data using character n-gram matching

2020

Maja Popović
Alberto Poncelas

Training of Neural Machine Translation systems is a time- and resource-demanding task, especially when large amounts of parallel texts are used. In addition, it is sensitive to unclean parallel data. In this work, we explore a data cleaning method based on character n-gram matching. The method is particularly convenient for closely related language since the n-gram matching scores can be calculated directly on the source and the target parts of the training corpus. For more distant languages, a translation step is needed and then the MT output is compared with the corresponding original part. We show that the proposed method not only reduces the amount of training corpus, but also can increase the system’s performance.

Keywords:

Artificial intelligence
n-gram
Machine translation
Pattern recognition
Computer science

Correction
Source
Cite
Save
Machine Reading By IdeaReader

References

Citations