Unsupervised Text Segmentation Based on Native Language Characteristics

Shervin Malmasi,Mark Dras,Mark Johnson,Lan Du,Magdalena Wolska

Unsupervised Text Segmentation Based on Native Language Characteristics

2017

Shervin Malmasi
Mark Dras
Mark Johnson
Lan Du
Magdalena Wolska

Most work on segmenting text does so on the basis of topic changes, but it can be of interest to segment by other, stylistically expressed characteristics such as change of authorship or native language. We propose a Bayesian unsupervised text segmentation approach to the latter. While baseline models achieve essentially random segmentation on our task, indicating its difficulty, a Bayesian model that incorporates appropriately compact language models and alternating asymmetric priors can achieve scores on the standard metrics around halfway to perfect segmentation.

Keywords:

Natural language processing
Artificial intelligence
Speech recognition
Computer science
First language
Text segmentation

Correction
Source
Cite
Save
Machine Reading By IdeaReader

References

Citations