A Masked Segmental Language Model for Unsupervised Natural Language Segmentation

C. M. Downey,Fei Xia,Gina-Anne Levow,Steinert-Threlkeld, Shane

A Masked Segmental Language Model for Unsupervised Natural Language Segmentation

2022

We introduce a Masked Segmental Language Model (MSLM) for joint language modeling and unsupervised segmentation. While near-perfect supervised methods have been developed for segmenting human-like linguistic units in resource-rich languages such as Chinese, many of the worlds languages are both morphologically complex, and have no large dataset of gold segmentations for supervised training. Segmental Language Models offer a unique approach by conducting unsupervised segmentation as the byproduct of a neural language modeling objective. However, current SLMs are limited in their scalability due to their recurrent architecture. We propose a new type of SLM for use in both unsupervised and lightly supervised segmentation tasks. The MSLM is built on a span-masking transformer architecture, harnessing a masked bidirectional modeling context and attention, as well as adding the potential for model scalability. In a series of experiments, our model outperforms the segmentation quality of recurrent SLMs on Chinese, and performs similarly to the recurrent model on English.

Correction
Source
Cite
Save
Machine Reading By IdeaReader

References

Citations