Sphere Embedding: An Application to Part-of-Speech Induction

Yariv Maron,Elie Bienenstock,Michael James

Sphere Embedding: An Application to Part-of-Speech Induction

2010

Yariv Maron
Elie Bienenstock
Michael James

Motivated by an application to unsupervised part-of-speech tagging, we present an algorithm for the Euclidean embedding of large sets of categorical data based on co-occurrence statistics. We use the CODE model of Globerson et al. but constrain the embedding to lie on a high-dimensional unit sphere. This constraint allows for efficient optimization, even in the case of large datasets and high embedding dimensionality. Using k-means clustering of the embedded data, our approach efficiently produces state-of-the-art results. We analyze the reasons why the sphere constraint is beneficial in this application, and conjecture that these reasons might apply quite generally to other large-scale tasks.

Keywords:

Artificial intelligence
Machine learning
Part of speech
Cluster analysis
Embedding
Categorical variable
Conjecture
Curse of dimensionality
Euclidean geometry
Mathematics
Unit sphere
euclidean embedding
Computer science

Correction
Source
Cite
Save
Machine Reading By IdeaReader

References

Citations