Optimizing Semantic Coherence in Topic Models

David M. Mimno,Hanna M. Wallach,Edmund M. Talley,Miriam Leenders,Andrew McCallum

Optimizing Semantic Coherence in Topic Models

2011

David M. Mimno
Hanna M. Wallach
Edmund M. Talley
Miriam Leenders
Andrew McCallum

Latent variable models have the potential to add value to large document collections by discovering interpretable, low-dimensional subspaces. In order for people to use such models, however, they must trust them. Unfortunately, typical dimensionality reduction methods for text, such as latent Dirichlet allocation, often produce low-dimensional subspaces (topics) that are obviously flawed to human domain experts. The contributions of this paper are threefold: (1) An analysis of the ways in which topics can be flawed; (2) an automated evaluation metric for identifying such topics that does not rely on human annotators or reference collections outside the training data; (3) a novel statistical topic model based on this metric that significantly improves topic quality in a large-scale document collection from the National Institutes of Health (NIH).

Keywords:

Training set
Machine learning
Artificial intelligence
Dimensionality reduction
Natural language processing
Latent Dirichlet allocation
Computer science
Latent variable
Data mining
Topic model
Coherence (physics)
Linear subspace
Information retrieval

Correction
Source
Cite
Save
Machine Reading By IdeaReader

References

973

Citations