Short text topic modeling by exploring original documents

2018 
Topic modeling for short texts faces a tough challenge, owing to the sparsity problem. An effective solution is to aggregate short texts into long pseudo-documents before training a standard topic model. The main concern of this solution is the way of aggregating short texts. A recent developed self-aggregation-based topic model (SATM) can adaptively aggregate short texts without using heuristic information. However, the model definition of SATM is a bit rigid, and more importantly, it tends to overfitting and time-consuming for large-scale corpora. To improve SATM, we propose a generalized topic model for short texts, namely latent topic model (LTM). In LTM, we assume that the observable short texts are snippets of normal long texts (namely original documents) generated by a given standard topic model, but their original document memberships are unknown. With Gibbs sampling, LTM drives an adaptive aggregation process of short texts, and simultaneously estimates other latent variables of interest. Additionally, we propose a mini-batch scheme for fast inference. Experimental results indicate that LTM is competitive with the state-of-the-art baseline models on short text topic modeling.
    • Correction
    • Source
    • Cite
    • Save
    • Machine Reading By IdeaReader
    32
    References
    22
    Citations
    NaN
    KQI
    []