Duplicate short text detection based on Word2vec

2017 
In modern life, people own new social relationship, watch news, create e-commerce transactions and have entertainment online. Light blogs and short comments become more and more popular. The traditional duplicate long text detection algorithms are hard to be applied in the current situations, so more effective duplicate detection algorithm for short text is needed. Based on the bag-of-words model Word2vec, this paper proposes a kind of duplicate detection algorithm with semantic embedded for short text. Words are embedded into vectors which are as input elements in Simhash algorithm to acquire 64 bits sequence, then compare two sequences with Hamming distances and return result filtered by preset threshold value. Subsequently, a more superior improvement is proposed where we add weighted idea into. The results are compared with the unweighted Word2vec method and the traditional TF-IDF method. Experiments are carried out on the SICK corpus, and its result shows that the weighted Word2vec method achieves higher accuracy and recall rate.
    • Correction
    • Source
    • Cite
    • Save
    • Machine Reading By IdeaReader
    0
    References
    9
    Citations
    NaN
    KQI
    []