Semi-Supervised Event-related Tweet Identification with Dynamic Keyword Generation

2017 
Twitter provides us a convenient channel to get access to the immediate information about major events. However, it is challenging to acquire a clean and complete set of event-related data due to the characteristics of tweets, eg short and noisy. In this paper, we propose a semi-supervised method to obtain high quality event-related tweets from Twitter stream, in terms of precision and recall. Specifically, candidate event-related tweets are selected based on a set of keywords. We propose to generate and update these keywords dynamically along the event development. To be included in this keyword set, words are evaluated based on single word properties, property based on co-occurred words, and changes of word importance over time. Our solution is capable of capturing keywords of emerging aspects or aspects with increasing importance along event evolvement. By leveraging keyword importance information and a few labeled tweets, we propose a semi-supervised expectation maximization process to identify event-related tweets. This process significantly reduces human effort in acquiring high quality tweets. Experiments on three real world datasets show that our solution outperforms state-of-the-art approaches by up to 10% in F 1 measure.
    • Correction
    • Source
    • Cite
    • Save
    • Machine Reading By IdeaReader
    31
    References
    14
    Citations
    NaN
    KQI
    []