ISSumSet: a tweet summarization dataset hidden in a TREC track
2021
A key issue for Twitter users relates to the summarization of the continuous and overwhelming stream of information. Many approaches for tweet summarization were proposed in the literature. It is however difficult to compare them given the lack of standard and accessible test collection. This absence can be due to the efforts needed to construct such a (large) dataset. In this paper, we propose to capitalize on the dataset proposed for the TREC Incident Streams track, which was not intended to evaluate automatic summarization. We show why and how this dataset is usable for this purpose, focusing on extractive summarization. Indeed, when producing additional annotations on a subset of the TREC Incident Streams (IS) dataset with particular initial assessors' annotations, it appears to respect the criteria identified in the literature for automatic summarization. For this, we studied the original TREC IS dataset and then proposed a subset summarizing each event, based on the initial assessors' annotations. This subset is evaluated according to the criteria previously mentioned. Several widely used state-of-the-art models for automatic text summarization, some specific to tweets and some adapted to tweet summarization, were finally tested on the proposed dataset. For easy reproducibility, the code used to build the dataset, our additional annotations, and the experiments made on the dataset are provided on our Github.
Keywords:
- Correction
- Source
- Cite
- Save
- Machine Reading By IdeaReader
19
References
1
Citations
NaN
KQI