An Effective Scheme for Generating An Overview Report over A Very Large Corpus of Documents

2019 
How to efficiently generate an accurate, well-structured overview report (ORPT) over thousands of documents is challenging. A well-structured ORPT is divided into sections of multiple levels (e.g., a two-level structure consists of sections and subsections). None of the existing multi-document summarization (MDS) algorithms is suitable for accomplishing this task. To overcome this obstacle, we devise NDORGS (Numerous Documents' Overview Report Generation Scheme) that integrates text filtering, keyword scoring, single-document summarization (SDS), topic modeling, MDS, and title generation to generate a coherent, well-structured ORPT. We then present a multi-criteria evaluation method using techniques of text mining and multi-attribute decision making on a combination of human judgments, running time, information coverage, and topic diversity. We evaluate ORPTs generated by NDORGS on two large corpora of documents, where one is classified and the other unclassified. We show that, using Saaty's pairwise comparison 9-point scale and TOPSIS, the ORPTs generated on SDS's with the length of 20% of the original documents are the best overall on both datasets.
    • Correction
    • Source
    • Cite
    • Save
    • Machine Reading By IdeaReader
    48
    References
    3
    Citations
    NaN
    KQI
    []