A graph-based method of newspaper article reconstruction

2012 
The primary information units in a newspaper are the articles. Article reconstruction from newspapers including article aggregation and reading order recovery is known to be a quite challenging task due to the complexity of the multi-article page layout. In this paper, we propose a novel approach for article reconstruction using a bipartite graph framework, which models the complex relationships between text blocks as one-to-one correspondences, and accomplishes the task by finding the optimal match on this graph. During the optimization process, various information sources, including geometric layout, linguistic and semantic content, are deeply mined in the bipartite graph model to deal with the wide range of complex newspaper layouts. Moreover, quite different from the existing methods, we perform the two sub-tasks of article reconstruction in reverse order, that is, we detect the reading orders of the text blocks first and then use the reading order to aggregate blocks belonging to the same articles. Experimental results on 3312 newspaper pages with 23184 articles demonstrate that our method outperforms the state-of-the-art methods for newspaper article reconstruction. In addition, this method has been adopted in several large-scale newspaper digitalization projects.
    • Correction
    • Source
    • Cite
    • Save
    • Machine Reading By IdeaReader
    11
    References
    0
    Citations
    NaN
    KQI
    []