Weakly Supervised Bounding Box Extraction for Unlabeled Data in Table Detection.

2020 
The organization and presentation of data in tabular format became an essential strategy of scientific communication and remains fundamental to the transmission of knowledge today. The use of automated detection to identify typographical elements such as tables and diagrams in digitized historical print offers a promising approach for future research. Most of the table detection tasks are using existing off-the-shelf methods for their detection algorithm. However, datasets that are used for evaluation are not challenging enough due to the lack of quantity and diversity. To have a better comparison between proposed methods we introduce the NAS dataset in this paper for historical digitized images. Tables in historic scientific documents vary widely in their characteristics. They also appear alongside visually similar items, such as maps, diagrams, and illustrations. We address these challenges with a multi-phase procedure, outlined in this article, evaluated using two datasets, ECCO (https://www.gale.com/primary-sources/eighteenth-century-collections-online) and NAS (https://beta.synchromedia.ca/vok-visibility-of-knowledge). In our approach, we utilized the Gabor filter [1] to prepare our dataset for algorithmic detection with Faster-RCNN [2]. This method detects tables against all categories of visual information. Due to the limitation in labeled data, particularly for object detection, we developed a new method, namely, weakly supervision bounding box extraction, to extract bounding boxes automatically for our training set in an innovative way. Then a pseudo-labeling technique is used to create a more general model, via a three-step process of bounding box extraction and labeling.
    • Correction
    • Source
    • Cite
    • Save
    • Machine Reading By IdeaReader
    20
    References
    0
    Citations
    NaN
    KQI
    []