HBA 1.0: A Pixel-based Annotated Dataset for Historical Book Analysis

2017 
This paper introduces HBA 1.0, a representative pixel-based annotated dataset which is released at the ICDAR2017 Competition on Historical Book Analysis (HBA2017). The HBA 1.0 dataset is composed of 4,436 real scanned ground truthed historical document images from 11 books (6 manuscripts and 5 printed books) in different languages and scripts published between the 13th and 19th centuries. The HBA 1.0 dataset contains 2,435 and 2,001 printed and manuscript pages, respectively. The ground truth of the HBA 1.0 dataset contains more than 7.58 billion annotated pixels. The HBA 1.0 dataset addresses a thriving topic of major interest of many researchers in different fields including (historical) document image analysis, image processing, pattern recognition and classification. The HBA 1.0 dataset and its ground truth can be used to evaluate the capabilities of image analysis methods to discriminate the textual content from the graphical ones on the one hand, and to separate the textual content according to different text fonts (e.g. lowercase, uppercase, italic) on the other hand. Evaluation results of a state-of-the-art pixel-labeling method on the HBA 1.0 dataset are reported and discussed in this paper in order to provide a benchmark/baseline for future evaluation studies and to showcase the intended use of the HBA 1.0 dataset.
    • Correction
    • Source
    • Cite
    • Save
    • Machine Reading By IdeaReader
    15
    References
    7
    Citations
    NaN
    KQI
    []