Unleashing Tabular Content to Open Data: A Survey on PDF Table Extraction Methods and Tools

2017 
Portable Document Format (PDF) has been a popular way to exchange data in documents since Adobe introduced the format in 1993. Its report-like characteristic which preserves and prioritizes graphical visualization was part of the main publishing concerns among several segments including government agencies. In this way, tabular data started to be enclosed within PDF documents and disclosed in government portals. This situation, apart being surprisingly contradictory to data openness, is still found even in the major open data initiatives. It is estimated that roughly 13% of published files in some main open data portals around the world have their data made available in PDF. Thus, there is a need for effective tools capable of extracting tabular content (a main placeholder for data) from PDF to allow its data to be published in more open formats such as the well-known CSV which complies with accessible and machine processable open data principles. This paper aims at providing a structured and comprehensive overview of the research in tabular content extraction specifically from PDF documents as well as to provide an overview of most recent practical results in the literature. The contribution of this work goes beyond theoretical discussions by helping data practitioners to understand to what extent methods and tools regarding tabular content extraction from PDF can benefit the open data initiatives in practical and effective ways.
    • Correction
    • Source
    • Cite
    • Save
    • Machine Reading By IdeaReader
    31
    References
    16
    Citations
    NaN
    KQI
    []