Extraction of Tabular Data from PDF to CSV Files

2021 
Companies generate their reports in the form of PDF files. For further data analysis, the statistics or quantitative data in these reports have to be converted to CSV (.csv) or Excel (.xlsx) files. This is done manually by companies. This consumes a lot of time and manual work which can be reduced for better utilization of resources. Forecomp is a web application to automatically convert the tables in the PDF to CSV files. The tables could be present in text format or as an image. The web application is built keeping flexibility in mind such that the user can select the process used to convert the PDF into CSV files based on the tables in their PDF. Different technologies used in this application include YOLO model for machine learning, Tesseract OCR, Tabula, and an inbuilt snipping tool. This paper introduces the concepts behind Forecomp focussing on the methodology employed and the various results obtained.
    • Correction
    • Source
    • Cite
    • Save
    • Machine Reading By IdeaReader
    2
    References
    0
    Citations
    NaN
    KQI
    []