Automatic Document Classification using Deep Feature Selection and Knowledge Transfer

2020 
Documents in an ERP system flow from different sources (customer, supplier, etc.) and can have different layouts, sizes and subjects (invoices, delivery forms, checks, etc.). The classification of these documents is usually done manually before being saved in the ERP system or processed by an Optical Character Recognition (OCR) engine. In this paper, we investigate using different deep convolutional neural networks (CNN) to extract deep features from images of scanned documents. The extracted features are further processed using various machine learning classifiers such as Logistic Regression (LR), K-Nearest Neighbors (KNN), Support Vector Machine (SVM) and Gaussian Naive Bayes (GNB). Different metrics were used (accuracy, precision, etc.) and examined to compare all models performances, while cross-validation approach at different folds sizes (4, 6, 8 and 10) was used to assess their generalization ability. The effect of dimensionality reduction techniques on overall performances was also explored. The best classification rate was 96.1%, which was achieved by combining LR and the VGG19 model. This very good performance despite the small dataset used (200 images) can allow using this approach in an ERP system as a preprocessing step in document manipulation for ERP users.
    • Correction
    • Source
    • Cite
    • Save
    • Machine Reading By IdeaReader
    28
    References
    2
    Citations
    NaN
    KQI
    []