Multi-lingual date field extraction for automatic document retrieval by machine

2015 
Robotic intelligence has recently received significant attention in the research community. Application of such artificial intelligence can be used to perform automatic document retrieval and interpretation by a robot through query. So, it is necessary to extract the key information from the document based on the query to produce the desired feedback. For this purpose, in this paper we propose a system for automatic date field extraction from multi-lingual (English, Devnagari and Bangla scripts) handwritten documents. The date is a key piece of information, which can be used in various robotic applications such as date-wise document indexing/retrieval. In order to design the system, first the script of the document is identified, and based on the identified script, word components of each text line are classified into month and non-month classes using word-level feature extraction and classification. Next, non-month words are segmented into individual components and labelled into one of text, digit, punctuation or contraction categories. Subsequently, the date patterns are searched using the labelled components. Both numeric and semi-numeric regular expressions have been used for date part extraction. Dynamic Time Warping (DTW) and profile feature-based approaches are used for classification of month/non-month words. Other date components such as numerals and punctuation marks are recognised using a gradient-based feature and Support Vector Machine (SVM) classifier. The experiments are performed on English, Devnagari and Bangla document datasets and the encouraging results obtained from the system indicate the effectiveness of the proposed system.
    • Correction
    • Source
    • Cite
    • Save
    • Machine Reading By IdeaReader
    34
    References
    12
    Citations
    NaN
    KQI
    []