A New Khmer Palm Leaf Manuscript Dataset for Document Analysis and Recognition: SleukRith Set

2017 
Analysis of ancient Khmer documents can be quite challenging due to the elaborated shape of Khmer handwritten characters combined with the complex structure of how words are formed from those characters. Palm leaf manuscripts, one of the most well-known old Khmer documents, have been being digitized and centralized; therefore, document analysis functions such as text search capabilities are necessary but still remain unavailable for this type of documents. In order to contribute to the progress of relevant researches, we introduce in this paper a new dataset called SleukRith set comprising of 657 pages of Khmer palm leaf manuscripts randomly selected from various collections whose quality and digitization method are variable. The dataset contains three types of data: isolated characters, words, and lines. Each type of data is annotated with the ground truth information which is very useful for evaluating and serving as a training set for common document analysis tasks such as character/text recognition, word/line segmentation, and word spotting. In order to serve as a base line, the result of an evaluation study of Khmer isolated character recognition that we have conducted on SleukRith Set using Convolutional Neural Network is also presented.
    • Correction
    • Source
    • Cite
    • Save
    • Machine Reading By IdeaReader
    8
    References
    10
    Citations
    NaN
    KQI
    []