A Robust Method for Text, Line, and Word Segmentation for Historical Arabic Manuscripts

2021 
The segmentation of old documents is a crucial phase for reading and understanding the content of a document automatically. Also, the extraction of words and phrases in a document needs segmentation of each line and word. But, the variations of text lines directions throughout the same document and overlapping characters between two or more text lines, especially in Arabic manuscripts, are the problems that usually found in such documents. For that, this chapter proposes an approach for text segmentation as well as line and word for historical Arabic manuscripts. First, text segmentation is realized using an encoder-decoder deep model to segment the main text and side text in the image. The model has been trained on two Arabic manuscripts dataset including Bukhari and RASM2018 datasets. Then, the segmentation of lines using a smoothing approach followed by thresholding determined automatically according to the size of handwriting. Then, segmentation of words is provided using smoothed Chamfer distance which takes into consideration the handwriting characteristics. The evaluation of the proposed approach is reported on the QUWI Arabic database and very promising results are achieved.
    • Correction
    • Source
    • Cite
    • Save
    • Machine Reading By IdeaReader
    31
    References
    0
    Citations
    NaN
    KQI
    []