Extraction of Arabic text from multilingual documents

2002 
This paper describes the processing of multilingual documents (Arabic/Latin), extracted from Arabic scientific articles whose displays pages contain Arabic lines which sometimes include one or more Latin words because they have no exact equivalent in Arabic. Processing these blocks we need to extract Arabic text from multilingual blocks. We propose an original method to locate Latin words from heterogeneous blocks. The method is based on a process of Arabic character recognition. This recognition is made by template matching that has been shown by tests to be efficient for the discrimination of Arabic and Latin script. Segment prototypes are extracted from main font styles used in the treated magazines. Results of the word discrimination adjoin the 100% on 30 blocks containing a total of 478 words.
    • Correction
    • Source
    • Cite
    • Save
    • Machine Reading By IdeaReader
    10
    References
    15
    Citations
    NaN
    KQI
    []