Extraction of Arabic text from multilingual documents

Ikram Moalla,A. Elbaati,A.A. Alimi,A. Benhamadou

Extraction of Arabic text from multilingual documents

2002

Ikram Moalla
A. Elbaati
A.A. Alimi
A. Benhamadou

This paper describes the processing of multilingual documents (Arabic/Latin), extracted from Arabic scientific articles whose displays pages contain Arabic lines which sometimes include one or more Latin words because they have no exact equivalent in Arabic. Processing these blocks we need to extract Arabic text from multilingual blocks. We propose an original method to locate Latin words from heterogeneous blocks. The method is based on a process of Arabic character recognition. This recognition is made by template matching that has been shown by tests to be efficient for the discrimination of Arabic and Latin script. Segment prototypes are extracted from main font styles used in the treated magazines. Results of the word discrimination adjoin the 100% on 30 blocks containing a total of 478 words.

Keywords:

Document processing
Optical character recognition
Natural language processing
Template matching
Arabic
Speech recognition
Latin script
Font
Intelligent character recognition
Computer science
Text mining
Artificial intelligence
character recognition

Correction
Source
Cite
Save
Machine Reading By IdeaReader

References

Citations