Text Content Based Layout Analysis

José Ramón Prieto,Vicente Bosch,Enrique Vidal,Dominique Stutzmann,Sébastien Hamel

Text Content Based Layout Analysis

2020

State-of-the-art Document Layout Analysis methodsrely on graphical appearance features in order to detect andclassify the different layout regions present in a scanned textimage. In many cases, however, performing this task using onlygraphical information is problematic or impossible. Only byactually reading some text in the boundaries of the problematicregions it becomes possible to reliably detect and separate theseregions. In these situations, textual, content-based features wouldbe required, but since transcription is usually performed afterlayout analysis, a vicious circle arises. In this work, we circumventthis deadlock by making use of the recently introduced concept ofProbabilistic Index Map. We use the word relevance probabilitiesprovided by this map to calculate relevant text content basedfeatures at the pixel level. We assess the impact of these newfeatures on a historical document complex paragraph classifica-tion task. The experiments are performed using both a classicalHidden Markov Model approach and Deep Neural Networks.The obtained results are encouraging and showcase the positiveimpact text content based features will have on the DocumentLayout Analysis research field.

Keywords:

Correction
Source
Cite
Save
Machine Reading By IdeaReader

References

Citations