language-icon Old Web
English
Sign In

Text Content Based Layout Analysis

2020 
State-of-the-art Document Layout Analysis methodsrely on graphical appearance features in order to detect andclassify the different layout regions present in a scanned textimage. In many cases, however, performing this task using onlygraphical information is problematic or impossible. Only byactually reading some text in the boundaries of the problematicregions it becomes possible to reliably detect and separate theseregions. In these situations, textual, content-based features wouldbe required, but since transcription is usually performed afterlayout analysis, a vicious circle arises. In this work, we circumventthis deadlock by making use of the recently introduced concept ofProbabilistic Index Map. We use the word relevance probabilitiesprovided by this map to calculate relevant text content basedfeatures at the pixel level. We assess the impact of these newfeatures on a historical document complex paragraph classifica-tion task. The experiments are performed using both a classicalHidden Markov Model approach and Deep Neural Networks.The obtained results are encouraging and showcase the positiveimpact text content based features will have on the DocumentLayout Analysis research field.
    • Correction
    • Source
    • Cite
    • Save
    • Machine Reading By IdeaReader
    21
    References
    2
    Citations
    NaN
    KQI
    []