Intelligent classification of web pages using contextual and visual features
2011
In this paper we address classification of Web content and in particular its application in the detection of pornographic Web pages. Filtering of undesirable Web content is mainly achieved based on blocking a specific Web address via searching it in a reference list of black URLs or doing a plain contextual analysis on the page by searching special keywords in the text. The main problem with current filtering methods is the requirement for instantly update of the URL list and also the high rate of over-blocking the usual pages. In this paper, we propose an intelligent approach which is based on using textual, profile, and visual features in a hierarchical structure classifier. Textual features contain information about keywords, black-words, etc. and profile features contain structural information like number of links, meta-tags, pictures, etc. As for the visual features we employ a sort of global and local indicative features including topological and shape-based characteristics which are extracted from the skin region. The algorithm was applied on a dataset with 1295 Web pages as training set including 700 porn pages (coming with text, image, or both) in English and Persian, and 595 non-porn pages including pages with medical, health, sports, etc. topics. Using a test dataset with 290 Web-ages a 95% accuracy rate was obtained.
Keywords:
- Correction
- Source
- Cite
- Save
- Machine Reading By IdeaReader
21
References
25
Citations
NaN
KQI