Chinese Web Content Extraction Based on Naïve Bayes Model

Wang Jinbo,Wang Lian-zhi,Gao Wanlin,Yu Jian,Cui Yuntao

Chinese Web Content Extraction Based on Naïve Bayes Model

2013

Wang Jinbo
Wang Lian-zhi
Gao Wanlin
Yu Jian
Cui Yuntao

As the web content extraction becomes more and more difficult, this paper proposes a method that using Naive Bayes Model to train the block attributes eigenvalues of web page. Firstly, this method denoising the web page, represents it as a DOM tree and divides web page into blocks, then uses Naive Bayes Model to get the probability value of the statistical feature about web blocks. At last, it extracts theme blocks to compose content of web page. The test shows that the algorithm could extract content of web page accurately. The average accuracy has reached up to 96.2%.The method has been adopted to extract content for the off-portal search of Hunan Farmer Training Website, and the efficiency is well.

Keywords:

Web content
Naive Bayes classifier
Web page
Document Object Model
Computer science
Artificial intelligence
Pattern recognition
Data mining
content extraction

Correction
Source
Cite
Save
Machine Reading By IdeaReader

References

Citations