Content Information Extraction of Theme Web Pages Based on Tag Information

2014 
In order to extract the content information of Theme Web Pages more accurately, this paper proposes a self-learning method based on the tag information by calculating the information quantity of various tag indicators. This method predefines several tag information indexes and coefficients index to calculate a variety of tag information quantity of the web pages in turn, and then the candidate content of Web pages is in the tag with the most information quantity. To improve the versatility of the method, we add the adaptive and adjustable coefficient weight in calculation formulas of tag information quantity. With the increasing of data be processed, tag collections, index value and the information quantity results are added into the learning database to adjust the weight of coefficient factor. Experimental results show that the accuracy of this extraction method with adaptive and adjustable coefficient weights can reach more than 99 percent recall rate. Also, this method does not depend on the specific structure and style of the web page and has good versatility.
    • Correction
    • Source
    • Cite
    • Save
    • Machine Reading By IdeaReader
    10
    References
    1
    Citations
    NaN
    KQI
    []