Online Web news extraction via tag path feature weighted by text block density

2017 
Web news extraction is the basis and an open research problem of many “big data” and “big knowledge” applications. Presently, tag paths and text block density are two excellent features that can help to solve this problem. The tag path feature can distinguish well the content from the noise for the whole webpage, but it has difficulty in recognizing noise in the content block or the content in the noise block. The text block density feature can recognize well the high-density content block, but it is not robust enough. Aiming at the abovementioned problems, we propose a Web information extraction model, referred to as CEDP, which can effectively combine the tag path feature and the text block density feature. We design a tag path feature weighted by the text block density in order to utilize the merits of the two features above. In addition, we design a Web news extraction method via the weighted tag path feature, CEDP-NLTD. CEDP-NLTD is a fast, universal, non-training, online Web news extraction algorithm that is suitable for extracting heterogeneous Web news from the big data environment of the Web across various resources, styles, and languages. Experiments on public datasets such as CleanEval show that the CEDP-NLTD method achieves better performance than the state-of-the-art CETR, CETD, CEPR, and CEPF methods, and it achieves better performance than CEDP-TD, CEDP-CTD, and CEDP-DSum, which are respectively generated from CEDP by using one of the three block density features of CETD.
    • Correction
    • Source
    • Cite
    • Save
    • Machine Reading By IdeaReader
    0
    References
    0
    Citations
    NaN
    KQI
    []