Title-Based Extraction of News Contents for Text Mining

2018 
As a vital measure to obtain valuable information and intelligence, web news are flooding all corners of the Internet anytime, anywhere. Traditionally, templates or hand-designed features are utilized to extract the content from web pages, but these models have higher time cost and lower extensibility. Recently, many scholars leverage DOM-tree-based or text-density-based models to extract the contents which have better extensibility and lower time cost, but most of them are hard to extract the content accurately and completely and are easy to introduce the noises. In this paper, we propose a title-based web content extracting model TWCEM to extract the contents of each web page, which leverage the title information to extract the web content. Compared with other extraction model, TWCEM can filter the noises effectively and locate the content positions more accurately. In this experiment, we evaluate the proposed model on real-life websites, and TWCEM achieves state-of-the-art results and outperforms its competitors on both extraction performance and time cost.
    • Correction
    • Source
    • Cite
    • Save
    • Machine Reading By IdeaReader
    33
    References
    3
    Citations
    NaN
    KQI
    []