Extracting Various Types of Informative Web Content via Fuzzy Sequential Pattern Mining

2017 
In this paper, we present a web content extraction method to extract different types of informative web content for news web pages. A fuzzy sequential pattern mining method, namely FSP, is developed to gradually discover fuzzy sequential patterns for various types of informative web content. To avoid the situation that the usage of HTML tags may be changed with the development of web technology, fuzzy sequential patterns are mined using a stable feature, in particular, the number of tokens in each line of source code. We have conducted extensive experiments and good clustering properties for the discovered sequential patterns are observed. Experimental results demonstrate that the FSP method is effective compared with state-of-the-art content extraction methods. Besides main articles of web pages, it can also find other types interesting web content such as article recommendations and article titles effectively.
    • Correction
    • Source
    • Cite
    • Save
    • Machine Reading By IdeaReader
    5
    References
    2
    Citations
    NaN
    KQI
    []