An Novel Text Extraction Technique Based on Pattern Matching and Automatic Backtracking

2021 
Most Web documents are described in a DOM tree structure, and therefore, the extraction of Web key information usually requires a process of traversing the DOM tree. The key entities to be extracted are not only determined by the extracted content itself, but also depend on the surrounding environment of the extracted entities. In this paper, we introduce a method for Web key information extraction through non-deterministic algorithm, which uses automatic backtracking algorithms and pattern matching to concisely describe the key text content that needs to be extracted, while significantly simplifying the extraction process.
    • Correction
    • Source
    • Cite
    • Save
    • Machine Reading By IdeaReader
    14
    References
    0
    Citations
    NaN
    KQI
    []