A simple approach to the design of site-level extractors using domain-centric principles

Chong Long,Xiubo Geng,Chang Xu,S. Sathiya Keerthi

A simple approach to the design of site-level extractors using domain-centric principles

2012

We consider the problem of extracting, in a domain-centric fashion, a given set of attributes from a large number of semi-structured websites. Previous approaches [7, 5] to solve this problem are based on page level inference. We propose a distinct new approach that directly chooses attribute extractors for a site using a scoring mechanism that is designed at the domain level via simple classification methods using a training set from a small number of sites. To keep the number of candidate extractors in each site manageably small we use two observations that hold in most domains: (a) imprecise annotators can be used to identify a small set of candidate extractors for a few attributes (anchors); and (b) non-anchor attributes lie in close proximity to the anchor attributes. Experiments on three domains ( Events, Books and Restaurants ) show that our approach is very effective in spite of its simplicity.

Keywords:

Correction
Source
Cite
Save
Machine Reading By IdeaReader

References

Citations