Progressive Deep Web Crawling Through Keyword Queries For Data Enrichment

2019 
Data enrichment is the act of extending a local database with new attributes from external data sources. In this paper, we study a novel problem-how to progressively crawl the deep web (i.e., a hidden database) through a keyword-search API to enrich a local database in an e ective way. This is chal- lenging because these interfaces often limit the data access by enforcing the top-k constraint or limiting the number of queries that can be issued within a time window. In response, we propose SmartCrawl, a new framework to collect re- sults e ectively. Given a query budget b, SmartCrawl rst constructs a query pool based on the local database, and then iteratively issues a set of most bene cial queries to the hidden database such that the union of the query results can cover the maximum number of local records. The key technical challenge is how to estimate query bene t, i.e., the number of local records that can be covered by a given query. A simple approach is to estimate it as the query frequency in the local database. We nd that this is ine ective due to i) the impact of |ΔD|, where |ΔD| represents the number of local records that cannot be found in the hidden database, and ii) the top-k constraint enforced by the hidden database. We study how to mitigate the negative impacts of the two factors and propose e ective optimization techniques to improve performance. The experimental results show that on both simulated and real-world hidden databases, SmartCrawl signi cantly increases coverage over the local database as compared to the baselines.
    • Correction
    • Source
    • Cite
    • Save
    • Machine Reading By IdeaReader
    45
    References
    3
    Citations
    NaN
    KQI
    []