Integrating Real-Time Entity Resolution with Top-N Join Query Processing

2021 
Real-time entity resolution (ER) is a challenging problem for large datasets. Traditional techniques of top-N join query processing are based on clean data without ER. For dirty datasets with duplicate tuples referring to the same real-world entity, these techniques may yield duplicates of top-N tuples for a query, and as a result some useful tuples may fail to be retrieved from the datasets, which leads to poor effectiveness. Based on “sorted and/or random accesses” and “no wild guesses”, in this paper, we discuss the models that integrate real-time entity resolution with top-N join queries over dirty datasets of real vectors. For finite dimensional \(\ell_{p} \) spaces and p-norm distances as nonmonotone ranking functions, using the norm equivalence theorem in Functional Analysis as a foundation, and designing buffers to join tuples with an outer-join mechanism and to cluster candidates for ER, we propose two database-friendly algorithms to answer the top-N join queries with the following two cases of data access methods: restricting sorted access and no random access. Extensive experiments are conducted to measure the effectiveness and efficiency of our approaches over various dirty datasets.
    • Correction
    • Source
    • Cite
    • Save
    • Machine Reading By IdeaReader
    21
    References
    0
    Citations
    NaN
    KQI
    []