Spark-based Streamlined Metablocking

2017 
Blocking techniques are widely applied in Entity Resolution (ER) approaches as preprocessing step in order to avoid the quadratic cost of the ER task. In this context, heterogeneous data and Big Data emerges as the major challenges that are faced by blocking techniques. In this sense, we propose the novel approach Spark-based Streamlined Metablocking (SS-Metablocking). Moreover, this work proposes the Cardinality-based load balancing technique to be applied in SS-Metablocking in order to improve its efficiency. To improve the effectiveness of the SS-Metablocking, the GWNP pruning algorithm is proposed in this work. Based on the experimental results, we can highlight that the proposed approach presents better results regarding efficiency and effectiveness than the state-of-the-art approach.
    • Correction
    • Source
    • Cite
    • Save
    • Machine Reading By IdeaReader
    11
    References
    13
    Citations
    NaN
    KQI
    []