Analysis and Comparison of Block-Splitting-Based Load Balancing Strategies for Parallel Entity Resolution

2020 
Entity resolution (ER) is a process to identify records that refer to the same real-world entity. In recent years, facing the ever-increasing data volume, both blocking techniques and parallel computation have been proposed for ER to reduce its running time and improve efficiency. It is popular and convenient to apply the MapReduce programming model for parallel computation. With the default load balancing strategy, if the block sizes are skewed, an imbalanced reducer load will occur and significantly increase the runtime. One possible solution is block-splitting: breaking the overpopulated blocks into smaller sub-blocks, to improve efficiency. In this paper we analyze the advantages and disadvantages of state-of-the-art block splitting methods (BlockSplit and BlockSlicer), and we propose two approaches: TLS and BOS to overcome the identified drawbacks. We comprehensively evaluate and compare our proposed solutions, with Spark implementations, using real-world and synthetic datasets with different properties. The results show that all of them can balance the reducer load with the help of the greedy partition assignment strategy. When memory of used cluster is not abundant given a dataset, a high number of reducers is required to reduce the GC time to improve efficiency. Partitcularly, our TLS and BOS have overwelmingly lower overhead due to the ability of block-wise composite key assignment.
    • Correction
    • Source
    • Cite
    • Save
    • Machine Reading By IdeaReader
    27
    References
    0
    Citations
    NaN
    KQI
    []