Research on the Optimization of Spark Big Table Equal Join

2019 
The big table equal join operation is one of the key operations of Spark for processing large-scale data. However, when Spark handles large table equal join problems, the network transmission overhead is relatively expensive and the I/O cost is high, so this paper proposes an optimized Spark large table join method. Firstly, this method proposes a Split Compressed Bloom Filter algorithm which is suitable for filtering data sets with unknown data volume. Then, the Maxdiff histogram is used to statistically analyze the data distribution of the connected data tables, and the skew data in the data set is obtained. According to the statistical results, the RDD is split, and finally the data connection is joined by a suitable join algorithm, and the sub-results are combined to obtain the final result. Experiments show that the Spark large table equal join optimization method proposed in this paper has obvious advantages in shuffle write, shuffle read and task running time compared with Spark original method.
    • Correction
    • Source
    • Cite
    • Save
    • Machine Reading By IdeaReader
    14
    References
    0
    Citations
    NaN
    KQI
    []