Big Data Analytics: Partitioned B+-Tree-Based Indexing in MapReduce

2021 
Big data analytics platforms are designed to improve performance by avoiding the extract transfer load approach. Also, there are techniques which have worked very well in performance optimization for relational databases. Yet these techniques are in the process of integration into big data analytics. Indexing and its data structure is an example of such techniques. Despite its popularity in query optimization for efficient data mining, the indexing was not integrated into the MapReduce platform. By design the MapReduce was made to perform a full scan of the input data. However, there were attempts made to incorporate the indexing for performance improvement in MapReduce in recent years. However, these attempts have not exhausted the potentials of indexing in the MapReduce query processing. Consequently, this chapter presents an indexing approach that uses the partitioned B+-Tree as its data structure to index the InputSplit component of the Hadoop distributed file system. This was done to achieve efficient data mining query processing when used with the Hadoop MapReduce. The results of this study showed that the proposed index method has significantly reduced the index size as well as the execution runtime of all search queries by at least 50% for all the used data sizes when compared with the Normal MapReduce processing and another clustered index approach. Thus, the use of the proposed index approach has the potential to significantly reduce the time taken in mining data within a dataset by half.
    • Correction
    • Source
    • Cite
    • Save
    • Machine Reading By IdeaReader
    30
    References
    0
    Citations
    NaN
    KQI
    []