GOM-Hadoop

Jiangtao Yin,Yong Liao,Mario Baldi,Lixin Gao,Antonio Nucci

GOM-Hadoop

2015

One of the most common datasets exploited by many corporations to conduct business intelligence analysis is event log files. Oftentimes, the records in event log files are temporally ordered, and need to be grouped by certain key with the temporal ordering preserved to facilitate further analysis. One such example is to group temporally ordered events by user ID in order to analyze user behavior. This kind of analytical workload, here referred to as RElative Order-pReserving based Grouping (Re-Org), is quite common in big data analytics, where the MapReduce programming paradigm (and its open-source implementation, Hadoop) is widely adopted for massive parallel processing. However, using MapReduce/Hadoop for executing Re-Org tasks on ordered datasets is not efficient due to its internal sort-merge mechanism when shuffling data from mappers to reducers. In this paper, we propose a distributed framework that adopts an efficient group-order-merge mechanism to speed up the execution of Re-Org tasks. We demonstrate the advantage of our framework by formally modeling its execution process and by comparing its performance with Hadoop through extensive experiments on real-world datasets. The evaluation results show that our framework can achieve up to 6.3x speedup over Hadoop in executing Re-Org tasks. We generalize a class of big data analytics workload (Re-Org) on ordered datasets.We propose a novel distributed mechanism for efficiently executing Re-Org tasks.The proposed mechanism is implemented in a distributed framework by extending Hadoop.A model is presented to formally study the proposed framework.Experiments show that our framework is 6.3x faster than vanilla Hadoop.

Keywords:

Correction
Source
Cite
Save
Machine Reading By IdeaReader

References

Citations