Comparison of Distributed Computing Approaches to Complexity of n-gram Extraction
2016
In this paper we compare different technologies that support distributed computing as a means to address
complex tasks. We address the task of n-gram text extraction which is a big computational given a large
amount of textual data to process. In order to deal with such complexity we have to adopt and implement
parallelization patterns. Nowadays there are several patterns, platforms and even languages that can be used
for the parallelization task. We implemented this task on three platforms: (1) MPJ Express, (2) Apache
Hadoop, and (3) Apache Spark. The experiments were implemented using two kinds of datasets composed
by: (A) a large number of small files, and (B) a small number of large files. Each experiment uses both
datasets and the experiment repeats for a set of different file sizes. We compared performance and efficiency
among MPJ Express, Apache Hadoop and Apache Spark. As a final result we are able to provide guidelines
for choosing the platform that is best suited for each kind of data set regarding its overall size and granularity
of the input data.
Keywords:
- Correction
- Source
- Cite
- Save
- Machine Reading By IdeaReader
4
References
1
Citations
NaN
KQI