Parallelization of Algorithms for Mining Data from Distributed Sources.

Ivan Kholod,Andrey Shorov,Maria Efimova,Sergei Gorlatch

Parallelization of Algorithms for Mining Data from Distributed Sources.

2019

We suggest an approach to optimize data mining in modern applications that work on distributed data. We formally transform a high-level functional representation of a data-mining algorithm into a parallel implementation that performs as much as possible computations locally at the data sources, rather than accumulating all data for processing at a central location as in the traditional MapReduce approach. Our approach avoids the main disadvantages of the state-of-the-art MapReduce frameworks in the context of distributed data: increased run time, high network traffic, and an unauthorized access to data. We use the popular data-mining algorithm – Naive Bayes – for illustrating our approach and evaluating it experimentally. Our experiments confirm that the implementation of Naive Bayes developed by using our approach significantly outperforms the traditional MapReduce-based implementation regarding the run time and the network traffic.

Keywords:

Correction
Source
Cite
Save
Machine Reading By IdeaReader

References

Citations