A large-scale filter method for feature selection based on spark

Reine Marie Ndéla Marone,Fodé Camara,Samba Ndiaye

A large-scale filter method for feature selection based on spark

2017

Recently, enormous volumes of data are generated in information systems. That's why data mining area is facing new challenges of transforming this “big data” into useful knowledge. In fact, “big data” relies low density of information (low data quality) and data redundancy, which negatively affect the data mining process. Therefore, when the number of variables describing the data is high, features selection methods are crucial for selecting relevant data. Features selection is the process of identifying the most relevant variables and removing those are redundant and irrelevant. In this paper, we propose a parallel, scalable feature selection algorithm based on mRMR (Max-Relevance and Min-Redundancy) in Spark, an in-memory parallel computing framework specialized in computation for large distributed datasets. Our experiments using real-world data of high dimensionality demonstrated that our proposition scale well and efficiently with large datasets.

Keywords:

Correction
Source
Cite
Save
Machine Reading By IdeaReader

References

Citations