A large-scale filter method for feature selection based on spark

2017 
Recently, enormous volumes of data are generated in information systems. That's why data mining area is facing new challenges of transforming this “big data” into useful knowledge. In fact, “big data” relies low density of information (low data quality) and data redundancy, which negatively affect the data mining process. Therefore, when the number of variables describing the data is high, features selection methods are crucial for selecting relevant data. Features selection is the process of identifying the most relevant variables and removing those are redundant and irrelevant. In this paper, we propose a parallel, scalable feature selection algorithm based on mRMR (Max-Relevance and Min-Redundancy) in Spark, an in-memory parallel computing framework specialized in computation for large distributed datasets. Our experiments using real-world data of high dimensionality demonstrated that our proposition scale well and efficiently with large datasets.
    • Correction
    • Source
    • Cite
    • Save
    • Machine Reading By IdeaReader
    4
    References
    1
    Citations
    NaN
    KQI
    []