FCFilter: Feature selection based on clustering and genetic algorithms

2016 
The search for patterns in big amounts of textual data, or text mining, can be at once rewarding and challenging. The patterns can reveal tendencies, similarities and predictions, but the information is usually implicit and difficult to be validated. Classification is one of the most relevant research areas in text mining, and it usually consists of predicting the class of a textual document based on a set of documents previously organized into different classes, such as author or topic. Choosing the words to compose the feature set is crucial to a proper classification. A well selected feature set can improve the performance of the classification method and enlighten the interpretation of the classification model adjusted to the data. This paper introduces the Feature Cluster Filter (FCFilter) method for feature selection. FCFilter eliminates the need to input or optimize the number of clusters by grouping the words in a sufficiently high number of clusters. Genetic algorithms are applied to optimize the combination of groups that will provide the final feature set. The method is based on the selection of features that are good predictors for text classification by clustering features and selecting only the suitable clusters. Experiments performed to evaluate the FCFilter with the Reuters-21578, SCY-Genes and SCY-Clusters datasets showed a significant reduction in the feature-value table dimensionality with slight improvements in the classification accuracy when compared to the baselines. The results are very promising, indicating potential improvements in the research on feature selection for text mining.
    • Correction
    • Source
    • Cite
    • Save
    • Machine Reading By IdeaReader
    23
    References
    3
    Citations
    NaN
    KQI
    []