Selecting Features for Breast Cancer Analysis and Prediction

2020 
Breast Cancer (BC) is the second most common cancer in women after skin cancer and has become a major health issue. As a result, it is very important to diagnose BC correctly and categorizing the tumors into malignant or benign groups. We know that Machine Learning (ML) techniques have unique advantages and that is why they are widely used to analyze complex BC dataset and predict the disease. Wisconsin Diagnosis Breast Cancer (WDBC) dataset has been used to develop predictive models for BC by researchers in this field. The dataset has 573 instances and 32 features. In this paper, we have proposed a method for analyzing and predicting BC on the same dataset using Apache Spark. This big data framework is a very powerful tool for working on huge volume of data, such as healthcare data [4]. Principle Component Analysis (PCA) has been applied on the dataset for selecting the most important features. We have run experiments with top 6 and 10 features. The experiments are executed on Hadoop cluster, a cloud platform provided by the Electrical Engineering and Computer Science (EECS) department of University of Cincinnati. We have also made a comparison between the performance of different machine learning techniques: Decision Tree and Random Forest Classifier. We have set the performance of Decision Tree with top 10 features as a benchmark in our work. Random forest Classifier performs better than Decision Tree algorithm with top 6 as well as top 10 features. Random Forest achieves 97.52 % accuracy using top 10 features. Our results show that selecting the right features significantly improves accuracy in predicting BC.
    • Correction
    • Source
    • Cite
    • Save
    • Machine Reading By IdeaReader
    6
    References
    2
    Citations
    NaN
    KQI
    []