EFFICIENT DATA MINING TECHNIQUES FOR MEDICAL DATA.

J. S. Saleema

EFFICIENT DATA MINING TECHNIQUES FOR MEDICAL DATA.

2018

J. S. Saleema

Healthy decision making for the well being is a challenge in the current era with abundant information everywhere. Data mining, machine learning and computational statistics are the leading fields of study that are supporting the empowered individual to take valuable decisions to optimize the outcome of any working domain. High demand for data handling exists in healthcare, as the rate of increase in patients is proportional to the rate of population growth and life style changes. Techniques for early diagnosis and prognosis prediction of diseases are the need of the hour to provide better treatment for the human community. Data mining techniques are a boon for building a quality and efficient model for health prediction applications. As cancer explodes everywhere in recent years, the data sets from cancer registries have been focused as the medical data in this research. The main aim of thesis is to build a constructive and efficient classifier model for cancer prognosis prediction. Most of the existing system develops a diagnosis prediction models from the screening or survey data, as the data set is widely available and are easy to collect due the insensitive nature of the factors involved in such research. Whereas the prognosis prediction requires a sensitive details of the patients those who are under treatment for a diagnosed disease. Hospitals and the community registries maintained by the government are the main source for data collection. Well maintained electronic hospital records with histopathology information is not public in India for the researchers. Hence cancer data from a US based open access data center has been used in this research for all experimentation. This research work is a progressive model that gradually improves the prediction accuracy by selecting appropriate data mining techniques in each phase. Prognosis is a term relating to the survival factor of a cancer patient in general but it also means the severity of the disease in the future time line of the patient. Two fold objective of this research is to identify the prominent response variables that support the prediction system for measuring the prognosis and improve the prediction models. Intense pre-processing of raw data has been planned and executed in the first phase of the research. Three base classifiers from data mining have been used for identifying the prominent class labels and rank them as a second phase. Next, the ratio-based balanced stratified sampling technique has been proposed and evaluated with the top four prominent labels stage, age, multiple primaries and survival that are identified in the first phase. As a fourth phase the combined effect of the prominent labels have been tested with the multi-label classifier approach. Finally the ensemble classifier model HEEP has been proposed and empirically tested for better model accuracy. The overall performance of the proposed and experimented models shows an average improvement from 2% to 6%. The detailed outcomes of all five phases have been presented in the respective chapters of this thesis. Libraries, objects and operators from MATLAB, RapidMiner, JAVA and MS Excel files have been used to implement the entire research at different phases based on the possibilities and constraints. This research outcome is an opportunity to contribute to the non-commercial researchers in health sectors for the overall health welfare. NNPCA and KPCA models experimented for dimensionality reduction using UCI public data has not been tested for SEER cancer data sets due to the value constraints of SEER parameters. Multi-label and HEEP models have been focused only for limited data sets as provided in the respective chapters due to system constraints for large data set experiments. Finding alternate solutions for these limitations and automating the entire process flow of all five phases will be main focus of the future research.

Keywords:

Correction
Source
Cite
Save
Machine Reading By IdeaReader

References

Citations