Prediction of diabetes using logistic regression and ensemble techniques

2021 
Abstract Background Logistic regression is a classification model in machine learning, extensively used in clinical analysis. It uses probabilistic estimations which helps in understanding the relationship between the dependent variable and one or more independent variables. Diabetes, being one of the most common diseases around the world, when detected early, may prevent the progression of the disease and avoid other complications. In this work, we design a prediction model, that predicts whether a patient has diabetes, based on certain diagnostic measurements included in the dataset, and explore various techniques to boost the performance and accuracy. Methods Logistic Regression is the main algorithm used in this paper and the analysis is carried out using Python IDE. The experiment mainly uses two datasets – one is the PIMA Indians Diabetes dataset, which is originally from the National Institute of Diabetes and Digestive and Kidney Diseases, and the other dataset is from Vanderbilt, which is based on a study of rural African Americans in Virginia. Feature selection is carried out using two different methods. Ensemble methods are further used, that improve performance by producing better predictions compared to a single model. Results The accuracy and runtimes are captured for the original datasets and also for the ones obtained after using feature selection and ensemble techniques. A comparison is also shown in each case. The highest accuracy obtained was around 78% for Dataset 1, after employing the ensemble technique- Max Voting; and it was around 93% for Dataset 2, after using the ensemble techniques- Max Voting, and Stacking. Conclusion Logistic Regression has shown to be one of the efficient algorithms in building prediction models. This study also shows that apart from the choice of algorithms, there are other factors that could improve the accuracy and runtimes of the model, such as: data-preprocessing, removal of redundant and null values, normalization, cross-validation, feature selection, and usage of ensemble techniques.
    • Correction
    • Source
    • Cite
    • Save
    • Machine Reading By IdeaReader
    10
    References
    0
    Citations
    NaN
    KQI
    []