Simple Classification Model for Diabetes Prediction Using R

Template Credit: Adapted from a template made available by Dr. Jason Brownlee of Machine Learning Mastery.

For more information on this case study project, please consult Dr. Brownlee’s blog post at https://machinelearningmastery.com/standard-machine-learning-datasets/.

Dataset Used: Pima Indians Diabetes Database

Data Set ML Model: Classification with numerical attributes

Dataset Reference: https://www.kaggle.com/uciml/pima-indians-diabetes-database

For more information on performance benchmarks, please consult: https://www.kaggle.com/uciml/pima-indians-diabetes-database

INTRODUCTION: The Pima Indians Diabetes Dataset involves predicting the onset of diabetes within 5 years in Pima Indians given medical details. It is a binary (2-class) classification problem. There are 768 observations with 8 input variables and 1 output variable. Missing values are believed to be encoded with zero values.

CONCLUSION: The baseline performance of predicting the class variable achieved an average accuracy of 75.85%. The top accuracy result achieved via Logistic Regression was 77.73% after a series of tuning trials. The ensemble algorithms, in this case, did not yield a better result than the non-ensemble algorithms to justify the additional processing required.

The HTML formatted report can be found here on GitHub.