Binary Classification Model for Diabetes Readmission Prediction Using Python Take 3

Template Credit: Adapted from a template made available by Dr. Jason Brownlee of Machine Learning Mastery.

SUMMARY: The purpose of this project is to construct a prediction model using various machine learning algorithms and to document the end-to-end steps using a template. The Diabetes Readmission Prediction is a multi-class classification situation where we are trying to predict one of the several possible outcomes.

INTRODUCTION: Management of hyperglycemia in hospitalized patients has a significant bearing on the outcome, in terms of both morbidity and mortality. However, there are few national assessments of diabetes care during hospitalization which could serve as a baseline for change. This analysis of a large clinical database was undertaken to provide such an assessment and to find future directions which might lead to improvements in patient safety. The statistical model suggests that the relationship between the probability of readmission and the HbA1c measurement depends on the primary diagnosis. The data suggest further that the greater attention to diabetes reflected in HbA1c determination may improve patient outcomes and lower cost of inpatient care.

In iteration Take1, we established the baseline prediction accuracy for further takes of modeling. To limit the processing time and memory requirements, we also limited the attributes used for this project by not including those attributes that do not appear on the final model of the research paper.

In iteration Take2, we further tested the machine learning models by rearranging some of the features to be more consistent with the research papers (Table 4). We had hoped to improve the overall accuracy and applicability of the model by having features with a fewer number of categories.

In this iteration, we will test the machine learning models by reconfiguring the target variable to have only two categories, thus making this a binary classification exercise. We hope to improve the overall accuracy and applicability of the model by predicting with just the “yes” and “no” outcomes.

ANALYSIS: The baseline performance of the machine learning algorithms achieved an average accuracy of 58.93%. Two algorithms (Logistic Regression and Gradient Boosting) achieved the top accuracy metrics after the first round of modeling. After a series of tuning trials, Gradient Boosting turned in the top overall result and achieved an accuracy metric of 63.14%. By using the optimized parameters, the Bagged Decision Trees algorithm processed the testing dataset with an accuracy of 62.85%, which was just slightly below the prediction accuracy from the training data.

CONCLUSION: Restructuring the target variable to binary outcomes yielded accuracy improvement and a reduction in the processing time. For this iteration, the Gradient Boosting algorithm achieved the best overall results using the training and testing datasets. For this dataset, Gradient Boosting should be considered for further modeling or production use.

Dataset Used: Diabetes 130-US hospitals for years 1999-2008 Data Set

Dataset ML Model: Multi-Class classification with numerical and categorical attributes

Dataset Reference:

One source of potential performance benchmarks:

The HTML formatted report can be found here on GitHub.