Template Credit: Adapted from a template made available by Dr. Jason Brownlee of Machine Learning Mastery.
SUMMARY: The purpose of this project is to construct a prediction model using various machine learning algorithms and to document the end-to-end steps using a template. The Diabetes Readmission Prediction is a multi-class classification situation where we are trying to predict one of the several possible outcomes.
INTRODUCTION: Management of hyperglycemia in hospitalized patients has a significant bearing on the outcome, in terms of both morbidity and mortality. However, there are few national assessments of diabetes care during hospitalization which could serve as a baseline for change. This analysis of a large clinical database was undertaken to provide such an assessment and to find future directions which might lead to improvements in patient safety. The statistical model suggests that the relationship between the probability of readmission and the HbA1c measurement depends on the primary diagnosis. The data suggest further that the greater attention to diabetes reflected in HbA1c determination may improve patient outcomes and lower cost of inpatient care.
In iteration Take1, we established the baseline prediction accuracy for further takes of modeling. To limit the processing time and memory requirements, we also limited the attributes used for this project by not including those attributes that do not appear on the final model of the research paper.
In this iteration, we further test the machine learning models by rearranging some of the features to be more consistent with the research papers (Table 4). We hope to improve the overall accuracy and applicability of the model by having features with a fewer number of categories.
ANALYSIS: The baseline performance of the machine learning algorithms achieved an average accuracy of 56.32%. Two algorithms (Linear Discriminant Analysis and Gradient Boosting) achieved the top accuracy metrics after the first round of modeling. After a series of tuning trials, Gradient Boosting turned in the top overall result and achieved an accuracy metric of 58.07%. By using the optimized parameters, the Gradient Boosting algorithm processed the testing dataset with an accuracy of 57.04%, which was slightly below the prediction accuracy from the training data.
CONCLUSION: Restructuring the categorical variables did not yield accuracy or processing time improvement. For this iteration, the Gradient Boosting algorithm achieved the top-tier training and validation results. For this dataset, Gradient Boosting should be considered for further modeling or production use.
Dataset Used: Diabetes 130-US hospitals for years 1999-2008 Data Set
Dataset ML Model: Multi-Class classification with numerical and categorical attributes
Dataset Reference: https://archive.ics.uci.edu/ml/datasets/Diabetes+130-US+hospitals+for+years+1999-2008
One source of potential performance benchmarks: http://www.hindawi.com/journals/bmri/2014/781670/
The HTML formatted report can be found here on GitHub.