Template Credit: Adapted from a template made available by Dr. Jason Brownlee of Machine Learning Mastery.
SUMMARY: The purpose of this project is to construct a prediction model using various machine learning algorithms and to document the end-to-end steps using a template. The Heart Disease dataset is a binary classification situation where we are trying to predict one of the two possible outcomes.
INTRODUCTION: The original database contains 76 attributes, but all published experiments refer to using a subset of 14 of them. In particular, the Cleveland database is the only one that has been used by machine learning researchers to this date. The “num” field refers to the presence of heart disease in the patient. It is integer valued from 0 (no presence) to 4. Experiments with the Cleveland database have concentrated on simply attempting to distinguish presence (values 1,2,3,4) from absence (value 0).
In iteration Take1, we examined the Cleveland dataset and created a Logistic Regression model to fit the data.
In iteration Take2, we examined the Hungarian dataset and created a Logistic Regression model to fit the data.
In iteration Take3, we examined the Switzerland dataset and created an Extra Trees model to fit the data.
In iteration Take4, we examined the Long Beach VA dataset and created an Extra Trees model to fit the data.
In this iteration, we will combine all four datasets and create a machine learning model to fit the data.
ANALYSIS: The baseline performance of the machine learning algorithms achieved an average accuracy of 80.56%. Two algorithms (Random Forest and Gradient Boosting) achieved the top accuracy metrics after the first round of modeling. After a series of tuning trials, Gradient Boosting turned in the top overall result and achieved an accuracy metric of 82.84%. By using the optimized parameters, the Gradient Boosting algorithm processed the testing dataset with an accuracy of 77.82%, which was somewhat below the prediction accuracy gained from the training data and possibly due to overfitting.
CONCLUSION: For the combined dataset, the Gradient Boosting algorithm achieved the best overall results using the training and testing datasets. For this dataset, Gradient Boosting could be considered for further modeling or production use.
Dataset Used: Heart Disease Data Set
Dataset ML Model: Binary classification with numerical and categorical attributes
Dataset Reference: https://archive.ics.uci.edu/ml/datasets/Heart+Disease
One potential source of performance benchmark: https://www.kaggle.com/ronitf/heart-disease-uci
The HTML formatted report can be found here on GitHub.