Binary Classification Model for Coronary Artery Disease Using R Take 1

Template Credit: Adapted from a template made available by Dr. Jason Brownlee of Machine Learning Mastery.

SUMMARY: The purpose of this project is to construct a prediction model using various machine learning algorithms and to document the end-to-end steps using a template. The Z-Alizadeh Sani CAD dataset is a binary classification situation where we are trying to predict one of the two possible outcomes.

INTRODUCTION: The researchers collected the data file for coronary artery disease (CAD) diagnosis. Each patient could be in two possible categories CAD or Normal. A patient is categorized as CAD, if his/her diameter narrowing is greater than or equal to 50%, and otherwise as Normal. The Z-Alizadeh Sani dataset contains the records of 303 patients, each of which has 59 features. The features can belong to one of four groups: demographic, symptom and examination, ECG, and laboratory and echo features. In this extension, the researchers add three features for the LAD, LCX, and RCA arteries. CAD becomes true when at least one of these three arteries is stenotic. To properly use this dataset for CAD classification only one of LAD, LCX, RCA or Cath (Result of angiography) can be present in the dataset. This dataset not only can be used for CAD detection, but also stenosis diagnosis of each LAD, LCX and RCA arteries.

In this iteration, we plan to establish the baseline prediction accuracy for further takes of modeling.

ANALYSIS: The baseline performance of the machine learning algorithms achieved an average accuracy of 83.07%. Two algorithms (Random Forest and Gradient Boosting) achieved the top accuracy metrics after the first round of modeling. After a series of tuning trials, Gradient Boosting turned in the top overall result and achieved an accuracy metric of 89.19%. By using the optimized parameters, the Gradient Boosting algorithm processed the testing dataset with an accuracy of 77.78%, which was significantly below the prediction accuracy gained from the training data and possibly due to over-fitting.

CONCLUSION: For this iteration, the Gradient Boosting algorithm achieved the best overall training and validation results. For this dataset, the Gradient Boosting algorithm could be considered for further modeling.

Dataset Used: Z-Alizadeh Sani Data Set

Dataset ML Model: Binary classification with numerical and categorical attributes

Dataset Reference:

The HTML formatted report can be found here on GitHub.