Template Credit: Adapted from a template made available by Dr. Jason Brownlee of Machine Learning Mastery.
SUMMARY: The purpose of this project is to construct a predictive model using various machine learning algorithms and to document the end-to-end steps using a template. The Egyptian HCV Patients dataset is a multi-class classification situation where we are trying to predict one of several (more than two) possible outcomes.
INTRODUCTION: The dataset captured the Egyptian patients who underwent treatment dosages for HCV for about 18 months. The goal is to predict the patient’s condition, in stages, based on the available measurements.
In iteration Take1, we constructed machine learning models using some simple and straight-forward data preparation steps. The final model from the iteration serves as the baseline for the future iterations of modeling.
In this iteration, we will perform feature engineering by applying “binning” or “discretization” of the attributes as described in the research paper. We will examine how the discretization technique affects modeling.
ANALYSIS: In iteration Take1, the baseline performance of the machine learning algorithms achieved an average accuracy of 25.85%. Two algorithms (k-Nearest Neighbors and Random Forest) achieved the top accuracy metrics after the first round of modeling. After a series of tuning trials, k-Nearest Neighbors turned in the top overall result and achieved an accuracy metric of 29.48%. By using the optimized parameters, the k-Nearest Neighbors algorithm processed the testing dataset with an accuracy of 24.78%, which was no better than the prediction from the training data.
In this iteration, the baseline performance of the machine learning algorithms achieved an average accuracy of 25.95%. Two algorithms (Linear Discriminant Analysis and Extra Trees) achieved the top accuracy metrics after the first round of modeling. After a series of tuning trials, Extra Trees turned in the lower variance of the two algorithms and achieved an accuracy metric of 26.78%. By using the optimized parameters, the Extra Trees algorithm processed the testing dataset with an accuracy of 25.93%, which was slightly worse than the prediction from the training data.
CONCLUSION: For this iteration, the Extra Trees algorithm achieved the best overall results using the training and test datasets, but all algorithms still performed poorly. For this dataset, we should consider collecting more data and perhaps more features before performing further modeling.
Dataset Used: Hepatitis C Virus (HCV) for Egyptian patients Data Set
Dataset ML Model: Multi-Class classification with numerical attributes
Dataset Reference: Dua, D. and Graff, C. (2019). UCI Machine Learning Repository [http://archive.ics.uci.edu/ml]. Irvine, CA: University of California, School of Information and Computer Science. https://archive.ics.uci.edu/ml/datasets/Hepatitis+C+Virus+%28HCV%29+for+Egyptian+patients
The HTML formatted report can be found here on GitHub.