Binary Classification Model for Parkinson’s Disease Using R Take 3

Template Credit: Adapted from a template made available by Dr. Jason Brownlee of Machine Learning Mastery.

SUMMARY: The purpose of this project is to construct a prediction model using various machine learning algorithms and to document the end-to-end steps using a template. Parkinson’s Disease dataset is a binary classification situation where we are trying to predict one of the two possible outcomes.

INTRODUCTION: The data used in this study were gathered from 188 patients with PD (107 men and 81 women) with ages ranging from 33 to 87 at the Department of Neurology in Cerrahpasa Faculty of Medicine, Istanbul University. The control group consists of 64 healthy individuals (23 men and 41 women) with ages varying between 41 and 82. During the data collection process, the microphone is set to 44.1 KHz and following the physician’s examination, the sustained phonation of the vowel /a/ was collected from each subject with three repetitions.

In the first iteration, the script focused on evaluating various machine learning algorithms and identifying the model that produces the best overall metrics. The first iteration established the performance baseline for accuracy and processing time.

In iteration Take2, we examined the feature selection technique of attribute importance ranking by using the Gradient Boosting algorithm. By selecting only the most important attributes, we decreased the processing time and maintained a similar level of prediction accuracy compared to the first iteration.

In iteration Take3, we will examine the feature selection technique of Recursive Feature Elimination (RFE) by using the Random Forest algorithm. By selecting only the most relevant attributes, we hoped to decrease the processing time and maintain a similar level of prediction accuracy compared to the first iteration.

ANALYSIS: In the first iteration, the baseline performance of the machine learning algorithms achieved an average accuracy of 77.84%. Two algorithms (Random Forest and Gradient Boosting) achieved the top accuracy metrics after the first round of modeling. After a series of tuning trials, Random Forest turned in the top overall result and achieved an accuracy metric of 88.24%. By using the optimized parameters, the Random Forest algorithm processed the testing dataset with an accuracy of 83.63%, which was just slightly below the prediction accuracy using the training data.

In iteration Take2, the baseline performance of the machine learning algorithms achieved an average accuracy of 82.14%. Two algorithms (Random Forest and Gradient Boosting) achieved the top accuracy metrics after the first round of modeling. After a series of tuning trials, Gradient Boosting turned in the top overall result and achieved an accuracy metric of 88.92%. By using the optimized parameters, the Gradient Boosting algorithm processed the testing dataset with an accuracy of 88.05%, which was just slightly below the prediction accuracy using the training data.

From the model-building perspective, the number of attributes decreased by 541, from 753 down to 212. The processing time went from 2 hours 16 minutes in the first iteration down to 27 minutes in Take2, which was a decrease of 80.1%.

In iteration Take3, the baseline performance of the machine learning algorithms achieved an average accuracy of 81.69%. Two algorithms (Random Forest and Gradient Boosting) achieved the top accuracy metrics after the first round of modeling. After a series of tuning trials, Random Forest turned in the top overall result and achieved an accuracy metric of 89.11%. By using the optimized parameters, the Random Forest algorithm processed the testing dataset with an accuracy of 84.96%, which was just slightly below the prediction accuracy using the training data.

From the model-building perspective, the number of attributes decreased by 571, from 753 down to 182. The processing time went from 2 hours 16 minutes in the first iteration down to 1 hour 43 minutes in Take3, which was a decrease of 24.2%.

CONCLUSION: For this iteration, using the RFE technique and the Random Forest algorithm achieved the best overall modeling results. Using a feature selection technique further reduced the processing time while achieving an even better prediction accuracy overall. For this dataset, either Gradient Boosting or Random Forest, combined with a feature selection technique, should be considered for further modeling or production use.

Dataset Used: Parkinson’s Disease Classification Data Set

Dataset ML Model: Binary classification with numerical attributes

Dataset Reference: https://archive.ics.uci.edu/ml/datasets/Parkinson%27s+Disease+Classification

Sakar, C.O., Serbes, G., Gunduz, A., Tunc, H.C., Nizam, H., Sakar, B.E., Tutuncu, M., Aydin, T., Isenkul, M.E. and Apaydin, H., 2018. A comparative analysis of speech signal processing algorithms for Parkinson’s disease classification and the use of the tunable Q-factor wavelet transform. Applied Soft Computing, DOI: https://doi.org/10.1016/j.asoc.2018.10.022

The HTML formatted report can be found here on GitHub.