Regression Model for Red vs. White Wine Quality Using Python

Template Credit: Adapted from a template made available by Dr. Jason Brownlee of Machine Learning Mastery.

SUMMARY: The purpose of this project is to construct a predictive model using various machine learning algorithms and to document the end-to-end steps using a template. The Wine Quality dataset is a regression situation where we are trying to predict the value of a continuous variable.

INTRODUCTION: The dataset is related to the white variants of the Portuguese “Vinho Verde” wine. The problem is to predict the wine quality using the chemical characteristics of the wine solely. Due to privacy and logistic issues, only physicochemical (inputs) and sensory (the output) variables are available (e.g., there is no data about grape types, wine brand, wine selling price, etc.).

For the red wine…

ANALYSIS: The baseline performance of the machine learning algorithms achieved an average RMSE of 0.704. Two algorithms (Extra Trees and Random Forest) achieved the top RMSE metrics after the first round of modeling. After a series of tuning trials, Extra Trees turned in the top overall result and achieved an RMSE metric of 0.574. By using the optimized parameters, the Extra Trees algorithm processed the test dataset with an RMSE of 0.563, which was even better than the prediction from the training data.

CONCLUSION: For this iteration, the Extra Trees algorithm achieved the best overall results using the training and testing datasets. For this dataset, Extra Trees should be considered for further modeling.

For the white wine…

ANALYSIS: The baseline performance of the machine learning algorithms achieved an average RMSE of 0.772. Two algorithms (Extra Trees and Random Forest) achieved the top RMSE metrics after the first round of modeling. After a series of tuning trials, Extra Trees turned in the top overall result and achieved an RMSE metric of 0.609. By using the optimized parameters, the Extra Trees algorithm processed the test dataset with an RMSE of 0.586, which was even better than the prediction from the training data.

CONCLUSION: For this iteration, the Extra Trees algorithm achieved the best overall results using the training and testing datasets. For this dataset, Extra Trees should be considered for further modeling.

Dataset Used: Wine Quality Data Set

Dataset ML Model: Regression with numerical attributes

Dataset Reference: https://archive.ics.uci.edu/ml/datasets/wine+quality

The HTML formatted report can be found here on GitHub.