Regression Model for NCAA Women’s Volleyball Win-Loss Percentages Using Python and Scikit-learn

Template Credit: Adapted from a template made available by Dr. Jason Brownlee of Machine Learning Mastery.

SUMMARY: The purpose of this project is to construct a predictive model using various machine learning algorithms and to document the end-to-end steps using a template. The NCAA Women’s Volleyball dataset is a regression situation where we are trying to predict the value of a continuous variable.

INTRODUCTION: NCAA maintains and publishes numerous datasets on its sporting events and statistics. The goal of this exercise is to experiment with the non-neural-network machine learning (ML) algorithms and observe whether we can use those classic ML techniques to model the sport of volleyball.

ANALYSIS: The baseline performance of the machine learning algorithms achieved an average RMSE of 0.0762. Two algorithms (Ridge Regression and Extra Trees) reached the top RMSE metrics after the first round of modeling. After a series of tuning trials, Ridge Regression turned a better overall result than Extra Trees. Ridge Regression achieved an RMSE metric of 0.0661. By using the optimized parameters, the Gradient Boosting algorithm processed the test dataset with an RMSE of 0.0643, which was slightly better than the prediction from the training data.

CONCLUSION: For this iteration, the Ridge Regression algorithm achieved the best overall results using the training and testing datasets. For this dataset, we should consider using Ridge Regression for more modeling and testing.

Dataset Used: NCAA Women’s Volleyball Archived Statistics

Dataset ML Model: Regression with numerical attributes

Dataset Reference:

The HTML formatted report can be found here on GitHub.