Regression Model for Song Year Prediction Using Python

Template Credit: Adapted from a template made available by Dr. Jason Brownlee of Machine Learning Mastery.

SUMMARY: The purpose of this project is to construct prediction model using various machine learning algorithms and to document the end-to-end steps using a template. The Song Year Prediction dataset is a classic regression situation where we are trying to predict the value of a continuous variable.

INTRODUCTION: This data is a subset of the Million Song Dataset, http://labrosa.ee.columbia.edu/millionsong/, a collaboration between LabROSA (Columbia University) and The Echo Nest. The purpose of this exercise is to predict the release year of a song from audio features. Songs are mostly western, commercial tracks ranging from 1922 to 2011, with a peak in the year 2000s. The data preparer recommended the train/test split of first 463,715 examples for training and the last 51,630 examples for testing. This approach avoids the ‘producer effect’ by making sure no song from a given artist ends up in both the train and test set.

ANALYSIS: The baseline performance of the machine learning algorithms achieved an average RMSE of 10.16. Two algorithms (Stochastic Gradient Boosting and eXtreme Gradient Boosting) achieved the top RMSE metrics after the first round of modeling. After a series of tuning trials, eXtreme Gradient Boosting turned in the top overall result and achieved an RMSE metric of 9.04. By using the optimized parameters, the eXtreme Gradient Boosting algorithm processed the testing dataset with a RMSE of 9.06, which was just slightly above the training data.

CONCLUSION: For this iteration, the eXtreme Gradient Boosting algorithm achieved the best overall results using the training and testing datasets. For this dataset, eXtreme Gradient Boosting should be considered for further modeling or production use.

Dataset Used: YearPredictionMSD Data Set

Dataset ML Model: Regression with numerical attributes

Dataset Reference: https://archive.ics.uci.edu/ml/datasets/YearPredictionMSD

Thierry Bertin-Mahieux, Daniel P.W. Ellis, Brian Whitman, and Paul Lamere. The Million Song Dataset. In Proceedings of the 12th International Society for Music Information Retrieval Conference (ISMIR 2011), 2011.

One potential source of performance benchmarks: https://www.kaggle.com/uciml/msd-audio-features/home

The HTML formatted report can be found here on GitHub.