Template Credit: Adapted from a template made available by Dr. Jason Brownlee of Machine Learning Mastery.
SUMMARY: The purpose of this project is to construct prediction model using various machine learning algorithms and to document the end-to-end steps using a template. The Song Year Prediction dataset is a multi-class classification situation where we are trying to predict one of the ten possible outcomes.
INTRODUCTION: This data is a subset of the Million Song Dataset, http://labrosa.ee.columbia.edu/millionsong/, a collaboration between LabROSA (Columbia University) and The Echo Nest. The purpose of this exercise is to predict the release year of a song from audio features. Songs are mostly western, commercial tracks ranging from 1922 to 2011, with a peak in the year 2000s. The data preparer recommended the train/test split of first 463,715 examples for training and the last 51,630 examples for testing. This approach avoids the ‘producer effect’ by making sure no song from a given artist ends up in both the train and test set.
ANALYSIS: The baseline performance of the machine learning algorithms achieved an average accuracy of 57.24%. Two algorithms (Stochastic Gradient Boosting and eXtreme Gradient Boosting) achieved the top accuracy metrics after the first round of modeling. After a series of tuning trials, eXtreme Gradient Boosting turned in the top overall result and achieved an accuracy metric of 63.85%. By using the optimized parameters, the eXtreme Gradient Boosting algorithm processed the testing dataset with an accuracy of 63.65%, which was just slightly below the training data.
CONCLUSION: For this iteration, the eXtreme Gradient Boosting algorithm achieved the best overall results using the training and testing datasets. For this dataset, eXtreme Gradient Boosting should be considered for further modeling or production use.
Dataset Used: YearPredictionMSD Data Set
Dataset ML Model: Classification with numerical attributes
Dataset Reference: https://archive.ics.uci.edu/ml/datasets/YearPredictionMSD
Thierry Bertin-Mahieux, Daniel P.W. Ellis, Brian Whitman, and Paul Lamere. The Million Song Dataset. In Proceedings of the 12th International Society for Music Information Retrieval Conference (ISMIR 2011), 2011.
One potential source of performance benchmarks: https://www.kaggle.com/uciml/msd-audio-features/home
The HTML formatted report can be found here on GitHub.