Template Credit: Adapted from a template made available by Dr. Jason Brownlee of Machine Learning Mastery.
SUMMARY: The purpose of this project is to construct a predictive model using various machine learning algorithms and to document the end-to-end steps using a template. The Metro Interstate Traffic Volume dataset is a regression situation where we are trying to predict the value of a continuous variable.
INTRODUCTION: This dataset captured the hourly measurement of Interstate 94 Westbound traffic volume for MN DoT ATR station 301. The station is roughly midway between Minneapolis and St Paul, MN. The dataset also included the hourly weather and holiday attributes for assessing their impacts on traffic volume.
In iteration Take1, we established the baseline mean squared error without much of feature engineering. This round of modeling also did not include the date-time and weather description attributes.
In iteration Take2, we included the time stamp feature and observed its effect on improving the prediction accuracy.
In iteration Take3, we re-engineered (scale and/or discretize) the weather-related features and observed their effect on the prediction accuracy.
In this iteration, we will re-engineer (scale and/or binarize) the holiday and weather-related features and observe their effect on the prediction accuracy.
ANALYSIS: From iteration Take1, the baseline performance of the machine learning algorithms achieved an average RMSE of 2646. Two algorithms (K-Nearest Neighbors and Gradient Boosting) achieved the top RMSE metrics after the first round of modeling. After a series of tuning trials, Gradient Boosting turned in the top overall result and achieved an RMSE metric of 1887. By using the optimized parameters, the Gradient Boosting algorithm processed the test dataset with an RMSE of 1878, which was even better than the prediction from the training data.
From iteration Take2, the performance of the machine learning algorithms achieved an average RMSE of 1559. Two algorithms (Random Forest and Extra Trees) achieved the top RMSE metrics after the first round of modeling. After a series of tuning trials, Random Forest turned in the top overall result and achieved an RMSE metric of 465. By using the optimized parameters, the Random Forest algorithm processed the test dataset with an RMSE of 461, which was slightly better than the prediction from the training data.
By including the date_time information and related attributes, the machine learning models did a significantly better job in prediction with a much lower RMSE.
From iteration Take3, the performance of the machine learning algorithms achieved an average RMSE of 977. Two algorithms (Random Forest and Extra Trees) achieved the top RMSE metrics after the first round of modeling. After a series of tuning trials, Random Forest turned in the top overall result and achieved an RMSE metric of 465. By using the optimized parameters, the Random Forest algorithm processed the test dataset with an RMSE of 462, which was slightly better than the prediction from the training data.
By discretizing the weather-related features, the average performance of all models did better. However, the changes appeared to have no impact on the performance of the ensemble algorithms, including Random Forest.
In the current iteration, the performance of the machine learning algorithms achieved an average RMSE of 966. Two algorithms (Random Forest and Extra Trees) achieved the top RMSE metrics after the first round of modeling. After a series of tuning trials, Random Forest turned in the top overall result and achieved an RMSE metric of 394. By using the optimized parameters, the Random Forests algorithm processed the test dataset with an RMSE of 386, which was even better than the prediction from the training data.
By binarizing the holiday and other weather-related features, the average performance of all models did better than baseline. Also, the changes appeared to have a positive impact on the performance of the ensemble algorithms, especially with Random Forest.
CONCLUSION: For this iteration, the Random Forest algorithm achieved the best overall results using the training and testing datasets. For this dataset, Random Forest should be considered for further modeling.
Dataset Used: Metro Interstate Traffic Volume Data Set
Dataset ML Model: Regression with numerical and categorical attributes
Dataset Reference: https://archive.ics.uci.edu/ml/datasets/Metro+Interstate+Traffic+Volume
One potential source of performance benchmarks: https://www.kaggle.com/ramyahr/metro-interstate-traffic-volume
The HTML formatted report can be found here on GitHub.