Template Credit: Adapted from a template made available by Dr. Jason Brownlee of Machine Learning Mastery.
SUMMARY: The purpose of this project is to construct a predictive model using various machine learning algorithms and to document the end-to-end steps using a template. The Rain in Australia dataset is a binary classification situation where we are trying to predict one of the two possible outcomes.
INTRODUCTION: This dataset contains daily weather observations from numerous Australian weather stations. The target variable RainTomorrow represents whether it rained the next day. We also should exclude the variable Risk-MM when training a binary classification model. By not eliminating the Risk-MM feature, we run a risk of leaking the answers into our model and reduce its effectiveness.
In iteration Take1, we constructed several traditional machine learning models using the linear, non-linear, and ensemble techniques. We also observed the best accuracy score that we could obtain with each of these models.
In this Take2 iteration, we will construct and tune an XGBoost machine learning model for this dataset. We will observe the best accuracy score that we can obtain with the XGBoost model.
ANALYSIS: In iteration Take1, the baseline performance of the machine learning algorithms achieved an average accuracy of 83.83%. Two algorithms (Extra Trees and Random Forest) achieved the top accuracy metrics after the first round of modeling. After a series of tuning trials, Random Forest turned in a better overall result than Extra Trees with a lower variance. Random Forest achieved an accuracy metric of 85.44%. When configured with the optimized parameters, the Random Forest algorithm processed the test dataset with an accuracy of 85.52%, which was consistent with the accuracy score from the training phase.
In this Take2 iteration, the XGBoost algorithm achieved a baseline accuracy of 84.69% by setting n_estimators to the default value of 100. After a series of tuning trials, XGBoost turned in an overall accuracy result of 86.21% with the n_estimators value set to 1000. When we apply the tuned XGBoost model to the test dataset, we obtained an accuracy score of 86.27%, which was consistent with the model performance from the training phase.
CONCLUSION: For this iteration, the XGBoost algorithm achieved the best overall result using the training and test datasets. For this dataset, XGBoost should be considered for further modeling.
Dataset Used: Rain in Australia Data Set
Dataset ML Model: Binary classification with numerical and categorical attributes
Dataset Reference: https://www.kaggle.com/jsphyg/weather-dataset-rattle-package
One potential source of performance benchmark: https://www.kaggle.com/jsphyg/weather-dataset-rattle-package/kernels
The HTML formatted report can be found here on GitHub.