Template Credit: Adapted from a template made available by Dr. Jason Brownlee of Machine Learning Mastery.
SUMMARY: The purpose of this project is to construct a predictive model using various machine learning algorithms and to document the end-to-end steps using a template. The Forest Cover Type dataset is a multi-class classification situation where we are trying to predict one of several (more than two) possible outcomes.
INTRODUCTION: This experiment tries to predict forest cover type from cartographic variables only. This study area includes four wilderness areas located in the Roosevelt National Forest of northern Colorado. These areas represent forests with minimal human-caused disturbances, so that existing forest cover types are more a result of ecological processes rather than forest management practices.
The actual forest cover type for a given observation (30 x 30-meter cell) was determined from the US Forest Service (USFS) Region 2 Resource Information System (RIS) data. Independent variables were derived from data initially obtained from the US Geological Survey (USGS) and USFS data. Data is in raw form (not scaled) and contains binary (0 or 1) columns of data for qualitative independent variables (wilderness areas and soil types).
In iteration Take1, we established the baseline accuracy for comparison with future rounds of modeling.
In iteration Take2, we examined the feature selection technique of attribute importance ranking by using the Gradient Boosting algorithm. By selecting the essential attributes, we decreased the modeling time and still maintained a similar level of accuracy when compared to the baseline model.
In iteration Take3, we examined the feature selection technique of recursive feature elimination (RFE) with the use of the Extra Trees algorithm. By selecting no more than 40 attributes, we maintained a similar level of accuracy when compared to the baseline model.
In this Take4 iteration, we will construct and tune an XGBoost machine learning model for this dataset. We will observe the best accuracy result that we can obtain using the XGBoost model with the training dataset from Kaggle. Furthermore, we will apply the XGBoost model to Kaggle’s test dataset and submit a list of predictions to Kaggle for evaluation.
ANALYSIS: From iteration Take1, the baseline performance of the machine learning algorithms achieved an average accuracy of 78.04%. Two algorithms (Bagged Decision Trees and Extra Trees) achieved the top accuracy metrics after the first round of modeling. After a series of tuning trials, Extra Trees turned in the top overall result and achieved an accuracy metric of 85.80%. By using the optimized parameters, the Extra Trees algorithm processed the testing dataset with an accuracy of 86.50%, which was even better than the predictions from the training data.
From iteration Take2, the performance of the machine learning algorithms achieved an average accuracy of 77.84%. Extra Trees produced an accuracy metric of 85.80% with the training dataset. The model processed the testing dataset with an accuracy of 86.19%, which was slightly better than the predictions from training. At the importance level of 99%, the attribute importance technique eliminated 23 of 54 total attributes. The remaining 31 attributes produced a model that achieved a comparable accuracy compared to the baseline model. The modeling time went from 2 minutes 2 seconds down to 1 minute 41 seconds, a reduction of 17.2%.
From iteration Take3, the performance of the machine learning algorithms achieved an average accuracy of 77.99%. Extra Trees produced an accuracy metric of 85.97% with the training dataset. The model processed the testing dataset with an accuracy of 86.64%, which was slightly better than the predictions from training. The RFE technique eliminated 16 of 54 total attributes. The remaining 38 attributes produced a model that achieved a comparable accuracy compared to the baseline model. The modeling time went from 2 minutes 2 seconds down to 1 minute 58 seconds, a reduction of 3.2%.
In this Take4 iteration, the XGBoost algorithm achieved a baseline accuracy performance of 75.29%. After a series of tuning trials, XGBoost turned in an accuracy result of 85.58%. When we apply the tuned XGBoost algorithm to the test dataset, we obtained an accuracy score of only 87.72%, which was even better than the predictions from the training data.
However, when we apply the tuned XGBoost algorithm to the test dataset from Kaggle, we obtained an accuracy score of only 75.45%. Keep in mind that Kaggle uses only 2.6% of the original dataset to predict the remaining 97.5% of test data.
CONCLUSION: For this iteration of the project, the XGBoost algorithm achieved the best accuracy result when compared to other machine learning algorithms using the training and testing datasets. For this dataset, XGBoost should be considered for further modeling.
Dataset Used: Forest Cover Type Data Set
Dataset ML Model: Multi-Class classification with numerical attributes
Dataset Reference: https://archive.ics.uci.edu/ml/datasets/Covertype
One source of potential performance benchmarks: https://www.kaggle.com/c/forest-cover-type-prediction/overview
The HTML formatted report can be found here on GitHub.