Multi-Class Deep Learning Model for Forest Cover Type Using TensorFlow Take 7

Template Credit: Adapted from a template made available by Dr. Jason Brownlee of Machine Learning Mastery.

SUMMARY: The purpose of this project is to construct a predictive model using various machine learning algorithms and to document the end-to-end steps using a template. The Forest Cover Type dataset is a multi-class classification situation where we are trying to predict one of several (more than two) possible outcomes.

INTRODUCTION: This experiment tries to predict forest cover type from cartographic variables only. This study area includes four wilderness areas located in the Roosevelt National Forest of northern Colorado. These areas represent forests with minimal human-caused disturbances, so that existing forest cover types are more a result of ecological processes rather than forest management practices.

The actual forest cover type for a given observation (30 x 30-meter cell) was determined from the US Forest Service (USFS) Region 2 Resource Information System (RIS) data. Independent variables were derived from data initially obtained from the US Geological Survey (USGS) and USFS data. Data is in raw form (not scaled) and contains binary (0 or 1) columns of data for qualitative independent variables (wilderness areas and soil types).

In iteration Take1, we established the baseline accuracy for comparison with future rounds of modeling.

In iteration Take2, we examined the feature selection technique of attribute importance ranking by using the Gradient Boosting algorithm. By selecting the essential attributes, we decreased the modeling time and still maintained a similar level of accuracy when compared to the baseline model.

In iteration Take3, we examined the feature selection technique of recursive feature elimination (RFE) with the use of the Extra Trees algorithm. By selecting no more than 40 attributes, we maintained a similar level of accuracy when compared to the baseline model.

In iteration Take4, we constructed and tuned an XGBoost machine learning model for this dataset. We also observed the best accuracy result that we could obtain using the XGBoost model with the training dataset from Kaggle. Furthermore, we applied the XGBoost model to Kaggle’s test dataset and submitted a list of predictions to Kaggle for evaluation.

In iteration Take5, we constructed several Multilayer Perceptron (MLP) models with one hidden layer. These simple MLP models will serve as the baseline models as we build more complex MLP models in future iterations. Furthermore, we applied the MLP model to Kaggle’s test dataset and submitted a list of predictions to Kaggle for evaluation.

In iteration Take6, we constructed several Multilayer Perceptron (MLP) models with two hidden layers. These MLP models will serve as a benchmark as we build more complex MLP models in future iterations. Furthermore, we applied the MLP model to Kaggle’s test dataset and submitted a list of predictions to Kaggle for evaluation.

In this Take7 iteration, we will construct several Multilayer Perceptron (MLP) models with three hidden layers. These MLP models will serve as a benchmark as we build more complex MLP models in future iterations. Furthermore, we will apply the MLP model to Kaggle’s test dataset and submit a list of predictions to Kaggle for evaluation.

ANALYSIS: Note: Performance measurements for iterations Take1, Take2, and Take3 are available from the Take4 blog posts.

In iteration Take4, the XGBoost algorithm achieved a baseline accuracy performance of 75.29%. After a series of tuning trials, XGBoost turned in an accuracy result of 85.58%. When we applied the tuned XGBoost algorithm to the test dataset, we obtained an accuracy score of only 87.72%, which was even better than the predictions from the training data.

However, when we applied the tuned XGBoost algorithm to the test dataset from Kaggle, we obtained an accuracy score of only 75.45%. Keep in mind that Kaggle uses only 2.6% of the original dataset to predict the remaining 97.5% of test data.

In iteration Take5, all single-layer models achieved an accuracy performance of between 70.6% and 77.8% after 75 epochs using the test dataset. The 36-node model appears to have the highest accuracy with low variance. However, when we applied the single-layer 36-node neural network model to the test dataset from Kaggle, we obtained an accuracy score of only 60.96%.

In iteration Take6, all dual-layer models achieved an accuracy performance of between 75.5% and 80.4% after 75 epochs using the test dataset. The 36/28-node model appears to have the highest accuracy with low variance. However, when we applied the dual-layer 36/28-node neural network model to the test dataset from Kaggle, we obtained an accuracy score of only 65.872%.

In this Take7 iteration, all three-layer models achieved an accuracy performance of between 78.1% and 80.1% after 75 epochs using the test dataset. The 36/28/24-node model appears to have the highest accuracy with low variance.

However, when we applied the three-layer 36/28/8-node neural network model to the test dataset from Kaggle, we obtained an accuracy score of only 65.744%. We captured additional performance measurements using different model configurations.

  • Three-Layer 36/28/08-Node MLP Model – Accuracy: 65.744%
  • Three-Layer 36/28/12-Node MLP Model – Accuracy: 63.707%
  • Three-Layer 36/28/16-Node MLP Model – Accuracy: 65.485%
  • Three-Layer 36/28/20-Node MLP Model – Accuracy: 63.663%
  • Three-Layer 36/28/24-Node MLP Model – Accuracy: 63.455%

CONCLUSION: For this iteration, the baseline model with a three-layer of 36/28/8 nodes appeared to have yielded the best result. For this dataset, we should consider experimenting with more and different MLP models.

Dataset Used: Forest Cover Type Data Set

Dataset ML Model: Multi-Class classification with numerical attributes

Dataset Reference: https://archive.ics.uci.edu/ml/datasets/Covertype

One source of potential performance benchmarks: https://www.kaggle.com/c/forest-cover-type-prediction/overview

The HTML formatted report can be found here on GitHub.