Template Credit: Adapted from a template made available by Dr. Jason Brownlee of Machine Learning Mastery.
INTRODUCTION: Mining activity has always been connected with the occurrence of dangers which are commonly called mining hazards. A special case of such a threat is a seismic hazard which frequently occurs in many underground mines. Seismic hazard is the hardest detectable and predictable of natural hazards, and it is comparable to an earthquake. The complexity of seismic processes and big disproportion between the number of low-energy seismic events and the number of high-energy phenomena causes the statistical techniques to be insufficient to predict seismic hazard. Therefore, it is essential to search for new opportunities for better hazard prediction, also using machine learning methods.
In iterations Take1 and Take2, we had three algorithms with high accuracy and ROC results but with strong biases due to the imbalance of our dataset. For this iteration, we will examine the feasibility of using the SMOTE technique to balance the dataset.
CONCLUSION: From the previous Take1 iteration, the baseline performance of the eight algorithms achieved an average accuracy of 93.11%. Three algorithms (Random Forest, Support Vector Machine, and Adaboost) achieved the top three accuracy scores after the first round of modeling. After a series of tuning trials, all three algorithms turned in the identical accuracy result of 93.42%, with an identical Kappa score of 0.0. With an imbalanced dataset we have on-hand, we will need to look for another metric or another approach to evaluate the models.
From the previous Take2 iteration, the baseline performance of the eight algorithms achieved an average ROC score of 71.99%. Three algorithms (Random Forest, Adaboost, and Stochastic Gradient Boosting) achieved the top three ROC scores after the first round of modeling. After a series of tuning trials, Stochastic Gradient Boosting turned in the best ROC result of 78.59%, but with a dismal sensitivity score of 0.88%.
From the current iteration, the baseline performance of the eight algorithms achieved an average ROC score of 87.33%. Three algorithms (Random Forest, Adaboost, and Stochastic Gradient Boosting) achieved the top three ROC scores after the first round of modeling. After a series of tuning trials, Random Forest turned in the best ROC result of 92.68%, but with a much-better sensitivity score of 70.87%. Using the optimized tuning parameter available, the Random Forest algorithm processed the validation dataset with a ROC of 84.57%, which was slightly below the ROC score of the training data.
The ROC metric has given us a more viable way to evaluate the models, other than using the accuracy scores. Also, the SMOTE technique helped to make the model evaluation more realistic with the imbalanced dataset we have. For this project, the Random Forest appeared to be the most suitable algorithm for the dataset.
The HTML formatted report can be found here on GitHub.