Template Credit: Adapted from a template made available by Dr. Jason Brownlee of Machine Learning Mastery.
SUMMARY: The purpose of this project is to construct a prediction model using various machine learning algorithms and to document the end-to-end steps using a template. The APS Failure at Scania Trucks dataset is a classic binary classification situation where we are trying to predict one of the two possible outcomes.
INTRODUCTION: The dataset consists of data collected from heavy Scania trucks in everyday usage. The system in focus is the Air Pressure system (APS) which generates pressurized air that is utilized in various functions in a truck, such as braking and gear changes. The datasets’ positive class consists of component failures for a specific component of the APS system. The negative class consists of trucks with failures for components not related to the APS. The data consists of a subset of all available data, selected by experts.
This dataset has many cells with missing values, so it is not practical to simply delete the rows with missing cells. This iteration of the project will produce a set of results by imputing the blank cells with the mean value. We will compare the results from Take 1, where we imputed the blank cells with the value zero.
CONCLUSION: From the Take1 iteration, the baseline performance of the ten algorithms achieved an average accuracy of 98.8001%. The ensemble algorithms (Bagged CART, Random Forest, Extra Trees, AdaBoost, and Stochastic Gradient Boosting) all achieved the top accuracy scores after the first round of modeling. After a series of tuning trials, Random Forest turned in the top result using the training data. It achieved an average accuracy of 99.3983%. Using the optimized tuning parameter available, the Random Forest algorithm processed the validation dataset with an accuracy of 99.2187%, which was slightly below the accuracy of the training data.
From the current iteration (Take2), the baseline performance of the ten algorithms achieved an average accuracy of 98.8348%. The ensemble algorithms (Bagged CART, Random Forest, Extra Trees, AdaBoost, and Stochastic Gradient Boosting) all achieved the top accuracy scores after the first round of modeling. After a series of tuning trials, Random Forest turned in the top result using the training data. It achieved an average accuracy of 99.3967%. Using the optimized tuning parameter available, the Random Forest algorithm processed the validation dataset with an accuracy of 99.2187%, which was slightly below the accuracy of the training data.
For this iteration, imputing the missing cells with the mean value improved the average performance of all models slightly, but not so much for the Random Forest algorithm. For this project, the Random Forest ensemble algorithm yielded consistently top-notch training and validation results, which warrant the additional processing required by the algorithm.
Dataset Used: APS Failure at Scania Trucks Data Set
Dataset ML Model: Binary classification with numerical attributes
Dataset Reference: https://archive.ics.uci.edu/ml/datasets/APS+Failure+at+Scania+Trucks
The HTML formatted report can be found here on GitHub.