Binary Classification Model for Truck APS Failure Using Scikit-Learn Take 2

Template Credit: Adapted from a template made available by Dr. Jason Brownlee of Machine Learning Mastery.

INTRODUCTION: The dataset consists of data collected from heavy Scania trucks in everyday usage. The system in focus is the Air Pressure system (APS), which generates pressurized air that supports functions such as braking and gear changes. The dataset’s positive class consists of component failures for a specific component of the APS system. The negative class consists of trucks with failures for components not related to the APS. The training set contains 60000 examples in total, in which 59000 belong to the negative class and 1000 positive class. The test set contains 16000 examples.

SUMMARY: The purpose of this project is to construct a predictive model using various machine learning algorithms and to document the end-to-end steps using a template. The Truck APS Failure dataset is a binary classification situation where we are trying to predict one of the two possible outcomes.

The challenge is to minimize the total cost of a prediction model the sum of “Cost_1” multiplied by the number of Instances with type 1 failure and “Cost_2” with the number of instances with type 2 failure. The “Cost_1” variable refers to the cost resulted from a redundant check by a mechanic at the workshop. Meanwhile, the “Cost_2” variable refers to the cost of not catching a faulty truck. The cost of Type I error (cost_1) is 10, while the cost of the Type II error (cost_2) is 500.

In iteration Take1, we constructed and tuned machine learning models for this dataset using the Scikit-Learn library. We also observed the best sensitivity/recall score that we could obtain using the tuned models with the training and test datasets.

In this Take2 iteration, we will attempt to provide more balance to this imbalanced dataset by using “Synthetic Minority Oversampling TEchnique” or SMOTE for short. We will up-sample the minority class from approximately 0.1% to approximately 33% of the training instances. Furthermore, we will observe the best sensitivity/recall score that we can obtain using the tuned models with the training and test datasets.

ANALYSIS: From iteration Take1, the performance of the machine learning algorithms achieved an average recall metric of 59.26%. Two algorithms (Extra Trees and Random Forest) produced the top results after the first round of modeling. After a series of tuning trials, the Random Forest model completed the training phase and achieved a score of 68.53%. When configured with the optimized learning parameters, the Random Forest model processed the validation dataset with a score of 66.40%. Furthermore, the optimized model processed the test dataset with a score of 71.73% with a high Type II error rate.

From this Take2 iteration, the performance of the machine learning algorithms achieved an average recall metric of 95.79%. Two algorithms (Extra Trees and k-Nearest Neighbors) produced the top results after the first round of modeling. After a series of tuning trials, the Random Forest model completed the training phase and achieved a score of 99.67%. When configured with the optimized learning parameters, the Random Forest model processed the validation dataset with a score of 80.40%. Furthermore, the optimized model processed the test dataset with a score of 82.40% with a high Type II error rate.

CONCLUSION: For this iteration, the Extra Trees model achieved the best overall results using the training and test datasets. For this dataset, we should consider using the Extra Trees algorithm for further modeling and testing activities.

Dataset Used: APS Failure at Scania Trucks Data Set

Dataset ML Model: Binary classification with numerical attributes

Dataset Reference: https://archive.ics.uci.edu/ml/datasets/APS+Failure+at+Scania+Trucks

One potential source of performance benchmark: https://archive.ics.uci.edu/ml/datasets/APS+Failure+at+Scania+Trucks

The HTML formatted report can be found here on GitHub.