Binary Classification Model for Santander Customer Satisfaction Using Scikit-Learn Take 3

Template Credit: Adapted from a template made available by Dr. Jason Brownlee of Machine Learning Mastery.

SUMMARY: The purpose of this project is to construct a predictive model using various machine learning algorithms and to document the end-to-end steps using a template. The Santander Customer Satisfaction dataset is a binary classification situation where we are trying to predict one of the two possible outcomes.

INTRODUCTION: Santander Bank sponsored a Kaggle competition to help them identify dissatisfied customers early in their relationship. Doing so would allow Santander to take proactive steps to improve a customer’s happiness before it’s too late. In this competition, Santander has provided hundreds of anonymized features to predict if a customer is satisfied or dissatisfied with their banking experience. The exercise evaluates the submissions on the area under the ROC curve (AUC) between the predicted probability and the observed target.

In iteration Take1, we constructed and tuned several machine learning models using the Scikit-learn library. Furthermore, we applied the best-performing machine learning model to Kaggle’s test dataset and submitted a list of predictions for evaluation.

In iteration Take2, we provided a more balanced dataset using “Synthetic Minority Oversampling TEchnique,” or SMOTE for short. We increased the minority class through sampling from approximately 3.9% to 20% of the training instances. Furthermore, we applied the best-performing machine learning model to Kaggle’s test dataset and observed whether the training on the balanced dataset had any positive impact on the prediction results.

In this Take3 iteration, we will construct and tune an XGBoost model. Furthermore, we will apply the XGBoost model to Kaggle’s test dataset and submit a list of predictions for evaluation.

ANALYSIS: From iteration Take1, the baseline performance of the machine learning algorithms achieved an average AUC of 67.94%. Two algorithms (Random Forest and Gradient Boosting) achieved the top AUC metrics after the first round of modeling. After a series of tuning trials, the Gradient Boosting model turned in a better overall result than Random Forest with a higher AUC. Gradient Boosting achieved an AUC metric of 83.60%, and the same Gradient Boosting model processed the test dataset with an AUC of 83.57%, which was consistent with the training result. Lastly, when we applied the Gradient Boosting model to the test dataset from Kaggle, we obtained a ROC-AUC score of 82.15%.

From iteration Take2, the baseline performance of the machine learning algorithms achieved an average AUC of 87.90%. Two algorithms (Random Forest and Gradient Boosting) achieved the top AUC metrics after the first round of modeling. After a series of tuning trials, the Gradient Boosting model turned in a better overall result than Random Forest with a higher AUC. Gradient Boosting achieved an AUC metric of 96.20%, and the same Gradient Boosting model processed the test dataset with an AUC of 81.93%, which indicated a high variance issue. Lastly, when we applied the Gradient Boosting model to the test dataset from Kaggle, we obtained a ROC-AUC score of 81.17%.

From this Take3 iteration, the baseline performance of the XGBoost model achieved an AUC of 83.97%. After a series of tuning trials, the XGBoost model processed the test dataset with an AUC of 83.98%, which was consistent with the training result. Lastly, when we applied the XGBoost model to the test dataset from Kaggle, we obtained a ROC-AUC score of 82.42%.

CONCLUSION: For this iteration, the XGBoost model achieved the best overall result using the training and test datasets. For this dataset, we should consider XGBoost and other machine learning algorithms for further modeling and testing.

Dataset Used: Santander Customer Satisfaction Data Set

Dataset ML Model: Binary classification with numerical and categorical attributes

Dataset Reference: https://www.kaggle.com/c/santander-customer-satisfaction/overview

One potential source of performance benchmark: https://www.kaggle.com/c/santander-customer-satisfaction/leaderboard

The HTML formatted report can be found here on GitHub.