Binary Classification Model for Customer Transaction Prediction Using Python (XGBoost Tuning Batch #2)

Template Credit: Adapted from a template made available by Dr. Jason Brownlee of Machine Learning Mastery.

SUMMARY: The purpose of this project is to construct a prediction model using various machine learning algorithms and to document the end-to-end steps using a template. The Santander Bank Customer Transaction Prediction competition is a binary classification situation where we are trying to predict one of the two possible outcomes.

INTRODUCTION: Santander Bank’s data science team wants to identify which customers will make a specific transaction in the future, irrespective of the amount of money transacted. The bank is continually challenging its machine learning algorithms to make sure they can more accurately identify new ways to solve its most common challenges such as: Will a customer buy this product? Can a customer pay this loan?

For this iteration, we will examine the effectiveness of the eXtreme Gradient Boosting algorithm with the synthetic over-sampling technique (SMOTE) to mitigate the effect of imbalanced data for this problem. Submissions are evaluated on the area under the ROC curve between the predicted probability and the observed target.

ANALYSIS: We applied different values for the max_depth, min_child_weight, subsample, and colsample_bytree parameters using fixed n_estimators (1000 or 100). The max_depth values vary from 10, 15, 20 to 25. The min_child_weight values vary from 3 to 5 with different learning rates. The subsample and colsample_bytree values vary from 0.6 to 1.0. The following output files are available for comparison.

  • py-classification-santander-kaggle-XGB-take11
  • py-classification-santander-kaggle-XGB-take12
  • py-classification-santander-kaggle-XGB-take13
  • py-classification-santander-kaggle-XGB-take14
  • py-classification-santander-kaggle-XGB-take15
  • py-classification-santander-kaggle-XGB-take16
  • py-classification-santander-kaggle-XGB-take17
  • py-classification-santander-kaggle-XGB-take18
  • py-classification-santander-kaggle-XGB-take19
  • py-classification-santander-kaggle-XGB-take21
  • py-classification-santander-kaggle-XGB-take22
  • py-classification-santander-kaggle-XGB-take23
  • py-classification-santander-kaggle-XGB-take24
  • py-classification-santander-kaggle-XGB-take25
  • py-classification-santander-kaggle-XGB-take26
  • py-classification-santander-kaggle-XGB-take27
  • py-classification-santander-kaggle-XGB-take28
  • py-classification-santander-kaggle-XGB-take29

CONCLUSION: To be determined after comparing the results from other machine learning algorithms.

Dataset Used: Santander Customer Transaction Prediction

Dataset ML Model: Binary classification with numerical attributes

Dataset Reference:

One potential source of performance benchmark:

The HTML formatted report can be found here on GitHub.