Binary Classification Model for BNP Paribas Cardif Claims Management Using Scikit-Learn Take 2

Template Credit: Adapted from a template made available by Dr. Jason Brownlee of Machine Learning Mastery.

SUMMARY: The purpose of this project is to construct a predictive model using various machine learning algorithms and to document the end-to-end steps using a template. The BNP Paribas Cardif Claims Management dataset is a binary classification situation where we are trying to predict one of the two possible outcomes.

INTRODUCTION: As a global specialist in personal insurance, BNP Paribas Cardif sponsored a Kaggle competition to help them identify the categories of claims. In a world shaped by the emergence of new practices and behaviors generated by the digital economy, BNP Paribas Cardif would like to streamline its claims management practice. In this Kaggle challenge, the company challenged the participants to predict the category of a claim based on features available early in the process. Better predictions can help BNP Paribas Cardif accelerate its claims process and therefore provide a better service to its customers.

In iteration Take1, we constructed and tuned several machine learning models using the Scikit-learn library. Furthermore, we applied the best-performing machine learning model to Kaggle’s test dataset and submitted a list of predictions for evaluation.

In this Take2 iteration, we will construct and tune an XGBoost model. Furthermore, we will apply the XGBoost model to Kaggle’s test dataset and submit a list of predictions for evaluation.

ANALYSIS: From iteration Take1, the baseline performance of the machine learning algorithms achieved an average log loss of 0.6422. Two algorithms (Logistic Regression and Random Forest) achieved the top log loss metrics after the first round of modeling. After a series of tuning trials, Random Forest turned in a better overall result. Random Forest achieved a log loss metric of 0.4722. When configured with the optimized parameters, the Extra Trees model processed the validation dataset with a log loss of 0.4706, which was consistent with the model training phase. When we applied the Random Forest model to Kaggle’s test dataset, we obtained a log loss score of 0.4635.

From this Take2 iteration, the baseline performance of the XGBoost model achieved a log loss of 0.4706. After a series of tuning trials, the XGBoost model reached a log loss metric of 0.4650. When configured with the optimized parameters, the XGBoost model processed the validation dataset with a log loss of 0.4674, which was consistent with the model training phase. When we applied the XGBoost model to Kaggle’s test dataset, we obtained a log loss score of 0.4634.

CONCLUSION: For this iteration, the XGBoost model achieved the best overall results using the training and test datasets. For this dataset, we should consider further modeling with the XGBoost algorithm.

Dataset Used: BNP Paribas Cardif Claims Management Data Set

Dataset ML Model: Binary classification with numerical and categorical attributes

Dataset Reference: https://www.kaggle.com/c/bnp-paribas-cardif-claims-management/overview

One potential source of performance benchmark: https://www.kaggle.com/c/bnp-paribas-cardif-claims-management/leaderboard

The HTML formatted report can be found here on GitHub.