Binary Classification Model for Homesite Quote Conversion Using Scikit-Learn Take 1

Template Credit: Adapted from a template made available by Dr. Jason Brownlee of Machine Learning Mastery.

SUMMARY: The purpose of this project is to construct a predictive model using various machine learning algorithms and to document the end-to-end steps using a template. The Homesite Quote Conversion dataset is a binary classification situation where we are trying to predict one of the two possible outcomes.

INTRODUCTION: Homesite, a leading provider of homeowners’ insurance, is looking for a dynamic conversion rate model that can give them indications of whether a quoted price will lead to a purchase. By using an anonymized database of information on customer and sales activity, the purpose of the exercise is to predict which customers will purchase a given quote. Accurate prediction of conversion would help Homesite better understand the impact of proposed pricing changes and maintain an ideal portfolio of customer segments. Submissions are evaluated on the area under the ROC curve between the predicted probability and the observed target.

In this Take1 iteration, we will construct and tune several machine learning models using the Scikit-learn library. Furthermore, we will apply the best-performing machine learning model to Kaggle’s test dataset and submit a list of predictions for evaluation.

ANALYSIS: In this Take1 iteration, the performance of the machine learning algorithms achieved an average ROC-AUC of 92.02%. Two algorithms (Random Forest and Gradient Boosting) produced the top accuracy metrics after the first round of modeling. After a series of tuning trials, Gradient Boosting turned in a better overall result than Random Forest. Gradient Boosting achieved a ROC-AUC metric of 96.43%. When configured with the optimized parameters, the Gradient Boosting algorithm processed the test dataset with a ROC-AUC of 96.42%. However, when we applied the Gradient Boosting model to the test dataset from Kaggle, we obtained a ROC-AUC score of 96.52%.

CONCLUSION: For this iteration, the Gradient Boosting algorithm achieved the best overall results using the training and test datasets. For this dataset, we should continue to experiment with Gradient Boosting and other algorithms for our modeling effort.

Dataset Used: Homesite Quote Conversion Data Set

Dataset ML Model: Binary classification with numerical and categorical attributes

Dataset Reference: https://www.kaggle.com/c/homesite-quote-conversion/data

One potential source of performance benchmark: https://www.kaggle.com/c/homesite-quote-conversion/leaderboard

The HTML formatted report can be found here on GitHub.