Template Credit: Adapted from a template made available by Dr. Jason Brownlee of Machine Learning Mastery.
SUMMARY: The purpose of this project is to construct a predictive model using various machine learning algorithms and to document the end-to-end steps using a template. The Homesite Quote Conversion dataset is a binary classification situation where we are trying to predict one of the two possible outcomes.
INTRODUCTION: Homesite, a leading provider of homeowners’ insurance, is looking for a dynamic conversion rate model that can give them indications of whether a quoted price will lead to a purchase. By using an anonymized database of information on customer and sales activity, the purpose of the exercise is to predict which customers will purchase a given quote. Accurate prediction of conversion would help Homesite better understand the impact of proposed pricing changes and maintain an ideal portfolio of customer segments. Submissions are evaluated on the area under the ROC curve between the predicted probability and the observed target.
In iteration Take1, we constructed and tuned several machine learning models using the Scikit-learn library. Furthermore, we applied the best-performing machine learning model to Kaggle’s test dataset and submitted a list of predictions for evaluation.
In iteration Take2, we constructed and tuned an XGBoost machine learning model for this dataset. We also observed the best result that we could obtain using the XGBoost model with the training dataset. Furthermore, we applied the XGBoost model to Kaggle’s test dataset and submitted a list of predictions to Kaggle for evaluation.
In this Take3 iteration, we will construct several Multilayer Perceptron (MLP) models with one hidden layer. We also will observe the best result that we can obtain using the one-layer model. Furthermore, we will apply the MLP model to Kaggle’s test dataset and submit a list of predictions to Kaggle for evaluation.
ANALYSIS: From iteration Take1, the baseline performance of the machine learning algorithms achieved an average ROC-AUC of 92.02%. Two algorithms (Random Forest and Gradient Boosting) produced the top accuracy metrics after the first round of modeling. After a series of tuning trials, Gradient Boosting turned in a better overall result than Random Forest. Gradient Boosting achieved a ROC-AUC metric of 96.43%. When configured with the optimized parameters, the Gradient Boosting algorithm processed the test dataset with a ROC-AUC of 96.42%. However, when we applied the Gradient Boosting model to the test dataset from Kaggle, we obtained a ROC-AUC score of 96.52%.
In iteration Take2, the performance of the XGBoost algorithm achieved an average ROC-AUC of 95.78%. After a series of tuning trials, XGBoost achieved a ROC-AUC metric of 96.48%. When configured with the optimized parameters, the XGBoost algorithm processed the test dataset with a ROC-AUC of 96.46%. However, when we applied the XGBoost model to the test dataset from Kaggle, we obtained a ROC-AUC score of 96.60%.
In this Take3 iteration, all one-layer models achieved a ROC-AUC performance of between 94.1% and 95.4% after 50 epochs using the test dataset. The 512-node model appears to have the highest ROC-AUC with low variance. Moreover, when we applied the single-layer neural network model to the test dataset from Kaggle, we obtained a ROC-AUC score of 95.546%. We captured additional performance measurements using different model configurations.
- Single-Layer 32-Node MLP Model – ROC-AUC: 94.424%
- Single-Layer 48-Node MLP Model – ROC-AUC: 95.309%
- Single-Layer 64-Node MLP Model – ROC-AUC: 95.296%
- Single-Layer 96-Node MLP Model – ROC-AUC: 95.537%
- Single-Layer 128-Node MLP Model – ROC-AUC: 95.339%
- Single-Layer 192-Node MLP Model – ROC-AUC: 95.470%
- Single-Layer 256-Node MLP Model – ROC-AUC: 95.399%
- Single-Layer 384-Node MLP Model – ROC-AUC: 95.480%
CONCLUSION: For this iteration, the baseline model with a single layer of 512 nodes appeared to have yielded the best result. For this dataset, we should consider experimenting with more and different MLP models.
Dataset Used: Homesite Quote Conversion Data Set
Dataset ML Model: Binary classification with numerical and categorical attributes
Dataset Reference: https://www.kaggle.com/c/homesite-quote-conversion/data
One potential source of performance benchmark: https://www.kaggle.com/c/homesite-quote-conversion/leaderboard
The HTML formatted report can be found here on GitHub.