Template Credit: Adapted from a template made available by Dr. Jason Brownlee of Machine Learning Mastery (http://machinelearningmastery.com/)
Dataset Used: Bank Marketing Data Set
Data Set ML Model: Binary classification with numerical and categorical attributes
Dataset Reference: http://archive.ics.uci.edu/ml/datasets/bank+marketing
One source of potential performance benchmarks: https://www.kaggle.com/rouseguy/bankbalanced
INTRODUCTION: The Bank Marketing dataset involves predicting the whether the bank clients will subscribe (yes/no) a term deposit (target variable). It is a binary (2-class) classification problem. There are over 45,000 observations with 16 input variables and 1 output variable. There are no missing values within the dataset.
CONCLUSION: The take No.2 version of this banking dataset aims to test the removal of one attribute from the dataset and the effect. You can see the results from the take No.1 here on the website.
The data removed was the “duration” attribute. According to the dataset documentation, this attribute highly affects the output target (e.g., if duration=0 then y=“no”). However, the duration is not known before a call is performed. Also, after the end of the call, the target variable is naturally identified. Thus, this input should only be included for benchmark purposes and should be discarded if the intention is to have a realistic predictive model.
The baseline performance of the seven algorithms achieved an average accuracy of 89.22% (vs. 89.99% from the take No.1). Three algorithms (Bagged CART, Random Forest, and Stochastic Gradient Boosting) achieved the top accuracy and Kappa scores during the initial modeling round. After a series of tuning trials with these three algorithms, Stochastic Gradient Boosting achieved the top accuracy/Kappa result using the training data. It produced an average accuracy of 89.46% (vs. 90.63% from the take No.1) using the training data.
Stochastic Gradient Boosting also processed the validation dataset with an accuracy of 89.18%, which was sufficiently close to the training result. For this project, the Stochastic Gradient Boosting ensemble algorithm yielded consistently top-notch training and validation results, which warrant the additional processing required by the algorithm. The elimination of the “duration” attribute did not seem to have a substantial adverse effect on the overall accuracy of the prediction models.
The HTML formatted report can be found here on GitHub.