Binary Classification Model for Census Income Using Python Take 2

Template Credit: Adapted from a template made available by Dr. Jason Brownlee of Machine Learning Mastery.

SUMMARY: The purpose of this project is to construct a prediction model using various machine learning algorithms and to document the end-to-end steps using a template. The Census Income dataset is a classic binary classification situation where we are trying to predict one of the two possible outcomes.

INTRODUCTION: This data was extracted from the 1994 Census Bureau database by Ronny Kohavi and Barry Becker (Data Mining and Visualization, Silicon Graphics). A set of reasonably clean records was extracted using the following conditions: ((AAGE>16) && (AGI>100) && (AFNLWGT>1) && (HRSWK>0)). The prediction task is to determine whether a person makes over 50K a year.

This dataset has many cells with missing values, so we will examine the models by imputing the missing cells with a default value. This iteration of the project will produce a set of results that we will use to compare with the baseline models from Take 1.

CONCLUSION: From the previous iteration (Take 1), The baseline performance of the ten algorithms achieved an average accuracy of 81.37%. Four ensemble algorithms (Bagged CART, Random Forest, AdaBoost, and Stochastic Gradient Boosting) achieved the top accuracy scores after the first round of modeling. After a series of tuning trials, Stochastic Gradient Boosting turned in the top result using the training data. It achieved an average accuracy of 86.99%. Using the optimized tuning parameter available, the Stochastic Gradient Boosting algorithm further processed the validation dataset with an accuracy of 87.23%, which was slightly better than the accuracy of the training data.

From this iteration (Take 2), the baseline performance of the ten algorithms achieved an average accuracy of 81.93%. Four ensemble algorithms (Bagged CART, Random Forest, AdaBoost, and Stochastic Gradient Boosting) achieved the top accuracy scores after the first round of modeling. After a series of tuning trials, Stochastic Gradient Boosting turned in the top result using the training data. It achieved an average accuracy of 87.31%. Using the optimized tuning parameter available, the Stochastic Gradient Boosting algorithm further processed the validation dataset with an accuracy of 87.57%, which was slightly better than the accuracy of the training data.

For this project, imputing the missing values appeared to have contributed to a slight improvement of the overall accuracy of the training model. The Stochastic Gradient Boosting ensemble algorithm continued to yield consistently top-notch training and validation results, which warrant the additional processing required by the algorithm.

Dataset Used: Census Income Data Set

Dataset ML Model: Binary classification with numerical and categorical attributes

Dataset Reference: https://archive.ics.uci.edu/ml/datasets/Census+Income

One potential source of performance benchmark: https://www.kaggle.com/uciml/adult-census-income

The HTML formatted report can be found here on GitHub.