Binary Classification Model for Breast Cancer Wisconsin (Original) Using Python

Template Credit: Adapted from a template made available by Dr. Jason Brownlee of Machine Learning Mastery.

SUMMARY: The purpose of this project is to construct a predictive model using various machine learning algorithms and to document the end-to-end steps using a template. The Breast Cancer Wisconsin dataset is a binary classification situation where we are trying to predict one of the two possible outcomes.

INTRODUCTION: The dataset contains various measurements of breast tissue samples for cancer diagnosis. It contains measurements such as the thickness of the clump, the uniformity of cell size and shape, the marginal adhesion, and so on. Dr. William H. Wolberg of the University of Wisconsin Hospitals in Madison is the original provider of this dataset.

ANALYSIS: The baseline performance of the machine learning algorithms achieved an average accuracy of 96.92%. Two algorithms (Logistic Regression and Gradient Boosting) achieved the top accuracy metrics after the first round of modeling. After a series of tuning trials, Gradient Boosting achieved an accuracy metric of 97.51%. By using the optimized tuning parameters, the Gradient Boosting algorithm processed the test dataset with an accuracy of 94.28%, which was just slightly below the prediction accuracy from the training data.

CONCLUSION: For this iteration, the Gradient Boosting algorithm achieved the best overall results using the training and test datasets. For this dataset, Gradient Boosting should be considered for further modeling.

Dataset Used: Breast Cancer Wisconsin (Original) Data Set

Dataset ML Model: Binary classification with numerical attributes

Dataset Reference:

The HTML formatted report can be found here on GitHub.