Multi-Class Classification Model for Letter Recognition Using Python

Template Credit: Adapted from a template made available by Dr. Jason Brownlee of Machine Learning Mastery.

SUMMARY: The purpose of this project is to construct a prediction model using various machine learning algorithms and to document the end-to-end steps using a template. The Letter Recognition Data Set is a multi-class classification situation where we are trying to predict one of the several possible outcomes.

INTRODUCTION: The objective is to identify each of many black-and-white rectangular-pixel displays as one of the 26 capital letters in the English alphabet. The character images were based on 20 different fonts and each letter within these 20 fonts was randomly distorted to produce a file of 20,000 unique stimuli. Each stimulus was converted into 16 primitive numerical attributes (statistical moments and edge counts) which were then scaled to fit into a range of integer values from 0 through 15.

CONCLUSION: The baseline performance of the eight algorithms achieved an average accuracy of 80.98%. Three algorithms (k-Nearest Neighbors, Support Vector Machine, and Extra Trees) achieved the top three accuracy scores after the first round of modeling. After a series of tuning trials, Support Vector Machine turned in the top result using the training data. It achieved an average accuracy of 97.37%. Using the optimized tuning parameter available, the Support Vector Machine algorithm processed the validation dataset with an accuracy of 97.46%, which was even slightly better the accuracy of the training data.

For this project, the Support Vector Machine algorithm yielded consistently top-notch training and validation results, which warrant the additional processing required by the algorithm.

Dataset Used: Letter Recognition

Dataset ML Model: Multi-class classification with numerical attributes

Dataset Reference:

One potential source of performance benchmarks:

The HTML formatted report can be found here on GitHub.