Multi-Class Classification Model for Letter Recognition Using R

Template Credit: Adapted from a template made available by Dr. Jason Brownlee of Machine Learning Mastery.

SUMMARY: The purpose of this project is to construct a prediction model using various machine learning algorithms and to document the end-to-end steps using a template. The Letter Recognition DataSet is a multi-class classification situation where we are trying to predict one of the several possible outcomes.

INTRODUCTION: The objective is to identify each of many black-and-white rectangular-pixel displays as one of the 26 capital letters in the English alphabet. The character images were based on 20 different fonts and each letter within these 20 fonts was randomly distorted to produce a file of 20,000 unique stimuli. Each stimulus was converted into 16 primitive numerical attributes (statistical moments and edge counts) which were then scaled to fit into a range of integer values from 0 through 15.

CONCLUSION: The baseline performance of the eight algorithms achieved an average accuracy of 79.30%. Three algorithms (Bagged CART, Random Forest, and k-Nearest Neighbors) achieved the top three accuracy scores after the first round of modeling. After a series of tuning trials, Random Forest turned in the top result using the training data. It achieved an average accuracy of 96.32%. Using the optimized tuning parameter available, the Random Forest algorithm processed the validation dataset with an accuracy of 96.45%, which was even slightly better the accuracy of the training data.

For this project, the Random Forest ensemble algorithm yielded consistently top-notch training and validation results, which warrant the additional processing required by the algorithm.

Dataset Used: Letter Recognition

Dataset ML Model: Multi-class classification with numerical attributes

Dataset Reference:

One potential source of performance benchmarks:

The HTML formatted report can be found here on GitHub.