Template Credit: Adapted from a template made available by Dr. Jason Brownlee of Machine Learning Mastery.
SUMMARY: The purpose of this project is to construct a prediction model using various machine learning algorithms and to document the end-to-end steps using a template. The Human Activities and Postural Transitions dataset is a classic multi-class classification situation where we are trying to predict one of the 12 possible outcomes.
INTRODUCTION: The research team carried out experiments with a group of 30 volunteers who performed a protocol of activities composed of six basic activities. There are three static postures (standing, sitting, lying) and three dynamic activities (walking, walking downstairs and walking upstairs). The experiment also included postural transitions that occurred between the static postures. These are stand-to-sit, sit-to-stand, sit-to-lie, lie-to-sit, stand-to-lie, and lie-to-stand. All the participants were wearing a smartphone on the waist during the experiment execution. The research team also video-recorded the activities to label the data manually. The research team randomly partitioned the obtained data into two sets, 70% for the training data and 30% for the testing.
In iteration Take1, the script focused on evaluating various machine learning algorithms and identifying the model that produces the best overall metrics. Because the dataset has many attributes that were collinear with other attributes, we eliminated the attributes that have a collinearity measurement of 99% or higher. Iteration Take1 established the performance baseline for accuracy and processing time.
In the current iteration Take2, we will examine the feature selection technique of eliminating collinear features. We will perform iterative modeling at collinear levels of 75%, 80%, 85%, 90%, and 95%. By eliminating the collinear features, we hope to decrease the processing time and maintain a comparable level of model accuracy comparing to iteration Take1.
ANALYSIS: In iteration Take1, the baseline performance of the machine learning algorithms achieved an average accuracy of 88.52%. Two algorithms (Linear Discriminant Analysis and Stochastic Gradient Boosting) achieved the top accuracy metrics after the first round of modeling. After a series of tuning trials, Linear Discriminant Analysis turned in the top overall result and achieved an accuracy metric of 94.19%. By using the optimized parameters, the Linear Discriminant Analysis algorithm processed the testing dataset with an accuracy of 94.71%, which was even better than the training data.
From the model-building perspective, the number of attributes decreased by 108, from 561 down to 453.
COL_75%: In the current iteration Take2, the baseline performance of the machine learning algorithms achieved an average accuracy of 88.53%. Two algorithms (Linear Discriminant Analysis and Stochastic Gradient Boosting) achieved the top accuracy metrics after the first round of modeling. After a series of tuning trials, Stochastic Gradient Boosting turned in the top overall result and achieved an accuracy metric of 91.63%. By using the optimized parameters, the Stochastic Gradient Boosting algorithm processed the testing dataset with an accuracy of 92.53%, which was even better than the training data.
From the model-building perspective, the number of attributes decreased by 408, from 561 down to 153. The processing time went from 7 hours 3 minutes in iteration Take1 down to 2 hours 48 minutes in Take2, which was a reduction of 60.2%.
COL_80%: In the current iteration Take2, the baseline performance of the machine learning algorithms achieved an average accuracy of 85.96%. Two algorithms (Linear Discriminant Analysis and Stochastic Gradient Boosting) achieved the top accuracy metrics after the first round of modeling. After a series of tuning trials, Stochastic Gradient Boosting turned in the top overall result and achieved an accuracy metric of 91.86%. By using the optimized parameters, the Stochastic Gradient Boosting algorithm processed the testing dataset with an accuracy of 93.13%, which was even better than the training data.
From the model-building perspective, the number of attributes decreased by 385, from 561 down to 176. The processing time went from 7 hours 3 minutes in iteration Take1 down to 3 hours 12 minutes in Take2, which was a reduction of 54.6%.
COL_85%: In the current iteration Take2, the baseline performance of the machine learning algorithms achieved an average accuracy of 87.32%. Two algorithms (Linear Discriminant Analysis and Stochastic Gradient Boosting) achieved the top accuracy metrics after the first round of modeling. After a series of tuning trials, Stochastic Gradient Boosting turned in the top overall result and achieved an accuracy metric of 91.16%. By using the optimized parameters, the Stochastic Gradient Boosting algorithm processed the testing dataset with an accuracy of 92.47%, which was even better than the training data.
From the model-building perspective, the number of attributes decreased by 362, from 561 down to 199. The processing time went from 7 hours 3 minutes in iteration Take1 down to 3 hours 37 minutes in Take2, which was a reduction of 48.6%.
COL_90%: In the current iteration Take2, the baseline performance of the machine learning algorithms achieved an average accuracy of 87.22%. Two algorithms (Linear Discriminant Analysis and Stochastic Gradient Boosting) achieved the top accuracy metrics after the first round of modeling. After a series of tuning trials, Linear Discriminant Analysis turned in the top overall result and achieved an accuracy metric of 91.52%. By using the optimized parameters, the Linear Discriminant Analysis algorithm processed the testing dataset with an accuracy of 92.82%, which was even better than the training data.
From the model-building perspective, the number of attributes decreased by 338, from 561 down to 223. The processing time went from 7 hours 3 minutes in iteration Take1 down to 3 hours 53 minutes in Take2, which was a reduction of 44.9%.
COL_95%: In the current iteration Take2, the baseline performance of the machine learning algorithms achieved an average accuracy of 88.04%. Two algorithms (Linear Discriminant Analysis and Stochastic Gradient Boosting) achieved the top accuracy metrics after the first round of modeling. After a series of tuning trials, Linear Discriminant Analysis turned in the top overall result and achieved an accuracy metric of 92.32%. By using the optimized parameters, the Linear Discriminant Analysis algorithm processed the testing dataset with an accuracy of 93.89%, which was even better than the training data.
From the model-building perspective, the number of attributes decreased by 278, from 561 down to 283. The processing time went from 7 hours 3 minutes in iteration Take1 down to 4 hours 31 minutes in Take2, which was a reduction of 35.9%.
CONCLUSION: For this iteration, eliminating collinear features at the 95% level and using the Linear Discriminant Analysis algorithm achieved the best overall results. For this dataset, we should consider using the Linear Discriminant Analysis algorithm for further modeling or production use.
Dataset Used: Smartphone-Based Recognition of Human Activities and Postural Transitions Data Set
Dataset ML Model: Multi-class classification with numerical attributes
Dataset Reference: https://archive.ics.uci.edu/ml/datasets/Smartphone-Based+Recognition+of+Human+Activities+and+Postural+Transitions
The HTML formatted report can be found here on GitHub.