Template Credit: Adapted from a template made available by Dr. Jason Brownlee of Machine Learning Mastery.
SUMMARY: The purpose of this project is to construct a prediction model using various machine learning algorithms and to document the end-to-end steps using a template. The Human Activities and Postural Transitions dataset is a classic multi-class classification situation where we are trying to predict one of the 12 possible outcomes.
INTRODUCTION: The research team carried out experiments with a group of 30 volunteers who performed a protocol of activities composed of six basic activities. There are three static postures (standing, sitting, lying) and three dynamic activities (walking, walking downstairs and walking upstairs). The experiment also included postural transitions that occurred between the static postures. These are stand-to-sit, sit-to-stand, sit-to-lie, lie-to-sit, stand-to-lie, and lie-to-stand. All the participants were wearing a smartphone on the waist during the experiment execution. The research team also video-recorded the activities to label the data manually. The research team randomly partitioned the obtained data into two sets, 70% for the training data and 30% for the testing.
In iteration Take1, the script focused on evaluating various machine learning algorithms and identifying the model that produces the best overall metrics. Because the dataset has many attributes that were collinear with other attributes, we eliminated the attributes that have a collinearity measurement of 99% or higher. Iteration Take1 established the performance baseline for accuracy and processing time.
In iteration Take2, we examined the feature selection technique of eliminating collinear features. We performed iterative modeling at collinear levels of 75%, 80%, 85%, 90%, and 95%. By eliminating the collinear features, we decreased the processing time and maintained a comparable level of model accuracy comparing to iteration Take1.
In the current iteration Take3, we will examine the feature selection technique of attribute importance ranking by using the Random Forest algorithm. By selecting only the most important attributes, we hope to decrease the processing time and maintain a similar level of accuracy compared to iteration Take1.
ANALYSIS: In iteration Take1, the baseline performance of the machine learning algorithms achieved an average accuracy of 89.61%. Two algorithms (Linear Discriminant Analysis and Stochastic Gradient Boosting) achieved the top accuracy metrics after the first round of modeling. After a series of tuning trials, Stochastic Gradient Boosting turned in the top overall result and achieved an accuracy metric of 97.70%. By using the optimized parameters, the Stochastic Gradient Boosting algorithm processed the testing dataset with an accuracy of 92.85%, which was below the training data and possibly due to over-fitting.
From the model-building perspective, the number of attributes decreased by 108, from 561 down to 453.
In iteration Take2, the baseline performance of the machine learning algorithms achieved an average accuracy of 89.48%. Two algorithms (Random Forest and eXtreme Gradient Boosting) achieved the top accuracy metrics after the first round of modeling. After a series of tuning trials, eXtreme Gradient Boosting turned in the top overall result and achieved an accuracy metric of 98.31%. By using the optimized parameters, the eXtreme Gradient Boosting algorithm processed the testing dataset with an accuracy of 94.97%, which was below the training data and possibly due to over-fitting.
From the model-building perspective, the number of attributes decreased by 278, from 561 down to 283. The processing time went from 10 hours 17 minutes in iteration Take1 down to 7 hours 49 minutes in Take2, which was a reduction of 23.9%.
In the current iteration Take3, the baseline performance of the machine learning algorithms achieved an average accuracy of 90.04%. Two algorithms (Random Forest and eXtreme Gradient Boosting) achieved the top accuracy metrics after the first round of modeling. After a series of tuning trials, eXtreme Gradient Boosting turned in the top overall result and achieved an accuracy metric of 98.35%. By using the optimized parameters, the eXtreme Gradient Boosting algorithm processed the testing dataset with an accuracy of 93.86%, which was below the training data and possibly due to over-fitting.
From the model-building perspective, the number of attributes decreased by 62, from 561 down to 499. The processing time went from 10 hours 17 minutes in iteration Take1 up to 16 hours 59 minutes in Take3, which was an increase of 65.1%.
CONCLUSION: For this iteration, the attribute importance ranking technique and using the eXtreme Gradient Boosting algorithm achieved the best overall results. For this dataset, we should consider using the eXtreme Gradient Boosting algorithm for further modeling or production use.
Dataset Used: Smartphone-Based Recognition of Human Activities and Postural Transitions Data Set
Dataset ML Model: Multi-class classification with numerical attributes
Dataset Reference: https://archive.ics.uci.edu/ml/datasets/Smartphone-Based+Recognition+of+Human+Activities+and+Postural+Transitions
The HTML formatted report can be found here on GitHub.