Template Credit: Adapted from a template made available by Dr. Jason Brownlee of Machine Learning Mastery.

SUMMARY: The purpose of this project is to construct a prediction model using various machine learning algorithms and to document the end-to-end steps using a template. The Human Activities and Postural Transitions dataset is a classic multi-class classification situation where we are trying to predict one of the 12 possible outcomes.

INTRODUCTION: The research team carried out experiments with a group of 30 volunteers who performed a protocol of activities composed of six basic activities. There are three static postures (standing, sitting, lying) and three dynamic activities (walking, walking downstairs and walking upstairs). The experiment also included postural transitions that occurred between the static postures. These are stand-to-sit, sit-to-stand, sit-to-lie, lie-to-sit, stand-to-lie, and lie-to-stand. All the participants were wearing a smartphone on the waist during the experiment execution. The research team also video-recorded the activities to label the data manually. The research team randomly partitioned the obtained data into two sets, 70% for the training data and 30% for the testing.

In iteration Take1, the script focused on evaluating various machine learning algorithms and identifying the model that produces the best overall metrics. Because the dataset has many attributes that were collinear with other attributes, we eliminated the attributes that have a collinearity measurement of 99% or higher. Iteration Take1 established the performance baseline for accuracy and processing time.

In the current iteration Take2, we will examine the feature selection technique of eliminating collinear features. We will perform iterative modeling at collinear levels of 75%, 80%, 85%, 90%, and 95%. By eliminating the collinear features, we hope to decrease the processing time and maintain a comparable level of model accuracy comparing to iteration Take1.

ANALYSIS: In iteration Take1, the baseline performance of the machine learning algorithms achieved an average accuracy of 89.61%. Two algorithms (Linear Discriminant Analysis and Stochastic Gradient Boosting) achieved the top accuracy metrics after the first round of modeling. After a series of tuning trials, Stochastic Gradient Boosting turned in the top overall result and achieved an accuracy metric of 97.70%. By using the optimized parameters, the Stochastic Gradient Boosting algorithm processed the testing dataset with an accuracy of 92.85%, which was below the training data and possibly due to over-fitting.

From the model-building perspective, the number of attributes decreased by 108, from 561 down to 453.

COL_75%: In the current iteration Take2, the baseline performance of the machine learning algorithms achieved an average accuracy of 89.75%. Two algorithms (Random Forest and eXtreme Gradient Boosting) achieved the top accuracy metrics after the first round of modeling. After a series of tuning trials, eXtreme Gradient Boosting turned in the top overall result and achieved an accuracy metric of 97.84%. By using the optimized parameters, the eXtreme Gradient Boosting algorithm processed the testing dataset with an accuracy of 94.75%, which was below the training data and possibly due to over-fitting.

From the model-building perspective, the number of attributes decreased by 408, from 561 down to 153. The processing time went from 10 hours 17 minutes in iteration Take1 down to 4 hours 26 minutes in Take2, which was a reduction of 66.6%.

COL_80%: In the current iteration Take2, the baseline performance of the machine learning algorithms achieved an average accuracy of 90.27%. Two algorithms (Random Forest and eXtreme Gradient Boosting) achieved the top accuracy metrics after the first round of modeling. After a series of tuning trials, eXtreme Gradient Boosting turned in the top overall result and achieved an accuracy metric of 98.00%. By using the optimized parameters, the eXtreme Gradient Boosting algorithm processed the testing dataset with an accuracy of 94.91%, which was below the training data and possibly due to over-fitting.

From the model-building perspective, the number of attributes decreased by 385, from 561 down to 176. The processing time went from 10 hours 17 minutes in iteration Take1 down to 4 hours 56 minutes in Take2, which was a reduction of 47.9%.

COL_85%: In the current iteration Take2, the baseline performance of the machine learning algorithms achieved an average accuracy of 89.31%. Two algorithms (Random Forest and eXtreme Gradient Boosting) achieved the top accuracy metrics after the first round of modeling. After a series of tuning trials, eXtreme Gradient Boosting turned in the top overall result and achieved an accuracy metric of 98.14%. By using the optimized parameters, the eXtreme Gradient Boosting algorithm processed the testing dataset with an accuracy of 94.15%, which was below the training data and possibly due to over-fitting.

From the model-building perspective, the number of attributes decreased by 362, from 561 down to 199. The processing time went from 10 hours 17 minutes in iteration Take1 down to 5 hours 27 minutes in Take2, which was a reduction of 47.0%.

COL_90%: In the current iteration Take2, the baseline performance of the machine learning algorithms achieved an average accuracy of 89.38%. Two algorithms (Random Forest and eXtreme Gradient Boosting) achieved the top accuracy metrics after the first round of modeling. After a series of tuning trials, eXtreme Gradient Boosting turned in the top overall result and achieved an accuracy metric of 98.17%. By using the optimized parameters, the eXtreme Gradient Boosting algorithm processed the testing dataset with an accuracy of 94.12%, which was below the training data and possibly due to over-fitting.

From the model-building perspective, the number of attributes decreased by 338, from 561 down to 223. The processing time went from 10 hours 17 minutes in iteration Take1 down to 6 hours 2 minutes in Take2, which was a reduction of 41.3%.

COL_95%: In the current iteration Take2, the baseline performance of the machine learning algorithms achieved an average accuracy of 89.48%. Two algorithms (Random Forest and eXtreme Gradient Boosting) achieved the top accuracy metrics after the first round of modeling. After a series of tuning trials, eXtreme Gradient Boosting turned in the top overall result and achieved an accuracy metric of 98.31%. By using the optimized parameters, the eXtreme Gradient Boosting algorithm processed the testing dataset with an accuracy of 94.97%, which was below the training data and possibly due to over-fitting.

From the model-building perspective, the number of attributes decreased by 278, from 561 down to 283. The processing time went from 10 hours 17 minutes in iteration Take1 down to 7 hours 49 minutes in Take2, which was a reduction of 23.9%.

CONCLUSION: For this iteration, eliminating collinear features at the 95% level and using the eXtreme Gradient Boosting algorithm achieved the best overall results. For this dataset, we should consider using the eXtreme Gradient Boosting algorithm for further modeling or production use.

Dataset Used: Smartphone-Based Recognition of Human Activities and Postural Transitions Data Set

Dataset ML Model: Multi-class classification with numerical attributes

Dataset Reference: https://archive.ics.uci.edu/ml/datasets/Smartphone-Based+Recognition+of+Human+Activities+and+Postural+Transitions

The HTML formatted report can be found here on GitHub.