SUMMARY: The purpose of this project is to construct an algorithmic trading model and document the end-to-end steps using a template.

INTRODUCTION: This algorithmic trading model uses the 20-day and 50-day moving averages to generate trading signals. We apply the analysis on the MSFT stock for the three years of 2017-01-01 thru 2019-12-31.

In this Take1 iteration, we will construct the code modules to cover the tasks of downloading the daily price information for a stock symbol. We will use the stock data acquired to fit a trading model in future iterations of the project.

ANALYSIS: Not available yet. To be developed further.

CONCLUSION: Not available yet. To be developed further.

Dataset ML Model: Time series forecast with numerical attributes

Dataset Used: Various sources as illustrated below.

Dataset Reference: Various sources as documented below.

The HTML formatted report can be found here on GitHub.

]]>SUMMARY: The purpose of this project is to construct a time series prediction model and document the end-to-end steps using a template. The Nonresidential Construction Spending dataset is a time series situation where we are trying to forecast future outcomes based on past data points.

INTRODUCTION: The problem is to forecast the monthly total nonresidential construction spending in the United States. The dataset describes a time-series of miles (in millions of dollars) over 18 years (2002-2020), and there are 217 observations. We used the first 80% of the observations for training various models while holding back the remaining observations for validating the final model.

ANALYSIS: The baseline prediction (or persistence) for the dataset resulted in an RMSE of 3,442. After performing a grid search for the most optimal ARIMA parameters, the final ARIMA non-seasonal order was (3, 1, 4) with the seasonal order being (2, 1, 2, 12). Furthermore, the chosen model processed the validation data with an RMSE of 1,029, which was better than the baseline model as expected.

CONCLUSION: For this dataset, the chosen ARIMA model achieved a satisfactory result and should be considered for further modeling.

Dataset Used: Total Construction Spending: Nonresidential

Dataset ML Model: Time series forecast with numerical attributes

Dataset Reference: https://fred.stlouisfed.org/series/TLNRESCON

The HTML formatted report can be found here on GitHub.

]]>SUMMARY: The purpose of this project is to construct a predictive model using various machine learning algorithms and to document the end-to-end steps using a template. The Rain in Australia dataset is a binary classification situation where we are trying to predict one of the two possible outcomes.

INTRODUCTION: This dataset contains daily weather observations from numerous Australian weather stations. The target variable RainTomorrow represents whether it rained the next day. We also should exclude the variable Risk-MM when training a binary classification model. By not eliminating the Risk-MM feature, we run a risk of leaking the answers into our model and reduce its effectiveness.

In iteration Take1, we constructed several traditional machine learning models using the linear, non-linear, and ensemble techniques. We also observed the best accuracy score that we could obtain with each of these models.

In iteration Take2, we constructed and tuned an XGBoost machine learning model for this dataset. We also observed the best accuracy score that we could obtain with the XGBoost model.

In iteration Take3, we constructed several Multilayer Perceptron (MLP) models with one, two, and three hidden layers. The one-layer MLP model serves as the baseline models as we build more complex MLP models in future iterations.

In this Take4 iteration, we will tune the single-layer MLP model and see whether we can improve our accuracy score.

ANALYSIS: In iteration Take1, the baseline performance of the machine learning algorithms achieved an average accuracy of 83.83%. Two algorithms (Extra Trees and Random Forest) achieved the top accuracy metrics after the first round of modeling. After a series of tuning trials, Random Forest turned in a better overall result than Extra Trees with a lower variance. Random Forest achieved an accuracy metric of 85.44%. When configured with the optimized parameters, the Random Forest algorithm processed the test dataset with an accuracy of 85.52%, which was consistent with the accuracy score from the training phase.

In iteration Take2, the XGBoost algorithm achieved a baseline accuracy of 84.69% by setting n_estimators to the default value of 100. After a series of tuning trials, XGBoost turned in an overall accuracy result of 86.21% with the n_estimators value set to 1000. When we apply the tuned XGBoost model to the test dataset, we obtained an accuracy score of 86.27%, which was consistent with the model performance from the training phase.

In iteration Take3, all one-layer models achieved an accuracy performance of around 86%. The eight-nodes model appears to overfit the least, when compared with models with 12, 16, and 20 nodes. The single-layer eight-nodes model also seems to work better than the two and three-layer models by processing the test dataset with an accuracy score of 86.10% after 20 epochs.

In this Take4 iteration, all models achieved an accuracy performance of around 86%. The model with the RMSprop optimizer appears to have the best accuracy score when predicting with the test dataset. It processed the test dataset with an accuracy score of 86.23% after 20 epochs.

CONCLUSION: For this iteration, the single-layer eight-nodes MLP model produced the accuracy score that is comparable to the XGBoost model. For this dataset, we should consider doing more tuning with the XGBoost and the MLP models.

Dataset Used: Rain in Australia Data Set

Dataset ML Model: Binary classification with numerical and categorical attributes

Dataset Reference: https://www.kaggle.com/jsphyg/weather-dataset-rattle-package

One potential source of performance benchmark: https://www.kaggle.com/jsphyg/weather-dataset-rattle-package/kernels

The HTML formatted report can be found here on GitHub.

]]>SUMMARY: The purpose of this project is to construct a predictive model using various machine learning algorithms and to document the end-to-end steps using a template. The Rain in Australia dataset is a binary classification situation where we are trying to predict one of the two possible outcomes.

INTRODUCTION: This dataset contains daily weather observations from numerous Australian weather stations. The target variable RainTomorrow represents whether it rained the next day. We also should exclude the variable Risk-MM when training a binary classification model. By not eliminating the Risk-MM feature, we run a risk of leaking the answers into our model and reduce its effectiveness.

In iteration Take1, we constructed several traditional machine learning models using the linear, non-linear, and ensemble techniques. We also observed the best accuracy score that we could obtain with each of these models.

In iteration Take2, we constructed and tuned an XGBoost machine learning model for this dataset. We also observed the best accuracy score that we could obtain with the XGBoost model.

In this Take3 iteration, we will construct several Multilayer Perceptron (MLP) models with one, two, and three hidden layers. These simple MLP models will serve as the baseline models as we build more complex MLP models in future iterations.

ANALYSIS: In iteration Take1, the baseline performance of the machine learning algorithms achieved an average accuracy of 83.83%. Two algorithms (Extra Trees and Random Forest) achieved the top accuracy metrics after the first round of modeling. After a series of tuning trials, Random Forest turned in a better overall result than Extra Trees with a lower variance. Random Forest achieved an accuracy metric of 85.44%. When configured with the optimized parameters, the Random Forest algorithm processed the test dataset with an accuracy of 85.52%, which was consistent with the accuracy score from the training phase.

In iteration Take2, the XGBoost algorithm achieved a baseline accuracy of 84.69% by setting n_estimators to the default value of 100. After a series of tuning trials, XGBoost turned in an overall accuracy result of 86.21% with the n_estimators value set to 1000. When we apply the tuned XGBoost model to the test dataset, we obtained an accuracy score of 86.27%, which was consistent with the model performance from the training phase.

In this Take3 iteration, all one-layer models achieved an accuracy performance of around 86%. The eight-nodes model appears to overfit the least, when compared with models with 12, 16, and 20 nodes. The single-layer eight-nodes model also seems to work better than the two and three-layer models by processing the test dataset with an accuracy score of 86.10% after 20 epochs.

CONCLUSION: For this iteration, the single-layer eight-nodes MLP model produced the accuracy score that is comparable to the XGBoost model. For this dataset, we should consider doing more tuning with the XGBoost and the MLP models.

Dataset Used: Rain in Australia Data Set

Dataset ML Model: Binary classification with numerical and categorical attributes

Dataset Reference: https://www.kaggle.com/jsphyg/weather-dataset-rattle-package

One potential source of performance benchmark: https://www.kaggle.com/jsphyg/weather-dataset-rattle-package/kernels

The HTML formatted report can be found here on GitHub.

]]>SUMMARY: The purpose of this project is to construct a predictive model using various machine learning algorithms and to document the end-to-end steps using a template. The Rain in Australia dataset is a binary classification situation where we are trying to predict one of the two possible outcomes.

INTRODUCTION: This dataset contains daily weather observations from numerous Australian weather stations. The target variable RainTomorrow represents whether it rained the next day. We also should exclude the variable Risk-MM when training a binary classification model. By not eliminating the Risk-MM feature, we run a risk of leaking the answers into our model and reduce its effectiveness.

In iteration Take1, we constructed several traditional machine learning models using the linear, non-linear, and ensemble techniques. We also observed the best accuracy score that we could obtain with each of these models.

In this Take2 iteration, we will construct and tune an XGBoost machine learning model for this dataset. We will observe the best accuracy score that we can obtain with the XGBoost model.

ANALYSIS: In iteration Take1, the baseline performance of the machine learning algorithms achieved an average accuracy of 83.83%. Two algorithms (Extra Trees and Random Forest) achieved the top accuracy metrics after the first round of modeling. After a series of tuning trials, Random Forest turned in a better overall result than Extra Trees with a lower variance. Random Forest achieved an accuracy metric of 85.44%. When configured with the optimized parameters, the Random Forest algorithm processed the test dataset with an accuracy of 85.52%, which was consistent with the accuracy score from the training phase.

In this Take2 iteration, the XGBoost algorithm achieved a baseline accuracy of 84.69% by setting n_estimators to the default value of 100. After a series of tuning trials, XGBoost turned in an overall accuracy result of 86.21% with the n_estimators value set to 1000. When we apply the tuned XGBoost model to the test dataset, we obtained an accuracy score of 86.27%, which was consistent with the model performance from the training phase.

CONCLUSION: For this iteration, the XGBoost algorithm achieved the best overall result using the training and test datasets. For this dataset, XGBoost should be considered for further modeling.

Dataset Used: Rain in Australia Data Set

Dataset ML Model: Binary classification with numerical and categorical attributes

Dataset Reference: https://www.kaggle.com/jsphyg/weather-dataset-rattle-package

One potential source of performance benchmark: https://www.kaggle.com/jsphyg/weather-dataset-rattle-package/kernels

The HTML formatted report can be found here on GitHub.

]]>ANALYSIS: The baseline performance of the machine learning algorithms achieved an average accuracy of 78.75%. Two algorithms (Extra Trees and Random Forest) achieved the top accuracy metrics after the first round of modeling. After a series of tuning trials, Random Forest turned in a better overall result than Extra Trees with a lower variance. Random Forest achieved an accuracy metric of 85.44%. When configured with the optimized parameters, the Random Forest algorithm processed the test dataset with an accuracy of 85.52%, which was consistent with the accuracy score from the training phase.

CONCLUSION: For this iteration, the Random Forest algorithm achieved the best overall results using the training and test datasets. For this dataset, Random Forest should be considered for further modeling.

Dataset Used: Rain in Australia Data Set

Dataset ML Model: Binary classification with numerical and categorical attributes

Dataset Reference: https://www.kaggle.com/jsphyg/weather-dataset-rattle-package

The HTML formatted report can be found here on GitHub.

]]>SUMMARY: The purpose of this project is to construct a time series prediction model and document the end-to-end steps using a template. The Vehicle Miles Traveled dataset is a time series situation where we are trying to forecast future outcomes based on past data points.

INTRODUCTION: The problem is to forecast the monthly number of vehicle miles traveled in the United States. The dataset describes a time-series of miles (in millions) over 20 years (2000-2019), and there are 239 observations. We used the first 80% of the observations for training various models while holding back the remaining observations for validating the final model.

ANALYSIS: The baseline prediction (or persistence) for the dataset resulted in an RMSE of 18139. After performing a grid search for the most optimal ARIMA parameters, the final ARIMA non-seasonal order was (0, 1, 0) with the seasonal order being (0, 1, 0, 12). Furthermore, the chosen model processed the validation data with an RMSE of 1856, which was significantly better than the baseline model as expected.

CONCLUSION: For this dataset, the chosen ARIMA model achieved a satisfactory result and should be considered for further modeling.

Dataset Used: Vehicle Miles Traveled

Dataset ML Model: Time series forecast with numerical attributes

Dataset Reference: https://fred.stlouisfed.org/series/VMT

The HTML formatted report can be found here on GitHub.

]]>SUMMARY: The purpose of this project is to construct a predictive model using various machine learning algorithms and to document the end-to-end steps using a template. The Springleaf Marketing Response dataset is a binary classification situation where we are trying to predict one of the two possible outcomes.

INTRODUCTION: Springleaf leverages the direct mail method for connecting with customers who may need a loan. To improve their targeting efforts, Springleaf must be sure they are focusing on the customers who are likely to respond and be good candidates for their services. Using a dataset with a broad set of anonymized features, Springleaf is looking to predict which customers will respond to a direct mail offer.

In iteration Take1, we constructed several traditional machine learning models using the linear, non-linear, and ensemble techniques. We also observed the best ROC-AUC result that we could obtain with each of these models.

In iteration Take2, we constructed and tuned an XGBoost machine learning model for this dataset. We also observed the best ROC-AUC result that we could obtain with the XGBoost model.

In iteration Take3, we constructed several Multilayer Perceptron (MLP) models with one hidden layer of 64, 128, 256, 512, 1024, and 2048 nodes. These single-layer MLP models serve as the baseline models as we build more complex MLP models in future iterations.

In iteration Take4, we constructed several Multilayer Perceptron (MLP) models with two hidden layers. We also observed whether these two-layer MLP models could improve the AUC-ROC performance of the single-layer models.

In iteration Take5, we constructed several Multilayer Perceptron (MLP) models with three hidden layers. We also observed whether these three-layer MLP models could improve the AUC-ROC performance of the single-layer models.

In this Take6 iteration, we will construct several Multilayer Perceptron (MLP) models with four hidden layers. We will observe whether these four-layer MLP models can improve the AUC-ROC performance of the single-layer models.

ANALYSIS: In iteration Take1, the baseline performance of the machine learning algorithms achieved an average ROC-AUC of 70.42%. The Random Forest and Gradient Boosting Machine algorithms made the top ROC-AUC metrics after the first round of modeling. After a series of tuning trials, GBM turned in an overall ROC-AUC result of 77.96%. When we apply the tuned GBM algorithm to the test dataset, we obtained a ROC-AUC score of only 62.58%, which was much lower than the score from model training.

In iteration Take2, the XGBoost algorithm achieved a baseline ROC-AUC performance of 77.08%. After a series of tuning trials, XGBoost turned in an overall best ROC-AUC result of 78.23%. When we apply the tuned XGBoost algorithm to the test dataset, we obtained a ROC-AUC score of only 62.86%, which was much lower than the score from model training.

In iteration Take3, all one-layer models achieved a ROC-AUC performance of around 50%.

In iteration Take4, all two-layer models again achieved a ROC-AUC performance of around 50%.

In iteration Take5, all three-layer models once again achieved a ROC-AUC performance of around 50%.

In this Take6 iteration, all four-layer models once again achieved a ROC-AUC performance of around 50%.

CONCLUSION: For this iteration, all four-layer models scored poorly on the ROC-AUC performance. For this dataset, we should consider doing modeling with XGBoost or ensemble algorithm.

Dataset Used: Springleaf Marketing Response Data Set

Dataset ML Model: Binary classification with numerical and categorical attributes

Dataset Reference: https://www.kaggle.com/c/springleaf-marketing-response/data

One potential source of performance benchmark: https://www.kaggle.com/c/springleaf-marketing-response/leaderboard

The HTML formatted report can be found here on GitHub.

]]>SUMMARY: The purpose of this project is to construct a predictive model using various machine learning algorithms and to document the end-to-end steps using a template. The Springleaf Marketing Response dataset is a binary classification situation where we are trying to predict one of the two possible outcomes.

INTRODUCTION: Springleaf leverages the direct mail method for connecting with customers who may need a loan. To improve their targeting efforts, Springleaf must be sure they are focusing on the customers who are likely to respond and be good candidates for their services. Using a dataset with a broad set of anonymized features, Springleaf is looking to predict which customers will respond to a direct mail offer.

In iteration Take1, we constructed several traditional machine learning models using the linear, non-linear, and ensemble techniques. We also observed the best ROC-AUC result that we could obtain with each of these models.

In iteration Take2, we constructed and tuned an XGBoost machine learning model for this dataset. We also observed the best ROC-AUC result that we could obtain with the XGBoost model.

In iteration Take3, we constructed several Multilayer Perceptron (MLP) models with one hidden layer of 64, 128, 256, 512, 1024, and 2048 nodes. These single-layer MLP models serve as the baseline models as we build more complex MLP models in future iterations.

In iteration Take4, we constructed several Multilayer Perceptron (MLP) models with two hidden layers. We also observed whether these two-layer MLP models could improve the AUC-ROC performance of the single-layer models.

In this Take5 iteration, we will construct several Multilayer Perceptron (MLP) models with three hidden layers. We will observe whether these three-layer MLP models can improve the AUC-ROC performance of the single-layer models.

ANALYSIS: In iteration Take1, the baseline performance of the machine learning algorithms achieved an average ROC-AUC of 70.42%. The Random Forest and Gradient Boosting Machine algorithms made the top ROC-AUC metrics after the first round of modeling. After a series of tuning trials, GBM turned in an overall ROC-AUC result of 77.96%. When we apply the tuned GBM algorithm to the test dataset, we obtained a ROC-AUC score of only 62.58%, which was much lower than the score from model training.

In iteration Take2, the XGBoost algorithm achieved a baseline ROC-AUC performance of 77.08%. After a series of tuning trials, XGBoost turned in an overall best ROC-AUC result of 78.23%. When we apply the tuned XGBoost algorithm to the test dataset, we obtained a ROC-AUC score of only 62.86%, which was much lower than the score from model training.

In iteration Take3, all one-layer models achieved a ROC-AUC performance of around 50%.

In iteration Take4, all two-layer models again achieved a ROC-AUC performance of around 50%.

In this Take5 iteration, all three-layer models once again achieved a ROC-AUC performance of around 50%.

CONCLUSION: For this iteration, all three-layer models scored poorly on the ROC-AUC performance. For this dataset, we should consider doing MLP modeling with more neural network layers or with more complex architecture.

Dataset Used: Springleaf Marketing Response Data Set

Dataset ML Model: Binary classification with numerical and categorical attributes

Dataset Reference: https://www.kaggle.com/c/springleaf-marketing-response/data

One potential source of performance benchmark: https://www.kaggle.com/c/springleaf-marketing-response/leaderboard

The HTML formatted report can be found here on GitHub.

]]>SUMMARY: The purpose of this project is to construct a predictive model using various machine learning algorithms and to document the end-to-end steps using a template. The Springleaf Marketing Response dataset is a binary classification situation where we are trying to predict one of the two possible outcomes.

INTRODUCTION: Springleaf leverages the direct mail method for connecting with customers who may need a loan. To improve their targeting efforts, Springleaf must be sure they are focusing on the customers who are likely to respond and be good candidates for their services. Using a dataset with a broad set of anonymized features, Springleaf is looking to predict which customers will respond to a direct mail offer.

In iteration Take1, we constructed several traditional machine learning models using the linear, non-linear, and ensemble techniques. We also observed the best ROC-AUC result that we could obtain with each of these models.

In iteration Take2, we constructed and tuned an XGBoost machine learning model for this dataset. We also observed the best ROC-AUC result that we could obtain with the XGBoost model.

In iteration Take3, we constructed several Multilayer Perceptron (MLP) models with one hidden layer of 64, 128, 256, 512, 1024, and 2048 nodes. These single-layer MLP models serve as the baseline models as we build more complex MLP models in future iterations.

In this Take4 iteration, we will construct several Multilayer Perceptron (MLP) models with two hidden layers. We will observe whether these two-layer MLP models can improve the AUC-ROC performance of the single-layer models.

ANALYSIS: In iteration Take1, the baseline performance of the machine learning algorithms achieved an average ROC-AUC of 70.42%. The Random Forest and Gradient Boosting Machine algorithms made the top ROC-AUC metrics after the first round of modeling. After a series of tuning trials, GBM turned in an overall ROC-AUC result of 77.96%. When we apply the tuned GBM algorithm to the test dataset, we obtained a ROC-AUC score of only 62.58%, which was much lower than the score from model training.

In iteration Take2, the XGBoost algorithm achieved a baseline ROC-AUC performance of 77.08%. After a series of tuning trials, XGBoost turned in an overall best ROC-AUC result of 78.23%. When we apply the tuned XGBoost algorithm to the test dataset, we obtained a ROC-AUC score of only 62.86%, which was much lower than the score from model training.

In iteration Take3, all one-layer models achieved a ROC-AUC performance of around 50%.

In this Take4 iteration, all two-layer models again achieved a ROC-AUC performance of around 50%.

CONCLUSION: For this iteration, all two-layer models scored poorly on the ROC-AUC performance. For this dataset, we should consider doing more modeling and tuning with more complex MLP models.

Dataset Used: Springleaf Marketing Response Data Set

Dataset ML Model: Binary classification with numerical and categorical attributes

Dataset Reference: https://www.kaggle.com/c/springleaf-marketing-response/data

One potential source of performance benchmark: https://www.kaggle.com/c/springleaf-marketing-response/leaderboard

The HTML formatted report can be found here on GitHub.

]]>