INTRODUCTION: Metro is a transportation planner and coordinator, designer, builder and operator for one of the country’s largest, most populous counties, Los Angeles. More than 9.6 million people, nearly one-third of California’s residents, live, work and play within its 1,433-square-mile service area. The purpose of this exercise is to practice web scraping by gathering the bus ridership statistics from the agency’s web pages. This iteration of the script automatically traverses the monthly web pages (from January 2009 to June 2020) to capture all bus ridership entries and store the information in a CSV output file.

Starting URLs: http://isotp.metro.net/MetroRidership/Index.aspx

The source code and HTML output can be found here on GitHub.

]]>SUMMARY: The purpose of this project is to construct and test an algorithmic trading model and document the end-to-end steps using a template.

INTRODUCTION: This algorithmic trading model examines a series of exponential and simple moving average (MA) crossover models via a grid search methodology. This iteration of the modeling will focus on applying a trend-following or a momentum-oriented approach. When the fast moving-average curve crosses below the slow moving-average curve, the strategy goes long (buys) on the stock. When the opposite occurs, we will exit the position.

From iteration Take1, the grid search script searched through all combinations between the two sets of MA curves, simple and exponential. The faster MA curve ranged from 5 days to 30 days, while the slower MA ranged from 10 days to 60 days. Both curves used a 5-day increment.

From iteration Take2, the grid search script searched through all combinations between the four sets of MA curves. The four models were simple only, exponential only, fast simple/slow exponential, and fast exponential/slow simple. The fast MA curve ranged from 5 days to 30 days, while the slow MA ranged from 10 days to 60 days. All four sets of curves used a 5-day increment.

From iteration Take3, the grid search script searched through all combinations between the two sets of MA curves, simple and exponential. The faster MA curve ranged from 5 days to 30 days, while the slower MA ranged from 10 days to 60 days. Both curves used a 5-day increment.

For this Take4 iteration, the grid search script will search through all combinations between the four sets of MA curves. The four models are simple only, exponential only, fast simple/slow exponential, and fast exponential/slow simple. The fast MA curve can range from 5 days to 30 days, while the slow MA can range from 10 days to 60 days. All four sets of curves use a 5-day increment.

ANALYSIS: From iteration Take1, we analyzed the stock prices for Apple Inc. (AAPL) between January 1, 2019 and August 3, 2020. The best simple MA model with 10-day and 20-day produced a profit of 284.11 per share. The best exponential MA model with 5-day and 20-day produced a gain of 280.73. The long-only approach yielded a gain of 280.86 per share.

From iteration Take2, we analyzed the stock prices for Apple Inc. (AAPL) between January 1, 2019 and August 3, 2020. The best MA model with 5-day EMA and 30-day SMA produced a profit of 289.28 per share. The long-only approach yielded a profit of 280.86 per share.

From iteration Take3, we analyzed the stock prices for Apple Inc. (AAPL) between January 1, 2019 and August 3, 2020. The best simple MA model with 30-day and 60-day produced a profit of 97.39 per share. The long-only approach yielded a gain of 280.86 per share.

For this Take4 iteration, we analyzed the stock prices for Apple Inc. (AAPL) between January 1, 2019 and August 3, 2020. The best MA model with 30-day SMA and 35-day EMA produced a profit of 112.00 per share. The long-only approach yielded a profit of 280.86 per share.

CONCLUSION: For AAPL and during the modeling time frame, the mean-reversion approach produced a suboptimal return when compared to the momentum-oriented and long-only approaches. In this case, the buying-and-holding strategy is much more profitable without too much fuss.

Dataset ML Model: Time series analysis with numerical attributes

Dataset Used: Quandl

The HTML formatted report can be found here on GitHub.

]]>SUMMARY: The purpose of this project is to construct a time series prediction model and document the end-to-end steps using a template. The Exports of Goods for California dataset is a time series situation where we are trying to forecast future outcomes based on past data points.

INTRODUCTION: The problem is to forecast the monthly exports of manufactured and non-manufactured commodities for the state of California. The dataset describes a time-series of exports of goods (in millions of dollars) over 25 years (1995-2020), and there are 298 observations. We used the first 80% of the observations for training various models while holding back the remaining observations for validating the final model.

ANALYSIS: The baseline prediction (or persistence) for the dataset resulted in an RMSE of 1101. After performing a grid search for the most optimal ARIMA parameters, the final ARIMA non-seasonal order was (0, 1, 3) with the seasonal order being (1, 0, 1, 12). Furthermore, the chosen model processed the validation data with an RMSE of 724, which was better than the baseline model as expected.

CONCLUSION: For this dataset, the chosen ARIMA model achieved a satisfactory result, and we should consider using the algorithm for further modeling.

Dataset Used: Monthly Exports of Goods for California

Dataset ML Model: Time series forecast with numerical attribute

Dataset Reference: U.S. Census Bureau, Exports of Goods for California [EXPTOTCA], retrieved from FRED, Federal Reserve Bank of St. Louis; https://fred.stlouisfed.org/series/EXPTOTCA, August 2, 2020.

The HTML formatted report can be found here on GitHub.

]]>SUMMARY: The purpose of this project is to construct and test an algorithmic trading model and document the end-to-end steps using a template.

INTRODUCTION: This algorithmic trading model examines a series of exponential and simple moving average (MA) crossover models via a grid search methodology. This iteration of the modeling will focus on applying a trend-following or a momentum-oriented approach. When the fast moving-average curve crosses below the slow moving-average curve, the strategy goes long (buys) on the stock. When the opposite occurs, we will exit the position.

From iteration Take1, the grid search script searched through all combinations between the two sets of MA curves, simple and exponential. The faster MA curve ranged from 5 days to 30 days, while the slower MA ranged from 10 days to 60 days. Both curves used a 5-day increment.

From iteration Take2, the grid search script searched through all combinations between the four sets of MA curves. The four models were simple only, exponential only, fast simple/slow exponential, and fast exponential/slow simple. The fast MA curve ranged from 5 days to 30 days, while the slow MA ranged from 10 days to 60 days. All four sets of curves used a 5-day increment.

For this Take3 iteration, the grid search script will search through all combinations between the two sets of MA curves, simple and exponential. The faster MA curve can range from 5 days to 30 days, while the slower MA can range from 10 days to 60 days. Both curves use a 5-day increment.

ANALYSIS: From iteration Take1, we analyzed the stock prices for Apple Inc. (AAPL) between January 1, 2019 and August 3, 2020. The best simple MA model with 10-day and 20-day produced a profit of 284.11 per share. The best exponential MA model with 5-day and 20-day produced a gain of 280.73. The long-only approach yielded a gain of 280.86 per share.

From iteration Take2, we analyzed the stock prices for Apple Inc. (AAPL) between January 1, 2019 and August 3, 2020. The best MA model with 5-day EMA and 30-day SMA produced a profit of 289.28 per share. The long-only approach yielded a gain of 280.86 per share.

For this Take3 iteration, we analyzed the stock prices for Apple Inc. (AAPL) between January 1, 2019 and August 3, 2020. The best simple MA model with 30-day and 60-day produced a profit of 97.39 per share. The long-only approach yielded a gain of 280.86 per share.

CONCLUSION: For AAPL and during the modeling time frame, the mean-reversion approach produced a suboptimal return when compared to the momentum-oriented and long-only approaches. In this case, the buying-and-holding strategy is much more profitable without too much fuss.

Dataset ML Model: Time series analysis with numerical attributes

Dataset Used: Quandl

The HTML formatted report can be found here on GitHub.

]]>SUMMARY: The purpose of this project is to construct a predictive model using various machine learning algorithms and to document the end-to-end steps using a template. The BNP Paribas Cardif Claims Management dataset is a binary classification situation where we are trying to predict one of the two possible outcomes.

INTRODUCTION: As a global specialist in personal insurance, BNP Paribas Cardif sponsored a Kaggle competition to help them identify the categories of claims. In a world shaped by the emergence of new practices and behaviors generated by the digital economy, BNP Paribas Cardif would like to streamline its claims management practice. In this Kaggle challenge, the company challenged the participants to predict the category of a claim based on features available early in the process. Better predictions can help BNP Paribas Cardif accelerate its claims process and therefore provide a better service to its customers.

In iteration Take1, we constructed and tuned several machine learning models using the Scikit-learn library. Furthermore, we applied the best-performing machine learning model to Kaggle’s test dataset and submitted a list of predictions for evaluation.

In this Take2 iteration, we will construct and tune an XGBoost model. Furthermore, we will apply the XGBoost model to Kaggle’s test dataset and submit a list of predictions for evaluation.

ANALYSIS: From iteration Take1, the baseline performance of the machine learning algorithms achieved an average log loss of 0.6422. Two algorithms (Logistic Regression and Random Forest) achieved the top log loss metrics after the first round of modeling. After a series of tuning trials, Random Forest turned in a better overall result. Random Forest achieved a log loss metric of 0.4722. When configured with the optimized parameters, the Extra Trees model processed the validation dataset with a log loss of 0.4706, which was consistent with the model training phase. When we applied the Random Forest model to Kaggle’s test dataset, we obtained a log loss score of 0.4635.

From this Take2 iteration, the baseline performance of the XGBoost model achieved a log loss of 0.4706. After a series of tuning trials, the XGBoost model reached a log loss metric of 0.4650. When configured with the optimized parameters, the XGBoost model processed the validation dataset with a log loss of 0.4674, which was consistent with the model training phase. When we applied the XGBoost model to Kaggle’s test dataset, we obtained a log loss score of 0.4634.

CONCLUSION: For this iteration, the XGBoost model achieved the best overall results using the training and test datasets. For this dataset, we should consider further modeling with the XGBoost algorithm.

Dataset Used: BNP Paribas Cardif Claims Management Data Set

Dataset ML Model: Binary classification with numerical and categorical attributes

Dataset Reference: https://www.kaggle.com/c/bnp-paribas-cardif-claims-management/overview

One potential source of performance benchmark: https://www.kaggle.com/c/bnp-paribas-cardif-claims-management/leaderboard

The HTML formatted report can be found here on GitHub.

]]>SUMMARY: The purpose of this project is to construct a predictive model using various machine learning algorithms and to document the end-to-end steps using a template. The BNP Paribas Cardif Claims Management dataset is a binary classification situation where we are trying to predict one of the two possible outcomes.

INTRODUCTION: As a global specialist in personal insurance, BNP Paribas Cardif sponsored a Kaggle competition to help them identify the categories of claims. In a world shaped by the emergence of new practices and behaviors generated by the digital economy, BNP Paribas Cardif would like to streamline its claims management practice. In this Kaggle challenge, the company challenged the participants to predict the category of a claim based on features available early in the process. Better predictions can help BNP Paribas Cardif accelerate its claims process and therefore provide a better service to its customers.

In this Take1 iteration, we will construct and tune several machine learning models using the Scikit-learn library. Furthermore, we will apply the best-performing machine learning model to Kaggle’s test dataset and submit a list of predictions for evaluation.

ANALYSIS: The baseline performance of the machine learning algorithms achieved an average log loss of 0.6422. Two algorithms (Logistic Regression and Random Forest) achieved the top log loss metrics after the first round of modeling. After a series of tuning trials, Random Forest turned in a better overall result. Random Forest achieved a log loss metric of 0.4722. When configured with the optimized parameters, the Extra Trees model processed the validation dataset with a log loss of 0.4706, which was consistent with the model training phase. When we applied the Random Forest model to Kaggle’s test dataset, we obtained a log loss score of 0.4635.

CONCLUSION: For this iteration, the Random Forest model achieved the best overall results using the training and test datasets. For this dataset, we should consider further modeling with the Random Forest algorithm.

Dataset Used: BNP Paribas Cardif Claims Management Data Set

Dataset ML Model: Binary classification with numerical and categorical attributes

Dataset Reference: https://www.kaggle.com/c/bnp-paribas-cardif-claims-management/overview

One potential source of performance benchmark: https://www.kaggle.com/c/bnp-paribas-cardif-claims-management/leaderboard

The HTML formatted report can be found here on GitHub.

]]>SUMMARY: The purpose of this project is to construct and test an algorithmic trading model and document the end-to-end steps using a template.

INTRODUCTION: This algorithmic trading model examines a series of exponential and simple moving average (MA) crossover models via a grid search methodology. This iteration of the modeling will focus on applying a trend-following or a momentum-oriented approach. When the fast moving-average curve crosses above the slow moving-average curve, the strategy goes long (buys) on the stock. When the opposite occurs, we will exit the position.

From iteration Take1, the grid search script searched through all combinations between the two sets of MA curves, simple and exponential. The faster MA curve ranged from 5 days to 30 days, while the slower MA ranged from 10 days to 60 days. Both curves used a 5-day increment.

For this Take2 iteration, the grid search script will search through all combinations between the four sets of MA curves. The four models are simple only, exponential only, fast simple/slow exponential, and fast exponential/slow simple. The fast MA curve can range from 5 days to 30 days, while the slower MA can range from 10 days to 60 days. Both curves use a 5-day increment.

ANALYSIS: From iteration Take1, we analyzed the stock prices for Apple Inc. (AAPL) between January 1, 2019 and August 3, 2020. The best simple MA model with 10-day and 20-day produced a profit of 284.11 per share. The best exponential MA model with 5-day and 20-day produced a gain of 280.73. The long-only approach yielded a profit of 280.86 per share.

For this Take2 iteration, we analyzed the stock prices for Apple Inc. (AAPL) between January 1, 2019 and August 3, 2020. The best MA model with 5-day EMA and 30-day SMA produced a profit of 289.28 per share. The long-only approach yielded a gain of 280.86 per share.

CONCLUSION: For this dataset, the combination of an exponential moving average curve of five days with a simple moving average curve of 30 days seem to produce the best profit level.

Dataset ML Model: Time series analysis with numerical attributes

Dataset Used: Quandl

The HTML formatted report can be found here on GitHub.

]]>SUMMARY: The purpose of this project is to construct a time series prediction model and document the end-to-end steps using a template. The Imports of Goods for California dataset is a time series situation where we are trying to forecast future outcomes based on past data points.

INTRODUCTION: The problem is to forecast the monthly imports of manufactured and non-manufactured commodities for the state of California. The dataset describes a time-series of imports of goods (in millions of dollars) over 12 years (2008-2020), and there are 149 observations. We used the first 80% of the observations for training various models while holding back the remaining observations for validating the final model.

ANALYSIS: The baseline prediction (or persistence) for the dataset resulted in an RMSE of 2391. After performing a grid search for the most optimal ARIMA parameters, the final ARIMA non-seasonal order was (3, 1, 4) with the seasonal order being (1, 0, 2, 12). Furthermore, the chosen model processed the validation data with an RMSE of 1496, which was better than the baseline model as expected.

CONCLUSION: For this dataset, the chosen ARIMA model achieved a satisfactory result, and we should consider using the algorithm for further modeling.

Dataset Used: Monthly Imports of Goods for California

Dataset ML Model: Time series forecast with numerical attribute

Dataset Reference: U.S. Census Bureau, Imports of Goods for California [IMPTOTCA], retrieved from FRED, Federal Reserve Bank of St. Louis; https://fred.stlouisfed.org/series/IMPTOTCA, August 1, 2020.

The HTML formatted report can be found here on GitHub.

]]>INTRODUCTION: This algorithmic trading model examines a series of exponential and simple moving average (MA) crossover models via a grid search methodology. This iteration of the modeling will focus on applying a trend-following or a momentum-oriented approach. When the fast moving-average curve crosses above the slow moving-average curve, the strategy goes long (buys) on the stock. When the opposite occurs, we will exit the position.

For this Take1 iteration, the grid search script will search through all combinations between the two sets of MA curves, simple and exponential. The faster MA curve can range from 5 days to 20 days, while the slower MA can range from 10 days to 50 days. Both curves use a 5-day increment.

ANALYSIS: For this Take1 iteration, we analyzed the stock prices for Apple Inc. (AAPL) between January 1, 2019 and August 3, 2020. The best simple MA model with 10-day and 20-day produced a profit of 284.11 per share. The best exponential MA model with 5-day and 20-day produced a gain of 280.73. The long-only approach yielded a profit of 280.86 per share.

CONCLUSION: For this dataset and period, the simple moving average curves of ten days and 20 days seem to produce the best profit level. However, the buying-and-holding approach is almost as profitable without too much fuss.

Dataset ML Model: Time series analysis with numerical attributes

Dataset Used: Quandl

The HTML formatted report can be found here on GitHub.

]]>SUMMARY: The purpose of this project is to construct a predictive model using various machine learning algorithms and to document the end-to-end steps using a template. The Truck APS Failure dataset is a binary classification situation where we are trying to predict one of the two possible outcomes.

INTRODUCTION: The dataset consists of data collected from heavy Scania trucks in everyday usage. The system in focus is the Air Pressure system (APS), which generates pressurized air that supports functions such as braking and gear changes. The dataset’s positive class consists of component failures for a specific component of the APS system. The negative class consists of trucks with failures for components not related to the APS. The training set contains 60000 examples in total, in which 59000 belong to the negative class and 1000 positive class. The test set contains 16000 examples.

The challenge is to minimize the total cost of a prediction model the sum of “Cost_1” multiplied by the number of Instances with type 1 failure and “Cost_2” with the number of instances with type 2 failure. The “Cost_1” variable refers to the cost resulted from a redundant check by a mechanic at the workshop. Meanwhile, the “Cost_2” variable refers to the cost of not catching a faulty truck. The cost of Type I error (cost_1) is 10, while the cost of the Type II error (cost_2) is 500.

In the previous Scikit-Learn iterations, we constructed and tuned machine learning models for this dataset using the Scikit-Learn and the XGboost libraries. We also observed the best accuracy result that we could obtain using the tuned models with the training, validation, and test datasets.

In iteration Take1, we constructed and tuned machine learning models for this dataset using TensorFlow with three layers. We also observed the best result that we could obtain using the tuned models with the validation and test datasets.

In iteration Take2, we provided more balance to this imbalanced dataset by using “Synthetic Minority Oversampling TEchnique” or SMOTE for short. We increased the population of the minority class from approximately 0.1% to approximately 33% of the training instances. We then decreased the population of the majority class to equal to the minority class. Furthermore, we also observed the best sensitivity/recall score that we could obtain using the tuned models with the training and test datasets.

In this Take3 iteration, we constructed and tuned machine learning models for this dataset using TensorFlow with four layers. At the same time, we leveraged the SMOTE technique to augment the dataset for training purposes. Furthermore, we also observed the best result that we could obtain using the tuned models with the validation and test datasets.

In this Take4 iteration, we will construct and tune machine learning models for this dataset using TensorFlow with five layers. At the same time, we will leverage the SMOTE technique to augment the dataset for training purposes. Furthermore, we will observe the best result that we can obtain using the tuned models with the validation and test datasets.

ANALYSIS: From the previous Scikit-Learn iterations, the optimized XGBoost model processed the testing dataset with a recall metric of 98.66% with a low Type II error rate.

From this Take1 iteration, the performance of the three-layer TensorFlow model achieved a recall score of 77.20% with the training dataset. After a series of tuning trials, the TensorFlow model processed the validation dataset with a recall score of 75.20%, which was consistent with the prediction from the training result. When configured with the optimized parameters, the TensorFlow model processed the test dataset with a recall score of 55.46% with a high Type II error rate.

From iteration Take2, the performance of the three-layer TensorFlow model achieved a recall score of 87.60% with the training dataset. After a series of tuning trials, the TensorFlow model processed the validation dataset with a recall score of 96.40%, which was much better than the prediction from the training result. When configured with the optimized parameters, the TensorFlow model processed the test dataset with a recall score of 85.06% with a lower Type II error rate than the previous iteration.

From iteration Take3, the performance of the four-layer TensorFlow model achieved a recall score of 83.60% with the training dataset. After a series of tuning trials, the TensorFlow model processed the validation dataset with a recall score of 93.20%, which was much better than the prediction from the training result. When configured with the optimized parameters, the TensorFlow model processed the test dataset with a recall score of 95.73% with a lower Type II error rate than the previous iteration.

From this Take4 iteration, the performance of the five-layer TensorFlow model achieved a recall score of 86.80% with the training dataset. After a series of tuning trials, the TensorFlow model processed the validation dataset with a recall score of 96.00%, which was much better than the prediction from the training result. When configured with the optimized parameters, the TensorFlow model processed the test dataset with a recall score of 98.66% with a lower Type II error rate but abysmal accuracy.

CONCLUSION: For this dataset, the model built using TensorFlow with four layers seems to yield the best result with a little help from applying SMOTE to balance the dataset. We should consider using TensorFlow and SMOTE to model this dataset further.

Dataset Used: APS Failure at Scania Trucks Data Set

Dataset ML Model: Binary classification with numerical attributes

Dataset Reference: https://archive.ics.uci.edu/ml/datasets/APS+Failure+at+Scania+Trucks

One potential source of performance benchmark: https://archive.ics.uci.edu/ml/datasets/APS+Failure+at+Scania+Trucks

The HTML formatted report can be found here on GitHub.

]]>