INTRODUCTION: Dr. Jason Brownlee’s Machine Learning Mastery hosts its tutorial lessons at https://machinelearningmastery.com/blog. The purpose of this exercise is to practice web scraping by gathering the blog entries from Machine Learning Mastery’s web pages. This iteration of the script automatically traverses the web pages to capture all blog entries and store all captured information in a JSON output file.

Starting URLs: https://machinelearningmastery.com/blog

The source code and HTML output can be found here on GitHub.

]]>SUMMARY: The purpose of this project is to construct a predictive model using various machine learning algorithms and to document the end-to-end steps using a template. The “Cats and Dogs” dataset is a binary classification situation where we are trying to predict one of the two possible outcomes.

INTRODUCTION: Web services are often protected with a challenge that’s supposed to be easy for people to solve, but difficult for computers. Such a challenge is often called a CAPTCHA (Completely Automated Public Turing test to tell Computers and Humans Apart) or HIP (Human Interactive Proof). ASIRRA (Animal Species Image Recognition for Restricting Access) is a HIP that works by asking users to identify photographs of cats and dogs. This task is difficult for computers, but studies have shown that people can accomplish it quickly and accurately.

The current literature suggests that machine classifiers can score above 80% accuracy on this task. Therefore, ASIRRA is no longer considered safe from attack. Kaggle created a contest to benchmark the latest computer vision and deep learning approaches to this problem. The training archive contains 25,000 images of dogs and cats. We will need to train our algorithm on these files and predict the correct labels for the test dataset.

In iteration Take1, we constructed a simple VGG convolutional model with 1 VGG block to classify the images. This model serves as the baseline for the future iterations of modeling.

In iteration Take2, we constructed a simple VGG convolutional model with 2 VGG blocks to classify the images. The additional modeling enabled us to improve our baseline model.

In this iteration, we will construct a simple VGG convolutional model with 3 VGG blocks to classify the images. The additional modeling will enable us to improve our baseline model.

ANALYSIS: In iteration Take1, the performance of the Take1 model achieved an accuracy score of 95.55% after training for 20 epochs. The same model, however, processed the test dataset with an accuracy of only 72.99% after 20 epochs. Reviewing the plot, we can see that the model was starting to overfit the training dataset after only ten epochs. We will need to explore other modeling approaches to reduce the over-fitting.

In iteration Take2, the performance of the Take2 model achieved an accuracy score of 97.94% after training for 20 epochs. The same model, however, processed the test dataset with an accuracy of only 75.67% after 20 epochs. Reviewing the plot, we can see that the model was starting to overfit the training dataset after only seven epochs. We will need to explore other modeling approaches to reduce the over-fitting.

In this iteration, the performance of the Take3 model achieved an accuracy score of 97.14% after training for 20 epochs. The same model, however, processed the test dataset with an accuracy of only 80.19% after 20 epochs. Reviewing the plot, we can see that the model was starting to overfit the training dataset after only six epochs. We will need to explore other modeling approaches to reduce the over-fitting.

CONCLUSION: For this dataset, the model built using Keras and TensorFlow did not achieve a comparable result with the Kaggle competition. We should explore and consider more and different modeling approaches.

Dataset Used: Cats and Dogs Dataset

Dataset ML Model: Binary classification with numerical attributes

Dataset Reference: https://www.microsoft.com/en-us/download/details.aspx?id=54765

One potential source of performance benchmarks: https://www.kaggle.com/c/dogs-vs-cats/overview

The HTML formatted report can be found here on GitHub.

]]>SUMMARY: The purpose of this project is to construct a time series prediction model and document the end-to-end steps using a template. The Wisconsin Food Industry Employment dataset is a time series situation where we are trying to forecast future outcomes based on past data points.

INTRODUCTION: The problem is to forecast the monthly number of employment number for the food and kindred products in the state of Wisconsin. The dataset describes a time-series of employment (in thousands) over 14 years (1961-1975), and there are 178 observations. We used the first 80% of the observations for training and testing various models while holding back the remaining observations for validating the final model.

ANALYSIS: The baseline prediction (or persistence) for the dataset resulted in an RMSE of 4.687. After performing a grid search for the most optimal ARIMA parameters, the final ARIMA non-seasonal order was (0, 1, 3) with the seasonal order being (0, 1, 1, 12). Furthermore, the chosen model processed the validation data with an RMSE of 1.133, which was better than the baseline model as expected.

CONCLUSION: For this dataset, the chosen ARIMA model achieved a satisfactory result and should be considered for further modeling.

Dataset Used: Wisconsin employment for food and kindred products from January 1961 to October 1975

Dataset ML Model: Time series forecast with numerical attributes

Dataset Reference: Rob Hyndman and Yangzhuoran Yang (2018). tsdl: Time Series Data Library. v0.1.0. https://pkg.yangzhuoranyang./tsdl/.

The HTML formatted report can be found here on GitHub.

]]>SUMMARY: The purpose of this project is to construct a predictive model using various machine learning algorithms and to document the end-to-end steps using a template. The “Cats and Dogs” dataset is a binary classification situation where we are trying to predict one of the two possible outcomes.

INTRODUCTION: Web services are often protected with a challenge that’s supposed to be easy for people to solve, but difficult for computers. Such a challenge is often called a CAPTCHA (Completely Automated Public Turing test to tell Computers and Humans Apart) or HIP (Human Interactive Proof). ASIRRA (Animal Species Image Recognition for Restricting Access) is a HIP that works by asking users to identify photographs of cats and dogs. This task is difficult for computers, but studies have shown that people can accomplish it quickly and accurately.

The current literature suggests that machine classifiers can score above 80% accuracy on this task. Therefore, ASIRRA is no longer considered safe from attack. Kaggle created a contest to benchmark the latest computer vision and deep learning approaches to this problem. The training archive contains 25,000 images of dogs and cats. We will need to train our algorithm on these files and predict the correct labels for the test dataset.

In iteration Take1, we constructed a simple VGG convolutional model with 1 VGG block to classify the images. This model serves as the baseline for the future iterations of modeling.

In this iteration, we will construct a simple VGG convolutional model with 2 VGG blocks to classify the images. The additional modeling will enable us to improve our baseline model.

ANALYSIS: In iteration Take1, the performance of the Take1 model achieved an accuracy score of 95.55% after training for 20 epochs. The same model, however, processed the test dataset with an accuracy of only 72.99% after 20 epochs. Reviewing the plot, we can see that the model was starting to overfit the training dataset after only ten epochs. We will need to explore other modeling approaches to reduce the over-fitting.

In this iteration, the performance of the Take2 model achieved an accuracy score of 97.94% after training for 20 epochs. The same model, however, processed the test dataset with an accuracy of only 75.67% after 20 epochs. Reviewing the plot, we can see that the model was starting to overfit the training dataset after only seven epochs. We will need to explore other modeling approaches to reduce the over-fitting.

CONCLUSION: For this dataset, the model built using Keras and TensorFlow did not achieve a comparable result with the Kaggle competition. We should explore and consider more and different modeling approaches.

Dataset Used: Cats and Dogs Dataset

Dataset ML Model: Binary classification with numerical attributes

Dataset Reference: https://www.microsoft.com/en-us/download/details.aspx?id=54765

One potential source of performance benchmarks: https://www.kaggle.com/c/dogs-vs-cats/overview

The HTML formatted report can be found here on GitHub.

]]>SUMMARY: The purpose of this project is to construct a predictive model using various machine learning algorithms and to document the end-to-end steps using a template. The “Cats and Dogs” dataset is a binary classification situation where we are trying to predict one of the two possible outcomes.

INTRODUCTION: Web services are often protected with a challenge that’s supposed to be easy for people to solve, but difficult for computers. Such a challenge is often called a CAPTCHA (Completely Automated Public Turing test to tell Computers and Humans Apart) or HIP (Human Interactive Proof). ASIRRA (Animal Species Image Recognition for Restricting Access) is a HIP that works by asking users to identify photographs of cats and dogs. This task is difficult for computers, but studies have shown that people can accomplish it quickly and accurately.

The current literature suggests that machine classifiers can score above 80% accuracy on this task. Therefore, ASIRRA is no longer considered safe from attack. Kaggle created a contest to benchmark the latest computer vision and deep learning approaches to this problem. The training archive contains 25,000 images of dogs and cats. We will need to train our algorithm on these files and predict the correct labels for the test dataset.

In this iteration, we will construct a simple VGG convolutional model with 1 VGG block to classify the images. This model will serve as the baseline for the future iterations of modeling.

ANALYSIS: In this iteration, the performance of the Take1 model achieved an accuracy score of 95.55% after training for 20 epochs. The same model, however, processed the test dataset with an accuracy of only 72.99% after 20 epochs. Reviewing the plot, we can see that the model was starting to overfit the training dataset after only ten epochs. We will need to explore other modeling approaches to reduce the over-fitting.

CONCLUSION: For this dataset, the model built using Keras and TensorFlow did not achieve a comparable result with the Kaggle competition. We should explore and consider more and different modeling approaches.

Dataset Used: Cats and Dogs Dataset

Dataset ML Model: Binary classification with numerical attributes

Dataset Reference: https://www.microsoft.com/en-us/download/details.aspx?id=54765

One potential source of performance benchmarks: https://www.kaggle.com/c/dogs-vs-cats/overview

The HTML formatted report can be found here on GitHub.

]]>INTRODUCTION: INTRODUCTION: The Conference on Neural Information Processing Systems (NeurIPS) covers a wide range of topics in neural information processing systems and research for the biological, technological, mathematical, and theoretical applications. Neural information processing is a field that benefits from a combined view of biological, physical, mathematical, and computational sciences. This web scraping script will automatically traverse through the entire web page and collect all links to the PDF and PPTX documents. The script will also download the documents as part of the scraping process.

Starting URLs: https://papers.nips.cc/book/advances-in-neural-information-processing-systems-32-2019

The source code and HTML output can be found here on GitHub.

]]>SUMMARY: The purpose of this project is to construct a predictive model using various machine learning algorithms and to document the end-to-end steps using a template. The CIFAR-10 dataset is a multi-class classification situation where we are trying to predict one of several (more than two) possible outcomes.

INTRODUCTION: The CIFAR-10 is a labeled subset of the 80 million tiny images dataset. They were collected by Alex Krizhevsky, Vinod Nair, and Geoffrey Hinton. The CIFAR-10 dataset consists of 60,000 32×32 color images in 10 classes, with 6,000 images per class. There are 50,000 training images and 10,000 test images.

The dataset is divided into five training batches and one test batch, each with 10,000 images. The test batch contains exactly 1,000 randomly selected images from each class. The training batches contain the remaining images in random order, but some training batches may contain more images from one class than another. Between them, the training batches contain exactly 5,000 images from each class.

In iteration Take1, we constructed a simple VGG convolutional model with 1 VGG block to classify the images. This model serves as the baseline for the future iterations of modeling.

In iteration Take2, we constructed a few more VGG convolutional models with 2 and 3 VGG blocks to classify the images. The additional models enabled us to choose a final baseline model before applying other performance-enhancing techniques.

In iteration Take3, we tuned the VGG-3 model with various hyperparameters and selected the best model.

In iteration Take4, we added some dropout layers as a regularization technique to reduce over-fitting.

In iteration Take5, we applied data augmentation to the dataset as a regularization technique to reduce over-fitting.

In iteration Take6, we applied both dropout layers and data augmentation to the dataset for reducing over-fitting.

In iteration Take7, we applied dropout layers, data augmentation, and batch normalization to the dataset for reducing over-fitting.

In this iteration, we will tune the Take7 model further by experimenting with different optimizers and different learning rates.

ANALYSIS: In iteration Take1, the performance of the Take1 model with the default parameters achieved an accuracy score of 66.39% on the validation dataset after training for 50 epochs. After tuning the hyperparameters, the Take1 model with the best hyperparameters processed the training dataset with an accuracy of 100.00%. The same model, however, processed the test dataset with an accuracy of only 67.01%. We will need to explore other modeling approaches to make a better model that reduces over-fitting.

In iteration Take2, the performance of the VGG-1 model with the default parameters achieved an accuracy score of 66.69% on the validation dataset after training for 50 epochs. The VGG-2 model achieved an accuracy score of 71.35% on the validation dataset after training for 50 epochs. The VGG-3 model achieved an accuracy score of 73.81% on the validation dataset after training for 50 epochs. The additional VGG blocks helped the model, but we still need to explore other modeling approaches to make a better model that reduces over-fitting.

In iteration Take3, the performance of the VGG-3 Take3 model with the default parameters achieved a maximum accuracy score of 73.43% on the validation dataset after training for 50 epochs. After tuning the hyperparameters, the Take1 model with the best hyperparameters processed the training dataset with an accuracy of 98.09%. The same model, however, processed the test dataset with an accuracy of only 73.44%. Even with VGG-3 and hyperparameter tuning, we still have an over-fitting problem with the model.

In iteration Take4, the performance of the Take4 model with the default parameters achieved a maximum accuracy score of 76.96% on the validation dataset after training for 50 epochs. We can see from the graph that the accuracy and loss curves for the training and validation sets were moving in the same direction and converged well. After increasing the number of epochs, the Take4 model processed the training dataset with an accuracy of 82.22% after 100 epochs. The same model processed the test dataset with an accuracy of 82.35%. This iteration indicated to us that having dropout layers can be a good tactic to improve the model’s predictive performance.

In iteration Take5, the performance of the Take5 model with the default parameters achieved a maximum accuracy score of 81.41% on the validation dataset after training for 50 epochs. We can see from the graph that the accuracy and loss curves for the training and validation sets were moving in the same direction and converged well. After increasing the number of epochs, the Take5 model processed the training dataset with an accuracy of 91.60% after 100 epochs. The same model processed the test dataset with an accuracy of 84.43%. This iteration indicated to us that applying data augmentation can be a good tactic to improve the model’s predictive performance.

In iteration Take6, the performance of the Take6 model with the default parameters achieved a maximum accuracy score of 84.12% on the validation dataset after training for 100 epochs. We can see from the graph that the accuracy and loss curves for the training and validation sets were moving in the same direction and converged well. After increasing the number of epochs, the Take6 model processed the training dataset with an accuracy of 87.21% after 200 epochs. The same model processed the test dataset with an accuracy of 85.79%. This iteration indicated to us that applying dropout layers and data augmentation together can be a good tactic to improve the model’s predictive performance.

In iteration Take7, the performance of the Take7 model with the default parameters achieved a maximum accuracy score of 87.15% on the validation dataset after training for 200 epochs. We can see from the graph that the accuracy and loss curves for the training and validation sets were moving in the same direction and converged well. After increasing the number of epochs, the Take7 model processed the training dataset with an accuracy of 90.15% after 400 epochs. The same model processed the test dataset with an accuracy of 89.02%. This iteration indicated to us that applying dropout layers, data augmentation, and batch normalization together can be a good tactic to improve the model’s predictive performance.

In this iteration, the performance of the Take8 model with the default parameters achieved a maximum accuracy score of 88.35% on the validation dataset after training for 400 epochs. After trying out different optimizers and settings, the best Take8 model processed the training dataset with an accuracy of 92.92%. The same model processed the test dataset with an accuracy of 90.36%. This iteration indicated to us that using an RMSprop optimizer can be a good option to improve the model’s predictive performance for this dataset.

CONCLUSION: For this dataset, the model built using Keras and TensorFlow achieved a satisfactory result and should be considered for future modeling activities.

Dataset Used: The CIFAR-10 Dataset

Dataset ML Model: Multi-class classification with numerical attributes

Dataset Reference: https://www.cs.toronto.edu/~kriz/cifar.html

One potential source of performance benchmarks: https://machinelearningmastery.com/how-to-develop-a-cnn-from-scratch-for-cifar-10-photo-classification/

The HTML formatted report can be found here on GitHub.

]]>SUMMARY: The purpose of this project is to construct a time series prediction model and document the end-to-end steps using a template. The Fisher River Flow dataset is a time series situation where we are trying to forecast future outcomes based on past data points.

INTRODUCTION: The problem is to forecast the daily river flow volume for the Fisher River near Dallas, Texas. The dataset describes a time-series of daily volume (in cms) over four years (1988-1991), and there are 1,461 observations in the dataset. We used the first 80% of the observations for training and testing various models while holding back the remaining observations for validating the final model.

ANALYSIS: The baseline prediction (or persistence) for the dataset resulted in an RMSE of 0.693. After performing a grid search for the most optimal ARIMA parameters, the final ARIMA non-seasonal order was (1, 0, 2) with the seasonal order being (0, 1, 0, 365). Furthermore, the chosen model processed the validation data with an RMSE of 0.987, which was no better than the baseline model.

CONCLUSION: For this iteration, the chosen ARIMA model did not achieve a satisfactory result. We should explore different sets of ARIMA parameters and conduct further modeling activities.

Dataset Used: Mean daily flow, Fisher River near Dallas, January 1, 1988 to December 31, 1991.

Dataset ML Model: Time series forecast with numerical attributes

Dataset Reference: Rob Hyndman and Yangzhuoran Yang (2018). tsdl: Time Series Data Library. v0.1.0. https://pkg.yangzhuoranyang./tsdl/.

The HTML formatted report can be found here on GitHub.

]]>Thanks to Dr. Jason Brownlee’s suggestions on creating a machine learning template, I have pulled together a project template that can be used to support modeling classification and regression problems using Python and the Kera framework.

Version 5 of the deep learning templates contain minor adjustments and corrections to the prevision version of the templates. Also, the new templates added the sample code to support:

- Baseline model creation via both scikit-learn’s KFold cross-validation or Kera’s built-in validation_data parameter.
- Hyperparameter tuning via scikit-learn’s GridSearchCV function or Python “for” loops

You will find the Python deep learning templates on the Machine Learning Project Templates page.

]]>SUMMARY: The purpose of this project is to construct a predictive model using various machine learning algorithms and to document the end-to-end steps using a template. The CIFAR-10 dataset is a multi-class classification situation where we are trying to predict one of several (more than two) possible outcomes.

INTRODUCTION: The CIFAR-10 is a labeled subset of the 80 million tiny images dataset. They were collected by Alex Krizhevsky, Vinod Nair, and Geoffrey Hinton. The CIFAR-10 dataset consists of 60,000 32×32 color images in 10 classes, with 6,000 images per class. There are 50,000 training images and 10,000 test images.

The dataset is divided into five training batches and one test batch, each with 10,000 images. The test batch contains exactly 1,000 randomly selected images from each class. The training batches contain the remaining images in random order, but some training batches may contain more images from one class than another. Between them, the training batches contain exactly 5,000 images from each class.

In iteration Take2, we constructed a few more VGG convolutional models with 2 and 3 VGG blocks to classify the images. The additional models enabled us to choose a final baseline model before applying other performance-enhancing techniques.

In iteration Take3, we tuned the VGG-3 model with various hyperparameters and selected the best model.

In iteration Take4, we added some dropout layers as a regularization technique to reduce over-fitting.

In iteration Take5, we applied data augmentation to the dataset as a regularization technique to reduce over-fitting.

In iteration Take6, we applied both dropout layers and data augmentation to the dataset for reducing over-fitting.

In this iteration, we will apply dropout layers, data augmentation, and batch normalization to the dataset for reducing over-fitting.

ANALYSIS: In iteration Take1, the performance of the Take1 model with the default parameters achieved an accuracy score of 66.39% on the validation dataset after training for 50 epochs. After tuning the hyperparameters, the Take1 model with the best hyperparameters processed the training dataset with an accuracy of 100.00%. The same model, however, processed the test dataset with an accuracy of only 67.01%. We will need to explore other modeling approaches to make a better model that reduces over-fitting.

In iteration Take2, the performance of the VGG-1 model with the default parameters achieved an accuracy score of 66.69% on the validation dataset after training for 50 epochs. The VGG-2 model achieved an accuracy score of 71.35% on the validation dataset after training for 50 epochs. The VGG-3 model achieved an accuracy score of 73.81% on the validation dataset after training for 50 epochs. The additional VGG blocks helped the model, but we still need to explore other modeling approaches to make a better model that reduces over-fitting.

In iteration Take3, the performance of the VGG-3 Take3 model with the default parameters achieved a maximum accuracy score of 73.43% on the validation dataset after training for 50 epochs. After tuning the hyperparameters, the Take1 model with the best hyperparameters processed the training dataset with an accuracy of 98.09%. The same model, however, processed the test dataset with an accuracy of only 73.44%. Even with VGG-3 and hyperparameter tuning, we still have an over-fitting problem with the model.

In iteration Take4, the performance of the Take4 model with the default parameters achieved a maximum accuracy score of 76.96% on the validation dataset after training for 50 epochs. We can see from the graph that the accuracy and loss curves for the training and validation sets were moving in the same direction and converged well. After increasing the number of epochs, the Take4 model processed the training dataset with an accuracy of 82.22% after 100 epochs. The same model processed the test dataset with an accuracy of 82.35%. This iteration indicated to us that having dropout layers can be a good tactic to improve the model’s predictive performance.

In iteration Take5, the performance of the Take5 model with the default parameters achieved a maximum accuracy score of 81.41% on the validation dataset after training for 50 epochs. We can see from the graph that the accuracy and loss curves for the training and validation sets were moving in the same direction and converged well. After increasing the number of epochs, the Take5 model processed the training dataset with an accuracy of 91.60% after 100 epochs. The same model processed the test dataset with an accuracy of 84.43%. This iteration indicated to us that applying data augmentation can be a good tactic to improve the model’s predictive performance.

In iteration Take6, the performance of the Take6 model with the default parameters achieved a maximum accuracy score of 84.12% on the validation dataset after training for 100 epochs. We can see from the graph that the accuracy and loss curves for the training and validation sets were moving in the same direction and converged well. After increasing the number of epochs, the Take6 model processed the training dataset with an accuracy of 87.21% after 200 epochs. The same model processed the test dataset with an accuracy of 85.79%. This iteration indicated to us that applying dropout layers and data augmentation together can be a good tactic to improve the model’s predictive performance.

In this iteration, the performance of the Take7 model with the default parameters achieved a maximum accuracy score of 87.15% on the validation dataset after training for 200 epochs. We can see from the graph that the accuracy and loss curves for the training and validation sets were moving in the same direction and converged well. After increasing the number of epochs, the Take7 model processed the training dataset with an accuracy of 90.15% after 400 epochs. The same model processed the test dataset with an accuracy of 89.02%. This iteration indicated to us that applying dropout layers, data augmentation, and batch normalization together can be a good tactic to improve the model’s predictive performance.

CONCLUSION: For this dataset, the model built using Keras and TensorFlow achieved a satisfactory result and should be considered for future modeling activities.

Dataset Used: The CIFAR-10 Dataset

Dataset ML Model: Multi-class classification with numerical attributes

Dataset Reference: https://www.cs.toronto.edu/~kriz/cifar.html

One potential source of performance benchmarks: https://machinelearningmastery.com/how-to-develop-a-cnn-from-scratch-for-cifar-10-photo-classification/

The HTML formatted report can be found here on GitHub.

]]>