SUMMARY: The project aims to construct a predictive model using various machine learning algorithms and document the end-to-end steps using a template. The Avila Bible Identification dataset is a multi-class modeling situation where we attempt to predict one of several (more than two) possible outcomes.

INTRODUCTION: The Avila dataset includes 800 images extracted from the “Avila Bible,” a giant Latin copy of the whole Bible produced during the XII century between Italy and Spain. The paleographic analysis of the manuscript has identified the presence of 12 transcribers; however, each transcriber did not transcribe the same number of pages. The prediction task is to associate each pattern to one of the 12 transcribers labeled as A, B, C, D, E, F, G, H, I, W, X, and Y. The research team normalized the data using the Z-normalization method and divided the dataset into two portions, training and test. The training set contains 10,430 samples, while the test set contains 10,437 samples.

ANALYSIS: The average performance of the preliminary TensorFlow models achieved an accuracy benchmark of 94.27%. When we processed the test dataset with the final model, the model achieved an accuracy score of 96.25%.

CONCLUSION: In this iteration, TensorFlow appeared to be a suitable algorithm for modeling this dataset.

Dataset Used: Avila Bible Dataset

Dataset ML Model: Multi-Class classification with numerical features

Dataset Reference: https://archive-beta.ics.uci.edu/ml/datasets/avila

One source of potential performance benchmarks: https://www.sciencedirect.com/science/article/abs/pii/S0952197618300721

The HTML formatted report can be found here on GitHub.

]]>SUMMARY: The project aims to construct a predictive model using various machine learning algorithms and document the end-to-end steps using a template. The Avila Bible Identification dataset is a multi-class modeling situation where we attempt to predict one of several (more than two) possible outcomes.

INTRODUCTION: The Avila dataset includes 800 images extracted from the “Avila Bible,” a giant Latin copy of the whole Bible produced during the XII century between Italy and Spain. The paleographic analysis of the manuscript has identified the presence of 12 transcribers; however, each transcriber did not transcribe the same number of pages. The prediction task is to associate each pattern to one of the 12 transcribers labeled as A, B, C, D, E, F, G, H, I, W, X, and Y. The research team normalized the data using the Z-normalization method and divided the dataset into two portions, training and test. The training set contains 10,430 samples, while the test set contains 10,437 samples.

ANALYSIS: The performance of the preliminary XGBoost model achieved an accuracy benchmark of 86.67%. After a series of tuning trials, the final model processed the training dataset with an accuracy score of 99.79%. When we processed the test dataset with the final model, the model achieved an accuracy score of 99.81%.

CONCLUSION: In this iteration, the XGBoost model appeared to be a suitable algorithm for modeling this dataset.

Dataset Used: Avila Bible Dataset

Dataset ML Model: Multi-Class classification with numerical features

Dataset Reference: https://archive-beta.ics.uci.edu/ml/datasets/avila

One source of potential performance benchmarks: https://www.sciencedirect.com/science/article/abs/pii/S0952197618300721

The HTML formatted report can be found here on GitHub.

]]>SUMMARY: The project aims to construct a time series prediction model and document the end-to-end steps using a template. The Water Utility Consumers dataset is a univariate time series situation where we attempt to forecast future outcomes based on past data points.

INTRODUCTION: The problem is to forecast the monthly number of water utility consumers in London, United Kingdom. The dataset describes a time series of utility accounts over 11 years (1983-1994), and there are 216 observations. We used the first 80% of the observations for training and testing various models while holding back the remaining observations for validating the final model.

ANALYSIS: The baseline persistence model yielded an RMSE of 2828. The CNN model processed the same test data with an RMSE of 2395, which was better than the baseline model as expected. In an earlier ARIMA modeling experiment, the best ARIMA model with non-seasonal order of (0, 1, 1) and seasonal order of (0, 0, 1, 12) processed the validation data with an RMSE of 2260.

CONCLUSION: For this dataset, the TensorFlow CNN model achieved an acceptable result, and we should consider using TensorFlow for further modeling.

Dataset Used: Number of water consumers in London, United Kingdom, Jan 1983 through April 1994.

Dataset ML Model: Time series forecast with numerical attribute.

Dataset Reference: Rob Hyndman and Yangzhuoran Yang (2018). tsdl: Time Series Data Library. v0.1.0. https://pkg.yangzhuoranyang./tsdl/.

The HTML formatted report can be found here on GitHub.

]]>SUMMARY: The project aims to construct a predictive model using various machine learning algorithms and document the end-to-end steps using a template. The Avila Bible Identification dataset is a multi-class modeling situation where we attempt to predict one of several (more than two) possible outcomes.

INTRODUCTION: The Avila dataset includes 800 images extracted from the “Avila Bible,” a giant Latin copy of the whole Bible produced during the XII century between Italy and Spain. The paleographic analysis of the manuscript has identified the presence of 12 transcribers; however, each transcriber did not transcribe the same number of pages. The prediction task is to associate each pattern to one of the 12 transcribers labeled as A, B, C, D, E, F, G, H, I, W, X, and Y. The research team normalized the data using the Z-normalization method and divided the dataset into two portions, training and test. The training set contains 10,430 samples, while the test set contains 10,437 samples.

ANALYSIS: The performance of the preliminary Gradient Boosted Trees model achieved an accuracy benchmark of 99.99% on the training dataset. When we applied the finalized model to the test dataset, the model achieved an accuracy score of 99.87%.

CONCLUSION: In this iteration, the TensorFlow Decision Forests model appeared to be a suitable algorithm for modeling this dataset.

Dataset Used: Avila Bible Dataset

Dataset ML Model: Multi-Class classification with numerical features

Dataset Reference: https://archive-beta.ics.uci.edu/ml/datasets/avila

One source of potential performance benchmarks: https://www.sciencedirect.com/science/article/abs/pii/S0952197618300721

The HTML formatted report can be found here on GitHub.

]]>ANALYSIS: The average performance of the machine learning algorithms achieved an accuracy benchmark of 85.51% using the training dataset. Furthermore, we selected Bagging Classifier as the final model as it processed the training dataset with a final accuracy score of 98.53%. When we processed the test dataset with the final model, the model achieved an accuracy score of 99.20%.

CONCLUSION: In this iteration, the Bagging Classifier model appeared to be a suitable algorithm for modeling this dataset.

Dataset Used: Avila Bible Dataset

Dataset ML Model: Multi-Class classification with numerical features

Dataset Reference: https://archive-beta.ics.uci.edu/ml/datasets/avila

The HTML formatted report can be found here on GitHub.

]]>SUMMARY: The project aims to construct a predictive model using various machine learning algorithms and document the end-to-end steps using a template. The Raisin Grains Identification dataset is a binary-class modeling situation where we attempt to predict one of two possible outcomes.

INTRODUCTION: In this study, the research team developed a computerized vision system to classify two different varieties of raisin grown in Turkey. The dataset contains the measurements for 900 raisin grain images. The image further broke down into seven major morphological features for each grain of raisin.

ANALYSIS: The performance of the preliminary TensorFlow model achieved an accuracy benchmark of 86.05%. When we processed the test dataset with the final model, the model achieved an accuracy score of 91.11%.

CONCLUSION: In this iteration, the TensorFlow model appeared to be suitable for modeling this dataset.

Dataset Used: Raisin Dataset

Dataset ML Model: Binary classification with numerical features

Dataset Reference: https://www.muratkoklu.com/datasets/

One source of potential performance benchmarks: https://doi.org/10.30855/gmbd.2020.03.03

The HTML formatted report can be found here on GitHub.

]]>SUMMARY: The project aims to construct a predictive model using various machine learning algorithms and document the end-to-end steps using a template. The Raisin Grains Identification dataset is a binary-class modeling situation where we attempt to predict one of two possible outcomes.

INTRODUCTION: In this study, the research team developed a computerized vision system to classify two different varieties of raisin grown in Turkey. The dataset contains the measurements for 900 raisin grain images. The image further broke down into seven major morphological features for each grain of raisin.

ANALYSIS: The performance of the preliminary XGBoost model achieved an accuracy benchmark of 85.92%. After a series of tuning trials, the final model processed the training dataset with an accuracy score of 86.17%. When we processed the test dataset with the final model, the model achieved an accuracy score of 86.66%.

CONCLUSION: In this iteration, the XGBoost model appeared to be suitable for modeling this dataset.

Dataset Used: Raisin Dataset

Dataset ML Model: Binary classification with numerical features

Dataset Reference: https://www.muratkoklu.com/datasets/

One source of potential performance benchmarks: https://doi.org/10.30855/gmbd.2020.03.03

The HTML formatted report can be found here on GitHub.

]]>Thanks to Dr. Jason Brownlee’s suggestions on creating a machine learning template, I have pulled together a project template that can be used to support time series analysis using the TensorFlow framework and Python.

Version 1 of the TensorFlow time series template replicates many code segments within Dr. Brownlee’s blog post “Deep Learning Models for Univariate Time Series Forecasting”. The plan is to build a script for modeling future projects by adapting the example workflow presented in the blog.

The TensorFlow time series template is on the Analytics Project Templates page.

]]>SUMMARY: The project aims to construct a predictive model using various machine learning algorithms and document the end-to-end steps using a template. The Raisin Grains Identification dataset is a binary-class modeling situation where we attempt to predict one of two possible outcomes.

INTRODUCTION: In this study, the research team developed a computerized vision system to classify two different varieties of raisin grown in Turkey. The dataset contains the measurements for 900 raisin grain images. The image further broke down into seven major morphological features for each grain of raisin.

ANALYSIS: The performance of the preliminary Random Forest model achieved an accuracy benchmark of 96.05% on the training dataset. When we applied the finalized model to the test dataset, the model achieved an accuracy score of 86.67%.

CONCLUSION: In this iteration, the TensorFlow Decision Forests model appeared to be suitable for modeling this dataset.

Dataset Used: Raisin Dataset

Dataset ML Model: Binary classification with numerical features

Dataset Reference: https://www.muratkoklu.com/datasets/

One source of potential performance benchmarks: https://doi.org/10.30855/gmbd.2020.03.03

The HTML formatted report can be found here on GitHub.

]]>ANALYSIS: The average performance of the machine learning algorithms achieved an accuracy benchmark of 84.11% using the training dataset. Furthermore, we selected Logistic Regression as the final model as it processed the training dataset with a final accuracy score of 86.29%. When we processed the test dataset with the final model, the model achieved an accuracy score of 91.11%.

CONCLUSION: In this iteration, the Logistic Regression model appeared to be suitable for modeling this dataset.

Dataset Used: Raisin Dataset

Dataset ML Model: Binary classification with numerical features

Dataset Reference: https://www.muratkoklu.com/datasets/

One source of potential performance benchmarks: https://doi.org/10.30855/gmbd.2020.03.03

The HTML formatted report can be found here on GitHub.

]]>