Template Credit: Adapted from a template made available by Dr. Jason Brownlee of Machine Learning Mastery.
SUMMARY: This project aims to construct a text classification model using a neural network and document the end-to-end steps using a template. The Disaster Tweets Classification dataset is a binary classification situation where we attempt to predict one of the two possible outcomes.
INTRODUCTION: Twitter has become an important communication channel in times of emergency. The ubiquitous nature of smartphones enables people to announce an emergency they are observing in real-time. Because of this, more agencies are interested in programmatically monitoring Twitter. In this practice Kaggle competition, we want to build a machine learning model that predicts which Tweets are about real disasters and which ones are not. This dataset was created by Figure-Eight and shared initially on their ‘Data for Everyone’ website.
From iteration Take1, we deployed a bag-of-words model to classify the Tweets. We also made predictions on Kaggle’s test dataset and submitted the results for evaluation.
In this Take2 iteration, we will deploy a word-embedding model to classify the Tweets. We will also submit the test predictions to Kaggle and obtain the performance score for the model.
ANALYSIS: From iteration Take1, the bag-of-words model’s performance achieved an average accuracy score of 75.49% after 20 epochs with five iterations of cross-validation. Furthermore, the final model processed the test dataset with an accuracy measurement of 75.02%.
In this Take2 iteration, the word-embedding model’s performance achieved an average accuracy score of 72.45% after 20 epochs with five iterations of cross-validation. Furthermore, the final model processed the test dataset with an accuracy measurement of 74.65%.
CONCLUSION: In this modeling iteration, the word-embedding TensorFlow model did not do as well as the bag-of-words model. However, we should continue to experiment with both natural language processing techniques for further modeling.
Dataset Used: Sentiment Labelled Sentences
Dataset ML Model: Binary class text classification with text-oriented features
Dataset Reference: https://www.kaggle.com/c/nlp-getting-started/
The HTML formatted report can be found here on GitHub.