Template Credit: Adapted from a template made available by Dr. Jason Brownlee of Machine Learning Mastery.
SUMMARY: This project aims to construct a text classification model using a neural network and document the end-to-end steps using a template. The Movie Review Sentiment Analysis dataset is a binary classification situation where we attempt to predict one of the two possible outcomes.
Additional Notes: This script is a replication, with some small modifications, of Dr. Jason Brownlee’s blog post, How to Prepare Movie Review Data for Sentiment Analysis. I plan to leverage Dr. Brownlee’s tutorial and build a TensorFlow-based text classification notebook template for future modeling of similar datasets.
In this Take1 iteration, we will construct the necessary code modules to handle the tasks of loading text, cleaning text, and vocabulary development.
INTRODUCTION: The Movie Review Data is a collection of movie reviews retrieved from the imdb.com website in the early 2000s by Bo Pang and Lillian Lee. The reviews were collected and made available as part of their research on natural language processing. The dataset comprises 1,000 positive and 1,000 negative movie reviews drawn from an archive of the rec.arts.movies.reviews newsgroup hosted at IMDB. The authors refer to this dataset as the ‘polarity dataset.’
ANALYSIS: Deep learning modeling results will be forthcoming in the future iterations.
CONCLUSION: In this Take1 iteration, we were able to construct the necessary code modules to handle the tasks of loading text, cleaning text, and vocabulary development.
Dataset Used: Movie Review Sentiment Analysis Dataset
Dataset ML Model: Binary class text classification with text-oriented features
Dataset Reference: https://www.cs.cornell.edu/home/llee/papers/cutsent.pdf and http://www.cs.cornell.edu/people/pabo/movie-review-data/review_polarity.tar.gz
One potential source of performance benchmarks: https://machinelearningmastery.com/prepare-movie-review-data-sentiment-analysis/
The HTML formatted report can be found here on GitHub.