Regression Model for Online News Popularity Using Python Take 4

Template Credit: Adapted from a template made available by Dr. Jason Brownlee of Machine Learning Mastery.

SUMMARY: The purpose of this project is to construct a predictive model using various machine learning algorithms and to document the end-to-end steps using a template. The Online News Popularity dataset is a regression situation where we are trying to predict the value of a continuous variable.

INTRODUCTION: This dataset summarizes a heterogeneous set of features about articles published by Mashable in a period of two years. The goal is to predict the article’s popularity level in social networks. The dataset does not contain the original content, but some statistics associated with it. The original content can be publicly accessed and retrieved using the provided URLs.

The data source credit goes to K. Fernandes, P. Vinagre and P. Cortez. A Proactive Intelligent Decision Support System for Predicting the Popularity of Online News. Proceedings of the 17th EPIA 2015 – Portuguese Conference on Artificial Intelligence, September, Coimbra, Portugal, for making the dataset and benchmarking information available.

ANALYSIS: The baseline performance of the machine learning algorithms achieved an average RMSE of 12645. Two algorithms (ElasticNet and eXtreme Gradient Boosting) achieved the top RMSE metrics after the first round of modeling. After a series of tuning trials, ElasticNet turned in the top overall result and achieved an RMSE metric of 11008. By using the optimized parameters, the Gradient Boosting algorithm processed the test dataset with an RMSE of 12947, which indicated that we might have a variance problem. We need to gather more data or apply regularization techniques in training to narrow the variance gap before deploying the model in production.

CONCLUSION: For this iteration, the eXtreme Gradient Boosting algorithm achieved the best overall results using the training and testing datasets. For this dataset, Gradient Boosting should be considered for further modeling.

Dataset Used: Online News Popularity Data Set

Dataset ML Model: Regression with numerical attributes

Dataset Reference:

The HTML formatted report can be found here on GitHub.