Template Credit: Adapted from a template made available by Dr. Jason Brownlee of Machine Learning Mastery.
SUMMARY: The purpose of this project is to construct a prediction model using various machine learning algorithms and to document the end-to-end steps using a template. The Ames Iowa Housing Prices dataset is a regression situation where we are trying to predict the value of a continuous variable.
INTRODUCTION: Many factors can influence a home’s purchase price. This Ames Housing dataset contains 79 explanatory variables describing every aspect of residential homes in Ames, Iowa. The goal is to predict the final price of each home.
In iteration Take1, we established the baseline mean squared error for further takes of modeling.
In iteration Take2, we converted some of the categorical variables from nominal to ordinal and observed the effects of the change.
In iteration Take3, we examined the feature selection technique of attribute importance ranking by using the Gradient Boosting algorithm. By selecting only the most important attributes, we maintained a similar level of RMSE compared to the baseline.
In this iteration, we will examine the feature selection technique of recursive feature elimination (RFE) by using the Random Forest algorithm. By selecting no more than 35 attributes, we hope to maintain a similar level of RMSE compared to the baseline.
ANALYSIS: The baseline performance of the machine learning algorithms achieved an average RMSE of 32,826. Two algorithms (Elasticnet and Gradient Boosting) achieved the top RMSE metrics after the first round of modeling. After a series of tuning trials, Gradient Boosting turned in the best overall result and achieved an RMSE metric of 23,246. By using the optimized parameters, the Gradient Boosting algorithm processed the test dataset with an RMSE of 23,859, which was slightly higher than the prediction from the training data.
From iteration Take2, Gradient Boosting achieved an RMSE metric of 23,466 with the training dataset and processed the test dataset with an RMSE of 23,118. Converting the nominal variables into ordinal did not have a material impact on the prediction accuracy in either direction.
From iteration Take3, Gradient Boosting achieved an RMSE metric of 24,132 with the training dataset and processed the test dataset with an RMSE of 23,918. At the importance level of 99%, the attribute importance technique eliminated 20 of 64 total attributes. The remaining 44 attributes produced a model that achieved a comparable RMSE to the baseline model.
From this iteration, Gradient Boosting achieved an RMSE metric of 24,035 with the training dataset and processed the test dataset with an RMSE of 23,958. The RFE technique eliminated 36 of 64 total attributes. The remaining 28 attributes produced a model that achieved a comparable RMSE to the baseline model.
CONCLUSION: For this iteration, the Gradient Boosting algorithm achieved the best overall results using the training and testing datasets. For this dataset, Gradient Boosting should be considered for further modeling.
Dataset Used: Kaggle Competition – House Prices: Advanced Regression Techniques
Dataset ML Model: Regression with numerical and categorical attributes
Dataset Reference: https://www.kaggle.com/c/house-prices-advanced-regression-techniques
One potential source of performance benchmarks: https://www.kaggle.com/c/house-prices-advanced-regression-techniques
The HTML formatted report can be found here on GitHub.