Template Credit: Adapted from a template made available by Dr. Jason Brownlee of Machine Learning Mastery.
SUMMARY: The purpose of this project is to construct a predictive model using various machine learning algorithms and to document the end-to-end steps using a template. The Superconductor Critical Temperature dataset is a regression situation where we are trying to predict the value of a continuous variable.
INTRODUCTION: The research team wishes to create a statistical model for predicting the superconducting critical temperature based on the features extracted from the superconductor’s chemical formula. The model seeks to examine the features that can contribute the most to the model’s predictive accuracy.
From previous iterations, we constructed and tuned several classic machine learning models using the Scikit-Learn library. We also observed the best results that we could obtain from the models.
In this Take1 iteration, we will construct and tune an XGBoost model. Furthermore, we will apply the XGBoost model to a test dataset and observe the best result that we can obtain from the model.
ANALYSIS: From previous iterations, the Extra Trees model turned in the best overall result and achieved an RMSE metric of 9.56. By using the optimized parameters, the Extra Trees algorithm processed the test dataset with an RMSE of 9.32.
In this Take1 iteration, the baseline performance of the XGBoost algorithm achieved an RMSE benchmark of 12.88. After a series of tuning trials, the XGBoost model processed the validation dataset with an RMSE score of 9.88. When we applied the XGBoost model to the previously unseen test dataset, we obtained an RMSE score of 9.06.
CONCLUSION: In this iteration, the XGBoost model appeared to be a suitable algorithm for modeling this dataset. We should consider using the algorithm for further modeling.
Dataset Used: Superconductivity Data Set
Dataset ML Model: Regression with numerical attributes
Dataset Reference: https://archive.ics.uci.edu/ml/datasets/Superconductivty+Data
One potential source of performance benchmarks: https://doi.org/10.1016/j.commatsci.2018.07.052
The HTML formatted report can be found here on GitHub.