Feature Selection for Kaggle Tabular Playground Series 2021 Jan Using Python and Scikit-learn

Template Credit: Adapted from a template made available by Dr. Jason Brownlee of Machine Learning Mastery.

SUMMARY: Feature selection involves picking the set of features that are most relevant to the target variable. This can help reduce the complexity of our model and minimize the resources required for training and inference. The Kaggle Tabular Playground Series Jan 2021 dataset is a regression situation where we are trying to predict the value of a continuous variable.

INTRODUCTION: In this notebook, we will run through the different techniques in performing feature selection on the dataset. We will leverage the Scikit-learn library, which features various machine learning algorithms and has built-in implementations of various feature selection methods. We will compare which method works best for this particular dataset.

ANALYSIS: The feature selection technique that yielded the best RMSE score was Recursive Feature Elimination (RFE). Its RMSE for the training dataset was 0.7082.

CONCLUSION: In this iteration, the RFE technique appeared to be suitable for modeling this dataset. We should follow up on the feature selection exercise by modeling the whole dataset using the selected attributes.

Dataset Used: Kaggle Tabular Playground Series 2021 Jan Data Set

Dataset ML Model: Regression with numerical attributes

Dataset Reference: https://www.kaggle.com/c/tabular-playground-series-jan-2021

One potential source of performance benchmarks: https://www.kaggle.com/c/tabular-playground-series-feb-2021/leaderboard

The HTML formatted report can be found here on GitHub.