If you are new to Python machine learning like me, you might find the current Kaggle competition “Santander Customer Transaction Prediction” interesting.
The competition is essentially a binary classification problem with a decently large dataset (200 attributes and 200,000 rows of training data). I have not participated in Kaggle competition before and will use this one to get some learning under the belt.
I had run the training data through a list of machine learning algorithms (see below) and iterate them through three stages. This blog post will serve as the meta post that summarizes the progress.
The current plan with the milestones is as follow:
Stage 1: Gather the Baseline Performance.
- LogisticRegression: completed and posted on Monday 25 February 2019
- DecisionTreeClassifier: completed and posted on Wednesday 27 February 2019
- KNeighborsClassifier: completed and posted on Friday 1 March 2019
- BaggingClassifier: completed and posted on Sunday 3 March 2019
- RandomForestClassifier: completed and posted on Monday 4 March 2019
- ExtraTreesClassifier: completed and posted on Wednesday 6 March 2019
- GradientBoostingClassifier: completed and posted on Friday 8 March 2019
Stage 2: Feature Selection using the Attribute Importance Ranking technique
- LogisticRegression: completed and posted on Monday 11 March 2019
- BaggingClassifier: completed and posted on Wednesday 13 March 2019
- RandomForestClassifier: completed and posted on Friday 15 March 2019
- ExtraTreesClassifier: completed and posted on Sunday 17 March 2019
- GradientBoostingClassifier: completed and posted on Monday 18 March 2019
Stage 3: Over-Sampling (SMOTE) and Balancing Ensembles techniques
- LogisticRegression: completed and posted on Wednesday 20 March 2019
- ExtraTreesClassifier: completed and posted on Friday 22 March 2019
- RandomForestClassifier: completed and posted on Monday 25 March 2019
- GradientBoostingClassifier: completed and posted on Wednesday 27 March 2019
- Balanced Bagging: completed and posted on Friday 29 March 2019
- Balanced Boosting: completed and posted on Sunday 31 March 2019
- Balanced Random Forest: completed and posted on Monday 1 April 2019
- XGBoost with Full Feature: completed and posted on Wednesday 3 April 2019
- XGBoost with SMOTE: completed and posted on Friday 5 April 2019
Stage 4: eXtreme Gradient Boosting Tuning Batches
- Batch #1: planned for Monday 8 April 2019
- Batch #2: planned for Monday 10 April 2019
I have posted all Python scripts here on GitHub. The final submission deadline is 10 April 2019.
Feel free to take a look at the scripts and experiment. Who knows, you might have something you can turn in by the time April comes around. Happy learning and good luck!