Monitoring machine learning model performance the right way to avoid overfitting the leaderboard in a competition
Table of Contents
Photo by Charles Deluvio on Unsplash
Introduction
There are several explanations of what overfitting or underfitting is, whatever works for you, add the following idea to it:
- An overfitting model/algorithm is too complex for the task or dataset at hand while
- An underfitting model is too simple for the task
If overfitting has ever cost you downward jumps on a machine learning competition, you might agree with me that the experience was somewhat humiliating especially if you went from top 3 downwards, when you had better and stronger non-overfitting submissions that you didn’t select to be scored on. This is what happens when you rely too much on public leaderboard placement instead of trusting your local cross validation, or local CV for short. I understand evaluating algorithms and tracking their performance can be tiresome sometimes but what is the point of building models in the first place if you’re not going to make sure they are efficient for the work they are intended to perform? In this article, we are going to define a simple method to monitor your machine learning model performance to avoid such cases.
Sample problem
We’re going to use a dataset from a competition on zindi. The competition entails predicting air quality to help in air quality monitor calibration. Let us make the necessary imports and have a look at a preview of the dataset:
|
|
|
|
ID | created_at | site | pm2_5 | pm10 | s2_pm2_5 | s2_pm10 | humidity | temp | lat | long | altitude | greenness | landform_90m | landform_270m | population | dist_major_road | ref_pm2_5 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | ID_0038MG0B | 2020-04-23 17:00:00+03:00 | USEmbassy | 6.819048 | 7.313810 | 6.794048 | 7.838333 | 0.807417 | 22.383333 | 0.299255 | 32.592686 | 1199 | 4374 | 21 | 14 | 6834 | 130 | 25.0 |
1 | ID_008ASVDD | 2020-02-23 19:00:00+03:00 | USEmbassy | 57.456047 | 67.883488 | 55.643488 | 70.646977 | 0.712417 | 25.350000 | 0.299255 | 32.592686 | 1199 | 4374 | 21 | 14 | 6834 | 130 | 68.0 |
2 | ID_009ACJQ9 | 2021-01-23 04:00:00+03:00 | Nakawa | 170.009773 | 191.153636 | 165.308636 | 191.471591 | 0.907833 | 20.616667 | 0.331740 | 32.609510 | 1191 | 5865 | 31 | -11 | 4780 | 500 | 149.7 |
3 | ID_00IGMAQ2 | 2019-12-04 09:00:00+03:00 | USEmbassy | 49.732821 | 61.512564 | 0.000000 | 0.000000 | 0.949667 | 21.216667 | 0.299255 | 32.592686 | 1199 | 4374 | 21 | 14 | 6834 | 130 | 54.0 |
4 | ID_00P76VAQ | 2019-10-01 01:00:00+03:00 | USEmbassy | 41.630455 | 51.044545 | 41.725000 | 51.141364 | 0.913833 | 18.908333 | 0.299255 | 32.592686 | 1199 | 4374 | 21 | 14 | 6834 | 130 | 39.0 |
Feature Engineering
Normally to get good results, you must consider feature engineering. I was lucky to participate in this hackathon and after vigorous data exploration, I came up with some good features. Let’s define a feature engineering function to handle feature generation:
|
|
|
|
Let us see how many more features we add to our initial dataset:
|
|
13665it [00:04, 2915.07it/s]
We created 33 more features!
33 more features! No one should beat our place in the leaderboard now if we combine our excellent feature engineering with performance monitoring to make sure we trust our local CV.
Modeling: Monitoring model performance
This is now where we start to define model performance steps. Let’s declare objects first. We’ll be using the able CatBoost algorithm wrapped in KFold cross validation spanning 10 folds for our modeling.
|
|
First, define two variables:
- out of folds predictions (those predictions made on the holdout set during the resampling procedure)
- errors (the errors recorded in every fold - for these we’ll calculate the mean error) The error metric for this competition was root mean squared error (RMSE).
The second variable is not too important, and must not be used, but I like to include it to strengthen my belief that the model performance across folds is reliable compared to the out of fold score if it does not deviate so much. We’ll use both in comparison to gauge our model performance. Naturally they should not be far from each other.
|
|
Second, define your cross validation strategy. In this case, we’re using KFold with 10 folds. This is where we populate these variables with values from our model training. In this loop, the two variables we created above get updated as the model trains and predicts on each fold.
|
|
RMSE: 11.147193104786583
RMSE: 11.542680896491985
RMSE: 14.893402844224648
RMSE: 14.109007552380556
RMSE: 9.74267734452024
RMSE: 11.78783134709677
RMSE: 8.893336931729067
RMSE: 8.760814543657926
RMSE: 10.588465788639557
RMSE: 15.59393007990298
Now comes time to compare between the two variables: We obtain the RMSE in out of folds predictions and we average errors across folds then compare. REMEMBER: The difference between these values should be minimal. Otherwise, you could be overfitting.
|
|
RMSE: 11.930416509510236
Mean error: 11.895585944533348
The two values are closer to each other. This is promising so we should trust our CV score regardless of what leaderboard score we are ranked with. If anything we should retain our place in the private leaderboard or secure upward jumps, not downwards because we have done our due diligence in tracking and monitoring our model performance.