5 Simple Tips To Improve Your Kaggle Models

How To Get High Performing Models In Competitions

Published in

Towards Data Science

7 min readOct 2, 2020

Photo by 🇨🇭 Claudio Schwarz | @purzlbaum on Unsplash

If you recently got started on Kaggle, or if you are an old regular of the platform, you probably wonder how to easily improve the performance of your model. Here are some practical tips I’ve accumulated through my Kaggle journey. So, either build your own model or just start from a baseline public kernel, and try implementing these suggestions !

1. Always review past competitions

Although Kaggle’s policy is to never feature twice an identical competition, there are often remakes of very similar problems. For example, some hosts propose a regular challenge on the same theme yearly (NFL’s Big Data Bowl for example), with only small variations, or in some fields (like medical imaging for example) there are a lot of competitions with different targets but very similar spirit.

Reviewing winners’ solutions (always made public after competition ends thanks to the incredible Kaggle community) can therefore be a great plus, as it gives you ideas to get started, and a winning strategy. If you have time to review a lot of them, you will also soon find out that, even in very different competitions, some popular baseline models seem to always do the job well enough :

Convolutional Neural Networks or the more complex ResNet or EfficientNet in computer vision challenges,
WaveNet in audio processing challenges (that can also very well be treated by image recognition models, if you just use a Mel Spectrogram),
BERT and its derivatives (RoBERTa, etc) in natural language processing challenges,
Light Gradient Boosting Method (or other Gradient Boosting or trees strategies) on tabular data…

You can either look for similar competitions on the Kaggle platform directly, or take a look at this great summary by Sundalai Rajkumar.

Reviewing past competitions can also help you get hints on all the other steps explained in the following. For example, getting tips and tricks on preprocessing for similar problems, how people choose their hyperparameters, what additional tools they have implemented in their models to have them win the game, or if they focused on bagging only similar versions of their best models or rather ensembled a melting pot of all available public kernels.

2. You never spend enough time on data preparation

This is far from being the most thrilling part of the job. However, the importance of this step cannot be overemphasized.

Clean the data : never assume the hosts worked on providing you with the cleanest possible data. Most of the time, it is wrong. Fill NaNs, remove outliers, split the data into categories of homogeneous observations…
Do some easy exploratory data analysis, to get an overview of what you’re working on (this will help you get insights and ideas). This is the most important step at this stage. Without proper knowledge of how your data is structured, what information you have, what general behavior features tend to have individually or collectively with respect to the target, you will walk blind and have no intuition of how to build your model. Draw plots, histograms, correlation matrices.
Augment your data : this is probably one of the best things to improve performance. However be careful not to make it so huge that your model won’t be able to process it anymore. You can either find some additional datasets on the Internet (be very careful about rights, or you could suffer the same fate as the winners of the $1M Deepfake Detection Challenge), or on the Kaggle platform (in similar past competitions !), or just work on the data you’re being provided : flip and crop images, overlay audio recordings, back-translate or replace synonyms in texts…

Preprocessing is also the step where you have to carefully think about what cross-validation method you will rely on. Kaggle’s motto could basically be : Trust Your CV. Working on your data will help you know how to split it : stratify on target values or on sample categories ? Is your data unbalanced ? If you have a clever CV strategy, and rely solely on it and not on leaderboard score (though it may be very tempting), then you’re very likely to get good surprises on private final scores.

3. Try hyperparameter searching

Hyperparameter searching helps you find the optimal parameters (learning rate, temperature of softmax, …) your model should have in order to get the best possible performance, without having to run a thousand boring experiments by hand.

The most common hyperparameter searching strategies include :

Grid Search (please never do that) : the worst performing method to my sense since you can completely miss a pattern or a very local peak in performance for some values, it consists or testing hyperparameter values equally distributed on an interval of possible values you have defined ;
Random Search (and its Monte-Carlo derivatives) : you try random values of your parameters. The main issue with it lies in the fact that it is a parallel method and can quickly become very costly the more parameters you are testing. However, it has the advantage of enabling you to include prior knowledge in your testing : if you want to find the best learning rate between 1e-4 and 1e-1, but you suppose it must be around 1e-3, you can draw samples from a log-normal distribution centered on 1e-3.
Bayesian Search : basically the random search but improved in so far as it is iterative and therefore much less costly. It iteratively evaluates a promising hyperparameter configuration based on the current model, and then updates it. It is the best performing of the three.
Other methods including gradient-based search or evolutionary optimization are more hazardous and do not generally apply. They can be recommended in some special cases.

There are many AutoML tools that can do the job very well for you. Just take a look at the excellent Medium & TowardsDataScience ressources on this topic :

However, you have to be careful and keep a solid intuition of what hyperparameters values mean. If you don’t have a solid validation set and homogeneous data, hyperparameter optimization pushed too far can lead into the overfitting-lion’s den. Always prefer some rationally explainable parameter choice to a millidecimal accuracy win on training data.

4. Simple practices can change the game

I have found that there are some model wrappers you can use to get better results. They work on different levels :

In the optimization process, never forget to add a Learning Rate Scheduler that help get a more precise training (starting small, progressively increasing when your model is learning well, reducing the step on plateau for example).
Still in the optimization process, you can wrap the Lookahead around your optimizer ; the lookahead algorithm consists of going k-optimization-steps forward, finding where the best performance had been, then going one step backward in the direction of that optimum and restarting training from there. In theory you get better performance, though I never found this to be true ; but it stabilizes training, which is good when your data is very noisy.
Find a good initialization for your weights before starting training : if you’re using a popular architecture, start from baseline weights (such as ImageNet in image recognition), if not, try Layer Sequential Unit Variance initialization (LSUV, the best possible init — in theory). It consists of initializing your weights to be orthogonal and of unit variance across all trainable layers.
Finally, I often found that training a LGBM from the last layer weights of a neural network, instead of adding a softmax as the output layer, can work surprisingly well.

5. Bag, Bag, Bag, Bag !

Apart from data augmentation, there is probably no technique more efficient than blending (also known as bagging) to improve your performance.

My personal tip is that I always save each and every model prediction I have run, both from my folds and from final models, and I average all of them (just basic averaging, I never found any evidence that “clever” ensembling such as weighting models by their solo performance was adding anything in final score). Don’t forget to blend public kernels as well.

The more models you have in your ensemble strategy, the more likely you are to survive private leaderboard shakeup. Indeed, diversifying your models make your final result more robust. It is the same idea as the one underlying portfolio diversification in finance : instead of one asset with a given return and a given variance, take a lot of different assets of same return and same variance, since it is much less likely that they will all draw down simultaneously, and losses on one will be compensated by wins on others. In the same idea, instead of relying solely on one model, make a lot of different models vote : the target predicted by the majority of them (in classification) or the mean of the targets predicted by each (in regression) will very likely be closer to the true answer.

Hope you enjoyed this article, thanks to Théo Viel for the review.