A Summary of My Experience with Kaggle Competitions Over the Last Year

As a data science enthusiast, I have tried many different things to boost my knowledge and experience. Reading weekly machine learning papers for a year, doing Coursera courses, reading hands-on machine learning books, and participating in Kaggle competitions. And in my opinion, the best way to gain experience in data science (apart from working in the industry) is to do Kaggle competitions. I did face tons of frustration when starting them out hence why I am writing this story. After doing more than 2–3 Kaggle competitions I started noticing a lot of useful patterns that I wish I have noticed much earlier. If you are starting out your first competition, I highly recommend you pay attention to these things since it will save you a lot of time.

1. Explore the data, and take your time (Practice Exploratory Data Analysis (EDA))

I used to be very excited to start pumping the data into a machine learning model and getting started as soon as I can. However, I have learned after a few trials that it is quite worth taking your time exploring the data initially. There are tons of EDA tutorials out there and people participating in the competition will usually post tons of EDA notebooks initially. Understanding the data will save you a lot of time down the road and will help you in making informed decisions as to which model to try and which models to avoid. Generally speaking, I would say if the competition lasts for around 3 months, don’t worry if you take the first 2–3 weeks doing EDA. It is much better to explore in the first stages rather than exploit!

EDA includes a lot of different tools, but I have generally noticed these to be the most useful:

Check the distributions of features. Try to see if the features are normally distributed or if there are skewed distributions. This will probably affect the way you preprocess the data.
Check the data types of features
Check the mean, median, and std of features
Check the number of features, size of the dataset
Check if there are any nullable/empty fields
Check the correlation of features, highly correlated features can sometimes be dropped

There are many more things you can do, but if you are a beginner, maybe just start with these.

2. Start reading papers about the FIELD of the competition

This might not be obvious and some people will disagree, but I highly recommend that you read some initial machine learning papers in the same field as the competition is in. For example, if you are doing a competition about detecting skin cancer in images, try to find the most recent papers doing the most similar task (which is probably going to be computer vision in bioinformatics for example).

Be pragmatic and careful about the papers you are going to read. It is easy to get into an endless cycle of reading and even forgetting about the competition. Skim through a collection of around 10–20 papers and choose the best 5. I would also highly recommend that you share this collection of papers with the competition community. Get people’s feedback, see what they think, and also get a few Kaggle points!

3. Start experimenting with models

A very common mistake here is to not track your experiments and get lost into a chaos of notebooks. You have to be systematic about your experiments, note down the model you are going to use and the range of hyperparameters you are going to be testing with. It is very important to establish a solid baseline that you can build on top of.

Also:

Don’t start with a neural network

I see this mistake all the time and I am guilty of doing it several times. It is very tempting to start with a neural network (for instance, starting with a transformer in an NLP competition). Do not start with the most complicated model, you will end up getting quite frustrated. Start with the most basic model and build your way up to a more complex one. A lot of competitions have been won using gradient boosting machines rather than deep neural networks. Also, you have to learn how to use gradient boosting machines because they are absolute beasts when it comes to performance in these Kaggle competitions. They are also very easy to use and several ones as Light GBM give you quite meaningful results such as the importance of features.

4. Work with the community

Remember that you are doing a competition on a public Kaggle community. You will always have questions, post them on the forum. And also if you can, try to work in a team. Learning from people is insanely quicker than learning by yourself. I guarantee a lot of the mistakes you are doing can be avoided if you were working in a team.

Also, keep up to date with the public notebooks and try to learn as much as you can from them.

However, don’t simply copy paste them!

Copying and pasting them will simply get you nowhere since you will not actually learn anything and most of the time you will find it tremendously hard to improve someone else’s code. Instead, try to understand their code and maybe replicate a block or 2 that can help you.

5. Ensemble, ensemble, ensemble

Ensembling is honestly the secret key to getting a good score in Kaggle competitions. I used to always forget about doing this. After you run your experiments, don’t simply submit the predictions for your best model. Submit an ensemble of your best 3 or maybe even 5 models. And always train your models using 3 or 5 K-folds.

Conclusion

I honestly didn’t find doing Kaggle competitions to be an easy road. I usually found them quite challenging and I still do. I have never won one, but I always find myself learning tons of new concepts after doing one. I have also always noticed that typical Kaggle grandmasters only get there after participating in LOADS (100+) Kaggle competitions which is absolutely insane. Not sure if I am going to get that far since at some point you start gaining Kaggle-specific experience rather than transferable Data Science experience. But, I think you won’t reach that point unless you do maybe more than 50 competitions or so!

If you want to receive regular paper reviews about the latest papers in AI & Machine Learning, high-quality ML tutorials, and more add your email here & Subscribe!