Micro-Outlier removal: climb high in Kaggle with minimum effort using this technique

I experienced something amazing. I climbed 8000 ranks on the Kaggle Titanic competition leaderboard in just a few attempts and in 10 minutes.

And here is more amazing news. I did it with minimum effort. I did not do feature engineering. I did not fill in the missing values. I only used a few of the columns. I did not use any complicated Machine Learning algorithm. Just a plain simple decision tree.

In this story, I will tell you about the magical technique which made this happen.

At the time of my last submission, I was ranked 4057 out of 14404 teams (in the top 30%)

The objective of this story is not to develop the best model, but how to climb the leaderboard with minimal effort.

Before applying the magical technique – The rank is 12616

After applying the magical technique – The rank is 4057! Woah!

Let me coin a term for the technique which I have used

Micro-outlier removal

Voila, the term sounds good. This term does not exist yet. If you are reading this story, then you might be seeing this term for the first time.

The motivation behind the micro-outlier removal technique

Even though we have many techniques to improve machine learning models, sometimes you have the feeling that something is missing. You may say that we have everything – hyper-parameter optimization, grid-search, and even auto-ml. So what on the earth could be missing?

Well, for me, the missing thing is an intuition-based visual approach. Augmenting all machine learning optimization techniques with an intuition-based visual approach can really give you the edge to go beyond usual.

Left photo Photo by Markus Spiske on [Unsplash](https://unsplash.com/s/photos/detective?utm_source=unsplash&utm_medium=referral&utm_content=creditCopyText), Right photo Photo by Ali Hajian on Unsplash

So let us see what a micro-outlier looks like.

Locating Micro-outliers

First, here is some background information about the model training on the Titanic data which I used. To keep things simple,

I have taken only the following fields as they are: PClass, Sex, SibSp, Parch, Fare, Embarked.
The field age is not taken, as it contains a lot of missing value.
There is no feature engineering
The machine learning algorithm used is a basic 5-level decision tree with a 70–30 train-test split

Shown here is the decision boundary based on the train dataset and decision tree algorithm. The legend in the diagram below indicates the meaning of the colors in the figure below.

Decision surface and train data (image by author)

We can make the following observations:

The decision surface which predicts survival (green area) is mostly located in the middle. The decision surface which predicts non-survival (red area) is mostly located on the sides.

Generally, the passengers which did not survive (blue points) are grouped together. Similarly, the passengers which survived (green points) are grouped together.

The micro-outliers can be visually located as follows:

survivors within a cluster of non-survivors
non-survivors within a cluster of survivors

The figure below shows micro-outliers with a white arrow.

Identifying Micro-outliers (image by author)

Now let us analyze the micro-outliers.

Analyzing the micro-outliers

In order to better understand the micro-outliers, let us analyze the micro-outlier situated at the top left. The visual way of analysis is shown in the animated image below. It shows a radar plot for the columns for each point as we hover over the point.

Analyzing the micro-outlier (image by author)

You will observe that all points are related to passengers who are male, have a high PCLass (which means 3rd class), and those who embarked from port S. All these passengers did not survive, except the micro-outlier point.

The micro-outlier here is the passenger Eugene Patrick Daly. You can read about he survived in the link here

He was a 3rd class passenger located on the lower decks and he had jumped in the cold water. He had no chance of surviving. However, he claimed that the thickness of his overcoat attributed to his survival, a garment he held on to for many years and which he named his "lucky coat."

Micro-outlier example Source - https://www.encyclopedia-titanica.org/titanic-survivor/eugene-patrick-daly.html — Micro-outlier example Source – https://www.encyclopedia-titanica.org/titanic-survivor/eugene-patrick-daly.html

Though we are happy for him, he is not good for machine learning! People who luckily survived due to some vague reasons such as the thickness of the overcoat are outliers which messes up the machine learning model. Neither do we have features on who jumped and the thickness of the overcoat for each passenger. So the best thing to do is to remove it from the training data.

The micro-outlier visual technique automatically identifies such ‘lucky’ persons automatically in the titanic data! You will not be able to do this with any classic Outlier Detection algorithms.

I removed 6 micro-outliers, trained the model, and made my submission. There was a big climb in the leadership board compared to submission without the micro-outlier technique.

Conclusion

Micro-outliers removal is a good intuition-based visual approach to improve your machine learning model accuracy without a lot of complex coding. In a way, we are removing data points that can unnecessarily complicate the model, and thus gaining overall model accuracy.

Please subscribe in order to stay informed whenever I release a new story.

Get an email whenever Pranay Dave publishes.

You can also join Medium with my referral link

Join Medium with my referral link – Pranay Dave

Additional Resources

Website

You can visit my website to make analytics with zero coding. https://experiencedatascience.com

Youtube channel

Here is a link to my YouTube channel https://www.youtube.com/c/DataScienceDemonstrated

Micro-Outlier removal: climb high in Kaggle with minimum effort using this technique

The motivation behind the micro-outlier removal technique

Locating Micro-outliers

Analyzing the micro-outliers

Conclusion

Additional Resources

Website

Youtube channel

Related Articles

Implementing Convolutional Neural Networks in TensorFlow

What Do Large Language Models “Understand”?

How to Forecast Hierarchical Time Series

Hands-on Time Series Anomaly Detection using Autoencoders, with Python

3 AI Use Cases (That Are Not a Chatbot)

Solving a Constrained Project Scheduling Problem with Quantum Annealing

Back To Basics, Part Uno: Linear Regression and Cost Function

Must-Know in Statistics: The Bivariate Normal Projection Explained

How to Make the Most of Your Experience as a TDS Author