I experienced something amazing. I climbed 8000 ranks on the Kaggle Titanic competition leaderboard in just a few attempts and in 10 minutes.
And here is more amazing news. I did it with minimum effort. I did not do feature engineering. I did not fill in the missing values. I only used a few of the columns. I did not use any complicated Machine Learning algorithm. Just a plain simple decision tree.
In this story, I will tell you about the magical technique which made this happen.
At the time of my last submission, I was ranked 4057 out of 14404 teams (in the top 30%)
The objective of this story is not to develop the best model, but how to climb the leaderboard with minimal effort.
Before applying the magical technique – The rank is 12616

After applying the magical technique – The rank is 4057! Woah!

Let me coin a term for the technique which I have used
Micro-outlier removal
Voila, the term sounds good. This term does not exist yet. If you are reading this story, then you might be seeing this term for the first time.
The motivation behind the micro-outlier removal technique
Even though we have many techniques to improve machine learning models, sometimes you have the feeling that something is missing. You may say that we have everything – hyper-parameter optimization, grid-search, and even auto-ml. So what on the earth could be missing?
Well, for me, the missing thing is an intuition-based visual approach. Augmenting all machine learning optimization techniques with an intuition-based visual approach can really give you the edge to go beyond usual.
, Right photo Photo by Ali Hajian on Unsplash](https://towardsdatascience.com/wp-content/uploads/2022/04/1EyQgHK7Z5kvsILlFqutfew.png)
So let us see what a micro-outlier looks like.
Locating Micro-outliers
First, here is some background information about the model training on the Titanic data which I used. To keep things simple,
- I have taken only the following fields as they are: PClass, Sex, SibSp, Parch, Fare, Embarked.
- The field age is not taken, as it contains a lot of missing value.
- There is no feature engineering
- The machine learning algorithm used is a basic 5-level decision tree with a 70–30 train-test split
Shown here is the decision boundary based on the train dataset and decision tree algorithm. The legend in the diagram below indicates the meaning of the colors in the figure below.

We can make the following observations:
The decision surface which predicts survival (green area) is mostly located in the middle. The decision surface which predicts non-survival (red area) is mostly located on the sides.
Generally, the passengers which did not survive (blue points) are grouped together. Similarly, the passengers which survived (green points) are grouped together.
The micro-outliers can be visually located as follows:
- survivors within a cluster of non-survivors
- non-survivors within a cluster of survivors
The figure below shows micro-outliers with a white arrow.

Now let us analyze the micro-outliers.
Analyzing the micro-outliers
In order to better understand the micro-outliers, let us analyze the micro-outlier situated at the top left. The visual way of analysis is shown in the animated image below. It shows a radar plot for the columns for each point as we hover over the point.

You will observe that all points are related to passengers who are male, have a high PCLass (which means 3rd class), and those who embarked from port S. All these passengers did not survive, except the micro-outlier point.
The micro-outlier here is the passenger Eugene Patrick Daly. You can read about he survived in the link here
He was a 3rd class passenger located on the lower decks and he had jumped in the cold water. He had no chance of surviving. However, he claimed that the thickness of his overcoat attributed to his survival, a garment he held on to for many years and which he named his "lucky coat."

Though we are happy for him, he is not good for machine learning! People who luckily survived due to some vague reasons such as the thickness of the overcoat are outliers which messes up the machine learning model. Neither do we have features on who jumped and the thickness of the overcoat for each passenger. So the best thing to do is to remove it from the training data.
The micro-outlier visual technique automatically identifies such ‘lucky’ persons automatically in the titanic data! You will not be able to do this with any classic Outlier Detection algorithms.
I removed 6 micro-outliers, trained the model, and made my submission. There was a big climb in the leadership board compared to submission without the micro-outlier technique.
Conclusion
Micro-outliers removal is a good intuition-based visual approach to improve your machine learning model accuracy without a lot of complex coding. In a way, we are removing data points that can unnecessarily complicate the model, and thus gaining overall model accuracy.
Please subscribe in order to stay informed whenever I release a new story.
You can also join Medium with my referral link
Additional Resources
Website
You can visit my website to make analytics with zero coding. https://experiencedatascience.com
Youtube channel
Here is a link to my YouTube channel https://www.youtube.com/c/DataScienceDemonstrated