Introducing Anomaly/Outlier Detection in Python with PyOD 🔥

Overview of Your Journey

Setting the Stage
What is Anomaly/Outlier Detection?
Introducing PyOD
Getting Familiar with the Data
Anomaly Detection for Data Cleaning
Anomaly Detection for Prediction
Wrapping Up

1 – Setting the Stage

In recent years, Anomaly Detection has become more popular in the machine learning community. Despite this, there are definitely fewer resources on anomaly detection than classical machine learning algorithms. As such, learning about anomaly detection can feel more tricky than it should be. Anomaly detection is from a conceptual standpoint actually very simple!

The goal of this blog post is to give you a quick introduction to anomaly/outlier detection. Specifically, I will show you how to implement anomaly detection in Python with the package PyOD – Python Outlier Detection. In this way, you will not only get an understanding of what anomaly/outlier detection is but also how to implement anomaly detection in Python.

The good news is that PyOD is easy to apply – especially if you already have experience with Scikit-Learn. In fact, the PyOD package tries to be very similar to the Scikit-Learn API interface.

Prerequisites: You should have some basic familiarity with Python and Pandas. However, no knowledge of anomaly detection is necessary 😃

2 – What is Anomaly/Outlier Detection?

Anomaly detection goes under many names; outlier detection, outlier analysis, anomaly analysis, and novelty detection. A concise description from Wikipedia describes anomaly detection as follows:

Anomaly detection is the identification of rare items, events or observations which raise suspicions by differing significantly from the majority of the data.

Let’s try to unpack the above statements. Say you have a dataset consisting of many observations. The goal of anomaly detection is to identify the observations that differ significantly from the rest. Why would you want to do this? There are two major reasons:

Use Case 1 – Data Cleaning

When cleaning the data, it is sometimes better to remove anomalies as they misrepresent the data. Let’s illustrate this with a concrete example: Say that you have made a survey that asks questions regarding the respondents favourite cat breeds 😺

You first give the survey to 100 people that each complete the survey. Now you want to estimate the average time it took to take the survey. Why? You want 10.000 more people to take the survey. It would be professional to indicate roughly how long the survey takes for the new respondents. Even though cats are awesome, people are busy!

Let’s say that you got the following results from the first 100 people:

3 Minutes – 57 respondents
4 Minutes – 33 respondents
5 Minutes – 6 respondents
6 Minutes – 3 respondents
480 Minutes – 1 respondent

What is going on with the last one? 480 minutes is 8 hours! Upon further inspection, you find that the respondent started the survey at 23:58 in the evening, and then stood still from 00:00 until 07:56. Then from the time 07:56 to 07:58 it was finished. Can you see what happened?

Clearly, a person started the survey, then went to bed, and then finished the survey when he/she got up in the morning. If you keep this result, then the average time to complete the survey will be

average = (3 57 + 4 33 + 5 6 + 6 3 + 1 * 480)/100 = 8.31

However, saying that the survey takes roughly 8 minutes is not accurate. The only reason it took that long was because of a sleepy respondent 😪

It would be more accurate to remove that person from the tally and get

average = (3 57 + 4 33 + 5 6 + 6 3)/99 = 3.54

For simplicity, the survey could write the sentence: The average completion time for the survey is between 3 and 4 minutes.

Here you have manually removed an outlier to clean the data to better represent reality. Anomaly detection is implementing algorithms to detect outliers automatically.

Caveat: In the above example you have removed an outlier to better match the survey length with reality. Anomaly detection should never be used to artificially make a product seem better than it really is. Careful consideration should be made whether it is ethically appropriate to use anomaly detection for data cleaning.

Use Case 2 – Prediction

In other applications, the anomalies themselves are the point of interest. Examples are network intrusion, bank fraud, and certain structural defects. In these applications, the anomalies represent something that is worthy of further study.

Network intrusion – anomalies in network data can indicate that a network attack of some sort has taken place.
Bank fraud – anomalies in transaction data can indicate fraud or suspicious behaviour.
Structural defects – anomalies can indicate that something is wrong with your hardware. While more traditional monitoring software is typically available in this setting, anomaly detection can discover more weird defects.

Caveat: In some settings like bank fraud, it is not always an individual transaction that raises suspicions. It is the frequency and magnitude of multiple transactions seen in context that should be considered. To deal with this, the data should be aggregated appropriately. This is outside the scope of this blog, but something that you should be aware of.

3 – Introducing PyOD

Let’s describe the Python package PyOD that helps you to do anomaly detection. In the words of the PyOD documentation:

PyOD is a comprehensive and scalable Python toolkit for detecting outlying objects in multivariate data.

Brifly put, PyOD supplies you with a bunch of models that perform anomaly detection. Some cool highlights that are worth mentioning are:

PyOD includes more than 30 different algorithms. Some of them are classics (like LOF), while others are the new kids on the block (like COPOD). See a full list of supported models.
PyOD has optimized its code by using the jit-decorator from Numba. I’ve written a blog post on Numba if you are interested in this.
PyOD has a uniform API. Hence if you become familiar with a few models in PyOD, then you can learn the rest with ease. I recommend taking a look at the PyOD API CheatSheet after you finish this blog.

If you are using PIP, then you can install PyOD with the command:

pip install pyod

If you already have PyOD installed previously, then make sure it is updated with the pip command:

pip install --upgrade pyod

If you are instead using the Conda package manager, then you can run the command:

conda install -c conda-forge pyod

In this blog post, I will demonstrate two algorithms for doing anomaly detection: Knn and LOC. You’ve maybe heard of KNN (K – Nearest Neighbors) previously, while LOC (Local Outlier Factor) is probably unfamiliar to you. Let’s first take a look at the data you will be using ⚡️

4 – Getting Familiar with the Data

We will be using the classical Titanic dataset. To get the dataset loaded into Pandas, simply run the code below:

import pandas as pd
titanic = pd.read_csv(
"https://raw.githubusercontent.com/datasciencedojo/datasets/master/titanic.csv"
)

To check out the first rows of the dataset, use the head() method:

titanic.head()

The first 5 rows of the Titanic dataframe.

As you can see, there are columns representing the sex, age, fare price, passenger class, ticket, etc. For simplicity you will only work with the following four columns:

# Selecting only the columns Survived, Pclass, Fare, and Sex
partial_titanic = titanic[["Survived", "Pclass", "Fare", "Sex"]]

There are no missing values in partial_titanic. However, the column Sex consists of the string values male or female. To be able to do anomaly detection, you need numeric values. You can convert this binary categorical variable to the values 0 and 1 with the code:

# Change the categorical value Sex to numeric values
partial_titanic["Sex"] = partial_titanic["Sex"].map({
"male": 0,
"female": 1
})

Now you are ready to do anomaly detection 😃

5 – Anomaly Detection for Data Cleaning

Let’s now use anomaly detection to clean the dataset partial_titanic you made in the previous section. You will use the KNN model to do this. The KNN model examines the data and looks for data points (rows) that are far from the other data points. To get started, you import the KNN model as follows:

# Import the KNN
from pyod.models.knn import KNN

We begin by initiating a KNN model:

# Initiate a KNN model
KNN_model = KNN()

For anomaly detection methods for data cleaning, you can fit on the whole dataset as follows

# Fit the model to the whole dataset
KNN_model.fit(partial_titanic)

Output:
KNN(algorithm='auto', contamination=0.1, leaf_size=30, method='largest', metric='minkowski', metric_params=None, n_jobs=1, n_neighbors=5, p=2, radius=1.0)

When running the code above you get printed out a lot of default values (e.g. contamination=0.1). This can be tweaked if needed. After running a model you can access two types of output:

Labels: By running KNN_model.labels_ you can find binary labels of whether an observation is an outlier or not. The number 0 indicates a normal observation, while the number 1 indicates an outlier.
Decision Scores: By running KNN_model.decision_scores_ you get the raw scores of how much of an outlier something is. The values will range from 0 and upwards. A higher anomaly score indicates that a data point is more of an outlier.

Let’s check out the labels of the trained model:

# Find the labels
outlier_labels = KNN_model.labels_

# Find the number of outliers
number_of_outliers = len(outlier_labels[outlier_labels == 1])
print(number_of_outliers)

Output:
88

For a dataset with 891 passengers, having 88 outliers is quite high. To reduce this, you can specify the parameter contamination in the KNN model to be lower. The contamination indicates the percentage of data points that are outliers. Let’s say that the contamination is only 1%:

# Initiate a KNN model
KNN_model = KNN(contamination=0.01)

# Fit the model to the whole dataset
KNN_model.fit(partial_titanic)

# Find the labels
outlier_labels = KNN_model.labels_

# Find the number of outliers
number_of_outliers = len(outlier_labels[outlier_labels == 1])
print(number_of_outliers)

Output: 
9

Now there are only 9 outliers! You can check them out:

# Finding the outlier passengers
outliers = partial_titanic.iloc[outlier_labels == 1]

If you check out the outliers variable, you get the following table:

If you check out the passengers above, then the KNN model picks up that their fare price is incredibly high. The average fare price for all the passengers can be easily found in Pandas:

# Average fare price
round(partial_titanic["Fare"].mean(), 3)

Output:
32.204

The KNN algorithm has successfully found 9 passengers that are outliers in the sense of the fare price. There are many optional parameters you can play around with for the KNN model to make it suit your specific need 🔥

The outliers can now be removed from the data if you feel like they don’t represent the general feel of the data. As mentioned previously, you should consider carefully whether anomaly detection for data cleaning is appropriate for your problem.

6 – Anomaly Detection for Prediction

In the previous section, you looked at anomaly detection for data cleaning. In this section, you will take a peak at anomaly detection for prediction. You will train a model on existing data, and then use the model to predict whether new data are outliers.

Say a rumor spread that a Mrs. Watson had also taken the Titanic, but her death was never recorded. According to the rumors, Mrs. Watson was a wealthy lady that paid 1000$ to travel with the Titanic in a very exclusive suite.

Anomaly detection can not say with certainty whether the rumor is true or false. However, it can say whether Mrs. Watson is an anomaly or not based on the information of the other passengers. If she is an anomaly, the rumor should be taken with a grain of salt.

Let’s test Mrs. Watson existence with another model in the PyOD library; Local Outlier Factor (LOF). A LOF model tests whether a data point is an outlier by comparing the local density of the datapoint with the local densities of its neighbors. For more information on this method, you can check out its Wikipedia page.

Let’s get coding! You start by establishing a Local Outlier Factor model:

# Import the LOF
from pyod.models.lof import LOF

# Initiate a LOF model
LOF_model = LOF()

# Train the model on the Titanic data
LOF_model.fit(partial_titanic)

Pay attention to how similar working with a LOF model is to working with a KNN model.

Now you can represent Mrs. Watson as a data point:

# Represent Mrs. Watson as a data point
mrs_watson = [[0, 1, 1000, 1]]

The values in mrs_watson represent her survival (0 for not survived), passenger class (1 for first-class), fare price (1000$ for the fare price), and sex (1 for female). The LOF model requires 2D arrays, so this is the reason for the extra bracket pair [] in mrs_watson.

We now use the predict() method to predict whether Mrs. Watson is an outlier or not:

outlier = LOF_model.predict(mrs_watson)
print(outlier)

Output:
1

A value of 1 indicates that Mrs. Watson is an outlier. This should make you suspicious that the rumor regarding Mrs. Watson is false 😮

7— Wrapping Up

I have shown you how to implement anomaly detection with the two algorithms KNN and LOF. As you probably suspect, there are many more algorithms that you can play around with in PyOD.

Anomaly detection is important for both cleaning the data and also for predicting outliers. The application at hand should determine whether or not it is of interest to apply anomaly detection. If you are planning on applying anomaly detection in Python, then PyOD is a solid choice.

Like my writing? Check out some of my other posts for more Python content:

If you are interested in data science, programming, or anything in between, then feel free to add me on LinkedIn and say hi ✋