The world’s leading publication for data science, AI, and ML professionals.

Introduction to Support Vector Machines - Motivation and Basics

Learn basic concepts that make Support Vector Machine a powerful linear classifier

Introduction

In this post, you will learn about the basics of Support Vector Machines (SVM), which is a well-regarded supervised machine learning algorithm.

This technique needs to be in everyone’s tool-bag especially people who aspire to be a data scientist one day.

Since there’s a lot to learn about, I’ll introduce SVM to you across two posts so that you can have a coffee break in between 🙂

Motivation

First, let us try to understand the motivation behind SVM in the context of a binary classification problem. In a binary classification problem, our data belong to two classes and we try to find a decision boundary that splits the data into those two classes while making minimum mistakes. Consider the diagram below which represents our (hypothetical) data on a 2-D plane. As we can see, the data is divided into two classes: Pluses and Stars.

Note: For the sake of simplicity, we’ll only consider linearly separable data for now and learn about not linearly separable cases later on.

Figure 1: Data representation where we have data points from two distinct classes
Figure 1: Data representation where we have data points from two distinct classes

The goal of SVM, like any classification algorithm, is to find a decision boundary that splits the data into two classes. However, there could be many possible decision boundaries to achieve this purpose as shown below. Which one should we consider?

Figure 2: What is the ideal decision boundary?
Figure 2: What is the ideal decision boundary?

The yellow and the black decision boundaries do not seem to be a good choice. Why, you ask? This is simply because they might not generalize well to new data points as each of them is awfully close to one of the classes. In this sense, the blue line seems to be a good candidate as it is far away from both classes. Hence, by extending this chain of thought, we can say that:

an ideal decision boundary would be a line that is at a maximum distance from any data point.

So, if we think of the decision boundary as a road, we want that road to be as wide as possible. This is exactly what SVM aims to do.

How it Works (Mathematically)

Now that we understand what SVM aims to do, our next step is to understand how it finds this decision boundary. So, let’s start from scratch with the help of the following diagram.

Figure 3: Mathematic formulation to identify ideal decision boundary
Figure 3: Mathematic formulation to identify ideal decision boundary

First, we will derive the equation of the decision boundary in terms of the data points. To that end, let us suppose we already have a decision boundary (blue line in the above diagram) and two unknown points that we have to classify. We represent these points as vectors u and v in the 2-D space. We also introduce a vector w which we assume is perpendicular to the decision boundary. Now, we project u and v in the direction of w and check whether the projected vector is on the left or right side of the decision boundary based on some threshold c.

Mathematically, we say that a data point x is on the right side of the decision boundary (that is, in the Star class) if w.x ≥ c else it is in the plus class. This means that the equation of the hyperplane (line in case of 2-D) that separates two classes, in terms of an arbitrary data point x is the following:

Now we have the equation of our decision boundary but it is not yet immediately clear how it would help us in maximizing its distance from the data points of both the classes. To that end, we would employ a trick which goes as follows. Usually, in a binary Classification problem, the labels of data samples are + 1 or -1. Thus, it would be more convenient for us if our decision rule (i.e. w.x + b) outputs quantity greater than or equal to +1 for all the data points belonging to star class and quantity less than or equal to -1 for all the data points belonging to the plus class.

Mathematically, x should belong to class Star if w.x + b ≥ 1 and x should belong to class Plus if w.x + b ≤ -1 or equivalently, we can write

for each point x_i, where we are considering y_i equal to -1 for the plus class and equal to +1 for the star class.

These two rules correspond to the dotted lines in the following diagram and the decision boundary is parallel and at equal distance from both. As we can see, the points closest to the decision boundary (on either side) get to dictate its position. Now, since the decision boundary has to be at a maximum distance from the data points, we have to maximize the distance d between the dotted lines. By the way, these dotted lines are called support vectors.

Figure 4: Margin of a decision boundary
Figure 4: Margin of a decision boundary

Now, let us denote the closest plus to the decision boundary as x_- and the closest star as x_+. Then, d is the length of the vector x_+x_- when projected along w direction (that is perpendicular to the decision boundary).

Figure 5: Identifying the width of the margin
Figure 5: Identifying the width of the margin

Mathematically, d could be written as:

Since x_+ and x_- are closest to the decision boundary and touch the dotted lines as mentioned earlier, they satisfy the following equations:

Substituting x_+ . w and x_- . w in the equation of d, we get:

Thus, if we have to maximize d, we can equivalently minimize |w| or minimize 1/2.|w|^2 (this transformation is done for mathematical convenience). However, this optimization must be subjected to the constraint of correctly classifying all the data points. Hence, we’ll make use of Lagrange Multiplier here to enforce the constraint from equation (A).

Now, it is time to do some mathematics. Formally, our objective is to minimize the following objective function:

Differentiating L with respect to w, we would obtain the optimal w as

The interesting thing to note here is that the decision vector w is a linear sum of the input vector (or data points) x_is. The next step is to differentiate L with respect to b which would give us the following equality

Now, we will substitute (E) into (D) and use (F) to rearrange the objective function into the following:

If you look closely, you will notice that the optimization function now depends on the dot product of the input vectors (that is, the data points). This is a nice property to have for the reasons we will discuss later on. Also, this optimization function is convex so we would not get stuck in the local maxima.

Now that we have everything, we could apply optimization routines like gradient descent to find the values of λs. I would encourage you to implement it and observe the obtained values of λs. Upon observing, you would notice that the value of λ would be zero for all the points except the ones which are closest to the decision boundary on either side. This means that the points that are far away from the decision boundary don’t get a say in deciding where the decision boundary should be. All the importance (through non-zero λs) is assigned to the points closest to the boundary, which was our understanding all along.

Does it work on more General Cases?

So, now you know all about what SVM aims at and how it goes about it. But, what about the following case where the data points are not linearly separable:

Figure 6: Motivating example for linearly inseparable cases
Figure 6: Motivating example for linearly inseparable cases

In this case, the SVM would get stuck in finding the optimal position of the decision boundary and we’ll get a poor result at the end of our optimization. Does this mean we can’t apply this technique anymore? The answer is, fortunately, no. For scenarios like these, we have two options:

1 – We can allow our algorithm to make a certain number of mistakes so that other points can still be classified correctly. In this case, we’ll modify our objective function to do just that. This is called the soft margin formulation of SVM.

2 – We can transform our data space into a higher dimension (say from 2d to 3d, but higher dimensions are also possible) in the hope that the points would be linearly separable in that space. We’ll use the "kernel trick" in this case, which would be computationally inexpensive because of the dependence of the objective function on the dot product of the input vectors.

Concluding Remarks

We’ll learn about these two methods in the next post which is linked below. If you have any questions or suggestions, please let me know in the comments. Thanks for reading. Cheers! 🥂

Support Vector Machines – Soft Margin Formulation and Kernel Trick


Feedback is a Gift

If you found the content informative and would like to learn more about Machine Learning and AI, please follow the profile to get notified about my future posts. Also, let me know in the comments if you have any questions or topic requests. All the images in the post are created by me.

Feel free to visit my website to get in touch http://rishabhmisra.github.io/ – I frequently consult on real-world AI applications.


Related Articles