ML101

Learn about Support Vector Machines

This post will cover SVMs as part of our ML 101 series.

Support vector machines (SVMs) are powerful yet flexible supervised machine learning algorithms that are used both for classification and regression. But generally, they are used in classification problems. In the 1960s, SVMs were first introduced but later they got refined in 1990. SVMs have their unique way of implementation as compared to other machine learning algorithms. Lately, they are extremely popular because of their ability to handle multiple continuous and categorical variables.

When do we need to use SVM?

You might be thinking when and how to use SVM? Well, we are just going to find it.
We use SVM when we have several datasets and we need to classify them separately. It’s quite clear now that SVM basically classifies datasets.

Now the question arises, how does SVM classify the datasets?
SVM finds an ideal line that separates the datasets from each other hence creating 2 classes. The line that classifies the datasets is called HyperPlane. We’ll be having graphical representation of Hyperplane shortly.

Working of SVM

An SVM model is basically a representation of different classes in a hyperplane in multidimensional space. The hyperplane will be generated in an iterative manner by SVM so that the error can be minimized. The goal of SVM is to divide the datasets into classes to find a maximum marginal hyperplane (MMH).

**Figure 1(Real-Life Applications of SVM (Support Vector Machines)), Data Flair, 2020**

Let’s get the idea of how SVM works by classifying a dataset.

We need to classify this dataset of blue ellipses and orange triangles with the help of SVM (Support Vector Machine) Algorithm. We are going to find an ideal line i.e. hyper plane that separates this dataset into blue and orange categories by using SVM algorithm.

Let's draw 2 random lines and check which line best separates the both classes.

There are 2 lines here i.e. green and yellow.

BRAINSTORMING TIME!!!

Which line do you think best classifies the data set into 2 separate classes?

Enough with thinking. Right answer is Green. Green line is just in the middle that classifies better. It's kind of premonition that green is the right line.

Wondering why we didn't choose yellow line? Because it’s more near towards ellipses than triangles and we in SVM Algorithm needs to separate by equal distance so it’s quite obvious here that green line makes an ideal line/hyperplane.

Hyperplane

Hyperplane is an ideal line that classifies the dataset into 2 different classes.

Support Vectors

In SVM algorithm, we find the points closest to the line from both the classes. These points are called Support vectors.

Support Vectors are basically the data points, which are closest to the hyperplane. These points define the separating line accurately by calculating the margins. Now that we used the word margins. Let’s find out what margin really is in Support Vector Machines.

Margins

A margin is a gap between the two lines on the closest class points. We can calculate it by taking the perpendicular distance from line to support vectors.

There are further 2 types of margins. Good Margin and Bad Margin. You might be thinking how to find out what’s the difference between two of them?

Good Margin & Bad Margin

It’s a good margin if it is grander in between the classes. If the margin is lesser in between the classes, it’s a bad one.

Let's look at the graphical representation of me show margin, hyperplane and support vector.

Goal of SVM

The goal of the SVM algorithm is to create the best line or decision boundary that can segregate n-dimensional space into classes so that we can easily put the new data point in the correct category in the future. This best decision boundary is called a hyperplane.

SVM chooses the extreme points/vectors that help in creating the hyperplane. These extreme cases are called support vectors, and hence the algorithm is termed as Support Vector Machine. Consider the below diagram in which there are two different categories that are classified using a decision boundary or hyperplane.

Types of SVM

Now that we’ve understood the basic working of Support Vector Machines. Let’s discuss the types of SVM.

SVM is generally divided into 2 categories. Linear SVM and Non-Linear SVM.

Linear SVM

We use Linear SVM for linearly separable data. More specifically, If a dataset that can be classified into two classes by using a single straight line, then such data is termed as linearly separable data. The classifier we use for classification of dataset is called Linear Support Machine classifier.

Example of Linear SVM is following:

Non-Linear SVM

Non-Linear SVM is used for non-linearly separated data, which means if a dataset cannot be classified by using a straight line, then such data is termed as non-linear data and classifier used is called as Non-linear SVM classifier.

It has become quite obvious now that Non-Linear SVM will be used for inseparable dataset. The classifier that can’t classify dataset using a straight line

We’ve to find a way to deal with inseparable dataset and for that we have KERNEL TRICK. SVM uses kernel trick for classifying inseparable dataset. Kernel trick transforms the input space to a higher dimensional space.

Let’s understand this by drawing some figures. Below is the figure that clearly has inseparable data set i.e. we cannot draw a single straight line to classify them.

We have to use a higher dimension for this data to be separated. In above figure, we’ve got 2 axis i.e. x and y. We’re going to insert a new axis here, say z-axis.

Let z be:

By this equation, it is clear that z coordinate is square of distance of point from origin.

What should be next step? I say, plot the data on z-axis. Let’s give it a try.

We can now draw a circular line to separate the dataset.

Hence, by using SVM’s Kernel trick, we can easily make the inseparable data separate by adding an extra dimension. Do you see? It all becomes more interesting once you get it.

Dealing with non-linear and inseparable planes

Let me explain Kernel Trick with another example.

Some problems can’t be solved using linear hyperplane, as shown in the figure below (left-hand side).In such situations, SVM uses a kernel trick to transform the input space to a higher dimensional space as shown in a picture below, on the right. The data points are plotted on the x-axis and z-axis (Z is the squared sum of both x and y: z=x^2=y^2). Now you can easily segregate these points using linear separation.

**Figure 9 (Non-linear & Inseparable planes illustration)**

Kernel Trick

Kernel tricks also known as Generalized dot product. Kernel tricks are the way of calculating dot product of two vectors to check how much they make an effect on each other. According to Cover’s theorem the chances of linearly non-separable data sets becoming linearly separable increase in higher dimensions. Kernel functions are used to get the dot products to solve SVM constrained optimization.

SVM Kernel Functions

Pros & Cons of Support Vector Machine

Now that we’ve the understanding of SVM and its types. Let’s focus on its pros and cons.

Pros

SVM produces accurate results.
SVM can classify both linear separable datasets and inseparable datasets.
SVM is effective in high-dimensional places.
SVM is memory sufficient as it uses support vectors in the decision function.
SVM provides better results in contrast to the traditional searching techniques.
SVM has a common kernel as well as a custom kernel.
SVM can solve smaller sets.

Cons

SVM is not suitable for complex and larger sets.
Training time is too much if complex or larger datasets are dealt with.
SVM isn’t suitable for overlapping classes.
SVM finds it computationally intensive to pick the right kernel.

Unbalanced problems

In problems where it is desired to give more importance to certain classes or certain individual samples, the parameters class_weight and sample_weight can be used.

SVC (but not NuSVC) implements the parameter class_weight in the fit method. It’s a dictionary of the form {class_label : value}, where value is a floating point number > 0 that sets the parameter C of class class_label to C * value. The figure below illustrates the decision boundary of an unbalanced problem, with and without weight correction.

SVC, NuSVC, SVR, NuSVR, LinearSVC, LinearSVR and OneClassSVM implement also weights for individual samples in the fit method through the sample_weight parameter. Similar to class_weight, this sets the parameter C for the i-th example to C * sample_weight[i], which will encourage the classifier to get these samples right. The figure below illustrates the effect of sample weighting on the decision boundary. The size of the circles is proportional to the sample weights:

Applications of SVM

SVM has its applications in daily life use. SVM is used in Sentiment analysis.

It can also do face detection. SVM classifies parts of the image as a face and non-face. SVM creates a square boundary around the face.

SVM is also helpful in image recognition. It provides enhanced precision for image classification.

SVM (Support vector Machine) has also its application in text and hypertext categorization.

SVM allows categorization for both inductive and transductive models. Support Vector Machine classify documents using training data into different categories. SVM classifies on the basis of the score produced and compares with the starting point.

Support Vector Machines (SVM) has vast applications in the field of bioinformatics. SVM algorithms are used in protein fold & remote homology detection.

Handwritten characters are recognized with the help of SVM. SVM can control dynamics using helpful parameters in Generalized Predictive Control.

Spam Detection is also one of the applications of Support Vector Machines.

Tips on Practical Use

Avoiding data copy: For SVC, SVR, NuSVC and NuSVR, if the data passed to certain methods is not C-ordered contiguous and double precision, it will be copied before calling the underlying C implementation. You can check whether a given numpy array is C-contiguous by inspecting its flags attribute.
For LinearSVC (and LogisticRegression) any input passed as a numpy array will be copied and converted to the liblinear internal sparse data representation (double precision floats and int32 indices of non-zero components). If you want to fit a large-scale linear classifier without copying a dense numpy C-contiguous double precision array as input, we suggest to use the SGDClassifier class instead. The objective function can be configured to be almost the same as the LinearSVC model.
Kernel cache size: For SVC, SVR, NuSVC and NuSVR, the size of the kernel cache has a strong impact on run times for larger problems. If you have enough RAM available, it is recommended to set cache_size to a higher value than the default of 200(MB), such as 500(MB) or 1000(MB).
Setting C: C is 1 by default and it’s a reasonable default choice. If you have a lot of noisy observations you should decrease it: decreasing C corresponds to more regularization.
LinearSVC and LinearSVR are less sensitive to C when it becomes large, and prediction results stop improving after a certain threshold. Meanwhile, larger C values will take more time to train, sometimes up to 10 times longer, as shown in 11.
Support Vector Machine algorithms are not scale invariant, so it is highly recommended to scale your data. For example, scale each attribute on the input vector X to [0,1] or [-1,+1], or standardize it to have mean 0 and variance 1. Note that the same scaling must be applied to the test vector to obtain meaningful results. This can be done easily by using a Pipeline:
Regarding the shrinking parameter, quoting 12: We found that if the number of iterations is large, then shrinking can shorten the training time. However, if we loosely solve the optimization problem (e.g., by using a large stopping tolerance), the code without using shrinking may be much faster
Parameter nu in NuSVC/OneClassSVM/NuSVR approximates the fraction of training errors and support vectors.
In SVC, if the data is unbalanced (e.g. many positive and few negative), set class_weight='balanced' and/or try different penalty parameters C.
Randomness of the underlying implementations: The underlying implementations of SVC and NuSVC use a random number generator only to shuffle the data for probability estimation (when probability is set to True). This randomness can be controlled with the random_state parameter. If probability is set to False these estimators are not random and random_state has no effect on the results. The underlying OneClassSVM implementation is similar to the ones of SVC and NuSVC. As no probability estimation is provided for OneClassSVM, it is not random.
The underlying LinearSVC implementation uses a random number generator to select features when fitting the model with a dual coordinate descent (i.e when dual is set to True). It is thus not uncommon to have slightly different results for the same input data. If that happens, try with a smaller tol parameter. This randomness can also be controlled with the random_state parameter. When dual is set to False the underlying implementation of LinearSVC is not random and random_state has no effect on the results.
Using L1 penalization as provided by LinearSVC(penalty='l1', dual=False) yields a sparse solution, i.e. only a subset of feature weights is different from zero and contributes to the decision function. Increasing C yields a more complex model (more features are selected). The C value that yields a “null” model (all weights equal to zero) can be calculated using l1_min_c.

End Notes

In the article above, we discussed all the aspects of Support Vector Machines. Its working, types, pros & cons, unbalanced problem and applications.

I hope you find this article helpful and enlightening. I'd be happy for any comments or suggestions you might have. Don't hesitate commenting in the feedback section below.