1. Introduction
This blog post is about Support Vector Machines (SVM), but not only about SVMs. SVMs belong to the class of classification algorithms and are used to separate one or more groups. In it’s pure form an SVM is a linear separator, meaning that SVMs can only separate groups using a a straight line. However ANY linear classifier can be transformed to a nonlinear classifier and SVMs are excellent to explain how this can be done. For a deeper introduction to the topic I recommend Tibshirani (2009), one can find a more detailed description including an derivation of the complete lagrangian there.
The general Idea of SVM is to separate two (or more) groups using a straight line (see Figure 1). However, in general there exits infinitely many lines that fulfill the task. So which one is the “correct” one? The SVM answers this question by choosing the line (or hyperplane if we suppose more than two features) which is most far away (the distance is denoted by ) from the nearest points within each group.
2. Mathematical Formulation
2.1 Separating Hyperplanes
Suppose multivariate data given by a pair where the explanatory variable and the group coding . In the following we assume an iid. sample of size given by .
Any separating hyperplane (which is a line if ) can therefore be described such that there exits some and
(1)
Note that with the (non restrictive) condition that . Our task to separate the groups is covered by the minimization problem(2)
2.2 Support Vector Machines
In most cases the assumption that there exits a hyperplane that perfectly parts the data points is unrealistic. Usually some points will lie on the other side of the hyperplane. In that case (2) will not have a solution. The idea of SVM is now to introduce an parameter to fix this issue. In particular we modify (1) such that we require
(3)
The corresponding minimization problem is then given by(4)
Both (2) and (3) are convex optimization problems and can be solved for example using the Lagrange minimization technique. A solution for (4) always has the form(5)
where iff (3) is met with “” and else, these are then called support vectors.2.2 Nonlinear SVMs
The method we described so far can only handle data where the groups can be separate using some hyperplane (line). However in many cases the data to be considered is not suited to be separated using a linear method. See figure 2 for example, in figure 2 the groups are arranged in two circles with different radius. Any attempt to separate the groups using a line will thus fail. However any linear method can be transformed into a non-linear method by projecting the data into a higher dimensional space. This new data may then be separable by some linear method. In case of figure 2 a suitable projection is for example given by , maps the data onto a cone where the data can be separated using a hyperplane as to be seen in figure 3. However, there are some drawbacks using this method. First of all will in general be unknown. In our simple example we where lucky to find a suitable , however concerning a more complicated data structure fining a suitable turns out to be very hard. Secondly, depending on the dataset, the dimension of needed to guarantee the existence of a separating hyperplane can become quite large and even infinite. Before we deal with this issues using kernels we will first have a look at the modified minimization problem (4) using instead of .
Let the nonlinear solution function given by . To overcome the need of the constant we introduce some penalty and rewrite (4) as
(6)
The notation means that only the positive part of is taken into account. At this point I would like to establish the connection to other linear methods. Since we consider SVMs we set the Loss function to . Different loss functions will lead to different methods, for example using will correspond to the logistic regression. Therefore the following strategy can be used to extent linear methods to deliver non-linear solutions.
From (5) we already know, that a nonlinear solution function will have the form
(7)
We can verify, that knowledge about is in fact not required to formulate the solution. We only need to know something about the inner product . While could be a high dimensional complicated function, the inner product is just a number. So if we would have some magical procedure that gives us this number, solving Nonlinear SVM would be straightforward. This leads us to think about kernels.
2.2.1 Kernels
We already get in touch with kernels when estimating densities or a regression. In this application Kernels are a way to reduce the infinite-dimensional problem to a finite dimensional optimization problem because the complexity of the optimization problem remains only dependent on the dimensionality of the input space and not of the feature space. To see this, let instead of looking at some particular function , we consider the whole space of functions generated by the linear span of where is just one element in this space. To model this space in practice popular kernels are
- linear kernel:
- polynomial kernel:
- radial kernel:
(8)
where are the orthonormal eigenfunctions of and eigenvalues , which ensure that that the generated space is a Hilbert space. Then for suitable and for some we can represent(9)
Thus we can write (6) in its general form as
(10)
According to (9) we can write down (10) using matrix notation as
(11)
This finite dimensional problem can now be solved using standard methods.
Pingback: Non-Linear Classification Methods in Spark – The Big Data Blog