Support Vector Machines

Machine Learning

Jesus A. Gonzalez

August 10, 2019

The Support Vector Machine

Training data consists of
- \(N\) pairs \((x_1,y_1), (x_2,y_2), \dots, (x_N,y_N)\)
- \(x_i \in R^p\) and \(y_i \in \{-1, 1\}\)
We define a hyperplane by
- \(\{x : f(x) = x^T \beta + \beta_0 = 0\}\)
- where \(\beta\) is a unit vector: \(\|\beta\| = 1\)

The Support Vector Machine

Classification rule induced by \(f(x)\) is
- \(G(x) = sign [x^T \beta + \beta_0]\)

The Support Vector Machine

Left
- The separable case
- Decision boundary in solid line
- Broken lines show the shaded maximal margin of width \(2M = 2/\|\beta\|\)
Right
- The nonseparable or overlap case
- Points \(\xi_j^*\) fall in the wrong side of their margin by an amount \(\xi_j^* = M\xi_j\)
- Points in the correct side of the margin have \(\xi_j^* = 0\)
- Margin is maximized subject to a total budget \(\sum \xi_i \leq constant\)
- \(\sum \xi_j^*\) is the sum of distance to points on wrong side of margin, total error

The Support Vector Machine

If classes are separable
- We can find a function \(f(x) = x^T\beta + \beta_0\) with \(y_i f(x_i) > 0\) \(\forall i\)
- We can find the hyperplane that creates the largest margin
  - Margin between training points for classes \(1\) and \(-1\)
This is an optimization problem
- \(\underset{\beta, \beta_0, \|\beta\|=1}{\max} M\)
- The margin is \(M\) units away from the hyperplane on either side of the hyperplane (blue line)
- The margin is then \(2M\) units wide

The Support Vector Machine

This problem can be rephrased as
- \(\min_{\beta, \beta_0} \| \beta \|\)
- subject to \(y_i(x_i^T \beta + \beta_0) \geq 1\), \(i=1,\dots,N\),
- \(M = 1/\| \beta \|\)
This is the support vector criterion for separated data
- A convex optimization problem
- Quadratic criterion, linear inequality constraints

The Support Vector Machine

However classes overlap in feature space
We still need to maximize \(M\) but allowing some points to be in the wrong side of the margin
- We define the slack variables \(\xi = (\xi_1, \xi_2, \dots, \xi_N)\).
We have to modify the constraint, there are two ways to modify this constraint
- \(y_i(x_i^T \beta + \beta_0) \geq M - \xi_i\), or
- \(y_i(x_i^T \beta + \beta_0) \geq M (1 - \xi_i)\)
- \(\forall\) \(i\), \(\xi_i \geq 0\), \(\sum\limits_{i=1}^{N} \xi_i \leq constant\)
The first results in a nonconvex optimization problem and the second one is convex
Then, the last one leads to the standard support vector classifier

The Support Vector Machine

\(\xi_i\) in the constraint represents how wrong a prediction is on the wrong side of the margin
\(\sum \xi_i\) represents to total proportional error for all predictions
Misclassifications happen when \(\xi_i > 1\)
- If we bound \(\sum \xi_i\) at a value \(K\), we bound the number of misclassifications at \(K\)

The Support Vector Machine

We can rewrite the equation in this equivalent form, having \(M=1/\|\beta\|\)
\(min \|\beta\|\)
subject to
- \(y_i(x_i^T\beta + \beta_0) \geq 1 - \xi_i\) \(\forall i\),
- \(\xi_i \geq 0\),
- \(\sum \xi_i \leq constant\)
This is the support vector classifier for the non separable case

The Support Vector Machine

From the figure we see that points far from the margin (their correct margin) do not help shaping the boundary
- Nice property

Computing the Support Vector Classifier

The previous problem is quadratic with linear inequality constraints
- A convex optimization problem
Quadratic programming solution using Lagrange multipliers
We re-express the problem as:
- \(\min_{\beta,\beta_0} \frac{1}{2}\|\beta\|^2+C\sum\limits_{i=1}^{N}\xi_i\)
- subject to \(\xi_i \geq 0\), \(y_i(x_i^T\beta+\beta_0) \geq 1 - \xi_i\) \(\forall\) \(i\)
- We change the “constant” for the “cost” parameter \(C\)
- The separable case corresponds to \(C = \infty\)
The Lagrange function is:

\(L_P=\frac{1}{2}\|\beta\|^2+C\sum\limits_{i=0}^{N}\xi_i\) \(-\sum\limits_{i=0}^{N} \alpha_i[y_i(x_i^T\beta+\beta_0)-(1-\xi_i)]\) \(-\sum\limits_{i=1}{N}\mu_i\xi_i\)

We minimize w.r.t. \(\beta\), \(\beta_0\), and \(\xi_i\)

Computing the Support Vector Classifier

Solving the Lagranian equation, with solutions for \(\hat{\beta_0}\) and \(\hat{\beta}\), the decision function can be rewritten as:

\[\hat{G}(x)=sign[\hat{f}(x)]\] \[=sign[x^T\hat{\beta}+\hat{\beta_0}]\]

Now we can find decision boundaries for problems with overlapping classes
- A linear support vector boundary
- For overlapping classes

Support Vector Machines and Kernels

Up to now we can find linear boundaries with our support vector classifier
We can make the method more flexible
- Enlarge the feature space using basis expansions such as polynomials
Linear boundaries in enlarged space achieve better training-class separation
- Translate to nonlinear boundaries in the original space
We select the basis functions \(h_m(x)\), \(m=1,\dots,M\) and the procedure is the same as before
- Fit the SV classifier using input features \(h(x_i) = (h_1(x_i),h_2(x_i),\dots,h_M(x_i))\), \(i=1,\dots,N\)
- Produce the nonlinear function \(\hat{f}(x)=h(x)^T\hat{\beta}+\hat{\beta}_0\)
- The classifier is \(\hat{G}(x)=sign(\hat{f}(x))\)
The SVM classifier extends this idea
- Dimension of enlarged space allowed to get very large

Support Vector Machines and Kernels

SVM for Classification
- Represent the optimization problem and its solution in a special way
- Involving input features via inner products
- Do this directly for the transformed feature vectors \(h(x_i)\)
The Lagrange dual function:
- \(L_D=\sum\limits_{i=1}^{N} \alpha_i - \frac{1}{2} \sum\limits_{i=1}^{N} \sum\limits_{i'=1}^{N} \alpha_i\alpha_{i'}y_i y_{i'} \langle h(x_i),h(x_{i'}) \rangle\)
The solution function \(f(x)\) can be written as:
- \(f(x) = h(x)^{T}\beta + \beta_0\)
- \(= \sum\limits_{i=1}^{N}\alpha_iy_i \langle h(x), h(x_i) \rangle + \beta_0\)

Support Vector Machines and Kernels

We don’t need to specify transformation \(h(x)\)
We only require knowledge about the kernel function and compute inner products in the transformed space \[K(x, x') = \langle h(x), h(x') \rangle\]
\(K\) should be a symmetric positive (semi-) definite function
Some popular kernels
- Polynomial: \(K(x, x') = (1 + \langle x, x' \rangle)^d\)
- Radial basis: \(K(x, x') = exp(-\gamma \| x - x' \|^2)\)
- Neural network: \(K(x, x') = tanh(k_1 \langle x, x' \rangle + k_2)\)

Support Vector Machines and Kernels

Example with a feature space with 2 inputs \(X_1\) and \(X_2\), and a polynomial kernel of degree 2
\(K(X,X')=(1+ \langle X, X' \rangle)^2\)
\(=(1 + X_1 X'_1 + X_2 X'_2)^2\)
\(=1 + 2X_1X'_1 + 2X_2X'_2 + (X_1X'_1)^2 + (X_2X'_2)^2 + 2X_1X'_1X_2X'_2\)

References

The Elements of Statistical Learning. Trevor Hastie, Robert Tibshirani, Jerome Friedman, Springer, 2009.
SVM Tutorial: Understanding the Math, Alexandre Kowalczyk, 2016, (online: http://www.svm-tutorial.com/2014/11/svm-understanding-math-part-1/ ), last visited: November 4, 2016.
SVM example. Dan Ventura, 2009, (online: http://axon.cs.byu.edu/Dan/678/miscellaneous/SVM.example.pdf), last visited: November 4, 2016.