Note 4: Support Vector Machine SVM - Tech It Yourself

## Saturday, 2 June 2018

1. Introduction
This algorithm is a different type of intuition of logistic regression. Assume that we have data set, in which green color represent positive training examples and brown color represent negative training examples. A line (or hyperplane) is trying to separate these training examples.
Figure: Separate training examples using a line (or hyperplane)
There are many solutions (black or blue line) for this problem. SVM supports to find the optimal solution.
Figure: The optimal line is red line
The problem of finding the optimal hyperplane is an optimization problem and can be solved by optimization techniques (use Lagrange multipliers).
2. Notation
Let 's define a binary classification problem with labels y and features x. We will use $y\in {-1,1}$, use w instead of $[\theta_{0}...\theta_{n}]^{T}$ and b instead of $\theta_{0}$. So:
$y=h(x)=g(w^{T}x + b)\\ \text{y=1 if }w^{T}x + b \geq 1\\ \text{y=-1 if }w^{T}x + b \leq -1\\ \text{These can be combined into}\\y^{(i)}(w^{T}x^{(i)}+b)\\ \text{if }y^{(i)}(w^{T}x^{(i)+b}) \ geq 1 \Rightarrow \text{ our prediction on this example is correct}\\ \text{Note: }w^{T}x + b=0 \text{ is the equation of separating hyperplane.}$
We define support vectors are data points that satisfy:
$w^{T}x+b=1 \text{ or } w^{T}x+b=-1$
They are data points that just touch the boundary of the margin circled below, there are 3 of them ($v_{1},v_{2},v_{3}$). d is 1/2 of the 'width' and is the distance between margin and separating hyperplane. (w,b) only depends on these support vectors.
Figure: 3 support vectors
3. Problem
$\text{We define the hyperplane such that for training examples (i=1,...,m) we have:}\\ w^{T}x^{(i)}+b\geq 1\text{ if }y^{(i)}= 1\\ w^{T}x^{(i)}+b\leq 1\text{ if }y^{(i)}= -1\\ \text{and }H_{1} \text{ and }H_{2} \text{ are hyperplane that}\\ H_{1}:w^{T}x^{(i)}+b= 1\\ H_{2}:w^{T}x^{(i)}+b= -1\\ \text{Of course, support vectors lie on these hyperplanes}\\ \text{and}\\ \text{d+: is the shortest distance to the positive example}\\ \text{d-: is the shortest distance to the negative example}\\ \text{The margin of separating hyperplane is (d+) + (d-) }\\ \text{Recall that the distance from point } (x^{(i)}) \text{ to hyperplane } w^{T}x^{(i)}+b =0\\ \text{is calculated by formula:}\\ \frac{|w^{T}x^{(i)}+b|}{\left \| w \right \|}=\frac{1}{\left \| w \right \|}$

The total distance between $H_{1}, H_{2}$ is $\frac{2}{\left \| w \right \|}$.
Our problem is maximize $\frac{2}{\left \| w \right \|\\}$ with constraint that no data points between $H_{1}, H_{2}$. It means:
$w^{T}x^{(i)}+b\geq 1\text{ if }y^{(i)}= 1\\ w^{T}x^{(i)}+b\leq -1\text{ if }y^{(i)}= -1$
Or we can convert it to minimizing problem:
Minimize: $\left \| w \right \|\\$ or $\frac{1}{2}\left \| w \right \|^{2}$
subject to: $y^{(i)}(w^{T}x^{(i)}+b)\geq 1\\ \text{ and }\lambda _{i}\geq 0$
We can solve this problem using KKT. Let 's make an example with very simple data set below and use KKT to solve it.
Figure: 2 data examples (2,2) and (1,1)
Our problem:
Minimize: $\frac{1}{2}\left \| w \right \|^{2}$
subject to: $y^{(i)}(w^{T}x^{(i)}+b)\geq 1\\$
We have:
$w=\begin{bmatrix} w_{1}\\ w_{2} \end{bmatrix}, x=\begin{bmatrix} x_{1}\\ x_{2} \end{bmatrix}$
Substitute data set into constraints:
$+1(2w_{1}+2w_{2}+b)\geq 1\\ -1(w_{1}+w_{2}+b)\geq -1\\ \text{or}\\ 1-(2w_{1}+2w_{2}+b)\leq 0\\ 1+(w_{1}+w_{2}+b)\leq 0$
The Lagrangian:
$L=\frac{1}{2}w_{1}^{2}+\frac{1}{2}w_{2}^{2}+\lambda _{1}(1-(2w_{1}+2w_{2}+b))+\lambda _{2}(1+(w_{1}+w_{2}+b))$
Applying KKT we have:
$w_{1}-2\lambda _{1}+\lambda _{2}=0\\ w_{2}-2\lambda _{1}+\lambda _{2}=0\\ -\lambda _{1}+\lambda _{2}=0\\ \lambda _{1}(1-(2w_{1}+2w_{2}+b))=0\\ \lambda _{2}(1+(w_{1}+w_{2}+b))=0\\ \lambda _{1}\geq 0,\lambda _{2}\geq 0$
There are 4 cases to check. I just check the case $\lambda _{1}> 0,\lambda _{2}> 0$ and solution is: $w_{1}=w_{2}=1, b=-3$
The separating hyperplane is: $x_{1}+x_{2}-3=0$
Figure: Separating hyperplane
We can check that the data points are support vectors.
$(2,2)\Rightarrow y=2+2-3=1\\ (1,1)\Rightarrow y=1+1-3=-1$
If you are familiar with Python cxvopt package that is used for solving convex optimization problem, you can model our problem like below:
Note: The standard form of our problem in CVXOPT is:
$min_{x}\frac{1}{2}x^{T}Px+q^{T}x$
subject to: $Gx\leq h\\Ax=b$
We can re-write our problem under new form:
$\begin{bmatrix} w_{1}\\ w_{2}\\ b \end{bmatrix}^{T} \begin{bmatrix} 1.0 & 0.0 & 0.0\\ 0.0 & 1.0 & 0.0\\ 0.0 & 0.0 & 0.0 \end{bmatrix} \begin{bmatrix} w_{1}\\ w_{2}\\ b \end{bmatrix}+\begin{bmatrix} 0.0\\ 0.0\\ 0.0 \end{bmatrix} \begin{bmatrix} w_{1}\\ w_{2}\\ b \end{bmatrix}$
subject to:
$\begin{bmatrix} -2.0 & -2.0 & -1.0\\ 1.0 & 1.0 & 1.0 \end{bmatrix} \begin{bmatrix} w_{1}\\ w_{2}\\ b \end{bmatrix}\leq \begin{bmatrix} -1.0\\ -1.0 \end{bmatrix}$
  1 2 3 4 5 6 7 8 9 10 11 from cvxopt import matrix from cvxopt import solvers import numpy as np P = matrix(np.array([[1.0,0.0,0.0],[0.0,1.0,0.0],[0.0,0.0,0.0]]), tc='d') q = matrix(np.array([0.0,0.0,0.0]), tc='d') G = matrix(np.array([[-2.0,-2.0,-1.0],[1.0,1.0,1.0]]), tc='d') h = matrix(np.array([-1.0,-1.0]), tc='d') sol = solvers.qp(P,q,G,h) print(sol['x']) 
Figure: Using cvxopt to solve the problem
4. Kernels
We knew how to apply SVM for linear problem. So how to apply SVM for non-linearly separable classification problems. Let 's review our problem and change perspective.
We have the Lagrangian of problem:
$L = \frac{1}{2}\left \| w \right \|^{2} + \sum_{i=1}^{m}\lambda_{i}(1-y^{(i)}(w^{T}x^{(i)}+b))$
We have:
$\bigtriangledown _{w}L=w-\sum_{i=1}^{m}\lambda _{i}x^{(i)}y^{(i)}=0$
and
$\bigtriangledown _{b}L=\sum_{i=1}^{m}\lambda _{i}y^{(i)}=0$
We plug these back to Lagrangian:
$L=\sum_{i=1}^{m}\lambda _{i}-\frac{1}{2}\sum_{i,j=1}^{m}y^{(i)}y^{(j)}\lambda _{i}\lambda _{j}(x^{(i)})^{T}x^{(j)}$
we obtain the dual problem:
$max_f=\sum_{i=1}^{m}\lambda _{i}-\frac{1}{2}\sum_{i,j=1}^{m}y^{(i)}y^{(j)}\lambda _{i}\lambda _{j}<x^{(i)},x^{(j)}>$
subject to: $\lambda _{i}\geq 0,i=1,...,m\\ \sum_{i=1}^{m}\lambda _{i}y^{(i)}=0$
By changing perspective, we can gain insight the problem. And we are able to write the algorithm by only using inner products between input feature vectors.
+ If two feature vectors $x^{i},x^{j}$ are completely dissimilar their dot product is 0, and they don’t contribute to f.
+ If two feature vectors $x^{i},x^{j}$ are completely alike, their inner product is 1. We have 2 cases:
++ If both predict the same output value (+1 or -1) so the $y^{(i)}y^{(j)}\lambda _{i}\lambda _{j}<x^{(i)},x^{(j)}>$ always positive. This decreases f, so the algorithm downgrades the similar feature vectors that make the same prediction.
++ If both predict the different output value (+1 and -1) so the $y^{(i)}y^{(j)}\lambda _{i}\lambda _{j}<x^{(i)},x^{(j)}>$ always negative. This increases f, so these points contribute to the solution.
Note:
- At optimal point, the solution of primal and dual is equivalent
- The dual form is useful for an algorithm that supports to solve our problem on computer. It is SMO (sequential minimal optimization) algorithm.
Assume that we have a data set like below:
Figure: non-linear data set
We can not separate this data set using linear problem. In order to solve this problem, we will re-use the idea in linear regression. That is mapping the features to a higher dimensional space (e.g: $x^{2} or x^{3}$) via $\varphi$ a feature mapping function. So our problem becomes linear problem.
Figure: Feature mapping
Now we replace x by $\varphi (x)$. We have:
$max_f=\sum_{i=1}^{m}\lambda _{i}-\frac{1}{2}\sum_{i,j=1}^{m}y^{(i)}y^{(j)}\lambda _{i}\lambda _{j}<\varphi(x^{(i)})^{T},\varphi(x^{(j)})>$. But computing $<\varphi(x^{(i)})^{T},\varphi(x^{(j)})>$ is expensive and time consuming (in case $\varphi$ is a quartic polynomial). If there is a function k such that $K(x^{(i)}, x^{(j)})=<\varphi(x^{(i)})^{T},\varphi(x^{(j)})>$ we do not need to compute $\varphi$ anymore. We call K is kernel function. It defines similarity in the transformed space. Our problem becomes:
$max_f=\sum_{i=1}^{m}\lambda _{i}-\frac{1}{2}\sum_{i,j=1}^{m}y^{(i)}y^{(j)}\lambda _{i}\lambda _{j}K(x^{(i)}, x^{(j)})$. let 's see some examples:
The polynomial kernel:
$K(x,z)=(x^{T}z+c)^{2}\\ =\sum_{i,j=1}^{n}(x_{i}x_{j})(z_{i}z_{j})+\sum_{i=1}^{n}(\sqrt{2cx_{i}})(\sqrt{2cz_{i}})+c^{2}$
The Gaussian kernel:
$K(x,z)=exp(-\frac{\left \| x-z \right \|^{2}}{2\sigma ^{2}})$
You can see the computation is inexpensive comparing to using $\varphi$.
4.2 Validate Kernel
Given Kernel function K(x,z) how to check whether it is valid or not (there exists feature mapping function $\varphi$ such that $K(x^{(i)}, x^{(j)})=<\varphi(x^{(i)})^{T},\varphi(x^{(j)})>$).
Assume that we have m training sets ${x^{1},...,x^{m}}$ and square m-by-m matrix K with $K_{ij}=K(x^{(i)},x^{(j)})$. This matrix is called Kernel matrix. Moreover, we have:
$K_{ij}=\varphi (x^{(i)})^{T}\varphi (x^{(j)})=\varphi (x^{(j)})^{T}\varphi (x^{(i)})=K_{ji}$
$\Rightarrow$ K is symmetric.
Mercer theorem: For K to be a valid (Mercel) kernel, it is necessary and sufficient that the corresponding kernel matrix is symmetric positive semi-definite.
5. Regularization
Let 's look the figure below:
Figure: Left-relaxed margin and Right-much smaller margin
The left picture shows a 'relaxed' margin classifier. In the left picture, when a outlier is added in the upper-left region, it causes the resulting classifier has a smaller margin.
In order to cover for the case in the right picture, we reformulate the problem using $l_{1}$ regularization:
$min_{w,b,\gamma }=\frac{1}{2}\left \| w \right \|^{2}+C\sum_{i=1}^{m}\gamma _{i}$
subject to: $y^{(i)}(w^{T}x^{(i)}+b)\geq 1-\gamma _{i},i=1,...m\\ \gamma _{i}\geq 0,i=1,...m$
For training example that $\gamma _{i} > 0$, its margin is less than 1 and the cost function increases $C\gamma _{i}$.
The new Lagrangian:
$L(w,b,\lambda ,\gamma ,r )=\frac{1}{2}\left \| w \right \|^{2}+C\sum_{i=1}^{m}\gamma _{i}-\sum_{i=1}^{m}\lambda _{i}[y^{(i)}(w^{T}x+b)-1+\gamma _{i}]-\sum_{i=1}^{m}r_{i}\gamma _{i}$
$\lambda _{i}, r_{i}$ is Lagrange multipliers.
The new dual problem:
$max_{\lambda }W(\lambda)=\sum_{i=1}^{m}\lambda _{i}-\frac{1}{2}y^{(i)}y^{(j)}\lambda _{i}\lambda _{j}<x^{(i)},x^{(j)}>$
subject to: $0\leq \lambda _{i}\leq C, i=1,...,m\\ \sum_{i=1}^{m}\lambda _{i}y^{(i)}=0$
The new KKT conditions:
$\lambda _{i}=0\Rightarrow y^{(i)}(w^{T}x+b)\geq 1\\ \lambda _{i}=C\Rightarrow y^{(i)}(w^{T}x+b)\leq 1\\ 0<\lambda _{i}<C\Rightarrow y^{(i)}(w^{T}x+b)= 1$
6. The SMO algorithm
The SMO algorithm was proposed by John C. Platt in 1998.
Features of the SMO:
- It is the fastest quadratic programming optimization algorithm, especially for linear SVM.
- It does not to store the kernel matrix in memory, since no matrix operations are involved
The dual problem:
$max_{\lambda }W(\lambda)=\sum_{i=1}^{m}\lambda _{i}-\frac{1}{2}y^{(i)}y^{(j)}\lambda _{i}\lambda _{j}<x^{(i)},x^{(j)}>$
subject to: $0\leq \lambda _{i}\leq C, i=1,...,m\\ \sum_{i=1}^{m}\lambda _{i}y^{(i)}=0$
Notice the requirement:
$\sum_{i=1}^{m}\lambda _{i}y^{(i)}=0$ (*)
At each step SMO chooses two elements $\lambda _{i}, \lambda _{j}$  to jointly optimize, find the optimal values for those two parameters given that all the others are fixed. The choice of the two points is determined by a heuristic.
Assume that two elements, $\lambda _{1}, \lambda _{2}$  that have been chosen for updating to improve the objective value and all the others are fixed. In order to compute the new values for these two parameters, we use (*);
$y^{(1)}\lambda _{1}^{(old)}+y^{(2)}\lambda _{2}^{(old)}=constant=y^{(1)}\lambda _{1}^{(new)}+y^{(2)}\lambda _{2}^{(new)}$ (***)
Combine with box constraint:
$0\leq \lambda _{i}\leq C, i=1,...,m$ (**)
We have in $(\lambda _{1}, \lambda _{2})$ space:
Figure: box constraint $\mathbf{\alpha =\lambda}$
The SMO compute $\lambda _{2}^{(new)}$ then using (***) to compute $\lambda _{1}^{(new)}$.
(**) and (***) give more restrictive constraint on $\lambda _{2}^{(new)}$
$U\leq \lambda _{2}^{(new)}\leq V$
if $y^{(1)}\neq y^{(2)}$:
$U=max\left \{ 0,\lambda _{2}^{(old)}-\lambda _{1}^{(old)} \right \}\\ V=min\left \{ C,C-\lambda _{1}^{(old)}+\lambda _{2}^{(old)} \right \}$
if $y^{(1)}= y^{(2)}$:
$U=max\left \{ 0,\lambda _{2}^{(old)}+\lambda _{1}^{(old)}-C\right \}\\ V=min\left \{ C,\lambda _{1}^{(old)}+\lambda _{2}^{(old)} \right \}$
The SMO gives the formula to compute $\lambda _{2}^{(new)}$
$\lambda _{2}^{(new,unclipped)}=\lambda _{2}^{(old)}+\frac{y^{(2)}\left \{ E_{2}^{(old)}- E_{1}^{(old)}\right \}}{\eta }$
where:
$E(i)=f(x^{(i)}-y^{(i)})=\left ( \sum_{i=1}^{m}\lambda _{j}y^{(j)}K_{ji}+b \right )-y^{(i)},i=1,...,m$
Apply constraint $U\leq \lambda _{2}^{(new)}\leq V$ we have:
$\lambda _{2}^{(new, clipped)}=\begin{cases} V & \text{ if } \lambda _{2}^{(new,unclipped)}>V \\ \lambda _{2}^{(new,unclipped)} & \text{ if } U\leq \lambda _{2}^{(new, unclipped)}\leq V \\ U & \text{ if } \lambda _{2}^{(new,unclipped)}<U \end{cases}$
and $\eta =-K_{11}-K_{22}+2K_{12}$
We can compute $\lambda _{1}^{(new)}$ by:
$\lambda _{1}^{(new)}=\lambda _{1}^{(old)}+y^{(1)}y^{(2)}\left ( \lambda _{2}^{(old)}-\lambda _{2}^{(new)} \right )$
to be continue ...