**1. Introduction**

This algorithm is a different type of intuition of logistic regression. Assume that we have data set, in which green color represent positive training examples and brown color represent negative training examples. A line (

**or hyperplane**) is trying to separate these training examples.

**Figure: Separate training examples using a line (or hyperplane)**

**Figure: The optimal line is red line**

**2. Notation**

Let 's define a binary classification problem with labels y and features x. We will use $y\in {-1,1}$, use w instead of $[\theta_{0}...\theta_{n}]^{T}$ and b instead of $\theta_{0}$. So:

$

y=h(x)=g(w^{T}x + b)\\

\text{y=1 if }w^{T}x + b \geq 1\\

\text{y=-1 if }w^{T}x + b \leq -1\\

\text{These can be combined into}\\y^{(i)}(w^{T}x^{(i)}+b)\\

\text{if }y^{(i)}(w^{T}x^{(i)+b}) \ geq 1 \Rightarrow \text{ our prediction on this example is correct}\\

\text{Note: }w^{T}x + b=0 \text{ is the equation of separating hyperplane.}

$

We define support vectors are data points that satisfy:

$w^{T}x+b=1 \text{ or } w^{T}x+b=-1$

They are data points that just touch the boundary of the margin circled below, there are 3 of them ($v_{1},v_{2},v_{3}$). d is 1/2 of the 'width' and is the distance between margin and separating hyperplane. (w,b) only depends on these support vectors.

**Figure: 3 support vectors**

**3. Problem**

$

\text{We define the hyperplane such that for training examples (i=1,...,m) we have:}\\

w^{T}x^{(i)}+b\geq 1\text{ if }y^{(i)}= 1\\

w^{T}x^{(i)}+b\leq 1\text{ if }y^{(i)}= -1\\

\text{and }H_{1} \text{ and }H_{2} \text{ are hyperplane that}\\

H_{1}:w^{T}x^{(i)}+b= 1\\

H_{2}:w^{T}x^{(i)}+b= -1\\

\text{Of course, support vectors lie on these hyperplanes}\\

\text{and}\\

\text{d+: is the shortest distance to the positive example}\\

\text{d-: is the shortest distance to the negative example}\\

\text{The margin of separating hyperplane is (d+) + (d-) }\\

\text{Recall that the distance from point } (x^{(i)}) \text{ to hyperplane } w^{T}x^{(i)}+b =0\\

\text{is calculated by formula:}\\

\frac{|w^{T}x^{(i)}+b|}{\left \| w \right \|}=\frac{1}{\left \| w \right \|}

$

The total distance between $H_{1}, H_{2}$ is $\frac{2}{\left \| w \right \|}$.

Our problem is maximize $\frac{2}{\left \| w \right \|\\}$ with constraint that no data points between $H_{1}, H_{2}$. It means:

$

w^{T}x^{(i)}+b\geq 1\text{ if }y^{(i)}= 1\\

w^{T}x^{(i)}+b\leq -1\text{ if }y^{(i)}= -1

$

Or we can convert it to minimizing problem:

Minimize: $\left \| w \right \|\\$ or $\frac{1}{2}\left \| w \right \|^{2}$

subject to: $y^{(i)}(w^{T}x^{(i)}+b)\geq 1\\ \text{ and }\lambda _{i}\geq 0$

We can solve this problem using KKT. Let 's make an example with very simple data set below and use KKT to solve it.

**Figure: 2 data examples (2,2) and (1,1)**

Minimize: $\frac{1}{2}\left \| w \right \|^{2}$

subject to: $y^{(i)}(w^{T}x^{(i)}+b)\geq 1\\$

We have:

$

w=\begin{bmatrix}

w_{1}\\

w_{2}

\end{bmatrix},

x=\begin{bmatrix}

x_{1}\\

x_{2}

\end{bmatrix}

$

Substitute data set into constraints:

$

+1(2w_{1}+2w_{2}+b)\geq 1\\

-1(w_{1}+w_{2}+b)\geq -1\\

\text{or}\\

1-(2w_{1}+2w_{2}+b)\leq 0\\

1+(w_{1}+w_{2}+b)\leq 0

$

The Lagrangian:

$L=\frac{1}{2}w_{1}^{2}+\frac{1}{2}w_{2}^{2}+\lambda _{1}(1-(2w_{1}+2w_{2}+b))+\lambda _{2}(1+(w_{1}+w_{2}+b))$

Applying KKT we have:

$

w_{1}-2\lambda _{1}+\lambda _{2}=0\\

w_{2}-2\lambda _{1}+\lambda _{2}=0\\

-\lambda _{1}+\lambda _{2}=0\\

\lambda _{1}(1-(2w_{1}+2w_{2}+b))=0\\

\lambda _{2}(1+(w_{1}+w_{2}+b))=0\\

\lambda _{1}\geq 0,\lambda _{2}\geq 0

$

There are 4 cases to check. I just check the case $\lambda _{1}> 0,\lambda _{2}> 0$ and solution is: $w_{1}=w_{2}=1, b=-3$

The separating hyperplane is: $x_{1}+x_{2}-3=0$

**Figure: Separating hyperplane**

$

(2,2)\Rightarrow y=2+2-3=1\\

(1,1)\Rightarrow y=1+1-3=-1

$

If you are familiar with

**Python cxvopt package**that is used for solving convex optimization problem, you can model our problem like below:

**Note**: The standard form of our problem in CVXOPT is:

$min_{x}\frac{1}{2}x^{T}Px+q^{T}x$

subject to: $Gx\leq h\\Ax=b$

We can re-write our problem under new form:

$

\begin{bmatrix}

w_{1}\\

w_{2}\\

b

\end{bmatrix}^{T}

\begin{bmatrix}

1.0 & 0.0 & 0.0\\

0.0 & 1.0 & 0.0\\

0.0 & 0.0 & 0.0

\end{bmatrix}

\begin{bmatrix}

w_{1}\\

w_{2}\\

b

\end{bmatrix}+\begin{bmatrix}

0.0\\

0.0\\

0.0

\end{bmatrix}

\begin{bmatrix}

w_{1}\\

w_{2}\\

b

\end{bmatrix}

$

subject to:

$

\begin{bmatrix}

-2.0 & -2.0 & -1.0\\

1.0 & 1.0 & 1.0

\end{bmatrix}

\begin{bmatrix}

w_{1}\\

w_{2}\\

b

\end{bmatrix}\leq \begin{bmatrix}

-1.0\\

-1.0

\end{bmatrix}

$

1 2 3 4 5 6 7 8 9 10 11 | from cvxopt import matrix from cvxopt import solvers import numpy as np P = matrix(np.array([[1.0,0.0,0.0],[0.0,1.0,0.0],[0.0,0.0,0.0]]), tc='d') q = matrix(np.array([0.0,0.0,0.0]), tc='d') G = matrix(np.array([[-2.0,-2.0,-1.0],[1.0,1.0,1.0]]), tc='d') h = matrix(np.array([-1.0,-1.0]), tc='d') sol = solvers.qp(P,q,G,h) print(sol['x']) |

**Figure: Using cvxopt to solve the problem**

**4. Kernels**

**4.1 About kernel**

We knew how to apply SVM for linear problem. So how to apply SVM for non-linearly separable classification problems. Let 's review our problem and change perspective.

We have the Lagrangian of problem:

$L = \frac{1}{2}\left \| w \right \|^{2} + \sum_{i=1}^{m}\lambda_{i}(1-y^{(i)}(w^{T}x^{(i)}+b))$

We have:

$\bigtriangledown _{w}L=w-\sum_{i=1}^{m}\lambda _{i}x^{(i)}y^{(i)}=0$

and

$\bigtriangledown _{b}L=\sum_{i=1}^{m}\lambda _{i}y^{(i)}=0$

We plug these back to Lagrangian:

$L=\sum_{i=1}^{m}\lambda _{i}-\frac{1}{2}\sum_{i,j=1}^{m}y^{(i)}y^{(j)}\lambda _{i}\lambda _{j}(x^{(i)})^{T}x^{(j)}$

we obtain the dual problem:

$max_f=\sum_{i=1}^{m}\lambda _{i}-\frac{1}{2}\sum_{i,j=1}^{m}y^{(i)}y^{(j)}\lambda _{i}\lambda _{j}<x^{(i)},x^{(j)}>$

subject to: $\lambda _{i}\geq 0,i=1,...,m\\

\sum_{i=1}^{m}\lambda _{i}y^{(i)}=0$

By changing perspective, we can gain insight the problem. And we are able to write the algorithm by only using inner products between input feature vectors.

+ If two feature vectors $x^{i},x^{j}$ are completely dissimilar their dot product is 0, and they don’t contribute to f.

+ If two feature vectors $x^{i},x^{j}$ are completely alike, their inner product is 1. We have 2 cases:

++ If both predict the same output value (+1 or -1) so the $y^{(i)}y^{(j)}\lambda _{i}\lambda _{j}<x^{(i)},x^{(j)}>$ always positive. This decreases f, so the algorithm downgrades the similar feature vectors that make the same prediction.

++ If both predict the different output value (+1 and -1) so the $y^{(i)}y^{(j)}\lambda _{i}\lambda _{j}<x^{(i)},x^{(j)}>$ always negative. This increases f, so these points contribute to the solution.

**Note:**

- At optimal point, the solution of primal and dual is equivalent

- The dual form is useful for an algorithm that supports to solve our problem on computer. It is SMO (sequential minimal optimization) algorithm.

Assume that we have a data set like below:

**Figure: non-linear data set**

We can not separate this data set using linear problem. In order to solve this problem, we will re-use the idea in

**linear regression**. That is mapping the features to a higher dimensional space (e.g: $x^{2} or x^{3}$) via $\varphi$ a**feature mapping**function. So our problem becomes linear problem.**Figure: Feature mapping**

Now we replace x by $ \varphi (x) $. We have:

$max_f=\sum_{i=1}^{m}\lambda _{i}-\frac{1}{2}\sum_{i,j=1}^{m}y^{(i)}y^{(j)}\lambda _{i}\lambda _{j}<\varphi(x^{(i)})^{T},\varphi(x^{(j)})>$. But computing $<\varphi(x^{(i)})^{T},\varphi(x^{(j)})>$ is expensive and time consuming (in case $\varphi $ is a quartic polynomial). If there is a function k such that $K(x^{(i)}, x^{(j)})=<\varphi(x^{(i)})^{T},\varphi(x^{(j)})>$ we do not need to compute $\varphi$ anymore. We call K is

$max_f=\sum_{i=1}^{m}\lambda _{i}-\frac{1}{2}\sum_{i,j=1}^{m}y^{(i)}y^{(j)}\lambda _{i}\lambda _{j}K(x^{(i)}, x^{(j)})$. let 's see some examples:

$

K(x,z)=(x^{T}z+c)^{2}\\

=\sum_{i,j=1}^{n}(x_{i}x_{j})(z_{i}z_{j})+\sum_{i=1}^{n}(\sqrt{2cx_{i}})(\sqrt{2cz_{i}})+c^{2}

$

$K(x,z)=exp(-\frac{\left \| x-z \right \|^{2}}{2\sigma ^{2}})$

You can see the computation is inexpensive comparing to using $\varphi$.

Given Kernel function K(x,z) how to check whether it is valid or not (there exists feature mapping function $\varphi$ such that $K(x^{(i)}, x^{(j)})=<\varphi(x^{(i)})^{T},\varphi(x^{(j)})>$).

Assume that we have m training sets ${x^{1},...,x^{m}}$ and square m-by-m matrix K with $K_{ij}=K(x^{(i)},x^{(j)})$. This matrix is called Kernel matrix. Moreover, we have:

$K_{ij}=\varphi (x^{(i)})^{T}\varphi (x^{(j)})=\varphi (x^{(j)})^{T}\varphi (x^{(i)})=K_{ji}$

$\Rightarrow $ K is symmetric.

Let 's look the figure below:

The left picture shows a 'relaxed' margin classifier. In the left picture, when a outlier is added in the upper-left region, it causes the resulting classifier has a smaller margin.

In order to cover for the case in the right picture, we reformulate the problem using $l_{1}$ regularization:

$min_{w,b,\gamma }=\frac{1}{2}\left \| w \right \|^{2}+C\sum_{i=1}^{m}\gamma _{i}$

subject to: $y^{(i)}(w^{T}x^{(i)}+b)\geq 1-\gamma _{i},i=1,...m\\

\gamma _{i}\geq 0,i=1,...m$

For training example that $\gamma _{i} > 0$, its margin is less than 1 and the cost function increases $C\gamma _{i}$.

The new Lagrangian:

$L(w,b,\lambda ,\gamma ,r )=\frac{1}{2}\left \| w \right \|^{2}+C\sum_{i=1}^{m}\gamma _{i}-\sum_{i=1}^{m}\lambda _{i}[y^{(i)}(w^{T}x+b)-1+\gamma _{i}]-\sum_{i=1}^{m}r_{i}\gamma _{i}$

$\lambda _{i}, r_{i}$ is Lagrange multipliers.

The new dual problem:

$max_{\lambda }W(\lambda)=\sum_{i=1}^{m}\lambda _{i}-\frac{1}{2}y^{(i)}y^{(j)}\lambda _{i}\lambda _{j}<x^{(i)},x^{(j)}>$

subject to: $0\leq \lambda _{i}\leq C, i=1,...,m\\

\sum_{i=1}^{m}\lambda _{i}y^{(i)}=0$

The new KKT conditions:

$

\lambda _{i}=0\Rightarrow y^{(i)}(w^{T}x+b)\geq 1\\

\lambda _{i}=C\Rightarrow y^{(i)}(w^{T}x+b)\leq 1\\

0<\lambda _{i}<C\Rightarrow y^{(i)}(w^{T}x+b)= 1

$

The SMO algorithm was proposed by John C. Platt in 1998.

$max_f=\sum_{i=1}^{m}\lambda _{i}-\frac{1}{2}\sum_{i,j=1}^{m}y^{(i)}y^{(j)}\lambda _{i}\lambda _{j}<\varphi(x^{(i)})^{T},\varphi(x^{(j)})>$. But computing $<\varphi(x^{(i)})^{T},\varphi(x^{(j)})>$ is expensive and time consuming (in case $\varphi $ is a quartic polynomial). If there is a function k such that $K(x^{(i)}, x^{(j)})=<\varphi(x^{(i)})^{T},\varphi(x^{(j)})>$ we do not need to compute $\varphi$ anymore. We call K is

**kernel function**. It defines similarity in the transformed space. Our problem becomes:$max_f=\sum_{i=1}^{m}\lambda _{i}-\frac{1}{2}\sum_{i,j=1}^{m}y^{(i)}y^{(j)}\lambda _{i}\lambda _{j}K(x^{(i)}, x^{(j)})$. let 's see some examples:

**The polynomial kernel:**$

K(x,z)=(x^{T}z+c)^{2}\\

=\sum_{i,j=1}^{n}(x_{i}x_{j})(z_{i}z_{j})+\sum_{i=1}^{n}(\sqrt{2cx_{i}})(\sqrt{2cz_{i}})+c^{2}

$

**The Gaussian kernel:**$K(x,z)=exp(-\frac{\left \| x-z \right \|^{2}}{2\sigma ^{2}})$

You can see the computation is inexpensive comparing to using $\varphi$.

**4.2 Validate Kernel**Given Kernel function K(x,z) how to check whether it is valid or not (there exists feature mapping function $\varphi$ such that $K(x^{(i)}, x^{(j)})=<\varphi(x^{(i)})^{T},\varphi(x^{(j)})>$).

Assume that we have m training sets ${x^{1},...,x^{m}}$ and square m-by-m matrix K with $K_{ij}=K(x^{(i)},x^{(j)})$. This matrix is called Kernel matrix. Moreover, we have:

$K_{ij}=\varphi (x^{(i)})^{T}\varphi (x^{(j)})=\varphi (x^{(j)})^{T}\varphi (x^{(i)})=K_{ji}$

$\Rightarrow $ K is symmetric.

**Mercer theorem**: For K to be a valid (Mercel) kernel, it is necessary and sufficient that the corresponding kernel matrix is symmetric positive semi-definite.**5. Regularization**Let 's look the figure below:

**Figure: Left-relaxed margin and Right-much smaller margin**

In order to cover for the case in the right picture, we reformulate the problem using $l_{1}$ regularization:

$min_{w,b,\gamma }=\frac{1}{2}\left \| w \right \|^{2}+C\sum_{i=1}^{m}\gamma _{i}$

subject to: $y^{(i)}(w^{T}x^{(i)}+b)\geq 1-\gamma _{i},i=1,...m\\

\gamma _{i}\geq 0,i=1,...m$

For training example that $\gamma _{i} > 0$, its margin is less than 1 and the cost function increases $C\gamma _{i}$.

The new Lagrangian:

$L(w,b,\lambda ,\gamma ,r )=\frac{1}{2}\left \| w \right \|^{2}+C\sum_{i=1}^{m}\gamma _{i}-\sum_{i=1}^{m}\lambda _{i}[y^{(i)}(w^{T}x+b)-1+\gamma _{i}]-\sum_{i=1}^{m}r_{i}\gamma _{i}$

$\lambda _{i}, r_{i}$ is Lagrange multipliers.

The new dual problem:

$max_{\lambda }W(\lambda)=\sum_{i=1}^{m}\lambda _{i}-\frac{1}{2}y^{(i)}y^{(j)}\lambda _{i}\lambda _{j}<x^{(i)},x^{(j)}>$

subject to: $0\leq \lambda _{i}\leq C, i=1,...,m\\

\sum_{i=1}^{m}\lambda _{i}y^{(i)}=0$

The new KKT conditions:

$

\lambda _{i}=0\Rightarrow y^{(i)}(w^{T}x+b)\geq 1\\

\lambda _{i}=C\Rightarrow y^{(i)}(w^{T}x+b)\leq 1\\

0<\lambda _{i}<C\Rightarrow y^{(i)}(w^{T}x+b)= 1

$

**6. The SMO algorithm**The SMO algorithm was proposed by John C. Platt in 1998.

Features of the SMO:

- It is the fastest quadratic programming optimization algorithm, especially for linear SVM.

- It does not to store the kernel matrix in memory, since no matrix operations are involved
The dual problem:

$

max_{\lambda }W(\lambda)=\sum_{i=1}^{m}\lambda _{i}-\frac{1}{2}y^{(i)}y^{(j)}\lambda _{i}\lambda _{j}<x^{(i)},x^{(j)}>$

subject to: $0\leq \lambda _{i}\leq C, i=1,...,m\\

\sum_{i=1}^{m}\lambda _{i}y^{(i)}=0

$

Notice the requirement:

$

max_{\lambda }W(\lambda)=\sum_{i=1}^{m}\lambda _{i}-\frac{1}{2}y^{(i)}y^{(j)}\lambda _{i}\lambda _{j}<x^{(i)},x^{(j)}>$

subject to: $0\leq \lambda _{i}\leq C, i=1,...,m\\

\sum_{i=1}^{m}\lambda _{i}y^{(i)}=0

$

Notice the requirement:

$\sum_{i=1}^{m}\lambda _{i}y^{(i)}=0 $ (*)

At each step SMO chooses two elements $\lambda _{i}, \lambda _{j}$ to jointly optimize, find the optimal values for those two parameters

Assume that two elements, $\lambda _{1}, \lambda _{2}$ that have been chosen for updating to improve the objective value and **given that all the others are fixed**. The choice of the two points is determined by a heuristic.**all the others are fixed**. In order to compute the new values for these two parameters, we use (*);

$y^{(1)}\lambda _{1}^{(old)}+y^{(2)}\lambda _{2}^{(old)}=constant=y^{(1)}\lambda _{1}^{(new)}+y^{(2)}\lambda _{2}^{(new)}$ (***)

Combine with box constraint:

$0\leq \lambda _{i}\leq C, i=1,...,m$ (**)

We have in $(\lambda _{1}, \lambda _{2})$ space:

**Figure: box constraint $\mathbf{\alpha =\lambda} $**

The SMO compute $\lambda _{2}^{(new)}$ then using (***) to compute $\lambda _{1}^{(new)}$.

(**) and (***) give more restrictive constraint on $\lambda _{2}^{(new)}$

$U\leq \lambda _{2}^{(new)}\leq V$

**if**$y^{(1)}\neq y^{(2)}$:

$U=max\left \{ 0,\lambda _{2}^{(old)}-\lambda _{1}^{(old)} \right \}\\

V=min\left \{ C,C-\lambda _{1}^{(old)}+\lambda _{2}^{(old)} \right \}$

V=min\left \{ C,C-\lambda _{1}^{(old)}+\lambda _{2}^{(old)} \right \}$

**if**$y^{(1)}= y^{(2)}$:

$U=max\left \{ 0,\lambda _{2}^{(old)}+\lambda _{1}^{(old)}-C\right \}\\

V=min\left \{ C,\lambda _{1}^{(old)}+\lambda _{2}^{(old)} \right \}$

V=min\left \{ C,\lambda _{1}^{(old)}+\lambda _{2}^{(old)} \right \}$

The SMO gives the formula to compute $\lambda _{2}^{(new)}$

$\lambda _{2}^{(new,unclipped)}=\lambda _{2}^{(old)}+\frac{y^{(2)}\left \{ E_{2}^{(old)}- E_{1}^{(old)}\right \}}{\eta }$

where:

$E(i)=f(x^{(i)}-y^{(i)})=\left ( \sum_{i=1}^{m}\lambda _{j}y^{(j)}K_{ji}+b \right )-y^{(i)},i=1,...,m$

Apply constraint $U\leq \lambda _{2}^{(new)}\leq V$ we have:

$\lambda _{2}^{(new, clipped)}=\begin{cases}

V & \text{ if } \lambda _{2}^{(new,unclipped)}>V \\

\lambda _{2}^{(new,unclipped)} & \text{ if } U\leq \lambda _{2}^{(new, unclipped)}\leq V \\

U & \text{ if } \lambda _{2}^{(new,unclipped)}<U

\end{cases}$

V & \text{ if } \lambda _{2}^{(new,unclipped)}>V \\

\lambda _{2}^{(new,unclipped)} & \text{ if } U\leq \lambda _{2}^{(new, unclipped)}\leq V \\

U & \text{ if } \lambda _{2}^{(new,unclipped)}<U

\end{cases}$

and $\eta =-K_{11}-K_{22}+2K_{12}$

We can compute $\lambda _{1}^{(new)}$ by:

$\lambda _{1}^{(new)}=\lambda _{1}^{(old)}+y^{(1)}y^{(2)}\left ( \lambda _{2}^{(old)}-\lambda _{2}^{(new)} \right )$

to be continue ...

to be continue ...

## 0 comments:

## Post a Comment