SVM
Support Vector Machine SVM 支持向量机
Vapnik(USSR)
Suitable for prediction with small sample number
0. No Free Lunch Theorem
If there is no priori assumption in the feature space, the average performance of all algorithms is consistent。
1. SVM introduction
图片来自百度百科
There are infinite lines to separate a linear separable sample space. The line that maximize the margin $d$ and $d_1=d_2=\frac{d}{2}$ is the best. The samples on the dotted lines in the figure is called support vector for they actually decides the position of the solid separation line.
2. How to get the line?
It is hard to directly compute the slope and intercept of the line, so we need to find the maximum value of $d$. In order to maximize $d$, we need to solve the following problem.(I will explain later. It is actually a quadratic programming in convex optimization theory.)
$$
\left{\begin{array}{l}
\min\frac{1}{2}|\boldsymbol{\omega}|^2\
s.t.(subject \ to) \ \ \ y_i[\boldsymbol\omega^T\boldsymbol{x_i}+b]\geq1
\end{array}\right.
$$
Facts:
$\boldsymbol\omega^T\boldsymbol{x}+b=0$ is a hyperplane in high dimension plane, and can be denoted by $(\boldsymbol\omega, b)$.
$\boldsymbol\omega^T\boldsymbol{x}+b=0$ and $a\boldsymbol\omega^T\boldsymbol{x}+ab=0(a\in R^+)$ are the same hyperplane,即$(\boldsymbol\omega, b)=(a\boldsymbol\omega, ab)$.
The distance between $x_0$ and hyperplane $(\boldsymbol\omega, b)$is $d=\frac{|\boldsymbol\omega^T\boldsymbol{x_0}+b|}{|\boldsymbol\omega|}$. (Just like the distance between a dot and a line in a xy-plane.)
Now start out analysis:
- For any $(\boldsymbol\omega_0, b_0)$,there is a $a$ that allows $|a\boldsymbol\omega_0^T\boldsymbol{x}+ab_0|=1$,and $d=\frac{1}{|\boldsymbol\omega|}$. So we need to minimize $\frac{1}{2}|\boldsymbol{\omega}|^2$. Here we multiply $|\boldsymbol{\omega}|^2$ by $\frac{1}{2}$ for the convenience of derivation.
- For non-support vector $\frac{|\boldsymbol\omega^T\boldsymbol{x}+b|}{|\boldsymbol\omega|}=d>\frac{1}{|\boldsymbol\omega|}$, so $|\boldsymbol\omega^T\boldsymbol{x}+b|>1$,. Now we need to drop the absolute value sign. We notice that $y_i$ has the same sign with $|\boldsymbol\omega^T\boldsymbol{x}+b|>1$, so $y_i[\boldsymbol\omega^T\boldsymbol{x}+b]>1$.
- For support vector $y_i[\boldsymbol\omega^T\boldsymbol{x}+b]=1$
- Finally, we get the restrictive condition $y_i[\boldsymbol\omega^T\boldsymbol{x}+b]\geq1$
P.S. In fact, you can say $y_i[\boldsymbol\omega^T\boldsymbol{x}+b]$ is no less than any positive number, but 1 is much more simple.
3. How do we use SVM to deal with non-separable data?
$$
\left{\begin{array}{l}
\min\frac{1}{2}|\boldsymbol{\omega}|^2+C\sum\limits_{i=1}^N\xi_i\
s.t.(subject \ to)
\left{\begin{array}{l}
s.t.(subject \ to)y_i[\boldsymbol\omega^T\boldsymbol{x_i}+b]\geq1-\xi_i\
\xi_i\geq0
\end{array}\right.
\end{array}\right.
$$
The main idea to deal with non-separable data is to map the hyperplane to higher dimension space.
We introduce ξ to deal with such occasions. At the same time, we have to make sure $\xi$ is not too big, so we add regulation term to the first equation. (Regulation term is pretty common in the machine learning field). C is a parameter that we need to set. In practice, we usually set up upper bound, lower bound and step and try to get the best C.
One of the most important difference between SVM and other algorithms is how they deal with non-linear separable data. Other algorithms try to use circles or rectangles to find the boundary. SVM tries to find a linear boundary in a higher dimension space.
$$
\boldsymbol{X}\stackrel{\phi}{\longrightarrow}\phi(\boldsymbol{X}) \
$$
$\phi()$ is the mapping function.
example:
$$
\boldsymbol{X} = \begin{bmatrix} a \ b \end{bmatrix}\to\phi(\boldsymbol{X}) = \begin{bmatrix} a^2 \ b^2\a\b\ab \end{bmatrix}
$$
The possibility of linear separable grows as the dimension grows. So in a infinity dimension space, all data sets are linear separable.
Kernel Function
We don’t have to know the explicit expression of $\phi()$. So we introduce kernel function $K(x_1, x_2) = \phi(x_1)^T\phi(x_2)$ to express mapping function implicitly.
Here are some common kernel function:
- Gaussian kernel function: $K(x_1, x_2)=exp(−\frac{∣∣x_1-x_2||^2}{2σ^2})$
- Polynomial kernel function: $K(x_1, x_2)=(γ\phi(x_1)^T\phi(x_2)+c)^n$
- Linear kernel function(no kernel function): $K(x_1, x_2)=\phi(x_1)^T\phi(x_2)$
- Sigmoid kernel function: $K(x_1, x_2)=tanh(γ\phi(x_1)^T\phi(x_2)+c)$
- Laplace kernel function: $K(x_1, x_2) = exp(-\frac{||x_1-x_2||}{\sigma})$
Kernel function must follow the following rules:
①$K(x_1, x_2) = K(x_2, x_1)$
②Positive Semi-definite
$$
\forall C_i(parameter) , \boldsymbol{X_i}(vector), \ \exists \sum\limits_{i=1}^N\sum\limits_{j=1}^NC_iC_jK( \boldsymbol{x_i}, \boldsymbol{x_j})
$$
4. Use prime problem and dual problem to avoid $\phi(x)$
What is prime problem and dual problem? Check this blog
Prime problem:
$$
\left{\begin{array}{l}
minimize\ f(\boldsymbol\omega)\
s.t.\left{\begin{array}{l}
g_i(\boldsymbol\omega)\leq(i=1\sim K)\
h_i(\boldsymbol\omega) = 0 (i=1\sim N)
\end{array}\right.
\end{array}\right.
$$
Dual Problem:
$$
\left{\begin{array}{l}
\Theta(\boldsymbol\alpha, \boldsymbol\beta) = \min\limits_{all \ ω}{L(\boldsymbol\omega, \boldsymbol\alpha, \boldsymbol\beta)}\
s.t. \ \boldsymbol\alpha_i \geq0
\end{array}\right.
$$
We need to find:
$$
\left{\begin{array}{l}
\min\frac{1}{2}|\boldsymbol{\omega}|^2+C\sum\limits_{i=1}^N\xi_i\
s.t.(subject \ to)
\left{\begin{array}{l}
s.t.(subject \ to)y_i[\boldsymbol\omega^T\boldsymbol{x_i}+b]\geq1-\xi_i\
\xi_i\geq0
\end{array}\right.
\end{array}\right.
$$
Considerate it as prime problem, we can get dual problem:
$$
\left{\begin{array}{l}
\Theta(\boldsymbol\alpha) = \min\limits_{all \ \omega, \ \xi_i, \ b} { \frac{1}{2}|\boldsymbol{\omega}|^2 - C \sum\limits_{i=1}^N\xi_i + \sum\limits_{i=1}^N\beta_i\xi_i + \sum\limits_{i=1}^N\alpha_i[1 + \xi_i - y_i\boldsymbol\omega^T \phi(x_i) - y_ib] }\
s.t. \ \boldsymbol\alpha_i \geq0,\ \boldsymbol\beta_i \geq
0\end{array}\right.
$$
$\boldsymbol{\omega\to\omega,\xi_i,b ;\ \alpha\to\alpha_i,\beta_i, \ \beta\to\phi}$
($\alpha$ controlls the inequality, $\beta$ controlls the euqality)
Let’s start!
$$
L=\frac{1}{2}|\boldsymbol{\omega}|^2 - C \sum\limits_{i=1}^N\xi_i + \sum\limits_{i=1}^N\beta_i\xi_i + \sum\limits_{i=1}^N\alpha_i[1 + \xi_i - y_i\boldsymbol\omega^T \phi(x_i) - y_ib] \
\left{\begin{array}{l}
\frac{\partial L}{\partial\omega}=0\
\frac{\partial L}{\partial\xi_i}=0\
\frac{\partial L}{\partial b}=0
\end{array}\right.
$$
Solve these three equation and we can get:
$$
\left{\begin{array}{l}
\omega=\sum\limits_{i=1}^N\alpha_iy_i\phi(x_i)\
\alpha_i + \beta_i = C\
\sum\limits_{i=0}^N\alpha_iy_i=0
\end{array}\right.
$$
By using these three euqations we can solve $\Theta(\alpha)$. Remember what we want? We want to use $K(x_i,x_2)$ to replace $\phi(x_i)$.
$$
L_{min}=\frac{1}{2}\boldsymbol{\omega}^T\boldsymbol{\omega} + \sum\limits_{i=1}^N\alpha_i -\sum\limits_{i=1}^N\alpha_iy_i\boldsymbol\omega^T \phi(x_i) \
L {min}= \sum\limits{i=1}^N\alpha_i + \frac{1}{2}\sum\limits_{i=1}^N\sum\limits_{j=1}^N\alpha_i\alpha_jy_iy_j\phi(x_i)^T\phi(x_j) - \sum\limits_{i=1}^N\sum\limits_{j=1}^N\alpha_i\alpha_jy_iy_j\phi(x_j)^T\phi(x_i)\
\ K(x_i,x_j) = \phi(x_i)^T\phi(x_i)\
L_{min} = \sum\limits_{i=1}^N\alpha_i - \frac{1}{2}\sum\limits_{i=1}^N\sum\limits_{j=1}^N\alpha_i\alpha_jy_iy_jK(x_i,x_j)
$$
Finally, we get our optimize target:
$$
\left{\begin{array}{l}
\max\Theta(\boldsymbol\alpha) = \sum\limits_{i=1}^N\alpha_i - \frac{1}{2}\sum\limits_{i=1}^N\sum\limits_{j=1}^N\alpha_i\alpha_jy_iy_jK(x_i,x_j)\
s.t. \ 0\leq\boldsymbol\alpha_i \leq C, \ \sum\limits_{i=1}^N\alpha_i y_i = 0
\end{array}\right.
$$
This is also a convex optimization problem, but we can solve it. (We know what K is!). There are different algorithms to slove convex optimization problem. We won’t talk about that here. SMO is one of the most common algorithms.
Determine $\omega \ and \ b$
$\boldsymbol\omega$:
Remember how do we predict y?
if $\boldsymbol\omega^T\phi(\boldsymbol{x_i})+b\geq0$, then $y = 1$
if $\boldsymbol\omega^T\phi(\boldsymbol{x_i})+b<0$, then $y = -1$
Actually we don’t need to know the exact value of $\boldsymbol\omega$ if we know $\boldsymbol\omega^T\phi(\boldsymbol{x_i})$.
$\boldsymbol\omega^T\phi(\boldsymbol{x_i}) = \sum\limits_{i=1}^N\alpha_iy_iK(x_i,x_j)$
$b$:
We need to use KKT condition: $\forall i=1\sim K, \ \alpha=0 \ or \ g(\omega) = 0$
$$
\beta_i =0, \ \alpha_i = 0 \
or
\ \xi_i = 0, \ 1 + \xi_i - y_i\boldsymbol\omega^T \phi(x_i) - y_ib = 0
$$
Choose a $0<\alpha_i<C$ randomly, then:
$$
\left{\begin{array}{l}
\xi_i = 0\
1+\xi_i - y_i\boldsymbol\omega^T \phi(x_i) - y_ib = 0
\end{array}\right.
\
b = \frac{1 - y_i\sum\limits_{i=1}^N\alpha_iy_iK(x_i,x_j)}{y_i}
$$
In parctice, we often choose many $\alpha_i$ and evaluate mean b as the final output parameter.
5. Summary
①train the model
- input ${(x_i,y_i)}_{i=1\sim N}$
- solve the optimization problem (SMO, etc.)
$$
\left{\begin{array}{l}
\max\Theta(\boldsymbol\alpha) = \sum\limits_{i=1}^N\alpha_i - \frac{1}{2}\sum\limits_{i=1}^N\sum\limits_{j=1}^N\alpha_i\alpha_jy_iy_jK(x_i,x_j)\
s.t. \ 0\leq\boldsymbol\alpha_i \leq C, \ \sum\limits_{i=1}^N\alpha_i y_i = 0
\end{array}\right.
$$ - evaluate b
②test model
input x
$$
\left{\begin{array}{l}
if \ \sum\limits_{i=1}^N\alpha_iy_iK(x_i,x_j)+b\geq0, then \ y = 1\
if \ \sum\limits_{i=1}^N\alpha_iy_iK(x_i,x_j)+b<0, then \ y = -1
\end{array}\right.
$$ouput y
5. Code
from Internet
1 | from sklearn import svm |