Previous
Linear Models
Contents
Table of Contents
Next
Bayesian Methods

Chapter 6

Support Vector Machines

6.1. Overview

The concept underlying the Support Vector Machine (SVM) is to find a "plane" that can separate samples belonging to different categories. Some of the most important concepts in SVM are margin, dual problem, and kernel trick (methods). The concept of margin is where the support vectors originate. The original SVM is defined with a hard margin, while a soft margin can enable SVMs to consider problems with imperfect data: data with noise and mislabeled points. The dual problem is an optimization technique to facilitate the search for solutions to SVM problems. Kernel methods generalize the basic SVM by introducing a kernel function so that a hyperplane can be used to separate different categories of data points without linear boundaries (planes). The kernel trick allows the SVM to consider nonlinear problems by converting the data to a higher dimensional space, in which the problem becomes linear.
This chapter will start a tour to SVM with an introduction to the basic SVM. The idea of SVM will be explained using mathematical functions and graphical illustrations. We will see how to derive the optimization formulation of a machine learning problem using SVM. To facilitate solution, the conversion of this original formulation of the optimization problem into a dual problem will be explained. Based on that, we will show how to generalize the basic SVM for linear problems via kernel functions to obtain nonlinear SVMs. After that, soft margin as a common technique for addressing overfitting issues in SVM applications will be introduced. Like the basic SVM (with soft margin), the introduction to SVM with hard margin will also include formulations of the original optimization problem and dual problem. In the final, extra skills about SVM, including the SMO algorithm and the use of SVM for multi-class classification and regression, will be briefly discussed.

6.2. Basics of SVM: Hard Margin SVM

6.2.1. Basic Formulation

Let us first take a look at typical data in a binary classification problem to see how SVM [55] can be constructed. In a space of samples, as shown in Fig.6.1, the plane separating different categories of data points can be described using the following linear function:
(6.1) w T x + b = 0 (6.1) w T x + b = 0 {:(6.1) vec(w)^(T)* vec(x)+b=0:}\begin{equation*} \vec{w}^{T} \cdot \vec{x}+b=0 \tag{6.1} \end{equation*}(6.1)wTx+b=0
where w w vec(w)\vec{w}w is the normal vector of the plane, which specifies the direction of the plane, b b bbb is the intersect, which determines the distance between the plane and the origin.
In this space, the distance from any point x x vec(x)\vec{x}x to the plane ( w , b ) ( w , b ) ( vec(w),b)(\vec{w}, b)(w,b) is
(6.2) r = | w T x + b | w (6.2) r = w T x + b w {:(6.2)r=(| vec(w)^(T)*( vec(x))+b|)/(||( vec(w))||):}\begin{equation*} r=\frac{\left|\vec{w}^{T} \cdot \vec{x}+b\right|}{\|\vec{w}\|} \tag{6.2} \end{equation*}(6.2)r=|wTx+b|w
The magnitudes of w w vec(w)\vec{w}w and b b bbb can be adjusted by multiplying them with the same constant so that the two margin lines (or called sides or boundaries) in Fig. 6.1 have the functions of
(6.3) { w T x + b = + 1 w T x + b = 1 (6.3) w T x + b = + 1 w T x + b = 1 {:(6.3){[ vec(w)^(T)* vec(x)+b=+1],[ vec(w)^(T)* vec(x)+b=-1]:}:}\left\{\begin{array}{l} \vec{w}^{T} \cdot \vec{x}+b=+1 \tag{6.3}\\ \vec{w}^{T} \cdot \vec{x}+b=-1 \end{array}\right.(6.3){wTx+b=+1wTx+b=1
It is noted that, in SVM, "margin" usually refers to the distance between the separating hyperplane and any of the boundaries of the separation zone or the distance between the two boundaries. However, it is also seen that "margin" is used to refer to the separation zone. To avoid confusion, we reserve the term "margin" for the distance and use "margin area" for the separation zone. The two boundaries of the separation zone, which are also called positive and negative hyperplanes in some literature, are called margin lines or margin boundaries.
Figure 6.1: Conceptual illustration of SVM
These two margin boundaries are the two (dashed) lines that are parallel to the separating hyperplane and passing through the points closest to the hyperplane. The points on these boundaries are the support vectors. Therefore, the data points that are out of the margin area should satisfy the following relations:
(6.4) { w T x + b + 1 , y = + 1 w T x + b 1 , y = 1 (6.4) w T x + b + 1 , y = + 1 w T x + b 1 , y = 1 {:(6.4){[ vec(w)^(T)* vec(x)+b >= +1","y=+1],[ vec(w)^(T)* vec(x)+b <= -1","y=-1]:}:}\left\{\begin{array}{l} \vec{w}^{T} \cdot \vec{x}+b \geqslant+1, y=+1 \tag{6.4}\\ \vec{w}^{T} \cdot \vec{x}+b \leqslant-1, y=-1 \end{array}\right.(6.4){wTx+b+1,y=+1wTx+b1,y=1
where y = + 1 y = + 1 y=+1y=+1y=+1 assigns a label of +1 to this point.
The seperating hyperplane, which is a plane in this 2D case, is in the middle of the margin area. Accordingly, the margin, or the width of the margin area, is 2 w 2 w (2)/(||( vec(w))||)\frac{2}{\|\vec{w}\|}2w. The two regions on two sides of the margin area mark the two categories. In a binary classification task using the SVM, the two categories are generally labeled as +1 and -1 (for y i y i y_(i)y_{i}yi ). Based on the definitions of the margin and label, we can see that valid points, which are on two sides of the margin area (including points on the margin boundaries), must satisfy the following condition:
(6.5) y i ( w T x i + b ) 1 , i = 1 , 2 , , I (6.5) y i w T x i + b 1 , i = 1 , 2 , , I {:(6.5)y_(i)*( vec(w)^(T)* vec(x)_(i)+b) >= 1","i=1","2","cdots","I:}\begin{equation*} y_{i} \cdot\left(\vec{w}^{T} \cdot \vec{x}_{i}+b\right) \geqslant 1, i=1,2, \cdots, I \tag{6.5} \end{equation*}(6.5)yi(wTxi+b)1,i=1,2,,I
The above equation can be written in a more general array format as
(6.6) y ( X ¯ w + b ) 1 (6.6) y ( X ¯ w + b ) 1 {:(6.6) vec(y)o.( bar(X)* vec(w)+b) >= vec(1):}\begin{equation*} \vec{y} \odot(\bar{X} \cdot \vec{w}+b) \geqslant \overrightarrow{1} \tag{6.6} \end{equation*}(6.6)y(X¯w+b)1
where o.\odot is the element-wide product and 1 1 vec(1)\overrightarrow{1}1 is the array that has the same size as y y vec(y)\vec{y}y and all the elements as 1 .
If data points belonging to two classes are distributed as shown in Fig. 6.1, then the two categories of points can be separated by an infinite number of planes. In the SVM, the plane associated with the best classification performance is believed to be the plane that can achieve the greatest (widest) margin. Accordingly, the mathematical description of the basic SVM for a binary classification task is
(6.7) { max w , b 2 w ( or max w , b 2 w 2 ) s.t. y i ( w T x i + b ) 1 , i = 1 , 2 , , I (6.7) max w , b 2 w  or  max w , b 2 w 2  s.t.  y i w T x i + b 1 , i = 1 , 2 , , I {:(6.7){[max_( vec(w),b)(2)/(||( vec(w))||)quad(" or "max_( vec(w),b)(2)/(||( vec(w))||^(2)))],[" s.t. "y_(i)*( vec(w)^(T)* vec(x)_(i)+b) >= 1","i=1","2","cdots","I]:}:}\left\{\begin{array}{l} \max _{\vec{w}, b} \frac{2}{\|\vec{w}\|} \quad\left(\text { or } \max _{\vec{w}, b} \frac{2}{\|\vec{w}\|^{2}}\right) \tag{6.7}\\ \text { s.t. } y_{i} \cdot\left(\vec{w}^{T} \cdot \vec{x}_{i}+b\right) \geqslant 1, i=1,2, \cdots, I \end{array}\right.(6.7){maxw,b2w( or maxw,b2w2) s.t. yi(wTxi+b)1,i=1,2,,I
The above problem can be reformulated to utilize the minimization function in optimization.
(6.8) { min w , b 1 2 w 2 s.t. y i ( w T x i + b ) 1 , i = 1 , 2 , , I (6.8) min w , b 1 2 w 2  s.t.  y i w T x i + b 1 , i = 1 , 2 , , I {:(6.8){[min_( vec(w),b)(1)/(2)|| vec(w)||^(2)],[" s.t. "y_(i)*( vec(w)^(T)* vec(x)_(i)+b) >= 1","i=1","2","cdots","I]:}:}\left\{\begin{array}{l} \min _{\vec{w}, b} \frac{1}{2}\|\vec{w}\|^{2} \tag{6.8}\\ \text { s.t. } y_{i} \cdot\left(\vec{w}^{T} \cdot \vec{x}_{i}+b\right) \geqslant 1, i=1,2, \cdots, I \end{array}\right.(6.8){minw,b12w2 s.t. yi(wTxi+b)1,i=1,2,,I

6.2.2. Dual Formulation

The above optimization problem is a convex quadratic programming problem and can be solved with some optimization packages. However, a more widely accepted way is to convert this problem into a dual problem by multiplying two sides of the equation with the Lagrange multipliers ( α i 0 α i 0 alpha_(i) >= 0\alpha_{i} \geqslant 0αi0 ). This will bring two benefits: 1 ) making the optimization problem easier to solve, and 2) facilitating the introduction of kernel functions for tackling nonlinear problems.
The Lagrange function of the above problem can be written as
(6.9) ( w , b , α ) = 1 2 w 2 + i = 1 I α i [ 1 y i ( w T x i + b ) ] (6.9) ( w , b , α ) = 1 2 w 2 + i = 1 I α i 1 y i w T x i + b {:(6.9)ℓ( vec(w)","b"," vec(alpha))=(1)/(2)|| vec(w)||^(2)+sum_(i=1)^(I)alpha_(i)[1-y_(i)*( vec(w)^(T)* vec(x)_(i)+b)]:}\begin{equation*} \ell(\vec{w}, b, \vec{\alpha})=\frac{1}{2}\|\vec{w}\|^{2}+\sum_{i=1}^{I} \alpha_{i}\left[1-y_{i} \cdot\left(\vec{w}^{T} \cdot \vec{x}_{i}+b\right)\right] \tag{6.9} \end{equation*}(6.9)(w,b,α)=12w2+i=1Iαi[1yi(wTxi+b)]
where α = [ α 1 , α 2 , , α I ] T α = α 1 , α 2 , , α I T vec(alpha)=[alpha_(1),alpha_(2),cdots,alpha_(I)]^(T)\vec{\alpha}=\left[\alpha_{1}, \alpha_{2}, \cdots, \alpha_{I}\right]^{T}α=[α1,α2,,αI]T are Lagrange multipliers. The optimal (minimal) value of the function locates where the partial derivatives of the equation with respect to w w vec(w)\vec{w}w and b ( l w b l w b((del l)/(del( vec(w))):}b\left(\frac{\partial l}{\partial \vec{w}}\right.b(lw and l b l b (del l)/(del b)\frac{\partial l}{\partial b}lb, respectively) are zero, that is
(6.10) { w = i = 1 I α i y i x i = X ¯ T ( α y ) 0 = i = 1 I α i y i = α T y (6.10) w = i = 1 I α i y i x i = X ¯ T ( α y ) 0 = i = 1 I α i y i = α T y {:(6.10){[ vec(w)=sum_(i=1)^(I)alpha_(i)y_(i) vec(x)_(i)= bar(X)^(T)*( vec(alpha)o. vec(y))],[0=sum_(i=1)^(I)alpha_(i)y_(i)= vec(alpha)^(T)* vec(y)]:}:}\left\{\begin{array}{l} \vec{w}=\sum_{i=1}^{I} \alpha_{i} y_{i} \vec{x}_{i}=\bar{X}^{T} \cdot(\vec{\alpha} \odot \vec{y}) \tag{6.10}\\ 0=\sum_{i=1}^{I} \alpha_{i} y_{i}=\vec{\alpha}^{T} \cdot \vec{y} \end{array}\right.(6.10){w=i=1Iαiyixi=X¯T(αy)0=i=1Iαiyi=αTy
Substituting the above equations into the Lagrange function yields the dual problem of the SVM:
(6.11) { min α [ 1 2 i = 1 I j = 1 I ( α i α j y i y j x i T x j ) i = 1 I α i ] = min α { 1 2 [ X ¯ T ( α y ) ] T [ X ¯ T ( α y ) ] 1 T α } s.t. i = 1 I α i y i = α T y = 0 , 0 α i , i = 1 , 2 , , I (6.11) min α 1 2 i = 1 I j = 1 I α i α j y i y j x i T x j i = 1 I α i = min α 1 2 X ¯ T ( α y ) T X ¯ T ( α y ) 1 T α  s.t.  i = 1 I α i y i = α T y = 0 , 0 α i , i = 1 , 2 , , I {:(6.11){[min_( vec(alpha))[(1)/(2)sum_(i=1)^(I)sum_(j=1)^(I)(alpha_(i)alpha_(j)y_(i)y_(j) vec(x)_(i)^(T)* vec(x)_(j))-sum_(i=1)^(I)alpha_(i)]=min_( vec(alpha)){(1)/(2)[ bar(X)^(T)*(( vec(alpha))o.( vec(y)))]^(T)*[ bar(X)^(T)*(( vec(alpha))o.( vec(y)))]- vec(1)^(T)*( vec(alpha))}],[" s.t. "sum_(i=1)^(I)alpha_(i)y_(i)= vec(alpha)^(T)* vec(y)=0","quad0 <= alpha_(i)","i=1","2","cdots","I]:}:}\left\{\begin{array}{l} \min _{\vec{\alpha}}\left[\frac{1}{2} \sum_{i=1}^{I} \sum_{j=1}^{I}\left(\alpha_{i} \alpha_{j} y_{i} y_{j} \vec{x}_{i}^{T} \cdot \vec{x}_{j}\right)-\sum_{i=1}^{I} \alpha_{i}\right]=\min _{\vec{\alpha}}\left\{\frac{1}{2}\left[\bar{X}^{T} \cdot(\vec{\alpha} \odot \vec{y})\right]^{T} \cdot\left[\bar{X}^{T} \cdot(\vec{\alpha} \odot \vec{y})\right]-\overrightarrow{1}^{T} \cdot \vec{\alpha}\right\} \tag{6.11}\\ \text { s.t. } \sum_{i=1}^{I} \alpha_{i} y_{i}=\vec{\alpha}^{T} \cdot \vec{y}=0, \quad 0 \leqslant \alpha_{i}, i=1,2, \cdots, I \end{array}\right.(6.11){minα[12i=1Ij=1I(αiαjyiyjxiTxj)i=1Iαi]=minα{12[X¯T(αy)]T[X¯T(αy)]1Tα} s.t. i=1Iαiyi=αTy=0,0αi,i=1,2,,I
where 1 1 vec(1)\overrightarrow{1}1 is a column array, which has the same size as α α vec(alpha)\vec{\alpha}α and all the elements equaling 1 .
Solving the above dual problem yields α α vec(alpha)\vec{\alpha}α. The optimal w w vec(w)\vec{w}w and b b bbb correspond to the plane that separates the two categories the best in the SVM. According to the Karush-Kuhn-Tucker conditions (KKT), the original and dual problems will have the same optimal values. The following optimal solution to the dual problem can be obtained:
(6.12) w j = i = 1 I α i y i x i j (6.12) w j = i = 1 I α i y i x i j {:(6.12)w_(j)^(**)=sum_(i=1)^(I)alpha_(i)y_(i)x_(ij):}\begin{equation*} w_{j}^{*}=\sum_{i=1}^{I} \alpha_{i} y_{i} x_{i j} \tag{6.12} \end{equation*}(6.12)wj=i=1Iαiyixij
Or equivalently,
(6.13) w = i = 1 I α i y i x i (6.13) w = i = 1 I α i y i x i {:(6.13) vec(w)^(**)=sum_(i=1)^(I)alpha_(i)y_(i) vec(x)_(i):}\begin{equation*} \vec{w}^{*}=\sum_{i=1}^{I} \alpha_{i} y_{i} \vec{x}_{i} \tag{6.13} \end{equation*}(6.13)w=i=1Iαiyixi
The calculation of b b b^(**)b^{*}b requires the use of support vectors. Let us recall the following equation (for margin boundary) that the support vector needs to satisfy:
(6.14) y ( w T x + b ) = 1 (6.14) y w T x + b = 1 {:(6.14)y^(**)*( vec(w)^(**T)* vec(x)^(**)+b^(**))=1:}\begin{equation*} y^{*} \cdot\left(\vec{w}^{* T} \cdot \vec{x}^{*}+b^{*}\right)=1 \tag{6.14} \end{equation*}(6.14)y(wTx+b)=1
where x x vec(x)^(**)\vec{x}^{*}x and y y y^(**)y^{*}y are a support vector (vector/array for a point on the margin boundaries) and its label. The symbol "" is used for such a sample and its label because the point also corresponds to the optimal margin. Due to the same reason, w w vec(w)\vec{w}w and b b bbb also carry the "" sign.
The support vectors are associated with positive Lagrange multipliers. This is because any inequality constraint is defined as g i ( x ) 0 g i x 0 g_(i)(x^(**)) >= 0g_{i}\left(x^{*}\right) \geqslant 0gi(x)0. In optimization, we have the complementary slackness condition as part of the KKT conditions. This condition states that g i ( x ) α i = 0 g i x α i = 0 g_(i)(x^(**))*alpha_(i)=0g_{i}\left(x^{*}\right) \cdot \alpha_{i}=0gi(x)αi=0. For points that are not support vectors, we have g i ( x ) > 0 g i x > 0 g_(i)(x^(**)) > 0g_{i}\left(x^{*}\right)>0gi(x)>0, hence we must have α i = 0 α i = 0 alpha_(i)=0\alpha_{i}=0αi=0.
Multiplying two sides of Eq. 6.13 with y y y^(**)y^{*}y, we get
(6.15) y 2 ( w T x + b ) = y (6.15) y 2 w T x + b = y {:(6.15)y^(**2)*( vec(w)^(**T)* vec(x)^(**)+b^(**))=y^(**):}\begin{equation*} y^{* 2} \cdot\left(\vec{w}^{* T} \cdot \vec{x}^{*}+b^{*}\right)=y^{*} \tag{6.15} \end{equation*}(6.15)y2(wTx+b)=y
We know y 2 = 1 y 2 = 1 y^(**2)=1y^{* 2}=1y2=1 because y y yyy can only be ± 1 ± 1 +-1\pm 1±1, so we have
(6.16) b = y w T x = y ( i = 1 I α i y i x i ) T x (6.16) b = y w T x = y i = 1 I α i y i x i T x {:(6.16)b^(**)=y^(**)- vec(w)^(**T) vec(x)^(**)=y^(**)-(sum_(i=1)^(I)alpha_(i)y_(i) vec(x)_(i))^(T)* vec(x)^(**):}\begin{equation*} b^{*}=y^{*}-\vec{w}^{* T} \vec{x}^{*}=y^{*}-\left(\sum_{i=1}^{I} \alpha_{i} y_{i} \vec{x}_{i}\right)^{T} \cdot \vec{x}^{*} \tag{6.16} \end{equation*}(6.16)b=ywTx=y(i=1Iαiyixi)Tx
After the w w vec(w)^(**)\vec{w}^{*}w and b b b^(**)b^{*}b are obtained, the function of the optimal hyperplane separating the two classes is
(6.17) w T x + b = 0 (6.17) w T x + b = 0 {:(6.17) vec(w)^(**T)* vec(x)+b^(**)=0:}\begin{equation*} \vec{w}^{* T} \cdot \vec{x}+b^{*}=0 \tag{6.17} \end{equation*}(6.17)wTx+b=0
Accordingly, we can use the following equation for predictions in a binary classification task with the SVM:
(6.18) f ( x ) = sign ( w T x + b ) (6.18) f ( x ) = sign w T x + b {:(6.18)f( vec(x))=sign( vec(w)^(**T)*( vec(x))+b^(**)):}\begin{equation*} f(\vec{x})=\operatorname{sign}\left(\vec{w}^{* T} \cdot \vec{x}+b^{*}\right) \tag{6.18} \end{equation*}(6.18)f(x)=sign(wTx+b)
where "sign" is the sign function, which outputs 1 if the input is greater than 0 , output 0 if the input is 0 , and output -1 if the input is lower than 0 .

6.3. Generalization of SVM: Kernel Methods

In reality, many problems are not linearly classifiable. This means that, in a sample space, the points in the two categories cannot be separated by a plane or a hyperplane. In order to classify such samples, we can map the samples from the original space, i.e., x x vec(x)\vec{x}x, into a space with higher dimensions, i.e., ϕ ( x ) ϕ ( x ) phi( vec(x))\phi(\vec{x})ϕ(x). It is possible that a hyperplane can be sought for the data points in the high-dimensional space. The function of the hyperplane in the new space can be formulated as
(6.19) f ( x ) = w T ϕ ( x ) + b (6.19) f ( x ) = w T ϕ ( x ) + b {:(6.19)f( vec(x))= vec(w)^(T)*phi( vec(x))+b:}\begin{equation*} f(\vec{x})=\vec{w}^{T} \cdot \phi(\vec{x})+b \tag{6.19} \end{equation*}(6.19)f(x)=wTϕ(x)+b
The optimization problem becomes
(6.20) { min w , b 1 2 w 2 s.t. y i ( w T ϕ ( x i ) + b ) 1 , i = 1 , 2 , , I (6.20) min w , b 1 2 w 2  s.t.  y i w T ϕ x i + b 1 , i = 1 , 2 , , I {:(6.20){[min_( vec(w),b)(1)/(2)|| vec(w)||^(2)],[" s.t. "y_(i)*( vec(w)^(T)*phi( vec(x)_(i))+b) >= 1","i=1","2","cdots","I]:}:}\left\{\begin{array}{l} \min _{\vec{w}, b} \frac{1}{2}\|\vec{w}\|^{2} \tag{6.20}\\ \text { s.t. } y_{i} \cdot\left(\vec{w}^{T} \cdot \phi\left(\vec{x}_{i}\right)+b\right) \geqslant 1, i=1,2, \cdots, I \end{array}\right.(6.20){minw,b12w2 s.t. yi(wTϕ(xi)+b)1,i=1,2,,I
The corresponding dual problem is
(6.21) { min α { 1 2 i = 1 I j = 1 I [ α i α j y i y j ϕ ( x i ) T ϕ ( x j ) ] i = 1 I α i } s.t. i = 1 I α i y i = 0 , 0 α i , i = 1 , 2 , , I (6.21) min α 1 2 i = 1 I j = 1 I α i α j y i y j ϕ x i T ϕ x j i = 1 I α i  s.t.  i = 1 I α i y i = 0 , 0 α i , i = 1 , 2 , , I {:(6.21){[min_( vec(alpha)){(1)/(2)sum_(i=1)^(I)sum_(j=1)^(I)[alpha_(i)alpha_(j)y_(i)y_(j)phi( vec(x)_(i))^(T)*phi( vec(x)_(j))]-sum_(i=1)^(I)alpha_(i)}],[" s.t. "sum_(i=1)^(I)alpha_(i)y_(i)=0","quad0 <= alpha_(i)","i=1","2","cdots","I]:}:}\left\{\begin{array}{l} \min _{\vec{\alpha}}\left\{\frac{1}{2} \sum_{i=1}^{I} \sum_{j=1}^{I}\left[\alpha_{i} \alpha_{j} y_{i} y_{j} \phi\left(\vec{x}_{i}\right)^{T} \cdot \phi\left(\vec{x}_{j}\right)\right]-\sum_{i=1}^{I} \alpha_{i}\right\} \tag{6.21}\\ \text { s.t. } \sum_{i=1}^{I} \alpha_{i} y_{i}=0, \quad 0 \leqslant \alpha_{i}, i=1,2, \cdots, I \end{array}\right.(6.21){minα{12i=1Ij=1I[αiαjyiyjϕ(xi)Tϕ(xj)]i=1Iαi} s.t. i=1Iαiyi=0,0αi,i=1,2,,I
The solution to the above problem involves the calculation of ϕ ( x i ) T ϕ ( x j ) ϕ x i T ϕ x j phi( vec(x)_(i))^(T)*phi( vec(x)_(j))\phi\left(\vec{x}_{i}\right)^{T} \cdot \phi\left(\vec{x}_{j}\right)ϕ(xi)Tϕ(xj). This is the inner product of the mappings of x i x i vec(x)_(i)\vec{x}_{i}xi and x j x j vec(x)_(j)\vec{x}_{j}xj. Because the mapping leads to a space with a higher dimension, the calculation of the inner product is usually challenging. Therefore, we usually introduce the following kernel function [56]:
(6.22) k ( x i , x j ) =< ϕ ( x i ) , ϕ ( x j ) >= ϕ ( x i ) T ϕ ( x j ) (6.22) k x i , x j =< ϕ x i , ϕ x j >= ϕ x i T ϕ x j {:(6.22)k( vec(x)_(i), vec(x)_(j))=<phi( vec(x)_(i))","phi( vec(x)_(j))>=phi( vec(x)_(i))^(T)*phi( vec(x)_(j)):}\begin{equation*} k\left(\vec{x}_{i}, \vec{x}_{j}\right)=<\phi\left(\vec{x}_{i}\right), \phi\left(\vec{x}_{j}\right)>=\phi\left(\vec{x}_{i}\right)^{T} \cdot \phi\left(\vec{x}_{j}\right) \tag{6.22} \end{equation*}(6.22)k(xi,xj)=<ϕ(xi),ϕ(xj)>=ϕ(xi)Tϕ(xj)
The function of the hyperplane with a kernel function, which is called the support vector expansion, is written as
(6.23) f ( x ) = w T ϕ ( x ) + b = i = 1 I α i y i ϕ ( x i ) T ϕ ( x i ) + b = i = 1 I α i y i k ( x i , x i ) + b (6.23) f ( x ) = w T ϕ ( x ) + b = i = 1 I α i y i ϕ x i T ϕ x i + b = i = 1 I α i y i k x i , x i + b {:(6.23)f( vec(x))= vec(w)^(T)*phi( vec(x))+b=sum_(i=1)^(I)alpha_(i)y_(i)phi( vec(x)_(i))^(T)*phi( vec(x)_(i))+b=sum_(i=1)^(I)alpha_(i)y_(i)k( vec(x)_(i), vec(x)_(i))+b:}\begin{equation*} f(\vec{x})=\vec{w}^{T} \cdot \phi(\vec{x})+b=\sum_{i=1}^{I} \alpha_{i} y_{i} \phi\left(\vec{x}_{i}\right)^{T} \cdot \phi\left(\vec{x}_{i}\right)+b=\sum_{i=1}^{I} \alpha_{i} y_{i} k\left(\vec{x}_{i}, \vec{x}_{i}\right)+b \tag{6.23} \end{equation*}(6.23)f(x)=wTϕ(x)+b=i=1Iαiyiϕ(xi)Tϕ(xi)+b=i=1Iαiyik(xi,xi)+b
The following are the common kernel functions.
  1. Linear Kernel: k ( x i , x j ) = x i T x j k x i , x j = x i T x j k( vec(x)_(i), vec(x)_(j))= vec(x)_(i)^(T)* vec(x)_(j)k\left(\vec{x}_{i}, \vec{x}_{j}\right)=\vec{x}_{i}^{T} \cdot \vec{x}_{j}k(xi,xj)=xiTxj
  2. Polynomial Kernel: k ( x i , x j ) = ( x i T x j ) d k x i , x j = x i T x j d k( vec(x)_(i), vec(x)_(j))=( vec(x)_(i)^(T)* vec(x)_(j))^(d)k\left(\vec{x}_{i}, \vec{x}_{j}\right)=\left(\vec{x}_{i}^{T} \cdot \vec{x}_{j}\right)^{d}k(xi,xj)=(xiTxj)d
  3. Gauss Kernel: k ( x i , x j ) = exp ( x i x j 2 2 σ 2 ) k x i , x j = exp x i x j 2 2 σ 2 k( vec(x)_(i), vec(x)_(j))=exp(-(|| vec(x)_(i)- vec(x)_(j)||^(2))/(2sigma^(2)))k\left(\vec{x}_{i}, \vec{x}_{j}\right)=\exp \left(-\frac{\left\|\vec{x}_{i}-\vec{x}_{j}\right\|^{2}}{2 \sigma^{2}}\right)k(xi,xj)=exp(xixj22σ2)
  4. Laplace Kernel: k ( x i , x j ) = exp ( x i x j σ ) k x i , x j = exp x i x j σ k( vec(x)_(i), vec(x)_(j))=exp(-(|| vec(x)_(i)- vec(x)_(j)||)/(sigma))k\left(\vec{x}_{i}, \vec{x}_{j}\right)=\exp \left(-\frac{\left\|\vec{x}_{i}-\vec{x}_{j}\right\|}{\sigma}\right)k(xi,xj)=exp(xixjσ)
  5. Sigmoid Kernel: k ( x i , x j ) = tanh ( β x i T x j + θ ) k x i , x j = tanh β x i T x j + θ k( vec(x)_(i), vec(x)_(j))=tanh(beta vec(x)_(i)^(T)* vec(x)_(j)+theta)k\left(\vec{x}_{i}, \vec{x}_{j}\right)=\tanh \left(\beta \vec{x}_{i}^{T} \cdot \vec{x}_{j}+\theta\right)k(xi,xj)=tanh(βxiTxj+θ)
It is worthwhile to point out that SVM is relatively sensitive to the selection of kernels in addition to missing data. Like the applications of kernels in other types of machine learning, the selection of kernels in real problems still heavily relies on experience. There is a lack of theories for guiding and explaining the selection of suitable kernels to address specific problems with SVM.

 

 

 

 

 

 

Enjoy and Build the AI World

Sample Code from AI Engineering

Cite the code in your publications

Linear Models