Machine Learning Algorithms
Support Vector Machine – SVM
overview
• SVM for linearly separable binary set
•Main Goal to design a hyper plane that classify all training vectors into
two classes
•The best model that leaves the maximum margin from both classes
•the two classes labels +1 (positive examples and -1 (negative examples)
X2
X1
overview
This is a constrained optimization problem
Split the data
•I margin in the best
possible way
•• • 0
hyperplane
This hyperplane
best splits the data
because it is as far
0 as possible from
0, 0 these support
support vectors
,/
•• 0 0
Q vectors
which is another
way of saying we
0
0
• maximized the
margin
Intuition behind SVM
i
Points (instances) are like vectors p = (xl,x2 xn)
'
SVM finds the closest two points from the two classes (see
figure), that support (define) the best separating line|plane Itffortwrtai
• Then SVM draws a line connecting them (the orange line
in the igure) *
After that, SVM decides that the best separating line is the
line that bisects, and is perpendicular to, the connecting WDp*
WOT
line
Margin in terms of W
n*
•( x2 — \\ ) = width
I
it • II
it' - x : + b =1
-
H’ Xj+i = -1
IV Xj+6 — IV —
-XT-& = 1 (-1 >
-
W X; W X : ~ ^
w 2
(x2 - X, )
HI >1
X2
Support Vector Machine
linearly separable data
Margin =
w ||
.
••
.
••
4
.1f ••
/
-1 - ••
V'
Support Vector , *;
.•
*
^ •*
/
® Support Vector
*
§
wrx + 6 = 1 *’
w >•
ViT\+ 6 = 0
\y' x + b = -l *
Svm as a minimization problem
• Maximizing is the same as minimizing 1^ 1 / 2
2 / 1tv I
• Hence SVM becomes a minimization problem:
Quadratic . l
problem
11mi - M'
Linear
sJ . •Vf- 1 M' • xtF
'
+ b*> 1. VAI
7 . constrain
• We are now optimizing a quadratic function subject to
linear constraints
• Quadratic optimization problems are a standard, well-
known class of mathematical optimization problems, and
many algorithms exist for solving them
1
mm
2
w 2
S.t . Vi ( xi - w -I- b ) 1 O Vi
In order to cater for the constraints in this minimization , we need to allocate
them Lagrange multipliers c* , where 0 V* :
1
Lr
2
| w || '
l ct - w -+- b ) 1 Vi ]
1
||w ||- y '
a t i [ t/ j ( x , - w + 6) 1]
^
2
2 =1
L
1
2^
7 l|w ||' y ] w C
*2
2 =1 2 =1
We wish to find the w and b which minimizes, and the α which maximizes LP(whilst
keeping αi
≥ 0 ∀i). We can do this by differentiatingL LP with respect to w and b and setting
9L p
the derivatives to zero:
9w
0 w
z=l
E
L
p
06
0 «iz/i =o
1=1
A Geometrical Interpretation
Class 2
10=0
8=0.6
7=0
2=0
5=0
1=0.8
4=0
6=1.4
9=0
3=0
Class 1
Example
• Here we select 3 Support Vectors to start with.
• They are S1; S2 and S3.
2
*2 A
Si
5I
©
1 (• 52 (- )
0
1 2 3 5 6
*
1
-1
-2
>
®S2
S3
s3 ()
U
Example
• Here we will use vectors augmented with a 1 as a bias input,
and for clarity we will differentiate these with an over-tilde.
That is:
2
(5) si
* = I
I
2
* = U) $2 1
1
4
* 0 = $3 0
1 w
Example
a) 5] . 5i + a2S2' S\ + a3S3 . S -i = 1 (- ve class')
5]. S2 4 — 1 (_ — ve class')
d| " “
^ ^
^3 * 3* 2
a , 5i . 53 + azS2 - S3 + a3S3. S3 = + 1 ( + ue class )
• Let's substitute the values for 5 S2 and S3 in the above
equations. 2 ^ 2 4
Si 1 S2 1 S3 0
1 1 1
2 2 2 2 4 2
«1 1 1 . 1 + «3 0 . 1 1
1 1 1 1 1 1
2 2 2 2 4 2
«i 1 . 1 . 1 1
1 1 1 1 1 1
2 4 2 4 4 4
1 ft . 0 + a3 0 . 0 +1
1 1 1 1 1 1
-- —-
1
- © (?) (© (© © )
ai
©? *
«»( ) “ (-;)
• (
; ) * (?) (
;•)
»
• = -1
• After simplification we get:
Get ! +
- - 4 az +
- - 9 cr 3 1
4«! -I- 6 cr 2 4- 9 cr 3 = 1
9«! + 9 az + 17cr 3
• Simplifying the above 3 simultaneous equations we
get: a1=a2 = - 3.25 and a3 = 3.5.
« 1 = «2 - 3.25 and a 3 = 3.5 * - ( )
2
52 = -
1
1
• The hyper plane that discriminates the
positive class from the negative class is give
S3 =
Q
by:
w
z
a i Si
• Substituting the values we get:
l
2 2 4
iv 1 + a2 1 + «3 0
1 1 1
2 2 4 1
w .
(-3 . 2 5) 1 + (-3 . 2 5). .
1 + (3 . 5 ) 0 0
1 1 1 3
2 2 4 1
vv = (- 3.25 ). 1 + (- 3.25 ). 1 + (3.5 ). 0 = 0
1 1 1 3
• Our vectors are augmented with a bias.
• Hence we can equate the entry in vv as the
hyper plane with an offset b .
• Therefore the separating hyper plane equation
y wx + b with vv and offset b 3.
Support Vector Machines
• y wx + b with w
-1
-2
Kernel trick
SVM Algorithm
1- Define an optimal hyperplane: maximize margin
2- Extend the above definition for non-linearly separable
problems: have a penalty term for misclassifications
3- Map data to high dimensional space where it is easier to
classify with linear decision surfaces: reformulate problem so
that data is mapped implicitly to this space
Suppose we're in 1-dimension
Not a big surprise
I O 0 0 o o o
x=0
Positive " plane^ Negative " plane"
Harder 1-dimensional dataset
That's wiped the
smirk off SVM's
face.
What can be
done about
this?
I 0 0 0 i +
x=0
Harder 1-dimensional dataset
Remember how
permitting non-
linear basis
functions made
linear regression
so much nicer?
Let's permit them
here too
Z /t = (• )
x=0
** s
Harder 1-dimensional dataset
Remember how
permitting non-
linear basis
functions made
linear regression
so much nicer?
Let s permit them
here too
/
x^O
Non-linear SVMs: Feature spaces
• General idea: the original feature space can
always be mapped to some higher-
dimensional feature space where the
training set is separable:
Φ: x → φ(x) />
*V
\
N \
/ \ \
\
/ \
N
/ \ • <\ N
S
l
\ / s S
s
\ /
\
\ \ /
/
\
N
\
•
s
\
\
s
\
27
</>
Input Space Feature Space
svm for nonlinear reparability
• The simplest way to separate two groups of data is with a straight
line, flat plane an N- dimensional hyperplane
• However, there are situations where a nonlinear region can separate
the groups more efficiently
• SVM handles this by using a kernel function (nonlinear) to map the
data into a different space where a hyperplane (linear) cannot be
used to do the separation
• It means a non-linear function is learned by a linear learning
machine in a high-dimensional feature space while the capacity of
the system is controlled by a parameter that does not depend on the
dimensionality of the space
• This is called kernel trick which means the kernel function transform
the data into a higher dimensional feature space to make it possible
to perform the linear separation
Kernels
• Why use kernels?
• Make non-separable problem separable.
• Map data into better representational space
• Common kernels
• Linear
• Polynomial K(x,z) = (1+xTz)d
• Gives feature conjunctions
• Radial basis function (infinite dimensional
space)
• Haven’t been very useful in text
classification 30
Thanks