0% found this document useful (0 votes)

48 views60 pages

Notes On Support Vector Machines: Fernando Mira Da Silva

This document summarizes support vector machines (SVM), including: 1) How SVM finds the optimal separating hyperplane for linearly separable problems to maximize the classification margin. 2) How SVM is extended to non-separable problems by relaxing constraints and allowing some misclassifications. 3) How SVM can handle non-linear problems using kernel methods or feature expansion to project data into higher dimensional spaces.

Uploaded by

António Grilo

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

48 views60 pages

Notes On Support Vector Machines: Fernando Mira Da Silva

Uploaded by

António Grilo

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 60

Notes on Support Vector Machines

Fernando Mira da Silva

[email protected]

Neural Network Group

INESC
November 1998
Abstract

This report describes an empirical study of Support Vector Machines (SVM) which aims to
review the foundations and properties of SVM based on the graphical analysis of simple binary
classification tasks.

SVM theory is briefly reviewed by discussing possible learning rules for linear separable
problems and the role of the classification margin in generalization. This leads to the formulation
of the SVM learning procedure for linear separable problems. It is then shown how it is possible to
extend the SVM approach to non-separable problem by relaxing the strict separability constraint
of the original formulation.

The application of SVM to nonlinear problems by feature expansion techniques is then briefly
reviewd. The effect of the dimension of the feature space on the SVM solution is briefly considered
based on a simple bi-dimensional problem. Finally, it is discussed how explicit feature expansion
can be avoided by the use of kernels.

A qualitative analysis of the generalization behavior of support vector machines is performed

by the analysis of the classification regions found by SVM in simple, synthetic examples. These
classification regions are compared with those found by conventional multi-layer and radial basis
functions networks.

In the last part of this document, practical problems regarding SVM optimization are briefly
discussed and some pointers and references regarding this issue are supplied.
Contents

1 Introduction 1

1.1 Classifiers, learning and generalization . . . . . . . . . . . . . . . . . . . . . . . 1

1.2 Linear classification systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2

2 Optimal linear classification 5

2.1 The separable case . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

2.1.1 Formulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

2.1.2 Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10

2.1.3 Classifier parameters . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11

2.2 The non separable case . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14

2.2.1 Misclassification margin . . . . . . . . . . . . . . . . . . . . . . . . . . 14

2.2.2 Optimization procedure . . . . . . . . . . . . . . . . . . . . . . . . . . 15

2.2.3 Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16

2.2.4 Classifier parameters . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17

2.2.5 Experimental results . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17

3 Non linear classification 19

3.1 Feature expansion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19

3.2 Orthonormalized polynomial feature expansion . . . . . . . . . . . . . . . . . . 20

3.3 Test results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22

IV CONTENTS

4 Non linear SVM and kernels 29

4.1 Definition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29

4.2 Kernel functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30

4.3 Function and parameters evaluation . . . . . . . . . . . . . . . . . . . . . . . . 31

4.4 Experimental results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33

4.4.1 The chess board problem revisited . . . . . . . . . . . . . . . . . . . . . 33

4.4.2 The cross problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34

4.4.3 The sinusoidal problem . . . . . . . . . . . . . . . . . . . . . . . . . . . 40

5 Implementation issues 45

6 Conclusions 47

References 49
List of Figures

1.1 Possible solutions for a linearly separable classification problem. . . . . . . . . . 3

2.1 Optimal classification hyper-plane. . . . . . . . . . . . . . . . . . . . . . . . . . 6

2.2 Binary classification region found by a SVM in a simple bi-dimensional distribution. 12

2.3 Binary classification region found by a simple sigmoidal unit distribution of fig.
2.2. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12

2.4 Binary classification region found by a SVM (3 support vectors) . . . . . . . . . 13

2.5 Binary classification region found by a simple sigmoidal unit in the distribution of
fig. 2.4. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13

2.6

Non-linearly separable problem. Classification regions found by a SVM with
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18

2.7 Same distribution of figure 2.6. Classification regions found by a SVM with
. 18

3.1 Extension of SVM to non linear problems through feature expansion. . . . . . . 20

3.2 Non linear chess board type classification problem. . . . . . . . . . . . . . . . . 21

3.3 SVM classification of the chess board distribution of figure 3.2. Linear case ( ). 23

3.4 SVM classification of the chess board distribution of figure 3.2. Polynomial fea-
ture expansion with . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23

3.5

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
SVM classification of the chess board distribution of figure 3.2. Polynomial fea-
ture expansion with 24

3.6 SVM classification of the chess board distribution of figure 3.2. Polynomial fea-
ture expansion with . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
VI LIST OF FIGURES

3.7 SVM classification of the chess board distribution of figure 3.2. Polynomial fea-
ture expansion with . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25

3.8
. . . . . . . . . . . . . . . . . . . . . . . . . . . . .
SVM classification of the chess board distribution of figure 3.2. Polynomial fea-
ture expansion with 25

3.9 SVM classification of the chess board distribution of figure 3.2. Polynomial fea-
ture expansion with . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26

3.10 MLP classification of the chess board distribution of figure 3.2. network.
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26

input gaussians.

3.11 RBF classification of the chess board distribution of figure 3.2. network with 8
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27

4.1 SVM classification of the chess board distribution of figure 3.2. RBF kernel. . . 33

4.2 Data distribution in the cross problem. . . . . . . . . . . . . . . . . . . . . . . . 34

4.3 Classification region for the cross problem with an SVM with a RBF kernel. . . 35

4.4 Classification region for the cross problem with a MLP network. . . . . . . . . . 35

4.5 Classification region for the cross problem with a RBF network (8 gaussians). . . 36

4.6 Classification region for the cross problem with a RBF network (14 gaussians). . 36

4.7 Data distribution in the cross problem with outliers. . . . . . . . . . . . . . . . . 37

4.8 Classification region for the cross problem with 4 outliers and a SVM with a RBF
kernel. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38

4.9 Classification region for the cross problem with 4 outliers and a MLP network. . 38

4.10 Classification region for the cross problem with four outliers and a RBF network
(8 gaussians). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39

4.11 Classification region for the cross problem with four outliers and a RBF network
(23 gaussians). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39

4.12 Source data distribution for the sinusoidal problem . . . . . . . . . . . . . . . . 40

4.13 Sinusoidal problem. Solution found by a SVM network with a RBF kernel. . . . 41

4.14 Sinusoidal problem. Solution found by a

MLP network. . . . . . . . . 42
LIST OF FIGURES VII

4.15 Sinusoidal problem. Solution found by a RBF network (eight gaussians). . . . . . 42

4.16 Sinusoidal problem. Solution found by a RBF network (35 gaussians). . . . . . . 43

Chapter 1

Introduction

1.1 Classifiers, learning and generalization

The interest on supervised, automatic learning systems for pattern classification was greatly en-
hanced by the advances in connectionist and other non-symbolic learning paradigms. In super-
vised learning, the system is presented with examples of input patterns and the corresponding
desired class. During the learning phase, the system is expected to develop a classification model
which is able to correctly classify arbitrary input data patterns.

One of the outstanding problems on supervised learning is to identify learning procedures

which lead to a system with a good generalization behavior, e. g., which is able to correctly
classify previously unseen data patterns. A first and intuitive condition to achieve a good gener-
alization seems to stem from the structure of the training set. In a first approach, one may expect
that good generalization depends on how well the training set represents the overall data distribu-
tion. However, neither this condition is strictly required, nor constitutes a sufficient condition for
optimal generaliztion.

Discussion of generalization and learning often involve the concepts of computational feasi-
bility and computational capacity. For a system to perform a given classification task, its model
structure must be able to implement it. This means that the classification process must be feasible
by the computational model. This requirement may imply the use of a fairly complex compu-
tational model. But, on the other hand, one must consider the computational capacity of the
model. The computational capacity of a given system may be formally quantified by the Vapnik-
Chervonenkis (VC) dimension (Vapnik, 1995). Broadly speaking, the VC dimension of a binary
classification system is the maximum number of training samples that can be learned by the sys-
tem without error for all possible binary labellings of the classification models (Haykin, 1994).
The introduction of the VC dimension enables to derive an upper bound for the the misclassifica-
2 I NTRODUCTION

tion rate in the application domain (risk) given the misclassification rate observed on the training
set (empirical risk). While the theoretical bounds which result from the VC dimension are often
considered too conservative (see for instance (Baum and Haussler, 1989; Baum, 1990; Cohn and
Tesauro, 1991; Ji and Psaltis, 1992)), the VC dimension remains an important tool for statistical
analysis of learning and generalization. A too large VC dimension may lead to an over fitted sys-
tem, which tends to work as a memory regarding the training samples, but which develops a poor
generalization behavior.

In practice, good generalization often requires a trade-off between the computational feasibil-
ity of the model and its computational capacity. This is often achieved by including regularization
techniques in the learning procedure. Several neural network learning models, as weight sharing
(Le Cun et al., 1990) or weight decay (Krogh and Hertz, 1992; Guyon et al., 1992) attempt to
reduce the VC dimension without affecting the computational feasibility of the system.

Recently, support vector machines (SVM) received a great deal of attention from the machine
learning community. Possibly the most attractive feature of SVM stems from the attempt to derive
a feasible classifier with a minimum VC-dimension for a given classification task. This report
describes an empirical study of Support Vector Machines which aims to review the foundations
and properties of SVM through the graphical analysis of simple binary classification tasks.

This report is structured as follows. In chapter 1 the basic principles of linear classification
systems are introduced. In chapter 2 it is introduced the optimality principle of support vector ma-
chines for linear classification tasks and its application to separable and non-separable problems.
In chapter 3 its discussed the extension of SVM to non linear classification problems by feature
expansion. In chapter 4 its considered the use of Kernels in non-linear SVMs. In chapter 5 some
details regarding implementation of SVM are considered. Finally, in chapter 6, conclusions are
presented.

1.2 Linear classification systems

Consider a linearly separable binary classification problem where the training set is defined by
pairs

(1.1)

Each
is an input pattern and ! #"
identifies the target class for input pattern .

an hyper-plane in

By definition, stating that the problem is linearly separable is equivalent to state that there is
splitting the input domain such that patterns of the same class lay in the
same region of the input space. The solution of the classification problem reduces to find any
hyper-plane which has this property. Without further restrictions, there is in general an infinity of
L INEAR CLASSIFICATION SYSTEMS 3

Figure 1.1: Possible solutions for a linearly separable classification problem.

solutions to this problem. Examples of several possible solutions for a simple bi-dimensional case
are shown in figure 1.1.

If the domain of the classification problem is the training set alone, then any hyper-plane in
figure 1.1 provides a correct solution to the classification problem. However, further to the correct
classification of all patterns in the training set 1 , a good classification system is expected to provide
correct classes for previously unseen data patterns. Therefore, to achieve optimal generalization,
the classification problem must be restated as follows: among all possible solutions, choose the
one which hopefully provides the best generalization ability. Clearly, this is an ill-posed problem,
since it is impossible to guess the generalization behavior without making strong assumptions on
the underlying data distribution.

A linear classification system halves the input space with an hyper-plane defined by the
equation

(1.2)

Geometrically, is a vector orthogonal to . The class of pattern is defined by

"
(1.3)

1
In the more general case, the optimal system that minimizes the classification error in the application domain may
even misclassify one or more patterns in the training set. However, we assume here that the problem is strictly linearly
separable and that there is no noise in the training set samples.
4 I NTRODUCTION

The original perceptron (Rosenblatt, 1958), the adaline (Widrow and Hoff, 1960) or the neu-
ron unit with sigmoidal activation function (Rumelhart et al., 1986) are among the simplest binary

classification systems with learning ability. The output of a neuron unit with an activation function
for an input pattern is given by

(1.4)

where
is the component of and is a vector of components . The function

is

usually a monotonic sigmoidal type function. According to (1.3), the class of input pattern
is obtained from by
"

(1.5)

where is a pre-specified threshold. In neural network terminology, the parameters are the

weights and is the bias input.

The learning phase of the sigmoidal unit reduces to find the parameters ( , ) that minimize
the total output error on the training set, given by

(1.6)

is an arbitrary distortion measure. If
where
"
refers to the quadratic error and is

"

. Since

an odd function with values in , then the distortion measure can be simply defined as
depend on all training samples, the parameters found

by the minimization process – and hence – also depend on all training samples. This is a
reasonable result, since must correctly classify all training patterns.

However, it is also reasonable to argue that, provided that all training patterns are correctly
classified, the exact location of should only depend on patterns laying near the decision bound-
ary. In fact, these patterns are the most critically relevant for the definition of the exact boundary
of the decision region and it is natural to expect that generalization is, above all, dependent on this
boundary. Accordingly, one may expect that correctly classified patterns far away from should
be almost irrelevant regarding the generalization behavior of the classification system.

The hypothesis that the optimal location of must only depend on the data patterns near to
the decision boundary is a key principle of support vector machines. Under this assumption, the

learning process must be able to identify this set of patterns – the support vectors – and it must be
able to express the optimal parameters ( ) of the classifier only as a function of these samples.
Chapter 2

Optimal linear classification

2.1 The separable case

2.1.1 Formulation

In its simplest form, the learning criteria of support vector machines requires that all patterns

alent to state that the parameters

within the training set are linearly separable by a single hyper-plane . This hypothesis is equiv-
of the classifier must meet the set of constraints

(2.1)

Let us define the classification margin as the minimum distance between and the closest
pattern to . This means

(2.2)

where

is the Euclidean distance from pattern to . This distance is given by

(2.3)

The classification margin can be seen as a measure of as well the hyper-plane performs
in the task of separating the two binary classes. In fact, a small value of means that is close
to one or more samples of one or two classes and, therefore, there is a reasonably high probability
that samples not included in the training set may fall in the wrong side of the classification region.
On the other hand, if is large, such probability is significantly reduced.
6 O PTIMAL LINEAR CLASSIFICATION

d max
d max

Figure 2.1: Optimal classification hyper-plane. Note that the exact location of
the hyper-plane only depends on two support vectors.

The learning criteria of support vector machines reduces to find the set of parameters

that maximize the classification margin under the constraints defined by (2.1). The optimal clas-
sification hyper-plane according to this criteria for a simple bi-dimensional example is depicted in
figure 2.1. Note that, as discussed before, the exact location of is only dependent on the sup-
port vectors closer to the hyper-plane (in this example, two support vectors). One of the relevant
features of this classification criteria is that, under certain hypothesis, it can be related with the
principle of structural risk minimization (Vapnik, 1992; Guyon et al., 1992), therefore providing
theoretical grounds for the justification of a good generalization ability of these systems (Vapnik,
1995; Gunn, 1997; Burges, 1986).

In fact, all set of parameters with
Note that there is one degree of freedom in the definition of by the condition
define the same hyper-plane and result in the
.

the subset of parameters

same classification system. Therefore, without loss of generality, it is possible to consider only
that verify the condition
(2.4)

Given (2.2) and (2.3), this additional constraint is equivalent to state that

(2.5)

Furthermore, this condition allow us to rewrite the classification constraint (2.1) as

(2.6)
T HE SEPARABLE CASE 7

of parameters

Note, once more, that this reduces to sate that once condition
, then it is possible to choose a

is satisfied for an arbitrary set
such that also

verify (2.6).

Let us consider again the classification margin. From (2.5) it results that

(2.7)

With this condition, the optimization problem can be stated as follows: maximize
(2.8)

under constraints
"

(2.9)

If there is patterns on the training set, the optimization will have constraints. Note that ,

while absent from (2.8), is implicitly defined by the constraints (2.9).

In order to avoid the square root associated to
and to simplify the overall formulation,

instead of the maximization of , it is usually considered the problem of minimization of
(2.10)

where the factor

is only included for convenience.

The minimization of

under constraints (2.9) can be achieved by minimizing the La-
grangian. The Lagrangian results from adding to
multiplied by a Lagrange multiplier ,

the left side of constraints (2.9), each one

. Note that constraints of the form are
subtracted from the objective function with positive Lagrange multipliers to form the Lagrangian.
Therefore, we have

"
"

" " (2.11)

with respect to and , subject to

Under this formulation, the optimization problem is reduced to finding the minimization of

(2.12)
8 O PTIMAL LINEAR CLASSIFICATION

which is equivalent to state that the solution is found at a saddle point of the Lagrangian in respect
to the Lagrange multipliers. This is a convex quadratic programming problem, since the objective
function is convex and each constraint defines itself a convex set.

While the above optimization problem can be directly solved using any standard optimization

package, it requires the explicit computation of . While this can be considered a quite natural
requirement at this stage, avoiding the explicit computation of is a key requirement for the

application of kernel function to SVM (this subject will be discussed in further detail in chapter
4). In this problem, the explicit computation of
can be avoided using the dual formulation
of Lagrangian theory (Fletcher, 1987). As a side result, this dual formulation will also lead to a
simpler optimization problem.

Classical Lagrangian duality enables to replace a given optimization problem by its dual for-
mulation. In this case, the Lagrangian duality states that the minimum of that is found solving

the primal problem

(2.13)

occurs at the same point

where it is found the maximum of that results from solving
the dual problem

(2.14)

This particular dual formulation is called the Wolfe dual (Fletcher, 1987) 1 .

Solving (2.14) implies to find the points where the derivatives of with respect to
and

vanish. Let us consider the derivative with respect to first. From (2.10) it results
" "

" (2.15)

Hence,
#

(2.16)

Replacing this condition for in the Lagrangian yields

" "

1
In order to avoid confusion between the primal and dual problems, in optimization literature the Lagrangian is
sometimes denoted by or
according to the problem being solved. This notation is not used in this document.
T HE SEPARABLE CASE 9

"

" " # (2.17)

On the other hand, the derivative of

with respect to is given by

" "

" (2.18)

Therefore,

(2.19)

Using this condition in (2.17), yields

"

(2.20)

Note that the first term in (2.20) is a quadratic form in the coefficients . Defining the vector

(2.21)

and the -dimensional unity vector

(2.22)

(2.20) can be re-written as

"

(2.23)

where is a
matrix whose elements are given by

(2.24)

The dual formulation enables to restate the optimization problem in the following simplified form:
minimize
"

(2.25)
10 O PTIMAL LINEAR CLASSIFICATION

subject to

(2.26)

(2.27)

Note that does not appear explicitly in the dual formulation and, moreover, the input pat-
terns only appear in the optimization functional in dot products of the form . As it will be

seen in chapter 4, this property is essential for the use of kernels in support vector machines.

2.1.2 Analysis

For the analysis of the solutions found by the optimization procedure is convenient to make use
of the Kuhn-Tucker conditions, which are observed at the solution of any convex optimization
problem.

Instead of recalling the whole set of Kuhn-Tucker conditions, we will only consider a partic-

ular relevant one for the analysis of SVM. This condition states that, for constraints of the form

, one must have at the solution. Note that this is a quite natural condition. In fact,

it just means that either at the solution, and in this case the constraint is irrelevant for the
specific solution found, and hence
, or, alternatively, and, in this case, imposes a

bound on the solution that implies the intervention of a non null multiplier .
In the SVM case, the Kuhn-Tucker condition means that, at the solution, one must have

" (2.28)

This condition implies that either or, alternatively, that2

(2.29)

After the optimization, means that the constraint can be eliminated from the La-

for the definition of

grangian without affecting the final solution. This means that if than pattern is irrelevant
. Therefore, it is clear that only those patterns for which con-
tribute for the definition of . The patterns in the set
(2.30)

2
Recall that since
,
T HE SEPARABLE CASE 11

are called the support vectors of the support vector machine.

Note that, given (2.29), one have

(2.31)

and, therefore, all support vectors correspond to the minimum classification margin and lay on
the two hyper-planes symmetrical to the decision boundary that define the optimal classification
margin.

2.1.3 Classifier parameters

In the dual formulation, only the Lagrange multipliers

of the primal problem are computed.

However, the classifier parameters are and . But these can be derived from (2.16), resulting

(2.32)

On the other hand can be computed from any support vector applying equation (2.29). In

practice, it is usually better to compute from the average of the results obtained from all support

is the number of support vectors, and denoting by

the -th support

vectors. If
vector and the corresponding class, one may compute from

" (2.33)

Since only the support vectors have

, the computation of

in (2.32) can also be
performed as a function of these samples only. Therefore, one may rewrite (2.32) as

(2.34)

where denotes the multiplier for the -th support vector.

A simple example of a bi-dimensional distribution and the corresponding classification region

found by a SVM is depicted in figure 2.2. In this figure each support vector appears enclosed in
"

a circle and the intermediate gray regions correspond to the classification margin where

a
. Note that only 2 support vectors are used in this case. The classification region found by
sigmoidal unit for the same input distribution is depicted in figure 2.3.
12 O PTIMAL LINEAR CLASSIFICATION

dimensional distribution.
and

Figure 2.2: Binary classification region found by a SVM in a simple bi-
.

Figure 2.3: Binary classification region found by a simple sigmoidal unit in

the same distribution of fig. 2.2
.
T HE SEPARABLE CASE 13

Figure 2.4: .
Binary classification region found by a SVM.
and
.

Figure 2.5: Binary classification region found by a sigmoidal unit in the same
distribution of fig. 2.4
.
14 O PTIMAL LINEAR CLASSIFICATION

A second example where the structure of the distribution leads to three support vectors is
depicted in figures 2.4. Note that in the linear separable case, if there is some degree of randomness
in the input distribution the SVM is almost always defined by two or three support vectors. The
solution found by a sigmoidal unit for the same input distribution is depicted and 2.5.

2.2 The non separable case

2.2.1 Misclassification margin

The above analysis assumes that all training samples can be correctly classified by a single deci-
sion hyper-plane. However, this is seldom the case. Even assuming that the structure and nature of
the problem is purely linear, it is necessary to take into account that training samples often result
from observations of (noisy) real world data, yielding training data often populated by misclassi-
fied, ambiguous or outlier samples. In these cases, the constrained optimization problem outlined
before does not converge, since it is not possible to satisfy the classification constraints (equation
(2.9)). In such cases, the optimization process will result in an increasing and unbounded sequence
of Lagrange multipliers for those samples that can not be correctly classified.

A simple solution to this problem was proposed in (Cortes and Vapnik, 1995). The idea

is to soft the constraints defined by (2.9) such that some of patterns are allowed to lay inside
the classification margin ( ) or even in the wrong side of the classification boundary.

This is done by including in the classification constraints slack variables which create room for a

misclassification margin. More specifically, for each pattern in the training set is introduced a
slack variable and the constraints (2.9) are re-written as
"

(2.35)
(2.36)

Of course, the optimization problem must now provide some way of computing also the new
parameters . Desirably, one would like that all vanish or, if this is not possible, that their values

get as small as possible. One way to accomplish this result is to include in the objective function
a penalty term to weight and minimize the contribution of the parameters. In these conditions,
the objective function defined by (2.10) is rewritten as

(2.37)

where is a positive constant that weights the penalty term. Now the optimization problem must
be re-stated as follows: minimize (2.37) subject to constraints (2.35) and (2.36).
T HE NON SEPARABLE CASE 15

2.2.2 Optimization procedure

As before, the constrained minimization of (2.37) can be performed by the minimization of the
Lagrangian

" " "

" "

" "
(2.38)

were introduced in order to support the additional

).
where the additional Lagrange multipliers
constraints on the slack variables (

At this point, we may restate the formulation of the primal and dual problems. The primal
problem corresponds to the solution of

(2.39)

while the Wolf dual is now

(2.40)

The partial derivatives of in respect to , and can be easily computed. From (2.38),

it becomes clear that the partial derivatives in respect to and are the same as in the separable

case, therefore also providing conditions (2.16) and (2.19). It remains the derivative of with
respect to . This derivative can be computed as

" "

" " (2.41)

"
and, therefore,
(2.42)

Replacement of this result, (2.16) and (2.19) in (2.38) makes all terms in to vanish and results

again in exactly the same objective function defined by (2.25). But one additional constraint must
16 O PTIMAL LINEAR CLASSIFICATION

be included in the optimization procedure. In fact, the Lagrangian multipliers

do not appear in

must have

the simplified objective function since they are implicitly defined by (2.42). However, since one
, this implies that

(2.43)

and therefore this additional constrain must be included in optimization process.

In synthesis, the optimization problem in the non separable case can now be restated as fol-
lows: minimize
"

(2.44)

subject to

(2.45)

(2.46)

2.2.3 Analysis

The analysis of the Kuhn-Tucker conditions in the non-linear case provides some further insight

concerning the role and effect of the parameters in the optimization procedure. Further to the

conditions for the

multipliers, there are now the conditions that relate the constraints
and the multipliers . At the solution, one must have . This means that one of two case
may occur:

If the slack variables

vanish, the corresponding samples are correctly classified, being
located at the border or outside the classification margin. For these samples, one have
" (2.47)

one will have

If, furthermore, these samples are far from the border, one will have
.
and, therefore,

, than . This means that the optimization process needed to make use of
If
is either outside the classification margin
) or misclassified (if ). In all these cases, the Kuhn-Tucker condition
the slack margin . In this case, the sample
(if
and, therefore, that . This means that a sample for which
reaches the upper bound on the optimization process corresponds
imposes that

sample or, at least, an out of margin sample (in the sense that
). Note that
to a misclassified

all samples in these conditions will correspond to support vectors of the classifier.
T HE NON SEPARABLE CASE 17

2.2.4 Classifier parameters

Let us consider again the problem of computing the classifier parameters and .

As before, can be computed from the support vectors by (2.34). However, for the compu-
tation of the Kuhn-Tucker conditions must be again revisited. Note that with the introduction of
the slack variables, the condition (2.28) at the solution is now expressed as

" (2.48)

any support vector such that

Since the slack variables are not explicitly computed, must now be computed from and
. As before, it is usually safer to use the average that results

average defined by (2.33) must exclude, all support vectors for which .
from all support vectors that verify such conditions. Therefore, in the non separable case, the

It remains the problem of choosing a good value of for a given classification problem. Un-
happily, the optimal value of is strongly dependent on the problem under analysis and, specif-
ically, on the noise, ambiguities or / and outliers that occur in the training set. In practice, is
usually fixed by a trial and error procedure.

2.2.5 Experimental results

Figures 2.6 and 2.7 contribute to a better understanding of the effect of different values of in a

problem. In the captions,

given optimization problem. Both figures refer to the same non-linearly separable classification
denotes the number of misclassified samples and the number

of samples for which

of out of margin samples. As it can be seen, a lager value of contributes to reduce the number
. This suggests that, in practice, one would like to have as large

as possible, in order to try to reduce the effect of outliers and misclassified samples. However,

corresponds to do not impose any upper bound on and, as it was already discussed, such
solution only exists if the problem is linearly separable. Therefore, in practice, it is necessary to
find a trade-off between a large and a numerically feasible solution of the optimization problem.
18 O PTIMAL LINEAR CLASSIFICATION

(

Figure 2.6: Non-linearly separable problem. Classification regions found by a
SVM with , , , ).

SVM with

Figure 2.7: Same distribution of figure 2.6. Classification regions found by a
( , , , ).
Chapter 3

Non linear classification

3.1 Feature expansion

As described before, support vector machines can only be applied to linear classification problems.
However, the vast majority of real world classification problems are in fact non-linear by nature.
Hopefully, the SVM framework was elegantly extended to non-linear problems.

One conventional solution to solve non-linear classification problems with linear classifiers is
to generate intermediate features from the input data such that the input patterns become linearly
separable in the feature space. In most cases, this result can be achieved performing simple but
adequate non-linear transformations of the input data. This is the basic strategy adopted, for
example, by multilayer neural networks or multilayer perceptron (MLP). In any MLP, the output
layer acts as a linear classifier. However, the pre-processing of the input by one or more non-linear
hidden layers may perform a transformation of the input data such that the output layer work on a
linearly separable feature space. This basic idea behind MLP networks can be traced back to the
work of Minsky and Papert (Minsky and Papert, 1969).

In multilayer networks, the transformation performed by the front-end layers is is usually

optimized by an error minimization procedure (usually backpropagation), which eases the task of
finding an intermediate transformation of the input data such that the output layer may perform on
a linearly separable feature space. An equivalent procedure to find an optimized transformation
of the input data for SVM was not found yet. However, for some classes of problems, one may
expect that a (fixed) non-linear projection of the input space in an higher dimension space may
generate a linearly separable feature space.

More formally, given a non linear separable training set , we assume that it is possible to
20 N ON LINEAR CLASSIFICATION

Φ (x)
Π

Linear
x SVM y (x)

Figure 3.1: Extension of SVM to non linear problems through feature expan-
sion.

of in (possibly with

find a non linear transformation

) such that the set

(3.1)

be applied to the feature space where

is linearly separable. It is clear that if such transformation is found, all the SVM theory may
lives, therefore providing a basis for an optimal
non-linear classification. This method is graphically depicted in figure 3.1.

3.2 Orthonormalized polynomial feature expansion

In order to test this feature expansion approach on a non-linear classification problem we con-
sidered the bi-dimensional classification problem represented in figure 3.2. The classes layout
correspond to a chess board pattern, thus resulting in a reasonably hard classification problem.

In order to solve this problem with a SVM, several classification tests based on feature spaces
of different dimensions were performed. In this group of tests, only polynomial transformations of

the input data were performed. More specifically, for each polynomial order , the feature space

was build from all features resulting from products of components of total order lower or equal

to . For example, denoting by
the following transformations were used for of

the expansion of order and by
the -th component of ,
:

(3.2)
O RTHONORMALIZED POLYNOMIAL FEATURE EXPANSION 21

Figure 3.2: Non linear chess board type classification problem.

.

(3.3)

(3.4)

Note that this feature expansion yields, for order , an output space of dimension

(3.5)

and, therefore, the dimension of increases considerably with . On the other extreme, if
the SVM reduces to the linear classifier considered before.

Note that the components of

result from different functions of only two components (

of

and
) and, therefore, they are usually highly correlated. Moreover, the individual components
have quite different variances depending on the order of each output feature. Due to these

space

facts, it is usually difficult to perform a successful optimization procedure directly on the feature
. In order to avoid these problems, it was performed an equalization of the feature
space by making
!" (3.6)
22 N ON LINEAR CLASSIFICATION

where
denotes the ensemble mean of . is a
transformation matrix
defined as
(3.7)

where is the covariance matrix of
:

" " (3.8)

Note that with this transformation, the covariance matrix of the final features
is
(3.9)

and, therefore, the feature space becomes orthonormalized. It was observed that this equalization
of the feature space yielded a much easier SVM optimization procedure.

3.3 Test results

In order to assess the properties of polynomial features of different orders, a first set of classi-
fication tests was performed with . All tests were performed with . The
results of this first set of tests is depicted on figures 3.3 to 3.7. As it would be expected, as (and
hence Dim( )) increases, a more elaborate boundary line is build by the SVM operating on .

As a result, the number of misclassified samples decreases with the increase of and a correct
classification of all training patterns is reached for .

Since a perfect classification is found for

, one may question what happens for higher

values of . In order to answer this question, we repeated the classification tests with increasing

values of . Figures 3.8 and 3.9 show the result of these tests for
( ). As it can be observed, increasing beyond
(

) and
yields an increasing
complex boundary line. In fact, a too high dimension of only contributes for the uniformization
of the distances of each feature vector to (note that this is an expected consequence of the
topological properties of high dimensional spaces).

For and above, the SVM optimization procedure converges for solutions in which
all samples lie in the boundary of the classification margin ( ) and, therefore, all
data patterns become support vectors. These solutions correspond to increasingly fragmented

classification regions and the regularization effect of SVM is, in these cases, significantly lower.
Note that for the dimensionality of the feature space becomes higher than the number of
training samples ( ). This can be a a possible justification for the type of solution
found by the optimization procedure.
T EST RESULTS 23

Figure 3.3: SVM classification of the chess board distribution of figure 3.2.
Linear case ( ). . . . .

Figure 3.4: SVM classification of the chess board distribution of figure 3.2.

Polynomial feature expansion with
.
. . . .
24 N ON LINEAR CLASSIFICATION

Figure 3.5: SVM classification of the chess board distribution of figure 3.2.

Polynomial feature expansion with
.
. . . .

Figure 3.6: SVM classification of the chess board distribution of figure 3.2.

Polynomial feature expansion with
. .
. . .
T EST RESULTS 25

Figure 3.7: SVM classification of the chess board distribution of figure 3.2.

Polynomial feature expansion with
.
. . . .

Figure 3.8: SVM classification of the chess board distribution of figure 3.2.

Polynomial feature expansion with
.
. . . .
26 N ON LINEAR CLASSIFICATION

Figure 3.9: SVM classification of the chess board distribution of figure 3.2.

Polynomial feature expansion with
. .
. . .

Figure 3.10: MLP classification of the chess board distribution of figure 3.2.
network. .
T EST RESULTS 27

network with 8 input gaussians. .

Figure 3.11: RBF classification of the chess board distribution of figure 3.2.

fication problem was solved using a

In order to compare these results with those found by conventional classifiers, the same classi-
MLP and a radial basis function (RBF) network with

3.11. Note that both networks share the same number of free parameters (

8 input gaussians. The classification regions found in each case are depicted in figures 3.10 and
), but the partic-
ular choice of 8 hidden units was rather arbitrary at this stage. As it would be expected, the solution
found by the MLP network is based on a segmentation of the input space with hyper-planes such
that all training samples are correctly classified. Apparently, the particular segmentation found
by the MLP is quite arbitrary, and the solution depicted on figure 3.10 will hardly provide a rea-
sonable generalization behavior (but it must be emphasized that input distributions of this type,
punctuated by local data clusters, are among the most difficult problems for MLP classifiers). On
the other hand, it becomes clear that the RBF network provides here a quite simple and natural
result, based on centering one (or more) gaussians in each input cluster of one of the two classes

1 . Apparently, this result will provide a better generalization behavior than the one obtained by

the MLP network. But, in any case, the decision boundary build by the SVM with seems a
much better match of the classification regions that one would heuristically expect to be associated
with the structure of this classification problem.

At this stage, one remark must be done regarding the tests performed through all this docu-
1
Note that the desired value of the white square patterns was set to 0 for the RBF networks, while the classification
threshold was set to 0.5. This strategy simplifies considerably the RBF task. On the other hand, it must be considered
that if the desired values of the target classes are interchanged, the solution found is based on gaussians centered on the
complementary class, therefore yielding a rather different decision boundary.
28 N ON LINEAR CLASSIFICATION

ment. Since our aim is to perform a qualitative comparison of typical solutions found by different
classifiers under similar conditions, it was not used any test set to provide a criteria for early stop-
ping of the learning procedure in RBF and MLP networks. It is well known that early stopping
may remarkably improve the generalization ability of MLP networks, namely when the number
of free parameters is close to the number of training patterns (as it happens in the majority of the
tests described herein). This means that the solutions found by MLP networks presented in this
document are certainly punctuated by over-learning. But it must be considered that our goal is
also to assess the regularization effect of the learning criteria itself. In this sense, all tests were
performed under similar and reasonably fair conditions.
Chapter 4

Non linear SVM and kernels

4.1 Definition

As it was discussed in chapter 3, the feature expansion approach requires the explicit translation
of the training patterns from a (usually) low dimensional space to an (usually) much higher
dimensional space were the problem is linearly separable. In this section, we discuss how the
theory of Kernels may avoid the explicit computation of the patterns in the high dimensional space
.

tion on the Lagrangian multipliers

In chapter 2 it was shown how SVM classification can be reduced to a constrained optimiza-
described by equations (2.44), (2.45) and (2.46). When this
problem is carried out on the space, it suffices to replace the definition (2.24) of the elements of
the matrix by

(4.1)

Therefore,
.theNow, and . Or,
optimization problem itself only requires the computation of the dot products

!
(called kernel function) such that
assume that we define this dot product as direct function of
by other words, that we define a function

(4.2)

can be directly computed from and , we avoid the explicit computation of the
If

features
. In this case, the elements of the matrix can be simply defined as

(4.3)

In the more general case, one may even define directly the kernel

without the
30 N ON LINEAR SVM AND KERNELS

explicit definition of

. Of course, some care must be taken when choosing kernel functions.
A valid kernel requires that some transformation of into exists such that the kernel represents
in fact a dot product in , even if such mapping does not need to be explicitly defined. Note that,
for some kernel functions, may even live on an infinite dimensional Hilbert space.

1995). For a given kernel function

The validity of a given kernel function may be asserted by the Mercer’s condition (Vapnik,
, there exists a valid expansion

(4.4)

(
denoting the -th component of ) if for any function in
such that

(4.5)

then

(4.6)

Note that the validity of a kernel function requires that (4.6) is satisfied for any finite energy
function (finite norm). This may not be easy to prove, even for quite simple kernel types.
On the other hand, it must be emphasized that some kernels often used do not satisfy Mercer’s
condition but, nevertheless, they are able to provide reasonable solutions to the optimization prob-
lem.

4.2 Kernel functions

Different types of SVM classifiers may be derived for a same problem by using different kernel
functions. Some of the most used kernel functions include:

Polynomial:

(4.7)

The previous kernel function may result on an Hessian with 0 determinant. In order to avoid
this problem, the following alternative polynomial kernel is often used:

(4.8)
F UNCTION AND PARAMETERS EVALUATION 31

Radial basis function:

(4.9)

Sigmoidal type (two layers neural network kernel):

"

(4.10)

for example a polynomial kernel of order

Note that polynomial kernels always correspond to finite dimension feature spaces. Consider
defined on a bi-dimensional space. We have

(4.11)
(4.12)

and therefore the implicit mapping performed is given by

(4.13)

The last two kernel (radial and sigmoidal) correspond to an infinite dimension spaces. It
must be emphasized that the proper choice of the kernel function is one of the least clarified issues
on the use of SVM.

4.3 Function and parameters evaluation

It was previously shown how the kernel function can avoid the explicit computation of the features
in the optimization process. However, it remains to discuss how the SVM can be used for

classification. In fact, the separating hyper-plane lives in , and the evaluation of the output
class for an arbitrary input pattern requires the computation of
(4.14)

But lives in , which suggests that explicit operations in must be performed.

32 N ON LINEAR SVM AND KERNELS

In fact, this can be avoided. From (2.34), one have

(4.15)

Replacement of (4.15) on (4.14), yields

(4.16)

and therefore

(4.17)

Equation (4.17) provides a simple and effective way of computing the output avoiding the
explicit knowledge of

.

The computation of the parameter was not addressed yet, but it can be performed by a similar
approach. From (4.17), it is clear that choosing any support vector such that

, one
"

may evaluate

(4.18)

which

Once again, it is usually numerically safer to compute from the average of all support vectors for

.

For a radial basis function kernel, (4.17) assumes the form

(4.19)

and therefore the final classifier is equivalent to an RBF network with Gaussian of identical

variance, each one centered in a given support vector and with one output weight given by of

.

Much in the same way, for a sigmoidal kernel one have

" (4.20)

and, in this case, the classification performed by the SVM is equivalent to the output of a
MLP network, with a linear output unit with bias . In this equivalent network, the hidden units

E XPERIMENTAL RESULTS 33

RBF kernel with

Figure 4.1: SVM classification of the chess board distribution of figure 3.2.
. . . . .

by
and an weight from hidden unit to the output of of

have an hyperbolic tangent activation function, with an weight from input to hidden unit given
.

In both cases, it is however remarkable that the SVM learning criteria avoids the explicit
specification of the number of Gaussian or hidden units, which results spontaneously from the
optimization process.

4.4 Experimental results

4.4.1 The chess board problem revisited

At this point it is interesting to consider again the chess board problem and compare the classifi-

with a kernel. The result is depicted in figure 4.1 for an RBF kernel with

.

cation region obtained by explicit feature expansion methods and the one resulting from a SVM

(
Note that this result is similar to the best result obtained with polynomial feature expansion
). However, in this case, a quite good solution was found without the need to specify the
dimension of or any learning parameter 1 . This is certainly one of the most interesting features
of the use of kernels in SVM.
1

The weight coefficient is the only training parameter which must be set in SVM optimization procedure. How-
ever, in this particular problem, it was not required an upper bound on the multipliers.
34 N ON LINEAR SVM AND KERNELS

Figure 4.2: Data distribution in the cross problem (P=96).

4.4.2 The cross problem

Separable case

The second problem problem addressed with SVM and a RBF kernel function was the quite simple
distribution depicted in figure 4.2. This distribution has 31 and 65 patterns in each one of the two
classes, in a total of 96 patterns.

The results of this problem using an SVM classifier with RBF kernel and a
MLP
network are depicted in figures 4.3 and 4.4. Both classifiers were able to correctly classify all
samples of the training set but, once again, the MLP solution is far less reasonable than the one
found by SVMs.

The some classification problem was also solved with two RBF networks. The first one, with 8
gaussians (figure 4.5), has the same number of parameters as the MLP network, while the second,
with 14 gaussians (figure 4.6), enables a solution with the same number of gaussians than the one
found by the SVM. It is clear that both networks find much more reasonable solutions than the
one found by the MLP network, and probably both networks would generalize well. However,
the solution found by the SVM seems again much more balanced and plausible than those found
by both RBF networks, which are apparently much more marked by the circular shape of the
gaussian.
E XPERIMENTAL RESULTS 35

Figure 4.3: Classification region for the cross problem with a SVM with a RBF
kernel. . . . .

Figure 4.4: Classification region for the cross problem with a
.
MLP.
36 N ON LINEAR SVM AND KERNELS

RBF network.

Figure 4.5: Classification region for the cross problem with an eight gaussian
.

network.

Figure 4.6: Classification region for the cross problem with a 14 gaussian RBF
.
E XPERIMENTAL RESULTS 37

Figure 4.7: Data distribution in the cross problem with 4 outliers (P=100).
.

Noisy case

In order to assess the performance of the several classifiers in the presence of noise or outliers, the

cross problem was again tested but, this time, four outliers were added to the data distribution (see
figure 4.7). The results of this problem using a SVM classifier with RBF the kernel and a
MLP network are depicted in figures 4.8 and 4.9.

In these, neither of these two classifiers was able to correctly classify all training samples.
But, again, it must be emphasized how the MLP quickly reacted trying to correctly classify all
outliers samplers and, therefore, yielding a quite unplausible solution with only one misclassified
sample. On the other hand, the SVM reacted in a much more moderate way, basically making use
of the slack variables to build a a classification boundary much closer to the one derived in the
absence of outliers, and leaving the four outliers misclassified.

The some classification problem was again repeated with pure RBF networks. This time, the
first test was performed again with eight gaussians (figure 4.10) and a second one with 23 gaussians
(figure 4.11) in order to offer the same degrees of freedom of the solution found by SVM. In both
cases the RBF performed reasonably well, being remarkable that in the first case the classifier
basically ignored the outliers and built a solution also quite similar to the one found with the clean
data distribution. However, with 23 gaussians, the solution found by the RBF network, with only
one misclassified sample, is already clearly subject to over-learning, yielding a less reasonable
decision boundary
38 N ON LINEAR SVM AND KERNELS

Figure 4.8: Classification region for the cross problem with 4 outliers and a
SVM with a RBF kernel. . . . .

Figure 4.9: Classification region for the cross problem with 4 outliers and a
MLP. .
E XPERIMENTAL RESULTS 39

an eight gaussian RBF network.

Figure 4.10: Classification region for the cross problem with four outliers and
.

a 23 gaussian RBF network.

Figure 4.11: Classification region for the cross problem with four outliers and
.
40 N ON LINEAR SVM AND KERNELS

Figure 4.12: Source data distribution for the sinusoidal problem. P=320

4.4.3 The sinusoidal problem

Separable case

The sinusoidal problem represents one case where the data distribution is certainly far away from
those usually found in real world problems. But any classifier must build a quite complex boundary
line to correctly classify this problem and, therefore, it was considered an interesting case study.

The source data distribution was build in a rather deterministic way and it is represented in
figure 4.12.

The solution found in this classification problem by a SVM with an RBF kernel and a

MLP network are depicted in figures 4.13 and 4.14. Again, both classifiers were able to correctly
classify all data samples. However, while the SVM built an elegant and balanced boundary line
between both classes, the MLP solution was again marked by the straight lines which mark the
sigmoidal units decision hyper-planes. While this solution built a reasonable decision boundary, it
is clear that, for some samples, the classification margin is rather small.

Once again, we tried two RBF networks in this problem. The first one, based on the fixed
reference of 8 gaussians, is depicted in figure 4.15. Its is clear that, in this particular case, the
number of gaussians in the RBF network was far away from the minimum required to achieve a
correct classification of all training patterns, therefore providing an odd – and useless – solution.
On the second test, the RBF network was tested with the same number of gaussians (35) used by
E XPERIMENTAL RESULTS 41

Figure 4.13: Sinusoidal problem. Solution found by a SVM network with a
RBF kernel. . . .

the SVM machine. In this case, the RBF network was able to find a reasonable solution that cor-
rectly classifies all data patterns. However, once more, the effect of the circular shaped gaussians
is still noticeable on the boundary line.
42 N ON LINEAR SVM AND KERNELS

.
Figure 4.14: .
Sinusoidal problem. Solution found by a MLP network.

gaussians).

Figure 4.15: Sinusoidal problem. Solution found by a RBF network (eight
.

Figure 4.16: Sinusoidal problem. Solution found by a RBF network (35 gaus-
sians). .
Chapter 5

Implementation issues

As discussed in chapter 2, training a support vector machine is equivalent to the minimization of

the Lagrangian
"

(5.1)

subject to fairly simple linear constraints. This problem can be easily solved using standard opti-
mization tools.

When the size of the training set is small, the Matlab c standard quadratic programming
function qp can be directly used to build the SVM. The qp Matlab function has the syntax

(5.2)

and solves the generic quadratic problem

(5.3)

subject to

(5.4)

(5.5)

The qp Matlab function is supplied in source code, therefore providing a good starting point
for analysis and implementation of a simple quadratic programming package. It main drawback

remains the need to fully build the matrix before the optimization procedure. Therefore, this

approach requires a memory amount in the order of ( being the training set size).

Another approach to the implementation of SVM is to write dedicated code based on any
standard non-linear programming package. The DONLP2 package, available from the Netlib
46 I MPLEMENTATION ISSUES

repository, is a powerful nonlinear programming package developed by Prof. Dr. Peter Spellucci
(e-mail [email protected]). It uses an SQP algorithm and dense-matrix linear
algebra. Source for DONLP2 is available by ftp from netlib ftp servers, e.g.,

ftp://netlib.bell-labs.com/netlib/opt/donlp2.tar.

DONLP2 was originally written in Fortran, but it can be translated to C by the widely available
f2c translator. The main DONLP2 module is an optimization routine which is parametrized via
global variables included in standard Fortran common blocks. All the examples included in this
document were computed by a dedicated C program which makes use of the DONLP2 package
via a simplified set of C interface routines.

As stated before, the SVM optimization procedure implies to solve a fairly large quadratic
programming problem. DONLP2, for instance, requires an amount of memory larger than

bytes, which easily grows to more than 2GBytes of memory with less than 10000 training samples.
On the other hand, each iteration of the optimization procedure is a complex process which may
require a large amount of time for large values of .

iteration, of all patterns for which the multiplier

One simple way of reducing the size of the optimization procedure is the elimination, in each
vanishes. This procedure was used in some
of the tests reported in this document. Since it enables the progressive reduction of the effective

Note, however, that since the monotonic evolution of the

dimension of the training set, it generates a much faster optimization after the first few iterations.
is not guaranteed, the final solution
found by this procedure may slightly differ from the one obtained keeping the initial training set
size fixed.

Another procedure for building SVMs from large databases was proposed in (Osuna et al.,
1997). The proposed method splits the optimization procedure in several smaller subtasks. Within
each subtask, a given number of parameters is kept fixed and the optimization is performed over
the remaining multipliers. This procedure is then iteratively solved for each parameter subset,
until a stop criteria is met. The authors report that this procedure enabled to build a SVM with

about 100,000 support vectors from a data base with 110,000 samples.

A full implementation of SVM is the SVM

package, developed by Thorsten Joachims.
Further information can be found at

http://www-ai.cs.uni-dortmund.de/FORSCHUNG/VERFAHREN/SVM LIGHT/svm light.eng.html.

Chapter 6

Conclusions

Support Vector Machines rely on the concept of optimal margin classification. While this is a
reasonably understood concept in linear classification problems, its scope and conceptual basis is
less trivial in nonlinear applications.

SVM address nonlinear problems by projecting the input space in a high dimensional feature
space where the problem is assumed to be linearly separable. However, application of the optimal
margin concept to the feature space requires some further discussion. In fact, the classification
boundary that is derived by the SVM theory in the input data space is strongly dependent on the
transformation that is performed from the input space to the feature space. This is affected by the
choice of the feature expansion or the kernel function. But since different feature expansions lead
to different solutions, the classifier found by the SVM theory for nonlinear problems is not unique
and does not complies to a universal optimality criteria.

It was also discussed how SVM classifiers depends on the upper bound defined for the La-
grange multipliers. While it is clear that the proper choice of this parameter is dependent on the
amount of noise on the data set, there is not yet an optimized procedure to set this parameter.

Nevertheless, the examples shown in this short report show that support vector machines ex-
hibit a remarkable regularization behavior, often providing better classification boundaries than
those found by other standard classifiers, namely MLP and RBF networks. When used with RBF
kernels(Girosi, 1997), SVM provide an interesting solution to the choice of the centers of Radial
Basis Functions, playing an important role in the choice of solutions which are sparse and, at
the same time, can be related to the principle of Structural Risk Minimization. This point pro-
vides a strong motivation for the analysis of SVM as classifiers with near optimal generalization
properties.
References

Baum, E. (1990). When are k-nearest neighbour and back propagation accurate for feasible sized
sets of examples? In Almeida, L. B. and Wellekens, C. J., editors, Neural Networks, Lec-
ture Notes in Computer Science, pages 69–80. EURASIP Workshop on Neural Networks,
Sesimbra, Portugal, Springer-Verlag.

Baum, E. and Haussler, D. (1989). What size net gives valid generalization? In Touretzky, D.,
editor, Advances in Neural Information Processing Systems 1, pages 81–90, San Mateo, CA.
Morgan Kaufmann.

Burges, C. J. C. (1986). A tutorial on support vector machines for pattern recognition. In Adavnces
in Knowledge Discovery and Data Mining@, the Microstructure of Cognition. Kluwer Aca-
demic Press, Cambridge, MA.

Cohn, D. and Tesauro, G. (1991). Can neural networks do better than the Vapnik-Chervonenkis
bounds? In Lippman, R., Moody, J., and Touretzky, D., editors, Advances in Neural Infor-
mation Processing Systems 3, pages 911–917, San Mateo, CA. Morgan Kaufmann.

Cortes, C. and Vapnik, V. (1995). Support vector networks. Machine Learning, 20:273–297.

Fletcher, R. (1987). Practical Methods of Optimization. John Wiley and Sons, Inc.

Girosi, F. (1997). An equivalence between sparse approximation and support vector machines.
Technical Report A.I. Memo No. 1606, Massachusetts Institute of Technology, Artificial In-
telligence Laboratory.

Gunn, S. (1997). Support vector machines for classification and regression. Technical Report
http://www.isis.ecs.soton.ac.uk/research/svm/svm.pdf, University of Southampton.

Guyon, I., Vapnik, V., Bottou, L., and Solla, S. A. (1992). Structural risk minimization for char-
acter recognition. In Hanson, S., Cowan, J., and Giles, C., editors, Advances in Neural
Information Processing Systems 4, pages 471–479, San Mateo, CA. Morgan Kaufman.

Haykin, S. (1994). Neural Networks: A Comprehensive Foundation. Macmillan.

50 REFERENCES

Ji, C. and Psaltis, D. (1992). The VC-dimension versus the statistical capacity of multilayer
networks. In Hanson, S., Cowan, J., and Giles, C., editors, Advances in Neural Information
Processing Systems 4, pages 928–935, San Mateo, CA. Morgan Kaufman.

Krogh, A. and Hertz, J. A. (1992). A simple weight decay can improve generalization. In Hanson,
S., Cowan, J., and Giles, C., editors, Advances in Neural Information Processing Systems 4,
pages 950–957, San Mateo, CA. Morgan Kaufman.

Le Cun, Y., Denjer, J. S., Henderson, D., Howard, R. E., Hubbard, W., and Jackel, L. (1990).
Handwritten digit recognition with a backpropagation network. In Touretzky, D., editor, Ad-
vances in Neural Information Processing Systems 2, pages 396–404, San Mateo, CA. Morgan
Kaufmann.

Minsky, M. and Papert, S. (1969). Perceptrons. MIT Press, Cambridge, MA.

Osuna, E., Freund, R., and Girosi, F. (1997). An improved training algorithm for support vector
machines. In Proc. of IEEE NNSP 97, Amelia Island.

Rosenblatt, F. (1958). The perceptron: a probabilistic model for information storage and organi-
zation in the brain. Psychological Review, 65:386–408.

Rumelhart, D. E., Hinton, G. E., and Williams, R. J. (1986). Learning internal representations by
error propagation. In Parallel Distributed Processing: Explorations in the Microstructure of
Cognition. MIT Press, Cambridge, MA.

Vapnik, V. N. (1992). Principles of risk minimization for learning theory. In Hanson, S., Cowan, J.,
and Giles, C., editors, Advances in Neural Information Processing Systems 4, pages 831–838,
San Mateo, CA. Morgan Kaufman.

Vapnik, V. N. (1995). The Nature of Statistical Learning Theory. Springer-Verlag, New York.

Widrow, B. and Hoff, M. (1960). Adaptive switching circuits. In IRE WESCON Convention
Record, Part4, pages 96–104.

2011 Smith Bits Product Catalog PDF
100% (1)
2011 Smith Bits Product Catalog PDF
73 pages
Murphy Book Solution
No ratings yet
Murphy Book Solution
100 pages
Optimized Crew Scheduling
100% (1)
Optimized Crew Scheduling
27 pages
SVM Example
No ratings yet
SVM Example
10 pages
Supervised Learning - Support Vector Machines and Feature Reduction
No ratings yet
Supervised Learning - Support Vector Machines and Feature Reduction
11 pages
Introduction To Data Mining 2005
60% (5)
Introduction To Data Mining 2005
400 pages
Introduction To Support Vector Machines: 1 Description
No ratings yet
Introduction To Support Vector Machines: 1 Description
15 pages
(Bernhard Schölkopf, Alexander J. Smola) Learning With Kernels PDF
No ratings yet
(Bernhard Schölkopf, Alexander J. Smola) Learning With Kernels PDF
645 pages
Foundations of Machine
No ratings yet
Foundations of Machine
120 pages
Statistical Learning Theory Notes
No ratings yet
Statistical Learning Theory Notes
119 pages
Brief Intro To ML PDF
No ratings yet
Brief Intro To ML PDF
236 pages
Cheat Sheet
No ratings yet
Cheat Sheet
163 pages
Theoretical Bioinformatics and Machine Learning - Hochreiter - 2013
No ratings yet
Theoretical Bioinformatics and Machine Learning - Hochreiter - 2013
400 pages
Electric Power Systems. Vol. I. Electric Networks.
91% (11)
Electric Power Systems. Vol. I. Electric Networks.
843 pages
Machine Learning Notes 1
No ratings yet
Machine Learning Notes 1
120 pages
An Adventure of Epic Porpoises
No ratings yet
An Adventure of Epic Porpoises
174 pages
PCML Notes
No ratings yet
PCML Notes
249 pages
Support Vector Machines: Dominik Wisniewski Wojciech Wawrzyniak
No ratings yet
Support Vector Machines: Dominik Wisniewski Wojciech Wawrzyniak
16 pages
Operational Reasearch Question Bank (Mba)
No ratings yet
Operational Reasearch Question Bank (Mba)
9 pages
Atc Lecture Tyliu
No ratings yet
Atc Lecture Tyliu
48 pages
Multi Effect Evaporator Design Calculation For Brown Sugar Production Using Computational Fluid Dynamics
No ratings yet
Multi Effect Evaporator Design Calculation For Brown Sugar Production Using Computational Fluid Dynamics
5 pages
Support Vector Machines For Classification and Regression: Steve R. Gunn
No ratings yet
Support Vector Machines For Classification and Regression: Steve R. Gunn
66 pages
Machine Learning Foundations
No ratings yet
Machine Learning Foundations
119 pages
Support Vector Machines (SVM) Models in Stata
No ratings yet
Support Vector Machines (SVM) Models in Stata
19 pages
SVMs: Classification & Regression Guide
No ratings yet
SVMs: Classification & Regression Guide
66 pages
Financial Market Volatility Forecasting
No ratings yet
Financial Market Volatility Forecasting
52 pages
SVM Applications and Properties
100% (1)
SVM Applications and Properties
34 pages
Preface VII Mathematical Notation Xi Contents Xiii
No ratings yet
Preface VII Mathematical Notation Xi Contents Xiii
6 pages
Introduction To: Support Vector Machines
No ratings yet
Introduction To: Support Vector Machines
53 pages
Optimal Control PDF
No ratings yet
Optimal Control PDF
123 pages
SML Unit 4
No ratings yet
SML Unit 4
61 pages
ATC - Detailed PPT - Power Systems
No ratings yet
ATC - Detailed PPT - Power Systems
10 pages
5616
No ratings yet
5616
275 pages
SVM Classifier Techniques Guide
No ratings yet
SVM Classifier Techniques Guide
15 pages
Coal Blending Model Developed To Prepare Feed To RINL With Cost Optimization - IPR
100% (2)
Coal Blending Model Developed To Prepare Feed To RINL With Cost Optimization - IPR
41 pages
Lecture 8
No ratings yet
Lecture 8
19 pages
Extending Machine Learning Models
No ratings yet
Extending Machine Learning Models
64 pages
This Is
No ratings yet
This Is
7 pages
Introduction to Support Vector Machines
No ratings yet
Introduction to Support Vector Machines
33 pages
Pattern Recognition & Learning II: © UW CSE Vision Faculty
No ratings yet
Pattern Recognition & Learning II: © UW CSE Vision Faculty
47 pages
SVM & Image Classification.
No ratings yet
SVM & Image Classification.
22 pages
Unit 2 Pom
No ratings yet
Unit 2 Pom
97 pages
Support Vector Machines Guide
No ratings yet
Support Vector Machines Guide
19 pages
Ch5 Integer Programming
No ratings yet
Ch5 Integer Programming
46 pages
Network Coding: Max-Flow Min-Cut Theory
No ratings yet
Network Coding: Max-Flow Min-Cut Theory
18 pages
MSP Report 2022june Final 508 v3
No ratings yet
MSP Report 2022june Final 508 v3
66 pages
Venter Review 2010
No ratings yet
Venter Review 2010
12 pages
Chapter 2 Upstream Processing
No ratings yet
Chapter 2 Upstream Processing
81 pages
Applied Machine Learning Course Schedule: Topic
No ratings yet
Applied Machine Learning Course Schedule: Topic
29 pages
Tertiary and Secondary Control Levels For Efficiency Optimization and System Damping in Droop Controlled DC-DC Converters
No ratings yet
Tertiary and Secondary Control Levels For Efficiency Optimization and System Damping in Droop Controlled DC-DC Converters
12 pages
Support Vector Machines: 1 What's SVM
No ratings yet
Support Vector Machines: 1 What's SVM
25 pages
22-Kernel Tricks Shit
No ratings yet
22-Kernel Tricks Shit
43 pages
MLT End Term Quizzes
No ratings yet
MLT End Term Quizzes
166 pages
Chapter 3 - Mix Design
No ratings yet
Chapter 3 - Mix Design
16 pages
Support Vector Machines
No ratings yet
Support Vector Machines
57 pages
Support Vector Machines: Comprehensive Problem Set: KFG December 19, 2024
No ratings yet
Support Vector Machines: Comprehensive Problem Set: KFG December 19, 2024
35 pages
Support Vector Machines Ymod
No ratings yet
Support Vector Machines Ymod
4 pages
06 Lagrange
No ratings yet
06 Lagrange
4 pages
Introduction to Support Vector Machines
No ratings yet
Introduction to Support Vector Machines
40 pages
How To Dance The Charleston: Lesson Plan
No ratings yet
How To Dance The Charleston: Lesson Plan
2 pages
Asymptotic Notation, Functions, and Running Times
No ratings yet
Asymptotic Notation, Functions, and Running Times
5 pages
Pitfalls of HDM-4 Strategy Analysis, Koji Tsunokawa
No ratings yet
Pitfalls of HDM-4 Strategy Analysis, Koji Tsunokawa
3 pages
ML 41
No ratings yet
ML 41
49 pages
ICML - 2016 - Stratified Sampling Meets Machine Learning
No ratings yet
ICML - 2016 - Stratified Sampling Meets Machine Learning
10 pages
Redes Móveis e Sem Fios: 2º Teste
No ratings yet
Redes Móveis e Sem Fios: 2º Teste
4 pages
Taz TFG 2016 2057
No ratings yet
Taz TFG 2016 2057
52 pages
AKTUA399 Masteroppgave Fredrik Hjorth Bentsen
No ratings yet
AKTUA399 Masteroppgave Fredrik Hjorth Bentsen
84 pages
ISYE6669 LP 10 21 1 - AndySun - FW
No ratings yet
ISYE6669 LP 10 21 1 - AndySun - FW
8 pages
Poly Aml
No ratings yet
Poly Aml
76 pages
08 Classification
No ratings yet
08 Classification
46 pages
Arxiv Preprint 2302.13157v1
No ratings yet
Arxiv Preprint 2302.13157v1
7 pages
Support Vector Machine: Prof. Subodh Kumar Mohanty
No ratings yet
Support Vector Machine: Prof. Subodh Kumar Mohanty
52 pages
Unit - 2
No ratings yet
Unit - 2
15 pages
Loss Functions and Metrics in Deep Learning
No ratings yet
Loss Functions and Metrics in Deep Learning
85 pages
Unit - 2-1
No ratings yet
Unit - 2-1
7 pages
svm2 (1) Fin
No ratings yet
svm2 (1) Fin
24 pages
1 s2.0 S0306261917313156 Main
No ratings yet
1 s2.0 S0306261917313156 Main
14 pages
Support Vector Machine: Abinas Panda
No ratings yet
Support Vector Machine: Abinas Panda
52 pages
Support Vector Machines For Histogram-Based Image Classification
No ratings yet
Support Vector Machines For Histogram-Based Image Classification
10 pages
1 s2.0 S2192440623000175 Main
No ratings yet
1 s2.0 S2192440623000175 Main
29 pages
Bayesian Optimization With Application To Computer Experiments Tony Pourmohamad Download
100% (3)
Bayesian Optimization With Application To Computer Experiments Tony Pourmohamad Download
70 pages
Research On Control Strategy For A Battery Thermal Management System For Electric Vehicles Based On Secondary Loop Cooling
No ratings yet
Research On Control Strategy For A Battery Thermal Management System For Electric Vehicles Based On Secondary Loop Cooling
19 pages
Notes Cce 577
No ratings yet
Notes Cce 577
71 pages
Mathematics Project r31
No ratings yet
Mathematics Project r31
19 pages
GonzalezPablo QuasiLinearMPC
No ratings yet
GonzalezPablo QuasiLinearMPC
168 pages
ML 18-20 SVM
No ratings yet
ML 18-20 SVM
44 pages

Notes On Support Vector Machines: Fernando Mira Da Silva

Uploaded by

Notes On Support Vector Machines: Fernando Mira Da Silva

Uploaded by

Notes on Support Vector Machines

Fernando Mira da Silva

Neural Network Group

A qualitative analysis of the generalization behavior of support vector machines is performed

1.1 Classifiers, learning and generalization . . . . . . . . . . . . . . . . . . . . . . . 1

1.2 Linear classification systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2

2 Optimal linear classification 5

2.1 The separable case . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

2.1.3 Classifier parameters . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11

2.2 The non separable case . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14

2.2.1 Misclassification margin . . . . . . . . . . . . . . . . . . . . . . . . . . 14

2.2.2 Optimization procedure . . . . . . . . . . . . . . . . . . . . . . . . . . 15

2.2.4 Classifier parameters . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17

2.2.5 Experimental results . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17

3 Non linear classification 19

3.1 Feature expansion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19

3.2 Orthonormalized polynomial feature expansion . . . . . . . . . . . . . . . . . . 20

3.3 Test results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22

4 Non linear SVM and kernels 29

4.2 Kernel functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30

4.3 Function and parameters evaluation . . . . . . . . . . . . . . . . . . . . . . . . 31

4.4 Experimental results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33

4.4.1 The chess board problem revisited . . . . . . . . . . . . . . . . . . . . . 33

4.4.2 The cross problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34

4.4.3 The sinusoidal problem . . . . . . . . . . . . . . . . . . . . . . . . . . . 40

1.1 Possible solutions for a linearly separable classification problem. . . . . . . . . . 3

2.1 Optimal classification hyper-plane. . . . . . . . . . . . . . . . . . . . . . . . . . 6

2.2 Binary classification region found by a SVM in a simple bi-dimensional distribution. 12

2.4 Binary classification region found by a SVM (3 support vectors) . . . . . . . . . 13

3.1 Extension of SVM to non linear problems through feature expansion. . . . . . . 20

3.2 Non linear chess board type classification problem. . . . . . . . . . . . . . . . . 21

4.2 Data distribution in the cross problem. . . . . . . . . . . . . . . . . . . . . . . . 34

4.7 Data distribution in the cross problem with outliers. . . . . . . . . . . . . . . . . 37

4.12 Source data distribution for the sinusoidal problem . . . . . . . . . . . . . . . . 40

4.14 Sinusoidal problem. Solution found by a

4.15 Sinusoidal problem. Solution found by a RBF network (eight gaussians). . . . . . 42

4.16 Sinusoidal problem. Solution found by a RBF network (35 gaussians). . . . . . . 43

1.1 Classifiers, learning and generalization

One of the outstanding problems on supervised learning is to identify learning procedures

1.2 Linear classification systems

Figure 1.1: Possible solutions for a linearly separable classification problem.

Geometrically, is a vector orthogonal to . The class  of pattern   is defined by

Optimal linear classification

2.1 The separable case

alent to state that the parameters

the subset of parameters

Furthermore, this condition allow us to rewrite the classification constraint (2.1) as

where the factor

the left side of constraints (2.9), each one

"        "           (2.11)

occurs at the same point

Replacing this condition for in the Lagrangian yields

On the other hand, the derivative of

Using this condition in (2.17), yields

and the -dimensional unity vector

(2.20) can be re-written as

      "  (2.28)

This condition implies that either   or, alternatively, that2

are called the support vectors of the support vector machine.

Note that, given (2.29), one have

2.1.3 Classifier parameters

Since only the support vectors have 

       (2.34)

A simple example of a bi-dimensional distribution and the corresponding classification region

Figure 2.3: Binary classification region found by a simple sigmoidal unit in

2.2 The non separable case

2.2.1 Misclassification margin

2.2.2 Optimization procedure

    "         "  "    

 were introduced in order to support the additional

while the Wolf dual is now

  "   " 

be included in the optimization procedure. In fact, the Lagrangian multipliers

and therefore this additional constrain must be included in optimization process.

If the slack variables

2.2.4 Classifier parameters

Geometrically, is a vector orthogonal to . The class of pattern is defined by

" " (2.11)

" (2.28)

This condition implies that either or, alternatively, that2

Since only the support vectors have

(2.34)

" " "

were introduced in order to support the additional

" "

" (2.48)

of in (possibly with

(3.5)

Since a perfect classification is found for

tion on the Lagrangian multipliers

(4.15)

(4.16)

" (4.20)

iteration, of all patterns for which the multiplier

Note, however, that since the monotonic evolution of the