NPTEL
NPTEL ONLINE CERIFICATION COURSE
Introduction to Machine Learning
Lecture 28
Prof. Balaraman Ravindran
Computer Science and Engineering
Indian Institute of Technology Madras
Support Vector Machines II
Interpretation & Analysis
So this is the optimization problem which is actually a simple optimization problem is a
quadratic objective and a set of linear constraints. We already saw how to solve this. You guys
had a convex optimization tutorial. So one of the things that we are looking for from the convex
optimization tutorial is that you will know how to solve this problem. So I what we do after this,
write the Lagrangian right.
(Refer Slide Time: 01:46)
(Refer to slide time 01.46)
I have to apply this for every data point so I run runs from i=1 to n okay. And I put a p there so
there is a primal so we will have to form the dual of this the dual looks a lot easier to solve. Dual
is actually a lot easier to solve. So we will go ahead and do the dual. So for first I will take the
derivatives. So derivative with respect to β and you can do that and solve it we will get that
derivative with respect to β0 so now is where I'm going to do some hand waving but you can go
through this computation so take that substitute into this okay. And do a lot of simplification
rights, so remember we have this β squared here therefore I am going to get a αiαjyiyj kind of
terms.
(Refer to slide time 03.26)
(Refer to slide time: 04.38)
So the dual will be so the dual is going to be a slightly simpler form why is it a slightly simpler
form so I have to only consider my constraints have become of lot simpler here right it just going
to be αi’s should be non-negative that solves my constraints are so it turns out that there are
efficient ways of solving optimization problems of this form. You do not have to worry about it.
Here are lots of packages that solve SVMs for you.
But then you just need to know what kind of optimization problem we are solving. I do not want
you to use it as a black box. Essentially what you are going to be solving is this. So when you
have a solution, when you have something that is both primal and dual so we can actually show
that the duality gap is 0 in this case so it is not going to that. But the point is when I have a
solution to the problem right it has to satisfy certain conditions.
It is already looked at that the KKT conditions. If people do not remember it please go back and
revise that right. So there are a whole bunch of things so you need to for you need to have the
solution to be primal feasible right. You need to have the solution to be dual feasible and so that
essentially have a bunch of things. Primal feasible would mean that well your αi’s have to be
great that will be dual feasible way that will be one condition these need to whole because it is a
solution for the primal and there you are you have your complimentary slackness right. So that in
this case becomes
(Refer to slide time 06.37)
So I know if in the notes I think you saw it as λifi right. So this essentially that is it so this is αifi.
So this is this may affect so that is the fourth these are the KKT conditions, that need to be
satisfied okay. And so what does this tell us?
Tells us a couple of things one so we know what the form of β should be what is the form of β it
has to be αiyixi right. So it is essentially what you are going to do is your β will be taking out
certain data points from your training data right and adding them up. So suitably multiplying it
by the output the desired output, so if xi’s output was positive then this will be +1 if xi’s output
was negative this will be -1.
So it is going to take a few of those and they are going to add them up right. So this should
remind you of perceptrons. So if you remember what we did in perceptions is we took whatever
was misclassified we just kept adding it to the weight vector. So in some sense you are doing
something very similar to that but instead of having some kind of a heuristic approach to
optimizing things. We did do a gradient descent but then we just said ok we will arbitrarily pick
the set of misclassified points and we will do the gradient descent and so on so forth. But here we
started off by saying okay we will minimize the distance to the closest point and from there we
derive something and it looks very suspiciously like the perceptron update rule. In fact nowadays
when people say I am going to train a perceptron they are actually doing this more often than
using the perceptrons learning rule right away. So now something else that you can observe so
this condition has to be satisfied. This condition has to be satisfied. So let us look at it there are
two terms here so when will i [ yi ( xiT 0 ) 1] be 0? When either i is 0 or yi ( xiT 0 ) 1
is 0. These are some condition when this has to be 0. Yeah! You are right but for geometrically
can you give me an answer. So i has to be zero is when the other term is not 0. So when will
the other term be not 0? When it is not the closest point. If xi is the closest point it will be bang
on the margin and for a point here that term will be 0. For a point here that term will be greater
than 1 or a point here the term will be greater than 1.
You see that so since the term will be greater than 1 the term in the square brackets will be non 0
so αi’s have to be 0. Correct, so what does this mean it means that points that are further away
from the hyper plane do not contribute to finding β. Because the α’s will be zero points that are
far away from the hyper plane are not going to contribute in finding β. In fact the points that will
contribute to β are exactly those points that are on the margin.
So in fact for this, this data set that they drew here right there are only two important points at
that one and this one because only two points are on the margin right. Such points which lie on
the margin are known as support points or support vectors right. And your β is going to depend
only on the support points. What about β0? So we can plug in any data point here, and we can
solve for β0 right.
One of these support points you can plug it in here and you can solve for β0. Which support point
do you pick? Ideally all of them should give you the same answer but usually does not happen
because of numerical reasons. So what typically people do is they plug in all the support points
okay. Solve for β0 and take the average right. So each one in turn it for every support point you
are going to get slightly different β0 you just take the average okay.
So that is how you compute the hyper plane at the end of it is basically how here when would α
be 0 if your data is on the hyper plane. So that will be one case when that happens. Essentially
you have two points which are on the same. It is not collinear but repeated things; I give you two
data points that are on the same point right. So by definition most of the support vectors will lie
on the same line so it cannot be collinear okay. So right in such cases that could be the case but,
yeah! These are generally degenerate cases yeah! So sure call them support vectors if you want.
So one thing to note is my fˆ so how this going to look likes now that I given the form for β here.
This is essentially going to look like I can flip these things around anyway that plus β0 right.
(Refer to slide time 15.24)
So, so if you think about it I will come back to this point later so if you look at the dual I only
have XTX right and if you look at the final classifier I am going to use I am going to have XT X.
So if i have a very efficient way of computing XTX right I can do some tricks with this whole
thing we will come back to that okay. I will just I want you to remember this so any questions on
this, any questions on this?
So before we move on I just wanted to point out something so if you think about, how LDA
works right. So LDA tries to do density estimation eventually right, if you if you think about it
you make some assumptions about the probability distribution the form of the probability
distribution. What assumption will you make it is Gaussian with equal covariance across all the
classes’ right.
Though, that essentially means that every data point in your training set is going to contribute
towards the parameters that you are estimating right. So the β will estimate there will depend on
all the data points that were given to you, whether they are here right close to the hyper plane or
whether they are very far away from the hyper plane. Let us all the data points will determine
your class boundary so that means that it becomes little susceptible to noise.
And if I have one or two data points that are generated through noise right even that will
contribute to determining the separating plane hyper plane right. On the other hand we test with
this kind of optimal hyper plane we are only worried about points that are close to the boundary
right. So I can do whatever I want here right I can change move a few points over here and
things like that it does not really matter.
What matters is if any noise enters close to the boundary? So that so in some sense if my noise is
uniform, the LDA will get more affected. Because even if noise insert some points there right
LDA classifier will change. Well my optimal hyper plane classifier will not move it will be
affected only by that fraction of the noise that changes the actual decision surface. Having said
that I should point out that if, if your data is truly Gaussian with equal covariance LDA is
actually optimal. It is probably optimal. While this one will depend on the actual data that you
get but in general would say this is more preferable because this is more stable. People remember
what stability is right; small changes in the data will not cause the classifier to change
significantly right.
So here small changes in the data will not cause it to change significantly in an expected sense
right. If I go and take the support vector and move it somewhere else okay. The class boundary
will change. But then I have whole bunch of other vectors which I can move around nothing will
happen to the class boundary unless I move it closer to the hyper plane than the existing support
vectors. If I take a point from here and move it here of course the class boundary will change. As
long as I do not modify which are the support vectors, I will get back the same classification
surface again and again all right. So in that sense SVM or will come to SVM little bit, this kind
of optimal hyper plane are very stable.
IIT Madras Production
Funded by
Department of Higher Education
Ministry of Higher Education
Government of India
www.nptel.ac.in
Copyrights Reserved