CSE
575:
Sta*s*cal
Machine
Learning
Jingrui
He
CIDSE,
ASU
Instance-based
Learning
1-Nearest
Neighbor
Four
things
make
a
memory
based
learner:
1. A
distance
metric
Euclidian
(and
many
more)
2. How
many
nearby
neighbors
to
look
at?
One
1. A
weigh:ng
func:on
(op:onal)
Unused
2. How
to
t
with
the
local
points?
Just
predict
the
same
output
as
the
nearest
neighbor.
Consistency
of
1-NN
Consider
an
es*mator
fn
trained
on
n
examples
e.g.,
1-NN,
regression,
...
Es*mator
is
consistent
if
true
error
goes
to
zero
as
amount
of
data
increases
e.g.,
for
no
noise
data,
consistent
if:
Regression
is
not
consistent!
Representa*on
bias
1-NN
is
consistent
(under
some
mild
neprint)
What
about
variance???
4
1-NN
overts?
k-Nearest
Neighbor
Four
things
make
a
memory
based
learner:
1. A
distance
metric
Euclidian
(and
many
more)
2. How
many
nearby
neighbors
to
look
at?
k
1. A
weigh:ng
func:on
(op:onal)
Unused
2.
How
to
t
with
the
local
points?
Just
predict
the
average
output
among
the
k
nearest
neighbors.
k-Nearest
Neighbor
(here
k=9)
K-nearest
neighbor
for
funcFon
Hng
smooth
away
noise,
but
there
are
clear
deciencies.
What
can
we
do
about
all
the
discon*nui*es
that
k-NN
gives
us?
Curse
of
dimensionality
for
instance-based
learning
Must
store
and
retrieve
all
data!
Most
real
work
done
during
tes*ng
For
every
test
sample,
must
search
through
all
dataset
very
slow!
There
are
fast
methods
for
dealing
with
large
datasets,
e.g.,
tree-based
methods,
hashing
methods,
Instance-based
learning
o^en
poor
with
noisy
or
irrelevant
features
Support
Vector
Machines
Linear
classiers
Which
line
is
beber?
Data:
Example
i:
w.x
=
j
w(j)
x(j)
10
w.x
+
b
=
0
Pick
the
one
with
the
largest
margin!
w.x
=
j
w(j)
x(j)
11
w.x
+
b
=
0
Maximize
the
margin
12
w.x
+
b
=
0
But
there
are
a
many
planes
13
w.x
+
b
=
0
Review:
Normal
to
a
plane
14
x+
margin
2
=
-1
w.x
+
b
=
0
w.x
+
b
w.x
+
b
=
+1
Normalized
margin
Canonical
hyperplanes
x-
15
x+
margin
2
=
-1
w.x
+
b
=
0
w.x
+
b
w.x
+
b
=
+1
Normalized
margin
Canonical
hyperplanes
x-
16
w.x
+
b
=
0
=
+1
=
-1
w.x
+
b
w.x
+
b
Margin
maximiza*on
using
canonical
hyperplanes
margin
2
17
=
-1
w.x
+
b
=
0
w.x
+
b
w.x
+
b
=
+1
Support
vector
machines
(SVMs)
Solve
eciently
by
quadra*c
programming
(QP)
Well-studied
solu*on
algorithms
Hyperplane
dened
by
support
vectors
margin
2
18
What
if
the
data
is
not
linearly
separable?
Use
features
of
features
of
features
of
features.
19
What
if
the
data
is
s*ll
not
linearly
separable?
Minimize
w.w
and
number
of
training
mistakes
Tradeo
two
criteria?
Tradeo
#(mistakes)
and
w.w
20
0/1
loss
Slack
penalty
C
Not
QP
anymore
Also
doesnt
dis*nguish
near
misses
and
really
bad
mistakes
Slack
variables
Hinge
loss
If
margin
1,
dont
care
If
margin
<
1,
pay
linear
penalty
21
Side
note:
Whats
the
dierence
between
SVMs
and
logis*c
regression?
SVM:
LogisFc
regression:
Log
loss:
22
Constrained
op*miza*on
23
Lagrange
mul*pliers
Dual
variables
Moving
the
constraint
to
objecFve
funcFon
Lagrangian:
Solve:
24
Lagrange
mul*pliers
Dual
variables
Solving:
25
Dual
SVM
deriva*on
(1)
the
linearly
separable
case
26
Dual
SVM
deriva*on
(2)
the
linearly
separable
case
27
w.x
+
b
=
0
Dual
SVM
interpreta*on
28
Dual
SVM
formula*on
the
linearly
separable
case
29
Dual
SVM
deriva*on
the
non-separable
case
30
Dual
SVM
formula*on
the
non-separable
case
31
Why
did
we
learn
about
the
dual
SVM?
There
are
some
quadra*c
programming
algorithms
that
can
solve
the
dual
faster
than
the
primal
But,
more
importantly,
the
kernel
trick!!!
Another
lible
detour
32
Reminder
from
last
*me:
What
if
the
data
is
not
linearly
separable?
Use
features
of
features
of
features
of
features.
Feature
space
can
get
really
33
large
really
quickly!
number
of
monomial
terms
Higher
order
polynomials
d=4
m
input
features
d
degree
of
polynomial
d=3
d=2
number
of
input
dimensions
34
grows
fast!
d
=
6,
m
=
100
about
1.6
billion
terms
Dual
formula*on
only
depends
on
dot-products,
not
on
w!
35
Dot-product
of
polynomials
36
Finally:
the
kernel
trick!
Never
represent
features
explicitly
Compute
dot
products
in
closed
form
Constant-*me
high-dimensional
dot-
products
for
many
classes
of
features
Very
interes*ng
theory
Reproducing
Kernel
Hilbert
Spaces
37
Polynomial
kernels
All
monomials
of
degree
d
in
O(d)
opera*ons:
How
about
all
monomials
of
degree
up
to
d?
Solu*on
0:
Beber
solu*on:
38
Common
kernels
Polynomials
of
degree
d
Polynomials
of
degree
up
to
d
Gaussian
kernels
Sigmoid
39
Overvng?
Huge
feature
space
with
kernels,
what
about
overvng???
Maximizing
margin
leads
to
sparse
set
of
support
vectors
Some
interes*ng
theory
says
that
SVMs
search
for
simple
hypothesis
with
large
margin
O^en
robust
to
overvng
40
What
about
at
classica*on
*me
For
a
new
input
x,
if
we
need
to
represent
(x),
we
are
in
trouble!
Recall
classier:
sign(w.(x)+b)
Using
kernels
we
are
cool!
41
SVMs
with
kernels
Choose
a
set
of
features
and
kernel
func*on
Solve
dual
problem
to
obtain
support
vectors
i
At
classica*on
*me,
compute:
Classify
as
42
Whats
the
dierence
between
SVMs
and
Logis*c
Regression?
Loss function
High dimensional
features with
kernels
SVMs
Logistic
Regression
Hinge loss
Log-loss
Yes!
No
43
Kernels
in
logis*c
regression
Dene
weights
in
terms
of
support
vectors:
Derive
simple
gradient
descent
rule
on
i
44
Whats
the
dierence
between
SVMs
and
Logis*c
Regression?
(Revisited)
Loss function
High dimensional
features with
kernels
Solution sparse
Semantics of
output
SVMs
Logistic
Regression
Hinge loss
Log-loss
Yes!
Yes!
Often yes!
Almost always no!
Margin
Real probabilities
45