Machine
Learning
Crash
Course:
Part
I
Ariel
Kleiner
August
21,
2012
Machine
learning
exists
at
the
intersec<on
of
computer
science
and
sta<s<cs.
Examples
• Spam
filters
• Search
ranking
• Click
(and
clickthrough
rate)
predic<on
• Recommenda<ons
(e.g.,
NeJlix,
Facebook
friends)
• Speech
recogni<on
• Machine
transla<on
• Fraud
detec<on
• Sen<ment
analysis
• Face
detec<on,
image
classifica<on
• Many
more
A
Variety
of
Capabili<es
• Classifica<on
• Collabora<ve
filtering
• Regression
• Ac<ve
learning
and
• Ranking
experimental
design
• Clustering
• Reinforcement
learning
• Dimensionality
• Time
series
analysis
reduc<on
• Hypothesis
tes<ng
• Feature
selec<on
• Structured
predic<on
• Structured
probabilis<c
modeling
For
Today
Classifica<on
Clustering
(with
emphasis
on
implementability
and
scalability)
Typical
Data
Analysis
Workflow
Obtain
and
load
raw
data
Data
explora<on
Preprocessing
and
featuriza<on
Learning
Diagnos<cs
and
evalua<on
Classifica<on
• Goal:
Learn
a
mapping
from
en<<es
to
discrete
labels.
– Refer
to
en<<es
as
x
and
labels
as
y.
• Example:
spam
classifica<on
– En<<es
are
emails.
– Labels
are
{spam,
not-‐spam}.
– Given
past
labeled
emails,
want
to
predict
whether
a
new
email
is
spam
or
not-‐spam.
Classifica<on
• Examples
– Spam
filters
– Click
(and
clickthrough
rate)
predic<on
– Sen<ment
analysis
– Fraud
detec<on
– Face
detec<on,
image
classifica<on
Classifica<on
Given
a
labeled
dataset
(x1,
y1),
...,
(xN,
yN):
1. Randomly
split
the
full
dataset
into
two
disjoint
parts:
– A
larger
training
set
(e.g.,
75%)
– A
smaller
test
set
(e.g.,
25%)
2. Preprocess
and
featurize
the
data.
3. Use
the
training
set
to
learn
a
classifier.
4. Evaluate
the
classifier
on
the
test
set.
5. Use
classifier
to
predict
in
the
wild.
Classifica<on
training
classifier
full set
dataset
test set new entity
accuracy prediction
Example:
Spam
Classifica<on
From: [email protected]
"Eliminate your debt by spam
giving us your money..."
From: [email protected]
"Hi, it's been a while! not-spam
How are you? ..."
Featuriza<on
• Most
classifiers
require
numeric
descrip<ons
of
en<<es
as
input.
• Featuriza1on:
Transform
each
en<ty
into
a
vector
of
real
numbers.
– StraighJorward
if
data
already
numeric
(e.g.,
pa<ent
height,
blood
pressure,
etc.)
– Otherwise,
some
effort
required.
But,
provides
an
opportunity
to
incorporate
domain
knowledge.
Featuriza<on:
Text
• Ofen
use
“bag
of
words”
features
for
text.
– En<<es
are
documents
(i.e.,
strings).
– Build
vocabulary:
determine
set
of
unique
words
in
training
set.
Let
V
be
vocabulary
size.
– Featuriza1on
of
a
document:
• Generate
V-‐dimensional
feature
vector.
• Cell
i
in
feature
vector
has
value
1
if
document
contains
word
i,
and
0
otherwise.
Example:
Spam
Classifica<on
From: [email protected]
"Eliminate your debt by Vocabulary
giving us your money..."
been
debt
eliminate
giving
how
From:
[email protected] it's
money
"Hi, it's been a while! while
How are you? ..."
Example:
Spam
Classifica<on
0 been
1 debt
1 eliminate
From: [email protected]
1 giving
"Eliminate your debt by
0 how
giving us your money..."
0 it's
1 money
0 while
Example:
Spam
Classifica<on
• How
might
we
construct
a
classifier?
• Using
the
training
data,
build
a
model
that
will
tell
us
the
likelihood
of
observing
any
given
(x,
y)
pair.
– x
is
an
email’s
feature
vector
– y
is
a
label,
one
of
{spam,
not-‐spam}
• Given
such
a
model,
to
predict
label
for
an
email:
– Compute
likelihoods
of
(x,
spam)
and
(x,
not-‐spam).
– Predict
label
which
gives
highest
likelihood.
Example:
Spam
Classifica<on
• What
is
a
reasonable
probabilis<c
model
for
(x,
y)
pairs?
• A
baseline:
– Before
we
observe
an
email’s
content,
can
we
say
anything
about
its
likelihood
of
being
spam?
– Yes:
p(spam)
can
be
es<mated
as
the
frac<on
of
training
emails
which
are
spam.
– p(not-‐spam)
=
1
–
p(spam)
– Call
this
the
“class
prior.”
Wrinen
as
p(y).
Example:
Spam
Classifica<on
• How
do
we
incorporate
an
email’s
content?
• Suppose
that
the
email
were
spam.
Then,
what
would
be
the
probability
of
observing
its
content?
Example:
Spam
Classifica<on
• Example:
“Eliminate
your
debt
by
giving
us
your
money”
with
feature
vector
(0,
1,
1,
1,
0,
0,
1,
0)
• Ignoring
word
sequence,
probability
of
email
is
p(seeing
“debt”
AND
seeing
“eliminate”
AND
seeing
“giving”
AND
seeing
“money”
AND
not
seeing
any
other
vocabulary
words
|
given
that
email
is
spam)
• In
feature
vector
nota<on:
p(x1=0,
x2=1,
x3=1,
x4=1,
x5=0,
x6=0,
x7=1,
x8=0
|
given
that
email
is
spam)
Example:
Spam
Classifica<on
• Now,
to
simplify,
model
each
word
in
the
vocabulary
independently:
– Assume
that
(given
knowledge
of
the
class
label)
probability
of
seeing
word
i
(e.g.,
eliminate)
is
independent
of
probability
of
seeing
word
j
(e.g.,
money).
– As
a
result,
probability
of
email
content
becomes
p(x1=0
|
spam)
p(x2=1
|
spam)
...
p(x8=0
|
spam)
rather
than
p(x1=0,
x2=1,
x3=1,
x4=1,
x5=0,
x6=0,
x7=1,
x8=0
|
spam)
Example:
Spam
Classifica<on
• Now,
we
only
need
to
model
the
probability
of
seeing
(or
not
seeing)
a
par<cular
word
i,
assuming
that
we
knew
the
email’s
class
y
(spam
or
not-‐spam).
– But,
this
is
easy!
– To
es<mate
p(xi
=
1
|
y),
simply
compute
the
frac<on
of
emails
in
the
set
{emails
in
training
set
with
label
y}
which
contain
the
word
i.
Example:
Spam
Classifica<on
• Pusng
it
all
together:
– Based
on
the
training
data,
es<mate
the
class
prior
p(y).
• i.e.,
es<mate
p(spam)
and
p(not-‐spam).
– Also
es<mate
the
(condi<onal)
probability
of
seeing
any
individual
word
i,
given
knowledge
of
the
class
label
y.
• i.e.,
es<mate
p(xi
=
1
|
y)
for
each
i
and
y
– The
(condi<onal)
probability
p(x
|
y)
of
seeing
an
en<re
email,
given
knowledge
of
the
class
label
y,
is
then
simply
the
product
of
the
condi<onal
word
probabili<es.
• e.g.,
p(x=(0,
1,
1,
1,
0,
0,
1,
0)
|
y)
=
p(x1=0
|
y)
p(x2=1
|
y)
...
p(x8=0
|
y)
Example:
Spam
Classifica<on
• Recall:
we
want
a
model
that
will
tell
us
the
likelihood
p(x,
y)
of
observing
any
given
(x,
y)
pair.
• The
probability
of
observing
(x,
y)
is
the
probability
of
observing
y,
and
then
observing
x
given
that
value
of
y:
p(x,
y)
=
p(y)
p(x
|
y)
• Example:
p(“Eliminate
your
debt...”,
spam)
=
p(spam)
p(“Eliminate
your
debt...”
|
spam)
Example:
Spam
Classifica<on
• To
predict
label
for
a
new
email:
– Compute
log[p(x,
spam)]
and
log[p(x,
not-‐spam)].
– Choose
the
label
which
gives
higher
value.
– We
use
logs
above
to
avoid
underflow
which
otherwise
arises
in
compu<ng
the
p(x
|
y),
which
are
products
of
individual
p(xi
|
y)
<
1:
log[p(x,
y)]
=
log[p(y)
p(x
|
y)]
=
log[
p(y)
p(x1
|
y)
p(x2
|
y)
...]
=
log[p(y)]
+
log[p(x1
|
y)]
+
log[p(x2
|
y)]
+
...
Classifica<on:
Beyond
Text
• You
have
just
seen
an
instance
of
the
Naive
Bayes
classifier.
• Applies
as
shown
to
any
classifica<on
problem
with
binary
feature
vectors.
• What
if
the
features
are
real-‐valued?
– S<ll
model
each
element
of
the
feature
vector
independently.
– But,
change
the
form
of
the
model
for
p(xi
|
y).
Classifica<on:
Beyond
Text
• If
xi
is
a
real
number,
ofen
model
p(xi
|
y)
as
2
Gaussian
with
mean
µ
iy
and
variance
σ
iy
:
� x −µ �2
1 − 12 i iy
p(xi |y) = √ e σiy
σiy 2π
• Es<mate
the
mean
and
variance
for
a
given
i,y
as
the
mean
and
variance
of
the
xi
in
the
training
set
which
have
corresponding
class
label
y.
• Other,
non-‐Gaussian
distribu<ons
can
be
used
if
know
more
about
the
xi.
Naive
Bayes:
Benefits
• Can
easily
handle
more
than
two
classes
and
different
data
types
• Simple
and
easy
to
implement
• Scalable
Naive
Bayes:
Shortcomings
• Generally
not
as
accurate
as
more
sophis<cated
methods
(but
s<ll
generally
reasonable).
• Independence
assump<on
on
the
feature
vector
elements
– Can
instead
directly
model
p(x
|
y)
without
this
independence
assump<on.
• Requires
us
to
specify
a
full
model
for
p(x,
y)
– In
fact,
this
is
not
necessary!
– To
do
classifica<on,
we
actually
only
require
p(y
|
x),
the
probability
that
the
label
is
y,
given
that
we
have
observed
en<ty
features
x.
Logis<c
Regression
• Recall:
Naive
Bayes
models
the
full
( joint)
probability
p(x,
y).
• But,
Naive
Bayes
actually
only
uses
the
condi<onal
probability
p(y
|
x)
to
predict.
• Instead,
why
not
just
directly
model
p(y
|
x)?
– Logis1c
regression
does
exactly
that.
– No
need
to
first
model
p(y)
and
then
separately
p(x
|
y).
Logis<c
Regression
• Assume
that
class
labels
are
{0,
1}.
• Given
an
en<ty’s
feature
vector
x,
probability
that
label
is
1
is
taken
to
be
1
p(y = 1|x) = −b Tx
1+e
where
b
is
a
parameter
vector
and
bTx
denotes
a
dot
product.
• The
probability
that
the
label
is
1,
given
features
x,
is
determined
by
a
weighted
sum
of
the
features.
Logis<c
Regression
• This
is
libera<ng:
– Simply
featurize
the
data
and
go.
– No
need
to
find
a
distribu<on
for
p(xi
|
y)
which
is
par<cularly
well
suited
to
your
sesng.
– Can
just
as
easily
use
binary-‐valued
(e.g.,
bag
of
words)
or
real-‐valued
features
without
any
changes
to
the
classifica<on
method.
– Can
ofen
improve
performance
simply
by
adding
new
features
(which
might
be
derived
from
old
features).
Logis<c
Regression
• Can
be
trained
efficiently
at
large
scale,
but
not
as
easy
to
implement
as
Naive
Bayes.
– Trained
via
maximum
likelihood.
– Requires
use
of
itera<ve
numerical
op<miza<on
(e.g.,
gradient
descent,
most
basically).
– However,
implemen<ng
this
effec<vely,
robustly,
and
at
large
scale
is
non-‐trivial
and
would
require
more
<me
than
we
have
today.
• Can
be
generalized
to
mul<class
sesng.
Other
Classifica<on
Techniques
• Support
Vector
Machines
(SVMs)
• Kernelized
logis<c
regression
and
SVMs
• Boosted
decision
trees
• Random
Forests
• Nearest
neighbors
• Neural
networks
• Ensembles
See
The
Elements
of
Sta4s4cal
Learning
by
Has<e,
Tibshirani,
and
Friedman
for
more
informa<on.
Featuriza<on:
Final
Comments
• Featuriza<on
affords
the
opportunity
to
– Incorporate
domain
knowledge
– Overcome
some
classifier
limita<ons
– Improve
performance
• Incorpora<ng
domain
knowledge:
– Example:
in
spam
classifica<on,
we
might
suspect
that
sender
is
important,
in
addi<on
to
email
body.
– So,
try
adding
features
based
on
sender’s
email
address.
Featuriza<on:
Final
Comments
• Overcoming
classifier
limita<ons:
– Naive
Bayes
and
logis<c
regression
do
not
model
mul<plica<ve
interac<ons
between
features.
– For
example,
the
presence
of
the
pair
of
words
[eliminate,
debt]
might
indicate
spam,
while
the
presence
of
either
one
individually
might
not.
– Can
overcome
this
by
adding
features
which
explicitly
encode
such
interac<ons.
– For
example,
can
add
features
which
are
products
of
all
pairs
of
bag
of
words
features.
– Can
also
include
nonlinear
effects
in
this
manner.
– This
is
actually
what
kernel
methods
do.
Classifica<on
Given
a
labeled
dataset
(x1,
y1),
...,
(xN,
yN):
1. Randomly
split
the
full
dataset
into
two
disjoint
parts:
– A
larger
training
set
(e.g.,
75%)
– A
smaller
test
set
(e.g.,
25%)
2. Preprocess
and
featurize
the
data.
3. Use
the
training
set
to
learn
a
classifier.
4. Evaluate
the
classifier
on
the
test
set.
5. Use
classifier
to
predict
in
the
wild.
Classifica<on
training
classifier
full set
dataset
test set new entity
accuracy prediction
Classifier
Evalua<on
• How
do
we
determine
the
quality
of
a
trained
classifier?
• Various
metrics
for
quality,
most
common
is
accuracy.
• How
do
we
determine
the
probability
that
a
trained
classifier
will
correctly
classify
a
new
en<ty?
Classifier
Evalua<on
• Cannot
simply
evaluate
a
classifier
on
the
same
dataset
used
to
train
it.
– This
will
be
overly
op<mis<c!
• This
is
why
we
set
aside
a
disjoint
test
set
before
training.
Classifier
Evalua<on
• To
evaluate
accuracy:
– Train
on
the
training
set
without
exposing
the
test
set
to
the
classifier.
– Ignoring
the
(known)
labels
of
the
data
points
in
the
test
set,
use
the
trained
classifier
to
generate
label
predic<ons
for
the
test
points.
– Compute
the
frac<on
of
predicted
labels
which
are
iden<cal
to
the
test
set’s
known
labels.
• Other,
more
sophis<cated
evalua<on
methods
are
available
which
make
more
efficient
use
of
data
(e.g.,
cross-‐valida<on).