CSCE833 Machine Learning
Lecture 9
Linear Discriminant Analysis
Dr. Jianjun Hu
mleg.cse.sc.edu/edu/csce833
University of South Carolina
Department of Computer Science and Engineering
Likelihood- vs. Discriminant-
based Classification
Likelihood-based: Assume a model for p(x|
Ci), use Bayes’ rule to calculate P(Ci|x)
gi(x) = log P(Ci|x)
Discriminant-based: Assume a model for
gi(x|Φi); no density estimation
Estimating the boundaries is enough; no
need to accurately estimate the densities
inside the boundaries
Lecture Notes for E Alpaydın 2004
2 Introduction to Machine Learning © The
MIT Press (V1.1)
Example: Boundary or Class
Model
Lecture Notes for E Alpaydın 2004
3 Introduction to Machine Learning © The
MIT Press (V1.1)
Linear Discriminant
Linear discriminant:
d
gi x | wi ,wi 0 wiT x wi 0 wij x j wi 0
j 1
Advantages:
Simple: O(d) space/computation
Knowledge extraction: Weighted sum of attributes;
positive/negative weights, magnitudes (credit scoring)
Optimal when p(x|Ci) are Gaussian with shared cov
matrix; useful when classes are (almost) linearly
separable
Lecture Notes for E Alpaydın 2004
4 Introduction to Machine Learning © The
MIT Press (V1.1)
Generalized Linear Model
Quadratic discriminant:
gi x | Wi ,wi ,wi 0 xT Wi x wiT x wi 0
Higher-order (product) terms:
z1 x1 , z 2 x2 , z 3 x12 , z 4 x22 , z5 x1x2
Map from x to z using nonlinear basis functions
and use a linear discriminant in z-space
k
gi x wij j x
Lecture Notes for E Alpaydın 2004 j 1
5 Introduction to Machine Learning © The
MIT Press (V1.1)
Two Classes
g x g1 x g2 x
w1T x w10 w2T x w 20
w1 w2 x w10 w 20
T
wT x w 0
C 1 if g x 0
choose
C 2 otherwise
Lecture Notes for E Alpaydın 2004
6 Introduction to Machine Learning © The
MIT Press (V1.1)
Geometry
Lecture Notes for E Alpaydın 2004
7 Introduction to Machine Learning © The
MIT Press (V1.1)
Multiple Classes
gi x | wi ,wi 0 wiT x wi 0
Choose C i if
K
gi x max g j x
j 1
Classes are
linearly separable
Lecture Notes for E Alpaydın 2004
8 Introduction to Machine Learning © The
MIT Press (V1.1)
Pairwise Separation
g x | w ij ij ,w ij 0 wT
ij x w ij 0
0 if x C i
gij x 0 if x C j
don' t care otherwise
choose C i if
j i ,gij x 0
9
Lecture Notes for E Alpaydın 2004 Introduction to Machine
From Discriminants to
Posteriors
When p (x | Ci ) ~ N ( μi , ∑)
T
gi x | wi ,wi 0 wi x wi 0
1 T 1
1
wi μi wi 0 μi μi log P C i
2
y P C 1 | x and P C 2 | x 1 y
y 0.5
chooseC 1 if y / 1 y 1 and C 2 otherwise
log y / 1 y 0
Lecture Notes for E Alpaydın 2004
10 Introduction to Machine Learning © The
MIT Press (V1.1)
P C 1 | x P C 1 | x
logitP C 1 | x log log
1 P C 1 | x P C 2 | x
p x | C 1 P C 1
log log
p x | C 2 P C 2
2 d / 2 1 / 2 exp 1 / 2x μ 1 1 x μ 1
T
log P C
1
log
2 d / 2 1 / 2 exp 1 / 2x μ 2 1 x μ 2
T
P C
2
wT x w 0
1
where w 1 μ 1 μ 2 w 0 μ 1 μ 2 T 1 μ 1 μ 2
2
The inverse of logit
P C 1 | x
log wT x w 0
1 P C 1 | x
1
P C 1 | x sigmoid wT x w 0
1 exp wT x w 0
11
Lecture Notes for E Alpaydın 2004 Introduction to Machine
Sigmoid (Logistic) Function
1. Calculateg x wT x w 0 and chooseC 1 if g x 0, or
2. Calculate y for
Lecture Notes Esigmoid
Alpaydın 2004
Introduction to Machine Learning © The
wT x w 0 and chooseC 1 if y 0.5
12
MIT Press (V1.1)
Gradient-Descent
E(w|X) is error with parameters w on sample X
w*=arg minw E(w | X)
T
Gradient E E E
w E , ,...,
w 1 w 2 w d
Gradient-descent:
Starts from random w and updates w iteratively
in the negative direction of gradient
Lecture Notes for E Alpaydın 2004
13 Introduction to Machine Learning © The
MIT Press (V1.1)
Gradient-Descent
E
w i ,i
wi
wi wi wi
E (wt)
E (wt+1)
wt wt+1
14
η
Lecture Notes for E Alpaydın 2004
Introduction to Machine Learning © The
MIT Press (V1.1)
Logistic Discrimination
Two classes: Assume log likelihood ratio is
linear p x | C 1
log wT x w 0o
p x | C 2
P C 1 | x p x | C 1 P C 1
logitP C 1 | x log log log
1 P C 1 | x p x | C 2 P C 2
wT x w 0
P C 1
o
where w 0 w log
P C 2
0
1
y P̂ C 1 | x
Lecture Notes for E Alpaydın 2004
1 exp wT x w 0
15 Introduction to Machine Learning © The
MIT Press (V1.1)
Training: Two Classes
X xt ,r t t r t | xt ~ Bernoulli y t
1
y P C 1 | x
1 exp wT x w 0
| X y 1 y
t t
l w,w 0 t r t 1 r
E logl
E w,w 0 | X r t log y t 1 r t log 1 y t
t
Lecture Notes for E Alpaydın 2004
16 Introduction to Machine Learning © The
MIT Press (V1.1)
Training: Gradient-Descent
E w,w 0 | X r t log y t 1 r t log 1 y t
t
dy
If y sigmoid a y 1 y
da
E r t 1 r t t
w j
w j
t
t
y 1
y t
x t
j
t y 1 y
r t y t xtj , j 1,...,d
t
E
w 0
w 0
r t y t
t
Lecture Notes for E Alpaydın 2004
17 Introduction to Machine Learning © The
MIT Press (V1.1)
Lecture Notes for E Alpaydın 2004
18 Introduction to Machine Learning © The
MIT Press (V1.1)
100 1000
10
Lecture Notes for E Alpaydın 2004
19 Introduction to Machine Learning © The
MIT Press (V1.1)
K>2 Classes
X x ,r r | x ~ Mult 1, y
t t
t
t t
K
t
p x | C i
log wiT x wio0
p x | C K softmax
y P̂ C i | x
exp wiT x wi 0 ,i 1,...,K
j 1
K
exp wT
j x wj0
t
l wi ,wi 0 i | X y t ri
i
t i
E wi ,wi 0 i | X rit log yit
t
wLecture
j r
Notes
forj E Alpaydın
t
y t
j xt
w
2004 j 0
j j
r t
y t
20 t to Machine Learning © The
Introduction t
MIT Press (V1.1)
Lecture Notes for E Alpaydın 2004
21 Introduction to Machine Learning © The
MIT Press (V1.1)
Example
Lecture Notes for E Alpaydın 2004
22 Introduction to Machine Learning © The
MIT Press (V1.1)
Generalizing the Linear
Model
Quadratic:
p x | C i
log xT Wi x wiT x wi 0
p x | C K
Sum of basis functions:
p x | C i
log wiT x wi 0
p x | C K
where φ(x) are basis functions
Kernels in SVM
Hidden units in neural networks
Lecture Notes for E Alpaydın 2004
23 Introduction to Machine Learning © The
MIT Press (V1.1)
Discrimination by Regression
Classes are NOT mutually exclusive and
r t y t where ~ N 0, 2
exhaustive
1
t
y sigmoid w x w 0 T t
1 exp wT xt w 0
l w,w 0 | X
1
exp
r y
t t 2
t 2 2 2
1
E w,w 0 | X r t y t
2 t
2
w Notes
Lecture
for
r t EAlpaydın
y t y t 2004
1 yt x t
24 Introductiont to Machine Learning © The
MIT Press (V1.1)
Optimal Separating
Hyperplane
1 if xt
C1
t t
t
X x ,r t wherer t
1 if x C2
find w and w 0 such that
wT xt w 0 1 for r t 1
wT xt w 0 1 for r t 1
which can be rewritten as
r t wT xt w 0 1
(Cortes and Vapnik, 1995; Vapnik, 1995)
Lecture Notes for E Alpaydın 2004
25 Introduction to Machine Learning © The
MIT Press (V1.1)
Margin
Distance from the discriminant to the closest
instances on either side T t
Distance of x to the hyperplane is w x w0
w
t
r w x w0T t
,t
We require w
For a unique sol’n, fix ρ||w||=1and to max
1
margin 2
min w subject to r t wT xt w 0 1,t
2
Lecture Notes for E Alpaydın 2004
26 Introduction to Machine Learning © The
MIT Press (V1.1)
Lecture Notes for E Alpaydın 2004
27 Introduction to Machine Learning © The
MIT Press (V1.1)
1 2
min w subject to r t wT xt w 0 1,t
2
1
N
2
L p w t r t wT xt w 0 1
2 t 1
1 N N
2
w
2
r w x w t
0
t
t T t
t 1 t 1
L p N
0 w t r t xt
w t 1
L p N
w 0
0 r 0
t t
t 1
Lecture Notes for E Alpaydın 2004
28 Introduction to Machine Learning © The
MIT Press (V1.1)
1 T
Ld w w wT tr t xt w 0 tr t t
2 t t t
1 T
w w t
2
t
1 t T s
r r x x t
2 t s
t s t s
t
subject to tr t 0 and t 0,t
t
Most αt are 0 and only a small number have αt
>0; they are the support vectors
Lecture Notes for E Alpaydın 2004
29 Introduction to Machine Learning © The
MIT Press (V1.1)
Soft Margin Hyperplane
Not linearly separable
r t wT xt w 0 1 t
Soft error
t
t
New primal is
1 2
Lp w C t t t t r t wT xt w 0 1 t t t t
2 Lecture Notes for E Alpaydın 2004
30 Introduction to Machine Learning © The
MIT Press (V1.1)
Kernel Machines
Preprocess input x by basis functions
z = φ(x) g(z)=wTz
g(x)=wT φ(x)
The SVM solution
w tr t z t tr t φxt
t t
g x w φx r φx
T t t
φx
t T
g x tr t K xt , x
t
Lecture Notes for E Alpaydın 2004
31 Introduction to Machine Learning © The
MIT Press (V1.1)
Kernel Functions
Polynomials of degree q: K x , x x x 1 t
T t
q
K x, y x y 1
T
2
x1y1 x2y 2 1
2
1 2x1y1 2x2y 2 2x1x2y1y 2 x12y12 x22y 22
x 1, 2x1 , 2x2 , 2x1x2 ,x ,x2
1
2
2
T
xt x 2
Radial-basis functions:
K xt , x exp
2
Sigmoidal functions:
K xt , x tanh2xT xt 1
32
(Cherkassky and Mulier, 1998)
Lecture Notes for E Alpaydın 2004
Introduction to Machine Learning © The
MIT Press (V1.1)
SVM for Regression
Use a linear model (possibly kernelized)
f(x)=wTx+w0
Use the є-sensitive error function
0 if r t
f xt
t t
e r ,f x t
t
r f x otherwise
1
min
2
2
w C t t
t
r t wT x w 0 t
w x w r
T
0
t
t
t Notes
t for E Alpaydın 2004
33
, 0
Lecture
to Machine Learning © The
Introduction
MIT Press (V1.1)