Revisiting Logistic Regression & Nave Bayes
Aarti Singh
Machine Learning 10-701/15-781 Jan 27, 2010
Generative and Discriminative Classifiers
Training classifiers involves learning a mapping f: X -> Y, or P(Y|X) Generative classifiers (e.g. Nave Bayes) Assume some functional form for P(X,Y) (or P(X|Y) and P(Y)) Estimate parameters of P(X|Y), P(Y) directly from training data Use Bayes rule to calculate P(Y|X) Discriminative classifiers (e.g. Logistic Regression) Assume some functional form for P(Y|X) Estimate parameters of P(Y|X) directly from training data
Logistic Regression
Assumes the following functional form for P(Y|X):
Alternatively,
(Linear Decision Boundary)
DOES NOT require any conditional independence assumptions
3
Connection to Gaussian Nave Bayes
There are several distributions that can lead to a linear decision boundary. As another example, consider a generative model:
Exponential family Observe that Gaussian is a special case
Connection to Gaussian Nave Bayes
Constant term
First-order term
Special case: P(X|Y=y) ~ Gaussian( y,y) where 0 = 1 (cij,0 = cij,1) Conditionally independent cij,y = 0 , i j (Gaussian Nave Bayes)
Generative vs Discriminative
Given infinite data (asymptotically), If conditional independence assumption holds, Discriminative and generative perform similar.
If conditional independence assumption does NOT holds, Discriminative outperforms generative.
Generative vs Discriminative
Given finite data (n data points, p features), Ng-Jordan paper
Nave Bayes (generative) requires n = O(log p) to converge to its asymptotic error, whereas Logistic regression (discriminative) requires n = O(p). Why? Independent class conditional densities * smaller classes are easier to learn * parameter estimates not coupled each parameter is learnt independently, not jointly, from training data.
7
Nave Bayes vs Logistic Regression
Verdict Both learn a linear decision boundary. Nave Bayes makes more restrictive assumptions and has higher asymptotic error, BUT converges faster to its less accurate asymptotic error.
8
Experimental Comparison (Ng-Jordan01)
UCI Machine Learning Repository 15 datasets, 8 continuous features, 7 discrete features
More in Paper
Nave Bayes
Logistic Regression
9
Classification so far (Recap)
10
Classification Tasks
Features, X
Diagnosing sickle cell anemia Tax Fraud Detection
Labels, Y
Anemic cell Healthy cell
Web Classification
Sports Science News
Drive to CMU, Rachels fan, Shop at SH Giant Eagle
Predict squirrel hill resident
Resident 11 Not resident
Classification
Goal:
Sports Science News
Features, X
Labels, Y
Probability of Error
12
Classification
Optimal predictor: (Bayes classifier)
Depends on unknown distribution
13
Classification algorithms
However, we can learn a good prediction rule from training data
Independent and identically distributed
Learning algorithm
So far
Decision Trees K-Nearest Neighbor Nave Bayes Logistic Regression
14
Linear Regression
Aarti Singh
Machine Learning 10-701/15-781 Jan 27, 2010
Discrete to Continuous Labels
Classification Sports Science News
X = Document
Anemic cell Healthy cell
Y = Topic
X = Cell Image
Y = Diagnosis
Regression
Stock Market Prediction
Y=?
X = Feb01
16
Regression Tasks
Weather Prediction
Y = Temp
X = 7 pm
Estimating Contamination
X = new location Y = sensor reading
17
Supervised Learning
Goal:
Sports Science News
Y=?
X = Feb01
Classification:
Regression:
Probability of Error
Mean Squared Error
18
Regression
Optimal predictor: (Conditional Mean)
Dropping subscripts for notational convenience
19
Regression
Optimal predictor: (Conditional Mean)
Intuition: Signal plus (zero-mean) Noise model
Depends on unknown distribution
20
Regression algorithms
Learning algorithm
Linear Regression Lasso, Ridge regression (Regularized Linear Regression) Nonlinear Regression Kernel Regression Regression Trees, Splines, Wavelet estimators,
Empirical Risk Minimizer:
21
Empirical mean
Linear Regression
Least Squares Estimator
- Class of Linear functions
Uni-variate case:
2 = slope
1 - intercept
Multi-variate case:
where
,
22
Least Squares Estimator
23
Least Squares Estimator
24
Normal Equations
p xp p x1 p x1
If
is invertible,
When is invertible ? (Homework 2) Recall: Full rank matrices are invertible. What is rank of What if is not invertible ? (Homework 2) Regularization (later)
25
Geometric Interpretation
Difference in prediction on training set:
is the orthogonal projection of onto the linear subspace spanned by the columns of
26
Revisiting Gradient Descent
Even when is invertible, might be computationally expensive if A is huge.
Gradient Descent Initialize: Update:
0 if
= < .
27
Stop: when some criterion met e.g. fixed # iterations, or
Effect of step-size
Large => Fast convergence but larger residual error Also possible oscillations Small => Slow convergence but small residual error
28
When does Gradient Descent succeed?
View of the algorithm is myopic.
http://www.ce.berkeley.edu/~bayen/
http://demonstrations.wolfram.com
Guaranteed to converge to local minima if
Converges as in jth direction Convergence depends on eigenvalue spread
29
Least Squares and MLE
Intuition: Signal plus (zero-mean) Noise model
log likelihood
Least Square Estimate is same as Maximum Likelihood Estimate under a Gaussian model ! 30
Regularized Least Squares and MAP
What if is not invertible ?
log likelihood
log prior
I) Gaussian Prior
Ridge Regression
Closed form: HW Prior belief that is Gaussian with zero-mean biases solution to small
31
Regularized Least Squares and MAP
What if is not invertible ?
log likelihood
log prior
II) Laplace Prior
Lasso
Closed form: HW Prior belief that is Laplace with zero-mean biases solution to small
32
Ridge Regression vs Lasso
Ridge Regression: Lasso:
HOT! Ideally l0 penalty, but optimization becomes non-convex
s with constant J() (level sets of J()) s with constant l2 norm
2
s with constant l1 norm
1
s with constant l0 norm
Lasso (l1 penalty) results in sparse solutions vector with more zero coordinates Good for high-dimensional problems dont have to store all coordinates!
33
Beyond Linear Regression
Polynomial regression
Regression with nonlinear features/basis functions
Kernel regression - Local/Weighted regression
Regression trees Spatially adaptive regression
34
Polynomial Regression
Univariate case: where ,
Weight of each feature
Nonlinear features
35
Nonlinear Regression
Basis coefficients Nonlinear features/basis functions
Fourier Basis
Wavelet Basis
Good representation for oscillatory functions
Good representation for functions localized at multiple scales
36
Local Regression
Basis coefficients Nonlinear features/basis functions
Globally supported basis functions (polynomial, fourier) will not yield a good representation
37
Local Regression
Basis coefficients Nonlinear features/basis functions
Globally supported basis functions (polynomial, fourier) will not yield a good representation
38
Kernel Regression (Local)
Weighted Least Squares Weigh each training point based on distance to test point
K Kernel
h Bandwidth of kernel
39
Nadaraya-Watson Kernel Regression
constant
40
Nadaraya-Watson Kernel Regression
constant
with box-car kernel #pts in h ball around X Sum of Ys in h ball around X Recall: NN classifier Average <-> majority41vote
Choice of Bandwidth
h
Should depend on n, # training data (determines variance) Should depend on smoothness of function (determines bias)
Large Bandwidth average more data points, reduce noise (Lower variance) Small Bandwidth less smoothing, more accurate fit (Lower bias) Bias Variance tradeoff : More to come in later lectures
42
Spatially adaptive regression
h
If function smoothness varies spatially, we want to allow bandwidth h to depend on X Local polynomials, splines, wavelets, regression trees
43
Regression trees
Binary Decision Tree Num Children? 2 <2
Average (fit a constant ) on the leaves
44
Regression trees
Quad Decision Tree
h
h
- Polynomial fit on each leaf
If Else stop
, then split Compare residual error with and without split
45
Summary
Discriminative vs Generative Classifiers - Nave Bayes vs Logistic Regression Regression - Linear Regression Least Squares Estimator Normal Equations Gradient Descent Geometric Interpretation Probabilistic Interpretation (connection to MLE) - Regularized Linear Regression (connection to MAP) Ridge Regression, Lasso - Polynomial Regression, Basis (Fourier, Wavelet) Estimators - Kernel Regression (Localized) - Regression Trees
46