Thanks to visit codestin.com
Credit goes to www.scribd.com

0% found this document useful (0 votes)
65 views6 pages

PRML RefSheet

Uploaded by

Manne Sravani
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
65 views6 pages

PRML RefSheet

Uploaded by

Manne Sravani
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 6

1.2.1 Overfitting 1.3.

2 Probability Densities

Reference Sheet for the Course

EC714PE: Pattern Recognition and Machine


Learning
Dr. M. Fayazur Rahaman,
Dept. of ECE, MGIT, Hyderabad
(Last Updated: September 25, 2024) • Probability density function
References: Z b
ß Chirstopher M Bishop, Pattern Recognition and Machine Learning, p( x ∈ ( a, b)) = p( x ) dx.
Springer, 2006 a
ß Course Lecture Slides p( x ) ≥ 0,
Z ∞
p( x ) dx = 1.
−∞
Unit-01 • Sum rule and product rule for density
Z
p( x ) = p( x, y) dy.
1 Introduction to Pattern Recognition 1.2.2 Generalization
1.1 Types of Learning • The goal is to generalize well to the new data (test set data) 1.3 Probability Theory p( x, y) = p(y| x ) p( x ).
i. Supervised Learning 1.3.3 Expectations and Covariances
ii. Unsupervised Learning
• Expectation of random variable x
iii. Reinforcement Learning
Discrete distribution E[ x ] = ∑ x p( x ) x.
1.2 Example: Polynomial Curve Fitting Continous distribution
R
E[ x ] = p( x ) x dx.
1
N points drawn from distribution E[ x ] ≈ N ∑nN=1 xn
• Variance of random variable x
h i
var[ x ] = E ( x − E[ x ])2

var[ x ] = E[ x2 ] − E[ x ]2 .

• Covariance between two random variables x and y

cov[ x, y] = Ex,y [( x − E[ x ])(y − E[y])]


• As M increases, the magnitute of the polynomial coefficients grows cov[ x, y] = Ex,y [ xy] − E[ x ] E[y].
larger, resulting in large variance in predictions
If x and y are independent their covariance is 0
1.3.4 Bayesian Probabilities
• Bayes Theorem

• Simple Regression problem: Predicting a real-valued target variable t posterior ∝ likelihood × prior
based on a real-valued input variable x
• A simple linear model using polynomial function of order M • Prior and Posterior uncertainty in model parameters w by incorporating
evidence from the observed data D
p( D |w) · p(w)
M
p(w| D ) =
p( D )
y( x, w) = w0 + w1 x + w2 x2 + · · · + wM xM = ∑ wj xj
j =0 where:
• Overfitting decreases as the data set size increases ◦ p(w| D ) is the posterior probability
◦ p( D |w) is the likelihood
• Error function (Sum of squared errors) to be minimized ◦ p(w) is the prior probability
◦ p( D ) is the evidence
1.3.1 The Rules of Probability
1.3.5 The Gaussian Distribution
• Sum Rule:
1 N • Gaussian distribution for a single real-valued variable x
2 n∑
E(w) = { y ( x n , w ) − t n }2
∑ p(X, Y )
!
=1 p( X ) = 1 ( x − µ )2
Y N ( x |µ, σ2 ) = √ exp − 2
2πσ 2 2σ

• Product Rule:

p( X, Y ) = p(Y | X ) · p( X )

1.2.3 Regularization • Bayes Theorem:


• Regularization controls overfitting by adding penalty term to the error
function, discouraging large coefficient values p ( X | Y ) · p (Y )
p (Y | X ) =
• Ridge regression uses quadratic regularizer to penalize large coefficients p( X )

N
e(w) = 1 ∑ {y( xn , w) − tn }2 + λ ||w||2
E
Bayes’ theorem expresses the posterior probability in terms of the
2 n =1 2 likelihood, prior probability, and the evidence.
• Multivariate Gaussian distribution for D-dimensional vector x 1.3.7 Bayesian Curve fitting 1.6 Decision Theory Unit-02
Bayes’ Theorem:
 
1 1
N (x|µ, Σ) = exp − (x − µ) T Σ−1 (x − µ)
(2π ) D/2 |Σ|1/2 2 • Using Bayes’ theorem, the posterior probabilities p(Ck |x) can be ex-
pressed as: 1 Linear Models for Regression
◦ µ is the mean vector.
◦ Σ is the covariance matrix. • With a training data set {xn }nN=1 and corresponding target values {tn },
• Likelihood function of the data set x from the given Guassian distribu- the goal is to predict t for new x values.
tion • Predictions can be made by constructing an appropriate function y(x)
N or modeling the predictive distribution p(t|x).
p(x|µ, σ2 ) = ∏ N (xn |µ, σ2 ) Examples
n =1
p(x|Ck ) p(Ck )
p(Ck |x) = .
p(x)

• Predictive distribution is given by


Z
p(t| x, x, t) = p(t| x, w) p(w|x, t) dw

= N (t|m( x ), s2 ( x )) where Here:


◦ p(Ck ) is the prior probability of class Ck , representing the probability
◦ Mean: of cancer before seeing the X-ray.
• Log-likelihood function
N ◦ p(x|Ck ) is the likelihood, representing the probability of obtaining
1 N N N m( x ) = βφ( x ) T S ∑ φ( xn )tn . the image x given that the patient belongs to class Ck .
ln p(x|µ, σ2 ) = − ∑ (xn − µ)2 − 2 ln σ2 − 2 ln(2π ).
2σ2 n=1
n =1
◦ p(Ck |x) is the posterior probability, representing the probability of
◦ Variance: cancer after taking the X-ray.
• Maximum likelihood solution for mean µ ◦ p(x) is the evidence or normalizing factor
s2 ( x ) = β−1 + φ( x ) T Sφ( x ).
1 N 1.6.1 Minimizing the misclassification
N n∑
µML = xn ◦ The Covariance matrix S is given by:
=1
where N
E[µML ] = µ S−1 = αI + β ∑ φ( xn )φ( xn ) T ,
n =1
• Maximum likelihood solution for variance σ2
1 N where φ( x ) is the feature vector with elements φi ( x ) = xi for
N n∑
2 =
σML ( xn − µML )2 i = 0, . . . , M.
=1
1.4 Model Selection
where
2 ]= N−1 2
E[σML σ
N
1.3.6 Curve Fitting revisited

• Cross-validation allows a proportion (S − 1)/S of the data to be used


for training while assessing performance on the entire data set.
• A drawback of cross-validation is the increased number of training
runs (by a factor of S), which can be problematic for computationally
expensive models.
• Curve Fitting Problem: To predict the target variable t given a new 1.1 Linear Basis Function Models
input value x, based on a set of training data input values x = 1.5 The Curse of Dimensionality
• The simplest linear model
( x1 , . . . , x N ) T and their corresponding target values t = (t1 , . . . , t N ) T .
• Model Assumption: Given the input x, the corresponding target value y(x, w) = w0 + w1 x1 + · · · + w D x D
t is assumed to follow a Gaussian distribution with a mean equal to the
value of the polynomial curve y( x, w).
where x = ( x1 , . . . , x D ) T and w = (w0 , . . . , w M−1 ) T .
• Extend the model by using basis functions φj ( x ):
 
p(t| x, w, β) = N t|y( x, w), β−1

where β is the precision parameter M −1


• Predictive distribution with the parameters determined by MLE Two-Class Example: y(x, w) = w0 + ∑ w j φj ( x)
  • The probability of making a mistake is given by: j =1
p(t| x, wML , β ML ) = N t|y( x, wML ), β−
ML
1
• The Naive approach becomes impractical in high-dimensional spaces. • Using basis function φ0 (x) = 1, the model can be compactly written
• Following Bayesian Approach ◦ The exponential growth in number of cells leads to a need for an as:
◦ Assuming prior distribution of w as exponentially large quantity of training data to ensure non-empty y(x, w) = w T φ(x)
p(w|α) = N (w|0, α−1 I) cells.
 α ( M+1)/2  α  • Polynomial Curve fitting: General polynomial of order 3 with D- p(mistake) = p(x ∈ R1 , C2 ) + p(x ∈ R2 , C1 ) where φ = (φ0 , . . . , φ M−1 ) T .
= exp − w T w dimensional input variable Z Z • Examples of basis functions:
2π 2 = p(x, C2 ) dx + p(x, C1 ) dx.
R1 R2
◦ Polynomial basis functions: φj ( x ) = x j
◦ Posterior distribution is D D D D D D
∑ wi xi + ∑ ∑ wij xi x j + ∑ ∑ ∑ wijk xi x j xk .
!
y(x, w) = w0 + ( x − µ j )2
p(w|x, t, α, β) ∝ p(t|x, w, β) p(w|α) ◦ Gaussian basis functions: φj ( x ) = exp −
i =1 i =1 j =1 i =1 j =1 k =1 2s2
◦ Maximum A posteriori (MAP) estimate of w leads to minimizing 
x −µ j

the regularized sum-of-squares error function ◦ The number of independent coefficients w grows proportionally to ◦ Sigmoidal basis functions: φj ( x) = σ s with σ( a) =
D3 1
β N α ◦ For a polynomial of order M, the number of coefficients grows like • Optimal Decision Rule: if p(x, C1 ) > p(x, C2 ) for a given value of x,
2 n∑
− log p(w|t) = { y ( x n , w ) − t n }2 + w T w 1+exp(− a)
=1
2 DM then x should be assigned to class C1 . ◦ Fourier basis functions
• Setting this gradient to zero and solving for w: • A simple regularizer :
  −1
w ML = ΦT Φ ΦT t = Φ† t EW (w) =
1 T
w w
2
where Φ is the N × M design matrix and Φ† is the Moore-Penrose
• With the sum-of-squares error function, the total error function be-
pseudo-inverse of Φ.
comes:
• The design matrix Φ is:
1 N  2 λ
2 n∑

φ0 (x1 ) φ1 (x1 ) ··· φ M −1 ( x1 )
 E(w) = tn − w T φ(xn ) + w T w
=1
2
 φ (x )
 0 2 φ1 (x2 ) ··· φ M −1 ( x2 ) 
Φ= .

. .
 • The exact minimizer can be found in closed form:
 . . .. .

Examples  . . . .

   −1
φ0 (x N ) φ1 (x N ) ··· φ M −1 ( x N ) w = λI + Φ T Φ ΦT t

• To maximize the log likelihood with respect to the noise precision β, • A more general regularizer takes the form:
we get:
1 1 N  2
1 N  2 λ M
N n∑
tn − w TML φ(xn )
2 n∑ 2 j∑
= E(w) = tn − w T φ(xn ) + |w j |q
β ML =1 =1 =1
Examples
• The case q = 1 is known as the lasso leads to a sparse model where
some coefficients w j are driven to zero.

1.3 Bayesian Linear Regression


• A Bayesian approach to linear regression helps avoid over-fitting asso-
ciated with maximum likelihood and provides an automatic method to
determine model complexity using only the training data.

1.1.2 Geometry of least squares 1.3.1 Parameter distribution


• The least-squares solution for w corresponds to the y in S that is • Prior probability distribution over the model parameters w is Gaussian,
closest to t. given by
p ( w ) = N ( w | m0 , S0 ),

where m0 is the mean and S0 is the covariance.


• The posterior distribution is proportional to the product of the likelihood
1.1.5 Multiple outputs function and the prior, and is
• Goal is to predict multiple target variables t, K-dimensional target
vector. p ( w | t ) = N ( w | m N , S N ),
• A common model:
y(x, w) = W T φ(x) where
 
1.1.1 Maximum likelihood and least squares where mN = S N S0−1 m0 + βΦ T t
• The target variable t is modeled as: ◦ y is a K-dimensional column vector,
◦ W is an M × K matrix of parameters, and S−
N
1 = S0−1 + βΦ T Φ
t = y(x, w) + e ◦ φ(x) is an M-dimensional column vector of basis functions.
• The conditional distribution of the target vector is assumed to be an • The maximum posterior weight vector is
where e is Gaussian noise with zero mean and precision β.
isotropic Gaussian:
• The conditional distribution of t given x and w is:
  w MAP = m N
p(t|x, w, β) = N (t|y(x, w), β−1 ) p(t|x, W, β) = N t|W T φ(x), β−1 I
• This solution corresponds to the orthogonal projection of t onto the • Consider a zero-mean isotropic Gaussian prior governed by a single
• For a data set X = {(xn , tn )}nN=1 , the likelihood function is: subspace S • For a set of observations t1 , . . . , t N , the log-likelihood function is given precision parameter α,
N 1.1.3 Sequential learning by:
p(t|X, w, β) = ∏ N ( t n | w T φ ( x n ), β −1 ) p(w|α) = N (w|0, α−1 I).
• The stochastic gradient descent algorithm updates the parameter vec-
n =1 NK β β N
2 n∑
tor w as follows: ln p (T|X, W, β) = ln − ktn − W T φ(xn )k2 • The corresponding posterior distribution is given with
• Logarithm of the likelihood function: w(τ +1) = w(τ ) − η ∇ En 2 2π =1
N N β where τ denotes the iteration number, and η is a learning rate parame-
• Maximizing this log-likelihood function with respect to W yields: mN = βS N Φ T t
ln p(t|w, β) = ln β − ln(2π ) − ED (w) ter.
2 2 2 S− 1 = αI + βΦ T Φ
• For the sum-of-squares error function, this update rule becomes: N
where the sum-of-squares error function is: W ML = (Φ T Φ)−1 Φ T T
w ( τ +1) = w ( τ ) + η ( t n − w ( τ ) T φ n ) φ n • The log of the posterior distribution is
1 N 1.2 The Bias-variance Decomposition
2 n∑
ED ( w ) = (tn − w T φ(xn ))2 1.1.4 Regularized least squares • The expected squared loss can be decomposed into three terms:
=1 β N n o2 α
2 n∑
• The total error function to be minimized is given by: ln p(w|t) = − tn − w T φ(xn ) − w T w + const.
• The gradient of the log likelihood function is: E[ L] = bias2 + variance + noise =1
2
E(w) = ED (w) + λEW (w)
N
where λ is the regularization coefficient, ED (w) is the data-dependent • Flexible models tend to have low bias and high variance, while rigid • Sequential updates of the posterior distribution, using a simple example
∇ ln p(t|w, β) = − ∑ (tn − wT φ(xn ))φ(xn )T error, and EW (w) is the regularization term. models have high bias and low variance. involving straight-line fitting.
n =1
1.3.3 Equivalent Kernel 2 Linear Models for Classification • Multiclass Case:
• The predictive mean can be expressed as: • The goal in classification is to assign an input vector x to one of K ◦ For K > 2 classes, the posterior probability p(Ck |x) is given by the
discrete classes Ck where k = 1, . . . , K. normalized exponential, or softmax function:
y(x, m N ) = m TN φ(x)
p(x|Ck ) p(Ck )
= βφ(x) T S N Φ T t p(Ck |x) =
∑K
j =1 p ( x | C j ) p ( C j )
N
= ∑ βφ(x) T S N φ(xn )tn
=
exp( ak )
n =1 ∑ j exp( a j )
N
= ∑ k(x, xn )tn ◦ Here, the quantities ak are defined as:
n =1
ak = ln p(x|Ck ) p(Ck )
where the equivalent kernel k(x, xn ) is given by:
2.1.1 Continuous inputs
k(x, x0 ) = βφ(x) T S N φ(x0 )
• The density for class Ck is given by:
• The equivalent kernel is localized around x, meaning predictions at x  
are weighted more by data points close to x than by distant points. 1 1
p(x|Ck ) = exp − (x − µk )> Σ−1 (x − µk )
• The covariance between predictions at x and x0 is given by: (2π ) D/2 |Σ|1/2 2
cov[y(x), y(x0 )] = β−1 k(x, x0 ) • For two-class problems, a binary target t ∈ {0, 1} is often used, where
indicating that nearby predictions are more correlated. t = 1 represents class C1 and t = 0 represents class C2 .
• The equivalent kernel ensures that the weights for combining training • For K > 2 classes, a 1-of-K coding scheme is used, for example, for
set target values sum to one: K = 5, a pattern from class 2 would be:
N t = (0, 1, 0, 0, 0) T
∑ k (x, xn ) = 1
n =1 • Three approaches to classification are discussed:
I. Directly constructing a discriminant function that assigns each x
to a specific class.
II. Modeling the conditional probability distribution p(Ck |x).
III. A generative approach where class-conditional densities p(x|Ck )
and prior probabilities p(Ck ) are modeled, then Bayes’ theorem is
applied:
p(x|Ck ) p(Ck )
p(Ck |x) =
p(x)
• For classification, we predict discrete class labels or posterior probabili-
1.3.2 Predictive distribution ties, which lie in (0, 1). This is done by transforming the linear function
• Goal is to predict the target variable t for new input values x. 1.4 Bayesian Model Comparison of w using a nonlinear function f (·):
• The predictive distribution is given by: • Suppose we have a set of L models {Mi } where i = 1, . . . , L. The
y ( x ) = f ( w T x + w0 ) • Two-Class Case:
Z posterior distribution is given by: ◦ The posterior probability for class C1 is given by:
p(t|t, α, β) = p(t|w, β) p(w|t, α, β) dw p(Mi |D) ∝ p(Mi ) p(D|Mi ) • Here, f (·) is called an activation function in machine learning.
p(C1 |x) = σ(w> x + w0 )
• For a model with a single parameter w, if the posterior is sharply peaked 2.1 Probabilistic Generative Models
where t represents the vector of target values from the training set. around w MAP , the integral can be approximated by: • Two-Class Case: ◦ Here, w and w0 are defined as:
• Results in: ◦ The posterior probability for class C1 is expressed as:
Z
∆wposterior w = Σ −1 ( µ1 − µ2 )
p(D) = p(D|w) p(w)dw ≈ p(D|w MAP )
p(t|x, t, α, β) = N (t|m TN φ(x), σN
2 (x))
∆wprior p(x|C1 ) p(C1 )
p(C1 |x) =  
p(x|C1 ) p(C1 ) + p(x|C2 ) p(C2 ) 1 1 p(C1 )
where ∆wposterior is the width of the posterior peak. w0 = − µ1> Σ−1 µ1 + µ2> Σ−1 µ2 + ln
2 (x) is:
• The variance σN = σ( a) 2 2 p(C2 )
• Taking the logarithm:
∆wposterior ◦ This can be rewritten using the logistic sigmoid function σ( a), where:
 
1 ln p(D) ≈ ln p(D|w MAP ) + ln
2 (x) =
σN + φ(x) T S N φ(x) ∆wprior
p(x|C1 ) p(C1 )
 
β
a = ln
where ∆wprior is the width of the prior. p(x|C2 ) p(C2 )
• The level of predictive uncertainty depends on x and decreases near • Bayesian model comparison generally favors models of intermediate
complexity, avoiding over-fitting and under-fitting. ◦ The logistic sigmoid function is defined as:
the data points.
1
σ( a) =
1 + exp(− a)

◦ The logistic sigmoid, maps the entire real axis to a finite interval (0,
1). It satisfies the symmetry property:

σ (− a) = 1 − σ( a)

◦ The inverse of the logistic sigmoid is the logit function:


 
σ
a = ln
1−σ
• General Case of K Classes: 2.1.3 Discrete features • In linear regression, the gradient and Hessian are:
◦ For K classes, the quantity ak (x) is given by: • Binary Feature Values: N
◦ Consider discrete feature values xi ∈ {0, 1} for simplicity. ∇ E(w) = ∑ (wT φn − tn )φn = ΦT Φw − ΦT t
ak (x) = w>
k x + wk0 ◦ Assuming features to be independent, the class-conditional distribu- n =1
tion takes the form: N
◦ Here, wk and wk0 are defined as: H = ∇∇ E(w) = ∑ φn φnT = ΦT Φ
n =1
D
Σ −1 µ x • The Newton-Raphson update gives the exact least-squares solution:
wk = k p(x|Ck ) = ∏ µkii (1 − µki )1−xi
i =1 w(new) = w(old) − (Φ T Φ)−1 {Φ T Φw(old) − Φ T t}
1
wk0 = − µ> Σ−1 µk + ln p(Ck ) = ( Φ T Φ ) −1 Φ T t
2 k • Linear Functions of Input Values:
• Relaxing the Shared Covariance Matrix Assumption: ◦ Substituting the naive Bayes model into the expression for ak (x), we Newton-Raphson for Logistic Regression
◦ Linear and quadratic decision boundaries are illustrated in Figure obtain: • In logistic regression, the gradient and Hessian are:
4.11. N
2.1.2 Maximum likelihood solution D ∇ E(w) = ∑ (yn − tn )φn = Φ T (y − t)
Two-Class Case:
ak (x) = ∑ {xi ln µki + (1 − xi ) ln(1 − µki )} + ln p(Ck ) n =1
i =1 N
• Assume Gaussian class-conditional densities with a shared covariance H = ∇∇ E(w) = ∑ yn (1 − yn )φn φnT = Φ T RΦ
matrix. The data set is denoted as {xn , tn }. n =1
• The prior class probabilities are denoted as p(C1 ) = π and p(C2 ) = • Two-Class Case (K=2):
◦ For the case of K = 2 classes, we can use the logistic sigmoid 2.2.2 Logistic regression • R is a diagonal matrix with elements Rnn = yn (1 − yn ).
1 − π. • The Newton-Raphson update for logistic regression becomes:
formulation: • The posterior probability of class C1 is written as a logistic sigmoid
◦ For a data point xn , from class C1 , we have tn = 1 and hence
function of a linear combination of features: w(new) = w(old) − (Φ T RΦ)−1 Φ T (y − t)
1
p(xn , C1 ) = p(C1 ) p(xn |C1 ) = π N (xn |µ1 , Σ) p(C1 |x) = p(C1 |φ) = y(φ) = σ (w T φ) = (Φ T RΦ)−1 Φ T Rz
1 + exp(− a(x))
◦ For a data point xn , from class C2 , we have tn = 0 and hence • Where z is an N-dimensional vector with elements
Maximum Likelihood for Logistic Regression
2.1.4 Exponential family • Parameters are determined using maximum likelihood estimation z = Φw(old) − R−1 (y − t)
p(xn , C2 ) = p(C2 ) p(xn |C2 ) = (1 − π )N (xn |µ2 , Σ)
(MLE). • The weighing matrix R is not constant but depends on the parameter
• The likelihood function is given by: dσ vector w.
= σ (1 − σ )
da Iterative Reweighted Least Squares (IRLS)
N • Likelihood function for data set {φn , tn }, where tn ∈ {0, 1} φn = • The quantity zn , which corresponds to the nth element of z is:
p(t|π, µ1 , µ2 , Σ) = ∏ [π N (xn |µ1 , Σ)]tn [(1 − π )N (xn |µ2 , Σ)]1−tn φ ( x n ): yn − tn
n =1 zn = φnT w(old) −
N y n (1 − y n )
• Maximization with respect to π:
◦ The log likelihood function terms dependent on π are:
p(t|w) = ∏ ytnn (1 − yn )1−tn where 2.2.4 Multiclass logistic regression
n =1
• In multiclass classification, the posterior probabilities are modeled using
N t = ( t1 , · · · , t N )T a softmax transformation of linear functions:
∑ {tn ln π + (1 − tn ) ln(1 − π )}
yn = p(C1 |φn ) exp( ak )
n =1 p(Ck |φ) = yk (φ) =
∑ j exp( a j )
◦ Setting the derivative with respect to π equal to zero, we obtain: • Corresponding cross-entropy error function:
• The activations ak are defined as:
N
N ak = wkT φ
1 N E(w) = − ln p(t|w) = − ∑ (tn ln yn + (1 − tn ) ln(1 − yn ))
N n∑
π= tn = 1
=1
N n =1 Maximum Likelihood for Multiclass Logistic Regression
Gradient of the Error Function • We can use maximum likelihood to estimate the parameters {wk }
• Maximization with respect to µ1 : directly.
◦ The log likelihood function terms dependent on µ1 are: • Gradient of error function:
• The derivatives of yk with respect to a j are:
N
• Exponential Family of Distributions:
1 N ∇ E(w) = ∑ (yn − tn )φn ∂yk
= yk (Ikj − y j )
2 n∑
− tn (xn − µ1 )> Σ−1 (xn − µ1 ) + const. ◦ Consider the general form of the exponential family distribution:
n =1 ∂a j
=1
    2.2.3 Iterative reweighted least squares • Here, Ikj are the elements of the indentity matrix.
◦ Setting the derivative with respect to µ1 equal to zero, we obtain: 1 1 1 T
p(x|λk , s) = h x g(λk ) exp λ x • The solution can be obtained using the Newton-Raphson iterative Likelihood Function and Cross-Entropy Error
s s s k
optimization technique. • Using the 1-of-K coding scheme, the target vector tn for feature vector
1 N
N1 n∑
µ1 = tn xn φn is binary, where only the element corresponding to the correct class
◦ Here, x is the feature vector, λk is the parameter vector for class is 1.
=1
Ck , and s is a scaling parameter. • The likelihood function is given by:
◦ Similarly, we obtain • Two-Class Case (K=2): N K t
◦ For K = 2 classes, substituting into the expression for posterior p ( T | w1 , . . . , w K ) = ∏ ∏ ynknk
1 N probability gives: n =1 k =1
N2 n∑
µ2 = (1 − t n ) x n
=1 • Taking the negative logarithm of the likelihood gives the cross-entropy
p(C1 |x) = σ ( a(x)) error function:
• Maximization with respect to the shared covariance matrix Σ: N K
◦ The log likelihood function terms dependent on Σ are: E ( w1 , . . . , w K ) = − ∑ ∑ tnk ln ynk
where a(x) = (λ1 − λ2 ) T x + ln g(λ1 ) − ln g(λ2 ) + ln p(C1 ) −
ln p(C2 ). n =1 k =1
N 1 N
2 n∑
− ln |Σ| − ( x n − µ ) > Σ −1 ( x n − µ ) • K-Class Case: Gradient of the Error Function
2 =1 ◦ For K > 2 classes, the posterior probability for class Ck is given by: • The gradient of the error function with respect to w j is:
◦ The sample covariance matrices S1 and S2 for classes C1 and C2 are N
defined as: ak (x) = λkT x + ln g(λk ) + ln p(Ck ) ∇ w j E ( w1 , . . . , w K ) = ∑ (ynj − tnj )φn
1 n =1
S1 = ∑ ( x n − µ1 ) ( x n − µ1 ) >
N1 n∈ Batch Newton-Raphson Update for Multiclass Problem
C 1 ◦ Again, the function ak (x) is linear in x.
Newton-Raphson Method • The Hessian matrix H for this update consists of blocks of size M × M.
1 2.2 Probabilistic Discriminative Models • The Hessian matrix elements for blocks j, k are given by:
S2 = ∑ ( x n − µ2 ) ( x n − µ2 ) >
N2 n∈ 2.2.1 Fixed basis functions
• The Newton-Raphson update for minimizing a function E(w) is:
C2 N
• Classification models can apply a fixed nonlinear transformation to the w(new) = w(old) − H−1 ∇ E(w) H jk = ∑ ynk (Ikj − ynj )φn φnT
◦ The maximum likelihood estimate for Σ is: input vector x using basis functions φ(x). n =1

N1 S1 + N2 S2 • Classes that are linearly separable in feature space φ(x) may not be • H is the Hessian matrix (second derivatives of E(w)). • This leads to the iterative reweighted least squares (IRLS) algorithm
Σ= linearly separable in the original input space x. Newton-Raphson for Linear Regression for multiclass classification.
N
2.2.5 Probit regression Unit-03 • Generative models can be used to construct kernels in discriminative
• Probit regression is an alternative discriminative probabilistic model for tasks.
binary classification. • Example:
• The posterior probability in a two-class problem can be written as: 1 Kernel Methods k(x, x0 ) = p(x) p(x0 ),
p ( t = 1| a ) = f ( a ) • Some techniques use a subset of the training data even during the
prediction phase. where p(x) is the probability of x under a generative model.
where a = w T φ, and f (·) is the activation function. • Examples include: • Fisher Kernel:
Noisy Threshold Model ◦ Nearest neighbours – assigning labels based on the closest example
from the training set. k(x, x0 ) = g(θ, x) T F−1 g(θ, x0 ),
• In a noisy threshold model, we evaluate an = w T φn for each input φn .
• If an ≥ θ, we set tn = 1; otherwise, tn = 0. • Such methods are called memory-based methods, which require a simi-
larity metric. where g(θ, x) = ∇θ log p(x|θ) and F is the Fisher information matrix,
• The activation function becomes the cumulative distribution function given by
(CDF): • Many linear parametric models can be re-cast into a dual representa-
tion, based on kernel functions.
h i
a
F = Ex g(θ, x) g(θ, x) T
Z
f ( a) = p(θ )dθ Kernel Function Definition
−∞
• For models based on a fixed nonlinear feature space mapping φ(x), the .
kernel function is: Other Kernel Functions
k (x, x0 ) = φ(x) T φ(x0 )
• Sigmoidal Kernel:
• The kernel is a symmetric function: k(x, x0 ) = k(x0 , x).
Kernel Function Examples k(x, x0 ) = tanh( ax T x0 + b).
• The simplest example: linear kernel, where φ(x) = x and k(x, x0 ) =
x T x0 . • Used in practice despite not always being positive semidefinite.
• The kernel trick allows building extensions of algorithms, by replacing • Closely related to neural networks.
scalar products with kernel functions.
1.2 Radial Basis Function Networks
• Example applications of the kernel trick:
◦ Nonlinear PCA (Principal Component Analysis). • One of the form of basis function that is widely used in linear regression
The Probit Function is Radial Basis Functions
• If p(θ ) is a standard Gaussian distribution (zero mean, unit variance), ◦ Nearest-neighbour classifiers.
◦ Kernel Fisher discriminant. • Radial Basis Functions (RBFs) have the property that they depend only
the activation function is the probit function: on the distance from a center µ j :
Z a Common Kernel Functions
Φ( a) = N (θ |0, 1)dθ • Stationary kernels: depend only on the difference between the argu-
−∞ ments, k(x, x0 ) = k (x − x0 ). φj (x) = h(kx − µ j k)
• Another related function is the error function (erf), defined by: • Radial basis functions (RBF): depend only on the magnitude of the
! distance between arguments, k(x, x0 ) = k(kx − x0 k). • Used in pattern recognition with noisy data to avoid overfitting.
θ2
Z a
2 1.1 Constructing Kernels • Applicable in both interpolation problems and regression models.
erf( a) = √ exp − dθ
π 0 2 • One approach to construct valid kernel functions is to choose a feature Exact Function Interpolation
space mapping φ(x) and use it to find the corresponding kernel: • Given N input vectors {x1 , . . . , x N } and target values {t1 , . . . , t N }.
• The probit function can also be written using the error function:
• Goal: Find a function f (x) such that f (xn ) = tn for n = 1, . . . , N.
M
• Model: Expressing f (x) as a linear combination of radial basis functions,
  
1 a
Φ( a) =
2
1 + erf √ k ( x, x 0 ) = φ( x ) T φ( x 0 ) = ∑ φi (x)φi (x0 ) one centred on every data point
2 i =1
2.2.6 Canonical link functions where φi ( x ) are the basis functions N
Generalized Linear Models (GLMs): • An alternative is to construct kernel functions directly, ensuring they Testing Valid Kernels f (x) = ∑ wn h(kx − xn k)
• Define the relationship between expected value of the target variable t are valid and correspond to a scalar product in some feature space. n =1
• A kernel function k(x, x0 ) is valid if the Gram matrix K, with elements
and input features φ. • Example of a valid kernel considering a two-dimensional input space k(xn , xm ), is positive semidefinite.
Exponential Family of Distributions: x = ( x1 , x2 ): • Coefficients wn are determined by least squares.
• Positive semidefinite means:
• Exact interpolation is undesirable with noisy data due to overfitting.
• Many common distributions (e.g., Gaussian, Bernoulli, Poisson) belong k(x, x0 ) = ( x T x 0 )2 = ( x1 x10 + x2 x20 )2
to this family. vT K v ≥ 0 for all vNx1 . Regularization Theory
2 2
• Conditional distribution of t: = x12 x 0 1 + 2x1 x10 x2 x20
+ x22 x 0 2 • Radial basis functions emerge naturally from regularization theory.
√ Techniques for Constructing New Kernels • The presence of a regularizer ensures the solution does not exactly
2 √ 2
= ( x1 , 2x1 x2 , x2 )( x 1 , 2x 0 1 x 0 2 , x 0 2 ) T
0
   
1 t 2 2
p(t|η, s) = h g(η ) exp
ηt • Given valid kernels k1 (x, x0 ) and k2 (x, x0 ), the following are also valid: interpolate the training data.
s s s
= φ(x) T φ(x0 ) •ß k(x, x0 ) = c k1 (x, x0 ), where c > 0 is a constant. Efficient Implementations of RBF Networks
• η: natural parameter, s: scale parameter. •ß k(x, x0 ) = f (x)k1 (x, x0 ) f (x0 ), where f (·) is any function. • One basis function per data point can be computationally expensive.
• For a 2D input space, the corresponding feature mapping is: •ß k(x, x0 ) = k1 (x, x0 ) + k2 (x, x0 ). • Alternative methods reduce the number of basis functions M such that
Relationship Between η and y √ •ß k(x, x0 ) = k1 (x, x0 )k2 (x, x0 ).
• Conditional Mean: φ(x) = ( x12 , 2x1 x2 , x22 ) T . M < N.
• These methods allow for the construction of more complex kernels • Basis functions can be chosen using methods like:
d 1 suited to specific applications. ◦ Random subset of data points.
y = E[t|η ] = −s ln g(η )
dη Examples of Common Kernel Functions ◦ Orthogonal least squares .
◦ Defines relationship between η (natural parameter) and y (conditional • Polynomial Kernel: ◦ Clustering algorithms (e.g., K-means).
mean). k (x, x0 ) = (x T x0 + c) M .
• Express η as a function of y:
• Gaussian Kernel (Radial Basis Function):
η = ψ(y)
!
• In GLMs: k x − x 0 k2
k(x, x0 ) = exp − .
y= f (w T φ) 2σ2
◦ f (·): activation function,
• Subset Kernel for non-vectorial spaces (e.g., sets):
◦ f −1 (·): link function (in statistics).
Canonical Link Function: k ( A1 , A2 ) = 2| A1 ∩ A2 | .
• The link function that maps the conditional mean y back to the linear
predictor η. Generative Models as Kernels

1 Source: Lecture Series by Prof. Yaser, CalTech

You might also like