0% found this document useful (0 votes)

65 views6 pages

PRML RefSheet

Uploaded by

Manne Sravani

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

65 views6 pages

PRML RefSheet

Uploaded by

Manne Sravani

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 6

1.2.1 Overfitting 1.3.

2 Probability Densities

Reference Sheet for the Course

EC714PE: Pattern Recognition and Machine

Learning
Dr. M. Fayazur Rahaman,
Dept. of ECE, MGIT, Hyderabad
(Last Updated: September 25, 2024) • Probability density function
References: Z b
ß Chirstopher M Bishop, Pattern Recognition and Machine Learning, p( x ∈ ( a, b)) = p( x ) dx.
Springer, 2006 a
ß Course Lecture Slides p( x ) ≥ 0,
Z ∞
p( x ) dx = 1.
−∞
Unit-01 • Sum rule and product rule for density
Z
p( x ) = p( x, y) dy.
1 Introduction to Pattern Recognition 1.2.2 Generalization
1.1 Types of Learning • The goal is to generalize well to the new data (test set data) 1.3 Probability Theory p( x, y) = p(y| x ) p( x ).
i. Supervised Learning 1.3.3 Expectations and Covariances
ii. Unsupervised Learning
• Expectation of random variable x
iii. Reinforcement Learning
Discrete distribution E[ x ] = ∑ x p( x ) x.
1.2 Example: Polynomial Curve Fitting Continous distribution
R
E[ x ] = p( x ) x dx.
1
N points drawn from distribution E[ x ] ≈ N ∑nN=1 xn
• Variance of random variable x
h i
var[ x ] = E ( x − E[ x ])2

var[ x ] = E[ x2 ] − E[ x ]2 .

• Covariance between two random variables x and y

cov[ x, y] = Ex,y [( x − E[ x ])(y − E[y])]

• As M increases, the magnitute of the polynomial coefficients grows cov[ x, y] = Ex,y [ xy] − E[ x ] E[y].
larger, resulting in large variance in predictions
If x and y are independent their covariance is 0
1.3.4 Bayesian Probabilities
• Bayes Theorem

• Simple Regression problem: Predicting a real-valued target variable t posterior ∝ likelihood × prior
based on a real-valued input variable x
• A simple linear model using polynomial function of order M • Prior and Posterior uncertainty in model parameters w by incorporating
evidence from the observed data D
p( D |w) · p(w)
M
p(w| D ) =
p( D )
y( x, w) = w0 + w1 x + w2 x2 + · · · + wM xM = ∑ wj xj
j =0 where:
• Overfitting decreases as the data set size increases ◦ p(w| D ) is the posterior probability
◦ p( D |w) is the likelihood
• Error function (Sum of squared errors) to be minimized ◦ p(w) is the prior probability
◦ p( D ) is the evidence
1.3.1 The Rules of Probability
1.3.5 The Gaussian Distribution
• Sum Rule:
1 N • Gaussian distribution for a single real-valued variable x
2 n∑
E(w) = { y ( x n , w ) − t n }2
∑ p(X, Y )
!
=1 p( X ) = 1 ( x − µ )2
Y N ( x |µ, σ2 ) = √ exp − 2
2πσ 2 2σ

• Product Rule:

p( X, Y ) = p(Y | X ) · p( X )

1.2.3 Regularization • Bayes Theorem:

• Regularization controls overfitting by adding penalty term to the error
function, discouraging large coefficient values p ( X | Y ) · p (Y )
p (Y | X ) =
• Ridge regression uses quadratic regularizer to penalize large coefficients p( X )

N
e(w) = 1 ∑ {y( xn , w) − tn }2 + λ ||w||2
E
Bayes’ theorem expresses the posterior probability in terms of the
2 n =1 2 likelihood, prior probability, and the evidence.
• Multivariate Gaussian distribution for D-dimensional vector x 1.3.7 Bayesian Curve fitting 1.6 Decision Theory Unit-02
Bayes’ Theorem:

1 1
N (x|µ, Σ) = exp − (x − µ) T Σ−1 (x − µ)
(2π ) D/2 |Σ|1/2 2 • Using Bayes’ theorem, the posterior probabilities p(Ck |x) can be ex-
pressed as: 1 Linear Models for Regression
◦ µ is the mean vector.
◦ Σ is the covariance matrix. • With a training data set {xn }nN=1 and corresponding target values {tn },
• Likelihood function of the data set x from the given Guassian distribu- the goal is to predict t for new x values.
tion • Predictions can be made by constructing an appropriate function y(x)
N or modeling the predictive distribution p(t|x).
p(x|µ, σ2 ) = ∏ N (xn |µ, σ2 ) Examples
n =1
p(x|Ck ) p(Ck )
p(Ck |x) = .
p(x)

• Predictive distribution is given by

Z
p(t| x, x, t) = p(t| x, w) p(w|x, t) dw

= N (t|m( x ), s2 ( x )) where Here:

◦ p(Ck ) is the prior probability of class Ck , representing the probability
◦ Mean: of cancer before seeing the X-ray.
• Log-likelihood function
N ◦ p(x|Ck ) is the likelihood, representing the probability of obtaining
1 N N N m( x ) = βφ( x ) T S ∑ φ( xn )tn . the image x given that the patient belongs to class Ck .
ln p(x|µ, σ2 ) = − ∑ (xn − µ)2 − 2 ln σ2 − 2 ln(2π ).
2σ2 n=1
n =1
◦ p(Ck |x) is the posterior probability, representing the probability of
◦ Variance: cancer after taking the X-ray.
• Maximum likelihood solution for mean µ ◦ p(x) is the evidence or normalizing factor
s2 ( x ) = β−1 + φ( x ) T Sφ( x ).
1 N 1.6.1 Minimizing the misclassification
N n∑
µML = xn ◦ The Covariance matrix S is given by:
=1
where N
E[µML ] = µ S−1 = αI + β ∑ φ( xn )φ( xn ) T ,
n =1
• Maximum likelihood solution for variance σ2
1 N where φ( x ) is the feature vector with elements φi ( x ) = xi for
N n∑
2 =
σML ( xn − µML )2 i = 0, . . . , M.
=1
1.4 Model Selection
where
2 ]= N−1 2
E[σML σ
N
1.3.6 Curve Fitting revisited

• Cross-validation allows a proportion (S − 1)/S of the data to be used

for training while assessing performance on the entire data set.
• A drawback of cross-validation is the increased number of training
runs (by a factor of S), which can be problematic for computationally
expensive models.
• Curve Fitting Problem: To predict the target variable t given a new 1.1 Linear Basis Function Models
input value x, based on a set of training data input values x = 1.5 The Curse of Dimensionality
• The simplest linear model
( x1 , . . . , x N ) T and their corresponding target values t = (t1 , . . . , t N ) T .
• Model Assumption: Given the input x, the corresponding target value y(x, w) = w0 + w1 x1 + · · · + w D x D
t is assumed to follow a Gaussian distribution with a mean equal to the
value of the polynomial curve y( x, w).
where x = ( x1 , . . . , x D ) T and w = (w0 , . . . , w M−1 ) T .
• Extend the model by using basis functions φj ( x ):

p(t| x, w, β) = N t|y( x, w), β−1

where β is the precision parameter M −1

• Predictive distribution with the parameters determined by MLE Two-Class Example: y(x, w) = w0 + ∑ w j φj ( x)
• The probability of making a mistake is given by: j =1
p(t| x, wML , β ML ) = N t|y( x, wML ), β−
ML
1
• The Naive approach becomes impractical in high-dimensional spaces. • Using basis function φ0 (x) = 1, the model can be compactly written
• Following Bayesian Approach ◦ The exponential growth in number of cells leads to a need for an as:
◦ Assuming prior distribution of w as exponentially large quantity of training data to ensure non-empty y(x, w) = w T φ(x)
p(w|α) = N (w|0, α−1 I) cells.
α ( M+1)/2 α • Polynomial Curve fitting: General polynomial of order 3 with D- p(mistake) = p(x ∈ R1 , C2 ) + p(x ∈ R2 , C1 ) where φ = (φ0 , . . . , φ M−1 ) T .
= exp − w T w dimensional input variable Z Z • Examples of basis functions:
2π 2 = p(x, C2 ) dx + p(x, C1 ) dx.
R1 R2
◦ Polynomial basis functions: φj ( x ) = x j
◦ Posterior distribution is D D D D D D
∑ wi xi + ∑ ∑ wij xi x j + ∑ ∑ ∑ wijk xi x j xk .
!
y(x, w) = w0 + ( x − µ j )2
p(w|x, t, α, β) ∝ p(t|x, w, β) p(w|α) ◦ Gaussian basis functions: φj ( x ) = exp −
i =1 i =1 j =1 i =1 j =1 k =1 2s2
◦ Maximum A posteriori (MAP) estimate of w leads to minimizing
x −µ j

the regularized sum-of-squares error function ◦ The number of independent coefficients w grows proportionally to ◦ Sigmoidal basis functions: φj ( x) = σ s with σ( a) =
D3 1
β N α ◦ For a polynomial of order M, the number of coefficients grows like • Optimal Decision Rule: if p(x, C1 ) > p(x, C2 ) for a given value of x,
2 n∑
− log p(w|t) = { y ( x n , w ) − t n }2 + w T w 1+exp(− a)
=1
2 DM then x should be assigned to class C1 . ◦ Fourier basis functions
• Setting this gradient to zero and solving for w: • A simple regularizer :
−1
w ML = ΦT Φ ΦT t = Φ† t EW (w) =
1 T
w w
2
where Φ is the N × M design matrix and Φ† is the Moore-Penrose
• With the sum-of-squares error function, the total error function be-
pseudo-inverse of Φ.
comes:
• The design matrix Φ is:
1 N 2 λ
2 n∑

φ0 (x1 ) φ1 (x1 ) ··· φ M −1 ( x1 )
 E(w) = tn − w T φ(xn ) + w T w
=1
2
 φ (x )
 0 2 φ1 (x2 ) ··· φ M −1 ( x2 ) 
Φ= .

. .
 • The exact minimizer can be found in closed form:
 . . .. .

Examples  . . . .

 −1
φ0 (x N ) φ1 (x N ) ··· φ M −1 ( x N ) w = λI + Φ T Φ ΦT t

• To maximize the log likelihood with respect to the noise precision β, • A more general regularizer takes the form:
we get:
1 1 N 2
1 N 2 λ M
N n∑
tn − w TML φ(xn )
2 n∑ 2 j∑
= E(w) = tn − w T φ(xn ) + |w j |q
β ML =1 =1 =1
Examples
• The case q = 1 is known as the lasso leads to a sparse model where
some coefficients w j are driven to zero.

1.3 Bayesian Linear Regression

• A Bayesian approach to linear regression helps avoid over-fitting asso-
ciated with maximum likelihood and provides an automatic method to
determine model complexity using only the training data.

1.1.2 Geometry of least squares 1.3.1 Parameter distribution

• The least-squares solution for w corresponds to the y in S that is • Prior probability distribution over the model parameters w is Gaussian,
closest to t. given by
p ( w ) = N ( w | m0 , S0 ),

where m0 is the mean and S0 is the covariance.

• The posterior distribution is proportional to the product of the likelihood
1.1.5 Multiple outputs function and the prior, and is
• Goal is to predict multiple target variables t, K-dimensional target
vector. p ( w | t ) = N ( w | m N , S N ),
• A common model:
y(x, w) = W T φ(x) where

1.1.1 Maximum likelihood and least squares where mN = S N S0−1 m0 + βΦ T t
• The target variable t is modeled as: ◦ y is a K-dimensional column vector,
◦ W is an M × K matrix of parameters, and S−
N
1 = S0−1 + βΦ T Φ
t = y(x, w) + e ◦ φ(x) is an M-dimensional column vector of basis functions.
• The conditional distribution of the target vector is assumed to be an • The maximum posterior weight vector is
where e is Gaussian noise with zero mean and precision β.
isotropic Gaussian:
• The conditional distribution of t given x and w is:
w MAP = m N
p(t|x, w, β) = N (t|y(x, w), β−1 ) p(t|x, W, β) = N t|W T φ(x), β−1 I
• This solution corresponds to the orthogonal projection of t onto the • Consider a zero-mean isotropic Gaussian prior governed by a single
• For a data set X = {(xn , tn )}nN=1 , the likelihood function is: subspace S • For a set of observations t1 , . . . , t N , the log-likelihood function is given precision parameter α,
N 1.1.3 Sequential learning by:
p(t|X, w, β) = ∏ N ( t n | w T φ ( x n ), β −1 ) p(w|α) = N (w|0, α−1 I).
• The stochastic gradient descent algorithm updates the parameter vec-
n =1 NK β β N
2 n∑
tor w as follows: ln p (T|X, W, β) = ln − ktn − W T φ(xn )k2 • The corresponding posterior distribution is given with
• Logarithm of the likelihood function: w(τ +1) = w(τ ) − η ∇ En 2 2π =1
N N β where τ denotes the iteration number, and η is a learning rate parame-
• Maximizing this log-likelihood function with respect to W yields: mN = βS N Φ T t
ln p(t|w, β) = ln β − ln(2π ) − ED (w) ter.
2 2 2 S− 1 = αI + βΦ T Φ
• For the sum-of-squares error function, this update rule becomes: N
where the sum-of-squares error function is: W ML = (Φ T Φ)−1 Φ T T
w ( τ +1) = w ( τ ) + η ( t n − w ( τ ) T φ n ) φ n • The log of the posterior distribution is
1 N 1.2 The Bias-variance Decomposition
2 n∑
ED ( w ) = (tn − w T φ(xn ))2 1.1.4 Regularized least squares • The expected squared loss can be decomposed into three terms:
=1 β N n o2 α
2 n∑
• The total error function to be minimized is given by: ln p(w|t) = − tn − w T φ(xn ) − w T w + const.
• The gradient of the log likelihood function is: E[ L] = bias2 + variance + noise =1
2
E(w) = ED (w) + λEW (w)
N
where λ is the regularization coefficient, ED (w) is the data-dependent • Flexible models tend to have low bias and high variance, while rigid • Sequential updates of the posterior distribution, using a simple example
∇ ln p(t|w, β) = − ∑ (tn − wT φ(xn ))φ(xn )T error, and EW (w) is the regularization term. models have high bias and low variance. involving straight-line fitting.
n =1
1.3.3 Equivalent Kernel 2 Linear Models for Classification • Multiclass Case:
• The predictive mean can be expressed as: • The goal in classification is to assign an input vector x to one of K ◦ For K > 2 classes, the posterior probability p(Ck |x) is given by the
discrete classes Ck where k = 1, . . . , K. normalized exponential, or softmax function:
y(x, m N ) = m TN φ(x)
p(x|Ck ) p(Ck )
= βφ(x) T S N Φ T t p(Ck |x) =
∑K
j =1 p ( x | C j ) p ( C j )
N
= ∑ βφ(x) T S N φ(xn )tn
=
exp( ak )
n =1 ∑ j exp( a j )
N
= ∑ k(x, xn )tn ◦ Here, the quantities ak are defined as:
n =1
ak = ln p(x|Ck ) p(Ck )
where the equivalent kernel k(x, xn ) is given by:
2.1.1 Continuous inputs
k(x, x0 ) = βφ(x) T S N φ(x0 )
• The density for class Ck is given by:
• The equivalent kernel is localized around x, meaning predictions at x
are weighted more by data points close to x than by distant points. 1 1
p(x|Ck ) = exp − (x − µk )> Σ−1 (x − µk )
• The covariance between predictions at x and x0 is given by: (2π ) D/2 |Σ|1/2 2
cov[y(x), y(x0 )] = β−1 k(x, x0 ) • For two-class problems, a binary target t ∈ {0, 1} is often used, where
indicating that nearby predictions are more correlated. t = 1 represents class C1 and t = 0 represents class C2 .
• The equivalent kernel ensures that the weights for combining training • For K > 2 classes, a 1-of-K coding scheme is used, for example, for
set target values sum to one: K = 5, a pattern from class 2 would be:
N t = (0, 1, 0, 0, 0) T
∑ k (x, xn ) = 1
n =1 • Three approaches to classification are discussed:
I. Directly constructing a discriminant function that assigns each x
to a specific class.
II. Modeling the conditional probability distribution p(Ck |x).
III. A generative approach where class-conditional densities p(x|Ck )
and prior probabilities p(Ck ) are modeled, then Bayes’ theorem is
applied:
p(x|Ck ) p(Ck )
p(Ck |x) =
p(x)
• For classification, we predict discrete class labels or posterior probabili-
1.3.2 Predictive distribution ties, which lie in (0, 1). This is done by transforming the linear function
• Goal is to predict the target variable t for new input values x. 1.4 Bayesian Model Comparison of w using a nonlinear function f (·):
• The predictive distribution is given by: • Suppose we have a set of L models {Mi } where i = 1, . . . , L. The
y ( x ) = f ( w T x + w0 ) • Two-Class Case:
Z posterior distribution is given by: ◦ The posterior probability for class C1 is given by:
p(t|t, α, β) = p(t|w, β) p(w|t, α, β) dw p(Mi |D) ∝ p(Mi ) p(D|Mi ) • Here, f (·) is called an activation function in machine learning.
p(C1 |x) = σ(w> x + w0 )
• For a model with a single parameter w, if the posterior is sharply peaked 2.1 Probabilistic Generative Models
where t represents the vector of target values from the training set. around w MAP , the integral can be approximated by: • Two-Class Case: ◦ Here, w and w0 are defined as:
• Results in: ◦ The posterior probability for class C1 is expressed as:
Z
∆wposterior w = Σ −1 ( µ1 − µ2 )
p(D) = p(D|w) p(w)dw ≈ p(D|w MAP )
p(t|x, t, α, β) = N (t|m TN φ(x), σN
2 (x))
∆wprior p(x|C1 ) p(C1 )
p(C1 |x) =
p(x|C1 ) p(C1 ) + p(x|C2 ) p(C2 ) 1 1 p(C1 )
where ∆wposterior is the width of the posterior peak. w0 = − µ1> Σ−1 µ1 + µ2> Σ−1 µ2 + ln
2 (x) is:
• The variance σN = σ( a) 2 2 p(C2 )
• Taking the logarithm:
∆wposterior ◦ This can be rewritten using the logistic sigmoid function σ( a), where:

1 ln p(D) ≈ ln p(D|w MAP ) + ln
2 (x) =
σN + φ(x) T S N φ(x) ∆wprior
p(x|C1 ) p(C1 )

β
a = ln
where ∆wprior is the width of the prior. p(x|C2 ) p(C2 )
• The level of predictive uncertainty depends on x and decreases near • Bayesian model comparison generally favors models of intermediate
complexity, avoiding over-fitting and under-fitting. ◦ The logistic sigmoid function is defined as:
the data points.
1
σ( a) =
1 + exp(− a)

◦ The logistic sigmoid, maps the entire real axis to a finite interval (0,
1). It satisfies the symmetry property:

σ (− a) = 1 − σ( a)

◦ The inverse of the logistic sigmoid is the logit function:

σ
a = ln
1−σ
• General Case of K Classes: 2.1.3 Discrete features • In linear regression, the gradient and Hessian are:
◦ For K classes, the quantity ak (x) is given by: • Binary Feature Values: N
◦ Consider discrete feature values xi ∈ {0, 1} for simplicity. ∇ E(w) = ∑ (wT φn − tn )φn = ΦT Φw − ΦT t
ak (x) = w>
k x + wk0 ◦ Assuming features to be independent, the class-conditional distribu- n =1
tion takes the form: N
◦ Here, wk and wk0 are defined as: H = ∇∇ E(w) = ∑ φn φnT = ΦT Φ
n =1
D
Σ −1 µ x • The Newton-Raphson update gives the exact least-squares solution:
wk = k p(x|Ck ) = ∏ µkii (1 − µki )1−xi
i =1 w(new) = w(old) − (Φ T Φ)−1 {Φ T Φw(old) − Φ T t}
1
wk0 = − µ> Σ−1 µk + ln p(Ck ) = ( Φ T Φ ) −1 Φ T t
2 k • Linear Functions of Input Values:
• Relaxing the Shared Covariance Matrix Assumption: ◦ Substituting the naive Bayes model into the expression for ak (x), we Newton-Raphson for Logistic Regression
◦ Linear and quadratic decision boundaries are illustrated in Figure obtain: • In logistic regression, the gradient and Hessian are:
4.11. N
2.1.2 Maximum likelihood solution D ∇ E(w) = ∑ (yn − tn )φn = Φ T (y − t)
Two-Class Case:
ak (x) = ∑ {xi ln µki + (1 − xi ) ln(1 − µki )} + ln p(Ck ) n =1
i =1 N
• Assume Gaussian class-conditional densities with a shared covariance H = ∇∇ E(w) = ∑ yn (1 − yn )φn φnT = Φ T RΦ
matrix. The data set is denoted as {xn , tn }. n =1
• The prior class probabilities are denoted as p(C1 ) = π and p(C2 ) = • Two-Class Case (K=2):
◦ For the case of K = 2 classes, we can use the logistic sigmoid 2.2.2 Logistic regression • R is a diagonal matrix with elements Rnn = yn (1 − yn ).
1 − π. • The Newton-Raphson update for logistic regression becomes:
formulation: • The posterior probability of class C1 is written as a logistic sigmoid
◦ For a data point xn , from class C1 , we have tn = 1 and hence
function of a linear combination of features: w(new) = w(old) − (Φ T RΦ)−1 Φ T (y − t)
1
p(xn , C1 ) = p(C1 ) p(xn |C1 ) = π N (xn |µ1 , Σ) p(C1 |x) = p(C1 |φ) = y(φ) = σ (w T φ) = (Φ T RΦ)−1 Φ T Rz
1 + exp(− a(x))
◦ For a data point xn , from class C2 , we have tn = 0 and hence • Where z is an N-dimensional vector with elements
Maximum Likelihood for Logistic Regression
2.1.4 Exponential family • Parameters are determined using maximum likelihood estimation z = Φw(old) − R−1 (y − t)
p(xn , C2 ) = p(C2 ) p(xn |C2 ) = (1 − π )N (xn |µ2 , Σ)
(MLE). • The weighing matrix R is not constant but depends on the parameter
• The likelihood function is given by: dσ vector w.
= σ (1 − σ )
da Iterative Reweighted Least Squares (IRLS)
N • Likelihood function for data set {φn , tn }, where tn ∈ {0, 1} φn = • The quantity zn , which corresponds to the nth element of z is:
p(t|π, µ1 , µ2 , Σ) = ∏ [π N (xn |µ1 , Σ)]tn [(1 − π )N (xn |µ2 , Σ)]1−tn φ ( x n ): yn − tn
n =1 zn = φnT w(old) −
N y n (1 − y n )
• Maximization with respect to π:
◦ The log likelihood function terms dependent on π are:
p(t|w) = ∏ ytnn (1 − yn )1−tn where 2.2.4 Multiclass logistic regression
n =1
• In multiclass classification, the posterior probabilities are modeled using
N t = ( t1 , · · · , t N )T a softmax transformation of linear functions:
∑ {tn ln π + (1 − tn ) ln(1 − π )}
yn = p(C1 |φn ) exp( ak )
n =1 p(Ck |φ) = yk (φ) =
∑ j exp( a j )
◦ Setting the derivative with respect to π equal to zero, we obtain: • Corresponding cross-entropy error function:
• The activations ak are defined as:
N
N ak = wkT φ
1 N E(w) = − ln p(t|w) = − ∑ (tn ln yn + (1 − tn ) ln(1 − yn ))
N n∑
π= tn = 1
=1
N n =1 Maximum Likelihood for Multiclass Logistic Regression
Gradient of the Error Function • We can use maximum likelihood to estimate the parameters {wk }
• Maximization with respect to µ1 : directly.
◦ The log likelihood function terms dependent on µ1 are: • Gradient of error function:
• The derivatives of yk with respect to a j are:
N
• Exponential Family of Distributions:
1 N ∇ E(w) = ∑ (yn − tn )φn ∂yk
= yk (Ikj − y j )
2 n∑
− tn (xn − µ1 )> Σ−1 (xn − µ1 ) + const. ◦ Consider the general form of the exponential family distribution:
n =1 ∂a j
=1
2.2.3 Iterative reweighted least squares • Here, Ikj are the elements of the indentity matrix.
◦ Setting the derivative with respect to µ1 equal to zero, we obtain: 1 1 1 T
p(x|λk , s) = h x g(λk ) exp λ x • The solution can be obtained using the Newton-Raphson iterative Likelihood Function and Cross-Entropy Error
s s s k
optimization technique. • Using the 1-of-K coding scheme, the target vector tn for feature vector
1 N
N1 n∑
µ1 = tn xn φn is binary, where only the element corresponding to the correct class
◦ Here, x is the feature vector, λk is the parameter vector for class is 1.
=1
Ck , and s is a scaling parameter. • The likelihood function is given by:
◦ Similarly, we obtain • Two-Class Case (K=2): N K t
◦ For K = 2 classes, substituting into the expression for posterior p ( T | w1 , . . . , w K ) = ∏ ∏ ynknk
1 N probability gives: n =1 k =1
N2 n∑
µ2 = (1 − t n ) x n
=1 • Taking the negative logarithm of the likelihood gives the cross-entropy
p(C1 |x) = σ ( a(x)) error function:
• Maximization with respect to the shared covariance matrix Σ: N K
◦ The log likelihood function terms dependent on Σ are: E ( w1 , . . . , w K ) = − ∑ ∑ tnk ln ynk
where a(x) = (λ1 − λ2 ) T x + ln g(λ1 ) − ln g(λ2 ) + ln p(C1 ) −
ln p(C2 ). n =1 k =1
N 1 N
2 n∑
− ln |Σ| − ( x n − µ ) > Σ −1 ( x n − µ ) • K-Class Case: Gradient of the Error Function
2 =1 ◦ For K > 2 classes, the posterior probability for class Ck is given by: • The gradient of the error function with respect to w j is:
◦ The sample covariance matrices S1 and S2 for classes C1 and C2 are N
defined as: ak (x) = λkT x + ln g(λk ) + ln p(Ck ) ∇ w j E ( w1 , . . . , w K ) = ∑ (ynj − tnj )φn
1 n =1
S1 = ∑ ( x n − µ1 ) ( x n − µ1 ) >
N1 n∈ Batch Newton-Raphson Update for Multiclass Problem
C 1 ◦ Again, the function ak (x) is linear in x.
Newton-Raphson Method • The Hessian matrix H for this update consists of blocks of size M × M.
1 2.2 Probabilistic Discriminative Models • The Hessian matrix elements for blocks j, k are given by:
S2 = ∑ ( x n − µ2 ) ( x n − µ2 ) >
N2 n∈ 2.2.1 Fixed basis functions
• The Newton-Raphson update for minimizing a function E(w) is:
C2 N
• Classification models can apply a fixed nonlinear transformation to the w(new) = w(old) − H−1 ∇ E(w) H jk = ∑ ynk (Ikj − ynj )φn φnT
◦ The maximum likelihood estimate for Σ is: input vector x using basis functions φ(x). n =1

N1 S1 + N2 S2 • Classes that are linearly separable in feature space φ(x) may not be • H is the Hessian matrix (second derivatives of E(w)). • This leads to the iterative reweighted least squares (IRLS) algorithm
Σ= linearly separable in the original input space x. Newton-Raphson for Linear Regression for multiclass classification.
N
2.2.5 Probit regression Unit-03 • Generative models can be used to construct kernels in discriminative
• Probit regression is an alternative discriminative probabilistic model for tasks.
binary classification. • Example:
• The posterior probability in a two-class problem can be written as: 1 Kernel Methods k(x, x0 ) = p(x) p(x0 ),
p ( t = 1| a ) = f ( a ) • Some techniques use a subset of the training data even during the
prediction phase. where p(x) is the probability of x under a generative model.
where a = w T φ, and f (·) is the activation function. • Examples include: • Fisher Kernel:
Noisy Threshold Model ◦ Nearest neighbours – assigning labels based on the closest example
from the training set. k(x, x0 ) = g(θ, x) T F−1 g(θ, x0 ),
• In a noisy threshold model, we evaluate an = w T φn for each input φn .
• If an ≥ θ, we set tn = 1; otherwise, tn = 0. • Such methods are called memory-based methods, which require a simi-
larity metric. where g(θ, x) = ∇θ log p(x|θ) and F is the Fisher information matrix,
• The activation function becomes the cumulative distribution function given by
(CDF): • Many linear parametric models can be re-cast into a dual representa-
tion, based on kernel functions.
h i
a
F = Ex g(θ, x) g(θ, x) T
Z
f ( a) = p(θ )dθ Kernel Function Definition
−∞
• For models based on a fixed nonlinear feature space mapping φ(x), the .
kernel function is: Other Kernel Functions
k (x, x0 ) = φ(x) T φ(x0 )
• Sigmoidal Kernel:
• The kernel is a symmetric function: k(x, x0 ) = k(x0 , x).
Kernel Function Examples k(x, x0 ) = tanh( ax T x0 + b).
• The simplest example: linear kernel, where φ(x) = x and k(x, x0 ) =
x T x0 . • Used in practice despite not always being positive semidefinite.
• The kernel trick allows building extensions of algorithms, by replacing • Closely related to neural networks.
scalar products with kernel functions.
1.2 Radial Basis Function Networks
• Example applications of the kernel trick:
◦ Nonlinear PCA (Principal Component Analysis). • One of the form of basis function that is widely used in linear regression
The Probit Function is Radial Basis Functions
• If p(θ ) is a standard Gaussian distribution (zero mean, unit variance), ◦ Nearest-neighbour classifiers.
◦ Kernel Fisher discriminant. • Radial Basis Functions (RBFs) have the property that they depend only
the activation function is the probit function: on the distance from a center µ j :
Z a Common Kernel Functions
Φ( a) = N (θ |0, 1)dθ • Stationary kernels: depend only on the difference between the argu-
−∞ ments, k(x, x0 ) = k (x − x0 ). φj (x) = h(kx − µ j k)
• Another related function is the error function (erf), defined by: • Radial basis functions (RBF): depend only on the magnitude of the
! distance between arguments, k(x, x0 ) = k(kx − x0 k). • Used in pattern recognition with noisy data to avoid overfitting.
θ2
Z a
2 1.1 Constructing Kernels • Applicable in both interpolation problems and regression models.
erf( a) = √ exp − dθ
π 0 2 • One approach to construct valid kernel functions is to choose a feature Exact Function Interpolation
space mapping φ(x) and use it to find the corresponding kernel: • Given N input vectors {x1 , . . . , x N } and target values {t1 , . . . , t N }.
• The probit function can also be written using the error function:
• Goal: Find a function f (x) such that f (xn ) = tn for n = 1, . . . , N.
M
• Model: Expressing f (x) as a linear combination of radial basis functions,

1 a
Φ( a) =
2
1 + erf √ k ( x, x 0 ) = φ( x ) T φ( x 0 ) = ∑ φi (x)φi (x0 ) one centred on every data point
2 i =1
2.2.6 Canonical link functions where φi ( x ) are the basis functions N
Generalized Linear Models (GLMs): • An alternative is to construct kernel functions directly, ensuring they Testing Valid Kernels f (x) = ∑ wn h(kx − xn k)
• Define the relationship between expected value of the target variable t are valid and correspond to a scalar product in some feature space. n =1
• A kernel function k(x, x0 ) is valid if the Gram matrix K, with elements
and input features φ. • Example of a valid kernel considering a two-dimensional input space k(xn , xm ), is positive semidefinite.
Exponential Family of Distributions: x = ( x1 , x2 ): • Coefficients wn are determined by least squares.
• Positive semidefinite means:
• Exact interpolation is undesirable with noisy data due to overfitting.
• Many common distributions (e.g., Gaussian, Bernoulli, Poisson) belong k(x, x0 ) = ( x T x 0 )2 = ( x1 x10 + x2 x20 )2
to this family. vT K v ≥ 0 for all vNx1 . Regularization Theory
2 2
• Conditional distribution of t: = x12 x 0 1 + 2x1 x10 x2 x20
+ x22 x 0 2 • Radial basis functions emerge naturally from regularization theory.
√ Techniques for Constructing New Kernels • The presence of a regularizer ensures the solution does not exactly
2 √ 2
= ( x1 , 2x1 x2 , x2 )( x 1 , 2x 0 1 x 0 2 , x 0 2 ) T
0

1 t 2 2
p(t|η, s) = h g(η ) exp
ηt • Given valid kernels k1 (x, x0 ) and k2 (x, x0 ), the following are also valid: interpolate the training data.
s s s
= φ(x) T φ(x0 ) •ß k(x, x0 ) = c k1 (x, x0 ), where c > 0 is a constant. Efficient Implementations of RBF Networks
• η: natural parameter, s: scale parameter. •ß k(x, x0 ) = f (x)k1 (x, x0 ) f (x0 ), where f (·) is any function. • One basis function per data point can be computationally expensive.
• For a 2D input space, the corresponding feature mapping is: •ß k(x, x0 ) = k1 (x, x0 ) + k2 (x, x0 ). • Alternative methods reduce the number of basis functions M such that
Relationship Between η and y √ •ß k(x, x0 ) = k1 (x, x0 )k2 (x, x0 ).
• Conditional Mean: φ(x) = ( x12 , 2x1 x2 , x22 ) T . M < N.
• These methods allow for the construction of more complex kernels • Basis functions can be chosen using methods like:
d 1 suited to specific applications. ◦ Random subset of data points.
y = E[t|η ] = −s ln g(η )
dη Examples of Common Kernel Functions ◦ Orthogonal least squares .
◦ Defines relationship between η (natural parameter) and y (conditional • Polynomial Kernel: ◦ Clustering algorithms (e.g., K-means).
mean). k (x, x0 ) = (x T x0 + c) M .
• Express η as a function of y:
• Gaussian Kernel (Radial Basis Function):
η = ψ(y)
!
• In GLMs: k x − x 0 k2
k(x, x0 ) = exp − .
y= f (w T φ) 2σ2
◦ f (·): activation function,
• Subset Kernel for non-vectorial spaces (e.g., sets):
◦ f −1 (·): link function (in statistics).
Canonical Link Function: k ( A1 , A2 ) = 2| A1 ∩ A2 | .
• The link function that maps the conditional mean y back to the linear
predictor η. Generative Models as Kernels

1 Source: Lecture Series by Prof. Yaser, CalTech

Jacaranda Specialist Mathematics VCE U34 (2016 Ed)
100% (10)
Jacaranda Specialist Mathematics VCE U34 (2016 Ed)
749 pages
Algebra 2 Practice Workbook
50% (2)
Algebra 2 Practice Workbook
92 pages
Magoosh 589 Quant Practice Questions PDF
0% (1)
Magoosh 589 Quant Practice Questions PDF
590 pages
Murphy Book Solution
No ratings yet
Murphy Book Solution
100 pages
Data Science Distributions & Models
100% (1)
Data Science Distributions & Models
5 pages
Bayes ML Tutorial
No ratings yet
Bayes ML Tutorial
69 pages
Statistical Learning Theory Guide
No ratings yet
Statistical Learning Theory Guide
4 pages
William Arveson - An Invitation To C - Algebras (1976) (978!1!4612-6371-5)
100% (5)
William Arveson - An Invitation To C - Algebras (1976) (978!1!4612-6371-5)
117 pages
P-Delta & Buckling Analysis in STAAD Pro
No ratings yet
P-Delta & Buckling Analysis in STAAD Pro
9 pages
PRML Exercise Solutions Guide
No ratings yet
PRML Exercise Solutions Guide
87 pages
Statistical Machine Learning-The Basic Approach and Current Research Challenges
No ratings yet
Statistical Machine Learning-The Basic Approach and Current Research Challenges
35 pages
Bayesian Models for AI Experts
No ratings yet
Bayesian Models for AI Experts
130 pages
Variable Separable
No ratings yet
Variable Separable
21 pages
Pattern Recognition Machine Learning: Chapter 3: Linear Models For Regression
100% (1)
Pattern Recognition Machine Learning: Chapter 3: Linear Models For Regression
48 pages
Bayesian Learning: Berrin Yanikoglu
No ratings yet
Bayesian Learning: Berrin Yanikoglu
64 pages
Curs 1 SSL - Introduction
No ratings yet
Curs 1 SSL - Introduction
57 pages
Linear Algebra For Business Analytics
No ratings yet
Linear Algebra For Business Analytics
27 pages
Probability Theory For Machine Learning: Chris Cremer September 2015
No ratings yet
Probability Theory For Machine Learning: Chris Cremer September 2015
40 pages
(Cap2) Solução Modern Control Systems DORF
No ratings yet
(Cap2) Solução Modern Control Systems DORF
45 pages
Machine Learning and Data Mining: Prof. Alexander Ihler
No ratings yet
Machine Learning and Data Mining: Prof. Alexander Ihler
51 pages
Machine Learning Handbook - Radivojac and White
No ratings yet
Machine Learning Handbook - Radivojac and White
108 pages
PRML Slides 3
No ratings yet
PRML Slides 3
57 pages
Dealing With Uncertainty P (X - E) : Probability Theory The Foundation of Statistics
No ratings yet
Dealing With Uncertainty P (X - E) : Probability Theory The Foundation of Statistics
34 pages
Machine Learning Course Overview
No ratings yet
Machine Learning Course Overview
91 pages
Machine Learning and Data Mining: Prof. Alexander Ihler
No ratings yet
Machine Learning and Data Mining: Prof. Alexander Ihler
51 pages
Linear Functional Analysis
No ratings yet
Linear Functional Analysis
9 pages
Model Reduction for Control Engineers
No ratings yet
Model Reduction for Control Engineers
17 pages
Bayesian Uncertainty Quantification
No ratings yet
Bayesian Uncertainty Quantification
23 pages
Chap1 Bishop
No ratings yet
Chap1 Bishop
35 pages
Linear - Regression
100% (1)
Linear - Regression
39 pages
COMP4702 Notes 2019: Week 2 - Supervised Learning
No ratings yet
COMP4702 Notes 2019: Week 2 - Supervised Learning
23 pages
Bayesian Decision Theory in ML
No ratings yet
Bayesian Decision Theory in ML
56 pages
Lecture1 Intro ML
No ratings yet
Lecture1 Intro ML
60 pages
2223hk1 Slide01 ML2022-2
No ratings yet
2223hk1 Slide01 ML2022-2
23 pages
Introduction ML
No ratings yet
Introduction ML
65 pages
Machine Learning: Lecture 6: Bayesian Learning (Based On Chapter 6 of Mitchell T.., Machine Learning, 1997)
No ratings yet
Machine Learning: Lecture 6: Bayesian Learning (Based On Chapter 6 of Mitchell T.., Machine Learning, 1997)
15 pages
Shell Element Internal Forces/Stresses Output Convention
No ratings yet
Shell Element Internal Forces/Stresses Output Convention
5 pages
CFX Openfoam Ebfvm PDF
No ratings yet
CFX Openfoam Ebfvm PDF
17 pages
Bishop2008 Chapter ANewFrameworkForMachineLearnin
No ratings yet
Bishop2008 Chapter ANewFrameworkForMachineLearnin
24 pages
Week 6 v1.61 (Hidden) - Revision, CW1, and Probabilistic Graphical Models
No ratings yet
Week 6 v1.61 (Hidden) - Revision, CW1, and Probabilistic Graphical Models
65 pages
Machine Learning and Pattern Recognition Bayesian Complexity Control
No ratings yet
Machine Learning and Pattern Recognition Bayesian Complexity Control
4 pages
Continuous-Time Signals: David W. Graham EE 327
No ratings yet
Continuous-Time Signals: David W. Graham EE 327
18 pages
Lecture5 Maximum Likelihood
No ratings yet
Lecture5 Maximum Likelihood
13 pages
DIP Lecture 06
No ratings yet
DIP Lecture 06
26 pages
Statics Hibler Chapter 2 (5-6)
No ratings yet
Statics Hibler Chapter 2 (5-6)
3 pages
Bishop CH 3 Notes
No ratings yet
Bishop CH 3 Notes
6 pages
Week 3
No ratings yet
Week 3
56 pages
ML Unit 3
No ratings yet
ML Unit 3
14 pages
Rutgers: 332:322 Principles of Communications Systems Spring 2004
No ratings yet
Rutgers: 332:322 Principles of Communications Systems Spring 2004
8 pages
Ds 7
No ratings yet
Ds 7
20 pages
Modeling of FACTS Devices Based On SPWM VSCs
No ratings yet
Modeling of FACTS Devices Based On SPWM VSCs
9 pages
Lecture Notes MAI
No ratings yet
Lecture Notes MAI
111 pages
Cambridge O Level: Additional Mathematics 4037/22
No ratings yet
Cambridge O Level: Additional Mathematics 4037/22
16 pages
Lecture16 Crossvalidation
No ratings yet
Lecture16 Crossvalidation
32 pages
Week 2
No ratings yet
Week 2
43 pages
Discrete-Time Sequences in MATLAB
0% (1)
Discrete-Time Sequences in MATLAB
4 pages
Advanced Math Olympiad Problems
No ratings yet
Advanced Math Olympiad Problems
4 pages
4.4 Parametric and Non-Parametric Estimator
No ratings yet
4.4 Parametric and Non-Parametric Estimator
47 pages
Statistical Perspective
No ratings yet
Statistical Perspective
85 pages
EDAN96 2024 Last Lecture-1
No ratings yet
EDAN96 2024 Last Lecture-1
78 pages
Reflection:: A. Reflection With Respect To - Axis
No ratings yet
Reflection:: A. Reflection With Respect To - Axis
11 pages
07 - Bayesian Learning
No ratings yet
07 - Bayesian Learning
55 pages
13 - Exponential Function
No ratings yet
13 - Exponential Function
14 pages
MLSM Lecture2 120923
No ratings yet
MLSM Lecture2 120923
35 pages
Cubics Exam Questions
No ratings yet
Cubics Exam Questions
32 pages
Lecture 03 Bayes Classifier With Prob Concepts
No ratings yet
Lecture 03 Bayes Classifier With Prob Concepts
70 pages
Thomas Calculus 11th Edition
No ratings yet
Thomas Calculus 11th Edition
6 pages
M-III Important Questions
No ratings yet
M-III Important Questions
4 pages
Ascoli–Arzela Theory Explained
No ratings yet
Ascoli–Arzela Theory Explained
10 pages
Regression and Generalization
No ratings yet
Regression and Generalization
67 pages
Engineering Math IV Course Guide
No ratings yet
Engineering Math IV Course Guide
3 pages
ML 3
No ratings yet
ML 3
66 pages
SML Lecture2
No ratings yet
SML Lecture2
35 pages
07 Intro To ML
No ratings yet
07 Intro To ML
38 pages
UNIT I-Part 2
No ratings yet
UNIT I-Part 2
35 pages
ML 1
No ratings yet
ML 1
64 pages
Welcome To Mobilarian Forum - O Cial Symbianize Forum.: Essential Math For AI
No ratings yet
Welcome To Mobilarian Forum - O Cial Symbianize Forum.: Essential Math For AI
3 pages
MLP RL1
No ratings yet
MLP RL1
6 pages
SAT Algebra - Results
No ratings yet
SAT Algebra - Results
68 pages
Applied Statistics - Lecture 1: Mario Beraha
No ratings yet
Applied Statistics - Lecture 1: Mario Beraha
52 pages
Magnetometer Offset Nulling Revisited
No ratings yet
Magnetometer Offset Nulling Revisited
4 pages
01-Linear Regression-Part 2
No ratings yet
01-Linear Regression-Part 2
37 pages
ML Sum-Up
No ratings yet
ML Sum-Up
53 pages
hw2b 2017
No ratings yet
hw2b 2017
7 pages
Lecture 1 2022
No ratings yet
Lecture 1 2022
55 pages
Statistics Notes Based On Pattern Recognition and Machine Learning (PRML)
No ratings yet
Statistics Notes Based On Pattern Recognition and Machine Learning (PRML)
5 pages
Unit 1
No ratings yet
Unit 1
92 pages
Lecture 02
No ratings yet
Lecture 02
4 pages
21Csc305P-Machine Learning: Offline
No ratings yet
21Csc305P-Machine Learning: Offline
8 pages

PRML RefSheet

Uploaded by

PRML RefSheet

Uploaded by

1.2.1 Overfitting 1.3.

Reference Sheet for the Course

EC714PE: Pattern Recognition and Machine

• Covariance between two random variables x and y

cov[ x, y] = Ex,y [( x − E[ x ])(y − E[y])]

1.2.3 Regularization • Bayes Theorem:

• Predictive distribution is given by

= N (t|m( x ), s2 ( x )) where Here:

• Cross-validation allows a proportion (S − 1)/S of the data to be used

where β is the precision parameter M −1

1.3 Bayesian Linear Regression

1.1.2 Geometry of least squares 1.3.1 Parameter distribution

where m0 is the mean and S0 is the covariance.

◦ The inverse of the logistic sigmoid is the logit function:

1 Source: Lecture Series by Prof. Yaser, CalTech

You might also like