PRML RefSheet
PRML RefSheet
2 Probability Densities
var[ x ] = E[ x2 ] − E[ x ]2 .
• Simple Regression problem: Predicting a real-valued target variable t posterior ∝ likelihood × prior
based on a real-valued input variable x
• A simple linear model using polynomial function of order M • Prior and Posterior uncertainty in model parameters w by incorporating
evidence from the observed data D
p( D |w) · p(w)
M
p(w| D ) =
p( D )
y( x, w) = w0 + w1 x + w2 x2 + · · · + wM xM = ∑ wj xj
j =0 where:
• Overfitting decreases as the data set size increases ◦ p(w| D ) is the posterior probability
◦ p( D |w) is the likelihood
• Error function (Sum of squared errors) to be minimized ◦ p(w) is the prior probability
◦ p( D ) is the evidence
1.3.1 The Rules of Probability
1.3.5 The Gaussian Distribution
• Sum Rule:
1 N • Gaussian distribution for a single real-valued variable x
2 n∑
E(w) = { y ( x n , w ) − t n }2
∑ p(X, Y )
!
=1 p( X ) = 1 ( x − µ )2
Y N ( x |µ, σ2 ) = √ exp − 2
2πσ 2 2σ
• Product Rule:
p( X, Y ) = p(Y | X ) · p( X )
N
e(w) = 1 ∑ {y( xn , w) − tn }2 + λ ||w||2
E
Bayes’ theorem expresses the posterior probability in terms of the
2 n =1 2 likelihood, prior probability, and the evidence.
• Multivariate Gaussian distribution for D-dimensional vector x 1.3.7 Bayesian Curve fitting 1.6 Decision Theory Unit-02
Bayes’ Theorem:
1 1
N (x|µ, Σ) = exp − (x − µ) T Σ−1 (x − µ)
(2π ) D/2 |Σ|1/2 2 • Using Bayes’ theorem, the posterior probabilities p(Ck |x) can be ex-
pressed as: 1 Linear Models for Regression
◦ µ is the mean vector.
◦ Σ is the covariance matrix. • With a training data set {xn }nN=1 and corresponding target values {tn },
• Likelihood function of the data set x from the given Guassian distribu- the goal is to predict t for new x values.
tion • Predictions can be made by constructing an appropriate function y(x)
N or modeling the predictive distribution p(t|x).
p(x|µ, σ2 ) = ∏ N (xn |µ, σ2 ) Examples
n =1
p(x|Ck ) p(Ck )
p(Ck |x) = .
p(x)
• To maximize the log likelihood with respect to the noise precision β, • A more general regularizer takes the form:
we get:
1 1 N 2
1 N 2 λ M
N n∑
tn − w TML φ(xn )
2 n∑ 2 j∑
= E(w) = tn − w T φ(xn ) + |w j |q
β ML =1 =1 =1
Examples
• The case q = 1 is known as the lasso leads to a sparse model where
some coefficients w j are driven to zero.
◦ The logistic sigmoid, maps the entire real axis to a finite interval (0,
1). It satisfies the symmetry property:
σ (− a) = 1 − σ( a)
N1 S1 + N2 S2 • Classes that are linearly separable in feature space φ(x) may not be • H is the Hessian matrix (second derivatives of E(w)). • This leads to the iterative reweighted least squares (IRLS) algorithm
Σ= linearly separable in the original input space x. Newton-Raphson for Linear Regression for multiclass classification.
N
2.2.5 Probit regression Unit-03 • Generative models can be used to construct kernels in discriminative
• Probit regression is an alternative discriminative probabilistic model for tasks.
binary classification. • Example:
• The posterior probability in a two-class problem can be written as: 1 Kernel Methods k(x, x0 ) = p(x) p(x0 ),
p ( t = 1| a ) = f ( a ) • Some techniques use a subset of the training data even during the
prediction phase. where p(x) is the probability of x under a generative model.
where a = w T φ, and f (·) is the activation function. • Examples include: • Fisher Kernel:
Noisy Threshold Model ◦ Nearest neighbours – assigning labels based on the closest example
from the training set. k(x, x0 ) = g(θ, x) T F−1 g(θ, x0 ),
• In a noisy threshold model, we evaluate an = w T φn for each input φn .
• If an ≥ θ, we set tn = 1; otherwise, tn = 0. • Such methods are called memory-based methods, which require a simi-
larity metric. where g(θ, x) = ∇θ log p(x|θ) and F is the Fisher information matrix,
• The activation function becomes the cumulative distribution function given by
(CDF): • Many linear parametric models can be re-cast into a dual representa-
tion, based on kernel functions.
h i
a
F = Ex g(θ, x) g(θ, x) T
Z
f ( a) = p(θ )dθ Kernel Function Definition
−∞
• For models based on a fixed nonlinear feature space mapping φ(x), the .
kernel function is: Other Kernel Functions
k (x, x0 ) = φ(x) T φ(x0 )
• Sigmoidal Kernel:
• The kernel is a symmetric function: k(x, x0 ) = k(x0 , x).
Kernel Function Examples k(x, x0 ) = tanh( ax T x0 + b).
• The simplest example: linear kernel, where φ(x) = x and k(x, x0 ) =
x T x0 . • Used in practice despite not always being positive semidefinite.
• The kernel trick allows building extensions of algorithms, by replacing • Closely related to neural networks.
scalar products with kernel functions.
1.2 Radial Basis Function Networks
• Example applications of the kernel trick:
◦ Nonlinear PCA (Principal Component Analysis). • One of the form of basis function that is widely used in linear regression
The Probit Function is Radial Basis Functions
• If p(θ ) is a standard Gaussian distribution (zero mean, unit variance), ◦ Nearest-neighbour classifiers.
◦ Kernel Fisher discriminant. • Radial Basis Functions (RBFs) have the property that they depend only
the activation function is the probit function: on the distance from a center µ j :
Z a Common Kernel Functions
Φ( a) = N (θ |0, 1)dθ • Stationary kernels: depend only on the difference between the argu-
−∞ ments, k(x, x0 ) = k (x − x0 ). φj (x) = h(kx − µ j k)
• Another related function is the error function (erf), defined by: • Radial basis functions (RBF): depend only on the magnitude of the
! distance between arguments, k(x, x0 ) = k(kx − x0 k). • Used in pattern recognition with noisy data to avoid overfitting.
θ2
Z a
2 1.1 Constructing Kernels • Applicable in both interpolation problems and regression models.
erf( a) = √ exp − dθ
π 0 2 • One approach to construct valid kernel functions is to choose a feature Exact Function Interpolation
space mapping φ(x) and use it to find the corresponding kernel: • Given N input vectors {x1 , . . . , x N } and target values {t1 , . . . , t N }.
• The probit function can also be written using the error function:
• Goal: Find a function f (x) such that f (xn ) = tn for n = 1, . . . , N.
M
• Model: Expressing f (x) as a linear combination of radial basis functions,
1 a
Φ( a) =
2
1 + erf √ k ( x, x 0 ) = φ( x ) T φ( x 0 ) = ∑ φi (x)φi (x0 ) one centred on every data point
2 i =1
2.2.6 Canonical link functions where φi ( x ) are the basis functions N
Generalized Linear Models (GLMs): • An alternative is to construct kernel functions directly, ensuring they Testing Valid Kernels f (x) = ∑ wn h(kx − xn k)
• Define the relationship between expected value of the target variable t are valid and correspond to a scalar product in some feature space. n =1
• A kernel function k(x, x0 ) is valid if the Gram matrix K, with elements
and input features φ. • Example of a valid kernel considering a two-dimensional input space k(xn , xm ), is positive semidefinite.
Exponential Family of Distributions: x = ( x1 , x2 ): • Coefficients wn are determined by least squares.
• Positive semidefinite means:
• Exact interpolation is undesirable with noisy data due to overfitting.
• Many common distributions (e.g., Gaussian, Bernoulli, Poisson) belong k(x, x0 ) = ( x T x 0 )2 = ( x1 x10 + x2 x20 )2
to this family. vT K v ≥ 0 for all vNx1 . Regularization Theory
2 2
• Conditional distribution of t: = x12 x 0 1 + 2x1 x10 x2 x20
+ x22 x 0 2 • Radial basis functions emerge naturally from regularization theory.
√ Techniques for Constructing New Kernels • The presence of a regularizer ensures the solution does not exactly
2 √ 2
= ( x1 , 2x1 x2 , x2 )( x 1 , 2x 0 1 x 0 2 , x 0 2 ) T
0
1 t 2 2
p(t|η, s) = h g(η ) exp
ηt • Given valid kernels k1 (x, x0 ) and k2 (x, x0 ), the following are also valid: interpolate the training data.
s s s
= φ(x) T φ(x0 ) •ß k(x, x0 ) = c k1 (x, x0 ), where c > 0 is a constant. Efficient Implementations of RBF Networks
• η: natural parameter, s: scale parameter. •ß k(x, x0 ) = f (x)k1 (x, x0 ) f (x0 ), where f (·) is any function. • One basis function per data point can be computationally expensive.
• For a 2D input space, the corresponding feature mapping is: •ß k(x, x0 ) = k1 (x, x0 ) + k2 (x, x0 ). • Alternative methods reduce the number of basis functions M such that
Relationship Between η and y √ •ß k(x, x0 ) = k1 (x, x0 )k2 (x, x0 ).
• Conditional Mean: φ(x) = ( x12 , 2x1 x2 , x22 ) T . M < N.
• These methods allow for the construction of more complex kernels • Basis functions can be chosen using methods like:
d 1 suited to specific applications. ◦ Random subset of data points.
y = E[t|η ] = −s ln g(η )
dη Examples of Common Kernel Functions ◦ Orthogonal least squares .
◦ Defines relationship between η (natural parameter) and y (conditional • Polynomial Kernel: ◦ Clustering algorithms (e.g., K-means).
mean). k (x, x0 ) = (x T x0 + c) M .
• Express η as a function of y:
• Gaussian Kernel (Radial Basis Function):
η = ψ(y)
!
• In GLMs: k x − x 0 k2
k(x, x0 ) = exp − .
y= f (w T φ) 2σ2
◦ f (·): activation function,
• Subset Kernel for non-vectorial spaces (e.g., sets):
◦ f −1 (·): link function (in statistics).
Canonical Link Function: k ( A1 , A2 ) = 2| A1 ∩ A2 | .
• The link function that maps the conditional mean y back to the linear
predictor η. Generative Models as Kernels