21CSE451T
PATTERN RECOGNITION TECHNIQUES
UNIT-2
INTRODUCTION TO PATTERN RECOGNITION
SYSTEMS
Parameter Estimation Methods
o Maximum-Likelihood Estimation
o Bayesian Estimation
Bayesian Parameter Estimation: Gaussian Case
Bayesian Parameter Estimation: General Theory
Problems of Dimensionality
Component Analysis and Discriminants
Expectation-Maximization
Hidden Markov Model
PARAMETER ESTIMATION METHODS
In typical supervised pattern classification problems, the estimation of the prior
probabilities presents no serious difficulties. However, estimation of the class-
conditional densities is quite another matter.
The number of available samples always seems too small, and serious problems arise
when the dimensionality of the feature vector x is large.
If we know the number of parameters in advance and our general knowledge about the
problem permits us to parameterize the conditional densities, then the severity of these
problems can be reduced significantly.
Suppose assume that p(x|ωi) is a normal density with mean μi, and covariance matrix
∑i, although we do not know the exact values of these quantities.
This knowledge simplifies the problem from one of estimating an unknown function
p(x|ωi) to one of estimating the parameters μi and ∑i.
Two common approaches for parameter estimation: Maximum Likelihood Estimation
and Bayesian Estimation. Although the results obtained from these are almost same.
o Maximum-likelihood view the parameters as quantities whose values are fixed
but unknown. The best estimate of their value is defined to be the one that
maximizes the probability of obtaining the samples actually observed.
o Bayesian methods view the parameters as random variables having some
known prior distribution. Observation of the samples converts this to a posterior
density, thereby revising our opinion about the true values of the parameters. In
the Bayesian case, we shall see that a typical effect of observing additional
samples is to sharpen the a posteriori density function, causing it to peak near
the true values of the parameters. This phenomenon is known as Bayesian
learning.
MAXIMUM-LIKELIHOOD ESTIMATION
Maximum-likelihood estimation methods have a number of attractive attributes. First, they
nearly always have good convergence properties as the number of training samples increases.
Furthermore, maximum-likelihood estimation often can be simpler than alternative methods,
such as Bayesian techniques.
The General Principle
Suppose that we separate a collection of samples according to class, so that we have c
data sets, D1, …, Dc, with the samples in Dj having been drawn independently according
to the probability law p(x|wj). We say such samples are i.i.d.—independent and
identically distributed random variables. We assume that p(x\wj) has a known
parametric form, and is therefore determined uniquely by the value of a parameter
vector ϴj.
For example, we might have p(x|wj) ~ N(μj, ∑j) where ϴj consists of the components of
μj, and ∑j. To show the dependence of p(x\ wj) on ϴj explicitly, we write p(x\ wj) as p(x\
wj, ϴj). Our problem is to use the information provided by the training samples to obtain
good estimates for the unknown parameter vectors ϴ1, … ϴc associated with each
category.
To simplify treatment of this problem, we shall assume that samples in Di, give no
information about ϴj if i !=j; that is, we shall assume that the parameters for the
different classes are functionally independent. With this assumption we thus have c
separate problems of the following form:
Use a set D of training samples drawn independently from the probability density p(x |
ϴ) to estimate the unknown parameter vector ϴ.
Suppose that D contains n samples, x1 , . . . , xn. Then, because the samples were drawn
independently, we have:
p{D| ϴ) is called the likelihood of ϴ with respect to the set of samples. The maximum-
likelihood estimate of ϴ is, by definition, the value ϴ^ that maximizes p(D|ϴ).
Figure: The top graph shows several training points in one dimension, known or assumed to be drawn
from a Gaussian of a particular variance, but unknown mean. Four of the infinite number of candidate
source distributions are shown in dashed lines. The middle figure shows the likelihood p(D|ϴ) as a
function of the mean. If we had a very large number of training points, this likelihood would be very
narrow. The value that maximizes the likelihood is marked ϴ^; it also maximizes the logarithm of the
likelihood—that is, the log-likelihood l(ϴ), shown at the bottom. Note that even though they look
similar, the likelihood p(D|ϴ) is shown as a function of ϴ whereas the conditional density p(x|ϴ) is
shown as a function of x. Furthermore, as a function of ϴ, the likelihood p(D|ϴ) is not a probability
density function and its area has no significance.
If the number of parameters to be estimated is p, then we let ϴ denote the p-component vector
ϴ = ( ϴ1,…,ϴp)t and let the gradient operator:
Maximum a posteriori or MAP estimators—find the value of ϴ that maximizes 1(ϴ) + In p(ϴ),
where p(ϴ) describes the prior probability of different parameter values.
1. The Gaussian Case: Unknown μ
Suppose that the samples are drawn from a multivariate normal population with mean μ and covariance matrix
∑. Find:
2. The Gaussian Case: Unknown μ, and ∑
In the more general (and more typical) multivariate normal case, neither the mean μ nor the
covariance matrix ∑ is known. Thus, these unknown parameters constitute the components of
the parameter vector ϴ. Consider first the univariate case with ϴ1 = μ and ϴ2 = σ2. Here the
log-likelihood of a single point is:
2) Bias
The maximum-likelihood estimate for the variance σ2 is biased; that is, the expected value over
all data sets of size n of the sample variance is not equal to the true variance:
where C is the so-called sample covariance matrix.
BAYESIAN ESTIMATION
Maximum-likelihood methods we view the true parameter vector we seek, ϴ, to be fixed, in
Bayesian learning we consider ϴ to be a random variable, and training data allow us to convert
a distribution on this variable into a posterior probability density.
1. The Class-Conditional Densities: If we again let D denote the set of samples, then we
can emphasize the role of the samples by saying that our goal is to compute the posterior
probabilities P(ωi |x, D). From these probabilities we can obtain the Bayes classifier.
Given the sample D, Bayes formula then becomes:
Assume that the true values of the prior probabilities are known or obtainable from a trivial
calculation; thus we substitute P(ωi ) = P(ωi |x, D).
2. The Parameter Distribution
Basic goal is to compute p(x|D), which is as close as we can come to obtaining the unknown
p(x). We do this by integrating the joint density p(x, ϴ|D) over ϴ. That is,
The above equation can be written as:
BAYESIAN PARAMETER ESTIMATION: GAUSSIAN
CASE
Use Bayesian estimation techniques to calculate the a posteriori density p(ϴ|D) and the desired
probability density p(x|D) for the case where p (x|μ) ~N(μ|∑).
1. The Univariate Case: p (μ | D)
Consider the case where μ is the only unknown parameter. For simplicity we treat first the
univariate case.
Gaussian Form:
2. The Univariate Case: p(x\D)
Having obtained the a posteriori density for the mean, p(μ|D), all that remains is to obtain the
"class-conditional" density for p(x|D).
3. The Multivariate Case
The treatment of the multivariate case in which ∑ is known but μ is not is a direct generalization
of the univariate case.
BAYESIAN PARAMETER ESTIMATION: GENERAL
THEORY
The Bayesian approach can be used to obtain the desired density p(x|D) in a special case—the
multivariate Gaussian. This approach can be generalized to apply to any situation in which the
unknown density can be parameterized.
The basic assumptions are summarized as follows:
• The form of the density p(x|ϴ) is assumed to be known, but the value of the parameter vector
ϴ is not known exactly.
• Our initial knowledge about ϴ is assumed to be contained in a known prior density p(ϴ).
• The rest of our knowledge about ϴ is contained in a set D of n samples x1 , . . . , xn drawn
independently according to the unknown probability density p(x).
The basic problem is to compute the posterior density p(ϴ|D):
To indicate explicitly the number of samples in a set for a single category, we shall write Dn =
{x1, x2, …, xn}. If n>1,
Using Bayes formula, we see that the posterior density satisfies the recursion relation:
When Do Maximum-Likelihood and Bayes Methods Differ?
There are several criteria that will influence our choice.
One is computational complexity, and here maximum-likelihood methods are often to
be preferred because they require merely differential calculus techniques or gradient
search for ϴ^, rather than a possibly complex multidimensional integration needed in
Bayesian estimation.
This leads to another consideration: interpretability. In many cases the maximum-
likelihood solution will be easier to interpret and understand because it returns the
single best model from the set the designer provided. In contrast, Bayesian methods
give a weighted average of models (parameters), often leading to solutions more
complicated and harder to understand than those provided by the designer. The
Bayesian approach reflects the remaining uncertainty in the possible models.
Another consideration is our confidence in the prior information, such as in the form of
the underlying distribution p(x|ϴ). A maximum-likelihood solution p(x|ϴ^) must of
course be of the assumed parametric form; not so for the Bayesian solution.
When p(ϴ|D) is broad, or asymmetric around ϴ^, the methods are quite likely to yield
p(x|ϴ) distributions that differ from one another. Such a strong asymmetry: Bayes
methods would exploit such information, whereas maximum-likelihood methods would
not.
When designing a classifier by either of these methods, we determine the posterior
densities for each category and also classify a test point by the maximum posterior. (If
there are costs, summarized in a cost matrix, these can be incorporated as well.)
There are three sources of classification error in our final system:
Bayes or Indistinguishability Error: The error due to overlapping densities p(x|ωi)
for different values of i. This error is an inherent property of the problem and can never
be eliminated.
Model Error: The error due to having an incorrect model. This error can only be
eliminated if the designer specifies a model that includes the true model which
generated the data. Designers generally choose the model based on knowledge of the
problem domain rather than on the subsequent estimation method, and thus the model
error in maximum-likelihood and Bayes methods rarely differ.
Estimation Error: The error arising from the fact that the parameter's are estimated
from a finite sample. This error can best be reduced by increasing the training data.
Noninformative Priors and Invariance
In a Bayesian framework we can have a "noninformative" prior over a parameter for a single
category's distribution. Suppose for instance that we are using Bayesian methods to infer from
data some position and scale parameters, which we denote μ and σ, which might be the mean
and standard deviation of a Gaussian, or the position and width of a triangle distribution.
Gibbs Algorithm
To pick a parameter vector ϴ according to p(ϴ|D) and use the single value as if it were the true
value. Given weak assumptions, this so-called Gibbs algorithm gives a misclassification error
that is at most twice the expected error of the Bayes optimal classifier.
PROBLEMS OF DIMENSIONALITY
In practical multicategory applications, it is not at all unusual to encounter problems involving
fifty or a hundred features, particularly if the features are binary valued.
1. Accuracy, Dimension, and Training Sample Size
If the features are statistically independent, there are some theoretical results that suggest the
possibility of excellent performance. For example, consider the two-class multivariate normal
case with the same covariance, i.e., where p(x|ωj) ~ N(μj, ∑), j = 1, 2. If the prior probabilities
are equal, then it is not hard to show that the Bayes error rate is given by:
where r2 is the squared Mahalanobis distance
Thus, the probability of error decreases as r increases, approaching zero as r approaches
infinity. In the conditionally independent case, ∑ = diag (σ12, …, σ1d):
2. Computational Complexity
First, we will need to understand the notion of the order of a function f(x): we say that the f(x)
is "of the order of h(x)"—written f(x) = O (h(x)) and generally read "big oh of h(x)"—if there
exist constants c and xo such that | f ( x ) | <= c|h(x)| for all x > x0. This means simply that for
sufficiently large x, an upper bound on the function grows no worse than h(x). For instance,
suppose f( x ) = ao + a1 x + a2 x2; in that case we have f(x) = O(x2) because for sufficiently
large x the constant, linear, and quadratic terms can be "overcome" by proper choice of c and
xo. The generalization to functions of two or more variables is straightforward. It should be
clear that by the definition above, the big oh order of a function is not unique. For instance, we
can describe our particular f(x) as being O(x2), O(x3), O(x4) or O(x2 ln x).
3. Overfitting
One possibility is to reduce the dimensionality, either by redesigning the feature extractor, by
selecting an appropriate subset of the existing features, or by combining the existing features
in some way.
Figure: The "training data" (black dots) were selected from a quadratic function plus Gaussian
noise, i.e., f(x) = ax2 + bx + c + ε where p(ε) ~ N(0, σ2). The 10th- degree polynomial shown
fits the data perfectly, but we desire instead the second-order function f(x), because it would
lead to better predictions for new samples.
COMPONENT ANALYSIS AND DISCRIMINANTS
One approach to coping with the problem of excessive dimensionality is to reduce the
dimensionality by combining features. Linear combinations are particularly attractive because
they are simple to compute and analytically tractable. In effect, linear methods project the high-
dimensional data onto a lower dimensional space.
There are two classical approaches to finding effective linear transformations.
1. Principal Component Analysis or PCA—seeks a projection that best represents the
data in a least-squares sense.
2. Fischer Linear Discriminant or FLD- find the best linear combination of features that
separates two or more classes of data
3. Multiple Discriminant Analysis or MDA—seeks a projection that best separates the
data in a least-squares sense.
1. Principal Component Analysis (PCA) is a dimensionality reduction technique used to
simplify complex datasets while preserving as much variability (information) as possible.
Purpose of PCA
Reduce the number of features (variables) in a dataset while retaining the most
important patterns.
Remove redundancy caused by correlated variables.
Help with visualization when data has many dimensions (e.g., reducing 50 features to
2 or 3).
Speed up machine learning algorithms by working with fewer variables.
PCA finds new axes (called principal components) that:
Are linear combinations of the original features.
Capture the maximum variance in the data.
Are uncorrelated with each other.
Steps of PCA
1. Standardize the Data
o PCA is affected by scale, so data is often standardized to have mean 0 and
variance 1.
2. Compute the Covariance Matrix
o Measures how variables vary together.
o Example: If height and weight are correlated, their covariance will be high.
3. Find Eigenvalues and Eigenvectors
o Eigenvectors → directions of new feature axes (principal components).
o Eigenvalues → how much variance each principal component captures.
4. Sort Components by Variance
o Keep the top k principal components that explain most of the variance.
5. Transform the Data
o Project the original data onto the selected principal components.
Example
Imagine you have two features:
Height
Weight
Both are correlated. PCA will:
Create PC1 (largest variance direction – maybe “body size”).
Create PC2 (smaller variance direction – maybe “slenderness”).
You can drop PC2 if it explains very little variance.
When to Use PCA
Large datasets with many features.
High correlation among variables.
You need faster computation or better visualization.
Preprocessing before machine learning to avoid overfitting.
2. Fischer Linear Discriminant or FLD
FLD is also known as Linear Discriminant Analysis (LDA) in the two-class case, but its goal
is different from PCA.
Purpose of FLD
PCA: Finds directions that capture maximum variance (unsupervised — ignores class
labels).
FLD: Finds a direction that best separates two or more classes (supervised — uses
class labels).
Main use: Dimensionality reduction while maximizing class separability.
FLD projects data onto a line so that:
The distance between class means is as large as possible.
The variance within each class is as small as possible.
This is measured by the Fisher criterion:
J (w) = (m1 – m2)2/ (s12 + s22)
where:
m1, m2 are projected means of the two classes.
s12 , s22 are projected variances.
Steps for Two-Class FLD
1. Compute class means μ1,μ2 in the original space.
2. Compute scatter within each class (within-class scatter matrices S1,S2).
3. Within-class scatter matrix:
SW=S1+S2
4. Find projection vector:
w=SW−1(μ1−μ2)
5. Project data:
y=wTx
6. Classify:
o Choose a threshold (e.g., midpoint of projected means).
o Points on one side → Class 1, other side → Class 2.
Example Intuition
Imagine two classes in 2D: red points and blue points.
PCA might pick the longest variance direction — which could cut through the classes
and not separate them well.
FLD finds the line that best separates red and blue when projected into 1D.
3. Multiple Discriminant Analysis or MDA
Multiple Discriminant Analysis (MDA) is essentially the multi-class generalization of
Fisher’s Linear Discriminant (FLD).
It’s also commonly known as Linear Discriminant Analysis (LDA) in statistics and machine
learning when there are more than two classes.
FLD works for two classes — one projection vector w separates them.
MDA works for k classes — it finds up to k−1 new axes (discriminant functions) that
maximize overall class separability.
It is supervised dimensionality reduction — reduces data from d-dimensions to at
most k−1 dimensions while preserving class discrimination.
Core Idea
Instead of just maximizing the separation between two means, MDA:
Maximizes the separation between all class means.
Minimizes the variance within each class.
Uses the between-class scatter SB and within-class scatter SW.
EXPECTATION-MAXIMIZATION
The Expectation-Maximization (EM) algorithm is an iterative method used to find the
maximum likelihood or maximum a posteriori estimates of parameters in statistical models,
particularly when dealing with hidden or incomplete data. It's frequently applied in machine
learning, especially in situations where some data points are missing or unobserved (latent
variables).
The EM algorithm is used to find the best parameters of a statistical model when:
Some data is missing, hidden, or latent.
The likelihood function is hard to optimize directly.
It’s widely used for:
Clustering (e.g., Gaussian Mixture Models).
Missing data imputation.
Hidden Markov Models.
EM alternates between:
1. E-step (Expectation) → Estimate missing/hidden variables given the current
parameter guesses.
2. M-step (Maximization) → Re-estimate parameters by maximizing the expected log-
likelihood from the E-step.
This process repeats until convergence.
Why Not Optimize Directly?
If we knew the missing/hidden values, parameter estimation would be easy (just like
normal maximum likelihood estimation).
If we knew the parameters, inferring hidden values would be easy.
But when both are unknown, EM solves it iteratively.
Step-by-Step (Generic EM)
Given:
Observed data X
Hidden variables Z
Model parameters θ
Step 1 — Initialize Parameters: Choose initial guess for parameters θ(0).
Step 2 — E-Step (Expectation): Compute:
Θ(t+1) = argmax Q(θ ∣θ(t))
This means: find the expected complete-data log-likelihood using the current parameters.
Step 3 — M-Step (Maximization):
Update parameters by:
θ(t+1)=argmax Q(θ∣θ(t))
This means: choose parameters that maximize the expected log-likelihood from the E-step.
Step 4 — Repeat:
Alternate E-step and M-step until parameters stop changing significantly (convergence).
HIDDEN MARKOV MODEL (HMM)
It is a statistical model that is used to describe the probabilistic relationship between a sequence
of observations and a sequence of hidden states.
An HMM consists of two types of variables: hidden states and observations.
o The hidden states are the underlying variables that generate the observed data,
but they are not directly observable.
o The observations are the variables that are measured and observed.
Markov Model is widely used in Speech Recognition, Natural Language Modelling,
On-line Handwriting Recognition, Analysis of Biological sequences such as
Proteins, DNA.
The Hidden Markov Model (HMM) is the relationship between the hidden states and
the observations using two sets of probabilities: the transition probabilities and the
emission probabilities.
o The transition probabilities describe the probability of transitioning from one
hidden state to another.
o The emission probabilities describe the probability of observing an output
given a hidden state.
Hidden Markov Model Algorithm
• Set of hidden states: {s1 , s2 , , s N }
• Process moves from one state to another generating a sequence of states:
si1 , si 2 , , sik ,
• Set of Observations: {o1, o2, …., ok}
• Markov chain property: probability of each subsequent state depends only on what
was the previous state: P( sik | si1 , si 2 , , sik 1 ) P( sik | sik 1 )
• To define Markov model, the following probabilities have to be specified:
• transition probabilitiesaij P ( si | s j )
• initial probabilities i P ( si )
• emission probabilities P(oi |si)
Example of Hidden Markov Model:
• Two states: ‘Low’ and ‘High’ atmospheric pressure.
• Two observations: ‘Rain’ and ‘Dry’.
• Transition probabilities:
P(‘Low’|‘Low’)=0.3
P(‘High’|‘Low’)=0.7
P(‘Low’|‘High’)=0.2
P(‘High’|‘High’)=0.8
• Emission/Observation probabilities:
P(‘Rain’|‘Low’)=0.6
P(‘Dry’|‘Low’)=0.4
P(‘Rain’|‘High’)=0.4
P(‘Dry’|‘High’)=0.6
• Initial probabilities: say
P(‘Low’)=0.4
P(‘High’)=0.6
Calculation of observation sequence probability:
• Suppose we want to calculate a probability of a sequence of observations in our
example, {‘Dry’,’Rain’}.
• Consider all possible hidden state sequences:
P({‘Dry’,’Rain’} ) = P({‘Dry’,’Rain’} , {‘Low’,’Low’}) + P({‘Dry’,’Rain’} , {‘Low’,’High’}) +
P({‘Dry’,’Rain’} , {‘High’,’Low’}) + P({‘Dry’,’Rain’} , {‘High’,’High’})
where first term is :
P({‘Dry’,’Rain’} , {‘Low’,’Low’}) = P({‘Dry’,’Rain’} | {‘Low’,’Low’}) P({‘Low’,’Low’})
= P(‘Dry’|’Low’)P(‘Rain’|’Low’) P(‘Low’)P(‘Low’|’Low)
= 0.4*0.6*0.4*0.3
Similary, Calculate others
Select maximum probability from among them.
If we had a set of states, we could calculate the probability of the sequence. But we don’t have
the states. All we have are a sequence of observations.
Problem Associated with HMM model:
Problem 1 (Evaluation): Given an HMM and a sequence of observations, what is the
probability that the observations are generated by the model? Forward-Backward
algorithm solve this problem.
Problem 2 (Decoding): Given an HMM and a sequence of observations, what is the
most likely state sequence in the model that produced the observations? Viterbi
algorithm solve this problem.
Problem 3 (Learning): Given an observation sequence, how should the model
parameters be adjusted to maximize the probability of the evaluation problem? Baum-
Welch algorithm solve this problem