0% found this document useful (0 votes)

10 views15 pages

Pattern Recognition Notes Unit-2

The document provides an overview of parameter estimation methods in pattern recognition, focusing on Maximum-Likelihood Estimation and Bayesian Estimation, particularly in the Gaussian case. It discusses the challenges of dimensionality, component analysis, and discriminants, highlighting techniques like Principal Component Analysis (PCA) and Fisher Linear Discriminant (FLD) for dimensionality reduction. Additionally, it addresses the differences between maximum-likelihood and Bayesian methods, as well as the implications of classification errors and the importance of training sample size.

Uploaded by

kansalmohak2003

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

10 views15 pages

Pattern Recognition Notes Unit-2

Uploaded by

kansalmohak2003

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 15

21CSE451T

PATTERN RECOGNITION TECHNIQUES

UNIT-2
INTRODUCTION TO PATTERN RECOGNITION
SYSTEMS

 Parameter Estimation Methods

o Maximum-Likelihood Estimation
o Bayesian Estimation
 Bayesian Parameter Estimation: Gaussian Case
 Bayesian Parameter Estimation: General Theory
 Problems of Dimensionality
 Component Analysis and Discriminants
 Expectation-Maximization
 Hidden Markov Model

PARAMETER ESTIMATION METHODS

 In typical supervised pattern classification problems, the estimation of the prior
probabilities presents no serious difficulties. However, estimation of the class-
conditional densities is quite another matter.
 The number of available samples always seems too small, and serious problems arise
when the dimensionality of the feature vector x is large.
 If we know the number of parameters in advance and our general knowledge about the
problem permits us to parameterize the conditional densities, then the severity of these
problems can be reduced significantly.
 Suppose assume that p(x|ωi) is a normal density with mean μi, and covariance matrix
∑i, although we do not know the exact values of these quantities.
 This knowledge simplifies the problem from one of estimating an unknown function
p(x|ωi) to one of estimating the parameters μi and ∑i.
 Two common approaches for parameter estimation: Maximum Likelihood Estimation
and Bayesian Estimation. Although the results obtained from these are almost same.
o Maximum-likelihood view the parameters as quantities whose values are fixed
but unknown. The best estimate of their value is defined to be the one that
maximizes the probability of obtaining the samples actually observed.
o Bayesian methods view the parameters as random variables having some
known prior distribution. Observation of the samples converts this to a posterior
density, thereby revising our opinion about the true values of the parameters. In
the Bayesian case, we shall see that a typical effect of observing additional
samples is to sharpen the a posteriori density function, causing it to peak near
the true values of the parameters. This phenomenon is known as Bayesian
learning.

MAXIMUM-LIKELIHOOD ESTIMATION
Maximum-likelihood estimation methods have a number of attractive attributes. First, they
nearly always have good convergence properties as the number of training samples increases.
Furthermore, maximum-likelihood estimation often can be simpler than alternative methods,
such as Bayesian techniques.

The General Principle

 Suppose that we separate a collection of samples according to class, so that we have c
data sets, D1, …, Dc, with the samples in Dj having been drawn independently according
to the probability law p(x|wj). We say such samples are i.i.d.—independent and
identically distributed random variables. We assume that p(x\wj) has a known
parametric form, and is therefore determined uniquely by the value of a parameter
vector ϴj.
 For example, we might have p(x|wj) ~ N(μj, ∑j) where ϴj consists of the components of
μj, and ∑j. To show the dependence of p(x\ wj) on ϴj explicitly, we write p(x\ wj) as p(x\
wj, ϴj). Our problem is to use the information provided by the training samples to obtain
good estimates for the unknown parameter vectors ϴ1, … ϴc associated with each
category.
 To simplify treatment of this problem, we shall assume that samples in Di, give no
information about ϴj if i !=j; that is, we shall assume that the parameters for the
different classes are functionally independent. With this assumption we thus have c
separate problems of the following form:
 Use a set D of training samples drawn independently from the probability density p(x |
ϴ) to estimate the unknown parameter vector ϴ.
 Suppose that D contains n samples, x1 , . . . , xn. Then, because the samples were drawn
independently, we have:

 p{D| ϴ) is called the likelihood of ϴ with respect to the set of samples. The maximum-
likelihood estimate of ϴ is, by definition, the value ϴ^ that maximizes p(D|ϴ).
Figure: The top graph shows several training points in one dimension, known or assumed to be drawn
from a Gaussian of a particular variance, but unknown mean. Four of the infinite number of candidate
source distributions are shown in dashed lines. The middle figure shows the likelihood p(D|ϴ) as a
function of the mean. If we had a very large number of training points, this likelihood would be very
narrow. The value that maximizes the likelihood is marked ϴ^; it also maximizes the logarithm of the
likelihood—that is, the log-likelihood l(ϴ), shown at the bottom. Note that even though they look
similar, the likelihood p(D|ϴ) is shown as a function of ϴ whereas the conditional density p(x|ϴ) is
shown as a function of x. Furthermore, as a function of ϴ, the likelihood p(D|ϴ) is not a probability
density function and its area has no significance.

If the number of parameters to be estimated is p, then we let ϴ denote the p-component vector
ϴ = ( ϴ1,…,ϴp)t and let the gradient operator:

Maximum a posteriori or MAP estimators—find the value of ϴ that maximizes 1(ϴ) + In p(ϴ),
where p(ϴ) describes the prior probability of different parameter values.
1. The Gaussian Case: Unknown μ
Suppose that the samples are drawn from a multivariate normal population with mean μ and covariance matrix
∑. Find:

2. The Gaussian Case: Unknown μ, and ∑

In the more general (and more typical) multivariate normal case, neither the mean μ nor the
covariance matrix ∑ is known. Thus, these unknown parameters constitute the components of
the parameter vector ϴ. Consider first the univariate case with ϴ1 = μ and ϴ2 = σ2. Here the
log-likelihood of a single point is:

2) Bias
The maximum-likelihood estimate for the variance σ2 is biased; that is, the expected value over
all data sets of size n of the sample variance is not equal to the true variance:
where C is the so-called sample covariance matrix.

BAYESIAN ESTIMATION
Maximum-likelihood methods we view the true parameter vector we seek, ϴ, to be fixed, in
Bayesian learning we consider ϴ to be a random variable, and training data allow us to convert
a distribution on this variable into a posterior probability density.
1. The Class-Conditional Densities: If we again let D denote the set of samples, then we
can emphasize the role of the samples by saying that our goal is to compute the posterior
probabilities P(ωi |x, D). From these probabilities we can obtain the Bayes classifier.
Given the sample D, Bayes formula then becomes:

Assume that the true values of the prior probabilities are known or obtainable from a trivial
calculation; thus we substitute P(ωi ) = P(ωi |x, D).

2. The Parameter Distribution

Basic goal is to compute p(x|D), which is as close as we can come to obtaining the unknown
p(x). We do this by integrating the joint density p(x, ϴ|D) over ϴ. That is,

The above equation can be written as:

BAYESIAN PARAMETER ESTIMATION: GAUSSIAN

CASE
Use Bayesian estimation techniques to calculate the a posteriori density p(ϴ|D) and the desired
probability density p(x|D) for the case where p (x|μ) ~N(μ|∑).
1. The Univariate Case: p (μ | D)
Consider the case where μ is the only unknown parameter. For simplicity we treat first the
univariate case.
Gaussian Form:

2. The Univariate Case: p(x\D)

Having obtained the a posteriori density for the mean, p(μ|D), all that remains is to obtain the
"class-conditional" density for p(x|D).

3. The Multivariate Case

The treatment of the multivariate case in which ∑ is known but μ is not is a direct generalization
of the univariate case.

BAYESIAN PARAMETER ESTIMATION: GENERAL

THEORY
The Bayesian approach can be used to obtain the desired density p(x|D) in a special case—the
multivariate Gaussian. This approach can be generalized to apply to any situation in which the
unknown density can be parameterized.
The basic assumptions are summarized as follows:
• The form of the density p(x|ϴ) is assumed to be known, but the value of the parameter vector
ϴ is not known exactly.
• Our initial knowledge about ϴ is assumed to be contained in a known prior density p(ϴ).
• The rest of our knowledge about ϴ is contained in a set D of n samples x1 , . . . , xn drawn
independently according to the unknown probability density p(x).

The basic problem is to compute the posterior density p(ϴ|D):

To indicate explicitly the number of samples in a set for a single category, we shall write Dn =
{x1, x2, …, xn}. If n>1,

Using Bayes formula, we see that the posterior density satisfies the recursion relation:

When Do Maximum-Likelihood and Bayes Methods Differ?

There are several criteria that will influence our choice.
 One is computational complexity, and here maximum-likelihood methods are often to
be preferred because they require merely differential calculus techniques or gradient
search for ϴ^, rather than a possibly complex multidimensional integration needed in
Bayesian estimation.
 This leads to another consideration: interpretability. In many cases the maximum-
likelihood solution will be easier to interpret and understand because it returns the
single best model from the set the designer provided. In contrast, Bayesian methods
give a weighted average of models (parameters), often leading to solutions more
complicated and harder to understand than those provided by the designer. The
Bayesian approach reflects the remaining uncertainty in the possible models.
 Another consideration is our confidence in the prior information, such as in the form of
the underlying distribution p(x|ϴ). A maximum-likelihood solution p(x|ϴ^) must of
course be of the assumed parametric form; not so for the Bayesian solution.
 When p(ϴ|D) is broad, or asymmetric around ϴ^, the methods are quite likely to yield
p(x|ϴ) distributions that differ from one another. Such a strong asymmetry: Bayes
methods would exploit such information, whereas maximum-likelihood methods would
not.
 When designing a classifier by either of these methods, we determine the posterior
densities for each category and also classify a test point by the maximum posterior. (If
there are costs, summarized in a cost matrix, these can be incorporated as well.)

There are three sources of classification error in our final system:

 Bayes or Indistinguishability Error: The error due to overlapping densities p(x|ωi)
for different values of i. This error is an inherent property of the problem and can never
be eliminated.
 Model Error: The error due to having an incorrect model. This error can only be
eliminated if the designer specifies a model that includes the true model which
generated the data. Designers generally choose the model based on knowledge of the
problem domain rather than on the subsequent estimation method, and thus the model
error in maximum-likelihood and Bayes methods rarely differ.
 Estimation Error: The error arising from the fact that the parameter's are estimated
from a finite sample. This error can best be reduced by increasing the training data.

Noninformative Priors and Invariance

In a Bayesian framework we can have a "noninformative" prior over a parameter for a single
category's distribution. Suppose for instance that we are using Bayesian methods to infer from
data some position and scale parameters, which we denote μ and σ, which might be the mean
and standard deviation of a Gaussian, or the position and width of a triangle distribution.

Gibbs Algorithm
To pick a parameter vector ϴ according to p(ϴ|D) and use the single value as if it were the true
value. Given weak assumptions, this so-called Gibbs algorithm gives a misclassification error
that is at most twice the expected error of the Bayes optimal classifier.

PROBLEMS OF DIMENSIONALITY
In practical multicategory applications, it is not at all unusual to encounter problems involving
fifty or a hundred features, particularly if the features are binary valued.
1. Accuracy, Dimension, and Training Sample Size
If the features are statistically independent, there are some theoretical results that suggest the
possibility of excellent performance. For example, consider the two-class multivariate normal
case with the same covariance, i.e., where p(x|ωj) ~ N(μj, ∑), j = 1, 2. If the prior probabilities
are equal, then it is not hard to show that the Bayes error rate is given by:

where r2 is the squared Mahalanobis distance

Thus, the probability of error decreases as r increases, approaching zero as r approaches
infinity. In the conditionally independent case, ∑ = diag (σ12, …, σ1d):

2. Computational Complexity
First, we will need to understand the notion of the order of a function f(x): we say that the f(x)
is "of the order of h(x)"—written f(x) = O (h(x)) and generally read "big oh of h(x)"—if there
exist constants c and xo such that | f ( x ) | <= c|h(x)| for all x > x0. This means simply that for
sufficiently large x, an upper bound on the function grows no worse than h(x). For instance,
suppose f( x ) = ao + a1 x + a2 x2; in that case we have f(x) = O(x2) because for sufficiently
large x the constant, linear, and quadratic terms can be "overcome" by proper choice of c and
xo. The generalization to functions of two or more variables is straightforward. It should be
clear that by the definition above, the big oh order of a function is not unique. For instance, we
can describe our particular f(x) as being O(x2), O(x3), O(x4) or O(x2 ln x).

3. Overfitting
One possibility is to reduce the dimensionality, either by redesigning the feature extractor, by
selecting an appropriate subset of the existing features, or by combining the existing features
in some way.

Figure: The "training data" (black dots) were selected from a quadratic function plus Gaussian
noise, i.e., f(x) = ax2 + bx + c + ε where p(ε) ~ N(0, σ2). The 10th- degree polynomial shown
fits the data perfectly, but we desire instead the second-order function f(x), because it would
lead to better predictions for new samples.

COMPONENT ANALYSIS AND DISCRIMINANTS

One approach to coping with the problem of excessive dimensionality is to reduce the
dimensionality by combining features. Linear combinations are particularly attractive because
they are simple to compute and analytically tractable. In effect, linear methods project the high-
dimensional data onto a lower dimensional space.
There are two classical approaches to finding effective linear transformations.
1. Principal Component Analysis or PCA—seeks a projection that best represents the
data in a least-squares sense.
2. Fischer Linear Discriminant or FLD- find the best linear combination of features that
separates two or more classes of data
3. Multiple Discriminant Analysis or MDA—seeks a projection that best separates the
data in a least-squares sense.
1. Principal Component Analysis (PCA) is a dimensionality reduction technique used to
simplify complex datasets while preserving as much variability (information) as possible.
Purpose of PCA
 Reduce the number of features (variables) in a dataset while retaining the most
important patterns.
 Remove redundancy caused by correlated variables.
 Help with visualization when data has many dimensions (e.g., reducing 50 features to
2 or 3).
 Speed up machine learning algorithms by working with fewer variables.
PCA finds new axes (called principal components) that:
 Are linear combinations of the original features.
 Capture the maximum variance in the data.
 Are uncorrelated with each other.

Steps of PCA
1. Standardize the Data
o PCA is affected by scale, so data is often standardized to have mean 0 and
variance 1.
2. Compute the Covariance Matrix
o Measures how variables vary together.
o Example: If height and weight are correlated, their covariance will be high.
3. Find Eigenvalues and Eigenvectors
o Eigenvectors → directions of new feature axes (principal components).
o Eigenvalues → how much variance each principal component captures.
4. Sort Components by Variance
o Keep the top k principal components that explain most of the variance.
5. Transform the Data
o Project the original data onto the selected principal components.

Example
Imagine you have two features:
 Height
 Weight
Both are correlated. PCA will:
 Create PC1 (largest variance direction – maybe “body size”).
 Create PC2 (smaller variance direction – maybe “slenderness”).
 You can drop PC2 if it explains very little variance.

When to Use PCA

 Large datasets with many features.
 High correlation among variables.
 You need faster computation or better visualization.
 Preprocessing before machine learning to avoid overfitting.

2. Fischer Linear Discriminant or FLD

FLD is also known as Linear Discriminant Analysis (LDA) in the two-class case, but its goal
is different from PCA.
Purpose of FLD
 PCA: Finds directions that capture maximum variance (unsupervised — ignores class
labels).
 FLD: Finds a direction that best separates two or more classes (supervised — uses
class labels).
 Main use: Dimensionality reduction while maximizing class separability.

FLD projects data onto a line so that:

 The distance between class means is as large as possible.
 The variance within each class is as small as possible.
 This is measured by the Fisher criterion:
J (w) = (m1 – m2)2/ (s12 + s22)
where:
 m1, m2 are projected means of the two classes.
 s12 , s22 are projected variances.
Steps for Two-Class FLD
1. Compute class means μ1,μ2 in the original space.
2. Compute scatter within each class (within-class scatter matrices S1,S2).
3. Within-class scatter matrix:
SW=S1+S2
4. Find projection vector:
w=SW−1(μ1−μ2)
5. Project data:
y=wTx
6. Classify:
o Choose a threshold (e.g., midpoint of projected means).
o Points on one side → Class 1, other side → Class 2.

Example Intuition
Imagine two classes in 2D: red points and blue points.
 PCA might pick the longest variance direction — which could cut through the classes
and not separate them well.
 FLD finds the line that best separates red and blue when projected into 1D.

3. Multiple Discriminant Analysis or MDA

Multiple Discriminant Analysis (MDA) is essentially the multi-class generalization of
Fisher’s Linear Discriminant (FLD).
It’s also commonly known as Linear Discriminant Analysis (LDA) in statistics and machine
learning when there are more than two classes.
 FLD works for two classes — one projection vector w separates them.
 MDA works for k classes — it finds up to k−1 new axes (discriminant functions) that
maximize overall class separability.
 It is supervised dimensionality reduction — reduces data from d-dimensions to at
most k−1 dimensions while preserving class discrimination.
Core Idea
Instead of just maximizing the separation between two means, MDA:
 Maximizes the separation between all class means.
 Minimizes the variance within each class.
 Uses the between-class scatter SB and within-class scatter SW.

EXPECTATION-MAXIMIZATION
The Expectation-Maximization (EM) algorithm is an iterative method used to find the
maximum likelihood or maximum a posteriori estimates of parameters in statistical models,
particularly when dealing with hidden or incomplete data. It's frequently applied in machine
learning, especially in situations where some data points are missing or unobserved (latent
variables).
The EM algorithm is used to find the best parameters of a statistical model when:
 Some data is missing, hidden, or latent.
 The likelihood function is hard to optimize directly.
It’s widely used for:
 Clustering (e.g., Gaussian Mixture Models).
 Missing data imputation.
 Hidden Markov Models.

EM alternates between:
1. E-step (Expectation) → Estimate missing/hidden variables given the current
parameter guesses.
2. M-step (Maximization) → Re-estimate parameters by maximizing the expected log-
likelihood from the E-step.
This process repeats until convergence.

Why Not Optimize Directly?

 If we knew the missing/hidden values, parameter estimation would be easy (just like
normal maximum likelihood estimation).
 If we knew the parameters, inferring hidden values would be easy.
 But when both are unknown, EM solves it iteratively.

Step-by-Step (Generic EM)

Given:
 Observed data X
 Hidden variables Z
 Model parameters θ

Step 1 — Initialize Parameters: Choose initial guess for parameters θ(0).

Step 2 — E-Step (Expectation): Compute:
Θ(t+1) = argmax Q(θ ∣θ(t))
This means: find the expected complete-data log-likelihood using the current parameters.
Step 3 — M-Step (Maximization):
Update parameters by:
θ(t+1)=argmax Q(θ∣θ(t))
This means: choose parameters that maximize the expected log-likelihood from the E-step.
Step 4 — Repeat:
Alternate E-step and M-step until parameters stop changing significantly (convergence).
HIDDEN MARKOV MODEL (HMM)
It is a statistical model that is used to describe the probabilistic relationship between a sequence
of observations and a sequence of hidden states.
 An HMM consists of two types of variables: hidden states and observations.
o The hidden states are the underlying variables that generate the observed data,
but they are not directly observable.
o The observations are the variables that are measured and observed.
 Markov Model is widely used in Speech Recognition, Natural Language Modelling,
On-line Handwriting Recognition, Analysis of Biological sequences such as
Proteins, DNA.
 The Hidden Markov Model (HMM) is the relationship between the hidden states and
the observations using two sets of probabilities: the transition probabilities and the
emission probabilities.
o The transition probabilities describe the probability of transitioning from one
hidden state to another.
o The emission probabilities describe the probability of observing an output
given a hidden state.

Hidden Markov Model Algorithm

• Set of hidden states: {s1 , s2 ,  , s N }
• Process moves from one state to another generating a sequence of states:
si1 , si 2 , , sik ,
• Set of Observations: {o1, o2, …., ok}
• Markov chain property: probability of each subsequent state depends only on what
was the previous state: P( sik | si1 , si 2 , , sik 1 )  P( sik | sik 1 )
• To define Markov model, the following probabilities have to be specified:
• transition probabilitiesaij  P ( si | s j )
• initial probabilities  i  P ( si )
• emission probabilities P(oi |si)

Example of Hidden Markov Model:

• Two states: ‘Low’ and ‘High’ atmospheric pressure.

• Two observations: ‘Rain’ and ‘Dry’.
• Transition probabilities:
P(‘Low’|‘Low’)=0.3
P(‘High’|‘Low’)=0.7
P(‘Low’|‘High’)=0.2
P(‘High’|‘High’)=0.8
• Emission/Observation probabilities:
P(‘Rain’|‘Low’)=0.6
P(‘Dry’|‘Low’)=0.4
P(‘Rain’|‘High’)=0.4
P(‘Dry’|‘High’)=0.6
• Initial probabilities: say
P(‘Low’)=0.4
P(‘High’)=0.6
Calculation of observation sequence probability:
• Suppose we want to calculate a probability of a sequence of observations in our
example, {‘Dry’,’Rain’}.
• Consider all possible hidden state sequences:
P({‘Dry’,’Rain’} ) = P({‘Dry’,’Rain’} , {‘Low’,’Low’}) + P({‘Dry’,’Rain’} , {‘Low’,’High’}) +
P({‘Dry’,’Rain’} , {‘High’,’Low’}) + P({‘Dry’,’Rain’} , {‘High’,’High’})
where first term is :
P({‘Dry’,’Rain’} , {‘Low’,’Low’}) = P({‘Dry’,’Rain’} | {‘Low’,’Low’}) P({‘Low’,’Low’})
= P(‘Dry’|’Low’)P(‘Rain’|’Low’) P(‘Low’)P(‘Low’|’Low)
= 0.4*0.6*0.4*0.3
Similary, Calculate others
Select maximum probability from among them.

If we had a set of states, we could calculate the probability of the sequence. But we don’t have
the states. All we have are a sequence of observations.
Problem Associated with HMM model:
 Problem 1 (Evaluation): Given an HMM and a sequence of observations, what is the
probability that the observations are generated by the model? Forward-Backward
algorithm solve this problem.
 Problem 2 (Decoding): Given an HMM and a sequence of observations, what is the
most likely state sequence in the model that produced the observations? Viterbi
algorithm solve this problem.
 Problem 3 (Learning): Given an observation sequence, how should the model
parameters be adjusted to maximize the probability of the evaluation problem? Baum-
Welch algorithm solve this problem

Unit - II
No ratings yet
Unit - II
171 pages
Duda Solutions PDF
No ratings yet
Duda Solutions PDF
77 pages
Maximum-Likelihood & Bayesian Parameter Estimation: Srihari: CSE 555
No ratings yet
Maximum-Likelihood & Bayesian Parameter Estimation: Srihari: CSE 555
9 pages
Assign 1
No ratings yet
Assign 1
5 pages
Unit - 5 ML
No ratings yet
Unit - 5 ML
57 pages
Dr. Arslan Shaukat
No ratings yet
Dr. Arslan Shaukat
18 pages
ML Map and Bayseian
No ratings yet
ML Map and Bayseian
35 pages
Point Estimation: Definition of Estimators
No ratings yet
Point Estimation: Definition of Estimators
8 pages
Boiler Tube IRIS Inspection Report
100% (1)
Boiler Tube IRIS Inspection Report
11 pages
15.097: Probabilistic Modeling and Bayesian Analysis
No ratings yet
15.097: Probabilistic Modeling and Bayesian Analysis
42 pages
Bayes ML Tutorial
No ratings yet
Bayes ML Tutorial
69 pages
Maximum Likelihood
No ratings yet
Maximum Likelihood
11 pages
PRCI Slides 1
No ratings yet
PRCI Slides 1
86 pages
Naive Bayes Classifier and Other Topics
No ratings yet
Naive Bayes Classifier and Other Topics
52 pages
Notes and Solutions For: Pattern Recognition by Sergios Theodoridis and Konstantinos Koutroumbas.
100% (1)
Notes and Solutions For: Pattern Recognition by Sergios Theodoridis and Konstantinos Koutroumbas.
209 pages
4.ML Estimation
No ratings yet
4.ML Estimation
19 pages
Session 32 - Point Estimate
No ratings yet
Session 32 - Point Estimate
53 pages
확통1 LectureNote09 on Bayesian Statistical Inference
No ratings yet
확통1 LectureNote09 on Bayesian Statistical Inference
78 pages
Machine Learning Class Notes: SVM & Bayesian Learning
No ratings yet
Machine Learning Class Notes: SVM & Bayesian Learning
16 pages
Lecture 4
No ratings yet
Lecture 4
51 pages
Maximum Likelihood and Bayesian Parameter Estimation: Chapter 3, DHS
No ratings yet
Maximum Likelihood and Bayesian Parameter Estimation: Chapter 3, DHS
35 pages
Chapter 2: Statistical Inference, Point Estimation, and Confidence Intervals
No ratings yet
Chapter 2: Statistical Inference, Point Estimation, and Confidence Intervals
16 pages
Unsupervised Learning Clustering Math
No ratings yet
Unsupervised Learning Clustering Math
28 pages
CHAPTER 4 Parametric Methods
No ratings yet
CHAPTER 4 Parametric Methods
13 pages
جلسه پنجم-1
No ratings yet
جلسه پنجم-1
15 pages
Probability Distributions Guide
No ratings yet
Probability Distributions Guide
86 pages
Utr - PLN Suar PDF
100% (1)
Utr - PLN Suar PDF
86 pages
Unit 5 - Machine Learning - WWW - Rgpvnotes.in
No ratings yet
Unit 5 - Machine Learning - WWW - Rgpvnotes.in
17 pages
A Pattern Is An Abstract Object, Such As A Set of Measurements Describing A Physical Object
No ratings yet
A Pattern Is An Abstract Object, Such As A Set of Measurements Describing A Physical Object
12 pages
Maximum Likelihood Estimation Guide
No ratings yet
Maximum Likelihood Estimation Guide
34 pages
CS775 Lec 2
No ratings yet
CS775 Lec 2
66 pages
I2ml3e Chap4
No ratings yet
I2ml3e Chap4
28 pages
Assignment 10 Solution
No ratings yet
Assignment 10 Solution
8 pages
Notes4 BayesianLearning
No ratings yet
Notes4 BayesianLearning
8 pages
Lecture 13
No ratings yet
Lecture 13
12 pages
4.2 Bayes Decision Theory
No ratings yet
4.2 Bayes Decision Theory
49 pages
4.4 Parametric and Non-Parametric Estimator
No ratings yet
4.4 Parametric and Non-Parametric Estimator
47 pages
Unit 2
No ratings yet
Unit 2
27 pages
Bayesian Inference
No ratings yet
Bayesian Inference
18 pages
DS 630 - Lec 02 - ST
No ratings yet
DS 630 - Lec 02 - ST
34 pages
Max Likelihood
No ratings yet
Max Likelihood
4 pages
Machine Learning Estimation Guide
No ratings yet
Machine Learning Estimation Guide
6 pages
Weatherwax Theodoridis Solutions
No ratings yet
Weatherwax Theodoridis Solutions
212 pages
MA40189 Notes
No ratings yet
MA40189 Notes
70 pages
2 Mle
No ratings yet
2 Mle
28 pages
Bayesian Decision Theory in ML
No ratings yet
Bayesian Decision Theory in ML
56 pages
ML Notes
No ratings yet
ML Notes
4 pages
Bayesian and MLE
No ratings yet
Bayesian and MLE
30 pages
CSCE 970 Lecture 2: Bayesian-Based Classifiers: Most Probable
No ratings yet
CSCE 970 Lecture 2: Bayesian-Based Classifiers: Most Probable
5 pages
Lecture Notes For Probability and Statistics
No ratings yet
Lecture Notes For Probability and Statistics
7 pages
SL09. Bayesian Learning
No ratings yet
SL09. Bayesian Learning
4 pages
Pattern Recognition Question Bank
No ratings yet
Pattern Recognition Question Bank
1 page
Chapter 3
No ratings yet
Chapter 3
34 pages
L6: Parameter Estimation: Parameter Estimation Maximum Likelihood Bayesian Estimation Numerical Examples
No ratings yet
L6: Parameter Estimation: Parameter Estimation Maximum Likelihood Bayesian Estimation Numerical Examples
15 pages
User's Guide For The AT&T Global Network Client For Linux: System Requirements and Installation
No ratings yet
User's Guide For The AT&T Global Network Client For Linux: System Requirements and Installation
2 pages
Learning Models From Data: 1 Parametric Estimation
No ratings yet
Learning Models From Data: 1 Parametric Estimation
14 pages
Introduction To Bayesian Methods: Jessi Cisewski Department of Statistics Yale University
No ratings yet
Introduction To Bayesian Methods: Jessi Cisewski Department of Statistics Yale University
53 pages
11 Parameter Estimation
No ratings yet
11 Parameter Estimation
6 pages
Multimedia Unit 4
No ratings yet
Multimedia Unit 4
16 pages
Turunan Imidazoline Crodazoline o
No ratings yet
Turunan Imidazoline Crodazoline o
2 pages
Jwt-Auth: Pacote: Tymon/Jwt-Auth Github: Documentação: 1. Instalar O Pacote
No ratings yet
Jwt-Auth: Pacote: Tymon/Jwt-Auth Github: Documentação: 1. Instalar O Pacote
3 pages
KKS Power Plant Identification System
No ratings yet
KKS Power Plant Identification System
3 pages
Snapdragon 616 Processor Product Brief
No ratings yet
Snapdragon 616 Processor Product Brief
2 pages
Introduction To Central User Administration (CUA) - SAP - All About Web and Cloud
No ratings yet
Introduction To Central User Administration (CUA) - SAP - All About Web and Cloud
3 pages
Unit 1
No ratings yet
Unit 1
23 pages
Pattern Recognition Notes Unit-1
No ratings yet
Pattern Recognition Notes Unit-1
19 pages
Lec22 PDF
No ratings yet
Lec22 PDF
8 pages
Unit 2 QB
No ratings yet
Unit 2 QB
2 pages
Unit 1 QB
No ratings yet
Unit 1 QB
2 pages
Pattern Recognition Syllabus
No ratings yet
Pattern Recognition Syllabus
2 pages
2022 Grade 10 3rd Tem Tamil
No ratings yet
2022 Grade 10 3rd Tem Tamil
8 pages
IntelliSteer Operating Guide PDF
No ratings yet
IntelliSteer Operating Guide PDF
240 pages
MYH Case Study
75% (16)
MYH Case Study
62 pages
Prototype CNC Machine Design PDF
No ratings yet
Prototype CNC Machine Design PDF
6 pages
RAK7431 RS485 To LoRaWAN Converter Specifications V1.2
No ratings yet
RAK7431 RS485 To LoRaWAN Converter Specifications V1.2
12 pages
VLSI Testing - DFT and Scan
No ratings yet
VLSI Testing - DFT and Scan
35 pages
Midterm - EE511 - Part B: K K K K
No ratings yet
Midterm - EE511 - Part B: K K K K
8 pages
Duolingo App: Sebastián Valencia
No ratings yet
Duolingo App: Sebastián Valencia
11 pages
SHS Grade 11 MIL Q4W6 FINAL
No ratings yet
SHS Grade 11 MIL Q4W6 FINAL
19 pages
LCD TV/DVD: Service Manual Circuit Diagrams
No ratings yet
LCD TV/DVD: Service Manual Circuit Diagrams
31 pages
C 5750 Users Guide
No ratings yet
C 5750 Users Guide
105 pages
Exam Form for B.Tech Students
No ratings yet
Exam Form for B.Tech Students
2 pages
Bank Account Transactions June-July 2024
No ratings yet
Bank Account Transactions June-July 2024
18 pages
FRST
No ratings yet
FRST
19 pages
RX1 Getting Started
No ratings yet
RX1 Getting Started
60 pages
TJR TUJR WF4 Manual 01 25 15
No ratings yet
TJR TUJR WF4 Manual 01 25 15
62 pages
Marketing Information Systems
No ratings yet
Marketing Information Systems
7 pages
Barangay Baracbac SK Annual Budget Fy 2019: Republic of The Philippines Province of Ilocos Sur Municipality of Galimuyod
No ratings yet
Barangay Baracbac SK Annual Budget Fy 2019: Republic of The Philippines Province of Ilocos Sur Municipality of Galimuyod
7 pages
ECEN3250 Lab 7: Design of Common-Source MOS Amplifiers Prelab Assignment
No ratings yet
ECEN3250 Lab 7: Design of Common-Source MOS Amplifiers Prelab Assignment
14 pages
Gatling 2
No ratings yet
Gatling 2
10 pages
Ina121 PDF
No ratings yet
Ina121 PDF
17 pages
V6 SuperCharger For Android-Update9 RC12-BlackDog-63457 Fix - SH
No ratings yet
V6 SuperCharger For Android-Update9 RC12-BlackDog-63457 Fix - SH
218 pages

Pattern Recognition Notes Unit-2

Uploaded by

Pattern Recognition Notes Unit-2

Uploaded by

21CSE451T

PATTERN RECOGNITION TECHNIQUES

 Parameter Estimation Methods

PARAMETER ESTIMATION METHODS

The General Principle

2. The Gaussian Case: Unknown μ, and ∑

2. The Parameter Distribution

The above equation can be written as:

BAYESIAN PARAMETER ESTIMATION: GAUSSIAN

2. The Univariate Case: p(x\D)

3. The Multivariate Case

BAYESIAN PARAMETER ESTIMATION: GENERAL

The basic problem is to compute the posterior density p(ϴ|D):

When Do Maximum-Likelihood and Bayes Methods Differ?

There are three sources of classification error in our final system:

Noninformative Priors and Invariance

where r2 is the squared Mahalanobis distance

COMPONENT ANALYSIS AND DISCRIMINANTS

When to Use PCA

2. Fischer Linear Discriminant or FLD

FLD projects data onto a line so that:

3. Multiple Discriminant Analysis or MDA

Why Not Optimize Directly?

Step-by-Step (Generic EM)

Step 1 — Initialize Parameters: Choose initial guess for parameters θ(0).

Hidden Markov Model Algorithm

Example of Hidden Markov Model:

• Two states: ‘Low’ and ‘High’ atmospheric pressure.

You might also like