Labman 2
Labman 2
Lab session 2 :
Introduction to Hidden Markov Models
Think !
If you get lost with some of the questions or some of the explanations, DO ASK the assistants or
the teacher for help : they are here to make the course understood. There is no such thing as a stupid
question, and the only obstacle to knowledge is laziness.
Contents
1 Preamble 1
5 Training of HMMs 13
1 Preamble
Useful formulas and definitions :
- a Markov chain or process is a sequence of events, usually called states, the probability of each of
which is dependent only on the event immediately preceding it.
- a Hidden Markov Model (HMM) represents stochastic sequences as Markov chains where the states
are not directly observed, but are associated with a probability density function (pdf). The gener-
ation of a random sequence is then the result of a random walk in the chain (i.e. the browsing of
a random sequence of states Q = {q1 , · · · qK }) and of a draw (called an emission) at each visit of a
state.
The sequence of states, which is the quantity of interest in speech recognition and in most of the
other pattern recognition problems, can be observed only through the stochastic processes defined
into each state (i.e. you must know the parameters of the pdfs of each state before being able to
associate a sequence of states Q = {q1 , · · · qK } to a sequence of observations X = {x1 , · · · xK }).
The true sequence of states is therefore hidden by a first layer of stochastic processes.
HMMs are dynamic models, in the sense that they are specifically designed to account for some
macroscopic structure of the random sequences. In the previous lab, concerned with Gaussian
Statistics and Statistical Pattern Recognition, random sequences of observations were considered
as the result of a series of independent draws in one or several Gaussian densities. To this simple
statistical modeling scheme, HMMs add the specification of some statistical dependence between
the (Gaussian) densities from which the observations are drawn.
- HMM terminology :
– the emission probabilities are the pdfs that characterize each state q i , i.e. p(x|qi ). To simplify
the notations, they will be denoted bi (x). For practical reasons, they are usually Gaussian or
combinations of Gaussians, but the states could be parameterized in terms of any other kind
of pdf (including discrete probabilities and artificial neural networks).
– the transition probabilities are the probability to go from a state i to a state j, i.e. P (q j |qi ).
They are stored in matrices where each term aij denotes a probability P (qj |qi ).
– non-emitting initial and final states : if a random sequence X = {x1 , · · · xK } has a finite
length K, the fact that the sequence begins or ends has to be modeled as two additional
discrete events. In HMMs, this corresponds to the addition of two non-emitting states, the
initial state and the final state. Since their role is just to model the “start” or “end” events,
they are not associated with some emission probabilities.
The transitions starting from the initial state correspond to the modeling of an initial state dis-
tribution P (I|qj ), which indicates the probability to start the state sequence with the emitting
state qj .
The final state usually has only one non-null transition that loops onto itself with a probability
of 1 (it is an absorbent state), so that the state sequence gets “trapped” into it when it is
reached.
1
1 PREAMBLE
– ergodic versus left-right HMMs : a HMM allowing for transitions from any emitting state to
any other emitting state is called an ergodic HMM. Alternately, an HMM where the transitions
only go from one state to itself or to a unique follower is called a left-right HMM.
The following 2-dimensional Gaussian densities will be used to model simulated vowel observations, where
the considered features are the two first formants :
Density N/a/ : 730 1625 5300
µ/a/ = Σ/a/ =
1090 5300 53300
Density N/e/ : 530 15025 7750
µ/e/ = Σ/e/ =
1840 7750 36725
Density N/i/ : 270 2525 1200
µ/i/ = Σ/i/ =
2290 1200 36125
Density N/o/ : 570 2000 3600
µ/o/ = Σ/o/ =
840 3600 20000
Density N/y/ : 440 8000 8400
µ/y/ = Σ/y/ =
1020 8400 18500
(Those densities have been used in the previous lab session.) They will be combined into Markov Models
that will be used to model some observation sequences. The resulting HMMs are described in table 1.
The parameters of the densities and of the Markov models are stored in the file data.mat. A Markov
model named, e.g., hmm1 is stored as an object with fields hmm1.means, hmm1.vars and hmm1.trans, and
corresponds to the model HMM1 of table 1. The means fields contains a list of mean vectors; the vars
field contains a list of variance matrices; the trans field contains the transition matrix; e.g to access the
mean of the 3rd state of hmm1, use :
>> hmm1.means{3}
The initial and final states are characterized by an empty mean and variance value.
2
1 PREAMBLE
HMM3 :
• state 1: initial state
0.5 0.5 0.5
0.0 1.0 0.0 0.0 0.0
0.0 0.5 0.5 0.0 0.0
• state 2: Gaussian N/a/
0.0
1.0 0.5 0.5 0.5
0.0 0.5 0.5 0.0
• state 3: Gaussian N/i/ I N/a/ N/i/ N/y/ F
0.0 0.0 0.0 0.5 0.5
• state 4: Gaussian N/y/ 0.0 0.0 0.0 0.0 1.0
Generate a sample X coming from the Hidden Markov Models HMM1, HMM2 and HMM4. Use the
function genhmm (>> help genhmm) to do several draws with each of these models. View the resulting
samples and state sequences with the help of the functions plotseq and plotseq2.
Example :
Do a draw :
>> [X,stateSeq] = genhmm(hmm1);
Use the functions plotseq and plotseq2 to picture the obtained 2-dimensional data. In the resulting
views, the obtained sequences are represented by a yellow line where each point is overlaid with a colored
dot. The different colors indicate the state from which any particular point has been drawn.
>> figure; plotseq(X,stateSeq); % View of both dimensions as separate sequences
This view highlights the notion of sequence of states associated with a sequence of sample points.
>> figure; plotseq2(X,stateSeq,hmm1); % 2D view of the resulting sequence,
% with the location of the Gaussian states
This view highlights the spatial repartition of the sample points.
Draw several new samples with the same parameters and visualize them :
>> clf; [X,stateSeq] = genhmm(hmm1); plotseq(X,stateSeq);
(To be repeated several times.)
Repeat with another model :
>> [X,stateSeq] = genhmm(hmm2);plotseq(X,stateSeq);
and re-iterate the experiment. Also re-iterate with model HMM3.
Questions :
2. What is the effect of the different transition matrices on the sequences obtained during the current
experiment ? Hence, what is the role of the transition probabilities in the Markovian modeling
framework ?
4. In the case of HMMs with plain Gaussian emission probabilities, what quantities should be present
in the complete parameter set Θ that specifies a particular model ?
If the model is ergodic with N states (including the initial and final), and represents data of
dimension D, what is the total number of parameters in Θ ?
5. Which type of HMM (ergodic or left-right) would you use to model words ?
Answers :
(Answers continue on the next page...)
4
5
2. The transition matrix of HMM1 indicates that the probability of staying in a particular state is close
to the probability of transiting to another state. Hence, it allows for frequent jumps from one state
to any other state. The observation variable therefore frequently jumps from one “phoneme” to any
other, forming sharply changing sequences like /a,i,a,y,y,i,a,y,y,· · · /.
Alternately, the transition matrix of HMM2 specifies high probabilities of staying in a particular
state. Hence, it allows for more “stable” sequences, like /a,a,a,y,y,y,i,i,i,i,i,y,y,· · · /.
Finally, the transition matrix of HMM4 also rules the order in which the states are browsed :
the given probabilities force the observation variable to go through /a/, then to go through /i/, and
finally to stay in /y/, e.g. /a,a,a,a,i,i,i,y,y,y,y,· · · /.
Hence, the role of the transition probabilities is to introduce a temporal (or spatial) structure in
the modeling of random sequences.
Furthermore, the obtained sequences have variable lengths : the transition probabilities implicitly
model a variability in the duration of the sequences. As a matter of fact, different speakers or
different speaking conditions introduce a variability in the phoneme or word durations. In this
respect, HMMs are particularly well adapted to speech modeling.
3. If we didn’t have a final state, the observation variable would wander from state to state indefinitely,
and the model would necessarily correspond to sequences of infinite length.
4. In the case of HMMs with Gaussian emission probabilities, the parameter set Θ comprises :
• the transition probabilities aij ;
• the parameters of the Gaussian densities characterizing each state, i.e. the means µ i and the
variances Σi .
The initial state distribution is sometimes modeled as an additional parameter instead of being
represented in the transition matrix.
In the case of an ergodic HMM with N emitting states and Gaussian emission probabilities, we
have :
• (N − 2) × (N − 2) transitions, plus (N − 2) initial state probabilities and (N − 2) probabilities
to go to the final state;
• (N − 2) emitting states where each pdf is characterized by a D dimensional mean and a D × D
covariance matrix.
Hence, in this case, the total number of parameters is (N − 2) × (N + D × (D + 1)). Note that this
number grows exponentially with the number of states and the dimension of the data.
5. Words are made of ordered sequences of phonemes : /h/ is followed by /e/ and then by /l/ in the
word “hello”. Each phoneme can in turn be considered as a particular random process (possibly
Gaussian). This structure can be adequately modeled by a left-right HMM.
In “real world” speech recognition, the phoneme themselves are often modeled as left-right HMMs
rather than plain Gaussian densities (e.g. to model separately the attack, then the stable part of
the phoneme and finally the “end” of it). Words are then represented by large HMMs made of
concatenations of smaller phonetic HMMs.
Answers (continued) :
2 GENERATING SAMPLES FROM HIDDEN MARKOV MODELS
3 PATTERN RECOGNITION WITH HMMS
T
Y
p(X|Q, Θ) = p(xi |qi , Θ) = b1 (x1 ) · b2 (x2 ) · · · bT (xT )
i=1
i.e. it is the product of the emission probabilities computed along the considered path.
In the previous lab, we had learned how to compute the likelihood of a single observation with
respect to a Gaussian model. This method can be applied here, for each term xi , if the states
contain Gaussian pdfs.
- Joint likelihood of an observation sequence X and a path Q : it consists in the probability that
X and Q occur simultaneously, p(X, Q|Θ), and decomposes into a product of the two quantities
defined previously :
p(X, Q|Θ) = p(X|Q, Θ)P (Q|Θ) (Bayes)
i.e. it is the sum of the joint likelihoods of the sequence over all possible state sequence allowed by
the model.
- the forward recursion : in practice, the enumeration of every possible state sequence is infeasible.
Nevertheless, p(X|Θ) can be computed in a recursive way by the forward recursion. This algorithm
defines a forward variable αt (i) corresponding to :
i.e. αt (i) is the probability of having observed the partial sequence {x1 , x2 , · · · , xt } and being in
the state i at time t (event denoted qit in the course), given the parameters Θ. For a HMM with 5
states (where states 1 and N are the non-emitting initial and final states, and states 2 · · · N − 1 are
emitting), αt (i) can be computed recursively as follows :
6
3 PATTERN RECOGNITION WITH HMMS 3.1 Likelihood of a sequence given a HMM
1. Initialization
where a1i are the transitions from the initial non-emitting state to the emitting states with pdfs
bi, i=2···N −1 (x). Note that b1 (x) and bN x do not exist since they correspond to the non-emitting
initial and final states.
2. Recursion
"N −1 #
X 1≤t≤T
αt+1 (j) = αt (i) · aij bj (xt+1 ),
2≤j ≤N −1
i=2
3. Termination
"N −1 #
X
p(X|Θ) = αT (i) · aiN
i=2
i.e. at the end of the observation sequence, sum the probabilities of the paths converging to
the final state (state number N ).
(For more detail about the forward procedure, refer to [RJ93], chap.6.4.1).
This procedure raises a very important implementation issue. As a matter of fact, the computation
of the αt vector consists in products of a large number of values that are less than 1 (in general, sig-
nificantly less than 1). Hence, after a few observations (t ≈ 10), the values of α t head exponentially
to 0, and the floating point arithmetic precision is exceeded (even in the case of double precision
arithmetics). Two solutions exist for that problem. One consists in scaling the values and undo the
scaling at the end of the procedure : see [RJ93] for more explanations. The other solution consists
in using log-likelihoods and log-probabilities, and to compute log p(X|Θ) instead of p(X|Θ).
Questions :
1. The following formula can be used to compute the log of a sum given the logs of the sum’s arguments :
log(a + b) = f (log a, log b) = log a + log 1 + e(log b−log a)
2. Express the log version of the forward recursion. (Don’t fully develop the log of the sum in the
P log
recursion step, just call it “logsum” : N N
i=1 xi 7−→ logsumi=1 (log xi ).) In addition to the arithmetic
precision issues, what are the other computational advantages of the log version ?
7
8
To perform a Bayesian classification, we need the prior probabilities P (Θ i |Θ) of each model. In addition,
we can assume that all the observation sequences are equi-probable :
p(X|Θi , Θ)P (Θi |Θ)
P (X|Θ)
P (Θi |X, Θ) =
∝ p(X|Θi )P (Θi )
P (Θi ) can be determined by counting the probability of occurrence of each model (word or phoneme) in a
database covering the vocabulary to recognize (see lab session 4).
If every model has the same prior probability, then Bayesian classification becomes equivalent to ML
classification.
Answer :
classification ?
Which additional condition makes the result of Bayesian classification equivalent to the result of ML
Bayesian classification rather than a Maximum Likelihood classification of the sequences ?
Maximum Likelihood sense. What additional quantities and assumptions do we need to perform a true
a HMM. Hence, given a sequence of features, we are able to find the most likely generative model in a
The forward recursion allows us to compute the likelihood of an observation sequence with respect to
Question :
Bayesian classification 3.2
1. Demonstration : a = elog a ; b = elog b
a + b = elog a + elog b
= elog a 1 + e(log b−log a)
log(a + b) = log a + log 1 + e(log b−log a) QED.
The computation of the exponential overflows the double precision arithmetics for big values (≈ 700)
earlier than for small values. Similarly, the implementations of the exponential operation are gener-
ally more precise for small values than for big values (since an error on the input term is exponen-
tially amplified). Hence, if log a > log b, the first version (log(a + b) = log a + log 1 + e(log b−log a) )
is more precise since in this case (log b − log a) is small. If log a < log b, it is better to swap the
terms (i.e. to use the second version).
2. (a) Initialization
(log)
α1 (i) = log a1i + log bi (x1 ), 2≤i≤N −1
(b) Recursion
(log) N −1 (log) 1≤t≤T
αt+1 (j) = logsumi=2 αt (i) + log aij + log bj (xt+1 ),
2≤j ≤N −1
h i
(c) Termination
N −1 (log)
log p(X|Θ) = logsumi=2 αT (i) + log aiN
h i
In addition to the precision issues, this version transforms the products into sums, which is more
computationally efficient. Furthermore, if the emission probabilities are Gaussians, the computation
of the log-likelihoods log(bj (xt )) eliminates the computation of the Gaussians’ exponential (see lab
session 4).
These two points just show you that once the theoretic barrier is crossed in the study of a particular
statistical model, the importance of the implementation issues must not be neglected.
Answers :
3 PATTERN RECOGNITION WITH HMMS 3.2 Bayesian classification
3 PATTERN RECOGNITION WITH HMMS 3.3 Maximum Likelihood classification
Experiment :
Classify the sequences X1 , X2, · · · X6 , given in the file data.mat, in a maximum likelihood sense with
respect to the six Markov models given in table 1. Use the function logfwd to compute the log-forward
recursion expressed in the previous section. Store the results in a matrix (they will be used in the next
section) and note them in the table below.
Example :
>> plot(X1(:,1),X1(:,2));
>> logProb(1,1) = logfwd(X1,hmm1)
>> logProb(1,2) = logfwd(X1,hmm2)
etc.
>> logProb(3,2) = logfwd(X3,hmm2)
etc.
Filling the logProb matrix can be done automatically with the help of loops :
Sequence log p(X|Θ1 ) log p(X|Θ2 ) log p(X|Θ3 ) log p(X|Θ4 ) log p(X|Θ5 ) log p(X|Θ6 )
Most likely
model
X1
X2
X3
X4
X5
X6
Answer :
X1 → HM M 1, X2 → HM M 3, X3 → HM M 5, X4 → HM M 4, X5 → HM M 6, X6 → HM M 2.
9
4 OPTIMAL STATE SEQUENCE
• the highest likelihood δt (i) along a single path among all the paths ending in state i at time t :
• a variable ψt (i) which allows to keep track of the “best path” ending in state i at time t :
Note that these variables are vectors of (N − 2) elements, (N − 2) being the number of emitting states.
With the help of these variables, the algorithm takes the following steps :
1. Initialization
where, again, a1i are the transitions from the initial non-emitting state to the emitting states with
pdfs bi, i=2···N −1 (x), and where b1 (x) and bN x do not exist since they correspond to the non-emitting
initial and final states.
2. Recursion
1≤t≤T −1
δt+1 (j) = max [δt (i) · aij ] · bj (xt+1 ),
2≤i≤N −1 2≤j ≤N −1
1≤t≤T −1
ψt+1 = arg max [δt (i) · aij ] ,
2≤i≤N −1 2≤j ≤N −1
“Optimal policy is composed of optimal sub-policies” : find the path that leads to a maximum likeli-
hood considering the best likelihood at the previous step and the transitions from it; then multiply
by the current likelihood given the current state. Hence, the best path is found by induction.
3. Termination
Find the best likelihood when the end of the observation sequence is reached, given that the final
state is the non-emitting state N .
4. Backtracking
10
4 OPTIMAL STATE SEQUENCE
Hence, the Viterbi algorithm delivers two useful results, given an observation sequence X = {x 1 , · · · , xT }
and a model Θ :
• the selection, among all the possible paths in the considered model, of the best path Q ∗ = {q1∗ , · · · , qT∗ },
which corresponds to the state sequence giving a maximum of likelihood to the observation se-
quence X;
• the likelihood along the best path, p(X, Q∗ |Θ) = p∗ (X|Θ). As opposed to the the forward procedure,
where all the possible paths are considered, the Viterbi computes a likelihood along the best path
only.
(For more detail about the Viterbi algorithm, refer to [RJ93], chap.6.4.1).
Questions :
1. From an algorithmic point of view, what is the main difference between the computation of the δ
variable in the Viterbi algorithm and that of the α variable in the forward procedure ?
Answers :
(d) Backtracking
h i
2≤i≤N −1
= arg max δT (i) + log aiN qT∗
(log)
h i
2≤i≤N −1
log p∗ (X|Θ) = max δT (i) + log aiN
(log)
(c) Termination
h i
2≤j ≤N −1 2≤i≤N −1
= arg max δt (i) + log aij ,
1≤t≤T −1
ψt+1
(log)
i h
2≤j ≤N −1 2≤i≤N −1
δt δt+1 (j)
1≤t≤T −1
(i) + log aij + log bj (xt+1 ), max =
(log) (log)
(b) Recursion
ψ1 (i) = 0
2≤i≤N −1 = log a1i + log bi (x1 ), δ1 (i)
(log)
2. (a) Initialization
of δ. Hence, the Viterbi procedure takes less computational power than the forward recursion.
1. The sums that were appearing in the computation of α become max operations in the computation
Experiments :
1. Use the function logvit to find the best path of the sequences X1 , · · · X6 with respect to the
most likely model found in section 3.3 (i.e. X1 : HMM1, X2 : HMM3, X3 : HMM5, X4 : HMM4,
X5 : HMM6 and X6 : HMM2). Compare with the state sequences ST1 , · · · ST6 originally used to
generate X1 , · · · X6 (use the function compseq, which provides a view of the first dimension of the
observations as a time series, and allows to compare the original alignment to the Viterbi solution).
11
4 OPTIMAL STATE SEQUENCE
2. Use the function logvit to compute the probabilities of the sequences X1 , · · · X6 along the best
paths with respect to each model Θ1 , · · · Θ6 . Note your results below. Compare with the log-
likelihoods obtained in the section 3.3 with the forward procedure.
Examples :
1. Best paths and comparison with the original paths :
>> figure;
>> [STbest,bestProb] = logvit(X1,hmm1); compseq(X1,ST1,STbest);
>> [STbest,bestProb] = logvit(X2,hmm3); compseq(X2,ST2,STbest);
Repeat for the remaining sequences.
2. Probabilities along the best paths for all the models :
>> [STbest,bestProb(1,1)] = logvit(X1,hmm1);
>> [STbest,bestProb(1,2)] = logvit(X1,hmm2);
etc.
>> [STbest,bestProb(3,2)] = logvit(X3,hmm2);
etc. (You can also use loops here.)
To compare with the complete log-likelihood, issued by the forward recurrence :
>> diffProb = logProb - bestProb
Most likely
Sequence log p∗ (X|Θ1 ) log p∗ (X|Θ2 ) log p∗ (X|Θ3 ) log p∗ (X|Θ4 ) log p∗ (X|Θ5 ) log p∗ (X|Θ6 )
model
X1
X2
X3
X4
X5
X6
X1
X2
X3
X4
X5
X6
Question :
Is the likelihood along the best path a good approximation of the real likelihood of a sequence given a
model ?
Answer :
true likelihood.
the procedure. Hence, the likelihood along the best path can be considered as a good approximation of the
further the computational load since it replaces the sum or the logsum by a max in the recursive part of
best path likelihood does not, in most practical cases, modify the classification results. Finally, it alleviates
The values found for both likelihoods differ within an acceptable error margin. Furthermore, using the
12
13
1. It can be assumed that the observation sequences associated with each distinct phoneme obey specific
densities of probability. As in the previous lab, this means that the phonetic classes are assumed
to be separable by Gaussian classifiers. Hence, the word /aiy/ can be assimilated to the result of
drawing samples from the pdf N/a/ , then transiting to N/i/ and drawing samples again, and finally
transiting to N/y/ and drawing samples. It sounds therefore reasonable to model the word /aiy/ by
a left-right HMM with three emitting states.
2. If we know the phonetic boundaries for each instance, we know to which state belongs each training
observation, and we can give a label (/a/, /i/ or /y/) to each feature vector. Hence, we can use the
mean and variance estimators studied in the previous lab to compute the parameters of the Gaussian
density associated with each state (or each label).
By knowing the labels, we can also count the transitions from one state to the following (itself or
another state). By dividing the transitions that start from a state by the total number of transitions
from this state, we can determine the transition matrix.
3. The Viterbi procedure allows to distribute some labels on a sequence of features. Hence, it is possible
to perform unsupervised training in the following way :
(a) Start with some arbitrary state sequences, which constitute an initial labeling. (The initial
sequences are usually made of even distributions of phonetic labels along the length of each
utterance.)
(b) Update the model, relying on the current labeling.
(c) Use the Viterbi algorithm to re-distribute some labels on the training examples.
(d) If the new distribution of labels differs from the previous one, re-iterate (go to (b) ). One can
also stop when the evolution of the likelihood of the training data becomes asymptotic to a
higher bound.
The principle of this algorithm is similar to the Viterbi-EM, used to train the Gaussians during
the previous lab. Similarly, there exists a “soft” version, called the Baum-Welch algorithm, where
each state participates to the labeling of the feature frames (this version uses the forward recursion
instead of the Viterbi). The Baum-Welch algorithm is an EM algorithm specifically adapted to the
training of HMMs (see [RJ93] for details), and is one of the most widely used training algorithms
in “real world” speech recognition.
Answers :
dure to train the model, making use of one of the algorithms studied during the present session.
3. Suppose you didn’t have the phonetic labeling (unsupervised training). Propose a recursive proce-
2. How would you compute the parameters of the proposed HMM ?
Justify your choice.
1. Which model architecture (ergodic or left-right) would you choose ? With how many states ?
instance of the word.
labeling of the data, i.e. some data structures that tell you where are the phoneme boundaries for each
/aiy/, and that you want to train a HMM for this word. Suppose also that this database comes with a
of training data. Suppose that you have a database containing several utterances of the imaginary word
In the previous lab session, we have learned how to estimate the parameters of Gaussian pdfs given a set
Questions :
database containing observation sequences, in a supervised or an unsupervised way.
the words that we want to model ? The solution is to estimate the parameters of the HMMs from a
compare the observations. But how to determine templates that represent efficiently the phonemes or
HMMs. As explained in section 3.3, these models have the role of stochastic templates to which we
Decoding or aligning acoustic feature sequences requires the prior specification of the parameters of some
Training of HMMs 5
5 TRAINING OF HMMS
REFERENCES REFERENCES
References
[RJ93] Lawrence Rabiner and Bin-Huang Juang. Fundamentals of Speech Recognition. Prentice Hall,
1993.
14