Linear Dynamic Models For Automatic Speech Recognition
Linear Dynamic Models For Automatic Speech Recognition
Joe Frankel
NIVER
U S
E
IT
TH
Y
O F
H
G
E
R
D I U
N B
Acknowledgements
Firstly, utmost thanks to Simon King for his generosity. This thesis would never have
reached fruition without his insight and guidance, and I count myself very lucky to have
had someone with such an inspirational teaching style assume the role of my supervisor.
Thanks also to all those who make the CSTR such a stimulating and supportive environ-
ment to work in. Those deserving of a special mention include Korin Richmond for his
attention to detail and willingness to discuss/share knowledge of matters ranging from
computing to linguistics, through machine learning, onto pottery, politics and home-brew.
Also Rob Clark for always making time to sort out my linux and system-related hiccups
despite having plenty of his own work to do. Thanks to those who read all or parts of this
thesis while it was in preparation – your comments and questions have been invaluable.
Thanks to those who have given me useful feedback over the last years, or taken the
trouble to answer queries which arose at odd times. I’ve benefited from the input of Paul
Taylor, Steve Isard, Amos Storkey, Chris Williams, Sam Roweis, Zoubin Gharamani and
Gavin Smith.
Thanks to both my English and Norwegian families, whose on-going support of every
kind has made it possible to see this thesis through to completion. Also to those who
make Edinburgh and Woodcote life such a pleasure – especially the inhabitants of [the
flat known as] 2f2 who have so graciously tolerated my living in the country, camping in
town, and turning up after late nights at the office for a glass of IPA in their kitchen.
So finally to Synnøve, who has been an amazing companion over the last years...shared
with me so many good times, survived the fallout from the emotional roller-coaster that is
a PhD, and produced the lovely Eva, who I hope will not remember how busy her father
was during her first six months.
Declaration
I have composed this thesis. Unless otherwise stated, the work reported is my own.
Abstract
The majority of automatic speech recognition (ASR) systems rely on hidden Markov
models (HMM), in which the output distribution associated with each state is modelled
by a mixture of diagonal covariance Gaussians. Dynamic information is typically included
by appending time-derivatives to feature vectors. This approach, whilst successful, makes
the false assumption of framewise independence of the augmented feature vectors and
ignores the spatial correlations in the parametrised speech signal. This dissertation seeks
to address these shortcomings by exploring acoustic modelling for ASR with an application
of a form of state-space model, the linear dynamic model (LDM).
Rather than modelling individual frames of data, LDMs characterise entire segments of
speech. An auto-regressive state evolution through a continuous space gives a Markovian
model of the underlying dynamics, and spatial correlations between feature dimensions
are absorbed into the structure of the observation process. LDMs have been applied
to speech recognition before, however a smoothed Gauss-Markov form was used which
ignored the potential for subspace modelling. The continuous dynamical state means
that information is passed along the length of each segment. Furthermore, if the state is
allowed to be continuous across segment boundaries, long range dependencies are built
into the system and the assumption of independence of successive segments is loosened.
The state provides an explicit model of temporal correlation which sets this approach
apart from frame-based and some segment-based models where the ordering of the data
is unimportant. The benefits of such a model are examined both within and between
segments.
LDMs are well suited to modelling smoothly varying, continuous, yet noisy trajec-
tories such as found in measured articulatory data. Using speaker-dependent data from
the MOCHA corpus, the performance of systems which model acoustic, articulatory, and
combined acoustic-articulatory features are compared. As well as measured articulatory
parameters, experiments use the output of neural networks trained to perform an articu-
latory inversion mapping. The speaker-independent TIMIT corpus provides the basis for
larger scale acoustic-only experiments. Classification tasks provide an ideal means to com-
pare modelling choices without the confounding influence of recognition search errors, and
are used to explore issues such as choice of state dimension, front-end acoustic parametri-
sation and parameter initialisation. Recognition for segment models is typically more
computationally expensive than for frame-based models. Unlike frame-level models, it is
not always possible to share likelihood calculations for observation sequences which occur
within hypothesised segments that have different start and end times. Furthermore, the
Viterbi criterion is not necessarily applicable at the frame level. This work introduces a
novel approach to decoding for segment models in the form of a stack decoder with A∗
search. Such a scheme allows flexibility in the choice of acoustic and language models
since the Viterbi criterion is not integral to the search, and hypothesis generation is inde-
pendent of the particular language model. Furthermore, the time-asynchronous ordering
of the search means that only likely paths are extended, and so a minimum number of
models are evaluated.
The decoder is used to give full recognition results for feature-sets derived from the
MOCHA and TIMIT corpora. Conventional train/test divisions and choice of language
model are used so that results can be directly compared to those in other studies. The
decoder is also used to implement Viterbi training, in which model parameters are alter-
nately updated and then used to re-align the training data.
Contents
1 Introduction 1
1.1 Preamble . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Speech recognition today . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2.1 Formulation of the problem . . . . . . . . . . . . . . . . . . . . . . 2
1.3 Acoustic modelling for ASR . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.3.1 Hidden Markov Models . . . . . . . . . . . . . . . . . . . . . . . . 4
1.3.2 Hybrid ANN/HMM . . . . . . . . . . . . . . . . . . . . . . . . . . 7
1.3.3 Segment models . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
Distinct sources of variation . . . . . . . . . . . . . . . . . . 8
1.4 Articulation as an information source for ASR . . . . . . . . . . . . . . . 9
Explicit modelling of phonological variation . . . . . . . . . 9
1.5 Motivation for this thesis . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
1.6 Publications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
2 Literature Review 13
2.1 Linear Gaussian models and their relatives . . . . . . . . . . . . . . . . . . 13
2.1.1 State-space models . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
Inference . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
Estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
Maximum likelihood . . . . . . . . . . . . . . . . . . . . . . 16
Expectation maximisation algorithm . . . . . . . . . . . . . 16
2.1.2 Linear Gaussian models . . . . . . . . . . . . . . . . . . . . . . . . 18
Static models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
Sensible principal components analysis . . . . . . . . . . . . 19
i
ii CONTENTS
Factor analyser . . . . . . . . . . . . . . . . . . . . . . . . . 19
Dynamic models . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
Linear dynamic model . . . . . . . . . . . . . . . . . . . . . 20
Degeneracy . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
2.1.3 Non-linear and/or non-Gaussian extensions . . . . . . . . . . . . . 21
Variations on the state process . . . . . . . . . . . . . . . . . . . . 21
Varying the observation process . . . . . . . . . . . . . . . . . . . . 21
Mixture distributions . . . . . . . . . . . . . . . . . . . . . . . . . 22
2.2 Articulatory-inspired acoustic modelling for ASR . . . . . . . . . . . . . . 23
Discrete . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
Continuous . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
2.2.1 Articulatory parameters as features . . . . . . . . . . . . . . . . . 23
Real articulatory features . . . . . . . . . . . . . . . . . . . . . . . 24
HMM systems . . . . . . . . . . . . . . . . . . . . . . . . . . 24
Dynamic Bayesian network system . . . . . . . . . . . . . . 25
Pseudo-articulatory features . . . . . . . . . . . . . . . . . . . . . . 26
Kirchhoff . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
King . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
2.2.2 Using articulatory parameters to derive HMM topology . . . . . . 28
HAMM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
Deng . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
2.2.3 Recognition by articulatory synthesis . . . . . . . . . . . . . . . . . 30
2.3 Segment modelling for speech . . . . . . . . . . . . . . . . . . . . . . . . . 33
Variable-length . . . . . . . . . . . . . . . . . . . . . . . . . 33
Fixed-length . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
2.3.1 Segmental HMMs . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
2.3.2 Segmental feature HMMs . . . . . . . . . . . . . . . . . . . . . . . 35
2.3.3 Modelling temporal dependencies with HMMs . . . . . . . . . . . . 36
2.3.4 Modelling speech-signal trajectories with standard HMMs . . . . . 37
2.3.5 ANNs in segment modelling . . . . . . . . . . . . . . . . . . . . . . 38
ANN segment models . . . . . . . . . . . . . . . . . . . . . . . . . 38
CONTENTS iii
3 Preliminaries 47
3.1 Data collection and processing . . . . . . . . . . . . . . . . . . . . . . . . 47
3.1.1 Articulatory Data . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
Direct measurement of human articulation . . . . . . . . . . . . . . 47
X-ray microbeam . . . . . . . . . . . . . . . . . . . . . . . . 48
EMA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
Laryngograph . . . . . . . . . . . . . . . . . . . . . . . . . . 49
Electropalatograph . . . . . . . . . . . . . . . . . . . . . . . 49
Automatically recovering articulatory parameters . . . . . . . . . . 50
Critical articulators . . . . . . . . . . . . . . . . . . . . . . . 51
3.1.2 Acoustic Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
Calculating MFCCs . . . . . . . . . . . . . . . . . . . . . . 54
Calculating PLPs . . . . . . . . . . . . . . . . . . . . . . . . 57
3.2 Corpora used in this thesis . . . . . . . . . . . . . . . . . . . . . . . . . . 58
3.2.1 MOCHA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
Feature sets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
Acoustic . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
Articulatory . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
Automatically recovered EMA . . . . . . . . . . . . . . . . . 61
Combined acoustic-articulatory . . . . . . . . . . . . . . . . 62
Linear discriminant analysis . . . . . . . . . . . . . . . . . . . . . . 62
Summary of the MOCHA feature sets . . . . . . . . . . . . . . . . 63
3.2.2 TIMIT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
iv CONTENTS
ix
x LIST OF FIGURES
5.1 Does the small MOCHA test-set reflect the full corpus? . . . . . . . . . . 119
5.2 MOCHA classification for frame shifts of 2 − 14ms . . . . . . . . . . . . . 120
5.3 MOCHA EMA LDM classification results . . . . . . . . . . . . . . . . . . 123
5.4 MOCHA EMA + EPG + LAR LDM classification results . . . . . . . . . 125
5.5 MOCHA EMA + EPG + LAR LDM classification confusions. . . . . . . 127
5.6 MOCHA net EMA LDM classification results . . . . . . . . . . . . . . . . 129
5.7 Comparison by phone category of EMA and net EMA LDM classification. 130
5.8 MOCHA PLP LDM classification results . . . . . . . . . . . . . . . . . . . 133
5.9 MOCHA MFCC LDM classification results . . . . . . . . . . . . . . . . . 134
5.10 Comparison by phone category of PLP and MFCC classification. . . . . . 135
5.11 Comparison by phone category of EMA + EPG + LAR and MFCC LDM
classification. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 136
5.12 MOCHA PLP + EMA LDM classification results . . . . . . . . . . . . . . 138
5.13 MOCHA MFCC + EMA LDM classification results . . . . . . . . . . . . 139
5.14 Comparison by phone category of PLP and PLP + EMA LDM classification.141
5.15 MOCHA PLP + EMA LDM classification confusions. . . . . . . . . . . . 142
5.16 MOCHA PLP + net EMA LDM classification results . . . . . . . . . . . . 144
5.17 MOCHA MFCC + net EMA LDM classification results . . . . . . . . . . 145
5.18 TIMIT PLP LDM classification results . . . . . . . . . . . . . . . . . . . . 151
5.19 TIMIT MFCC LDM classification results . . . . . . . . . . . . . . . . . . 152
5.20 TIMIT MFCC LDM classification confusions. . . . . . . . . . . . . . . . . 154
5.21 Comparison by phone category of PLP and MFCC LDM classification. . . 155
5.22 Schematic of multiple regime modelling . . . . . . . . . . . . . . . . . . . 165
5.23 Comparison by phone category of static and dynamic MR model classifi-
cation. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 169
5.24 TIMIT mixed static/dynamic model set validation results . . . . . . . . . 170
5.25 TIMIT likelihood-combined static/dynamic model validation results . . . 172
5.26 Spectrogram of actual MFCCs . . . . . . . . . . . . . . . . . . . . . . . . 177
5.27 Spectrogram of LDM-predicted MFCCs . . . . . . . . . . . . . . . . . . . 177
LIST OF FIGURES xi
5.1 The 5 cross-validation sets swap role until each has been used for testing. 117
5.2 MOCHA feature sets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 118
5.3 MOCHA classification for frame shifts of 2-14ms . . . . . . . . . . . . . . 121
5.4 MOCHA EMA cross-validation classification results . . . . . . . . . . . . 123
5.5 MOCHA EMA + EPG + LAR cross-validation classification results . . . 125
5.6 MOCHA net EMA classification results . . . . . . . . . . . . . . . . . . . 129
5.7 Ranked average RMSE for network predictions of EMA data. . . . . . . . 131
5.8 MOCHA PLP cross-validation classification results . . . . . . . . . . . . . 133
5.9 MOCHA MFCC cross-validation classification results . . . . . . . . . . . . 134
5.10 MOCHA PLP + EMA cross-validation classification results . . . . . . . . 138
5.11 MOCHA MFCC + EMA cross-validation classification results . . . . . . . 139
xiii
xiv LIST OF TABLES
B.1 Finding MOCHA EMA LDM initial conditions: varying C and D. . . . . 247
B.2 Finding MOCHA EMA LDM initial conditions: varying Λ. . . . . . . . . 247
B.3 Finding MOCHA EMA LDM initial conditions: using phone-specific v. . 248
B.4 Finding TIMIT PLP and MFCC LDM initial conditions: varying C and D. 248
B.5 Finding TIMIT PLP and MFCC LDM initial conditions: varying Λ. . . . 249
B.6 Finding TIMIT PLP and MFCC LDM initial conditions: using phone-
specific v. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 249
B.7 Finding MOCHA EMA LDM initial conditions: factor analyser initialisation.251
B.8 Finding TIMIT PLP and MFCC LDM initial conditions: factor analyser
initialisation. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 251
E.5 MOCHA EMA + LAR + EPG diagonal state covariance LDM classifica-
tion results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 262
E.6 MOCHA EMA + LAR + EPG identity state covariance LDM classification
results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 263
E.7 MOCHA EMA + LAR + EPG LDM classification results . . . . . . . . . 264
E.8 MOCHA articulatory LDM classification results . . . . . . . . . . . . . . . 265
E.9 MOCHA net EMA factor analyser classification results . . . . . . . . . . . 266
E.10 MOCHA net EMA LDM classification results . . . . . . . . . . . . . . . . 267
E.11 MOCHA PLP factor analyser classification results . . . . . . . . . . . . . 268
E.12 MOCHA PLP LDM classification results . . . . . . . . . . . . . . . . . . . 269
E.13 MOCHA MFCC factor analyser classification results . . . . . . . . . . . . 270
E.14 MOCHA MFCC diagonal covariance LDM classification results . . . . . . 271
E.15 MOCHA MFCC diagonal state covariance LDM classification results . . . 272
E.16 MOCHA MFCC identity state covariance LDM classification results . . . 273
E.17 MOCHA MFCC LDM classification results . . . . . . . . . . . . . . . . . 274
E.18 MOCHA PLP + EMA factor analyser classification results . . . . . . . . 275
E.19 MOCHA PLP + EMA LDM classification results . . . . . . . . . . . . . . 276
E.20 MOCHA MFCC + EMA factor analyser classification results . . . . . . . 277
E.21 MOCHA MFCC + EMA LDM classification results . . . . . . . . . . . . 278
E.22 MOCHA PLP + net EMA factor analyser classification results . . . . . . 279
E.23 MOCHA PLP + net EMA LDM classification results . . . . . . . . . . . . 280
E.24 MOCHA MFCC + net EMA factor analyser classification results . . . . . 281
E.25 MOCHA MFCC + net EMA LDM classification results . . . . . . . . . . 282
E.26 TIMIT PLP 61 phone factor analyser classification results . . . . . . . . . 283
E.27 TIMIT PLP 39 phone factor analyser classification results . . . . . . . . . 284
E.28 TIMIT PLP 61 phone LDM classification results . . . . . . . . . . . . . . 285
E.29 TIMIT PLP 39 phone LDM classification results . . . . . . . . . . . . . . 286
E.30 TIMIT MFCC 61 phone factor analyser classification results . . . . . . . . 287
E.31 TIMIT MFCC 39 phone factor analyser classification results . . . . . . . . 288
E.32 TIMIT MFCC diagonal covariance 61 phone LDM classification results . . 289
E.33 TIMIT MFCC diagonal covariance 39 phone LDM classification results . . 290
LIST OF TABLES xvii
E.34 TIMIT MFCC diagonal state covariance 61 phone LDM classification results291
E.35 TIMIT MFCC diagonal state covariance 39 phone LDM classification results292
E.36 TIMIT MFCC identity state covariance 61 phone LDM classification results293
E.37 TIMIT MFCC identity state covariance identity 39 phone LDM classifica-
tion results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 294
E.38 TIMIT MFCC 61 phone LDM classification results . . . . . . . . . . . . . 295
E.39 TIMIT MFCC 39 phone LDM classification results . . . . . . . . . . . . . 296
xviii LIST OF TABLES
Chapter 1
Introduction
1.1 Preamble
In Bourlard, Hermansky & Morgan (1996), a trio of prominent speech scientists con-
fronted the speech recognition research community. With the belief that current tech-
niques would not ultimately provide the performance sought from speech recognizers, they
urged researchers to look beyond incremental modification of standard approaches. They
acknowledged that this would lead to increases in word error rate in the short-term, but
that this should not act as discouragement if three criteria are met. These are:
• sound methodology
With these in mind, it is time to start building the case for the departure from the
mainstream of automatic speech recognition with which this thesis is concerned.
Automatic speech recognition (ASR) has moved from science-fiction fantasy to daily re-
ality for citizens of technological societies. Some people seek it out, preferring dictating
1
2 CHAPTER 1. INTRODUCTION
to typing, or benefiting from voice control of aids such as wheel-chairs. Others find it
embedded in their hi-tec gadgetry – in mobile phones and car navigation systems, or crop-
ping up in what would have until recently been human roles such as telephone booking
of cinema tickets. Wherever you may meet it, computer speech recognition is here, and
it’s here to stay.
Statistical methods have come to dominate ASR. Speech data provides a huge amount
of information to build a model upon, data into which variability is introduced by diverse
factors such as the speaker’s dialect, vocabulary, mood, rate of speech, breathiness, type
of microphone and presence of background noise. Used properly though, all of these are
information sources which can be combined, along with their associated uncertainties, by
statistical models.
• decoding in which a search is made for the most likely sequence of words ŵ1j given
the observations y1N , using the acoustic, lexical and language model probabilities:
The main concern of this thesis is acoustic modelling, though of course meaningful testing
of new acoustic models also requires the other components. The signal processing, lan-
guage modelling and decoding needed for the classification and recognition experiments
in this thesis are described in Sections 3.3, 3.1.2 and 6.2 respectively.
Though the current generation of ASR systems rely strongly on speech description, per-
formance does continue to improve. This is almost entirely due however to larger training
corpora and faster processors which can support ever larger unit inventories and numbers
of model parameters. Many of those working with speech technology believe that speech
modelling will be the route into real improvements in ASR performance. Roweis (1999)
makes the case for speech-production motivated ASR by drawing comparisons with other
engineering problems. Typically in cases where inference is sought from some noisy ob-
servations, the first step is to devise an appropriate model. Such a model will usually
have internal states which reflect the nature of the process. The model is trained on the
4 CHAPTER 1. INTRODUCTION
available data, and then used to make inferences about new and unseen observations. If
this kind of approach is to be applied to ASR, the properties of such a model must first
be considered. The ideal acoustic model for ASR would:
Hidden Markov model (HMM) and hybrid artificial neural network (ANN) /HMM
systems provide the basis for the majority of ASR applications, as it is these approaches
which have provided the necessary combination of accuracy and efficiency. The last 15
years have also seen steady and continued interest in statistical segment modelling from
a variety of groups and researchers. However, despite the theoretical advantages of this
class of models, none have made it into mainstream ASR. The following sections discuss
these three different approaches with reference to the acoustic modelling ideals above.
process, which is governed by a Markov model. Since distributions rather than symbols
are associated with each state, there is no unique mapping between state and observation,
and it is the goal of recognition to determine the most likely underlying state sequence2 .
Training HMMs uses an application of the expectation maximisation (EM) algorithm
which is described in Section 2.1.1 on page 16. For each state, an output distribution and
set of transition probabilities must be estimated.
The HMM is a frame-based model. When conditioned on a discrete state qj ∈ Q,
successive frames of data are assumed to be independent. This has the effect of ignoring
the temporal ordering of the observations within each state. To add the capacity to
model inter-frame dependencies, dynamic information is often included in the form of δ
and δδ coefficients. These can be simple differences, though are more often computed
using some form of numerical differentiation. For instance, HTK (Young 1993), a widely
used HMM recognition toolkit, estimates derivative information at a given time using
a linear regression over the previous and following 2 frames (Young 1995). Therefore,
contained in the δδ coefficients is information from the surrounding 8 frames. Including
time-derivatives takes account of the ordering of the original features and yields significant
improvements in recognition performance, though makes the assumption of framewise
independence even less appropriate.
State transitions give an exponential model of phone duration, though in practice
their contribution to the calculation of the joint probability of state and observation
sequences is overpowered by that of the output distribution. In fact, Merhav & Ephraim
(1991) show that an HMM with a continuous output distribution, such as typically used
for ASR, converges asymptotically to an independent process as the dimension of the
feature vector increases. Another independence assumption which is normally made by
HMMs is spatial. To reduce parameterization and increase efficiency of computation,
Gaussian mixture components have diagonal covariance matrices. Dependencies persist
between feature dimensions even for front-end parameterizations which are designed to
decorrelate the frequency components of the speech signal, such as Mel-frequency cepstral
coefficients (MFCCs), which are described in Section 3.1.2. Diagonal covariance matrices
2
The approximation of finding the single most likely state sequence rather than summing over all
possible sequences is described in more detail in Section 6.1.1 on page 187.
6 CHAPTER 1. INTRODUCTION
In standard HMMs, Gaussian mixture models provide the means of computing the prob-
ability of an observation yt given a state qj ∈ Q, p(yt | qj ). This quantity is turned around
in a hybrid ANN/HMM approach in which an artificial neural network is trained to give
a direct estimate of the posterior probability of state qj given an observation, p(qj | yt ).
Using Bayes’ rule, the two can be related, giving
p(yt | qj ) p(qj )
p(qj | yt ) = (1.2)
p(yt )
Models are compared with a scaled likelihood (Morgan & Bourlard 1995, Robinson, Cook,
Ellis, Fosler-Lussier, Renals & Williams 2002) where p(yt ) is ignored since it is indepen-
dent of the state:
p(qj | yt )
p(yt | qj ) ∝ (1.3)
p(qj )
In practice, the ANN takes a number of adjacent acoustic frames as input, so that
t+τ
yt is replaced with yt−τ . Frames surrounding the time of interest are included in any
likelihood calculation, thereby loosening the standard HMM assumption of framewise
independence. This brings an implicit model of context dependency into the acoustic
modelling as observations from neighbouring phones will appear in the input near model
boundaries. The HMMs for which the ANN provides emission probabilities typically have
simple topologies, and are used to impose minimum durations and give phone transition
probabilities. For example, the HMMs used in Robinson et al. (2002) are single state
monophone models.
Artificial neural networks provide general non-linear mappings which make minimal
assumptions regarding functional form, and have the capacity to model any dependencies
present in the data. However, ANNs tend to be treated as ‘black box’ classifiers as
meaningful analysis of their inner workings is difficult. Both spatial and inter-frame
correlations occurring within the input window can be accounted for, and recurrent neural
networks have also been used to introduce explicit models of time dependency. An ANN
cannot really be seen as providing a model of speech production, though the ability to
8 CHAPTER 1. INTRODUCTION
model a multitude of dependencies and use of discriminative training makes them powerful
classifiers.
Training an ANN once needed specialist hardware, but now with the availability of
cheap computing, the computational demands of parameter estimation can be met on
standard systems. Once trained, mapping from input to target domain with a neural
network is extremely rapid. This ease of evaluation, combined with the need for relatively
few models makes the hybrid ANN/HMM an extremely efficient manner in which to
implement ASR, both in terms of memory and CPU usage. Robinson (1994) reports the
current lowest phone error rate for TIMIT phone recognition using a hybrid ANN/HMM
system.
(Young, Evermann, Kershaw, Moore, Odell, Ollason, Povey, Valtchev & Woodland 2002).
Segment modelling can, and has, taken a multitude of different forms, and a review
of those which have been applied to ASR is given in Section 2.3 on page 33. The work
reported in this thesis revolves around the application of a segment model, the linear
dynamic model (LDM), which will be introduced in Section 2.1.2 on page 20 and described
fully in Chapter 4.
One of the acoustic modelling ideals listed above was to reflect the nature of the
production mechanism. This could be achieved by building a model in which the internal
states are derived from the properties of the articulators. The following section considers
the ways in which knowledge of the underlying articulation might aid speech recognition.
Human speech recognition is a highly developed process. Most people manage a high
level of accuracy, despite constantly having to switch between (potentially unfamiliar)
speakers and a variety of noise conditions. Those who subscribe to the motor theory of
speech perception (Liberman & Mattingly 1985) would argue that ‘we can, because we
know how’. Using articulatory information in ASR offers the chance to explicitly model
some of the effects which arise from simple variation in the production mechanism, yet
cause significant changes to the acoustic parameters.
This occurs as during the [i], the velum lowers to be ready for the following [n]. The
effect of co-articulation varies considerably dependent on the phone type and the context
in which it occurs, though a slight and predictable alteration of an articulatory gesture
can have a significant effect on the acoustic realisation of a given phone.
The articulatory parameters are considered a design tool which give a real-world basis
on which to work and a source of inspiration for the properties which an acoustic model
for ASR should possess. A model of measured human articulation will not ultimately
be useful in a speech recognition system, though understanding the characteristics of
such data will be. Separate recovery of articulation and subsequent modelling of these
parameters, if shown to be advantageous, should be combined. Training both parts of this
process together would render the acoustic-articulatory mapping into an internal state of
the acoustic model.
1.5. MOTIVATION FOR THIS THESIS 11
This thesis is motivated by the belief that a model which reflects the characteristics of
speech production will ultimately lead to improvements in automatic speech recognition
performance. The articulators move slowly and continuously along highly constrained
trajectories, each one capable of a limited set of gestures which are organised in an
overlapping, asynchronous fashion. Feature extraction on the resulting acoustic signal
produces a piecewise smooth, spatially correlated set of parameters in which neighbouring
feature vectors are highly correlated, and dependencies can spread over many frames.
These properties should be reflected in an acoustic model.
This work investigates speech modelling using a form of linear state-space model, and
the related implementational issues. The intention is to show that a hidden dynamic
representation can be used to enhance speech recognition. Furthermore, the possibility
of acoustic modelling in which an underlying representation is continuous both within
and between phones would allow modelling of longer range dependencies, and loosen the
standard assumption of inter-segmental independence.
summarises the findings of this thesis and suggests directions for future work.
1.6 Publications
A number of publications have resulted from work during the period of study for this
thesis. These are: Frankel, Richmond, King & Taylor (2000), King, Taylor, Frankel &
Richmond (2000), Frankel & King (2001a) and Frankel & King (2001b).
Chapter 2
Literature Review
A survey of the literature relevant to this thesis falls neatly into three parts: linear
Gaussian models, speech recognition which incorporates articulatory information, and
segmental approaches to acoustic modelling. Linear Gaussian models will be dealt with
first, as they will prove useful in describing some of the segment models of Section 2.3.
Linear Gaussian models are a class of models which have found many applications in the
last few decades, in domains such as control, machine learning and financial analysis.
There are two excellent papers offering reviews of such models, Roweis & Ghahramani
(1999) and Rosti & Gales (2001), which are drawn from for this summary.
yt = h(xt , t ) (2.1)
xt = f (xt−1 , . . . , x1 , ηt ) (2.2)
13
14 CHAPTER 2. LITERATURE REVIEW
The observation noise t characterises the variation due to a range of external sources,
for example measurement error or noise. Furthermore it offers a degree of smoothing which
is useful when there is a mismatch between training and testing data. Uncertainty in the
modelling of the state process is described by the state noise ηt . An important feature of
these models is that the observation at time t is conditional only on the state at that time.
However, the state can take a variety of forms, such as static distributions, long-span auto
regressive processes or sets of discrete modes. Figure 2.1 represents such a model, where
motions in the state space give rise to the observed data.
y1,t
h
y2,t
xt x0
y3,t
0 time
Figure 2.1: In state-space models, the observations are seen as realisations of some unseen,
usually lower-dimensional, process. This provides a means of distinguishing the underlying system
from the observations which represent it. The state and observation spaces are linked by the
transformation h.
State-space models are useful in many real-life situations where systems contain a
different number of degrees of freedom, usually fewer, than the data used to represent
them. In these cases, a distinction can be made between the production mechanism at
work and the parameterization chosen to represent it. The hidden state variable can have
just as many degrees of freedom as are required to model any underlying processes, and
then a state-observation mapping shows how these are realised in observation space. This
offers a means of making a compact representation of the data. In fact, dimensionality
2.1. LINEAR GAUSSIAN MODELS AND THEIR RELATIVES 15
There are two problems which must usually be solved for practical application of
any given state-space model. Firstly, it should be possible to infer information about
the internal states of the model for a given set of parameters and sequence of observa-
tions. Secondly, the parameters which identify the model must be estimable given suitable
training data.
Inference
For a fixed set of model parameters Θ, and an N -frame sequence of observations y1N =
{y1 , . . . , yN }, there are a number of quantities which may be calculated. The total
likelihood of a set of observations can be computed as the integral of the joint prob-
ability of state and observations p(y1N , xN
1 |Θ) over all possible state sequences. With
xN
1 = {x1 , . . . , xN } representing one such candidate state sequence, the integral to be
evaluated is
Z
p(y1N |Θ) = p(y1N , xN N
1 |Θ) dx1 (2.3)
all xN
1
An application of Bayes’ rule then provides the conditional probability of a given state
sequence underlying the observations:
p(y1N , xN
1 |Θ)
p(xN N
1 |y1 , Θ) = (2.4)
p(y1N |Θ)
Filtering and smoothing are operations which lie at the heart of inference calculations
for all but the simplest of state-space models. Filtering produces an estimate of the
state distribution at time t given all the observations up to and including that time,
p(xt |y1t , Θ), and smoothing gives a corresponding estimate of the state conditioned on the
entire N -length observation sequence, p(xt |y1N , Θ). The Kalman filter and Rauch-Tung-
Striebel (RTS) smoother provide the optimal solutions for linear Gaussian models and
are detailed in Section 4.2.1. Equivalent techniques for filtering/smoothing of non-linear
or non-Gaussian models exist, though optimal solutions are not common. See Haykin
(2001) for details of the extended Kalman filter, particle filtering and so on.
16 CHAPTER 2. LITERATURE REVIEW
Estimation
Maximum likelihood If the state were observable, it would be possible to find stan-
dard maximum likelihood (ML) estimates of the parameters. The state and observation
data are regarded as fixed, and the parameter set Θ is treated as a random variable.
Writing Y = y1N and X = xN
1 , the likelihood function for the model is
The ML solution corresponds to finding a value of Θ which maximises 2.5. For many
models, this maxima can found analytically, often by maximising log L(Θ|Y, X ) = l(Θ|Y)
rather than L(Θ|Y, X ) directly. However, alternatives to standard ML techniques must
be employed when the state is unobserved.
p(Y, X |Θ)
Z
l(Θ|Y) ≥ p0 (X ) log dX . (2.7)
X p0 (X )
The rearrangement of 2.6 to the inequality 2.7 serves two purposes. Firstly, the original
likelihood function contained the log of an integral. In 2.7, the log has been moved
within the integral which greatly simplifies analytical optimisation. Secondly, choosing
2.1. LINEAR GAUSSIAN MODELS AND THEIR RELATIVES 17
p0 (X ) = p(X |Y, Θ(i) ), the posterior of the state given the observations Y and some
previous model parameters, Θ(i) , gives equality in 2.7. In cases where p(X |Y, Θ(i) ) is
unavailable or intractable, alternate distributions can be used for p0 (X ), and maximising
the expression in the right hand side of 2.7 will still increase, or at worst not affect, the
incomplete-data likelihood function log L(Θ|Y).
For many classes of model, including all those dealt with in this section, the state
posterior distribution is available. Since the denominator on the right hand side of 2.7
is not dependent on Θ, and setting p0 (X ) = p(X |Y, Θ(i) ), maximising log L(Θ|Y) is
equivalent to maximising what is known as the auxiliary function:
Z
(i)
Q(Θ, Θ ) = p(X |Y, Θ(i) ) log p(Y, X |Θ) dX . (2.8)
X
= E [ log p(Y, X | Θ)| Y, Θ(i) ] (2.9)
This maximisation is carried out in two stages. Given some starting parameter estimates,
the goal of the E-step is to:
find Q(Θ, Θ(i) ), the expectation of the complete-data log-likelihood with respect
to the unknown data X , given the observed data Y, using the most recent
parameter estimates Θ(i) , i.e. evaluate
Less formally, in the E-step, estimates are made of the state values using the most
recently estimated parameter set and the observation sequence. These are then used in
place of actual values in the standard ML solutions in the M-step. The EM algorithm
combines these operations and steps iteratively toward the ML solution. The parameters
18 CHAPTER 2. LITERATURE REVIEW
In this class of models, the transformations H and F are restricted to being linear map-
pings, and the error terms t and ηt are additive Gaussian noise. Furthermore, if the
state is dynamic, the initial state must also be specified and is also Gaussian. This set of
constraints ensures that the distribution over both state and output processes are always
Gaussian. Modifying 2.1 and 2.2 accordingly, a linear Gaussian model is described by
yt = Ht xt + t t ∼ N (v, C)
k
(j)
X
xt = Ft xt−j + ηt ηt ∼ N (w, D)
j=1
and an initial state distribution x1 ∼ N (π, Λ). For our purposes, H and F are time-
invariant, and the state process is at most 1st order, so that k = 1. A linear Gaussian
model can be categorised as static or dynamic, dependent on the properties of its state
process.
Static models
In a static model, the temporal ordering of the data is ignored. Setting F = 0 removes the
dynamic portion of the model and the state is simply modelled as a Gaussian. Equations
2.12 and 2.13 describe such a model.
yt = Ht xt + t t ∼ N (v, C) (2.12)
Given that the output distribution is simply represented by a single static Gaussian
distribution, the advantage of such a model over a standard Gaussian is not immediately
apparent. However, the aim is to absorb the correlation structure of the data into H,
2.1. LINEAR GAUSSIAN MODELS AND THEIR RELATIVES 19
and constrain the observation noise covariance C to be diagonal or some multiple of the
identity I. In this way, high dimensional data can be described with a minimum number
of parameters.
The degeneracy present (see LDMs on page 20 below) in the model means that the
state noise distribution can be restricted to ηt ∼ N (0, I) with no loss of generality. The
observation noise mean v takes the mean of the data, and the form of the covariance C
will be discussed in the sections below. Noting that D has been set to be an identity
matrix, the output distribution can be found using analytical integration:
yt ∼ N (v, HH T + C) (2.14)
where β = H T (HH T + C). Likelihood computation under such models simply involves
evaluation of the Gaussian in 2.14, and EM parameter estimation is straightforward and
detailed in both of the review papers mentioned above.
Factor analyser In a factor analysis (FA) model, the observation noise C is restricted
to being diagonal. This gives a Gaussian output distribution over the observations though
uses only 2p + pq parameters compared to the p + p(p + 1)/2 parameters in a standard p-
dimensional Gaussian. If C were not constrained in this way, parameter estimation could
20 CHAPTER 2. LITERATURE REVIEW
simply place zeros in H and set v and C to be the sample mean and covariance. This
would amount to a valid maximum likelihood solution, though not a very informative
one. The power of a factor analysis model lies in the possibility of providing a compact
representation of high dimensioned data. Indeed, if there is enough training data to give
full rank estimates of the sample covariance, there is nothing to be gained using factor
analysis over a straightforward Gaussian distribution.
Dynamic models
Linear dynamic model The linear dynamic model (LDM) is described in some detail
in Chapter 4, though for continuity a quick introduction is given here. The static models
above aim to characterise the spatial correlations of the data, and the LDM builds on this
by also modelling the temporal characteristics of the data. An LDM is described by the
following pair of equations:
and a distribution over the initial state, x1 ∼ N (π, Λ). Equation 2.17 describes how a
Gaussian density flows through a continuous state-space, transformed according to the
rotations and stretches of F and the addition of noise. The state distribution is directly
related to the observations by the linear mapping given in (2.16).
Degeneracy The LDM is an over-specified model. For a model with observation di-
mension p, state dimension q, full covariance matrices and non-zero noise terms, there are
parameters. These correspond to the observation noise mean and covariance, observation
matrix, state evolution matrix, state noise mean and covariance, and finally initial state
mean and covariance. However, as demonstrated in Roweis & Ghahramani (1999), the
structure of the state noise covariance, D, can be incorporated into F and H. D can
therefore be restricted to being an identity matrix with no loss of generality, and so the
2.1. LINEAR GAUSSIAN MODELS AND THEIR RELATIVES 21
There are a number of commonly used state-space models which will not feature in this
thesis, though a quick summary will serve to show the linear models in context. Either or
both state and observation processes can be varied, and below are some of the possibilities.
where the ‘winner-take-all’ non-linearity W T A[x] produces a new vector with all elements
set to zero apart from one which is set to unity. This state process dictates which of
the columns of H is used to generate the mean of the output distribution. This is a
particular form of hidden Markov model where the q mixture components share a common
covariance. Setting F = 0 in 2.21 gives a mixture of q Gaussians model with a pooled
covariance, and further setting C = 0 in 2.20 produces a vector quantization (VQ) model.
Independent component analysis (ICA) for blind source separation is also discussed in
Roweis & Ghahramani (1999), where the W T A non-linearity is replaced with a general
non-linearity, g(x).
The straightforward linear mapping used in all the models so far can be replaced with
what Rosti & Gales (2001) term a linear discriminant analysis (LDA) observation process.
It is so named since with a static state, the model becomes standard LDA, which aims to
22 CHAPTER 2. LITERATURE REVIEW
give maximal linear separation of the data by class, by minimising within-class variance
at the same time as maximising between-class variance. See Section 9 on page 62 for more
details of LDA.
Mixture distributions
In addition to the pooled covariance Gaussian mixture model described above, there are
other means of modifying linear Gaussian models to give multimodal distributions over
the observations.
Error terms can be replaced with mixture distributions and process parameters chosen
from a number of candidates according to an indicator function. Extending the static
models in this way provides an efficient method of representing multimodal, non-Gaussian,
spatially correlated observations. Parameter estimation and inference for such models
can be found in Rosti & Gales (2001). However, equivalent extensions of dynamic-state
models are subject to problems of computational intractability: state evolution would
require conditioning on a mixture distribution, causing exponential growth in the number
of components describing the state. Filtering and smoothing are therefore intractable
for such models and approximate methods must be employed. Section 7.3.2 on page 235
outlines some of these, and further information can be found in Ghahramani & Hinton
(1996b), Murphy (1998), and Rosti & Gales (2001).
2.2. ARTICULATORY-INSPIRED ACOUSTIC MODELLING FOR ASR 23
Efforts to incorporate articulatory information into acoustic modelling for speech recogni-
tion have been made by a number of researchers in a variety of ways. These include using
articulatory parameters directly as features for recognition, recognition by synthesis from
articulatory parameters, and articulatory representations to constrain or dictate model
structure. Articulatory information used in ASR is parameterized in one of the following
two ways:
Continuous In a few systems, such as Zlokarnik (1995a) and Wrench (2001), the artic-
ulatory input takes the form of a number of continuous streams of data which together
give a smoothly-varying description of the motion of the articulators. These may consist
of measured human articulation, or parameters which have been automatically recovered
from an acoustic input.
is required. One approach is to treat the articulation as missing data during recognition,
and another is to infer articulatory parameters from the acoustic data which are then used
in place of the real ones. Recovering articulation from the acoustics is known in speech
processing as articulatory inversion, and is described in Section 3.1.1. Data generated
in this way will be used in experiments later in the thesis, and referred to as ‘recovered
articulation’.
The systems outlined in this section all use features derived from measured human ar-
ticulation. In all cases, these features are found to aid recognition. However, recovered
articulatory features have yet to prove as useful.
target domains allowed the MLP to recover some of the cues that aid word discrimina-
tion which are present in the articulatory parameters, but less apparent in the acoustics.
Alternatively, the increase in recognition accuracy might be due to contextual informa-
tion contained in the recovered articulatory parameters, as the MLP used a input layer
spanning 51 frames, or 0.51 seconds.
Similar experiments were conducted on a larger scale by the collector of the MOCHA
(Wrench 2001) database. The experiments were speaker-dependent, and used 460 TIMIT
type sentences recorded by a southern English female speaker. It was found that aug-
menting an acoustic feature set to include real articulatory parameters gave a 9% rela-
tive increase in phone accuracy for continuous recognition with a triphone HMM system
(Wrench & Hardcastle 2000, Wrench 2001). The baseline acoustic recognition accuracy
was 65% using 14 MFCCs along with their δ and δδ coefficients. An articulatory feature
set was generated by stacking the EMA, laryngograph and electropalatograph (EPG)
data with their corresponding δ and δδ coefficients and performing linear discriminant
analysis (LDA) dimensionality reduction. On the same recognition task as above, the ac-
curacy was 63%, similar yet lower than acoustic-only performance. Combining all acoustic
and articulatory data and again using LDA to produce a 45-dimensional feature vector,
recognition accuracy for the same task was 71%, higher than for either acoustic or artic-
ulatory features used on their own. However, when real articulation was replaced with
articulatory parameters automatically generated from the acoustics using a multi-layer
perceptron (MLP), there was no improvement over the baseline acoustic result (Wrench
& Richmond 2000).
Dynamic Bayesian network system Dynamic Bayesian networks (DBN) are exten-
sions of Bayesian networks for modelling dynamic processes. These models have been
applied to standard acoustic speech recognition (Zweig & Russell 1998), though are ideal
for incorporating articulatory information as handling missing data is straightforward.
Articulation can be used in all or some of the training set, and can be hidden at recogni-
tion time.
With qt ∈ Q representing the state at time t, standard HMMs model the probability of
each observation yt given the current state qt , along with the probability of transitioning
26 CHAPTER 2. LITERATURE REVIEW
Under this model, the observations are conditioned on both state and articulator position,
shown by 2.24. Furthermore, in 2.25, the new articulatory configurations are conditioned
not only the previous one, but also on the current state, providing an element of contextual
modelling.
The Wisconsin x-ray microbeam database (Westbury 1994) provided parallel acoustic-
articulatory data for an isolated word recognition task. The acoustic features were 12
MFCCs and energy along with their δ coefficients, and the articulatory features consisted
of x and y coordinates for 8 articulator positions (upper lip, lower lip, four tongue po-
sitions, lower front tooth, lower back tooth). DBNs take discrete observations, and so
codebooks were generated for the acoustic and articulatory data-sets using K-means.
The acoustic-only word error rate of 8.6% was reduced to 7.6% when the articulatory
data was used during recognition. With the articulation hidden, the system gave a recog-
nition word error rate of 7.8%, which is a 9% relative error decrease over the acoustic
baseline.
Pseudo-articulatory features
Kirchhoff The fullest investigation into recognition using articulatory features has been
reported by Kirchoff (1998, 2002). One of the main motivations for this work was to build
a system which would be robust to noise. Using a phonetic-level transcript, the telephone
speech corpus OGI numbers95 (Cole, Noel, Lander & Durham 1995) was marked up with
the pseudo-articulatory features shown in Table 2.1 according to a set of canonical phone-
Table 2.1: Pseudo-articulatory features and levels used to mark up the Numbers95 corpus for
Kirchhoff’s articulatory feature-based recognition system.
feature conversion rules. The conversion mappings can be found in Kirchhoff (1998). A
separate MLP was trained to predict the values for each feature based on the acoustic in-
put. Using a further MLP, the outputs from these 5 networks were mapped to phone class
posteriors which could be used in a standard hybrid HMM/ANN recognition formulation.
On clean speech, the word error rates for the acoustic and articulatory models were
comparable, 8.4% and 8.9% respectively, though in the presence of a high degree of
additive noise, the articulatory model produced significantly better results. At a noise
level of 0dB 1 , the word error rate for the acoustic model was 50.2%, higher than the
43.6% produced under the articulatory system. When the outputs of the acoustic and
articulatory recognizers were combined, the error rates were lower than for either of the
two individually under a variety of noise levels and on reverberant speech. The framewise
errors for the different articulatory feature groups show that classification performance
on the voicing, rounding and front-back features does not deteriorate as quickly as for
manner and place in the presence of noise. This appears to support the author’s claim
that a system of ‘experts’ where each MLP is only responsible for distinguishing between a
1
The relative intensity (sound energy) of two signals is measured in decibels (dB): L = 10 log10 |I1 /I2 |
(Gold & Morgan 1999). Thus 0dB signifies a signal and noise with equal intensity.
28 CHAPTER 2. LITERATURE REVIEW
small number of classes would be more robust to adverse conditions than one ‘monolithic’
classifier. By incorporating confidence scores when the outputs of individual classifiers
are combined, the system could be tailored to particular operating conditions.
Similar experiments were performed on a larger spontaneous dialogue corpus. Verbmo-
bil (Kohler, Lex, Patzold, Scheffers, Simpson & Thon 1994) contains 31 hours of training
data and 41 minutes of test data from a total of 731 speakers. Once again, improvements
were shown when acoustic and articulatory features were combined. The word error rate
in this case was 27.4%, a 6% relative error reduction on the acoustic baseline of 29.0%.
King King, Stephenson, Isard, Taylor & Strachan (1998) and King & Taylor (2000)
also report recognition experiments based on the combination of the output of a number
of independent neural network classifiers. The work was primarily aimed at comparing
phonological feature sets on which to base the classifiers, though the feature predictions
were also combined to give TIMIT phone recognition results. Unlike Kirchhoff who used
a neural network to combine the independent feature classifiers, the predicted feature
values were used as observations in an HMM system. The resulting recognition accuracy
of 63.5% was higher than the result of 63.3% found using standard acoustic HMMs, though
the increase was not statistically significant. The need for an asynchronous articulatory
model was demonstrated using classifications of a set of binary features derived from
Chomsky & Halle (1968). In cases where features changed value at phone boundaries,
allowing transitions within two frames of the reference time to be counted as correct, the
percentage of frames where all features were correct rose from 52% to 63%. Furthermore,
the accuracy with which features were mapped onto the nearest phone rose from 59% to
70%. This demonstrates the limiting nature of forcing hard decisions at phone boundaries
onto asynchronous data. In both King and Kirchhoff’s systems, the individual feature
classifiers were independent. Using a neural network to map features to phone posterior
probabilities in the latter gave an implicit model of asynchrony.
Phones are almost always used as the sub-word units for speech recognition. They provide
a useful means of describing the structure of language, and the availability of phonetic
2.2. ARTICULATORY-INSPIRED ACOUSTIC MODELLING FOR ASR 29
lexica means that they are a convenient choice for recognizers. However, the realisation
of phones varies considerably according to the context in which they occur. Section 1.4
described the increased parameterization which accompanies modelling contextual varia-
tion in standard HMM systems. Rather than building vast numbers of models and then
reducing parameter numbers by tying states, efforts have been made to use articulatory
knowledge to build compact sets of states which still include contextual information where
necessary.
HAMM Richardson et al. (2000a, 2000b) drew on the work by Erler & Freeman (1996)
in devising the hidden articulator Markov model (HAMM), which is an HMM where each
articulatory configuration is modelled by a separate state. The state transitions aim to
reflect human articulation: static constraints disallow configurations which would not
occur in American English, and dynamic constraints ensure that only physically possible
movements are allowed. Furthermore, asynchronous articulator movement is allowed as
each feature can change value independently of the others. In addition to the static
constraints which reduced the number of states from 25, 600 to 6, 676, the number of
parameters was further reduced by removing states with low occupancy during training.
The recognition task was PHONEBOOK, an isolated word, telephone speech corpus.
With a 600 word lexicon, the HAMM gave a significantly higher word error rate than a
standard 4-state HMM. These were 7.56% and 5.76% respectively. However, a combina-
tion of the models gave a word error rate of 4.56%, a relative reduction of 21% on the
HMM system.
Deng Deng and his group (1994a, 1994b) have also worked on building HMM systems
where each state represents an articulatory configuration. Following Chomsky’s theory
of distinctive features and a system of phonology composed of multi-valued articulatory
structures (Browman & Goldstein 1992), they have developed a detailed system for de-
riving HMM state transition networks based on a set of ‘atomic’ units. These units
represent all combinations of a set of overlapping articulatory features possible under a
set of hand-written rules.
Five multi-levelled articulatory features are used: lips, tongue blade, tongue dorsum,
30 CHAPTER 2. LITERATURE REVIEW
velum and larynx. Within each, a ‘0’ level is included and used to indicate when the
feature is irrelevant to the specification of a phone. Each phone is mapped to a static
articulatory configuration, apart from affricates which are decomposed into stop and
fricative portions, and diphthongs which are made up by concatenating appropriate pairs
of vowels. Features can spread according to the relative timing of the phone, by 25%,
50%, 75%, or 100%. When the spread is by 100%, a feature can extend as far as the
boundary of the next phone in either direction. Then, depending on the configuration of
the following phone, the feature might spread further. In this way, long span dependencies
can be modelled. When articulatory feature bundles overlap asynchronously, new states
are created for the intermediate portions which either describe the transitions between
phones or allophonic variation.
On a TIMIT classification task, HMMs constructed from these units achieved an accu-
racy of 73% compared with context-independent HMMs of phones which gave an accuracy
of 62%. The feature-based HMMs also required fewer mixture components, typically 5,
than standard phone-based HMMs use. This suggests that a principled approach to state
selection will require fewer parameters and therefore less training data, as each state is
modelling a more consistent region of the acoustic signal.
This work was developed to include higher level linguistic information by Sun, Jing &
Deng (2000). This included utterance, word, morpheme and syllable boundaries, syllable
onset, nucleus and coda, along with word stress and sentence accents. This time, results
were reported on TIMIT phone recognition, rather than classification. A recognition
accuracy of 73.0% was found using the feature-based HMM, which compares favourably
to a baseline triphone HMM which gave an accuracy of 70.9%. This represents a 7.2%
relative decrease in error.
Blackburn (1995, 1996, 1996, 2000) saw the need to model contextual effects in the speech
signal in an efficient manner as crucial to the development of speech recognition. His
thesis centred around the investigation of an articulatory speech production model (SPM)
which enabled modelling of co-articulation in the time-domain. Experiments using real
2.2. ARTICULATORY-INSPIRED ACOUSTIC MODELLING FOR ASR 31
articulatory data were carried out on the University of Wisconsin (UW) x-ray microbeam
data (Westbury 1994), and further work used the resource management (RM) (Price,
Fisher, Bernstein & Pallett 1988) corpus. The system took output from HTK (Young
1993), an HMM recognizer, and performed N-best rescoring by re-synthesising articulatory
traces from each time-aligned phone sequence and mapping these into log-spectra using
MLPs. Errors between these and the original speech data were calculated and used to
re-order the N-best list.
In performing the re-synthesis, the assumption is made that the articulatory gestures
which make up each phone can be divided into two parts. In the first, the articulators
move away from the configuration corresponding to the previous phone, and in the second,
they move toward the configuration needed for the following one. Furthermore, there is
some region in which the articulatory configuration gives rise to the sound corresponding
to the current phone. This may or may not be static, and is assumed to fall roughly in
the middle of the phone.
The SPM models the position of each articulator at the centre of each phone model
using a Gaussian distribution, P ∼ N (µP , σP2 ). A notion of the articulatory effort re-
quired to produce each phone by each of the articulators is introduced by computing
the curvatures (2nd-order derivatives) of linearly-interpolated mean articulator positions
over a set of time-aligned phone sequences. These are also modelled using Gaussian dis-
2 ), and combined with the original articulator-phone positional
tributions, C ∼ N (µC , σC
distribution to arrive at a distribution of position conditioned on curvature for each phone
and articulator, P |C.
Computing the most likely articulator positions for a sequence of time-aligned phones
therefore consists of linearly interpolating the unconditional articulator position means,
computing the curvatures for each one, and arriving at new means using the conditional
distribution. The end result is a complete set of articulatory traces which incorporate
the strength of contextual effects. There was a significant reduction in mean square error
between the real and recovered articulation on the UW test set when the co-articulation
modelling was included, compared to a baseline where the initial phone-articulator pre-
dictions were not adjusted.
The mapping from the re-synthesised articulation to line-spectral acoustic features
32 CHAPTER 2. LITERATURE REVIEW
was handled with a separate MLP for each phone. On the UW corpus, recognition
performance was enhanced for all but one of the speakers in the test set using N -best
lists with 2 ≤ N ≤ 5, however for higher N , the increases on some speakers were offset
by decreases for others. The SPM made most contribution to the recognition accuracy
for speakers for which there was poor initial performance.
Experiments were also conducted on the RM corpus. Here, N -best rescoring for
small N offered modest gains, though performance deteriorated with N = 100. However,
combining the HTK and SPM output probabilities in the log domain gave relative error
reductions of 6.9% and 6.0% for two of the speakers, suggesting some value in the addition
of an articulatory re-synthesis stage.
2.3. SEGMENT MODELLING FOR SPEECH 33
Some of the early attempts at speech recognition used what were essentially segment
models, though these were rule-based rather than probabilistic models. The last 15 years
have seen steady and continued interest in statistical segment modelling from a variety
of groups and researchers. However, despite the theoretical advantage of relaxing HMM
independence assumptions, segment models have yet to make it into mainstream ASR.
There is a key design choice that is made in any segment model implementation, which
is whether the models are of fixed-length or variable-length sequences of observations.
Variable-length The segment model can take observation sequences with a range of
durations. Decoding is straightforward as all complete hypotheses account for an equal
number of frames.
The methods below use models which aim to capture the temporal correlations within
segmental units. In some cases, such as the ANN methods of Section 2.3.5 or the dy-
namical system model of Section 6, the models also have the capacity to model spatial
correlations across feature dimensions.
By the early 1990s, HMMs were the dominant acoustic model for speech recognition.
Recognizing both the efficiency of the HMM framework and the limitations of a frame-
based model, Russell (1993) introduced the segmental HMM. Under this model, there
34 CHAPTER 2. LITERATURE REVIEW
is a target tj associated with each state qj ∈ Q which may be a static Gaussian distri-
bution (Russell 1993), or a linear trajectory (Holmes & Russell 1995). Observations are
assumed to be independent given each state qj , though the output distribution is condi-
tioned on an extra-segmental distribution p(tj ) which remains fixed for the entire state
occupancy. Since tj represents the target associated with state qj , the joint probability
of an observation sequence y1N = {y1 , . . . , yN } and a state qj can be written as:
N
Y
p(y1N , qj ) = p(tj ) p(y1N | tj ) = p(tj ) p(yi |tj ) (2.26)
i=1
Equation 2.26 shows how the segmental HMM factors the distribution over the observa-
tions into intra-segmental and extra-segmental variation, as described in Section 1.3.3 on
page 8. The extra-segmental distribution p(tj ) models properties of the speech parame-
ters which remain static over entire segments, such as speaker identity, while the intra-
segmental variation p(y1N | tj ) characterises the distribution of the observations around a
given target once external factors have been accounted for. This is distinct from standard
HMMs where both variation sources are modelled together by the mixture components
of the output distribution regardless of the time-scale over which they occur.
Rather than directly modelling sequences of frames, another approach has been to model
the features derived from a series of frames. Gish & Ng (1993) fitted a polynomial of the
form
C = ZB + E (2.27)
and so
2
n−1 n−1
cn,i = b1,i + b2,i + b3,i + en,i (2.28)
N −1 N −1
where n = 1, . . . , N and i = 1, . . . , D. For any given segment, polynomial model param-
eters are found using a least squares approach. Segment duration N , along with B and
E are then used in place of the original features in an HMM system. With phones as
segments, significant improvement over an HMM baseline system was demonstrated on
a 20 keyword spotting task. However, the main drawback of such an approach is that
phonetic alignments must be available or derived by some other means.
Yun & Oh (2002) developed this idea using the polynomial features as input to a
segmental HMM, making full recognition practical rather than just classification of pre-
aligned data. However, in their system, the parametric features are computed over fixed-
length sliding windows of frames rather than variable-length time-normalised segments.
t+M
In this case, with an analysis window of N = 2M +1 frames, C becomes Ct = Yt−M , where
36 CHAPTER 2. LITERATURE REVIEW
the rows are vectors of features centred on ytT . The design matrix is modified accordingly,
so that the rows reflect relative positions from the current time. Again taking the case
with R = 3 as an example, Z would be set to:
M M 2
1 − 2M (− 2M )
1 − M −1 (− M −1 )2
2M 2M
. . ..
. .
. . .
1 0 0
. .. ..
.
. . .
−1 −1 2
1 M2M ( M2M )
M M 2
1 2M ( 2M )
Context independent recognition on TIMIT was used to compare the new segmental
feature HMM (SFHMM) with standard frame-based HMMs. Experimental results showed
that the SFHMM gave a modest improvement on a standard HMM baseline using an
MFCC parameterization of the acoustics with no δ features. However, an SFHMM system
using MFCCs and their δs outperformed a standard HMM system which included both
δ and δδ features. For systems with 2 Gaussian mixture components, the baseline HMM
gave a recognition accuracy of 57.0%, and a SFHMM with fixed segment length of 5 and
polynomial order 4 gave an accuracy of 60.1%. In this case, the segmental polynomial
features were demonstrated to offer more than simply replacing the 2nd order derivative
coefficients.
Appending δ and δδ parameters to feature vectors has become the standard approach
by which a model of speech signal dynamics is included in state-of-the-art HMM speech
recognition systems such as HTK (Young 1993). The properties of frame-based models of
dynamic features are discussed in section 4.3 on page 102. Another approach which aims
to account for the time dependencies present in the speech signal was taken by Woodland
(1992). In this work, the mean vector of the output distribution associated with each
state is modelled as being dependent on other observations. With the system occupying
q
state qj ∈ Q, and yt denoting the observation at time t, the predicted observation ŷ t j is
2.3. SEGMENT MODELLING FOR SPEECH 37
given by:
q q q q
X
ŷ t j = µ 0j + Akjp (yt+kp − µ kjp ) (2.29)
p
In this equation, k p is the amount by which the explanatory variable is offset from the
q q
current time, and µ kjp and Akjp are the mean vector and prediction matrix associated with
q
offset k p respectively. This model reduces to a standard HMM when Akjp is set to be zero.
The model was evaluated on a multi-speaker isolated-word British English E-set2
database for which 12 MFCCs and their corresponding δs were derived giving a 24-
dimensional feature vector. Using a single predictor at offsets of +3 or −3 frames gave sim-
ilar results, with test set errors of 3.9% and 3.8% respectively. These compare favourably
with the baseline system result of 5.6% errors, in which full covariance Gaussian distri-
butions were used to model the observations. However, when predictors at both +3 and
−3 frames were used, the error rate increased to 6.9%. The suggested cause was that in
this case there were more parameters than could be reliably estimated from the available
data. A discriminative approach to training was then employed, and using a predictor
with an offset of −3, the error rate was further reduced to 2.8%.
Iyer et al. (1998, 1999) looked at a means of modelling the time dependencies in the
parameterized speech signal without sacrificing any of the computational efficiency of a
standard HMM system. They suggested that in fact, HMMs do model the trajectories
in acoustic features. However, this does not rely on any knowledge of the underlying
structure and instead is achieved through switching mixture component as necessary.
They proposed modelling each phonetic unit with a set of M parallel HMMs, in which
transitions are made left to right along individual HMMs, but not across the parallel
paths. In an analogous fashion to segmental HMMs, intra-trajectory and extra-trajectory
distributions are then accounted for independently.
Good initial parameter estimates were important to ensure that the inter-trajectory
variation was captured. Otherwise, the model could simply degenerate into multiple
HMMs for each phonetic unit. Initialisation first used a standard HMM to give a state-
2
An E-set database is composed of the words ‘B’, ‘C’, ‘D’, ‘E’, ‘G’, ‘P’, ‘T’, and ‘V’
38 CHAPTER 2. LITERATURE REVIEW
level alignment. The observations corresponding to each state were then assigned to a
particular path according to the clustering of a trajectory model. Two such models were
experimented with, the first being the parametric feature model of Section 2.3.2, and the
second was a non-parametric model described in Siu, Iyer, Gish & Quillen (1998). Once
initialised, the parallel path HMMs were trained using the same techniques as standard
HMMs.
Recognition results are reported on the Switchboard and Callhome corpora (Godfrey,
Holliman & McDaniel 1992) with 2-path models. A slight decrease in word error rate
was found when the models were used directly to produce output probabilities. However,
when the parallel path HMM probabilities were combined with existing HMM scores in
an output lattice, a 1% absolute, or 2.9% relative, error reduction was found.
Hybrid ANN/HMM ASR occupies the territory between frame-based and segment models
as the inputs are neither frames nor segments, but context windows over the observations.
Artificial neural networks have been used to provide posterior probabilities for standard
HMMs, segment models, as well as the observation process in non-linear state-space mod-
els.
Zavaliagkos, Zhao, Schwartz & Makhoul (1994) describe a hybrid segmental ANN/HMM
model which was applied to word recognition on the RM corpus within an N -best paradigm.
Experiments used single layer or elliptical basis networks to rescore word lattices gener-
ated using standard HMMs. Artificial neural networks require a fixed-sized input layer,
and two methods of time normalisation were compared. The first used a quasi-linear
sampling of the feature vectors, either repeating or ignoring frames of features, but never
actually interpolating. The second took a discrete cosine transformation (DCT) of the
sequence of feature values across each dimension of the observations. These were trun-
cated to give a fixed number of coefficients. The DCT approach was found to give the
best final performance. For 5 RM test sets, lattice rescoring using a combination of both
2.3. SEGMENT MODELLING FOR SPEECH 39
types of network and HMMs gave error reductions of between 9% and 29% relative to an
HMM baseline.
Verhasselt, Cremelie & Marten (1998) worked with a similar idea, using ANNs to
give posterior probabilities for entire segments. However, the single layer networks were
replaced with multi-layer perceptrons (MLP) and a pre-segmentation algorithm was in-
cluded which limited the set of allowable segmentations, making full recognition feasible.
Another addition was that the probabilistic framework made explicit use of segmentation
probabilities. Phone recognition on the TIMIT core test set combined the outputs of
diphone and context independent phone segment scores, giving a recognition accuracy of
70.2%, comparable to some of the highest reported performances for this task.
Richards and Bridle (1999) introduced the hidden dynamic model (HDM) which sought to
give an explicit model of co-articulation within a segmental framework. A static target or
series of targets in a hidden state space was associated with each phone in the inventory.
Given a time-aligned sequence of phones, a Kalman smoother was run over the targets
to produce a continuous trajectory through the state space. These trajectories were
connected to the surface acoustics by means of a single MLP mapping.
All targets were set to zero and network weights randomised at the start of training,
which was by a form of gradient descent. Preliminary results are reported in Picone, Pike,
Regan, Kamm, Bridle, Deng, Ma, Richards & Schuster (1999) for an N-best rescoring task
on the switchboard corpus. With a baseline word error rate of 48.2% from a standard
HMM system, 5-best rescoring with the reference transcription included using the HDM
gave a reduced error rate of 34.7%. An identical rescoring experiment using an HMM
trained on the data used to build the HDM gave a word error rate of 44.8%. This suggests
that the HDM was able to capture information that the HMM could not.
Iso (1993) investigated a similar model in which a set of observations y1 , . . . , yt , . . . , yN
were connected to a sequence of control commands x1 , . . . , xk , . . . , xK by means of a non-
linear predictor:
As with the HDM, the control commands were static targets, though diagonal covariance
Gaussian distributions over the state-space were used in place of state vectors. Parameters
were estimated using a form of gradient ascent, and two alternative objective functions
were suggested, one maximum likelihood and the other discriminative. Some preliminary
experiments confirmed the benefit of diagonal over identity covariances for the control
commands, and also improved results using a discriminative approach to training. How-
ever, no baseline using a conventional system was given so real benefits are hard to
quantify.
In the late 80’s, Mari Ostendorf and members of her group in Boston began an investi-
gation into segment modelling. They introduced the stochastic segment model (SSM)3 , a
framework within which to research the distributional forms and modelling assumptions
necessary to account for inter-frame correlations and temporal dependencies.
where the acoustic and duration models provide p(y1l |l, m) and p(l |m) respectively. Rather
than describing y1l directly, the original formulation of the SSM modelled fixed-length (see
page 33) segments denoted zN l
1 = {z1 , . . . , zN }. An observed segment y1 is considered to
be a linear down-sampling of zN l N
1 using the relation y1 = z1 Tm, l in which the original fea-
Z
p(Y|l, m) = p(Z|m) dZ (2.32)
Z:Y=ZTm, l
3
The reader who is familiar with the original SSM literature should be aware that some of the notation
has been altered to fit in with the conventions used in this thesis.
2.3. SEGMENT MODELLING FOR SPEECH 41
dimensional Gaussian. This takes full account of the intra-segmental correlation structure,
though requires a large number of parameters. Modelling a p-dimensional observation
with N frames would require a N p × N p covariance matrix. Robust estimation of such
a model would require a great deal of data, and was in fact found to be impractical for
systems with more than a few feature dimensions (Digalakis, Ostendorf & Rohlicek 1989).
The independent-frame, or block-diagonal model reduces the parameterization by
assuming that successive frames are independent given l, the length of the segment. The
probability of the fixed-length vector sequence zN
1 under model m is then:
p(zN
1 |m) = p1 (z1 |m)p2 (z2 |m) . . . pN (zN |m) (2.33)
where pi is the probability density which models the ith frame. Even though the inter-
frame correlation modelling, the potential advantage of SSMs, is ignored under such a
model, it has been shown to match HMM performance. Digalakis et al. (1989) report
that with similar number of parameters and context-independent models, an HMM and
an independent-frame SSM produce the same phone classification accuracies.
A target-state model assumes that every segment model is associated with a static
target in some state-space, X . The target might vary according to factors such as context
or speaker characteristics, and is modelled by a zero-mean Gaussian distribution:
zt = Ht xt + t t ∼ N (v, C) (2.34)
xt = ηt ηt ∼ N (0, D) (2.35)
where the pi (zi |zi−1 ) are conditional Gaussian densities. As shown by Digalakis (1992),
and in Section 3.4, most of the variation in a given frame of acoustic speech data can be
42 CHAPTER 2. LITERATURE REVIEW
explained using the previous frame as a predictor, which supports this choice of model
structure.
For both the target-state and Gauss-Markov models, TIMIT context independent clas-
sification accuracies were higher, 52.0% and 53.9% respectively, than for the independent-
frame model, which gave an accuracy of 50.6% using a feature set of 18 MFCCs (Ostendorf
& Digalakis 1991, Digalakis 1992). When δ coefficients were included in the feature set,
performance using the target-state and Gauss-Markov models deteriorated, giving lower
classification accuracies than an independent frame model. This result was attributed to
non-linearities near segment boundaries. Furthermore, the lack of an observation noise
term and the associated smoothing in the Gauss-Markov model was thought to contribute
to a mismatch between training and test data (Digalakis 1992).
A generalisation of the Gauss-Markov model is the dynamic system (DS) model,
referred to in this thesis as the linear dynamic model (LDM). These were introduced in
Section 2.1.2 on page 20 and are described fully in Chapter 4.
These models were the focus of Vasilios Digalakis’ thesis (1992), where approaches to
estimation, classification, and recognition with the models were presented. Digalakis
termed the assumptions made in modelling fixed-length and variable-length segments with
LDMs trajectory invariance and correlation invariance respectively. Trajectory invariance
assumed that there was a single, fixed trajectory underlying all instances of a phone. In
the standard SSM fashion, a linear time-warping was used to normalise segment durations
before the model was applied. Correlation invariance assumed that within each segment,
there were a number of regions in which inter-frame correlations were static. In this case,
no time-warping was used, but a deterministic mapping dictated the durations of each of
the sub-segmental models.
On a TIMIT classification task, with phones as segments, similar results were found
under each of the two assumptions for longer phones. However, poor performance arose
using the trajectory-invariance formulation when an observation sequence was consider-
ably shorter than its hypothesised length. The problem was thought to be that under the
trajectory invariance assumption, the correlations originally present in short instances of
2.3. SEGMENT MODELLING FOR SPEECH 43
Segmental mixtures
Kimball (1994) developed some of the ideas of the stochastic segment model, considering
the addition of mixture modelling. Gaussian mixtures rather than single Gaussians were
used for frame-level modelling in the independent-frame SSM. This gives a model which
can be cast as an HMM with a deterministic state sequence dependent on the segment
length. The model was further augmented to have segment-level mixtures. In this case
the model can be derived as a segmental mixture HMM as described by Gales & Young
(1993), though with a fixed state transition sequence dependent on the segment length.
Recognition experiments on the RM corpus showed similar results for these two systems,
44 CHAPTER 2. LITERATURE REVIEW
both of which gave a lower word error rate than an independent-frame SSM baseline.
Goldenthal (1994) also explored methods for explicitly modelling the dynamics underlying
speech sounds. His approach involved modelling each speech segment using a track con-
sisting of a sequence of averaged acoustic parameters. Variable-length and fixed-length
segments were experimented with, and the fixed, or trajectory invariant tracks adopted.
A form of fractional linear interpolation was used for the time normalisation in which
each acoustic observation contributed its data proportionally to adjacent frames of the
track according to how closely the time-scales corresponded. The tracks were typically
set to a length of 10 frames.
Just as with the full covariance SSM of Section 2.3.6, data limitations meant that
estimation of a full covariance matrix for each track was not practical. One approach to
reducing the number of parameters would be to split each model into regions in which
the full correlation structure is modelled. However, in this work, residuals were first
calculated by differencing corresponding frames of track and observations. These errors
were then partitioned into Q regions in which the full correlation structure was mod-
elled. If a model has accounted for the dynamics and correlation structure of a segment,
then the residual energy will be small and show weak inter-frame dependencies in com-
parison to the original data. It was proposed that by partitioning at this level, little
information is sacrificed. Sub-segmenting the residual tracks into Q = 3 regions gave a
context-independent TIMIT classification accuracy of 74.2%, and then 75.2% with the
log-duration of the token modelled as a Gaussian and built into the error distribution.
Two approaches to dealing with contextual variation were also investigated. The first
involved creating clusters of biphone tracks covering each model in each left and right
context independently, which could then be merged to create any given triphone. The
second made explicit models of the dynamics at segment boundaries by generating tracks
for every transition in the training data. These were clustered to provide 200 transition
models which were incorporated in the acoustic likelihood calculation during recognition.
Context-independent phone recognition on the TIMIT core test set gave an accuracy
2.4. CONCLUSIONS 45
of 61.9%, where the usual confusions as given on page 150 were allowed in reporting
results. This was increased to 63.9% when the transition models were included. Simi-
larly, the transition models improved context-dependent recognition performance, with
the accuracy of 66.5% being raised to 69.3%. These results compare favourably with
the state-of-the-art in TIMIT phone recognition which will be summarised below, though
it should be noted that trigram language models were used in producing the context-
dependent results, rather than bigrams which are normally used when reporting TIMIT
recognition results.
2.4 Conclusions
In this chapter we have outlined non-standard approaches to acoustic modelling for speech
recognition. The models of Section 2.2 incorporated articulatory information, and in
Section 2.3, the models sought to account for the temporal dependencies present in speech
data. Many of these techniques were shown to be useful compared to acoustic-only or
frame-based baseline models, however it is interesting to compare these results with others
in the literature.
Many of the experiments described above give results for TIMIT phone recognition,
a well known task which is commonly used for development of ASR systems. Currently,
the highest reported accuracy is by Robinson (1994) using a hybrid ANN/HMM system.
Chen & Jamieson (1996) report a result which may be higher, though a percent correct
is quoted rather than recognition accuracy. This is an unreliable measure since it does
not penalise insertion errors.
Table 2.2 shows the recognition accuracies for the systems described in this chapter,
along with the best reported ANN/HMM and HMM results for this task. In all cases
apart from Goldenthal’s context-dependent models, bigram language models were used.
All results are on the NIST core test set which contains the 8 si and sx sentences from
2 male and 1 female speakers from each of the 8 dialect regions. Results on the core test
set tend to be slightly lower than those on the full test data, which may be due to the
balance of dialect and gender not following that of the training data. Note that Table
2.2 does not include Yun & Oh (2002) as results were based on the full test set and also
46 CHAPTER 2. LITERATURE REVIEW
Table 2.2: Summary of TIMIT phone recognition accuracies for the systems described in this
chapter, along with the highest reported ANN/HMM and HMM accuracies for this task. Context
dependent and context independent models are denoted by CD and CI respectively
included the sa sentences, which are the same for every speaker in both training and test
sets. The work of Digalakis (1992) is also excluded as experiments use an early version
of TIMIT which follow a different allocation of training and test speakers.
However, it is apparent that many of these methods give accuracies which are close
to the state-of-the-art. Indeed, Deng’s articulatory-motivated state selection gives the
highest HMM accuracy, and both Verhasselt and Goldenthal produce systems which out-
perform a standard triphone HMM recognizer.
Chapter 3
Preliminaries
This chapter will give a review of the data and some of the techniques which will be
used in later experimentation. Section 3.4 then gives some preliminary analysis which
examines the suitability or otherwise of linear models of speech data.
Both acoustic and articulatory data is used in this thesis. The corpora which will be
used for experimentation are described below, however this will be preceded by an outline
of the issues involved in collecting and processing data from these distinct information
sources.
Incorporating articulatory features into speech recognition has already been stated as
one of the concerns of this work. This of course relies on having access to measured or
automatically generated articulation.
47
48 CHAPTER 3. PRELIMINARIES
X-ray microbeam An x-ray microbeam system involves attaching 2-3mm gold pellets
to the chosen set of articulators. A narrow high-energy x-ray beam then tracks the pellets
during speech production. The output is a series of samples of the x and y coordinates
of each articulator in the mid-sagittal plane. The system described by Westbury (1994)
uses sampling rates of between 40Hz and 160Hz dependent on the particular articulator
which is being tracked. Rather than subjecting the entire head to radiation, a focused
beam only scans the areas where the pellets are expected to be, predicting at each sample
where to scan at the next. Figure 3.1 shows a schematic diagram of such a system.
One drawback of x-ray microbeam measurement of articulation is that the machinery
required can produce appreciable levels of background noise. Not only does this result in
a noisy signal, but can also interfere with speech production. This is due to the Lombard
effect, which describes the reflex by which a speaker modifies their vocal effort whilst
speaking in noisy surroundings (Junqua 1993).
Figure 3.1: Schematic diagram of an x-ray microbeam system for measuring human articu-
lation. Diagram reproduced with permission from Dr. Michiko Hashi’s homepage, found at
http://www.hoku-iryo-u.ac.jp/ mhashi/index.html
in Section 3.2.1
Figure 3.2: Placement of coils which was used in recording the MOCHA EMA database. Coils
marked in magenta are attached to articulators, and regular samples of their position in the mid-
sagittal plane taken. These correspond to tongue tip, body and dorsum, lower incisor, velum and
upper and lower lip. The coils marked in cyan are included to provide reference points which
allow head movements relative to transmitter coils to be corrected for. Diagram reproduced from
Richmond (2001a) with permission of the author.
and a series of recordings of the points at which the tongue is in contact with the palate
is made.
Corpora large enough for training and evaluating ASR systems are rare due to the ex-
pense and labour intensity of data collection. However, two such data-sets exist: MOCHA
(Wrench 2001) which will be described on page 58, and the Wisconsin x-ray microbeam
database (Westbury 1994). The latter was introduced on page 25 of the literature review
and consists of parallel articulatory and acoustic features for 60+ subjects each of whom
provide in the region of 20 minutes of speech data. The data for each subject is divided
into a number of tasks, which include reading prose passages, counting and digit se-
quences, oral motor tasks, citation words, near-words, sounds and sound sequences along
with read sentences.
Building a mapping from acoustic features to the articulatory gestures which produced
them, known to speech researchers as the inversion mapping, is by no means trivial.
Firstly, the mapping is bound to be highly non-linear, since in some instances, small shifts
in the articulators produce significant modification of the acoustic signal. For example,
consider the effect on the acoustic signal of opening the lips during a plosive. Here, a
small movement results in a significant change in pressure within the vocal tract. The
result is a sudden change in air-flow, and therefore also the acoustic signal. Secondly, the
mapping is an example of an ill-posed problem, as there is no one-to-one mapping from
the acoustic to articulatory domains.
This is demonstrated empirically by Roweis (1999), who investigated the geometric
spaces articulatory and acoustic parameters occupy, and how they relate. It was shown
that articulatory configurations can be mapped to points in acoustic feature space using
simple linear transformations, though the reverse was not so. An experiment using the
parallel acoustic-articulatory data collected at the Wisconsin x-ray microbeam facility
(Westbury 1994) was performed to demonstrate this many-to-one relationship. With the
acoustic signal parameterized using line spectral pairs (LSP), a key acoustic frame was
chosen. The 1000 frames nearest the key frame in acoustic space were plotted in articu-
3.1. DATA COLLECTION AND PROCESSING 51
latory space. The entire database was used, which meant that there was no distinction
made between effects due to intra-speaker and inter-speaker variation. The spread of
points in articulatory space found for acoustically similar features was large compared to
reference ellipses showing the 2 standard deviation contours for the 1000 frames which
were closest in articulatory space. Many of the plots also show multimodality in the
spread of the points in articulatory space. These findings demonstrate that a range of
articulatory configurations can be used to produce acoustically similar sounds.
The phenomenon of articulatory compensation, which occurs when someone is speak-
ing under unusual constraints, can be used to demonstrate the one-to-many nature of
the acoustic-articulatory mapping on a speaker-dependent basis. Such a constraint can
be artificially introduced in the form of a bite-block, which holds the speaker’s jaw in
an unnatural position. Lindblom, Lubker & Gay (1979) recorded subjects saying four
Swedish vowels both with and without bite-blocks in place. The finding was that all sub-
jects were able to produce formant patterns within their usual range of variation despite
the constraint. In fact, anyone who has seen a ventriloquist perform knowns that speech
can be produced with little or no visible movement of the articulators.
of a given phone.
The difficulties inherent in accurate inversion mapping have not served to deter the
large number of researchers who have worked on this problem. In recent years, the
development of articulography technologies such as x-ray microbeam and EMA, along
with increases in computing power, have opened articulatory inversion to machine learning
techniques. Approaches found in the literature include artificial neural networks, Kalman
smoothers, self organising HMMs and use of codebooks. A review of such work can be
found in Richmond (2001).
The source-filter model of the vocal tract provides the assumptions underlying many
common approaches to speech processing. These include filter banks, cepstral analysis
and linear predictive coding (LPC). Analysis generally assumes that the speech signal is
statistically stationary over short time intervals, or frames. Within one such frame, the
source-filter model (with ω denoting frequency) decomposes the spectrum of the speech
signal into an excitation E(ω) and a vocal tract transfer function V (ω). This relationship
is written as:
During spoken English, two main sources of acoustic energy are used to excite the filter.
The first is due to vibrations of the glottis which produce a periodic airflow waveform, such
as occurs during voiced speech. The second is when air expelled from the lungs is passed
through a narrow constriction creating the turbulence which characterises frication.
The acoustic signal is an encoding of the combination of an energy source and the
spectral modifications imparted by the vocal tract transfer function. The information
which it conveys is not immediately apparent, and so features must be extracted. It is
the vocal tract shape, here modelled by the filter, which creates resonances at certain fre-
quencies. Since these resonances cause the acoustic signal to carry the frequency patterns
which distinguish speech sounds, automatic speech recognition includes a step in which
filter information is extracted from the acoustic signal prior to any subsequent modelling.
3.1. DATA COLLECTION AND PROCESSING 53
10ms
25ms
Hamming window
Figure 3.3: Spectral estimates are made within a series of regularly spaced overlapping windows.
Hamming windows each spanning 25ms placed in the region of 10ms apart are commonly used.
Spectral estimates are made within a series of regularly spaced overlapping windows,
with a common choice being a set of Hamming windows each spanning 25ms placed in the
region of 10ms apart. Mel-frequency cepstral coefficients (MFCCs) and perceptual linear
prediction (PLP) cepstra are currently the most widely used features for ASR. Variants
54 CHAPTER 3. PRELIMINARIES
exist, such as RASTA-PLP (Hermansky, Morgan, Bayya & Kohn 1991) which is tailored
to ASR in noisy conditions. Producing MFCC and PLP parameterizations of the acoustic
signal combine steps which are motivated by both speech production and perception, and
others which tailor the parameters to the models which will be used to characterise them.
Investigating new front-end acoustic parameterizations does not form part of the work
in this thesis. However, MFCCs and PLP cepstra offer distinct representations of the
speech signal, one of which may be a more appropriate choice for the class of models
under investigation. Therefore, experimental work will compare results found using both
MFCC and PLP-based features.
1999).
56 CHAPTER 3. PRELIMINARIES
windowed speech
FFT
magnitude
spectrum
frequency
IDCT IDCT
Durbin
recursions
PLP coefficients
cepstral
recursions
feature vector feature vector
truncation truncation
Figure 3.4: Figure showing the steps taken to compute Mel frequency and perceptual linear
prediction cepstra. Adapted from figures in Young (1995) and Gold & Morgan (1999).
3.1. DATA COLLECTION AND PROCESSING 57
Calculating PLPs Figure 3.4 shows the steps involved in extracting PLP cepstra from
the speech signal alongside those for deriving MFCCs. Producing these two parameteri-
zations can be seen to involve many of the same steps. As for MFCCs, spectral estimates
are made within a set of overlapping tapered windows onto the acoustic signal. These
estimates are then smoothed by integrating within a set of overlapping triangular filters
positioned along a Mel-warped frequency scale.1
An equal-loudness curve is used to weight the filter bank outputs to follow the variation
in the sensitivity of human hearing across the frequency ranges, and amplitude compres-
sion uses a cube-root to approximate the intensity-loudness power law. An inverse discrete
cosine transformation is applied to generate the real cepstrum, though unlike computing
MFCCs where smoothing is achieved through truncation, the parameters of an all-pole
filter such as used in standard linear predictive coding (LPC) are estimated. Finally, a
set of cepstra can be computed from the LPC coefficients to give a parameterization in
which the spatial correlations are reduced.
Computing MFCCs and PLP cepstra give sets of acoustic features with similar prop-
erties. The essential difference is that PLPs employ an extra layer of LPC-based smooth-
ing. A further stage which is frequently taken in either case is liftering. The cepstra are
reweighted to give emphasis to the higher order coefficients, which tend to have small mag-
nitudes in comparison to those of lower order. For example, scaling the nth coefficient by
n has the effect of roughly equalising the variances of the cepstra (Gold & Morgan 1999).
1
Some authors, such as Gold & Morgan (1999) suggest that spectral smoothing be implemented within
a set of trapezoidal filters spaced along a Bark-warped frequency scale in calculating PLPs. However,
HTK (Young et al. 2002), which uses triangular filters positioned along a Mel-warped scale, has been used
for feature extraction in this work.
58 CHAPTER 3. PRELIMINARIES
The data which will be used in experimentation in this thesis comes from two corpora,
MOCHA (Wrench 2001) and TIMIT (Lamel, Kassel & Seneff 1986). These, along with
the sets of features which will be derived from them are described in the following sections.
3.2.1 MOCHA
The MOCHA database was recorded at Queen Margaret University, Edinburgh, and con-
sists of parallel acoustic-articulatory information for a number of speakers, each of whom
read up to 460 sentences. The sentences comprise the 450 American TIMIT sentences
which were designed to provide good phone pair coverage, along with an extra 10 sen-
tences which include phonetic pairs and contexts found in the received pronunciation
(RP) accent of British English.
The MOCHA corpus includes automatically generated labels for the 46-phone set
detailed in Appendix A.1. Once phone sequences had been generated for each utterance
using a keyword dictionary (Fitt & Isard 1999), HTK (Young 1993) was used to force-
align flat-start monophone HMMs to the acoustic data to give start and end times for
each phone label. All experimental work uses the 20 minutes of speech data from the
southern English female subject fsew0 for which Wrench & Richmond (2000) estimate
that 5-10% of phones are incorrectly labelled.
Feature sets
The MOCHA corpus offers a number of different parallel streams of acoustic and articu-
latory data from which a variety of feature sets can be derived. These comprise:
Acoustic The speech waveform was recorded directly onto disk in a sound-damped stu-
dio, sampled at 16kHz and stored with 16 bit precision. For this work, both MFCCs and
PLP cepstra were generated from the acoustic signal at 10ms intervals within overlapping
25ms Hamming windows using the HTK version 3.1 tool HCopy. In each case, the result-
ing 12 cepstral coefficients were augmented to include the log signal energy along with δs
and δδs corresponding to each of the parameters, giving a (12 + 1) × 3 = 39-dimensional
3.2. CORPORA USED IN THIS THESIS 59
feature vector.
2
This post-processing was carried out by Korin Richmond.
60 CHAPTER 3. PRELIMINARIES
acoustic
signal
upper
lip x
upper
lip y
lower
lip x
lower
lip y
lower
incisor x
lower
incisor y
tongue
tip x
tongue
tip y
tongue
dorsum x
tongue
dorsum y
tongue
back x
tongue
back y
velum x
velum y
0.4 0.6 0.8 1 1.2 1.4
time (s)
‘This was easy for us’
Figure 3.5: Figure showing the 14 articulatory dimensions contained in the EMA data, along
with the acoustic signal for the first sentence in the MOCHA fsew0 data – ‘This was easy for us.’
3.2. CORPORA USED IN THIS THESIS 61
full full
connection connection
Figure 3.6: This figure shows a feed-forward neural network of the type used to infer articulatory
traces from acoustic parameters. A set of 20 filter bank coefficients was computed within 20ms
windows on the acoustic signal at 10ms intervals. A context window of 20 such frames provided
the input layer of 400 units. The inputs are fully connected to the hidden layer of 50 units, which
in turn feeds forward to the 14-dimensional articulatory output. Reproduced from Richmond
(2001a) with permission of the author.
62 CHAPTER 3. PRELIMINARIES
Articulatory inversion has not been part of the work of this thesis, and at the time
of writing, a set of network-recovered EMA derived using a cross-validation scheme3
over the complete fsew0 data is unavailable. Therefore, experiments which incorporate
automatically recovered EMA follow the train/test division with which it was created.
This is described in Section 5.1.1 on page 116 where the basic classification procedure is
introduced.
Numerical differentiation routines from the Edinburgh Speech Tools (Taylor, Caley, Black
& King 1997-2003) library were used to calculate the δ and δδ coefficients correspond-
ing to the data for each feature set. Linear discriminant analysis (LDA), which will be
described below, was used give reduced dimensionality versions of the larger feature sets.
Letting pj , µj and Σj be the a priori probability, mean and covariance of the data
which corresponds to class j, the within and between class variance are given by:
X
Sw = pj Σ j (3.2)
j
Sb = cov(µ1 , µ2 , . . . , µj ) (3.3)
Maximising the variance between classes Sb , whilst minimising that within classes Sw ,
could be achieved using by transforming the data by matrix Ω = Sw−1 Sb . Noting that the
construction of Ω entails that it will be of rank j − 1, dimensionality reduction can be
introduced by instead transforming the data by a matrix consisting of the eigenvectors of
Ω which correspond to a subset (the larger), or all, of the j − 1 non-zero eigenvalues.
In this case the classes are the 46 MOCHA phones, and so post-processing with LDA
can be applied to give at most a 45-dimensional feature set in the cases where the original
dimension is 46 or higher.
Table 3.1 shows a summary of the feature sets which are derived from the MOCHA corpus
and will be used for experimentation.
3.2.2 TIMIT
The TIMIT corpus is well known to those working in the field of speech recognition. It was
designed and collected specifically for use in development and evaluation of ASR systems
and comprises speech data from 630 speakers of eight major dialects of American English.
Each subject speaks 10 phonetically rich sentences, of which 2 are the same for all speakers
and are included to highlight dialectical variation. These are ignored for all experimental
work as inclusion would skew the phone coverage. As for the MOCHA corpus, MFCC and
PLP cepstral coefficients, energy and corresponding δ and δ parameters were generated
from the waveform using HTK. Table 3.1 gives a summary of the feature sets which are
derived from the TIMIT corpus for use in later experimentation.
64 CHAPTER 3. PRELIMINARIES
Table 3.1: Each tick denotes a feature set which will be used for MOCHA speaker-dependent
classification. All experiments use the data from speaker fsew0, and results are presented in
Section 5.1. Log signal energy (and corresponding derivatives where required) are appended to
both of the acoustic features.
Table 3.2: Each tick denotes a feature set which will be used for TIMIT speaker-independent
classification and recognition. Log signal energy (and corresponding derivatives where required)
are appended to both features.
The summary of the component parts of a speech recognizer on page 2 stated that a
language model was incorporated to give an estimate of the prior probability of each
candidate word sequence, w1j = {w1 , . . . , wj }. Assuming that the probability of any given
word depends only on the identities of the words which precede it, Bayes’ rule can be
used to decompose P (w1j ) as:
j
P (w1j ) =
Y
P (wi |w1 . . . wj−1 ) (3.4)
i=1
3.3. LANGUAGE MODELLING 65
An n-gram language model assumes that the probability of any given word is only depen-
dent on the preceding n − 1 words, so the relation 3.4 is reduced to:
j
P (w1j ) =
Y
P (wi |wi−1 , . . . , wi−n+1 ) (3.5)
i=1
where B(wi−1 , wi−2 ) weights P (wi |wi−1 ) so as to normalised the probability mass of the
trigrams finishing with wi (Young 1995).
The language models used in the phone classification experiments of Chapter 5 are
simple bigrams with no backing off. Code from the Edinburgh Speech Tools (Taylor
et al. 1997-2003) was used to estimate probabilities by counting occurrences of phone pairs
in the training data. A minimum probability is set to avoid zeros for phone pairs which do
not occur in the training set. The phone recognition experiments which follow in Chapter
6 use backed off phone bigrams, with probabilities again estimated on the training data.
These language models were produced using the CMU-Cambridge Statistical Language
Modelling toolkit (Clarkson & Rosenfeld 1997).
66 CHAPTER 3. PRELIMINARIES
This chapter has so far introduced the data and some of the basic techniques which will
be required for experiments presented in later chapters. The remainder of the chapter
involves some preliminary data analysis. One of the attributes of the ideal acoustic model
for ASR as given on page 3 was to account for the temporal dependencies present in
speech data. The following sections examine these dependencies and provide empirical
motivation for applying the model at the core of this work to speech data.
Linear models are in general simpler to deal with than their non-linear counterparts.
There are comparatively few functional forms to choose between and the properties of
linear models are straightforward and well known. A commonly accepted approach to
model selection is to choose the simplest model which can describe the process in question.
With this in mind, the experiments below examine the suitability of linear models for
accounting for the dependencies present in the speech data used in this thesis, acoustic
and articulatory. Modelling of the correlations occurring within and between phonetic
segments are considered separately.
Digalakis (1992) carried out such an experiment, examining whether linear models
can account for the relationships present in an MFCC parameterization of speech. The
conclusion was that linear models are appropriate intra, but not inter phone. Results of
similar experiments are presented below in greater detail, and extended to assess linear
models of articulatory traces and of a PLP cepstra parameterization of the acoustic signal.
Each of the core data/feature sets used in this thesis is examined to determine whether or
not assumptions of linearity are appropriate. Results are further broken down by phonetic
class to enable comparison with classification and recognition accuracies presented later
in the thesis. Appendix A.1 gives these classes along with the IPA symbols corresponding
to the MOCHA and TIMIT phone sets.
3.4.1 Method
The approach will be to compare the fit of linear and non-linear regressions on subsets of
speech data. If a linear regression can account for as much of the systematic variation as
a non-linear counterpart, there is no justification for the extra complexity of employing a
3.4. LINEAR PREDICTABILITY 67
non-linear model.
Letting y, x(1) , . . . , x(p) be random variables, with y the dependent variable and
x(1) , . . . , x(p) as predictors, the linear regressions use a standard multiple regression model,
such that
p
X
y= αi x(i) + β. (3.7)
i=1
The non-linear regressions use a generalized additive model (Hastie & Tibshirani 1986):
p
X
Θ(y) = Φi (x(i) ) (3.8)
i=1
SSresidual
R2 = 1 − (3.10)
SSdata
MOCHA TIMIT
Table 3.3: Commonly occurring phone pairs which were used to examine inter-phone dependen-
cies present in MOCHA and TIMIT. Appendix A.1 gives the IPA symbols for each of the phone
classes used to label the MOCHA and TIMIT corpora.
The 6 commonly occurring phone pairs given in Table 3.3 were chosen for use in
experiments to compare linear and non-linear models between, or inter segments. For
each of the phone pairs, the instances in which the second phone in the pair is 5 frames
or longer are used in experimentation, and intra-phone regressions of the 5th on the 2nd
3.4. LINEAR PREDICTABILITY 69
frame of the second phone are performed. Then, the 2nd frame of the second phone is
taken as the dependent variable, with the penultimate frame of the first phone in the
pair used as the explanatory variable for an inter-phone regression. These form pairs of
inter-phone and intra-phone regressions in which the interval between the dependent and
explanatory variables is kept constant. In this way, any confounding factors due to the
spacing between regression variables are controlled for, allowing a direct examination of
the effect of crossing phone boundaries. Results are compiled in a similar fashion to the
within-phone comparisons, replacing relative phone frequencies with relative phone pair
frequencies.
3.4.2 Results
Results take the form of R2 values for each feature dimension of each phone (or phone-
pair), for both linear and non-linear regressions. This correspondence makes a paired
t-test as described in Section 5 of Chapter 6 a natural choice to check the hypothesis of
equality of fit under each of the two models. For every experiment in this section, this
was refuted in no uncertain terms. In other words, the non-linear regression was always a
better fit to the data than the linear one. However, whilst linear models do not perform
exactly as well as their non-linear counterparts, in many cases they are extremely close.
MOCHA results
The within-segment experiments show that under both models, R2 values are considerably
higher for regression on the previous frame than on the initial frame, much as expected.
On the articulatory data, the overall linear and non-linear R2 values are 0.981 and 0.987
respectively for the regressions on the previous frame. Not only are these both extremely
high, they are very close. When the regression uses the segment-initial frame as the
explanatory variable, the linear model is outperformed by a larger margin, though still
gives 91.1% of the non-linear R2 . Acoustic features do not provide such consistently
well fitting models, with R2 values ranging from 0.815 for non-linear regression on the
preceding frame for MFCCs, down to 0.301 for linear regression on the segment-initial
frame for PLPs. MFCC features provide a better fit, and a higher relative linear versus
70 CHAPTER 3. PRELIMINARIES
Table 3.4: Results of regressions to compare the performance of linear and non-linear regressions
in predicting dependencies within phones. R2 values averaged over 46 phone classes for linear and
non-linear regressions which predict segment-central frames based either on previous or segment-
initial frames. Also shown are percentages of non-linear R2 gained by linear regressions. Data is
from the speaker fsew0 in the MOCHA corpus
non-linear performance, than the PLP features do. In both cases the linear models manage
to capture a high proportion of the dependencies accounted for by non-linear regressions
when the previous frame is used as the dependent variable, 90.7% and 86.9% for MFCCs
and PLPs respectively. However, relative performance is substantially worse when the
initial frames are used as predictors.
These results are shown pictorially in Figure 3.7, broken down by phonetic class. The
approximately parallel graphs of R2 values for the two acoustic features show that the
internal structure of which phone types are modelled well are similar for both. Overall
though, regression models can produce better fits to MFCCs than PLPs. For all feature
types, regressions on the previous frame within diphthongs produce some of the largest
R2 values and highest relative performances of linear models against their non-linear
counterparts. However, these become the lowest relative and some of the lowest absolute
scores when the initial frame becomes the explanatory variable. This is to be expected as
diphthongs are by their nature transitional and so subject to a high degree of variation. In
general, it is for vowels, nasal stops, and liquids that linear models give values of R2 most
similar to their non-linear counterparts, and affricates where the differences are largest.
3.4. LINEAR PREDICTABILITY 71
0.9
0.8
0.7
PLP − ACE
0.6
R−squared
PLP − linear
MFCC − ACE
MFCC − linear
0.5 EMA − ACE
EMA − linear
0.4
oral stops−unvoiced
fricatives−unvoiced
oral stops−voiced
0.3
fricatives−voiced
vowels−back
vowels−front
non−speech
vowels−mid
0.2
nasal stops
diphthongs
affricates
liquids
0.1
glides
0
phone category
0.9
0.8
0.7
0.6
R−squared
0.5
0.4
0.3
PLP − ACE
PLP − linear
MFCC − ACE
0.2 MFCC − linear
EMA − ACE
EMA − linear
0.1
0
phone category
Figure 3.7: Pictorial comparison of the proportion of variance accounted for by linear and
non-linear regression, where results for individual phones have been pooled into phonetic
categories as given in Appendix A.1. Data comprises EMA articulatory traces and two
parameterizations of the acoustic signal, MFCCs and PLPs, for a single speaker from the
MOCHA corpus. The phone categories in the lower graph match those in the upper.
72 CHAPTER 3. PRELIMINARIES
R2 linear performance as
features
regression linear non-linear percent of non-linear
Table 3.5: Results of regressions to examine the effect of crossing phone boundaries on linear and
non-linear regressions. R2 values averaged over 6 frequently occurring phone pairs are given for
linear and non-linear regressions on intra-phone and inter-phone bases. Also shown are percentages
of non-linear R2 gained on linear regressions. Data is from the speaker fsew0 in the MOCHA
corpus, and the phone pairs used are given in table 3.3.
Table 3.5 shows the results comparing the pairs of inter-phone and intra-phone re-
gressions. In all cases, linear models give performance closer to that of the non-linear
models when regressions are on an intra-phone rather than inter-phone basis. This effect
is smallest for articulatory data for which the linear models give 92.6% and 91.2% of
the R2 value found under non-linear models for intra-phone and inter-phone regressions
respectively. On acoustic data, the R2 values found using linear models are considerably
lower than those found with their non-linear counterparts, with linear models giving in
the region of 50% of the performance of non-linear models. The relative R2 values given
by linear and non-linear models are 53.9% and 47.9% on intra-phone and inter-phone
regressions respectively with MFCCs as features. This decrease is largely due to the lin-
ear regression R2 value reducing from 0.519 to 0.462 where its non-linear equivalent only
drops by 0.005 from 0.963 to 0.958. These, and similar results for PLPs, suggest that the
effects of crossing phone boundaries are more detrimental to the performance of linear
than non-linear predictors.
These results also support those given in Table 3.4 in demonstrating the poor relative
performance of linear compared to non-linear models when the regression variables are
spaced a number of frames apart.
3.4. LINEAR PREDICTABILITY 73
TIMIT results
Table 3.6: Results of regressions to compare the performance of linear and non-linear regressions
in predicting dependencies within phones. R2 values averaged over 61 phone classes for linear and
non-linear regressions which predict segment-central frames based either on preceding or segment-
initial frames. Also shown are percentages of non-linear R2 gained by linear regressions. Data
comprises the training set from the speaker-independent TIMIT corpus
The TIMIT data results show many of the same trends as seen in the MOCHA data,
though estimations should be more reliable as there is considerably more data available.
Model fits using PLP and MFCC features are again close for the between-segment exper-
iments, though in a role reversal, the PLP data provides better regression models than
the MFCCs. Linear regression of central on preceding frame gives R2 values of 0.706
and 0.690 for the PLP and MFCC features respectively, compared to non-linear values of
0.734 and 0.717. These both represent over 96% of the non-linear fit.
As before, R2 values are significantly lower when the segment-initial rather than pre-
ceding frame is used as predictor. Non-linear regressions give R2 values of 0.456 and
0.442 for PLPs and MFCCs respectively, and linear models manage about 90% of these.
The graph in Figure 3.8 shows these results broken down by phonetic category. PLP and
MFCC features exhibit similar trends, with the best fits and smallest differences between
linear and non-linear model fit for vowels, liquids and nasal stops with the preceding frame
as the predictor. Just as with the MOCHA data, diphthongs give high relative perfor-
mance with the previous frame as predictor, and considerably lower relative performance
using the segment-initial frame.
74 CHAPTER 3. PRELIMINARIES
0.7
0.6
R−squared
0.5
0.4
fricatives−unvoiced
0.3
fricatives−voiced
stops−unvoiced
stop−closures
stops−voiced
vowels−back
vowels−front
non−speech
vowels−mid
0.2
diphthongs
affricates
nasals
liquids
0.1
glides
0
phone category
affricates
diphthongs
stops−unvoiced
stop−closures
vowels−mid
stops−voiced
non−speech
vowels−back
liquids
glides
nasals
0.9
0.8
0.7
0.6
R−squared
0.5
0.4
0.1
0
phone category
Figure 3.8: Pictorial comparison of the proportion of variance accounted for by linear and non-
linear regression, where results for individual phones have been pooled into phonetic categories
as given in Appendix A.1. Data comprises two parameterizations of the acoustic signals from the
training set of the multi-speaker TIMIT corpus, MFCCs and PLP cepstra.
3.4. LINEAR PREDICTABILITY 75
R2 linear performance as
features
regression linear non-linear percent of non-linear
Table 3.7: Results of regressions to examine the effect of crossing phone boundaries on linear and
non-linear regressions. R2 values averaged over 6 frequently occurring phone pairs are given for
linear and non-linear regressions on intra-phone and inter-phone bases. Also shown are percentages
of non-linear R2 gained on linear regressions. Data is from the training set of speaker-independent
TIMIT corpus, and the phone pairs used are given in Table 3.3.
Table 3.7 shows the results of the regressions used to examine the effect of crossing
phone boundaries on linear and non-linear regression models. For both MFCCs and PLPs,
absolute R2 values are higher for intra-phone than for inter-phone regressions regardless
of the regression model. Also, the proportion of the non-linear R2 given by the linear
model is higher when regressions are on an intra-phone basis. For example, with MFCC
features, the relative intra-phone performance of linear regression models is 84.2% which
drops to 72.1% when the regressions are inter-phone. As with the MOCHA data, these
results suggest that crossing phone boundaries is more detrimental to the performance of
a linear predictor than a non-linear counterpart.
Conclusions
All the speech data extracted from the MOCHA corpus came from a single speaker. It
was expected that the consistency across such a data-set would lead to high R2 values
compared with those on TIMIT in which there are a multitude of different vocal character-
istics and speaking styles. This was true for the non-linear regressions on acoustic features,
however with the exception of using the preceding frame to predict the segment-central
frame for MFCC features, linear models consistently produced lower absolute intra-phone
R2 values on MOCHA data than on TIMIT. Furthermore, the relative performance of
linear compared to non-linear regressions was lower for MOCHA acoustic data than for
76 CHAPTER 3. PRELIMINARIES
TIMIT equivalents.
Comparing the pairs of inter-phone and intra-phone results in Tables 3.5 and 3.7
shows that non-linear models account for more of the variation in the dependent variable
than linear models, when the regression variables are spaced at an interval of 3 frames.
The results in these tables further demonstrate that the relative performance of linear
compared to non-linear models is reduced when the regressions cross phone boundaries.
This suggests that linear predictors are not suited to modelling inter-phone dependencies.
On MOCHA speaker-dependent data, linear and non-linear models give comparable
fits predicting central using preceding frame for all features, and a comparable fit pre-
dicting central using initial frame with articulatory data. Given the extremely good fit
of the linear model to the articulatory data, a linear model seems entirely suitable in this
case. Furthermore, given that 98% of the variation in the phone-central frame could be
explained using the preceding one, a first order model such as the regression model in
Equation 3.7 seems ideal. This finding supports that of Roweis (1999) who shows that
linear models with only a few degrees of freedom can account for much of the structure
present in articulatory data. Linear models also seem adequate for the acoustic data when
the preceding frame is used as predictor. However, the more general conclusions on model
choice for acoustic data based on these results will be based on the speaker-independent
TIMIT experiments.
Intra-phone regressions on TIMIT acoustic data show that with the preceding frame as
the explanatory variable, application of a linear model gives 96% of the fit of a non-linear
equivalent. Furthermore, a linear regressor accounts for over 70% of the variation in the
data. The success of such a simple regression model, and the closeness to the fit of a non-
linear model, justifies the exploration of a first-order linear model of the parameterized
speech signal on an intra-phone basis.
Chapter 4
This chapter describes the linear dynamic model in detail: the function of each component
of the model, parameter estimation, evaluation, the assumptions made in applying LDMs
to speech data, and the variations which will be compared experimentally in Chapter 5.
The class of state-space models, to which the LDM belongs, was introduced in Section
2.1.1 on page 13. However, it is worth re-stating the purpose of such a model, which is
to make a distinction between the underlying process and the observations with which it
is represented. With yt and xt representing p and q dimensioned observation and state
vectors respectively, an LDM is specified by the following pair of equations:
and a distribution over the initial state, x1 ∼ N (π, Λ). The LDM assumes that the
dynamics underlying the data can be accounted for by the autoregressive state process
4.2. This describes how the Gaussian-shaped cloud of probability density representing
the state evolves from one time frame to the next. A linear transformation via the matrix
F and the addition of some Gaussian noise, ηt , provide this, the dynamic portion of the
model. The complexity of the motion that Equation 4.2 can model is determined by
the dimensionality of the state variable, and will be considered below. The observation
77
78 CHAPTER 4. LINEAR DYNAMIC MODELS
process 4.1 shows how a linear transformation with the matrix H and the addition of
measurement noise t relate the state and output distributions.
Practical use of an LDM involves filtering or smoothing to provide estimates of the state
vectors dependent on a set of observed values. However, to build a clear picture of the
model’s capabilities, we first consider the state process in isolation.
The transform F consists of a combination of rotations and stretches about and along
the dimensions of the state-space, and the noise element is additive, given by the Gaussian
ηt ∼ N (w, D). The mean w can be non-zero, allowing for a steady drift of the state vector.
With xt and Σt representing the mean and covariance of the state distribution at time
t, applying the update equation can be seen as consisting of two elements. The first is a
linear transformation, xt = F xt−1 , in which the Gaussian distribution of xt is preserved
but rescaled giving:
xt ∼ N (F xt−1 , F Σt F T ) (4.3)
and the second is convolution with the state error ηt . The result of convolving a pair of
Gaussian random variables z1 ∼ N (µ1 , θ1 ) and z2 ∼ N (µ2 , θ2 ) is also Gaussian:
z1 + z2 ∼ N (µ1 + µ2 , θ1 + θ2 ) (4.4)
The state dimension determines the nature of the dynamics which the system can
model. This comes about as the potential for interaction between dimensions increases
with the size of the state vector. With a state dimension of 1, the model can describe
exponential growth or decay with some general trend. Figure 4.1 shows plots of the state
means for two such models, produced by generating values according to the state Equation
4.2. The parameters used were:
in the first and second plots respectively. To show the variance changing over time, single
standard deviations from the mean are also included in the figures. In both cases, the
variances were set with D = 0.005 and Λ = 0.05.
0.5
0.9
0.4
state mean
state mean
0.7
0.3
0.5
0.2
0 10 20 30 40 50 0 10 20 30 40 50
frame frame
Figure 4.1: A 1-dimension state can model exponential growth or decay, with a steady
drift provided by the mean of the state error. This plot shows state mean against time
for two examples of 1-dimensional models. The dotted lines are placed a single standard
deviation from the mean and show how the variance evolves as the model runs.
80 CHAPTER 4. LINEAR DYNAMIC MODELS
0.1
0.2
state mean − axis 1
0
0.1
0 −0.1
0 10 20 30 40 50 0 10 20 30 40 50
frame frame
Figure 4.2: A 2-dimensional state can produce damped oscillations, again with an overall
drift which is provided by the state error mean. This figure shows each axis of the state
vector plotted against time for an example of a 2-dimensional model. The second axis
plot resembles a phase-shifted rescaling of the first.
4.1. THE LDM AND ITS COMPONENT PARTS 81
0.1 t=0
axis 2
t=45
t=30
−0.1
t=15
0 0.1 0.2
axis 1
Figure 4.3: The state dimensions shown in Figure 4.2 are plotted here against one another.
The ellipses show a single standard deviation around the mean at t = 0, t = 15, t = 30,
and t = 45. This figure illustrates how the principal axes of the distribution rotate so
that the density can preserve its shape around the direction of flow.
82 CHAPTER 4. LINEAR DYNAMIC MODELS
With larger state dimensions, the interactions between axes provide the capacity to
model more complex oscillations. Figure 4.4 shows each of the dimensions in an example
of a 4-dimensional model over time, generated using
0.90 −0.04 −0.01 −0.06 0.01 0.2
0.02 0.92 0.03 0.13 0.1 0.0
F =
, w = π=
−0.03 −0.02 0.83 −0.01
0.21 1.1
0.09 −0.21 −0.01 0.92 −0.1 0.2
0.8
0.6
state mean − axis 2
state mean − axis 1
0.6
0.4
0.4 0.2
0
0.2
−0.2
0
0 10 20 30 40 50 0 10 20 30 40 50
frame frame
1.25 0.5
state mean − axis 3
1.2 0
−0.5
1.15
−1
1.1
−1.5
1.05
0 10 20 30 40 50 0 10 20 30 40 50
frame frame
Figure 4.4: A higher dimension state-space can contain more complexity within each
period of oscillation, and hence model more complicated trajectories. Here the axes of
the state mean for an example of a 4-dimensional model are plotted over time.
4.1. THE LDM AND ITS COMPONENT PARTS 83
In this work, the evolution matrix F will be constrained to be a decaying mapping (see
Section 4.2.4 on page 99), and so the state trajectories are destined to converge. Given
This gives an interesting insight into the workings of the dynamic portion of the LDM.
Since the constraint is made that |F | < 1, the state’s evolution is governed by a set of
accelerations and velocities with which to attain some steady-state location in state-space.
Figure 4.5 shows the state trajectories of Figure 4.4, with a dashed line along the
target mean, as found in 4.10, for each. For each of the four state dimensions, the state
is shown tending toward its predicted target. Another visualisation of the same model is
given in Figure 4.6, where pairs of the state dimensions are plotted against each other.
84 CHAPTER 4. LINEAR DYNAMIC MODELS
0.8
0.6
state mean − axis 1
0.4 0.2
0
0.2
−0.2
0
0 10 20 30 40 50 0 10 20 30 40 50
frame frame
1.25 0.5
state mean
state mean − axis 3
1.2 0 target
−0.5
1.15
−1
1.1
−1.5
1.05
0 10 20 30 40 50 0 10 20 30 40 50
frame frame
Figure 4.5: These four plots show the model of Figure 4.4 with the state mean target
included as a dashed line. Simply used to generate, the state means tend toward their
targets.
4.1. THE LDM AND ITS COMPONENT PARTS 85
0.4
−0.4
axis 2
axis 4
0.2
−0.8
0
t=0 −1.2
−0.2 −1.6
0 0.2 0.4 0.6 0.8 1 1.1 1.15 1.2 1.25
axis 1 axis 3
Figure 4.6: The trajectories of Figures 4.5 and 4.4 are shown here as 2-dimensional slices
of a 4-dimensional space. Axes 1 and 2 are plotted against each other in the left-hand
graph, with 3 and 4 in the right. The state mean targets are shown as red dots.
86 CHAPTER 4. LINEAR DYNAMIC MODELS
A further illustration of the state tending toward a target rather than consisting of a
fixed set of trajectories is given in Figure 4.7. In each of the four plots, the dash-dotted
line corresponds to the 2nd axis of the 4-dimensional state plot in Figure 4.4. The solid
line gives the trajectory generated for this axis using the same set of parameters, varying
only the state initial mean π, and once again, the target is shown with a dashed line.
Not only do all four trajectories converge rapidly toward the mean target, but also have
a common shape within a few frames of generation beginning.
0
0
0 10 20 30 40 50 0 10 20 30 40 50
frame frame
original
state mean − axis 2
0.25
0 0
π = 0.25 π = 0.75
0 10 20 30 40 50 0 10 20 30 40 50
frame frame
Figure 4.7: The dot-dashed red line in each of these plot corresponds to the second axis
of the mean plot in Figure 4.4, and the dashed line its target. The solid line gives state
means when the model is used to generate with a variety of initial state values.
4.1. THE LDM AND ITS COMPONENT PARTS 87
State noise
The normally distributed state error, ηt ∼ N (w, D), has two functions. Firstly, a non-
zero mean allows the LDM to model a steady drift in the data. The standard formulation
often sets w to zero, and Roweis & Ghahramani (1999) observe that this does not lead to
a loss of generality. By adding a q + 1st dimension to the state vector which is always set
to unity, an extra column in F to hold w, and filling row q +1 with zeros apart from a 1 in
the last entry, an exactly equivalent model can be created. Derivations and manipulations
of the model can therefore be streamlined by assuming a zero mean on the state error
distribution. Whichever way it is incorporated, the state error mean allows the LDM to
describe a constant velocity in a given direction. However, this constant displacement will
be offset by the decreasing transformation F as the model runs towards its mean target,
the location of which is of course affected by w as shown in Equation 4.10. The second
function of the state error ηt is that the covariance D corresponds to the intra-segmental
variation, as discussed in the introduction to segment models on page 8. This term gives
the variance about a given trajectory, and hence the confidence with which the model
makes each new prediction.
The properties of the state process have been discussed in this section without refer-
ence to the observations. This is a valid exercise since the LDM is a generative model,
however in practice, paths through the state space are created by conditioning on ob-
served values. The relevant filtering and smoothing operations will be described after a
description of the observation process.
The state and observation spaces are linked by a linear transformation and the addition
of observation noise. Each dimension of the observation vector is therefore seen as a
noisy weighted sum of the state dimensions. As with the state evolution, applying the
observation equation can be seen as being composed of two elements. The first is a linear
transformation, yt = Hxt , which stretches and rotates the state density into a Gaussian
distribution over a (usually) higher dimensional space giving:
The second is convolution with the measurement noise t which gives a smoothed and
displaced version of 4.11 over the observation space:
As mentioned in Section 2.1.2, the state can be forced to have orthogonal components, in
which case modelling of the correlation structure of the data is contained in H.
Observation noise
The observation noise is additive Gaussian noise given by t ∼ N (v, C). The mean v is
typically initialised at the start of parameter estimation to be the mean of the observations
in the training set. This has the effect of centering the state around the origin and using
v to model the average displacement of the observations. Some statements of the LDM,
such as found in Roweis & Ghahramani (1999), assume the data to be zero-meaned and
set v = 0p , where 0p is a p-dimensional vector of zeros.
Convolving the observation noise with the original predictions to produce Equation
4.12 was described as a smoothing operation. This step widens the spread of the cloud
of density over the observations. By shifting probability mass away from regions of high
likelihood, the sensitivity to mismatch between training and test data is reduced. Mod-
elling the distribution of the errors between model predictions and the observations in
this way corresponds to the extra-segmental variation which was defined in Section 1.3.3
on page 8. The covariance C also gives a measure of the confidence on each prediction
the LDM makes, a property which will be examined with reference to articulatory data
in Section 4.4.2 on page 108.
4.2. TRAINING AND EVALUATION 89
The Kalman filter model, as the LDM is also known, has been well used and researched
by the engineering and control theory communities since 1960 when Rudolph Kalman
introduced his ‘new approach to linear filtering and prediction problems’ (Kalman 1960).
Shortly afterwards, in 1963, Rauch provided the optimal smoother to accompany Kalman’s
filter (Rauch 1963). With inference possible, researchers turned their attention to param-
eter estimation, with solutions for H given by Shumway & Stoffer (1982), and for all
parameters by Digalakis, Rohlicek & Ostendorf (1993). In this section, these techniques
are described, along with other issues and considerations for practical implementation of
LDMs.
4.2.1 Inference
The Kalman filter and Rauch-Tung-Striebel (RTS) smoother are used to infer state in-
formation given an N -length observation sequence y1N = {y1 , . . . , yN } and a set of model
parameters Θ. As a reminder of the terminology introduced in Section 2.1.1 on page
15, filtering is the means of estimating the state distribution at time t given all the ob-
servations up to and including that time, p(xt |y1t , Θ). Smoothing gives a corresponding
estimate of the state conditioned on the entire observation sequence, p(xt |y1N , Θ). The
notation used for the filtered and smoothed state means will be x̂t|t and x̂t|N respectively.
The corresponding covariances are written as Σt|t and Σt|N .
Kalman filtering takes the initial state distribution1 , x1 ∼ N (π, Λ) and makes a
forward sweep through the observation sequence y1N to produce estimates of xt|t for
1 ≤ t ≤ N . Each recursion consists of two stages. In the first, the model makes prior
predictions of the state mean and covariance, x̂t|t−1 and Σt|t−1 , then in the second, these
predictions are projected into the observation space giving ŷt , compared with yt , and
adjusted to give posteriors, x̂t|t and Σt|t . This process provides a means of updating the
state distribution as new observations are made. The adjustment factor Kt is called the
Kalman gain and chosen to minimise the filtered state covariance Σt| t . The forward filter
1
The initial state x1 is one of the parameters which is required to fully specify an LDM. During
inference it is used as the prior on the first state vector by setting x1|0 ∼ N (π, Λ).
90 CHAPTER 4. LINEAR DYNAMIC MODELS
recursions comprise:
x̂t|t = x̂t|t−1 + Kt et
x̂t|t−1 = F x̂t−1|t−1 + w
et = yt − ŷt = yt − v − H x̂t|t−1
Kt = Σt|t−1 H T Σ−1
et
Σ et = HΣt|t−1 H T + C
Σt|t−1 = F Σt−1|t−1 F T + D
observe yt
et correct
y = H xt | t−+v
1
project
Figure 4.8: This figure shows a single recursion of a Kalman filter. A prediction of x̂t|t−1
is made by the model based on the posterior state for the previous time, x̂t−1|t−1 , and
projected into the observation space. The error et is then computed with respect to some
newly observed data yt , and the state statistics adjusted to give a posterior x̂t|t
The RTS smoother adds a backward pass in which the state statistics are adjusted
once all data has been observed, giving x̂t|N and Σt|N . The RTS smoother can be seen as
providing the optimal linear combination of two filters – one which starts at the beginning
of the observation sequence and recurses forward, and the other which commences at the
final observation and works backward. The weighting of the contribution of each filter is
4.2. TRAINING AND EVALUATION 91
provided by At which is chosen to minimise Σt|N . The smoother recursions consist of:
At = Σt−1|t−1 F T Σ−1
t|t−1
The recursions above which estimate the cross-covariance terms Σt, t−1|t and Σt, t−1|N
are not part of the standard filter/smoother equations. However, they are required in
parameter estimation for LDMs and are derived in Digalakis et al. (1993), with a more
efficient form given by Rosti & Gales (2001).
The plots in Figure 4.9 show a pair of 2-dimensional slices of a 4-dimensional state
space during filtering. An LDM was trained on the [ey] tokens from a subset of the TIMIT
corpus, and then a state sequence generated corresponding to an example of [ey] which
was not included in the training data. The predict and correct steps are shown, as are
the state target and initial value. The target is not attained in either slice, however it
can be seen that predictions are typically towards the target, whilst adjustments can be
in any direction.
92 CHAPTER 4. LINEAR DYNAMIC MODELS
0
t=0
−2
axis 2
predict
−4
correct
target
−6
−8
−10 −8 −6 −4 −2 0
axis 1
4
predict
correct
2 target
axis 4
t=0
0
−2
−4
−10 −8 −6 −4 −2 0
axis 3
Figure 4.9: An LDM with a 4-dimensional state was trained on the [ey] tokens from a
subset of the TIMIT corpus. This figure shows state inference using a Kalman filter given
a new unseen [ey] token. Also marked is the state target and initial position. The target
is not attained in either slice of the state-space, though trajectories tend toward it.
4.2. TRAINING AND EVALUATION 93
Training an LDM is an unsupervised learning problem. There are no inputs to the system,
so the model will describe the unconditional density of the observations. As part of the
original application of LDMs for speech modelling, Digalakis et al. (1993) present both a
classical maximum likelihood approach, and a derivation of the EM algorithm. The latter
was adopted for its simplicity and good convergent properties, and is also used in this
work. The derivation below largely follows that in Digalakis (1992).
The state variable and its noise can be combined to give a single Gaussian distributed
random variable. Letting Θ denote the model parameter set, from Equation 4.2 we find,
1 1 T −1
p(xt |xt−1 , Θ) = p exp − (xt − F xt−1 − w) D (xt − F xt−1 − w) (4.13)
(2π)q |D| 2
Similarly from 4.1,
1 1 T −1
p(yt |xt , Θ) = p exp − (yt − Hxt − v) C (yt − Hxt − v) (4.14)
(2π)p |C| 2
tively, the Markovian structure of the model means that the joint likelihood of state and
observations can be written as:
N
Y N
Y
L(Θ|Y, X ) = p(Y, X |Θ) = P (x1 |Θ) P (xt |xt−1, Θ) P (yt |xt , Θ) (4.15)
t=2 t=1
Now substituting 4.13, 4.14, and 4.16 into 4.15, and writing l(Θ|Y, X ) = log L(Θ|Y, X ),
the joint log-likelihood for the LDM is a sum of quadratic terms:
N
1 X
log |C| + (yt − Hxt − v)T C −1 (yt − Hxt − v)
l(Θ|Y, X ) = −
2
t=1
N
1 X
log |D| + (xt − F xt−1 − w)T D−1 (xt − F xt−1 − w)
−
2
t=2
1 1 N (p + q)
− log |Λ| − (x1 − π)T Λ−1 (x1 − π) − log(2π) (4.17)
2 2 2
94 CHAPTER 4. LINEAR DYNAMIC MODELS
where the last term is due to the normalising constants in 4.13, and 4.14 and 4.16.
N
∂l X
C −1 yt xTt − C −1 Hxt xTt − C −1 vxTt
= = 0
∂H
t=1
N N
! N
!−1
X X X
⇒ Ĥ = yt xTt − v̂ xTt xt xTt (4.18)
t=1 t=1 t=1
and for v:
N
∂l X
C −1 yt − C −1 Hxt − C −1 v
= = 0
∂v
t=1
N N
1 X 1 X
⇒ v̂ = yt − Ĥxt (4.19)
N N
t=1 t=1
1
Multiplying the terms in 4.18 through by N, 4.18 and 4.19 can be combined to give:
−1
1 PN T 1 PN
t=1 xt xt t=1 xt
" #
N N
h i
Ĥ v̂ = 1 PN 1 PN
T
t=1 yt xt t=1 yt
N N PN
1 T
N t=1 xt 1
Maximising in terms of D follows the same line of argument, and solutions for π and Λ
are also found using the techniques above.
If the state were observable, the maximum likelihood estimates for the parameters of an
LDM would be found as:
−1
PN T
PN
t=1 xt xt t=1 xt
" #
h i
Ĥ v̂ = PN PN (4.23)
T
t=1 yt xt t=1 yt
P
N T
t=1 xt 1
N N N
1 X 1 X 1 X
Ĉ = yt ytT − yt xTt Ĥ T − yt vT (4.24)
N N N
t=1 t=1 t=1
−1
# PN T
PN
t=2 xt−1 xt−1 t=2 xt−1
"
h i
F̂ ŵ = N N (4.25)
P T
P
t=2 xt xt−1 t=2 xt
P
N T
t=2 xt−1 1
N N N
1 X 1 X 1 X
D̂ = xt xTt − xt xTt−1 F̂ T − xt wT (4.26)
N −1 N −1 N −1
t=2 t=2 t=2
π̂ = x1 (4.27)
Λ̂ = x1 xT1 − x1 π T . (4.28)
Section 2.1.1 on page 16 described the EM algorithm, which provides a means of iterating
toward the ML solution in situations where there is missing or incomplete data. In this
96 CHAPTER 4. LINEAR DYNAMIC MODELS
case the incomplete data is the state, and EM takes a model with parameters Θ(i) at
the ith iteration and makes an update to give Θ(i+1) , in such a way as to guarantee an
increase in the likelihood on the training data.
Dempster et al. (1977) demonstrated that for distributions from the exponential family
(of which the LDM with its Gaussian output distribution is a member), the E-step of the
EM algorithm consists of computing the conditional expectations of the complete-data
sufficient statistics for the standard ML parameter estimates. These sufficient statistics
are the quantities in Equations 4.23 – 4.28 which are all computed as sums of
Therefore, the E-step involves computing the expectations of the values in 4.29 condi-
tioned on Y and Θ(i) . Since Y is observed:
Given that the initial state x1 is a normal random variable, and that both state and
observation processes are linear with additive Gaussian noise, when conditioned on a
sequence of observations Y, the state at time t will also be Gaussian, so
xt |Y ∼ N (x̂t|N , Σt|N )
In this case, with cov[A, B] denoting the covariance of the random variables A and B and
using the relation:
An RTS smoother as described in Section 4.2.1 on page 89 can be used to compute the
complete-data estimates of the state statistics x̂t|N , Σt|N , and Σt,t−1|N . EM for LDMs
then consists of evaluating the ML parameter estimates 4.23 – 4.28 replacing xt , xt xTt ,
and xt xTt−1 with their expectations 4.31 – 4.33.
These solutions easily extend to multiple examples of each time series. In the E-step,
the combination of filter and smoother is run for each observation sequence, and sums
accumulated over all observations and expected state values. The M-step then proceeds
as before by evaluating the expressions 4.23 – 4.28, replacing any divisions by N with a
division by the total number of observation frames which contributed to each sum.
EM as presented here is a batch algorithm, meaning that all the training data is
processed before the model parameters are updated. A version of EM for on-line learning
is given in Neal & Hinton (1998), though the data used to train an ASR system will
normally be recorded and annotated at the word or phone level before training commences
making a batch approach appropriate.
Classification and recognition require calculation of the likelihood of a given model gen-
erating a section of speech data. The Kalman filter as stated in Section 4.2.1 on page 89
is in a form termed the innovations representation by Ljung (1999). The prediction error
at time t is given by
et = yt − ŷt
= yt − H x̂t|t−1 − v (4.34)
98 CHAPTER 4. LINEAR DYNAMIC MODELS
yt = Hxt + t (4.35)
Σ et = E[et eTt ]
h T i
= E H(xt − x̂t|t−1 ) + (t − v) H(xt − x̂t|t−1 ) + (t − v) (4.38)
= HΣt|t−1 H T + C (4.40)
Since errors are assumed uncorrelated and Gaussian, the log-likelihood of an observed
sequence y1N given an LDM with parameter set Θ can be calculated as:
N
1 X Np
log p(y1N |Θ) = − log|Σet | + eTt Σ−1
et et − log(2π) (4.41)
2 2
1
where et and Σet are computed as part of the standard Kalman filter recursions. The
normalisation term outside the summation in 4.41 can be omitted when comparing mul-
tiple models on a single given section of data as occurs during classification of a single
segment or recognition of a single utterance.
It was found by experiment that the state’s contribution to the error covariance Σet
was detrimental to classification performance. The state covariance is normally reset to
a value learned during training at the start of each segment, and converges during the
first few Kalman filter recursions. The resulting fluctuations in the likelihoods computed
during the segment-initial frames have most effect on the overall likelihood of shorter
phone segments. Replacing Σet = C + HΣt|t−1 H T with Σ0et = C improves classification
accuracy on shorter segments. This is demonstrated in the results of phone classification
on 480 TIMIT validation sentences where an LDM has been trained on the data corre-
sponding to each phone class. Figure 4.10 shows, for each segment length in frames, the
number of correctly classified tokens. It is apparent that for segments over 11 frames, the
correct form of likelihood calculation gives a marginally improved accuracy. However, for
4.2. TRAINING AND EVALUATION 99
1000
state covariance not included
state covariance included
800
correctly classified tokens
600
400
200
0
0 5 10 15 20 25 30
segment length (10ms frames)
Figure 4.10: Results of phone classification on 480 TIMIT validation sentences, broken down
by segment length. The dashed red line corresponds to classification using the correct form of
likelihood calculation, and the solid blue line to classification where likelihoods are computed
replacing Σet = C + HΣt|t−1 H T with Σ0et = C.
shorter segments, a modified Σet gives markedly higher classification performance. The
results shown are on for the 61 phone TIMIT set, prior to the addition of language model.
On these 480 sentences, using the correct and modified likelihood calculations results in
classification accuracies of 40.1% and 46.7% respectively. Likelihood calculations for the
experiments in this thesis will omit the contribution of the state covariance unless oth-
erwise stated. Further discussion of the properties of the state covariance are found in
Section 7.1.4 on page 224.
Efficient computation The initial distribution of the state, x1 ∼ N (π, Λ), is part of
the specification of an LDM. These values are estimated during training, and then used
100 CHAPTER 4. LINEAR DYNAMIC MODELS
to initialise the state priors for the Kalman filter, so that x1|0 = π and Σ1| 0 = Λ. An
examination of the filter and smoother recursion given on pages 90 and 91 reveals that
the none of computations for the 2nd order statistics at time t involve the newly observed
value yt . The forward statistics Σt|t−1 , Σt| t , Σt, t−1|t , Kt , and Σe t will then be identical
for any pair of observation sequences {y1 , . . . , yN1 } and {y0 1 , . . . , y0 N2 } for t ≤ N1 , N2 .
The situation is slightly different for the backward smoothing pass, though the above
also applies to At , which is calculated using the filtered parameters Σt−1|t−1 and Σt| t−1 .
However, the smoothed state covariances are dependent on N , and so Σt−1|N and Σt,t−1|N
are identical for any pair of observation sequences {y1 , . . . , yN1 } and {y0 1 , . . . , y0 N2 } for
which N1 = N2 . These observations lead to implementational strategies in which state
covariances and the correction factors Kt and At can be calculated, cached, and reused.
The matrix operations which are used to compute these quantities form the bulk of
the computation of implementing LDMs and so considerable speed-ups can be found by
employing such a strategy.
For example, training a full set of LDMs for a single iteration of EM on 12 MFCCs and
energy derived from the TIMIT training data on a 2.4GHz Pentium P4 processor took 10
minutes, 39 seconds. This was reduced to 1 minute 54 seconds through caching. Similarly,
using those same models for classification of the full TIMIT test set took 108 minutes 32
seconds, reduced to 7 minutes 39 seconds by pre-computing the relevant quantities. This
represents a 14-fold speed increase.
Constraints Taking an LDM and multiplying one dimension of the state space by some
factor whilst dividing the corresponding column of H by the same gives distributions over
the observations identical to those of the original. Despite the lack of unique parameter
estimates, and the inherent degeneracy which was discussed in Section 2.1.2 on page 20,
EM training for LDMs is stable in practice and converges quickly. As with any application
of EM, parameters must be initialised before training begins. There is no single established
or ‘correct’ technique for initialising LDMs, though choice of initial parameters is key to
good performance. Appendix B describes the approach used in this work.
One constraint is always placed on the LDMs during training, which is that F is a
decaying mapping. If |F | > 1 were allowed, the state evolution could give a model of
4.2. TRAINING AND EVALUATION 101
exponential growth. Such behaviour may not be apparent over small numbers of frames,
whilst still introducing an element of numerical instability into the system. This becomes
especially important in the situation where the state is not reset between models. To
constrain |F | < 1, the singular value decomposition (SVD) is used at the re-estimation
step. The SVD provides a pair of orthonormal bases U and V , and a diagonal matrix of
singular values S such that
F = U SV T (4.42)
By replacing any elements of S greater than 1− with 1− for some small ( = 0.005 was
used in this work), and then re-computing F = U Snew V T , the bases of F are preserved
whilst forcing the transform along them to be decaying (Roweis 2001).
Other constraints may be considered also. These include forcing the rows or columns
of H to sum to unity in order to fix the scaling of the state process. Alternatively, one or
both of C and D can be set to be diagonal, thereby forcing modelling of the correlation
structure of the data into H, and using the error distributions to describe the variances
unique to each dimension of the data. The latter can be enforced during the re-estimation
step, simply by setting all off-diagonal elements of Ĉ or D̂ to zero. This would also increase
implementational efficiency, as inverting diagonal matrices requires far less computation
than their full counterparts.
102 CHAPTER 4. LINEAR DYNAMIC MODELS
r
X
zt = Ai zt−i + ηt (4.43)
i=1
where the Ai s are p×p matrices and ηt is additive Gaussian noise given by ηt ∼ N (w, D).
Williams (2003) shows that the likelihood of observations Z under such a model can be
made equivalent to a likelihood expressed in terms of independent difference observations
derived from Z. This result demonstrates that an autoregressive model can be expressed
as a static model with an appropriate set of δ coefficients. The question follows as to
whether there is then an equivalence between an LDM and a static model with δ coeffi-
cients.
Firstly, note that the model of Equation 4.43 can be written in the following form:
zt A1 A2 ··· Ar zt−1 ηt
zt−1 Ip 0p ··· 0p zt−2 0p
Ztt−r+1 = = 0p Ip + (4.44)
..
.
.. ..
. .. . .. .
.. .. ..
. . . .
zt−r+1 0p ··· 0p Ip 0p zt−r 0p
write:
zt
z
i t−1
h
yt = 0p · · · 0p + t (4.45)
Ip
...
zt−r+1
zt A1 A2 ··· Ar zt−1 ηt
zt−1 Ip 0p ··· 0p zt−2 0p
= 0p Ip + (4.46)
.. .. .. ..
.
..
.. .. .. .
.
.
. . . .
zt−r+1 0p · · · 0p Ip 0p zt−r 0p
So far, Z has been of the same dimension as the observations Y, which means that
the hidden state vector has been of dimension rp. State-space models generally employ
a state of different (frequently lower) dimension than the observations. Incorporating
dimensionality reduction via the observation process means that the autoregressive model
can have just as many degrees of freedom as required to model any dynamics which might
underly the observations. Now letting Z be a d-dimensional vector, and with the Bi s
104 CHAPTER 4. LINEAR DYNAMIC MODELS
Note that the model of equations 4.45 and 4.46 can be found by setting p = d, B1 = Id
and Bi = 0d for i = {2, . . . , r}.
The matrices Bi can be used to incorporate linear dimensionality reduction into the
model. Specifying B1 but setting the remaining Bi s to be zero matrices ensures that yt has
a dependence only on zt . In this case, the observations are modelled as a corrupted-by-
noise version of a lower-dimensional autoregressive process of order r. Further specifying
Bi for i = {2, . . . , r} gives yt a dependence also on zt−i .
In practice, estimation for LDMs is largely unconstrained. The state vector is not
explicitly divided into separate components, as rd-dimensional Ztt−r+1 is replaced by
q-dimensional xt . Neither of the state evolution or observation matrices, F and H respec-
tively, are forced to place zeros as shown in Equations 4.45 and 4.46. Writing the model
in this fashion simply serves to show the sorts of structure which, subject to appropriate
estimation techniques, the LDM might discover in the data.
This interpretation of the modelling of the LDM aims to highlight the differences
between LDMs and autoregressive models. The addition of observation noise sets the two
apart by making the autoregressive component into a hidden process. When combined
with dimensionality reduction via the observation process, the effect is to ambiguate the
order of the modelling in the state. The equivalence between an autoregressive model and
a static model with δ features is due to the explicit linear relationship an autoregressive
process describes between observed feature vectors. This section has shown how modelling
is altered and hence how this explicit relationship is removed when the autoregressive
process instead describes an internal state of the model.
4.4. THE LDM AS A MODEL FOR SPEECH RECOGNITION 105
The internal variables in the hidden state reflect some of the known properties of
speech production, where articulators move relatively slowly along constrained trajecto-
ries. The continuous nature of the state means that temporal dependencies are modelled
for as long as the state is not reset, with the position of the state at time t affecting
its position at time t + τ . This could be the beginning and end of a phone or sentence
depending on how the model is implemented.
The model has built into it the notion of speech being modelled in a domain other than
the observations, which are seen as noisy transforms of an underlying process. A linear
mapping between state and observation processes dictates that points which are close in
state space are also close in observation space. Therefore, trajectories which are continu-
ous in state space are also continuous in observation space. If the hidden state is seen as
having articulator-like characteristics, such a constraint is not universally appropriate as
106 CHAPTER 4. LINEAR DYNAMIC MODELS
sometimes small changes in articulatory configuration can lead to radical changes in the
acoustics (examples of this were given in Section 3.1.1 on page 50). However, Section 3.4
on page 66 showed that whilst linear models do not give good descriptions of the depen-
dencies between phone segments, behaviour within phones can be adequately accounted
for by a linear predictor. This is reflected in the LDM-of-phone formulation: within phone
models, the output distribution evolves in a linear, continuous fashion. Discontinuities
and non-linearities can be incorporated at phone boundaries where resetting the state
and switching the observation process parameters H, v and D results in a sudden shift
in acoustic space. Alternatively, by passing state statistics across model boundaries, the
state process can remain continuous through such shifts.
Figures 4.11 and 4.12 give a visual depiction of how well an LDM can characterise
acoustic data. The top spectrogram was generated from the actual cepstral coefficients,
and is plotted as time against log frequency for the TIMIT sentence, ‘Do atypical farmers
grow oats?’ Red corresponds to regions of high energy, and blue to low. The second
spectrogram represents predictions of the cepstra made by a series of LDMs. A set of
phone models trained on the TIMIT training data was used to generate predictions the
length of the utterance. The time-aligned phonetic labels dictated which model was run
in each phone region, and the state predictions made during a forward pass of a Kalman
filter were transformed by the observation process to give a vector of mean predicted
cepstra for each frame. The state statistics were reset at the beginning of each new phone
to model-specific values, learnt during training. The LDM follows many of the spectral
characteristics present in the original spectrogram, though the effect of resetting the state
statistics at phone boundaries is apparent.
Mel−warped frequency
axr
h#
ih
pcl
ih
el
aa
gcl
ow
ow
h#
tcl
r
m
tcl
d
ux
q
ey
kcl
g
t
t
k
s
Do atypical farmers grow oats?
Figure 4.11: A spectrogram generated from the actual Mel-cepstra for the TIMIT sentence
‘Do atypical farmers grow oats?’
Mel−warped frequency
axr
h#
ih
pcl
ih
el
aa
gcl
ow
ow
h#
tcl
r
m
tcl
d
ux
q
ey
kcl
g
t
t
k
Measured articulatory parameters have many of the same properties as the trajectories
which LDMs can generate, being smooth, slowly varying and continuous, yet noisy. Fur-
thermore, the model can absorb the asynchrony which exists between the motions of
different articulators. Making a model of measured articulatory data, such as that found
in the MOCHA corpus, is a situation in which fewer degrees of freedom will most likely be
required for modelling purposes than are originally present in the data. For example, the
EMA data includes x and y coordinates for three points on the tongue. These six data
streams are likely to be highly correlated and there will be redundancy of information.
Subspace modelling should be able provide a compact representation of such a system.
The observation noise C has an interesting interpretation when dealing with articula-
tory data, as it can be seen as capturing the critical, or otherwise, nature of an articulator
during a given phone (see page 51 for a definition of critical articulators). It would be ex-
pected that the variances found on the data stream corresponding to a critical articulator
will be low compared to that of an articulator which is not critical for a given phone.
Work reported in Richmond et al. (2003) was concerned with recovering articulatory
traces from acoustic parameters using data from the MOCHA corpus. The variances
associated with recovery of articulator feature dimensions tongue tip y, upper lip x and
velum x were compared across the consonantal phones. The nature of the EMA data is
such that y-coordinates correspond to height and x-coordinates to frontness2 . In each
case, the phones for which the variance was low were found to correspond to those in
which the articulator was expected to be critical. Likewise, those with a high variance
on the network output corresponded to the articulators thought to be non-critical for
production of the given phone.
A similar experiment was conducted as part of this thesis’ exploration of LDMs. Mod-
els were trained on the EMA data for all 460 sentences from speaker fsew0 of the MOCHA
corpus, and the diagonal elements of the observation noise covariance C corresponding
to the three articulatory feature dimensions above were extracted. Table 4.1 gives the
variances for the 23 consonants in the MOCHA phone set ranked in order of magnitude.
2
This is approximate as x and y are relative to the bite-plane, which is measured during EMA recording.
4.4. THE LDM AS A MODEL FOR SPEECH RECOGNITION 109
The results tend to follow what might be expected, with tongue tip being critical for pro-
duction of [sh,s,z], and upper lip being non-critical for [y,t,d]. Furthermore, the variance
associated with nasalised phones for which the velum is lowered and open [m,ng,n] is high
compared to when it is closed, such as [zh,ch,jh]. One notable exception is for [t], which
gives one the largest variances for the tongue tip, an articulator which would be expected
to be critical.
Table 4.1: LDMs were trained on the MOCHA EMA data, and shown here for the 23 consonantal
phones are the ranked variances associated with articulatory dimensions tongue tip y, upper lip x
and velum x. y-coordinates correspond to height and x-coordinates to frontness. In general, low
variances correspond to phones in which the articulator would be expected to be critical.
110 CHAPTER 4. LINEAR DYNAMIC MODELS
Since the use of LDMs for speech recognition is largely uncharted, many aspects of its
application are open for investigation. The task of classification is the ideal domain in
which to compare experimentally a number of modelling alternatives, which are described
below.
State dimension The discussion of the state process above illustrated the effect the
dimension of the hidden state has on the complexity of the motions which can be modelled.
If a continuous, dynamic state-space has something to offer, how many degrees of freedom
in the state are needed for speech data?
Form of H The original application of LDMs to speech as part of the SSM described
in Section 2.3.6 set H to the identity matrix. The model was thus cast as a smoothed
Gauss-Markov model, and subspace modelling was ignored. If there are fewer degrees
of freedom present in the data than in the observation, the compact parameterization
offered by a sub-space should improve modelling.
4.4. THE LDM AS A MODEL FOR SPEECH RECOGNITION 111
Form of the error terms The effect of forcing the error covariances to be diagonal
was discussed above. Forcing D to be diagonal or an identity matrix gives no theoretical
loss of generality, and can be implemented with one of two methods. An equivalent
model can be created by subsuming the correlation structure in to F and H leaving
D = I, or alternatively off-diagonal components can be set to zero during the M-step of
re-estimation. The latter may affect performance as correlation information accumulated
during the E-step is simply ignored. It may be useful to stop training the observation
noise t after a few iterations, to focus any learning on to the other parameters.
112 CHAPTER 4. LINEAR DYNAMIC MODELS
Chapter 5
Using LDMs for full speech recognition is of course the ultimate goal of this work. How-
ever, a classification task is an extremely useful staging post. Whereas recognition involves
jointly finding the most likely model sequence and segmentation, in a classification task
the segment start and end times are given and it is only the model sequence which must
be determined. This provides a framework in which to make comparisons between models
where the number of confounding factors is kept to a minimum. The experimenter is able
to refine the process of parameter initialisation, training and testing, safe in the knowledge
that no errors are introduced from such sources as decoding or duration modelling.
Sections 5.1 and 5.2 present the results of speaker-dependent MOCHA and speaker-
independent TIMIT phone classification tasks respectively. The experimental set-up is
straightforward, and the intention is to compare classification performance under a variety
of models and parameterizations of speech. Section 5.3 extends these basic results by
looking at a number of ways in which to develop the acoustic modelling.
In the classification experiments which follow, the LDMs are fully specified. The use-
fulness or otherwise of the dynamic component of the model is what is being assessed,
and will be decided on the relative performance of otherwise equivalent static models,
multivariate Gaussians and factor analysers (FA), compared with LDMs. As discussed in
Section 2.1.2 on page 19, factor analysers produce spatially correlated Gaussian output
distributions but with substantially fewer parameters than present in a full covariance
Gaussian. This may prove advantageous if there is insufficient data to give robust esti-
113
114 CHAPTER 5. LDMS FOR CLASSIFICATION OF SPEECH
A paired t-test will be used to aid comparison of experimental set-ups. Such a test provides
a method for assessing if system A has given a consistently higher accuracy than system
B across the test-set. The test sentences are split into n groups (where n = 10, 24 for
MOCHA, TIMIT data respectively), and the classification accuracies under both systems
computed for each group. The hypothesis that the mean accuracy difference d¯ between
the systems is 0, H0 : d¯ = 0 is tested against the one-sided alternative H1 : d¯ > 0 by
computing
d¯
t= q (5.1)
s2d /n
where s2d is the sample variance of the differences and n the number of pairs. This is
compared to a t-distribution with n − 1 degrees of freedom to give the probability of
finding such a value of t by chance. Low probabilities (p < 0.05 or p < 0.01) justify
rejecting H0 in favour of H1 and concluding that there is evidence supporting system
A’s superior performance. In this work, p < 0.01 will be assumed as a threshold unless
otherwise stated.
5.1.1 Methodology
In all experiments, the data is split into three subsets, each of which has a distinct
role. Some is used for parameter estimation (training set), some for making intermediate
decisions (validation set), and the remainder is used for evaluation (test set). Models are
trained using an application of the EM algorithm, which, as with any iterative estimation
procedure can lead to overfitting of the data. In this case, the models learn the specific
characteristics of the training set but do not generalize well and perform poorly on unseen
data. Using a validation set provides a means of setting parameters such as the number
of training iterations and the language model scaling factor before the final models are
evaluated on the test set.
Training The time-aligned labels provided with the MOCHA corpus are used to ex-
tract all tokens corresponding to each of the 46 phone classes, so that a single context-
independent model can be trained for each. A number of EM iterations are carried out
with the parameters being stored at each. The bigram language model probabilities are
also estimated using the phone pairs present in the training set.
To ensure enough data for robust estimation, 4/5 of the 460 utterances are set aside for
training, and the remaining 1/5 divided equally between validation and test sets. Such a
division leaves quite a small test set and so, where possible, main classification results are
found using a K-fold cross-validation procedure which is described below. As long as a
test set consisting of only 46 sentences reflects the properties of the entire data, it can be
used to find preliminary results and the computation involved will only be a fraction of
full cross-validation. Results on the small test-set will be compared to those found with
a full cross-validation below on page 119.
This process is repeated 5 times, with A, B, C, D and E switching roles until each
has been used as a test-set. Table 5.1.1 shows the permutations used to enable a final
classification accuracy to be calculated for all the utterances in the corpus. From now
on, classification will refer to the basic classification procedure with the 46 utterance test
set, and when cross-validation has been used, it will be stated explicitly. Note that where
results are compared, evaluation will always be given for identical test sets.
5.1. SPEAKER-DEPENDENT MOCHA CLASSIFICATION 117
A, B, C D E
B, C, D E A
C, D, E A B
D, E, A B C
E, A, B C D
Table 5.1: The 5 cross-validation sets swap role until each has been used for testing
The data-sets which are derived from the MOCHA corpus were described in Chapter 3.
Table 3.1 is repeated here as Table 5.2 to give a reminder of the various features which will
be used for experimentation. Where feature dimensions are 46 or greater, additional sets
are included where linear discriminant analysis (LDA) has been applied for dimensionality
reduction. No experiments are carried out without LDA post-processing for the combined
data sets with δ and δδ parameters as the resulting 81-dimensional features are considered
excessively large.
As stated in the introduction to this chapter on page 113, the main thrust of this set
of experiments is to examine the contribution a dynamic hidden state makes to phone
classification accuracy, compared with otherwise equivalent static models. Results are
presented for each of the feature sets above, starting with MOCHA articulatory features
in section 5.1.3 on page 122 and acoustic-derived features in Section 5.1.4 on page 132.
Sections 5.1.5 and 5.1.6 on pages 137 and 143 then give classification results where acoustic
features are combined with real and network-recovered EMA respectively. Section 5.1.7
on page 148 presents a summary of the findings thus far.
To begin with, Section 5.1.2 describes preliminary experiments which compare results
on the small MOCHA test set and those found using a full K-fold cross-validation, and
also look at the effect that the data frame-rate has on classification accuracy.
118 CHAPTER 5. LDMS FOR CLASSIFICATION OF SPEECH
Table 5.2: Each tick denotes a feature set which will be used for speaker-dependent classification.
All data is from speaker fsew0 of the MOCHA corpus. Where dimensions exceed 46, equivalent
sets are included where linear discriminant analysis (LDA) has been applied for dimensionality
reduction.
5.1. SPEAKER-DEPENDENT MOCHA CLASSIFICATION 119
Figure 5.1 shows a plot of classification results using LDMs with EMA data as features
for state dimensions 0 to 20. Accuracies are shown both on the small test set from the
original train/test division, and where a 5-fold cross-validation has been employed. It
is apparent that accuracies on the 46 utterance test are much more subject to random
variation, and are also slightly higher than those where a 5-fold cross-validation has been
used. However, the trend for both is that accuracy improves as the state size increases up
to a dimension of around 7 where the graphs remain largely static. Figure 5.1 suggests
that a more reliable result can be obtained through cross-validation, though a subset of
the data is adequate for preliminary experimentation.
62
61
classification accuracy (%)
60
59
58
57
jackknife classification
basic classification
56
0 5 10 15 20
state dimension
Figure 5.1: Classification accuracies shown for EMA data with LDMs as the acoustic
model. The solid red line shows cross-validation classification accuracies for state dimen-
sions of 0 to 20, and the dashed blue line shows classification accuracies on a reduced
test-set.
120 CHAPTER 5. LDMS FOR CLASSIFICATION OF SPEECH
When preprocessing data for ASR, the experimenter must choose a frame rate (and win-
dow size in the case of acoustic data) appropriate to the model being used. A high
data-rate means more computation, so a trade-off between speed and accuracy can arise.
75
70
65
60
55
50
2 4 6 8 10 12 14
frame shift (ms)
Figure 5.2: Classification accuracies shown for EMA, MFCC and PLP data from the
MOCHA corpus for frame shifts between 2 and 14ms. 10ms frames were adopted for all
further work
EMA data The EMA data is sampled every 2ms. Down-sampling was performed by
first low-pass filtering the data, and then choosing every 2nd , 3rd , 4th , 5th , 6th or 7th
frame to give spacings of between 4ms and 14ms. A set of LDMs with a 9-dimensional
state vector was used for classification with no cross-validation and the results are shown
in Table 5.3 and Figure 5.2. Larger frame spacings than the original give improved
performance. To test if this was due to the smoothing nature of the low-pass filtering,
5.1. SPEAKER-DEPENDENT MOCHA CLASSIFICATION 121
Classification accuracy
frame shift(ms)
EMA PLP MFCC
Table 5.3: Classification accuracies for systems using LDMs with 9-dimensional states to perform
classification on real EMA, MFCC and PLP data from the MOCHA corpus. Frame shifts of 10ms
and 12ms give the highest accuracies.
the EMA data was filtered whilst maintaining the original frame spacing. This gives a
classification accuracy of 51.3%, lower than with 2ms shift data used raw. Frame-rates of
10ms and 12ms produced the highest classification accuracies.
Acoustic data Section 3.1.2 described the process of producing PLP and MFCC fea-
tures. Analysis is performed on a series of overlapping windowed regions of the speech
signal. A window size of 20ms or 25ms is a common choice for ASR as it provides a rea-
sonable level of smoothing whilst still capturing many of the short-time events. Different
systems use differing frame-shifts, for example 16ms (with a 32ms window) in the hybrid
ANN/HMM system described in Robinson et al. (2002) or 10ms (with a 25ms window)
in a typical HMM system (Young et al. 2002). For this experiment, PLP and MFCC
coefficients were generated using 20ms windows on the acoustic signal, and the frame
shift varied between 2ms and 14ms.
Classification accuracies for a set of LDMs with a 9-dimensional state are given in
Table 5.3 and Table 5.2 alongside those for the EMA data. For both PLP and MFCC
features, a 10ms frame-shift gives the highest classification performance. For all features
and all further experiments, a 10ms frame-shift will be used.
122 CHAPTER 5. LDMS FOR CLASSIFICATION OF SPEECH
For all experiments which follow, the results of classification based on the single train/test
division described in Section 5.1.1 are used to determine the state dimension of LDM and
factor analyser models to be used in a 5-way cross-validation. Results using LDMs are
shown graphically for each feature set. None of these preliminary results are reported in
the text, though are given in full in Appendix E. Where the classification accuracy using
LDMs is statistically significantly higher than for both static models, results are shown
in bold face.
EMA data Figure 5.3 shows phone classification results using LDMs with EMA, EMA
with δs, and EMA with δs and δδs as the features. The state dimension ranges from 0 to
22, where a 0-dimensional state corresponds to a full covariance Gaussian classifier. There
is fluctuation in the classification accuracy as the size of the state is varied, though patterns
are still apparent. LDM performance for EMA features without δs or δδs improves as
the state dimension increases, though remains fairly consistent for models with state
sizes of between 5 and 18. The EMA data is a 14 vector, and whilst a much lower
dimensioned state offers close to the highest LDM classification performance, it is not
until the state dimension is larger than that of the data at around 18 that classification
accuracy deteriorates. It is likely that under these conditions, there are more model
parameters than can be robustly estimated from the available data. There is not one
obvious optimal model dimension, though 12 produces the highest accuracy and so is
used as the state size for LDM cross-validation classification. Adding δ and δδ features
gives a clear increase in performance. There is still a relationship between state dimension
and accuracy, as the accuracy increases with state size up to dimensions of around 16.
The cross-validation classification results in Table 5.4 show that with and without
δ features, the modelling of dynamics in an LDM gives an improvement in accuracy
compared to static models. The relative error reductions for including dynamic modelling
over the best static model are 3.8%, 5.8% and 6.1% for EMA data alone, adding δs and
further adding δδs respectively. Surprisingly it is the last of these for which there is the
largest relative improvement. The δs and δδs are included for the purposes of adding
dynamic information, and so it was expected that in this case there would be the least
5.1. SPEAKER-DEPENDENT MOCHA CLASSIFICATION 123
75
70
65
60
55
50
0 5 10 15 20
state dimension
Figure 5.3: Speaker-dependent classification accuracies for systems with LDMs used as the acous-
tic model. The features are EMA data, EMA data with δ coefficients, and EMA data with δ and
δδ coefficients. Accuracies (y-axis) are shown for a variety of state dimensions (x-axis).
state dim 12 11 16
FA
accuracy 57.5 % 63.0 % 63.4 %
state dim 13 19 17
LDM
accuracy 59.1 % 65.7 % 66.1 %
Table 5.4: Cross-validation classification accuracies for systems with LDMs and FA models as
the acoustic model. The features are EMA data, EMA data with δ coefficients, and EMA data
with δ and δδ coefficients. LDM accuracies in bold face are statistically significantly higher than
for either of the static models.
124 CHAPTER 5. LDMS FOR CLASSIFICATION OF SPEECH
benefit in using a dynamic model. However, LDM models using features which include
δs give a higher accuracy than either of the static models which also use δδs.
Extended articulatory features Figure 5.4 shows classification accuracies for LDMs
with the state dimension varying from 0 to 24 for the full articulatory features alone, with
δs, with δs and δδs, and where LDA has been used for dimensionality reduction on the last
of these. For all feature sets, the classification accuracies increase with the dimension of
the hidden state, reaching a plateau at around 13. These features consist of the EMA data
as used in the previous experiments with the addition of EPG and laryngograph data.
These combine to give a 19-dimensional feature vector for which the highest accuracies
occur with larger state dimensions than for the EMA data used alone. As in the previous
set of experiments, adding δs gives a clear increase in classification accuracy. The effect
of further adding δδs or then using LDA for dimensionality reduction is not so apparent
from the graph. The trend of improving results as the state dimension increases persists
in each case.
For all feature sets, there is a statistically significant increase in the classification
accuracy where dynamic models have been used. This shows that improvements are
consistent over the test set. In a reversal from the EMA data, the smallest relative error
reductions on including dynamic modelling are given where δ parameters are incorporated
in the feature set. These are 6.5%, 6.1% and 4.1% for the features used alone, with δs,
and with δs and δδs. The smallest gain of 2.4% is found when LDA has been used for
post-processing. Given that LDA produces data which is maximally linearly separable, it
is not surprising that a static Gaussian is able to give close to the discriminatory power
of a dynamic linear Gaussian model.
Figure 5.5 shows a confusion matrix of the classifications made by LDMs with a 19-
dimensional state and δs included. This corresponds to the accuracy of 72.1% given in the
2nd column of Table 5.5. The vertical and horizontal axes show true and classified phone
identity respectively. The strong diagonal shows that many phones are correctly classified,
and the shaded areas off the diagonal display where the errors fall. The strong vertical
line in the upper portion of the table shows that vowels and diphthongs are commonly
misclassified as schwa, denoted by [@]. This is expected, as schwa is one of the most
common and highly variable sounds in the English language and can be thought of as the
‘neutral position’ of the articulators. Another common mistake is to misclassify fricatives
and affricates as [t]. The parallel diagonal lines about [b, d, g] and [p, t, k] show that
whilst the models can distinguish stops according to place and manner of articulation,
5.1. SPEAKER-DEPENDENT MOCHA CLASSIFICATION 125
Classification accuracies for LDMs with MOCHA EMA + EPG + LAR data
85
EMA + EPG + LAR
EMA + EPG + LAR + δ
80 EMA + EPG + LAR + δ + δδ
EMA + EPG + LAR + δ + δδ LDA
classification accuracy (%)
75
70
65
60
55
50
0 5 10 15 20
state dimension
Figure 5.4: Speaker-dependent classification accuracies for systems with LDMs used as the acous-
tic model. The features are the extended articulatory set from the MOCHA corpus comprising
EMA, laryngograph and EPG data. Accuracies (y-axis) are shown for a variety of state dimensions
(x-axis) and with the data used raw, or post-processed using LDA.
state dim 7 21 15 17
FA
accuracy 65.8 % 70.0 % 70.4 % 71.0 %
state dim 21 19 15 22
LDM
accuracy 68.3 % 72.1 % 72.1 % 72.0 %
Table 5.5: Cross-validation classification accuracies for systems with LDMs and FA models as the
acoustic model. The features are the full MOCHA articulatory set comprising EMA, laryngograph
and EPG data. Results are shown with the data used raw, or post-processed using LDA. LDM
accuracies in bold face are statistically significantly higher than for either of the static models.
126 CHAPTER 5. LDMS FOR CLASSIFICATION OF SPEECH
Phone classification
breath
@@
ow
i@
uh
aa
oo
uu
eir
ou
ng
dh
zh
sh
ch
@
sil
m
th
ai
ei
oi
jh
iy
w
a
e
b
d
g
p
h
y
k
v
z
s
ii
f
i
l
a
e
i
ii
iy
@
@@
uh
aa
o
oo
u
uu
ai
ei
True phone identity
eir
i@
oi
ou
ow
l
r
w
y
m
n
ng
b
d
g
p
t
k
v
z
zh
dh
f
h
s
sh
th
ch
jh
breath
sil
Figure 5.5: This confusion matrix corresponds to the classifications made by LDMs with an
articulatory feature set comprising EMA, laryngograph and EPG data with δ parameters and a
19-dimensional state. The overall classification accuracy is 72.1% and is given in Table 5.5. The
strong diagonal shows that many phones are correctly classified, though the vertical line in the
upper left of the table shows that vowels and diphthongs are commonly misclassified as schwa,
denoted by [@]. Another common mistake is to misclassify fricatives and affricates as [t]. The
parallel diagonal lines about [b, d, g] and [p, t, k] show that whilst the models can distinguish
stops according to place and manner of articulation, making a voiced/voiceless decision is prone
to inaccuracy.
128 CHAPTER 5. LDMS FOR CLASSIFICATION OF SPEECH
75
70
65
60
55
50
0 5 10 15 20
state dimension
Figure 5.6: Speaker-dependent classification accuracies for systems with LDMs used as the acous-
tic model. The features are network-recovered EMA data used alone, with δ coefficients, and with
δ and δδ coefficients. Accuracies (y-axis) are shown for a variety of state dimensions (x-axis).
state dim 16 20 19
FA
accuracy 49.4% 49.5% 50.5%
state dim 9 13 20
LDM
accuracy 57.1% 59.3% 59.6%
Table 5.6: Classification accuracies for systems with LDMs and FA models as the acoustic model.
The features are network-recovered EMA data used alone, with δ coefficients, and with δ and δδ
coefficients.
130 CHAPTER 5. LDMS FOR CLASSIFICATION OF SPEECH
80
Classification accuracy
70
60
50
40
oral stops−unvoiced
fricatives−unvoiced
oral stops−voiced
30
fricatives−voiced
vowels−back
vowels−front
non−speech
vowels−mid
20
nasal stops
diphthongs
affricates
liquids
glides
10
0
phone category
Figure 5.7: Comparison by phone category of the classification accuracies of the highest scoring
systems using network-recovered and measured EMA data as features.
EMA data on the test set for each phone category. The networks give the best recovery of
data corresponding to affricates and voiceless fricatives, categories which give some of the
closest performances under the real and recovered feature sets. Overall, there appears to
be some correspondence between accurate recovery of articulation and emulation of the
classification performance using real features, though the evidence here is by no means
conclusive.
5.1. SPEAKER-DEPENDENT MOCHA CLASSIFICATION 131
affricates 0.288
fricatives-unvoiced 0.313
diphthongs 0.315
vowels-front 0.317
vowels-mid 0.327
vowels-back 0.334
oral stops-voiced 0.335
fricatives-voiced 0.336
glides 0.341
oral stops-unvoiced 0.348
liquids 0.354
nasal stops 0.363
non-speech 1.022
Table 5.7: Ranked average root mean squared error (RMSE) for the network predictions of the
EMA data on the test data for each phone category.
132 CHAPTER 5. LDMS FOR CLASSIFICATION OF SPEECH
PLP features Figure 5.8 shows classification accuracies using LDMs with a PLP pa-
rameterization of the acoustics and a state dimension ranging from 0 to 20. Two main
observations can be made of this graph. Firstly, there is little performance improvement
on adding δ and δδ coefficients. Secondly, it is only PLPs used with no δ or δδ features
which show a visible trend of classification accuracy increasing with state dimension.
Table 5.8 shows the cross-validation classification results for LDMs, factor analysers
and full covariance Gaussian classifiers. For each of the feature sets, the LDMs provide
the highest accuracy, giving relative error decreases of 1.3%, 3.1% and 1.3% over the best
static models for PLPs used alone, with δs, and with δs and δδs respectively. However, of
these, it is only PLPs with δs which yield a statistically significant performance increase,
though with p < 0.025 rather than the p < 0.01 which is assumed elsewhere. This result
of 71.8% is close to the highest classification accuracy found using articulatory features,
which was 72.1% for the combined EMA, laryngograph and EPG data with respective δs.
MFCC features The graph in Figure 5.9 shows LDM classification accuracies for
MFCC features with and without δ and δδ parameters for state dimensions between
0 and 20. The addition of δs gives a clear increase in performance, though further adding
δδs appears to contribute little. When MFCCs are used alone as features, there is a trend
of accuracy improving as the state size increases, right up to a dimension of around 10.
The increases are more consistent and much less noisy than has been seen for previous
feature sets.
The corresponding cross-validation classification results are given in Table 5.9, along
with results for full covariance Gaussian classifiers and factor analyser models. With
MFCCs used alone and with δ coefficients, the inclusion of dynamic modelling proves
useful, giving 3.9% and 4.9% relative error reductions over the highest scoring static
model in each case. The highest overall classification accuracy with MFCCs of 75.0%
was found when δs are included in the feature set for LDMs with a state dimension of
9. Adding δδ parameters gives a slight decrease in performance, and no evidence to
suggest that modelling of dynamics is beneficial in this case. It may be that the resulting
39 dimension feature proves too large for robust estimation of the LDM’s parameters.
5.1. SPEAKER-DEPENDENT MOCHA CLASSIFICATION 133
80
classification accuracy (%)
75
70
65
60
55 PLP
PLP + δ
PLP + δ + δδ
50
0 5 10 15 20
state dimension
Figure 5.8: Speaker-dependent classification accuracies for systems with LDMs used as the acous-
tic model. The features are PLPs, PLPs with δ coefficients, and PLPs with δ and δδ coefficients.
Accuracies (y-axis) are shown for a variety of state dimensions (x-axis).
state dim 17 13 19
FA
accuracy 68.5 % 70.2% 69.5 %
state dim 9 13 15
LDM
accuracy 69.7 % 71.8 %∗ 70.4 %
Table 5.8: Cross-validation classification accuracies for systems with LDMs and FA models as
the acoustic model. The features are PLPs, PLPs with δ coefficients, and PLPs with δ and δδ
coefficients. Result ∗ is significant with p < 0.025 rather than with p < 0.01 level as assumed
elsewhere.
134 CHAPTER 5. LDMS FOR CLASSIFICATION OF SPEECH
80
classification accuracy (%)
75
70
65
60
55 MFCC
MFCC + δ
MFCC + δ + δδ
50
0 5 10 15 20
state dimension
Figure 5.9: Speaker-dependent classification accuracies for systems with LDMs used as the acous-
tic model. The features are MFCCs, MFCCs with δ coefficients, and MFCCs with δ and δδ
coefficients. Accuracies (y-axis) are shown for a variety of state dimensions (x-axis).
state dim 8 14 14
FA
accuracy 69.2 % 73.6% 73.8%
state dim 10 9 1
LDM
accuracy 70.5 % 75.0 % 74.3%
Table 5.9: Cross-validation classification accuracies for systems with LDMs and FA models as
the acoustic model. The features are MFCCs, MFCCs with δ coefficients, and MFCCs with δ and
δδ coefficients.
5.1. SPEAKER-DEPENDENT MOCHA CLASSIFICATION 135
Both of the static models give slight accuracy increases when δδs are added, though still
produce lower accuracies than the LDM result which only includes δs.
80
Classification accuracy
70
60
50
40
oral stops−unvoiced
fricatives−unvoiced
oral stops−voiced
30
fricatives−voiced
vowels−back
vowels−front
non−speech
vowels−mid
20
nasal stops
diphthongs
affricates
liquids
glides
10
0
phone category
Figure 5.10: Comparison by phone category of the classification accuracies of the highest scoring
systems using PLP and MFCC features. The overall accuracies are 71.8% and 75.0% for PLP and
MFCC features respectively.
Figure 5.10 gives a pictorial comparison of LDM classification from the best PLP
and MFCC systems. In both cases these are where δ parameters are included with the
features. The accuracy using MFCCs is higher overall than for PLPs, 75.0% compared
to 71.8%, though non-speech and diphthong segments give slightly better classification
performance with PLP features. The categories for which there are the largest differences
are front vowels and liquids for which the classification accuracies with MFCC features
are 8.3% and 5.3% greater than those with PLPs.
Figure 5.11 compares the MFCC + δ system of the previous graph with the most ac-
curate articulatory feature system, in which an LDM with a 19-dimensional state models
measured EMA, laryngograph and EPG data. The latter results were originally given in
Table 5.5 on page 125. Overall, the acoustic features give a higher classification accu-
136 CHAPTER 5. LDMS FOR CLASSIFICATION OF SPEECH
80
Classification accuracy
70
60
50
40
oral stops−unvoiced
fricatives−unvoiced
oral stops−voiced
30
fricatives−voiced
vowels−back
vowels−front
non−speech
vowels−mid
20
nasal stops
diphthongs
affricates
liquids
glides
10
0
phone category
Figure 5.11: Comparison by phone category of the classification accuracies of the highest scoring
systems using extended articulatory and MFCC features. The overall accuracies are 72.1% and
75.0% on articulatory and MFCC data respectively.
racy than the articulatory, though the differences are not evenly spread across the phone
categories with performance based on the articulatory features being marginally superior
for nasal stops, liquids and affricates. The category for which the acoustic features pro-
duce the largest improvement over articulatory is voiced oral stops, though the confusion
matrix of Figure 5.5 on page 127 suggested that these errors can be attributed to poor
voiced/voiceless decisions. Back vowels and non-speech phones are also more accurately
classified using acoustic features. The latter category includes silence which it is expected
that a model using acoustic features will detect with some certainty.
5.1. SPEAKER-DEPENDENT MOCHA CLASSIFICATION 137
PLPs and measured EMA data Figure 5.12 shows LDM classification accuracies for
feature sets which combine PLP and measured EMA data for state dimensions ranging
from 0 to 28. As before, a 0-dimensional state corresponds to a full covariance Gaussian
classifier. With an input of plain PLP and EMA features, there is a clear yet noisy increase
in classification accuracy as the size of the state vector increases until reaching a dimension
of around 15. As before, there is no one optimal dimension, though a state of size 17
produces the best system performance. Adding δ parameters gives a visible improvement
in the accuracy of the system, though there is a less marked effect on performance by
varying the state dimension. These results appear similar to those found using the sets
of features where LDA has been used for dimensionality reduction.
Table 5.10 shows cross-validation classification accuracies for these combinations of
PLP and EMA features for LDMs, factor analyser models and full covariance Gaussian
classifiers. In all cases, the LDM gives the highest accuracy, though the increases of
performance over the best static models are not significant when the data has been post-
processed using LDA. The relative error reductions from including a dynamic state for
the combined data used alone and with δs are 9.5% and 5.5% respectively. The latter,
where features are combined with their δs gives the highest overall accuracy of 79.2%.
MFCCs and measured EMA data Figure 5.13 shows LDM classification accuracies
for feature sets which use MFCCs in combination with measured EMA data and state
dimension ranging from 0 to 24. The clearest trend is shown where the features do not
include δs. The classification accuracy increases with the state size until a dimension of 15
is reached. Performance then tails off when the state attains a dimension of 20. It is likely
that in this case, there is insufficient data to produce robust estimates of the parameters
of models with larger dimensioned states. Classification accuracy is improved when δs
corresponding to both the MFCC and EMA parameters are include in the features, and
similar results are found with or without post-processing using LDA.
Table 5.11 shows the cross-validation classification accuracies for LDMs, along with full
covariance Gaussian classifiers and factor analyser models. In all cases, LDMs produce the
most accurate phone classification, though the increases over the static models are not as
138 CHAPTER 5. LDMS FOR CLASSIFICATION OF SPEECH
80
classification accuracy (%)
75
70
65
60
PLP + EMA
55 PLP + EMA + δ
PLP + EMA + δ LDA
PLP + EMA + δ + δδ LDA
50
0 5 10 15 20 25
state dimension
Figure 5.12: Speaker-dependent classification accuracies with LDMs used as the acoustic model.
The features are combinations of PLPs and real EMA data, used raw or post-processed using
LDA. Classification accuracies (y-axis) are shown for a variety of state dimensions (x-axis).
raw LDA
model and info PLP + PLP + PLP + PLP +
EMA EMA + δ EMA + δ EMA+ δ + δδ
state dim 8 23 12 25
FA
accuracy 72.7% 77.6% 77.5% 78.2%
state dim 17 24 11 8
LDM
accuracy 76.3% 79.2% 78.4% 78.6%
Table 5.10: Cross-validation classification accuracies for systems with LDMs and FA models as
the acoustic model. The features are combinations of PLPs and real EMA data, used raw or
post-processed using LDA.
5.1. SPEAKER-DEPENDENT MOCHA CLASSIFICATION 139
80
classification accuracy (%)
75
70
65
60
MFCC + EMA
55 MFCC + EMA + δ
MFCC + EMA + δ LDA
MFCC + EMA + δ + δδ LDA
50
0 5 10 15 20
state dimension
Figure 5.13: Speaker-dependent classification accuracies with LDMs used as the acoustic model.
Features are combinations of MFCCs and real EMA, used raw or post-processed using LDA.
Classification accuracies (y-axis) are shown for a variety of state dimensions (x-axis).
raw LDA
model and info MFCC + MFCC + MFCC + MFCC +
EMA EMA + δ EMA + δ EMA+ δ + δδ
state dim 13 20 9 12
FA
accuracy 74.0% 77.6% 77.3% 76.4%
state dim 17 22 16 14
LDM
accuracy 75.8% 78.4%∗ 78.8%∗ 77.6%
Table 5.11: Cross-validation classification accuracies for systems with LDMs and FA models as
the acoustic model. The features are combinations of MFCCs and real EMA data, used raw or
post-processed using LDA. Results ∗ are significant with p < 0.025 rather than p < 0.01.
140 CHAPTER 5. LDMS FOR CLASSIFICATION OF SPEECH
great as above where PLPs are combined with real EMA data. The gain from including
dynamic modelling is largest where no δs are included, with the accuracy increasing
from 74.0% to 75.8%, a relative error reduction of 6.9%. Including deltas, with and
without LDA dimensionality reduction results in relative error decreases of 3.1% and
3.2% respectively, though these are significant with p < 0.025 rather than p < 0.01 which
is used elsewhere for comparisons. The highest accuracy of 78.8% was obtained with
MFCC and EMA features along with all corresponding δs post-processed using LDA as
described in Section 9 on page 62. The models were LDMs with a 16-dimensional state.
In this case, applying LDA gives a marginal improvement, as the the equivalent system
with same features used raw gives a result of 78.4%.
Table 5.12: This table gives a summary of the classification accuracies found using LDMs with
combinations of acoustic and measured articulatory features. The corresponding acoustic-only
results are also shown for comparison. ⇓ denotes that LDA has been used for dimensionality
reduction.
Table 5.12 summarises classification accuracies presented in the last two sections:
those using just acoustic parameters and those where acoustic and measured articula-
tory data are combined. These figures show that adding articulatory features improves
acoustic-only classification, with the largest error reductions given where the acoustic
features are PLPs and LDA is used for post-processing. The relative error reductions
in this case are 27.0% and 27.7% where δs and both δs and δδs are included prior to
dimensionality reduction. The highest acoustic-only accuracy with PLPs of 71.8% was
increased to 79.2%, corresponding to a relative error reduction of 26.2%.
Acoustic-only classification with MFCCs gives higher classification accuracies than
5.1. SPEAKER-DEPENDENT MOCHA CLASSIFICATION 141
90
Classification accuracy 80
70
60
50 PLP
PLP + EMA
40
oral stops−unvoiced
fricatives−unvoiced
oral stops−voiced
30
fricatives−voiced
vowels−back
vowels−front
non−speech
vowels−mid
20
nasal stops
diphthongs
affricates
liquids
glides
10
0
phone category
Figure 5.14: Classification accuracies compared by phone category for PLP features with δs and
combined PLP and real EMA data again with δs. The latter gives the highest accuracy of the
combined feature sets for LDMs with a 24-dimensional state.
with PLPs, and the increases on adding articulatory data are correspondingly lower. The
largest improvement was found for PLPs when no δ parameters are included, with a
relative error reduction of 18.0%. The highest overall acoustic-only result of 75.0% for
MFCCs with δs was increased to 78.8% on addition of real EMA data and δs, and post-
processing with linear discriminant analysis. However, the best result for combinations of
acoustic and measured articulation of 79.2% is found on PLP and EMA parameters with
their respective δs. A breakdown of this result by phone category is shown in Figure 5.14,
along with the accuracies obtained using PLPs with δs. Classification performance for
diphthongs is slightly higher using only PLPs, and similar for voiced oral stops, affricates
and non-speech segments. However, classification of liquids, voiceless stops, voiceless
fricatives, nasal stops along with front and mid vowels all give in the region of 10% higher
accuracy using the combined features.
Figure 5.15 shows a confusion table corresponding the combined PLP, EMA and δs
142 CHAPTER 5. LDMS FOR CLASSIFICATION OF SPEECH
result. As with the confusion for the extended articulatory feature set on page 127, a
common error is to misclassify vowels as schwa. The voiced/voiceless errors are still in
evidence, as is misclassification of fricatives and affricates as [t] and [d].
Phone classification
breath
@@
ow
i@
uh
aa
oo
uu
eir
ou
ng
dh
zh
sh
ch
@
sil
m
th
ai
ei
oi
jh
iy
w
a
e
b
d
g
p
h
y
k
v
z
s
ii
f
i
l
a
e
i
ii
iy
@
@@
uh
aa
o
oo
u
uu
ai
ei
True phone identity
eir
i@
oi
ou
ow
l
r
w
y
m
n
ng
b
d
g
p
t
k
v
z
zh
dh
f
h
s
sh
th
ch
jh
breath
sil
Figure 5.15: This confusion matrix corresponds to the classifications made by LDMs with a
combined PLP and real EMA feature set with δ parameters as presented originally in Table 5.10.
Common errors include misclassification of vowels as schwa, denoted by [@], and fricatives and
affricates as [t] and [d]. Also, voiced/voiceless errors are apparent with voiced oral stops [b, d,
g] classified as their voiceless equivalents [p, t, k]
5.1. SPEAKER-DEPENDENT MOCHA CLASSIFICATION 143
As with the experiments using network-recovered EMA on page 128, the results in this
section will be based on the single train/test division, rather than a 5-fold cross-validation
of the data.
PLPs and network-recovered EMA data Figure 5.16 shows classification accuracies
for LDMs with combined PLP and recovered EMA feature sets for state dimensions
0 through to 24. The plots show the accuracies for features used alone and with δs.
Furthermore, results are given where δ and δδ parameters are included and LDA used
for post-processing. The strongest trend for phone discrimination to improve as the state
dimension increases is shown where no δs are included in the features. The performance
reaches a plateau when the state dimension is in the region of around 12 and then declines
again, which may be due to over-parameterization. Including δs and further adding
δδs gives a slight improvement to the classification accuracies, though not as much has
been seen in other feature sets. Post-processing with LDA does not appear to improve
classification accuracy.
Table 5.13 shows the results for which accuracies on the validation data are highest
using LDMs and factor analysers, along with results using full covariance Gaussian clas-
sifiers. In all cases, the LDMs provide statistically significant improvements in accuracy
over the best static model, with the largest relative error reduction of 9.5% being where
δs are included and LDA has not been used. It is this combination which also provides
the best classification performance of 74.3%.
MFCCs and network-recovered EMA data Figure 5.17 shows an equivalent set of
results where MFCCs are combined with network output. Experiments use LDMs with
state dimensions ranging from 0 to 24, and features with and without δs, either used
raw or subject to LDA post-processing. There is a noticeable trend for the classification
accuracies for all features to increase as the state dimension does, up to a size of around
15. Adding δs improves phone discrimination, though applying LDA to the data makes
no further contribution.
Table 5.14 shows the best of these results along with corresponding accuracies using
144 CHAPTER 5. LDMS FOR CLASSIFICATION OF SPEECH
Classification accuracies for LDMs with MOCHA PLP + net EMA data
85
80
classification accuracy (%)
75
70
65
60
Figure 5.16: Speaker-dependent classification accuracies with LDMs as the acoustic model. The
features are combinations of PLPs and network-recovered EMA, used raw or post-processed using
LDA. Classification accuracies (y-axis) are shown for a variety of state dimensions (x-axis).
raw LDA
model and info PLP + PLP + PLP + PLP +
net EMA net EMA + δ net EMA + δ net EMA+ δ + δδs
state dim 20 20 14 18
FA
accuracy 69.5% 68.2% 70.8% 71.1%
state dim 20 12 14 18
LDM
accuracy 71.7%∗ 74.3%∗∗ 73.4%∗ 71.9%
Table 5.13: Classification accuracies for systems using LDMs and FA models. The features are
combinations of PLPs and network-recovered EMA data, used raw or post-processed using LDA.
Results ∗ and ∗∗ are significant with p < 0.025 and p < 0.05 respectively.
5.1. SPEAKER-DEPENDENT MOCHA CLASSIFICATION 145
Classification accuracies for LDMs with MOCHA MFCC + net EMA data
85
80
classification accuracy (%)
75
70
65
60
Figure 5.17: Speaker-dependent classification accuracies with LDMs as the acoustic model. The
features are combinations of MFCCs and network-recovered EMA, used raw or post-processed
using LDA. Classification accuracies (y-axis) are shown for a variety of state dimensions (x-axis).
raw LDA
model and info MFCC + MFCC + MFCC + MFCC +
net EMA net EMA + δ net EMA + δ net EMA+ δ + δδs
state dim 20 29 19 24
FA
accuracy 68.7% 71.5% 72.8% 70.7%
state dim 21 19 16 27
LDM
accuracy 72.0%∗ 74.7% 73.1% 73.2%∗∗
Table 5.14: Classification accuracies for systems using LDMs and FA models. The features are
combinations of MFCCs and network-recovered EMA data, used raw or post-processed using LDA.
Results ∗ and ∗∗ are significant with p < 0.025 and p < 0.05 respectively.
146 CHAPTER 5. LDMS FOR CLASSIFICATION OF SPEECH
factor analysers and full covariance Gaussian classifiers. The inclusion of a model of the
underlying dynamics provides accuracy increases for all feature sets, the largest of these
being where LDA has not been used. The relative error reductions for the features used
alone and with δs are 10.8% and 10.1% respectively. The latter also gives rise to the most
accurate system, where LDMs with a 16-dimensional state correctly classified 75.2% of
phones in the test data.
Table 5.15: This table gives a summary of the classification accuracies found using LDMs with
combinations of acoustic and network-recovered articulatory features. The corresponding acoustic-
only results are also shown for comparison. ⇓ denotes that LDA has been used for dimensionality
reduction. Results marked ∗ are significant with p < 0.05 rather than p < 0.01.
Table 5.15 shows a summary of LDM results using acoustic-only and combined acous-
tic and automatically recovered articulatory features. Note that all results quoted here
correspond to the the train/test division of the basic classification procedure as described
on page 116 rather than a 5-fold cross-validation. Following this division, which was used
in generating the network-recovered articulatory features, ensures that classification test
sentences did not occur in the network training set. Adding network-recovered articula-
tion to each of the PLP feature sets yields increases in classification accuracy. However, it
is only the set which includes δs where the increase is statistically significant, and this is
with p < 0.05 rather than p < 0.01. In this case, the highest PLP classification accuracy
of 72.5% is increased giving 74.3%, representing a relative error reduction of 6.5%.
Combining MFCC and network-recovered articulation gives an improvement on the
acoustic-only baseline where no δ and δδ parameters are included with the features. The
classification accuracy is increased from 70.9% to 72.0%, which represents a relative error
5.1. SPEAKER-DEPENDENT MOCHA CLASSIFICATION 147
reduction of 3.8%. A paired t-test shows that the difference between these results is
significant with p < 0.05. However, the addition of network-recovered articulation gives
reductions in classification accuracy where δs or δs and δδs are included with the original
MFCC results. For MFCCs with δs and δδs, the reduction is from 75.6% to 73.2%,
representing a relative error increase of 3.2%.
These results show that in some cases, adding network-recovered articulatory param-
eters to acoustic features gives increased classification accuracy. However, none of these
give improved performance over the highest acoustic-only accuracy of 75.6%.
A multitude of classification results were presented in the previous section. The central
question which these experiments were designed to address is whether the addition of an
explicit model of inter-frame dependency would improve classification performance over
that of a static model which assumes framewise independence.
Table 5.16: Comparison of the overall best results using static and dynamic models on data
derived from MOCHA acoustic and acoustic-articulatory data. Result ∗ is significant with p <
0.025 rather than with p < 0.01 level as assumed elsewhere.
Table 5.16 shows the overall best speaker-dependent classification accuracies using
static and dynamic models on acoustic-only and combined acoustic-articulatory data.
Note that the systems which include recovered articulation come into the category of
using acoustic-only data. In each case, the static and dynamic models give their highest
accuracies on different feature vectors. Such a comparison is valid as the original data from
which the parameters are generated is identical in each case. Equal care has been taken
in optimising the performance of both static and dynamic models, and it is unsurprising
that each model type favours certain features. In both cases, the dynamic models give
marginally superior performance, significant with p < 0.025. Static models give their
highest classification accuracies where both δs and δδs are included in the features, where
dynamic models use only δs. In this case, the LDM state can be seen to provide and
exceed the information encapsulated by δδ parameters.
5.2. SPEAKER-INDEPENDENT TIMIT CLASSIFICATION 149
Data from the MOCHA corpus offers the unusual opportunity of examining the effect
on classification and recognition accuracy of combining acoustic and articulatory data.
However, experiments were speaker-dependent and models trained on a little over 15
minutes of speech data. The TIMIT corpus provides over 4 hours of speech data from
630 speakers which has been hand-labelled at the phone level, providing an ideal basis
for speaker-independent phone classification and recognition experiments. Well-known
benchmark results exist for TIMIT experiments (see Table 2.2 on page 46), allowing
meaningful comparisons of results with those of other systems. The training set consists
of 124412 phone segments from 462 speakers, compared to the MOCHA fsew0 training
data which comprises 12651 phone segments from a single speaker.
5.2.1 Methodology
The TIMIT corpus has designated training and test sets. However, as in the experiments
using MOCHA data, a validation set is required. Of the 462 speakers making up the
training data, data from 60 of them was set aside for validation. These are listed in
Appendix C, and were chosen with the proportion of speakers from each of the 8 dialect
regions following the distribution in the test set. For classification, a number of EM
iterations are performed to estimate parameters using the reduced training data and the
models stored. Classification accuracy on the validation data is used to determine how
many iterations the models should be trained for, and choose a language model scaling
factor. Models are then retrained using the combined training and validation data, and
data from the test set used to produce a final classification accuracy. Results will be shown
pictorially for models with a range of state dimensions, however the final result quoted
for each feature set corresponds to the configuration which give the highest classification
accuracy on the validation data. All results are given in full in Appendix E.
Lee & Hon (1989) introduced a set of allowable confusions which is commonly used
when reporting results on the TIMIT corpus. These are listed in Table 5.17, and provide
a means of collapsing the original 61 phone set down to 39. Test set results will be quoted
on the 39 phone set, though any validation accuracies will relate to the original 61 phones.
150 CHAPTER 5. LDMS FOR CLASSIFICATION OF SPEECH
Table 5.17: This set of allowable confusions was introduced by Lee & Hon (1989) and is commonly
used when quoting results on the TIMIT corpus.
PLP features Figure 5.18 shows classification accuracies for LDM systems with PLP
features used plain, with δs, and with δs and δδs. The state dimension ranges from 0 to 20,
where as before, 0 corresponds to a full covariance Gaussian classifier. It is apparent that
there is no one optimal size of state vector, though there is a performance improvement as
the dimension increases. This is most marked in the case of the PLP features used alone
as the size of state vector increases from 1 to 5. With no δs or δδs included, the highest
classification accuracy is 67.8%, given by a set of LDMs with a state dimension of 8.
Figure 5.19 also shows that adding δ coefficients to the PLP parameters improves phone
discrimination, as does further including δδ features. The highest classification accuracies
for these feature sets are 71.0% and 72.2%, given by LDMs with 9 and 13-dimensional
states respectively.
Table 5.18 gives a summary of the results found with LDMs, factor analysers, and also
full covariance Gaussian classifiers. As before, bold face signifies that LDM results are
statistically significantly higher than for both of the other models. For all feature sets,
the modelling of dynamics offers a modest but consistent improvement in classification
accuracy. The largest relative error decrease of LDM against the best static model is
5.2. SPEAKER-INDEPENDENT TIMIT CLASSIFICATION 151
75
70
65
60
55
50
0 5 10 15 20
state dimension
Figure 5.18: Speaker-independent classification accuracies for systems with LDMs used as the
acoustic model. The features are PLPs, PLPs with δ coefficients, and PLPs with δ and δδ
coefficients. Accuracies (y-axis) are shown for a variety of state dimensions (x-axis).
state dim 13 20 20
FA
accuracy 66.5% 69.3% 70.9%
state dim 10 9 13
LDM
accuracy 67.8% 71.0% 72.2%
Table 5.18: Classification accuracies for systems with LDMs and FA models as the acoustic
model. The features are PLPs, PLPs with δ coefficients, and PLPs with δ and δδ coefficients.
152 CHAPTER 5. LDMS FOR CLASSIFICATION OF SPEECH
75
70
65
60
55
50
0 5 10 15 20
state dimension
Figure 5.19: Speaker-independent classification accuracies for systems with LDMs used as the
acoustic model. The features are MFCCs, MFCCs with δ coefficients, and MFCCs with δ and δδ
coefficients. Accuracies (y-axis) are shown for a variety of state dimensions (x-axis).
state dim 10 20 19
FA
accuracy 66.3% 70.0% 70.7%
state dim 12 12 9
LDM
accuracy 67.4% 71.3% 72.3%
Table 5.19: Classification accuracies for systems with LDMs and FA models as the acoustic model.
The features are MFCCs, MFCCs with δ coefficients, and MFCCs with δ and δδ coefficients. LDM
results in bold face denote that the accuracy is statistically significantly higher than for either of
the static models.
5.2. SPEAKER-INDEPENDENT TIMIT CLASSIFICATION 153
3.9%, and occurs where no δs are included in the features. This was expected to be so,
as the dynamic model should be able to capture some of the same information which the
δ parameters are intended to provide. Furthermore, LDMs with δs in the features give
similar performance to the static models when δs and δδs are included. The relative error
reductions through inclusion of a dynamic state-space are 3.0% and 3.1% when δ and δδ
parameters are used in the feature sets.
The graph in Figure 5.18 is noticeably smoother than the equivalent LDM classification
results on the MOCHA data, shown in Figure 5.8 on page 133. The TIMIT corpus is
considerably larger and the results appear less prone to random fluctuations.
MFCC features Figure 5.19 shows LDM classification of the TIMIT test set using
LDMs to model an MFCC parameterization of the acoustics. It is apparent that adding
δs improves phone discrimination, as does further including δδs. For each of the three
feature sets there is a slight increase in accuracy as the state dimension increases up to
around 10. A state vector of size 12 gives the best performance on the validation set for
the features used alone and with δs, though 9 provides the highest accuracy where δδs are
also included. Table 5.19 compares these results with those for factor analysis models and
full covariance Gaussian classifiers. The accuracy increases using LDMs are statistically
significant over the best static models on each of the feature sets, giving relative error
reductions of 3.0%, 3.7%, and 3.5% with MFCCs used alone, with δs and with both δs
and δδs respectively.
The overall highest accuracy of 72.3% for acoustic features was given using a set of
LDMs with a 9-dimensional state to characterise MFCCs with δs and δδs. A confusion
matrix is given for this result in Figure 5.20 on page 154. Some of the most common errors
appear to be misclassifying vowels as [ix] and discriminating between [er] and [axr], which
are in fact very similar acoustically. Errors also arise making voiced/voiceless distinctions
such as between the fricatives [zh] and [sh] and also [z] and [s]. The confusion table in
Figure 5.20 also shows that most nasals are classified as [m] or [n], and the gap in the
diagonal shows classification of [eng] segments is very poor. However, there are only 4
examples of [eng] in the test set.
154 CHAPTER 5. LDMS FOR CLASSIFICATION OF SPEECH
Phone classification
ax−h
eng
pau
axr
em
epi
uw
aw
ow
bcl
dcl
gcl
pcl
kcl
ae
eh
ah
aa
ao
uh
en
ng
dh
hh
h#
ax
ux
ay
ey
oy
nx
dx
zh
hv
sh
ch
tcl
er
th
ih
el
jh
ix
iy
b
d
p
q
y
v
z
s
r
f
l
ae
eh
ih
ix
iy
ah
ax
ax−h
axr
er
aa
ao
uh
uw
ux
aw
ay
ey
ow
oy
True phone identity
el
l
r
w
y
em
en
eng
m
n
ng
nx
b
d
dx
g
k
p
q
t
bcl
dcl
gcl
kcl
pcl
tcl
dh
v
z
zh
f
hh
hv
s
sh
th
ch
jh
epi
h#
pau
Figure 5.20: This confusion matrix corresponds to the classifications made by LDMs with a 12-
dimensional state and MFCC features with both δ and δδ parameters. The accuracy of 72.3% is
reported in Table 5.19 on page 152. Some of the most common errors appear to be misclassifying
vowels as [ix] and discriminating between [er] and [axr], which are in fact very similar acoustically.
Errors also arise making voiced/voiceless distinctions such as between the fricatives [zh] and [sh]
and also [z] and [s]. Most nasals are classified as [m] or [n].
5.2. SPEAKER-INDEPENDENT TIMIT CLASSIFICATION 155
Figure 5.21 gives a comparison of the classification accuracy found using PLP and
MFCC features, with results broken down by the phone categories used in the compari-
son of linear and non-linear models in Section 3.4. In both cases δ and δδ parameters are
included with the features. The overall accuracies are very similar, being 72.2% and 72.3%
for PLPs and MFCCs respectively, though Figure 5.21 shows that there is is variation
in the distribution of the errors. Using MFCCs, classification of unvoiced fricatives and
diphthongs are slightly over 2% more accurate than the equivalent result based on PLP
features. The situation is reversed for glides and voiced fricatives, for which classification
is almost 2% higher for PLPs than for MFCCs.
80
Classification accuracy
70
60
50
40
fricatives−unvoiced
30
fricatives−voiced
stops−unvoiced
stop−closures
stops−voiced
vowels−back
vowels−front
non−speech
vowels−mid
20
diphthongs
affricates
nasals
liquids
glides
10
0
phone category
Figure 5.21: The classification accuracies are compared by phone category for PLP and MFCC
features. In both cases δs and δδs are included. Across the entire test set the accuracies are 72.2%
and 72.3% for PLPs and MFCCs respectively.
156 CHAPTER 5. LDMS FOR CLASSIFICATION OF SPEECH
Just as in the MOCHA phone classification of Section 5.1, the primary goal of these ex-
periments is to assess the benefit of modelling the correlation between successive frames
of speech. Table 5.20 shows the overall highest classification accuracies using static and
dynamic models on TIMIT acoustic data. With either MFCC or PLP parameters and
corresponding δs and δδs, the highest static model accuracy is 71.3%. LDMs with MFCCs,
δs and δδs give a slightly higher, and statistically significant result of 72.3%. This repre-
sents a relative error reduction of 3.5%. Unlike the equivalent result on MOCHA data,
classification accuracy for LDMs increases on adding δδs to MFCC features with δs. It
appears that in this case there is sufficient data to train the extra model parameters.
Table 5.20: Comparison of the overall best results using static and dynamic models on data
derived from TIMIT acoustic data.
5.2. SPEAKER-INDEPENDENT TIMIT CLASSIFICATION 157
The experiments above have been concerned with examining the impact of adding a
dynamic hidden state to static models. The linear dynamic models have been applied
in their maximally parameterized form - full observation covariance matrix, sub-space
modelling, and no constraint on the form of the state noise. The experiments in this
section will compare the various modelling alternatives which were outlined in Section
4.4.3 on page 110. These variants are self-explanatory other than stop 1 and 2 which refer
to using full LDMs but ceasing to update the observation noise parameters v and C after
the 1st or 2nd training iterations respectively.
Table 5.21: Cross-validation classification accuracies for systems with a variety of forms of LDMs
as the acoustic model. The features are MFCCs, MFCCs with δ coefficients, and MFCCs with
δ and δδ coefficients derived from the MOCHA corpus. Where the full LDM gives the highest
accuracy, results are given in bold face. Otherwise, any variant which gives a better performance
than the full LDM is given in bold face
158 CHAPTER 5. LDMS FOR CLASSIFICATION OF SPEECH
Table 5.22: Cross-validation classification accuracies for systems with a variety of forms of LDMs
as the acoustic model. Features are the full MOCHA articulatory set comprising EMA, laryngo-
graph and EPG data. Results are shown with the data used alone, with δs, and with δ and δδs.
The latter feature set is either modelled raw, or post-processed using LDA. Where the full LDM
gives the highest accuracy, results are given in bold face. Otherwise, any variant which gives a
better performance than the full LDM is given in bold face
Tables 5.21 and 5.22 give cross-validation classification results for a number of varia-
tions on LDMs with the MOCHA MFCC and extended articulatory feature sets. Where
applicable, state dimensions have been chosen on the 46 utterance test set, and then used
to obtain full cross-validation results. Similarly, Table 5.23 gives results for the same
variations on LDM formulation using TIMIT MFCC features. Where necessary, state
dimensions are chosen on the validation data. Full results are given in Appendix E. Ta-
ble 5.21 shows that on the MOCHA MFCC data, fully specified LDMs give the highest
accuracies for MFCCs used alone and with δs. However, when δδs are also included, the
5.2. SPEAKER-INDEPENDENT TIMIT CLASSIFICATION 159
Table 5.23: Classification accuracies for systems with a variety of forms of LDMs as the acoustic
model. The features are MFCCs, MFCCs with δ coefficients, and MFCCs with δ and δδ coefficients
derived from the TIMIT corpus. Where the full LDM gives the highest accuracy, results are given
in bold face. Otherwise, any variant which gives a better performance than the full LDM is given
in bold face
models which use diagonal and identity matrices for the state covariance both give accura-
cies of 74.7%, higher than the 74.3% found using LDMs with full covariances. Given that
the state dimension in the latter case is 1, it may be that MOCHA provides insufficient
data to estimate the extra off-diagonal parameters. The equivalent result on the TIMIT
corpus, given in Table 5.23, shows that in this case, where there is considerably more
data, LDMs with fully specified state covariances give the overall highest accuracy.
Many of the results here are somewhat inconclusive: there is no strong evidence to
suggest deviation from a fully parameterized LDM where there is sufficient training data.
However, setting diagonal or identity state covariances gives similar accuracies and offers
160 CHAPTER 5. LDMS FOR CLASSIFICATION OF SPEECH
a small degree of computational saving. One result which is worth mentioning is that the
inclusion of subspace modelling improves classification accuracy. Using TIMIT MFCC
data with δ and δδs, setting H = I, gives an LDM classification accuracy of 71.7%, which
is statistically significantly lower than that using fully-specified LDMs.
Section 4.3 on page 102 noted the equivalence between an autoregressive (AR) process
and a static model with δ coefficients (Williams 2003), and went on to demonstrate how
the modelling provided by an LDM is distinct from these models. The state process in
an LDM is first order, though it was shown that the addition of observation noise means
that the LDM is not simply a first order model.
The experiment reported below compares classification accuracy on the TIMIT corpus
for LDMs with acoustic parameters to that found for static models where simple differ-
ences are appended to the feature vectors. The latter provides a model which is analogous
to a first-order autoregressive process. Factor analysers are used as the static models.
Classification accuracy
model
PLP MFCC
Table 5.24: TIMIT classification accuracy for LDMs and static models for which simple differences
are included in the observations.
The classification results given in table 5.24 show that for both PLPs and MFCCs,
classification accuracy is higher under a static model where simple differences are ap-
pended to the features, than for an LDM with only the original features as input. Paired
t-tests show that these differences are statistically significant.
It is apparent that in this case, the model of speech signal dynamics given by the
LDM does not provide the discriminatory power found by inclusion of differences in the
features for a static model. However, in a number of the experiments described above,
such as classification of TIMIT MFCCs, LDMs with δs are found to give equal or higher
5.2. SPEAKER-INDEPENDENT TIMIT CLASSIFICATION 161
accuracies than the best static model which includes δs and δδs in the features. In these
cases, the addition of a dynamic state appears to provide or exceed the extra information
contained in the δδ coefficients.
162 CHAPTER 5. LDMS FOR CLASSIFICATION OF SPEECH
The experiments above make a straightforward application of LDMs to speech data, with
a single model representing the variable-length segments associated with each phone class.
Such an approach takes no account of the systematic variation due to segmental duration,
which might be dealt with either by constructing models for segments of different lengths,
or using some form of duration normalisation. The latter corresponds to the trajectory
invariance formulation of LDMs used by Digalakis (1992), described in Section 6 on page
42. Experiments in which duration-normalised fixed-length segments are modelled will
be described below in Section 5.3.1.
The manner in which LDMs have been applied thus far in this chapter also assumes
that the inter-frame correlations are fixed for the duration of complete segments. There
may be an advantage in splitting segments into multiple regimes, each of which is modelled
by a separate LDM. This corresponds to the correlation invariance assumption which is
also described on page 42. Experiments which follow this route will be described in
Section 5.3.2 on page 164. Other experiments in this section include combining static
and dynamic models in Section 5.3.3 on page 168 and adding an explicit model of phone
duration, which is explored in Section 5.3.4 on page 173.
model-dependent, or fixed for all phone models. Table 5.25 shows TIMIT classification
accuracies when an equal number of frames, between 2 and 15, are used in the fixed-length
representation of all phone types. Mapping the observations corresponding to each phone
segment onto 5 frames gives the highest classification accuracy of 66.7%, though does not
reach the baseline.
2 63.4%
3 65.8%
4 66.4%
5 66.7%
6 66.5%
7 66.2%
8 65.7%
9 65.5%
10 65.5%
11 64.8%
12 65.5%
13 65.4%
14 65.3%
15 65.5%
baseline 67.8%
Table 5.25: TIMIT classification accuracies in which equal numbers of frames are used in the
fixed-length representations of each phone class, and features are PLPs. Results are given for fixed
segment lengths of between 2 and 15. Shown in bold face is the highest accuracy gained using
fixed-length segments, and the baseline result.
Segment duration varies considerably according to the phone class. For example, the
stop release [b] is on average just under 2 frames long whilst segments corresponding to
the diphthong [oy] are on average a little under 17 frames. To minimise the impact of
the duration normalisation, it may be advantageous to map each phone segment type
to a fixed number of frames which is related to its duration distribution. In the exper-
164 CHAPTER 5. LDMS FOR CLASSIFICATION OF SPEECH
iment reported below, the mean duration along with the 25th , 33rd , 50th , 66th and 75th
duration percentiles were used to determine the number of frames which each segment
type is mapped to in duration normalisation. Results are given in Table 5.26, and are
considerably worse than using a single fixed segment length across all phones, with the
highest classification accuracy being 54.3%.
baseline 67.8%
Table 5.26: TIMIT Classification results using phone-dependent duration normalisation and PLP
features. The fixed number of frames which each segment type was mapped to was determined
by the mean duration or one of the percentiles of the phone’s duration distribution. The best
classification accuracy using duration normalised segments is shown in bold face, as is the baseline.
None of the results which use fixed-length segments reach the variable-length baseline
accuracy. It may be that investigating other approaches to duration normalisation will
provide improved results. However, any form of duration normalisation is liable to alter
inter-frame correlations, which is undesirable for a model which is intended to characterise
these dependencies.
As mentioned above, Digalakis (1992) used the term correlation invariance to describe the
assumption of applying distinct LDMs to number of sub-phone regions. A deterministic
mapping, dependent on segment duration, dictated the sequence of sub-models which
were used to generate each phone. This formulation will be referred to as multiple regime
(MR) as the abbreviation of correlation invariance, CI, can be confused with the term
5.3. VARIATIONS ON THE IMPLEMENTATION 165
context-independent which is frequently used in speech recognition. Figure 5.22 shows the
LDM-of-phone and multiple regime LDMs used to generate a single segment of speech
data.
m1 m2 m3
Figure 5.22: In the multiple regime (MR) formulation, segments are split into multiple regimes,
each of which is modelled by a separate LDM. The LDM-of-phone approach assumes that inter-
frame correlations are fixed across entire segments, whereas in an MR formulation, correlations are
static within sub-phone regions. A deterministic mapping dependent on segment length dictates
how many frames are spent in each regime.
A multiple regime approach was not taken initially in this work for three reasons:
described by some discrete, hidden random variable, and the transition network
learnt in a probabilistic manner.
• subdividing segments runs the risk of losing the ‘segmental’ nature of the modelling.
The intention is to model longer sections of speech in which linguistic events occur.
Partitioning phone-length segments will produce regions which often consist of only
a few frames, and modelling may tend toward the HMM, where models describe
short, stationary regions of the parameterized speech signal within which there is
little to be gained from an explicit model of dynamics.
• following Occam’s Razor which states that ‘entities should not be multiplied un-
necessarily,’ it was considered that the simpler LDM-of-phone model should be
investigated first to give a basis for comparison.
affricates 2
fricatives 1
nasal stops 2 apportioned equally
semivowels and glides 2
silence 1
oral stop closures 1
Table 5.27: Divisions by phone category used in multiple regime LDM experiments.
The mappings used to determine the number of frames which correspond to each
sub-model are given in Table 5.27. Models of fricatives, silence and oral stop closures are
simply modelled with a single region as the speech signal is considered to be approximately
statistically stationary during these sounds. Two regimes corresponding to ‘coming in’
5.3. VARIATIONS ON THE IMPLEMENTATION 167
and ‘going out’ are used for nasal stops along with semivowels and glides, and for affricates
which consist of the combination of a stop and a fricative. Vowels, which are subject to
strong contextual variation, are split into 3 regimes modelling ‘onset’, ‘steady state’ and
‘offset’. Stevens (1999) describes oral stop releases as consisting of 3 distinct regions: a
transient, frication at the point of articulation and finally aspiration. Oral stop releases
are accordingly split into 3 regimes. All segments are split equally into their chosen
number of regions except vowels which are apportioned in the ratio 3:2:3 and the release
portions of oral stops in which the 1st and 2nd regions have a maximum duration of 10
and 8 frames respectively. Therefore, in longer oral stop release segments, the largest
portion is spent in the final region.
model base +δ + δ + δδ
LDM-of-phone 67.8% 71.0% 72.2%
full Gaussian MR 68.6% 73.2% 74.2%
state reset MR 68.9% 73.5% 74.4%
state passed MR 70.2% 73.6% 74.5%
model base +δ + δ + δδ
LDM-of-phone 67.4 % 71.3% 72.3%
full Gaussian MR 68.6 % 73.3% 74.3%
state reset MR 67.9 % 73.3% 74.3%
state passed MR 69.5 % 73.7% 74.5%
Table 5.28: Results of using multiple regime LDMs for classification of TIMIT PLP and MFCC
features. These results compare standard LDMs with classification in which segments are modelled
with multiple regimes. The MR mappings are used for static models, LDMs in which the state is
reset between regions, and LDMs in which the state information is passed throughout the length of
the segment. Results for which the state-passed LDMs give statistically significant improvements
over the other models are shown in bold face.
Table 5.28 shows the results of experiments in which multiple regimes are used within
each phone segment for PLPs and MFCCs as features. The baseline classification accu-
168 CHAPTER 5. LDMS FOR CLASSIFICATION OF SPEECH
racy found using LDM-of-phone models is shown, along with a result for the case when
segments are mapped onto multiple regimes, each of which is modelled by a full covariance
Gaussian distribution. This is included as a means of examining whether the dynamic
portion of the model still contributes under such a formulation. Multiple region LDM
classification results are also presented, both with the state reset between regions (state-
reset), and passed throughout the segment (state-passed).
In all cases, static models which use multiple regions outperform standard LDMs of
phones. Such a model corresponds to a particular form of HMM in which the state
transitions are deterministic given segmental duration, and the output distribution is a
unimodal full covariance Gaussian. Accuracy is also improved over the baseline LDM
system when multiple regime LDMs are used with the state reset between regions. For all
feature sets, the best performance is found with MR LDMs in which state information is
passed between sub-phone regions. However, the accuracy increase over all other models
is only statistically significant in the absence of δ features.
With MFCC and PLP features respectively and no δs, 2.9% and 4.2% relative error
reductions are given using the state-passed MR LDM compared to the next best model.
Compared with standard LDMs, the accuracies increase from 67.8% to 69.5% in the case
of MFCC features, and 67.8% to 70.2% for PLPs. Including both δs and δδs gives the
overall best classification performances, with accuracies of 74.5% for state-passed MR
LDMs using each of the acoustic parameterizations. These results are higher than for
other models using the same features, though only by a few tenths of a percent. It seems
that adding dynamic information in the form of δ and δδ parameters reduces the benefit
of including a continuous state process.
The final classification accuracies in the last experiment for the multiple region LDM and
Gaussian models are similar, however the distribution of errors may differ between static
and dynamic models. The following experiments take the results in Table 5.28 for MFCCs
with δ and δδ coefficients to examine if combining static and dynamic models will give
improved results over using either singly. Figure 5.23 shows the classification accuracies
5.3. VARIATIONS ON THE IMPLEMENTATION 169
of state-passed multiple region LDMs and multiple region Gaussian models broken down
80
Classification accuracy
70
60
50
40
fricatives−unvoiced
30
fricatives−voiced
stops−unvoiced
stop−closures
stops−voiced
vowels−back
vowels−front
non−speech
vowels−mid
20
diphthongs
affricates
nasals
liquids
glides
10
0
phone category
Figure 5.23: The classification accuracies due to state passed MR LDMs and MR Gaussians of
Table 5.28 are compared by phone category. Features are MFCCs with δs and δδs.
by phonetic category. The dynamic models give higher classification accuracy for voiced
fricatives, liquids and stops, though their static counterparts provide better performance
on diphthongs, unvoiced fricatives, and non-speech segments.
One approach to combining the static and dynamic models is to use a model set composed
of a mixture of dynamic and static models. There are a total of 18429 tokens in the
validation set, of which 5661 were incorrectly classified by the set of LDM models. Of
these errors, 4963 also occurred under the Gaussian models, leaving just over 12% of LDM
errors which were distributed differently under the static model. There were 220 tokens
which were classified correctly under the static model and incorrectly by the dynamic
model. [h#] (silence) represented 20% of these, [s] 12% and between 5% and 10% by each
of [ae, er, ey, iy]. Experiments were carried out in which some of the multiple regime
170 CHAPTER 5. LDMS FOR CLASSIFICATION OF SPEECH
LDMs are replaced by their static models of the same sub-phone regions. In experiment
A, dynamic [h#, s] are replaced, and in experiment B, [h#, s, ae, er, ey, iy] are all
replaced with their static model counterparts.
Given that a mismatch between the ranges of the likelihoods produced under different
model types is likely, validation accuracies for a number of values of β, a scaling of the
static model likelihood are found. These are shown for each experiment in Table 5.29
and Figure 5.24. The highest validation result is 68.2% for a β of 1.0 and dynamic [h#,
s] replaced with static models. This translates into a 39-phone classification accuracy of
73.9% on the test set. These are significantly lower than the equivalent multiple regime
results shown in Table 5.28 on page 167 of 74.3% and 74.5% for static and dynamic models
respectively.
65
60
validation accuracy (%)
55
50
45
40
35
30
25 experiment A
experiment B
20
0.6 0.8 1 1.2 1.4
static likelihood scaling factor (β)
Figure 5.24: Graphs showing the validation accuracies in Table 5.29. Dynamic models
are replaced with their static counterparts: [h#, s] in A, and [h#, s, ae, er, ey, iy]
in B. The static model likelihoods are scaled by β
5.3. VARIATIONS ON THE IMPLEMENTATION 171
Table 5.29: 61 phone classification accuracies on the validation set when some dynamic models
are replaced with their static counterparts given for a range of values of β, a scaling of the static
model likelihood. Dynamic [h#, s] are replaced in A and [h#, s, ae, er, ey, iy] in B.
Likelihood combination
A number of speech recognition systems have sought to use information from more than
one source. For example, Kirchhoff (1998) experimented with a variety of methods of
combining the posterior likelihoods from acoustic and articulatory-feature systems for use
in recognition. Also, Robinson et al. (2002) used separate neural networks with distinct
parameterizations of the acoustic signal as input, the outputs of which were combined to
give a scaled acoustic likelihood2 for use in a hybrid ANN/HMM system. Both of these
systems use weighted averages of the log-likelihoods from each source as the final acoustic
likelihood. Given that summing log probabilities is equivalent to multiplication of straight
probabilities, such an approach makes the simplifying assumption that the classifiers are
statistically independent. In the present case, given the similarity between the models
which will be combined, this is unlikely to be so. However, there will be no attempt to
model interactions for the purposes of this experiment.
Combined likelihoods are computed as α ls +(1−α)ld where ls and ld are the likelihoods
under static and dynamic models respectively. Results of classification on the TIMIT
validation data are shown in Table 5.30 and Figure 5.25 for α = 0.0, 0.1, . . . , 1.0. A value
of α = 0.3 gives the highest validation result, and produces a classification accuracy of
74.9%. This result is slightly higher than either the MR Gaussian accuracy of 74.3%
and the MR state-passed LDM accuracy of 74.5%, and a paired t-test reveals that in
both cases, the increase is statistically significant. Goldenthal (1994), described in the
literature review on page 44, reports a context independent TIMIT classification accuracy
of 74.2% on the 39 phone set, which is almost identical to the results given here.
2
See Section 1.3.2 on page 7 for a definition of scaled likelihood as used in this context.
172 CHAPTER 5. LDMS FOR CLASSIFICATION OF SPEECH
69.5
69
68.5
0 0.2 0.4 0.6 0.8 1
likelihood combination factor (α)
Figure 5.25: This plot shows classification validation accuracies from combining token
likelihoods produced under dynamic and static models. New likelihoods are computed
as α ls + (1 − α)ld where ls and ld are the likelihoods under static and dynamic models
respectively. These results shown in this figure correspond to those reported in table 5.30.
α 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0
accuracy 69.3% 69.5% 69.7% 69.8% 69.7% 69.7% 69.5% 69.3% 69.3% 69.0% 68.7%
Table 5.30: This table shows classification validation accuracies from combining token likelihoods
produced under dynamic and static models. New likelihoods are computed as α ls +(1−α)ld where
ls and ld are the likelihoods under static and dynamic models respectively. A value of 0.3 for α
gives the highest accuracy. Results are given on the original 61 phone set
5.3. VARIATIONS ON THE IMPLEMENTATION 173
Experiments thus far have not included an explicit model of phone duration. Such an
addition may aid classification accuracy, and two possible models are considered in the
experiments which follow. The first uses a Gamma distribution to model the segment
durations corresponding to each phone class. The second models the log-durations with
a Gaussian distribution.
The parameters describing each candidate duration distribution were estimated from
all the segment durations in the TIMIT training set, and experiments to examine the
benefits or otherwise of including a model of phone duration follow the classification
procedure outlined on page 149. The duration model likelihoods are scaled by κ and
added to the acoustic model likelihoods prior to the combination with the language model
and Viterbi search.
Table 5.31: Classification validation accuracies on adding Gamma and log-Gaussian duration
models to both LDM-of-phone and combined static and dynamic multiple region models. The
original results which do not include a duration model were given in Table 5.19 on page 152 and
Table 5.28 on page 167. Duration model likelihoods are scaled by κ. Note that in both cases these
figures correspond to classification accuracies found on the validation set and therefore use the 61
phone set.
Table 5.31 shows 61-phone classification accuracies on the TIMIT validation set on
adding log-Gaussian and Gamma distribution duration models to LDM-of-phone and
combined static and dynamic multiple region models for a range of values of κ. The
174 CHAPTER 5. LDMS FOR CLASSIFICATION OF SPEECH
original results for each where no duration model is used were given in Table 5.19 on page
152 and Table 5.28 on page 167. The features are MFCCs with δ and δδ parameters. For
each model type, the log-Gaussian distribution provides the largest increase in accuracy,
though this is marginal in the case of the combined model set.
Table 5.32 shows the results of taking the log-Gaussian duration model with the like-
lihood scaling found on the validation data and producing classification accuracies on the
TIMIT test set. The accuracy for LDM-of-phone models with MFCC, δ and δδ features
increases from 72.3% to 73.3%, and the combined static and dynamic models with multiple
regions per phone from 74.9% to 75.2%. In both cases, the increases are statistically signif-
icant, meaning that the improvement is consistent across the test set. Goldenthal (1994),
described in Section 2.3.7 on page 44, compared Gaussian, log-Gaussian and Gamma dis-
tributions to model phone segment durations and found that a log-Gaussian model gave
the highest likelihood on the TIMIT training set. Context-independent TIMIT classifica-
tion accuracy was reported to increase from 74.2% to 75.2% on inclusion of a log-Gaussian
model of phone duration.
The log-Gaussian duration model will be used in recognition experiments in the fol-
lowing chapter.
classification accuracy
models
LDM-of-phone combined MR models
Table 5.32: Effect on classification performance of including a duration model. In both cases, log-
Gaussian models of segmental duration are used. Results in bold face are statistically significantly
higher than their equivalents which do not include a duration model.
The main results of the previous section are summarised in Table 5.33, and correspond
to features consisting of MFCCs with δs and δδs. Firstly, classification experiments
show that including a dynamic state process gives increased accuracy over an equivalent
static model. The 71.3% accuracy obtained using full covariance Gaussian classifiers is
5.3. VARIATIONS ON THE IMPLEMENTATION 175
statistically significantly lower than the 72.3% found with a set of LDMs. Secondly,
subspace modelling, in which a compact representation is made of the data’s correlation
structure is shown to aid classification. Setting H = I, gives an LDM classification
accuracy of 71.7%, which again is significantly lower than that using fully-specified LDMs.
With the exception of modelling fixed-length segments, the extensions to the orig-
inal LDM-of-phone formulation yield increases in classification performance. Adding a
log-Gaussian model of phone duration is found to increase classification performance,
taking the standard LDM accuracy from 72.3% to 73.3%, a relative error reduction of
3.6%. Using a hand-picked deterministic mapping to govern the transitions between sub-
segmental models in the multiple regime formulation is clearly suboptimal. Despite the
ad hoc nature of this implementation, classification accuracies are higher than those for
standard LDMs of phones. However, it is only where no δ or δδ features are included that
dynamic models show statistically significant increases over multiple regime static model
performance.
LDM-of-phone 72.3%
LDM-of-phone + duration 73.3%
MR LDM 74.5%
Table 5.33: Summary of the classification results for standard LDMs of phones, multiple regime
LDMs and combined LDM and Gaussian multiple regime models. Features are TIMIT MFCCs
with δs and δδs.
176 CHAPTER 5. LDMS FOR CLASSIFICATION OF SPEECH
A state process which is continuous both within and between phone segments would
represent a step toward the goal of reflecting the properties of speech production in the
chosen acoustic model. Passing state information across model boundaries offers a degree
of contextual modelling, and furthermore gives the possibility of modelling longer range
dependencies than contained within phone segments.
The spectrograms in Figures 5.26, 5.27 and 5.28 give visual evidence of the potential
benefits of a state which is continuous across model boundaries. The first two figures will
be familiar from Figures 4.11 and 4.12 on page 107, where spectrograms of the original and
LDM-predicted MFCCs corresponding to the sentence ‘Do atypical farmers grow oats?’
were given. The third spectrogram is derived from LDM predictions where the state has
been allowed to be continuous across phone boundaries, and as before, the correct model
according to the phone labels is used to generate each segment. In all cases, a Mel-warped
frequency scale is used, and red corresponds to regions of high energy, whilst blue to low.
A comparison of Figures 5.26 and 5.27 shows that many of spectral characteristics
of the acoustic signal are reproduced by the set of LDMs for which the state is reset at
the start of each segment. However, spectral transitions are subject to strong boundary
effects as each new model takes a few frames to find an appropriate location in state-space.
The spectrogram in Figure 5.28 demonstrates how a fully continuous state reduces these
effects. For example, the discontinuities in the transition of the first formant through the
phones [ux q ey] early in the utterance are present where the state is reset, but absent
when the state is allowed to be continuous across segment boundaries.
5.4.1 Implementation
As with the multiple regime models above, state-passed and state-reset will refer to imple-
mentations where state statistics are passed across or reset at model boundaries respec-
tively. Training and testing with a fully continuous state require simple modifications of
the standard state-reset case as described below. In practice, approximations are made
during testing to prevent an exponential increase in the required computation. Alter-
natively, the state covariances can be reset when boundaries are crossed, but the mean
5.4. CONTINUOUS STATE CLASSIFICATION 177
Mel−warped frequency
axr
h#
ih
pcl
ih
el
aa
gcl
ow
ow
h#
tcl
r
m
tcl
d
ux
q
ey
kcl
g
t
t
k
s
Do atypical farmers grow oats?
axr
h#
ih
pcl
ih
el
aa
gcl
ow
ow
h#
tcl
r
m
tcl
d
ux
q
ey
kcl
g
t
t
k
s
Do atypical farmers grow oats?
Figure 5.27: A spectrogram generated from predictions made by LDMs during a forward
pass through the data.
Mel−warped frequency
axr
h#
ih
pcl
ih
el
aa
gcl
ow
ow
h#
tcl
r
m
tcl
d
ux
q
ey
kcl
g
t
t
k
Figure 5.28: A spectrogram generated from the predictions made by LDMs where the
state is continuous through the entire utterance.
178 CHAPTER 5. LDMS FOR CLASSIFICATION OF SPEECH
allowed to be continuous. Some information will still be carried from one phone to the
next, but efficient computation can be maintained by pre-computing or caching the 2nd
order filter statistics as discussed in Section 4.2.4 on page 99.
Training
rather than a fixed initial distribution xt|t−1 ∼ N (π, Λ). However, since the initial state
remains Gaussian, the linear-Gaussian properties of the LDM are preserved so that esti-
mation and likelihood computations otherwise proceed as normal.
Testing
Passing the state across segment boundaries substantially increases the computation in-
volved in a classification task. With |M| denoting the number of models between which
a classification will be made, and j the number of phone segments in a given utterance,
an exhaustive evaluation would require computing the likelihood of |M|j possible phone
sequences, rather than the j × |M| which is required in standard state-reset classification.
5.4. CONTINUOUS STATE CLASSIFICATION 179
Therefore, to keep computational demands within reason, pruning will be needed at some
level to discard unlikely partial phone sequences.
The time-asynchronous decoding strategy which will be described in full in Section
6.2 of the following chapter has been adapted to perform state-passed classification. By
only allowing paths to finish at the phone-end times as given in the labels, decoding
becomes a classification task. Further storing the state mean and covariance as part of
any given hypothesis and using these to initialise filter recursions gives continuous state
classification. Pruning levels in the decoder can be set to give a minimum of search errors
whilst still producing classifications at an acceptable speed3 . However, using the decoder
for this task means that the Viterbi criterion (see Section 6.1.1 on page 187) is applied at
phone boundaries. This is not strictly admissible, though is believed to be a reasonable
approximation and substantially improves efficiency. This issue is discussed in some detail
in Section 7.1.4 on page 227.
The first set of experiments are speaker-dependent, use the real EMA data from the
MOCHA corpus and correspond to the one-way train/test division as detailed on page
116. LDMs with a state dimension of 14 gave the highest accuracy on the validation set.
These results are shown in Table E.2 on page 259 of Appendix E. The corresponding
accuracy on the test data was 59.5%, and is used a baseline for the experiments below.
A set of LDMs is trained whilst allowing the state mean to be passed across phone
boundaries, after being initialised identically to the models in the baseline experiment.
Performing classification with these state-passed models with the standard procedure in
which states are reset at the start of each new segment gives an accuracy of 56.2%. A
similar experiment, in which both state mean and covariance are passed throughout each
utterance in training, but not in testing, gives a classification accuracy of 57.1%. Both of
these results are lower than the baseline of 59.5%, which shows that this modified training
scheme does not yield performance improvements when using the standard approach to
testing.
3
The TIMIT MFCC continuous state classification results given in Section 5.4.3 were found with the
decoder running at around 75 times slower than real-time on a 2.4GHz Pentium P4 processor.
180 CHAPTER 5. LDMS FOR CLASSIFICATION OF SPEECH
Table 5.34: Classification accuracies for model sets trained with the state mean or both mean and
covariances continuous across phone segment boundaries. The standard classification procedure
is used where the state is reset at the start of each new segment.
Rather than training state-passed models from scratch, models which have been sub-
ject to a few iterations of EM can be used to initialise the state-passed models. Results of
such experiments, along with those quoted above are given in Table 5.34. Further training
on models after 4 standard EM iterations gives the highest accuracies. These are 59.8%
and 60.0% where mean and both mean and covariance are continuous across segment
boundaries respectively. These results both provide marginal, though not statistically
significant, improvements on the baseline of 59.5%.
The experiments above are subject to a mismatch between training and testing: the
state statistics were passed across segment boundaries during parameter estimation, but
not during testing. Table 5.35 gives the results of classification using the modified decoder
where the state behaviour during testing matches that during training, where either state
mean or both mean and covariance are passed between phones. Results in brackets show
the corresponding classification accuracies from Table 5.34 where the standard testing
procedure is used. The models trained from scratch where both state mean and covariance
are carried give equivalent accuracies under both conditions, though in the system where
just the state mean is carried, performance deteriorates slightly when training and testing
match. Where models had been trained using standard EM, and further trained with a
fully continuous state, there is a large reduction in accuracy.
5.4. CONTINUOUS STATE CLASSIFICATION 181
Table 5.35: Classification accuracies for model sets trained with the state mean or both mean
and covariance continuous across phone boundaries. State resetting during testing matches that
which was used during training. Accuracies in brackets are equivalent results using the standard
classification procedure in which state statistics are reset at the beginning of each new segment.
Continuous state classification experiments were also performed on acoustic data from
the TIMIT corpus. For MFCCs with no δs or δδs, Table 5.19 on page 152 shows that the
highest LDM-of-phone classification accuracy of 67.4% was given using a set of models
with a state dimension of 12. The classification results presented in this section are for the
TIMIT core test set as described on page 211, and the language models are the backed-
off bigrams which will be used in the recognition experiments of Chapter 6. Otherwise,
the classification procedure remains as described in Section 5.2.1 on page 149. Standard
state-reset classification with these different language models and test set also gave an
accuracy of 67.4%, and will provide the baseline for this set of experiments.
The LDMs were initialised to have starting values which were identical to those used
when training the models which provide the baseline result. Models were trained from
scratch with both state means and covariances passed over segment boundaries, rather
than including any iterations of EM where states were reset at the start of segments. The
experiments of the previous section showed that this mixed training approach resulted in
marginal though not statistically significant increases in accuracy using standard state-
reset testing, though when the state was also passed between segments during testing,
the lowest classification accuracies were found.
The results are presented in Table 5.36 and show that the highest accuracy of 67.4%
is given by the baseline result where the state is reset at the start of each new segment
in both training and testing. State-passed training followed by state-reset testing results
182 CHAPTER 5. LDMS FOR CLASSIFICATION OF SPEECH
implementation
classification accuracy
training testing
Table 5.36: Features are TIMIT MFCCs with δs and δδs. Both state mean and cov
in an accuracy of 66.0%, though using these same models and passing the state between
segments during testing gives the improved result of 67.0%. The latter is close to the
baseline, and shows that in this case, a mismatch between training and testing caused a
reduction in performance.
Section 4.2.3 on page 97 showed that a modified likelihood calculation gave higher
classification accuracies for shorter phones. With the state continuous across entire ut-
terances, there may be an advantage by re-including the contribution of state covariance
in normalising the prediction errors. The last result of Table 5.36 shows that in fact, this
causes a slight reduction in classification accuracy, giving 66.7% rather than the 67.0%
found using the modified likelihood calculation used in this work.
These result of the last two sections show that in the current implementation, classifica-
tion accuracy is reduced by allowing the state statistics to be continuous across phone
boundaries. On MOCHA EMA data, allowing the state to be fully continuous resulted in
an accuracy of 57.0%, where models in which the state was reset between segments gave
59.5%. Training models with a single state-passed iteration after initially training with
a number of state-reset iterations gave a classification accuracy of 60.0%. This is higher
than the baseline of 59.5%, though not statistically significantly so.
On TIMIT acoustic data, the accuracies were similar under state-passed and state-
reset implementations of the models, 66.7% and 67.0% respectively. For this data, a
5.4. CONTINUOUS STATE CLASSIFICATION 183
mismatch in the state behaviour during training and testing was found to cause a reduction
in performance. Section 7.1.4 on page 224 discusses these results further, and suggests
ways in which a fully continuous state might prove useful in the future.
184 CHAPTER 5. LDMS FOR CLASSIFICATION OF SPEECH
Chapter 6
Automatic speech recognition ultimately centres around a search problem: given the
probabilities from an acoustic model, pronunciation model, language model and perhaps
duration model, the most likely sequence of words must be found. In light of the number
of possible word sequences, each of which is subject to a large number of potential time-
alignments, an exhaustive search is impractical even for modest sized tasks. Decoding, as
this search is known, therefore becomes an exercise in judicious investigation of the search
space in which time and memory requirements must be minimised without introducing
too many new errors.
There are two main approaches to implementing such a search in ASR. The first is
time-synchronous forward dynamic programming, or Viterbi decoding, where all hypothe-
ses are evaluated at a given time before the search proceeds to the next time. The second
is time-asynchronous A∗ search, where regardless of time, the best candidate hypothesis
is extended.
With W = w1j = {w1 , . . . , wj } denoting a word sequence and Y = y1N = {y1 , . . . , yN }
an observation sequence, Renals & Hochberg (1999) inter alia define the task of decoding
as finding the maximum a posteriori (MAP) probability of words W given observations
Y:
185
186 CHAPTER 6. LDMS FOR RECOGNITION OF SPEECH
However, P (W|Y) is not a quantity which can generally be computed directly, and so
Bayes’ rule is used to decompose 6.1 into a product of the acoustic model likelihood and
language model probability:
p(Y|W)P (W)
P (W|Y) = (6.2)
p(Y)
∝ p(Y|W)P (W) (6.3)
For the purposes of decoding, p(Y) can be omitted as it is independent of the words
W. Letting M = mk1 = {m1 , . . . , mk } denote the concatenation of a series of sub-word
models which together account for the full observation sequence Y and produce the word
sequence W, 6.3 can further be decomposed as:
X
P (W|Y) ∝ p(Y|W, M)P (M|W)P (W) (6.4)
allM
Given that the sequence of models was chosen to represent the word sequence, W is
implicit in M, and therefore p(Y|W, M) = p(Y|M). This allows the MAP criterion of
6.1 to be re-written as:
( )
X
W ∗ = argmax P (W) p(Y|M)P (M|W) (6.5)
W allM
In certain circumstances, which will be discussed below, rather than summing over all
possible model sequences, only the most likely is considered. This approximation can be
applied to 6.5, giving:
∗
W ' argmax P (W) max p(Y|M)P (M|W) (6.6)
W M
Thus the quantities which must be computed for decoding are P (W), the language model
probability as defined in Section 3.3 on page 64, the acoustic model likelihood p(Y|M),
and the model sequence probability P (M|W ). Unless the lexicon allows multiple pronun-
ciations, there will be a unique sequence of models which can be concatenated to form
each word, and hence P (M|W ) is simply unity.
The following sections will outline time-synchronous and time-asynchronous search
schemes, though it is the latter which will be used to implement decoding of LDMs. Any
search algorithm in its basic form should be admissible, meaning that it is guaranteed to
find the most likely word sequence. This is not necessarily the correct word sequence,
6.1. DECODING FOR SPEECH RECOGNITION 187
but the one which gives the highest likelihood under the acoustic and language models.
An admissible search adds no errors which would not occur under an exhaustive search.
However, pruning, which is the process of early removal of unlikely hypothesis in order to
reduce computation, can in practice introduce some search errors.
Forward dynamic programming, or Viterbi decoding as it has become known when applied
to automatic speech recognition (Young et al. 2002), is a breadth-first scheme for finding
the most likely path through a probabilistically weighted lattice where the axes are time
and model. Breadth-first refers to a search in which all candidate hypotheses at any given
time are extended before search proceeds to the next time, a characteristic which proves
useful for ‘on-line’ speech recognition. With St representing a set of word hypotheses at
time t, forward dynamic programming in its simplest form recurses in time, extending
all s ∈ St by all models (states in the case of HMMs) which the language model allows.
Therefore, the n(St ) candidate hypotheses at time t become n(St )|M| at time t + 1, where
|M| denotes the number of allowable models.
The Markovian dynamics underlying HMMs provide the ‘memoryless’ property that
all state information up to a given time is encompassed by the state at that time. An
extremely useful product of this property is the Viterbi criterion (Viterbi 1967), which
can be stated as:
where two paths occupy the same state at a given time, that with the locally
lower likelihood will never supersede the other.
Therefore, if the goal is to find the single most likely path through the lattice (as is
frequently the case in ASR), when multiple paths share a model state at a given time,
only that with the highest probability need be kept. This allows the single most likely path
to be computed at far lower cost than summing over all possible paths and corresponds
to making the approximation in Equation 6.6.
The search as described above is exhaustive, and often still computationally expensive
despite application of the Viterbi criterion. Therefore, pruning strategies are called for, a
common approach being the beam search. Under this scheme, only the best hypotheses
188 CHAPTER 6. LDMS FOR RECOGNITION OF SPEECH
wn−1 wn wn+1
Word
level
p1 p2 wn
Network
level
m1 m2 m3 HMM
level
Figure 6.1: An efficient strategy for implementing Viterbi decoding for HMMs involves first
decomposing all word sequences allowable under the language model into sub-word lexical items
p1 , . . . , pl . These are further decomposed into a transition network consisting of set of HMM
states. This figure is taken from Young et al. (2002).
are extended. A beam width ∆ is chosen and any paths which have likelihoods less than
maxs∈St log p(s) − ∆ are removed from the search at each time step.
An efficient strategy for implementing Viterbi decoding uses the concept of token
passing (Young, Russell & Thornton 1989). Figure 6.1 shows how an HMM state transition
network is compiled before recognition in which all word sequences allowable under the
language model are decomposed into sub-word lexical items p1 , . . . , pl . These are further
decomposed into a network where the nodes are HMM states. Any given node contains
up to N tokens (for single-best decoding, N = 1), each of which stores a partial path
and associated likelihood. The search moves forward in time by passing each token to
each node which the transition network allows. State transition and acoustic likelihoods
are added to the path likelihood contained in each token, and the N tokens with highest
likelihood at each node are retained. For word decoding, transitions out of each node
are recorded, and then a traceback of these records for the token with the highest path
probability is used to give the most likely word sequence (Young et al. 2002).
6.1. DECODING FOR SPEECH RECOGNITION 189
6.1.2 A∗ search
frame number
0 10 20 30 40 50 60 70 80 90 100
0
y oo r
−20
−40
−60
log likelihood
−80
detailed match lookahead
−100
f g*
−120
−140 h*= f + g *
−180
Figure 6.2: The evaluation function h∗ is a combination of likelihood computed under the acous-
tic, language and lexical models, and an estimate of the cost of explaining the remaining obser-
vations. The lookahead function must be ‘optimistic’ and give an upper bound on the remaining
acoustic likelihood for A∗ to be an admissible search.
These are the detailed match ft which constitutes the likelihood of observations y1t under
190 CHAPTER 6. LDMS FOR RECOGNITION OF SPEECH
where w1jt = {w1 , . . . , wjt } and mk1t = {m1 , . . . , mkt } represent the hypothesised sequence
of words and sub-word models respectively. The other component is gt∗ which gives an
N |M). Com-
estimate of the acoustic likelihood of the remainder of the observations, p(yt+1
puting the evaluation function h∗ from a combination of detailed match and lookahead
is shown pictorially in Figure 6.2.
Using an evaluation function composed of detailed match and lookahead function is
key to time-asynchronous search, as it allows the comparison of hypotheses of differing
lengths. Nilsson (1971) shows that such a search as admissible as long as gt∗ gives an
upper bound on the acoustic likelihood. Intuitively, if the lookahead underestimated the
likelihood remaining, longer hypotheses would be favoured and the decoder would simply
keep extending hypotheses until one was completed, without returning to explore shorter
paths. Conversely, in the limiting case where ∀t, gt∗ = 0, decoding becomes approximately
time-synchronous.
The name stack decoder is in fact somewhat misleading. A stack normally corresponds
a last-in-first-out data structure, though in this case it comprises a list of hypotheses
sorted by the evaluation function h∗. Letting S represent the collection of stack items,
each si ∈ S contains:
– detailed match ft
Decoding consists of alternately popping the best hypothesis off the stack, generat-
ing multiple new hypotheses by adding all allowable words, and then pushing the new
hypotheses back onto the stack. One complete cycle then proceeds as follows:
• check if s1 explains the entire observation sequence: if so this is the winning hy-
pothesis and the search is finished.
• for each of s01 (1), . . . , s01 (|W|), the likelihoods ft , gt∗ and hence h∗t are calculated for
a range of candidate end times.
• hypotheses s01 (1), . . . , s01 (|W|) are pushed onto the stack
The first complete hypothesis to be popped from the stack is considered the winner. This
can be simply extended to produce N -best lists by taking the first N complete hypotheses
which are popped.
The efficiency or otherwise of an A∗ search is largely determined by the lookahead
function g ∗ . Whilst the estimate of the remaining likelihood must be optimistic, over-
estimates can lead to a vastly increased search space. Soong & Huang (1990) take a
multipass approach where the likelihoods computed during standard forward Viterbi de-
coding are used to provide exact estimates of g ∗ for a backward A∗ search which follows in
order to find an N -best list of hypotheses. However, exact computation of the likelihood
remaining is usually considered impractical and approximations are made using heuristic
approaches. Kenny, Hollan, Gupta, Lennig, Mermelstein & O’Shaughnessy (1993) also
use a multipass approach, though in this case the initial Viterbi pass comprises a less
detailed search with a simplified language model to rapidly produce estimates of g ∗ . An
alternative approach to estimating g ∗ is described in Paul (1992.), where the difference
between the path likelihood of a given hypothesis at time t and the least upper bound of
the likelihood of any hypothesis at that time is used.
192 CHAPTER 6. LDMS FOR RECOGNITION OF SPEECH
Combined depth and breadth-first decoding schemes have also been investigated, such
as the start-synchronous search of Renals & Hochberg (1995). Here, a stack ordered by
an evaluation function h∗ is maintained for each hypothesis end-time. Starting with
the stack corresponding to the earliest reference time, a depth-first search is made in
which hypotheses are either extended or pruned until the stack is emptied. Search then
proceeds to the stack with the next lowest reference time. Hypotheses for which a word
is successfully added are inserted in the stack corresponding to the new hypothesis end
time. Such a scheme in conjunction with a hybrid connectionist/HMM acoustic model
(such as described in Section 1.3.2) was shown to give near real-time decoding on a large-
vocabulary speech recognition task.
Work for this thesis has involved implementing acoustic matching for linear dynamic mod-
els within the structure of a general purpose time-asynchronous stack decoder, originally
written by Simon King of the Centre for Speech Technology Research (CSTR).1 Though
the following sections deal with the case of decoding of LDMs, many of the issues involved
apply to search with any segment-based model of speech. A time-asynchronous A∗ search
strategy was chosen for a number of reasons:
• The Viterbi criterion is not integral to the search. Such an approximation (as given
in Equation 6.6) can be applied where models share the same state at some given
time t. However, since LDMs have a continuous state, the Viterbi criterion can only
be applied when the state is reset, which depending on the implementation may be
at the ends of phones, words, or not at all. The Viterbi criterion is never admissible
on a frame-by-frame basis.
• Section 4.2.3 on page 97 described how p(ytt+τ |Θm ), the likelihood of a given model
generating a sequence of observations ytt+τ = {yt , . . . , yt+τ }, is calculated. For
notational simplicity, the likelihoods below are assumed to be conditioned on Θm .
1
Appendix D gives a breakdown of the authorship of the various pieces of code which were required
for implementation of the work in this thesis.
6.2. DECODING FOR LINEAR DYNAMIC MODELS 193
Once p(ytt+τ ) has been calculated, extending acoustic matching by a single frame is
straightforward. Since
all that is required is a further forward Kalman recursion to compute p(yt+τ +1 |yt+τ ).
t+τ
However, p(yt−1 ) cannot be calculated in such an efficient manner. The state’s ini-
tial value affects the subsequent forward filtered state statistics, and hence any like-
lihood computation. Therefore, a separate Kalman filter must be run to compute
the model likelihoods for each candidate start time. In the light of this compu-
tational burden, the chosen search strategy must minimise exploration of unlikely
paths.
• The language model contributes part of the likelihood which is used to score each
hypothesis on the stack. However, unlike a Viterbi search, the language model is
not used to generate each new hypothesis. Decoupling the language model and
hypothesis generation in this way means that the decoder can be designed in a
modular fashion. The only restriction on the language model is that it must be able
to assign probabilities to initial portions of sentences consisting of whole words, for
example, ‘the cat sat on the’. With no requirement that the Viterbi criterion be
applied on a frame level, the decoder is also flexible to the choice of acoustic model.
Section 6.1.1 above described how Viterbi decoding can be implemented for HMMs.
Pre-compiling a transition network according to the language, lexical and acoustic models
is a natural approach for decoding with HMMs since the models are discrete and finite-
state right down to state level. However, the LDM does not fit so neatly into such a
structure. LDMs give models of variable-length segments rather than frames, and with
the continuous state meaning that the Viterbi criterion is inadmissible on a frame-wise
basis, a time-asynchronous strategy provides a more natural and straightforward approach
to implementing decoding for LDMs.
194 CHAPTER 6. LDMS FOR RECOGNITION OF SPEECH
el RABBLE
b
h# r æ ih RABBIT
tcl
tcl t
t
RAT s RATS
Acoustic likelihood
required to add word
el RABBLE
b
h# r æ ih Acoustic likelihood
RABBIT
tcl already computed
tcl t
t
RAT s RATS
el RABBLE
b
h# r æ ih RABBIT
tcl
tcl t
t
RAT s RATS
el RABBLE
b
h# r æ ih RABBIT
tcl
tcl t
t
RAT s RATS
Figure 6.3: Portion of a tree-shaped lexicon. During the depth-first walk, acoustic matching for
words sharing common prefixes can be shared. This figure shows the extra computation required
and that which can be re-used whilst adding the words ‘RABBLE’, ‘RABBIT’, ‘RAT’ and ‘RATS’.
An optional silence model [h#] precedes each word.
6.2. DECODING FOR LINEAR DYNAMIC MODELS 195
For each hypothesis which is popped, decoding involves a depth-first walk over a tree-
shaped lexicon as described in Renals & Hochberg (1995). Building the search around such
a lexicon allows computation to be shared by paths which have common prefixes. Figure
6.3 shows a fragment of such a lexicon during the depth-first walk. Adding each new word
requires some extra acoustic matching, though the figure shows how the likelihood of [r
ae b] need only be computed once for the words ‘rabble’ and ‘rabbit’. The experiments
presented in this chapter concern phone recognition, so although a lexicon of this type
will prove useful for word recognition2 , it will not form part of the decoder used in this
thesis.
Acoustic matching takes place in a grid structure with time increasing down the y-
axis and a column for each phone model to be added. Hypotheses are extended by whole
words, one phone at a time. The detailed match for adding a new phone involves taking
the likelihoods in the column corresponding to the most recently added phone, running
a separate Kalman filter for each phone start-time, and entering the newly computed
likelihoods in the appropriate rows of the following column. If the state is being reset
between phone models, as it is for the recognition experiments reported in this thesis,
the Viterbi criterion can be applied when two paths meet. This occurs where there are
multiple path likelihoods to be inserted in a single grid space, and only the highest need
be kept. Given that separate likelihoods must be calculated for each candidate start time,
this step prevents a massive explosion in the number of hypotheses under consideration.
Section 7.1.4 of chapter 7 will discuss the issues involved in continuous state decoding.
An optional silence is added at the start of each new word. The likelihoods corre-
sponding to each of the candidate start times from the recently-popped hypothesis are
entered into the first two columns of the grid. The likelihoods in the first column are
picked up, and detailed match performed to compute the path likelihoods on adding a
number of frames of the silence model h#. As usual, the Viterbi criterion is applied where
paths meet, so that word-initial silence is included if any of the resulting likelihoods are
higher than those of the incoming paths which already occupy the second column. This
2
Word recognition using LDMs is a currently underway by others at CSTR.
196 CHAPTER 6. LDMS FOR RECOGNITION OF SPEECH
t
t+1
t+2
t+3
t+4
t+5
Figure 6.4: Individual words are added in a grid structure which is organised with time increasing
down the y-axis and a column for each phone model to be added. To allow an optional initial
silence, the likelihoods corresponding to each candidate start time are entered into the first two
columns of the grid. This is shown in the upper diagram. The lower diagram shows how the
likelihoods in the first column are picked up, and detailed match performed to compute the path
likelihoods on adding a few frames of the silence model h#. A separate Kalman filter must be
run for each candidate start time. The Viterbi criterion is applied where paths meet, so that the
silence model is used if any of the resulting likelihoods are higher than those of the incoming paths
which already occupy the second column.
198 CHAPTER 6. LDMS FOR RECOGNITION OF SPEECH
A
incoming paths h# r ae tcl t
B t
t+1
t+2
t+3
t+4
t+5
t+6
t+7
t+8
t+9
t+10
t+11
t+12
t+13
t+14
t+15
t+16
t+17
t+18
t+19
outgoing paths
Figure 6.5: This figure shows two of the many possible paths through the grid on adding the
word ‘rat’. Path A achieves this by taking a hypothesis which ends at time t + 1 and then adding 2
frames of the r model, 11 frames of ae, 2 frames of tcl and then 3 frames of t giving a hypothesis
ending at time t + 19. Path B takes a hypothesis ending at time t + 2 and adds 3 frames of
word-initial silence before accounting for ‘rat’ using 2 frames of r, 2 frames of ae, 4 frames of tcl
and 4 frames of t. Path B gives a hypothesis ending at time t + 17 which may be picked up at
subsequent cycle of the decoder.
6.2. DECODING FOR LINEAR DYNAMIC MODELS 199
The decoding experiments which are presented below consist of phone recognition on
isolated sentences using the 61 and 46 phone model sets for TIMIT and MOCHA data
respectively. For every utterance to be decoded, a Kalman filter is run across the full
observation sequence for each model m ∈ M. The frame-wise likelihoods under each
model are ranked, then an average taken across the top n. These averages are then
summed so as to produce a reverse accumulation of framewise likelihood. All experiments
reported below use n = 1 which provides a practical upper bound on the remaining
likelihood, though ignoring language model and durational constraints means that the
lookahead is over-estimated.
6.2.3 Pruning
1.6
1.4
beam width scaling
1.2
0.8
0 200 400 600 800 1000
stack size
Figure 6.6: The beam width ∆(stack) is updated at every iteration in order to keep to a target
stack size. This figure shows the factor by which ∆(stack) is scaled at every cycle where the decoder
aims to maintain 300 partial hypotheses on the stack where α = 0.1.
Pruning is implemented both in the grid and on the stack, and is dependent on
200 CHAPTER 6. LDMS FOR RECOGNITION OF SPEECH
calculated likelihood ft rather than lookahead. As each word is added in turn to the
(grid)
most recently popped hypothesis, an upper bound Ψt is kept on the likelihoods at
(grid)
each time t in the grid, so that any paths for which ft < Ψt − ∆(grid) are discarded.
(stack)
Similarly on the stack, an upper bound is kept for each time t, Ψt . Pruning removes
(stack)
any hypotheses which contain no paths for which ft > Ψt − ∆(stack) .
2500
initial stack ∆ = 100
initial stack ∆ = 1000
2000
1500
stack size
1000
500
0
0 100 200 300 400 500
decoder iteration
Figure 6.7: The adaptive pruning adjusts the stack beam width ∆(stack) at each iteration to
maintain a roughly constant number of stack items. This figure shows the first 500 cycles of the
decoder for large and small original ∆(stack) and a target stack size of 300.
In practice, finding suitable values of ∆(stack) was a problem: tight thresholds could
result in pruning away all hypotheses, whilst larger values of ∆(stack) resulted in a stack
which grew to a size which significantly increased decoding time. An adaptive pruning
scheme was developed in which a target stack size was chosen and at each iteration, the
stack beam width was updated dependent on the current stack size. Relation 6.10 gives
the factor by which the stack beam width ∆(stack) is adjusted:
(stack) 0 stack size
∆ = 1 − α log ∆(stack) (6.10)
target stack size
The tuning parameter α dictates how rapidly the beam width can change. Figure 6.6
6.2. DECODING FOR LINEAR DYNAMIC MODELS 201
shows the factor by which ∆(stack) is scaled at every cycle where the decoder aims to
maintain 300 partial hypotheses on the stack. Here, α = 0.1, the value used for exper-
imentation. An illustration of the adaptive pruning maintaining a stack of 300 partial
hypotheses whilst decoding is shown in Figure 6.7. During 1000 decoder cycles, the stack
size increases initially, but is soon capped and then remains fairly constant.
60
55
target stack = 20
target stack = 40
target stack = 70
recognition accuracy
45
40
35
30 grid ∆ = 1
1 < grid ∆ < 150
grid ∆ = 150
25 1 2
10 10
decoding time (x real time)
Figure 6.8: The combination of target stack size and local beam width ∆(grid) affects the time
taken to decode each utterance. This scatter plot shows recognition accuracy against decoding
speed for target stack size ranging from 10 to 150 and ∆(grid) set between 1 and 150. Each cluster
corresponds to a different target stack size. With suitable parameters, decoding runs at around
20× slower than real time with close to best accuracy.
To examine how pruning affects both accuracy and speed of decoding, the decoder
was run over the 92 validation sentences used in the MOCHA recognition experiments
202 CHAPTER 6. LDMS FOR RECOGNITION OF SPEECH
with target stack sizes ranging from 10 to 150 and a grid beam width ∆(grid) of between 1
and 150. The features were the 45-dimensional articulatory features used in experiments
in Section 6.3.1 below. The results shown in Figure 6.8 are recognition accuracies against
mean time to decode each sentence as a factor of times slower than real time, plotted on
a logarithmic scale. Note that results are slightly lower than those given in Section 6.3.1
as no duration model was included.
Each cluster corresponds to a different target stack size, and the variation within each
to the range of local beam widths ∆(grid) . It is apparent the number of partial hypotheses
kept on the stack has a significant effect on the speed at which the decoder runs. The
local beam width ∆(grid) however affects accuracy but has little effect on time to decode
each utterance. Close to highest accuracies are given with the decoder running at around
20× real time. The results corresponding to the smallest and largest values of ∆(grid) are
shown in red and blue respectively. It is apparent that for smaller stack sizes, pruning
in the grid is advantageous to recognition performance, as the highest accuracies do not
correspond to the largest grid beam widths. Such pruning has the effect of removing
unlikely hypotheses at the first possible opportunity.
Here, pruning in the grid is shown to make little difference to decoding speed. However
it should be noted that these results are for phone recognition: it may be that the local
beam width ∆(grid) has a more significant effect on the decoder speed for word recognition
during which multiple phone models are evaluated in the grid.
Pre-computation of state statistics was discussed in Section 4.2.4 on page 99. This can
also be used during recognition with correspondingly significant savings. Since the state is
reset between phones, computation can be further reduced by caching acoustic likelihoods.
If model m with parameter set Θm is evaluated for the observation sequence ytt+τ =
{yt , . . . , yt+τ }, the likelihoods p(ytt |Θm ), . . . , p(ytt+τ |Θm ) are stored. Later pops of the
stack will frequently require likelihoods for models starting at the same time t though
with a different previous context. The cached likelihoods can be inserted into the grid
without being re-computed.
6.3. EXPERIMENTS 203
6.3 Experiments
The recognition experiments use the LDM-of-phone formulation which provided the core
results of chapter 5. In each case, the model set which gave the highest classification
accuracy on the relevant validation set, and hence provided the final classification result,
is used for recognition. The word decoder is adapted for phone recognition by treating
phones as words. The lexicon simply maps each phone onto a ‘word’ consisting of the
phone in question.
Decoding involves the combination of likelihoods from the acoustic, language and
duration models. To balance the contribution of each of these, and tune decoder accuracy,
a number of parameters must be set. These constitute:
• language model scaling factor. Classification experiments used simple bigram lan-
guage models, though for recognition, backed-off bigrams as described in Section
3.3 on page 64 are employed.
• duration model scaling factor. The log-Gaussian duration models of Section 5.3.4 are
used to aid phone recognition. Furthermore, the segment lengths at the tails of the
duration distribution are disallowed by setting a minimum allowable duration model
likelihood. In this way minimum and maximum segment durations are imposed.
• phone insertion penalty. Balancing the number of insertions and deletions is fre-
quently required to maximise recognition accuracy. A phone insertion penalty is
added to the acoustic log-likelihood in an effort to promote/discourage transitions
from one model to the next.
• likelihood to end scaling factor. Section 6.2.2 above noted that the approach used to
compute the lookahead function gt∗ would result in over-estimates of the remaining
likelihood. Scaling the lookahead (by a factor < 1) might result in a faster search,
though this was not used in practice. Comparison of the original estimate of the
complete likelihood g0∗ and the actual computed likelihood for TIMIT recognition
using δ and δδ parameters showed that on average, the estimate was a factor of
1.062 higher than the detailed match. This was considered to be a sufficiently low
upper bound.
204 CHAPTER 6. LDMS FOR RECOGNITION OF SPEECH
Unless otherwise stated, language and durational model parameters are estimated only
on the relevant training sets. Scaling factors and phone insertion penalty are optimised
on validation data.
Recognised output is subject to insertion and deletion errors, as well as misclassi-
fication of individual phones. The HResults tool distributed as part of HTK (Young
et al. 2002) uses dynamic programming to align decoder transcriptions with their manual
labels, before counting the number of correctly identified phones along with any inser-
tion and deletion errors. Two statistics are frequently used to report recognition results.
These are %correct and %accuracy, and are calculated as follows:
n(phones correct)
%correct = 100 × (6.11)
n(total phone labels)
n(phones correct) − n(insertions)
%accuracy = 100 × (6.12)
n(total phone labels)
Early recognition experiments revealed that the decoder was prone to deletion errors.
With the various likelihood scaling factors and phone insertion penalty optimised to give
the highest recognition accuracy on a validation set, there remained many more deletions
than insertions. In Section 4.2.3 on page 97, it was observed that the state covariance
had an adverse effect on the classification of shorter phone segments. It was suggested
that this was due to low likelihoods in the first few frames of segments prior to the state
finding an appropriate location in state-space. Low likelihoods during segment-initial
frames have a more significant impact on overall segment likelihood in shorter segments.
Similarly, it was speculated that the large number of deletion errors during recognition
were due to low likelihoods directly after state initialisation.
One method which might be used to counter this effect is by commencing Kalman filter
recursions one frame prior to the hypothesised model start time, though not accumulating
likelihood over the first recursion. For a model starting at time t, such a step amounts to
6.3. EXPERIMENTS 205
setting xt−1|t−2 ∼ N (π, Λ), rather than xt|t−1 ∼ N (π, Λ) as is standard. Table 6.1 gives
the results of an experiment to examine if modifying the state initialisation in this way
improves recognition accuracy. A subset of 120 utterances was taken from the TIMIT
validation set and recognition performed with both modes of state initialisation using
PLP and MFCC features, each with respective δ and δδ parameters. The models are
the LDMs which gave the final classification accuracies of 72.2% and 72.3%, as shown in
Tables 5.18 and 5.19 on pages 151 and 152 of the previous chapter, for PLPs and MFCCs
respectively.
TIMIT PLP + δ + δδ
TIMIT MFCC + δ + δδ
Table 6.1: Results of experiments comparing the mode of initialising state statistics for the
Kalman recursions which are used to compute model likelihoods. The abbreviations ins, del
and subs refer to insertion, deletion and substitution errors respectively. Initialising the Kalman
recursions for model starting at time t with xt|t−1 ∼ N (π, Λ) is standard. Alternatively, the
recursions can be begun a frame prior to the hypothesised model start time, though the likelihood
not accumulated over the first filter recursion. This corresponds to setting xt−1|t−2 ∼ N (π, Λ)
In both cases, starting Kalman recursions a frame earlier than the hypothesised seg-
ment start does have the effect of reducing the proportion of errors which are deletions.
For PLPs, the drop is from 34.3% to 30.1%, and for MFCCs from 33.6% to 27.0%. How-
ever, in the case of PLPs this is at the cost of a reduction in recognition accuracy, which
falls from 63.5% to 62.9%. For MFCCs, the %correct is increased, though extra insertion
errors mean that the accuracies are almost identical: 63.5% under the standard state
initialisation and 63.6% when an extra Kalman recursion is included. Given that there
is no evidence of benefit with MFCC features, and a reduction in accuracy using PLPs,
this modified state initialisation was not adopted for recognition experiments.
206 CHAPTER 6. LDMS FOR RECOGNITION OF SPEECH
Digalakis (1992), whose work is described in Section 6 on page 42, also found recog-
nition with LDMs to be prone to deletion errors. By setting the initial state mean to
be equal to the segment-initial observation, these errors were reduced, though this was
only possible as the formulation of LDMs ignored subspace modelling by setting H, the
state-observation mapping, to be the identity I.
articulatory 63%
acoustic 65%
articulatory + acoustic 71%
LDM results To allow comparison with results above, recognition experiments using
LDMs on MOCHA data follow the experimental procedure used in Wrench (2001), where
language model probabilities were estimated on the entire corpus, and a 5-fold cross-
validation performed to accumulate recognition results for each of the 460 sentences.
The scaling of the language model probabilities was set on the first cross-validation set.
Recognition with LDMs uses both a language model and a duration model, parameters
for which are accordingly estimated on the full training data, and scalings set on the first
cross-validation set. Note that in all other experiments, the parameters of language and
duration models are only estimated on training data.
For the acoustic and the combined acoustic and real articulatory features, the set of
LDMs for which the highest classification accuracies were found are used for recognition
experiments. For acoustic data, these are LDMs with a 9-dimensional state using MFCC
and δ features. The highest accuracy with a combined feature set used LDMs with a
24-dimensional state with PLP, EMA and corresponding δ features. These classification
results were originally given in Table 5.9 and Table 5.10 of chapter 5.
The actual data-set used in Wrench (2001) as described above was made available
for articulatory feature recognition experiments. Classification experiments using this
new feature set following the methodology of those in the previous chapter (described in
Section 5.1.1 on page 115) were performed, and the highest accuracy found using an LDM
with a 9-dimensional state. The full results of these exploratory experiments are given
in Table E.8 on page 265 of Appendix E. A 5-fold cross-validation gave a classification
accuracy of 72.4%. This represents similar performance to the experiments of chapter 5
where full articulatory features gave a classification accuracy of 72.1%.
Table 6.3 shows recognition results using LDMs on each of the acoustic, articulatory
and combined acoustic-articulatory features. Acoustic-only recognition gives an accuracy
208 CHAPTER 6. LDMS FOR RECOGNITION OF SPEECH
ins 14.6%
acoustic del 36.6%
61.1% 54.4%
MFCC + δ subs 48.8%
errors 6367
ins 21.3%
articulatory del 27.7%
68.6% 60.1%
EMA + LAR + δ + δδ LDA sub 51.0%
errors 5564
ins 16.0%
acoustic-articulatory del 33.2%
70.1% 64.4%
PLP + EMA + δ sub 50.8%
errors 4974
Table 6.3: LDM cross-validation recognition results for a 46 phone model on the fsew0 data-set
using acoustic, articulatory derived, and mixed acoustic and articulatory derived features.
of 54.4%, lower than that using the articulatory features, for which the accuracy is 60.1%.
In moving from classification to recognition tasks, a greater deterioration in accuracy is
found with acoustic than with articulatory features. These correspond to reductions from
75.0% to 54.4% and 72.4% to 60.1% with acoustic and articulatory features respectively.
Since recognition involves jointly finding the most likely alignment and phone sequence,
these results suggest that segmentations found using articulatory features lead to less
confusion in phone identity. Figure 6.9 shows a confusion matrix corresponding to the
recognition output using articulatory features. The vertical stripes down the mid-right
of the figure shows that common errors are to misrecognize phones as the voiced fricative
[v] and voiced oral stop [g].
The highest overall recognition accuracy of 64.4% is found using the combined acoustic-
articulatory feature set. The breakdown of errors into insertions, deletions and substi-
tutions included in Table 6.3 shows similar patterns for each of the feature sets. Sub-
stitutions represent around half of all errors, with deletions occurring around twice as
6.3. EXPERIMENTS 209
Phone recognition
breath
@@
ow
i@
uh
aa
oo
uu
eir
ou
ng
dh
zh
sh
ch
@
sil
m
th
ai
ei
oi
jh
iy
w
a
e
b
d
g
p
h
y
k
v
z
s
ii
f
i
l
a
e
i
ii
iy
@
@@
uh
aa
o
oo
u
uu
ai
ei
eir
True phone identity
i@
oi
ou
ow
l
r
w
y
m
n
ng
b
d
g
p
t
k
v
z
zh
dh
f
h
s
sh
th
ch
jh
breath
sil
Figure 6.9: Confusion table for the recognition output with the articulatory parameters as fea-
tures. The vertical stripes down the mid-right of the figure shows that common errors are to
misrecognize phones as the voiced fricative [v] and voiced oral stop [g].
often as insertions. For these experiments, the decoder runs at upwards of 20× real-
time, depending on the size of feature vectors and the amount of pruning. The results of
60.1% accuracy for the 45-dimensional articulatory feature set was found with the decoder
running at 27× slower than real time.
These results are all lower than the HMM equivalents given in Table 6.2, though
using the articulatory parameters gives the closest results. For these features, context
210 CHAPTER 6. LDMS FOR RECOGNITION OF SPEECH
independent LDMs give a recognition accuracy of 60.1%, which is close to the 63% found
using a triphone HMM system.
Recognition experiments were also performed using the combined acoustic and network-
recovered features. Rather than a 5-fold cross-validation, these use the single train/test
given in Section 5.1.1 on page 116 which was used to generate the network articulation.
ins 13.5%
acoustic del 38.1%
63.2% 57.5%
MFCC + δ + δδ sub 48.4%
errors 620
ins 6.8%
acoustic-network articulatory del 47.1%
58.8% 55.7%
MFCC + NET + δ sub 46.0%
errors 645
Table 6.4: LDM recognition results for a 46 phone model on the fsew0 data-set using acoustic
and combined acoustic and network-recovered articulatory features.
Table 5.15 on page 146 gave a summary of the classification results for combinations
of acoustic and network-recovered articulation, along with acoustic-only results on this
same train/test division. The highest classification accuracies were 75.6% and 74.7%
for acoustic-only and combined features, given using MFCCs with both δs and δδs, and
network-recovered articulation added to MFCCs, both with corresponding δs. The results
of recognition with these same models and features are given in Table 6.4, and as with
classification, the highest accuracy is found with the acoustic features used alone. Adding
network articulation results in a decrease in accuracy from 57.5% to 55.7%. A discussion
of the possible reasons as to why the set of network-recovered articulatory features do not
improve acoustic-only phone classification was given on page 147 and is equally relevant
to recognition.
6.3. EXPERIMENTS 211
Recognition experiments were carried out for all of the TIMIT acoustic feature sets used
in Section 5.2.1 of chapter 5. The model sets are those for which final classification
accuracies are reported in Tables 5.18 and 5.19 on pages 151 and 152 for PLP and MFCC
features respectively. As with classification, recognition is performed using a set of 61
models, though in reporting results the phone set is collapsed down to 39. The set of
allowable confusions is given in 5.17 on page 150, and all results are given on the NIST
core test set, which is described in Section 2.4 on page 45.
Figure 6.8 showed how recognition accuracy is affected by the level of pruning within
the grid for any given target stack size. Since the lowest pruning thresholds did not
necessarily yield the highest accuracies, recognition on the test set uses the same level of
pruning which is applied during validation. This is necessarily fairly tight, given the num-
ber of scaling factors and word insertion penalties which must be chosen. The recognition
accuracies presented in this section were produced with the decoder running between 10
and 30 times slower than real-time on a 2.4GHz Pentium P4 processor, depending on the
dimension of the feature vector.
Recognition results are given in full in Table 6.5. For PLP cepstra, adding δs increases
recognition accuracy from 55.2% to 58.5%, and has a balancing effect on the occurrence of
deletion and insertion errors with the ratio shifting from over 5:1 to around 3:1. However,
further adding δδs gives no extra increase in accuracy, and in fact gives a reduction in
% correct. The recognition experiments which use MFCC features show that accuracy is
improved on adding δs and further including δδs. Recognition accuracy increases from
51.1% to 57.2% on adding δs. The number of errors is reduced from 3588 to 3140, and
the percentage of these which are deletions falls from 45.5% to 41.5%. Also then adding
δδs gives the highest overall accuracy of 60.3%, and further balances the occurrence of
insertions and deletions.
212 CHAPTER 6. LDMS FOR RECOGNITION OF SPEECH
ins 7.7%
del 39.2%
PLP 58.7% 55.2%
sub 53.1%
errors 3283
ins 10.6%
del 32.7%
PLP + δ 62.9% 58.5%
sub 56.7%
errors 3045
ins 8.3%
del 37.0%
PLP + δ + δδ 62.0% 58.5%
sub 54.7%
errors 3042
ins 5.9%
del 45.5%
MFCC 54.0% 51.1%
sub 48.6%
errors 3588
ins 6.6%
del 41.5%
MFCC + δ 60.0% 57.2%
sub 51.8%
errors 3140
ins 9.3%
del 35.0%
MFCC + δ + δδ 63.9% 60.3%
sub 55.8%
errors 2914
Table 6.5: Speaker-independent recognition results using acoustic data from the TIMIT core test
set. Results correspond to the 39 phone set.
6.3. EXPERIMENTS 213
Figure 6.10 shows a confusion table corresponding to the highest TIMIT recognition
result of 60.3% using MFCCs with δ and δδ parameters. The majority of the confusions
appear to be between vowels, with phones commonly misclassified as [ix] [ax] or [ao].
Also, errors appear in making voicing decisions, with [b, d] being frequently recognised
as their voiceless equivalents [p, t].
Context-independent TIMIT phone recognition results are given by Goldenthal (1994),
whose work is summarised on page 44 of the literature review. A 39-phone recognition
accuracy of 61.9% was found on the core test set, and increased to 63.9% by incorporat-
ing explicit models of phone transitions. These results used gender-specific models, and
assume that the gender of a given speaker is known3 .
To allow comparison with this result, gender-specific experiments were also performed
with the LDM-of-phone formulation using MFCCs with δs and δδs. Classification followed
the methodology of Section 5.2.1 on page 149, except that separate model sets were
trained, validated and tested for male and female speakers. An overall classification
accuracy of 73.6% was found, which represents a statistically significant increase over the
original highest LDM-of-phone classification accuracy of 72.3%, given in Table 5.19 on
page 152. Gender-dependent recognition on the TIMIT core test set gave an accuracy
of 61.5%, shown in Table 6.6. This constitutes a statistically significant increase on the
original gender-independent model accuracy of 60.3%, and is close to Goldenthal’s 61.9%
for which explicit transition models were not used.
gender-independent 60.3%
gender-dependent 61.5%
Table 6.6: Gender-dependent recognition on the TIMIT core test set gave an accuracy of 61.5%,
a statistically significant increase on the original gender-independent model accuracy of 60.3% and
close to Goldenthal’s 61.9%.
3
Lamel & Gauvain (1993b) showed that gender decisions can be reliably made from speech parameters,
findings backed up by Goldenthal (1994) who reports that of 250 utterances from the TIMIT test set,
gender classification gave an accuracy of 100% using the trajectory models as described on page 44 of this
thesis.
214 CHAPTER 6. LDMS FOR RECOGNITION OF SPEECH
uw
Phone recognition
aw
ow
ae
eh
ao
uh
ng
dh
hh
h#
ax
ay
ey
oy
dx
sh
ch
er
th
jh
ix
iy
b
d
p
y
v
z
s
r
f
l
ae
eh
ix
iy
ax
er
ao
uh
uw
aw
ay
ey
ow
True phone identity
oy
l
r
w
y
m
n
ng
b
d
dx
g
k
p
t
dh
v
z
f
hh
s
sh
th
ch
jh
h#
Figure 6.10: Confusion table for LDM recognition output with MFCCs + δs + δδs as features.
The majority of the confusions appear to be between vowels, with phones commonly misclassified
as [ix] [ax] or [ao]. Errors also appear in making voicing decisions, with [b, d] being frequently
recognised as their voiceless equivalents [p, t].
Chapter 7
On page 11 of the introduction, it was stated that the work in this thesis is motivated by
the belief that a model which reflects the characteristics of speech production will ulti-
mately lead to improvements in automatic speech recognition performance. The following
sections describe how this has been interpreted in practice, and analyses the results of
the investigations reported in previous chapters. Section 7.3 on page 234 then goes on to
describe how this work will be extended in the future.
Computing the likelihood of a sequence of feature vectors within a single HMM state
qj ∈ Q is invariant under random reorderings of the data.1 Speech is in fact a highly
ordered signal (where the ordering itself carries information), due to the constrained
nature of the production mechanism. Furthermore, given that features are extracted
within overlapping windows on the acoustic signal, successive frames of data are destined
to be highly correlated. Incorporating the ordering present in the parameterized speech
signal with an explicit description of the inter-frame dependencies should lead to improved
modelling.
1
This statement assumes that, if included, any derivative information in the form of δs and δδs has
already been appended to the feature vectors. Clearly reordering the data prior to calculation of δs and
δδs will yield a distinct set of features.
215
216 CHAPTER 7. CONCLUSIONS AND FUTURE WORK
Contribution of the dynamic state process The bulk of the classification experi-
ments of Chapter 5 were intended to assess the contribution made by the addition of a
model of inter-frame dependencies. Linear dynamic models were compared with otherwise
equivalent static models, and the addition of modelling of temporal correlations was found
to give modest yet statistically significant increases in classification accuracy. Table 7.1
shows the highest accuracies using static and dynamic models on the TIMIT corpus, with
LDMs applied in their fully parameterized form and one model per phone class. Classi-
fication using LDMs gave an accuracy of 72.3%, a 3.5% relative error reduction over the
best result found using a static model of 71.3%. A paired t-test showed this result to be
statistically significant which demonstrates that the increase is consistent over the test
data.
7.1. ANALYSIS OF RESULTS 217
Table 7.1: Comparison of the highest accuracies using static models and LDM of phone models.
Features are TIMIT MFCCs with δs and δδs.
Recognition using LDMs on the TIMIT core test set gave an accuracy of 60.3%, which
is higher than the recognition accuracy of 58.1% found with an otherwise equivalent
static model2 . This represents a relative error reduction of 5.3%, and again is shown to
be significant using a paired t-test. Therefore, for models of complete phone segments,
accounting for inter-frame dependencies is shown to give acoustic modelling which results
in more accurate phone identity decisions.
Experiments later in the chapter used multiple regime (MR) LDMs in which phone
segments are split into a number of regions, each one modelled by an LDM. Table 7.2
shows the highest results found using multiple regime static and dynamic models on
TIMIT MFCCs used alone and with both δs and δδs. These results given here are for
the MR LDMs in which the state is passed between sub-models. Also shown are results
using the standard LDM-of-phone models and an accuracy gained by combining static
and dynamic models.
With MFCCs used alone, the multiple regime models give statistically significant per-
formance increase over both standard LDM-of-phone and the MR static models. When
both δ and δδ parameters are included in the features, the MR LDM still gives a higher
accuracy than the MR static model and standard LDM, though the increase is not statis-
2
Many thanks to Simon King for preparing this baseline result. HTK (Young et al. 2002) was used to
implement one-state monophone HMMs with single full covariance Gaussian output distributions. Models
were trained, validated and tested on identical data to that used in all LDM recognition experiments. The
language model was also the same. The validation set was used to select the language model scale factor,
word insertion penalty, and a beam pruning width such that there were minimal search errors. The
best models were initialised with uniform segmentation and Viterbi training (HInit), then Baum-Welch
to convergence with fixed segment label times (HRest) followed by a single iteration of full embedded
training(HERest). It should be noted that the HMM implementation of the static models means that the
duration model will be exponential rather than log-Gaussian as used in LDM recognition.
218 CHAPTER 7. CONCLUSIONS AND FUTURE WORK
classification accuracy
Table 7.2: Comparison of accuracies of static and dynamic multiple regime models. Standard
LDM-of-phone results are also given for reference. The features are TIMIT MFCCs either used
alone or with δ and δδ parameters. Bold face denotes an accuracy which is statistically significantly
higher than the others using the same features. Results were originally given in Table 5.28 on
page 167.
tically significant in this case. When the likelihoods under the two models are combined,
a classification accuracy of 74.9% is found, which represents a statistically significant in-
crease over either of the models used individually. There are two main conclusions to be
drawn from these results.
Firstly, the distribution of errors under static and dynamic models are sufficiently
different that accuracy can be improved using combinations of the two. However, LDM
estimation should allow modelling of static distributions, as a zero or near-zero state evo-
lution matrix F would remove or significantly reduce the dependencies between successive
frames. Assuming parameters can be reliably estimated, overall static model performance
should never exceed that found using LDMs, though this is complicated by discrimination
amongst models. Figure 5.23 on page 169 gave a comparison of the classification accura-
cies under the MR LDM and MR static models broken down by phonetic category, and
shows that static models gave higher accuracy for diphthongs. This class of phones are
characterised by spectral transitions, and it was expected that a dynamic model would
give an advantage in this case. Closer inspection of the results shows that whilst static
models correctly classified a greater number of diphthong segments, there were also al-
most 50% more phones misclassified as diphthongs by the static models than by dynamic
models.
The second conclusion which can be drawn is that partitioning phone segments into
7.1. ANALYSIS OF RESULTS 219
Feature extraction for ASR, as described in Section 3.1.2 includes steps which are designed
to decorrelate the final parameters. However, dependencies persist between feature dimen-
sions. A model which can account for these spatial correlations should have an advantage
over one in which they are ignored. Section 5.2.4 on page 157 gave the results of classifi-
cation on MOCHA and TIMIT data for LDMs on which a variety of constraints had been
placed. The results using TIMIT MFCCs with δ and δδ parameters are summarised in
Table 7.3 for variations on fully parameterized LDMs which alter the modelling of spatial
dependencies.
The function of the matrix H is to allow the observation and state distributions
to occupy distinct vector spaces. Frequently, fewer degrees of freedom are needed to
describe a process than present in the parameters which describe it, and H offers the
chance to give a compact representation of the underlying dynamics. Setting H to be the
identity I casts the LDM as a smoothed Gauss-Markov model and removes the capacity
for subspace modelling. Table 7.3 shows that the classification accuracy using such a
model is 71.7%, which represents a statistically significant reduction on the 72.3% given
by a fully parameterized LDM. The linear dimensionality reduction which H provides
220 CHAPTER 7. CONCLUSIONS AND FUTURE WORK
Table 7.3: Comparison of accuracies using LDMs with a variety of constraints which affect
modelling of spatial correlations for TIMIT MFCCs with δs and δδs. Note that the initial state
covariance was not restricted during these experiments. The total number of parameters for each
LDM variant is given also.
Setting both observation noise C and state noise D to be diagonal is the only one of
these variations which represents a theoretical loss of generality. It was shown in Section
4.1.2 on page 87 that with xt and Σt representing the state mean and covariance at time
t, the output distribution is:
With C and D diagonal, 7.1 gives a model of the spatial dependencies where the corre-
lation structure of the data is absorbed into H, so that the observation noise covariance
is approximated by a projection of the lower-dimensioned state error distribution. The
classification accuracy using LDMs with diagonal covariances for both state and obser-
vation is 69.9%, which is statistically significantly lower than all of the other results in
Table 7.3 where output noise distributions are described fully. These results demonstrate
that detailed modelling of spatial dependencies is advantageous in making phone-class
decisions.
The classification accuracy of 72.2% where just the state covariance has been set to be
diagonal is almost identical to that found with a fully parameterized LDM. The figures in
7.1. ANALYSIS OF RESULTS 221
Table 7.3 demonstrate that the transformation H can provide the rotations required to
map the state space onto a basis in which the dimensions are independent with minimal
loss of accuracy and a slight reduction in parameterization.
Along with comparing the performance of static and dynamic models, the classification
experiments of Chapter 5 examined the effect of adding articulatory features to acoustic
parameters. The difficulties inherent in measuring human articulation mean that real
articulatory data can only be used as a design tool. However, with building a model
which reflects the properties of speech production cited as a goal, an examination of the
properties of real articulatory parameters provides an ideal starting point.
The regressions of Section 3.4, which compare linear and non-linear models of the
dependencies between and within phone segments, find both the overall highest absolute
R2 values and closest linear/non-linear performances on MOCHA EMA data. With the
preceding frame as the predictor, a linear regression gives an R2 of 0.981, which is 99.4%
of that of the non-linear model. These results demonstrate the strong correlation which
exists between articulatory feature vectors spaced 10ms apart. This is expected given the
relatively slowly varying nature of the speech production mechanism.
The regression models were also able to give good predictions of the data when the
dependent and explanatory variables were spaced a number of frames apart. Much of
the variation in the segment-central frame – almost 80% – could be explained using a
linear model with the segment-initial frame as predictor. Crossing phone boundaries was
found to have minimal impact on the performance of the non-linear regressor, though
lead to a slight reduction of that of the linear model. These results serve to demonstrate
the long-range predictability of articulatory features, even though the inter-segmental
dependencies may be better modelled with a non-linear predictor.
Table 7.4 gives a summary of the best classification and recognition results for acoustic,
articulatory and combined acoustic-articulatory features for the MOCHA corpus. Acous-
tic and articulatory features give different distributions of errors, as shown pictorially in
Figure 5.11 on page 136. This is also apparent in moving from classification to recognition
222 CHAPTER 7. CONCLUSIONS AND FUTURE WORK
tasks: whereas classification with acoustic data gives a higher accuracy than that with
articulatory parameters, 75.0% compared to 72.4%, recognition with articulatory data
gives greater accuracy than that with acoustic features, 60.1% compared to 54.4%. Given
that the difference between classification and recognition tasks is in finding the alignment
of segments, these results suggest that articulatory features allow the set of LDMs to find
phone boundaries more accurately.
Table 7.4: Comparison of the highest classification and recognition accuracies on MOCHA data
for acoustic, articulatory and combined acoustic-articulatory feature sets. These experiments use
measured articulation from the MOCHA corpus.
Table 7.5: Comparison of the highest classification and recognition accuracies on MOCHA data
for acoustic and combined acoustic-articulatory feature sets. These experiments use the network-
recovered articulation produced to accompany MOCHA data.
articulators which in measured data show a great deal of variation. Not all articulators
are critical for the production of each phone (see page 51 for a description of critical/non-
critical articulators), and Section 4.4.2 on page 108 showed that the variance captured in
the observation noise C for different feature dimensions reflected notions of which artic-
ulator would be critical in producing the consonantal phones. These variances are used
to weight the contribution of each feature dimension in computing the likelihood of a
model given a sequence of observations. It seems likely that for network-recovered artic-
ulatory features, there is an overemphasis of the contribution of non-critical articulators
in computing likelihoods.
Section 5.1.6 on page 147 suggested that an alternative type of neural network may
provide an articulatory inversion mapping which is more suitable for the purposes of ASR.
An approach taken in Zacks & Thomas (1994) replaces the sum of squares error function
with one which incorporates a measure of correlation. The intention is to push learning
toward recovering the shape of articulatory trajectories, rather than simply estimating the
conditional mean. Alternatively, work reported in Richmond (2001) uses mixture density
networks which allow a multimodal output distribution. Such a network, with a single
mixture component, would estimate the variance of each articulatory dimension along
with its conditional mean. Incorporating this measure into the LDM’s observation noise
covariance would provide a measure of the criticality or otherwise of a given articulator,
by weighting its contribution in any likelihood calculation. Real articulation is shown
to provide cues which are not as apparent in acoustic data. Ultimately, incorporating
artificially recovered articulatory parameters will only provide performance improvements
if the inversion mapping can reliably derive these cues from the acoustic signal.
224 CHAPTER 7. CONCLUSIONS AND FUTURE WORK
MOCHA EMA
reset 59.5%
passed 57.0%
TIMIT MFCC
reset 67.4%
passed 67.0%
Table 7.6: Comparison of classification accuracies in which the state is either reset or passed
between phone segments. Results are given for MOCHA EMA and TIMIT MFCC features.
For both MOCHA EMA data and TIMIT MFCCs, the classification accuracy de-
creases when the state statistics xt|t and Σt| t are passed across phone boundaries. Rosti
& Gales (2003) report preliminary results for a similar experiment using LDMs in which
it was also found that a fully continuous state gives a slight reduction in performance. A
factor-analysed hidden Markov model (FAHMM) (Rosti & Gales 2002) was used to gen-
erate 50-best lists for utterances from the resource management (RM) (Price et al. 1988)
test set. Rescoring using LDMs gave a word error rate of 11.00% where the state was
reset between phone segments, and 11.82% where the state was continuous.
The initial results both in this thesis and from Rosti & Gales (2003) are not conclusive
for this implementation of LDMs, as the success of such an approach might depend on
occasional resetting of the state. There is a great deal of variation in the nature of
the transitions between segments. In some cases, these will be highly non-linear, such
7.1. ANALYSIS OF RESULTS 225
as found between the closure and release portions of plosives where the change point is
defined by an abrupt shift in the spectral energy. At other times, the segmental boundaries
are less well-defined, such as in the transition between a vowel and a nasal stop, where
anticipatory nasalisation colours the vowel sound and the spectral transitions are smooth.
It may be that resetting the state for the first of these examples would act as a regularizer
for the state covariances, but allowing passing of the state in the second would enhance
modelling.
Figure 7.1 shows some of the statistics computed during a forward filter recursion
across a complete utterance from the TIMIT corpus, both with the state passed across
model boundaries, and also where it has been reset. The framewise likelihoods are plotted,
along with the log determinants of the error covariance Σet and state covariance Σt|t . The
same set of models was used to generate both state-passed and state-reset plots so that
the effect of resetting can be seen in isolation. The true models according to the manual
labels were used for each segment, having been trained on the full TIMIT training data
in the normal way with the state reset at the start of each segment.
When the state is reset to have its initial distribution N (π, Λ) between segments,
phone boundaries are evident as there are sudden reductions in the magnitude of the
state covariance Σt| t (shown in the 3rd and 4th plots on different y-scales) at these points.
For longer segments, the covariance appears to converge to a set magnitude regardless of
whether the state has been reset or not.
With the state continuous across segment boundaries, estimates of the state covari-
ance Σt| t at the start of phones tend to have larger magnitudes than when steady-state
values are reached. Conversely, if reset at the start of segments, the magnitude of the
state covariance is dramatically under-estimated. Training of LDMs was described in
Section 4.2.2 and estimation of the initial state covariance follows that found in Digalakis
et al. (1993), Ostendorf, Digalakis & Kimball (1996) and Roweis & Ghahramani (1999).
However, Ghahramani & Hinton (1996a) give a form which adds the covariance of the
initial smoothed state vectors about the initial state mean π to the estimate of Λ. With
(k)
x̂1|Nk representing the initial smoothed state vector for the k th of K sequences which
(i) (i)
has Nk frames, the extra term is K T
P
i=1 (x1|Ni − π)(x1|Ni − π) . This modification would
certainly result in larger estimates of the initial state covariance, and might provide a
226 CHAPTER 7. CONCLUSIONS AND FUTURE WORK
−20
−40
40
30
20
0 50 100 150 200 250 300
50
state cov det state cov det
−50
0 50 100 150 200 250 300
25
20
15
0 50 100 150 200 250 300
frame number
Figure 7.1: Pictorial comparison of the framewise likelihood, error covariance log determinant
log |Σet | and state covariance log determinant log |Σt|t | with the state reset and passed over bound-
aries. The 3rd and 4th plots both show log |Σt|t |, the latter with a y-scale which provides greater
detail. The plots were produced using an MFCC parameterization of the TIMIT sentence ‘even
then if she took one step forward he could catch her’.
Similar patterns are apparent in the plot of the error covariance determinant |Σet |,
though since Σet is a combination of the observation noise covariance C and a projection
of the predicted state covariance HΣt|t−1 H T , a floor will be provided on its minimum
magnitude by C. The spikes at segment boundaries are still evident in the state-reset
plot though are less marked. The top plot shows that framewise likelihoods where the
state has been passed across phone boundaries are generally equal to or higher than those
produced with the state reset. To calculate the framewise likelihood, the prediction errors
7.1. ANALYSIS OF RESULTS 227
are normalised by the error covariance and so, as expected, boundary effects are evident
in the state-reset likelihood.
It is apparent from these observations that allowing the state to run across model
boundaries gives a subtle modification of the properties of the LDM. Such an implemen-
tation frequently leads to a slight over-estimation of the state covariance at the start of
new segments. In some cases there are sudden increases in its magnitude, such as shown
by the spike at frame 100 in the 4th plot of Figure 7.1. Further investigation is required
to establish the instances when it is advantageous to pass the state and when resetting
is useful. Building an understanding of the manner in which this choice interacts with
the ability to make phone class decisions would be non-trivial, though desirable given the
intuitive appeal of such a model for ASR.
Should continuous state decoding of LDMs be required, there is a practical issue which
must be overcome: as observed in Section 6.2 on page 192, the Viterbi criterion can only be
applied when the state is reset. Since the state at any time affects future evolution, there is
no guarantee that for two paths with identical language model state but differing location
in state space X , the path with lower likelihood will not supersede that of the other
at some future time. The inadmissibility of the Viterbi criterion would create a vastly
increased search space, though it may be that an approximation can be made in which
paths with lower likelihood are a removed if the states are ‘close’. In fact, the predict-
correct nature of the state process means that the influence which the initial conditions
have on future state distributions diminishes with each forward Kalman recursion.
To demonstrate this pictorially, a set of reference state mean vectors and covariance
matrices was first found by running a forward filter for the [eh] and [sh] segments from the
TIMIT sentence ‘Even then if she took one step forward he could catch her.’ For the same
segments and models, filters were also run whilst varying the state initial distribution by
setting it to that of each of the other models in the set. The plots in Figure 7.2 show
the root mean square errors (RMSE) between the reference state statistics and those
generated with each distinct initial value. The rapid reduction in the RMSE suggests
that an approximation to Viterbi could be applied where two or more paths finish at the
228 CHAPTER 7. CONCLUSIONS AND FUTURE WORK
0.04 0.012
0.02 0.006
0.01 0.003
0 0
0 2 4 6 8 10 0 2 4 6 8 10
frame frame
0.05 0.012
state covariance elementwise RMSE
0.04
state mean elementwise RMSE
sh model 0.009
0.03
0.006
0.02
0.003
0.01
0 0
0 2 4 6 8 0 2 4 6 8 10
frame frame
Figure 7.2: A set of reference state mean vectors and covariance matrices was found by running a
forward filter for the [eh] segment in ‘then’ and the [sh] segment in ‘she’ from the TIMIT sentence
‘Even then if she took one step forward he could catch her.’ For the same segments and models,
filters were also run whilst varying the state initial distribution by setting it to that of each of
the other models in the set. The plots show the root mean square errors (RMSE) between the
reference state statistics and those generated with each distinct initial value.
7.1. ANALYSIS OF RESULTS 229
same time with identical language model states, and have occupied the same model for a
number of frames.
Pre-computation of the 2nd order state statistics and caching of likelihoods is used in
state-passed decoding and leads to significant reductions in computation. This is only
possible if the initial state distribution is known in advance – which of course relies
on resetting at segment boundaries. However, when running a Kalman filter with no
inputs, the error covariance Σet , Kalman gain Kt , predicted state covariance Σt+1| t , and
corrected state covariance Σt|t as described in Section 4.2.1 on page 89 converge after a
few iterations. An ad hoc measure of convergence of the state statistics can be made by
computing the root mean square error (RMSE) between the entries in successive filtered
matrices. Figure 7.3 shows such errors for Σet , Kt , Σt+1| t , and Σt|t for models of each
of the 46 MOCHA phones trained on EMA data. Low values indicate small differences
between successive matrices. It is apparent that these quantities converge rapidly, with
only slight adjustments 4 frames from the start of a segment. Computational savings are
thus offered by ceasing to update the error covariance Σet , Kalman gain Kt , predicted
state covariance Σt+1| t , and corrected state covariance Σt|t after the fourth or fifth frame
of a new model.
230 CHAPTER 7. CONCLUSIONS AND FUTURE WORK
1.5
0.5
0
1 2 3 4 5 6
frame from segment start
Figure 7.3: A pictorial representation of the convergence of the error covariance Σet ,
Kalman gain Kt , predicted state covariance Σt+1| t , and corrected state covariance Σt|t
during filter updates. Each line shows the diminishing root mean square error (RMSE)
between successive values of Σet , Kt , Σt+1|t , and Σt|t for models of all 46 phones trained
on MOCHA EMA data.
7.2. LIMITATIONS WITH THE CURRENT APPROACH 231
The approach taken in this thesis has been to work with simple models and tasks which are
small enough to allow meaningful analysis of results. In this manner, a deep understanding
of the properties of the models in question can be built up: a more complex task, such
as conversational speech, whilst being more realistic of the requirements made of an ASR
system would also make error analysis significantly more complicated. The following
sections outline two of the main limitations with the current system, and how these
might be addressed in order to scale up to larger and more demanding tasks.
The experiments reported in previous chapters all rely on time-aligned phonetic labels to
train the model sets. Very few corpora include manual phone labels, and so systems are
typically trained from word transcriptions aligned with the speech signal at the sentence
or paragraph level. Full embedded training of LDMs requires integrating over all possible
segmentations, which will be computationally expensive (though tractable) as a separate
Kalman smoother must be run for each one.
An approximation to full EM training is to alternate between segmenting (a.k.a forced
aligning) each utterance using the most recently estimated model parameters and then
continuing training. This method uses the single most likely model alignment rather than
summing over all possible alignments and is therefore known as Viterbi training. The
decoder as described in Section 6.2 can simply be adapted for use as an aligner. At each
pop, all hypotheses are removed from the stack apart from the correct one. By storing
the candidate phone end-times for each partial hypothesis, a traceback can be made to
find the segmentation which produced the final highest likelihood.
A pilot experiment was performed to assess the LDM’s ability to provide a segmenta-
tion of the data. The highest classification and recognition accuracies on the TIMIT data
were found using LDMs with a 9-dimensional state and MFCCs with δ and δδ parameters.
The decoder was used with these models to give alignments for all of the utterances from
the TIMIT corpus. New labels were prevented from shifting more than 50ms (5 frames)
either side of the manual label start and end times. This constraint affected just over
232 CHAPTER 7. CONCLUSIONS AND FUTURE WORK
Table 7.7: Results of TIMIT classification and recognition experiments where phone models
have been trained according to a set of LDM alignments rather than manual labels. The LDM
alignments were also used in testing during classification.
Results are given in Table 7.7, and show a significant increase in the classification
accuracy, from 72.3% to 75.9%. However, using this new model set for recognition gave
an accuracy of 59.6%, slightly lower than the equivalent result using models trained
on the manual labels which was 60.3%. Whilst Viterbi training did not in this case
produce improved recognition performance, the classification result shows that LDMs
can successfully be used to align data. This will allow future work to extend to larger
corpora for which manual phonetic labels are unavailable.
The LDM gives a unimodal time-varying Gaussian distribution over the observations.
However, the parameterized speech signal consists of data which is only approximately
Gaussian (Young 1995). Furthermore, factors such as variability between speakers mean
that features can be multimodal. Gaussian mixture models are a frequently-used approach
for approximating general probability density functions. Under such a model,
r
X
Y∼ λj pj (Y) (7.2)
j=1
where the pj are Gaussian distributions Y|j ∼ N (θj , φj ) and λj = P (j) is the prior on
mixture component j of r such that ri=1 λj = 1 (r is used here rather than the more
P
usual m since models have previously been denoted m). State-of-the-art HMM speech
7.2. LIMITATIONS WITH THE CURRENT APPROACH 233
recognition systems such as HTK (Young et al. 2002) use mixtures of Gaussians to model
the observations generated by each state. With the addition of mixtures giving substantial
performance improvements for HMMs, multimodal modelling for LDMs could be expected
to provide similar advantages.
Replacing the Gaussian observation noise distribution t with a mixture distribution
would be a simple way in which to produce a multimodal LDM. However, this also leads
to a model which is computationally intractable. Inferring the state distribution xt would
involve conditioning on a mixture density over the observations. Each forward filter
recursion would then cause an exponential growth in the number of mixture components
with which the state was described. Replacing other model parameters with mixtures
results in the same computational intractability. Section 7.3.1 below describes how future
work will seek to give a multimodal representation of the observations whilst minimising
such effects.
Full covariance observation noise The results of Section 5.2.4 which compared a few
variations on the fully parameterized LDM, summarised in Table 7.3 on page 220, showed
that a full covariance matrix for the observation noise improved classification accuracy.
This requires models with a large number of parameters, and also increases computational
expense.
The projection of the state distribution by H determines the approximation which can
be made to the error distribution. In this case, the linear mapping does not match the
modelling given by estimating a full covariance matrix. A different form of H, such as a
general non-linear mapping may improve the ability to capture the correlation structure
of the data. This type of approach was taken in Richards & Bridle (1999) and Iso (1993),
both of which are outlined in Section 2.3.5 of the literature review. Alternatively, moving
toward a multimodal representation of the parameterized speech signal will allow more
general output densities and reduce the mismatch between the true and approximated
error distributions.
234 CHAPTER 7. CONCLUSIONS AND FUTURE WORK
Work at CSTR will continue the investigation of linear dynamic models for speech recog-
nition. Future directions are summarised below.
The multiple regime (MR) experiments of Section 5.3.2 on page 164 altered the standard
implementation by modelling each segment with a series of LDMs. The state process
could be continuous for the length of each phone, but the set of LDM parameters used to
generate observations was controlled by a deterministic mapping dependent on segment
type and duration. The more general case, with the regime changes modelled by a discrete
(usually Markovian) process is termed a switching state-space model by Ghahramani &
Hinton (1996b), and will be referred to as a switching linear dynamic model in this work.
yt yt+1
xt x t+1
qt q t+1
Figure 7.4: Pictorial representation of a switching LDM. Arrows represent dependencies with
xt , yt and qt denoting the state and observation vectors, and discrete switch state at time t
respectively.
Figure 7.4 shows such a model represented as a Bayesian network (Smyth 1998).
Arrows denote dependencies, with square and oval nodes corresponding to discrete and
continuous random variables. As usual, xt and yt denote the state and observation
vectors, and qt is the discrete switch state at time t.
Depending on the switching topology, this class of models can be used to approximate
non-linear dynamics, along with non-Gaussian and multimodal output distributions. The
7.3. FUTURE WORK 235
qt=2
qt=1
Figure 7.5: Example of a switching topology which allows modelling of non-Gaussian and/or
multimodal output distributions.
Figure 7.5 shows a switching topology which would allow modelling of multimodal or
non-Gaussian output distributions. If the models associated with switch states 1 and 2
differ in one or more of the observation process parameters H, v and C, then the output
distribution will be a Gaussian mixture model with 2 components. Alternately, temporal
switching is shown in Figure 7.6 where the models corresponding to switch states 1 and
2 differ in one or more of their state process parameters F , w and D. Regime switching
of this type gives a piecewise linear state evolution which can be used to approximate
non-linear dynamics. The continuous state classification experiments in Section 5.4 on
page 176 provide an example of temporal switching where all parameters are switched at
known points – in this case phone boundaries.
qt=1 qt=2
Figure 7.6: Example of a switching topology which can be used to approximate non-linear
dynamics.
7.3.2 Tractability
Section 7.2.2 on page 232 above pointed out the downside to including mixtures in LDMs:
exponential growth of the number of components in the state distribution leads to com-
236 CHAPTER 7. CONCLUSIONS AND FUTURE WORK
putational intractability for all but the simplest of models and shortest of observation
sequences. The machine learning literature contains a number of approximate methods
which can be used to work around the problems involved in inference and estimation.
Possible approaches include:
• N -best Viterbi, where the N most likely components of the state distribution are
kept any time.
The switching model described above introduces mixing of parameters. In fact this is
not a desirable property in an acoustic model for ASR: frame-based HMMs were criticised
during the discussion of segmental HMMs on page 33 for being able to generate whilst
randomly switching between mixture components at every frame. By regarding switching
as an actual switching rather than mixing process, the benefits of multimodal modelling
can be introduced whilst retaining the structure which segment models allow. Since LDMs
are used to generate segments and not individual frames, the switching process should
only be allowed to change state at certain times, rather than at every frame. Performing
N -best Viterbi, or an approximation to Viterbi as described in Section 7.1.4, when the
switching process finite-state network paths meet will keep the number of components
describing the state small. Computation, which must be considered for a practical ASR
system, is significantly reduced by switching rather than mixing.
Once the switching process arrives in a particular state, an associated LDM gener-
ates a number of frames using a fixed set of parameters. The switching process then
transitions to another state, where another LDM generates another sequence of frames
using a new parameter set. Such an approach gives an appealing replication of speech
production: at any given time the articulators follow a unimodal path, though context
or speaker characteristics dictates the set of trajectories or targets required to produce a
7.3. FUTURE WORK 237
given segment.
The application of LDMs to speech data found in this work, along with those of Digalakis
(1992), Digalakis & Ostendorf (1992), Digalakis et al. (1993), and Rosti & Gales (2003)
forces parameter switches at phone boundaries. Furthermore, the multiple regime LDMs
of Section 5.3.2 follow the correlation invariant models of Digalakis (1992) and use deter-
ministic mappings to control the parameter switching within phone models. In all but
the pilot studies presented in Section 5.4 and by Rosti & Gales (2003), the states are
also reset at the start of each phone, so these are not strictly switching models as defined
above. However, the observation that manually forced switch points will be suboptimal
is equally applicable.
ASR requires mapping from continuous features to words which are symbolic and
discrete. At some level then, the parameterized speech signal must be divided into a
series of regimes. However, these regimes are not necessarily neatly abutted like “beads
on a string” (Ostendorf 1999), but in fact influence each other strongly. Furthermore,
phones are unlikely to the optimal units in all cases, as frequently it may be useful to model
shorter regimes such as stop closures or longer ones such as syllables. A project is currently
underway at CSTR to automatically derive a unit inventory for speech recognition with
LDMs. Building a switching model and learning the topology of the underlying finite-state
switching process will accompany this work.
There are no closed-form solutions for the problem of inferring a finite-state model
topology from data. The two approaches which appear in the literature either involve state
splitting or state merging (Mohri 1997). The first of these initialises with a simple topology
and repeatedly splits existing states to add new ones (Ostendorf & Singer 1997, Freitag &
McCallum 2000). Alternatively, a complex initial topology can be constructed and then
similar states merged or tied (Stolcke & Omohundro 1992, Stolcke & Omohundro 1994,
Lee, Kim & Kim 2001). Whichever approach is taken, states are added or removed until
convergence has been reached according to some metric.
It may be that either of these methods can be improved using a prior on the topology,
238 CHAPTER 7. CONCLUSIONS AND FUTURE WORK
In the preamble, the criteria laid down in the oft-cited Bourlard et al. (1996) which must
be met to justify risky departures from mainstream approaches to ASR were given. These
are:
• sound methodology
Phone sets
239
240 APPENDIX A. PHONE SETS
The IPA symbols closest to each TIMITBET phone are taken from Keating (1998).
Model initialisation
Given a model and some data, the EM algorithm updates the model parameters in such
a way as to increase the model’s likelihood over the data. The iterative nature of EM
means that initial estimates must first be found for model parameters. Moreover, whilst
EM is guaranteed to increase model likelihood over the data at every iteration, there is
the possibility of stepping toward local minima on the error surface. As will be demon-
strated, initial conditions can have significant impact on final model performance, though
unfortunately there is no one established technique for initialising LDMs.
This appendix presents two possible approaches to parameter initialisation. The first
is an ad hoc method in which assumptions are made about the function of each parameter,
and hence the sorts of values they should take. Starting values for each of the parameters
are either then chosen, or found experimentally. The second approach takes the param-
eters of a factor analyser model to initialises the observation process. These are in turn
used to infer state parameters.
It is important that in developing an ASR system, the test set is used as infrequently
as possible, to avoid a gradual tuning toward the test data. Therefore, the validation
components of the MOCHA and TIMIT classification procedures outlined in Sections 5.1.1
and 5.2.1 on pages 115 and 149 are used to compare combinations of initial conditions.
The accuracies quoted below thus refer to the highest classification accuracies found on
the validation set for a range of language model scalings and training iterations. As a
245
246 APPENDIX B. MODEL INITIALISATION
yt = Hxt + t t ∼ N (v, C)
xt = F xt−1 + ηt ηt ∼ N (w, D)
and a distribution over the initial state, x1 ∼ N (π, Λ). Observations yt and state xt
vectors have dimensions p and q respectively.
In this approach, assumptions are made about the purpose and interaction of the param-
eters of the LDM. The initial values are then chosen with the intention of steering model
parameters toward these ideals.
The function of a state-space model is to make a distinction between some underlying
process and the observations with which it is represented. Ideally, there will be consistent
dynamics in the data which can be closely modelled by the state’s evolution. In such a
situation, predictions would be confident and a low state error covariance would follow.
Furthermore, the smoothing required to allow for any mismatch between predictions and
observations should be accounted for by the observation noise. Capturing the correlation
structure of the data in the observation matrix H would be desirable as the components
of the state would be independent. This would allow for an efficient implementation of
the model as the initial state covariance Λ, state error covariance D and state evolution
matrix F would all be diagonal.
B.1.1 Method
F is set to be an identity matrix, meaning that the evolution of the state space is
simply a random walk. Note that after one iteration of EM, the constraint of F being
a decaying mapping as described in Section 4.2.4 is applied, though remedial action is
rarely necessary. H is either set to be uniform across the the dimensions of the state so
that every entry is fixed as 1/q, or randomised with each element drawn independently
from a N (0, 0.5) distribution.
B.1.2 Results
For both H randomised and given fixed equal values across state dimensions, F , w, v and
x1 were set as above. The state initial covariance was initialised as Iq , a q-dimensional
identity matrix. Table B.1 shows the classification validation accuracies for the obser-
γ
vation noise C set to γIp and the state noise D as 5 Iq where γ varies from 0.0005 up
to 5000, each time by an order of magnitude. The best accuracy of 59.7% is given with
Table B.1: Classification validation accuracies for systems trained on MOCHA EMA data for
a variety of magnitudes of the noise covariances. C is initialised as γIp and D as (γ/5)Iq . H is
either given equal weightings for each of the dimensions of the state vector or randomised
H randomised and γ = 5, meaning that C = 5Ip and D = Iq . These values are kept
and accuracies obtained for the initial state covariance, Λ, set to ξIq with ξ ranging from
Table B.2: Classification validation accuracies for systems trained on MOCHA EMA data for a
variety of magnitudes of state initial covariance, Λ, set to ξIq .
0.0001 to 1000, again by an order of magnitude each time. The results are given in Table
B.2. The best validation accuracy of 60.7% is given for Λ = 0.01Iq .
248 APPENDIX B. MODEL INITIALISATION
Now, having found some initial parameter estimates in which all models are initialised
identically, phone-specific initial observation noise means are considered. These are cal-
culated by averaging the observation vectors corresponding to each phone type over the
training data. The results in Table B.3 show that a universal data mean gives higher clas-
Table B.3: Taking the initial parameter estimates chosen above, initialising with phone-specific
observation noise means is compared to using one universal data mean.
sification accuracies on the validation set. Phone specific means not only give a higher
log-likelihood on the training data, 11.6 compared to 7.6, but also on the validation data,
8.7 compared to 5.9. It seems that model specific means improve the fit of the models to
each class, but not the ability to discriminate between them.
The same process was carried out for both PLP and MFCC features for speaker-independent
classification on the TIMIT corpus. Table B.4 shows that with the state noise covariance
Table B.4: Classification validation accuracies for systems trained on TIMIT PLP and TIMIT
MFCC data for a variety of magnitudes of the noise covariances. C is initialised as γIp , D as
0.2γIq , and H is either initialised to have equal weightings for each of the dimensions of the state
vector, or randomised
set to be the identity matrix I, the highest classification accuracies were for γ of 0.5
and 500 for PLP and MFCC features respectively. In both cases the highest accuracies
were given for H randomised. Table B.5 shows the effect on the validation accuracy of
varying the initial state covariance estimate whilst using the values for C and D chosen
B.2. FACTOR ANALYSIS MODEL FOR INITIALISATION 249
Table B.5: Classification validation accuracies for systems trained on TIMIT PLP and TIMIT
MFCC data for a variety of magnitudes of the state initial covariance, Λ, which was set to ξIq .
above. The highest accuracy was given for Λ of 0.01Iq and Iq for PLP and MFCC features
respectively. Table B.6 shows that, as in the case of the MOCHA EMA data, using a
Table B.6: Taking the initial parameter estimates chosen above, phone-specific observation noise
means are compared to one universal data mean for TIMIT MFCC and PLP data.
single data mean to initialise the observation noise mean, v, gives a better accuracy than
using phone specific initial means.
Another, possibly more principled, approach to parameter initialisation for LDMs uses
a factor analysis model to provide the LDM observation process parameters. A factor
analyser (see Section 2.1.2 on page 19) can be cast as an LDM observation process con-
ditioned on a static standard Gaussian state target, and so consists only of H, v and C.
EM can be used to estimate these parameters, and setting v to be the data mean, only
initial values for H and C need to be chosen.
Once a factor analyser model has been trained for each phone, H, v and C are used
directly as the LDM observation process, and are then used to provide estimates of the
remaining parameters following some the ideas in Section 4.2.2 on page 93.
250 APPENDIX B. MODEL INITIALISATION
B.3 Method
With the observation process parameters fixed, the LDM joint log likelihood of state and
observations given in Equation 4.17 on page 93 becomes:
N
X
log|D| + (xt − F xt−1 − w)T D−1 (xt − F xt−1 − w)
l(Θ|Y, X ) ∝ −
t=1
− log |Λ| + (x1 − π)T Λ−1 (x1 − π) (B.1)
Λ̂ = x1 xT1 − x1 π T
Since xt , xt−1 xTt−1 and xt xTt−1 are unknown, they are replaced using their posterior esti-
mates under the original factor analysis model which was used to generate H, v and C.
These expectations are computed as:
Note that Equation B.4 differs from Equation 4.33 which computes a similar quantity
under an LDM. The static nature of the factor analysis model means that the state cross-
covariance is zero. The state posterior distribution under a factor analysis model which
is required to compute these estimates is given in Equation 2.15 on page 19.
B.3.1 Results
EMA data
Classification validation results on MOCHA EMA data using LDMs which were initialised
using factor analyser models are shown in Table B.7. The number of EM iterations used
B.3. METHOD 251
in training the factor analyser ranged from 1 to 5, and final classification performance on
the validation set deteriorated with each extra iteration. The highest accuracy of 57.8%
was obtained using a factor analyser trained for a single iteration, and was lower than the
Table B.7: Classification accuracies on the validation set after LDMs were initialised using factor
analyser models. Results are shown for 1 through to 5 iterations of EM on the factor analyser
before LDM training began. Classification was speaker-dependent and used the MOCHA EMA
data.
Similar results were obtained with TIMIT acoustic features, shown in Table B.8. Here,
the highest accuracies of 53.1% and 59.9% for PLP and MFCC features respectively are
both lower than the accuracies of 62.0% and 61.4% gained initialising with the ad hoc
method above. The largest deterioration in performance from using this approach to
Table B.8: Classification accuracies on the validation set after LDMs were initialised using factor
analyser models. Results are shown for 1 through to 5 iterations of EM on the factor analyser
before LDM training began. Classification was speaker-independent and used the TIMIT MFCC
and PLP data.
model initialisation occurs for the PLP features. This is despite the log-likelihood of the
models on the validation set increasing from 11.8 to 14.4. As with phone-specific initial
means, the fit of the model but not the discriminatory power has been improved.
252 APPENDIX B. MODEL INITIALISATION
B.4 Conclusions
The degeneracy present in the LDM (discussed in Section 2.1.2 on page 20) coupled with
the iterative, non-optimal nature of EM training means that well chosen initial conditions
are important for successful application of LDMs. This is demonstrated in the results
above. For example, the highest and lowest accuracies using TIMIT PLP data were
62.0% and 43.7% – significant variation given seemingly sensible initial conditions. In an
extreme case, estimation could lead to an H filled with zeros, and all modelling through
the observation noise. This would however be preferable to a poorly fitting state process
with associated high error covariances which can be an effect of badly chosen initial
conditions.
In this appendix, two methods of finding initial estimates for LDM parameters were
compared. The first used a combination of hand-picked and experimentally chosen values.
In the second, a factor analysis model was used to find estimates of the observation process
parameters, and these in turn were used to derive the remaining parameters. For each of
MOCHA EMA, TIMIT MFCC and TIMIT PLP features, the ad hoc method gave higher
classification accuracies on the validation sets, and was adopted for experimentation in
this thesis.
Appendix C
Table C.1: Validation speakers used training models on the TIMIT corpus. The distribution of
the dialect regions and genders approximates that in the test set.
253
254 APPENDIX C. TIMIT VALIDATION SPEAKERS
Appendix D
The core experimental work reported in this thesis has required writing a library of
functions dealing with the implementation of linear Gaussian models. A variety of tools
produced by others have also been used in tasks such as feature extraction, language
modelling and Viterbi search. The main elements are listed below, with italic used where
the code has been written as part of the work for this thesis. An LDM toolkit built on
the Edinburgh Speech Tools library is planned for future release.
D.1 General
255
256 APPENDIX D. TOOLS USED IN EXPERIMENTAL WORK
• Language modelling - Edinburgh Speech Tools (Taylor et al. 1997-2003) and CMU-
Cambridge Statistical Language Modelling toolkit (Clarkson & Rosenfeld 1997)
The following are all implemented in C++, and use the base classes provided by the Edin-
burgh Speech Tools library (Taylor et al. 1997-2003).
• Factor analyser
– Parameter estimation
– Classification routine
– input/output
– Parameter estimation
– Classification routine
– input/output
D.3 Decoding
• LDM acoustic matching and adaptation of grid routines for segment models
• Adaptive pruning
The following tables give the classification results of chapter 5 in full. The highest accuracy
in each table is given in bold face. In the case of MOCHA data, these are the state
dimensions which were then used to produce the cross-validation classification results.
Where network-recovered data has been used or compared with other results, K-fold cross-
validation was not possible and † marks the result which gave the highest performance on
a separate validation set and is taken as best for that particular feature. Where TIMIT
data has been used, the highest accuracy is marked in bold face, and † marks the result
which corresponds to the best performance on validation data and was therefore taken to
be the final result for the features and models in question.
257
258 APPENDIX E. FULL CLASSIFICATION RESULTS
Table E.1: Speaker-dependent classification accuracies for systems with factor analysers used as
the acoustic model. The features were EMA data, EMA data with δ coefficients, and EMA data
with δ and δδ coefficients. Accuracies are given for a state dimensions ranging from 0 to 20.
259
Table E.2: Speaker-dependent classification accuracies for systems with LDMs used as the acous-
tic model. The features were EMA data, EMA data with δ coefficients, and EMA data with δ and
δδ coefficients. Accuracies are given for a state dimensions ranging from 0 to 22. These results
are shown graphically in figure 5.3 on page 123
260 APPENDIX E. FULL CLASSIFICATION RESULTS
Table E.3: Speaker-dependent classification accuracies for systems with factor analysers used
as the acoustic model. The features were the full articulatory set from the MOCHA corpus
comprising EMA, laryngograph and EPG data. Accuracies are shown for a state dimensions of 0
to 24 and with the data used raw, or post-processed using LDA.
261
Table E.4: Speaker-dependent classification accuracies for systems with diagonal covariance
LDMs used as the acoustic model. The features were the full articulatory set from the MOCHA
corpus comprising EMA, laryngograph and EPG data. Accuracies are shown for a state dimen-
sions of 0 to 24 and with the data used raw, or post-processed using LDA.
262 APPENDIX E. FULL CLASSIFICATION RESULTS
Table E.5: Speaker-dependent classification accuracies for systems with diagonal state covariance
LDMs used as the acoustic model. The features were the full articulatory set from the MOCHA
corpus comprising EMA, laryngograph and EPG data. Accuracies are shown for a state dimensions
of 0 to 24 and with the data used raw, or post-processed using LDA.
263
Table E.6: Speaker-dependent classification accuracies for systems with identity state covariance
LDMs used as the acoustic model. The features were the full articulatory set from the MOCHA
corpus comprising EMA, laryngograph and EPG data. Accuracies are shown for a state dimensions
of 0 to 24 and with the data used raw, or post-processed using LDA.
264 APPENDIX E. FULL CLASSIFICATION RESULTS
Table E.7: Speaker-dependent classification accuracies for systems with LDMs used as the acous-
tic model. The features were the full articulatory set from the MOCHA corpus comprising EMA,
laryngograph and EPG data. Accuracies are shown for a state dimensions of 0 to 24 and with the
data used raw, or post-processed using LDA. These results are shown graphically in figure 5.4 on
page 125.
265
Table E.8: Speaker-dependent classification accuracies for systems with LDMs used as the acous-
tic model. The features were an articulatory set derived from the MOCHA corpus and used in
Wrench (2001). Accuracies are shown for a state dimensions of 0 to 16. These features described
in section 6.3.1.
266 APPENDIX E. FULL CLASSIFICATION RESULTS
Table E.9: Speaker-dependent classification accuracies for systems with factor analysers used as
the acoustic model. The features were net EMA data, net EMA data with δ coefficients, and net
EMA data with δ and δδ coefficients. Accuracies are given for a state dimensions ranging from 0
to 22.
267
Table E.10: Speaker-dependent classification accuracies for systems with LDMs used as the
acoustic model. The features were net EMA data, net EMA data with δ coefficients, and net
EMA data with δ and δδ coefficients. Accuracies are given for a state dimensions ranging from 0
to 22. These results are shown graphically in figure 5.6 on page 129.
268 APPENDIX E. FULL CLASSIFICATION RESULTS
Table E.11: Speaker-dependent classification accuracies for systems with factor analysers used
as the acoustic model. The features were PLPs, PLPs with δ coefficients, and PLPs with δ and
δδ coefficients. Accuracies are given for a state dimensions ranging from 0 to 22.
269
Table E.12: Speaker-dependent classification accuracies for systems with LDMs used as the
acoustic model. The features were PLPs, PLPs with δ coefficients, and PLPs with δ and δδ
coefficients. Accuracies are given for a state dimensions ranging from 0 to 20. These results are
shown graphically in figure 5.8 on page 133
270 APPENDIX E. FULL CLASSIFICATION RESULTS
Table E.13: Speaker-dependent classification accuracies for systems with factor analysers used
as the acoustic model. The features were MFCCs, MFCCs with δ coefficients, and MFCCs with
δ and δδ coefficients. Accuracies are given for a state dimensions ranging from 0 to 22.
271
Table E.14: Speaker-dependent classification accuracies for systems with diagonal covariance
LDMs used as the acoustic model. The features were MFCCs, MFCCs with δ coefficients, and
MFCCs with δ and δδ coefficients. Accuracies are given for a state dimensions ranging from 0 to
20.
272 APPENDIX E. FULL CLASSIFICATION RESULTS
Table E.15: Speaker-dependent classification accuracies for systems with diagonal state covari-
ance LDMs used as the acoustic model. The features were MFCCs, MFCCs with δ coefficients,
and MFCCs with δ and δδ coefficients. Accuracies are given for a state dimensions ranging from
0 to 20.
273
Table E.16: Speaker-dependent classification accuracies for systems with identity state covariance
LDMs used as the acoustic model. The features were MFCCs, MFCCs with δ coefficients, and
MFCCs with δ and δδ coefficients. Accuracies are given for a state dimensions ranging from 0 to
20.
274 APPENDIX E. FULL CLASSIFICATION RESULTS
Table E.17: Speaker-dependent classification accuracies for systems with LDMs used as the
acoustic model. The features were MFCCs, MFCCs with δ coefficients, and MFCCs with δ and
δδ coefficients. Accuracies are given for a state dimensions ranging from 0 to 20. These results
are shown graphically in figure 5.9 on page 134.
275
Table E.18: Speaker-dependent classification accuracies for systems with factor analysers used
as the acoustic model. The features were combinations of PLPs and real EMA data, used raw or
post-processed using LDA. Accuracies are given for a state dimensions ranging from 0 to 28.
276 APPENDIX E. FULL CLASSIFICATION RESULTS
Table E.19: Speaker-dependent classification accuracies for systems with LDMs as the acoustic
model. The features were combinations of PLPs and real EMA data, used raw or post-processed
using LDA. Accuracies are given for a state dimensions ranging from 0 to 28. These results are
shown pictorially in figure 5.12 on page 138
277
Table E.20: Speaker-dependent classification accuracies for systems with factor analysers used
as the acoustic model. The features were combinations of MFCCs and real EMA data, used raw
or post-processed using LDA. Accuracies are given for a state dimensions ranging from 0 to 22.
278 APPENDIX E. FULL CLASSIFICATION RESULTS
Table E.21: Speaker-dependent classification accuracies for systems with LDMs used as the
acoustic model. The features were combinations of MFCCs and real EMA data, used raw or post-
processed using LDA. Accuracies are given for a state dimensions ranging from 0 to 24. These
results are shown pictorially in figure 5.13 on page 139.
279
Table E.22: Speaker-dependent classification accuracies for systems with factor analysers used
as the acoustic model. The features were combinations of PLPs and recovered EMA data, used
raw or post-processed using LDA. Accuracies are given for a state dimensions ranging from 0 to
28.
280 APPENDIX E. FULL CLASSIFICATION RESULTS
Table E.23: Speaker-dependent classification accuracies for systems with LDMs as the acoustic
model. The features were combinations of PLPs and network-recovered EMA data, used raw
or post-processed using LDA. Accuracies are given for a state dimensions ranging from 0 to 24.
These results are shown graphically in figure 5.16 on page 144
281
Table E.24: Speaker-dependent classification accuracies for systems with factor analysers used
as the acoustic model. The features were combinations of MFCCs and network-recovered EMA
data, used raw or post-processed using LDA. Accuracies are given for a state dimensions ranging
from 0 to 32.
282 APPENDIX E. FULL CLASSIFICATION RESULTS
Table E.25: Speaker-dependent classification accuracies for systems with LDMs used as the
acoustic model. The features were combinations of MFCCs and recovered EMA data, used raw
or post-processed using LDA. Accuracies are given for a state dimensions ranging from 0 to 30.
These results are shown graphically in figure 5.17 on page 145
283
Table E.26: Speaker-independent classification accuracies for systems with factor analysers used
as the acoustic model. The features were PLPs, PLPs with δ coefficients, and PLPs with δ and
δδ coefficients. Accuracies are given for a state dimensions ranging from 0 to 22
284 APPENDIX E. FULL CLASSIFICATION RESULTS
Table E.27: Speaker-independent classification accuracies for systems with factor analysers used
as the acoustic model. The features were PLPs, PLPs with δ coefficients, and PLPs with δ and
δδ coefficients. Accuracies are given for a state dimensions ranging from 0 to 22
285
Table E.28: Speaker-independent classification accuracies for systems with LDMs used as the
acoustic model. The features were PLPs, PLPs with δ coefficients, and PLPs with δ and δδ
coefficients. Accuracies are given for a state dimensions ranging from 0 to 22
286 APPENDIX E. FULL CLASSIFICATION RESULTS
Table E.29: Speaker-independent classification accuracies for systems with LDMs used as the
acoustic model. The features were PLPs, PLPs with δ coefficients, and PLPs with δ and δδ
coefficients. Accuracies are given for a state dimensions ranging from 0 to 22. These results are
shown graphically in figure 5.18 on page 151
287
Table E.30: Speaker-independent classification accuracies for systems with factor analysers used
as the acoustic model. The features were MFCCs, MFCCs with δ coefficients, and MFCCs with
δ and δδ coefficients. Accuracies are given for a state dimensions ranging from 0 to 26.
288 APPENDIX E. FULL CLASSIFICATION RESULTS
Table E.31: Speaker-independent classification accuracies for systems with factor analysers used
as the acoustic model. The features were MFCCs, MFCCs with δ coefficients, and MFCCs with
δ and δδ coefficients. Accuracies are given for a state dimensions ranging from 0 to 26.
289
Table E.32: Speaker-independent classification accuracies for systems with diagonal covariance
LDMs as the acoustic model. The features were MFCCs, MFCCs with δ coefficients, and MFCCs
with δ and δδ coefficients. Accuracies are given for a state dimensions ranging from 0 to 20.
290 APPENDIX E. FULL CLASSIFICATION RESULTS
Table E.33: Speaker-independent classification accuracies for systems with diagonal covariance
LDMs as the acoustic model. The features were MFCCs, MFCCs with δ coefficients, and MFCCs
with δ and δδ coefficients. Accuracies are given for a state dimensions ranging from 0 to 20.
291
Table E.34: Speaker-independent classification accuracies for systems with state diagonal covari-
ance LDMs as the acoustic model. The features were MFCCs, MFCCs with δ coefficients, and
MFCCs with δ and δδ coefficients. Accuracies are given for a state dimensions ranging from 0 to
20.
292 APPENDIX E. FULL CLASSIFICATION RESULTS
Table E.35: Speaker-independent classification accuracies for systems with state diagonal covari-
ance LDMs as the acoustic model. The features were MFCCs, MFCCs with δ coefficients, and
MFCCs with δ and δδ coefficients. Accuracies are given for a state dimensions ranging from 0 to
20.
293
Table E.36: Speaker-independent classification accuracies for systems with state identity covari-
ance LDMs as the acoustic model. The features were MFCCs, MFCCs with δ coefficients, and
MFCCs with δ and δδ coefficients. Accuracies are given for a state dimensions ranging from 0 to
24.
294 APPENDIX E. FULL CLASSIFICATION RESULTS
Table E.37: Speaker-independent classification accuracies for systems with state identity covari-
ance LDMs as the acoustic model. The features were MFCCs, MFCCs with δ coefficients, and
MFCCs with δ and δδ coefficients. Accuracies are given for a state dimensions ranging from 0 to
24.
295
Table E.38: Speaker-independent classification accuracies for systems with LDMs used as the
acoustic model. The features were MFCCs, MFCCs with δ coefficients, and MFCCs with δ and
δδ coefficients. Accuracies are given for a state dimensions ranging from 0 to 20. These results
are shown graphically in figure 5.19 on page 152
296 APPENDIX E. FULL CLASSIFICATION RESULTS
Table E.39: Speaker-independent classification accuracies for systems with LDMs used as the
acoustic model. The features were MFCCs, MFCCs with δ coefficients, and MFCCs with δ and
δδ coefficients. Accuracies are given for a state dimensions ranging from 0 to 20.
Bibliography
Blackburn, C. (1996), Articulatory Methods for Speech Production and Recognition, PhD
thesis, University of Cambridge.
Blackburn, C. & Young, S. (1995), Towards improving speech recognition using a speech
production model, in ‘Proc. Eurospeech’, Vol. 2, pp. 1623–1626.
Bourlard, H., Hermansky, H. & Morgan, N. (1996), ‘Towards increasing speech recognition
error rates’, Speech Communication 18, 205–231.
297
298 BIBLIOGRAPHY
Chomsky, N. & Halle, M. (1968), The Sound Pattern of English, Harper & Row, New
York, NY.
Clarkson, P. & Rosenfeld, R. (1997), Statistical language modelling using the CMU-
cambridge toolkit, in ‘Proc. Eurospeech’.
Cole, R., Noel, M., Lander, T. & Durham, T. (1995), New telephone speech corpora
at CSLU, in ‘Proc. Fourth European Conference on Speech Communication and
Technology’, Vol. 1, pp. 821–824.
Dempster, A., Laird, N. & Rubin, D. (1977), ‘Maximum likelihood from incomplete data
via the EM algorithm (with discussion).’, Journal of the Royal Statistical Society
B(39), 1–38.
Deng, L. & Sun, D. (1994a), Phonetic classification and recognition using HMM represen-
tation of overlapping articulatory features for all classes of English sounds, in ‘Proc.
ICASSP’, Vol. I, pp. 45–48.
Deng, L. & Sun, D. X. (1994b), ‘A statistical framework for automatic speech recognition
using the atomic units constructed from overlapping articulatory features’, Journal
of the Acoustical Society of America 95(5), 2702–2719.
Digalakis, V. & Ostendorf, M. (1992), ‘Fast algorithms for phone classification and recog-
nition using segment-based models’, IEEE Trans. on Speech and Audio Processing
40(12), 2885–2896.
Digalakis, V., Ostendorf, M. & Rohlicek, J. (1989), Improvements in the stochastic seg-
ment model for phoneme recognition, in ‘Proc. of the DARPA Speech and Natural
Language Workshop’, Cape Cod, Massachusetts, pp. 332–338.
Digalakis, V., Rohlicek, J. & Ostendorf, M. (1993), ‘ML estimation of a stochastic linear
system with the em algorithm and its application to speech recognition’, IEEE Trans.
Speech and Audio Processing 1(4), 431–442.
Duda, R. & Hart, P. (1973), Pattern Recognition and Scene Analysis, John Wiley, New
York, chapter 4.11.
Eide, E., Rohlicek, J., Gish, H. & Mitter, S. (1993), A linguistic feature representation of
the speech waveform, in ‘Proc. ICASSP-93’, pp. 483–486.
Erler, K. & Freeman, G. (1996), ‘An HMM-based speech recogniser using overlapping
articulatory features’, Journal of the Acoustical Society of America 100, 2500–13.
Fitt, S. & Isard, S. (1999), Synthesis of regional English using a keyword lexicon., in
‘Proc. Eurospeech’, Vol. 2, pp. 823–6.
Frankel, J. & King, S. (2001a), ASR - articulatory speech recognition, in ‘Proc. Eu-
rospeech’, Aalborg, Denmark, pp. 599–602.
BIBLIOGRAPHY 299
Frankel, J. & King, S. (2001b), Speech recognition in the articulatory domain: investigat-
ing an alternative to acoustic HMMs, in ‘Proc. Workshop on Innovations in Speech
Processing’.
Frankel, J., Richmond, K., King, S. & Taylor, P. (2000), An automatic speech recogni-
tion system using neural networks and linear dynamic models to recover and model
articulatory traces, in ‘Proc. ICLSP’.
Freitag, D. & McCallum, A. (2000), Information extraction with HMM structures learned
by stochastic optimization, in ‘AAAI/IAAI’, pp. 584–589.
Gales, M. J. F. & Young, S. J. (1993), The theory of segmental hidden Markov models,
Technical report, Cambridge University Engineering Department.
Ghahramani, Z. & Hinton, G. (1996a), Parameter estimation for linear dynamical systems,
Technical Report CRG-TR-96-2, Dept of Computer Science, University of Toronto.
Gish, H. & Ng, K. (1993), A segmental speech model with application to word spotting.,
in ‘Proc. ICASSP’, pp. 447–450.
Godfrey, J., Holliman, E. & McDaniel, J. (1992), Telephone speech corpus for research
and development, in ‘Proc. ICASSP’, San Francisco.
Gold, B. & Morgan, N. (1999), Speech and Audio Signal Processing, Wiley Press.
Goldenthal, W. (1994), Statistical trajectory models for phonetic recognition., PhD thesis,
M.I.T.
Haykin, S., ed. (2001), Kalman Filtering and Neural Networks, Wiley Publishing.
Hermansky, H., Morgan, N., Bayya, A. & Kohn, P. (1991), Rasta-PLP speech analysis,
Technical Report TR-91-069, ICSI, Berkeley, California.
Holmes, W. (1996), Modelling variability between and within speech segments for auto-
matic speech recognition, in ‘Speech Hearing and Language: work in progress 1996’,
Vol. 9, Department of Phonetics and Linguistics, UCL.
Holmes, W. J. & Russell, M. J. (1995), Speech recognition using a linear dynamic seg-
mental HMM, in ‘Proc. Eurospeech’, pp. 1611–1614.
Iso, K. (1993), Speech recognition using dynamical model of speech production, in ‘Proc.
ICASSP’, Minneapolis, MN, pp. II:283–286.
300 BIBLIOGRAPHY
Iyer, R., Gish, H., Siu, M., Zavaliagkos, G. & Matsoukas, S. (1998), Hidden Markov
models for trajectory modelling, in ‘Proc. ICSLP’.
Iyer, R., Kimball, O. & Gish, H. (1999), Modelling trajectories in the HMM framework,
in ‘Proc. Eurospeech’.
Junqua, J. (1993), ‘The Lombard reflex and its role on human listeners and automatic
speech recognisers’, Journal of the Acoustical Society of America 93, 510–524.
Kalman, R. (1960), ‘A new approach to linear filtering and prediction problems’, Journal
of Basic Engineering 82, 35–44.
Keating, P. (1998), ‘Word-level phonetic variation in large speech corpora’, ZAS Papers
in Linguistics 11, 35–50.
Kenny, P., Hollan, R., Gupta, V., Lennig, M., Mermelstein, P. & O’Shaughnessy, D.
(1993), ‘A∗ -admissible heuristics for rapid lexical access.’, IEEE Transactions on
Speech and Audio Processing. 1(1), 49–58.
King, S., Stephenson, T., Isard, S., Taylor, P. & Strachan, A. (1998), Speech recognition
via phonetically featured syllables, in ‘Proc. ICSLP’, pp. 1031–1034.
King, S., Taylor, P., Frankel, J. & Richmond, K. (2000), Speech recognition via
phonetically-featured syllables., in ‘PHONUS’, Vol. 5, Institute of Phonetics, Uni-
versity of the Saarland, pp. 15–34.
Kirchhoff, K., Fink, G. & Sagerer, G. (2002), ‘Combining acoustic and articulatory feature
information for robust speech recognition’, Speech Communication pp. 303–319.
Kohler, K., Lex, G., Patzold, M., Scheffers, M., Simpson, A. & Thon, W. (1994), Hand-
buch zur datenaufnahmen und transliteration in TP14 von VERBMOBIL - 3.0.,
Technical Report 11, IPDS Kiel.
Lamel, L., Kassel, R. & Seneff, S. (1986), Speech database development: design and
analysis of the acoustic-phonetic corpus., in ‘Proc. Speech Recognition Workshop’,
Palo Alto, CA., pp. 100–109.
BIBLIOGRAPHY 301
Lee, J. J., Kim, J. & Kim, J. (2001), ‘Data-driven design of HMM topology for on-line
handwriting recognition’, International Journal of Pattern Recognition and Artificial
Intelligence 15(1), 107–121.
Lee, K. & Hon, H. (1989), ‘Speaker-independent phone recognition using hidden
Markov models.’, IEEE Transactions on Acoustics, Speech, and Signal Processing
37(11), 1641–1648.
Liberman, A. & Mattingly, I. (1985), ‘The motor theory of speech perception revisited’,
Cognition 21, 1–36.
Lindblom, B., Lubker, J. & Gay, T. (1979), ‘Formant frequencies of some fixed-mandible
vowels and a model of speech motor programming by predictive simulation’, Journal
of Phonetics 7, 147–161.
Ljung, L. (1999), System Identification - Theory For the User, 2nd edn, PTR Prentice
Hall, Upper Saddle River, N.J.
Macho, D., Nadeu, C., Jancovic, P. Rozinaj, G. & Hernando, J. (1999), Comparison of
time and frequency filtering and cepstral-time matrix approaches in ASR, in ‘Proc.
Eurospeech’, Budapest, Hungary, pp. 77–80.
Merhav, N. & Ephraim, Y. (1991), Hidden Markov modelling using the most likely state
sequence, in ‘IEEE Int. Conf. Acoust., Speech, Signal Processing’, IEEE, pp. 469–
472.
Mohri, M. (1997), ‘Finite-state transducers in language and speech processing’, Compu-
tational Linguistics 23(2), 269–311.
Morgan, N. & Bourlard, H. (1995), ‘Neural networks for statistical recognition of contin-
uous speech’, Proceedings of the IEEE 83(5), 741–770.
Murphy, K. P. (1998), Switching Kalman filters, Technical report, University of California
Berkeley.
Neal, R. & Hinton, G. (1998), Learning in Graphical Models, Kluwer Academic Publishers,
chapter A view of the EM algorithm that justifies incremental, sparse, and other
variants, pp. 355–368.
Nilsson, N. J. (1971), Problem-Solving Methods in Artificial Intelligence, MacGraw-Hill
(New York NY).
Ostendorf, M. (1999), Moving beyond the ‘beads-on-a-string’ model of speech, in ‘Proc.
IEEE ASRU Workshop’.
Ostendorf, M. & Digalakis, V. (1991), The stochastic segment model for continuous speech
recognition, in ‘Proc. of the 25th Asilomar Conference on Signals, Systems and Com-
puters’, pp. 964–968.
Ostendorf, M., Digalakis, V. & Kimball, O. (1996), ‘From HMMs to segment models: A
unified view of stochastic modelling for speech recognition.’, IEEE Trans. on Speech
and Audio Processing .
302 BIBLIOGRAPHY
Ostendorf, M. & Singer, H. (1997), ‘HMM topology design using maximum likelihood
successive state splitting’, Computer Speech and Language 11(1), 17–41.
Papcun, G., Hochberg, J., Thomas, T. R., Laroche, F., Zachs, J. & Levy, S. (1992),
‘Inferring articulation and recognising gestures from acoustics with a neural network
trained on x-ray microbeam data’, Journal of the Acoustical Society of America
92(2), 688–700.
Paul, D. (1992.), An efficient A* stack decoder algorithm for continuous speech recognition
with a stochastic language model., in ‘Proc. ICASSP’, Vol. 1, San Francisco, pp. 25–
28.
Picone, J., Pike, S., Regan, R., Kamm, T., Bridle, J., Deng, L., Ma, Z., Richards, H. &
Schuster, M. (1999), Initial evaluation of hidden dynamic models on conversational
speech, in ‘Proc. ICASSP’, Phoenix, Arizona.
Price, P., Fisher, W. M., Bernstein, J. & Pallett, D. S. (1988), The DARPA 1000-word re-
source management database for continuous speech recognition., in ‘Proc. ICASSP’,
Institute of Electrical and Electronic Engineers., pp. 651–654.
Rauch, H. E. (1963), ‘Solutions to the linear smoothing problem.’, IEEE Transactions on
Automatic Control 8, 371–372.
Renals, S. & Hochberg, M. (1995), Decoder technology for connectionist large vocabu-
lary speech recognition, Technical Report +CS-95-17, Dept. of Computer Science,
University of Sheffield. Dept. of Computer Science.
Renals, S. & Hochberg, M. (1999), ‘Start-synchronous search for large vocabulary con-
tinuous speech recognition.’, IEEE Transactions on Speech and Audio Processing
7, 542–553.
Richards, H. & Bridle, J. S. (1999), The HDM: A segmental hidden dynamic model of
coarticulation, in ‘Proc. ICASSP’, Phoenix, Arizona, USA.
Richardson, M., Bilmes, J. & Diorio, C. (2000a), Hidden-articulator Markov models for
speech recognition, in ‘Proc. ASR2000’.
Richardson, M., Bilmes, J. & Diorio, C. (2000b), Hidden-articulator Markov models:
Performance improvements and robustness to noise, in ‘Proc. ICSLP’, Beijing, China.
Richmond, K. (2001), Estimating Articulatory Parameters from the Acoustic Speech Sig-
nal, PhD thesis, Centre for Speech Technology Research, Edinburgh University.
Richmond, K., King, S. & Taylor, P. (2003), ‘Modelling the uncertainty in recovering
articulation from acoustics’, Computer Speech and Language 17(2), 153–172.
Robinson, A. (1994), ‘An application of recurrent nets to phone probability estimation’,
IEEE Transactions on Neural Networks 5(2), 298–305.
Robinson, A., Cook, G., Ellis, D., Fosler-Lussier, E., Renals, S. & Williams, D. (2002),
‘Connectionist speech recognition of broadcast news’, Speech Communication 37, 27–
45.
BIBLIOGRAPHY 303
Rosti, A.-V. & Gales, M. (2003), Switching linear dynamical systems for speech recogni-
tion, in ‘UK Speech Meeting, London, April 2003’, University College, London.
Rosti, A.-V. I. & Gales, M. J. F. (2001), Generalised linear Gaussian models, Technical
Report CUED/F-INFENG/TR.420, Cambridge University Engineering.
Rosti, A.-V. I. & Gales, M. J. F. (2002), Factor analysed HMMs, in ‘Proc. ICASSP’.
Roweis, S. (1999), Data Driven Production Models for Speech Processing, PhD thesis,
California Institute of Technology, Pasadena, California.
Russell, M. (1993), A segmental HMM for speech pattern modelling, in ‘Proc. ICASSP’,
pp. 499–502.
Shumway, R. & Stoffer, D. (1982), ‘An approach to time series smoothing and forecasting
using the EM algorithm.’, Journal of Time Series Analysis 3(4), 253–64.
Siu, M., Iyer, R., Gish, H. & Quillen, C. (1998), Parametric trajectory mixtures for
LVCSR, in ‘Proc. International Conference on Spoken Language Processing’.
Smyth, P. (1998), ‘Belief networks, hidden Markov models, and Markov random fields: a
unifying view’, Pattern Recognition Letters. .
Soong, F. & Huang, E. (1990), A tree-trellis based fast search for finding the N best
sentence hypotheses in continuous speech recognition, in ‘Proc. Workshop on speech
and natural language’, Morgan Kaufmann Publishers Inc., pp. 12–19.
Stephenson, T., Bourlard, H., Bengio, S. & Morris, A. (2000), Automatic speech recogni-
tion using dynamic Bayesian networks with both acoustic and articulatory variables,
in ‘Proc. ICSLP’, Vol. II, pp. 951–954.
Stolcke, A. & Omohundro, S. (1992), Hidden Markov model induction by Bayesian model
merging, in S. J. Hanson, J. D. Cowan & C. L. Giles, eds, ‘Advances in Neural
Information Processing Systems’, Vol. 5, Morgan Kaufman.
Stolcke, A. & Omohundro, S. (1994), Best-first model merging for hidden Markov model
induction, Technical Report TR-94-003, ICSI, Berkeley.
Sun, J., Jing, X. & Deng, L. (2000), Data-driven model construction for continuous speech
recognition using overlapping articulatory features, in ‘Proc. ICSLP’.
Taylor, P., Caley, R., Black, A. & King, S. (1997-2003), ‘Edinburgh Speech Tools’,
http://www.cstr.ed.ac.uk/projects/speech tools.
304 BIBLIOGRAPHY
Verhasselt, J., Cremelie, N. & Marten, J. (1998), A hybrid segment-based system for
phone and word recognition, in ‘COST’.
Viterbi, A. J. (1967), ‘Error bounds for convolutional codes and an asymptotically optimal
decoding algorithm.’, IEEE Transactions on Information Processing 13, 260–269.
Westbury, J. (1994), Ray Microbeam Speech Production Database User’s Handbook, Uni-
versity of Wisconsin, Madison, WI.
Woodland, P. (1992), Hidden markov models using vector linear prediction and discrimi-
native output distributions., in ‘Proc. ICASSP’, pp. 509–512.
Young, S. (1993), The HTK hidden Markov model toolkit: Design and philosophy, Tech-
nical Report TR.153, Department of Engineering, Cambridge University.
Young, S., Evermann, G., Kershaw, D., Moore, G., Odell, J., Ollason, D., Povey, D.,
Valtchev, V. & Woodland, P. (2002), The HTK Book (for HTK Version 3.2, Cam-
bridge University Engineering Department.
Young, S., Russell, N. & Thornton, J. (1989), Token passing: A simple concep-
tual model for connected speech recognition systems, Technical Report CUED/F-
INFENG/TR38, Cambridge University Engineering Dept.
Yun, Y.-S. & Oh, Y.-H. (2002), ‘A segmental-feature HMM for continuous speech recogni-
tion based on a parametric trajectory model’, Speech Communication 38(1-2), 115–
130.
Zacks, J. & Thomas, T. (1994), ‘A new neural network for articulatory speech recognition
and its application to vowel identification’, Computer, Speech and Language 8, 189–
20.
Zavaliagkos, G., Zhao, Y., Schwartz, R. & Makhoul, J. (1994), ‘A hybrid segmental neural
net/hidden Markov model system for continuous speech recognition’, IEEE Trans.
on Speech and Audio Processing 2(1), II:151–160.
BIBLIOGRAPHY 305
Zweig, G. & Russell, S. (1998), Probabilistic modelling with Bayesian networks for auto-
matic speech recognition., in ‘Proc. ICSLP’.