Discrete Time Processing of Speech Signals - Proakis
Discrete Time Processing of Speech Signals - Proakis
rocess of
pe Signals '
oJ.,
"l' SET. 10 'I ~ -.. *
m
I
~ AdaP t i ve noise canceling (ANC) Isom et ric absolute judgment [test] (IAJ) Region of convergence (ROC)
Agency reso urces m an agem ent database
Adaptive predi ctive coding (AP C) Learning vector qu antizer (LVQ ) Residual-excited lin ear prediction (R ELP)
.1Adapti ve pulse code modulati on (APC M) (DRMD)
Least mean square [algori thm] (LMS) Segmental signal-to-noise ratio (SN R,cs)
Adaptive transform coding (ATe) D elta modulati on (OM)
Diagnostic accept ability measure (DAM) Left -right [parsing] (LR) Self-organizing feature find er (SOFF)
I Additive white Gaussian no ise (AWG N) Diagnostic rh yme test (DRT) I II Level bu ilding [algorithm] (LB) Snort-term co mplex cepstrum (stee)
I Advanced Research Projects Agency [of
Levinson-Durbin [recursion) (L-D) Short-term discrete Fourier transform
I the United States] (AR PA) Differential pul se code modulat ion
(DPCM)
[' Linguistic decoder (LD ) (stDFT)
American Telephone and Telegraph ]
Line sp ectrum pair [parameters] (LSP) Short-term dis crete-time Fourier
(AT&T) Digital signal processing (DSP) I
D iscret e cosine tr ansform (D CT) Lin ear constant coefficient d ifferential transform (stD T FT)
Articulation index (AI) J I
Dis cret e Fourier series (D FS) , .. equat ion (LCCDE) Short-term inverse discrete Fourier
Artificial neural network (ANN) Linear prediction [or linear predi ct ive] (LP) transform (stI D FT)
Augmented transition network (ATN) Discrete Fourier tra nsform (D IT) 1
Discr ete tim e (DT) Lin ea r predictive coding (LP C) Short-term memory [model] (STM)
Average ma gnitude differenc e fun ction
Di scret e-time Fourier transform (DT FT) Linear, time invariant (LTI) Short-term power density spectrum (stPD S)
(AMDF) Linde-Buzo-Gray [algorithm] (LGB) Short-term real cepstrurn (stRC)
Autoregressive [model] (AR) Discrimination informatio n (D \)
Dynamic pro gramming (D P) Log a rea ratio [parameters] (LAR) Signal-to-noise ratio (SN R)
Autoregressive-moving averag e [model]
Dynamic time warping (D TW) Massachusetts Institute of Technology Simple inverse filter tr acking [algorithm]
(ARMA)
Back-propagation [algorithm] (BP) Estimate-max imize [algorithm] (EM ) \ (MIT) (SIFT)
Fast Fourier transform (FIT) Maximum a posteriori (MAP) Stanford Research In stitute (SRI)
Bahl-Jelinek-Mercer (algorithm] (BJ M)
Bellman optimality prin ciple (BOP) Feature map classifi er (F MC) Maximum average mutual in form ation Strict sens e (or strong sen se) statio nar y
Bits per normal ized second (or bits per Finite impulse response (F IR) (M MI) (SSS)
Finite state automaton (FSA) Maximum entropy (M E) Subband coding (SBC)
~ a mpl e ) (bpn)
Bits per second (bp s) Floating point operat ion (flop) Maximum entro py method (M EM) Systems D evelopment Corporation (SD C)
Bolt , Beranek, and Newman, Forward-backward [algor ithm} (F--B) Maximum likelihood (ML) Texas Instruments Corporation (TI)
Incorporated (BBN) Frequ en cy weighted segmental Mean opinion score (MOS) Texas Instruments/NationalBureau of
Bounded input-bounded output [stability] signa l-to -noise ratio (SN R!w_.cs) Mean square error (MSE) Standards [database] (TIINBS)
(BIBO) Grammar-driven connected word Minimum discrimination information Time-delay neu ral network (TDNN)
Carnegie Mellon Universit y (CM U) reco gniz er (GD CWR ) (M Ol) Time domain harmonic scaling (TDHS)
Centro Studi e Laboratori Harmonic product spect rum (H PS) Modified rhym e test (M RT) Traveling salesma n problem (TSP)
Telecomunicazion i (CSELT) H ear-what-Lmean [speech recognition M ult ilayer perceptron (MLP) Vector Quantization (VQ)
Closed phase (CP) syste m] (HWIM) National Bureau of Standards [of the Vector sum excited linear prediction
Cocke- Younger-Kasami [algorithm] Hertz (Hz) United Sta tes] (NBS) (VSELP)
(CSY) Hidden Markov m odel (HM M) National Institute of Standards and Weighted recursive least squares
Co de-excited linear prediction (CELP) IEEE International Con ference on Tech nology [of th e U nited States] (NITS) (algor ithm] (WRLS)
Complex cepstrum (CC) . Acoustics, Speech, and Signal Normalized Hertz (norm-Hz) Weight ed-spectral slope mea sure (WSSM)
Connectionist Viterbi training (CVT ) Processing (ICASSP) Normali zed radians per second (norrn-rps) Wide sense (or weak sense ) stationary
Co ntin uous sp eech recognition (CSR) Impulse response (IR) Normalized seco nds (norm-sec) (WSS)
eOc::
._ Cl r.'\
I- ""'.
I .5 en
Q.)t/).c
QiU)U
_G>(1)
OCJ(1)
.!! ~ c..
Ca..cn
, /
,/
......
<,
Discrete-Time
Processing of
Speech Signals-------,
John R. Deller, Jr. Michigan State University
John G. Proakis Northeastern University
John H. L. Hansen Duke University
* SET. 10 '1::i::l4 *
DE fmulO S BE PO ~ GRA aO
DELA FACULTAD OE INGErl IERIA
".~
OEPFI ' - .1
/-::: ::J--d C:-t-,- ~ "7
Ed itor: J ohn G rifTin ?,.- I ) -'--- , / , ~ -- ..-' L ..-
Pr oducti on Supervisor: Elaine W. Wetterau
Pr odu ction Manager: Roger Vergnes
.-, --,,--.-.;
Text Designer: Natasha Sylvest er
Cov er Designer: Ca thleen Norz
.:_ -~
Dedications
Illu str ati ons: Acade m y Art Wo rks, In c.
T his boo k was set in Times Roman by G raph ic Scie nces Corpor a tion,
printed and bound by Book Press.
C opyright © 1993 by Macmillan Pub lishing Company, a d ivision of Macmillan , In c. To Joan . .. For her boundless love . pat ienc e, and support th rou ghout
t his long p roj ect. J would sometimes find her alone in a qu iet pl ace
Pr inted in t he Un ited States of Ameri ca rea di ng Charles Dickens o r Jane Au sten so as not to di sturb me
All rights reserved . No par t of this boo k m ay be reprod uced o r whil e I wrote ab out speech processin g. The juxtap osition was striking.
tr a nsmitted in any form or by a ny m ean s, electro nic or mecha nical. I would ima gine som e patient spouse 150 yea rs fro m now reading
includ ing photocop ying, record ing, or any info rma tio n storage a nd Deller, Pr oa kis, and Hansen while his or her part ner wrote a future
ret rieval sys tem , wit ho ut per m issio n in writ ing fro m the publish er.
great book. The ridiculous fantasy made me smile. Wh en she asked
Macm illa n Pub lishing Co m pan y why, I would simply tell her th at I was so happ y to have such an
866 Thi rd Avenue, New York, New York 10022 understanding wife. And so I am . J.D.
'--
Macmilla n Publishing Co mpa ny is pa rt
of the Ma xwell Communication G roup of Co mpa nies. To Felia, George, and Elen a. J. G. P.
Ma xwell Ma cmillan Canada, In c.
1200 Eglinto n Avenu e East
Su ite 200
D on Mills, Onta rio M 3C 3N l
-.
'---"
{
To my wife Holly, and in memor y of Geo rge, a devoted
, 115971
Includes index.
Ipreface I
Purposes and Scope. The purposes of this boo k are severalfold. Princi
pally, of co urse, it is intended to provid e the reader with solid fundamen
tal tools and sufficient exposure to the applied technologies to support
adv anc ed research and development in the array of speech proc essing en
d eavors. As an acad emic instrument. however, it may also provide the
se rious student of signal processing an opportunity 10 st rengthen a nd
deepen his or her understanding of the field through th e study of one of
th e most important and interesting contemporary applications of signal
pro cessing con cept s. Finally, by collecting a large number of con tempo
rary topics with an extensive reference list into a single volume, the book
will serve as a convenient resource for those already working in the field.
The book is written by three professors of electrical engineering. This
ha s two implications. First , we view the book as a pedagogical tool. This
means we have attempted to keep the student in mind with each sen
tence and with each chapter. Notation , approach. and general level have
been made as uniform as possible across developments and across chap
ters . Second, the text is written with a clear bias toward the topics and
approaches of modern electrical engineering curricula-especially signal
processing, systems, and communications . Sp eech proc essing is inher
ently multidisciplinary, and we occasionally indicate to the reader where
topics ar e necessarily treated superficially or not at all, and where the
reader can find more information. This occurs principally in ar eas that
would probably be labeled "Speech Science" or "Computer Science."
Preface ix
viii Preface
available in Chapter 1. Sections 1. I and 1.2 should be comfortably con leering speech files on-site. Several standard databases have been com
sidered review material for anyone who is to succeed with the text. Sec piled by the U.S . National Institute of Standards and Technology and are
tions 1.3 and 1.4 need not be review, but the typical EE senior will have available on compact disk from the U.S. Department of Commerce , Na
at least some exposure to these topics, even if specific courses in pattern tional Technical Information Service (NTIS). These databases are de
recognition and information theory have not been taken. Nevertheless, scribed in Section 13.8 of the book. For ordering information , call the
Sections 1.3 and 1.4 do provide sufficient background in their respective NTIS at (703) 487-4650. or fax (703) 321-8547. Alternatively, the au
topics for the remaining chapters, whereas Sections 1.1 and 1.2 are not thors of this book have made available some speech samples that can be
intended as substitutes for relevant coursework. Section 1.5 is simply a downloaded over the electronic mail network using the instructions
review of some concepts that should be quite familiar to any engineering below.
student, and it is included principally as a means for establishing Speech files can be downloaded to a personal computer (PC) through
notation. the internet computer network using the PCITCP (usually called simply
"ftp" for "file transfer protocol") software.' The database can be ob
Course Planning. The general topical content of the speech processing tained through a Michigan State University file server to which the pub
field as reflected in the book is described in Section 1.6. The instructor lic may connect without a password. In order that we may provide the
might wish to review that section in planning a course around the book most current information about the data and data handling procedures ,
(and to have the students read this section as an introduction to the the downloading is carried out in two steps. In the first step, the user will
course) . Clearly, it will be impossible to cover the entire book in a single download a description of the currently available files and the means for
quarter or semester. We have found that in a typical semester course (15 obtaining them. These will be in an ASCII file called README.DPH
weeks), the following can be covered: Chapter 1 (Background material (where the "DPH" extension keys the documentation to the Deller,
Brief review of Sections 1.1 and 1.2); Chapter 2 (Speech science topics Proakis, and Hansen textbook). In the second step, the desired files can
Rely heavily on student reading of qualitative material and focus on is be downloaded following directions in README.DPH . To obtain the in
sues necessary to engineering modeling); Chapter 3 (Modeling-The struction file, execute the following steps:
main goal is the digital model. Highlight the mathematics of the acoustic
1. Type ftp jojo.ee.msu.edu. No password is needed. (This assumes that
tube theory and stres s the physical significance without extensive in-class
the user is connected to the internet computer network and that the
formal development of the results); Chapter 4 (Short-term processing
This is often the students ' first real exposure to short-time processing. ftp software is loaded on the PC.)
2. Type cd \DPHTEXT to change to a directory named DPHTEXT.
Cover the basics carefully and have the student use the computer. ); Chap
3. Type get README.DPH to get the file containing instructions and
ter 5 (Linear prediction-Cover this central topic thoroughly except for
description of available data.
some of the details of the solution methods in Section 5.3.3 that the in
4. Type quit to exit the file transfer program.
structor may choose to omit .); Chapter 6 (Cepstral analysis-The instruc
5. Read the material in file README.DPH.
tor may choose to omit Section 6.3 as explained in the reading notes at
the beginning): Chapters 10, 11 , and 12 (Recognition basics-Many de Acknowledgments. We appreciate the support of many people who have
tails will need to be omitted, for example , in Sections 12.2.4-12.2.8); contributed to the writing and production of this textbook. We are in
Chapters 13 and 14 (Language modeling and neural network ap debted to a number of colleagues and graduate students who provided
proaches-These topics can be covered only superficially as time per critical reviews and suggestions that enhanced and improved the presen
mits). Alternatively, the instructor may choos e to include Chapters 7, 8, tation. Among those are Professor Mark Clements, Georgia Tech; Pro
and 9 of Part IV on coding and enhancement, rather than recognition, at fessor Jerry Gibson , University of Texas at Austin; Professor Paul
the end of the course. These three chapters could be covered in some de Milenkovic, University of Wisconsin; Professor Larry Paarrnann, Wichita
tail. If time and resources permit, an ideal approach is to thoroughly State University; Dr. Joseph Picone, Texas Instruments Corporation; Dr.
cover material in Parts I, II, and III in an introductory course, and Parts Dale Veeneman, GTE Laboratories, Waltham; and Professor Greg
IV and V in an advanced applications course. Wakefield, The University of Michigan. We also thank the members of
the EE 80 1 class of Spring Quarter 1991 at Michigan State University,
Obtaining Speech Data for End-of-Chapter Problems. It will be noted who struggled through the early versions of the prob lem sets and found
that many of the problems in this book require real speech files. Al
though most universities have the facilities to create such files, we offer
the following options to instructors who wish to avoid the details of coI- 'A product of ITP Soft ware , Inc .. of Cambridge, Mass.
x Preface
(the hard way) the stumbling blocks. We also appreciate the diligent ef
~ontents
forts of Mr. Sudhir Kandula in assisting with graph ics and problem so
lutions. We also th ank the students of the Duke Speech Processing
Laboratory (T. W. Pai, S. Nandkumar, S. Bou-Ghazale, and L. Arslan) for
their graphics and proofreading assistance. We also greatly appreciate the
I
encouragement and support of Editor John Griffin of the Macmillan
Publishing Company whose patience and good sense of humor were ex
traordinary. We also wish to thank Senior Production Supervisor Elaine
Wetterau of Macmillan for many painstaking hou rs with the manuscript
that contributed greatly to the quality of the fini shed work.
J. D. would also like to acknowledge research support during the writ
ing of this book from the Whitaker Foundation , the National Science
Foundation, and the Office of Naval Research. The research support ed
by these sponsors enriched the author's understanding of many impor
tant topics.
Preface Vll
J.R.D.
J.G.P.
J.H.L.H . I Signal Processing Background
1 Propaedeutic 3
1.0 Preamble 3
1.0.1 The Purpose of Chapter I 3
1.0.2 Please Read This Note on Notation 4
1.0.3 For People Who Never Read Chapter I (and
Those Who Do) 5
1.1 Review of DSP Concepts and Notation 6
1.1.1 " Normalized Time and Frequ ency" 6
1.1.2 Singularity Signals 9
1.1.3 Energy and Power Signals 9
L.1.4 Transforms and a Few Rela ted Co ncepts 10
1.1.5 Windows and Frames 16
1.1.6 Discrete-Time System s 20
I. 1.7 Minimum, Maximum, and Mixed-Phase Signals
and Systems 24
1.2 Review of Probability and Stochastic
Processes 29
t .2.1 Probability Spaces 30
1.2.2 Random Variables 33
1.2.3 Random Processes 42
1.2,4 Vector-Valued Random Processes 52
1.3 Topics in Statistical Pattern Recognition 55
1.3.1 Di stance Measures 56
1.3.2 The Euclidean Metric and "Prewhitening" of
Features 58
Contents xi
_ - - - - - xn-t;;ortte1its
2.2.2 T he Role of the Vocal Tract an d Som e
1.3.3 Maximum Likelihood/ Cl assification 63 Elementary Acoustical Analysis 104
1.3.4 Feature Selection a"rld Probab lisric Separability 2.2.3 Excitat ion of th e Speech System and th e
Measu res 66
Phys iology of Voicin g 110
1.3.5 Cluste ring Algorit hms 70
1.4 Information and Entro py 73 2 .3 Phonem ics an d Phonetics 115
1.4. I Definitions 73 2.3.1 Phonemes Versus Phones 115
2.3.2 Ph <?oemi c and .,Phon etic Tran scription /1 6
1.4.2 Random Sources 77 11 7
l.4.3 Entropy Concepts in Pattern Recogniti on 78 2.3.3 Phonem ic and Phon eti c C lassif ication
2.3.4 prosodic f eatures and Coarticulation 137
1.5 Ph asors and St eady-St at e Solutions 79
2.4 Conclusion s J 46
1.6 Onward to Sp eech Processin g 81
2.5 Problems 146
1.7 P roblem s 85
Appendices: Supplemental Bibliography 90
1.A Exam pl e Textboo ks o n Di gital Signal I Sl
3 Modeling Speech Production
Processing 90
3.0 Preamble I 51
l.B Example Textbooks on Stochastic Process es 90
3. J Acoustic Theory of Sp eech Pr oduction 151
I.e Example Textbooks on Statistical Pattern 3.1.1 History 151
Re cogn i tion 91 3. 1.2 Sound Prop agat ion 156
3. 1.3 Source Excitat ion M od el 159
I.D Exa m ple Textbooks on Informati on Theory 91 3.1.4 Vocal-Tract M od eling 166
I.E Oth er Reso urces o n Speech Processing 92 3.1.5 Models fo r Nasals and Fricat ives 186
I.E. 1 Textbook s 92 3.2 Discr ete-Time Modeling 18 7
1.E.2 Edited Paper Collecti ons 92 3.2.1 Gen er al Di scret e-Time Speech Model /8 7
1.E.3 Journals 92 3.2.2 A Dis cret e-Time Filler Model for Speec h
1.EA Co nfere nce P roceed ings 93 Producti on 192
l.F Example Textbooks on Sp eech a nd He aring 3.2.3 Other Speech Models 197
Scien ces 93 3.3 Co nclu sio ns 200
l.G Other Res ources o n Artificial Neural 3.4 Problems 20 I
Networks 94
I .G. J Textb ook s and Monographs 94 3.A Sin gle Lossless Tube Ana lysis 203
l.G.2 Jou rnals 94 3.A.1 Open and Close d Terminations 203
l.G. 3 Co nfer enc e Pr oceedings 95 3.A.2 Impedance Ana lysis, T-Ne twork, and Two-Port
Netwo rk 206
3.B Two-Tube Lossless M od el of the Vocal Tract 2 11
/I Speech Production and Modeling 3.C Fast Discr et e-Time Transfer Functi on
Calculati on 2 17
2 Fundamentals of Speech Science 99
2.0 Preamble 99
III Analysis Techniques
2.1 Sp ee ch Co m m un ica tion 100
2.2 Ana to m y a nd Ph ysiol ogy o f th e Sp eech Produ ct ion 4 Short-Term Processing of Speech 225
Syste m 101
4. 1 Introduction 225
2.2.1 Anato my fOl
XIV contents \...oU h lt:U h ~ )'..\1
Concepts 22 6
6 Cepstral Analysis
4.2.5 On the Role of "liN" and Related Issues 234
Applications 236
6.2 "Real" Cepstrum 355
Analysis 394
Identification 26 7
Conclusions 397
Coefficients 331
7.3.2 Time Domain Waveform Coding 435
De convolution 336
7.4.3 The Cepstral (Homomorphic) Vocoder 462
Techniques 488
Performance 554
9. 1 Introduction 568
M ethods 504
Methods 516
8.4. 1 Introduction 51 7
9.3.2 Signal-to-Nois e Ratio 584
8.4 .3 Speech Enh an cem ent and All-Pole 9.3.4 Oth er Measur es Based on LP Analysis 588
Enhancement 527
9.4 Objective Versus Subj ecti ve Measures 593
Filterin g 528
9.5 Problems 595
I0.1 Introduction 60 I
Tracking 541
I0.1.1 Th e Dream and th e Reality 60 /
10.2.3 Isolated-Word Versus Conti nuous-Speech 12.2.6 Training with Multiple Observ ation
Recogni tion 608 Seque nces 718
10.2.4 Linguistic Constra ints 614 12.2.7 Alte rna tive Optim ization Cri ter ia in the Training
10.2.5 Aco ustic Am biguity an d Confusa bili ty 6 19 of HMMs 720
10.2.6 Environmental N oise 620 12.2.8 A Di stance Measure for HMMs 722
10.3 Relat ed Problem s and App roaches 620 t 2.3 Practic al Issues 723
10.3 .1 Knowledge Engi nee ring 620 / 12.3.1 Aco ustic Obse rva tio ns 723
10.3.2 Speaker Recognition a nd Verification 62 / 12.3.2 Model Stru cture and Size 724
12.3.3 Trainin g with In su ffici ent Data 728
10.4 Conclusio ns 62 1 12.3.4 Acoustic U nits M odeled by HMMs 730
10.5 P roblem s 62 1 12.4 First View of Recognition Systems Based on
HM Ms 734
11 Dynamic Time Warping 623 12.4.1 Introduct ion 734
12.4.2 IWR Without Syntax 735
11 .1 Introduction 623 12.4.3 CSR by the Co nnec ted-Wor d Strat egy With ou t
Syntax 738
11.2 Dynamic Programming 624 12.4.4 Preliminary Co m me nts on La nguage Mod eling
11.3 Dynami c Time Warp ing App lied to IWR 634 Using HMMs 740
11.3.1 DT W Problem and Its So lution Usi ng D P 634 12.5 P roblems 740
11.3.2 DTW Search Co nstra int s 638
11.3.3 Typ ical DTW Algorithm : Mem ory and
Computatio na l Requi rements 649 745
13 Language Modeling
11.4 DTW Appli ed to CSR 651
11.4. 1 Introducti on 651
13.1 Int roduct ion 745
11.4.2 Level Building 65 2 13.2 Formal Tools for Li nguistic Processing 746
11.4. 3 Th e O ne-Stage Algorithm 660 13.2.1 Formal Languages 746
11.4.4 A G ra m mar-Driven Co nnected-Word Recogni tion 13. 2.2 Perplexit y of a Lan guage 749
System 669 13.2.3 Bottom-Up Versus Top-Down Parsing 751
11 .4.5 Prun ing an d Beam Sea rch 670
11.4.6 Summar y of Resource Requirements for DTW 13.3 HM Ms, Finite Sta te Automata , an d Regular
Algori thms 6 71 G rammars 754
11.5 Training Issues in DTW Algorithms 672 13.4 A "Bottom-Up" Par sin g Exampl e 759
11.6 Co nclusions 674 13.5 Prin ciples of "Top-Dow n" Recogniz ers 764
13.5.1 Focu s on the Lingu isti c D ecod er 764
11.7 Problem s 674 13.5.2 Focus on th e Acoustic Decoder 770
13.5.3 Adding Levels to the Linguist ic Decoder 772
12 The Hidden Markov Model 13.5.4 Traini ng th e Co nt in uous-Speec h Recognizer 775
677
12.1 Int rodu ct ion 677 13.6 Oth er Language Model s 779
13.6.\ N-Gram Statist ical Models 779
12.2 T heoret ica l Developments 679 13.6.2 Other Fo rmal Grammar s 785
12.2. I Generaliti es 679
13.7 IWR As "C SR" 789
12.2.2 Th e D iscrete Obs erva tion HM M 684
12.2.3 The Co nt inuo us Observation HMM 705 13.8 Standar d Dat aba ses for Speech-Recogni tion
12.2.4 Inclusion of Sta te Duration P rob abil iti es in the Research 790
Di scret e Observati on HMM 709
12.2.5 Scaling the Forw ard-Back ward Algor ithm 715 13.9 A Survey of Language-Model-Based Systems 791
xx Contents
~
13.10 Conclusions 80 1
13.11 Problems 80 1
[ART I
14 The Artificial Neural Network 805
14.1 Introduction 805
,/
SIGNAL PROCESSING
14.2 The Artificial Neuron 808
BACKGROUND
14.3 Ne t work Principl es and Par adi gms 8 13
14.3.1 Introd uctio n 813
14.3.2 Layered Networks: Forma lit ies and
Defin itions 8 15
14.3.3 The Mult ilayer Percept ron 819
14.3.4 Learning Vector Quantiz er 834
14.4 Applications of AN Ns in Speech Recognition 837
14.4.1 Presegmented Speech Material 837
14.4.2 Recogn izing Dynami c Speech 839
14.4.3 ANNs and Co nventional Approac hes 841
14.4.4 Language Mod eling Usin g ANNs 845
14.4.5 Integrat ion of ANNs into the Survey Syste ms of
Section 13.9 845
14.5 Con elusions 846
14.6 Problems 847
Index 899
lHAPTER 1I I
/ II
- _........• I1
, ' /'
Propaedeutic II
Read.Me: If you are someone who never reads Chapter 1, please at least
read Sections 1.0.2 and l.a.3 before proceeding!
.0 Preamble
1.0.1 The Purpose of Chapter 1
If the reader learns nothing more from this book, it is a safe bet that
he or she will learn a new word . A propacdeutic' is a "preliminary body
of knowledge and rules necessary for the study of some art or science"
(Barnhart, 1964). This chapter is just that-a propaedeutic for the study
of speech processing focusing primarily on two broad areas, digital signal
processing (DSP) and stochastic processes, and also on some necessary
topics from the fields of statistical pattern recognition and information
theorv.
Th~ reader of this book is assumed to have a sound background in the
first two of these areas, typical of an entry level graduate course in each
field. It is not our purpose to comprehensively teach DSP and random
processes, and the brief presentation here is not intended to provide an
adequate background. There are many fine textbooks to which the reader
m.ight refer to review and reinforce prerequisite topics for these subjects.
We list a considerable number of widely used books in Appendices I.A
and I.B.
What, then, is the point of our propaedeutic? The remainder of this
chapter is divided into four main sections plus one small section, and the
tutorial goals are somewhat different in each. Let us first consider the
two main sections on DSP and stochastic processes. In the authors' expe
rience, the speech processing student is somewhat more comfortable with
"deterministic" DSP topics than with random processes. What we will do
in Section 1.1, which focuses on DSP, therefore, is highlight some of the
key concepts which will play central roles in our speech processing work.
Where the material seems unfamiliar, the reader is urged to seek help in
3
1.0 / Preamble 5
4 Ch. 1 / Propaedeutic
standing, and clear understanding is derived by forcing oneself to com
one or more of the DSP textbooks cited in Appendix I.A. Our main ob prehend and use such notation. Painstaking care has been taken in this
jective is to briefly outline the essential DSP topics with a particular in book to use information-laden and consistent notation in keeping with
terest defining notation that win be used consistently throughout the this philosophy. When we err with notation , we err on the side of exces
book. A second objective is to cover a few subtler concepts that will be sive notation which is not always conventional, and not always necessary
important in this book, and that might have been missed in the reader's once the topic has been mastered. Therefore, the reader is invited (with
first exposure to DSP. your instructor's permission if you are taking a course') to shorten or
The goals of Section 1.2 on random processes are somewhat differ simplify the notation as the need for the "tutorial" notation subsides.
ent. We will introduce some fundamental concepts with a bit more for Let us give some examples. We will later use an argument In to keep
mality, uniformity, and detail than the DSP material. This treatment track of the point in time at which certain features are extracted from a
might at first seem unnecessarily detailed for a textbook on speech pro speech signal. This argument is key to understanding the "short-term"
cessing. We do so, however, for several reasons . First, a clear understand nature of the processing of speech . The ith "linear prediction" coefficient
ing of stochastic process concepts, which are so essential yin speech computed on a "frame" of speech ending at time m will bc denoted
processing, depends strongly on an understanding of the basic probability a(i; m). In the development of an algorithm for computing the coeffi
formalisms. Second, many engineering courses rely heavily on stochastic cients, for example , the index m will not be very germane to the develop
processes and not so much on the underlying probability concepts, so ment and the reader might wish to omit ,it once its significance is clear.
that the probability concepts become "rusty." Emerging technologies in Another example comes from the random process theory. Numerous ex
speech processing depend on the basic probability theory and some re amples of sloppy notation abound in probability theory, a likely reason
view of these ideas could prove useful. Third, it is true that the mastery why many engineers find this subject intractable. For example, some
of any subject requires several "passes" through the material, but engi thing like "j(x)" is frequently used to denote the probability density func
neers often find this especially true of the field of probability and ran tion (pdf) for the random variable -x. There are numerous wavs in which
'
dom processes.
this notation can cause misunderstandings and even subtle mathematical
The third and fourth major divisions of this chapter, Sections 1.3 and
traps which can lead to incorrect results. We will be careful in this text to
1.4, treat a few topics which are used in the vast fields of statistical pat
delineate random processes, random variables, and values that may be
tern recognition and information theory. In fact , we have included some
assumed by a random variable. We will denote a random variable, for
topics in Section 1.3 which are perhaps more general than "pattern rec
example, by underscoring the variable name, for example, x. The pdf for
ognition" methods , but the rubric will suffice. These sections are con
cerned with basic mathematical tools which will be used frequently, and
~ will be denoted h(x), for example. The reader who has clear under a
in diverse ways in our study, beginning in Part IV of the book. There is standing of the underlying concepts might choose to resort to some slop
no assumption that the reader has formal coursework in these topics be pier form of notation, but the reader who does not will benefit greatly by
yond the normal acquaintance with them that would ordinarily be de working to understanding the details of the notation.
rived from an engineering education. Therefore, the goal of these sections
is to give an adequate description of a few important topics which will be 1.0.3 For People Who Never Read Chapter 1 (and Those Who Do)
critical to our speech work.
Finally, Section 1.5 briefly reviews the essence and notation of phasors To be entitled to use the word "propaedeutic" at your next social en
and steady-state analysis of systems described by differential equations. gagement, you must read at least some of this chapter? If for no other
A firm grasp of this material will be necessary in our early work on ana reason than to become familiar with the notation, we urge you to at least
log acoustic modeling of the speech production system in Chapter 3. generally review the topics here before proceeding. However, there is a
As indicated above, the need for the subjects in Sections I. 3-1. 5 is not large amount of material in this chapter, and some people will naturally
immediate, so the reader might wish to scan over these sections. then re prefer to review these topics on an "as needed" basis. For that reason, we
turn to them as needed. More guidance on reading strategy follows. provide the following guide to the use of Chapter 1.
With a few exceptions, most of the topics in Sections 1.1 and 1.2 will
be widely used throughout the book and we recommend their review be
1,0.2 Please Read This Note on Notation fore proceeding. The one exception is the subsection on "State Space Re
The principal tool of engineering is applied mathematics. The lan
guage of mathematics is abstract symbolism. This book is written with a 'If you have skipped the first part of this chapter, you will be using the word without
even knowing what it means.
conviction that careful and consistent notation is a sign of clear under
1 .1 I Review of DSP C oncepts a nd Nota ti o n 7
6 Ch. 1 / Propasdsutic
alizati on s" in Sect ion 1.1. 6, whic h will be used in a limited way in Speech w avefo rm
Chapte rs 5 and 12 . T he top ics in Sections 1.3 and 1.4 , however, ar e
mostl y specialized subjects wh ich will be used in particular aspects of our
st udy, beginning in Part IV of th e book . Likewise th e topi c in Secti on 1.5
is used in o ne isolat ed , but important, bod y of mat er ial in Ch apter 3.
•• } """I"
These latter topics and th e "state spa ce" topi c in the ea rlier section
will be " flagged " in Reading No tes at t he be ginn in g of relevant cha pters,
and in oth er appropri ate places in th e boo k. -o
"0
~
C.
C:
1.1 Review of DSP Concepts and Notation <
.\·(n)=:sa(nT)=:s,,(l) I'=/f' n = . . .. - I,O, I,2, . .. , (1.1 ) FIGURE 1.1. Segment of a speech waveform used to illustrate the concept
of "normalized time :' suppose that samples are to be taken at a rate Fs =:
th e intege r n indexes the sample num ber, but we have lost th e absolute 10kHz so that the sample period is T =: 0.1 msec. The lower time axis
time orien tat io n in th e argument. To recover th e tim es at which th e sam represents real time measured in millisecond s, while the upper represents a
ples are taken , we simply need to know 1: normalization of the time axis such that the sample times fall at integers .
Normalized time , t' , is related to real time. t, as t' = t l]. We wi ll on a few
To understand t he "physica l" signi fica nce of th is mathemati cal con
occasions refer to the sample period in the scaled case as a "nor malized
vention, it is som etimes convenient to imagin e that we have sca led th e
second (norm-sec):'
real-world time axis by a factor of T pri or to taking th e samples, as illu s
tr at ed in Fig. 1.1. " Norm alized time," say I ' , is related to real t ime as
tim es " normalized Hertz (norm-Hz)"], and the sample radi an frequen cy
t is al ways 2n [d ime ns io nless or " no rmal ize d radi an s per seco nd (n orm
I' = T ( 1.2)
rps)"]. Acco rd ingly, th e N yquist freq ue ncy is always 0.5 norm-Hz, o r n
and th e sam ples of speech are taken at inter vals which are exactl y' "nor norrn-rps. In gene ral, th e co nversions bet ween "rea l" frequ encies, say F
mali zed seconds (norm-sec)." In m ost cases it is perfe ctly suffi cient to (H z) and n (rps) and th eir normali zed co unter parts , say f and OJ, ar e
refer to th e interv al between samples as the "sample period ," wh ere the given by
co nversion to th e real-world in terval is obv ious. H owever, o n a few occa ( 1.3)
f =: FT
sions we will have more than one sa mpling process occurri ng in th e same
pr obl em (i.e., a resampling of th e speech seq ue nce), and in t hese in OJ == Q T. (1.4)
st an ces th e co nce pt of a "n ormali zed second" is useful to refer to th e
We ca n easily verify thi s by examining a single sinusoid at real frequ ency
basic sampling inter val on th e data.
or course, the norm alization of time renders certain freq uency quanti f2,
'v tensi vely throughout this book, a nd it will be ass umed that th e read er is
p
x
<!.<el" lim
N_oo 2N+ I I
n=-/V
Ix(n)1
2
• (1.11) familiar with their properties and usage.
The first is th e discrete-lime Fourier transform (D T FT) , which , for the
A power sign al has finite but nonzero power, sequence x (n), is de fin ed by
00
O< Px<oo , (1.l2) X( w ) "" I.
fl '::-OO
iw
x( n)e- " .
( 1.19) ]"
I
• Transients, those which decay (usually exponent ially) with/time. Ex The DTFT b ears a useful relationship to th e conti nuou s-tim e Fourier
amples are / transform in the case in which x(n) represents samples of the a na log sig
.......
nal' xa(l ' ). In this case X(w) will be a periodic (wit h period 2n) , poten
xj (n) = a"u (n), lal<1 ( Ll 3)
ti ally aliased ve rsion of X ) w ),
xin) = a1n1 cos (nwo + Ifl), lal < L (1.14) 00
• Finite sequences, those wh ich are zero outside a finite time duration.
X(w ) = I Xa(w - 2ni).
( 1.2 1)
1"=-"'00
An example is
fln[
The existence of th e DTFT is not a trivial subject, and we will review
x 3(n) = e u(I1 +3)- u (n - 24 6)], IfJ l<OJ. (1.15) onl y a few importa nt details. A suff icient condition for the DTFT of a
Whereas the ene rgy signal s either decay out sufficiently fast or "stop" seq ue nce x(n) to exist is th at th e se que nce be absolut ely summable,
co m p letely, the power signals neither decay nor increase in their enve co
lopes. The power signal s ca n be associated wit h three br oad classes of
signals. These are
I
Jt=- oo
Ix(n)1 < OJ. (1.22)
• Constant signals. An exa m ple 1S Th is follows im med iat ely from ( 1.19). Moreover, absolute summability 1of
w n
xin) = a -00 < a < OJ. ( 1.16) x (n) is tantamount to absolute convergence of the ser ies I :-oo x(n)e-
implying that thi s ser ies converges uniform ly to a continuous fun ction of
• Peri od ic signals, tho se for whic h x (n) ::= x(n + N) for some finite N w (Churchill, 1960, Sees . 59 and 60). A seque nce that is absolutely su m
x 5(n)
x 6(n)
=
=
a sin (n w o + /.f),
[x i n ) ]m o duJ O 512 = Lx
QO
/= - 00
- OJ < a < OJ
3( n + i512 ).
(Ll?)
(1.1 8)
Ex =,,~oolx(nW ::; L~oo l x ( n) I r (1.23)
There are, howe ver, energy sign als that are not abs olute ly su mm ab le (see
• Realizations of stationary, ergodic sto chastic processe s (see Section Problem 1.2). These energy signals will still have DTFTs , but ones whose
1.2.3). series converge in a weaker (mean square) sense. This can be seen by
view ing ( 1. 19) as a con ventional Fourier series for the periodic function
The signal s which fall into neither category are the trivial zero signal
X(w ) whose coefficients a re x( n). One of the properties of Fourier series
and those whi ch " blow up" with time. Examples of th e latter are xj(n)
is that if the energy in a sin gle period of th e function is finite, then the
1 Note th e use of "normalized ti me" here. If " real" lime is used, (1.2 1) become s
1.1.4 Transforms and a Few Related Concepts
Ai the hea rt of much of engineeri ng analysis a re various freq uen cy do X( Q) = -.1.-
T
I (Q _2rr_)
X
r-: -00" T I .
main tr a nsforms. Three tra nsforms on d iscrete-t im e data will be used ex
1 .1 I Review of DSP Concep ts a nd Nota tion 13
12 Ch . 1 I Pro pa e de utic
series will co nverge in m ean sq uare (Ch urchill, [963) . I n t he present case whe re th e coe fficients are co mputed as
(usi ng the Parse val rela ti on ), .v-j
eo
C(k) = ~ I y( n)e-J (2,,/ ,vlkn. ( 1.29)
f
~
-'t
\X(w)1
2dw=2iT
I
n-r r- co
x
2(n)
= 2nE>. <co, (1. 24) >1 = 0
[In pri nci ple, the C(kr s may be co mputed over any perio d of y(n) ·l
so the DT FT wi ll co nve rge in th e mean sq ua re se nse . Pract icall y, th is It is occas ionally conven ient to use an "e ngineeri ng DTFT" for a peri
means t ha t th e DT FT sum will converge to X(w) at all poin ts of co ntin u od ic sign al that techni ca lly has no DTFT. Th e co ntrived DT FT co m
ity, and at po ints of dis continuity it will co nverge to t he " average" value posed of analog im pu lse fun ctions at th e harmoni c frequ encies weighte d
(" ha lfway" bet ween the values on eit her side of th e d isco ntinui ty). by t he O FS coefficients is
Pr op erties of the DTFT ar e det ailed in th e textb ooks cited in Appen
di x I.A, and some are rev iewed in P robl em 1.3. T he reaper sho uld also
recall th e num erou s symme t ry properties of t he t ran sform relat ion wh ich
Y(w) = 2iT k~O:> C(k)Sa(W-k~~} (1.30)
( 1.2 7) =2 I 2
IC(k)1 ,
0, othe r n k- kJ
Th e OFT represents exact samples of th e OT FT of th e fi nite sequence where k, and k2 repre sen t t he integer indices of the lowest and high est
x (n) at N eq ually spaced freque ncies, wk == (2iT / N)Ie, for k E [O,N - 1]. harm onic of yen) in the spec ified range. ! It is not di fficult to show that a
T he discrete Fourier series (DFS) is closely related to th e DFT compu su ita ble definiti on is
tationally, but is quite different ph ilosoph icall y. The DFS is used to rep
resent a periodic seque nce (hence a power signal) wit h period , say N,
usin g th e set of ba sis fun ction s </( 2,, / N)kll for k == 0, . .. , N - 1. T hese rep
r;,(w) ~2iT k~co I C(k) 12 0a(W - k;; ). (1.33)
[The reader can confi rm that this is co nsistent with (1. 32).] By com paring but th e textbooks cite d above pr ovide a general o verv iew of mo st of th e
wit h ( 1.30), it is clear why som e aut hors choose to write I Y(W) !2 as a no fundamental treat ments of the FFT. So me ad vanced topics arc found in
tation for the PDS. Th e ad vantage of doing so is th at it gives the con (Bur ris, 1988).
tr ived DTFT of (1.30) yet a nothe r " DT FT-like" property in the following Th e fin al tran sform th at will be used exte nsively in this book is th e
I
sense: X (w ) [2 is properly ca lled the energy density spectrum for an en ergy (two-s ided) z-transform (ZT), defined by
signal x(n) and can be integrated over a specified freq uency ran ge to find
I
tota l energy in that ra nge . y(wW is thus an analogous not ation for an 00
analogous function for a power sequence . Th e disad vant age is that it in X (z) = 2, x(n)z - n, ( 1.36)
n= - co
troduces more notation whi ch can be ea sily co nfuse d with a more
" va lid " spe ctral quanti ty. We will th erefore use onl y ~.(w) to indicate the
PD S of a periodic signal y(n). whe re z is a ny com pl ex nu mber for which the sum exists, th at is, for
The similarity of th e D FT to th e DFS is app ar ent , and thi s similarity wh ich
is co nsiste nt with our und erstanding that th e IDFT, if used outside th e oo
range n E [0, N - I J, will produce a periodic repl icat ion of th e finite 2, Ix (n )zl - fl < co . (1. 37)
seq ue nce x(n). Relat ed to thi s' periodic nature of th e O FT are the proper r, = - 'CD
ties of "c ircular shift" and "ci rc ula r con voluti on" of which the reader
must beware in an y applica tion of this transform . A few of these notions The values of z fo r which t he ser ies conve rges comprise th e region of
are reviewed 'in Problem 1.4. convergence (R OC) for the ZT. Wh en th e ser ies converges, it converges
For interpretive purposes, it will be useful for us to not e th e following. absolutely (Ch urchill. 1960, Sec. 59), implying that the ZT converges uni
Altho ugh the DTFT do es not exist for a periodic signa l, we might co n forml y as a function of z everywhere in the RO C. Depending on the tim e
sider taki ng the limit? sequence , th e R OC m ay be th e interior of a circle, t he exterior of a cir
cle, or an annulus of the form r ,n < 1z I < rout' where r in may be zero and
N
rO U I ma y be infinite. Th e ROC is oft en cri tical in uniquely ass oc iating a
-y(W) = NI':...~ ? N +l
1 '" y (n)e -;am
L..- ( 1.34) time sequence wit h a ZT. For details see th e textb ook s in Appendi x I.A.
n~-N
The ZT is for mally in verted by co nto ur int egration,
in the hop e of making the tr an sfo rm co nverge. A mo ment's th ought will
reveal t hat this computation is equivalent to th e same sum taken over a
single p eriod , say
x ( n) = ~
_'HJ
! c
X( Z)Z"- l dz, ( I. 38)
[The reader can confirm that this is consistent with (1.32).] By comparing but the textbooks cited above provide a general overview of most of the
with (1.30), it is clear why some authors choose to write I Y«(V) 12 as a no fundamental treatments of the FFT. Some advanced topics are found in
tation for the PDS. The advantage of doing so is that it gives the con
(Bu rris. 1988).
trived DTFT of (1.30) yet another "DTFT-like" property in the following The final transform that will be used extensively in this book is the
sense: IX(w)12 is properly called the energy density spectrum for an energy (two-sided) z-transform (ZT), defined by
signal x(n) and can be integrated over a specified frequency range to find
total energy in that range. I Yew) [2 is thus an analogous notation for an 00
analogous function for a power sequence. The disadvantage is that it in X(z) =
n
L
--00
x(n)z-n, ( 1.36)
troduces more notation which can be easily confused with a more
"valid" spectral quantity. We will therefore use only r;,(w) to indicate the
PDS of a periodic signal yen). where z is any complex number for which the sum exists, that is, for
The similarity of the OFT to the DFS is apparent, and this similarity which
is consistent with our understanding that the IDFT, if :used outside the oo
range n E (0, N - I], will produce a periodic replication of the finite L lx(n)zr" < co. ( 1.37)
sequence x(n) . Related to this periodic nature of the .IJ·FT are the proper n=-<::o
Since we use the sa me " uppercase," for exa m ple, X, not ation to i ndicate ~";:> 0.6
.';::
in ' (lAO). / 04
./ 0.3
1.1.5 Windows and Frames
0.2
In all practical signal processing applications. it is necessary to wor k
0\
with short terms or Fames of th e signa l, unless the signal is of short du
rat ion .!' This is especially true if we are to use conventional an alysis 0
]0 40 50 60 70 80
0 JO 20
t echniques o n signal s (su ch as speech) with nonstation ary d yn amics. In Sample. n
this case it is ne cessar y to select a portion of the sign al that ca n reason
FIGURE 1.3. Definitions and example time plots for the rectangular, Kaiser,
ably be ass u med to be stationar y. Hamming, Hanning, and Blackman windows. All plots are for window lengths
Recall that a (tim e domain) wi ndow. say w(n), is a real , fin ite length N = 101, and for the Kaiser window, f3 = 4.
seque nce u sed to se lect a desired frame of the o rigina l signa l, say x ( n), by
a sim p le multiplication process. Some of the co m m only used window se
qu ences a re shown in Fig. 1.3. For consist en cy, we will ass ume windows l
m w(m - n)]. A frame of th e signal x ( n) of len gt h N (sa m e as th e dura
t o be causal seq uences beginning at t im e n = O. The duration will usually t io n of the window) ending at t im e m, say fx(n; m ), is obtained as
be denoted N. Most co m mo nly used win dows ar e symm et ric a bo ut th e
t ime (lv' - 1)/2 where this time may be halfway betwe en two sa m ple
f x (n ; m) = x (n)w(m - n) . (1.42)
points if tv' is even . Recall that th is mean s that the w indows are linear This simple concept will be us ed exte nsively in future de velopments in
phase sequences [e.g., see (P roakis and Manolakis, 199 2)] and therefore vol ving fram es of speech. In fact , mu ch of th e time in this book the
hav e DTFTs that can be written fra me will be related to a speech se q uenc e den oted s(n) an d it will be un
W(w) = IW(w ) Ie -·,w «N- Il / 2), ( 1.4 1)
necessary t o e m ploy the subscri pt s because it will be ob viou s. We will
o nly use a subscr ipt in dis cussions wh ere fr ames are being cr eated from
where the phase term is a simple lin ear cha ract e rist ic co rres po nd ing to mor e than one signal.
th e delay of the wind ow that m ake s it ca usa l.P Assu m e for the moment t hat x(n) is a st ati onary sign al for a ll time.
It will be our co n vent io n in t hi s book to use win dows in a ce rt a in Clearly, the temporal properties o f f) n; m) are distorted with re sp ect to
manner to creat e a frame of th e si gnal. We first reverse the win dow in those of x(n ) due to th e direct modi fication of the tem po ral seq uence by
tim e!' [w( -n)], then sh ift it so th at its leading edge is at a de sired time, th e window. Correspondingly, th e spect r a l properties a lso differ as the
two transforms are apparently co nvo lved . That is, if Fx(w; m ) den otes the
" The ROC includ es the un it circle If and onl y if x (n) is absolut ely surnmab le. There D TFT of fram e fx(n; m), then
fore, in keeping with our discussion above, only a uniformly convergent DTFT e<t n be ob
tain ed by evaluating the corr espond ing ZT on the unit circle.
"A SImilar discuss ion applies to the design of FIR filters b y trun cation of a desired OR
(see DSP textbooks cited in Appen dix I .A).
1
F)w;m ) = 2n i~ X(w - 8)vV(-f) )e-
Jem
dO. ( 1.43)
121f the window were allowed to be centere d on n = 0, it would have a pur ely real DTFT
and a zero-phase characteristic. Now the relationship bet ween Fx(uJ; m) and X (w ) will onl y be cle ar from
13SiJ1ce we assum e windows to be symmetric about their midpoints, this reversal is just ( LA3) to those who are able to v isualize th e process of convo lv ing co m
to initially shift the lead ing edge to time zero. plex functions! Most of us do not have such an imagin ation. H owever, we
18 Ch. 1 ! Propaedeutic
1 .1 ! Review of DSP Con c epts and Notat ion 19
can get some insight int o the spectral distort ion by assum ing with som e
0
loss of generality that the reversed and shifte d window is ce ntered on
t ime n "" 0 [m = (N - 1) /2). Said ano ther way, this simply mean s that th e - 10
tr ue signal tr ansform , X(w) , again st which we ar e going to compare ou r
- 20
frame 's t ran sform , F/ w; m), is th e one whose signal is assumed to have
its time or igin in th e middl e of the window. T his, of course, is not always (li' - 30
~
the X(w) th at represents our standar d , but we can use it for insight. In ~
thi s case (l .4I) can be used in (1. 43) to yield g
u
- 40
''""..,"'- - 50
I ) w; m) = ?
1 f l X(w - 8)/ W(-e)!d8 "= -2
1 fit X(w - e) l W(8) ld8
'0
.~
50
-60
I Hamming
x(n)
cording to the definition. For convenience, the state and output equa
v(n)
+) , )I tions can be written in vector-matrix form as
v(n+ 1) = Av(n) + cx(n) (1.52)
yen) = bTv(n) + dx(n) , (1.53)
a(l) h (l) in which d is the scalar d = h(O), A is the M X M state transition matrix i
'I!
0 ... 0
0 0 .1
0 1
a(2) h(2) 'V I I 0 ... 0
0 I 0
0 ...
1 0 0
0 0 0
+" -
aCQ-I) h(Q-l) ~ I
• A
+ )~ I
a (Q)
~ ''''''')
bCQ)
I /
/
,
0 0 0
o(M-I) a(.M-2) a(M-3) a(M-4) .. '
0 0
a(l)
a(M)
(1.54)
+ )~. , and c and bare M-vectors \recaU the assumption Q < M and the defini
a(M-I)
a (M)
c =[0 0 0 ... 0 lJT
(1.55)
b(M - 1) + b(O)a(M - 1)
These comprise state variables for this system, as we shall see momentar b(M - 2) + b(O)aUld - 2)
(1. 56)
ily. Note that b=
uj(n + I) = vl + 1(n) , i = 1, 2, ... , M - 1 ( 1.49)
M b(l )+b(O)a(l)
lI M (n + 1) = x(n) + L a(i){)Af_,+Jn). (1.50)
;~ I
Equations (1.52) and (1.53) are very close to the state-space description
These are the state equations for the system. Note also that the output of an LTI system that will be needed in our work in a limited way. In
can be computed from the state variables at time n using fact, because of the way we have chosen to define the state variables
here, these equations comprise a lower companion f orm state-space
M
model , so named because of the form of the state transition matrix A. A
y en) = b(O) {)M(n + 1) + L b(i) {)M_I +l(n)
i~ l
(1.51 )
simple redefinition of state variables leads to the upper companion form
model which we explore in Problem 1.5.
,If
Finally, in our study of hidden Markov models for speech recognition
= b(O)x(n) + L [b(i ) + b(O)a(i) ] vA1-l +I(n ),
;~ l
in Chapter 12, we will have need of a state-space description of a system
that has a vector output. In this case the system will naturally arise in a
state-space form and there will be no need for us to undertake a conver
which is called simply the output equat ion for the system, It is clear that
these state variables do comprise a legitimate state for this system ac- sian of an input-output de scription of the system. The system there will
24 Ch . , I Propaedeutic
, ., r l=reView-O n~ pl~
have a sim ilar state equation to (1.52) except that the state transition
matrix, A, will generally not be of a special form like the one abo ve (in
1
X( z)= ( - ( , 12
y - 1) ( 1
. - ~IZ
v·
-1 ) ( 1 Y - l
- (,2:;;" )
( I. 58)
dicating more complicated dependencies am ong the states). The output 1 ( 1-' p 1z 1 )( I - p~ Z - I )( l-P2z- I) '
equation will take th e form
with (I == 0.9L45", ( 2 = 0.5 , PI = 0 .7£1 35°, and P2 = - 0. 5, and therefore has
yen) = Bv(n) + dx(n) ( 1.57) an a na lytica l description
in whi ch yen) and dare P-vectors (P outputs) and B is a P X M matrix.
We will have more to sa y about this system when its need arises. IX I(e I== I(I -
j OJ ) ( I e"J"') (l - (~c -~(~)( 1 -( ze -1"") I. ( 1.59)
n - D.e-J'vH I - o;e-J'LO) (l - p-,e- /CV)
1.2.7 Minimum-, Maximum-, and Mixed-Phase Signals and Systems The true phase characteristic is given by
J
We have discussed the grouping of signals into energy or power catego
X ,JW _
(1 - ( e -J"') (l - Ce- Jt V) (I - ::. e-.I'V )} I 60
arg ] I(e »)-arg { ( I- Ple -J'LO ) (I- p ~e-i"»)(J - Pzc"""J'V) .
1 I •Z
ries. Here we restrict our attention to the subclass of rea) signals with le ( . )
gitimate DTFTs (those that are absolutely Su m mabie) and consider
another useful categori zation . ./ Th e magnitude and phase spectra for the signal x ,(n) are found in Figs.
The specification of the magnitude spectrum of a discrete-time signal 1.6 a nd 1.7, respectively, and the pole-zero di agram is sh own in Fig. 1.8.
is generally not sufficient to uniquely speci fy the sign al, or, equivalently. If the magnitude spe ctrum were all th at wer e kno wn to us , however, it
the DTFT of the signal. Consider, for example, the magnitude spectrum, would not be possible to deduce this z-t ra ns fo rm an d co rrespo nd ing sig
I I,
X(w) shown in Fig. 1.6. This spectrum was actually computed for the nal with ce rt ainty. Indeed, con sider the signa l x 2(n ) with z-tr ans fo rm
signal xrCn) with z-transform,
X (1 - 1 - (I)(Z - 1- ';)(2' 1_ ( 2)
2(Z) = (1 - PI Z- I )( 1- p~ Z- I) (l - P2Z - 1) '
(1.61 )
2
10 ~ I I ...... ! , .. ...... i
infe r that the minimum-phase signal is the one whi ch , for a given magni ~
m
E)m) d~r I x 2(n) , ( 1.62)
2
n=O
th en it will be true th at 14 30 35
lO 25
5
Exfilm (m) ;::: E x (m) ( 1.63) Ti me, n (norm -sec)
FIGURE 1.9. Time domain plots of minimum-pha se signal x ,(n ) and
for any absolute ly sum mabie signal x (n) wit h the same magnitude spe c
maXimum-phase signal x 2( n ). The signals are squared tor convenience .
t ru m, and for an y m. Precisely th e opposite holds for th e maximum
ph ase sign al, say x ma.(n),
Fina lly. we note th at we ha ve assume d t ha t signals in th is di scu ssion
EXm)m) ::; E'/m) (J .64) are gene rally infini te in durati on by allowing them to have poles in their
z-t ransfo rms . (By restricti ng our di scussion to abs olutely sum mable sig
for any absolutely sum mable signa l x (n) with th e same m agn itude spec
nals, however, we have constrained th e poles to b e inside the uni t circle.)
trum , and for any m . The significan ce of th ese expres sions can be appre
In the case of a rea l, fin ite durat ion ("a ll zero "), mi ni mum-p hase se
cia te d in F ig. 1.9, whe re we show th e t im e dom ain wavefo rms for x 1(n)
qu enc e of length N (perhaps th e im pul se response of an FIR filter) , it ca n
above, which we now know is minimum phase, an d for xin) , wh ich is
m aximum ph ase. be shown th at it s maximum-phase counterp art is given by
Yet an other way to view a minimum-ph ase signal, particu larl y when it xma.(n) "" x min(.l\l- 1 - n) ( 1.66)
rep resen ts th e impulse response of a syste m, is as follows: If hen) repre
sent s a m inimum-phase impulse response of a ca usa l stab le system, th en or
the z-dom ain system functio n, H (z), will have all of its pol es and zeros x max( -)~ 7~ -(N- 1l Xrmn( Z -I) •
L.
(1.6 7)
inside th e un it circle. H en ce t here exists a ca usal, sta ble inverse system,
H - 1(z), such that T he conce pts of minimum-phase signa ls and system s will playa key
H(z) H-1( z) = 1 (1.65) rol e in th e th eory of lin ear pr ed ict ion and surrou nd ing m od eling
co ncepts.
eve rywhe re in th e z-plane, If th ere were eve n one zero ou tside th e unit
circle in H (z), a st able inverse would no t exist, since at leas t on e pole in
th e inver se would be obliged to be outsid e th e un it circle. T he existence 1.2 Review of Probability and Stochastic Processes
of a causal stable inverse z-transform for H(z) is therefore a su ffic ien t
co ndition to ass ure tha t the signa l hen) (or its corresponding system) is We will disco ver in the next chapte r t hat th ere are two basic classes of
m in imu m phase. speech sounds, "voiced" and " unvoi ced ." G ene rally speak ing, th e form er
'.A proof of this fact is outlined in Pr o blem 5.36 of (O ppen heim and Sch afer. 19 89). is characterized by determ inistic acoustic wavefo rms, while t he latter cor
30 cs. 1 I Propaedeut ic
1.2 I Review of Probability and StochastiC Processes 31
resp onds t o st ochas t ic wa veforms. Th e di fference ca n be heard in the two few events, and still have a consiste nt and meaningful theory of probabil
sounds present in th e word " it," for exa m ple. A\though random process ity. A proper "eve nt space" will turn out to be a sigm a-field or sigm a
theory will be ne cessary to analyze unvoiced signals, we will find that algebra over S , which is a set of su bsets of S that is closed under
even in the cas e of voiced sound s it will be ver y useful to employ a nalyt i complementation. union, a nd (if S ha s an infinite number of elem ents)
cal techniques which are fundamentally motivated by sto chastic process countable union. Let us call the algebra A . In typical engineering prob
theory, notabl y the autocorrelat io n function. In di fferent ways from those lems, the algebra of events is often all intervals in some continuum of
used to an alyze sp eech waveforms, we will em ploy concepts from proba possible outcomes, or the " power set" of di screte outcomes if S is finite
bility in our st udy of stochastic models for the coding and reco gnit ion of and discrete. These and other algebras in different situations ar e natu
speech. In these and other aspect s of our study of speech processing, rall y used in problems without much for ethought.
basic concepts from random process th eory will be prerequisite to our Th e third com p one nt, probability, is a normali zed measu re assigned to
pursuits. / these " well thought out" set s of event s that adheres to four basic axioms.
As is th e case with digital signal processing concepts, it will be nec es If P(A ) denotes the probability of event A , these are
sary for the reader to have a workin g knowledge of the concepts of prob
ability and stochastic processes, at leas t at the level of a typical se nior Or 1. P(S) = 1.
entry-level graduate course. Some of the widely used books in th e field 2. P( A) > 0, fo r all A E A .
are listed in Appendix I.B , and th e read er is en couraged to refer to these 3. For two mutually exclusive e vents A , B E A ,
textbooks to review concepts as needed. P (A U B ) = P (A ) + P(B ). ( 1.68)
As noted, one of t he central to ols of sp eech processing is the autocor
relation sequence. Several of the more fundamental concepts, in particu Mutually exclusive means A n B = 0 , where 0 is the null event.
lar stationarity and ergodicity, als o play key roles in our work. In the 4. For a countably infinite set of mutuall y exclusive e ven ts Ai E A,
recognition domain , an und erstanding of basic concepts concerning joint i = 1,2 , . . . .
exp eriments a nd statistical independ ence will be essent ia l. 11 is our pur
pose here to briefly review th ese fundamental notions with the autocorre
lation sequence and surrounding id eas as a target of t his di scussion. We
P( g A ,)
(- 1
= f
i~ 1
P(A J ( 1.69)
Combined Experiments S = SI x 5!
T his is a good place to r eview th e not ion of a com bined experiment, Exp t 2 + 30 1 I~ pr oduct .s (or
We will have need of thi s the o ry o nly if suc h ex pe ri m en ts have inde '£ 2 = 152, J;.PJ combmcd
experimenr 'L
pend en t eve nts , so we will rest rict our d iscu ssion acc ord ingly. Formally, 25
a n " ex periment " is eq uivale nt to the probability space used to treat tha t
expe r irnent. For example, let experim e nt I , 1':'1' be a sso ciated with a prob
a bi lit y space as follow s:
T he e ve nt s to whi ch we will ass ign probabil iti es consist of all open an d space
y
17Carefully note that this term ha s nothing whatsoever to do with dis cret e tim e.
'.So me textbooks use uppercase letters to indicate rand om va riabl es; still others use
boldface. Uppercase letters will have man y OTher sign ifican ces in this book and boldface I !A notation which is more consistent with !;. (x) would be P,,_(x). but this has other obvi
quantities are used to indicate vectors o r matrices, o us d isad vantages. For example, bow would we ,"'note the pr ob-ability of th e even t ,! ::O: x ?
36 Ch. 1 I Propaedeutic
1.2 I Review of Probability and Stochastic Processes 37
1984, ell. 1) for details]. Note that a discrete random variable will have Multiple Random Variables
a pdf that will consist entirely of impulses at the point outcomes of
the random variable. The weighting on the impulse at x I. is Pt _x> x). Preliminaries. We are gradually building toward a review of random pro
Returning to (1.84), from the Fundamental Theorem of Calculus, we cesses. The next step is to consider relationships among several random
ha ve variables. We begin by considering relationships between two random
variables, noting that many of the concepts we review here have natural
Pea < ~< b) ;:: Fib) - Fi a ) ::: J:fi~) d~, (1.85) ge neralizat ions to more than two random variabl es. At the end of the
section, we focus on random vectors in which some of these generaliza
implying the well-known result that the area under the pdf on the range tions will arise.
(a, bl yields the probability that::! produces a value in that interval.'? In combining experiments above , we encountered the task of combin
Some of the commonly used pdf's in speech processing are ing two sample spaces at the fundam ental le vel. We assumed that the
events in the individual sample spaces were independent. Here we im
1. Gaussian:
plicitly assume that two random variables, say ~ and L' map the same S
into two different range spaces, Sand S y . The joint range space is sim
hex) _
- 1
/2 2 exp -
{(x-2 f1 Y} 6
2 ' (1.86) ply a product space,
x
-
- \ n~ ~
s~::: S~ X S!:, (1.91)
where fl x is the average or mean of ~' and a; is the variance, or
a~ is the standard deviation (discussed below)"
formed in a similar manner to product sample spaces for combined ex
2. Uniform: periments. The joint event algebra, say -.A~:. ' are events chosen from Si!l.'
For most purposes, these will be open and closed rectangles and points in
~ {g,~ a ' a<x<b Si!l.' A significant difference between this theory and that of combined
!,(x) (1. 87) experiments is that we do not assume that events in the individual
otherwise
range spaces, Sx and .sy. are independent. We formally assign probabilities
for some b » a. to events in .A~ by tracing them back to .A to see what event they repre
sent there. -
3. Laplacian:
These ideas are readily extended to more than two random variables.
~(x) = V_ax
;- 2 exp {- V2u
Ixl }'
~
(1.88) Joint cdf and pdf; Conditional Probability. The joint cdf and joint pdf are
defined formally as
where u:£ is the standard deviation of ::!.
19Carc must be taken with impul se functions at the limits of integra tion if they exist.
_
!~(x, y) - 2na a I
, /1 _ ,,2 exp {'2 I }
- Q(x. y) , ( 1.94)
38 en 1 I Propaedeutic i.2 / Review of Probability and Stochastic Processes 39
r-
where there is no linear dependence between them. This does not say that they
are necessarily statistically independent, for there can still be nonlinear
Q(x , y) = • _1~1 {( x ~:~ 2P~~( x ~:~ )(Y~:l) dependence between them. We will see this issue in the topic of vector
quantization (Section 7.2.2), where there will be an effort made to ex
( 1.95) tract not only linear dependency (correlation) out of the speech data, but
+~ (l' -U,)2}.
o f. .
also nonlinear dependency, to produce efficient coding procedures.
assuming that the pdf exists. When g(~) = ~, this produces the average or
P(A E AriB EA ,) d~f peA, B) (1.96)
mean value of -x, u~ . Note that when -x produces only discrete values,
- 1: PCB) , say x!' x 2 " ' . ' then the pdf consists of impulses and the definition
where the numerator and denominator are determined by traceback to produces
..4. If A is the interval ~ -< x, then we have the conditional cdf, cc
F~I~ ( x I B) =
def P( [
~ s x E A~
] IB .5')
E,: =
P(x .:::; x, B)
PCB) (1. 97)
c (g(::)) = Ix,P(~ = xJ
i~l
(1.1 04)
and the conditional pdf The definition is readily generalized to functions of two or more random
variables. For example,
J~I~(xl B) ~f Ix Fix IB). (1.98)
All the usual relationships between the cdf and pdf hold with the condi L' {g(~,,r)} ~f i: i: g( x, Y)f~(x, y) dx dy. (1.105)
tioning information added. For example, \,
Particularly useful averages are the moments of a random variable.
X2
The ith moment of the random variable x is the number
Fdix21 B) - F£li x , IB) = x/~Ii ~I B) dr:,. f (1.99)
A special central moment is the second one (i = 2), which we call the
and variance and denote (J~. The square root of the variance, (J x' is called
the standard deviation of x. -
Jz:.r(X, y) = .~(x)-0:( Y) ' (1.102)
The i, k joint moment-between random variables x and y is the
number -
Statistical independence is a very strong condition. It says that out
comes of x and y tend not to be related in any functional way, linear or
nonlinear.When-two random variables are related linearly, we say that
they are correlated. To say that 0: and arc uncorrelated is to say that.r L'L:~<l}= i: i:X iykj~(x,Y)dXdY (1.108)
40 Ch . 1 I Propaedeuuc 1.2 I Review of Prob ab ility and Stochas tic Processes 41
and th e i, k joint central moment is th e number No te that th e vecto r is indicated b y a boldface qua ntity, and the fact that
it is a random vecto r is in dicated by the line beneath it. The pdf associ
ated with a random vector is very sim ply th e joi nt pd f among its compo
C{ C~~ - J1x)'(L- Il J} = fa, foo (X- IlJ( y - p)fJ;) x,y)dx dy. nent ran dom variables,
(1.109)
- ~ - 00 - 00 - - -
r
, ::(X I .X 2 , • • • )
, x,\, =
dcf I'
J'O:L.!'.I"" .. oi: ,, (x l' x 2 • • • • ,x,\ ,). (1.11 4)
When n = k = I, th e jo int mom ent is called th e correlation between x and Ope ratio ns a mo ng rand om vectors follow th e usu al rul es of matri x
y. a nd th e joint central mom ent , the covariance. Let us call th ese num arithmetic. Fo r exam ple. th e operation s of inner and oute r produ cts of
bers I'D' and (lj" resp ectively. A para meter freq ue ntly used in the ra ndo m vecto rs will be sign ificant in ou r work. Recall tha t the inner
statistica l ana lysts of data (and which a ppears in th e j oint Gaussian pdf product or I! norm of a vector (in thi s case a random vector, say
above) is th e correlation coefficient given by ~ = L~l ... ~,..r). is the sum of its squ ared components. Th is ca n be writ
te n in a var iety of ways,
cX V
P'!)'=(J; .' (1.110) N
- ~ v
II.!r = ~T~ = L ~ ~.
i ~1
( 1.1 15)
We see th at the correlati on coefficient is th e covari ance between x and y
normalized to the product of th e ind i vidual sta nda rd devia tions. - Note t hat th e inner prod uct of a random vector is itself a random varia
T
Co rrelatio n and covari an ce will occur rep eat edl y in our study of ble. T he outer product. on the other hand, is th e product xx which cre
speech, and it is adv isable to master th eir mea ni ngs if they ar e not al ates a random matrix whose (i,j ) element is th e ran dom varfa ble ::i,::iJ ' Of
read y very familiar. T hi s is especially tru e becau se th e terms "autocorr e course, th e inner and oute r products may be comp uted between two d if
lation" and "cova riance" arc used in ways th at are not consistent with fere nt rand om vectors,
their definitions in some aspects of speech processing. A related pair of The expectation of a rando m vect or (or ma trix) is just the vecto r (or
somewhat un fortunat e terms-" is the following: x and yare said to be or ma tr ix) of expectations of the ind ivid ual eleme nts. For exa mple, C (~ l is
thogonal if their correlation is ze ro, and uncorrelated lf th eir covariance simp ly th e vector of me an s [IlA Il;. X", whic h we might denote Ilx '
•••
G' lLIBI ~f J:Y'~I : I( YIB) dY, ( 1.112) c~ ~f G'(( ~ - Il~)( .! - 1J.~ n ( 1.11 7)
It is well known th at th e best p redictor of y, in th e sense of least square is called th e covariance matrix fo r x for th e similar reason .
error, given some event concern ing x is given by the conditional expecta An exa mp le th at occurs frequ ently in enginee ring pro blem s is th e
tion. If x and yare also joint GaUSSIan , then th e con d ition al expectation Gaussian rando m vector fo r which any sub set of its ran dom vari able
also provides the linear least squar e erro r predictor (see textbooks in Ap compone nts ha s a joint Gau ssian pd f. In particular, th e jo int pdf amo ng
pendi x I .B). th e entire set of N is an N-dimensiona l Gaussian pd f. T hat is, if x is a
Ga uss ian random vector, th en
Random Vectors. In discussing more th an one rando m variable at a time, .f~( X I'· . . , XlV) = ~" .!!. ., ( xl' . ... x N)
say ~) ' ~ 2' . .. ' ~N' it is frequently conven ient to package them into a ran ( 1.118)
dom vector,
.! dcf r
= ~1 ~2 " ' :iN
IT. ( 1. 113) = "'I? ~ \ . eXP {-~ (X - IJ.~ tc ~ I (X - fl.J},
whe re x denotes th e vector of ar guments [X I' .. xNY and C, and J.l" are
the covaria nce matrix and mean vector as defined above. It can be
20<':n C'C.' - h n r-n cc-csi no cnzineers are not resn onsible for this tcrm ino lo ev! sh o w n that thi s form reduces to (I ()4) in 1h p t w()-r1 im pn ~ i ()n >ll ",,<;('
42 Ch. 1 I Propaedeuti c 1.2 I Review of Probability and Stochastic Processes 43
tl
I(I
f\ ~
where each random variable repre sent s a model for the generation of val t1
~
e f',
ues at its corresponding time. There will be many occasions when we will s::
iJ 0
J\ 1\ \A I"
f\1
W
'r
want to refer to a random process by a name. For example, it is too VI I
\' IV
\
;>
clumsy to write something like "the random process {. . . ,~( -I), ~(O). ~
- I( I
x( I), .. .1 is used to mod el the spee ch signal. . .." What shall we call the
random process ? Thi s is one place where we shall bow to convention and
-znI
use a less-than-ideal choice. It is common to refer to a random process
by the same name as that used for the rand om variables which constitute
- 30
it. For example, the random process in (1.119) would be called simply ~,
~ = [. .. , ~ (- I ),~(O),~( I ) , . .. J = { ~(n) , n E (- oo, co)j, (1.120) - 4(
so that we can write "the random pro cess x is used to mod el the speech - 50
signal. . . ." Of course. the problem which arises is that x may refer to a 0 20 60 HO IC 10
random variable or a random process . We could further distinguish a
random process by using yet another notation, for example, s, but this 1000
40
/ f'AVVV~
probl em. Suppose that we define a simple experiment in which an integer ~ 82
representing one of L digitized speech waveforms is selected at random. o I 1\
( \.J ~ N/~\\J
For illustrative purposes, we plot segments of all of the waveforms (for
L = 3) in Fig. 1.11. We can imagine that each time is governed by a ran
dom variable, say x (n) at time 11, and the ordered collection of these ran - 200
dom variables is the und erlying random process, x. When the experiment
is complete, each random variable will go to work mapping the outcome,
for example, "waveform 2," to an amplitu de level corresponding to that
- 400
~
outcome. For exampl e, x(8) map s the out come "waveform 2" to a value
82 in our figure. For this one experiment, therefore, the totality of all the -600
0 20 40 60 80 10 o
" We will focus on the disc rete case because of our primary intere st in dis crete signals in
this book.
44 Ch. 1 I Propaedeutlc 1.2 / Review of Probability and Stochastic Processes 45
400
~,(n J )' • ~ , (n,)'[ I (m J) " ••, D(m) e" ... ' ~I' VI ' ... , vj )
( 1.122)
= f~l(n l )" " '~' ( ni) ( ~\' . . . ,e).~ l(m,),
i ,. , • [ ) (m) ) ( VI ' ... , vj ) .
,.., 200
~
.2 Stationarity. A random process x is said to be stationary to order i or ith
o
~
order stationary if
«
~ l (n l)'" '' .!,(n,) ( ¢I' . . . , sJ = ~ l( '" + 1',) , ... . ,>: , (fl, H .) ( ~\ , . . . ,eJ ( 1.123)
- 20(}
for any times /11' n2 , · • • , nj and any £1. This means that the joint pdf
do es not change if we consider any set of i random variables from x with
the same relative spacings as the original set (which is arbitrary). If x is
- 400 stationary to any order, then it is said to be strict sense, or strong sense,
stationary (SSS). We will review a weaker form of stationarity below.
Stationarity has important implications for engineering analysi s of a
- nOll I I ! ! ! ! I stochastic process. It implies that certain statistical properties of the
() I 20 4U 60 80
proc ess are invariant with time, making the process more amenable to
Time, fl l n "r m- ~ec )
modeling and analysis. Consider, for example , the case in which x is
FIGURE 1.11. (continued) first-order stationar y. Then
~(n)(~) = j~ (n +t, )(~) ( 1.124)
random variables will produce a particular waveform from the experimen
tal outcome, each random variable being responsible for one point, This for any nand £1, from which it follows immediately that every random
one waveform is called a sam ple Junction or realization of the random variable in x has the same mean. In this case, it is reasonable to talk
process. The collection of all realizations (resulting from allI the experi- about the a~rage of the random process, but in general there are as
ments) is called an ensemble. It should be clear that if we select a time, we many averages as random variables in a random pro cess. This leads us to
will get a random variable . If we select an experimental outcome, we get a the important issue of ergodicity,
realization. If we select both a time and an outcome, we get a number,
which is the result of the mapping of that out come to the real line by the Ergodicity and Temporal Averages. Consider a random process, ~, known
random variable at the time we select. to be first-order stationary. We might find ourselves in th e lab with only
one reali zation of the process, say xl(n), n E (-co , co), wond ering whether
pdf for a Random Process. Associated with any i random variables in a we could somehow estimate the average of x, say J1 . In principle, we
random process is an ith ord er pdf. For example, for ~(n I)' ~( n2 ) ' and should acquire a large number of realizations-and us€ them to compute
~ (n3 )' we have the third-order density an empirical average (estimate) of any random variable, say x (n) at time
~(n lk:(n 2) .~ (/l .l ) ( ~ I ' ~2' ~3)' (1.1 21 )
n. (It wouldn 't matter which n, since the averages should all be the same
du e to stationarity.) This estimate, obtained by averagin g down through
This is consistent with our previ ou s convent ion of listing all random var the ensemble at a point, is referred to as an ensemble average. Th e en
iables in the joint pdf as subscripts of f . sem ble average represents an attempt to estimate L'( ~ ( n)} at time n,
hence to estimate the average of process. Since we do not have an ensem
Independence of Random Processes. We have reviewed the meaning of in ble, it would be tempting to estimate ILx by com puting a temporal average
dep endence of random variables above. We must also recall the meaning of the realization, x J n), -
46 Ch. 1 I Propaedeutic
1.2 I Review of Probability and Stochastic Processes 47
,E~ll,
1
Jix 1 = oL' (xl(n) ) d,;f 2N + I I
1I=-'V
xj(n) . (1.125) If a random process is ith-order stationary, it is also (i - 1Ith-order sta
tionary. Therefore a second-order stationary process is also first order
and has a constant mean,
Note that we have explicitly used a signal name. XI' as a subscript of Il to
indicate that it has been computed using the realization rather than an Il.o: = t[~(n)l for any n. (1.131)
ensemble. Note also the operator J used to indicate the long-term time This leads us to the definition of a weak form of stationarity, which is
average. This notation will be used consistently: often sufficient to allow many useful engineering analyses .
'v A random process x is said to be wide sense, or weak sense, stationary
def. 1 '\:' ( \ (WSS) if
Jl' )
= )~~ 2tv'+ 1 "f:..v .. (1.126)
1. Its autocorrelation is a function of time difference only as 111
When will Jix I ' the time average, equal Ji_x ' the ensemble or statistical ( 1.130).
average? Generally speaking, a random process is ergodic if ensemble av 2. Its mean is constant as in (1.131).
erages can be replaced by time averages." In our example, if Ilx1= llx' ~ is
We note that
said to be mean-ergodic, since this property holds for the mean.
Ergodicity will be an important assumption in our work with speech be SSS = second-order stationarity = WSS , ( 1.132)
cause we frequently will have only one realization with which to compute but neither of the implications reverses except in the spe cial case of joint
averages. In particular, second-order ergodicity will play an important Gaussian random variables (see Problem 1.11).
role and we will look more carefully at this concept shortly. Finally, but very important, note that if x is correlation-ergodic, then
the autocorrelation can be computed using a
temporal average
Correlation and Covariance Applied to Random Processes
1 N
Consider two random variables, say ~(nl ) and ~ (nJ, taken from a ran rx(ll) = J(x(n)x(n - IJ») = }~n;, 2N +1 L x(n)x(n -
n =- ..V
II), (1.133)
dom process x . Recall that the correlation of these two random variables
is l(~( 11 1 )~ ( n~)}. Since the two random variables in this case are drawn where x(n) is some realization of x. This is an extension of the idea of
from the same random process, we give this the name autocorrelation ergodicity discussed above, to a second-order case. Note carefully that
and feature it with a special notation the subscript on r is a signal name x, indicating the use of a signal to
compute a time average, rather than random variables to compute an en
r",(n p n2) ~fL'L:~(nl)~(n2)}' (1.127) semble average. We have already introduced this notation, but it is worth
Similarly, the autocovariance function is given by reiterating here so that the reader is clear about its significance.
Since speech processing is an applied discipline , we will frequently use
ci nl' fl 2 ) dg l{[~( n -l{~( fl l II ][~( n2 ) - C(~( n2 ) ) ]} "
j ) (1.128) temporal, rather than ensemble, averages in our developments. Of course,
It is a simple matter to show that this is because we have signals, rather than stochastic models, to deal
with . On the other hand, there is often much to be gained by modeling
c"'( nl' nJ = r£( n1' nJ -l'(~( n l )ll(~( n2 ) ) . (1.129) speech as a stochastic process. Accordingly, when a speech signal is
It follows immediately from the definition of stationarity that if the thought of as a realization of a stochastic process, the underlying process
random process x is at least second-order stationary, then the value of must be assumed to have the appropriate stationarity and ergodicity
the autocorrelation does not depend on which two random variables are properties to allow the computation of meaningful temporal statistics.P
selected from x, but rather their separation in time. In this case, we
adopt the somewhat sloppy, but very conventional, notation 2lA philosophical point is in order here. A moment's thought will reveal that speech, if
rx(lJ) def •
= autocorrelation 0f any two ran dom
am vari
varia bl es 10
. ~,
thought of as a random process. cannot possibly comprise a stationary random process,
since speech is a very dynamic phenomenon. This is an indication of the need for "short
- which arc separated by IJ in time (1.130) term" analytical tools which can be applied to short temporal regions of assumed
stationarity. At this point we begin to use formal th eory in some rather ad hoc and ad lib
= C (~(n)~(n - IJ)] for any n. ways. or course, it is often the case in engineering problems that we lise formal theories in
rather loose ways in practice. However. th e ab ilit y to understand the implications of our
22This definition of ergcdicity is entrenched in engineering textbooks, but it is not sloppiness , and the ability to predict and expla in success in spite of it, depends entirely on
strictly accurate [sec (Gray and Davisson, 1986, Ch . 7)). O UT understanding of the underlying formal pr inciples. In this book, we will stress the de
pendency of ad hoc methods on formal principles.
48 Ch. 1 / Propaedeut ic 1.2 I Review of Pro bability and Stochasti c Processes 49
Multiple Random Processes We refer th e read er to the textbo oks in Appen dix I.R for a general back
We now exte nd th ese ideas to t he cas e of two rando m pr ocesses. A~\~ gro un d. For our purposes, it is sufficient to define the power density spec
trum of a WSS ran d om process ~ as t he DT FT of its autoco rrelatio n
n atural extensio n of th e co ncept of stationarity we have th e following:\ ..
fun ction ,
Two random processes, :: and .!) are said to be jointly SSS if
I ()
:!: l n l " "
() () .. ,(
' =:£ J n i , ~l ml , , , ~) m)
)(¢j,"" ¢,/ VI.
" ..)
, v) i( ) '1gf f' r ( )e -J" )~ ( 1.\3 9)
(1.134) "w
-
L
~ ~ - CD
x lJ
-
.
= ~, (n l + LI). . • .:;(n , + tl.)~,(m , + <1). ' !: J( mJ + <1) (SI ' ... , S" VI " .. , vj )
Acco rd ingly, t he autocorre lation can be computed from the power de n
for any i random var ia bles from x, an d any j fro m y, an d for an y to. It sity spectrum as
follows that if ~ and l: are jointly SSS, then eac h ind ividually SSS. is 1 fnii w) eJW~ dt».
From random variabl es ~( n j) and J:( n2 ) , chose n fro m ~ and J:, respec r6,(IJ) ="2"n -1t ( 1.140 )
ti vel y, we can form th e cross-correlation,
r"l(n l' n 2 ) 1~r ,:t:~ ( n l )J:( n 2 ) ) , ( 1.135)
If x is a lso correlat ion- ergod ic and th e autocorrelati on is com puted using
time ave rag ing, then , according to our conventio n, th e su bscripts will de
and the cross-covariance, not e realization s. For example,
l x(w ) = I
ry ~ -oo
r)II)e IW ~ . (1.141 )
Similarl y to ( 1. 129), we obtain
c::[(nl' n2 ) = r"l(nl' n2 ) - C {~(nl )}C {r(n 2 ) } ' ( 1.137) The totai> power in a second-orde r st ationary real random pro cess is
de fine d as
As we did in th e indi vidu al random process case, it will be us eful to def _2
Pi! = t' t~ (n)) for any n. (1.14 2)
have a weaker form of sta tiona rity b etween two ra ndom processes. The
following conditions are required for x and l: to b e declared jointly WSS: To make sense of this definition , we reca ll th e definiti on of th e power in
a signal, which according to (1.11) is given b y 2s
1. ~ and r are individually WSS;
2. r"l(n p n2 ) is a fun ction of 1] = n 2 - n l only. Px = ci'1 [x (n) [2j. ( 1.143)
It is easy to show that joint SSS ~ jo int WSS (but not th e conve rse). If x (n) happen s to be a realization of x , and x is seco n d-order ergod ic,
Also, simp ly by de fini t ion , we see that jo int WSS ~ individual SSS, but, t hen we see th at these t wo computati o-ns are e quiva lent.
aga in, the convers e is not gen erally tru e. As an aside , we recall that reali za tion s of sta tionary, e rgodic, sto
As an extension of the co nce pt of ergo dici ty to th e joint random proc ch astic pro cesses were listed as a class of power signa ls in Sect ion 1.2.3.
ess case, we note that th e cross-correlation ca n be computed using a Indeed , we now can apprec iate that this is th e case . If x (n) is suc h a real
ization and is not a power sign al, th en
temporal average over two rea lizations if the processes are jointl y
correlation-ergodic: Pi! =Px = !O or coJ ( 1. 144)
1 N and we en counter a contradiction.
r ZJl.(1]) = ~lx( n)y (n - 1])J = J~ 2N + I n"'i:Nx (n)y (n - 17)· ( 1. 138) No w th at the definition of pC!. make s sen se, we note that
Power Density Spectrum 2' T he word total is used here to co nnote that the power in all freq uencies is con sidered.
S ingle Random Process. A gene ral discussion of th is important topic is 2sT he absolute value signs appear here because x (n) was assum ed to be complex-valued
in general in defi nition (1.11). Since we have foc used exclus ively upon real random pro
unn ecessary for our wo rk wit h speech and would take us too far afield. cesses, they are superfluous in this discussion.
50 Ch. 1 I Propaedeutic 1-2 I Review of Probabil ity and Stochastic Proces ses 51
This res ult follows im m edia tely from defi nit io ns and sa ys that the scaled Acco rd ingly, th e autocorrelation fu nctio n for white noi se is
tot al area under l ,(w) yie lds t he total power in x , making it a sort of
J~(rt ) = c5( rt)· ( 1. 150)
den sit y of power on frequency. much like th e pdf IS a probability density
on it s variable of interest . In fact , to find th e power in any frequency T he reader is cautioned to distingui sh between continuous-time white
range, say WI to (;) 2' for S we can compute noi se an d th e phenomenon we are discussing here. Co n tin uo us white
noise has infinite power and a flat power density sp ectrum over all fre
Power in ~ in frequen cies W I to (;)2 = -1 f"'1l )w) dco. ( 1.146)
qu encie s. Just as the discrete-time impulse cannot be considered as sam
tt (~ I - ples of the co ntinuou s time impulse, so discrete-time white noise should
no t be co ns idered to be samples of co nt inuous time white noi se. In fact,
Finally, we re m a r k that so me stoc ha st ic processes ha ve all of th eir d iscr et e-time white noise may be thought to represent samples of a co n
powe r co nce nt ra te d at d iscre te frequen cies. Fo r exa mp le, a process X t in uo us t ime st oc has t ic process , which is bandlimited to th e Nyquist
whose random va r iables arc ( ~ ( n ) = cos(won + ~), n E (-co , co )l with ~ a ra nge and whi ch has a fla t po wer d en sit y spectrum o ver that ra nge .
ra ndom variable, will have all powe r co ncen t rate d at freq ue ncy WO o In
this case. the autocorrelation (e nse m ble or temporal) will b e periodic
with the same frequency, and we m ust resor t to the use of impulses in Random Processes and Linear Systems
the PDS much like our work with th e PDS for a periodic deterministic It will be useful for us to review a few key results co nce rning the anal
process. ysis of LTI discrete time systems with stochastic inputs. Let us restrict
t his discussion to WSS, second-order ergodic, stochastic processes.
Two Random Processes. Let us focus her e on jointly WSS random pro Co nsider first an LTI system with discrete-time impulse response hen).
cesses, ~ and ,1-', with cross-correlation r~./ tJ) . In this case the cross-power Suppose that x (n), a realization of random process x, is input to th e sys
spectral density is given b y te m . The output , say yen), is given by the co nvolut io n sum,
co co
l,ry(w) =
-
L r~/ Yf) e
'1- 00 -
J<U'I . (1.147) yen) = I x(n - i ) h(i ). (1.151 )
j =- --co
We can compute the cross power between the two processes, Of course, the same transformation occurs on the i nput no matter which
realization of x it happens to be. We could denote thi s fact by replacing
p ~ = IIi J"r~( w) doi,
_Of. (1.148) x (n - i) by its-corresponding random variable, x (n - i ), on the right side
of (1.151). Without a rigorous argument; " it is-believable that the map
which is interpretable as the po wer that th e two random processes gener ping of th ese random variables by th e convolution sum will produce an
at e over and above their individual po wers due to the fact that they are other random variable (for a fixed n), r (n) , so we write
correlated . 0::>
,1-'(n) = I
1= - 00
~ (n - i)h(i). (l.l 52)
Noise
As n varies , a second random process is create d at th e output, y. We have
Realizations of stochastic processes often occur as unwanted distur
as sumed x to be WSS and second-order ergodic. Let us show that the
bances in engineering applications and are referred to as noise. Even
sa m e is true of y.
when the stochastic signal is not a disturban ce, we often employ the term
By applying the expectation operator to both sides of ( 1.1 52) and in
noi se . Such will be the cas e in our speech work, for exa m ple, when a
t erchanging the order of summation on the right, we hav e
noise process appears as th e driving fu nc tio n for a m odel fo r "unvoiced"
spe ec h sounds. (Consider, e.g., th e sound that the letter "s" implies.) <0
One of the most important for ms of noi se in enginee ri ng analysis is L' !r (n)} = L L'L~( n - i)} h(i) (1.153)
(discrete-lime) white noise, de fined as a stationa ry process, say 1,1', with 1=-00
th e pr operty that it s power den sity sp ectrum is cons tant o ver th e N yquist or
range,
r:Jw) = 2n. for wE [- n on). ( 1.149) 26This argu ment cen ters on co ncepts of st ochast ic convergence that are treat ed in many
st an dard textbooks (see books in Append ix I. B not lab eled "elementary").
52 en. 1 / Propaedeutic
1.2 / Review of Probability and Stochastic Processes 53
co
'Uf- = J.-l!- L h(i). ( 1.154) These sequences will arise in two different ways in our work. In the
l = - OO
first case , the elements of each random vector will be random variables
Since this result does not depend on n, we see that y is stationary in the representing scalar signal samples, which for some reason are conven
mean. A similar result obtains with fl.y and fl.y replaced by u, and fl.}' if we iently packaged into vectors. For example, suppose we have a scalar sig
begi n with ( 1.1 5 I) and use temporal averages, so that y is also ergodic in nal (random process) ~= {. . . ,~(- I ), ~ (O) , ~(1), .. .[. We might find it
the mean . necessary to break the signal into 100-point blocks for coding purposes,
In a similar way (see Problem 1.14) we can show that the autocorrela thereby creating a vector random process
tion of J: is dependent only on the time difference in the arguments and
is given by ~(O) ~(100) ~(200)
= 0>
~(l) ~(101) ~(201)
r/t)= L L h(i)h(k)rx(t)+k-i), (1.155)
~ j=-c:o k =-00
!~r "!(O)= , ~(l) = , ~(2) = I,· · .
or. m terms of temporal autocorrelations,
cc co ~(99)J L~(l99) J L~(299)~
r/t) = L L
i=-o::;, k=-O".)
h(i)h(k)rjt) + k - i). (1.156) "(1. 160)
We conclude, therefore, that a WSS correlation-ergodic input to an LTI Note that the "time" indices of the vector random process represent a re
system produces a WSS correlation-ergodic output. This is a fundamental index ing of the sample times of the original random process .
result that will be used implicitly in many places in our work. A second type of vector random process will result from th e extraction
Finally, we recall the important relationship between the input and of vector-valued features from frames of speech. We might, for example,
output power spectral densities in the case of LTI systems with WSS extract 14 features from 160-point frames of speech. These frames may
inputs, be overlapping, as shown in Fig. 1.12. In some cases we might choose to
index the resulting random vectors by the end-times of the frames; in
r;(w) = 1 H(w) 1 ~(w).
2
(1.157) others we might reindex the vector sequence using consecutive integers.
This result is derived by taking the DTFT of both sides of (1.155). In either case, it is clear that the vector sequence comprises a vector
valued random process.
For a vector random process, the mean vector takes the pla ce of the
mean in the scalar case , and the autocorrelation matrix plays the role of
1.2.4 Vector-Valued Random Processes the autocorrelation . These are
At several places in this book, we will encounter random processes def
J-L,,;(n) = l'j~(n») (1.161)
that are vector-valued. A vector-valued random process x is a collection of
random vectors indexed by time," and
~(np n2) d~f c (~( n 1 )~T( n 2)), (1.162)
~ ~fj ... ,~(-I),~(O),~(I), . .. J. (1.158)
Realizations of these random processes comprise vector-valued signals of respectively. Note that the mean vector contains the mean of each of the
the form component random variables and the correlation matrix contains the
cross-correlations between each component pair in the vectors. We can
(... ,x(~I),x(O),x(l), .. '1' (1.159) also speak of the covariance matrix of the vector random process x , de
fined as -
which we customarily denote simply x (n). (Not e: We are now employing
boldface to indicate vector quantities.) C-x( n I' n2 ) ~ c ([~( nI) - l-lx(n[ )1[~( n2 ) - l-lX(n2J}·
- -
(1.16 3)
17 Again, we will restrict our a tt ention to real processes, but the complex cas e is a simple
When the vector random process is WSS, we have a stationary mean vec
generalization. tor, and correlation and covariance matrices that depend only on time
difference. These ar e defined, for an arbitrary n, as follows:
54 Ch. 1 I Propa ede utic 1.3 / 'rco.cs in St atistical Patt ern Recognition 55
600 , I I I I I i I I tisti cal ma trices th at are mean in gful whe n ap prop ria te ergod icit y condi
tion s hold. T hese are, for an a rbit ra ry /1,
rt» f (i + 1) f (i + 2)
- 400 -1.3 Topics in Statistical Pattern Recognition
R eading Note: Most of the material i ll this section Wi ll not be used until
- 600 ' , , , I ! I ! I Pam I V and V. The exception is Section 1.3. 1, whtcii willfirst be encoun
o 50 100 150 200 250 300 350 400 tered in Chapter 5.
T Ime, n (norm -s ec)
FIGURE 1.12. A vector random process created by extracting vector-valued As in the prev ious two subsectio ns of t h is ch apter, t he mat erial treat ed
features from frames of a speech process at periodic intervals. Note: Here here represents a very sm all sam pling of a vast research dis cipline, with a
we index the features by sequential integers. Later we will establish the focus on a few to pics which will be significant to us in our speech pr o
convention of indexing features by the time of the leading edge of the
sliding window. cessing work . Unl ike the oth er t wo subsections, however, we m ake no
ass umpt ion here o r in the m ain tex t th at the reader has a for ma l back
def gro und in pattern recognit ion beyo nd a ca sua l acquaintence with certain
1l2> = 1l2>( >J) = t [ ~(n» ) ( 1.164) ideas th at are inh erent in general enginee ring study. A few exa mple text
boo ks from th is field are listed in App endix 1.c.
~( I]) ~f Ri n, n - 1]) = t l ~(n )~T( n - 17») ( 1.165) Much of speec h p roce ssing is co ncerned wit h th e analysis and recogni
tion of patterns and d raws heavi ly on results from th is fie ld. Altho ugh
C/17) d~f C!(n, n - ,,) = t {[ ~(n) - 1l!][ ~T( n - 1]) - Il~n . (1.166) m an y speech p rocessin g developments ca n be successful ly un der stood
wit h a ra ther superficial know ledge of pattern recognit ion th eo ry, ad
F req ue ntly, we a re specifically interested in th e "ze ro lag" cor relations va nced rese arch and development a re no t possible wit hout a rigo ro us un
(or covariance) m atrix of a sta tionar y vecto r ra ndom process that plays derstand in g. A few advance d spee ch processing to pics in t his book will
th e role of the va riance of the process. For this case, we will write need to be left to the reader's further pursuit , since it is not intended to
ass ume th is advanced pattern recognition backgroun d , nor is it poss ible
R d~f R,(O) ( 1.167) to provi de it withi n the scope of th e book.
~ T here are two main bran ch es of pattern recogni tion - s tat istical and
and syntacttc. Gen erally speaking, the form er deals wit h statistical relation
C ~f C (0) (1.l68) shi ps amo ng features in a pattern , while the latt er approaches patterns as
.! .!
str uctures that can be composed of prim itive patt erns according to a set
for sim plicity. T he reader should carefull y compare these notatio ns with of rules. Althoug h th ese bran ches arc not exactly distinct , they ar e q uite
( 1. 116) and (1.1 17) and discern t he ditTerence in meaning. different in philosophy. In ou r wor k. th e use of t he latt er is confined to
Finally, we note t hat there are temporal versions of these th ree key sta th e special problem of language modeling in autom atic speech recogni
56 Ch. 1 I Propaedeulic 1.3 I Topics in Statistical Pattern Recogn ition 57
valued random process , or perhaps vectors drawn from two random pro
cesses. For the sake of discussion , let us just refer La x and y. We sho uld note that the I norm of a vect or x, denoted Ilxlls ' is defined
It is sufficient for us La be concerned with vecto rs drawn from as
Cartesian spaces. The Nedimensional real Cartesian space, denoted IR N is
the collection of all N-dimensional vectors with real elements. A metric, N
N
d(- , '), on IR is a real-valued function with three properties: For all Ilxll s ~r ",{i Ixl s . ( 1.178)
N
x, y, Z EIR , 1<=1
1. d(x, y) ~ O. It follows immediatelv• that the /s metric between the vectors x and y is
2. d(x, Y) = 0 if and onl y if x = y. equivalent to the (, norm of the difference vector x - y,
3. d(x, y) -< d(x , z) + d(z, y).
ds(x, y) = II x - yll s' (1.179)
These properties coincide well with our intuitive notions about a proper
An important generalization of the Euclidean metric is called variously
measure of distance. Indeed, a metric is often used as a distance measure
the weighted Euclidean, weighted f2 , or quadratic metric."
in mathematics and in engineering. s?
Any function that meets the properties in th e definition above is a le
gitimate metric on the vector space. Accordingly, there are many metrics, d (x, y) d.£f
2 \0 - ,V/ [x _ y ]T \\I[x - )'J . ( 1.l80)
each having its own advantages and dis advantages. Most of the true met
rics that we use in speech processing are particular cases of the where W is a positive definite matrix that can be used for several pur
Minkowski metric, or close relati ves. This metric is defined as follows: poses discussed below.
Let x k denote the ktb component of the N-vect or x. Then th e Minkowski Before proceeding, we should be careful to point out that, in theoreti
metric of order s, or the Is m etric, between vecto rs x an d y is cal discussions, we might wish to discuss the distance between two sto
c has t ic vectors, say x and y. In this case we might write, for example ,
so mething like -
28In fact , syntactic pattern recognition has its root s in the theory of fo rm al languages
tha t was motivated by the study of nat ura l languages (sec Chapter 13).
29W C will, however, en counter some distan ce measu res later in the book that arc not true lOThc quadratic metric is often defined without th e sq uare root, but we em ploy th e
metrics. sq uare root to makc the distance more parallel to the Euclidean metric.
58 Ch. 1 I Propaedeutic
1.3 I Topics in Statistical Pattern Recognition 59
-
d2J ~,~) = v'[~ I rWl ~ - I l . ( 1.18 1)
2
b
F [~:l ( 1. 183)
ean dis ta nc e between th ese represent ati ons is
where
We note th at th e d istance would be " incorrect " eve n if the new basis vec
a = X 1P 1 + X 2 P2' (1.184) tors were ort hogo na l but not normalized .
60 Ch. 1 / Propaedeutic
1.3 I Topics in Statistical Pattern Recognition 61
What is "wrong" in ,the second case is the basis upon which the repre
sentations are assigned coordinates. The two "coordinates" with the sec In effect , what we have done in the above example is removed th e re
ond assignment of basis vectors are not distinct information. Moving in dundant information in the "bad" vector representations that skews our
the direction of p~ also includes motion in the direction of P~' and con sense of how naturally far apart they are. This is accomplished by linear
versely. Only when these bases are made to correspond to distinct (ort ho transformation of the space, or, equivalently, weighting of the distance
normal) pieces of information does our sense of distance come back into metric. This example was meant to build intuition about a more realistic
focus and the Euclidean distance become meaningful. Algebraically, if the and important problem in pattern recognition. We often encounter (ran
basis vectors were made to correspond to a proper orthonormal set, then dom) vectors of features whose elements are highly correlated , or inap
the Euclidean distance would be appropriate. This would require that we propriately scaled. The correlation and scaling effects will occur, for
transform the vector representations x' and yl to their representations on example, when multiple measurements are made on the same process
a "proper" set of basis vectors before computing the Euclidean distance. and mixed in the same feature vector. For example, we might measure
In this case, let us just choose to go back to the orthonormal basis set PI the average number of zero crossings!' per norm-sec in a speech frame,
and P2' in which case we know that x' and yare transformed back to the and also the average energy. Clearly, there is no reason to believe that
original x and y. Let us call the transformation V. We have, then, these numbers will have similar magnitudes in a given frame , since they
represent quite different measurements on the sequence. Suppose, for ex
x = Vx' and y = Vy'. (1.189) ample, that in one frame we measure 240 "joules," and 0.1 zero crossing,
In this contrived example, Y can be found from (1. J 89) using simple al per norm-sec. In the next, we measure 300 and 0.05 . Are these vector
gebra, because we happen to know what the transfo(med vectors are. In representations based on an appropriate orthonormal basis set so that
general, however, finding the transformed representation of a vector cor Euclidean distances are meaningful? This answer could be argued either
responding to a change of basis is a simple generalization of the follow way, but the question is really academic. Our satisfaction with the dis
ing [see, e.g., (Chen , 1984, p. 17)]. We take the two columns of V to be: tance measure here will depend upon how faithfully it reflects the differ
the representation of ~; with respect to basis set f~ 1' ~2)' and the repre ence in the frames in light of the measurements. So let LlS explore the
sentation of ~; with respect to !Pl' ~2} ' respectively. question: Do these two frames represent the same sound? If so, we would
Now consider computing the Euclidean distance of the transformed like the distance to be small .
vectors to obtain a meaningful measure of their distance apart, In answering this question, we should notice two things about the
measurements. First, there could be less information in these measure
ments than we might assume. It could be the case that zero crossings
d2(Yx', Yy') = vryx' - VY'frYx l
Yy/] tend to decrease when energy increases (correlation) so that the combina
tion of changes does not make the two frames as different as the out
= V[x l
- y/fyTy[x' - Y'] (1.190) come might suggest. This point is reminiscent of the nonorthonormal
/
= d2w( X , y). basis case above. Second, note that the zero crossing measure is so rela
tively small in amplitude that its effect on the distance is negligible. In
The last line in (1. 190) denotes the weighted Euclidean distance with order for this feature to have more "discriminatory power" (which does
weighting matrix W = v'v We see that the "meaningful" Euclidean dis not potentially get lost in numerical roundoff errors."), the relative scale
tance for the vectors whose bases are not conducive to proper distance of the features must be adjusted. (This corresponds to basis vectors of
computation can be obtained by using a weighting matrix equivalent to grossly different lengths , orthogonal or not.) An approach to solving this
the "square" of the transformation matrix.
scaling problem is to simply normalize the feature magnitudes so that
An important point to note is that a linear transformation of coordi each has unity variance. Presumably, smaller features will have smaller
nates does not change the rank ordering of distances from some reference ~ariances (and conversely) and this will tend to bring the measurements
vector. If in the above, for example, there were some vector Zl such that
mto an appropriate relative scale. The "decorrelation" process is also not
d2( X/, Z/) < d2(y' , Z/), (I. 191) difficult; in fact, the scaling can be accomplished simultaneously using
the following.
then it would be true that
di x , z) < d2 (y, z). ( I. I 92)
Whereas we would want a transformation to make distances more mean
ingful, intuitively we would not want th e ordering of distances to be
" h " " h " A ; 11 a feature space.
62 Ch. 1 I Propaedeutic 1.3 I Topics in Statistical Pattern Recognition 63
Suppose that the feature vectors between which we are trying to com The linear operation applied to the feature vectors in this procedure is
pute a distance are x' and y/. Each is an outcome of random vector ~' frequently referred to as a prewhitening transformation, since it produces
with mean /-lx' and covariance matrix Cx" We would like to transform the feature vectors whose components are uncorrelated and normalized. This
original random variable x' to a representation, say X, in which all com terminology is somewhat abusive because the "white" concept applies to
ponents are uncorrelated-and are individually of u-nity variance. This (usually scalar) random processes that are not being considered here, and
means that the new covariance matrix ex should equal I, where I is the also because "white" features would have zero means. This latter point
identity matrix. According to the heuristic arguments above, Euclidean would require, for a better analogy, that R x ' rather than C x be 1. Never
distance computed on these vectors will then be intuitively appealing. As theless, the terminology is widely used and is well underst-ood by signal
in the simple vector example above, we will show that the proper Euclid processing engineers.
ean distance can be computed using an appropriate weighting matrix in Simplifications of the prewhitening procedure are so metim es em
the computation. ployed to avoid the computational expense of using the full covariance
The requisite transformation on the feature vectors is easily discov matrix in the distance expression. The most common simplification is to
ered by focusing on the covariance matrix. Since Cx' is a symmetric ma assume that the features are mutually uncorrelated, but inappropriately
trix, it can be written [see, e.g., (Nobel , 1969, Ch. 10)] scaled relative to one another. In this case Cx' (it is assumed) has the
furm
C='!' = <DA<I>T, (J.] 93)
C~,=A. (1.198)
where <I> is an orthogonal matrix whose columns are the normalized
eigenvectors of Cx" and A is a diagonal matrix of eigenvalues of ex" where A is a diagonal matrix whose diagonal elements in general are un
Therefore, - equal. The transformation that need be done on each incoming vector is
represented by A -lI2 X. This amounts to simply normalizing each feature
<l>A<1'T = G{[~I - /-l~ .][~1 - /-lJ'} , (1.194)
to its standard deviation so that all features may contribute equally to
the distance.
from which it follows that
= d2W(~" I' )·
c' = argmax
c p(-c = c I-X = x). ( 1.199)
We see again that a meaningful Euclidean distance between correlated
feature vectors .ca n be computed if an appropriate weight is used. It is Unfortunately, the training process usually does not permit characteriza
worth noting that the weighted Euclidean distance which has arisen here tion of the probabilities p(£ = cl ~ = x). Instead what we [earn is the prob
is very similar to the Mahalanobis distance that we discuss below. ability that a given class will generate certain feature vectors, rather than
64 Ch. 1 I Propaedeulic 1.3 I Topics in Statistica l Pattern Recognition 65
P(f= c, ~ = x ) where x denotes the N-vector of arguments (features) [Xl ' " and Cx 1c x"y
p(.0:=cl~=x )= P(~-x ) (1.201) and Il x l c are the class-conditional covariance matrix and mean vector.
Without belaboring the issue, it is believable based on our previous dis
and cussion that an appropriate likelihood measure for this case is the class
cond it ional density J~l c(xlc) . The class decision is based on maximizing
P(f= c, ~ = x)
P( ~ = x1.0: = c) = (1.202) th e likelihood , -
P( £ = c)
c' = arg~ax '~ I £(x l c) . (1.209)
from which we have
We can rid ourselves of the need to compute the exponential by electing
P (~ = Xlf =C)P(f = c)
p (£ =cl~=x ) ( 1.203) instead to maximize In ~ I r(x Ic). This leads to the decision rule
P(~ = x)
Clearly, the choice of e that maximizes the right side wi ll also be the c' = arg~in{[" - J-t!. IXC;: ,,[x - J.L~ l cl + In {det C ;~J. ( 1.210)
choice of c that maximizes the left side. Therefore,
Note that the maximization has become a minimization because we have
c· = argmax P(£ = cl~ = x) = argmax P (~ = x l£ = c)FC£ = e). removed a superfluous minus sign from the computation. Notice also
c c (1.204) that the first term on the right has th e form of a weighted Euclidean dis
Furthermore, if the class probabilities are equal, tance. Let us further develop this point.
The term on the right side of (1.210) is sometimes considered a dis
1 tance between the given feature vector and the cth class mean, J.L x I c'
p(£ = c) = K' c = 1, 2, ... , K, ( 1.205)
Accordingly, it provides a measure of "how far x is from class c." For
then generality, let us replace the specific outcome of the feature vect or,
x, with its random variable, ~ , and define the ma ximum likelihood dis
c' = a rg~ax p(£=cl~ =x )=arg1Pax p(~=xl£=c). (1.206) lance as
Therefore, under the condition of equal a priori class probabilities, the
class decision dm/(~' J-t!IJ = [~- J.L~I X C;:J~ - !-L~lcl + In {det C;:J (1.211)
c' = arg}pax P(! = xl£ = c) (1.207) We see that for a multiclass, multivariate Gaussian feature problem,
choosing the class that minimizes this di stance is eq ui valen t to choosing
is equivalent to the more desirable (1.199) for which we do not have the maximum likelihood class.
probability distributions. A simplification occurs when all classes share. a common covariance
A quantity related to the probability of an event which is used to matrix , say
make a decision about the occurrence of that event is often called a like
lihood measure. Hence, our decision rule based on given feature vector C:i dcfC - C = .. . = C (1.212)
= ~I[ - :i1 2 ~IK'
x is to choose the class c that maximizes the likelihood p( ~ = x 1£= c).
This is called the maximum likelihood decision. In this case C~lc can be replaced by C~ in (1.211) and the final In l ] can
There is an implicit assumption in the discussion above that the ran be ignored, since it simply adds a constant to all distances. In this case,
dom feature vector may only assume one of a finite number of outcomes. we obtain
1£
This is evident in the writi~g . of probability distribution p( ~ = x = c).
dM(~ ' ~~I J = [ ~- J-1~ Hr C~'T~ - ~~I C] ' ( 1.2 13)
Wh ere this is not the case, It IS frequently assumed that feature vectors
associated with a given class are well modeled by a multivariate Gauss This distance is frequ ently call ed the Mahala nobis distance (Mahala
ian distribution [cf. (1.118)J, nobis, 1936). We see that for a multiclass, multivariate Gaussian feature
66 Ch. 1 I Propaedeutic 1.3 I Topics in Statistical Pattern Recognition 67
problem in which th e classes share a commo n covariance matri x (th e way tures th at pe rfor m well, and appea r to be end uring. Th is is not to say
in which feat ur es are cor related is sim ilar across classes). cho osing the th at new features have not been tried , and that the field is not evolving.
class to which the given feature vecto r is closest in th e sense of the Ind eed , we have seen, for example, "ce pstra l" ty pe of features su pplant
Mah alanobis distance is tantam ount to choos ing th e maximu m likeli the "LP" type param eters in certain speec h recognition tas ks during th e
hood class. J 980s. Thi s shift, however, was between two closely related sets of fea
Interestingly, we have come full circle in our discus sion , for it is ap tures a nd was to some extent motivated by comp utatio nal expediencies.
parent th at th e Mahalanobis di stance is nothing more th an a "covariance Further, the most frequ ently cited study behind th is shift relies on experi
weighte d" (sq uared ) Euclidean distance'! between the feature vector and me nta l evidence of improved recognition perform an ce (Davis and
a special set of det erministic vectors - the mea ns of th e classes. Never Mermelstei n, 1980). Although th e course of research is very unpredi cta
theless, the name Mahalanobis distan ce is ofte n applied to this distan ce ble, for the foreseeable futu re, th ere appears to be no compelling prob
in this special maximum likelihood pro blem. Based on our previou s dis lems th at will demand a deep analys is of feat ures.
cussio n, it should be apparent that the Mahalanobis dis tance represent s As if to cont radict th e statem ent above. lurking in one li ttle corner of
an appropriate use of the 12 metri c, since the inverse covar iance weight our study (Section 12.2.7) we will mention some direction s in speec h rec
ing removes the corre lation a mo ng the features in the vect ors. ognitio n resea rch that ar e ba sed on the notion s of prob ab ilistic separ
ability and entropy measures. These measures ar e customar ily encou n
1.3.4 Feature Selection and Probabilistic Separability Measures tered in th e advanced study of feature extractio n and eva lua tio n. We
conclude this section by broadly discu ssing the typ es of feat ure evalua
In th e preceding section, we discussed a genera l problem in which a t ion, puttin g the topic of probabili stic sep arability measures and entropy
featu re vector was associ ated with one of a number of classes. A subject meas ures into perspective.
th at we avoid ed was the selectio n of features (this process is ofte n called
feature extraction s and the ir eva lua tio n in terms of class ificat ion per
forman ce. The se ta sks ar e inseparable, since performan ce evaluation is Probabilistic Distance Measures
often integrated into th e search for app rop ria te features . In th is section Ideally, features would be eva luate d on their performance in terms of
we make a few bri ef comme nts about these issues. On e of the o bjectiv es m inimizing th e rat e of classificati on error. However, erro r rat e is gener
is to let the reader know what mat erial is not being cove red wit h regard ally a very diffi cult quantity to evaluat e, and oth er techn iqu es m ust be
to this topic, and why. Another is to touch on the subject of probability employed. Almost all com monly used techniqu es for feature evaluation
separability measures and entrop y measures, and to explain their specific involve some attemp t to measure th e separa tion of classes when repre
relationship to speech processing. sente d by the features.
Feature selection and eva luatio n is a vast su bject on which much re The simplest techn iqu es for measuring class sepa rat ion (or interclass
sea rch has been performed and man y papers and books wr itten. To at distance) are based on d istance metrics in multidimensional space , espe
tempt to add ress this subject in any de tail would take us too far afi eld cially the Euclidean distance and its varia nts, which we discussed exten
from th e main subject of thi s bo ok. Several excellent textbo ok s address sively above. These me asu res genera lly do not utilize much of the
this field authoritatively and in detail, and we refer th e reader to th ese p robabili stic structure of the classes and therefore do not fait hfully repre
books and the research literatu re for detailed stu dy." Second, the imp or sent the degree of overlap of the classes in a stat istical sense. Th e proba
tance of feature evaluation pro cedu res is diminished relative to th e early bilist ie separability measures represent an attempt to capt ure that
days of speech processing. Altho ugh sta tistic al pattern recognition tech inform ati on in th e evaluation. Th ere are two relat ed types of probabili s
niques are central to th e op eration and perform ance of many spee ch tic sepa ra bility measures, the "probabilistic di stan ces" and th e "probabi
processing tasks (particul arl y speech recognit ion ), decades of research listic dependencies."
and development have led to convergence on a few (spectrally bas ed) fea
To illu strate what is meant by a "probabilistic distan ce," consi der th e
two -class problem for which class-conditional pdf's are shown fo r two
lJAgain, we co uld in trodu ce a sq uare root into th e defin ition to make thi s distance ex different features, ~ and ~, in Fig. 1.14. Let us assu me th at th e a priori
actly a Euclidean metric as defined in ( 1.180), but that would be brea king with convent ion .
Th e Mah al anobis dis tan ce is alm ost in varia bly defin ed with out the sq uare root, and it class probabilities are equal , P( ~ = I) = p( ~ = 2). In the first case features
should be clear that for the m aximu m likelih ood problem, whe ther the distance is sq uared (scalars, so we can draw a picture in two dimensions) characte rized b y
or no t is of no conseq uence. ra ndom va~iab le ?:: ar e employed , and ~I/xl L) and j~ l .<x I 2 ) are well
l'For example, see the textb oo ks in Appendix I.C and the lEi;'I, Transactions on Pattern sepa rated With respect to the feature values. The classe s ap pear to be a l
Analysis and Ma chine Intelligence.
most fully separable based on these densities. On the other hand. when
1.3 I Topics in Statistical Pattern Recognition 69
68 Ch. 1 I Propaedeutic
both of which redu ce to a Mah alan obi s-like dista nce in the case of
- T ~ x Gaussian feature vectors and equal class covariances (see Pr oblem 1. 19).
Co nversely, when the features depend very stro ngly on the ir class asso
features 1': are used , the separation is extremely poor. In thi s casej~ l/y l l)
ciation, we expect ~ , ~(x lc) to be q uite di fferent fro m th e mixtu re pdf.
and .t;'I£( y/ 2) are identical and the classes would be completely insep ara
Th erefo re, a good indicator of the effec ti veness of a set of feat ures at
ble base d on thi s feat ure. T ha t is, this feature would provide no better
separat ing the classes is given by the pro babilistic dependence measur es
per forma nce th an simply guessing, or ran do m assignm ent , of the class
whic h quantify the difference between the class con dit iona l pdf 's and t he
identity.
mixt ure pdf. These measur es adhere to the same pr operti es noted above
Probabilistic distance measures attem pt to capt ure the degree of over for the probabilistic dist ance measures and are generally of th e form
lap of the class pdf's as a measure of their distanc e apa rt. In general,
these measures ta ke the form
<:= 1
l.
M ( £, ~ ) = ~ P( £ = c) 2, P (~ = x,l£ - c) log,
'~ 1
_ P (~ =x,,£ = c )
p( X = x,)
_
70 en. 1 f Propaedeutic 1.3 f Top ics in Stalistical Pattern Recognition 71
K L P(x=x,, £ =c) .
There are two basic classes of clustering algorithms. In dynamic clus
= L Ip(~ = x/,~= c ) IOgl P(~=X ,)P ( £ =c
c~ 1 ' ~1
tering, a fixed num ber of clusters (classes) is used . At each ite ratio n, fea
ture vecto rs are reas signe d accor ding to certain rul es until a sta ble
( 1.22 0)
pa rtitioning of th e vectors is achi eved. We give an importan t exam ple
This measure , which can be seen to b e an ind icat or o f th e average devia below. In hierarchical clustering, each feat ure vecto r is in it ially a separate
tion of ~ ,/x l c) from ~(x ) [or P (!I.~) fro m p (~ )] will be given anoth er cluster, th en at each ste p of the algorit hm, th e two most simila r clusters
int erp ret at ion when we dis cuss entro py co ncepts in Section 1.5. (ac cord ing to some simila rity criteri a) a re merged until the de sired num
ber of cluste rs is ac hieved.
Entropy Measures There a re a vari et y of cluste ring algorit hms, but we focu s on on ly on e
examp le of an iter ati ve approach which is wide ly used in speech pro cess
Ent ro py measu res are based on informat ion- th eor eti c concepts th at ing fo r a number of tasks. This is usually call ed th e Ksmeans algorithm,
qu an tify the amo u nt of unc ert ai nty assoc iated with the out come of an hut th e "K" sim ply refe rs to th e number of desired classes and can be re
expe riment. In th e patte rn recognition context, these measures relate how placed by any des ire d index . Th e operation of the K-m ea ns algori th m is
mu ch uncertainty remai ns abo ut the class mem bership onc e a feat ure stra ightforward. Feature vecto rs are contin uously reassi gned to clusters,
measurement is made. Th is kno wledge qu antifies the effecti veness of a and the cluster centr oids upd at ed , un til no furth er reassign ment is neces
set of features at convey ing inform ation th at assists classificati on. Al sa ry. T he algorithm is given in Fig. 1.15.
th ough we will have no direct use for ent ropy measures in th is book, we Th e versio n of K-mea ns given her e is so metimes called the isodata al
will have several occasions to use th e con cepts of informat ion and en go rithm . It is di fferent from th e or iginal K-mea ns algorith m in th at it re
tropy. We will therefore address these issues in th e next sect ion, and, for assigns the en tire set of training vect ors b efo re updatin g th e cluster
com plete ness, in clude some comments on entropy mea sures in pattern centroids. If mean s are recomputed after each vector is co nsidered, then
recognition at th e end of that sect ion. the algorithm terminates on ly after a com plete scan of the trai ning set is
made witho ut reassignm ent.
1.3.5 Clustering Algorithms
T he previous di scussions were based on the assumption that labeled FIGURE 1.15. The K-means algorithm .
(acco rding to class) t rai ning fea tures were avai lable from which to in fer
th e un derl ying probability structure of th e classes. In some problems, Initializat ion: Choose an arbitrary partition of the training vectors {xl into K
however, inform ati on about th e class memb ership of th e trai ni ng vecto rs clusters. denoted Ak , k = 1. 2, . . . ,K, and compute the mean vector (centroid) of
is not pro vid ed. It is possible th at we mi ght not eve n kn ow the number each cluster, Xk , k = 1, 2, . .. , K.
of classes repres ented by the t ra ining features. T he problem of automati
call y separating train ing data into groups represen ting classes is often Recursion:
solved by a clustering algorithm .
1. For each feature vector, x, in the training set, assign x to Ak " where
The process of clustering is pan of a more gener al group of techniques
co mm onl y referred to as unsup ervised learn ing. As the n am e would k· = argmin d(x, xk ) . ( 1.221 )
imply, un supervi sed learning techn iques are concerned with th e problem k
of form ing classes fro m training data witho ut bene fit of su perv ision re d( . , . ) represents some distance measure in the feature space.
garding class m ember ship. Within th is group o f techn iqu es, clu stering 2. Recompute the cluster centroids, and return to Step 1 if any of the cen
algorithms represent a rather ad hoc approach to learning classes, which troids change from the last iteration.
do not a ttempt to employ deep ana lysis of th e statistical structure of
th e dat a. The more formal unsupervised learning methods are called A bri ef history a nd more details of the K-means ap p roach from an in
m ode separation techniques (Devijver and Kittler, 1982, Ch . 10), and we form ati on t heo ry perspecti ve is give n in th e pap er by Makhoul et at.
sh all not have a ny usc for th ese meth ods in our study o f speec h. Rather, ( 198 5). In an unpublish ed 1957 pap er [more recently pu bli shed , see
clustering meth ods are based on the heuristic argu me nt that vectors rep (Lloyd, 1982) ], Lloyd , independe nt ly of the patt ern recognition research
rese nt ing the same class sh ould be "close" to one ano ther in the feature efforts , had essent ia lly worked out the isodat a algor ith m for scalar qu an
space and " far" from vectors rep resen ting other classes. Accordingly, one tization in pulse code modul ation. T he generaliza tio n of the K-mea ns al
of the dist an ce metrics discussed ab ove is usu ally employed in the go rit hm to "vector qua nt izatio n," a technique whic h we will first
ana lysis. encounte r in Chapt er 7, is sometim es called th e general ized Lloyd algo
72 Ch. 1 I Propaedeutic 1.4 I Information and Entropy 73
rithm (Gray, 1984). A further generalization involves the fact that the Alternative lBG algorithm with "centroid splitting ."
K-means approach can also be applied to representations of the clusters Initialization: Find the centroid of the entire population of vectors. Let this be
other than centroids, and to measures of similarities other than distance the (only) initial code vector.
metrics (Devijver and Kittler, 1982 ). A measure of similarity which does
not necessarily adhere to the formal properties of a distance metric is Recursion. There are I total iterations where i code vectors are desired. Let the
often called a distortion measure. Linde et al. (1980) were the first in the iterations be i = 1, 2, ... , I. For iteration I,
communications literature to suggest the use of vector quantization with 1. "Split" any existing code vector, say X, into two codes, say x(l + e) and
K-means and nonmetric distortion measures. Consequently, the K-means x( I - e), where e is a small number, typically 0.01. This results in 2' new
algorithm (particularly with these generalizations) is frequently called the code vectors, say xi, k = I, 2, ... ,i:
Linde-Buzo-Grav or LHG algorithm in the speech processing and other 2. For each feature vector, x, in the training set, "quantize" x into code x~.,
communications literature. where k' = argrin d [x, Xi ). Here d( . , . ) represents some distortion mea
Generally, the objective of the LBG algorithm is to find a set of, say, sure in the feature space.
K feature vectors (codes) into which all feature vectors in the training set 3. For each k, compute the centroid of all vectors x such that xt = Q(x) during
can be "quantized" with minimum distortion. This is like adjusting the the present iteration. Let this new set of centroids comprise the new code
levels of a scalar quantizer to minimize the amount of distortion in book, and, if i < 1, return to Step 1.
curred when a signal is quantized. This set of code vectors comprises a
codebook for the feature space. The method is generally described in Fig.
1.16. A slight variation on the LBG method is also shown in Fig. 1.16, 1.4 Information and Entropy
which differs in the way in which the algorithm is initialized. In the lat
ter case, the number of clusters is iteratively built up to a desired num Reading Note : The material in this section will not be needed until Parts IV
ber (power of two) by " splitting" the existing codes at each step and and V of the tex t.
using these split codes to seed the next iteration.
The issues discussed here are a few necessary concepts from the field of
information theory. The reader interested in this field should consult one
of many widely used books on this subject (see Appendix 1.0).
FIGURE 1.16. The generalized Lloyd or Linde-Buzo-Gray (LBG) algorithm. Note that our need for this material in this text will usually occur in
cases in which all random vectors (or variables) take discrete values. We
In itiali zation: Choose an arbitrary set of K code vectors, say xk ' k= 1,2, ... , K. will therefore focus on such cases. Similar definitions and developments
exist for continuous random vectors [e.g. , (Papoulis, 1984)].
Recursion:
1.4.1 Definitions
I. For each feature vector, x, in the training set, "quantize" x into code xp ,
K
I, c= c' H(E ) d~f ~ [I( E ) 1 = - 2>>C~ = cl ) log, p(E= cJ ( 1.228)
p (~= c) = { 0, ( 1.225) {= I
c =1= c',
Now conside r N ra ndom vectors , sa y x(l ), .. . , x(N ). each of which
In the first case in which the class probabilities are un iformly distrib produce s outcomes from the sam e finite set ,36 [XI' .--: . ,xJ By a natural
uted , we have complete uncertainty about the association of a given fea generalizat ion of the above, the information associated with the revela
ture vector, and gain the maximum inform ati on possible (on the average) ti on th at ~(1 ) = x k " • • • , ~(!'. . ) = x k ,\, is defined as
when its true association is rev eal ed. On th e other hand , in the second
case we have no doubt that tru e class is c' , and no information is im l[ ~( 1) = x k " • • •• ~(N) = Xk,l'J d~f - log2 P[.!(l) = xkt' . .. , .!(N ) = x k,J
parted with the revelation of the cla ss id entity. In either case, the infor
( 1.229)
mation we receive is in indirect proportion to th e probability of the
classY a nd the entropy ass oci ated with these random var iab les is
The same intuitive a rgument s ap ply, of course , to the outcomes of an y H [ ~( 1), . .. . .! (N )1d~ !-' (I[ .!( 1), . . . . ~(.iV ) ]]
random variable-the quantit y c need not mode l class outcomes in a
pattern recognition problem. Let u s th erefore begin to view c as a general t. L
di screte random vari able. In fact , for even hroader gen eral ity, let us begin - L ... L pb( I ) = X II ' .. .. ~(N ) = x,J
to work with a random vector. c, recognizing, of course, that the scalar ',~I I" I
c-the less likely is the value c, the more information we receive. Al I[~( I), . .. , ~( N)] = I
n~ l
J[~(n )] (1.231)
though information may be defined using any logarithmic base, usually
base two is used, in which case I( . ) is me asured in bits. The sense of and
t h is t erm is as follows : If ther e ar e K equa lly likely outcomes , say
"i
CI' ... , cK ' and each is assigned a n int eger 1, 2, .. . , K, t he n it requires a
b inary number with log, K bits to identify th e index of a particular out H[ ~(l'), . .. , ~(N)] = L H[ ~(n )].
n~ 1
(1 .232)
come. Tn this case, we receive exact ly that number of bits of information
when it is revealed that the tru e outcome is c, In particul ar, if x(1 ), . . . , xU'-i ) are indepen dent and identically distributed,
th en -
I( E = c) = - Iogz p( 52 = c) = log, K. ( 1.227)
H[ ~(l) , ' .. ,~(!'i)] = NH[ ~(n)] for arbitrary n. (1.233)
I( 52 = c) can therefore be interpret ed as the number of binary digits re
1
quired to identify the ou tco me c if it is o ne of 2 ( £= C ) equally likely Intuitively, the information received when we learn th e outcome, say
possibilities. x k ' of a random vect or, x, will be less if we already know the outcome,
In general, of course, inform ati on is a ran dom quantity that depends say Yl ' of a co rrelate d random vector, y. Accordingly, we define th e condi
on the outcome of the random vari ab le. We denote this by writing sim tional information and conditional entropy, respectively, as
pl y l( ~ ) . The entropy is a measure of ex pected information across all out l ( ~ = xk l~ = r) = -log2 P( ~ = Xk!X = Y) (1.234)
comes of the rand om vect or,
" Acco rding to Papoulis (1981)•. Planck was th e firs t to describe th e explicit relat io nship l6T his defi nition is eas ily generalized to th e case in which all random vecto rs have dif
between probability and informatIo n In 1906. ferent sets of outco mes, b ut we will no t have need of thi s mor e gene ral case.
76 en. 1 / Propaedeutic 1.4 / Information and Entropy 77
Eq uation ( 1.23 8), in turn, leads to th e co nclusion th at However, if th e random vectors are not independent, th en we must use"
i\1( ~ , r ) = I(~ ) - J( ~ I¥.) = I(XJ - I(r I~ ). ( 1.239) L L
T his result clearly shows the interpretation of the mutual in formation as E( x) 't;f -lim '\'
- L
'\' P[x(l
... L
N - f (X) -
) = Xi ' ... , x(N ) = XI]
1 _ "
' 1- 1 IN~ I
the information th at is "sha red" by the ra ndom vectors.
(1.244)
Like an entropy measure, the average mutual info rmation, which we
denote M (~, ~ ), is the expected mutual information o ver a ll values of th e X log, P[~( l) = x \' ... , ~(N) = xIIJ
random vectors,
If the random vecto rs ar e un corr elated beyond some fin it e N, then the
exp ress ion need not co ntain th e limit. D efinition ( 1.244) is useful for
d f { p(!, r ) }
M(~ , 1:) ~ £"'.r log, p(~ )p(~ ) t heoretical discussions, but it becomes practically intractable for N's
L
)J P(X =Xi'Y=Yk)
_ . J7 A sta tionary so urce with d iscrete. independ ent random vari ables (or vectors) is called
~ L p( ~ = x/, r = v.
K
og2P(~ = x,)P('t = Yk) a discrete me mo ryless source in the communicat ion s field [see, e.g., (Proakis, 1989, Sec.
2.3.2)\.
k~ 1 I~ I
( 1.240 ) J'We assu me here that the rando m proce ss sta rts a t n ~ O.
78 Ch. 1 / Propaedeutic 1.5 I Phasors and Steady-State Solutions 79
much larger than two o r th ree. We will see o ne interesting app lica t io n of we will simply mean th e average mutual information bet ween two ran
this expression in our study of la nguage mo de ling in Cha pte r 13. dom variab les at a ny a rbitra ry time 11 . We will writ e M (~, I) to em pha
size th at th e ra ndom var iables are taken from t he stat ionary ra ndom
so urces,
1.4.3 Entropy Concepts in Pattern Recognition
L K P ( ~ = X/, 'i.. =Yk )
En tropy m easures are used in patte rn recogn ition problems. To pro
vide an exa m ple of th e use of th e entro py concepts descr ib ed above, a nd M( ! , ~::) ~f I
/~l
I p( ~= X" I = Yk) log2 P( _x =x,)P( I =Yk)
k~ 1
(1.247)
yet) = VelQr, (1.252) We will make use of these ideas in our early work (Chapter 3) con
where V= Ye}9'y. Putting the forms (1.250) and (1.252) into ( 1.248), it is cerning analog acoustic modeling of the speech production system. If
found immediately that the terms eJ Q 1 cancel and the differential equa necessary, the reader should review these topics in any of a number of
engineering textbooks [e.g., (Hayt and Kimmerly, 1971)] or textbooks on
tion solution reduces to one of solving an algebraic equation for Y in
differential equations [e.g., (Boyce and DiPrima, (969)] .
terms of X, powers of 0., and the coefficients at and hi" Engineers often
take advantage of this fact and solve algebraic phasor equations directly
for steady-state solutions, sidestepping the differential equations com
pletely. We have developed constructs such as "impedance" to assist in 1.6 Onward to Speech Processing
these simplified solutions (see below).
In fact, recall that phasor analysis amounts to steady-state frequency Thus ends our review and tutorial of selected background material pre
domain analysis . In principle, the phasors X and Yare frequency depen requisite to the study of speech processing. The reader will probably
dent, because we may enter a variety of inputs (actually an uncountably want to refer back to this chapter frequently to recall notational conven
infinite number!) of the form x (t ) = X cos(o.l + (p) , each with a different tions and basic analytical tools . Before beginning our formal study, we
frequency, 0.; amplitude, X; and phase, f(Jx' to produce corresponding out make a few introductory comments about the speech processing field,
puts of form y(t) = Y cos(Ot + f(J) with frequency-dependent amplitudes and about the organization of the book.
and phases. We may reflect this fact by writing the phasors as X (0.) and
YeO) . Plugging forms (1.250) and (l.252) into (1.248) with these explic Brief History. The history of speech processing certainly does not begin
itly frequency-dependent phasors immediately produces the general ex with the digital signal processing engineer, nor even with the work of
pression for the output phasor electrical engineers. In an interesting article!" surveying some of the his
m
tory of speech synthesis, Flanagan (1972) notes humankind's fascination
with speech and voice from ancient times, and places the advent of the
Lb .o.i
. 0 I scientific study of speech in the Rennaisance when clever mechanical
Yen) = t- n %(Q). (1.253) models were constructed to imitate speech. The first well-documented ef
I + Lall' forts at mechanical speech synthesis occurred in St. Petersburg and
1= 1 Vienna in the late eighteenth century. The 1930s, a century and a half
The ratio H(o.) ~y (Q) IX (Q), is of course the (Fourier) transfer func later, are often considered to be the beginning of the modern speech
tion for the system. Other ratios, in particular, impedances and admit technology era , in large part due to two key developments at Bell Labora
tances, result from similar analyses. If, for example, y(t) is a voltage tories. The first was the development of pulse code modulation (peM),
across a discrete electrical component in response to current x(t), then the first digital representation of speech (and other waveforms) which
the phasor ratio Zen) = Y(Q)/%(Q) resulting from the (usually simple) helped to pioneer the field of digital communications. The second was
differential equation governing the component is the impedance (fre the demonstration of the Vocoder (Voice Coder) by Dudley (1939) , a
quency dependent) of that component. The algebraic equations resulting speech synthesizer, the design of which first suggested the possibility of
from phasor-based solutions of differential equations mimic the simple parametric speech representation and coding. The subsequent decades
"Ohm's law" type relations that arise in DC analysis of resistive circuits. have seen an explosion of activity roughly concentrated into decades. We
As electrical engineers, we sometimes become so familiar with these sim mention a few key developments: intense research on the basic acoustical
ple phasor techniques that we forget their fundamental connection to the aspects of speech production and concomitant interest in electronic syn
underlying differentiaJ equation. thesizers in the late 1940s through the 1960s (Fant, 1960), which was
In connection with the concepts above, we note that the ratio of spurred on by the invention of the spectrograph in 1946 (Potter et
phasors is always equivalent to the ratio of complex signals they a1., 1966); advances in anaJysis and coding algorithms (linear prediction,
represent, cepstrum) in the I 960s (see Chapters 5 and 6 in this book) made possible
Y(Q) Y(Q)e
flll
yet) by the new digital computing machines and related work in digital signal
JQ 1
( 1.254) processing [e.g., (Cooley and Tukey, 1965)]; development of temporally
X(Q) = X (Q) e = x(t) adaptive speech coding algorithms in the 1970s (see Chapter 7); and vast
This fact is sometimes useful in theoretical discussions in which phasor
'"Also see (Sch ro eder, 1966). Each of these papers, as well as others describing early
notations have not been defined for certain signals. work, are reprinted in (Schafer and Ma rkel, 1979).
82 Ch. 1 I Propaedeutic 1.6 I Onward to Speech Processing 83
interest in sp eech recognition research in the 1970s and 1980s and con comprises the engineering foundation upon which speech processing is
tinuing into the 1990s, grounded in the development of dynamic pro built. In the first of these topics (Chapter 4) we examine the general issue
gramming techniques, hidden Markov modeling, vector quantization, of processing short terms of a signal. Most engineering courses ignore the
neural networks, and significant advances in processor architectures and fact that, in the real world, only finite lengths of signals are available for
fabrication (see the chapters of Part V). processing. This is particularly true in speech where the signal remains
stationary for only milliseconds. The remaining chapters (5 and 6) of
Research Areas and Text Organization. There is no precise way to parti Part III introduce the two most important pararneterizations of speech in
tion the speech processing research field into its component areas. contemporary processing-linear prediction coefficients and cepstral co
Nevertheless, we offer the following first approximation to a partition efficients-their meaning, and the analysis techniques for obtaining
that can roughly be inferred from the discussion above: them. These parameters are widely used for spectral representations of
speech in the areas mentioned above. We shall therefore use them repeat
Speech Science (Speech Production and Modeling) (Part II of this book)
edly as we progress through the material.
Analysis (Part TIT)
Part IV consists of three chapters that cover a rather wide range of
Coding, Synthesis, Enhancement, and Quality Assessment (Part IV)
topics. This part of the text is concerned with those aspects of sp eech
Recognition (Part V)
processing which most directly intersect with the communications tech
We have organized the book around these themes. nologies. Here we will be concerned with efficient coding for the trans
Part If is concerned with providing necessary topics in speech science mission of speech across channels and its reconstruction at the receiver
and with early efforts to model speech production, which are grounded site. Since the task of synthesis is closely coupled with transmission and
in the physics of the biological system. By speech science we mean the reconstruction strategies, we will examine some of the widely used ana
use of engineering techniques-spectral analysis, modeling, and so lytical techniques for synthesis in the context of this study. Synthesis for
on-in work that is specifically aimed at a better understanding of the voice response systems, in which a machine is used in place of a human
physiological mechanisms, anatomy, acoustic, phonetic, and linguistic as to dispense information, is also an important application domain, and
pects of normal and abnormal voice and speech production. Naturally, many of the techniques used in communications systems are equally ap
such work is highly interdisciplinary and is least concerned with immedi plicable to this problem.
ate application of the research results. Needless to say, however, speech The effectiveness of a coding scheme at preserving the information
science research has been, and continues to be, central to progress in the and the natural quality of the speech can be ascertained by using results
more applied fields. In Chapter 2, the first chapter in Part II, we exam from quality assessment research . Accordingly, we include this topic in
ine speech science concepts necessary to "engineer" speech. Our goal is Part IV (Chapter 9). Related to the assessment of quality is the enhance
to learn enough about speech to be able to converse with interdisci ment of speech that has been corrupted by any of a number of natural or
plinary researchers in various aspects of speech science and speech pro human-made effects, including coding. This issue will also be addressed
cessing, and to be able to build useful mathematical models of speech in Part IV (Chapter 8).
production. Chapter 3 begins the quest for a useful mathematical model Speech recognition deals with the related problems of designing algo
by building on the science of speech production discussed in Chapter 2. rithms that recognize or even understand'? speech, or which identify the
The journey takes us through a discussion of fundamental attempts to speaker (speech recognition versus speaker recognition)." In Part V, we
model speech production based on the physics of acoustic tubes. These take up the first of these problems , that of recognizing the speech itself.
real acoustic models are revealing and provide a firm foundation for the Chapter 10 overviews the problems encountered in trying to recognize
widelv used discrete time model, which will be employed throughout the speech using a computer. Chapters I I and 12 introduce the two most
remai-nder of the book and whose description is the culmination of the widely used techniques for recognizing speech-dynamic time-warping
chapter. algorithms and the hidden Markov model. The first is a template match-
Speech analysis research is concerned with processing techniques that
are designed to extract information from the speech waveform. In Part 4°A speech recognizer simply "translates" the message into word s, while a speech under
III we take up the most important contemporary tools for analyzing standing system would be able to ascertain the meaning of the utterance. Speech under
speech by computer. Speech ,is analyze~ for many. reasons , including standing algorithms can be used as an aid to recognition. by, for example, disallowing
nonsensical concatenations of words to be tried. or by " expecting" certain utterances in var
analysis for analysis' sake (basic research II1to phonetics or better models ious conversational contexts.
of speech production), but also to reduce it to basic features for coding, 41A slight va riation on the latter problem is speaker verification, in which the recognizer
synthesis, recognition, or enhancement. Part III of the book, therefore, accepts or rejects the speaker's claim of identity.
84 Ch. 1 I Propaedeutic
1.6 I Onward to Speech Processing 85
ing method following the classical paradigm of statistical pattern recogni
tion with the interesting special problem of time registration of the and knowledge-based and artificial intelligence approaches to recogni
waveform. The latter is a stochastic method in which statistical charac tiorr" [e.g., (Zue, 1985)]. Although we briefly discuss the former, we do
terizations of utterances are automatically learned from training utter Dot address the latter. Another example concerns the use of "semantic"
ances. Chapter 13 introduces the basic principles of language modeling, and "pragmatic" knowledge in speech recognition (see Chapters 10 and
techniques that reduce entropy by taking ad vantage of the higher-level 13). Semantics and pragmatics are subjects that are difficult to formalize
structure of spoken utterances to improve recognizer performance. Chap in conventional engineering terms, and their complexity has precluded a
ter 14 is a brief introduction to a radically different approach to speech significant impact on speech recognition technology outside the labora
recognition based on massively parallel computing architectures or "arti tory. We treat these issues only qualitatively in this book.
fici al neural networks." This field is in its relative infancy compared with The speech (and hearing) science domains-anatomy and physiology
techniques based on sequential computing, and it offers interesting chal of speech production, acoustic phonetics, linguistics, hearing, and psy
lenges and possibilities for future research and development. chophysics-are all subjects that are fundamentally important to speech
processing. This book provides an essential engineering treatment of
Applications. The applications of speech processing are manifold and di most of these subjects, but a thorough treatment of these topics obvi
verse . In a general way, we have alluded to some of the basic areas above. ously remains beyond the scope of the book. The reader is referred to
Among the principal "drivers" of speech processing research in recent Appendix l.E for some resources in the area.
years have been the commercial and military support of ambitious en Finally, the explosive growth in this field brought about by digital
deavors of large scale. These have mainly included speech coding for computing has made it impossible for us to provide a thorough account
communications, and speech recognition for an extremely large array of of the important work in speech processing prior to about 1965. Essential
potential applications-robotics, machine data entry by speech, remote elements of the analog acoustic theory of speech, upon which much of
control of machines by speech for hazardous or "hands-free" (surgery) modern speech processing is based, are treated in Chapter 3 and its ap
environments, communications with pilots in noisy cockpits, and so on. pendix. A much more extensive treatment of this subject is found in the
Futurist ic machines for human/machine commun ication and interaction book Speech Analysis, Synthesis, and Perception by J. L Flanagan (1972).
using sp eech are envisioned (and portrayed in science fiction movies), This book is a classic textbook in the field and no serious student of
and in the meantime, more modest systems for recognition of credit speech processing should be unfamiliar with its contents. Other impor
card , telephone, and bank account numbers, for example, are in use. In tant papers with useful reference lists can be found in the collection
addition, speech processing is employed in "smaller scale" problems such (Schafer and Markel, 1979).
as speaker recognition and verification for military, security, and forensic
applications, in biomedicine for the assessment of speech and voice dis Further Information. The appendixes to this chapter provide the reader
orders (analysis), and in designing speech and hearing aids for persons with lists of books and other supplementary materials for background
with disabilities (analysis and recognition). Inasmuch as speech is the and advanced pursuit of the topics in this book. In particular, Section
most natural means of communication for almost everyone, the applica I.E of this appendix is devoted to materials specifically on speech pro
tions of speech processing technology seem nearly limitless, and this field cessing . Among the sections are lists of other textbooks, edited paper col
promises to profoundly change ou r personal and professional lives in lections, journals, and some notes on conference proceedings.
coming years.
What Is Not Covered in This Textbook. Speech processing is an inher 1.7 PROBLEMS
ently interdisciplinary subject. Although the boundaries among academic
disciplines are certainly not well defined, this book is written by electri 1.1. Whereas the unit step sequence, u(n), can be thought of as samples
cal engineers and tends to focus on topics ~mll nave oeen most actively of the continuous time step , say uit), defin ed as
pursued by digital signal processing engineers.
Significant contributions to this field , especially to speech recognition, I, I ~O
have come from research that would usually be classified as computer ua(l)= { 0 (1.255)
science. A comprehensive treatment of these "computer science" topics is t< a
ou tside the intended scope of this book. Examples include (detailed dis
cussions of ) parsing algorithms for language modeling (see Chapter 13), "This and other pap er s on knowledge-based ap p roach es are reprinted in (Waibel and
Lee, 1990).
86 Ch. 1 / Propaedeulic 1.6 / Onward to Speech Processing 87
a similar relationship does not exis t be twee n t he di screte-time "impulse," TABLE 1.1. Propert ies of the DTFT.
J( n) and it s conti nu ous-time co unterpa rt J)t) .
(a) Consider sampling th e signa l ua(t ) with sa mple period T to Property Time Domain Frequency Domain
obtain th e seq ue nce u(n) ~f ua(nT) . If we n ow s ubject u(n ) to Linearity ax,(n) + bx2(n) a.X1(W) + bX 2(w )
the customar y id eal in te rpolation p ro cedure in an attempt to D elay x(n - d) e-)cad X(w)
reconstruct uJ t) (P roakis and Manolakis, 1992, Sec. 6.3) , will ej wonx(n)
Modulation X(w - wo)
th e original uJt) be rec ove red? Wh y or why oot?
(b) Roughly sketc h th e tim e sig na l, say ua(t ), a nd the spect ru m Time reversal x (- n) X(- w )
I I
U/Q ) of th e signa l th at will be recovered in part (a) . Multi plicat ion x (n)y(n )
I J~
271: _~ X«() yew- () d( = X(w ) * Y(w)
(c) That Ja(t) canno t be sa mpled fast enough to preser ve th e infor
mation in th e time signa l is appare nt, since th e signa l has infi Co nvolutio n x(n) * yen) X(w )Y(w)
ni te bandwidth, that is, Ll a(Q) = 1. However, to show that an y Conj ugation x'(n) X'( -w)
attempt to sample Ja(t) resul ts in an a no ma lous sequence, con . dX(w)
sider what happen s in t he freq ue ncy domain wit h reference to Different iatio n nx(n) J~
( 1.2 l ). What is th e an oma ly in th e tim e seq ue nce that causes
*
th is strange frequen cy dom ain result? A
(d) Carefully sketch and numerically lab el th e tim e signal, say 0a(t) , N- 1 ,v-1
E; > L00
!x(n)/2= ?
I In/X( w)1 2dw
. (1.2 57)
Property Time Domain Frequency Domain
rl = -- OO _7[. - n Linearity ax ,(n) + bX1.(n) aX,(k) + bX1.(k)
Cir cular shi ft x(n - d )mooN W kdX(k)
1.4 . (a) Verify th e propert ies of t he OFT shown in Tabl e 1.2. The nota
Mod ulatio n
tion W ~f e- i 2lr / is used for convenience an d all time se W'nx( n)
N
X(k + l) modN
quences a re ass umed to be of leng th .N. Circular convo lutio n x(n)modN * y(n) X (k ) Y (k )
(b) Pro ve Parseval's relation: *The notation 1'1" d~ e - /2" /." and all sequ en ces a re assu med to be of len gth N.
88 Ch. 1 I Propaedeutic 1.6 I Onward to Speech Processing 89
(d) Of the causal sequences in part (b), how many are minimum show that the process is also second-order stationary. Show, in fact, that
phase? Maximum phase? the process is SSS.
1.7. (Computer Assignment) Using a signal processing software package, 1.12. For a WSS random process ~, verify that
replicate the experiment of Section 1.1.7, that is, reproduce Figures
1.6-1.8.
1.8. Given the joint probability density function J5Y (x , y) for two jointly Px = - 1
- 271:
J." f'xCw) di» = -1 fnr)w) doi = rx(O).
-ll - TC 0 -
( 1.266)
continuous random variables ~ and .!:' verify the following using a picto
rial argument: 1.13. Show that, if a WSS random process X , which is ergodic in both
mean and autocorrelation, is used as input to a stable, linear, time
X
2
JJ',
invariant discrete time system with impulse response hen), then the out
P(x 1 < ~::::;; x 2 ' Yl <.!:::::;; Y2) = J -cco _o/~(x, y) dx dy put random process .!: is also ergodic in both senses.
-
f X2
-co
Jr,_ooJ~(x,
y) dx dy - J X!
- co
f_~J~(X,
Y'
y) dx dy
1.14. Verify (1.155) and (1.156).
r
j .o:( n l ):I( n 2) (x I' x 2
) =
2 77:02
=!
vi11 _ pJ;
~
e -1 12Q(XI. Xl )
,
(1 264)
.
J B = -In J: V~I ,(xll)j~ 1 £(xI2)dx . ( 1.269)
where
Show that this measure reduces to a Mahalanobis-Iike distance in the
case of Gaussian feature vectors and equal class covariances . Hint: Use
Q( x 1 , x 2 ) =
the fact that
~ {(X l - f.i;s.)2_ 2p~ (Xl - f.i,E). (X2- f.il;.) + (X2- f.i,o.)2},
1 P-x ax- \ ax
-
\ Ux
_.
o: ~
( 1.265) -- ~(x - J1!ll)TC ~ \X - J1!II) -i(X - J1!1 2)TC~ I(X - J1!12) ( 1.270)
90 Ch. 1 I Propaedeutic
1.0 I Example Textbooks on Information Theory 91
_(!!~11-!!~12)
- 2
-1[ x _(J.L~II+!!~1
C", 2
2)]T -I[ _(!!~ll+J.L!f12)]
C x
x 2 . (1.271)
Gardner, W. A. Introduction to Random Processes with Applications to
Signals and Systems, 2nd ed. New York: McGraw-Hill, 1990.
Gray, R. M ., and L. D. Davisson. Random Processes: A Mathematical
1.20. Two stationary binary sources, x and y, are considered in this prob Approach for Engineers. Englewood Cliffs, N.J .: Prentice Hall, 1986.
lem. In each case the random variables of the source, for example, x(n), Grimmett, G. R ., and D. R. Stirzaker. Probability and Random Processes.
n = 1,2, . . . ,are statistically independent. Oxford: Clarendon, 1985.
(a) Given P[~(n) = 1) = 0.3 for any n, evaluate the entropy of Helstrom, C. W. Probability and Stochastic Processes for Engineers, 2nd
source ~,H(~). ed. New York: Macmillan, 1991.
(b) In the source y, the entropy is maximal. Use your knowledge of Leon-Garcia, A. Probability and Random Processes for Electrical Engi
the meaning of entropy to guess the value P[ r(n) = 1J. Explain neering. Reading, Mass.: Addison-Wesley, 1989. [Elementary]
the reasoning behind your guess. Formally verify that your con Papoulis, A. Probability, Random Variables, and Stochastic Processes, 2nd
jecture is correct. ed . New York: McGraw-Hill, 1984.
(c) Given that P[~(n) = x, L(n) = y] = 0.25 for any n and for any Peebles, P. Z. Probability, Random Variables, and Random Signal Princi
possible outcome, (x , y) = (0,0), (0, I), (1,0) , (1,1), evaluate the ples, 2nd ed. New York: McGraw-Hill, 1987. [Elementary]
average mutual information, say M(~, r ), between the jointly Pfeiffer, P. E. Concepts of Probability Theory . New York: Dover, 1965.
stationary random sources ~ and l' Wong, E., and B. Hajek. Stochastic Processes in Engineering Systems.
New York: Springer-Verlag, 1984. [Advanced]
(d) Find the probability distribution P[~(n), L(n)] such that the two
jointly stationary random sources have no average mutual
information.
1.21. Verify (1.231)-(1.233). i.e Example Textbooks on Statistical Pattern
Recognition
APPENDICES: Supplemental Bibliography Devijver, P. A., and J. Kittler. Pattern Recognition: A Statistical Ap
proach. London: Prentice Hall International, 1982 .
1.A Example Textbooks on Digital Signal Processing Fukunaga, K. Introduction to Statistical Pattern Recognition. New York:
Academic Press, 1972.
Cadzow, J. A. Foundations of Digital Signal Processing and Data Analy Jain, A. K., and R. C. Dubes. Algorithms for Clustering Data. Englewood
sis. New York: Macmillan, 1987. Cliffs, N.J.: Prentice Hall, 1988.
Jackson , L. B. Digital Filters and Signal Processing, 2nd ed. Norwell,
Mass.: Kluwer, 1989.
Kuc, R. Introduction to Digital Signal Processing. New York: McGraw
Hill, 1988. 1.0 Example Textbooks on Information Theory
Oppenheim, A. V., and R. W. Schafer. Discrete Time Signal Processing,
Englewood Cliffs, N.J. : Prentice Hall, 1989. Blahut, R. E. Principles and Practice of Information Theory. Reading,
Proakis, J. G ., and D. G. Manolakis. Digital Signal Processing: Prin Mass.: Addison-Wesley, 1987.
ciples, Algorithms. and Applications, 2nd ed. New York: Macmillan, Csiszar, I., and J. Korner. Information Theory. New York: Academic
1992. Press, 1981.
Gallagher, R. G. Information Theory and Reliable Communication. New
York: John Wiley & Sons, 1968.
Guiasu, S. Information Theory with Applications. New York: McGraw
1.8 Example Textbooks on Stochastic Processes Hill, 1976.
Khinchin, A. Y., Mathematical Foundations of Information Theory. New
Davenport, W. B. Random Processes: An Introduction for Applied Sci York: Dover, 1957.
entists and Engineers. New York: McGraw-Hill, 1970. [Elemen Mcfiliece, R. J . The Theory C!l Information and Coding. Reading, Mass .:
tary] Addison-Wesley, 1977 .
92 Ch. 1 I Propaedeutic
1.F I Example Textbooks on Speech and Hearing Sciences 93
1.E Other Resources on Speech Processing AT&T Technical Journal (Prior to 1985, Bell System Technical Journa l).
Flanagan, J. L. Speech Analysis, Synthesis, and Perception, 2nd ed. New IEE E Transactions on Signal Processing (P rior to 1991, IEEE Transac
York: Springer-Verlag, 1972. lions on Acoustics, Speech, and Signal Processing, and prior to 1974,
Furui, S. Digital Speech Processing. New York: Marcel Dekker, 1989. IEEE Transactions on A udio and Electroacoustics).
Furui, S., an d M. Sondhi. Recent Progress in Speech Signal Processing, IEEE Transactions a ll Audio and Speech Processing (initiated in 1993).
New York: Marcel Dekker, 1990. Journal of the Acoustical Society of America:"
Markel , J. D., and A. H. Gray. Linear Prediction of Speech. New York: Speech Communication : An Interdisciplinary Journal.
Springer-Verlag, 1976. In addition, the Proceedings of the IE EE and the IEEE Signal Processing
Morgan, D. P., and C. L. Scofie ld. Neural Networks and Speech Process M agazine occasio nally have special issues or ind ividual tutorial papers
ing. Norwell, Mass.: Kluwer, 1991. covering var ious aspects of speech process ing.
O'Shaughnessy, D. Speech Communication: Human and Machine. Read
ing, Mass.: Addis on-Wesley, 1987.
Pap arnichalis , P. E. Practical Approaches to Speech Coding. Englewood 1.E.4 Conference Proceedings
Cliffs, N.J.: Pr entic e Hall , 1987. The numb er of engineering conferences and worksh ops that treat
Parso ns, T. W. Voice and Spe ech Processing. New York: McGraw-Hili, speech processing is vast-we will m ake no atte mpt to list them. How
1986. ever, the most widely attended conference in the field, and the forum at
Rabiner, L. R., and R. W. Schafe r. Digital Processing of Speech Signals. which new breakth roughs in speech processing are often report ed, is the
Englewood Cliffs, N.J.: Prentice Hall, 1978. annual Int ernat ional Conference on Acoustics, Speech, and Signal Pro
cessing, sponsored by the Signal Processing Society of th e IEEE. T he so
1.E.2 Edited Paper Collections ciety publishes an annual proceedings of thi s conference. By scanning the
Dixon, N. R., an d T. B. Mart in, eds., Automatic Speech and Speaker Rec reference lists in these proceed ings, as well as th ose in the jo urnals
ognition . New York: IEEE Pr ess, 1979. above, the reader will be led to some of th e other important conference
Fallside , F., and W. A. Woods , eds., Com puter Processing ofSpeech . Lon proceedings in the area.
don: Prent ice Ha ll Intern ational, 1985. Also see Section I. G.3 of this append ix.
Lea, W. A., ed., Trends in Sp eech Recogni tion. App le Valley, Minn.:
Speech Science Pu blishers, 1980.
Reddy, R., ed., Speech Recognit ion. New York: Academic Press, 1975. - Example Textbooks on Speech and Hearing
Schafer, R. w., and J. D. Markel, eds., Speech Analysis. New York: John Sciences
Lel-liste , I., ed ., Readings in Acoustic Phonetics. Cambridge , Mass.: MIT 1.G.3 Conference Proceedings
Press , 1967 .
The number of conferences devoted to neural network technology is
Lieberman, P. Intonation , Perception, and Language, Cambridge, Mass .:
very large. These are two of the important ones:
MIT Press, 1967.
MacNeilage, P. The Production of Speech. New York : Springer-Verlag, IEEE International Conference on Neural Networks.
1983 . International Joint Conference on Neural Networks.
Minifie , E, T. Hixon, and F. Williams, eds., Normal Aspects oj Speech,
Many papers on ANNs related to speech processing are also presented at
Hearing, and Language. Englewood Cliffs, N. J.: Prentice Hall , 1973.
the IEEE International Conference 00 Acoustics, Speech, and Signal Pro
Moore, B. An Introduction to the Physiology oj Hearing. London: Aca
cessing, which is discussed io Section I.E.4 of this appendix .
demic Press, 1982.
O'Shaughnessy, D. Speech Communication: Human and Ma chine. Read
ing, Mass .: Addison-Wesley, 1987.
Perkell, J., and D. Klatt, eds., Invariance and Variability in Speech Pro
cesses. Hillside, N.J .: Lawrence Erlbaum Associates , 1986 .
Zemlin, W. Speech and Hearing Science, Anatomy and Physiology.
Englewood Cliffs , N .J .: Prentice Hall, 1968.
1.G.2 Journals
A few of the widely read journals on ANNs in English are the
following:
IEEE Transactions on Neural Networks.
International Journal of Neural Systems.
Neural Computation.
Neural Networks Journal.
In addition, many of the journals listed in Section I.E.3 of this appendix
publish articles on neural network applications to speech processing.
\
lHAPTER ~ I
Fundamentals of Speech Science
Reading Notes: This chapter treats qualitative concepts and no special read
ing from Chapter 1 is required.
2.0 Preamble
Fundamentally, of course, speech processing relies on basic research in
the speech and hearing sciences, some of which is centuries old, and
much of which is ongoing. Few speech pro cessing engineers have the time
or opportunity to become expert in these fundamental sciences, so the
field of speech processing remains an inherently multidisciplinary one.
Nevertheless, the speech processing engineer needs a sound working
knowledge of basic concepts from these areas in order to intelligently an
alyze and model speech, and to discuss findings with researchers in other
fields. The purpose of this chapter is to provide the essential background
in these allied fields . We touch upon a rather broad array of inter
disciplinary subjects. We can only hope to treat elements of these re
search areas, and the reader in need of deeper study is encouraged to
consult the textbooks in Appendix I.F.
In order for communication to take place, a speaker must produce a
speech signal in the form of a sound pressure wave that travels from the
speaker' s mouth to a listener's ears . Although the majority of the pres
sure wave originates from the mouth, sound also emanates from the nos
trils, throat, and cheeks. Speech signals are composed of a sequence of
sounds th at serve as a symbolic representation for a thought that the
speaker wishes to relay to th e listener. The arrangement of these sounds
is governed by rules associated with a language. The scientific study of
language and the manner in which these rules are used in human com
munication is referred to as linguistics. Th e science that studies the char
acteristics of human sound production, especially for th e d escription,
classification, and transcription of speech, is called phonetics. In this
chapter, we deal principally with th e latter science. Some material on
language, from an an alytical point of view, will be found in Chapters 10
and 13.
99
2.2 I Anatomy and Physiology of the Speech Production System 101
100 Ch. 2 I Fundamentals of Speech Science
the perception of a single tone.' As the speech chain illustrates, there are
2.1 Speech Communication many interrelationships between production and perception that allow
indi viduals to communicate among one another. Therefore, future re
Speech is used to communicate information from a speaker to a listener. search will not only focus on speech production, hearing, and linguistic
Although we focus on the production of speech, hearing is an integral structure but will also undoubtedly probe the complex interrelations
part of the so-called speech chain. Human speech production begins with among these areas.
an idea or thought that the speaker wants to convey to a listener. The
speaker conveys this thought through a series of neurological processes
and muscular movements to produce an acoustic sound pressure wave 2.2 Anatomy and Physiology of the Speech
that is received by a listener's auditory system , processed, and converted
back to neurological signals. To achieve this, a speaker forms an idea to
Production System
convey, converts that idea into a linguistic structure by choosing appro
2.2.1 Anatomy
priate words or phrases to represent that idea, orders the words or
phrases based on learned grammatical rules associated with the particular The speech waveform is an acoustic sound pressure wave that origi
language, and finally adds any additional local or global characteristics nates from voluntary movements of anatomical structures which make
such as pitch intonation or stress to emphasize aspects important for up the human speech production system. Let us first give a very brief
overall meaning. Once this has taken place, the human brain produces a overview of these structures.
sequence of motor commands that move the various muscles of the vocal Figure 2.1 portrays a medium saggital section of the speech system in
system to produce the desired sound pressure wave. This acoustic wave is which we view the anatomy midway through the upper torso as we look
received by the talker's auditory system and converted back to a se on from the right side. The gross components of the system are the lungs,
quence of neurological pulses that provide necessary feedback for proper trachea (windpipe), larynx (organ of voice production), pharyngeal cavity
speech production. This allows the talker to continuously monitor and (throat), oral or buccal cavity (mouth), and nasal cavity (nose) . In techni
control the vocal organs by receiving his or her own speech as feedback.' cal discussions, the pharyngeal and oral cavities are usually grouped into
Any delay in this feedback to our own ears can also cause difficulty in one unit referred to as the vocal tract, and the nasal cavity is often called
proper speech production. The acoustic wave is also transmitted through the nasal tract:' Accordingly, the vocal tract begins at the output of the
a medium , which is normally air, to a listener's auditory system. The larynx. and terminates at the input to the lips. The nasal tract begins at
speech perception process begins when the listener collects the sound the velum (see below) and ends at the nostrils of the nose. Finer anatom
pressure wave at the outer ear, converts this into neurological pulses at ical features critical to speech production include the vocal folds or vocal
the middle and inner ear, and interprets these pulses in the auditory cor cords, soft palate or velum, tongue, teeth, and lips. The soft tip of the
tex of the brain to determine what idea was received. velum , which may be seen to hang down in the back of the oral cavity
We can see that In both production and perception, the human audi when the mouth is wide open, is called the uvula. These finer anatomical
tory system plays an important role in the ability to communicate effec components move to different positions to produce various speech
tively. The auditory system has both strengths and weaknesses that sounds and are known as articulators by speech scientists. The mandible
become more apparent as we study human speech production. For exam or jaw is also considered to be an articulator. since it is responsible for
ple, one advantage of the auditory system is selectivity in what we wish both gross and fine movements that affect the size and shape of the vocal
to listen to. This permits the listener to hear one individual voice in the tract as well as the positions of the other articulators.
presence of several simultaneous talkers, known as the "cocktail party As engineers, it is useful to think of speech production in terms of an
effect." We are able to reject competing speech by capitalizing on the acoustic filtering operation, so let us begin to associate the anatomy with
phase mismatch in the arriving sound pressure waves at each ear,' A dis
advantage of the auditory system is its inability to distinguish signals lAn audio compact disc which demonstrates a wide collection of these auditory phe
that are closely spaced in time or frequency. This occurs when two tones nomena,produced by the Institute for Perception Research (lPO), Eindhoven. The Nether
are spaced close together in frequency, one masks the other, resulting in lands, 1987. is available from the Acoustical Society of America (Houtsma et al., 1987).
'The term vocal-tract is often used in imprecise ways by engineers. Sometimes it is used
to refer to the combination of all three cavities, and even more often to refer to the entire
' Loss of th is feedback loop ~ontribut~s. s.ignificantly to the degradat ion in speech quality spee ch production system. We will be careful in this book not to use "vocal-tract" when we
for individuals who have hearing disabilities, mean "speech production system," but it is inevitable that the term will sometim es be used
' Listeners who are hearing impaired in one ear cannot cancel such interference and can to mean "vocal-tract and possibly the nasal tract too , depending on th e particular sound
being considered."
therefore listen to only one speaker at a time.
102 Ch. 2 I Fundamentals of Speech Science 2.2 I Anatomy and Physiology of the Speech Production System 103
Nasal
sound
) outpu t
Soft palate
(vel um)
Nasal ca v ity
c?:~tly
Nostnl
Pharyngeal
Pharvnzeal Lip cavity
ca~i;;' -------+:----f!>IY~ ))))))
Tongue Vocal
folds
T ongue \ n
Larynx
.£~
'if
u..
~!
hump Oral
sound
Teeth
Lun g ~
Lung
Di aphragm
Muscle
force
FIGURE 2.1. A schematic diagram of the human speech production mechanism. FIGURE 2.2. A block diagram of human speech production .
such a technical model. The three main cavities of the speech production
system (vocal plus nasal tracts) comprise the main acoustic filter. The fil coupling can substantially influence the frequency characteristics of the
ter is excited by the organs below it (and in other ways to be described sound radiated from the mouth. If the velum is lowered, the nasal tract is
below), and is loaded at its main output by a radiation impedance due to acoustically coupled to produce the "nasal" sounds of speech. Velar
the lips. The articulators, most of which are associated with the filter it opening can range from zero to about 5 cm' for an average adult male.
self, are used to change the properties of the system, its form of excita For the production of nonnasal sounds, the velum is drawn up tightly to
tion, and its output loading over time. A simplified acoustic model ward the back of the pharyngeal cavity, effectively sealing off the en
illustrating these ideas is shown in Fig. 2.2. trance to the nasal cavity and decoupling it from the speech production
Let us look more closely at the main cavities (acoustic filter) of the system.
system that contribute to the resonant structure of human speech . In the Let us now focus on the larynx. From a technical point of view, the
average adult male (female) , the total length of the vocal tract is about 17 larynx has a simple, but highly significant, role in speech production. Its
(14) em. The vocal tract length of an average child is 10 em. Reposi function is to provide a periodic excitation to the system for speech
tioning of the vocal tract articulators causes the cross-sectional area of sounds that we will come to know as "voiced," Roughly speaking, the
the vocal tract to vary along its length from zero (complete closure) to periodic vibration of the vocal folds is responsible for this voicing (more
greater than 20 em", The nasal tract constitutes an auxiliary path for the on this below). From an anatomical (and physiological) point of view,
transmission of sound. A typical length for the nasal tract in an adult however, the larynx js an intricate and complex organ that has been stud
male is 12 em. Acoustic coupling between the nasal and vocal tracts is ied extensively by anatomists and physiologists. A diagram showing the
controlled by the size of the opening at the velum . In general , nasal main features of the larynx appears in Fig. 2.3. The main framework of
104 Ch. 2 / Fundamentals of Speech Science 2.2 I Anatomy and Physiology of the Speech Production System 105
~
"'""
I
"
~ ~ I .~
!tJ~~ W~) ~~ ~ ~~
E
es
0.6 dB
40001EJ--------
3700
~ 34( }()
""c;
3 100
o
2800
('.l 2500
E 2200
1900
1600
1300 Microph one
1000
700
400
]( X)
o 2 '3 4 5
Freq uency, F (kH z)
( b)
FIGURE 2.6. 8ell Laboratories early phosphor belt spectrograph system.
FIGURE 2.5. Magnitude spectra for (a) the /s/ sound, and (b) the /1/ sound ,
in the utterance of "six" in Fig. 2.4. 6T his usu ally is not a lim itation, sinc e phase is not as important as magnitude in a rna
jeri ty of speech applications.
110 Ch. 2 / Fundamentals 01 Speech Science 2.2 / Anatomy and Physiology 01 the Speech Production System 111
"Of course whether "silence" shou ld be called a fo rm of excitat ion is debatable but it is
useful to include it for modeling purposes. '
SIn this instance, and in many oth er cases in this boo k, we restr ict the d iscussion to
con sideration of American English . To do otherwise would o pen up bro ad dis cussions that
would go well beyond the scope of the book . T he pri nciples d iscusse d here will provide a
solid foundation for the study of ot her lan guages.
90 ne often hears a singer de scribed as having a "beautiful voice." This ma y indeed be FIGURE 2.7. A sequence of cross sections of the laryn x illustrating a
the case, but the audience does not attend the concert t o hear the singer' s voi ce ! complete phonation cycle. After Vennard (1967).
112 Ch. 2 I Fundamentals of Speech Science 2.2 I Anatomy and Physiology of the Speech Production System 113
repre sented as the squa re of the air velocity; whereas the potential energy
is proportional to th e air pressure. As the vocal folds spread apart, air ve
locity increases significantly through the narrow glottis, which cau ses a
local drop in air pressure. Therefore, when the vocal folds are closed, air O.R
pressure and potent ial energy are high. As th e glottis opens , air velocity ~
.::::
and kinet ic energy increase, while pressur e and potent ial energy decrease. .::2
§
Th e glott is continues to open until th e natural elastic tension of the vocal 7u
folds equals the sep arating force of th e air pressu re. At this point the "
is
0.6
.;:)
glottal opening and rat e of airflo w have reached their maxima. The ki <:>
netic energ y that was received by the vocal folds du ring opening is stored <:i
>
u
as elastic recoil energy, which in turn causes the vocal folds to begin to E
0.4
close [Fig. 2.7(e)]. Inward movement of the vocal folds gathers momen '0">
tum and a suct io n effect caused by a Bernoull i force!" occurs when the ~
"0
glott is becom es nar row enough. Both the elastic restoring force and Ber (5
noulli force act to close th e vocal folds abruptly [Fig. 2.7(e)]. The sub 0.2
glottal pressure and elastic restoring forces during closure cause th e cycle
to repeat.
An example tim e waveform for th e volum e velocity!' (defined care OV I J L ! ! ,
fully later) at the glottis is shown in Fig. 2.8(a). The variation in airflow o 20 40 60 so 100
through the glottis result s in a periodic open and closed phase for th e Time, I (msec)
glottal or source excita tion. The magnitude spect rum of one pulse of the (a)
glottal waveform is shown in part (b) of the figure. Note the lowpa ss na
ture of this spectrum . Thi s will be significant in our future modeling ef 7.5 I I
l OA Bernoulli force exists whenever th ere is a difference in flu id pressure between opp o -""'
:::J
OJ;
site sid es of an obj ect. One example of a Bernoull i force in a constricted area of flow oc .£ - S. l
curs when a ir is blown between parall el sheets of pa per held close together. Th e sheets of 0
paper pull together instead of mov ing apart because the air velocity is greater, and th e pres '"
sur e is lower, between the sheets than on the outer sides .
lIThroug!lout the boo k, and especially in th is chapter and the next, we will be int erested ,
in two volume velocity wavefor ms. Th ese are the volume velocity at th e glottis and th e vol
ume velocity at the lips. We shall call th ese UvoUi,( ') and ulo ,( ' ), respect ively, when bot h
appear in th e same discussion . Mo re frequentl y, th e glottaT volume velocity will be the
waveform of int erest , and when th ere is no risk of co~ fusio n , we. shall drop the subscri pt
" glott is" and write simply u( · ). Furth ~r, we shall ? e tntere ~ t ed 10 b.olh cont inuous time
and dis crete tim e signals in our diSCUSSIOns. To aVOId excessive notation, we shall not dis Sn \Olt 151t 20n
tinguish the two by writing, for examI?le, u)t) an.d u (.n) (where sUb~c ript "a" means "ana Frequency, Q ( krad/sec)
log"). Rather. the argu ment s t and n WIll be ,sufficient 10 all cases to .mdicate the differen ce, (b)
and we shall write simp ly u(t) and u(n) . Finally, the frequ ency var iable n will be used in
th e consideration of cont inuo us tim e signals to denote " real-world" freq uenci es While "w" FIGURE 2.8. (a) Time waveform of volum e velocity of the glottal source
will be used to ind icatc "normalized" freq uencies in conjunction with d iscrete time signals
exc itation. (b) Magn itude spectrum of one pulse of the volume velocity at
(see Sectio n 1.1.1) . the glottis.
114 Ch. 2 I FUndamentals of SpeeCh Science 2.3 I Phonemics and Phonetics 115
co urse, the fundamental per iod is ev ide nt in the spee ch waveform as we produ cing an excitation airstream include ejectives, clicks, and implosives.
can see , for example, in Fig. 2. 4. T he fund amental period is dependent Cli cks and implosive sounds are produced when air is drawn into the
on the size and tension of th e spea ker's vocal fold s at an y given instant. vocal tract , and are therefore termed ingressive . Ejective sounds occur
Since the average size of the voca l folds in men , for example, is larger when only air in th e oral cav ity is pushed out. Eject ives are found in
than the average in wom en , th e average fundamental frequency of an man y Nat ive American languages (e.g., Hopi , Apache, Ch erokee), as well
adult male in speaking a given utt erance will often be lower than a as in some African and Caucasian languages. Click s occur in Southern
female's. Bantu languages such as Zulu and Xhosa, and are used in the languages
The term pitch is often used interchangeably with fundamental fre spoken b y the Bushmen. Implosives occur in the Native American lan
quency. However, there is a subtle difference. Psy choacousticians (scien guages, as well as in many languages spoken in India, Paki stan, and
tists who study the per ception of so und) use the term pit ch to refer to the Africa.
perceived fundamental frequ enc y of a sound, whether or not that sound is
actually present in the waveform. Spe ech transmitted over the co m mercia l
phone lines , for exam ple, ar e usually bandlimited to about 300-3000 Hz.
Nevertheless, a person who is phonating at 110Hz will be perceived as 2.3 Phonemics and Phonetics
phonating at 110Hz by the listen er. even th ou gh the fundamental fre
quency of the received waveform cannot be less than 300 Hz. In this 2.3.1 Phonemes Versus Phones
case , the psychoacoustician would say that the pitch of the received Having considered the physical composition of th e speech production
sp eech waveform is 110 H z, wh ile th e fundam ental frequency is 330 Hz. system and the manner in whi ch we produce speech sounds, we now
Thi s quirk of th e human aud itory system requires th at we be careful with focus on the collectio ns of sou nds th at we use to communicate our
these terms. Nevertheless, with this caut ion, we will routinely use the t houghts . On ce a speaker has for med a thought to be communicated to
word "pitch " to mean "fun da me nta l frequ en cy" in this book, since it is the listen er, he or she (theoretically) constructs a phrase or sentence by
con venti onal to do so. Sinc e we will not be concerned wit h perceptual choosing from a collection of finite mutually exclusive sounds. The basic
phenomena, th is will not cau se am biguit ies to arise. t heoret ical unit for describing how speech con veys lin guistic meaning is
Everyone has a pitch ran ge to which he or she is constrained by sim called a phonem e. For American English, there are ab out 42 phonemes
ple physics of his or her lar ynx. Fo r men, the possible pitch range is usu which ar e made up of vowels, semivowels, diphthongs, and consonants
ally found somewhere bet ween th e two bounds 50-250 Hz, while for (nasals, sto ps, fricatives , affricates). Each phoneme can be considered to
women the range usually falls somewhere in the interval 120-500 Hz. be a code that consists of a unique set of arti culat ory gestu res. These ar
Everyone has a "habitual pitch level," which is a sort of "preferred" pitch ticulatory gestures include the type and location of sound excitation as
that will be used naturally on the averag e. Pitch is shifted up and down well as the position or movement of th e vocal tract articulators.
in speaking in response to factors relating to stress, intonation, and emo We mu st clearly distinguish between phonem es and phones, and phone
tion. Stress refers to a ch ange in fundamental frequency and loudness to mics a nd phonetics. For our purposes, we can think of a phoneme as an
signify a change in emphas is of a syllabl e, word, or phrase. Intonation is ideal sound unit with a complete set of corresponding articulatory ges
associated with the pitch contour over time and p erforms several func tures. If speakers could exactly and con sistently produce (in the case of
tions in a language, the mo st important being to signal grammatical English) these 42 sounds, speech would amount to a stream of discrete
stru cture. The markings of sentence, clause , and other boundaries is ac codes. Of course, due to many different factors including, for example,
complished through intonation patterns. We shall discuss stress, intona accents, gender, and, most importantly, coarticulatory effects, a given
tion , and other features related to prosodies in Section 2.3.4. "phoneme" will have a variety of acoustic manifestations in the course of
flowing speech. Therefore, any acoustic utterance that is clearly "su p
A More General Look at Excitation Types posed to be" that ideal phoneme, would be labeled as that phoneme. We
see, the refore, that from an acoustical point of view, the phoneme really
The production of any sp eech sou nd involves the movement of an air represents a class of sounds that convey the same meaning. Regardless of
stream. The majority of spe ech sounds ar e produced by pushing air from wh at is actually uttered for the vowel in "six," if the listener " u nder
the lungs, through th e trachea and pharynx , o ut through the oral and/or stan ds" the wor d " S!X," then we would say that the phoneme III was rep
nasal cavities. Since air from the lungs IS used, these sounds are called resented in the speech . The phonemes of a language, therefore, comprise
~
pulmonic; since the air is pushed out, th ey are also labeled as egressi ve. a min imal theo retical set of un its that are sufficient to convey all mean
All speech sounds in Am eri can English, to which we have nominally re ing in th e language. This is to be juxtaposed with th e act ual sounds that
stricted our discussion , ar e pulmonic egressi ve, O ther mechanisms for a re produced in speaking, which speech scientists call ph ones. The study
116 Ch. 2 / Fundamentals of Speech Science 2.3 / Phonemics and Phonetics 117
of the abstract units and their relationships in a language is called phoneously makes a poor phonetic device. In 1888, a group of prominent Eu
mics, while the study of the actual sounds of the language is called pho ropean phoneticians developed what is known as International Phonetic
netics. More specifically, there are three branches of phonetics each of Alphabet (IPA) in an effort to facilitate and standardize transcription.
which approaches the subject somewhat differently: The IPA is still widely used and accepted. Part of the IPA is shown in
Table 2.1. The complete IPA has sufficient entries to cover phonemes in
J. Articulatory phonetics is concerned with the manner in which
all the world's languages and not all are used in all languages and
speech sounds are produced by the articulators of the vocal system,
dialects. F
2. Acoustic phonetics studies the sounds of spee ch through analysis of
The IPA is most appropriate for handwritten transcription but its
the acoustic waveform.
main drawback is that it cannot be typed on a conventional typewriter or
3. Auditory phonetics studies the perceptual response to speech sounds
a computer keyboard . Therefore, a more recent phonetic alphabet was
as reflected in listener trials.
developed under the auspices of the United States Advanced Research
Our work below will represent a blend of articulatory and acoustic Projects Agency (ARPA) , and is accordingly called the ARPAbet. There
analysis. are actually two versions of the ARPAbet, one that uses single-letter sym
If a talker is asked to "speak a phoneme" in isolation, the phoneme bols, and one that uses all uppercase symbols. The use of all uppercase
will be clearly identifiable in the acoustic waveform . However, when spo necessitates some double-letter designators. The two versions of the
ken in context, phoneme boundaries become increasingly difficult to ARPAbet are given in Table 2.1. Throughout the remainder of this book,
label. This is due to the physical properties of the speech articulators. we shall consistently use the single-letter ARPAbet symbols for phonetic
Since the vocal tract articulators consist of human tissue, their position transcription.
ing from one phoneme to the next is not executed by hard mechanical The "raw" symbols shown in Table 2. I might be more appropriately
switches . but by movement of muscles that control articulator move called a phonemic alphabet because there are no diacritical marks indi
ment. Accordingly, there is normally a period of transition between pho cated to show allophonic variations. For example, a super h is sometimes
nemes, which under certain conditions can slightly modify the manner in used to indicate aspiration, the act of delaying the onset of voicing mo
which a phoneme is produced. Therefore, associated with each phoneme mentarily while exhaling air through a partially open glottis. The differ
is a collection of allophones (variations on phones) that represent slight ence is heard in the phonemes /p/ (as in "spit") and /p hl as in "pit." The
acoustic variations of the basic unit. Allophones represent the permissi difference is subtle, but it is precisel y the purpose of diacritical marks to
ble freedom allowed within each language in producing a phoneme, and denote subtleties of the phonetic content. In spite of the lack of any sig
this flexibility is dependent not only on the phoneme itself, but also on nificant phonetic information , we shall continue to call the ARPAbet a
its position within an utterance. Therefore, although we present pho phonetic alphabet, and the transcriptions employing it phonetic transcrip
nemes in this section as the basic building block for human speech com tions, as is customary in the literature.
munication, considerable freedom is afforded to the speaker in producing As has been our convention so far, we shall place phonetic transcrip
these sounds to convey a thought or concept. tions between slashes in this book (e.g., Is/). As some examples of pho
netic transcriptions of complete English words, consider the entries in
2.3.2 Phonemic and Phonetic Transcription Table 2.2.
~
a() 6. The stationarity of the phoneme.
Q) , - .....0=
~ 0
eb~ A phoneme is stationary or continuant if the speech sound is produced
.0
~
ct
((
=: e:
.- ~ (,;J
(fJrJi;;"
>bQ~N~N..c::Eco_~~z~a~~~u~~ by a steady-state vocal-tract configuration. A phoneme is noncontinuant
-c if a change in the vocal-tract configuration is required during production
(5 of the speech sound. The phonemes in Table 2.3 are classified based on
.0 '0
s>. continuant/noncontinuant properties. Vowels, fricatives, affricates, and
IJ)
I
<I)
0>
-
~a
~
io
> <D '0 ~ N '-, t<:> ..c:: E c D _ _ E = '- C' ~ .-, ... ~.e :l:! nasals are all continuant sounds. Diphthongs, liquids, glides, and stops
all require a vocal-tract reconfiguration during production and hence la
.~ I I I beled non continuant. Due to their required vocal-tract movement, non
<I)
~ continuant phonemes are generally more difficult to characterize and
:E ii ~ ~~ ~~~~~ ~'-'~ ~ ~ model than continuant ones.
<I)
(J)
.e,
;.<
,-,~~~~~~,-,oO~o~~~~~~~~'-'~
'-' ..... ~ '-' ~ 0 ~ 0 0 ~ u ~ .... .- 0 0 0 ~ ~ '-' ,-,.- o.~
'-' Let us now study the different classes of phonemes mentioned above
;;j
..c::..c::..c::..c::..c::..c::.c..c::..c::~..c::~=..c::..c::.D~~.D_~~~~
r.1 in some detail. We will study both articulatory and a coustic features of
~ these classes.
<Ii
<ll
.c
.!.
~ ~.~
l:l.l3"
=
>-~>-~tJ.I--<o~::e~o::x~><~>- Vowels and Vowellike Phonemes
-~I~
Q. ~ ;:J .;. -_tJ.ItJ.I--<--<--<O~~tJ.I--<~<~O~~~~Q~O~
~
Vowels. There are 12 principal vowels in American English. Phoneticians
~o ~ often recognize a thirteenth vowel ca lled a schwa vowel , which is a sort of
~
c -< , ==
~ 0 "degenerate vowel" to which many others gravitate when articulated hast
a 'E>~ .~
s:
o, .S ~~ .- - '-' tJ.I @) oj u 0::> ::l '=' >< -< >-- ~ 0X ~.D ~ ~ ~ 011 ......
d ily in th e course of flowing speech . The phoneti c symbol we have
[fJrJi ..... adopted for the schwa in this book is /xl. The initial vowel in "a he ad" is
S· ci.
.... -- ~ g
c-J <.:> "0 "'
a schwa vowel. The schwa occurs wh en the tongue hump does not have
. _ (1)
W
..J
Q -
~ Po:::'"
0. time to move into a precise location and assumes a neutral position in
m ~8 •__ '-' w ~ e () 0 ~ ::l ~ <> <: 0 ~ <3 ... Po..o ~ ~..><: oil ~ o '" 0
:>u::6 the vocal tract, so that the tract approximates a uniform tube. The result
~
>-<>,
u: ~ ++ (Q? ing vowel is short in duration and weak in amplitude. Except for its oc
currence as a " la zy" vowel, for our purposes it is not much di fferent
from the vowel tAt occurring in "bud." Therefore, we will speak of a
schwa vowel to co nnote the "unintentional " neutrality of a vowel, but in
some of th e discussions below will not attempt to distinguish it acousti
ca lly from t.he "proper" vowel lAt.
120 Ch. 2 I Fundamentals ot Speech Science 2.3 I Phonemics and Phonetics 121
TABLE 2.3. Phonemes Used in American English. vowel group. Figure 2.9 shows how the vowels are arranged based on
these two articulatory features .
Continuant Noncontinuant
The approximate configurations of the voca l-tract arti cu lators for the
vowels in Fig. 2.9 ar e shown in column (a) of Fig. 2.10 . Th e ph ysiological
variation between high-medium-Iow front vowel versu s high-medium-low
Diphthongs Semivowels Consonants back vowel can be seen by comparing th e vocal -tra ct profiles for /i, I, @I
->.
Liquids Glides
(Stops)
->.
with lu , 0 , a/. Also illustrated in Fig. 2. 10 are corresponding acoustic
waveforms and voca l-tract frequency representat ions for each vowel. A
variety of acoustic features can be seen in the tim e waveforms and spec
IrI Iw l Voiced Unvoiced
III Iyl Ibl I pl tral plots . The time waveforms show that vowels are quasi-periodic due
Idl It I to the cyclical vocal-fold movement at the glottis which ser ves as excita
I gl Ik l tion. The time waveforms also show that the resonant structu re of the
vocal-tract changes as tongue-hump position and degree of con striction
are varied. The changing resonant structure is reflected as shifts in for
Vowels mant frequency locations and bandwidth s as pictured in the vocal-tract
~ spectra l plots. Vowels can be distingui shed by the location of formant
frequencies (usually the first three form ants are suffic ient). As an exam
Front
Mid Back
iii IR I I u I
ple, it has been shown through X-ray sketches that the neutral vowel Ix l
I II I xI lUI
result s in a nearly con stant cross-sectional area from the glottis through
le i IAI 10 1
the lips. T he formant frequencies for a male speaker occu r near
lEI Ie!
500, 1500, 2500, 3500 Hz, and so on . F, and F2 are closely tied to th e
I @I I a I
shape of the vocal-tract articulators. Th e frequenc y location of the third
formant , F3, is signific ant to only a few specific sounds. The fourth and
high er formants remain relative ly constant in frequ ency regardless of
Consonants cha nges in articulation .
Formant frequency locations for vowels ar e affect ed by three factors :
the overall length of th e pharyngeal-oral tr act , the location of constric
tions along the tract, and the narrowness of the constrictions . A set of
Fricatives
rules relating th ese factors to formants is shown in Table 2.4 .
A
Voiced
Ivl
Unvoiced
If!
Whisp er
Ihl
Affricates
IJI
Nasal s
Iml
Iii
IRI luI
Hi gh
~
i(~V C }L2J
:~~~bW~,
-5000 64
-10
21l~
- 20
I II
o
I I I)'
""')~ ~ ~~;~:~~JJ'
- 10000
- 20000
o
~iJL f~" [t, i
WI r~W1~ W\'oil~j
'" I
(;4 '
2010o~
- 10
-20~)
o 4
o Ti me ( TTl5CC) o Frequency (kH z) 4 (m sec) Frequency (k Hz)
"",~
T lITJL'
-SOO~
5000 10
o
-I ~
-10000
o T ,me (nrxcc)
,
64
"
- 20 0
Frequency (k H z)
4' I. o
I·ime (msec)
I
M
-1 0
-20 OL--L------L--l----~4
Frequency {k Hz)
- 10
- SO()() 'I I I,
-1 00011 64 - 20 ~ ) ~ - 10000 M
o 4 -2°0 /
II Trm e (msec) f-requency Ikllz) o Time (m scc r Frequency (kH~)
, "
30o~
20
In
-10 o("lJeY ) ~
~ : ~~; . 'iiJ~I~ >O~
30
20
10
II
-20 - sooo 64 -1 0
-20000 64 - 30 ~. - 20~'
o Tlln e (TTl ' Ct' ) 0
l-rcqucrc y (k rILl -1 00000 Trrnc [m xcc ] o 4
f requency (kH ~ )
"' ~
40000
~~~I JO~
lOO(N) [
'~INN\W~.JI\··JHJH~
50(){) 20 20000 20
o 10
o 10
o
-5000
I I I I I
o - 20000 - 10
-11l1l(J°O -IO~I~
M
-~0000
o - 20 0~~
I IlIlIiO
15000 b ± t
-10000
o rime u nsccj
64
,O~
.I Il
, 20
10
-20
o
-I (I~
II
Frequency (kJl z)
"
4
a tfamcr) 1711 40000
20{}00
- 20000
o
- 41111110 I
o
1&~\~r,')~l~~W~JI~11y~IIV'~,1l
ti me unsee)
"
64
)CIg ~
20
-10
-20
0
Frequency (kHz)
4"
FIGURE 2.10. A collection of features for vowels in American English.
FIGURE 2.10. (continued)
Column (a) represents schematic vocal-tract profiles, (b) typical acoustic
waveforms, and (c) the corresponding vocal-tract magnitude spectrum for
each vowel.
Also } the jaw is more open and the pharynx more constricted for the
IE, @/ phonemes versus /i,I/. The back vowels /u, UI differ in the
The length of the vocal tract affects the frequency locations of all amount of constriction by the tongue toward the back of the palate. As
vowel formants . We sec (Table 2.4) that a simple inverse proportionality can also be seen, the lips are more rounded for lui, and less so for /01.
rule relates the overall vocal-tract length from the glottis to the lips with The back vowels Ic, a/ have much less lip constriction, but are formed
the location of formant frequencies . However, in general the location and with progressively more pharyngeal constriction going from /0/ to /a/.
spacing of formants F, and above are more closely correlated with vocal Generally speaking, there are two rules that relate vocal-tract constriction
tract length than for F I and F 2 • F) and F~ formants have been shown to and Fl' The first is for oral constriction and says that F 1 is lowered by
be related to the location and amount of constriction in the vocal tract. any. constriction in the front of the oral cavity. Therefore, if the tongue is
Figure 2.1 O(a) shows diagrams of the vocal tract for different vowels at all pushed up toward the front or middle of the palate, F, will be lower
based on an X-r '!y study by Lindblom and "Sa nd berg (197 I). Several ob than for the neutral vowel /x! (F) = 500 Hz). The second rule says that F,
servations concerning vocal-tract constrictior, can be made in th ese dia Ls raised by pharyngeal constriction , Two rules arc also seen to relate. F
grams. The tongue positions for front vowels Ii, I, c, E, @I form a series of to vocal-tract constrictions. The frequency of F 2 is dependent on whether
2
vocal configurations that are progressively less constricted at the palate. the tongue constriction is near the front of the oral tract (front vowels) or
124 Ch. 2 I Fundamentals of Speech Science 2.3 f Phonemi cs and Phonetics 125
-
The final rule relates the effect of lip-rounding on formant locations. f-
li z dB
f-
It is seen t hat lip-rounding tends to lower all formants. As the schematic f 3010 28
vocal-tract profiles in Fig. 2.10(a) indicate, lip-rounding plays an im por 3000 ~
Hz dB H z dB
tant part in forming the back vowels. We will also see that lip-rounding 2.~ 50 -27 24g 0 -24 H'. dB Hz <t il li z dB Hz dB
affects frequency cha racteristics of consonants as well. Lip position be .~":JE~ :: ! :: ,..- C '~ 2410 22 2440 - 2H 2410 - 34 HI dB H7 <Ill 23'Xl - 27
2400 2290 24 < : ~··~ ~::F~ ~ ~-.:. ~ ~:::
. ~~~":::;~-";r.
gins with wide-open lips for lal and prog resses toward more con stricted 2240 34 2240 4 3
(i.e., rounded) lip configurati ons for leI , 10/, lUI, a nd lu I .
-;:;
:s f-
- 1990 -21
. -
c-, • 1840 - 17 HI dB
The central vowels fall between the front and back vowels. These vow g 1800 ~: ~::i In o 12 IWO 20
els, IRI and IAI (or Ix/), are formed with constriction in the central part
<J
s, - ,YN' .-.
il
of the ora l tract. The IRI vowel is form ed with th e central part of the ~
I 1190 -10
.... ...
,
1350. .•- ...15
tongue raised mid-high toward the palat e. Th e tip of the tongue is either 1200 I 1090 5
1020 12
lifted toward the front part of the palate or pulled backward along th e f-
..•M40
7 ,~ , M70 -19
f no -I ...
floor of the mouth in a con figuration called retroflexed. Muscl es of th e 2
660 - 1
. 570 0 640 -I
6 0( I
. 530 ow
formant patterns fro m th e vocal-tract sha pe. Th ey work best when a sin o r- li
gle constriction is the dominant feature of the vocal tra ct. When two con IiI Iii l EI I@I lal /el l UI l uI IA! I RI
strictions operate on the vocal tract, a rule may or may not apply over FIG URE 2.11. Aver age formant locations for vowels in American Eng lish
th e entire range of constrictions. As an example, lip-rounding has a more (Peterson and Barney, 1952).
126 en. 2 I Fundamentals of Speech Science 2.3 I Phonemics and Phonetics 127
2600
200
'p i
e l st form ant @ I
a ,
o: 2nd forman! ,
2400
a 3rd formant
Iii pi
bE
I
--;:: a j,.'l.
2200 c U
a
\~
\0 0
.;
,'1:
.~
90
@ 'U
'0 XO
2000
'" I; -
lEI
o
E
c
~
70
(we
a U ~t; / '" --El
I l>i
N
:s & 60
A
u
- - ~
0,
1800
I@/ 50
l\ U - -- ~ R RI
~
<-<
o
-- - -
I •R
e
c
C a
1600
40I • E
• i
30
\400 300 400 600 l-lOO !O()O 2000 3000 4000
/RI
o F ,1IID<ln1 frequency (Hz)
o/AI
1200
'/al
FJ F; F,
1000
Vo wel, Avg. Extr eme, A vg . E-x l rc me ~ A vg. €~ Ir eme.,
lui
lei
o 38 30 80 6Ii 30 120 17 1 60 300
XOO I I I I I I I I
FIGURE 2.12. The vowel triangle . Plot of F, versus F2 for vowels in c 47 30 120 50 30 200 98 40 240
Bandwidths of formant s can also be used to cha racterize vowels, al A vg. 49.7 64.0 11 5.2
though their variation is not as pronounced ac ross vowels as frequen cy ---
location. Figure 2.13 shows average and extre me valu es of the first three FIGURE 2.13. Average formant bandwidths for vowels in American English
formant bandwidths from 20 mal e speakers [each vowel was produced by (Dunn, 1961).
each speaker twice (Dunn , 1961)]. Formant frequen cy locations versus
bandwidth measurements are shown above th e -band width results. Vowel
bandwidth trends are noted for th e three form ants (dashed lines for each ba ndwid ths) acro ss speakers. Howe ver, th e data in Figs. 2. 1 J, 2.12 , and
formant). This graph indicates a ten dency fo r band width of a fo rmant to 2. 13 have served as useful guidelines for many purposes o ver se veral
incr eas e with its center frequen cy, with t his t rend m uch more pro de cad es.
nounced for f~. It is evident that alth ough formant locatio n is of primary
importance in characterizing vowels, differen ces in bandwidth also con Diphthongs. Vowels ar e voic ed speech so und s invol ving a th eoretically
tribute 10 the overall vowel characte r. consta nt vocal-tract shape . Th e onl y articulator movem ent occurs duri ng
Finally, it should be reemphasi zed th at th ere exists significant variabil t he in it ia l t rans iti on to. and the exiting tran siti on fro m, th e nominal
ity in vowel formant characteristics (frequ en cies, spe ctral magnitudes, vocal-t ract configuration for the vowel. A diphthong involves' an inten
128 en 2 I Fundamentals of Speech Science
2.3 1 Phonemics and Phonetics 129
tional movement from one vowel toward another vowel. There exists
3K I
some ambiguity as to what constitutes a diphthong, since articulator
movement from one vowel toward another might be confused with a se
I '
quence of two distinct vowels. A reasonable definition for a diphthong is I i \
, I
a vocalic syllable nucleus containing two target vowel positions. The first
2K
target vowel is usually longer than the second, but the transition between
y l1
the targets is longer than either target position (Lehiste and Peterson,
1961). The three diphthongs that are universally accepted are fYI,/W/,
and 101 (corresponding examples occur in "pie," "out," and "toy"). It
should be emphasized that although a diphthong represents a transition
from one vowel target to another, frequently it is the case that neither
~
vowel is actually reached. For example, in the diphthong IYI ("pie"), the "
~ IK
initial vowel position is neither Ia! ("hot") nor I@I ("had") , and the sec <.r.,'"
ond is neither iii ("heed") nor II! ("hid"). The easiest method for deter /
,
mining whether a sound is a vowel or diphthong, is to simply produce \
-,
11
/
I
the sound. If your vocal tract does not maintain a constant shape, or if
the sound cannot be sustained without articulator movement and both
vocal targets are vowels, then the sound is a diphthong.
F 1 =f"2
Vocal-tract movement for a diphthong can be illustrated by a plot of
the F 1-F2 transition as in Fig. 2.14. The arrows indicate the direction of 500
motion of the formants versus time. The dashed circles indicate average
positions for vowels. Figure 2.14 confirms that the three diphthongs fYI,
IW/, and 10 1 move from one vowel target to a second, but in most cases 250 500
do not achieve either vowel configuration. F r (Hz)
The three diphthongs fYI, !WI, 101 contain two steady-state vowel tar
get configurations. A second group of dipthongized sounds exists that has FIGURE 2.14. Movements of F1 and F; for some diphthongs in American
English. After Holbrook and Fairbanks (1962).
only one steady-state target. These sounds are usually referred to as diph
thongized vowels. They occur in speech because of the tendency to add a
"glide" to the beginning or end of long vowels . For example, when we say time spent moving toward and away from the target is comparable to the
the word "bay" we usually do not say Ibe/, but Ibel/. The leIl sound!' in amount of time spent at the target position. Glides can be viewed as
this example has a long glide as the first element. Consider, however, transient sounds as they maintain the target position for much less time
loUI as in "boat" (lboUt/, not Ibotl) has a short steady-state target as the than vowels . Liquids also possess spectral characteristics similar to vow
first , followed by a long glide as the second element. Another example is els, but they are normally weaker than most vowels due to their more
luwl as in "who" (/huw/, not /hu/), which has a long steady glide. Finally, constricted vocal tract. Time waveform and vocal-tract profiles for the
the glide-vowel sequence Iyul as in "you" is sometimes called a diph begi nning and ending positions are shown in Fig. 2.15 for liquids and
glides.
thong; however, other glide-vowel sequences do exist so that singling this
one out can be misleading.
Consonants
Sernivowels. The group of sounds consisting of Iw/, Ill, Irl, and 1'1'1 are
called semivowels. Semivoy.'els are classified as either liquids (lwl in The consonants represent speech sounds that generally possess vocal
"wet," III in "lawn") or glides Ut! in "ran," Iyl in "yam"). A glide is a vo tract shapes with a larger degree of constriction than vowels. Consonants
calic syllable nucleus consisting of one target position, with associated may involve all of the forms of excitation that we discussed in Section
formant transitions toward and away from the target. The amount of 2.2 .3. Some consonants may require precise dynamic movement of the
vocal-tract articulators for their production. Other consonants, however,
13Note that single-letter phonetic symbols do not exist for these diphthongized vowels .
may not require vocal articulator motion so that their sounds, like vow
We create the appropriate sound from two symbols. els, arc sustatned. Such consonants arc classified as COntinuants. Sounds
in which the airstream enters the oral cavity and is completely stopped'
130 Ch. 2 f Fundamentals of Speech Science 2.3 I Phonemics and Phonetics 131
. (",I ~
_ \
l=~~~
!:1 1"""01' II ,, 11111 1
ih,i1I~I I II III~I,, II"II~iIlOilliil\llIlllIfllll l l~il "'I""' 1 011 1 '1>11
01
2 hr.111I':'
jj'
~ J
~
o r im, (rnscc)
c. oI"""" ''':!:'~'''''''l '' ' ' '''I......~:::;;~ ;;;;::;;..:t'I''I I~llUllh .'l:;"'t l l""';;;;;;;;::; g - --. ...-
o Ti me (msec j 64 <5 O.c: ~
4
6 - -- f
3 ~·"' ·' '' 'I,,,,,t.1UIIII'l l,,,lllhlI \11'11 I'" tlllllll . U llllll ll nllli tllllllll ~ I 'lIl1 ' t n~
i 1
~'~'"''''''''''''
l~v~~JI~I~~~I~
~ 1111111 111111 \1111"""""111;:."""",,11 11
J ( YOU) ~ s
>,
~
2 L""'h"""" " " llIl hlll llll l;." " 'I '" " " : IIh"IIIIIIIIIIII~1I1111 _
....
Ol)
~
0
3 a <, "oJ) ........ ...eJJ
0
- 64
u: O O""" '·· "'''~'' '' '''· '' ' ' '' '' ' " ' '' ''· '''''' ' ''·''''''''''' '' '' · ''· '' · ' ·-'' "" """"". <3
>
0 .:
.... ~VJ -..
•
~
T ime (msec) T lmc (JI1'S,\::c ) (,4
"" u
r
4, C ~ ~
~! i Ill""·..." .2
' (~I ~
1111,,,,,,,,,,"'"I••,,,' IIIIILl11111111""1l11"'"111:11111 11. 111llj1l1" .. I Il ~ 1l 1 cu
~
1 .
,~~JI~}~\/I~rl~IW~I~
-I 11111:>.dllll, I :i
u
"o 2 . ..1.111..... ',.." 11'.. 1"'...
'€ ~ - ::::
~ ~1I 1 . ,j!II''' '·:;III'L:.I 111I,.d' III1" ,,'' III'' ~ c-, Q)
1 '1I",,,,,,,,,,,,,,.,,,,,,,,,,"·","j"'"''··''''''''''''''''',,,,,,,,.,,,, ,,,,. ,,,,,.,,. ~
2 ~ .... ,>-.
J: .. ...
,.....".,...""".......".",., ....,....""""',.,",.,, ...,", """"',.,',.. ~, . , .... ,,' '0
0
Trrnc trnsec)
64 oI
o T1n 1C (msc c)
-------' ...<ll
C
4 C
'1 hi II'dl U~l 1""1,,,"l"II,t"II'I';l'III1I1I1::lI'II
1'"'' lllllll" (ll
s -. ~ -.. ...0 -.. "'c
Ql
I,n
.;i
~~
~ 2 ~Il III 11111111'' ' ' , .....,'',,' '. "''''''''. '' ,,," ",, , "" "0
~o ....... ,§ ". . . . :0
. "' :~
-.. ~.2
.
», c
I l lct) ~ ~ """1:_'1 III (ll ~ '" 0 ~" •
"~ ~ =< <, l~ .... E ~ "'g ::: ~
o It
1 "'."" .•"' ""•••"",.." , "" _ "" ,,, ,,..,, .
(ll
a:
,
........ .:;: -.. ",In
. _
>.
.0 ~ .... 5
l/l 8 A.J:;
"0
C
E: -.. r
:::l
o ] -.. ~
for a brief period are called stops (the term plosives is also used to de
(J)
or:
oQ)
-
~ ~
.... -
.:~
~
scribe normasal stops). The continuant, nonpl osive consonants are the Q)
a. - At)
have chosen to place th ese semi vowels bet ween vowels and consonants, §: .o
.g
-...
"- g
c ....l ..... f
since they require mor e extensive voca l-tract constriction and articular o
movement than vowels. These sounds are pr odu ced by movements that
z
'0
form partial constrictions of the vocal tract. Th e stop consonants c (; ::; ~5
/b , d, g, p, t, k/ completely inhibit th e breath stream for a portion of the
oQ :E ....... . .E <, of -.. . -.
Cii ~ E 0 ~ 0
articulatory gesture . J:l :;:a <, J= .... 'c: -c-, ,~
0(ii - 0. .
(/) ........ :0..
((J
Fricatives. Fricatives are produce d by exciting th e vocal trac t with a G
steady airstream that becom es turbulent at some poi nt of cons triction.
The point of constriction is norm ally used to d istinguish fricati ves as
shown in Table 2.5. Location s of constrictions for fricative consonants
include labi odental (upper teeth on lower lip), interdental (tongue behind
.0
e-.l
w
...J
III
~ ~
o
V
u
£<
-
C: /i'"
.S:
B
'':
"
':...-0:
o ..:!
Q'; ~
.~
-
t:: . ......
ct:
:; <
~
Z
~
0.
0
~
~
~
0
g
~
0..
\-,
0
:>
~
u
' t;:
c, 0
a)
"'0
:=
V.l
~
~
.Sf
....l
~
front teeth ). alveolar (to ngue touching gum ridge), palatal (tongue resting
on hard or soft palate), and glottal (vocal folds fixed and tensed). The
constriction in the vocal tract (or glottis) results in an un voiced excita
2.3 / Phonemics and Phonet ics 133
132 Ch. 2 f Fundamentals of Speech Science
< ,-, ~
nent, in which case they have what we have called mi xed excitation.
Those with simple unvoiced excitation are usually called unvoiced frica
tives, while those of mixed excitation are called voiced fricatives.
'~~~
I~~
o
-soo ,
, "
n~
10
-S
S
o
-10:)0 ~ I I I I ~
~ · lO ~ )1
In the unvoiced case, the constriction causes a noise source anterior to -IS{)O O M II ' 4
Time unsee) Frequency (kH'.j
the constriction. The location of the constriction serves to determine the
""" ~ :;~
fricative sound produced. Un voic ed fricatives include I fI ("free"), IT I 30110
20fnl
("thick"), lsi ("cease "), l SI ("mesh"). 1h/ 14 ("heat"). These are listed in the 1000
o ~/''1J~iN,·~I'I~'fIt·~,\~~~WIJ/l,
order in which the point of constriction moves from labiodental to glot -1001l
tal. The constriction also separates the vocal tract into two caviti es. The i [ I -1000 I'
I)
I I I i)
M
-10
o 4
,
A ~,
off is inversely proportional to the length of the cavity anterior to the
j\'~~d~~>\J'wm
constriction. The back cavity acts as an energy trap, which introduces
antiresonances in the lower portion of frequency spectrum (Heinz and 64 n
Frequency (kHz)
4
T nue unsec)
Stevens, 1961). Vocal-tract profiles, time waveforms, and sample vocal
"""~
tract frequency responses for unvoiced fricatives are shown in Fig. 2.16.
"~
(,000
=~~~ ~~~\'~I~~~
10
:;
The major constriction and its effect on low-frequency energy content is o
evident. Time waveforms also reflect the nonperiodic nature of sound -- )
-10
-Ij~·
production. -moo O ~ o 4
Tunc rmscc ) l -requency ( k lJ ~)
The voiced fricatives I vl ("vice"), I D I ("then"), Iz/ ("zephyr"), and IZI
("measure") are the voiced counterparts of the unvoiced fricati ves IfI, ITI,
",oo,~
moo
lsi , lSI, respectively. Voiced fricatives also possess the usual frication 5011
o 1~ I\I-.Mf«MtI~<)I\~~~O~ w~
111 / r._ j'J\~
=~~o
noise source caused at the point of major constriction , but also have pe -'i00
riodic glottal pulses exciting the vocal tract. Since two points of excita -1000 1 l I , I I I \-,
o M 4
tion exist for the vocal tract, the vocal-tract spectra of voiced fricatives Trrnc tmscc) Frequ enc y IkH,)
are expected to differ from unvoiced fricatives. The labiodental Ivl and FIGURE 2.16. A collection of features for unvoiced fricatives in American
interdental IDI voiced fricatives are almost periodic, resulting from a English. Column (a) represents schematic vocal-tract profiles, (b) typical
higher degree of excitation at the glottis. The alveolar /zl and palatal /ZI acoustic waveforms, and (c) the corresponding vocal-tract magnitude spectra.
voiced fricatives, on the other hand, are more noiselike, giving rise to sig
nificant energy in the high-frequency portion of the spectrum. These
voiced fricatives also contain a voice bar that is a very low-frequency for in American English are the unvoiced affricate I CI as in "change" and
mant (near 150 Hz). The voice bar occurs because the vocal folds are the voiced affricate IJI as in "jam." The unvoiced affricate ICI is formed
able to vibrate, exciting an occluded oral and nasal cavity. Figure 2.17 il by the production of the unvoiced stop 111, followed by a transition to the
lustrates vocal-tract profiles, time waveforms, and sample vocal-tract fre unvoiced fricative lSI . The voiced affricate IJI is formed by producing a
quency responses for voiced fricati Yes. Comparison with Fig. 2.16 reveals voiced stop IdI, followed by a vocal-tract transition to the voiced frica
the differences in time and frequency characteristics between voiced and tive IZ/.
unvoiced fricatives . The unvoiced fricatives If, T, s, SI ha ve more energy
at the middle and high frequen cies than at lower frequencies. In contrast, Stops (or Plosives), The majority of speech sounds in American English
voiced sounds always have more energy in the low frequencies (below can be described in terms of a steady-state vocal-tract spectrum. Some
I kHz ) than at high frequenc ies. sounds, such as diphthongs, glides, and liquids, possess one or more
vocal-tract target shapes, with smooth movement of the vocal-tract ar
Affricates. Similarly to glid es, liqu ids, and diphthongs, affricates are ticulators either to or from a target position. The stop consonants
formed by transitions from a stop to a fricative. The two affricates found I b, d, g, p, t, kl are transient, noncontinuant sounds that are produced by
building up pressure behind a total constriction somewhere along the
L4The whisper sound Ihl is normall y referred to as an unvoiced glottal frica tive. vocal tract , and suddenly releasing this pressure . This sudden explosion
134 Ch. 2 I Fundamentals 01 Speech Science 2.3 I Phonemics and Phonetics 135
(a) (b ) (c ) the walls of th e throat. This leads to the pr esen ce of a voice bar in the
acoustic analysis.
'(=l ~ '" ~
20m
~~~~~~V~~(~I~
]()OU 5
o u I, ) (b )
(e)
- lOOO -5
- 1 0()()0 - I O ~ ""
o
,,~{ij!J
64 4
-;~k~"~~,
Frequency (l::H1)
"~
T im e (rnscc)
10
~ I. w~
U
10 -j
- l~ - 1500 6J. = : ~o .
o T i me (m~L': (; ) Frequellcy (kH7)
4
- WU 4 ~
° - 2000
Time Imsec)
64
fr equency (kH7,)
".,~
1000 0
2000~~~ 15 ~
~~
.1000
~1Ii">'<A\I'N'Mf~t/I'WW\~\~
ij! I ~ d,"!:""~I~I" I:'~',
10
5 o
'''' ',,' I i
-1000 . 64
-\
o
- 10
- 15 ' I \ I I.
- ,' 000
- l ooUO L
II
l
64
1 )
- IU
- 2° 0 ! ~ ~
. 0 rim e (mscc)
o 4 Time unsee} Frequency (ki f, )
Freq uen cy ()(j-I7.J
- 1000
-2000
- 3OOU 64
",
ID~
.\U
zo
10
=13 l~ '
o THm:"zmsec} f-requency (kH, j
and aspiration of air characterizes the stop consonants. Stops are some
O("~ "l~ rw
times referred to as plosive phonemes and their mode of excitation is
5~ ~
"~
10
-----JIjJ
wh at we called plosive excitation in Section 2.2.3. As summarized in - 500
.I
o
Table 2. 5, the closure can be bilabial (fbI in "be ," t»! in " pea" ), alveolar - IOOU
-5
:: l~ ,
I I I I)
- 1.100 64
(ldl in "day," ItI in "tea"), or velar (/g/ in "go," IkJ in " key" ). 0 o 4
Tlme (msec)
The unvoiced stop consonants tv, t, kJ are produced by building up air Freq"ell<'Y (kH, )
-\
o
;;
=:~o .
0± A
, '"bA
- 20(IQO Ii4
glottis before vocal fold movement is required in the ensuing voiced 1'InLC(m:o::ec ) 4
Fn:({uel)cy <kH7.)
sound. Unvoiced stops usu ally po ssess lon ger periods of frication than
voiced stops. The frication and aspiration is called the stop release. The
interval of time leading up to the release during wh ich pressure is built
up is called the stop gap. These features are evi den t in the time
waveforms in Fig. 2.18.
", , [ij!J 4
0
2000
-20()(~ I I I
'~
~ • -5
5
U
Voiced stops /b, d, g/ are similar to th e unv oiced stops, except they in - 4000 0 Ti ll l~' <ITI:-'(:C )
- 10
o ' ,
Freq ue ncy (kHz )
clude vocal fold vibration that co ntinues throu gh th e entire stop or be FIGURE 2.18. A collection of features for voiced and unvoiced stops in
gins after the occlusion release. During the period in which pressure is American EngliSh. Column (a) represents schematic vocal-tract profiles just
building behind the oral tract closure . SOme energy is radiated through prior to release , (b) typic'll! acoustic waveforms, and (0) the corresponding
vocal-tract magnitude spectra.
136 Ch. 2 f Fundamentals of Speech Science
2.3 f Phonemics and Phonetics 137
stop occlusio n. Stop s can also be altered if they app ear between vowels, cavity is inversely proportional to th e frequ ency location of the antireso
resulting in a lap or flap (Zue and Lafe rriere, 1979). This va riant is pro nance. For na sals, form ants OCCur approximat ely every 850 Hz instead of
du ced when one a rticulato r is thr own against another, such as when th e every I kHz. A low form ant (F 1) is found near 250 Hz th at dominates th e
tip of the tongue touch es th e upper teeth or alveo lar ridge in " letter." frequ enc y spec trum. F z is weak, resulting from th e antiresonance of th e
Since stop consonants are tran sient. th eir propert ies are highly influenced closed oral cavity, and F 3 occurs near 2200 Hz. The antireson ance pro
by prior or subsequent phonem es (Delattre et al., 1955). Th erefore, the duces a spectral zero in th e frequency respon se that is inversely propor
waveforms in Fig. 2. 18 provide little informati on for d istinguishing the tional to th e length of the con stricted oral cavity. The spectral zero
stop conso nan ts. Vocal-tract spectra for stops are som etimes useful; how occurs in th e ra nge 750-1 250Hz for I ml , 1450-220 0Hz for In l , and
ever, a number of allophones can produce radi cally different spectra at above 3000 Hz for I GI . Bandwidths of nasal forma nts are normally wider
the tim e of release. than tho se for vowels. This is du e to th e fact that th e inner surface of th e
nasal cavity con ta ins extensive surface area, resulting in higher ene rgy
Nasals . Th e nasal consonants /m, n, GI are voiced sound s produced by losses due to conducti on and viscous friction . Since th e human aud itory
the glott al waveform exciting an open nasal cavity and closed oral cavity. system is onl y partially able to perceptually resolve spectral nulls, dis
Th eir waveforms resemble vowels, but are normally weaker in energy due crimination of nasals based on the place of art iculation is norm ally cued
to limited ability of th e nasa l cavity to radi at e sound (relat ive to the oral by form ant transitions in adj acent sound s.
cavity). In forming a nasal, a co mplete closure is made toward the front Although nasals are th e only phonem es that incorporate the nasal cav
of th e vocal tract, eithe r at th e lip s (labial o r front nasal I ml as in ity to produce th eir resonant frequen cy structure, some ph on em es
"mo re"), with th e tongue resting on the gum ridge (alveolar or mid-nasal becom e nasalized if they precede or follo w a nasal sound . This occurs
Inl as in " noon") , or by the ton gue pre ssing at th e soft or hard palate primarily in vowels that prec ede a nasal, where the velum begins to drop
(velar or back nasal /GI as in "sing"). Th e velum is opened wide to allow in anticipation of th e ensuing co nso nant. Nasa lizat ion produces pho
for sound propagat ion through the na sal cavity. nemes that have broader F bandwidths and are less peaked than those
J
Figure 2.19 shows vocal-tract profiles, tim e waveforms, and vocal-tract without nasal coupling. T his is caused by damp ing of the formant reso
frequency respons es for th e three nasals in American English. The closed nance by th e loss of energy through th e opening into th e nas al cavi ty
oral cavity is still acoustically coupled to th e pharyngeal and nasal cavi (House and Stevens, 195 6). Oth er spect ral cha nges include spectral zeros
ties, and will therefore affect th e resulting spectral resonan ces by trap that cancel or tr ap energy within th e voca l tract. The degree to which
pin g energy at certain frequencies. This phenomenon gives rise to nasali zation affects the spectrum dep ends on the amount of coupling be
antiresonances in th e overall vocal system. The length of th e closed oral tween the two cavities. If the velum is only slightly lowered , weak nasal
cavity coupling pr oduces minor cha nges in th e resulting frequ ency re
ml~'~
4000 . sponse. Th e degree of coupling directly influe nces the location and
'M~
ZUOO
1000 ~~~ strength of the spect ral zeros . Chan ges in spec tral characteris tics for a
::jg&~~'~~~ • -JO~
-20 L _--', _~
_ --L_ _ -L_ •
nasalized vowel (strongly coupled) are shown in Fig. 2.20. The resulting
nasalization produces a zero near 600 Hz and cancels most of the third
U M o 4
Freque ncy (kH,.)
T tmc (msec) formant. Th e position of th ese zeros should not, however, be conside red
~
to hold for weak degrees of nasalizati on of othe r vowels, since the vocal
~~~
l (nolL2j
:~~~~ ,
10 tra ct shape an d amo unt of velar ope ning interact to determine th e posi
o tion of spectral zeros.
- 10
-2 0 o - ' - I \ I >. . :.--~ ,.
o Fin .e ( rnsec )
M I'requ cnc j (kH/ ) .
~
&OOO~
2.3.4 Prosodic Features and Coarticulation
- 40t
~. ~ , V~~JI~~JA!vYV~~> j ~~
~ ~ ~
Prosodies
01"""
o . .. (14 (I 4 Our' discussion of speech has focused on characterizing speech in
] nne uu scc r Frequen cy (U Ii'.)
terms of art iculatory phonetics (the manner or place of articulat ion) and
FIGURE 2.19 . A coll ection of features for nasals in Ameri can Eng lish. acoustic-phonetics (frequency structure , time waveform characteri stics).
Colu mn (a) represents schematic v?cal -tract profiles , (b) typical acoustic Speech produ ction , however, involves a complex sequence of articulatory
wav eforms , and (c) the correspon ding vocal-tract magnitude spectra .
mo vements , timed so th at vocal-tract shapes Occur in the desired pho
2.3 J Phonemics and Phonetics 139
138 Ch. 2 I Fundamentals of Speech Science
~ 30
to consider textbooks cited in Appendix I.F, especially (Chomsky
"
-o
3
and Halle, 1968; Ladefoged, 1967; Lieberman, 1967; Perkell and Klatt,
-~ 20 1986).
e;,
es
2 10
Prevailing theories have stated that the time-varying vocal-tract trans
fer function carries most of the information concerning which phonemes
()
0 1 2 ~ are produced, with glottal source excitation characteristics convey
j-requcn cy (k H z) ing prosodic cues such as intonation and stress. Intonation and stress are
the two most important prosodic features that convey linguistic informa
lui nasalized tion. Stress is used to distinguish similar phonetic sequences or to high
40
.:0 light a syllable or word against a background of unstressed syllables.'!
2 30 For example, consider the two phrases "That is insight" and "That is in
.g
sight." In the first phrase there is stress on "in" but "sight" is unstressed,
~ 20
5'0 while the converse is true in the second phrase. The analytical question
"
2: 10 that has received a great deal of attention in recent years is how many
degrees of stress should be recognized to account for all such contrasts?
o l 2 A structuralist approach 17 suggests that four such degrees are normally
Frequency (kHz)
disti nguishable (from strongest to weakest): (I) pri mary, (2) secondary,
FIGURE 2.20. An example of the effects of nasalization on a vowel
(3) tert iary, and (4) weak. It is only possible to distinguish these stress
spectrum. Note the appearance of spectral nulls in the nasalized case.
levels for words in isolation. A good example is the compound word
1434Z434.
elevator operator, which shows each of the stress levels. One could also
neme sequence order. From an acoustic-phonetics point of view, expres rise above the syllables to a word, or lexical, stress level, as in the exam
sive uses of speech depend on tonal patterns of pitch, syllable stresses, ples "Joe writes on a black board" ("black" used as an adjective) and
and timing to form rhythmic speech patterns. Here we will focus on "Joe writes on a blackboard" ("blackboard" is a noun). The second sen
acoustic-phonetic aspects of rhythm and pitch as a function of their lin tence has additional stress placed on "blackboard."
guistic structure in American English. ls Intonation refers to the distinctive use of patterns of pitch or melody.
Timing and rhythms of speech contribute significantly to the formal An analysis of intonation is performed by considering pitch patterns in
linguistic structure of speech communication. The tonal and rhythmic as terms of contours, for which pitch range, height, and direction of change
pects of speech are generally called prosodic features. Since they normally are generally characterized. Intonation performs several useful functions
extend over more than one phoneme segment, prosodic features are said in language, the most important being to signal grammatical structure. IS
to be suprasegmental. Prosodic features are created by certain special Intonation therefore performs a role similar to punctuation in writing.
manipulations of the speech production system during the normal se Intonation, however, has much wider contrasts such as marking sentence,
quence of phoneme production. These manipulations are categorized as clause, or other boundaries, as well as contrasting grammatical sentence
either source factors or vocal-tract shaping factors . The source factors are structure such as in questions or statements. For example, consider the
based on subtle changes in the speech breathing muscles and vocal folds, rising and falling pitch patterns in the second speaker's utterance in the
while the vocal-tract shaping factors operate via movements of the upper following two pieces of dialogue. In the first, Charles does not know
articulators. The acoustic patterns of prosodic features are heard in sys~
tematic changes in duration, intensity, fundamental frequency, and spec
tral patterns of the individual phonemes. The aim of this section is to l·Stress can refer to syllabic, word, phrase, or sentence-level stress. Phonologically, it
sim ply is . a means of distingu ishing degrees of emphasis or contrast.
17 A structuralist approach to language focuses on the way linguistic features are de
"It should be noted, however, since foreign languages differ from American Eng1i~h in
scribed in terms of structures and systems. Structure in linguistics can be found in phonol
their basic spee ch building blocks (p.honcmes), so too will their tonal patterns of pitch, syl
ogy, semantics, and grammar. These concepts will be further discussed in Chapter 13.
lable stresses, aRd sentcn~c ~iming.'"'Since most stude~ts st~dy a non .•native language by
140 Ch. 2 J Fundamentals of Speech Science 2.3 I Phonemics and Phonetics 141
whether Casey is even trying to come to the meeting, and his intonation STATEMENT QUESTI ON
(rising pitch at the end) indicates a question: Frequency H~
qu esti on , the over all subglottal air pre ssure remains app roximat ely Although th ese glottal sou rce factors contribute greatly to stress and
constant . Wh at cau ses the change in pitch to produce th e rising into intonat ion, variations in duration a re also used to encode pro sodic fea
nation for questions and fallin g intonation for sta teme nts? It is cu r tures. A syllable is normally longer if it a ppears in th e final portion of a
rently believed that th e major factor controlling pitch is the te nsion breath grou p. Unstressed words norm ally contain vowels that are shorter,
in the vocal folds . Although studies continue in thi s area, it appears that Also, the type of consonant that follows a vowel greatly influ ences the
the strongest stress of a syllable in a phrase is produced by a co mbina duration of that vowel (e.g., consider the duration of the vowels in "fate"
tion of increased vocal fold ten sion and a peak in sub glottal pressur e. and " fade" ), Consonant duration is affected by syllable stress and the po
Increased vocal fold tension is applied th rou gh small muscles in the sit ion of th e syllable in th e sentence in a manner sim ilar to those ob
laryn x. served for vowels. Phoneticians have described rules for prosodic effects
We summarize the major points concerning stress and intonation as of duration for vowels and consonants to explain the origins of speech
follows: motor control. Many of th ese rules form the basis of natural sounding
tex t-to-speech systems. In gene ral, th ese rules sta te that the larger tile
• Th e phonetic patterns for stress and inton ation ar e based on a
number of su bunits (i.e., phonemes, syllables) in a larger unit, the shorter
blending of changes in vocal fold ten sion , peaks and valleys of
each subunit becomes up to some limit of comp ressibility. It is suggested
subglottal air pressure, and var iation in duration of vowels and
that th is du ration compression occurs due to the limited numbe r of
consonants.
speech units t hat can be temporarily held in our speech motor memory.
• Mov ement of the fundamental frequen cy (or pitch) is reflected in
This discu ssion , t hough far from complete, has ser ved to relate several
chan ges in subglottal air pre ssure and vocal fold tension. If either (or
factors respon sible for prosodic features.
both) increases, a corresponding incre ase in fundam ental frequency
also occurs.
• The general shape of the subg lottal air pressure over a phrase is rela Coarticulation
tively constant and does not reflect the wide variations seen in the
The production of speech requires movement of the articulators in the
pitch con tour.
vocal tr act and th e glottal sou rce. U nlike distinct characters found in
Based on earlier work by Lieberman (1967) , a sim ple breath group the printed text , phoneme articulations typically overlap each oth er in time ,
ory of intonation was proposed to explain how pattern s of subglottal thereby causing sound patterns to be in transition most of the time. In
pressure and vocal fold tension are encoded to relay stress and into na stead of quick rigid arti culator mo vem ent between uniform island s of sta
tion. The theory states that the intonation contour will naturally fall to tionary phoneme production, speech is normally produced with smooth
ward the end of a breath group du e to lower sub glottal air pressure, just movement a nd timing of the articu lators to form the proper vocal-tract
prior to taking a new breath of air. The breath group can then be ma rked shapes to produce th e desired phon eme sequence. Coarticulat ion is th e
as a qu estion by simply increasing the vocal fold tension. In its simplist term used to refer to th e change in phon em e articulation and acoustics
form , the breath group theory assumes a high degree of independence be ca used by the influ ence of anoth er sou nd in the same uttera nce.
tween vocal fold tension and sub glottal pressure. Recent studies suggest Mov ements of the art iculators are fairly independ ent of one another.
that intonation contours ar e primari ly controlled by vocal fold tension Th e muscle grou ps associate d with each articulator, as well as mass and
for high -pitched sections, and by subglottal pressure during lower pitch po sition, affect the degree and ease of movement. T he movements of the
portions of the contour. It is also thought that the vert ical position of the tongue and lips are free to overlap in time; however, the actual tongue ar
larynx dete rm ines whether the vocal fold tension or subglottal pressure is ticulation for a vowel-consonant-vowel sequence can vary depending on
the dominant factor. the target positions th e tong ue must attain for the specific vowels and
Other glottal source factors which are related to stres s and intonation consonants. Because each phoneme has a different set of art icu lator posi
are the intensity and spectral balanc e between high and low frequency re tion goals (not all of which are stri ctl y required), considera ble freedom
gions of tbe glottal source spectrum. An increase in subglottal pressure exists during actual speech production. Th ere are two primary factors un
will increase the inte nsity of the open phase of the glottal so urce wave derl ying coa rticulation effects:
form (increased amplitude of the air puffs produced at th e glottis). If a
voiced syllable is stressed , the increased subglottal pressure gives rise to a 1. Th e specific vocal-tract shapes which mu st be attained.
glottal source spectru m with increased high frequency content , relati ve to 2. Th e motor program for production of the sequence of speech units
the lower frequencies. Therefore , vowels which are stressed will possess (conson ant s, vowels, syllables, words, and phrases).
higher magnitude s for the higher-frequency formants th an unstressed If an articulatory gesture d oes not confl ict with the following pho
vowels. nem e, the given art icu lator may move toward a position more appropri
144 Ch. 2 I Fundamentals of Speech Science 2.3 J Phonemics and Phonetics 145
l ' H ..
however. we shall see that effecti ve speech modeling, synthesis, and rec
ognition techniques are possible without a complete theory of speech
motor production, alt hough future techniques certainly will benefit from
such emerging theo ries. ,
146 Ch. 2 / Fundamentals of Speech Science
I
"""'"""
2.4 Conclusions
The purpose of this chapter has been to present an overview of the fun
damentals of speech science . Th e goal has been to provide a sufficient
background to pursu e appli cations of digital signal proc essing to speech .
Speech is produced through the careful movement and positioning of the
vocal-tract articulators in respon se to an excitation signal that may be
~~ ~
( 1) OR ( 1) ( ) OR ( ) ( ) OR ( )
~~~
[5]JL§]~
though a complete model of such higher level pro cessing does not yet
exist, well-defined rules have been collected. Such knowledge is necessary
for natural sounding synthesis and text-to-speech syste ms. A more de
tailed treatment of the areas of speech science will pro ve useful in many
of the applications in the following chapters, but it will be more useful to (p) ( ) ( )
explore these questions afte r becoming familiar with techniques for ob
~~ ~
tain ing general models of speech production and their use in coding, en
hancement, or recognition.
2.5 PROBLEMS ( ) ( ) ( )
2.1 Explain the differences between pitch and fundamental frequency. vocal-tract position. For each figur e, write th e phonetic symbol that cor
Which is a function of glott al vocal fold movem ent ? Whi ch is a function responds to the shap e, and circle the arti culator(s) most responsible for
of both vocal fold movem ent and processing by the auditory system? th is phoneme production .
2.2 Discuss the forces that act to produce a complete phonation cycle 2.6 Suppose that a typic al male speaker of average height (6 feet) pos
during voiced speech production. How do the forces during the sesses a vocal tract of length 17 em. Supp ose also that his pitch period is
phonation vary if th e density of air is reduced (e.g., if the speaker is 80 msec . Discuss how voice characteristics would change for a very taLL
breathing a helium-oxygen mi xture)? Suggest a means to remedy th is un (7 feet) basketbaLL player and an infant (2 feet tall) . Assume that their
natural sounding speech . vocal systems are proportioned to th eir heights.
2.3 Th e speech spectrogram has been used for anal ysis of speech charac 2.7 The waveform of the sentence "Irish youngsters eat fresh kippers for
teristics for many years. Could such a method be used for recognition of breakfast," is shown on p. 148. Th e signal was sampled at 8000 samples/
speech patterns? What aspects of speech productio n cannot be resolved second.
from a spectrogram? (a) Label all the regions of voiced, unvoiced , and silence .
2.4 How man y phonemes are there in American English? List the attri (b) For th e phrase "Irish youngst ers," indicate the phonemic
butes which distinguish phon emes into indi vidual speech classes (manner boundaries (i.e., mark boundaries for the phonemes
of voicin g, place of articulation ). IYI - Irl - III - lSI , Iy l - IAI - I GI - lsi - It I - IR I - hi ).
(c) For each the vowel sound, estim ate the average pitch period.
~
2.5 Th e following figures arc designed to help you become more familiar (d) Is th e speaker an adult male, adu lt female, or a child?
with vocal-tract shape and speech production. For each drawing, there is (e) Fin d the range for pitch freq uency, usi ng estim ated minimum
only one sound in American English that could be produced with such a and maximum pitch period values from part (c).
2.5 / Problems 149
2.8 It can be shown that vowel formant location varies while producing
the following word list: "mat" - "met" - "mit" - "meat." Discuss the dif
ferences in vocal-tract articulator position/movement in producing each
of these words.
2.9 Vowels are characterized by their formant locations and bandwidths.
What spectral features vary across the vowel triangle? How do vocal-tract
articulatory mo vements characterize such a triangle? Does the vowel tri
angle change for different speakers?
2.10 American English consists entirely of egressive phonemes. List two
phonemes that differ only in their type of excitation (i.e. , voiced versus
unvoiced). Name one language that employs ingressive phonemes.
2.11 Diphthongs are characterized by movement from one vowel target
to another. Sketch the initial and final vocal-tract shapes for the IWI
diphthong as in "out." Also sketch the magnitude spectrum for each,
Which formants require the greatest articulatory movement?
2.12 What are the differences in vocal-tract articulator movement among
diphthongs, liquids, and glides?
2.13 What differentiates the five unvoiced fricatives If/,1TI,Is/, lSI, Ihl?
Why a re fricatives always lower in energy than vowels?
2.14 Diphthongs , liquids, and glides possess one or more vocal-tract
target shapes, with smooth continuant movement of the vocal-tract art ic
ulators. Which speech sounds in American English are transient, noncon
tinuant sounds?
2.15 Nasals differ from other phonemes in that the nasal cavity is part of
:11
the speech system's resonant structure. Discuss how the closed oral cavity
contributes to the overall magnitude spectrum of a nasal. II
8o 2.16 The system of spelling in American English (orthographic transcrip
N
~ tion) often fails to represent the proper sequence of sounds for word pro , I.
duction. For example, the italicized letters in the words below all
g
o represent the same sou nd s:
05
s
<:>
to two too
+ sea see
r ~~
g
g
sue shoe
through threw Ii
I,,~ Also , similar spelling can represent different speech sounds as shown
g gg ogx 0 0 0 0 00 below.
~~ r~~~ ~~~o.~ ~
coo c o
000 o c
oe
000 c o
60.0
('1"", N - - (""t ~~S ~~ ~N- -N ~,N- -N charter character
I I I I I I I I
(u ).I ';}pnl)luwE tp""dS dime dim
th ick then
zop bottom cat
150 en. 2 I Fundamentals of Speech Science
In other cases, combinations of letters may represent just one sound (e.g.,
physical, aisle) , or no sound at all (e.g., know, psychiatry, pneumonia).
(a) Below are several groups of words. Pronounce the words and
compare the sounds corresponding to the italicized letters. Are
the sounds from each group the same phoneme? If so, give one
additional word which contains the same speech sound. List the
[CHAPTER 3 I
phoneme associated with each italicized letter. Modeling Speech Production
2 3
suppose Reading Notes: The discussion and notational conventions established in
such open Section 1.5 will be used extensively in this chapter, beginning in Section
you puddle owl 3.1.3.
broom touch sold
to stud bawl
super put coal
4 5 6
passion slang logged
fishing finger waited 3.0 Preamble
azure ginger baulked
cashy golfing baited The general purpose of this chapter is to develop several related models
shift belong tract of speech production that will be used throughout the book. We noted at
the end of Chapter 1 that much of modern speech processing is based on
(b) Perform a phonetic transcription of the following words. Which extensive engineering research into analog acoustic modeling of speech
letters correspond to a single sound, which are silent, and which production that was carried out in the middle decades of the twentieth
combine with another to form a single sound? century. A complete description of this work would require one or more
unique, license, fizzle, orangutan, autumn, sickle, volumes, so a thorough account in this book is not possible. Further,
tissue , vegetable , quartz many digital speech processing pursuits require only a basic understand
ing of the central results of the acoustic theory. The purpose of this chap
2.17 (Computer Assignment) (a) In the speech files available for this ter is to pursue these core ideas . More elaboration on some aspects of the
book (see Preface), you will find a file that contains l o-bit integer sam acoustic theory is found in the appendices of this chapter.
ples of the vowel /0/, sampled at an 8-kHz sampling rate . Extract 512
samples of this speech sound and plot the time waveform. Next, apply an
FIT to the segment and plot the magnitude spectrum.
(i) What features characterize the time waveform? 3.1 Acoustic Theory of Speech Production
(ii) From the magnitude spectrum, estimate the fundamental
frequency. Is this a male or female speaker? 3.1.1 History
(iii) From the same magnitude spectrum, estimate the formant The motivation for understanding the mechanism of speech produc
locations . How do your measurements compare to typical tion lies in the fact that speech is the human being's primary means of
values for the vowel? communication. Through developments in acoustic theory, many aspects
(b) In the same set of computer files, you will find some vowel files of human voice production are now understood. There are areas such as
whose phonemes are not labeled. For one or more of these files. nonlinearities of vocal fold vibration, vocal-tract articulator dynamics,
repeat the steps of part (a), and determine which vowel the knowledge of linguistic rules, and acoustic effects of coupling of the glot
file(s) likely represent(s) . tal source and vocal tract that continue to be studied. The continued pur
suit of basic speech analysis has provided new and more realistic means
of performing speech synthesis, coding, and recognition.
Early attempts to model and understand speech production resulted in
mechanical speaking machines. Modern advances have led to electrical
151
3.1 I Acoustic Theory of Speech Production 153
152 eh. 3 I Modeling Speech Production
come out
shape to the human vocal tract and excited th em with a vibrating reed , here
which , like the vocal folds, interrupted an airstream. In roughly the same
period, Wolfgang Ritter von Kempelen (1791) demonstrated a much
more successful machine for generating connected utterances.' I n 1791, Auxiliary
human speech production and his experiments during the two decades he
had been working on a speaking machine. Von Kempeleu's machine used
a bellows to supply air to a reed, which, in turn, excited a single resona
tor cavity. The resonator cavity was constructed of leather and the cross
sectional area was varied by having the operator squeeze the leather tube
resonator for voiced speech sounds (Fig. 3.1). Consonant sounds were
r?h" ~"d
Com presse d
air chamber
simulated using four separate constriction mechanisms controlled by the
Sec tion through resonator and reed
fingers of the operator's second hand. Using von Kempe1en's description,
Sir Charles Wheatstone built an improved version of the speaking ma FIGURE 3.2. Wheatstone's reconstruction of von Kempelen's speaking
machine. Adapted from (Paget, 1930).
chine which he demonstrated at the Dublin meeting of the British Asso
ciation for the Advancement of Sciences in 1835 [see Fig. 3.2, from
(Dudley and Tarnoczy, 1950; Wheatstone, 1979)].2 Further developments
in mechanical speech modeling and synthesis continued into the 1800s
and early 1900s. In 1846, J. Faber demonstrated a speech machine called
"Euphonia," which possessed characteristics much like a speech organ.
This instrument was said to represent a significant improvement over
b von Kernpelen's, because it allowed for variable pitch , permitting singing.
Ordinary, whispered, and conversational speech could also be produced.
Other machines include an approach using tuning forks by Helmholtz to
@ produce artificial vowels (Helmholtz, 1875). Paget (1930) built models of
c A plasticene and rubber that could individually produce almost every vowel
@ and consonant sound. Wagner (1936) constructed a vowel-copying electri
cal circuit that controlled the energy level in each of the fir st four for
mant frequency regions.
c One of the first all-electrical networks for modeling speech sounds was
developed by J. Q. Stewart (1922). In 1939, one of the first all-electrical
speech synthesizers known as the "Voder" (from voice demonstrator
(Dudley et al., 1939) was demonstrated at the World's Fair (Fig. 3.3). A
schematic diagram of the Voder is also shown in Fig. 3.4 (Dudley, 1955).
FIGURE 3.1, Von Kernpalen's final speaking machine. Adapted from (Dudley A trained operator was required to "play" the Voder to produce speech.
and Tarnoczy, 1950).
' Il ' has been reported that von Kernpclcn actually began his car lv work o n vowel produ c
' T he effo rt s of von .Kernpelen would conceivably have a profound impact on speech
tion in 1769, but that it was not seriously regarded by his scientific' colleagues . Althoughhis
modeling. As a boy in Edinburgh , Scotland, Alexander Graham Bell had th e opportunity of
work received ample publicity, his research abili ty had been besmirched by an ear ly decep
seeing Whe atstone's const ructio n of von Kcmpelen's machine. Being greatly impressed by
tion involvi ng a 'm echanical chess-pla ying mach ine. It seems the primary "mechanical"
his work , and with assistance from his brother Melville, Bell set out to construct his own
component of (his earlier device was aconc ealed legless -man named Worousk,i, who was a
speaking machine. His inte rest ultimat ely led to his U.S , Patent 174465 of the voi ce
form er commander in the Polish regi m ent and a master chess player (Flanagan, 1972).
telephone.
154 Ch. 3 I Modeling Speech Production 3.1 I Acoustic Theory ot Speech Product ion 155
I
I
I
I Mouth-radiator
I I
I I
I I
I I
I I
l I
I I
I I
I J
I I
Rand om
IUnvoiced Lou dspeak er
source
noise
source
Relaxation
oscill ator IVoice r _ J
source
FIGURE 3.3. The Vader being demonstrated at the New York World's Fai r FIGURE 3.4. A schematic diagram of the Voder. Adapted from (H. Dudley.
(Dudley, 1940; Dudley and Tarnoc zy, 1950). R. R. Riesz, and S. A. Watkins, 1939).
T he o pera tor man ipul ate d 14 keys wit h his or her fingers, which con
t roll ed vocal-tract resonan ce st ructure, a wri st bar wh ich operate d exci ta
ti on ca rrier (rando m noise/relaxati on oscilla tor) , and a r ight foot pedal
whic h allowed for va ria ble pi tch . Although opera tor t ra inin g was q uite
long (o n th e order o f a yea r), sk ille d "p layers" could prod uce quite intel
ligi ble sp eech. One of th e im po rtant as pects of th e Vader is the reso
na nce co nt rol box, which co ntaine d 10 co ntiguo us bandpass fi lte rs that
span the fr eq uen cy range of speech . The output s of the filt ers w er e
passed through ga in controls and were added. Ten fingers wer e used to
o perat e the ga in contro ls, an d t hre e ad ditional keys were pro vided for
t ransien t sim ulatio n of stop co nso nan ts.
H. K. Dun n ( 1950) achie ved far better q ual it y tha n that of the Va d er
wi th a n electrical vocal tra ct, pict ur ed in Fig. 3.5. T he devi ce was b ased
o n a vi brating energy source that replaced th e vocal folds. and a trans
mi ssio n line model (inductors a nd ca pacitors) for the vocal tract using
lowpass filter sections t hat pro vided the delay experi ence d by a sound FIGURE 3.5. A front view of the Elect rical Vocal Tract de ve loped by H . K .
wave traveling a long th e human vocal tra ct. Dunn (1950) .
156 en. 3 I Modeling Speech Production 3.1 I Acoustic Theory of Speech Production 157
Further developments in speech modeling and synthesis continued ourselves how the early speech scientists began in their quest to under
with the development of computers. In many regards, advancement in stand and synthesize speech. The answer centers around stationary
speech modeling led to the development of better speech coding and syn speech sounds, such as vowels, which offer intriguing parallels with
thesis methods. Pursuing the historical development of speech modeling sound generation of early pipe organs. The resulting speech production
further will therefore lead to more recent advances in coding and synthe models presented in this chapter will take on the form of an acoustic
sis, which we shall present in a later chapter. We therefore turn back to tube model. Since it is assumed that more readers are familiar with elec
the problem of understanding the basic principles of sound propagation trical circuits than acoustics, electrical analogs of the vocal tract based on
and modeling of the human speech production system. Readers inter transmission lines will be considered. These models are subsequently ex
ested in further historical discussion of early speech modeling should tended to a discrete-time filter representation for use in processing
consider articles by Flanagan (1972) and Klatt (1987). speech by computer. The development of the discrete-time model is the
principal goal of this chapter, as it will serve as the basis for most of the
techniques described in Chapter 5 and beyond.
3.1.2 Sound Propagation
The discussion of human speech production in Chapter 2 reveals three
In Chapter 2 we characterized speech sounds in terms of the position separate areas for modeling. These include the source excitation, vocal
and movement of the vocal-tract articulators, variation in their time tract shaping, and the effect of speech radiation. For example, a single
waveform characteristics, and frequency domain properties such as for voiced phoneme such as a vowel, modeled over finite time, can be repre
mant location and bandwidth. This has served to help us understand the sented as the product of the following three (Fourier) transfer functions:
differences in how individual phonemes are produced. In this section, we
S(Q) = U(Q)H(Q)R(Q) (3.1)
turn to mathematical representations of human speech production to
provide a foundation for applications in speech synthesis, coding, en with (see footnote 11 in Chapter 2) U(Q) = uglou;,{n) representing the
hancement, and recognition. In our brief historical overview of speech voice waveform (source excitation), H(n) the dynamics of the vocal tract,
modeling, we saw that early attempts at speech modeling and under and R(Q) the radiation effects . The voice waveform, u(t), is taken to be a
standing resulted in the construction of mechanical and electrical devices volume velocity, or "flow," waveform . The transfer fun ction H(n) repre
for speech production. The foundations of many of the early attempts to sents the ratio of the volume velocity at the output of the vocal system to
formulate mathematical models are a direct consequence of these early that at the input to the tract. In this sense, H(Q) also includes the lips,
speech production devices. Therefore, the resulting mathematical meth but in no sense includes the larynx. However, the speech waveform is
ods presented here should provide the necessary basis for both analysis usually considered to be the sound pressure waveform at the output of
and synthesis of speech. the vocal system . H(n) does not account for the flow-to-pressure conver
To fully characterize the human speech production system would re sion (radiation) function at the lip boundary, which is included in R(Q).
quire a set of partial differential equations that describe the physical The reader should appreciate this subtlety because of its significance in
principles of air propagation in the vocal system (Beranek, 1954; Morse, many discussions to follow.
1968; Portnoff, 1973; Rabiner and Schafer, 1978; Sondhi, 1974). Sound As implied by the representation (3.1), the majority of modern speech
generation and propagation requires the characterization of such topics modeling techniques assumes that these components are linear and sep
as arable. Accordingly, the production model for the speech signal is as
sumed to be the concatenation of acoustical, electrical, or digital models,
1. The time-varying nature of vocal-tract shape.
with no coupling between subsystems. Another assumption made is that
2. Nasal cavity coupling.
3. The effects of the soft tissue along vocal-tract walls. of planar propagation. Planar propagation assumes that when the vocal
4. The effect of subglottal (lungs and trachea) coupling with vocal folds open, a uniform sound pressure wave is produced that expands to
fill the present cross-sectional area of the vocal tract and propagates
tract resonant structure.
5. Losses due to viscous friction along, and heat conduction across, evenly up through the vocal tract to the lips (see Fig. 3.6).
the vocal-tract walls. Assumptions of noncoupling and planar propagation are practical ne
cessities needed to facilitate computationally feasible techniques for
Therefore, a complete description requires detailed mathematical analy speech modeling, coding, and synthesis. Although these conditions are re
sis and modeling based on acoustic theory and low-viscosity (air) fluid strictive, they have proved useful in the formulation of the majority of
mechanics. Although extensive research bas been performed in these ar present-day digital speech processing algorithms. It will be useful for us
eas, a universal theory has not yet emerged. If this is so, we might ask to consider the actual physical mechanism of human sp eech production
158 Ch. 3 I Modeling Speech Production 3.1 I Acous tic Theory of Speech Product ion 159
c= <
Velocity
profile mental studies that seem to suggest th e need for alt ernative speech mod
eli ng m ethods ba sed on physical properties. We begin the discussi on by
I
considering a m od el for the source excitat ion.
1 I )
I I I I
I I I I
I I I I 3.1.3 Source Excitation Model
1
I
I
1
1
1 ,r Types of Excitation
I 1 1 /
~ l J.. J. Development of an understanding of so urce cha racteristics gene rally
\ '. 1 '
\ I r requires th at we assum e independen ce bet ween sour ce and tract models.
\ I 1
Recall from th e discus sion of speech acousti cs in Section 2.2.3 that two
\ ++ rI bas ic form s of excitat ion are possible:
I
1 ~
I
1 1
I
r False
fold s
1. Voiced exci tation-s-e. periodic movem ent of th e vocal folds resulting
'I 1 I I in a st ream of quasi-periodic puffs of a ir.
I
I
I I
I I
I
I
2. Unvoiced excitation-s-e. turbulent noiselike excita tion caused by air-
I I I I Dow through a narrow constriction .
\ -+-t- I
I II I Recall also th at th e following form s of exci ta tion represent important
lU J. combin at io ns and variations of voi ced and un voiced sources which, for
I I I 1
I I I I mo deling purposes, ar e often featured as di stin ct categorie s:
-1-'
'r I
I \
1 I
J r I \
vo iced sou nd depending on whi ch ph on em e is being uttered.
I I 1 \
I I I \
4. Whisper-a whisper is an utteran ce create d by forcing air through a
I I \ \ partially ope n glottis (glottal fricative) to excit e an otherwise nor
I I
mally articulated utterance.
5. Silence- in clude d as an excita t ion form for modeling purposes,
since there are short time region s in speec h in which no sound oc
FIGURE 3.6. The classical planar propagation model of sound for speech cu rs. The pause before th e burst of a plo sive is an example of th is
analysis.
ph enomenon .
For vo iced a nd unvoiced sounds, th e speech signal s(l ) can b e mod
in developing a practical wo rking speech model. However, there are
eled as th e convo lution of an excitat ion signal (see be low) and th e voca l
speech scientists who argue that it may not be necessary to model the
tract impulse respon se h (l) .3 The mod eling of th e excitation signal and
fine detailed st ruct ur e and procedures of the physical speech mechanism its spe ctrum rep resent the goal of thi s section.
(i.e., exact articulator movements, characteristics of vocal fold muscle
characteristics) if we are merely interested in characterizing broad timing
and frequency domain properties of the resulting waveform. Thus it may Voiced Excitation
be best to establish a goal for sp eech model developm ent befor e initiating
For voice d or so norant producti on, airflow fr om t he lungs is int er
the discussion. For man y speech applications such as cod ing, synthesis,
ru pt ed b y quas i-periodic vibration of th e voca l folds , as illu strated in Fig.
and recognition, good pe rfo rmance can be ach ieved with a speech model
3.7. Details of the for ces responsible for vocal fold movement have been
that reflects broad characteristi cs of timing and articul ator y patterns, as
di scussed in Sect ion 2.2 .3. Duri ng voice d activity, the voca l fold s ent er a
well as varying frequ en cy properties. Finer spe ec h mo deling methods,
bas ed on ac tual physi cal t raits, are necessary for such a reas as analysis of
l,
vocal fold movement, effects of pathology on human speech production, 3Fo r th e p re sent . we will ign ore radi at ion eff ects in the modeling.
160 Ch. 3 I Modeling Speech Production 3.1 / Acoustic Theory of Speech Production 161
Air flow
~ Lips
_
2 = ~
.§ ~ b
G)
.~ 0 , _
'-' ".
~
'"
~
0
'"
C
-;;
,.
10 lk
50: /\ t!J ,t!J, ;)
from lungs
o .'i 10 15 20 25 30
Time (rnsec)
Vocal
Vocal tract
cords
1.0
od s of rest (i.e., nonspeech activity such as breathing). One means of waveforms at the mouth for the vowel ja/. After Ishizaka and Flanagan (1972).
each pulse , play important roles in a talker's ability to vary source char u(n) =
,
L g( n -
<cc
iP) , (3.4)
acteristics. An analysis of such factors requires the estimation and recon
struction of the excitation signal from the given speech signal. The where K(n) is the impulse response of the glott al shaping filter.' This sig
generally accepted approach for analysis of glottal source characteristics nal is then used to excite the supraglottal system made up of the vocal
involves direct estimation of the glottal flow wave. Methods that seek to tract transfer function H (z) and the output radiation component R(z) .
reconstruct the time excitation waveform are referred to as glottal inverse The overall amplitude of the system is co ntrolled by 8 0 ,
filtering algorithms . Due to the high-accuracy requirements of these algo The task of estimating the glottal source requires the formulation of a
rithms, most require large amounts of human interacti on with little re procedure that inverts the influence of the vocal-tract and radiation com
gard to computational efficiency. These algorithms will be discussed ponents. Such glottal inverse filtering algorithms reconstruct the time exci
below, and also in Chapter 5 as an appli cation of linear prediction tation waveform u(n). The main component of the inverse filter is the
analysis. vocal-tra ct component, H(z). The filter therefore requires estimates of sta
If a discrete-tim e linear model for the speech production system is tionary formant locations and bandwidths. Two approaches have been ap
used , the z-domain
.' transfer function for the speech signal s(n) may be plied in th e past for inverse-filter implementation. The first, based in
written as follows:
~
8(z) = 8 oU (z) H (z)R (z) (3.2) '11 should be noted that the pulse tra in e rn) h as no real count erpart in the physical sys
tem . The first modeling com ponent with a p ~ys ie al co rrela te is th e voice waveform u( n).
= 8 oE (z) G (z)H (z)R (z). (3.3) ' As a n aside, from (3.4 ) one ca n see the ad va nt age of using an FlR filt er to model G(z).
162 Ch. 3 I Modeling Speech Production 3.1 / Acoustic Theory of Speech Production 163
analog technology, involves individual determination of formants requiring vowels to be analyzed carefully, since, for exam ple, nasalization can lead
user interactive control (Holmes, 1962; Hunt et aI., 1978; Lindqvist, 1964, to inaccurate determination of formants. Lindqvist (1964) also identified
1965, 1970 ; Miller, 1959; Nakatsui an d Suzuki , 1970; Rothenberg, 1972). the problem of coupling of the subglottal and supraglottal system during
The second involves automatic formant measurement by some form of the open-glottis interval. For these reasons, interactive inverse filter anal
nonstationary linear prediction analysis during the closed glottis interval ysis without the use of a pitch detector has almost entirely been confined
(Berouti, 1976 ; Berouti et aI., 1977; Childers, 1977; Hedelin, 1981 , 1984, to male voices , where the closed-glottis interval is well defined and of
1986; Hunt et al., 1978; Krishnamurthy and Childers, 1986; Larar et al., sufficient duration .
1985; Markel and Wong, 1976; Veeneman and BeMent, 1985 ; Wong et aI., Methods for glottal waveform extraction using linear prediction analy
1979). In both approaches, an estimate of the vocal-tract filter H(w) and a sis will be discussed in detail in Chapter 5. Briefly, the techniques ate
filter representing radiation at the lips R(w) is used. Given a measurement based upon a model of the laryngeal dynamics that includes a closed
of the spectral properties of the sound pressure S(w) , the glottal excitation phase during each cycle . The resulting volume velocity at the glottis lien)
can , in principle, be estimated to within a gain constant using (3.2), is periodic, ideally resembling a lowpass pulse train as illustrated in Fig.
3.9 . The regions of zero volume velocity correspond to closed phases of
S(w) the glottis. As the vocal folds begin to separate, the volume velocity grad
G(w) = H(w)R(w) (3.5)
ually increases. After reaching their maximum separation , the vocal folds
clos e more rapidly, thus completing the open phase. To extract the glottal
[If continuous time analysis is being used, refer to (3.1) , which indicates waveform, linear prediction analysis is applied during one or more closed
the same result with co replaced by Q. ] When individual formant deter phase regions to estimate the vocal-tract dynamics. Ignoring lip radiation
mination is used , each formant is canceled individually using a second effects, during the closed phase, only transient dynamics of the vocal
order filter with a complex conjugate pair of zeros. The input signal is tract are present in the speech signal. The resulting estimated voca l-tract
assumed to be a stationary vowel with high-frequency components and model, say H(z), can be used to recover the glottal waveform according
formants that are clearly separable. Miller (1959) was the first to attempt to (3.5).
such an approach. The inverse filter formulated contained only one ad In addition to the two principal approaches for glottal waveform esti
justable formant. Lindqvist (1964, 1965, 1970) later developed an ap mation described abo ve, other methods for modeling the excitation
proach that contained four tunable filters to cancel the first four source at the glottis are based on direct waveform models. In these tech
formants and a fixed lowpass filter to attenuate higher formants. Holmes niques a particular glottal pulse shape is adopted, such as a ri sing slope
(1962) also developed a system with four tunable filters, but also allowed leading into a half-wave rectified sine pulse. The pulse is parameterized
the user to see the excitation pattern of each formant. Fant and Sonesson by values that determine the precise shape of the wavefor m (Hedelin ,
(1962 ) compared inverse filtering to waveforms recorded by a photoelec 1986; Rosenberg, 1971 ). Such a direct modeling approach is generally ac
tric pitch determination instrument. Results showed good temporal ceptable for applications such as speech synthesis, where the goal is to
agreement between the two estimates of the instant of glottal closu re. produce natural-sounding speech. However, if a model is to be extracted
The criterion for filter adjustment is based on th e postulate that during from a given speech signal for coding purposes, for example, the human
the closed-glottis interval, the excitation propagated up through the vocal glottal pulse shape can vary co nside rably from t he assum ed model,
tract approaches zero. The analysis filter therefore models only the vocal thereby causing speech of lower quality.
tract characteristics. This requires that the output signal of the inverse
filter be maximally nat during this time.
There are some disadvantages and disagreements in the literature con
cerning inverse filtering techniques. It is generally accepted that at the in
stant of glottal closure, all frequencies of the vocal tract are excited
~
equally. This can be seen as large impulses in the residual -wave for rn for
vocalic speech. Yet, the results by Holmes (1962, 1976 ) and Hunt et al.
(1978) indicate that there may be a secondary excitation at the instant of s~
glottal opening or even during the closed-glottis interval. These results v.; Force
presumably led researchers such as Atal and Remdc ( 1982) and others to
look beyond pitch-excited methods to alternative formulati ons invol ving \V avc fo rm xnapc
" rnult ip ulse excitation" (see Chapter 7). Also, it is nece ssary to select the FIGURE 3.9. Example of an ideal glottal...source waveform.
164 Ch. 3 I Modeling Speech Production 3 .1 I Acoustic Theory of Speech Production 165
The analytical methods described above are ordinarily used to esti Unvoiced Excitation
mate the glottal waveform. These methods, however, require an accurate The discussion of source excitation mod eling would not be complete
and stable model for the vocal-t ract and radiation effects , making them without addressing un voiced excitation . Unvoiced excitation includes fri
less attractive for modeling of speech in some coding and recognition ap cation at a point of major constriction alon g th e vocal tract or an explo
plications. Fixed mathematical models have also been used to model the sive form during a stop release. Both are normally modeled using white
glottal source. An example is the electrical analog of a vocal fold model noise. This sou rce of excitation theoretically has no effect on the shape
de veloped by Ishizaka and Flanagan (1972) using measurements of sub of the spee ch spectrum , sin ce its power density spectrum is constant over
glottal air pressure, volume velocity at the glottal, vocal fold tension, and all frequencies. It is also notable that phase is not a meaningful concept
the resting cross-se ctional area of the opening at the vocal folds. Their in characterizing nois e, since the Fourier tran sform of a noi se process
model was based on a set of nonlinear differential equations. The coup does not, strictly spea king, exist. In fact , little attention is paid to the ef
ling of these equations with the partial differential equations will be used fect of the excitation phase, since good quality speech can be obtained
to model the vocal-tract results in a time-varying acoustic resistance and for synthesi s with random phase and sin ce the spectral a mplit ude is
inductance. These terms are dep endent on th e cross-sectio nal area of the more important than phase for speech per ception (further dis cussed in
glottis as IIAg'Oltis(t). So if the glottis closes (i.e., l/Ag\Otlis(t) -> 00], the im Chapter 5). Throughout the history of speech modeling, voiced excitatio n
pedances become infinite and th e volu me velocity becomes zero. In most has always received more research attention than the unvoiced case. This
cases, it is assumed that the coupling between the vocal tract and glottis is du e to st udies in speech perception that suggest that accurate modeling
is weak , so that the glottal impedance can be linearized (Flanagan, 1972) of voiced speech is cru cial for natural-sounding spe ech in both coding
as and synt hesis applications.
In conjunction with this problem of unvoiced excitation a nd phase
ZglOltlS (Q ) = Rg\oltiS+j Q Lgl otlis (3.6) cons iderations, it is worth previewing an issue that will be central to
many of the discussions in the future. Suppose that we wish to write an
with R gloltis and LgJotti' constants during voiced speech activity. The result analytical model similar to (3.3) for the unvoiced case. It would be natu
is the glottal excitation model shown in Fig. 3.10 . The term uglo\ti,(l) rep ral to simply omit th e glottal shaping filter G(z) and simply let "E(z)"
resents the volume velocity at th e glottis, which has the shape shown in represent th e driving noise sequ ence , yielding
Fig. 3.8. This model will be used to establish boundary conditions once
S (z) = 8 oE (z)H(z)R( z), (3.7)
the vocal-tract acoustic tube model has been established.
Or in terms of the DTFT
S ew) = 8 oE(w)H(w)R(w). (3.8)
The cautious reader may object to th is expression on the grounds that a
-3 <>
o
S ~"
-=
-5 0 >
:>
'U
"
..z:.
;:;
o
1l
'":_
O O k /J /J ,/J , £l
DTFT will generally not exist for the excitation process. However, if we
let e be th e name for th e excitation (noise input) process, and s the ran
dom process characteri zing the resulting output (speech), th en-it is per
fectly proper to write
;>
o 5 10 15 20 25 30
r.;(w) = 8o~(w)H(w)R( w) , (3.9)
Time (milliseconds)
where, recall, r x(w) refers to the power density spectrum of th e random
process x . However, if we interpret E (w) as representative of some finit e
time poH ion of a realization of e, and similarly for Sew), then (3.7) be
l.0
~
comes a valid working expression. Reflecting on (3.3), we reali ze that a
g 2 0 similar problem occurs there because the DTFT does not strictly exist for
", af:
..c:
g ~~ ~:-;:j 016i'.-"o
IV' . • "
l V 1\v 1\ 64,.-" \ }IIN \ j td\ /\ II I~\ J l J \ A
\ I \ 4J \ \ f 1/ II VI
, \
1\
I" \ J
\
r J \
0 V \
n r \t a stationary pulse train. The issue of short-term processing in speech
2'
work is important in both th eoretical and practi cal terms, and we shall
- 1.0. pay a great deal of attention to it in the ensu ing chapters. For the re
FIGURE 3.10. An approximate: rnodel of the glottis during voiced speech
maind er o f this chapte r, however, we shall conti nue 10 use som ewhat
, "'ctivitv._Afte r_lsbi z: .?~a and F la nag~(19721
sloppy, but readily int erpretable, analysis in this reg,~ar~d~. ~ _
166 Ch. 3 I MOdeling Speech Production 3.1 / Acoustic Theory of Speech Production 167
A Plethora of Transfer Functions large compared to the di ameter of the vo cal tract (less than approxi
mately 4 kHz) . To verify this, we calculate the wavelength of a sound
Before continuing, we must emphasize a point that is inherent in the wave at 4 kHz as
discussion above. There are many transfer functions of interest in model
ing speech. Of course, each of G (z), H(z), and R( z) represents a transfer c 340 mlsec
A4 k H z = - = = 8.5 em , (3.12)
function between two im port ant signals; it is hoped that the reader now F 4000 cycles/sec
appreciates the significance of each." The main transfer function varia
tions differ in whether or not the effects of the larynx andlor lips are in where c is the speed of sound in air. Here, the wavelength A4 k H z of 8.5 em
cluded. Indeed, therefore, H(z) is one such transfer ratio of interest, is much larger than the average diameter of the vocal tract (= 2cm), so
that the assumption of planar propagation is reasonable.' In addition to
UhPJ z ) planar propagation, we also assume that the vocal tract can be modeled
H (z) =-. - (3.10)
Uglolt is(Z) by a hard-walled, lossless series of tubes. Once we have considered the
loss less tube-model case, we will introduce the effects of losses due to
Of the other combinations of outputs and inputs that can be formed , the soft-wall vibration , heat conduction, and thermal viscosity later in this
most important for this book is the "co m plete" transfer function sect ion.
C onsider the varying cross-sectional area model of the vocal-tract
8(z) d~f 5(z) . (3.11 ) shown in Fig. 3.11. The two laws that govern sound wave propagation in
E( z) this tube are the Law of Continuity and Newton's Force Law (force =
D epending on whether the sound being modeled is voiced or unvoiced, mass X acceleration),
8(z) mayor may not contain the glottal dynamics, G(z). Throughout this
chapter and those to follow, it is fundamentally important to understand apex, t) = _ v !i(x, y, z, t) (3.13)
th e nature of the system function under discussion . pc? at
GI OllIS Lips
Pressure and Volume Velocity Relations
A sound wave is produced when the vocal folds vibrate, other a rticula
tors move, or by random air particle motion. The propagation follows. (a)
the laws of physics, which include the conservation of mass, momentum,
and energy. Air can be considered a compressible, low-viscosity fluid, A(..)
which allows the application of laws of fluid mechanics and th ermody
namics. If we consider sound propagation in free space, it should be clear
that a sound wave propagates radially outward in three dim ensions from
a point sound source. For human sound production, soft tissue along the
vocal-tract prevents radial propagation. Therefore, sound waves normally
propagate in only one direction along the vocal tract. To simplify vocal
tract modeling, we assume that sound waves obey planar propagation
along the axis of the vocal tract, from the glottis toward the lips. This as o x
(1))
sumption is strictly valid only for frequencies with wavelengths that are
FIGURE 3.11. An ideal, variable cross-sectional area vocal tract.
6Th e ar gument z can be replaced in this discussion by co if the DTFT is the transform
of interest and by Q if the "analog" F'T is being used. 1We shall reconsider thi s topic in Section 3.2.3.
168 Ch. 3 / Modeling Speech Production 3.1 J Acoustic Theory of Speech Production 169
where \J indicates the gradient, \J . the divergence, and where Lossless Tube Model
p(x, t) ~f deviant sound pressure from ambi ent , at location x, As noted above, one of the difficulties encountered in solving (3.16)
time t and (3.17 ) is the inability to characterize the area function A (x, I) along
Ii (x , y, Z, t) '1,gf vector velocity of an air particle ( in the vocal tract at the vocal tract versus time. In order to gain some understanding of the
location (x , y, z) model governed by these coupled differential equations, let us consider a
simplified vocal-tract shape consisting of a tube with uniform cross
c ~r spe ed of sound in air (340 m lsec at sea level) sectional area. Let the area be fixed in time and space, so that A (x, t) can
= d ensrty
P def . 0f "
air 10 t h e tu b e. be replaced by A as shown in Fig . 3.12. The adequacy of such a tube
model has been demonstrated by research com pa ring the sounds pro
If one-dimensional planar propagation is assumed, then all particles at a duced by physical models with sounds produced by humans. Natural
given displacement x will have the same velocity independent of location vowel sounds were first produced by speakers and recorded. During pro
(y, z) within the cross-sectional area A . Therefore, an alysis is more con duction, measurement techniques using X-ray motion pictures were used
venient if we examine the velocity of a volume of air rather than the vec to sketch the pharyngeal- and oral-tract shapes for each vowel. Using
tor velocity of a sin gle particle 1+ (x , y, Z, t). The velocity of a volume of these sketches, tube models were constructed with the same shapes as
air is defined as • those observed in the X-rays. Sound energy was then passed through each
tube model , and the emerging sound patterns showed agreement with the
V (x, t) = A(x, t) v( (x , 1), (3.15)
natural vowel patterns. This work was initiated by Chiba and Kajiyama
(1941), and continued by Dunn (1950), Fant (1960, 1968, 1970),
where A(x, t ) is the cross-sectional area of the vocal tract, and (x, t) the li
velocity of any particle ( in that volume, This assumes that each particle
in the volume is moving in th e same direction and with the same point
velocity (the volume follows planar propagation shown in Fig . 3.6). Since
the vector velocity now points in a single direction, I'\-'e shall henceforth
drop the vector notation over the volume velocity (x, t ) and write simply v Palate
~
v (x, t). Substituting volume velocit y for point velocity in (3 .13) and
(3.14) results in
uIJ(x. t ) 1 a[p (x , t )A(x, t)] , BA(x, t)
- T (3.16) Ph aryngeal
ax pc? at at wall
apex, l) a[v(x, t)/A(x, I)]
(3.17)
ax = P at
These equations characterize the volume velocity vex, t) and sound pres
sure p(x, I) along the vocal tract from glottis (x = 0) to lips (x = I or 17.5
Glotns
em for a typical male speaker). In general, closed-form solutions for these
equations are possible for only a few simple co nfigu rations. If, however,
the cross-sectional area A (x, t) and associated boundary conditions are
specified, numerical solutions are possible. In the case of continuant Glo!ti ~ Lip>
sounds such as vowels , it is reasonable to assume that the vocal-tract area
function A (x, t) does not var y with time. For non continuant? such as
diphthongs , A (x, l) will vary with time. Although detailed measurements
of A (x , t) are needed , they are extremely difficult -to obtain during pro
,~(I 0
Jakobsen et al. (196 7), Lindbolm and Sun dberg (1971), Perkell (1969), Th is fun ction relates the input and output volu me veloc ities fOT the uni
and Stev ens and House ( 1955 , 196 1). fo rm acoustic tube. The resonant frequ encie s F, for this model are found
Figure 3.12 illustrates an X-ray sketch and corresponding uniform tube by sett ing the d enominator equal to zero . This occurs wh en
configuration that approxi mates the neutral vowel Ix/. The ideal source is
represented by a piston , whi ch produces sou nd waves th at propagate OJ = ~[2 i - 1]
- for i = 1,2,3,4, .... (3.22 )
along the axis of the vocal-t ract model. The assumption of a constant c 2
cross-sectional area do es not ensure the single uniform tube shown in
Fig . 3. 12, and we need to account for th e actual bend in th e h uman vocal Sin ce Q i = 2nFi , tub e resonances occur at frequencies
tract for this to be a valid model. In a st udy of the effects of variable
angl e bends in uniform acoustic tubes, Sondhi ( 1986) found that curva c
F = ~ [21-1] for 1 = I, 2, 3, 4, . ... (3.23 )
ture does not ch an ge th e points of resonance by more than a few percent ( 4/
from those of a straight tube model.! Sin ce this effect is sma ll, we ini
ti ally con sid er the uniform tube case. Fi gur e 3.13 shows a plot of the tube transfer function from (3.2 1), using
a tube of length 1= 17.5 em and a speed of sound c = 350 m/sec.? Reso
nances F l' F 2 , F 3 , . .. occur at 500 Hz , 1500 Hz, 2 500 H z, . . . , respec
Open Termination. Fir st consider th e case in which the tube is open at ti vely.
th e lip s (x = I ), so that th e de viation from am bient pr essu re at the lips is
zero,
Closed Termination. It is of interest to obtain results for the case in
p (l, t) = PJips(l) = o. (3.18)
whi ch the oral tract is completely occluded (i.e., closed lip or closed tube
Further, because we are onl y in terested in stea dy-state ana lysis, we let the co nd ition). Let us reco nsider the same uniform acou stic tube terminated
glottal source be modeled by an exponen ti al excitati on. b y a complete closure. The volu me ve loc ity of air at th e lip s will be zero,
1)
- UglOlhS(D.) e1 fl t '!E.f D,
(I , t ) - - ltps
(Q)e )HI
.
(3.20) F 20
c
cos (Qljc) N
The transfer function for th e vocal tract can b e found b y tak ing the rati o
of the ph asor volume velocit y at the lip s to th at at the glottis, or equiv
alently, th e ratio of th e co rrespond ing complex sign als [recall ( 1.254)),
o
D.lip s(D.) U1ip,(t)
_ _- __ = (3.2 1) 2 34
H(n ) = U· (Q) - u . (t) cos (nl/c)
slo tU$ '" gl ot ti s Frequency, F ( kHz)
FIGURE 3.13. Frequency response for a uniform acoustic tube.
l ' For a 17-cm uniform tu be with a 90· bend, varia t ion in formam frequency locati on
from a straigh t tu be was +0 .8%, +0.2%. - 1.0%, -2 .5% for F,.l'~. F\, and 1'~ , respectively.
I
' T he speed of soun d -in air a t sea level is 340 m/sec. T he speed of sound in moist air at
body temp erature (3TC) is 350 rn/scc.
172 Ch. 3 f Modeling Speech Production
since there is no air leaving the closed tube. The boundary condition at
the lip end is
u(!, I) == Ultp,(t) == o. (3.24)
~ '"§
Steady-state solutions for volume velocity and pressure are derived in o , u~
Appendix 3.A.l. Of course, without any detailed mathematics, it is obvi
ous that the transfer function fOT this case is H(Q) == 0 for any Q.
'"
;§ '"
0..
-
:-; ;
~ ~
o
~
o
II
:J ~
o'" II 11
-"" "....=
'-'I" ~
~
Transmission Line Analogy. It is interesting to note that a uniform cylin
i;
'~ cc
""",
"0
ll-J1U
~ ---r t:'
'-'I"
I T
G
~
I
I,
,...,
G
~
i.
/>
drical tube as pictured in Fig. 3.12, which has a plane wave propagating
<::
2
E G
c > u
~ so I-
Cf ~
-:7.
'"'
if,
o
0
::.J
~
0: I::
CD
~ .£ ~ ~ ,. ~
U
<:: N°
theory as voltage and current (see Table 3.1). A summary of the trans a:: OJ '" :: ~ II T'
mission line-acoustic tube analogies is shown in Table. 3.2. coo "2 <>
0..
:.J
vo '.)
~.£
.... <> ;:,
~ g1
: E -§:
"'" ~
0: u -::: II
-:::
.;:: II .~ c.. > II E._ .i.
One key idea that follows from the transmission line analogy is the t5 "'~" :> '" " <> <;; U
" -i
o <:: '"
'" =
c .= "'
concept of acoustic impedance and adm ittance. Since the sound pressure Cl)
W ':>" uu <
u <5 ~
~
,L
5 '" 0
>
and volume velocity in the acousti c system are analogous to voltage and I'-,J:)
-:::
current in the transmission line , respectively, we naturally define the
"'0.
C
(Il'
<Il .~ 3 s
uz. g;;~II
<Il
acoustic impedance of the tube at distance x from the glottis as ""
~CG
<Il C .; : 2
Z(x. Q) d,;,f p (x , t) == P(x , Q) , (3.25)
5«
o e":
«.0<2 gQ)
§ "§
vex, l) vex, Q) "S~
0i::
~ u
.s
j5 . ~
..... OJ
o
where p(x , l) and vex, t) are steady-state solutions as above, and P and p
<1)0
U)'.;:i ~
s .=
" '"
" "":::-= II
r-::::-:I ~
o'"" ><.
::J<Il ~I'...;J ----.
:.J
are the corresponding phasor representations. Also by analogy, we define -0
00
~
u '"
C
'~ ~::::
~.=
!'i.
"<I"
I
I-
f.:'
G
I
1 q
I-
II II ~ ~
-
U)::J II II Zi '"
The acoustic admittance, Y(x , Q), and acoustic characteristic admittance, C\j ~
E
-:; .,:
~·f' N
c "'
0
g~ g~
<::
M ~ ~ "0
o'"
C C
'"
0..
Yo' are defined as the respective reciprocals. UJ V
'" " D II
~ >-.
II
~
II <)
' ~~ ::::,'" ::::,
-5.-5""
Q.;;;)
NO
s::s2~,=s -
Several interesting uses are made of the acoustic impedance concept in -l 0 g '-'
CO ' i: II ....,
§ S:! .g
Appendix 3.A.2. First, the input impedances are characterized for the f! "
,J:) '"0. c: gw <:: .., ~ 11
-:::
~~ ~ g 3
,..... Q.
open and closed termination cases . The results are shown in Table 3 ~2.
;;;)
:..::l
,g -; I( C
§ § - ""
*
~
~ 5
c ~ -E 0
vo 0
ts:
-<
'.) '"
I; "0 "
(5
TABLE 3.1. Analogies Between Acoustic and Electrical Quantities ~
~
pt.x, t) Sound pressure v(x , t) Voltage
v (x , z) Volume velocity itx, z) Current
Acoustic inductance L Inductance
pIA
Atpc' Acoustic capacitance C Capacitance
173
174 Ch. 3 I Modeling Speech Production 3.1 / Acoustic Theory of Speech Production 175
Second, just as electrical impedances can be used to study resonance vocal-tract area function A(x) (see Fig . 3.11). If a large number of tubes
properties of a transmission line, so too can the acoustic impedances be with short lengths is used, then we can expect the formant structure of
employed to find the resonance frequencies of the acoustic tube. The re the concatenated tubes to approach that of a tube with a continuously
sults are expectedly consistent with those found from the transfer func varying cross-sectional area.
tion in (3.21). FinaLLy, the uniform acoustic tube in steady state can be Before beginning the formal study of the multitube model, we note
replaced by a T-network of impedances which , in turn, can be used to de that there are several "intermediate" models between the single-tube case
rive a two-port network of the tube. The resulting network reflects the as and the general N-tube case that are instructive and useful. However,
sumption of no losses in the tube in' the 'occurrence of purely reactive since the principal goal of our study is the development of a digital
impedances. Details are found in Appendix 3.A .2. model, we shall relegate these intermediate studies to the appendices,
where the interested reader can further pursue the details. The two-tube
lossless model, discussed in Appendix 3.B, offers an interesting first step
toward generality in that it permits the separate modeling of pharyngeal
Multitube Lossless Model of the Vocal Tract
and oral-tract cavities, while offering an opportunity to exercise the vari
Since the production of speech is characterized by changing vocal-tract ous modeling techniques studied above with tractable mathematics.
shape, it might be expected that a more realistic vocal-tract model would Three-tube models for nasals and fricatives , discussed in Appendix 3.C,
consist of a 't ube that varies as a function of time and displacement along provide the first case in which zeros are present in the model due to the
the axis of sound propagation. The formulation of such a time-varying presence of an acoustic "side-cavity" modeling the nasal tract.
vocal-tract shape can be quite complex. One method of simplifying this We now consider the general N-tube lossless acoustic model, an exam
model is to represent the vocal tract as a series of concatenated lossless ple of which is illustrated in Fig. 3.14 for N = 6. The cross-sectional areas
acoustic tubes, as shown in Fig. 3.14. The complete vocal-tract model A l' A 2, .. . ,AN are chosen to approximate the varying area function
consists of a sequence of tubes with cross-sectional areas A k and lengths A (x , I). Here, we assume that a stationary sound is to be modeled, so
lk' The cross-sectional areas and lengths are chosen to approximate the that the time dependency in A (x , l) can be removed. If an appropriate
number of tubes is chosen, the resonant frequencies of this concatenated
system of tubes can be made to approximate those of the human vocal
tract by approximating A(x). This represents an important step in the de
velopment of a computational means to model the vocal tract, since this
acoustic tube model will serve as the transition from continuous-time to
a discrete-time model based on digital filter theory.
I It can be inferred from the discussion in Appendix 3.A.l that steady
state solutions for volume velocity and sound pressure at point x k in the
kth section of the model have the following forms [see (3.87) and (3.88)
Glottis
Lip " in Appendix 3.A.lj,
AI AZ A3 A4 AS A(, Uk(X
k,
t) = : ; ['JI;(5:'l)ejQle-JCl(xkl e) - o/~(Q)e jClle +jCl(xk I e)] (3.27)
•
z
I ·L
._ J IS
Pk(X
k,
t) = 'P;UJ.)ej,Q'e-j,Q(Xk Ie) + '¥~(n)e Jnl e +j .Q(Xk I e).
l
~
'3 length ' k' The variable x k represents displacement along the tube mea
sured' from left to right (Le., O:s xk.:S 'k)' and 'I1(Q) and '¥~(Q) are the
-
~
.... complex amplitudes of the rightward (positive direction) and leftward
14 (negative) traveling waves along the kth tube. Suppose that we write
FIGURE 3.14. A vocal-tract model comprised of concatenated acoustic tubes. u(x k , t) == t/(x k , t) - v-(xk , t) (3.29)
176 Ch. 3 f Modeling Speech Production 3.1 I Acoustic Theory of Speech Production 177
-- - - - )
U:-l- L(t) IJ;+I(I -T, + I )
and
I{(I}
-<- - - -
I)~(' - 'l)
«; 1(1)
, .
{ + LV - Th
A~ "'l
I)
I
where 2 0 • k is the chara ct eristi c impedance of th e kt h sect io n, defined as \
in (3.26). Equations (3 .30) a nd (3 .3 1) indicate that th e st ead y-state solu I
\
tions at any point in th e sectio n , say x k = x~, can be expressed ent irely in \
ter ms of th e pos itive- and negat ive-going volume velocity (or pressure) \
waves at th e left (x, = 0) boundar y of the section . T his sho uld not be sur
pri sing, since there are n o losses in the tube, and the volum e velocity ~I-----Ir
I• ;{k
wave at x k = 0 will th erefore appear at xi< = x ~ a n a ppropriate delay or ad o lk
va nce in time later. The magn itude of that delay or ad vance is x~ /c. Fur Tube k: T k , Pk
- --- - - - - - - - - -1+-
~+ L
th er, as we have d iscu ssed above, the pressure and volum e ve locity ar e Xk + 1
rela ted through the acou st ic imped ance. Therefore, for conve n ience in oI
th e following, we em ploy th e following abuse of notation:
Tube k + J: T,+ r f!kt- 1
+ ) def
1Jk (t = Vk
4 (0 ' t ) (3.32) FIGURE 3.15. Sound wave propagation at the juncture between two
concatenated acoustic tubes. Each tube is tully characterized by its length
- ( ) def - (0 and cross-sectional area (/;" A k ) . or its delay and reflection coefficient
Vk t= Vk , t ). (3. 33) [rk = Ik/ c , Pk = (Ak +1 - Ak) /(A k + 1 + A k ) ] ·
We also define and (3.2 8) are su bs tit ute d in to (3.35) a nd (3. 36) and the notation of
def'k (3. 32), (3. 33), and (3.34) employed, t he followi ng boundary rel ati on s are
rk = C' (3. 34) obtained:
noting that r k is th e delay inc urre d in t raveling the length of sect ion k. v;(t - 'k) - v ~(t + 'k) = V; of- I(t ) - V~+ I (I) (3.37)
To d etermine the interaction of traveling waves bet ween tubes, con
sider what happens at the junctu re between th e k t h and (k + 1)st tubes. ZO.k[ v;(t - ' k) + I)~ (t + ' k)] = Zook+l[ v;+1(I) + VZ+ t (I)]. (3.38)
Since sou nd wave propagat ion in a tube obeys the law of cont inuity and Figure 3. 15 illu strat es that at th e tube juncture, a porti on of th e posi
New t on's force law (Halliday and Resnick, 196 6), soun d pressure and tive tr avelin g wave (m ovi ng left to right ) in tube k is transmi tted to tube
vo lume veloci ty must be co nt inuous in both time and sp ace ever ywhere k + 1, and a portion is reflected ba ck int o tube k. The same occurs for
alo ng the multitube model. Co nsider the tube juncture show n in Fig. the negati ve tra velin g wave (moving right to left) fro m tube k + 1, a por
3.1 S. Then the law of cont in uity requires th e following boundary condi t io n is transmitted into tube k (m oving right to left ), and a portion is re
ti ons to be satisfied at the juncture flected into tube k + I (m ovi ng left to right). We can obtain a relati on
v",(x k = ' k' t)= Vk +l(X k + 1 =: 0, t) (3.35) t hat illustrates t rans m itted and refl ect ed sou nd wave propagation at th e
j u ncture if (3 .3 7) and (3.3 8) are solved for th e positive wave transmitted
and into tube k + 1, V~+ l (l ) ,
Pk(Xk = lk' r) = PH Jx k+ 1 =: 0 , t) , (3. 36)
ind icating that sound pr essure and volu m e velocity at the en d of tube k ~tl ( t ) = v;(t - 'k) [ A
2Akt 1
+A ] _
+ Vk+I(t) [A
A +A
k+ 1 -
A '
k]
(3 .3 9)
must equa l pressure an d velocit y at the beginning of tube k + 1. If (3.27) k +1 k k+l k
178 Ch. 3 I Modeling Speech Production 3.1 I Acoustic Theory of Speech Production 179
and the negative transm itted wave transmitted into tube k, v~ (t + 'lJ Inciden t T une T ransmi tted
[obtai ned by subtract ing (3.39) from (3.37)], sound junction sound
wa ve iJT wave
_ [ 2A k ] + [A k+I -A k ]
"';(1+ Pk )
v ~(t + ' k) = vk+J(t) A +A - vk(t - ' k) A . + A . (3.40)
k+l k k+1 k
O;Jlk
In derivin g (3.39) and (3.40), we have used the fact th at Z o.k and Z o,k+l
contai n a co mmon facto r pc that cancels, leavin g cross-sec tional areas in Re flecte d
the results. From (3.39), we see that the volume velocity wave tr ansmit sound
ted into tube k + 1 is compose d of a portion transm itted from the for wav e
ward tra veling wave in tub e k, v;(t - ' k)' and a port ion reflected from th e FIGURE 3.16. Signal flow graph of a forward moving sound wave (volume
backward traveling wave in tube k+ I, V ~+ I(t). Equat ion (3.40) indicates velocity) at the junction between two tubes ,
tha t the transmitted backward traveling wave into tube k: consists of a
part ially transmitted backward traveling wave from tube k: + I, V;+I U) ,
and a reflected portion fro m t he forward tr avelin g wave in tube k, Figure 3.16 pr esents a signal flow graph at the juncture of th e kth and
1J;(t - 'k)' If we assume that the negative travelin g wave 0;+1( t) in tube (k -I- I)st tub es for a forward traveling sound wave. Using reflection coef
k+ 1 is zero, th en (3.39) and (3.40) reduce to ficients, (3.39) and (3.40) can be written as
0: +1(t) = v;U- -»: (3.41 ) V; +I(t) = v;U - ' k) ( I + Pk) + v;+1(t )Pk (3.4 7)
v;U+ ' k) = - vt(t - rk)p~, (3.42) v;U+ 'k) = V~+l(t)( I - Pk) - v;U - 'JPk · (3.48)
where the ratios p; and P ~ are th e transm ission coefficient and reflection Equations (3.47) and (3.48), which are often called the Kelly-Lochbaum
coefficient between the k th and (k + t )st tubes, respecti vely: equations (Kelly and Lochbaum, 19(2), immediately lead to the signal
flow graph I I shown in Fig. 3.17 for th e juncture at the k t h boundary.
P; d~f 2A k~1 (3.43) This st ructure was first used by Kelly an d Lochbaum to synthesize
Ak + 1 + Ak speech in 1962. Note tha t this signal flow graph contains equivalent in
format ion of th e tub e diagram in Fig. 3. 15, but, whereas the sect ions
A
- def k+ l - Ak = Pk
+- 1. were characterized by areas and lengths in th e acoustic model , here th e
(3.44)
Pk=A +A secti ons are more readil y characteri zed by the reflection coefficient s and
k+ l k
delays. '
In fact , since th e reflection and transmission coefficie nts are sim ply re
lated, it has becom e conventional to use only th e reflecti on coefficients
in analytical and modeling work. Let us henceforth adopt this convention v~(t - T
k
) tit + I (I) I Delay I V; +I U - Tk+ 1) ur + 2(t)
and sim plify th e notation for the reflection coefficie nt by omitting the
minus sign.!?
I + 11k
l Tk + ,
J I + Pk + I
- I -s p; < 1. (3.46)
Tube k Tu be k + 1
T he reflection coefficient s become increasingly imp ort ant as we strive for
a digital filter represent ation.
I
FIGURE 3.17 . A signal flow representation of two lossless acoustic tubes.
lOIn Chap ter 5 we will encounter the reflection coefficient again in conj unction wit h the
" For an int roductio n to thc use of signal flow grap hs in signal proc essing, see, for exam
digi tal mod el. ple, (Oppe nheim and Schafe r, 1989).
180 Ch. 3 / Modeling Speech Production
3.1 I Acoustic Theory of Speech Production 181
Figure 3.16 isolates the kth tube junction and illustrates the reflection
and transmission of the forward propagating sound wave at the bound Since this impedance is assumed to be real, a half-infinite acoustic tube
ary. A propagating sound wave experiences a delay in traveling the length can be used to model a real or resistive load. Under such condi
of the each tube, which is represented as a signal delay 1:kin Fig. 3.17 for tions , the characteristic impedance is sometimes replaced by a resistive
the kth tube . Since each tube is lossless, the propagating wave is not at load Ro'
tenuated. When the sound wave reaches a tube boundary, a portion is Let us now consider the junction between the last vocal-tract tube (call
transmitted (l + Pk)' and the remaining portion reflected (-Pk) ' The per it tube N) and the lips. We assert that the infinite open tube makes an
centage transmitted/reflected is dependent on the cross-sectional area appropriate model for the lip radiation. In any case , let us assume that
mismatch at the tube juncture. If the mismatch is small (i.e. , A k =A k + 1) , we can model the lips with a tube section, say section N + 1. In most sit
then a larger portion will be transmitted (perfect transmission occurs uations, real and synthetic, once the sound wave passes the lip boundary,
when A k = A k+l)' If a large mismatch is present (i.e., A k > > A k + 1) , more it radiates outward. If the radiating volume velocity wave meets no ob
of the signal will be reflected . stacles, no reflections will propagate back down into the vocal system.
The signal flow representation for two adjoining sections can be ex Thus the negative traveling wave Vj~+I(XN+P I) [or p-(XN+l' 1)], for any XN + 1
tended to N tubes in the obvious way. At each boundary, the flow and and any t, should be zero (see Fig. 3.19):
pre ssure dynamics are described in terms of the reflection coefficient and V~+I(Xll!+l' t)::: p-(xN + P t) = 0 (3.50)
the delay of the section on the left. In order to complete the (lossless)
acoustic model of speech production, we need to consider the boundary for all X N + 1 and t. Therefore, we can model the effects of radiation from
effects at the lips and glottis. This is the subject of the following section. the lips using a half-infinite tube. The other property which must hold to
make the half-infinite tube appropriate is that the radiation load must be
purely real. Assuming that this is the case, and that real impedance Zhps
is encountered by the traveling sound wave as it reaches the lips , then the
Tube Models for the Glottis and the Lips half-infinite lube with cross-sectional area A N + J such that
The formulation of a tube model for either the glottis or the lips re
pc
quires consideration of the half-infinite acoustic tube in Fig. 3.18. The Zo, N + 1 =A- = Zhps (3.51 )
figure illustrates a tube of cross-sectional area A and infinite length. For N+I
such a tube , an incident sound wave v+(t) [or p+(t)] injected at the source is a good model.
x = 0 will propagate to the right indefinitely. Since the propagating wave Let us next derive the signal flow graph for the lip boundary. We use
never experiences a tube boundary, no portion will ever be reflected. the explicit subscript " lips" to denote the quantities at the lip boundary,
Now it is shown by letting 1_ co in (3.106) in Appendix 3.A.2 that the so in the present case, P1ipo ~f P,v' By definition, the reflection coefficient
impedance at location x in a lossless open tube of infinite length is sim is given by
ply the characteristic impedance of the tube:
A "'-1- 1 -A"
"
lim Zopen(x,.Q):::: Zoo (3.49) 1ip (3.52)
1_00 lube P , = A
N+ l
+ A"..
~ I
..r "'" ( ) x
I
v+(/) '.
A
/,,--~
cc
--
-+-
--
-
'---_toc
'. / _ L. ~ .x
(j)ona) ~ x
I I ~ L1PS
x=O
;:={-""
.v tu be
Fir st voca l
-----
nth vocal
r tube
which can also be expressed as is a tim e-var ying acoustic impedance which is a function of th e inverse
cross-sectional area of the glottis, IIArIOltiS(t ). For examp le, when the glot
Z0 , N - Z lips tis is closed [AglOlllS(t ) = 0], the glotta imp edan ce becomes infi nite. Two
(3.53)
P lips = Z N + ZliPS equivalent "circuit" mod els are shown in Fig. 3.2 1. It is very important
0 ,1
to noti ce that these models are not tim e invariant and th at in neither
Now in light of (3.48) and (3.50) we can write case does the source and impedance vary ind epend ently. Consider th e
curre nt source model , for example. In th e case of infini te impedance, th e
v~(t + TN) = -Ph psV~ (lN' 1- TN)' (3.54)
para llel bran ch containing Z glottl' becom es an open circuit , but the source
Accordingl y, the output volume velocity at the lip boundary 1S
current is also simultaneously redu ced to zero. In genera l, as the glott is
opens and closes, the glott al impedan ce varies between infinity and a fi
vN(lM l) = vN(t - ' N) = v~ (t - 't N) - v~, (t + TN) nite value, causing th e input volume velocity of th e vocal tract to vary
(3.55) between zero and a finite value. Th e variat ion in the volume velocity
= (I + PI' IpS
)v+(t - T,,).
;v
wave is exactly the pulselike glottal flow wave shown earlier in Fig. 3.8.
Th e signal flow diag ram reflect ing (3.54) and (3.55) has been appended One must be careful to remember this tim e depend ence and source
imp edance coupling when using the circuit analo gies.
to the output of th e tube model in Fig. 3.20.
A half-infinite tube can also be used to model the termination at the If a first-ord er approximation is used , the tim e-varying glottal impe
glottis. Consider an incid ent sound wave at the juncture of the glottis dance can be approximated by the fixed impedance
and the first vocal-tract tube as shown in Fig. 3.19.12 A portion of the
volume velocity wave will be tra nsmit ted into tube I, and a portion re Z glolulCl) = R gIO,U. +j Q L glotti; ' (3.57)
flected. The reflected portion travels back down through the subglottal T he tra nsmission line circui t model shown in Fig. 3.2 1 may now be con
vocal system (trache a and lungs). Thi s backward traveling wave therefore sidered a linear, tim e-invariant circuit. As illustra ted in this figure, the
does not cont ribute appreciably to sound production (it is generally as net volume velocity into the first vocal-tract tube model is obta ined by
sumed that the soft lung ti ssue absorbs the majorit y of th is energy). In subtracting that portion lost due to glottal impedan ce (usi ng a curre nt di
terms of sound prop agation and modeling, this backward tra veling wave vider relation 13),
is typically ignored. The glottal termination is th erefore modeled as a
volume velocity source UgJ otti. ( t) in parallel with a glottal impedance Zgloltis I I
[as in (3.6)]. The glottal imp edance 1)1(0 , t) = vo(O, t) - Pi(O , t) Z (n) = UgJOlli,(t) - p, (O , l) ~ (Q)'
10'1\' gl o illS
UgIOlll, (t>
a ~
1 + Pglotti<
II~(t)
, 0--
(} ~(t + T L)
--1 --0. ,
v :(t - TN)
~
VN(t) '" \11-(1)
I + Php, GIOllal
~' I (X = O, I)
Vocal
volume vclocitv tract
2
UgIOtlLS(r) • model
- P"ps
fig Glo ttis Vocal tract Lips
RgI01T iS(I)
v ~(t> IJ ~(t+rl) I.' ~(I + '..v)
! I
G lottal Vocal
pressure
Tu b" 0
(glottis)
I Tube I Tu be N I Tub e N + I
(lips)
Pgloni/t)
Ir aCI
model
Glotti s Vo cal tract Li ps
l
FIGURE 3.20. Signal flow diagrams of acou stic tube junctures for modeling FIGURE 3.21. Trans miss ion line circuit model for the glottis and lips .
the glottis and lips .
[3Fo r the ha lf-infinite tub e used to m odel the glottis, the locat io n x = 0 is tak en to be
I " He re we assume that the vocal tract is modeled with a concatena tion of N acoustic ju st in side the j unct ure of the glott is and fi rst vocal- tract tu be. T he loca tio n x ~ oo is the -r
.... t ll hes ._T he_t ube_elos~~t to the glottis is tube 1. "left end " of th e tu be.
-------------
184 en, 3 I Modeling Speech Production
3.1 I Acoustic Theory of Speech Production 185
Solving for the forward traveling sound wave at the boundary, vTeO, r), we FIGURE 3.22. Overall signal flow diagram for a two-tube acoustic model of
speech production.
obtain
1 + Pglotlis
+
VI (0, t) == 2
( )
U glOttlS t + P gl Otl lSV -(0 , I,)
j
(3.60)
volume velocity at the lips, uhpS(t), is simply the input sound wave to the
half-infinite tube that models the lips ,
where the glottal reflection coefficient is. U1ip.(t) = VN+1(0' r) = lJ3(O, t). (3.62)
(pel A I)
ZglOtliS(Q) -
(3 .61) Accordingly, using Mason's gain rule [see, e.g., (Oppenhei m and Schafer,
PgJotti S ==
ZglOtl
(n + (pcI Al )'
• ) 1989)] on the graph of Fig. 3.22 , we have
l
The input volume velocity at th e glottis is comprised of only that portion D;ip<;e Q ) U1ips(t)
Htwo-lube(Q) = = _.:.....-
transmitted into tube I , plus that portion from the backward traveling Uglo ni.(Q) UgJotti,(t)
wave in the tube, which is reflected at th e glottal tube/tube 1 juncture. A (3.63)
signal flow graph depicting wave propagation from (3 .60) is included at
[(I + PgJotti') /2]( I + PliP') (l + Pl)e- I n(T I+T l)
the input of Fig. 3.20.
Finally, we notice that, like the lip reflection coefficient, PI ips' the glot 1 +P P e -J02'1 + P P e - Jn2'2 + p , fJ . e - j Q 2 (t j + t 2)
1 glottis I hp s glot tis lip s
tal reflection coefficient Pglolt is is generally frequency dependent because
of the involvement of an impedance. However, like Zlips' Zglotti. is fre Some features of this transfer function should be noted . First, the magni
quently taken to be real for ease in vocal system modeling. Therefore, the tude of th e numerator is the product of the multiplier terms in the for
terminal effects of the glottis and lips can be represented in terms of im ward signal path , while the phase term e-j O (t j + f 2) represents the delay ex
pedance and a volume velocity source for transmission line circuit analy perienced by a signal propagating through th e forward path. Second, if
sis in Fig. 3.21 , or using signal flow diagrams like Fig . 3.20. jQ is replaced by the complex variable s, the poles of the system function
H 1WO.tubcCS) represent the complex resonant frequencies of the system. Fant
Complete Signal Flow Diagram (1970) and Flanagan (1972) have shown that if cross-sectional areas and
lengths are chosen properly, this transfer function can approximate the
As an example of a complete representation of sound wave propaga magnitude spectrum of vowels.
tion, we consider a two-tube (N '"' 2) acoustic model for the vocal tract.
We hasten to point out that the choice N :::: 2 is made for simplicity, and
in practice the number N would o rd ina rily be chosen much larger (typi Loss Effects in the Vocal-Tract Tube Model
cally 10-14; see Chapter 5). U sin g a two-tube signal flow representation We should return to the starting po int of this discussion and recall
for the vocal tract (from Fig . 3.17), and signa l flow representations for some simplifying assumptions made regarding the vocal tract. A number
th e glottis and lips (from Fig . 3.20), we ob ta in th e overall system flow di of unrealistic constraints have in fluenced th e outcome of this develop
agram in Fig. 3.22 . The overall syst em t ransfer fu nction is found by tak ment. In particular, we assumed that th e vocal tract can be appropriately
ing the ratio of volume velocity at th e lips to that at the glottis. We can modeled by a hard-walled, lossless series of tubes. In fact, energy losses
use phasors or assume complex exponential signals (see Section 1.5). The
186 Ch. 3 I Modeling Speech Production 3.2/ Discrete-Time Modeling 187
occur during speech due to viscous friction between the flowing air and has generally been found that a three-tube model of the vocal tract works
tract walls, vibration of the walls, and heat conduction through the walls. well to model main front and back cavities connected by a narrow con
A proper analysis would require that we modify (3.16) and (3.17) to re striction. Each of the tubes produces resonances (formants) whose fre
flect these losses and proceed accordingly. Of course, some rather intrac quencies are related to the tube lengths. Like nasals , the fricative tract
table mathematics would result, a fact which is compounded by the configuration contains cavities that can trap energy leading to antireso
frequency dependence of the loss effects . The usual approach to this nances in the overall spectrum. It has been found that fricatives exhibit
problem taken by speech engineers is learn to anticipate certain fre very little energy below the first zero in their spectra, which is usually
quency domain effects which occur as a result of the tract losses. With a quite high in the Nyquist range.
basic understanding of these effects, the speech engineer can usually Like the vocal-tract loss case, in practice we often use a model that
build successful algorithms and analysis on the simplified lossless model. was fundamentally derived using the lossless vocal-tract assumption
The most significant of the losses mentioned above aris es due to the (without side cavities) to analyze nasals and fricatives. To a lesser extent
tract wall vibration. The walls of the tract are, relatively speaking, rather than in the vocal-tract loss case, the ideal model will be able to handle
massive structures and are therefore more responsive to lower frequen these anti resonance phenomena in the spectra. However, antiresonances
cies. This fact corresponds to a preferential broadening of the band are often more problematic than simple broadening of formant band
widths of the lower formants. On the other hand, the viscous and widths, because, whereas the latter can be accomplished by a simple
thermal losses are most influential at higher frequencies, so that these shifting of poles in the model, the former requires the presence of spec
losses tend to broaden the bandwidths of the higher formants. If we use tral zeros for a good representation. However, we shall find that models
the analysis in (Flanagan, 1972) and (Portnoff, 1973), it can also be whose spectra require only poles are to be greatly desired in speech
shown that the vibration losses tend to slightly raise the formant center modeling.
frequencies, while the other losses tend to lower them. The net change is
a small shift upward with respect to the lossless tube model.
Whereas the lossless tube model is not, strictly speaking, appropriate
for real speech due to the ignored loss effects, the analytical models to
3.2 Discrete-Time Modeling
which these early "loss less" developments will lead will often be able to
"adjust themselves" to handle the aberrations which occur in the spec 3.2.1 General Discrete-Time Speech Model
trum. It is therefore important that we clearly understand the lossless We have considered several strategies for modeling the speech produc
case, and that we be able to anticipate what spectral clues may be present tion system. Based on early observations of the resonant structure of cy
that indicate certain real phenomena. lindrical tubes , an acoustic tube mod el was first considered for speech
modeling. An example of a six-tube model is shown in Fig. 3.23(a). We
3.1.5 Models for Nasals and Fricatives superficially noted the analogy between acoustical systems and the elec
tric transmission line and have occasionally employed that analogy in
The production of the nasal phonemes /n ,m,G/ requires that the certain developments. In particular, glottal excitation and load effects
velum be open and the lips closed. This creates a very different tube con from lip radiation were modeled using electric circuit analogies. This
figuration from the vocal-tract case discussed above. In the nasal case, analysis and further consideration of properties of the forward and back
the oral cavity creates an acoustical side cavity branching off the main ward traveling sound waves in an acoustic model like Fig. 3.23(a) ulti
path from glottis to nostrils. Analytically, the result is a set of three mately led to the signal flow representation in Fig. 3.23(b). A more
acoustical differential equations, the solution of which, for many practi detailed view of the transmission line analogy in Appendix 3.A.2 can be
cal purposes, is not warranted. As in the case of the vocal-tract losses, it used to develop the model shown in Fig. 3.23(c) in which a series of
is usually sufficient to be aware of frequency domain effects caused by T-network impedances is used to mod el the four vocal-tract tubes found
nasality. In particular, the side cavity introduced by the nasal configura in Fig. 3.23(a). It should be stressed that quantities of cross-sectional
tion will tend to trap energy leading to antiresonances (zeros) that appear area and length for each of the acoustic tubes fully characterize each of
in addition to the resonances (poles) in the magnitude spectrum. It has the three models.
also been found that nasal phonemes tend to have som ewhat broader Th e resulting signal flow diagram in Fig. 3.23(b) suggests that the
bandwidth formants at high frequencies (Fujimura, 1962) due to heat lossless tube model exhibits characteristics common to a digital filter
and friction losses along th e large walls of the nasal tract. model. The final signal flow for th e four-tube vocal-trac t model contains
The modeling of consonants in general, and fricatives in particular, is only additions, multiplications, and delays. These operations are easily
I a_C,.om plex subject. the details of which would take us too far afield. It implemented in a discrete-time model. The only restriction to satisfy is
188 Ch. 3 I Modeling Spe ech Production 3.2 I Discrete-Time Modeling 189
Final Acous tic Tube Mo del Final T ransmi ssion Lme/Circ uir Mo del
(
I
Agl uLli,
glouis
_= 00
Al
~
A2 A3 A4
I II'
AiJPS
=00
• 00
U,'N"/' @-Z-glo-m-,:'-ITTIt-<,:---?"" ,
I ~ '4 'P ~
Gtouis
'2 Voc al tract
I Lips
Gl otli s I Vocal trac t Lips
(a) k)
FIGURE 3.23. (Cont.) (c) An equivalent transmission line circuit
representation for the acoustic tube model in (a).
Final SIgna l-Flow Mo del
sm ooth transition to th e discrete-tim e domain , let us consider a set of N
1 + Pg lOIr LS tubes, each of a fixed common length, say ~x = LI N (see Fig. 3.24). An al
2 ' I T
4 I +Plip ,
ysis of wave propagation for this acoustic tube model is equivalent to
Uglottis(1)
U lips<') previous system s, with the exception that her e each tube possesses th e
Pglotti, sam e delay
' I
T
4 L ~x
r=~= - . (3.65)
eN N
This limits the number of variables available for simulating th e voca l
I T ube 1
I Th be 2
I T ube 3
I Tu be 4 I tract cross-sectiona l area fun ction A (x , I). With thi s restriction, the signal
Glottis Vocal tract Lips flow model of F ig. 3.23(b) is replaced by that in Fi g. 3.25(a), where eac h
tube dela y 1:,. h as been repla ced by the consta nt delay r. If a dis crete-time
impulse is inj ected into this signal flow model, the earliest an output
(1))
,
L
.r
•
G 10llis Li ps
where "otal represents the amount of time necess ary ~o\r a sound wave to FIGURE 3.24. A co_ncatenation of seven lo ssless tubes of equal length.
travel the entire length of vocal-tract model L = :L-l lr To ens~ a Adaf:1ted from (Rabiner and Schafer,...1.9z.a).. ~ _
190 Ch. 3 I Modeling Speech Production 3.2 I Discrete-Time Modeling 191
Mod ified Signa l Flow Model ure 3.25 illustrates that the shape of the incident wave at the entrance of
1 + Pglo" LS
tube j (xJ = 0) will not change until th e wave ha s propagated the full
2 T I + PI T I + Pz r I + PJ T 1 + PI'
Ug10I!lS( I) - - -+- - -,-- --0-1 f-----:>-......- 'P'
---Jo---'- U hps(f) length of th e tube, meets a tube juncture, is co mpletely or partially re
flected, and the reflected wave travels back the entire length of the tube.
Th erefore, for th e volume velocity to change at an y boundary location
T I - PI T 1 - Pz r I - PJ T
along the signal flow graph requires a round-trip delay of 2r. This is veri
fied by obtaining the cumulative delay of any closed loop in the signal
(a)
flow graph of Fig. 3.25(a). For this figure , cumulative dela ys of 2r,4r,
Discrete T ime Signa l Flo w Model 6'[:, and 8'[: are possible. If only external signal flow model anal ysis is con
1 + Pg10llis
Z- I /2 1.- 1/2 1.- 1/2
sidered, equivalent loop equations can be obtained if the half-sample de
2 lays in the feedback path are reflected up into the forward path, resulting
in Fig. 3.25(c). This final discrete-time signal flow model is equivalent to
models (a) and (b) in its ability to characterize speech; however, it has
Z·· I/2 Z- I/2 1... 1/2 1.- 1/2 th e distinct ad vantage of employing unit sample delays. A con sequ ence
of moving the half-sample delays into the forward path is that an addi
(b ) tional delay of NI2 samples is introduced. To counteract thi s effect, a
signal "ad vance" by NI2 samples is added after the lip radiation term.
Such a n advance presents no difficulty if the discrete-tim e digital filter
Modified Discre te-Tim e Signal Flow Model
1 + Pg1olll, representation is to be implemented on the computer. The final discrete
2 time transfer function for a two-tube vocal-tract model ca n be shown to
have the following form [from (3.63)],
U1iPs(Z)
Hlwo-Iobe(z) = -U-
, --(-)
gl ot ti s Z (3.66 )
, Tube 1 Tu be 2
I Tube 3
I
Tube 4
I [(I + PgJon) 12 ](I + fJI)(l +PIiP,)Z- 1
1
Glottis Vocal tract Li ps
+ (P1PglOUis + P1Pl\p; ) Z - , + P glOtu ,fJl lps Z - 2
(e)
The two-tube z-tra nsform system function in (3.66) is relatively easy
FIGURE 3.25. (a) A modified signal flow model employing lossless tubes of to obtain. Ho wever, calculation of transfer fun ctions usin g signa l flow
equal length. (b) A discrete-time signal flow diagram for the model shown in a nalysis becomes increasingly complex as the number of tubes increases.
(a). (c) A final discrete-time signal flow diagram using whole sample delays
for the model shown in (a).
For example, the four-tube vocal-tract model in Fig. 3.25(c) possesses a
transfer fun ction of similar form to (3.66). However, calculation of the
would occur is Nt units of time (4r seconds for the model pictured). If a coefficients from flow graphs becomes unwieldy as th e model ord er in
sample period of T = 2r is chosen for the di screte-time system, then Nt creases to 10 or more. One solution to this problem is to resort to the
seconds corresponds to a shift of NI2 samples. For an even number of use of two-port network models for each section of th e tube. This general
tubes, NI2 is an integer and the output can be obt ained by shifting the approach is described in Appendix 3.A.2; its application to the fast com
sequence through the signal flow m od el. If N is odd, however, output val putation of discrete-time transfer functions for mod els with a large num
ues must be obtained between sample locations, thus requiring an inter ber of sections is given in Appendix 3.e. The gen eral N-section lossless
polation step. Such a delay in most cases is ignored, since it has little model is shown there to have a z-domain system func tion of th e form
effect for most speech applications. N
Therefore, an equivalent discrete-time system can be obtained by sub
stituting sample delays of Z-(I I2 ). Th e ~ samp le de lays shown in Fig. H (z) =
1+
P g'o\t i'
«:" /211 (l + P
k= l
k
)
3.25(b) imply that a sample mu st be interp olated bet ween each input (3 .67)
2 N
sample, which is infeasible in practice. To address th is issue , consider a
volume velocity wave tra veling th e length of a lossless tube section . Fig
1- L hkr
k-- !
k
l ~.
model GCr)
models.
One final point is essential. As we have already seen in the early dis
cussions of source modeling, there are various transfer functions that are
of interest in speech modeling, and we have urged the reader to under VOIced! Ug[OlIl, (n ) .1 Vocal-uact Speec h
Radiation
stand which transfer function is being considered in a given develop unvoiced
modell/e, l model R( z)
PJ'I" (n) = <en }
sw itch
ment. In the present case, H(z) indicates the system function between the
VOcal. lra,'!)
volume velocity flow at the output of the glottis and that at the lips. By ( parameters
definition, this transfer ratio is that associated with the vocal tract, and
the notation H(z) is consistent with previous developments [see (3.3) and Random
(3.7)]. The reader might wonder, then , about the meaning of tube 0 (glot noise
generator
tal tube) and tube N + I (lip tube) in Figs. 3.23 and 3.24. A careful read
ing of the foregoing will indicate that these "tubes" do not represent the
Gain for nOLSe source
larynx and lips per se, but rather are present as a device for modeling the
boundary effects between the larynx and vocal tract, and vocal tract and FIGURE 3.26. A general discrete-time model for speech production. After
pole model, however, it has been very successfully used in many analysis, where P corresponds to the peak time of the pulse and K the time at
coding, and recognition tasks, as we shall see in our studies. which complete closure occurs.
Each pair of poles in the z-plane at complex conjugate locations The radiation component, R(z) [or, in terms of impedance and "ana
(P1,p;) roughly corresponds to a formant in the spectrum of }fez). Since , log" frequencies, R(n) = Z liP.c,O)], can be thought of as a low-impedance
H(z) should be a stable system, all poles are inside the unit circle in the load that terminates the vocal tract and converts the volume velocity at
z-plane. If the poles in the z-plane are well separated, good estimates for the lips to a pressure wave in the far field. In the discussion of tube mod
formant frequencies and bandwidths are els, we assumed this impedance to be real for convenience, but a more
accurate model is given by Zhp,(n) such that (Flanagan, 1972)
F
A
= (F,)tan _1 [ -
- Im(pJ
] (3.70) [zhI'S (n)1 = QKIK2 · (3.74)
I 2n Re(p,)
VK 2
I + D? K 2
2
and
B= -j ( ;) In IPi!, (3.71)
where F; represents the zth formant and Pi its corresponding pole in the
arg {ZliP,(Q)j = 2- arctan
2
{QK
K.
2
} , (3.75)
upper half of the z-plane. Also, F, = liT denotes the sampling frequency 2
in Hz. where K, = l28/9n and K2 = 8r/3nc, with r the radius of the opening in
For voiced speech, the digital model must also include a section that the lips (assumed circular) and c the velocity of sound. From (3.75) it is
models the laryngeal shaping filter, G(z}. Depending on the purpose of clear that Zllps(Q) becomes real only asymptotically as the frequency in
the model, the glottal filter may also be constrained to be an "all-pole" creases, and tends to a purely imaginary quantity as frequencies de
transfer function like (3.69). It is often suggested that the two-pole signal, crease. However, similarly to the case of the glottal model above, it is
often the objective to model only spectral magnitude effects. In this case
g(n) = [an-ftnJu(n), p<a <l, a=l , (3.72)
we note that Zup, has a highpass filtering effect: IZhP.(O)I = 0 and dZ1ip/dQ
in which u(n} is the unit step sequence, is an appropriate choice for the tends to remain positive on most practical Nyquist ranges of frequencies
impulse response of the filter. In the sense that this pulse can be made to (e.g., 0-4000 Hz). A simple digital filter that has these properties is a
have a similar spectral magnitude to many empirical results, this choice differencer,
is a good one. Its principal benefit is that it does not require zeros to
R(z) = ZhpsCZ) = 1 - Z- I. (3.76)
model G(z). However, an "all-pole" impulse response corresponding to
any number of poles is incapable of producing realistic pulse shapes ob This filter has a single zero at z" = 1 in the z-p la ne. Since there are occa
served in many experiments, because it is constrained to be of minimum sions in which the inverse of this filter, R-1(z) , arises in speech modeling,
phase. In particular, it is not possible to produce pulse shapes for which it is customary to decrease the radius of Zo slightly so that the inverse fil
the "opening phase" is "slower" than the "closing phase" (Problem 3). ter will be stable. In this case the model becomes
These features of the pulse have been well documented in many papers I
R(z)=l-zz-
o -, z 0 =1, z0 < 1. (3.77)
in the literature (e.g., Timke et al., 1948; Deller, 1983). Many pulse sig
nals have been suggested in the literature; one of the more popular is due A second reason for moving the zero off the un it circle is that with a m i
to Rosenberg (1971), crophone approximately 30 em from the speaker's lips, the analysis has
not totally left the acoustic near field; therefore low-frequency preem
~[ 1- cos ( ~) l O<n<P
phasis by a full 6 dB/octave is not fully justified (Flanagan, 1972).
We have gone to some lengths above to assure that both H(z) and G(z)
could be modeled in some sense with all poles. The inclusion of the sin
(3.73) gle zero Zo in R( z) will .destroy the "alI-pole" nature of the total model if
g (n) = ~
cos [n(n - P) ] Ps,n<K we do no t find a way to "turn the zero into poles." There are two meth
2(K - P) , ods-, for do in g so. We believe that the second is preferable for the reader
stUdying these modeling techniques for the first time. The first method
0, otherwise of preserving the ali-pole structure of the overall model is to' argue that
3 .2 I Discrete-Time Modeling 197
196 en. 3 I Modeling Speech Production
grounds for endeavoring to make it ali-pole. In fact, the contrary is true.
(3.72) is a good model for the glottal dynamics and that one of its poles Yet , as we have repeatedly indicated above, an all-pole model is often de
win cancel the zero Zo in R(z) . Although this argument is satisfactory for sirable. An important fundamental point for the reader to understand is
the experienced speech processing engineer, it is potentially the source of that when we focus attention on the all-pole model of speech production,
several misconceptions for the person new to the field. We therefore urge we do so with the understanding that there might be a better model if
the reader to "decouple" the models for G( z) and R(z) and simply note the objective is to exactly generate a given speech waveform using its
that the model (3.77) can be written model.
- I _
The preoccupation with an all-pole model of th e spe ech production
R(z)= I-z"
Z -
~ k - I (3.78) system, however, arises from the fact that a very powerful and simple
L zoz computational technique, linear prediction analysis (studied in Chapter
k~ O 5), exists for deriving an all-pole model of a given speech utterance. The
extracted model will be optimal in a certain sense, but not in the sense of
where K is theoretically infinite, but practically finite because Zo < 1. replicating the waveform from which it is derived. If the identified model
Therefore, to the extent that (3.77) is a good model for the lip radiation cannot necessarily replicate the waveform, it is natural to ask, "In what
characteristic, we see that this model can be represented by an all-pole sense is an all-pole model of the speech production system appropriate?"
filter with a practically finite number of poles. The answer is inherent in the previous discussions. Tra cing back through
Let us summarize these important results. Ignoring the technicalities th e discussions for each filter section, it will be noticed that in each case
of z-transform existence [see dis cussion below (3.7)], we assume that the (If, G, and R), an argument is made for the appropriateness of an all-pole
output (pressure wave) of the speech production system is the result of filter in the sense of preserving the spectral magnitude of the signal. Ap
filtering the appropriate excitation by two (unvoiced) or three (voiced) parently, an all-pole model exists that will at least produce a waveform
linear, separable filters. If we ignore the developments above momentar with the correct magnitude spectrum . As we have indicated earlier in the
ily, let us suppose that we know "exact" or "true" linear models of the chapter, a waveform with correct spectral magnitude is frequently suffi
various components. By this we mean that we (somehow) know models cient for coding, recognition, and synthesis.
that will exactly produce the speech waveform under consideration. We should leave this chapter, then, with the following basic under
These models are only constrained to be linear and stable and are other standings. Computational techniques (to be developed in Chapter 5) exist
wise unrestricted. In the unvoiced case with which we can conveniently identify (find the filter coefficients, or
S(z) = E(z)H( z)R( z), (3.79) poles, of) an all-pole model from a speech waveform . This all-pole model
will be potentially useful if the objective is to model the magnitude spec
where E(z) represents a partial realization of a white noise process. In tral characteristics of the waveform . Our knowledge of acoustics might
the voiced case, lead us to believe that there is a better "true" model as formulated in
5(z) = E( z)G(z)H( z)R( z), (3.80) (3.81). However, we will be satisfied to compute a model, say 8(z), which
is accurate in terms of the spectrum. In fact , we will show in Chapter 5
where £(z) represents a discrete-time impulse train of period P, the pitch that the "true" model, 8 (z), and the estimated all-pole model, 8(z), are
period of the utterance. In the abov e, G, H, and R represent the "true" clearly related. Ideally, of course, e(z) and 8(z) will have identical mag
models. Accordingly, the true overall system fun ction is nitude spectra. It will be shown that if enough poles are included in 8(z),
then the all-pole model will be shown to be the minimum-phase part of
8(z). This is entirely reasonable, since 8(z) has all of its singularities in
5(z) !H(Z)R(Z), unvoiced case side the unit circle and therefore must be minimum phase.
(3.81 )
8(z)=-=
E(z) G(z)H(z)R(z), voiced case
3.2.3 Other Speech Models
With enough painstaking experimental work, we couLd probably deduce In ,this chapter, we have discussed several schemes for modeling
reasonable "true" models for any stationary utterance of interest. In gen human speech production. The initial goal was to develop a modeling ap
eral, we would expect these models to require zeros as well as poles in
I
proach that would match as closely as possible the resonant structure
their system functions. In fact , there are several arguments abo ve against (form ant locations) of the human vocal tract that produced the corre
the appropriateness of an all-pole model. If we were asked to deduce a sponding speech sound. This notion was motivated by earlier scientists
good speech production model, at this point in our study, we have no
198 Ch. 3 I Modeling Speech Production
3.2 / Discrele-Time MOdeling 199
who capitalized on th eir knowledge of the physics of sound propagation
in musical instruments such as p ipe orga ns. Early findings prompted th e
development of mechanical and electro mecha nica l speaking machines. _ Ve locuv
pro file To
+
..4- - -
The acoustic tube, trans mission line, and digita l filter mod els presented
here all assume planar wave propagation along the vocal-tract axis (from ~ ora l cav ity
ditions, with a stationar y point and type of excitat ion. The solution for I I
I I 1 r
addressing losses normally enta ils a modi ficat ion to the output vocal I 1 I I
Sheet like l uryngeal
tract frequency respons e of a lossless model. T he point and type of exci I 1 r I Jet flow " sec tion
I I I I
speech model pre sented in Section 3.2 .2 reflects a genera lly accepted ,l. A l J.
I l i J
mean s for characteriz ing the changing speech characterist ics for the pur I I
ter 5, we will present a fast algorithm for obtai ning speec h parameters for \ t +f
th is model given a short int erval of speech. The extraction of these "lin I I J ( Rea ttachment
-I-+- False vocal
ear prediction parameters" open s many appli cat ion areas of interest for I I folds region
the reader. However, we reemphasize that although this is the generally I) I
\ False vocal
\\\
\
I I I folds
accepted model , it is more of a digital filter representation of the speech I I I
waveform than a true model of speech product ion . Here, we consider I I I I /' ..... \ \,. -,
several issues that are not adequately add ressed in the classical digital fil
I
1 I I
-r-r- I
I
\ ~\ IJ. r
./ I " 1\ \ ./
ter model, but are submitted as issues for the reader to reflect upon. U.l.J. ' ,\ ' 1\
I I I I 1 11
The classical digital filter mod el in Fig. 3.26 provid es a switch to dis I I I I
11 1
tin guish between voiced and unvoiced speech production. This is a seri - /- 1 r True voca l
, II I
/ , 11
ous limitation for several reasons. First , human speech production does ' I
I , "
1\
folds / I 1\
/ / I \
not require voicing to turn off immediately prior to an unvoiced pho I I I \
/ J I I ~ ~I 1I
~
possess two sources of excita tion, vocal fold movem ent and a major con / I I \
r I I I
/ I I \
striction resulting in both voiced and unvoiced forms of excitation. Since i ' l \
Flow lines
the model is based on acoustic tub e theo ry assuming short-term station I
I I
\
\
I
Trachea
arity, it lacks the ability to charac terize rapidl y changing excitation prop
erties such as that found in plosive sounds like ItI and fb I . The method
for handling such sounds is to assume short-term stationarity and "hope
for th e best." Furth er research is und er way to develop a more realistic FIGURE 3.27. (a) The classical interpretation of sound prop agation along the
means of characteri zing excitatio n prop erti es. This work has resulted in propagation along the vocal tract.
several alternative cod ing schem es, whic h are discussed in Chapter 7.
Unfortunately, thes e methods have generally been int roduced to address
the existing limitations of the present digital filter model for speech cod
folds begin to open , the propagating sound pressure wave expan ds to fill
ing and ha ve onl y partially addressed th e need for formulating improved
the cross-sectional area of the vocal tract. However, the physics of the lar
human speech production models.
In the discussion of speech mod elin g, this chapter has assumed a lin ynx suggest that this cannot occur. Sound pressure and volume velocity
ear source/filter mod el for speec h prod uctio n assum ing pla nar sound measur ement s within cast models of th e human speec h syste m suggest
propagation [see Fig. 3.27(a)]. This requi res excita tion at th e glottis to be that the nonlinear fluid dynamics as illustrated in Fig. 3.27(b) to be a
decoupl ed from th e vocal tract. We shoul d expect that coupling of th e ex more realistic mea ns of characterizing sound propagation along the vocal
citation source and vocal tract influences the resulti ng speech signal and tr act. Studies by Teager ( 1980, 1983, 1989) suggest th at the vortices lo
that such coupling should therefore be incorporated in production mod cated in the false vocal folds Provide th e necessary source of excitation
eling. In addition , th e linea r source/fi lter mode l assumes that as the vocal duri ng the closed phase of th e vocal folds. It has also been suggested that
the pro pagating sound wave adheres to th e vocal-tr act walls, resulting in
200 Ch. 3 I Modeling Speech Production
3.4 / Problems 201
laminar flow. A complete solution for such a nonlinear fluid dynamic
model requires the solution of the Navier-Stokes equations and has been most important conclusions taken from this chapter should be an appre
achieved for a stationary phoneme; however, it may not be computation ciation of how the discrete-time model is a direct descendant of a quite
ally feasible to obtain a solution for a time-varying speech model. naive physical model, the lossless tube tract model with some simple
The foundation of our developments ha s been the acoustical analysis glottal and lip models at the term inals. In future discussions, it will be
of a sequence of acoustic tubes. We have shown that such a system is aB useful to remember the model from whence the discrete-time model was
pole and that it is capable of modeling the formant structure of resonant deri ved. Doing so will add much insight into the appropriateness and
weaknesses of the model in various applications.
phonemes like vowels. This early discussion, combined with the fore
knowledge that the powerful linear prediction method is available to de
duce all-pole models of any speech sound, has predisposed our discussion 3.4 Problems
to favor poles and to devote little attention to representing spectral zeros
in the model. This is especially important for nasals and useful for ad 3.5. (a)
Verify that the reflection coefficient at any tube juncture in an
dressing model limitations for fric atives and plosives. Autoregressive-« acoustic tube model is bounded in magnitude by unity,
moving average (ARMA), or pole-zero [e.g., (Proakis and Manolakis,
1992; Johnson, 1985)) methods have been considered for speech. The -1 <. Ilk :$ 1. (3.4)
choice of the numbers of poles and zeros is an important issue in both (b)
analysis and synthesis settings. Related to the use of zeros in speech Suppose that we were to implement a digital filter realization of
modeling is the characterization of losses in the vocal tract. Losses gener an N-tube acoustic model. It is also known that high- order dig
ally affect formant bandwidth more than location . The use of additional ital filter realizations sometimes suffer from issues relating to
zeros can sometimes provide the necessary modeling effects in the fre finite precision arithmetic. Does the result of part (a) suggest
quency domain . a means of testing the iV-tube digital filter realization for
stability?
Although the issues discussed above are important for accurate speech
production modeling, in many applications the general digital model in 3.2. (a) Consider a two-tube lossless vocal-tract model (see Figure 3.17).
Fig. 3.26 is more than sufficient. In considering a research problem in Draw a signal flow diagram using reflection coefficients and
speech processing, however, the reader should employ this digital model delay elements for the case in which A = 1 em ", I] = 9 em,
J
with the knowledge of its limitations. We have alluded to some of these A 2 = 7 ern", and /2 = 8 em. Include glottal and lip radiation
limitations at the end of the preceding section and will return to these effects . What phoneme might this model represent?
important issues in Chapter 5. (b) Repeat for dimensions A, = 0.9 cm' , II = 9.5 cm,A = 0.25 em",
/2 = 2 em , A 3 = 0.5 em', /3;; 5 em . 2
3.3 Conclusions 3.6. Consider the following tube model for the vocal tract:
This chapter has co vered some of the basic acoustic theory necessary to
build important models used pervasively in speech. Decades of research
have gone into analog acoustic modeling efforts which have been so cen
l I area e ce
Ff:=J
this analog research in very superficial tenus. Fortunately for our imme
diate purposes, a deep and detailed study of this rich and varied research
field is unnecessary. The reader who be comes seriously involved in the LiPS
speech processing field will benefit from a more detailed study of origi ----------
... I
nal research papers and texts. .
Remarkably, a cursory brief overview of the acoustical theory has been The lips have zero radiation impedance.
sufficient to develop what will emerge in our work as the fundamental llgIOllis(t) = volume velocity at the Source
basis of much of contemporary sp eech processing-the discrete-time
model of Fig. 3.26. We will find this simple model at the heart of many ZglO!li S = real glottal Source impedance
_ _ _~oLthe developments in the remaining chapters of the book. One of the
AI' A 2 = area of tubes
202 Ch. 3 I Modeling Speech Production 3.A I Single Lossless Tube Analysis 203
Th is idea l vocal tract is model ed by the chapter, more realistic glottal pulse shapes can be obtained
using the Rosenb erg pulse
1 -\- Pg 1 + PI 1 + P2 1[I -- cos(nn/P )], O ~ n ~P
~g l <>lti,(I) ~ J , I" • I r
1/ 2 g (n) = cos([ n (n - P)]I[2(K - P) ]) , P <n ~K. (3.3)
(
0, otherwise
-P g Pg -P I PI - P2 P2 (c) Plot severa l impulse responses for various values of P and K.
Make sure that your pulse is "closed" at time 11 = 64 in every
case.
I { I I 1: I I (I I "I I ( I
(d) Based on 128-point DFTs, plot the magnitude and phase re
l -P g I - PI 1 - P2 sponses for two typical examples in part (c).
(e) Compare the time-domain and frequency-domain properties of
the two glottal pu lse models. What realistic time-domain fea
tur e of the Rosenberg pulse is apparently unachievable with the
(a) Compute the ideal voca l-tract va lues Pg1ot! tS' PI' P2' an d 1: in two-po le model? How is this feature manifest in the phase
terms of tube model parameters. spectrum?
(b) Let this model correspond to a digital system with a sample
period of 21:.
APP ENDICES
· d t he z-transtorrn
FIn .. 0 f t he J:
transfer f uncuon
. U1iP/Z)
'
Ug101\l.(z)
Reading Note : These appendices provide supplemental material on acoustical
U,. (z) analysis of speech. In mathem atical developments involving steady-state
(c) Draw a possible set of pole and zero locations of Ilps if analysis, we om it the argume nt Q from phasors and impedances unless re
UgJolti.(z) quired for clarity.
I. ZgJOlllS = 0).
Since (3.83) and (3.84) are linear constant coefficient differential equa
- pc e - J Q(I/ C) ] .
tions , their solutions will ha ve the form' " qJ~en(.o.) = -UgiOlliS A [ 2 cas(n/le) (3.93)
vex, t) = v'(x , I) + v- ex , 1)
(3.87) From (3.87 ) and (3 .88) the steady-state solutions for the volume velocity
= ~ (~(Q)e jn te -Jn (x/ c) _ qr(n) eJ!}'e +j H(X/C) ] and pressure at distance x from the origin in the uniform lossless tube of
length I are
and
p(x, t) = pT(X, t) + p-(x , t)
veX, t) = DgIctus_ [COS(.o.(!-
(n I
X)/C)] JO,
e (3.94)
cos ~d c)
(3.88)
= '¥+(D)eJ0 1e-J()(x/C) + qr(Q)eJ!}te +j !}(x/c) ,
where t/ (x , t) and v-ex, t ) represent traveling waves in the positive and p (X, i) = j PCv [sin(.o.U - X)IC)] j Ot
(3 .95)
negative directions, respectively, and similarly for p +(x, I) and p-(x, t) . A glom >
cos (.0.11 c) e .
The numbers qJ+ (Q) and qr(Q) are complex numbers that are dependent
upon Q but not x (the effects of x are accounted for in the additional Equation (3.20) follows immediately from (3.94) upon letting X= I.
phase factors). Note that since c represents the speed of sound in air,
+xlc represents the amount of time needed for a positive moving sound Closed Tube. We derive solutions for the volume velocity and pressure
wave to travel x units along the tract (to the right), while - xl e represents waveforms at distance x in the lossless uniform tube which is closed at
the time required for a negative traveling wave to move x units to the the termination.
left. Hence an additional delay (or phase) factor appears with each term The boundary conditions are (3.85 ) and
to compensate for the shift along the x dimension. It is also worth noting
that for a tube of length I, the del ay v(O, t) = uhpsU) = O. (3.96)
derl
Applying these conditions to (3.87) and (3.88) results in (3.90) and
r = (3.89)
e
~ [\}'\Q)e-jQ (I/C) - '¥-(Q)eJO(I/C)] = O. (3.97)
represents the elapsed time for a plane wave to traverse the entire length
of the tube. Therefore, the complex amplitudes of the traveling waves in the closed
Applying the boundary conditions to (3.87) and (3.88), or equiva tube case are
lently, writing phasor equations at the boundaries, results in the following
relations, which can be solved for the com plex amplitudes, _ pc [ e )!}(I/c) ]
\}J~ o ,ed(Q) = UglOltlS A J-'2-s-in-(-Q--:-IIe)
~ (3.98)
A + - ] (3.90)
l
pe '¥ (D.) - '¥ (D.) = Ugl olli s
The solutions are from which the steady-state solutions for volum e velocit y and sound
pressure for the closed tube case follow,
_ pc [_ _e _"J_Q_W---:-cJ_ ] (3.92)
l.J:I: pcn(Q) = Ug!otti' A 2 cos (0.l/e)
- [$iO(.o.U- X) / C) ]e JCll
vex, t) = Ugl o t ,,> sin (Qllc) (3. I 00)
"This is just a genera lization of the mate rial discussed in Sectio n 3.1.4 to the case in
wh ich tile additional independ cm argum ent x is present . For details the reader mav consult
any standard textbook on differential equations [e.g. (Boyce and D iPrima, 1969)} or a p(x , t) =jPCry . [COS(D,U -X)/C) ] eJf'll
A slot\1S
physics or engineering textbook treating wave equa tions. sin (Qljc) . (3 .101)
206 en. 3 I Modeling Speech Production
3.A I Single Losslsss Tube Analysis 207
Z in
p(O, t ) .
= - - = ]2 0
(Ql)
tan - = 2 0 tanh(Q'C). (3.104) T-Network and Two-Port Network Models of the Lossless Tube. The
open v(O, t) c
transfer function H(Q) in (3.2 I) was obtained by taking the ratio of vol
where 'I: is defined as the time required for a plane wave to travel the en ume velocity at the lips to that at the glottis. This is equivalent to a ratio
tire length of the tube [see (3.89»). In a similar manner we find from of output to input current for a two-port electrical network. Under
(3.100) and (3.101) in Appendix 3.A.l that steady-state conditions , the uniform tube section can be replaced in
terms of external observerable quantities by a T-netwark of impedances
2 in
p(O, I)
= - - = - ]Zo
.
cot -
(Of) = 20 coth(Q'C). (3.105)
shown in Fig. 3.28. The impedances of this network are described using
the following equations and are independent of the tube terminations
closed v(O, t) c (i.e, the boundary conditions) (Potter and Fich, 1963; Mason, 1948;
Johnson, 1924):
Determination of Resonances Using Z(x, Q). One method of determining
the resonant modes of a transmission line is to determine the frequencies
for which the impedance approaches infinity (or admittance approaches 2 1 =zo tanh ( r 4 ) (3.110)
zero) . We show first that an equivalent procedure can be carried out with
acoustic impedance. In a manner similar to the derivation of (3.104) and 2 2 = 2 0 csch (n) (3.11l)
(3.105) we find that for a general distance x along the tube, the im pe
dances for the open and closed termination cases are as follows:
2 = [R +JQL] 1/ 2
(3.112)
o G+ JQC
Z~~i:(X, Q ) = 2 " tanh[ .Q (r -~)] (3 .106)
r =[(R + j0.L)(G+ jQC)(2. (3.113)
and In these equations, I represents the length of the transmission line section
(or acoustic tube), and R, G,L, and C the distributed resistance, admit
208 en. 3 I Modeling Speech Production
3.A I Single Lossless Tube Analysis 209
i
-)0
l
+
-ffi --, --ffi + - iz The T-equ ivalent acoustic circ uit emp loying aco ustic quantities is shown
in Fig. 3.28, where volume veloc ity and so und pressure have been substi
tut ed for current and voltage.
tTl
- __
~ __ _ _ ..l.. _ __ ~ _ _
-
lJ
2 It is also possible to lise the T-network a na lysis to deri ve a two-port
net work mod el of the lossless tube, To achieve this, it is necessa ry to re
co nsider th e input-output relation s of the aco ustic tube T-network of Fig.
3.28. In an assume d stea dy-state conditio n, let us denote the phasors as
Tran smission line T- network soc iate d wit h th e "bo undary" volume veloci ty waveforms as VI and V ,
meaning that z
«o, t) = Ve j Q t
j (3. 118)
Vi ii,
------ + - @ --r -- ffi +
-)0 and
- J' 0 1
PI P2 v(l, t) == V2e • (3.119)
- _ ___ _ _ .L __ __ _ _ Similar defini ti on s are giv en to Pt and P2 for th e pressure wave for ms .
The 'l-network is cha racte rized by the following system of equ ation s:
The imp eda nces Z ,I' 2 2P Z12' and 2 22 ar e found using the t wo so und
tan ce, ind uctance, and capacita nce per un it len gth of line, Z; is the char pr essur e equ ations fro m the T-network
acte ristic impedance of the lin e, and r is th e prop agat ion constant. In
this section we neglect dissipat ive terms Rand G rep resenting viscous re PI== VI Z l + ( VI - V2)Z 2 (3.121)
sistance and en ergy absorpt ion by the tube walls. Substituting acoustic P2 = (VI - V2)22 - V2 2 1•
eq ui valents for L and C, (3.122)
P2 = Vt Z 2 - V2 [2, + 2 2 ]
(3.1 25)
Z = pc
- tanh -
(jD.t) pc
= j - tan
(0./ ) (3.116)
1 A 2c A 2c
VI 2 0 Csch(sr) - V2 [ 2 0 tanh (~r) + ZoCSCh(Sr)], (3. 126)
Where we have repl aced jD. by the gen eral complex variable s. The t wo
Z =pC
2 A
CSCh(jD.t)
c
= _jPCcsc(D.I ) ,
Ae
(3.117) port network impedances 2 12 and Z 2 1 are seen to be
T he impedances are th erefor e reduced to only react ive com po nents, rep
Z!2 =- Zocsch (n) (3.127 )
resenting a lossless uniform aco ustic tube or sect ion of t rans m ission .lin e. Z21 = Zocsch(sr). (3. 128)
210 Ch. 3 I Modeling Speech Production
3.B / Two-Tube Lossless Model of the Vocal Tract 211
Z21 = -Zocoth(sr). (3.130) This system of equations provides a means for obtaining characteristics
of output pressure and velocity given input quantities and the tube prop
The resulting two-port impedance matrix can be written as erties of characteristic impedance Z" and delay r.
v2
= PJ Z[_1J2J+ V [_ 22
1
11
12
]
•
(3.135)
~
N N I (3.139)
LY = L
i= l
i-=0.
i~ l Z,
(3.137) .A
= j--cot
pc
oral
(nl
-- .
C
o ral)
The poles of the parallel circuit can therefore be used to determine the From (3 .137), the poles occur at frequencies n where Yph + Yo",' = 0, or
frequencies at which the admittance approaches zero, or impedance be
comes infinite. Our analysis continues by determining the input admit n lph) = A cot (nZoral)
A ph tan ( -c- -c- . (3. I 40)
tance for the pharyngeal and oral cavity tubes (Fig. 3.30) . In order to Oral
avoid coupling effects between the pharyngeal cavity and subglottal struc A graphical solution for the resonant frequencies can be obtained by
tures (lungs , trachea, larynx), we consider the analysis during the period plotting Aph tan(D/Ph/c) and AOral cot(.Q(ra/c) and noting the points where
when the vocal folds are completely closed. The input admittance look
the two functions are equal. Figure 3.31 illustrates the graphical solution
ing back into the pharyngeal tube is found using the impedance of a
for a two-tube model for the vowel Iii with the tube dimensions
closed tube of length lpb [from (3.60)],
lph = 9 cm,A ph = 8 ern", loral = 6 em, and AOra1= 1em", This particular model
produces resonances at F I =250,F2 = 1875, and F 3=282SHz. Typical
real vowel formants occur at 270,2290, and 3010 Hz.
Transmission Line Mod el
Z"..• • • ~ Phocyo, "I ,",iCy ;l 1': 0,,1 ,",lly j z,,~ .u Iph =9cm
'oral = 6 em
.... ~
I
5
4
Iii
:~
/ \\,
I
F
)
I
3
Open Short i
t..
I
circuit c ircuit
A ph = 8 em2
2 -
Y Yural
rh I 1 -
Aoral = 1 cm 2 o ....... F, " F ,.---
" F)
'" 2
Acoustic Tube Model .-.' ,.'
l r
F I = 250 Hz r
-2
r;,> '=
' 875H;:
2825 H/ -3
-4
-
tube
rub e Yl'l) Y -5
oml 0 .'iOO 1000 150Q 2000 2500 ~ oo o 350 0 4000
Frequency (Hz)
FIGURE 3.30. Impedance analysis of an acoustic two-tube model for the
vocal-tract. FIGURE 3.31 . A two-tube approximation for the vowel Iii and a
corresponding graphical solution for its formant frequencies.
214 Ch. 3 I MOdeling Speech Production 3.8 I Two-Tube Lossless Model of the Vocal Tract 215
Transmission Line Analysis. Although characterization of formant loca If the following T-network impedances for the each transmission line sec
tions is important in multi-acoustic tube analysis, it does not fully repre tion are substituted into (3.144),
sent the frequency response of the model. If we use the two-tube model
Pc)tanh (~/l)
in Fig. 3.29, an equivalent electrical circuit consisting of T-network sec
tions from transmission line theory can be constructed. The result is
shown in Fig. 3.32. The impedance at the lips is assumed to be zero
ZIA= (~ 2 Z2 A = (P
Ac)tanh (r;lz)
z
"""2
(open tube , short-circuited line), and that at the glottis to be infinite (3.145)
(closed tube, open-circuited line). Defining VIZ to be the phasor for the
steady-state volume velocity at the juncture betwe en the pharyngeal and
oral tubes, the two loop equations are, 2,. = (:~) csch(r;l,l 2 = (:~)CSCh(r;l,) ,
w
UJiPsZ IA = ZZJJ[VIZ- U liP,] (3.141) and the resulting expression simplified, the final two-tube transfer func
tion is obtained:
[U&!OlliS - V12 ]Z tB = V12[Z IA - Z2A] + UliPSZU' (3.142)
U Az
Equation (3.141) can then be solved for VI Z' H(Q) =-_lips
-= .
Vgl ott is At sinh (~ II )sinh(rzl z) + Az cosh (~II )cosh( I;l2)
(3.146)
ZIA]
- tZ -.lip 1+--, (3.143)
V = U , [ Z ZB The terms Al and II correspond to area and length of the pharyngeal
tube, while Az and 12 correspond to dimensions of the oral cavity tube.
Under loss1ess conditions, ~ = lz = jn/c. From this transfer function, the
and substituted into (3. 142). After simplifying, the following transfer
poles occur at values of Q = 2nF for which
function of output to input volume velocity results,
or simplifying,
I lz
~ I (
(~ II) ] = A z[ cot (~ 1
J
- ~
-- A I [tan 2) ] • (3.148)
U glolti,
Uhp,
Al A2 ~
These resonances occur at the same frequ encies as those found using the
parallel input impedance calculation from the previous discussion on for
mant location analysis [see (3.140)). By employing a two-tube model, a
variety of settings for Ap h ' Iph ' Aoral' and loral can be used to approximate
many articulatory configurations. In Fig . 3.31, an example analysis for
one vowel was illustrated. Several additional two-tube models and their
co rresponding pole structures are shown in Fig. 3.33. For front vowels,
Fig. 3.33 confirms that the first and second formants are widely sepa
rated, while back vowels possess first and second formants that are
Zg l(>tti'=<>= z IIp ,~
=0
closely spaced.· Generally speaking, as the ratio of back to front tube ar
eas (Aph / Aoral ) increases, the first forman t locati on decreases.
- I 'Two-Port Network Analysis. Analysis of the two-tube vocal-tract model
FIGURE 3;32. A two-tube transmission fine circuit model. was achieved by simplifying the transmission line T-network equations
216 Ch. 3 I Modeling Speech Production 3.C I Fast Discrete-Time Transfer Function Calculation 217
II ~:=5
cient impedance mismatch in the transmission line case), then it can be
assumed that no coupling exists between front and back tubes. IS With the
II = 9 em /2 = 6 em assumption of limited or no coupling, the two-port equations (3.136) im
mediately yield the overall transfer function ,
F I =250
F = 1875
z PliPS] [ COShUQloraJ c) - (pel Aora] ) sinh UQloraJ C)] (3.149)
F ) = 2825 _
[ Ulips - (- Aorat!pe) sinh (jD.lora1 /c) cosh (jQloralIe)
A2 =8 em2
~t \J
2
IA (b) COShUQ1phl c) --(pel A ph ) sinhUQlphl C)]. [~l ott iS]. (3.150)
AI =
c::::=j
I
1 em
I I---.!. tan
11\2
,
,81
2 [ - (Aph Ipc) sinh (jQlpb Ie) cosh(JQlpJc) Ugl o 1ll S
~ll =13emJ I Again, we consider the period when the glottis is closed and the oral
'I=4em 0 cavity is wide open (i.e., no sound pressure deviation from ambient
F J =625
- I PI = 0). With these boundary conditions, the following volum e velocity
r~ITo is obtained,
(?).
-2
F z = 1700 0u
-3
flo = 2325 5 Viip, F(Q , lph' lora], Aph , Aoral )
~ '" H(Q )= -_- = . (3.151)
<::
~ 5 Ugl oltis AOral cot(nlph/e ) - Ap h tan(nlorat!c)
'(- I..f' (c)
Az =7 em1 4 If cross-sectional areas Aph and Aoral are di verse enough (i.e., Aph > > Aoral
L\ eot .6l]
AI = 1 cm,------J
1 3 Al
A2tan {i' or Ao ral > > Aph)' then characteristic impedances for each two-port sec
2 ' 1
tion are different enough to satisfy the condition for noncoupling. In the
:c ~~ I
majority of cases, tube areas are not so di verse so that all coupling is
/] = g em 'z=8 em 0
eliminated. It becomes necessary, however, to continue to rely on the
-1
two-port equation (3.136) in order to handle multi tube configurations.
F I
= 750 -2
F1 = 1250
The result in (3.151) suggests a transfer function with the same pole loca
-3
F 3 =2700 tions as those found in the parallel impedance formant analysis method,
-4
and the T-network transmission line circuit analysis method.
-5
A 1 =A 2 =8cm 2 23 1."
' - - ' 1- ---r;--
' . I
L\ ' 3.C Fast Discrete-Time Transfer Function Calculation
- 11 = 17 em =;:.I 1 ~ml
'2=Oem
01
- I
, I :-"
< I """ , r I I Calculation of vocal-tract transfer functions using signal flow analysis be
comes increasingly complex as the number of tubes increases. For exam
F I = 500 -2 ple, the four-tube vocal-tract model in Fig. 3.25(c) possesses a transfer
F = ]500
z -3 function of similar form to (3.66); however, calculation of the coefficients
F = 2500 o 500 1000 1500 2000 2500 3000
3
Fr?quency (Hz)
becomes unwieldy as the model order increases to 10 or more. In Appen
FIGURE 3.33. A collection of two-tube approximations for several phonemes.
The solid lines represent cotangent curves. The variable fJ is the ratio
Qlc = 2nFjc. (a) Vowel Iii, "beet"; (b) vowel I@(. "had"; (e) vowel [e}, "hot" ;
" T his assumption is sim ilar to the no-coupling assum ptio n of a high output im peda nce
{eJ) Schwa vowel IX(.
transistor amplifier driving a low-impedance load.
218 Ch. 3 / Modeling Speech Production 3.C / Fast Discrete-Time Transfer Function Calculation 219
dix 3.A.2, which treats concatenated lossless tubes, we found that it is VZ (z), and second by solving for V; (z) and V~ (z) in terms of V;;+I(Z)
possible to characterize the input-output relationships of sound pressure and V1;+I (z) . These are
and volume velocity using th e two-port impedan ce matrix given in
(3.131 ), where the impedances ZI P Z 12' Z2I' and Z22 were found using
'l-network impedances from a transmission line circuit [Fig. 3.23(c)]. By
following this approach, a similar matrix characterization could be ob
tained for the signal flow model in Fig. 3.25(c) .
For this discuss ion , consider the single-tube discrete-time flow dia
V;(z)!
~(z) I I
I
-P
I +h
k
z -~J
I +Pk
I
l V;+ l(Z)
I ~+ I (Z)
(3.155)
gram in Fig. 3.34 . Following the approach used for a single transmission 1 + r. 1 +Pk
line circuit, we obtain a ma tri x ti , which produces forward and backward
a nd
traveling waves in the (k + l)st tube given quantities from the kth tube,"
-I
Z Pk
V;+I(Z)] = ti[ 11"(Z)] J1+ 1(Z)I I I V;;(z)
1 -Pk 1 - Pk
[ V;+ l(Z) ~ (z) P 7- 1
(3.156)
1
(3.152) ~+ I (Z) I I k- II ~(z )
I -Pk 1 -Pk
= [b'll 01 2][V;; (Z)j . We write these two equations in compact form as
0 21 b'22 v;: (z)
Vk(z) = AkVk+I(Z) (3 .157)
Writing the equations implied by the flow diagram, we obtain and
Vk+I (z) = vt"(z)( I + I\ )Z-I + V;+l(Z)Pk (3.153) Vk+1(Z) = A ~JVk( Z) , (3.158)
1 P Z- 1
~+ l (Z) = V; (z ) + V; (z) k (3.154) respecti vely. Evidently,
1- Pk 1- Pk
Recall that by omitting the distance argument "x " we implicitly refer to
th e waveform at th e left boundary of the section, x = 0 [see (3.32) and
~v+ J Z) = A7v' A7v~ 1 ... A ~ IVl(Z) = [ r1 A~ I]VI (Z) (3.159)
k-N
(3.33)]. From (3.153 ) and (3.154), we obtain two matrix forms of the so
lution, first by solving for r k+Jz) and ~+I(Z) in terms of V;(z) and and
vt(z) 0 • ?
(l + Pk )
• ? • o V k'+ 1 (z) V1(Z) = A, A2 · ·· AN+,VN+1 (z) = [fr k- l
Ak]V:V+l(Z), (3.160)
cause the reflection/transmission characteristics of the fourth tube have In general, it can be seen that for a multitube lossless model, the vocal
been lumped with the fifth (half-infinite) tube used to model lip radia system transfer function can be expressed as
tion effects . If Plip, is used in place of P4' (3.160) can be computed. The
equation that relates the boundary condition for the lips is written as N
H( z) = 1 + PgJo tti s
Z -.'i/2 TI (1 + Pk)
k~l
JiP,(z) 0
where Ak is the following modified signal flow matrix,
where it is assumed that sound propagating forward into the half-infinite
acoustic tube for the lips results in no reflections [U~pS(z) = OJ.
Using (3.60), we find that the boundary at the glottis is characterized
by the relation
Ak = [-P0 zJ . (3.169)
-Pk 1
The use of equations (3.16 2) and (3.163) in (3.160) yields the ratio D(z) = ZN[1 - f
J,,~ 1
bkZ - k ] . (3.170)
UglOIU,(Z)/ UliPs(Z), which can then be reciprocated to obtain H(z).
Direct application of equation (3.15 9) in conjunction with (3.162) and Substituting back into (3.168) shows the transfer function for the N-tube
(3.163) is not as straightforward, since (3.163) produces glottal volume lossless vocal-tract system ,
velocity UgJOlli,(Z) from input vocal-tract terms V7 (z) and ~ (z). If th~
N
tube matrix product is obtained as follows
H(z) =
1 +p
glottJS
Z - N/ 2 TI (l + Pk)
k= l
(3.171)
0 Il 612]= fIL'i~I \ (3.164)
2
1-
N
L bkz- k
[ k ~l
621 On k~ 1
Thi s important result was discussed in Section 3.2.1 [see material below
then, solving for V~ (z) and V~ (z) in the following system of equat ions, (3.67)].
U 1 . (z)
g mil'
=:; [
1
2 - 2pgIOtlis ]
+ Pgl olli s 1 + Pgl Ol1is
[V:1(Z)]
~ (z)
(3. 165)
[
U ;~p'(Z)l = Z_N/2[0\ I o'z]i (z)l,
VI + (3.166)
o 021 L
On V 1- (z)
~Z
«
~HAPTER ~ I
Short-Term Processing of Speech
Reading Notes : No "advanced" topics f rom Chapter 1 will be requi red in this
chapter. Basic DSP concepts f rom Section 1.1 will be used without comment .
In particular, the DTFT and DFT will playa significant role in S ection
4.3.5. Basic concepts fro m (scalar-valued ) random process theory, which was
reviewed in Section 1.2.3 will be used through out. In particular, Section 4.3.1
will require a solid understanding of the correlation properties of stochastic
processes.
4.1 Introduction
Engineering courses frequently ignore the fact that all analysis must be
done in fin ite time . The continuous-time Fouri er transform , for example,
is a rema rkabl y useful tool for signal analysis. In its fund amental defini
tion , however, the Fourier transform requires our knowledge of the signal
for all time , and, further, whatever property or feature we are seeking by
use of th e Fourier transform (a spectrum, a resonance, bandpass energy,
etc.) must remain invariant for all time in the signal. Most of us do not
have doubly infinite time in which to process a signal, and most signals
do not cooperate by remaining "stationary" with respect to the desired
measurement forever. This latter point is particularly true of speech sig
nals, which we can expect to "change" every few milli second s. In the best
case, we as engineers often take what we have learn ed in the "long term "
and apply it to the "short term," counting on our good experience and
intuition to compensate for the deviation of realit y and theory. In the
worst case, we simply ignore the discrepancy and hope for the best. The
danger in this latter approach is that when the "best" occu rs we do not
know why, and we may come to rely on a design or approach that has
serious weaknesses that might emerge at a most inopportune time.
Digit al signal processing engineers are perhaps more aware of the
short-term nature of signal analysis than some oth er specialists if for no
other reason than that our computing ma chines can only hold and pro
cess a fi nite amount of data at a time . We have also come face to face
with the effects of short-term signal anal ysis in stud ying to pics like the
DFT "leaka ge" ph enomenon and FIR filter design. T hese topics, how
ever, deal with "stat ic" analysis in which a single frame of a signal is op
erated up on, and the results anal yzed with respect to "asym ptotic"
results. Speech is a dynamic, information-beari ng process, however. We
226 Ch. 4 I Sho rt-Term Proce ssing of Speech 4.2 / Short-Term Meas ures from Lon g-Term Conce pts 227
as speech processors cann ot be co nte nt to analyze short-te rm effects in a can be applied to th e problem above by determining some thres hold on
single frame. In thi s chapt er we want to build on our DSP background its valu e below which the signal is c1assified as un voiced on th e frame
and formally exam ine sho rt-t erm processing fro m the " dynamic" point of ending at m, above which it is deemed voiced.
view. Our objective here is to learn about ana lysis of frames of speech as
th ose frames move through time and atte mpt to capture tr ansi ent fea
tures of the signal. Another goal is to introduce a n um ber of short-term 4.2.2 "Frames" of Speech
features that are useful in speec h processing, some of which will be vital
to our work in future chapters. Clearly, we are mo ving toward th e practical probl em of working with
small ranges of the speech sequence. Before continuing down this path , it
is important to review the concept of a "frame ," which was first intro
4.2 Short-Term Measures from Long-Term Concepts du ced in Section 1.1.5. Formally, we define a frame of speech to be the
product of a shifted window with the speech sequ ence
4.2.1 Motivation !.(n ; m) ~f s(n) w( m - n). (4.4)
Suppose that we wish to ext ract some inform ati on about a short term For convenience, we will omit the subscript s when the frame is taken from
or fra me of speech spanning th e tim e ra nge n = m - N + I, .. , .rn . We a speech sequence s(n). This will almost always be the case in our work,
have an intuitive idea that th ere is a long-term conc ept that could pro and we will include subscripts for clarit y when it is not. Altbough, practi
vide th e needed informati on if we could "genera lize" it to the short term. cally, a frame is just a "chunk" of speech which perhaps has been tapered
For exa mple, suppose that it is desir ed to know whether a speech se by the window, formally it is a new sequence on n in its own right, which
qu ence is voiced or unvoiced on this short term ending at time n = m. happens to be zero outside the short term n E [m- N + 1, mJ. Accordingly,
We know that voiced speec h is generally of higher "power" (average we will often find that short-term processing of the speech is tantamount to
squ ared value per sam ple) tha n un voiced . The idea, th en , would be to long-term processing of a frame. Th e frame created by this process also de
employ the concept of average power to assist in the decision. Average pends on the end time, m, so that it has a second argument (and also an
power, however, is a long-term concept;' implicit argument, N) . Th is formalit y will be very useful to us in the up
ps= ..t{sl (n )}, (4.1) co ming discussion.
where ..t denotes the long-term te mpora l average? int roduced in ( 1.126).
Ou r intuition is telling us here tha t if S l (n) is an eterna lly voiced signal , 4.2.3 Approach 1 to the Derivation of a Short-Term Feature and
and .\'2 (n) et ernally un voiced , then the lon g-term concept of averag e Its Two Computational Forms
power could be useful, since it would be tru e that
In general, suppose that X, is th e long-ter m feature we have in mind to
Ps.( > P ~ 1 , (4.2) help us solve a problem. In general, th ere might be a fam ily of features,
and if we could find some "sim ilar" short-term qu antities using only the each one dependent upon an inde x 1, so let us write the general long
points around m , say P" (m), it would be tr ue that term feature of the sequence s(n) as X/A). [An example of a feature in
~olY i ng a param eter is the long-term autocorrelat ion 1)11) , whic h is
P (m)
j J
> P (m).
'2
(4.3) indexe d by th e integ er 11, indicating the lag. A feat ure may also be
[Note that on ce we have narrowed our selection of point s to some small pa ramete rize d by a continuous param eter or by a vector of param et ers, as
region around m, it is inconsequential to th e truth of (4.3) whether sJ we shall see below.] Suppose furth er that X,O.) is comp uted fro m s(n) (as
and s remain eternally statio nary, as long as they have th e desired prop suming th at s retains the desired propert y forever) as
erties on the selected ran ge of pc ints.] P,(m) (however it is com puted) Xs(l ) = 5 (A)(s(n)} = of{S'(A)fs(n))), (4.5)
Where 5().) is some operation, generally nonlinear and dependent on A.
' If the sequence sen) were com plex, sZ(n) shou ld be repla ced by Is (n)l! in the follo wing For most commonly used long-term features it is foun d that 5(A.) can be
a nd in sim ila r ex pressi ons [see ( 1.9) and ( 1.11)1. We will ign ore this gen erality -in ou r work
since the seq ue nces we deal with are al most alwa ys real. deccmposed as 5(}.) = of 0 S'().) , where S'(A) is an operation that produces
21f sen) ha ppens to be a realization of a st ochastic pro cess ~ in this d iscussio n, we as a new seque nce on n, and of is the temporal average operator. [See the P,
sume the appropriate stationarity a nd ergod icity properties to permit the use or temp or al com puta tion above, for example, where ,'7 is easily seen to be the squar
averaging. ing operation .:7{s(n)} s2(n) .] In light of (4.5), a highly int uitive way to
0=
4.2 / Short-Term Measures from Long-Term Concepts 229
228 Ch. 4 I Short-Term Processing of Speech
n=-oo
.:7(17)[5(n ) w(m - n)}
ilyoJ) short-term jeature(s) Xs(l ;m), "sim ilar to" the long-term j eature(s)
I co
Xs(l ) . = N I
n=-oo
sen) w(m - n) s( n - 1]) w ( rn - n + 1J) (4.11)
1. Sele ct the desired N-Iength fram e of sen) using a window, wen),
j (n ; m) = sen) \V(m - n}. (4.6) 1 m
= N I sen) w(m - n)s(n -lJ)w(m - n + 17) ·
2. Apply a "5 (A)-like" operation , say Sell , to the frame: n=m-N+ I
n::::ll- oo
S(l)[s(n)w(n-m) } property that it produces a new sequence on n from th e seq uence upon
which it operates, say,
(4.7)
1
= N Ico
n.:."""oo
S(A) {j (n ; m»),
t)!J(n; A) = .:7(1) [ll(n)} , (4.12)
where "» will also depend on a parameter. [If v( n) is a frame, then both
where it is assumed that S(}.) can be decomposed as (l IN) v(n) and "» will also depend on m.] In addition, .'7(J.) very often (but not
I : -co 0 SeA) ju st as 5(A) =..t0 5(A) in the long term. always) has the property that for any two sequences, x (n), v(n),
5(}.)(x(n) v(n» ) = .:7().){x(n»)5(J.){ v(n»)
It should be noted that S(J.) is often the same operation as .:7(1) . In fact , d ef (4.13)
let us restrict our discussion to such cases: = x:i n; A.) v:/(n ;l ) .
n= - oo
(s (n ) w(rn - n)l 2. (4.9)
Xp,; m) = N
1
I"" .:7(l)\s(k»).:7(J.){w(m - k)}
k: ~- oo
We note that no parameter, A., is involved here, an d .'7{ . ) = { . )?. Another (4.16)
example that does involve a parameter is the autocorrelation . In the long 1 co
term we have = N I
!c·_-oo
S~(k ; A)w~(rn - k ;l) .
rsCIf) = 5(tl){s(n)} = eL' j.:7(Il)(s(n)}} = .i' {s(n) s(n - If»). (4.10)
In this case we see that .'7 depends on th e parameter If corresponding to We see that (for an y parameter i .) , th e feature Xs (I,; m) can be computed
the autocorrelation lag, and 5'('7){5(n») = s(n)s( n - If) . Applying Construe as th e convolution of the sequences ( I/N) s.i n ;A) and w:?(n;i,) evaluated
230 Ch. 4 I Short-Term Processing of Speech 4 ,2 I Short-Term Measures from Long-Term Concepts 231
at time n = m, To compute the feature at any general time n, we can which can be computed at any m independent of past values of Ps(n).
write Note, however, that (4.18) can be written as a recursion,
1 P~(n)=P:(n-1)+s2(n)-s2(m-N). (4.19)
X,(J, ; n) = N s'](n;?c) * w'](n;?c), (4.17)
where P:(n ) d~,f NPJ n ) . If P~(n) is to be computed at m = 0,N/4,
where * denotes con volution. This form, depicted in Fig. 4.1, allows the Ni2 . . . . ,for example, the use of (4.18) requires N squaring operations
interpretation of the Xs (), ; n) feature as the output of a filter with impulse and (N - I) additions every N /4 points or four squ aring operations and
response w 5 (n ; ,t) when the input is (I/N)s.?(n; J.). approximately four additions per n (per speech sample or norm-sec);
Thus far we have assumed w(n) and , hence, presumably, w5 (n ;).), to whereas (4.19) requires only two squares and two adds per 11 once the in
be a finite duration [or, if viewed as a filter, finite impulse response itial computation is performed.
(FIR)] window. This "output filter" form for computing X/A; n) begs the Finally, it should be noted that the test of merit of a short-term esti
question as to whether it is actually necessary to employ an FIR filter at mator is the accuracy with which it estimates its long-term counterpart
the output, particularly in cases in which X,(..1.; n) is a feature that is in [assuming s(n) remains stationary], that is, the degree to which
sensitive to the phase of s(n). If we recognize that wry(n; 2) will generally
X,(A; m) = X/A). (4.20)
be lowpass in spectral nature, it is possible to generalize (4.17) by substi
tuting h5 (n ;..1.) [any general filter with magnitude spectrum equivalent to This approximation, in turn, is directly related to the choice of windows
that of w:7(n;,t)] for w:7(n;2) in the computation. It should be noted that in the analysis . A review of this important topic is found in Section
the insertion of an infinite impulse response (llR) output filter into 1.1.5. More details on the effects of windows in the design of estimators
(4.17) will require that X,(}.;n) be computed for each n, as recent past can be found in the textbooks listed in Appendix 1.A. Let us briefly re
values of the output will be necessary at each time; this is true even call the two considerations in choosing a window: type of window and its
though Xp, ; n) might only be desired for a select set of " m' s:" This does length , N. Generally speaking, for a fixed N, two com peting factors
not , however, necessarily imply computational inefficiency. Indeed, even emerge in the choice of window type (note in the discu ssion above that
FIR (inherently nonrecursive) forms can sometimes be formulated into-a the windows act directly on the speech data): the need not to distort the
recursive computation for greater efficiency: Consider, for example, ' the selected points for the waveform versus the need to smooth the abrupt
short-term power computation using a rectangular window of length N. discontinuity at the window boundaries. (A frequency domain interpreta
Using either (4.9) [from (4.8)] or (4.17) yields tion of these factors is found in Section 1.1.5) Generally, the latter con
sideration emerges as primary, and "smoother" windows such as the
m
1 Hamming are used. In choosing N , again there are two competitive fac
P (m ) = N
oS
L
n=~n.-I'''"•., 1
s2(n), (4.18)
tors: for a fixed window type, increasing N improves the spectral resolu
tion at a given m by providing more information to the computation. An
example arising in voiced speech is shown in Fig. 4.2. However, as the
window slides through time (to compute features at various m's), long
Lowpass windows make phonetic boundary straddling more likely and events in
filter time are not resolved as well. This phenomenon is also illustrated in Fig.
4.2. The choice of N is highly problem dependent, but as a rule of
sen) ~.r<n ; )., ) - -----+-x/ A; n) thumb, speech can be assumed to remain stationarv for frames on the
q'( ).,,~ ·l hen )
order of 20 msec so that window lengths are chosen accordingly.
Before leaving the generalities of short-term processing to look at some
Operation
on speec h
W,{(/!; Ii.) important examples, we examine two more variations on the theme of
deducing a short-term feature based on a long-term attribute.
~
- i
(parameterized 1
by).,,) Operae 1() 1l
,IV
on
window
4.2.4 Approach 2 to the Derivation of a Short-Term Feature and
Its Two Cornputational Forms
w (n )
The long-term . temporal average for a speech sequence, s(n), is
FIGURE 4.1. Co.mputation of the short-term feature Xs (2; m) , which is
constructed according to Construction Principle 1 and viewed as a convolution. 1-1..., ~ ..L'{s(n)}. (4.21)
1500 I I I I I I I I I I
4.2 / Short-Term Measures from Long-Term Concepts 233
=
.os.
= where w is any window [or by the exten sion below (4.17), a ny lowpass
filter]. Let us define the short-term average operation, A (m ) to be such
''""
]
() that
0
<Il
1SOOI I I I I I ) I I !
CONSTRUcnON PRINCIPLE 2 If Xs (A) = .i'[.:7(J.){s(n )fi, let
.5
I
!
,
:
:_ ._
.
i
4::
HamrtJin
w ind~w "
lengt1
Q
1024
n= -oo
Lo wpass ample, since power is average energy, when we do not average, the feat ure
!iller is called the "energy" in x (n) . (A subtler but analogous situation arises
between the discrete-time Fourier series coefficients and the DTFT.) The
s,;'n; I.)
.\ (11 )
'J(I,-~ . ) w( n) I ) "1.,0.; 11 ) "autocorrelation" is an examp le of a "feature" that bears the same name
in either case.
We could, of course, create construction principles for fram es of en
Operarion
on speech
ergy signals according to the discussion above. It is plausible, if not obvi
(parameterized I ous, that the short-term results would be identical in form to the "power"
by A) N results in (4.8) and (4.24) , except that th e "averaging factor " liN would
FIGURE 4.3. Computation of the short-term feature Xs(J.;m), which is
not appear.
constructed according to Construction Principle 2 and viewed as a convolution.
The main point of this discussion is to note the following. The factor
liN in fron t of the short-term features rarely plays any significant role in
short-term analysis. It is therefore often om itted in practice . This should
4.2.5 On the Role of "1JN" and Related Issues not be taken to mean that the short-term feature is an estim ator of an
"energy-type" feature. On th e contrar y, the energy feature would not theo
We have expli citly assumed th at the long-term speech signal in the reti cally exist in the long term. For pedagogical purposes, therefore, we
above, s(n) , is a power signal. Indeed , this is the proper assumption usually leave the factor liN in the features to emphasize the fact that the
whether we are modeling a voiced or unvo iced phone over infinite time. features are theoretically estimators of power-type quantities.
Recall , from general signal proc essing theory, however, that if s(n) were Another point of confusion that might occur to the reader is that the
an energy signal instead, a slightly different set of long-term features frame itself is an energy signal even though the signal from which it is
would be used . For example, suppose that x (n) is an energy signal. If we drawn in our work is a power signal. As a consequence of this, long-term
attempt to compute the power of x (n) according to (4 .\), or its autocor "energy" analysis on the frame (which is entirely proper) will produce the
relation according to (4.10) , we will get zero power, or zero for any argu short-term result of Construction Principle 1 without the scale factor
ment of the autocorrelation. This is not unexpected since , on the average, I IN. For example, rs(i]; m) (without the scale factor) can be obtained as
there is no power in an energy signal , nor is there any correlation be
<XJ
tween points at any lag. It is customary in the energy case to use features
which sum but do not average,
f (f( n;m)!( n -1 11[ ; m)J= L. f (n;m)J(n -I?/I;m)
n- - oo
(4.26) (4.30)
Xx(l.) = f [ 3 (A)(x (n)}}, m
L f (n; m )f (n - I17 ];m ),
where n= m - N+ l
N
which is easily seen to be eq uivalen t to (4.11) once the scale factor liN
f l. } ~f lim L [·l· (4.27)
is included. A simi lar situation can be described for Construction Prin
1 N~ CI::J
n = - "l
ciple 2.
For example, in place of power we would use The fact that an "energ y-type" short-term feature is often used in prac
tice, and that the frame itself is an energy signal , should not be allowed
00 to divert the reader's attention from reali zing that the short-term feature
Ex == L
n= - oo
2
x (n) ; (4.28)
is fundamentally an estimator of a "power-type" feat ure of a power sig
nal. Indeed, the counterpart long-term energy feature does not exist in
in place of the "power definition" of autocorrelation , we would use principle. Implications to the contrary can sometimes cause conceptual
problems for the student or researcher working on theoretical problems
00 or interpreting practical data. The averaging factor 1IN, while of no
r)t7) = I
n- - co
x( n)x(n - ti) · (4.29)
practical significance, is loaded with theoretical meaning, so we leave it
in th e developments to follow.
In some cases, we have different names for these "energy definition" fea We focus now on some important examples of short-term features
tures, while in others the name remains the same . Tn the above, for ex used in speech pro cessing.
236 Ch. 4 I Short-Te rm Processing of Speech
4.3 I Exampl e Short-Term Features and App lications 237
4.3 Example Short-Term Features and Applications th e long-term a utocorrelation) until the window length becomes infinite.
Fo rmally, i f sen) is a realization of a wide sense stationar y (WSS),
4.3.1 Short-Term Estimates of Autocorrelation correlat ion-ergod ic, random pro cess, s,
- then I I' ,(11;m) can be seen to be
Short-Term Autocorrelat ion on e outcom e of th e random variable 1~ ,(11, m), where
T he autocorrelation sequence will be found to playa cent ra l role in
man y aspects of sp eech processing. We focus here o n several short-term I~S('7, m) ~r~ i
n=m-N+l+I'I1
~(n)~(n- I17 I), (4. 37)
co unterparts to the lon g-term autoco rrelat ion. Recall that we can for
mal ize the autocorrelatio n as in whi ch ~(n) and ~ ( n - 11J1) are random varia bles drawn from ~. Now,
r/ tI) = 5 (tI) [s(n)} = ..t(:7(1J)js(n)}} = ..tIs(n)s(n - tI)}. (4.3 1) 1 m
n::::-oo
:7(l/){s(l1) w( m - n)} 1/-dependent rect angular window on th e speech,
(4.35)
00 , { YN/(N-IIlI), n = 0, 1, 2, .. . ,1'1 - 1
=N I.
n=-oo
s(n)w(m -n)s(n- I 1JI )w(m-n + l tI!) . w(n) =
0, other 11
(4.40)
N
2. The variance of 2~/tl ; m) with respect to r,(ll) = {s(17) becomes large
as 1 /1\"-' N, making this estimator unreliable for large \ III in spite of rs (a, 13) -- Nlim
I
~ ro 2N- + I
I
1l=-N
sen - a)s(n - 13)
its unbiasedness (Jenkins and Watts, 1968). (4.47)
3. 2~ .\ (r{;1n) is a consistent estimator of r, (Il)· = .L'{s(n - a) s(n - 13)\.
Finally, we note another short-term estimator of autocorrelation that Generalizing our discussion leading to (4.25) to allow a two-parameter
avoids the bias problem of I r,(1J; m), which can be deri ved beginning with operator, :J(a,f3) we see that a short-term version of (4.46) is given by
(4.25):
I
I + co rp/a , (J; m) = N I co
s5(n ; a, 13) w(m - n), (4.48)
3,, (YJ ; m) = N I
n = - OO
.9(17)(s(n)j w(m - n), (4.43) n:::>< - oo
(4.50)
240 Ch. 4 I Shorl-Term Processing of Speech
4.3 I Example Short·Term Features and Appl ications 241
where r,(l}; m) is a short-term estimator of autocorrelation . Since Application of Y,(71 ;m) to Pitch Detection
rJw; mfis the DTFT of the sequence r,('I;m), the short-term autocorrel The short-term autocorrelation and covariance are among the most
ation can be obtained by inverting the transform, useful computations made on the speech waveform . Th ey will playa cen
tral role in many upcoming topics. At present, we do not have the tools
r (I}' m) = - 1
~ ' 2n
J"F (r»; mse' ?" dw.
-It !£ '
(4.5 l) to describe these advanced applications; so, for the purposes of illustra
tion , we show some relatively simply uses of the short-term feature as it
has been applied to the problem of pitch detection.
From (4.51) it is easy to see that the average power in the frame ,
From basic probability theory, we know that we can interpret the auto
P~ = r/O; m), is given by the normalized total area under T;(w; m) ,
correlation as an indicator of the degree of linear relationship that exists
between any two random variables spaced 1] apart in a stationary random
-
1
P' ==-2
7C
J" ~(w;m)dw==-1 Jlt~(w;m)dw .
-n - 7T. 0
(4.52) process. For an ergodic process, therefore, we infer that the autocorrela
tion relates the degree of expected linear relationship that exists between
In fact, to find the power in any frequency range, say w j to w 2 ' for s on any two points that are spaced by 1] in time in a sample waveform. When
th e frame, we can compute - we move to the short-term case, by purely heuristic arguments, we expect
r,(1]; m) to indicate the expected amount of relationship that exists be
Power in ~ in frequencies w t to w on the frame = -.l
n 2 Jl<\:(W; m) dto.
W1
tween time points spaced n apart on the window ending at time m.
In Fig. 4.4 we see the waveform for the utterance "three" and plots of
(4.53) r, (n ; m) for two values of m. The short-term estimator of autocorrelation
is 3r ( l}; m) given in (4.44), using a N == 256 point window, Note the strong
The reader is encouraged to compare these definitions and results with
indication of periodicity [large values of r.('I; 2756) for I} = ip, i an integer
those in Section 1.2.3. and P the pitch period] when the window is entirely over the voiced pho
When the signal is deterministic, or when s(n) represents a sample
neme IiI ( m = 2756) , and the lack of strong periodicity when the window
function of an ergodic random process, then the temporal short-term
invol ves an unvo iced region (m = 500). One may get the idea that
autocorrelation, r,(1/;m) , is computed and the temporal short-term power
' , (11 ; m) would make a good detector and tracker of pitch . Indeed, this
density spectrum is defined as its DTFf, idea has been explored, but direct autocorrelation methods are seldom
used in practice because they are more error prone than methods that are
I.
00
~(w;m) ~f r.,.( '1; m ) e -jW~. (4.54) only slightly more complicated to impl ement. (Further methods wiLL be
'1= - CO discuss ed as the material becomes accessible later in the book .) The main
We will give another interpretation of the temporal stPDS in terms of the problem with direct use of r,('1 ; m) is that the first formant frequency,
short-term DTFT in Section 4.3.5. which is often near or even below the fundamental pitch frequency, can
interfere with its detection. If the first formant is particularly strong, this
can create a competing periodicity in the speech waveform that is mani
Short-Term Cross-Correlation and Cross-PDS fest in the autocorrelation.' A secondary problem is that the speech is
From any of the short-term estimators of autocorrelation, we can im truly only "quasi-periodic," causing the peaks of rJn; m) to be less promi
mediately infer a similar short-term cross-correlation function. For exam nent, and, in turn, making peak-pi cking difficult.
ple, in a similar manner to the deri vation of 1r( IJ; m) of (4.36) , we could Investigators have attempted various signal preprocessing measures to
make the autocorrelation focus more intently on the fundamental period
deduce the estimator
of the waveform . Among these are raising the speech waveform to a large
1 odd (to preserve the sign) power (Atal, 1968), and center clipping. In the
\ cy(l1 ;m)""N Im
x (n)y (n - l l1 l) (4.55) I~tt er, the lowpass filtered- time domain waveform is subjected to a non
n ~ m- N+ l+ 1 11 1
linear operation like (Sandhi, 1968)
for the two sequences x( n) and y (n) . In general, let us simply refer to
r);)'(I1; m). Then the short-term cross-PDS is defined as
l
5Rccall that there can ,be energy at fre quencies other than the harmonics of the pitch fre
quency because of the windowing in short-term processing.
rxy (w'm)
,
d=cf I
00
rx}' (n-
'I '
.
In)e-}(!>'1
.
(4.56) 6In all techniques discussed , it is conventional to remove high-frequ ency content in the
speech waveform fir st b y lowpass filtering .
'1= - CQ
242 Ch. 4 I Short-T erm Processing of Speech
4.3 I Example Short-Term Features and Applica tions 243
80° 1 I I I I I I
(XI04 )
600 2
400
G 1.5
V',
., r-
N
"".~ s~
~ -,
E
" .§
..c:
~
:;
'"
~
'" g<; o.s
'""
::
~
~ ()
-400 ~
i;ii
- o..~
-60°
0 1000 2000 3000 4000 5000
Tim e. /I (norm-sec)
(a ) -I (;~---~~---:-!:-::------l__ I
50 JOO 150 200 250
15 1 I I I I I Shift , '/
(c)
E
. '"'
c"
-+_...
r~
5«s
E -5 Typically, th e clipping limits are set to ± 30% of th e abso lute maxi m um
~o of the waveform . T he clipping op erator of (4.57) for this typical case, and
..c:
'" its effects on a speech waveform for th e word "three" are illustrat ed in
t. .._
- 10 Fig . 4.5. Figures 4.4 and 4.5 can be used to compa re the au tocorrelation
of the u nm odified speech an d t hat of the clipped speec h for the word
I I
"three". Oth er clipping operators have been investigat ed by Dubnowski et
- 15o I
50
I
100
I
ISO
I
200 250
l-.-JI al. (19 75). T hese are depict ed in Fig. 4.6.
Shift. ' I Before leaving th e issue of cente r clippi ng for pit ch de tec tion, it
(b) should be not ed th at this procedure constit utes a "whi ten ing" proc ess on
t he spect rum, since it makes the speech m or e pulselike (see Problem 4.9).
FIGURE 4.4. Utterance of the word "three" and plots of short-term
autocorrelation. r.(rt; m) versus rt based on 256-po int Hamming windows
Hence center clipping will tend to emphas ize h igh frequencies in noisy
ending at m = 500 and 2756. The se two frame s co rrespond to the fricative speech and cause pi tch det ection perform ance to det erio rate (Paliwal and
IT/ and the vowel /iI. respectively. Aarskog, 1984).
244 Ch. 4 I Short-Term Processing of Speech
4.3 I Example Short-Term Features and Applications 245
01
(x 1( 2.5 I
) I I i i I I
C[.I(n»)
2
c
~ c+ 5(n)
~ 1.5
i:-"
~~
"
<: - I Ordinarily c+:: C ·
.S
<i
-g
8 C[s(n»)
~ 0 .5 \
?\O~'"
" c~
E
~
t:
o
..c:: o
'" c:
c+ s(n)
--0 .5 ~ - .
c:
- 11 i i i I I I
a 50 100 ISO 200 250, \.
Shift 11 C;,\O~'"
FIGURE 4.5. The experiment of Fig. 4.4 repeated with center clipping
applied to the speech prior to computing the autocorrelation. Clipping limits FIGURE 4.6. Dubnowski clipping operators.
are set to ± 30% of the absolute maximum value of the waveform.
4.3.3 Zero Crossing Measure
T he number of zero crossings (number of t imes the seque nce cha nges
sign) is also a useful feature in speech anal ysis. Formally defined in th e
4.3.2 Average Magnitude Difference Function long term , th e zero crossing measure is
Let us formally define th e average magnitude difference fu nction
(AM D F) for an eternally stationary signal, s( n), by _ {,sgn[s(n)}- sgn{s (n - l )}I}
2<. -0£ 2 , (4.60)
~M)1J) = .,L' (l s( n) - s (n -l1)I } . (4.58)
where
It should be clear that this family of features, indexed by the time differ
ence parameter 11, takes on small values when 1] approaches th e period of + 1, s(n» O
sgn (s( n») = { (4. 61)
sen) (if any) , and will be large elsewhere. Accordingly, it can be used for - 1, . s(n) <0
pitch period estimation. Appl ying Construction Principle 2 to obtain a
realistic family of features, we ' have Const ructio n Pr inciple 2 can be used to deri ve a short-term zero crossing
measu re for the N-Iength interval ending at n = m ,
I m
~.M«rl ;m ) = N L
n= m- N +J
Is(n) -s(n - ry)!I1'(m-n) . (4.59)
Z .(m) =~ ~ Isgn[s (n)! -sgn(s( n -I )ll
, ]V L w(rn - n) (4.6 2)
fI =m-N+ l ? .
Th e computation of this feature is dep icte d as a convolution in Fig. 4.7,
I
-_~_<l n d.i ts_aDPlicati on
to a Ritch detecti on problem is illustrated in Fig. 4.8. Before sho wing an exam ple of this feature, we will review anothe r and
use th em together.
/
1500 I I i
246 Ch. 4 / Short-Term Processing of Speech
lv'
1000
s(n ) +
IJ, M.cr!; n)
Magnitud e ~
"'o 500
"?
.2
0
~
;;
Delay s:
rj "
4J
<)
c,
(I)
''1fIM~~
FIGURE 4.7. Short-term AMDF viewed as a convolution.
Definitions
While urging the reader to review the comments in Section 4.2.5, we --+j Voiced regionf-+- --+j Voi ced region f+
I II
recall the short-term power and short-term energy measures for the N - 1000
0 1000 2000 3000 4000 5000 6000
length frame ending at time m ,
Time, n (norm -sec)
In
1 (a)
Ps(m) = .N L
n=m -}l+ l
s2(n) (4.63)
350 I I I I i I
and
m
Es(m) = L
n= m- N +l
i (n) , (4.64)
300
150
I \ I
Pitch = 121.95 HI.
FIGURE 4.8. Application of the short-term AMDF .6.Ms ('l ; m) to the detection
..
of pitch for the word "seven." (a) Si~Hal for the utterance of "seven."
Short-term AMDF computed over rectangular windows of length 256 ending
at (b) m = 2500, (c) m = 3000, (d) m = 4775, (e) m = 5000 . Windows in (b) 70 90 100 liD
and (c) are both positioned in "voiced region I" and each produces" a similar Sh ift , II
pitch frequency estimate. Windows in (d) and (e) are both positioned in
"voiced region II" and likewise produce similar pitch estim ates . (b)
400, I I I I I 4.3 / Example Short-Term Features and Appli cations 249
90
80 .. ..
~
!
.........t .
8c
<') 70 \:. +
L
E~ 8
~ 250 8
<I "'"' 60
u: s~
c ~
s-< <I ~ PilCh per.i()~ f...l.!!!1s.~£ .
200 - · u: 50 ~ PilCh = 90.9 Hz
E ~
~
t:
o
..c
Pitch period = 8.3 msec
-<
E
I
J
til l.~ O 40
Pirch = 120.5 Hz !
,g
til
100 ~ . _ . . 30
20
70 so 90 100 110
Shift, 7J 10
60 70 80 110 120
(c)
Shift , 1/
130 (e)
_
~
or)
....
I
100 .. •..........
ample, in Part V we will study se veral methods for th e recognition of
"'" 'T
S~ words by comparing th eir acoustic signals with "template" words in the
~
<I 90 J recognizer. In some methods it is necessary for the incoming word to be
as free of " nonspeech" regions as possible to avoid such regions from
u:
0
2 causing mism atch . The probl em of detectin g endpoints would seem to be
80 .•..... ,..
-< ~
relati vely trivial, but , in fact, it has been found to be very difficult in
E practice, except in cases of very high signal to ("background") noise ra
~ Pitch period ::jI0.7 msec :
t 70 tios. Some of the principal causes of endpoint detection failures are weak
$:I .p ir~h = 93.46 H~ " .-.,
~ fricatives (If/ , IT I , Ih/) or voiced fricati ves that becom e unvoiced at the
60 I
._ ~ - - - 4 - '' end (" has" ), weak plosives at either end (/p/ , I t! , /k/), nasals at the end
i (" gone" ), and trailing vowels at the end ("zoo"). We will di scuss this
50 Problem in more detail in the context of speech recognition in Chapter
10.
40
As a medi um for illustrating the use of the Short-ter m zero crossing
60 70 80 go 100 I/O and energy measures , we briefly discuss th e endpoint detection probl em.
Sh ift , I) A widely used method for this task was pu blished by Rabiner and
(d) Sambur (1975). In Fig. 4.9 we sec the short-term zero crossing and en
250 Ch. 4 / Short-Term Processing of Speech 4.3 I Example Short-Term Features and Appllcations 251
speech into voiced and unvoiced regions. We might, for example, wish to
I know when a pitch estimate is necessary in a speech coder. We can infer
: t
I
from Fig. 4.9 that the same measures could be used to at least give an in
3
I
" I I I cies), while the energy will usually be larger during voiced segments.
---l------- - - - - -------- -- - - - - --.
N
l
4.3.5 Short-Term Fourier Analysis
The Short-Term DTFT8
The DTFT of an infinitely voiced (or unvoiced) speech sound does not
exist because the sequence is not of finite energy. However, we are ulti
mately going to use an ','energy-type" transform on the frame , hence the
name "stDTFT." To adhere to our philosophy of beginning with a long
FIGURE 4.9. Short-term energy and zero crossing measures plotted for the term "power-type" computation, however, we obviously should not begin
word "four." The utterance has a strong fricative at the beginning; hence, the with the DTFT since the DTFT will? "blow up." What should we use? To
zero crossing level is initially high, and the energy low. The opposite is true answer this qu estion, and to facilitate th e developments below, let us
as the signal enters the voiced portion of the utterance. After Rabiner and avoid the complications of stochastic processes and focus on the voiced
Sambur (1975).
case. We will return to the unvoiced case below. For the voiced case , the
discrete Fourier series (DFS) might come to mind as the appropriate
ergy measures? plotted for the word "four." These curves are the result of long-term starting point, but, looking ahead, we realize that the frame
computing each measure every 10 msec on frames of length 10 msec. It will not generally be a periodic signal. This means that a "short-term
is assumed that the first 10 frames are background. They are used to DFS" might be less than satisfying because it would not reveal the spec
find the mean and variance of each of the features . In turn, these statis tral nature of the frame in between the harmonic frequencies.t ? We seem
tics are used to set "upper" and "lower" thresholds, <u and <I' as shown to need a "com promise" between the Fourier transform and Fourier se
in the figure. The energy curve is then searched to find the first crossing ries with which to begin, and we have the additional requirement that
of the upper threshold T u moving toward the middle of the segment from the "transform" accommodate power signals. An excellent choi ce would
each end. Then we "back down " to the nearest crossing of L t in each case. be the complex envelope spectrum , which we first introduced in (1.35).
This process yields tenative endpoints N j and N 2 in the figure. The dou Indeed a corre ct interpretation of the stDTFT we are about to develop is
ble-thresholding procedure prevents the false indication of endpoints by that it is a frequency domain transform that serves as an estimator of the
dips in the energy curve. Now we move toward the ends from N I and JY2 complex envelope spectrum of the speech. Accordingly, within a scale
for no more than 25 frames , examining the zero crossing rate to find factor it is "trying to be" the DFS coeffi cients at the harmonics of the
three occurrences of counts above the threshold <zc' If these are not signal.'!
found, the endpoint remains at the original estimate. This is the case
with N 2 in Fig. 4.9. If three occurrences are found , then the endpoint es 8Some of the early work on this topic, based on analog signal analysis, is found in pa
timate is moved backward (or forward) to the time of the first threshold pers by Fano (1950) and Schroeder and Atal (I 962). More recent work based on discrete
ti me analys is is reported, for example, j 0 Portnoff (1980). A list of additional references
crossing. This is the case for N, (moved to Nl ) in the figure. can be found in the tutorial by Nawab and Quatieri (J 98 8).
We can pose the endpoint detection problem as one of discerning "Actually, for a det erministic signal, we can "create" a DTFT with analog impulse func
tions at the harmonic frequencies (see Section 1.1.4), but this does not serve an y useful pur
pos e here .
71n fact, the "energy" measure here is actually a magnitude measure similar to 'OFt!rther, in practice, we might not have knowledge of the period of the waveform , and
might not have a good means for estimating it.
m
M ,(m )= L
" --"1- ·\ '+ 1
I s(nl l· (4.65)
l l lt will become clear below that another correct interpretation views the stDTfl as the
result of an " illegal" long-term D'TF'T (composed of analog impulse functions at the har
monics) havin g been convolved with the spectrum of the window creating the frame . Hence
That thi s measure gives equivalent information to the energy measure discussed above the stDTFT is an estimate of the long-term DTFT with potentially severe "leakage" of en
should be apparent. ergy from the harmoni c frequencies.
252 Ch. 4 I Short-Term Processing of Speech 4.3 I Example Short-Term Features and Applications 253
The complex en velope spec trum of th e speec h is given by [see ( 1.35)] Operation
on speech
_ 1 P-l
Sew) = p In- Os(n)e - jwn , (4.66) I(n )
Well) r-------j.~ S(o.!: n)
whe re P is the pi tch period of the speech. For forma l purposes, recall
th at th is may be written
e - j """rl
_ . I N
S ew ) = .l(s(n)e-;wnl = )i~ - , - 2, s(n)e - j um . (4.67) FIGURE 4.10. Computation of S(w;n) viewed as a convolut ion.
~ 2N + I n= - N
No w using either Construct ion Princi p le I or 2 (see P rob lem 4. 2), then Some typical short-te rm magnitude spectr a based on th e stD T FT fo r
ignoring the averaging fact or liN, we derive th e short-term DTFT speec h data are shown in Fig. 4.11. 12
(stD T FT ) (often just called th e short-term Fo urier transfo rm or short A ph ilosophical rem ark is in ord er before leavin g t he deterministi c
time Fourier transform) for an N-length frame en ding at time m, case. We have inte rprete d S(w; m) as a complex function th at is th eoreti
cally try ing to esti mate DFS coeffi cients (to with in a supe rflu ous scale
~ ~
m
. ~
In
factor 1/N) when eva luated at the harmonics. Thi s is a correct interpre
Ss(w; m) = L..., sen) w(m - n)e -;wn = L..., f(m; nie ? " , (4.6 8) tati on , and perh aps not a bad one to keep in mind, but it is pr obably ac
n~m-N+ I n=m- N+l
curate to say th at mos t DSP en gineers do not routinely dwell on th e
whe re wen) is any wind ow of len gth N. Now t he convention of using th e st DTfT in these term s. Ins tead th ey just learn to read th em , re late th em ,
uppercase S to indicate a t ransform of th e seq uence sen) makes the sub and acce pt them as DTfTs of "lit tl e signals," whic h are valua ble and in
scri pt unnecessary here, so we will write simp ly S(w;m) . Note that if we formative in their own right. In fact, with exper ience t he rea de r will
view S ew; m) as a set of features of the speech, th en co plays the role of a probabl y develop th is same intuition a bout all of t he short-term features
continuous param eter, so t h at t here is actuall y an un co untably infinite we have discu ssed . T he formal fra mewo rk in wh ich we have developed
fam ily of features , one for eac h w. t hem, howe ver, sho uld co nt ribute signif icantly to that intui tio n .
Note -that by dropping th e factor 1IN, we have crea te d a n "e nergy Since t here a re no " ha rmonics" present in the un voiced case , we must
type" com putation th at has a name, DTFT, which is not used to referto also atte m pt to find a prop er int erpretation of th e stD TFT whe n sen) is a
th e similar " power-ty pe" co mputat ion. Th is is similar to what happened stoc hastic signa l. T he most meaningful way to view sew; m) for a ran
whe n we dropped the liN fro m th e sho rt -term average power to get dom process is as the "sq uare root" of th e stPDS of the process, th at is,1 3
short-term ene rgy. For proper inter pretation of the short-term energy,
however, we encouraged th e rea de r not to th in k of it as an estimato r of Sew: m) S' (co; m) IS(w; mW
"long-term energy" (which th eor etica lly d oes no t ex ist), but rat her as r;(w;m) = N2 = N2 , (4.69)
short-term average power (estimating long-term power) with liN dropped
for conven ience. Similarly, here we encourage the reader to th ink of the where [' (w ; m) is defin ed in (4.54).
stDTFT not as an estimator of the DT FT, but rather as an est im at or of In fa~t , (4.69) hold s whethe r s(n) is a stoc has tic or deterministi c sig
th e com plex en velope spec t rum wit h th e scale factor om itted for nal . In a th eoret ica l sense, ho wever, it is more " necessary" for the
convenience. stochastic case . In th e deterministic ca se, we feel no un ea sin ess ab out
For a given w, Sew , m) ca n be viewed as the convoluti on of the corn dea ling with a Four ier transform, in t his ca se an stDT FT. We have given
plex sequence s( n) e- J W Il with t he real seq uence wen). T his com putation is a perfectly acce ptab le interpretation of S(w ; rn) for th e deterministic ca se
diagramed in Fig. 4.10 . Thi s view is some times ca lled th e "filtering inter
pr etation " of the stDTFT. T he filte ring approach corresponds to the first
above. We sho uld feel comfortable th at S ew ; 2
I rn)1
I N 2 is a valid stP DS,
a nd that writi ng it in these terms is acceptable. On the ot her hand , for a
exp ression on the right side of the de finit ion in (4.68). Th e second ex stochastic process for which we have been repeat edl y told tha t Fourier
pression corresponds to th e "Fourier tran sfo rm interpretat ion," since it
involves a convent ional DT FT o n the frame fen; m). These interprets
l tion s, wh ile suggesting two ways to compute the st DT FT, do not really l20f course, th ese spectra are based on d iscrete freq uency algorithms to be described
below.
provi de clea r int erpretation s of what it means. H en ce we have been care 13T hc factor ll/~ appear s here becau se we have d ropped it from the stDTFT in the
ful to pro vide thi s information above. defi nition.
254 Ch. 4 f Short-Term Process ing of speecn 4.3 f Example Short·Term Features and Applications 255
, , , ,
800
~ 512- pOUll Hann ing window
600 ~
J,
~
VJ
,.
*
4
.~
400 . :::
2
"B
~
..
>:.
;/11
ft
v
"C
l"I~I,m~wYv~
U 3
~
t.') 1~~ 1' 1~
eo
~
:::
>:.
'ffL
I:
- 200
~o 10'
.J;;
(/')
- 400
-lim ' , ! ! ! !
-' i 10°U' , ! , , I , ,
o soo 1000 1500 2000 2.'iOO 3000 3500 4000 4500 '
7<
"2 3r.
4 "4
Time. ' I (norm-sec )
(a) Frequency, w (non n -rps)
(c)
'J:..
.I '"
w
'tJ 0.
2 '"
2 u
OIl '0
es .=
E '2
'r:" ~
! E
B ~
'" t:
Q j . ..
V5
10°o' , !
7'
, I 1 ! I
4 "2 31\"
4 10 1 , , I t I I , I
Frequency. ",' (norrn -r ps)
o 1l' t: 3r. T.
4' 2 4
( Il)
Frequency. .-) (norrn-rps)
FIGURE 4.11. Some short-term spectral magn itUdes based on stDTFTs. (a) (d)
Utterance of vowel fa/. (b. c, d) Short-term magn itude spectra based on
512-point Hamming, Hanning, and rectangular windows , respectively. FIGURE 4.11. (Cant.)
256 Ch. 4 I Short-Term Processing of Speech 4.3 I Example Short-Term Features and Applications 257
transforms make no sense, it is comforting to be able to relate SeW; m) to Occasionally, however, a theoretical problem might arise if we use the
a quantity that " is" meaningful, namely, a power density spectrum. For a DFT in this cavalier fashion on a frame of speech covering the range
deterministic signal, we can v iew the stDTFT as an estimator of a n == m - N + 1, ... ,m; say f(n ; m). Let us consider what happens if we
Fourier transformlike function (the complex envelope spectrum), whereas first shift the frame down to the range n = 0, . . . , N - 1, and then com
in the stochastic case, we should only view it as a step toward computing pu te its DFT:
an stPDS, which , in turn, is an estimate of a long-term PDS. In this
sense, the phase spectrum of the stDTFT is not meaningful in the sto f( n; m) ---+ f( n + m - N + 1; m) (shi ft) (4.71 )
chastic case. N-l
Finally, let us note that the proper way to invert an stDTFT is with 2. f(n + m - N + 1; m )e-j k (21l/ N ) rl ( DFT for k=O , ... , N - I) .
the conventional IDTFT inversion formula. The stDTFT is a legitimate n ~O
is inherently a short-term entity. However, we need to be cautious, espe = samples of Sew; m) e;k(21t/N)(m- N+ I ) at 2nkj N, k=O, ... ,N-I.
cially in theoretical developments, about how we employ the OFT ip.
(4.74)
short-term analysis. The issue is centered on whether or not we need
"delay preservation" in a particular development. By this we mean we This is because the stDTFT is "delay preserving," and the usual DFT
must be careful with the DFT if it is important to keep track of where the is not.
frame is located in tim e. If so, we will use what we will call the "short It is clear that if we want a discrete Fourier transform that preserves
term DFT." If not, we can use the "usual" DFT in the customary way; the proper delay, and is therefore properly samples of S(w;m), we should
What is the "customary way" in which we use DFTs? The OFT is in use
herently defined for a sequence assumed to have its nonzero portion on
S (k ; m) = {I n~O
7(n; m) e
0,
-jk(2n/N)n
'
k= 0, . . . ,N - 1
otherwise
(4.77 )
speech. Clearl y, we mus t do some sampling of S(w; m) along th e co di
mension in order to discretize the (co m plex-va lued) con tin uo us function.
This will result in a di screte two-dimen sion al seq uence , with indices ove r
both th e fre que n cy and tim e dimension s. We want to send as few of
thes e numbers as possible in order to preser ve cha nne l bandwidth, but
whe re th e ar rows above 7
and S are j ust reminders of th e shift of th e we also wa nt to be sure tha t th e recei ver is able to reconst ruct the spee ch
dat a befo re th e computati on . fro m th e inform at ion sen t. We pursue here some small set of samp les
Fi nall y, we must discover how to invert th e st DFT. Since S(k ; m) rep fro m which sen) is entirely rec ov erable,
resen ts equa lly spaced sa mples of S(w ; m) on th e Nyquis t ran ge, the fun Let u s fir st fix m and discover how to sa m ple along the ta d im ension.
da menta l th eory of th e D FT tells us that th e use of the " usual" IDFT Fro m ou r di scussion ab ove , we see that th e samples of S ew ; m), which
form on t hese frequency samples (if not truncat ed ) will produce a peri. com prise th e stDFT, na me ly S(k; m) , k = 0, . . . , N - 1, are a most natural
odic (po te ntially aliased ) rep lication of the tim e seque nce corresponding set of samples to use. F rom the stDFT we ca n completely reco ver the
to S (w;m), in this casef(n ;m). That is, frame ftn; m), or, eq uiva lently, the entire stD TFT if desired. Gi ven the
frame and knowled ge of the window sequ en ce, it is possib le to recover
N- [
s(n ), n = m - N + 1, .. . . m .
ro
~ I S(k; m) e J (2n/ N ) kn = I f(n + iN; m). (4.78)
Now let u s co nsider at what m's we mu st sam ple S(k; m) . If w(n ) is a
k=0 i= -<;<,
fini te window known to the recei ver, th en clea rly we ca n j ust send o ne
set' > of freque ncy sa mples (one st D FT) for each a djacen t frame of N
To obtain an appropriate inverse, we separate out t he peri od co rrespond
speech sa mples. These ca n then be u sed to reco nst ruc t th e adjace nt
ing to i = 0, fram es, an d subs equently th e N point blocks of speech by remo ving th e
I N= I window. Con sequentl y, th e S(k ; m ) are co m p uted at times m = 0, N,
f (n; m) = N Ik ~O so: m)e j
(2f< / N ) kn
'
n = m - N + I, . . . , m. 2N, ... . (N ote th at t h is represents a resampling of time with resp ect to
th e original sam pling rate on the sp eech dat a.) T his me t hod is no t consis
{
0, othe r n (4.79) te nt with how sho rt -te rm Fourier-like meth od s have been used in codi ng
and synthesis, so we need not belabor t h is discussion. It is interesti ng to
We sha ll refer to th is form as the shor t-term IDFT (stI D FT ). note, howe ver, th at no cod ing efficiency is gaine d for all this effort, since
Althoug h it is probably ob viou s from th e di scu ssion above, for com NI2 complex numbers are sent each N norm-sec for a net rat e of I sam
p lete ness we note that to inve rt t he "u su a l" DFT, S (k ; m), we use th e ple per norm-sec. For th e sa me rate, th e or igin al speech waveform ca n be
" usual" IDFT formul a and obta in th e vers io n of the frame th at has been sent.
shi fted do wn to the time origin , 7 (n ;m) = f(n + m - N + J; m) . Closely related to wha t we will later ca ll th e filter-bank method of
short-term spect ral wave form encoding or sy nt hesis is th e following. Co n
sider th e com p utatio n of Stk; n) at a parti cul ar k show n in F ig. 4 . 12.
l '41f the feature para met er is continuous as in t he case of S(u): n), then the feature is a
seque nce over the time dimension and a fu nct ion over the parameter d imension.
l; Not e that N12, instead of N, is necessa ry because of the complex sym me try of the
transform.
260 Ch. 4 I Short-Term Processing of Speech 4.3 I Example Short-Term Features and Application s 261
N ote that we have specifically replaced th e computation time m by the quite generally t rue for co m monly used window seque nces (see Problem
more general n, indicating that we mi ght wish to compute S(k; n) at 4.4). T he main benefit that obtains from thi s sch em e is a flexible repre
ever y n. In fact, let us assume that it is computationally necessary, effi senta tio n of the speech that can be manipulated in both time and fre
cient, or convenient to compute S( k ; n) at every n. Th is will allow for q ue ncy. However, this benefit does not outweigh th e di sad vantage of
th e possibility that we n) is not a finite-length window and that a re excessive coding requirements , and simple filter bank syst ems of this
cursive computation is neces sary. The question remains, however, as to type a re used primarily in research. We will return to such syst ems in
whether it is necessary to send S(k; n) for all n even if it is available. The Chapte r 7 to learn to make them more useful and effic ient. The main
answer is clear when we interpret th e system whose impulse response is purpose of thi s brief preview was to come to terms with S (k; n) as a two
wen) as a iowpass filter, which all commonly used windows will be . Con dimensional seque nce and to study its sampling.
sider, for example, the case in which wen) is a Hamming window. If we
define th e bandwidth of the lowpass filt er to be th e positive frequen cy
width of the main spe ctral lobe , th en for th e Hamming window this The Use of Sew; m) in Pitch and Formant Estimation
bandwidth is
T he short-term Fourier transform, in both analog and di screte-time
2]'[
(norm-rps) . (4.80) forms, has been the basis for many important developments in speech
Wb =N
ana lysis and synthesis. For a comprehensive overview the reader is re
ferred to the textbooks by Flanagan (1972) and Rabiner and Schafer
This, therefore, is the nominal Nyquist frequen cy of S (k; n) when consid ( 1978). Other references are noted in footnote 8. Wh ile th e stD T FT re
ered as a sequence on n, and th e a ppropriate sampling rate on the se main s an important tool in some areas of speech processin g (e .g., it
quence is serves as th e basis for some commercial d igital spectrog rap hic analyzers),
in man y co nte mpo ra ry problems spectral features of speech are deduced
4n by oth er techniques. We will study thes e methods as we pr oc eed in th e
ws = 2w b =/i (norm-rp s) . (4.81)
book .
Clearly, the stDTFT can serve as a basis for formant analysis of
From samples (with respect to n) of S (k ; n) taken at this rate, the receiver speech, since it ver y directly contains th e formant information in its
should be able to exactly reconstruct the entire sequence; hence no infor magnitude spectrum. An example of such a formant analysis system is
mation will be lost. The sample times correspond to des cribed by Schafer and Rabiner (1970). The stDTFT has also been
used in a number of algorithms for pitch detection. If it is computed
2n .N
n = i- = 1 , i = .. . , - I, 0, 1, 2, ... (norm-sec) , (4.82) with sufficient spectral resolution, then the harmonics of th e pitch fre
Ws 2 quen cy will be apparent in the short-term spe ctrum. This ide a is th e
basis for the harm onic product spectrum (H P S) (Schroede r, 1968), de
fine d as
so that samples of S( k; n ) with respect to n need only be sent every NI2
co m putations. Note that this interval corresponds to half the window
length. This process in which onl y a fra ction of the sequence values is P(w ; m) d=ef IT
S(rw ; m )
R
(4.8 3)
sent is called decimation . r~ l
"'
'::'-
ples of short-term features, both to illustrate th e formal developments
':2.
en
and becaus e they will be useful in the work to follow. Among the most
.2 important of the short-term features we have discussed are the correla
<c:»
V
0 Fo
I J
2F0
I
3F"
I
41"u
t ion est imat ors. These will playa central role in lin ear prediction (LP)
a nalysis, whic h, in turn, will be at the heart of many future di scussions
an d m eth od s. We take up the subject of LP a na lysis in the following
;=;;
I cha pte r.
;::
(r = 2)
~
~
.q
t:2 .5 Problems
.",
.2
U L
V,U,U , 4.1. Recall the "voiced" excitation to the digital model of speech produc
o :y r.
2
21'0 3Fo tion ,
00
e(n) = I
q=- oo
J(n - qP ) . (4.84)
\
"
G.;
(r = 3) (a) Find an expression for the long-term temporal autocorrelation,
[e
'0
re (IJ).
t:2 (b) Find I r (11: m) of (4.36) for 1/ = 0, 1, ... , N - I using a rec
G"
.2
--J-! ,U,u , U 1 I ,
I
tangular' window of length N, wh e re IP < N < (I + I)P for some
integer I.
o F~ 2F F
~ 0
2F0 (c) R epeat part (b) for 3r (1/ ; m) of (4.44).
3 3
Frequency , F (kH z) 4.2. (a) Be gin with (4.67) and use Construction Principle I to derive
the (scaled) stDTFT given in (4.68).
FIGURE 4.13. The harmonic product spectrum is the product of frequency
compressed copies of the original spectrum. In each copy the frequency (b) Repeat part (a) using Construction Principle 2.
axis is compressed by a different integer factor so that the harmonics line (c) In deri ving the stDTFT, both const ruct io n principles produce
up and reinforce the fundamental frequency. the same result. Can you state gener al conditions under which
th e two principles produce the same short-term estima to r?
4.3. (a) Determine whether the alleged prop erties of the st D FT shown
in Table 4.1 are correct. The notation W d,;f e- j 211/ N is used for
4.4 Conclusions convenience and all frames are of length N.
The fundamental purpose of this chapter has been to provide a formal
basis for the processing of frames of speech as these frames move though TABLE 4.1. Alleged Properties of the 510FT.
time. Since speech typically remains stationar y for ranges of onl y tens of
Property Time Domain Frequency Domain
milliseconds, this issue is of crit ical importance in th e extraction of fea
tures from speech . These fea tures, in turn, dynamically cha racter ize the Linearity af; (n; m) + bJ;,(n ; m) aS1(k ;m) + bS 2 (k; m)
waveform for coding, ana lysis, or recogn ition.
I
Circular shift f(n - d; m)rnod i'J W <dS(k ;m)
Short-term features of speech are generally com puted using one of two
Modulation ~V l"f(n ;m)
S (k + I ; m) mOd l\'
ba sic paradigms wh ich we ha ve formali zed as "co nst ruct ion principles."
These both amount to windowed transformations of long-term attributes Circular convolution [t n; m)mod N * y(n) S( k ;m) Y(k)
of the sign al, and the two construction principles differ in essence by
264 Ch. 4 f Short-Term Processing of Speech 4.5 I Problems 265
(b) Determine whet her a form of Pars eval's relation holds: and
m ."'l - l 1
I !f (n;m )\2= ~ IIS(k; m)12 . (4.85) P,(m) == N Im
s2(n) , (4.90)
n~ m - N+ l k-O n-=m-N+ l
(c) Determine whether th e stDFT is "sym met ric" in the sense that resp ecti vely. Clearly, 's (O) = PJ wh ere 1',(1/) is th e long-term a uto
IS(k; m)1 I
= SeN - k; m)! and argS(k ; m) == - arg SeN - k; m) for correla tion sequence. for which short-term autocorrelation esti
k = 0, 1, ... , [N/2], where [N/ 2] m eans the largest integer less m ators, and for which windows, is it true that
than or equal to N/ 2. 1'..(0; m) = Ps(m )? (4.91 )
4.4. What is the effect on coding eff icie ncy of using the rectangular win
(b) What is the relation between the short-term auto correlation,
dow in the "filter bank" en coder di scussed at th e end of Section 4.3.5?
a nd th e sho rt-term energy given in (4.8 8)?
Express your answer as the ratio of the number of real numbers per
norm-sec required to transmit the stD FT versus the number of real num 4.8. (Computer Assignment) Write simple programs for the computation
bers required per norm-sec to send the speech waveform itself. The win of th e short-term AMDF and the general short-term autocorrelation esti
dow bandwidth (in terms of N) can be estimated from Fig . 1.4. mator 1'(1/; m) given in (4 .35) . Explore the use of different window types
and lengths in est imating the pitch every 1000 sampl es for th e Iii glis
4.5. (a) Verify that Construction Principle 2 can be used to derive the
sa ndo. '(N ote: Th e speaker of the glissando (pitch sweep ) is an adult male
short-term zero crossing m easure,
and th e samplin g rat e of th e data is 10kHz. This sho uld help reduce the
Z, (m)=-l
N
f
n~m -N+ l
Isgn {s( n)] -sgn {s (n - J)Jl
2 w(m - n)
(4.86)
number of 1]'S co ns idere d . This is important to reduce th e amount of
co mp uta tio n to a reasonable le vel.)
4.9. (Computer Assignment)
from the long-term definition, (4.6 0). (a) Perform th e following operations with a vowel utterance of your
(b) If the window is rectangular, show that Z/n) can be computed choice:
recursively (for all n), using a computation of the form (i) Co m pu te the short-term a ut ocor relat io n of (4.3 5) for a
Z s(n) == Z ,(n - 1) + other terms. (4.87) Hamming window of length N = 512 for 11 == 0, 1, ... , 256.
(ii) Com pute the N = 512-point magnitude spectrum of the
(c) Show the computational structure for your result in part (b). waveform based on a Hamming win dow and an st D FT.
(d) Suppose that it is desired to ha ve Z/ m) for m == 0, N/2, N, (Not e: The stDFT and conventional DFT are equ ivalent
31.',[/2, 2N, .... On th e average, how many floating point opera here because only the magnitude spe ctrum is required.)
tions (flops) per norm-sec are required to obtain these measures (iii) Repeat steps (1) and (ii) after center clipping the wave for m
using (1) (4.86) and (ii) th e defining relation (4.87)? For simplic according to (4.57) .
ity, let one flop be equivalent to on e multiplication operation (b) Comment on the changes in both the autocorrelation and th e
and ignore all additions. spectrum. What do these changes indicate about th e effects of
the clipping operation on the waveform?
4.6. Repeat parts (b)-(d) of problem 4.5 for th e short-term energy mea
(e) Estimate the pitch using the two autocorrelation results . Which
sure defined as result would provide better performance in an automated
m procedure?
E,( m) = L.
n- m -N+ I
l en). (4.88)
4.10. (Computer Assignment) Use the short-term AMDF to estim ate the
pitc h in the utterance "s even ." Decide upon an appro priate windo w type
In tallying flops, consider a sq uaring operation to be equivalen t to one an d l ~ngth , freq uency of computing I1M/1J; m) (i.e., at what m's ), and ap
multiplication . ~rop n at e range of n. The speaker is an adult male and th e sampling ra te
4.7. (a) Recall th e expressi ons fo r th e power and short-term power of IS 10 kH z.
the rea] sequen ce sen),
P, == ,.tji(n») (4.89)
5.1 I Long-Term LP Analysis by System Identification 267
The block diagram for this system is shown in Fi g. 5.I(a). The objective which is to serve as an estimated model for the true speech production
of an LP algorith m is the id entification of the parameters- associated svste m , 8(z). The form of the estimated model is shown in Fig. 5.1(b).
with the all-pole system fun ction. . Before initiating a pursuit of these parameters, it is prudent to wonder
why an all-pole model is used, in light of the fact that we have seen fit to
initially attribute a pole -zero system to the speech. By acoustic analysis
8 (-)=
L. 1_ :z
.H
j··, t
a(i)z- I
. (5.3)
of tube models in Chapter 3, we initially built up a simple model of
speech production whos e system function turned out to be all-pole. We
di scussed the fact , however, th at there are numerous compelling argu
ments for the inclusion of zeros in th e speech production model. It is
Pilch peri od, P 0 11 well known that certain phoneme classes, most notably the vowels, in
Gain volve voca l-t ract configurations that are acoust ically resonant, and are
th erefore appropriately modeled by all-pole structures (Fant, ] 956 , 1960;
DT G(z) Voiced/ unvo iced
Flanagan , 1972). On the other hand, such phoneme classes as nasals and
Voiced I Impulse Glottal SWItch
generator filter fricative s, and generally any sound that can be modeled by the inclusion
Ii (z } R(z ) of acoustic "side cavities" in the vocal tract , will contain certain spectral
sen)
Vocal Li p
tra ct radiation
Speech nulls, mathematically corresponding to zeros in the system function. Fur
signal th er, it is also known that the glottal pulse waveform, a component of
filter filter
Uncorr elatcd 8(z), is better represented by a filter containing zeros - (Rosenberg, 1971 ;
Unvoiced I noise Deller, 1983), Ideally, therefore, the m ethod of choice would be to model
generator
speech with a pole-zero system function in most cases.
Gain The use of an all-pole model, however, is primarily a matter of analyti
00 cal necessity. As we shall see, these parameters can be determined using a
Pitch period, P (a)
meaningful strategy (resu lting in simple linear equations) applied to the
ver y limited information available about the true system (namely, its out
put) .' Beyond neces sity, however, there is an o verriding argument against
DT
the need for zeros, if it is the purpose of the LP model to preserve the
Voiced I Impulse
(b) 3Reeall the dis cussion about nonminimum-phase system s in Chap ters I and 3.
<Note that an all-pole mod el is one that requ ires delayed va lues of the output only, and
FIGURE 5.1, Discrete-time speech production model. (a) "True" model. (b) minimal information (only the pres ent value) about the input which is unmeasurable in the
Model to be estimated using LP analysis. speech pro blem .
' This is not to say that the "ea r" cannot perc eive phase differen ces. The relationsh ips
a mong .two or mo re frequency components within the same "critic al band" (see Section
21n keeping with the notational convention establi shed in earlier chapters, we could at 6.2.4 ) are very significant to the perception of th e so und (Carlyon, 1988). There is also
ta ch a subscript such as s to the parameters, ii (I) , to indicate that these numbers ca n be some evidence tha t phase effects can alter the perception of the first formant (hence. vowel
viewed as fea tures of the signal s(n), indexed by' a parameter i. This subscript is usually not perception) (Darwin and Gardner, 1986), but these effects are apparently not very signifi
necessary since it is obvious which signal is related to the features und er consideration. We cant. Th e ultimate tes t for the "validity" of an y engineering assumption is whether the re
sha ll th erefore om it it for simplicity. Notc t hat we have also begun to index filter coeffi sulting developmen t serves the inte nde d purpose satisfactorily. Whether or not the
cients by parenthetical argum ents rat her tha n subscr ipts so that we may use subscripts to ignorance of phase is stri ctl y proper, LP analysis bas ed on this assumption has proved to be
ind icate signals when necessar y. emi nently useful.
270 Ch. 5 I linear Predict ion Analysis
5.1 I Long-Term LP Analysis by System Identif ication 271
huma n ea r is fund am entally " phase deaf" (M ilner, 1970, p. 217) . What (a) Ori ginal po le - 7.ero diagram. (b) T he zero at I / l" is reflected 10 its
ever information is aurally gleaned from the speec h is ext racted from its conj ugate reciproca l locati on w" , then
magnitude spectru m. Further, as we show, a magnitude, but not a phase, canceled wit h a pole at the same
location.
spec trum can be exactly m od eled with st able po les. T he refore, th e LP
model can exactl y preser ve the magnitude spectral d ynamics (the " infor 0 1
w
mat ion") in the sp eech , but might not ret ain the phase characterist ics. In
fact , th e sta ble, all-p ole nature of th e LP rep resent ati on co nstrains such a pX pX
model to be minimum phase quite apart from th e tru e cha rac teristics of
the signal being enco ded. We will show th at (ideally) the model of (5.3)
p~ X
will have the correct magnitude spectru m, but minim um-phase cha racter p* X
istic with respect to th e "tr ue" model. If t he objective is to code, store,
resynthesize, and so on, th e magn itude spect ral cha racteristics, but not 1
0 w·
necessaril y the temporal dynamics, the LP mo d el is perfectly "valid" and
useful.
(e) The process is repea ted
for each "ex ternal " zero,
L EMMA 5,1 (SY STEM D ECOMPOSITION L EMMA) Any causal rational sys
tem of form (5. 1) can be decomposed as
8 (z) = 8 o8 mm
. (z)8 31'" (z) (5.4)
where 8 min (z) is minimum phase, and 8 ap (z) is all-pass, that is,
I8 JW
(e ) I= I V OJ.
(d) e m1n (z) ge ts all "reflected" zeros plus all origin al poles. e "p(z) ge ts a ll original ze ros
ap plus all "cancel ing" pole" If e (z) i~ expressed in the for m
ap -I
sider only stable ca usal syste ms so th at only zeros may be outside the where zi = origina l zeros. fl j = can ce ling pole s, then it sho uld be mu ltiplied by ~ /(-p) to
unit circle in th e z-plane . Co nstruct a pole-zero diagram for 8 (z). Any make its magn itude unity. Th e original gain must be divided by this prod uct. I
pole or zero inside th e unit, cir cle is attributed to e mm (z). Any zero FIGURE 5.2. Illustration of the proof of the System Decomposition Lemma.
outside th e unit circle should be reflected to th e conjugate reciprocal
location insi de the un it circle, and th e reflected zeros also become part
of S min ' S a (z) contains the reflect ed zeros in their unreflected posi
tions, plus pol es that ca ncel the reflected ones . 8 a p is normalized by
appropriate scaling of the gain term. T he pole- zer o m anipulation is il 8 (z) = e ~ 1 + b(1) Z- ' + b(2)z -2
lustrated in Fig. 5.2 . Th e gain scalin g is seldom impor tant (as we shall 1 - a(l )z - l - a(2)z-2
O( I - p z·-l )( I _ p.z 1) .
l .
•Al though proofs ca n often bc omitted on fi rs t read ing, it is suggested that you read this
one because it leads to an immediate und erstan ding of the lemma.
Sup pose that we choos e to fir st work on the "extern al" zero at
z = I/ w. We introduce the conjugate reciprocal zero at z = w· an d th e
5.1 J Long-Term LP Analysis by System Ide ntificatio n 273
272 Ch. 5 / Linear Prediction Analysis
ci ently large choice of 1. [Ponder the fact that the coefficients in each and sen) can be interpreted as the output of the minimum-phase compo
term like (5.7) are decaying exponentially.] In practice, a sufficient nent of 8(z) driven by a phase-altered version of e(n). Except for this
model is obtained by neglecting terms be yond a small integer, typically input term , the output sen) can be predicted using a linear combination
8-14. of its past J values. In statistical terminology, the output is said to regress
According to Lemmas 5.1 and 5.2 , 8(z) can be written on itself, and the model is often call an autoregressive (AR) model in
other domains." The a (i)'s form the predictor equation coefficients, and
1 (5.8) their estimates from LP analysis are often called a linear predictive code,
8(z) = 8 0 / 8 a / z),
1- I a(i)z- '
especially in the communications technologies, where they (or related pa
rameters) are used to encode the speech waveform. For this reason , the
i= 1
process of LP analysis is frequently called linear predictive coding, but we
We will show that the LP model will ideally represent the all-pole will reserve this term for its more proper use in cases where coding is the
minimum-phase portion of 8(z). The a(i)'s will be computed in such a issue (see Chapter 7). In the literature, the estimates of the numbers a(i),
way so as to match, if M := 1, the o U)'s of (5.8). Since leap(w) I= I implies
I
\8 ((.LJ) := 8 I8
o m in
(W) !V w, a correct matching of the aU ) parameters will
8We have tak en the liberty of assuming that these z-tra nsform s exist. When they do not ,
at lea st yield a model with a (scaled) correct magnitude spectrum. In case a sim ilar dis cussion can be based on th e pow er density spectra.
9T he no ta ti on AR(A-f) is often used to denote an autoregressive model with 1II parame
te rs, or, equivalently, M pol es. This is typ ica lly called an AR model " of order M ."
, ' Some_de.t ails of this proo f are cons idered in Problem 5.2.
274 Ch. 5 I linear Prediction Analysis
5.1 / long-Term lP AnalYSIS by System Identification 275
i == 1,2, . . . ,I, namely aU), are variously referred to as "a" parameters,
aU) parameters, LP parameters, or LP coefficients. 0
8 ° (' )
These and similar vector notations will playa central role in our devel
opments. The superscript T is used to denote the transpose. A I se n)
0 (z) I
With these ideas in mind, we pursue the model. We will find the LP I
1- L
M
- --
[;(iV i
Wn )
{ -- L \
/- - - - -)... s(n)
_\--.-
these "interpretive problems" because each allows us to interpret the re
sulting model in a different way. The four interpretations of the LP
i= J \
Err or computanon
. )
m in im i zes ..f(e2( n)] when sen) is i nput. [T he LP pa rameters are given by venie nce , we will oft en want to ha ve th em packaged as a vecto r- ma trix
aU) = - a(i) , i = 1, 2, ... , M, and the resulting A(z) is called the inverse equation
filt er (IF) i n the LP literalUre.]
Solution. Fro m part (c) of Fig. 5.3 a nd (5. 14) we have rs(O) r s ( I) rJ2 ) rs(M - 1)
I
,t [e'(n) a( O) ~ lj ~ of {[ ~ a(i)s(n - i) - ern) r a( O) ~ 1} .
(5. 16)
I's(1 )
rs (2)
rs(O)
1'..( 1)
rs (1 )
1'/ 0 )
r, (M - 2)
rs(M - 3)
Differentiating with respect to a(rj), 'I = 1, 2, .. . , M, and setting th e re rs(M - I) r,(M - 2) I) M - 3) rs(O)
sults to zero, we obtai n ~
{[~ a(i)s(n - ~ I} ~ 0
r-
where re/ 'l) = ..t\e( n)s(n - 11)l de notes the tempor al cro ss-correlation of a(M) rs(M) ~
th e sequ enc es e(n) and sen) , wh ich are either deterministic or re aliza (5.23)
tions of jointly WSS, secon d-order ergod ic random processes; and
rs(rj) = .J'js(n)s(n - 1/)], t he temporal autocorre lation of sen). Now, if we which we will write compact ly as
assume e to be a unity varia nce orthogonal ra ndo m process" , me aning a
Rs = r s ='> a = R-\ lrs· (5.24)
that
re(t]) = ..L' (e(n) e(n - 1/)] = 15 (1/), (5.19) Inverse Filtering Interpretation of the LP Model. It is custom ary to em
play a su btle truth when formulating the LP problem: If th e inpu t is an
th en, using (5.1), it is easy to show that orthogonal process, th en minimizing the ave rage squa red output of th e IF
r"s(11) = e ~ J(lJ) , n~ 0, (5.20) (see In terpreti ve Problem 5.1) is tantamount to minim izing th e aver age
squared er ror in the out put. Afte r we have a littl e experience with LP
which is th erefore ze ro for all positi ve 1/. [The violati on of (5. 19) in an alysis, thi s will seem ob vious, but , upon first readi ng, thi s probabl y
the "voiced" case is discussed below.] Finally, th erefore, will n ot see m ob vious at all. Let us begin by demonstrating thi s fact .
]If
Not e th e following regarding th e error qu antity mi nimize d above :
I
,= 1
aU) r s (11 - i) = - 1) 11), 1] = 1,2, . . . ,M (5.2 1)
..f(e2 (n ) Ia(O) = I] = oL' {[ e(n) - e( n) f!a( O) = 1)
2
= oL'(e (n) I a(O) = l) - 2.L' j1?(n) e(n) I a(O) = 1] + .1'(e\n»).
and recalling that d( i) = - a U ) for i = 1,2, . . . ,M, we have
(5.25)
M
(5.22) If ~ is a WSS, second-order ergod ic, unity var iance, ort hogo nal pr ocess, it
I, a(i) r/ 1J - i) = r,(tI), tJ = 1, 2 , .. . , A-f.
is not di fficult to sh ow (Problem 5.3) that this becom es
i= 1
gebra literature, th ey are sometimes called the normal equa t ions . For con for mulate the problem in four ot her ways. (We will seek solutions 10
these in P roblems 5.4-5 .7.) The first of these inte rpre ts the res ult in
" Recall tha t thi s is equi va lent to saying that f. is zero mea n, uncorrelated . term s of a n optimal inverse filt er:
278 Ch. 5 I Linear Prediction Analysis 5.1 I Long· Term LP Analysis by System Identification 279
INTERPRETIVE PROBLEM 5.2 (INVERSE FILTERING) Design the IF of Fig. for any n.
5.3 part (c), and (5.14) and (5.15), which minimizes the average squared
output (or output power) of the filt er. Although we will inherently prove this result in our discussions, ex
plicit proofs are found in many sources [e.g. , (Gray and Davisson, 1986,
Linear Prediction Interpretation of the LP Model. Similarly, if pp. 204-205)].
We can write (5.29) in a slightly different way that will look more
M
J(n) = I a(i )s(n - i) (5.27)
fam iliar,
i= l ['{~(n)~(n -1])} "" 0, for I] = I, . . . , M (5.30)
is viewed as a prediction of s(n), and therefore or
Ai
rJ'(I]) = 0 for 1] = 1, ... , AI. (5.31)
P{ z) = I a{i) Z - i (5.28) - '
Further note that if s is an ergodic stochastic process, or s(n ) a
i=l
deterministic signal, then condition (5.30) is equivalent to
as the FIR prediction filter (see Fig . 5.4), then e(n) is the prediction error
(also called the prediction residual), and the problem can be posed as ..L' (e(n) s(n - 1]»), for 1] = 1, ... ,M (5.32)
follows .
or
INTERPRETIVE PROBLEM 5.3 (LINEAR PREDICTION) Find the prediction r&\(I]), = 0, for 1] = I, ... , M . (5.33)
filter, P( z), which minimizes the average squared prediction error. The reader should compare (5.33) with (5.18). We see that (5.33) is
sufficient and necessary to derive the normal equations from
A closely related interpretation is based on a well-known result from
(5.18).
the field of mean squared error (M SE) estimation called the orthogo
This discussion leads us to a variation on Interpretive Problem 5.3:
nality principle (OP) . The OP is a general result that applies to many
forms of linear and nonlinear estimation problems in which the criterion
I NTERPRETIVE PROBLEM 5.4 (LINEAR PREDICTION BY ORTHOGONALITY
is to minimize the MSE [see, e.g., (Papoulis, 1984)]. A relevant statement
P RINCIPLE) Design the minimum mean squared error prediction filter of
of the OP for the present problem follows . A somewhat more general
Fig. 5,4 using the orthogonality principle.
statement and some elaboration is found in Appendix 5.B .
S pectral Flattening Interpretation of the LP Model. Finally, we note a fre
THEOREM 5.1 (ORTHOGONALITY PRINCIPLE) Consider the speech se quency domain interpretation that will be useful below.
quence sen) to be a realization of a WSS random process s with random
variables sen). The error in prediction sequence in Fig. 5,4 wilf likewise be INTERPRETIVE PROBLEM 5.5 (SPECTRAL FLATIENING) Design the model
a realization of a random process e in this case. The prediction filter P(z)
a/form (5.3) that minimizes the integral
of Fig. 5.4 will produce the minimum mean squared error [[' {~2(n)} will
be min imum] if and only if the random variable ~(n) is orthogonal to the
random variables 1JI) for 1= n - M , ... , n - 1,
['{~(n) ~(f)} = 0, for I"" n - M , ... , n - 1 (5.29)
f1 it
- It
:-(w )
9(w)1
2 dto.
.
(5.34)
where 8(w) is the DTFT of the LP model impulse response, and is(w)
represents the temporal pow er d ensity spectrum of the sign al s(n)
p el )
Speech [defined in (1.141 )].
AI Error in predicti on
sen) Pred iction -
:E &(i) ;.- ' c
"- I ~ ~(n )
i= I 1(11) This problem statement is easily verified by showing that the integral
2
is equivalent to ..L'(e (n)) = r,,(O) (see Problem 5.7). Interpretive Problem
5.5 is useful for th eoreticaJ purposes, but is least useful for design.
Notice carefully that Interpretive Problems 5.2-5.5 yield the custom
ary LP equations, (5-.2 2) or (5.24), whether or not e is orthogonal,
FIGURE 5.4. Prediction error interpretation of the inverse filter. whereas "getting the right answer" [i.e. , (5. 22)] to Interpretive Problem
280 Ch. 5 I linear Prediction Analysis
5.2 / How Good Is the LP Model? 281
5.1 depends explicitly upon e being orthogonal. In this sense, Interpretive comes zero, or p- 00 (in this case, the excitation becomes a single
Problems 5.2-5.5 are " better " problem statements, but they lack the in. discrete-time impulse).
tuitive quality of Interpretive Problem 5.1. Further, as we shall discuss This claim is shown as follows: Using definition (5.2), we can show
below, e(n) may usually be considered a realization of a zero-mean that
un correlated process in speech work.
1
' e,V/) = ' e(lJ) == p Lw
q=-oo
oCt] - qP) . (5.37)
5.2 How Good Is the LP Model? Putting this result into the easily demonstrable equation
co
5.2.1 The "Idea!" and "Almost Ideal" Cases ' e's(v) == 8 0 L re,(lJ )8min(11 - v)
,,=- oo (5.38)
We now seek to discover how well, and in what sens e, the LP model
represents the true speech system. Consider first the ideal case in which yields
M = I (model order somehow chosen properly) and e(n) is orthogonal'!
8 0 ~
(as assumed). The result is so fundamental that we set it off as a
theorem.
re·/v) = p L,
q~ - oo
8m in(qP- v) = p8 0 0
8
emm(O)o(v) == po(v), (5.39)
THEOREM 5.2 ("IDEAL" IDENTIFICATION CONDITIONS) The conditions which is valid for large P, since stability assures that the IR of the
M == J and e(n) orthogonal are suffi cient and necessary for the model of minimum-phase portion of the system will decay exponentially with n.
(5.3), with estimates a(i), i = 1,2, ... ,(M == I) , given by (5.22), to exactly Practically speaking, the approximation in (5.39) will hold if the IR of
represent the min imum-phase component of 8 (z). th e vocal tract has sufficient time to damp out between glottal excita
tions. This is usually the case for typically lower-pitched "m ale"
Proof. A formal proof of this important result is given in Appendix phonation, whereas LP techniques oft en suffer some problems in this re
5.A. The proof is tedious, so we move it out of the reading path and gard for higher-pitched female and children 's voices (see Fig . 5.5). That
present only a relatively complete sufficiency proof here. the model is only approximate is due to the assumption that (5.20) holds
When M = I, postmultiplying (5.11) by 5(n) and applying the .t op in the derivation. Here we have shown that assumption is nearly justi
erator yields fied, clearly leading to a model that accurately estimates the minimum
phase component of the system.
r , =: Rsa + e Ofe 's' (5.35)
where all notation has been defined above except f e , >, which is an ob
5.2.2 "Nonideal" Cases
vious extension of our conventions . Substituting (5.35) into (5.24), we
have In practice, the "true" order of the system, J, is, of course, unknown.
A , e
a=a.,- ~o R-s 1re,_,=*a- def ~
== a-a
1 e
"" ~o R-s T , _, (5.36)
e
An accepted operational rule for the choice of the LP model order is
1000 400
SOO 350
600 8 300
§
400
S~ 250
~
~ ~
.; u,
200
i
0
Pitch = 200.0 Hz
- 400 50
ISOl) 600
1000
500
8
cc
- 400
500 S~
s
'"
.;
g
C.
E
-<
-500
0
) /~} I~,I
~IIT -'
~
~
I,
s
~
<
~
~
$!
300 .... ··· ,·· .
200
Pitch pen od = 7.4 msec
Pitch::: 135.2 Hz
iii
100-· ·
- 1000
-1500 o
I I ! I I ! I I I -_..
- L -· 0
100 200 300 400 500 600 700
.. r. A {\
0 20 40 60 80 lOa 120
Tim e. 11 (norm-sec) Shift, '1
(b ) (d)
FIGURE 5.5. Typical (a) female and (b) male utt erances of the vowel ja/. FIGURE 5.5. (Cont.)
Based on short-term AMDF analysis in (c) and (d), the est imated pitch
frequency for the female voice is 200 Hz, while that for the male is 135 Hz.
In the latt er case, the wave form is seen to damp ou t to a large extent
between pulses, whereas this is seen not to be true in the former.
5.2 I How Good Is the LP Mode l? 285
284 Ch. 5 I Linear Prediction Analysis
in the t rivia l cas e in whic h aU) = 0, i = M + I, M + 2, . .. ,f] . In th eory we T he atte m pt is to m odel 8 (z), which has an M-orde r m inimum-phase
com po nen t, us ing an lv/-order ("p roperly chose n") ana lysis. But we can
can write
redi agram th e system as in Fig . 5.6( b), no t ing that th e sys te m is now
dri ven by an uncorrelated input but has a minimum-phase compon ent
8
m in
(z) = 1 _ i a(i) z- I that is M + L = larder, whe re L is th e order of E min(z) . We have, th ere
fore, an apparent attempt to identify an M -or de r subsyste m as before.
i= \
In eit her case, this th eoretical interpret ation pro vid es no prac tic al in
(5.41) sight . Exactly which M- orde r subsystem is identified by such an an alysis
M I. is in de term inate, and, practicaIly speaking, th e meaning of th e resu lting
I - I. a'(i) z- l I - ~ a"(i)z-' d (i )'s (how goo d or bad, and compared with what?) is unclear. Ge nerally
i~ l I =J spea king, howe ver, the resulting d (i)'s will identi fy th e M most reson ant
po les of t he syste m. T his can be seen by recourse to Interpret ive Probl em
~ 8 m in . M (z) X 8 mm ,L(z), 5.5 , th e spectral flatten in g interpret ati on . which asserts th at the M -order
wh ere J = M + L , and in t erpret t he param eters aU) as est imates of the LP m od el of ( 5.3) , wit h coefficients derived using (5.2 2), will have a
factored subset of coefficients a'(i), i = l , 2, . . . , M , wit h erro rs given by magni tu de spect rum whi ch m in imizes th e integral (5. 34). It is clea r that
wh en 18(w)12 canno t match r:;(w ) exact ly, the t ask of minimi zing th e
(5.3 6) with f c ' " repl acin g f e's' where
area in (5 .34) ca uses e to co nce nt ra te o n th e peaks in th e spect rum of
el (n ) ¢:?E"(z) Qgf £ '(z) r:;(w), or t he most reson ant frequ encies in t he speech spec t ru m . T his ha s
L (5.42) the followin g impli cations for th e discu ssion at hand : In th e case M < I
1 - I,a "U) z-1 discusse d a bove , si nce we have as sumed e(n) ort hogo nal, we have
i ~\
2
r;,«(v) = 1 = r; (w) = 18(w)1 (5.44)
As a pertinent aside, no te carefully t hat t his sam e discu ssion applies to
the analysis in which M is " prope rly chosen" (M = 1) bu t e(n ) is not or and t he effort to flatten th e peaks in th e speec h spect ru m is tan tam ount
thogonal (a nd has a z-trans form) : Di agr am t he true syst em as in Fig. 'co try ing to flatten th ose in the syst em spec trum o nly. Only if 6 (cu) has
5.6(a), wh ere e(n) is correlat ed a nd has a z-t ransfo rm peaks th at can be suffic ient ly flatten ed by M po les ca n the LP mod el be
£(z) = E .p(z) E (z) . (5.43) asse r te d t o be a reasonable approximati on to 8 min(z), in sp ite of the
m in
mo de l underestimation.
On th e ot her hand, in th e cas e in wh ich e(n) is correlate d, but M is
eo "large eno ugh ," we can see wit h refer ence to F ig. 5.6(b) that, o nce again,
M-order r;" p(w) I
= 1-= r;(cu) = 9 (w W r;,(w) (5.45)
_ ~_S(Il)
ern) - ~ .
Cor relaled
and 8(z) will a pparent ly identify 9 (z) well (in th e usua l min imum-phase
(a) sense) only if e(n) m ake s no sign ifican t co ntr ibut io n to th e spectral peaks
of t he speech spect ru m (Whic h is another way to say th at e(n) is " not
Very correlate d"] .
Go A n exa m ple of the above occurs in speech ana lysis whe n it is desired
to mod el only th e vocal-tract portion of th e spee ch sys tem , which would
typi cally be of orde r 8 to 10. T he entire speech system includes ot her dy
s (n ) na mics, including the glottal system , whi ch typ icall y in crease the system
e (n ) " \
ap
Uncorr clatcd
order to, say, 14. We may interpret the eighth-order ana lysis as eit he r th e
attempt to identify an ei ghth-order subsyste m o f a truly fourteenth-order
(b)
system , o r as the attempt t o id enti fy a t rul y eight h-ord er system (vocal
FIGURE 5.6. A correlated input sequence cau ses an equ ivalent problem to tract) driven by a correlated input (glottal wavefo rm) . In eit her case , th e
that caused by an underestimated mod el order. In (a) the speech is modeled
by a correlated input dr iving an M-o rder system. Th is mode l can be redrawn
result will be a goo d min im um -phase ap proxi mation t o th e vocal-tract
as in (b), in which an orthogonal input seq uence drives a model of order syst em , on ly if th e neglect ed anomaly does not m ake a significant spec
h ClrAater_than_M._ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ t ra l cont ribut io n. Roughly spe ak ing, thi s will be true for rela tively loud ,
5.2 I How Good Is the LP Model? 287
286 en. 5 I Linea r Prediction Analysis
low-pitched phonati on in which the vocal waveform is a relatively 5.2.3 Summary and Further Discussion
un correlat ed, low duty-cycle, pulse train . We conclude th at wh en the input to t he speech product ion is a n or
Our final " no nideal" case is only a slight variation on th e above-the thogon al sequence (and only then ), a nd wh en th e model ord er is COf
case in which e(n) is correlate d but has no z-t ransform. Indeed, this is rectl y chosen (M == 1), th en t>(z) represents t he mi nimum-phase co m po
technica1Jy the case .wit h t he voi ced input t o t he speech model over the ne nt of 8 (z),8mio (z), everyw he re in the z-p la ne. The con dition of
i nfin it e time line. Not e, however, that the spectral result of Interpretive orthogonal e(n) is met for the unvoi ced case . We have argued that fo r
Problem 5.5 is st ill valid , since it requires onl y th at e(n) [he nce sen)} have sufficie ntly low pi tch , the voiced input may be cons ide re d practi call y or
a power density sp ect rum . As above, we co nclu de h ere th at on ly if r:(w) thogo nal. Therefore, the LP m od el will ind eed represent th e minimum
m akes an insigni ficant co ntribution to th e peak iness of th e overall speech phase co mpo ne nt of the speech .
spec t ru m will 8 (z) b e a good representation of 8 min (z). Indeed, this is
Some poi nts about th e t emporal pr op erties of th e model signals sho ul d
th e ca se as P _ OJ (p it ch fre quency be comes small) as di scussed above.
be emphasized. Wh en 8 (z) is used for vo iced spe ech reconstruct ion , the
Howe ver, as P becomes sma ll, th e pitch pulses beg in to sign ificant ly dis
synthes ized speec h will n ot be a tempo ral repli cation of th e speech se
tort th e spect ru m; hence, the identification .
quence; rath er, it is not hard to dem onstrat e th at if s' (n) represen ts t he
It is instructi ve to exam ine thi s distortion ph en omenon from another synt hes ized speech , thent !
point of view. It is oft en pointed out that, if O(n) represe nts the IR of the
vl ,
S /(W) = e Z'S (w ) e- jQ>.p(C (5.50)
LP m od el, t>(z), then
1'8('1) = 1'.1(11), for Y/ = 1, 2, . .. ,1Y1. (5.46) where e J'I'ap(Q) is the phase cha racte ristic associate d wit h th e all-pass filter.
Ideally, th e synthesize d speech will b e a ph ase-alte red , scaled vers ion o f
sen). The p roblem of mak ing sy nthesized speech so und natural is a co m
In other words, th e au t ocorrel ation of the LP model IR matches that of
plex issue, and it is not the point of this discuss ion to address that prob
th e sp eech for the first M lags. T h is is ea sily show n as foll ows: According
lem . The p oint is, ra t her, that sp eech th at is resynthesized using even
to ( 5.3), "i de al " LP paramete rs will n01 replicat e th e or igina l waveform . However,
M all other fac to rs bein g " ide al" (per fect tra nsition s am ong frames, per fect
B(n) = I. aU)e(n- i) + 8 oc5 (n) (5.4 7) pi tch detecti on , a nd so on), it is like ly the case that the phase scatter ing
i- I would have a negligible effect on the natural ness of the synth eti c speech
if appropriate scaling is used (see Sec. 5.3.4).
so that , for any 11 > 0, Second , we sho uld take note of the temporal pro pe rties of th e p red ic
tion residual whe n the speech is in ver se filte red . When the IF, Ii (z), is
eo ..L' ~o (n) e( n -
AI used to filte r th e speech , th e output , e(n), is nomi nall y an estimate of t he
..L'{B(n) e(n - II)} = 2, Gj..L'{ B(n - i) B(n - J]) } + Y/)} syste m input, e(n) . However, it is easi ly show n th at, in the ideal case,
i- I (5.48 )
E (z) = 8 oE' (z), (5 .51)
or (recall (5 .10)] so that the re sid ua l w ill be an esti mat e of the all-pass filter
output, a ph ase-alter ed vers ion of the t rue inp ut . T he refore, th e residu al
.'of
(5.49) m ight not be a goo d represen tation of th e expected input pul se train in
I' ,§(Y/) = I a(i)re(y/ - i), '1 = 1, 2, . . . ,M.
the voic ed case as the va r io us frequen cy com po ne n ts in th e wave form
i -I
can
. ~
e
be "smeared" in time due to th e phase delays in ap (z) that remain
Comparing this result with (5.22) reveal s th at (5.46) is t ru e. In £ (z) a nd are not present in E(z). Algo r ithms th at use th e residu al d i
Now, in th e case in whi ch the excitat io n pitch pe rio d , P, is small, it is rect ly for pitch detection, for exam ple, do not always yield good perform
not d ifficult t o de m ons trate t hat r,( 11) [he nce r o(1])} is a severely aliased a nce , at least in pa rt du e to thi s ph en om en on.
vers ion of "0(1]) (see Problem 5.9). This means that IS (w)1 is only math~
2
When M is underest imated [or, equiva len tly, whe n e(n) is n ot o rtho
ernatically obliged to match 18(w)j2 at a few "un de rsam ple d" (widely gonal], ph ase and tem poral relationships between the LP model a nd the
spaced) points in the freque ncy domain . Th ese points represent the
wid ely sp aced har monic freq uencies o f th e excitatio n-t he only informa "The DTFf S«.() docs 'not th eoret ica lly exist her e, hut we can reso rt to the use of th e
tion abou t 8 (w ) that ca n possibly be represente d in th e speech sequence. " enginee ring DTFf" described nea r ( 1.30).
288 Ch. 5 / Linear Prediction Analysis
5.2 / How Good Is the lP Model? 289
true system become very unclear. Formally, the problem is the difficult 600
one of determining how well a high-order polynomial is approximated by
a polynomial of lower order. In this case we need to resort to spectral ar
guments alone. We can anticipate that the LP model will approximate 400
the (scaled) magnitude spectrum as well as possible, concentrating on the
peaks in the spectrum first. It is also to be noted that, in spite of the
total ignorance of the phase throughout these spectral arguments, the re $ 200
sulting model can still be shown to be a minimum-phase filter'> (Kay and
Pakula, 1983; Lang and McClellan, 1980; Burg, 1975). Hence , in spite of
.,
( ::..
:>
"~ ~
our loss of ability to make phase comparisons directly, we can still assert
I~I
0
that the LP model is "attempting" to represent the minimum-phase part ,2
;:; II
of the true system. '6
Finally, it should be noted that, in practice, the choice of M can be e
e, - 200 }
made through a study of r,,(O), the total prediction residual energy [i.e.,
the value of the integral in (5.34) as a function of MJ (see Fig. 5.7)
(Chandra and Lin, 1974). If we assum~ e(n) to be orthogonal, when M -400
.s-,
<~
200
---:
omll'~~~
":
-d
-p.
1000
<;
§ 900
~
'i3
£ - 200
800
S -::::.. 700 -
-400 -. ... .
,;.,
Unvoi ced
...~
e,
600
- 600
! I I I I "
~
() so 100 150 200 250 ~
-;;
Tim e, 1/ (nom -sec) <> 500
e::;
(a)
FIGURE 5.7. In (a) and (b) the .eff ects of increasing model order on the 400
prediction residual are observed for the vowe l /a/. In (a) M = 8, while in (b)
M = 14. The increased model order removes more correlation from the
300 Voiced
o 2 4 6 8 1O 12 14 !6 18 20 'I
J>T his is typically done by showing that a pole outside the unit cir cle can be reflected Num he r of LP parameter,. M I
inside to decrease the mean squared er ror. Thus thc spectrum is pres erved and the filter be (c)
----- " ,.,m"".m i n i mum_l1""ht:
a'"~."_
se _
FIGURE 5.7. (Cant.\ II
------------
290 en. 5 I Linear Prediction Analysis
......
A natural estimator for the LP parameters that uses only the specified
reaches I, the integral takes the value e~, apparently the minimum possi
data points is the one that results upon insertion of one of the short-term
ble error energy. For M < 1, r,,(O) is a monotonically non increasing func
estimators for autocorrelation discussed in Section 4.3 .1 [for generality,
tion of M , which is easily seen as follows: Suppose that the identified
say rs(17; m)]:
a leI
parameters (i), i = 1, 2, .. . , M , resu It in erro r energy (0). Then the set
ii ' (I) , i = 1, 2, . . . ,M + 1, which are designed to minimize rZ!+j (0), can al M
ways do as well (at minimizing error) as th e smaller set, just by letting I ii(i ; m)rs (l/ - i ; m)::::: rs(17 ;m), 17= 1,2, ... ,M, (5.52)
{l (i) = a(i), i ::; 1\1! and ii' (1\1 + 1) = O. In this case , r ;f+1 (0) = r~ (0) . i~1
LP literature, is a Toeplitz operator (since all the elements along any di
short-term LP analysis . We will discover that our expertise in short-term
agonal are equal). The solution of equation (5.52) [or (5.53)] is referred
processing will allow us to quickly develop the needed practical LP
5.3.1 Autocorrelation Method INTERPRETIVE PROBLEM 5.6 (AUTOCORRELATION PROBLEM) Let J(n ; m)
Our objective here will be to use the knowledge and intuition gained he the frame of speech It», m) =: s(n) w (m - n). Find the tineal' predictor of
from the long-term LP problem, combined with our understanding of [ tn ;m), which minimizes the total squared error in prediction over all n.
short-term analysis, to produce an analogous short-term LP analysis. (Note that the error is to be m inim ized over all n, not just the range se
There are two well-known and widely used short-t erm LP solutions: the lected by the window.)
"autocorrelation method" and the "covariance method." Although in
Solution. First let us define some important notation. We let ](11; m)
many ways the latter solution is more analogous to the long-term prob
indicate the predicted value of the point f(n; m), and e(n ; m) be the
lem, we will follow the customary procedure of presenting the autocorre
prediction error'? at n,
lation method first. This method has two main virtues with respect to
the covariance method: It employs a "friendlier" (one argument) estima
e(n ; m) d=cf~ . ; m).
fe n; m) - [in (5 .)-4)
tor for the autocorrelation sequence, and it is always guaranteed (theoret
ically) to produce a stable LP model (Markel and Gray, 1976). We also let N(m) d enote the total squ ared prediction error over all n
Desired is an estimate of the LP parameters on the N data points end related to the analysis of the fram e endi ng at time m. (We have intro
ing at time m: s( m - N + I) , s( m - N + 2), . . . , s(m ). Let us call the vector duced an extra factor of N, the window length , for convenience.) In
of parameter estimates" a(m), neglecting the conventional subscript s as these terms, we want to minimize
above, since it is obvious that these parameters are associated with the
sequence sen). Recall also that a(m) is a vector of M parameters a(i; m) , 1
i = 1,2, . . . ,M, again omitting the subscript of asCi; m).
';(m) = N I0:>
n w-rco
2
e (n; m)
Recall the LP normal equations given in (5.22). Even if we had a com
plete record of s(n) , n E (-=, co], the long-term solution would produce (5 .55)
2
a terrible estimate of the parameters related to the interval of interest, 1 0:> M
since the autocorrelation would contain information from many different = N I1~OO [ f(n ;m)- ~a(i;m)f(n-i;m) .
]
phones.
"It is impol1ant for the read er to think carcfullv about not at ion. We will make every ef 17The reade r sho uld carefully po nd er th e fact tha t I:( n; m} is not a frame of the long
fort to keep the notational con vention s consistent ';ith those deve loped in our earlier work.
te rm er ror, called e(n) above; th at is, &(n ; m ) *
e( n) w(m -- n).
5.3 I Short-Term LP Analysis 293
292 ci-. 5 I Linear predict"lon Analysis
INTERPRETIVE PROBLEM 5.7 (COVARIANCE PROBLEM) Find the linear
and setting the result to
Differentiating ~ (m ) with respect to il ('1; m) predictor of s(n) that minimizes the mean squared error on the range»
n E [m - N + 1, m],
l
zero, we obtain
a~(m)
iJti(r]; m)
=:_2
N
(X>
nbC<) wm n -'m
[in: ).f( '1 , )
~(m) =
I
N
m
L [s(n) - s(n)r = N
1 m
L c
2(n;
m). (5.60)
w -m-N+l n ~m-N~j
(5.56)
In Problem 5.10, we will show that the solution to Interpretive Problem
_'f;./I(i;
M m)f(n - i; m)f(n - n; m) 1 == O.
5.7 is (5 .57) and also explore the relationship to the orthogonality princi
ple discussed above. This problem represents the usual method for intro
ducing the covariance approach.
No te that a frame of speech is not created first in this problem. The
Upon dividing by 2, moving th e summing operation (including the
prediction takes place with resp ect to the unaltered speech. Whereas the
scale factor liN) over n across terms, and recalling the definition of
autocorrelation method windows the signal and then seeks to minimize
fen; m), it is clear that this expression is equivalent to (5.52) if the the error in predicting the windowed signal over all n, the covariance
short-term autocorrelation is computed using (4.35) and the window
method seeks to minimize the error in prediction of the unmodified
specified in the problem. speech only on the sp ecified range of points.
This set of equations has the matrix form Assume that 1fJ;1 (m) ex ists'? and that s(m - N + 1) = 0 (recall that this
(5.58) vector represents the M initial conditions on the minimization window ).
<ps(m)a(m) = q>/m) ~ a(m) = <p;l (m)cp,(m), Then the covariance solution is exact If and only If l)7e(O, v; m) = KJ(v),
v = 0, 1, .. . , N - I, where K is any finite constant.
where <J)s(m ) is the Jv[xM matrix with (t),v) element r,o/1J,v ;m) ; and
points outside the data range n E In- N -+- I, rn] in computing the the same seq uence as th e one resulting in the autocorrelation method over the same win
d ow [c.f. (5.54)].
({Js(1J, v; m ). The significance of this range, however, is that it comprises
I'The necessary and sufficient co nditio ns for the existence of the inverse are found in
the data over which the technique minimi zes the prediction error energY, (Dell er, 1984).
• as...shown by the following problem.
294 Ch. 5 I Linear Prediction Analysis
5.3 / Short-Term LP Analysis 295
gets large. We note that although a strict theorem like this does not exist which minimizes
for the autocorrelation method, intuitively one would expect a practically
si milar result. ),. II S(l'v')a(N) - seN) 1/ 2 , (5.64)
A related result, which is useful practically and has no parallel in the
long-term case. is as follows: where
THEOREM 5.4 Lei 8(z) be a minimum-phase speech system. Assume that ST(N) d~f [s (l ) s (2) ' " s(N)] (5.65)
., (m) exists and that s(m - N + I) =f. 0 (nonzero initial conditions).
<1>-1
Then the covariance solution is exact if e(n) = 0, n E [m - N + 1, m]. This 8(1'/) ~f[s(J)s(2 ) . .. s(N)f (5.66)
condition is not theoretically necessary, but is practically so.
and /1 . IJ indicates the /2 norm. Posed in this sense, the solution is given
According to this theorem there must be no excitation on the minimiza by [see, e.g., (Golub and Van Loan, 1989)],
tion window for exact solution in the covariance case when the initial
conditions are nonzero. ~, STU'/) S(!v')a(N) = ) ,. ST(N )s(N). (5.67)
The Covariance Method and Classical Least Squares Problems With a little effort, one can show that
Note: In this section, it will be useful to assume that the analysis win
dow covers the range n = 1, 2, ... , N, so that m = Nand m - N + 1 = 1.
<l>s(m) = ~ ST(N)SC.rv') (5.68)
There is no loss of generality with this choice and it will help us to set up
some equations that will be useful in the future when we make this assump and
tion again .
tps(N) = l~' ST(N)S(N), (5.69)
Many useful insights and practical results can be obtained by recogniz
ing the relationship between the covariance method and classical least so that (5.67) is precisely the covariance solution (see Problem 5.11).
squares (least MSE) solutions of an overdetermined system of equations.
Two points are worth noting. First, stati sticians have long used the
Tn fact, the covariance method fits exactly the mold of the conventional
technique of weighting certain observations mo re heavily than others in
least squares estimation problem: Given a series of N;::: M observations,
this problem by solving it subject to the constraint that the weighted
s (1), s (2), ... ,s(N), and a set of related vectors, s (I), s(2) , ... , s (N), find squared error, say,
the linear predictor (of dimension M) relating them, say a(N), which
minimizes the mean squared error (sample variance) 1
N
(N) = N L A.(n)/s(n) - S(n)]2 = N1 L 2 (n)e\n ;N),
N N
n= l
(5.70)
¢,(N) = ~ I
n~ l
[sen) - $(n)] 2, (5.61)
n=l
where sen) is the prediction of the nth observation, known solution is given by
s(n) ~f I
M
aU; N)s(n - i). (5.62) ~ ST(N) A (N)S(N)ii(N) = iv, S1 (N)A (N)s(lV), (5.71 )
;=1
In vector-matrix form we can state the problem: Find the solution, A(N), where A(N) is a diagonal matrix whose ith diagonal element is AU). Not
to the overdetermined system of eq uations.>' sur p risingly, this equation is exa ctly equivalent to the result we would
S(N )a(N) = s(N), (5.63) have obtained in Our pursuit of the covariance normal equations if we
had started with a weighted error criterion, (5.70), rather than (5.60). In
t his case, we would have concluded with normal equations of the form
;OThe reason for the bar over th e vector seN) is to distinguish it from th e notation seN),
wh ich indicates the vector of past M speech values at ti me N. <I> : Uv') a (N ) = cp .~ (N) => a(N) 0= [q. :(N )j"" lcp ~(m) , (5.72)
5.3 I Short-Term LP Analysis 297
296 Ch. 5 I Linear Prediction Analysis
ciple, one can use any of a number of well-known techniques that can be
where <1> :(N) and (jl ~(m) are the weighted counterparts to the matrices in found in basic linear algebra books [e.g., (Nobel, 1969: Golub and Van
(5.58) . that is, <1> ',(N) is the matrix with (tI, v) entries . Loan , 1989 )]. We could even, for example. use some "brute force "
method for inverting the m atrix A, then forming the product A -lb. A
I !'.... (5.73) more rational thing to do would be to compute an "LU" decomposition
rp ~(1] , v: N ) = N ~ },(n)s(n - tI)s(n - v)
1l ~ 1
of the matrix A using Gaussian elimination . Any efficient method for
For future refer solving a system of equations, however, will take advantage of the special
and <p :(iV) the M-vector with vth element <p'(O, v ~l'·l). structure inherent in the vector-matrix problem. We have noted that both
ence, let US note explicitly the relations th e covariance and autocorrelation methods produce symmetric matrices,
(5.74) while the autocorrelation matrix is additionally Toeplitz. Another very
<t>:(N) = ~T S'r(N)A(N)S(N) important property in either case, which is true in all but th e rarest of
practical circumstances (Deller, 1984), is the positive definiteness of the
and matrix. Our task here is to take advantage of these properties in creating
(5.75) efficient algorithms for the solution of the two methods. We discuss some
(jl ~(N) = ~r ST(N)A(N)s(N). basic methods which are popular in speech processing, noting that there
are man y variations on the solutions presented here. There are also many
We will call this technique the weight ed (short-term) covariance method. other approaches from classical numerical algebra that could be applied;
The weighted covariance method offers the advantage of emphasizing new architectures and algorithms might focus attention on one or more
a datum that might be desirable in some sense by use of a large A(n), or of these in the future . We will describe an instance of this latter effect
of rejecting an undesirable point. Although there is no general theory for below, which was driven by parallel processing architectures in the 1980s.
selecting weights to improve the quality of the LP estimate, the weighted Intelligent use of any of these methods requires that we be aware of their
strategy has been used for adaptive identification (Deller and Hsu , 1987) numerical properties, an issue that would take us too far afield in this
in algorithms for glottal waveform deconvolution and formant estimation
(discussed below) (Deller and Picache, 1989~ Deller and Luk, 1987; Larar
discussion. However, the reader might wish to refer to any of a number
of excellent books that treat this subject [e.g., (Golub and Van Loan, ~
et al. , 1985; Laebens and Deller, 1983 ; Yeeneman and Bement, 1985; 1989)] before committing a speech processing project to a particular
Wong et al. , 1979) and in algorithms that significantly improve computa algorithm. 11 \:[
tional efficiency (Deller and Luk, 1989). In turn, these algorithms find 11
',1
application in aspects of coding and recognition, which are subjects of
Levinson-Durbin Recursion (Autocorrelation)
future chapters.21
Second, noting this link between the covariance method and the con Let us begin with the autocorrelation problem which has the most spe
ventional least squares problem opens the door to the use of many con ;11\1
cial structure. In 1947, Levinson ( 1947) published an algorithm for solv
ventional methods of solution that could prove useful in speech ing the problem Ax = b in which A is Toeplitz, symmetric, and positive
processing. In particular, the development of two useful temporally definite, and b is arbitrary. P Of course, the autocorrelation equations are
recursive forms of the covariance solution based on these ideas will be exactly of this form, with b having a special relationship to the elements J I
found in the next section. of A. In 1960 , Durbin (1960) published a slightly more efficient algo
Our next order of business is to study some of the commonly used
rithm for this speci al case. Durbin's algorithm is often referred to as the
I
methods for solving the autocorrelation and covariance equations and Levinson-Durbin (L-D) recursion by speech processing engineers. I
their practical considerations. The L-D recursion is a recursive-in-model-order solution for the auto
correlation equations. By this we mean that the solution for the desired
order-M model is successively built up from lower-order models, begin
5.3.3 Solution Methods ning with the "Oth order predictor," which is no predictor at all.
Whether the autocorrelation or covariance method is employed, we
are faced with the solution of a linear vector~matrix problem of the form
From the vector-matrix form of the autocorrelation equations in (5.53)
we can write
Ax = b in which A is a square matrix and x is the vector sought. In prin
- ~.,(m)aM (m) + r /m) = O. (5.76)
r
'lIn some cases, "weighting" can simply be "binary" in the sens e that points are eithC
; ~ ~h,,~pclin the estimate or they are not. For othe r time-selective strategies, the reader is rc
_. . . "V.,'" ~ ~ ... 11\,{;v".hi ct 31.. 1987).
" A reformu\ation of Lev inson's algorithm is found in,(Robinson._19ti 7) II...
298 Ch. 5 / Linear Prediction Analysis
5.3 I Short·Term LP AnalYsis 299
Be careful to recall that the m's here are time indices indicating the end
r,(O; m) rs(J;m) rs(2;m)
of the window over which the analysis is being performed. A typical ele r,(3; m) I c;3(m)
ment of R,(m), for example, is 1)11; m). A bit of new notation here is the rs(1; m) rs(O; m)
added superscript M on the parameter vector, aM (m), used to indicate an
r,.(I; m) r ,(2; m) -a (J ; m)
3
0
r.J2; m) r,(J; m) r/O;m) =
I~Jth-order solution. Since the L-D recursion is recursive in the model rs(J;m) -a\2 ;m) 0
order, it is necessary in the following development to add superscripts of r,(3;m) r.1(2; m) r,(1; m) rs(O; m) -ad(3'. m )
this form to several variables to keep track of the order of the solution 0
with which they are associated. Quantities without such superscripts are
(5.82)
associated with the Mth (finalj-order solution. To avoid confusion, any Note that the augmented matrix , R~(m), is Toeplitz. Also note that the
quantity that needs to be raised to a power will be written in brackets matrix R;(m) (associated with the order-two solution) is embedded in
first, for example. [K(3; mW. R\m) in two places. It is the 3 X 3 matrix obtained by removing either
As we did above, let us denote by e(n; m) the prediction residual se the first row and column, or the fourth row and column, of R;(m). This
quence due to the autocorrelation estimate a(m) [see (5.54)J. With this consequence of the Toeplitz structure is central to OUT development.
definition, it is not difficult to demonstrate that Now let us assume for the moment that we can write the augmented
M 1 co
order-three solution vector in terms of the order-two solution as follows:
r,(O; m) - I j:.=t
aM(i; m)r,(i; m) = N I teen; m)y.
n=-oo
(5.77)
I I
It is clear that the term on the right can be interpreted as the total energy
(scaled by a factor 1/N) j n the prediction residual sequence, based on
-a (1 ; m )
3
-(P(I; m) °
-a2 (2; m )
the inverse filter of order M. As we did in (5.55), let us write this quan -a3 (2; m ) -a (2; m )
2 - K(3; m)
-a\l ;m)
(5.83)
tity as ~M(m) (except that this time we have added the superscript) and 3
rewri te (5.77) as
-£1 (3; m)
o I
and will be known as a reflection coefficient] Plugging this result into the
[ r, (m ) Rs(m)J[-a(mU
.J 1 o. = (5.79)
r.(O;m) !~(I;m) r)2; m) r,(3; m) 1
rs(2; m) 's(l; m)
',(0; m) rs (I-J m) -0 3(2; m)
We now have the necessary tools to move from ord er-two parameters to
Eq uation (5.84) implies two scalar equation s: o rde r-three para me te rs. Gi ven e (m), we can co mpu te r ,( '1 ; m), 11 = 1, 2, 3,
(5.86)
(m) - K(3; m)q '" e(m) fro m th e dat a, th en K( 3; m) using (5.89). Th en we can compute th e order
three LP p ar ameter s using (5.90) and (5.9 1). Finall y, we ca n co m pute
(5.87 )
q _ K(3; m)(m) = O=> K(3; m) ", -,/i-
( (m)
e(rn) usin g (5.88) in case we want to move on to th e next ste p to corn
pute th e order-four LP parameters. Th e general algorit hm is just
a st rai ghtforward generalization of thi s specifi c process a nd is given in
(5.88) Fig. 5.8 .
There are seve ra l important features of th e L-O recursion to be noted.
We know from previous discussion that th e seque nce of average err o r en
ergies should be nonincreasing,"
(5.89) o~ ~ I (m) -< ~ ' - l( m) ~ ~ 1 -2 (m) -< ... :::; ~ o( rn) . (5.96)
If we define the norm aliz ed error energy at it eration I by the ratio
c:'(m )R o(m ), then it is clear that
for any I. T his qu antity can be monitored at each ste p to det ect numeri
cal instabilities. It is also possible to watch eit he r e {(m)/c;o(m ) or ~I (m)
to determine whe n a sufficient model order has been reach ed .
Similarl y, we can use (5 .95) and (5.97) to show that
IK(f;m)l-< I for al1 l, (5.98)
FIGURE 5.8. The Levinson-Durbin (l-D) recursion applied to the wiOOOW
offering another ch eck for numerical stability. As we noted ab ove, these K
ne(m -N+1 ,mj. param eters ar e often ca lled reflection coefficients because of their close
relat ionship to the reflection coefficients in analog acoustic tube models
Initializati on: / = 0 of th e vocal tract (recall Section 3.1.4). We will explore this issue below.
f,0 (m) = scaled total energy in the "e rror" from an "order 0" predictor Anothe r interpretation , also to be studied in th is sect ion, leads to the
average energy in the speech frame fen', m ) = s( n)w(m - n) nam e partial correlatio n, or parcor, coefficients for these parameters .
Th ese coefficients play an important role in speech analysis a nd coding
= r,(O;m) applicatio ns, and serve as an alternative to th e usual autoco rrelat ion LP
coe fficie nts as a parametric representation for t he speech. Clea rly, the M
Recursion: For 1= 1, 2, . .. , M,
a(i; m) parameters can be obtained from the M reflection coe fficients, as
1. Compute the /th reflection coefficient , we hav e seen above. It is also not difficult to show tha t the procedure
I-I } (5.92) ca n be " reversed" to obtain the reflection coeffici ent s from the LP pa
K(/ ; m)= (__~
.; (m) t
rs(/ ; m)- IJz' -I(i; m) r,(l-i; m ) .
i~1
eters are primarily used in speec h codi ng ap plications (see Section 7.4.5) where 2' denotes th e z-transform opera tor, then using (5. 102) through
and are given, respecti vely, by (5. 104), we can write
I 1 + xt]; m) I t l (Z; m) = ,4 1- 1 (z; m)F(z; m) - 7<(1; m) z-I
g(l;m)=- log = tanh - K(l;m), for I = 1,2 , . . . , M, (5. 107)
2 I - K(1; n!) = c':' (z; m) - z-IK(l; m )Al - 1(Z-I : m) F(z; m).
and (5.99)
It is clea r that th e first term on th e right j n thi s expressio n corresponds
(J(I; m) = ~ sin- 1 7<(1; m), for 1= 1, 2, . .. , M. (5.100) to e' - \ (n ; m), bu t what about the second? Let us defin e
n
Th ese parameters are used in place of the reflectio n coeffic ients because, g l(z;m) d,;f Z-(I+1) i:J.1(z- l ; m } F (z ; m ). (5. 108)
when a reflection coefficient has a ma gnitude nea r uni ty, the results have This is close to the second term except that 1 has been incremente d by
been found to be highl y sensit ive to qu antizati on erro rs. Th ese transfor unit y and we have ignored the reflection coefficie nt, 7<(I;m). Noting that
ma tion s warp th e amplitude scale of the param eters to decrease this sen
sitiv ity. Note that th e IS param eter s have th e adva ntage of remaining Ii(n; m) d,g- Z-I{g'(Z; rn)] = f (n - 1- I ; m)
bounded in magnitude by un ity,
I (5.109)
1(J(!; m) l < 1 (5.101) - L C/ (i; m)f (n + i - I -I ;m),
i~ 1
for any I. Nevertheless, the LAR param eters seem to be more popular,
we see th at this sequence amounts to the error in "backward prediction"
and no significant difference in perfo rm an ce between the two sets has
of f (n -1 - 1; m) based on 1f uture points. Backward pred ictio n is illus
been found. A study of the qu an tization propert ies of these transmission tra ted in Fig. 5.9.
par am eters is found in (Viswanathan and Makh oul , 1975).
Now th at we understand what e' means, let us go back to (5. J 07) and
write it in these term s,
Lattice Structures Based on the L-D Recursion
c'u, m) = c': L (z ; m) - K(/; m)gl- I(z; m) (5. 110)
The L-D Lattice. Th e L-D recursion lead s to a lattice formulation of the or, III the time dom ain ,
inverse filter computation . Suppose that we let
e'en; m) = e I- I (n; m ) - «i]; m) p l- I(n; m ). (5. 111)
1
1'(z;m) ~f 1 - I a'(i;m) z- i. (5.102) Also using (5. 103) and (5.108), we obtain
,-I {JI(Z; m) = z-(l+I)[1 1- 1 ( Z-I ; m) - 7<(1; m) z- I/i I-I(z ; m)]F(z; m)
Usin g Step 2 of the L-D recursion , which co mputes the {/(i ; mrs in 1 1
terms of the fJl- I (i; mrs a nd 7<(!; m) , it is not d ifficult to show that (see = z - I [8 - (z; m) - 7<(1; m) c': (z; m)] (5. 112)
Problem 5.16) or
A1(z; m ) = 1 1 1
- (z; m ) - «(l; m)z -I;l l - I (z " \; m). (5. 103) p l(n ; m) ;:; pH (n - 1; m) - K(I ; m)e l - I (n - I; m) . (5. 113)
We let e'en; m ) denote the error sequence on the window based on th e
I
order-I in verse filt er, f (n;m) I
I The poin t at ri me
1- -- - - - -- - - - I II - 1- I IS "predicted"
I r : using I future values
el(n ; m) = f en; m) - I i/(i; m)f(n - i; m ), (5.104) I
I
I
1= 1
J " I
wh ere , as usual, fen; m) denotes the speech frame end ing at time m,
f en; m) == sen) w( m - n). If we employ the notat ion
t'(z; m) d,g Z le/(n; m) ) (5.105) ./ -
!:1\
----L....--
,, -1 -1
------i l
n- I n
~
T Ime, n (norm-sec)
F(z; m) ~f Z(f(n; m)), (5.106) FIGURE 5.9. "Backward prediction" of the speech sequence.
304 Ch. 5 J linear Prediction Analysis 5.3 / Short-Term LP Analysis 305
The two time domain equations for the forward and backward prediction f (n ;m)
errors, (5 .1 II) and (5. 113), lead immediately to the analysi s lauice struc
---
, M- ' (f/ ;m ) z;, ()(n ;m )
ture shown in Fig. 5.10. The computations represented by the analysis
lattice are used to convert the speech sequence into the prediction error
sequence and are equivalent, in theory, to an inverse filter deri ved using
the autocorrelation method. The synthesis lattice structure can be used to
reverse the process and is derived simply by rewriting (5 .111) as
;;1- 1(n; m) = e'(11; m) + «(l; m)fJ 1-1 (11 ; m). (5 .114)
+
The structure is shown in Fig . 5.11.
FIG URE 5.11. Synthesis latt ice structure .
The Itakura-Saito (Parcor) Lattice. The lattice structures derived above Although this derivation is a bit tricky, verifying that it is indeed correct
are a direct consequence of the L -D recursions and are equivalent to the is straightforward. We simply need to plug (5.104) and (5.109) into
autocorrelation m ethod . The lattice can , at first glance, be a bit deceiving (5. 1IS) to derive
in this regard because from the diagram it appears that the computations
involved in the L-D recursion (in particular, the autocorrelations) have
been circumvented. In fact, however, the lattice diagram provides a pic
K(!;m) = ~I
I
1 ••
{ I-l
rs(l ;m)- ~.al -l(i;m)/)l -i;m) ,
} (5.116)
torial view of the evolution of the model order recursions, but it is still
quite necessary to compute th e L-D type recursions in the process of which is the expression used in Step 1 of the L-D recursion above .
constructing the lattice. Itakura et al. (1969, 1972), howe ver, demon At last we can see th e reason for call ing «t ]; m) a partial correlation
strated that the lattice could evolve without explicit computation of the (parcor) coefficient. It is evident from (5 .115) that K(l; m) represents a
correlations or predictor coefficients. Although their approach is quite normalized cross-correlation between the sequences e'( n; m) and p '(n ; m),
novel [see the original papers and (Markel and Gray, 1976 , Ch. 2)], we which statisticians commonly call the correlation coefficient.>' The word
hav e the information necessary to derive their key result in our develop "partial" connotes the lower model orders prior to reaching the "com
ments above. In fact , using the definitions of forward prediction error, plete" correlation at 1= M .
e(l1 ;m ):e M(n;m) [see (5.54) and (5.104)], backward prediction error, The key point is the ability to compute the K(l ; m) parameters without
p(n;m) :pM(n;m) [see (5.109)], and total squared (forward) error [see recourse to the correlations or prediction parameters. A lattice structure
(5.55)], it can be shown that (Marple, 1987 , Sec. 7.3.3) based on this idea is shown in Fig. 5.12.
m
I e'-l(n ;m)p'-I(n ;m) 1 1
0- - (n ;m) 1(
n=rn+N-1 e II ;m)
K(l ; m) = ---;:==.============================ (5.115)
i-~N-l [e H
(n; m)Y n ~~N-l [fJH (n; m)]2 <:
f( n ;m ) £: M( Il :m ) ,B'- L(n ;m )
= cO(Il:m) 2
E (n;m) £M- L(Il: m ) + = i' (II :m )
, ., Ilh . .
FIGURE 5.12. The parcor lattice of ltakura and Saito. The "correlator" is
used to execute equation (5.115) to deduce the K (I ; m) coefficient for the Ith
K,(M ;m) stage .
l4Actually, we are being a bit loose with terminology here , sin ce we are Using snort-term
tem poral averages rather than statistical exp ectat ions employed in th e rigorous de finit ion of I"
FIGURE 5.10. Analysis lattice structure . a correlation coefficient.
306 Ch. 5 I Line ar Pred iction Analys is 5 .3 I Short-Term LP Analysis 307
k .' I
simi lar ity between the signal fl ow d iagra m of th e acou st ic tube m od el (5. 119)
H (z) =
~
and th e lattice st ruct ures developed here. Indeed, th e "pro pagation", of 2 k
err ors through th e lattice, and th e " tra ns m issio n" and "refl ection" of [ 1- b kZ - ]
these er rors at the section boundaries, is qui te anal ogou s to the behaviors
of t he forward and backwar d flow waveforms in the acoustic tube m odel. Not surprisingly, if th e impe dance of the glottis is assumed to be infinite
In fact, both co mp uta tio nal structures are implem entin g an all-pole so th at Pgl ottlS == 1, th en the denomin ato r coe fficients ca n be gene ra ted
tr ansfer fun ction, and eac h section (in eit he r case) is responsible for th e usin g th e L-D recursion of Fig. 5.10 wit h th e refl ecti on coe fficie nts -PI
i mpl em entation of a po le pair. Further, th e acou stic reflecti on coefficient used in place of « t]; m) (the change of sign is related to the de fined direc
is su ffic ient to cha racte rize a given tube section whil e th e lattice reflec tion s of flow in th e two ana lyses).
ti on coefficient completely characterizes a recursion in the LP computa
t ion . Altho ugh there are some differen ces between the tub e model and
th e lattice filt er, particul arl y with regard to details at the inputs and out The Burg Lattice
puts, it is very inte rest ing that such analogous computational structures Our main intention here is to show an alternative latti ce formulation
ar ise from quite di sp arate modelin g ap pr oac hes. W he reas the acoustic that is act ually identical in form to th at in Fig. 5.10, but which has re
tube out come was based on ph ysical co nsider ations , th e LP model is flection coeffi cients co mpute d according to a different crite rion. Some
based on a syste m ide ntification approach that makes no o vert attempt histo ric al perspective is ne cessary here befo re we get into technical
to model the ph ysics of the system. On th e other hand , th e fact that th e
deta ils.
two app roa ches are e ithe r inherently (t ube) or exp licitly (LP) centered on J. P. Burg presented a classic paper> in 1967 (Bu rg, 1967) in wh ich he
a n all-pole model makes it natural tha t a similar computational structure was concerned with the fact that we have no kno wledge of th e autocorre
co uld be used to implement the result. lation seque nce in th e estimation procedure outside th e range 0 ::::; I} < M .
T he an alogy is ma n ifest in pr ecise te rms as foll ows. Because of the Effecti vely, we assume th at rsCl}; m) == 0 outside thi s ra nge. Th is is a reflec
eq uat ions we ha ve avai lable without furth er algeb ra, it is most convenient tion of the fact th at a ny typical window used on the da ta will zero out th e
to co mpare th e analysis latt ice (which produces th e excita tion from sequ ence outside th e windo w. We are t herefore imposing a genera lly incor
speech ) with th e dyn ami cs of the reverse flow (from lips to glott is) in the rect constrai nt on the solution. Burg t hought it mad e more sense to as
acous tic tube. Let us write (5. 110) and (5.112) in matrix form , sume th e data sequen ce to be "m ax ima lly un certain," or to co ntain th e
"maxi m um entropy," on the times th at are not included in the estimation
l' I(Z; m)] [ I - K(! ; m)] [l'/-1 (z; m)] . (5.117)
process. T his, in gen eral, will imply a different aut ocorrelation sequence
outside the "known" ran ge 0 ::::; 1] ::5 AI than t he zeros th at we have as
[ B'(z;m) == - z- IK(l ; m ) Z- I BI-l(z;m)
sumed . Th e power density spectrum th at goes with t his modified
autocorrelation sequ ence is called the maximum entropy (ME) spec/rum .
Now let us recall (3.155) in Ap pendix 3.e, wh ich rel at es t he forward and In gene ral, we would expect th e ME spect ru m to be different, of course,
backward volume fl ows at the input of section k of th e tube model to from the spectrum of the auto corre lation LP model, le(w;mW.
However,
those at the outpu t. (Recall that sectio n indic es increase fro m lips to glot in on e important ca se, when the data sequen ce is a Gaussian random
t is in the lattice, whi le the opposit e is tru e for th e tube.) This equation process, Burg shows the ME spectrum to be identical to the autocorrela
ca n be writt en as tion LP spectru m, bei ng cha racterize d by an all-pole function wit h polyno
mia l coefficients that are exactly equi valent to the LP aU) parameters and
are computed according to the same normal eq uatio ns (5.52) . T he re fore,
-Pk] [V;+I(z)] .
-Z ~ 'Pk
V;(Z)] 7
(5.11 8) the autocorrelation method is often called the max im um entropy method
[V; (z) = I +-P k [ 1 V;+ J(z) (MEM ), particularly by people worki ng in spectral estimation problems.
One should carefully notice that th ey are equivalent onl y in th e important
Exce pt for some sca ling and a delay, t he co m pu tati ons are seen to b e special case noted.
ide nti cal in th e acoustic tube and lattice sect ions. The need for th e extra
facto rs in the acoustic tube model ca n be seen by recalling the transfer
2~ B u rg was actuall y concerne d with spectru m estima tion, rather th an param eter ident ifi
fun ction for th e N-tube acoustic mod el deri ved in Append ix 3. e, cat ion, in his paper.
308 Ch. 5 / linear Prediction Analysis 5.3 I Short-Term LP Analysis 309
A year later, Burg published another remarkable paper> in which he denominator represents the arithmetic average of the forward and back
was concerned with the estima tion of the LP parameters, especially ward error energies, while the autocorrelation-based reflection coeffi
under circumstances in which there are very few data in the windo~ cients contain a geometric mean in the denominator.
upon which to base estimates of autocorrelation. In this paper he devel The Burg method, when viewed as a spectral estimation technique , has
oped the basis for the lattice structure we have been discussing above. been found to be prone to "line splitting" and also biased estimates for
Burg was the first to notice that , as in the case of the Itakura-Saito lat narrowband processes [see (Marple, 1987, Ch. 8) for discussion and
tice , the computation of autocorrelation could be side-stepped altogether references]. Various weighted error strategies have been proposed to al1e
and the whole lattice constructed using only reflection coefficients. What viate the bias problem (Swingler, 1980; Kaveh and Lippert, 1983; Helme
makes his method unusual , and the reason we discuss it here at the end, and Nikias, 1985) . A specific application to speech in volving an expo
is that his optimization criterion for determining the aU) parameters is nential forgetting factor is found in (Makhoul and Cosell, 1981).
som ewhat different from the usual autocorrelation criterion. In Burg's In summary, after having developed the lattice in Fig. 5.10 that is gov
work he chooses to find the a(m) vector subject to minimization of the erned by (5.111 ) and (5.113), we recognize that, at each step in the
sum of forward and backward prediction error energies. Further, at itera recursion, the next stage of th e lattice is parameterized by only one
lion I, a1(/; m) is computed subject to the minimization of the relevant quantity, the reflection coefficient. If we therefore solve for K(l; m),
error quantity, say ' which minimizes t' (m ) under the constraint of fixed previous coeffi
cients , we will derive (5.121) . The Burg lattice is often presented this ad
'C/(m) d~t~ i
n :<, : rn+N~ I
[e /(n;mW+[/31(n;m)] 2 (5.120)
hoc way, almost as an afterthought, at the end of discussions on the
autocorr elation method. In fact , the Burg method is the predecessor and
motivator for the autocorrelation developments.
with the coefficients , [/(i; m ), i = 1,2, ... ,1- 1, remaining fixed at their Finally, it is important to return to the history of the Burg method,
values from the previous order, s': j (i; m). Since d\l ; m) = xi]; m) , this and the term MEM. In this "Burg lattice problem," we have essentially
can be interpreted as the pursuit of reflection coefficients that are solved an "autocorrelation" type problem with a different error minimi
optimized locally with 'p revious reflection coefficients held fixed. Putting zation criterion . Again under conditions of a Gaussian input and output
(5.11 I) and (5.113) into (5.120), and differentiating /(m) with respect to to the model , the ME considerations yield the same normal equations as
K(/; m) , yields if we had just used the standard autocorrelation approach. For this rea
m so n, Burg's lattice is sometimes referred to as the MEM, but one should
L el- l (n; m)fJ1- 1 (n; m) be careful to remember that this is only strictly appropriate in an impor
n~ m+ N-1 tant special case. Further, the use of the term MEM here tends to blur
K(l;m) (5.121)
the distinction between Burg's two important contributions. His 1967
Lm
2 n ~ m+N-t
2
[i-1(n;m)] +[p/-I(n ;m)]
2
paper was primarily concerned with the pre sentation of the ME princi
ple, while the 1968 paper showed the ability to decouple the lattice and
which are the reflection coefficients used in the Burg lattice. It should be reflection coefficients from the autocorrelations. Each of these is a pro
noted that, as in the case of the autocorrelation reflection coefficients, found result on its own merits .
the Burg reflection coefficients have the property that
IK(l;m)!<1. (5.122) Decomposition Methods (Covariance)
This fact, along with a demonstration that the solutions are unique and In the covariance case the equations to be solved are of the form
provide minima of the /(m)'s , is found in (Markel and Gray, 1976, (5.58) . In this case we still have symmetry and positive definiteness to
Ch. 2). take ad vantage of, but the covariance matrix is not Toeplitz.
If we refer to the autocorrelation reflection coefficients as "forward" The most commonly used methods for solving the covariance equa
parcor coefficients (because they are based on minimization of forward tions are based on the decomposition of the covariance matrix into lower
error energy), then it is clear that we could also derive a set of "back and upper triangular matrices, say Land U, such that"
ward" reflection coefficients. The Burg reflection coefficients represent
the harmonic mean between the forward and backward coefficients. The <I>,(m) = LV . (5.l23)
26The two bodies of work have become blur red together in the literature, as we discuss 21Clear[y, Land U and related quantities depend on tim e m, but we will suppress th is
below, but each is a distinct cont ribution to the area. index for simplicity here.
310 Ch. 5 I Linear Prediction Analys is 5 .3 I Short -Term LP An alysis 311
On ce this de composition is achieved, th en (5 .58) ca n be sol ved by se where L(i, k) is th e (i, k)-elc m ent of L , and f1(k) is th e k th diagon al ele
qu entiall y solving th e equ at ions me nt of A. R ecalling t ha t the di agonal eleme nts of L are uni ty, we can
Ly = cps(m ) eas ily co nve rt th ese eq ua tio ns into solutions for th e off-diago nal eleme nts
(5.1 24)
of L and for the f1(k)'s. These are as follows:
and
;-1
U a (m) = y, (5.125) (jI.,(i,j ; m) - I L(i, k )l1(k) L (j, k)
L(i,j) k= l (5. 130)
in each case using simple algorithms to be described below (Golub and 11(j)
Van Loan, 1989).
Since th e matrix <I>s(m) is sym metric and (ass umed) positi ve definite, i = 2, ... , i'vf;j "" I, .. . , i - 1
th e most efficient algor ithms for accom plishing th e LV decomposition
6.( 1)=\O,(I, l:m) (5. 131)
ar e bas ed on the following results from lin ear algeb ra:
,- I
T
LAL Decomposition. To develop an algorithm b ased on th e LAL T de Initialization: Compute D.(1) using (5. 13 1) and recall L(l , 1)= 1.
com positi on , we firs t write scala r eq uati o ns for th e lower tri angle ele
ments of If>/ m) in terms of th e qu antiti es o n th e ri ght side of (5.126). Recursion: For i = 2, .. . , M,
Fo r th e off-diagonal eleme nts, For k = 1, . .. , i- I,
P(i , k) = ut, k )D.(k)
I Next k:
'Ps(i ,}; m) = I Lii, k )!:l(k) LU, k) , i == 2, . . . , M; j = 1, .. . > i - 1, Compute D.(i) using (5.132) with Pt i, k) in place of Lii, k)ti( k)
k~l (5.1 28) For j = I, . .. , i- I,
Compute L U,j ) using (5.130) wit h Pti, k) in place of L(i, k) D.(k)
Nex t j
a nd for the diagonal elements, Next i
,
(jIs(i, i; m) = L L(i, k)6.(k)L(i, k).
k =\
(5.1 29)
' ST he notation " O ( .)" denot es "on the order of" and indicates approximation.
3 12 Ch. 5 I Linear Prediction Analysis 5.3 / Short -T e rm LP An alysi s 3 13
1- 1
Next j
a(i; m ) ....... aU; m) / Uti, i) rp.,u.i :m) - L cu. k) C(j, k)
Next i CU,) = k= 1 , I' > j ' (5. 138) I'
CU j)
T Ii
Befo re lea ving th e LAL d ecomposition , we note an interesting and j -I
useful feature. Similarl y to (5.78), we ca n write the follo win g expression
for th e average squared erro r [agai n let us call it ~ (m )l in using th e
CU,)) =: '\1 rp/i,); m) - I c (j, k) .
k~ l
2
(5. 139)
co varianc e method on the window ending at m:
ql/ O, 0; m) - (J) :(m) a(m) = ~ (m) . (5. 133) T hes e equations ca n be used to compute th e lower tri angular matrix C
with O(kf 3j6 ) flo ps acc ording to th e algorithm in Fi g. 5.16. The solution
Using th e fact th at 29 a (m) =L- TA- \ [from ( 5. 125)J an d th e fact th at a(m ) can th en be obtained using th e forward elim inatio n and back
T
U = AL , we can exp re ss th is as substitution algorithm s above. It is to be noted that th e Cho lesky proce
d ure involves sq uare roo t tak ing , whic h is avoide d in th e Lt.\LT
¢(m ) = ql, (O, 0; m) - cp ; (m) L - TA- ly. (5. 134)
pr ocedure.
Now using (5.124), we ca n writ e T he Ch olesky decompo sit ion algorithm shown in Fig. 5. 16 computes
one column of C at a time. (T he computations ca n also be structured to
y2 (k ) co m p ute rows.) Note that storage can be sav ed without computational
~ (m) = rp,(O,0; m ) - yTA - 'y = IfIs(O , 0; m ) - I -.
M
(5.135) error by over wr iting th e memory location assigned to th e (i,j) eleme nt of
k=l fJ.(k) v s(m ) by CU,) fo r any i and) as th e algorith m p ro ceeds.
(Weighted) Recursive Least Squares, Coordinate Rotation weighted" covariance analysis is desired, it is a simple matter to set all
Algorithms, and Systolic Processors (Covariance) the weights to unity in the following discussion.
Let us proceed to develop the conventional WRLS algorithm. The al
The Levinson-Durbin recursion is a recursive-in-model-order solution gorithm will consist of two coupled recursions, one for the (inverse of)
of the autocorrelation equations. In this material we study methods for the covariance matrix , the other for the LP parameter vector.
solving the covariance equations recursively in time. We will study two We have noted two quite different ways of computing the weighted co
widely different algorithms for computing a weighted recursive least variance matrix <1>; (N). The first consists of computing its elements di
squares (WRLS) solution for the LP parameters, which is equivalent to a rectly from the speech using the scalar equations (5.73). The second ,
weighted covariance solution. given in (5.74), arose by comparing the normal equations to a classical
The interest in a time recursion in LP analysis of speech is naturally linear algebra result. A third, which is immediately evident from the sec
motivated, for example, by the need to adaptively change parameter esti ond, is a key equation in the development of the first recursion:
mates over time. Although the "conventional" form of WRLS was popu N
lar for similar identification problems in the control and systems science <l>s(N) = NI ~
I
L..- A(n)s(n)s T (n) . (5.140)
domains beginning in the 1960s, speech processing engineers did not n=1
widely use the method, but instead resorted to more ad hoc methods for
From this form , we can clearly see that
changing estimates over time. Perhaps the most important reason for this
is that WRLS requires 0(1\12) flops per speech sample when computed N<t«N) = (N - I) <I>;(N - 1) + A(N)s(N)s1'(N). (5 .141)
on a sequential machine, whereas when we focus on the computational
complexity of various solution methods in this section, we will discover If we define
that we are already familiar with other methods that do the job at the ex 'Ps(N) ~ [N<l>~(N) ( , (5.t42)
pense of O(M) flops. In the 1980s , however, systolic and other forms of then we have
parallel processors made it possible to solve this 0(M 2) problem at time
scales that are many orders of magnitude faster than "real time" in many 1J'\.(N) =[ OV-l)<I>;(N-1) +J.(n)s(N)sT(N)r. (5 .143)
speech applications. New attention was focused on this algorithm [in par The following lemma , often called the matrix inversion lemma , or
ticular, for adaptive antenna arrays (McWhirter, 1983)] in that era; from Woodbury's lemma (Ljung and Soderstrom , 1983, p. t 9; Graupe, 1989) ,
it there promises to arise more elegant solutions to adaptive strategies will lead immediately to one key equation.
and discoveries of new uses for weighted covariance estimates in speech.
Here we briefly discuss the conventional form of the WRLS algorithm LEMMA 5.3 (MATRIX INVERSION LEMMA) Let A and C be matrices for
and then focus on the version that is amenable to systolic processing. which inverses exist, and B be a matrix such that BCB is of the same di
T
m ension as A. Then
where e(n; n - I) has precisely the same meaning as always, the predic
tion error at time n based on the filter designed 'o n the window ending at FIGURE 5.17. Conventional weighted recursive least squares algorith m.
n - 1. Equation (5.146) is derived as follows. The auxiliary covariance
vector on the right side of (5.72) can be written [similarly to ~5.141)] Initialization: Set !leO) = 0, and '1',(0) according to (5.151).
subject to the weighted error constraint (5.70), where j\l /2(N) is the diag matrix differing from the identity matrix only in its (i, i), (j, j), (i, j), and
onal matrix, (j, i) elements, which are as follows .P
P(i, i) = PU,) = cos e (5.156)
Sequential Method. Thus far, we have studied solutions that, in one way tions takes the form
or another, have formed the covariance matrix and then used it to obtain f<NR,v_I ... R1S(N)a(.rv') = R",I<N-l . . . RIs(lv·). (5.161)
the solution for a(N). In this case, we will find the solution without di
Let us further define the global transform matrix
rectly computing <Ds(N); instead, we will work directly with the system of
equations (5.152) or (5.154) . Well-established techniques for finding the def (5.162)
II N = RNI<N-' I '" 1<1'
solution by orthogonal triangularization of the " co efficient" matrix, SeN)
are based on the original work of Givens (I 958) and Householder (1958). After the application of [IN' to the system , the result is of the form
Householder's paper was the first to mention the applicability to the (5. l59).
(batch) least squares problem , and the papers of Golub and Businger To move toward a recursive solution, we add one further equation to
(I 965a, 1965b) were the first to present numerical algorithms. The the system. For the moment, we lea ve th e previous equation s unchanged,
Givens technique is the focus of our work here. meaning in particular that their weights cann ot change. Suppose, there
The "batch" (using all the equations at once) least squares solution of
(5.152) by Givens transformation is achieved by "annihilating" elements
of S(N) in a row-wise fashion [see, (e.g. , Golub and Van Loan, 1989)] , 32T hese quantities, of course, depend on N, but we will once again suppress the index
"rotating" the information in each element into the upper triangle of the for simplicity.
top /YI rows of the matrix. The plane rotation matrix, say p, used to anni llWe have avoided the need to process the top M rows in our formulation by filling
hilate the (i, j) element of S (N), say SU,) ;N), is an (M + N) x (M + N) them with zeros ; this makes the following dis cussion much less cumbersome.
320 C h. 5 I Linear Prediction Analysis
5 .3 I Short-Term LP An aly si s 321
fore, th at we were to encounter the same syste m of N equa tio ns plu s a XU 2 (N + l)S (N + 1). In fact, by inducti on , we see th at , for an N + 1 equa
new equation repr esentin g ti me N + I ,
tion problem , the app ropriate sequ ence of op erations is given by
[
,\ J/z(N )S (N ) J a (N + I) =
[ j \ I / 2(N)S(N) ] 5/11+1 5 NS "'- 1. .. s; (5. 167)
V ,l(N+ l) sT(N+ J) V ,l,(N+ l )s(N +] ) (5.]63) where S n is defin ed to be th e (M + n) X (M + n) matrix appropriate for
including th e 12th equation of the system if th e syste m indeed had only n
or
equations, and which is restricted to operat ing onl y upon the top 12 eq ua
I12
X (N + l)S(N + l) a(N + l) =:XJ/z(N + 1)5 (N + 1). (5.]64) tions of A1 1Z(N + I)S(N + 1).
Further, in considering, for exa m ple, th e application of S N + I to the
Wishing to find the least sq uare solution to th is system of equat ions, we system of (5. 166), we would find th at rows and columns (;\1 + 1) through
might recogn ize that we are now faced with an N + 1 equa tion version of N ser ve onl y to preser ve the rows of zeros (ca used by ann ih ilat ion of
th e p roblem th at we have just solved with N eq uations, and simply sta rt equations 1 through N) in th e matri x on the left , and the vector d, (N) on
ove r, using precisel y the sam e method. We would once again apply the the right [see also (5. 159)]. Since th ese rows play no further role in th e
Gi vens transformations row-wise, this time deno ting the row operation solution, th ey and the cit ed rows and colum ns of 5 "'+1 can be elim ina ted.
matrices by QI' Qz' .. . , QNI.'· Note th at the R n matri ces of the N equa Each S; op erator, therefore, can be formulated as an (M + 1) X tM + I)
tion problem ar e (M + N) X (M + N), whereas th ese Q matrices are matrix. This fact is reflected in the algorit hm below. Befor e stating th e
n
(.Iv! + N + 1) X tM + N + 1). The key to using the solution in a sequential algorith m, we show how thi s method is eas ily modi fied to be ada ptive to
manner is to recogni ze th at the N equation problem is "em bedded" in the most recent dynamics of the speech signal.
the N + I equation probl em in the following sense: Th e operations don e
to the first N equations in th e N + I equation p roblem are identical to Adaptive Method. On e prominent method of causing covaria nce-like esti
those in the N eq uati on problem , as shown in the following lemma. mators to ad apt to th e most recent dynam ics of the signal an d "forget
the past" is to includ e a so-called " forgetting factor" into th e recursion.
L EMM A 5.4
In principle, the cova riance meth od is solved at each N, subject to the
error min imi zation
QNQN-I . . . Q, = OT
[
ITA' O~X l ] , (5. 165)
N N
for some 0 < a < I. Comparing this with (5.70), we see th at the weights
The proof of thi s lemma follows immediately from the definitions and
in this case ar e ?(n) =: «": and th at they becom e smaller as we look fur
some algebraic tedium. By app lying Lemma 5.4 to the N + I equations of
ther bac k into th e history of th e d ata. Not e that the se weight s are l ime
(5. 163), it is seen tha t after N row tr an sformation s in th e N + 1 eq uation
[
ITN,lIlZ(N )S(N) J a(N +l ) =
[ n ",X I/Z(N)S(N) J
, (5.166)
the sam e as th ey wer e in the N point p robl em , but instead would be
scaled do wn b y a fact or of a. At fir st , th is might seem to make a
V,l(N + l )s T(N + 1) V-t(N+ l )s (N + I ) recursi on difficult, but in fact the solution is qu ite simple.
First let us note that , in this case, the equation weight ing matrix for
th e top M rows of which are exactly (5.160) . Th erefor e, after N row the N po int problem , Xlfz(N ), is as follows:
tran sformations in the N + I equ ation problem , the estimate ii(N) is eas AlIz(N) =:diag[l ... 1 fJ ·~' - l (3 "'- 2 ... (J I], (5. 169)
ily generate d and the solutio n begins to assume a "seq ue ntial" fla vor,
Notice that in solving th e N + ! equation problem , it is ap parently suf where fJ ~r va.
Co nside r th e system (5. 152) or (5.1 54) with weights given
ficie nt to apply th e row tr an sformations 34 QN+' RNI<N_I . .. I<J if as in (5.169). Suppose that we implement a nai ve adap tive strategy that
t he I<n's are restricted to operating on th e upper N equa tions of introduces a new Xllz(N ) matrix at each N and reu ses the G ive ns ap
proach to ob tain ii(N ). At time N , we can use the seq uential technique
suggested a bove. However, it is im portant to realize that in the cou rse of
34 Because of the restrictions, the matr ices in th is seqlienee of operalions no longer have
compat ible dime nsion and it is not meaningful 10 int erpret this st ring as a product. solving the N equation problem , the intermedi ate est imates a (n), n = L,
2, ... , N - L (if co m pute d), will not co rrespond to adaptive estima tes.
5 .3 J Short-Term LP Analysis 323
322 Ch . 5 I Linear Prediction Analysis
....---
FIGURE 5.18. "Systolic" WRLS Algorithm tor Adaptive and Nonadaptive Cases .
Rather, they are just estimates corresponding to the static set of weights
at time N. Nevertheless, suppose that we have used the sequential solu Dedicate (M + 1) X M memory locations to the elements of the matrix on the left
tion at time N, the row transformation matrices being denoted Sn as in side of the equation, and (M + 1) X 1 to the auxiliary vector on the right.
(5.167). Now in the adaptive scheme, the next set of N + 1 equations to These are concatenated into a working matrix, W, of dimension (M + 1) X
be solved is eM + 1). By W (n) we mean the working matrix at time n (including the nth equa
tion), before the "application of S "," and by W ' (n) , the postrotation matrix.
~ + 1) = [PA 2 (N )S (N)]
PA1/\ N)SCN] a(N 1
/
I nitializat ion:
[ sT(N+I) (5.170) (5.172)
s(N+I) '
W(O)>=W'(O)>=lO(Mt-l) X,I,( \ O(M l'1) xJ
l
fore, we have
For j= 1,2, . .. .M,
l J J W(M+l ,j;n)=O
0NXM' a(N+I)= jJd2(N) . w' (j, k; n) >= W(j, k; n)cos8+ W(M + 1, k; n)sin e
TABLE 5.1. Approximate Number of Floating Point Operations (Flops) The lattice methods are the most computationally expensive of the
Required by Various LP Solutions ." conventional methods but are among the most popular because of several
Typical Number
important properties. First , it should be noted that more efficient lattice
Flops of FloIlS structures than those described here are to be found in the work of
Method and per (M= 14, Makhoul (1977) and Strobach (1991). Second , even with slightly more
Complexity Subtasks Sample N= 256) expensive algorithms, lattice methods offer the inherent ability to di
rectly generate the parcor coefficients (which are often preferable to the
Autocorrelation Windowing I I LP coefficients, see Section 7.4.5), to monitor stability, and to deduce ap
(L-D recursion) Autocorrelation OeM ) 14 propriate model orders "on line ."
OeM ) update
2 Finally, if sequential computation is used, th e most expensi ve solu
Solution O(M IN) 0.76
tions are the temporal WRLS recursions. However. even on a sequential
Covariance Cov ariance OeM) 14 machine , these methods are advantageous for some applications in pro
(Cholesky) update
2
viding point-by-point parameter estimates and residual monitoring, as
OeM) Solution O(M IN ) 0.76 well as convenient adaptation capability. Further, contemporary comput
Lattice Windowing 1 1
ing technology has made 0(M 2 ) algorithms not nearly as prohibitive in
(It akura or Burg) Solution O(5M ) 96 terms of actual computation time as was the case in the early days of LP
O(5M) (parcor analysis. As noted above, however, the primary motivation for the inter
coefficents) est in these solutions is the availability of parallel processing architec
2 tures that render the Givens rotation-based WRLS method effectively an
Covariance Covariance O(2M ) 392 O(iVl} process. More details on how parallel computing can be applied to
(W)RLS update
Solution
2
196
this method is found in the work of Gentleman and Kung (1981);
(conventional) O(M )
O(3M
2
) Weights 3 3 McWhirter (1983); and Deller, Odeh, and Luk (1991,1 989,1989). It is
Forgetting 3 3 also to be noted that, in the late 1980s and early 1990s, a parameter esti
factor mation strategy known as set-membership identification was applied to
2
LP analysis. This is, in fact, the main subject addressed in the papers by
Covariance Covariance O(M ) 196 Deller et al. noted abo ve. In the paper by Deller and Odeh (1991), it is
(W )RLS update
2/2 shown that the set-membership approach has the potential to make even
(QRcdecomposition) Solution O(M ) 98
the sequential version of WRLS an OeM) algorithm. Various other "fast"
o(3M 212) Weights 3
2
3
RLS algorithms have been proposed in the literature, notably the fast
Forgetting O(M 12} 98
factor transversal filter (FTF) of Cioffi and Kailath (1984, 1985). The FTF in its
most stable form , however, is an O(8M) algorithm , which is close to
'A flop is taken to be one float ing point multiplication plus one addition. Further discus 2
O(M ) for speech analysis, and has been shown to be very sensitive to fi
sion is found in the text.
nite precision arithmetic effects (Marshall and Jenkins, 1988).
the analyst should keep in mind the stability of the solution. The auto 5.3.4 Gain Computation
correlation solution is theoretically guaranteed to represent a stable filter Let us return now to the general modeling problem , recalling that we
if infinite precision arithmetic is used. In practi ce, fin ite wordJength view the total speech production system as consisting of three compo
computation can cause unstable solutions ," but the L-D recursion con nents, 8(z) == 808min(z)8ap (z). In the discussions above, we have focused
tains a built-in check for stability embodied in conditions (5.97) and on the problem of modeling the rninirnum-phase component, B mm (z) , the
(5.98). Although no such theoretical guarantee or numerical check is estimate of which yields a sufficient spectral characterization of the
available in the covariance case, stability is usually not a problem if the speech for many purposes. Thus far, we have said nothing about the pos
frame is sufficiently large, since the covariance and autocorrelation meth sibility of estimating the gain, 8 0 , of the LP model. Indeed, we have dis
ods both converge to the same answer as N increases. covered implicitly that the LP parameter estimates are insensitive to this
J"A discussion of finite wordlength effects in LP solutions is found in (Markel and Gray quantity. There are many instances, however, in which the relative gains
1976, Ch. 9). across frames are important, and we seek an estimation procedure here.
I Linear Prediction Analysis 5.3 / Short-Term LP Ana lys is 327
Let us
exactly c("ecall ( 5. 11 ), assuming that our model order, M , can be chosen Finally. we show an alternative estimation procedure for the ga in, Re
placed b~h.lal to th e "t rue" ord er of the system, I , so that I will be re turning to Fig . 5.4, we ean write the follow in g exp ression for th e predic
M in that equation: tion residual, [>(n),
.'II M M
I I) = L a(i)s(n - i) + e oe/(n) ~f aTs(n ) + e oe/(n) . (5, '173) &(n) = - L d(i) s(n - i ) + sen ). (5.179)
Recall al, i~ l i= l
jog seq ue,'b that e/(n) represents the phase-scattered version of the driv, If we have done a reasonably good job of select ing the model order, M ,
pass corn ltce, e(n), which is a consequence of subjecting e(n) to the all t hen
and applxbonent of 8(z), Postmultiplying both sides of (5.173) by sen) (j(n) = 8 oe/(n). (5.180)
" ing the long-term averaging op erator, we have
At Multiplying the left and right sides of (5.] 79) and (5. 180) together, and
.
' s(O) = I,- I a(i)r/i) + 8 01'",/0). (5. 174) taking the average, we have
Now, In \ M
' ,(0; m) - I
j- l
a(i; m) r,(i ;m), unvoiced case (5.177) Reading Note: The reader might wish to review Section 1.3.1 before study
ing this material. The topics are closely related, but not highly dependent
and upon one another.
Rem ember that R/m ) sign ifies the (/v! + I) X (M + I) a ug me nted Corre Not e that th is measure will always be po siti ve because of con di tio n
lati on matri x, defined as (5. 190). Also note that while thi s measure is call ed a " distance ," it is not
a true metric because it does not have the required sym met r y property,
We know that [This expression is often called the Itakura-Saito distance in speech pro
cessing (Ita kura and Saito , 1968). ] Further, if a(m) truly is close t o bern'),
~a ( m ) < ~\;(m) (5.190)
then a(m) = ~(m ') and dr[a(m), b(m')] = dM[a(m) , b(m ')].
because e,,(m ) is th e best possible prediction error in the sense of mini The Itakura distance is probably the most widely used m easure of sim
mizing the average squared prediction error. ilarity between LP vectors. In ltakura' s original paper (1975 ), it is intro
How " fa r" is LP vector b(m') from LP vecto r a(m)? One way of an duced for use in an isolated word recognition strategy that has since been
swering this Question is to measure how much "better" a(m ) is at pre named "dy n am ic time warping." The strategy will be the subj ect of
dicting its "own" frame than b(m ') is . A measure of this is the ratio Chapter II.
(6(rn)/e..(m), or, taking the logarithm , we define the Itakura distance as
creases the relat ive ene rgy of th e high-frequen cy spe ct ru m. Typically, the
filt er 5.4 Alternative Representations of the LP
P(z) ~ I -J1Z - 1 (5.196)
Coefficients
is used w it h J1 = 1. T hi s filt er is ide ntical in fo r m to th e filter used to In thi s short sectio n we remind the reader of t wo alte rnative sets of pa
m odel th e lip radiation cha racteris tic. We kno w th at thi s filt er introduces ra mete rs t ha t a re th eoretically equiv alent to t he LP pa ra mete rs, a nd
a zero near to :::: 0, and a 6-dB pe r oct ave shift on the spe ech spectrum. which ca n be derived from them, and introduce two ot hers. It is not th e
The reasons for employin g a preernphasis filt er a re twofold. First, it purpose her e to study alternati ve models of speech producti on , although
ha s been argued that th e mi nimum-phase co m po ne nt of the glottal sign al th er e are man y."
ca n b e modeled by a sim ple two- real-pole filt er wh ose poles are near In t he communications technologies, the LP paramet er s ar e rarely
z = I (see Ch apter 3). Further, th e lip radia tion cha racterist ic, with it s used directl y. In stead , alternative representation s o f th e LP m odel are
zero near z = I , t ends to can cel th e spec tral effects of one of the glottal employe d. Th ese alt ernate sets of p aramet ers have b etter qu an ti zation
pole s. By introducing a seco nd zero near z = I , th e spect ra l contributions a nd in te rpolation properties, and have b een show n to lead to syste m s
of th e lar ynx and lip s have been effect ively elimi na ted a nd the analysis wit h better sp eech quality (Visw anatha n and M akhoul , 1975). We have
ca n be asserted t o be see king pa rameters correspo nding to the vocal t ract already st u die d three alternative representations: th e refl ecti on co effi
onl y. We know that th e sp eech production model is a greatly simplifi ed cie nts, th e log area ratio parameters, and the in vers e sine parame ters . We
a nalyt ica l model of a com plex physical syste m . Accordingly, we must be will see th ese syst ems em p loyed in co ding m eth od s in Chapter 7. In this
ca refu l not to ov erstat e th e fac t th at preernphasis results in an LP spec brief sec t ion, we introduce another alt ernative representation t o t he LP
trum or filter that is free of glottal or lip radiation effects. In the worst param et ers, the line spectrum pair, and preview a secon d, th e ce pst ral
case , however, it is clear th at th e preemph asis will give the higher paramet er s. The latter will be discussed thoroughly in th e following chap
formants in the vocal tract a better cha nce to influence the outcome. ter, afte r the necessary background has been presented.
The value of J1 is t aken in the range 0.9 < f1. < 1.0, alt hough the pre
cise valu e seems to be of little conseq uen ce. Of co u rse, preemphasis
5.4.1 The Line Spectrum Pair
sho uld not be p erformed on un voiced speech, in which cas e J1 "'" O. Both
G ray and Markel ( 197 4) a nd Makhoul and V iswa natha n (1974) have In th e 1980 s, the line spectrum pair (LSP) was int roduced as ano ther al
wo rked with an "optimal" va lue of f1. give n by ternat ive to the LP parameters. This t echnology was researched most ex
r (l ' rn) te nsively by th e Japan ese telephone industry, but some se min al ideas ar e
f1. = s , , (5.197) found in the following papers in Engl ish: (Itakura, 1975; Sugamura and
rs(O; m ) Itakura, 1981 ; Soong and Juang, 1984; Crosmer and Barnwell, 1985 ).
where r s (1]; m) is the usual short-term autocorrel at ion seque nce for the The LSP is developed by beginn ing with th e z-dom ain representation
frame . For un voic ed frames thi s va lue is sm all, wh ereas for voiced of th e inverse filter of order M,
frames it is near unit v. M
The second rea son for preemphasis is to prevent numerical instability. A(z; m) d,gf I - I a (i; m) z-l. (5.198)
T he work on this problem has focused on th e aut ocorrelat ion met hod, i= l
but th e deleterious effec ts ca n be ex pected to be eve n worse for the Now A(z; m) is decomposed into two (M + 1)-orde r polynomial s,
covariance case (Markel a nd G ra y, 1976, p. 222) . If th e speech sign al is
P(z; m) = A(z; m) + z - (M+l )A(z - l; rn) (5.199)
domin ated by low freq uencies , it is highly predictable a n d a large LP
model order w ill result in an ill- conditioned 38 autocorrelation matrix Q( z ; m) = .4(z ; m) - Z-(M+I} ..1(z- l; m), (5.200)
(Ekstrom , 1973). M akhoul ( 197 5) a rgues th at th e ill-conditioning of the
so that
autocorrelation matri x be comes in creasin gly se ver e as the dynamic range
o f th e spectrum increases. If th e spect ru m ha s a gen eral "tilt" that is .4(z ; m) = P( z ; In) + Q(z; m)
causing the wide dynamic range , then a firs t-o rde r in verse filter should (5.20 I)
2
be able to "whiten" th e sp ectrum . Indeed, the preemphasis filter may be
int erpreted as such an inve rse filt er, a nd f.1 given by (5. 197) is the optim al ~9 1n dee d, there are many such models, including, for example, models includin g both
coefficient in the sense of MS E. poles and zeros (Steiglitz, 1977; Konvalinka a nd Mataussek, 1979; El-Ja roudi and Makhoul ,
1989), sinusoidal models (McAulay and Quat ieri, 1986), orthogonal function expansions
(Korcnberg and Paarrnann , 1990), and tim e-var ying LP models and LP models based on al
J8 For a general discussion of ill-condit ioning, see (Nobel. 1969). ternative formu lations [for a review see (McClellan, 1988)}.
5 .5 f Application s o f LP in Spe e c h A n alys is 333
332 Ch . 5 I Linear Predictio n Ana lys is
model. Otherwise. the origin al zero is likel y to represent a wide
[T he reader is encouraged to take a small-order A(z ; m ) polynomial bandwidth spectral feat ure.
and work out th ese three eq ua tio ns.] In light of (5.103 ), Pi z; m) can be
interpreted as representing a (M + I)-order lattice (ana lysis) filt er with fi
nal reflection coe fficient K(M + I; m) = 1. Similarly for Q(z ; m) with 5.4.2 Cepstral Parameters
K(M + 1; m ) = - I . Accordingly, Pt z; m) a nd Q(z; m) co rrespond to
lossless mod els of the vocal tr act with th e glo tt is closed and op en, respec Because the cepstral param eters are so wid ely used in the speech rec
ti vely (see Section 5.3.3). In turn, thi s ca n be shown to guarantee that all ognition domain, and because they are frequently derived from the LP
zeros of P and Q lie on the unit circle (Soong and Juang, 1984). In fact , parameters, we mention them here for emp hasis and completeness. In
P has a real zero at z = - 1, Q a zero at z = I . a nd all other zeros are com Chapter 6 we will discuss the relationship between the cepstr al and LP
plex and interleaved as shown in Fig. 5.19 . Th ese zeros comprise the LSP parameters. The read er ma y wish to glance ahead at (6.44) in Section
parameters. The name derives from th e fact that each zero pair corre 6.2.4 to preview the conversion formula.
sponds to a pole pair in the forward model, which lies on the unit circle .
In turn , this pole pair would represent an undamped sinusoid that, in an
alog terms. would have a lin e spectru m . Sin ce the zeros occur in complex
conjugate pairs for both P and Q, th ere are only M unique zeros needed 5.5 Applications of LP in Speech Analysis
to spec ify the model. Th e zeros are found by ite rative sea rch alo ng the
unit circle, taking ad vantage of th e int erleavin g. Although the ze ros are Throughout the remainder of the book, we will sec th e LP parameters
com plex, their magnitudes are known to be unity, so that only a single playa central role in many coding, synthesis, and recognition strategies.
real parameter (th e frequen cy or angle) is needed to spec ify each one. In In order to give some illustration of th e use of LP in th e real world while
fact , coding th e fre quency differences between zeros has been found to be we are still in the analysis p art of the book , we focu s bri efly on the re
more efficient th an coding the frequ en cies the mselves, leading to a 30% lated problems of pitch , formant, and glottal waveform estimatio n.
improvement in efficiency over the use of log area ratio parameters. An
other st rategy involves the use of second-orde r filte r secti ons to recon
struct th e speech from the LSP paramet er s. In this case ea ch section .5.1 Pitch Estimation
implements one zero pair, and it is sufficient to know only the cosine of In Section 4.3.1 we bri efl y discussed th e possibility of using th e short
th e frequency of th e pair. This is a not he r way to reduce the dynamic term autocor relation as a detector of pitch an d indicat ed th at this is sel
ra nge of the parameters and improve coding efficiency. dom done because it is only slightly less expensive than more reliable
Finally, it is to be noted that th e LSP parameters are interpretable in methods. Some alternati ve methods were dis cussed there , including at
terms o f the formant frequencies of th e model. Each zero of A(z;m) tempts to "p rewhite n" the speech by "clipp ing" before computing the
maps into one ze ro in eac h of the pol ynomials P( z; m ) and Q(z; m ). If autocorrelation. The simple inverse filter tracking (SIFT ) algorithm of
the two resulting zer os are clo se in frequency, it is likely that the "par Markel (1972) follow s th is basic strategy of prewhitening followed by
ent" zero in A(z; m ) rep resents a form ant (narrow bandwidth) in the autocorrelation, but the pr ewhitening ste p involves the use of the LP
based IF.
Unit
T he SIFT algorithm is diagramed in Fig. 5.20. Initially, the digitized
'- y-
Zeros of Q(z;m ) speech is lowpass filtered and decimated in order to suppress sup erfluou s
high -frequency content and reduce the a mo unt of necessary computation.
~ Zeros of P(z;m ) In Fig . 5.20 , for example, a 10-kHz sampling rate is assu me d on sen) and
th e sequence is lowp ass filtered to exclude frequencie s above 800 Hz .
The sequence is th en decimated by a factor 5: 1 to create an effective
~
sampling rate of 2 kH z. To create an IF, a low-order an aly sis (M = 4) is
sufficient, since we would expect no more than two form ants in the nom
inal I-kHz bandwidth remaining. The short-term LP analysis is typically
z-plane done on rather small fram es of speech (= 64 points) for good time reso
lution. Once the IF is creat ed for a given fram e (ending at , say time m),
FIGURE 5.19. The interleaved zeros of the LSP polynomials P( z; m) and
th e frame is passed through it to compute the residual, e(n; m). Although
Q(z;m) .
Sp ee ch
s(n)
I
800 Hz
LPF
r ".--
i
Down
samp ler
~
W in dow
end-t ime
til
m
Low pas s
f. . .·... 11 ......
/(Il ; m) A
A(:: ",) \--,
~
§
.;;:
10 20 30 40
TIme . I (msec)
( a)
al
~
~
LP r-
.,
'2
en
S.l. autoco rrelation '"
s
meth od g§'
,....l
- 80
o 0.2 0,4 0 .6 O.S LO
Frequency. F (kHz)
e( n; m) Tc(n: m ) P<m) ( ll)
A urocorrelarion Interpolation I-
residua l
Ji~\f(I~
'itch estimate
?f '~ ~
~
at m
,..J ~) ~
b
M
X (k ; m) = 1- I a (n ; m)e-j ( 21r/ N ) kn
.~~
n~ 1
14.1
(5.205)
o -........ ~ »<:
= A(e j (2Jl/ N ) k" ; m), k = 0, I, . . . , N - 1.
ii: 4.5 ~ - ~
Taking the magnitude an d reciprocating each point gives
Tim", I
(f) IA(e J (2n / N ) k ,, ; m) I- I = 18 (eJ (21r/ ;V) k ,, ; m) )= 8 ~' I8 ( e )(21r/ N) k") I, (5.206)
FIGURE 5.21. (Cant.) whi ch , for a sufficiently large N, yields a high-resolution representation
of the (scaled) speech magnitude spectrum. From this spectrum, local
maxima are found and those of small bandwidths , and perhaps those re
stricted to certain neighborhoods for classes of phonemes, are selected as
formants.
5.5.2 Formant Estimation and Glottal Waveform Deconvolution It is to be noted that an FFT algorithm is employed to obtain (5.205)
Formant frequencies and bandwidths are principal analytical features and that it is not necessary to actually take the reciprocal in (5.206) un
of the speech spectrum. Moreover, they are appealing features because less the spectrum is to be displayed. If the IF spectrum is used directly,
they ar e clearly related to the articulatory act and the perception of the local minima are sought to represent the formants.
sp eech. Accordingly, as we shall see later in our work, formant informa Markel (1972) reports that the peak picking of the reciprocal IF spec
tion is used extensively in coding and recognition of speech. trum was successful at producing accurate estimates of formant frequen
A closely related problem to formant estimation is that of estimating cies about 90 % of the time in experiments in whi ch he tracked formants
the time-domain glottal waveform. An understanding of the characteris in flowing spe ech. This is a significant improvement over the accuracy
tics of the glottal dynamics is important, for example, in speech coding th at wou ld be expected from an attempt to pick peaks from the unproc
and synthesis (see Chapter 7), and in algorithms for laryngeal pathology essed speech spectrum.
detection (see the references below). In this su bsection, we briefly exam The procedure above involves the computation of the IF spectrum by
ine the application of LP-based techniques to these two related problems effectively evaluating A(z; m) at equally spaced points on the unit circle.
of formant and glottal waveform estimation. An enhancement to this procedure suggested by McCandless (1974) in
volves the evaluation of the spectrum on a circle of radius p < I. This
has the effect of making the valleys in the IF spectrum (peaks in the
speech spectrum) more pronounced and easier to discern. This is espe
Formant Estimation by Spectral Methods ~ially important in cases in which two formants are very closely spaced
The first of two simple techniques for formant estimation is based on 10 frequency. The bandwidths, are, of course , distorted in this case. To
peak finding in an LP-derived magnitude spectrum. Papers of Atal and carry out th is approach in terms of the method described above, it is a
Hanauer (1971) and Markel (1972) first described a method of this type, simple matter of premultiplying the LP parameters by p before comput
although Schafer and Rabiner (1970) had earlier reported a spectral peak ing the FIT. Clearly, the DFT of
picking method based on the cepstrum, which we will study in Chap p, -pa(l , m), -pa(2; m), . . . , - pa( M; m), 0,0,0, .. , , O} (5.207)
ter 6.
Recall that the short-term IF will approximate the inverse of the is
minimum-phase component of the speech system, M
M
1- L a (n; m)pe -
nc j
J (21r/ N ) k" =A(pe j ( 2/f / N l k" ; m ), k = 0, 1, . . . , N - 1,
'" a(i;m)z-' = e -l(z) = e-
I (5.203)
A(z;m) = 1- ~ (z). (5.208)
rrnn
i~1
which is the IF spectrum evaluated on the p-circle as required.
338 en. 5 I Linear Prediction Analysis
5.5 I Applications of LP in Speech Analysis 339
be modeled by an L-oeder all-pole t"'nsfee fun cti on . We also aSSign th 5.5 / Applications of LP in Speech Analysis 341
Overall system gain, 0 0' to H (z) for convenience, e
~s we have noted , we ar e chiefly concer ned with the fi rst two of th ese
8 ~tep s . The third ste p, although th eoretically trivial, involves a number of
~ L
H (z) - , "----- .
1- h(i) z - "
(5.209) practical difficulties that are treated in the literature.
The single -channel method of Wong, Markel, and Gray for computing
;=1
the CPIF is formulated in terms of our conventional notation as follows:
We also
simp assume that th e lip radiati on ch ara cte ristic can be model ed by a
le differencer, Let Nw be a window length no longer than the expected size of a CP in
terval, say, for example, 20 points if the typ ical cycle period is 100
points. For the Nw-length window ending at time m , compute the L -order
1
R (z) <== l - a z - covariance solution, letting the average squared error be denoted sCm).
~
(5.210)
ence a ation
with equ 1. II is clear from Pig. 5.22 that th e speech fOllows the ditTer_ The procedure is rep eat ed over a range of m's covering at least one cycle
of the speech. c;( m) is monitored as a function of m and tb e window pro
L ducing the minimum is assumed to be the best candidate for computing
(Picache, 1988) suggests that the technique is not highl y reliable without continue to see LP analysis play a significant role in the remaining
modifications. This is likely due to several basic reasons, all centered on chapters-in endeavors ranging from coding to recognition to enhance
the underlying assumptions . These include: ment. To say that LP has been important in the development of modern
speech processing is to drastically understate its significance. LP analysis
1. The assumption of a minimum-phase system . The "system;' in this has unquestionably become the premiere method for extracting short
case includes any system between the glottis and the samples in the term spectral information from speech. In fact, even when LP parameters
computer memory. This is critical because any significant "smear are not directly used in a design , it is often the case that LP coefficients
ing" of the temporal relationships of the frequency components of are first extracted, then converted to another parametric representation.
the glottal signal will render the method invalid. Berouti (1976) de Such is the case with the "cepstral" parameters to which we turn our at
scribes a method for com pensating for the phase characteristic of tention in the following chapter.
the recording equipment, for example.
2. The basic assumption of a linear filter driven by glottal waveform ,
including no significant coupling between the dynamics of the vocal
tract and the larynx. Any deviation from this model destroys some
5.7 Problems
very fundamental assumptions underlying the method. The work of 5.1. In Fig. 5.3(b), suppose that the impulse response, 8(n), which mini
Teager and Teager (1990) is very relevant to th is point. mizes the mean square output error, J' ls2(n)j, has been found . Demon
3. The existence of a closed phase in the glottal cycle. That this is not strate that the desired model parameters, a(i ), i ==: 1,2, ... , are non
certain has been borne out by the two- channel studies described
linearly related to the impulse response sequence.
below.
5.2. In this problem we provide some supportive details for the proof of
One of the most difficult aspects of using the single-channel CPIF
method is that th e "answer" is unknown, and there is no basis for assess Lemma 5.2.
(a) Demonstrate that any first-order polynomial of the form
ing what is apparently a "reasonable" result. Two-channel methods offer
some help with this problem. 1-z oz - 1 withl zol< 1
In two-channel approaches, the EGG signal is used as an indicator of
can be written as the ratio
the closed-phase region. Since most of the computational effort in single
channel approache s is in locating the CP, such approaches greatly in (5.214)
cr.>
crease the efficiency of th e analysis. Further, the Larar et al. (1985)
two-channel research has cast doubt upon the covariance least square
I + I, Z~ Z -k
k= 1
error as an accurate indicator of the CP interval. Although faster and in
herently more accurate, th e use of an electroglottograph is not always (b) Consider the following polynomial:
practical, particularly in on-line vocoding schemes in which the glottal co
waveform estimate is needed . Therefore, the development of reliable A(z) = IJ3(j) z -J
single-channel methods remains an interesting problem . Picache ( 1988) )=0 (5.215)
has described an enhanced covariance-type algorithm that exhibits the
potential for more accurate CP location as well as more efficient compu
tation of the estimates . In the paper cited in th e opening paragraph of ""
[
1 - Ia(i)
M z-I] " [1 +rI I
00 Z;'Z - k,] •
;= 1 I= L k (=\
this discussion , Milenkov ic (1986) describes a single-channel approach
that involves the joint estimation of the vocal-tract model and a linear zil
in which L < 0), and \ < 1 for all I. Argue heuristically that for
model of the glottal source. This method is not a CP method and , in
fact, shows potential to circumvent many of the problems inherent in as
I
any e > 0, there exists an f such that for all j >- I, f3(j) 1< e. Hint:
Begin by examining the case L = I.
suming the existence of a CPo
5.3. Demonstrate (5.26).
5.4. Derive the long-term LP equations (5.22) by solving Interpretive
5.6 Conclusions
Problem 5.2.
We have been on a long and arduous journey through the concepts, com 5.5. Deri ve the long-term LP equations (5.22) by solving Interpretive
putational methods, and some direct applications of LP analysis. We will Problem 5.3.
5.7 I Problems 345
344 Ch . 5 I Linear Prediction Analysis
(z '(i), i = 1, 2, ... ,I
5.7 . Derive the long-term LP equations (5.22) by solving Interpretive (5.218)
Problem 5.5 . (Note: The method of solution is suggested under the prob
lem statement.)
Q'+k(i) =
t 0, i=l+l, ... ,l+k
5.tO. Solve Interpretive Problem 5.7 , showing that the solution results in
the covariance method equations (5 .57). recursion.
5.16. Derive the expression
5.11. Verify that (5.67) is equivalent to the covariance method LP solu (5.2 19)
tion, equation (5.58), resulting from the analysis of the short term of
A '(z; rn) = 1 1- 1 (z; m) - K(l ; m) z-',4 '-1 (z "; m),
speech s(l), . . . , s(N). which was necessary to deduce the lattice structure from the Levinson
5.12. Return to Interpretive Problem 5.7 , the covariance method prob Durbin recursion.
lem , and solve for t he linear predictor that minimizes the weighted 5.17. A key to the Itakura-Saito lattice is the computation of the reflec
squared error tion coefficients that makes them interpretable as pareor coefficients,
m
I m I m
sCm) = N
i
I
n~m-N+ I
A(n)[s(n) - s(n)f = N L
,,~ m-N+ I
2
A(n)e (n; m), (5.216) L
n=m+N-t
e/-l (n; m)p/-l (n ; m)
(5.220)
5.19. (a) Given that the LAL T decomposition exists for a positive def[, (c) Now compute the DFT of the LP parameters as suggested in
nite covariance matrix <I>s(m), show that a Cholesky decompost, (5.204)-(5.206), and estimate th e formant frequencies using the
tion exists. resulti ng spectrum.
(b) Show that the converse is true. Repeat (c) using the McCandless procedure with p = 0.95.
(d)
Estimate the formant frequencies by simply taking the DFf
5.20. Carefully show that the recursion for the inverse weighted covari (e)
spectrum of the windowed speech itself.
ance matrix , (5 .145), follows from (5.143) and the matrix inversion
lemma (Lemma 5.3). (f) Discuss your results.
We introduced the Berouti technique for glottal waveform de
5.21. Enroute to developing a "systolic array" version of the WRLS algo 5.24. (a)
convolution as a special case of the Wong method with the
rithm for solving the covariance equations, we established the upper tri frame size, N • chosen as it s minimum possible value, N w = L ,
angular system of equations (5.160), w
with L the order of the vocal-tract model. Show that in this
T(N)a(N)=dl(N) (5.221) case the covariance equations used on the window, say
onl y for the frequency ran ge W a S W ~ w/>. T ha t is, we wan t all M LP pa where
rameters to be dedicated to this region of the spect ru m (with th e rest of a(M)
aC t) a(2)
the sp ectrum simply ign ored). Given a fra me of sp eech on the range
n = n - N + \, . . . , m, describe th e steps of an autocorrelation-like o
m ethod for estimating th ese param et ers . (Hint: Consi der com puting the
(5.233)
a ut oc orrelat ion sequence, r ,O ; m), . . . ,rs(M; rn) using a frequency do
I (,11 _
o
ma in ap proa ch .) A ~ l) x(M- l )
5.26. (T his problem is for persons who have studied th e pro of of The
orem 5.2 in Appendix 5.A.) Verify th e upper companion form state space o
model for the speech signal given in (5.231 ) and (5 .232) using the meth
ods o f Section 1.1.6. and c == (8 0 0 0 ." 0). \>1 is eq uiva lent to the t op row of A, and
0
5.27. (T hi s problem is for persons wh o have stud ied th e material in Ap
d=Now
8 o' using this fornmlation, it is not di fficult to show th at
pendix 5.B. ) Give a simp le proof of Corollary 5.1. The result should fol
low quite readily from th e orthogonalit y principle. n-qo- l (5.2 34)
s (n)== 2, AQ eobe'(n-q - I) + An-qos(qo)
q=O
for all n> qo in wh ich qo is som e t ime for wh ich e'(qo) == 0. postmulti
APPENDIX plyin g by e'e n) and a pplyi ng th e J: ope rat or yields
where all notation ha s been defined a bove except re,s' whi ch is an obvi T he right side is 0 when re,(l}) == CO(lJ) for an y con stant, C, an d it is easy
ous ex te nsio n of our co nve nt ion s. Substituting (5.229) into (5.24), we to show that re,(lJ) == 1'/ 11). This pr oves su fficiency.
ha ve iNece ssity of e(n) un co rre\ated.j Rewrite (5.236),
a =a + e 0R-
A l - der , 0. R- l
, Te's=a = a - a= 'Gf o s f ,,'s ' (5.2 30)
re'.l ==lb AB A
2b
" ' ] (1') 1) r e (2) ···r (5.237)
whe re, a denotes th e erro r vect or. The sufficiency ar gumen t is complete d
by sh owing that r",s is the zero vector if e( n) is orthogonal. A conveni ent
fram ework with in which to demonstrate this fact is the upper companion
== (c AM \) A V+1b · .. 1(1) 1) 1'.(2) .. ·r·
form sta te space m od el of th e spe ech product io n model. Now th e first M colum ns of the m at rix in (5.237) co nst it ute the control
The speech product ion syste m ha s an upper compa nion form m odel lability matrix , C, for th e state space system which prod uces th e speech.
(see Section 1.1.6) It is wei] known th at, for the upper com panion for m model und er co n
side ra tion, C has fu ll rank , M , and the refore r r's' whi ch is of dimension
s en + I) = As ( n ) + ce' (n) (5.23 1)
M , can only equal th e zero vect or if
and ( 5.238)
1'/ 11) == 0, 11 == 1,2, . .. .
s(n) = bTs(n + 1) + de'(n) , (5,23 2)
350 o-. 5 / Linear Prediction Analysis
5 .B / T h e Orthogonality Principle 351
THEOREM 5.6 (ORTHOGONALITY PRINCIPLE) A linear filter hen), con The proof is left to th e reader as Problem S.27.
strained to be zero except on n E[l1l' n2J, is the unique minimum MSE We will encounter a very important application of thi s result in Chap
jllter,h f(n ), iffany random variable in the corresponding error sequence is te r 8. In that instance, q(n) and X I (n) will correspon d to noise seq uences
orthogonal to the random variables in the input used in its computation. th at are hi ghly correlated, and x 2(n) to a speech sequence th at is uncorre
C{£(n)q(n 1/)J = 0, lated with eit her noi se signal. In terms of Fig . 5.23 , we would like to esti
TJ E [n l' 112J. (5.240) mate x t( n ) from q( n) in order to subtract it from y(n). In th ese terms
A proof of this important theorem can be found in many sources [e.g., e(n) can be interpreted as an estimate o f x 2 (n), th e speec h seq ue nce. We
(Pa po ulis, J 984)]. encoun ter a problem with this approach , since we do not ha ve access to
th e sequen ce x ,(n ). It seem s, therefore, at first glance, im possible to de
x (n) s ign the minimum MSE to estimate it. Ho wever, we also kno w (because
2
of Coro llary 5. \) that th e filter will be id entical to th at produced by try
ing to estimate y( n) from q(n) . From this point o f view, we do have suffi
xl (n)
e(n)
cient informat ion to design the desired filter. It is in teresting th at when
the problem is approached in this way, we dis cov er that the esti mate of
th e speech is also a signal of minimum mean squar e (power).
q(n) 10-1
FIGURE 5.23. Diagram of the general linear minimum MSE estimation problem.
4lT he discussion is easily modi fied to accom modate the predict ion of y en) (meaning that
at time n an estimate of y en + k ), k > 0 is desired), or the smoothing of y en) [meaning
that at time n an estima te of y(n-k),k >O is desired].
6 .1 I Introduction 353
iog a system that would remove the unwanted high-frequency spectral en
~HAPTER ~
ergy. The result would then be transformed ?ac~ into the tim.e do~ain.
J
Each of the operations taken to produce this filtered result 1S a linear
one so that the overall operation, say :J, is linear. Only because Xl (n) and
we;) are combined linearly can we be conf1dent that putting the signal
x(n) into the filter will produce only the low-frequency part Xl (n), that is,
Cepstral Analysis 5 {x(n) J = , {Xl (n) + wen)] = :J{x 1 (n) ] + :7(w(n) ] = X l (n) . (6.2)
If the components were combined in some other way (e.g., convolution),
6.1 Introduction we would generally have no clear idea of the filter's effects on the compo
nent parts,
quence convolved with the impulse response of the vocal system model.
We have access only to the output, yet often find it desirable to eliminate The latter situation above is the case with the speech problem we ad
one of the components so that the other may be examined, coded, mod dress. "Cepstral" analysis is motivated by, and is designed for , problems
eled, or used in a recognition algorithm. centered on voiced speech. According to our speech production model,
The elimination of one of two combined signals is, in general, a very voiced speech is composed of a conv olved combination of the excitation
difficult problem. However, engineers know a great deal about thi s type sequence, with the vocal system impulse response,
latter as the real cepstrum (Re). The definition of the RC used in Our
fonnal understanding of the weaknesses of cepstral te chniques as thev
work will make it equivalent to the even part of the CC on the region
over which the RC is defined. In fact , there are various definitions of the
pertain to speech processing is based upon theoretical and experiment~1
work with the Cc. Since we have tried to structure this chapter so that
RC, but all are equivalent to the real part of the CC within a scale factor.
the reader can study only the RC if desired, this creates a dilemma. To
The reader is encouraged to glance at Table 6.1 for a preview of the no
include this material in the context of the CC study would d eprive some
The basic difference between the RC and the CC is that the early cep
formality be used in the discussion. We have chosen to sacrifice the for
strum discards phase information about the signal while the homomor
mality and describe these findings in a fairly qualitative way in the last
phic cepstrum retains it. Although the CC is more appealing in its
sect10n of the chapter. References to the key papers will provide the in
formulation , and although the preservation of phase bestows certain
properties that are lost with the RC, the CC is often difficult to use in terested reader the opportunity to further pursue these issues.
practice and it is the earlier version that is employed most widely in
speech analysis and recognition. In fact , however, one of the most impor
tant applications of cepstral analysis in contemporary speech processing 6.2 "Real" Cepstrum
is the representation of an LP model by cepstral parameters. In this case,
the signal parameterized is minimum phase, a condition under which the 6.2.1 Long-Term Real Cepstrum
RC and CC are essentially equivalent. Unless the reader intends to use Definitions and General Concepts
the cepstrum in phase-sensitive applications (e.g., vocoders), he or she
As usual, we will find it convenient to begin our study with a long
may wish to study the RC carefully, the CC casually, and return to the
term view of the cepstrum, avoiding some of the details of short-term
details of the complex case as needed.
analysis. Once we have laid down the foundations, the transition to the
short-term computation will be quick and simple .
TABLE 6.1. Cepstrum Notation and Terminology Used Throughout the Chapter.
The real cepstru m (RC) of a speech sequence s(n) is defined as
Notation for
Name Signal x(n) Relationship
c:Jn) =" T 1 ~log15(s(n)H} = -
2n
I J" 10g1 5(w)1 e}W.>I dos ,
- f[
(6.6)
Complex cepstrum (eC) y".{n)
in which .:7(.1 denotes the DTFT. Ordinarily, the natural or base 10 loga
Real cepstrurn (RC) c,,(n) c., (n) = Yx.•vcn(n)
rithm is used in this computation, but in principle any base can be used.
Short-term We will assume the natural log throughout. Note carefully that the RC is
complex cepstrum (steC) yx(n;m) an even seq uence on n, since its DTFT, namely Cs(w ) = log \S(w)\ , is real
frame ending at m /
and even. The computation of the RC is shown in block-diagram form in
Short-term Fig. 6.1.
real eepstrum (stRC) c,,(n ; m) c)n; m) = Yx.<v.,,(n; m) We have noted above that the cepstrum will be b est understood if we
frame ending at m focu s on voiced speech. Since we are therefore dealing with power (peri
356 Ch. 6 / Cepstra J Analys is
6.2 I "Rea l" CepSlrum 357
.I"(n)
for some a rbitrarily larg e, even integer, L. T he DT FT of a hypoth etical
Voiced DTFf
speec h
log 1'1 lDTFf C/ II) long. term speech seq uence and the DT~T of its w ind~wed versio n are il
Cepslrum lustrated in F ig. 6.2(a) and (b), respecti vely. In th e windowe d spec t ru m,
th e am plit ude of th e sp ectral peak at the kth harm onic freq uency is
Jog IS(w) I "= C/w)
[(L + 1)/p)18(2nk/P ).1 fo~ any k: ~hi s becomes a~pa ren t up on co n
FIGURE 6.1. Computation of the RC. vol ving the spectru m III Fig. 6.2(c) with the hypothetical lon g-term spe c
tr um show n in pa rt (a). Th erefore, th e amp litudes of th e spect ral p eak s
grow proportion atel y with L. It is also clear from the convolution t hat
odic) signals here, the read er mi ght ha ve detected a potential Problem in the widt h of each "main lobe" in the spe ctrum (one cente re d on each
this "long-term " definition. Th e DTFT of the signal s(n) does not theo~ harm oni c fre q uency) d ecreases with L. Furthe r, as L becomes larg e, it
retically exist. Howe ver, we ca n make Use of the "e ngineering DTFT" for can be shown that the energy (area under the sq uare d m agnitude spec
pow er signals, whi ch was reviewed near (1.30). In this case " S(w)" is tr um ) associated wit h a sm all neighborhood aro und th e k th harmonic
taken to mean 2
frequ en cy be com es [2n(L + 1)/ p 2 ]18(2nk/P )1 . T herefo re, fo r large L,
the DTFT Sew) approximates a set of impulses at the harm onics, and
S(w) = 2n
00 I - k 271:) ,
k~ D (k)oa\W (6.7)
t he power associat ed wit h the k t h harmoni c is approxima tely equal to
p 2n/S(271:k / P)!2/( L+ t ). We can therefore im agine t hat we have repl aced a
trul y lon g-term analysis with one that approaches th e ex ac t ca se as
in which 0o( . ) is the analog impulse fun ction , th e D (k) are the DFS coef close ly as desired . This will allow u s to work with DTFTs an d avoid
fici ents for the sequence sen), and P is its period (pitch period). We will some cumbe rsome det ails.
verify momentarily that
Now a moment's thought will re veal that, wit h the possible exception
of a few poi nts near the ends of the win dow, th e same windowed speech
D (k) = ; e(k~7I:» (6.8) seq ue nce will be obtain ed if we wind ow th e excitation seq uence, th en
drive the voc al system model with the windowed e(n) .3 The model we
where 19(w) is the Fourier transfer function of the vocal system model. will use, therefore, for th e generation of the long-term speech signa l is
The use of this "power type" DTFT would seem to be the proper theoret one of ha ving driven th e vocal system with a windowe d versio n of th e ex
ical course of action in this case, but , as sometimes happens, the pres citati on sequen ce. Th e window may be assumed to be a rbit ra rily large
ence of impulse fun ctions would lead us into som e significant theoretical and is on ly a devi ce to remove th e troublesome impulse fun cti on s. We
difficulties. Since the point of long-t erm analysis is to elucidate and mo therefore will not em ploy any sp ecial notation to den ote that th ere is a
tivate, rather than to confuse, we make th e fOllowing small adjustment, lon g window involved, since the "windowed" versions ma y be mad e to
which will circumvent th e problem s with impulse fun ctions. The source arbit ra rily closely approxi mate the truly lon g-term version s.' R ath er, we
o f th e impUlses is th e periodic exc ita tio n seq uence e(n) for whi ch the will continue to call the excitation e(n) and th e speec h sen). The "tilde"
DTFT is
not ation in Fig. 6.2 and in the pr evious paragraph was only a tem porary
dev ice to distin guish the two spectra .
/ 71:
E (w ) =::"-'-
P
L
k=- oo
00 (.w -k-271:) .
co
P (6.9)
Let us now return to the computation of th e RC. T he set of operation s
leading to cs(n) is depicted in Fig. 6. 1. The first t wo o perati ons in the
Since figure can be interpreted as an attempt to tran sform th e signal s(n ) into
a "linea r" domain in the sense that the two parts of the signal which are
Sew) "" E(w) l9(w), 2Recall th at a consistent theory of impulse fu nctions requires that f (fJ)J(a) be defi ned to
(6.10)
mean j (a)o( a) for a conti nuous function, I. of fl.
we see that
2
(6.7) with coefficients give n by (6.8) is indeed th e correct
. 3V!e will later see that th is ab ility to assum e that the window has been app lied to the ex
DTFT for Sew). Let us imagine th at we apply a very long rectangular citatio n, rather than the speech itself, is cri tical to some of the sho rt-ter m spectral
window to the speech, say argu ments.
_T.. ' The rea?e r may feel that we have already entered the realm of short -term pr ocessing
1, z< n<
- !:
2 because a wl !1do~ has been applied. It should be emp hasized here that the essence of long.
we n) "" !e r~ processing 1S the assumption t~at . th e signal is stati onary for all time . Th is assum ption
{ (6. J J) 1S III elTe<:t here, and the window IS Just a conce ptual device to remove so me confusing
0, othe r n, math emat ics,
358 Ch. 6 / Cepstral Analysis 6.2 / "Real" Cepstrum 359
2,, convolved in the unaltered sign al now have representati ves that are
S(<.:) s
r
added in th e new domain . Let us denote th e operation co rresponding to
Jog of a real number is taken (this will not be the case with the Ce). We
have then,
--+--+
C)w) ~ ~C' { s (n)l = log 15«(1)1
o 2" -" k 2,, ,, = log ]E(w)8(w) !
p 2 P (6.12)
= log IE (w)j + log 18 (w)1
Freq uency , 0) (norrn -rps)
= Ce(w) + Ciw).
(a) Hypothetical lo ng-term speech spec trum (shown wlth
unrealistically long pitch period. P ).
Now that we are in a "linear domain ," we a re free to apply line ar tech
niques to the new "signal." C,Cw). In particular, we might wish to apply
Fouri er anal ysis to view the "frequency domain" properties of the new
1.'1"(<<.')1 "signal." Upon recognizing that C,(w) is a period ic function, the appro
priate thing to do is to compute a line spectrum fo r the "signal," th at is,
Peak height = [L + J] IG(k 27r)1
to com p ute Fouri er series coefficients for the "harmonics" of th e "sig
"' / P P nal." These would take the form (see Problem 6.1)
To
Now, according to the definition,
p 2
Frequen cy, w (norm-rps)
c,( n) = -2
(in
n -n
C,(ru) e)''''' do] , (6.14)
(h) Spectrum of same hypo thetica l signal after application
or a "very long" (large L ) rectang ular window.
but C,(w) is a real, even function of w, so that (6. 13) and (6.14) produce
equivalent results .' Therefor e, th e RC can be interpreted as the Fourier
serie s "line spectrum" of th e "si gnal" Cs(w).
IW (w') I We have been careful to put qu otation marks around terms that are
L+l being used in an unusual manner. The " signal" that is being transformed
into the " frequency dom ain " is, in fact , already in what we consider the
frequency domain in usual engineering terms. Therefore, the "new" fre
qu ency domain was dubbed the "quefrency domain" by Tukey in the ear
lier work on the RC, and the "cep strum " was so named because it plays
the role of a "spectrum" in the qu efrency domain. The index of the RC
o 2,, (which actually is a discrete-time axis) is called the "quefrency axis ." The
11"
L +I "ha rmonic frequ encies" of C, (w), which are actually time indices of the
cepstrum, are called rahm onics. T he re is an entire vocabulary of amusing
Frequency, w·(nor m·rps)
(puISeS)
~ ~
x
I
t
11 W
speec h, e( n) and e( n), into two additive components, and th en to analyze' u,'
Excitation responsibl e Vocal system responsible
thos e components with spectr al (cepstral) an alysis. Clearly, from (6.12), for "fust" spectral P hi , for "~ 10\V" variation,
variations ro ern.
c/ n) = Ce(n) + c8(n), (6.15) (nonh ncar
combmation )
and if the non zero part s of ce<n) and cin ) occupy different parts of the
qu efrency axis, we should be able to examine them as separate entities log IE(w) I lo£o10 (w)1
which we were unable to do when they were convolved in sen). log. \S(w) I = C,(w)
+
{Ill '"
Intuitive Approach to the Cepstrum (Historical Notes)
We have essentially captured abo ve the th ought pro cess that led the in
ventors of the RC to its dis cover y. However, their earl y work is ba sed
1r w
Problem
t 7i W
more on engin eering intuition than on math ematical form ality. The solved!
(linear
clau se above "if th e nonzero parts of ce(n) and co(n) occupy different combinmion)
parts of th e qu efrency axis, we should be able to examine them as sepa
rate entities" is the key to the early th inking. Noll (L 967) was th e first to C r(n) [approx] ea(H)
c,( n)
apply and extend th e cepstral no tions to speech (pitch detection) , but his Low que trcncy energy
ideas were based on earlier work on seismic signa ls by Bogert et al. High quefn::n cy
(196 3). The history is quite interesting and is related in Noll's paper in (~)
which Tukey is credited with the invent ion of the cepstrum vocabul ary n
set . With this du e credit to the earlier researchers, we will discuss the his P 2P 3P n iP 21' 3P n
torical developm ent s from Noll' s speech viewpoin t, taking some libert ies FIGURE 6.3. The motivation behind the Re, and s9 me ?f the accompanying
to pose Noll's ideas in our digital signal processing framework. vocabulary. (a) In the speech magnitude spectrum,IS(w)\, two components
Viewing th e speech spectrum , jS(w)l , as consisting of a "quickly vary can be identified: a ·slowly varying" part (envelope) due to the speech
system, !S(W)I, and a "quickly varying" part due to the excitation ,l E (w)l·
ing" part , 1£ (w) l, and a "slowly var ying" part , \8 (w)\ (see Fig. 6.3), Noll These components are combined by addition. Their time domain
simply took th e logarithm counterparts , 8(n ) and e(n) , are convolved . (b) Once the logarithm of the
spectral magnitude is taken, the two convolved signal components . 8(n) and
log IS(w)1 = log 1£ (w)1 + log l e (w) 1 (6. L6)
e(n) , have additive correlates in the new "s\gnal;' Cs{w). The former
to get these two multiplied pieces into add itive ones. The reason for want corresponds to a slowly varying ("Iow-quefrency") component of Cs(w). and
the latter to a quickly varying ("high-quefrency") component. (c) When the
ing to get two additive pieces was to apply a linear operator, the Fourie r IDTFT is taken, the slowly varying part yields a "cepstral" component at low
transform , knowing that the transform would operate indi vidu ally on two quefrencies (smaller values on the time axis), and the component with fast
additive components , and furth er, knowing pr ecisely what th e Fouri er variations results in a "cepstral" component at high quefrencies (larger values
transform would do to one quickl y varying piece and one slowly varying on the time axis). The low-quefrency part of the cepstrum therefore
piece. Noll was t hinking, as we did above, of the two signals as "t ime" sig represents an approximation to the cepstrum of the vocal system impulse
response , co(n). and the high-quefrency part corresponds to the cepstrum of
nals, one "high-frequency," one "low-frequency," knowing th at th e "high
the excitation, ce(n).
frequency" signal would manifest itself at big values of frequ ency in the
"frequency dom ain," and that the "low-frequency" signal would app ear at
smaller values of "frequency." Since, in fact, the two "signals" were already
in th e frequency domai n, the new vocabulary was employed including the
word quefre ncy to describe "freque ncies" in thi s new "frequency dom ain : '
Some of the vocabulary is illustrat ed in Fig. 6.3.
362 Ch. 6 I Cepstral Analysis 6.2 I "Rea l" Cepstrum 363
Before proceeding, we should not e a pot ent ial source of problems that "Li tte ring" o pera tion
is often overlooked in discussions of the cepstrum. This "intuitive" level r .. \
I ) I
is a good place to flag this issue.s If the reader had a sense of uneasiness
upon arrival at (6. 16), it is probably because of the lack of intuition that ~ DTFf DTFf ~ l og I 0( r.o) 1
Q~ 'l "signal"
ment, through the good workings of th e logarithm , we see them added too' C.cr.o) Convert Esuma ie
gether, We must pause, however, and ask: Are these two additive Creme new "signa l" to of
components still high and low quefrency even though we have subjected with line arly combined "qucfrcncy
cu(n )
them to this nonlinear transformation? An affirmative answer is, of compo nents dom ain "
course, critical to the proper operation of the technique. The correct an
swer is, however, that the com ponents do not always remain suffici ently Mu ltiply cc pstrum by
well separated in quefrency even though they may appear to be so before ~ (.'0(11)
C/I1 )
the logarithm is taken. Th e predictability of thi s ad verse phenomenon is
not high because of th e nonlinearity.
We will return to the important "separa tion after the log" issue in the
last section of the chapt er. In th e mean tim e, th e reader is urged to keep
_~~Jl AWl> n
:b-
t
I Chosen to remove ()n )
in mind that cepstral te chniques, from the intuitive conceptual level ~
through the most formal developm ents, are based on a few tenuous as :::::: ce(n) c:; " , (II)
sumptions that can cause unpredictable out com es. We will point out an
FIGURE 6.4. "Low-time littering" to remove CQ(w) from C,( w) . The
of these interesting techniques. Nevertheless, they are there, and the the "new" linear domain. Note that cs(n) is an even-symmetric sequence and,
read er should keep this in mind as he or she applies the methods in consequently, I(n) should be even only the positive time parts are illustrated
Liftering original time domain (invert o:cal), we encounter a problem . In appl ying
the Q:eal operation, we have discarded the phase spect rum of the origin al
Before moving on to a more practi cal discussion about the RC , let us signal. Not surprisingly. therefore, the Q:cal operation in not uniquely
explore the possibility of doing linear filtering in our new linear domain invertible, sinc e the phase information is irretrievable. Based on the form
to select one or the other of the well-separated parts. Since c,,(n) and co(n) of (j.eaJ, however, the inverse
are well-separated in th e qu efrency dom ain , we can, in principle, use the
RC to eliminate, say, C,,(w) from Cs(w). Thi s process is called filtering (a
[Q:cal t {. J d,gf ~ -l l exp [ . ]} (6. 17)
play on filtering) , and in thi s case we use a "low-t ime lifter" as in Fig. 6.4 might be proposed, but it is not difficult to see t hat this will return a
(a nalogous to a lowpass filt er in the usual frequ ency domain). The output t ime domain signal with a zero-phase characteristic (which is necessarily
of this process in the quefrency domain is an R C, say c.,(n) = clI (n). If we noncausal). A second candidate for [Q~31]- 1 , which yields a minimum
now wish to use c(!(n) to obtain an estimate of O(n) neatly separated from phase characteristic, will emerge when we discuss the complex cepstrum
sen), we need to get out of the qu efrency domai n, then invert the o:cal (also see Problem 6.2).
op eration. Leaving th e qu efr enc y dom ain is easy-we simply apply a We conclude from this discussion that lifterin g is a useful and mean
DTFT to the RC. Note th at this pr ocess resul ts in an estimate of ingful process with the RC for obtaining an estimate of the log spectrum
I
log 8(w)1 devoid of any excitation comp on ent s. This process is called of either of th e separated components. That is, we can apply a useful lin
cepstral smoothing of the vocal syste m spect rum, an issu e we will study ear operation to the RC. However, if th e objective is to return to the
in more detail later. The IDTFT, low-time lifter ing, DTFT operations original tim e domain with an estima te of the sep arated signal, the RC
comprise a linear filter operation, say :J, in the "new" linear domain that will fail , because its "linea rizing" op eration is not invertible. To complete
was create d by the ~ea J op erat ion . When we attempt to return to the this ta sk, we would need a phase-preser ving lineari zing op eration . We
will find such an op eration when we discuss the complex cepstru rn. How
6These comments appl y equally well to th e Cc.
ever, we should not leave the reader with the impression that return ing to
364 Ch. 6 / Cepstral An aly sis
6.2 / "Real" Cep st rum 365
cs(n) =~
2n
i:[lOg) f 1=-00
S{l)e - jwl!}e j'U1'dW (6. 18)
N - I, regardless of the window tim e m. To use th e stIOFT would rest ore
the ph ase info r mation about th e win dow delay and p rod uce a ce pst rum
o n th e ran ge n = m - N + I , ... , m , but no ot he r phase informat io n a bo ut
for all n. By a sim p le gene ralizat io n of Cons tru cti on Principle 1 in Chap the signal would, of course, be restored. 11 is conve ntiona l t o u se th e
ter 4, we derive the followin g short-term "real" cepstral (stRC) estimator, st RC on th e low time range. so the 10FT is em p loye d. In u sing the
cs(n; m ), for the N-Iength fr ame of speech en ding a t ti m e mJ(n ; m ) = result, we ar e ordi narily aware of th e "m" valu e associat ed with a par
s(n)w (m - n): ticular co m p ut ation an d we learn to interp ret cepstra based at tim e
zero .
cs(n;m ) = ~ L'{lOg! f
2n " I=- ro
f (l; m)e - jW
1! } eJW' / do: By co m par ing F igs. 6.5 and 6.6 and usi ng o ur kno wled ge of th e re la
t ionshi p b et ween the IDTFT and IDFT, we see tha t
~
(6.19 )
= _1
2n:
in{lOg/ f
" I =m -N+ l
f(l; m)e -JWI IJejWI/ doi
c_s (n-m
,
)={ c.( n + qN ;m )..
L..,.,
q *'~ -oo
0,
n= O. I, .. . ,N - J
other n
(6.20)
fo r n = 0, 1, . ... This a mo unts to nothing more than using f en; m) as whe re c,(n; m) denotes the result using OFTs, so t hat es(n; m) is a peri
input to our "usual" algorithm rep resented by F ig. 6.1. Accordingly, we od ic, aliase d , version of the "t ru e" quantity we seek, cS< n ; m). Since the
should replace the DTFT in Fig. 6. I b y the stDTFT. In Fi g. 6.5, we have st RC will be of infinite duration (a fact that ca n be infe rred fro m th e
redrawn Fig. 6.1 fo r this case to feature t he fac t tha t th e short-term m aterial in Sect ion 6.3.1 ), so me aliasing is in ev itable. 1n o rde r that the
Four ier transform appears in th e in termediat e comput ations. This will be alias ing not be too seve re, one can a ppe n d zeros onto th e speech frame
import an t momentaril y. f( n ;m) and co m p ute the stD FT, IDFT pai r bas ed on more points. 11 is
Before proceeding, it is im portant to reemphasize tha t th e RC, in this often necessary t o use a significant number of zer os (to exten d th e effec
case the stRC, makes use o f m agn itud e spectra] info rmat ion on ly, an d t ive fra me len gth to , say 512 or 1024 points) t o avoid aliasing."
disr egards all phase information. In pa rti cul ar, the in form ation about the
~ !("m)·1":;~g H H H I-
~~
r . "D TfT
H '0' 1·1
m
r
c,cfJJ; m }
IDTFr
~ '·,1'" m)
w{ m -I/ )
, <OFT log II lOfT ' ,'"m)
FIGURE 6.5. Computation of the stRG using the DTFT. 7A n alternative method for alias-free com puta tion of the CC has been prop osed by
Tribolet ( 1967).
6.2 { "Rea'" Cepstrum 367
366 Ch . 6 ( Cepstral Analysis
In future discussions, we will not continue to distinguish the RC Com by Oppenheim and Schafer (1968). We elaborate a bit upon their work
puted using the DFT from that using the DTFT by the "tilde" notation. here. According to (5.2) and (6.21),
In fact , most of our study is based on the more abstract DTFl: although 00
pract ical implementation requires the use of the DFT. The reader is
urged to keep in mind the potential aliasing problem when using the
sen) == L
q- -OJ
e(n - qP),
(6 .24)
practical computation.
where P is the pitch period (in norm-sec) or the spacing of unit samples
in e(n). If the time constants of O(n) are short compared with pitch pe
6.2.3 Example Applications of the stRC to Speech Analysis and riod,s that is, if
Recognition (6.25)
B(n) = 0, n ~P'
ideas that we will see used repeatedly in the speech recognition chapters (1'+ 1)P -1 ,
of Part V, and also in the material on vococling in Chapter 7. The third sen) = e(n - rp).
(6.26)
application is among the most important uses of the cepstrum in modern
speech processing. In the first two of these applications, the comments Now consider the effect of the window over this period,
made at the end of the "Historical Notes" in Section 6.2.1 are particu !sen; m) = e(11 - rP)w(m - 11), (6.27)
larly relevant. The reader might wish to review those few paragraphs be
fore . reading these applications .
n = rP, rP+ 1, .. . ,(r+ l)P - I.
If the window is long and tapers slowly compared with P', the duration
Pitch Estimation of 8(n), so that
Cepstral analysis offers yet another way to estimate the important fun w(m - n) = w(m - rP), n = rl; rP + 1, ... , rP + P' - I,
(6.28)
damental frequency parameter. We have given heuristic arguments above
that the RC should do a good job of separating the excitation and vocal then we can write
system components of the RC in the quefrency domain. In this section
we will formalize that argument and show how this information leads to
hen; m) = e(n - rP) w(m - rP), (6.29)
a convenient pitch detection technique. n = rP,rP + 1, . . . ,(r+ l)P - 1.
The voiced speech signal is modeled in the time domain as the convo
[Note that we have used the fact that e(n - rP) = 0 on the range n =
lution of the pulse train excitation and the impulse response of the vocal
rP + P', ... , (r + I)P - 1 in expanding the time range between (6.28) and
system,
(6.29).] Putting the periods back together, we have
sen) = e(n) * 8(n). (6.21)
OJ
We consider a frame of speech ending at time m: !sen; m) = I e(n - qP) w(rn - qP)
(6.30)
!sen; m) = sen) w(m - n) = [e(n) * 8(n)] w(m - n). (6.22)
'F - OO
Now using th e key approximation in (6.23), we seek the stRe of the '
speech. We do so by noting th at th e RC of.f;( n;m) will be the sum of the As in th e long-term case [see th e di scu ssion sur roun ding (6.14) ], e, ( n; m)
RCs of its two convolved components, .f; (n; m) and O(n).9 First we seek ca n be interpreted as Fourier (cosine) seri es coefficients for the periodi c
the stRC of the excitation [whi ch is equivalent to the RC of the fram e fun ction log IE(w ;m)!. Since the period of log lE (w;m )! is 2TC / P, " h a r
fA n;m)]. If there are Q peri ods of e(n) inside th e window used to create moni cs" (or " rah mon ics," as Tukey calls th em , since th ey are in the
qu efrency domain) occur at times n = i2n/(2n/P ) = iP, i = 0, 1, .. . .
Lt»,
%+ Q m). and if those periods corr espon d to indices q == qo' % + 1, . .. ,
- I, then T he refore, (6.3 5) will produce a result of th e form
qO+Q-I co
qO+Q-l
({= qo
uo= c,,(O;m)= - I
211: -"
I" 10g !E (w ;m)!dw . (6.3 7)
= L w( m - qP ) e -jWqp. Let us now focus on the vocal system impulse response a nd dis cuss its
q=qo R C. Recall that because of the theoretical discussion leading to (6.30), we
Now let us define the sequence will be seeking the long-term RC for B(n) h er e. F irst we note, similarly to
(6.37 ), that
w( m -qP),
q = %' ... , (jo + Q- I
w (q) '"
{
0, other q
conclude that E(w; rn), hence log /E{ w ; m>l , will be a periodic function of
Iro(n)1 "$ (Qi n + p in/ r + Q out (6.39)
n n '
W with period 2n/P The precise form of the log spectrum will depend
on the numbers w(m - [% + Q - 1]P) , . .. , w{rn - qoP), but the most im
where Qin (p in) represents the number of zeros (pol es) of 8(z) inside the
portant point to note is th e periodicity of th e sho rt -ter m log spectrum,
unit circle, Qout is the number of zeros outside th e unit cir cle, and fJ is
log /E(w ;m)/. th e m aximum magnitude of the set of all zeros a nd pol es inside the unit
circle. We can argue that the RC will al so b e inside this e nvelop e as fol
9There is a subtlety here with which we must be careful. In effect we are seeking the lows. The RC is equivalent to the even part of the CC. Sup pose that for
short-term RC of the excitation, and the long-term RC of the impul se response. The stRC
of e( n), however, is equivalent to the long-term RC of th e fram e J.,(n; m) (a common feature some n > 0, the R C exceeds the envelope , for example, in th e negati ve
extract ion OCcurrenc e), so that th e statemen t is correct as writt en. direction. Then the odd part of th e CC must be su fficientl y large so that
th e CC at n will be inside the envelope. However, at - n, wh ere th e RC
6.2 / "Real" Ce ps t rum 3"71'
1500 I I I I I I
will also be "too large" in th e negative direction, the odd p art will also b e
negative and th e sum (CC) will be outside the envelope. From this we
1000
conclude that the RC also decays as lin . Clearly. th is envelope will usu
ally decay qu ickly with respect to typical values of the pitch period P.
putting th ese results togeth er, we have that
r:i'
Cs(n; 111) = ce(n; m ) + ci n ) (6.40)
.,.. 500
from which we conclude that cc(n ; m) will appear in cs(n; m ) as a pulse
'"
~
.; train added to th e RC of e(n) . Further, c8 ( n) usually decays very quickl y
6
~
with respect to P, as is apparent in Fig. 6.7, so that
s:
o
8 n =O
0
C''co; m) + co (O),
'" O<n <P,
c/ l1; m) = co (n),
- 500 (6.41)
N = 256 points. For a sample rate of 10kHz, this implies the use of over
lapping windows of length 256. which are moved 100-200 points each
0 ,6
estimate.
Figure 6.8 shows an example of cepstral pitch detection taken from
0.5
the work of Noll (l967 ). The plots on the left (in each case-male and
female) are a series of short-term log spectra derived by using an
0.4 N = 400-point Hamming window mov ed 100 points each computation
(so that the fr ames might be indexed m == 100,200, .. .). The sampling
0.3 rate On th e original data is F<= 10kHz. The plots on the right in each
N
.., 0.2
case show the stRC according to Noll's definition, which differs from the
~~
:£ stRe we have been discussing above by a square and a scale factor (see
<> Problem 6.6). The lack of a clearly defined pitch peak in the first seven
0 0.1
---{).3 <t
FIGURE 6.7. (a) A frame of speech consisting of the phoneme lif selected
--0.4 using a rectangular window, (b) stRC of the frame. The voiced excitation
0 50 100 150 200 250 manifests itself as a weighted train of discrete-time impulses in the stRC.
Qucfrency, n (norm-sec) while the vocal system impulse response is represented by the low
( b) quefrency, quickly decaying portion of the stAC.
6 .2 I "Real" Cepstrum 373
372 Ch. 6 I Cepstral Analysis
~ [JlT
1
11.1
I "
.k
rT1 fore , the reader is advised to peruse the literature on the subject (Noll ,
1967; Rabiner, 1977; Schafer and Rabiner, 1970) to explore the various
algorithmic enhancements that have been employed to overcome such
I difficul ties. let us recall what we can and cannot do with liftering of the
A Finally,
I I stRC in this regard. (Recall the discussion surrounding Fig . 6.4.) Al
!
I
though there is nothing that prevents liftering of the stRC to remove
I
dom the purpose of the analysis. Ordinarily, the purpose is the estima
- I.
l ~
j.::
I
A tion of pitch for which the stRe itself is adequate.
J
t
Formant Estimation
In 1970, Schafer and Rabiner (1970) described the process of cepstral
smoothing for formant analysis of voiced speech. Although their work is
centered on the more formal ec, the same pr ocedure can be achieved
' ",
It o with the RC. In fact, we have already discussed the basic procedure of
]
~ ::;. cepstral smoothing in Section 6.2.1. What we can now add to that discus
;; sion is the introduction of the more realistic short-term RC and a more
~ formal discussion of the separability of the cepstral components. We em
~
" o
phasize that this technique will produce an estimate of Sew) (the total
j §
I ,., model transfer function). not just H(w) (the vocal tract alone).
o J 2 3 4 o 3 6 9 12 15 2 3 4 0:1 6 9 12 15
Q
:J To review, to obtain an estimate of log \e(w)!, from the speech on a
frequency (kl-i z) Time (msec) Frequency (kHz) Time (msec) window ending at time m, we execute the following steps:
Malt: Female 1. Compute the stRe of the speech c/n; m) as above.
FIGURE 6.8. log IS(w: m)! and c,(n; m} for a series of m's for a male and a 2. Multiply c,(n ; m) by a " low-time" window, l(n) to select cin):
female talker in an experiment performed by Noll. (The definition of the
stRG used by Noll differs from our definition by a square and a scale
cin) = c/ n; m)l(n). (6.42)
factor , but the general information provided is the same.) After Noll (1967). (Note that the lifter L(n) should theoretically be an even function of
n, or even symmetric about time (N - I )/2 if an N-point DFT is
culties. In fact , the height of the spectral peak depends upon many fac used in the next step.)
tors, notably the size, shape, and placement of the window creating the 3. To get the estimate of log \S (w)l, DTFT (DFT) the estimate of
analysis frame (see Problem 6.3). As an extreme example, suppose that
the pitch period is quite long and the window size is chosen so that only co(n).
one period or less appears in a given frame. In this case Q I in the .dis =; This entire procedure is diagramed in Figs. 6.9 and 6.\ O. Figure 6.10 is '
cussion surrounding (6.31) and the reader is encouraged to review the the completion of the analysis begun in Fig. 6.7.
discussion in this light to conclude that c,,(n; m) will no longer consist of Recall that the term "cepstral smoothing" refers to the process of re
a pulse train. Another critical factor is the formant structure of the vocal moving the high-quefrency effects of the excitation from the spectrum by
system . If, for example, the vocal system is essentially a narrowband fil this analysis. In the next section we will see the cepstrum used to effect
ter, the periodic component in the spectrum will be masked by the furthe~ smoothing on a spectrum that is already smoothed by LP
formant filter and no peaks will occur in the stRe. Before attempting the analysis .
7
374 en . 6 / Cepstral A na lys is
(( n ;m )
5
T
[;S
surr rr lDTFT
(stOFT) r-+ log 1·1 ~ 4 Low-lime 4 DTIT
(TOFT ) h fter (O Ff) 7
c::>
or. 4
-::.v'"
w (m - m) c(m m )
U 3
~
€
, ~ :::::: l'o(n;m)
~
t:
0
2
s:
VI
I( n)
FIGURE 6.9. "Cepstral smooth ing" using the stRC. Block diagram of the
computations . Note that the processin~ is almost always terminated after o
achieving the estimate of ce(n) or log 18( w)]. Q~eal is not uniquely defined,
I I I I
and it is therefore not possible to obtain a meaningful estimate of B(n ). - 1\o ,i(\ 45
.~
I 1 ! I I
LIJ ~~
L. J _ A
j \J 35
5 10 15
Q uefrcn ~ y . n (norm-sec)
LP and Cepstral Analysis in Speech Recognition (LP to
(a)
Cepstrum Conversion and Cepstral Distance Measures)
For many years LP analysis has been among the most popular meth I
8 '------ I
ods for extracting spectra l information from speec h. Contributing to thi s
popul arit y is the enor mous amou nt of theoretical and appli ed research
on the technique, which has resulted in very well-understood prop erties
and many efficient and readily ava ilable algorithms. For speech recogni 7.5
t ion (as well as coding), the LP parameters are a very useful spectral rep
resentation of the speech because th ey represent a "smoothed" versio"~ of
the spectrum that has been resolved from the model excitation.
7
However, LP analysis is not without drawbacks (Jua ng et al., 1987).
We emphasized the word " model" in the above paragraph because, as we
know fro m our study in Chap ter 5, LP analysis does not resolve the ~
<n
vocal-tract characteristics from the glottal dynamics. Since these lar yn i 6.5
~
geal characteris tics var y from perso n to person , and even for within GO
52
pers on utterances of the same words , the LP param eters con vey some
informa tion to a speech recognizer t hat degrades performance, pa rticu 6
larly for a speaker-independent system.!? Further, the all-pole constraint
-----.
FIGURE 6.10. The cepstral smoothing operation of Fig. 6.9 applied to th ~ 5.5
real data of Fig. 6.7. In Fig. 6.7 we find the illustration of the input frame
for the vowel /if and the resulting stRG. (a) By applying a low-time lifter of
duration 50 to the stRC, the estimate c{)(n; 512) results . (b) Computation of
51
U ! ! ! ! ~
the OFT results in an estimate of log le(w; 512)1 . at, of course . co = 2nk/M, "
where M is the length of the OFT. frequency. w (norrn-rps)
( b)
lOAsyste m tha t recognizes m ultipl e speakers, a ll or so me o f who m mi ght not have par
ticipate d in tra in ing the system .
6 .2 I "Real" Cepstrum 377
376 Ch. 6 I Cepstral Analysis
manipulation is performed. 8
relation, we have that the sum of the squared cepstral coefficients is re
For theoretical reasons, the conversion formula is more easily derived
for converting the LP parameters to the (short-term) complex cepstrurn lated to the model spectrum as follows ,
(stCC), which we study in Section 6.3 , and the stCC is usually used in
this context. As we have noted above, however, there is very little dif Ic~(n;m)
ro
~ -I In [log1 8 (w; m)11 dco.
2 (6.46)
feren ce between the stCe and the stRC for a minimum-phase signal, n ~O 2n -:t
like the impulse response of the LP model B(n). In this case, the steC, we can omit
If we wish to ignore the gain of the model [see (6.43)],
Yti (n ; m), is reaL cau sal, and related to the stRC, co{n; m) , as
Cti(O, m) and write
I"(log~A-I(OJ;m)\]
{
Co(O ; ~) = log S (O) ::::: log So' n =O
Ic ~(n ;m)=-
00 1 2
do:
Yti(n;m) = 2c6(n,m), n > O. (6.43) n= 1 2n -It (6.47)
0, n <O .
= __
1 In[log IA(W;m)\] 2
dOJ.
The recursion for conv erti ng LP to CC parameters is 2n -"
Accordingly, if we have the cepstral sequences for two LP models, say
log eo' n =O 8 (n ; m) and 02(n; m), then computing th e sum of the squared differences
1
ye(n; mi> n- I k: yields the mean squared difference in the log spectra,
{
-a(n; m ) + ~ (n )Ye(k; m)a(n - k ; m), n >O ,
ro 2
(6.44 ) I (c§l(n ;m}- c02(n;m)1
n ~l
(6.48)
in which a(n; m) is taken to be zero for n E,f [1 , MJ, and where M denotes 2
the order of th e LP model. The proof of this recursion will be the subject ~ ~ J"(log \A~l (OJ; m) - .4;1(w ; m)\ 1dia.
of Problem 6.7 after we have studied th e Cc. Usin g (6.43 ), (6.44) is eas 2n: 1<
ily modified to compute the stRC rather than the stCe; we will continue This result indicates that the Euclidean di stance between two cepstral
our dis cussion using the RC for consistency with the present materiaL sequences is a reasonable measure of spectral similarity in the models.
Note something very different from the cepstra used previously. In In practice, of course, only finite sequences of cepstral parameters, say
this case, rather than having computed the cepstrum of the speech , we co,(\ ; m) , . . . , co, (L ; m) (and similarly for the second model), would be
have computed th e cepstrum of the impulse resp onse of the LP model.
used in the di stance computation. 12
Therefore, we are co mp uting the cepst rum of a sequence that has already The cepstral parameters ma y be interpreted as coefficients of a Fourier
been "smoothed " in the sense that the excitation has been removed . The series expansion of the periodic log spectrum . Accordingly, they are
purposes of doing so are gen erall y twofold. First, it is possible to "fi ne based on a set of orthonormal functions (see Section 1.3.2) and, in a the
tune" the smoothing operation with several manipulations on ce(-n; m).
l2Rccall also that the se numbers would result from having used discrete frequency trans
JlJn fact. the most successful technique in this study was a cepstr al techn ique, the mel forms , which can potentially induce some cepstral aliasing [see (6.20)). We will assume that
cepstrurn, which is not based on LP analysis, but rath er a filter bank spectral analysis. We
will stud y the mel-cepstru m below. care is taken to prevent this probl em.
6 .2 I "Real" Cepstrum 379
378 Ch. 6 / Cepstral Analysis
ance databases, for instance . Among the most significant findings is the
oretical sense, the Euclidean metric is an appropriate and natural dis fact that the weighti ng of th e lower quefrency coeffi cients is much more
tance measure to use. N evertheless, several researchers have tried
significan t th an th e higher ones. (Note that the effect of th e weighting of
weighted Euclidean distances to determine whether performance im~
low-quefrency coefficients is to deempbasi ze th eir sign ificanc e , since
provements would result. Furui (1981), Paliwal (1982), and Nakatsu et al. l
their variances are large. ) The likely expla nation :! (Jua ng et al. , 1987) is
(1983) have explored the use of various weighting strategies with the Eu
that lower quefrencies correspond to slower changes in the model spec-
clidean distance in speech recognition tasks, in each case achieving small.
trum , which in turn are related to "spectral tilt" that arises du e to the
improvements in performance. As Tohkura (1987) points out, however,
glottal dynamics. The glottal dynamics would generally be expect ed to in
"t here is no clear reason why and how [weighting] works, and how to
terfere with correct recognition. Conversely, weighting of cepst ral coeffi
choose an optimal set of weights." Noting that the cepstra l coefficients
cients beyond about N == 8 (this would be approximately the same as the
decrease with n (this will be demonstrated below), and that the variances
order of th e LP analysis) was found to degrade performance. The likely
de crease accordingly, Tohkura suggests th e use of a diagonal weighting
expla nat io n (Juang et al. , 1987 ) is that by normalizing th ese very small
matrix which normalizes the variances of the indi vidual parameters: Let
coefficients, one is actually emphasizing features th at ar e pr on e to nu
£e(m) be a random L- vector from which the cepstral vectors to be com
pared, say co, (m) and c82(m ) ar e drawn , where m eri cal erro rs an d artifacts of the computation.
In keeping with Tohkura's findings, which suggest " dow n weighting" at
ce(m) d=cf[ce (\ ; m) . .. ce (L ; m) ] .
T
(6.49)
each end of the cepstral sequence, sev eral lifters have been proposed
< I < whi ch taper at both ends (Tohkura, 1987; Juang et al. , 1987). The basic
In these terms, the Euclidean distance suggested above is forms of these lifters are shown in Fig . 6.11. The analytical fonn of the
raised sine lifter is
d2 [c11,
· (m )'OJ 'Fe.
c. (m)J ='V lCo,(m ) -c 8-2 (m) ]T[c01· (m )- cOJ
· (m)]. (6.50)
len) == I + ~ sin (~n}
(6.53)
Tohkura assumes that th e off-di agonal terms of the co variance matrix,
Ce, can be ignored so that in which L is the length of the cepstral sequence used. Again the lifters
should theoretically be made even. Only th e p ositive time parts are
CiI =A , (6.5 1)
shown in th e figure.
with A a diagonal matrix of the vari an ces of the ind ividual coefficients
(see Section 1.3.2). The weighted cepstral distan ce is then defined as
d 2w [cej(m), c§2(m)] = ~m) - ce2(m)y A - 1[c o,(m ) - c82(m )]. (6.52) ]1, 11 =0. I, . . . , L.
() L L
thereby pro viding analytical support for Pal iwal 's method. The se 2
quence nctn; m) is sometimes called the root po wer sums sequence
FIGURE 6.11. Some example low-time litters. The second was used
[see, e.g. , (Hanson and Wakita, 1987)], so Paliwal's weighted distance
successfully by Juang et al. (1987) for digit recogn ition.
is called the root power sums me asure.
Several aspects of the weighting strategy were investi gat ed by Tohkura.
In general, the weighting was responsible for a significan t improvement
IlTohkur a and J uang et al. were wor king conte mporaneo usly at Bell Labo rato ries on the
with respect to un weighted cepstral co effi cients , adm itti ng recognition
related resear ch repo rted in the two papers cited here.
rates of 99% (increased from 95% and 96 .5%) on two isolat ed dig it utter
6.2 I "Real" Cepstrum 381
380 Ch. 6 I Cepstral Analysis
F
mel
1000
==--
log 2
r1+ -Eli. ]
1000 '
(6.54 )
th is approach to com put ing the mel co mpo nent s on a frequ ency axis as
sumed to cover a Nyquist range 0-5 kH z. Th e remaining compon ents can
be set to zero, or, more com monly, a second psychoacoustic principle is
in which Fm d (FHJ is th e percei ved (real) frequency in mel s (H z). in voked , which we now dis cuss .
Drawing on this idea of a perceptual freq uency scale, speech research Loose ly speaking, it has been found th at the perception of a particular
ers began to Investigate th e benefits of using a frequ ency axis wh ich was frequen cy b y the auditory system , say 0. 0 , 15 is influence d by energy in a
warpe d to correspo nd to t he m el scale . The stR C is particul arly well critical band of frequencies around 0. 0 (Sch roeder, 1977 ; Allen, 1985;
O'Shaughnessy, 1987, Ch. 4). Further, the ba ndwidth of a critical band
14 11 is interesting to not e that the pit ch expressed in mels is rough ly propo rtio na l to the var ies with frequency, beginning at about 100 Hz for freq uencies below
nu mb er of nerve cells term inating on th e bas ilar membrane of the inner ear, count ing from
th e a pica l end 10 the poin t of maxim al st im ulation alo ng the membrane (Stephens and ISNote that we are usin g uppercase n her e to designate "real -world" frequ enci es.
Bate, 1966, p. 238).
382 Ch. 6 / Cepstral Analysis
- '- _.-
"Quantized"
"Dcslred" DFT log IS(k.
"I.:"
frequency frequency value
(Hz) (Hz) DFT
t
lug IS(k;m )1
1148 1152 118
13\8 13iS 135
1514 1514 155 Magntlude
1737 173S 178 spectra of
1995 1992 204 crit ical band
r- 2291 2294 filters
235
1\ 2630 2627 269
3020 30\8 309
3467 3467 355
4000 4004 oll(Jl!1l110 VV
ITT't:1~ 410 o I
Y V,2
o 7C u'" (norrn-rps)
o
o 1000 2000 3000 4000 5000 F (Hz )
o 512 I.: c (n; m)
(mel~'ce psrrUln)
FIGUR.E 6.13. Using the stDFT to derive the appropriate frequency
components for computing the mel-cepstral coefficients. Twenty
mel-frequency components are desired on the Nyquist range 0-5 kHz. Ten
are committed to the "linear" range 0-1 kHz. These appear at 100,200, ... , 10FT
using
1000 Hz. The other 10 are to be at log-distributed frequencies over the (660) O
range 1-4 kHz. These correspond to "desired frequencies" shown in the
table. If we use a 1024-point stDFT, frequency lines at close approximations
to these desired frequencies will be available. These are the "quantized
Y
frequencies" shown above. Form
---j
sequence
Y(k) Y(il
1 kHz, and then increasing logarithmically above 1 kHz. Therefore, rather -"
,;;::'0
than simply using the mel-distributed log magnitude frequency compo-
nents to compute the stkC, some investigators have suggested using the N
log total energy in critical bands around the mel frequencies as inputs to
the final IDFT. We use the notation Y(i) to denote the log total energy in
the ith critical band. This process is illustrated in Fig. 6.14. The meaning FIGURE 6.14. Use of critical ban
of Y(i) will be understood by examining this figure. filter spectra are labeled "concep
Note how the final IDFT is actually implemented in the figure. We practice, they must be effectively
frequencies sampled by the DFT
have conceptualized the critical band filters as residing on the Nyquist the sum of the weighted log lS(k
range onl y. In principle, to be used as discrete-ti me filters , their trans- Y(i) and similar quantities are s
forms (which are purely real in this case) must be made symmetrical the ith critical band. [Note: To
about the Nyquist frequency. Consequently, the critical band filter log S(k ; mll 2 can be used in place
Ihowever, this would only change
energy outputs, designated Y(i) in Fig. 6.14, will be symmetrical about
the Nyquist frequency. The integers i index the center frequencies of the
critical band filters, each of which is assumed to be centered on one of
the frequencies resolved by the original stDFT. In other words, if i in- Y(k) = { Y(i),
dexes center frequency Fc. s" then
0,
C.I
F.~
F = k: N for some k, say k I., (6.55) The final IDFT. then , is
for each i , where F , is the sample frequen cy, and N is the number of
cs(n; m) =
points used to compute the stDFT (N-length frame of data plus zero pad-
ding) . Therefore , let us define
-----~-----~-----------------~L-/""AHear-~t::f~rU" f-OU...,,-----_
[Note that we continue to use the notation c (n; m) to denote the mel.
ccpstrum.] However, since the sequence Y(k) "is symmetrical about N/2 warping by a bilinear transform of the frequency axis.t> reported signifi
C'even"), We can replace the exponential by a cosine, cant recognition improvement with respect to the "unwarped" LP-based
cepstral coefficients [see also (Lee et al., 1990)].
cs(n ; m) == --,
N
1
I
N ' -I -
k=o
(
Y(k) cos k
271: )
7]\,
n .
(6.58)
Delta, or Differenced, Cepstrum
Again using the symmetry of Y(k), We can write this as
In addition to the cepstral or mel-cepstral parameters, another popular
YeO) Y(N/2) 2 (N/ZJ-I_
cs(n; m) = N ' - N + N' I
( 271: '
Y(k) cos k N
'
6 n)
feature used in contemporary speech recognition is the delta cepstrum.!?
If es(n; m) denotes the stk.C or mel-cepstrum for the frames of the signal
sen) ending at time m, then the delta , or differenced, cepstrum at frame m
(1'.'/2 )-1 (6.59)
is defined as
=~2 L.
'" Y(k)cos
- ( k~n
271: ) ,
IV k~1 N Lic;(n;m) d,gfc/n;m+JQ)- cs(n;m-JQ) (6.61)
for all n. Here Q represents the number of samples by which the win
where we have assumed N ' to be even and YeO) = Y(N/2) = O. Of course,
dow is shifted for each frame. The parameter 6 is chosen to smooth the
if NCb denotes the number of critical band filters used on the Nyquist
estimate and typically takes a value of 1 or 2 (look forward and back
range, then there are only N Cb nonzero terms in the sum of (6.60). There
fore, let us rewrite the Sum as ward one or two frames). A vector of such features at relatively low n's
(quefrencies) intuitively provides information about spectral changes
that have occurred since the previous frame, although the precise mean
csC l1;m) =.1.
Y(k) cos (
/1/ Ik
k,~ n) ing of Lic,(n; m) for a particular n is difficult to ascertain.
(
1~1.2> " " ' ''~b The delta cepstrurn can, of course, also be computed for LP-based
(6.60) cepstral parameters. In this case, all instances of ('s(n; m) above would be
2 _ (
I~I . 2~ k,~ n).
replaced by ('8(n; m), where, as usual, this notation refers to the cepstrum
- N' > Ncb Y(k,.)cos of the impulse response of the LP model estimated on the frame ending
at time m.
The lower form in (6.60) (without the superfluous scale factor 2IN') is Several researchers have argued that the differencing operation in
usually used to compute the final IDFT. This is the form reflected in Fig. (6.61) is inherently noisy and should be replaced by a polynomial ap
6.14. Note that the use of only the Nyquist range in the final step re proxi mation to the derivative (Furui, 1986; Soong and Rosenberg, 1986);
quires that the stDFT be computed only over this range in the first however, many investigators have used the simple differencing operation
operation. successfully [e.g., see (Lee et al. , 1990)].
Finally, we note that various types of critical band filters have been
Typically, 8-14 cepstral coefficients and thei r "derivatives" are used
employed in practice (Davis and Mermelstein, 1980; Dautrich et a1.,
for speech recognition in systems that employ cepstral techniques (luang
1983). The set of filters we have used above are essentially those pro
et al., 1987; Rabiner et a1., 1989 ; Lee et al., 1990). This means, for ex
posed by Davis and Mermelstein. These critical band filters are particu
ample, that (6.60) and (6.61) would be computed for 11 = 1, 2, ... , 8-14.
larly simple to use computationally, and have been employed in a
From the discussions above, we should feel confident that these "low
number of research efforts [see, e.g., (Paul and Martin, 1988)].
time" measures wHI be associated with the vocal system spectrum and its
dynamic changes.
Mel.Frequency Warping of LP-based Cepstral Parameters. A mel
frequency warping can also be included in cepstral parameters derived
from LP analysis. In this case the cepstral parameters cannot be derived
directly from the LP parameters. Rather, it is necessary to compute the
log magnitude spectrum of the LP parameters and then warp the fre 'oRefer to any of the books in Appendix 1.A for a discussi on of the bili near transform.
quency axis to correspond to the mel axis. The cepstrum is then com 'lBe careful to distinguish this feature from the differential cepstrum discussed in Sec
puted in the usual way. Shikano (l985), who approximated the mel lion 6.3.2 . Whereas the feature discussed here represents a time derivative of the cepstral
parameters, the differential cepstrurn represents a frequency derivative.
386 Ch. 6 I Cepstral Analysis 6 .3 / Complex Cepstrum 387
Log Energy Q:al , however, we deduce a bona fide homomorphic operation and arrive
immediately at the Cc. The adjustment is to replace the log!S(W)!
The measures c)O;m) and Llc)O;m) (with x standing for either s or 8) operation (see Fig. 6.1) with the complex logarithm of the complete
are often used as relative measures of spectral energy and its change. For DTFT, phase included. With this change, the complex cepstrum (cq of
the stRC, by definition, the signal sen) is defined similarly to the RC,
In order that T sew) ~f 10gS(w) be unique ," arg {S(w)) must be chosen to
6.3 Complex Cepstrum be an odd continuous function of co [for details see (O ppenh eim and
Schafer, 1989 , Sec. 12.3 )]. This means that when using "canned" routines
6.3.1 Long-Term Complex Cepstrum to find the arg{S(w)1 in the process of computing 10gS(w),
one must be careful to add multiples of 2n when necessary to make
Definition of the CC and of Homomorphic Systems arg{S(w)] meet this criterion. This task, which is called phase unwrap
In describing the history of the RC, we suggested that its invention ping, although conceptually simple, is difficult in practice. Algorithms for
came out of a somewhat less formal line of thinking than we used in in phase unwrapping are discussed in (Tribolet, 1977).
troducing it. The notion of moving signals out of a "convolution" do In light of (6.67) , we can redraw the computation of the CC in Fig. 6.15
main and into a "linear" domain in a formal way is perhaps better to explicitly feature the real and imaginary parts of the complex log. This
attributed to a retrospective view of the RC in light of the work of is shown in Fig. 6.16. The upper path of the computation, which treats
Oppenheim, Schafer, and others on the topic of homomorphic signal pro the real part of the complex log, results in the conjugate symmetric
cessing. Homom orphic signal processing is generally concerned with the
transformation of signals combined in nonlinear ways to a linear domain
~ o~'" S(~
' ) lDTFT
in which they can be treated with conventional techniques, and then the sen)
"[, ( n)
Comple x
retransformation of the results to the original nonlinear domain. The DTFf log Com plex
Voiced
general field is quite interesting and is applicable to a number of prob >peech cepstru rn
lems other than speech. We will focu s our attention on the cepstrum and (CC)
leave the general theory to the reader' s further pursuit of the literature. FIGURE 6.15. Computation of the CC.
The RC fails the general criterion that a homomorphic technique per
mit a "round-trip" back to the original nonlinear domain. We discussed "We avoid wriung T{e») to refer to the DTFT of the CC, since this notation would be
this problem near (6.17), where tbe main problem was found to be the easily confused with our convent ional notation for a power density spectrum . Instead we
d iscarding of phase by the operator
ea l
Q:
• With a "simple" adjustment to use Y,(w) (upsilon ).
6 .3 I Comple x Cepstrum 389
388 Ch. 6 I Cepstral Analysis
/(nl
s(n ) S«(()) Y s.rej OJ) Y,.evcn(n) i.o w-urnc
DTFT log H IDT FT litter
~ Y/n )
~ ~~ irrrr 8(1J ;m)
~
E
DTFT ,
-,
;; LI ~
,
I
-~.
FIG URE 6.16. Computat ion of the CC showing decomposition into its even ;::' ,
1
\ . I
....~---------
Q ~ ' operation
part of the ee, which in this case is real and even. Apparently, the even L mcar liftering operation
part of the ee is equivalent to th e Re as we have defined it , FIGURE 6.17. "Cepstral smooth ing" using the CC . In fact , we have
anticipated the need for the short-term CC and have employed it in this
Ys. cven ( n) "" c..(n). (6.68) analysis. (The details are a straightforward generalization of our knowledge
of the stRC and will be covered below.) This figure shows the block
Th e lower path, which in volves th e imaginary part of the log, results in diagram of the computations. Note that the processing can continue all the
the complex antisymmetric part of the ee, which will also turn out to be way back to the original time domain to obtain an estimate of B(n ) because
real, and therefore odd. It is also clear from Fig. 6.16 that the even part of the existence of the operation Q•.
of the ee is based on the magnitude spe ctrum of sen) only (this is, of
course, consistent with our understanding of the RC) wh ile the output of
th e lower path is based on the phase spectrum of sen). It is this lower perposition principle where the com bination ru le-? "+" has been replaced
path that is missing when we compute the Re and which preserves t he by "»."
phase information that the Re does not. .JI{e(n) * B(n)) = ..JI{e(n)) * .JI{8Cn) ). (6.69)
Before focusing on properties of the ee, we note the construction and
meaning of a complete homomorphic system for con volved signals. Ju st
Because ..JI retains the same basic algebraic structure as a linear system,
as in the RC case, we can complete a linear operation on the ee by per we call .JJ a hom omorphic ("same shape") system. Th e [Q. - :J- Q:Jl sys
forming a liftering operation,' ? then a for ward DTFT. For example , we tem for ..JI is ca lled the canoni cal fo rm for the homomorphic system,
might wish to "low-time lifter" to remove the effects of the excitation. and the Q. is called the characteristic system for th e operation that tak es
This is the cas e shown in Fig . 6.17. On ce the linear operation is per *' to +. Since th e characteristic system is unique, all homomorphic sys
formed , onl y an estimate of log 8 (w ) remains, and th is "signal" must be tems that go from * to * differ onl y in th eir li near parts, :J.
reconverted to the ori ginal "con volution" domain. This final operation is
accomplished through the application of th e inverse operation Q :', which
in th is case exists un ambiguously. a:
is illustrated in Fig. 6.17. Properties of the CC
In summary, the con volution to linear domain transformation is de
noted Q. and its inverse is Q: I. Th e IDTFT, low-time lifteri ng, DTFf A stu dy of some of the important analytical properties of th e ec will
operations comprise a linear filterin g operation, say J in the "new" lin lead us to , among other things, a mo re formal und erstanding of some of
ear domain. Let us denote the set of operations [Q. - :J - Q~ I J by .JI. our intuitive notions about the behavior and usefulness of the RC. We
Then it is tru e t hat the overall syst em follows a sort o f generalized 511
20We have said nothing about rule s for scalars here. For details see the discussion in
(Oppenheim and Schafer. 1968).
19This time the lifter can have both magnitud e and phase prop erties,
390 Ch. 6 I Ceps tral Ana lysis
6 .3 I Complex Ceps trum 391
restrict our discussion to (speech ) sequences that have rat ional z where a = n if So < 0, and is zero otherwise. We now need to apply the
transforms which can be put in the form iDT FT operation (see Fi g. 6.15) to obtain ys(n). If we use the fact that
Q ,n Q out
a' a3 a4
,
n (1- Cz-
_ D ' k~ l
1
) f1 (1 - (~U IZ- I)
k· \
log (] + a) == a - - + - - - +'"
2 3 4
(6.75)
S( z) = S oz . (6.70)
p ,n
II (1- p ~nz-l ) fo r lal< I , it is not difficult to sho w that the IDTFT of (6.74) IS
I I+ja,
k= 1
log So n=O
where, ( ~n. k = I, . .. , QIn and ,~ut , k = I, .. . , Q OUI represent the zeros
inside and outside th e unit circle, respectively. and p ~n, k = 1, ... , p in are
~ [(~nr ~ [p~nr D( :-I)n+l
-L - + L - + ' ' n> O.
the poles inside the unit circle." Note that complex poles and zeros must ys(n) ::= k=1 n k=l n .n (6.76)
occur in conjugate pairs. Note also that D ' should be nonnegative for
causality. The first step toward getting a principal result is to multiply Z [\Il k J
Q OlJL out -n
+ D(_l )Ji+' n <O
and divide the numerator through by - [ l /(~ut ]z~ yielding rn:; k =1 n n
Q ;n Q OU!
IT (I - 'tz- IT (I - \IIZUI
k= 1
I
)
k= 1
Z) We need to dwell on the terms xr in the n==O case, and D(-lr+ j n in
1
S( z) = SO Z-D . (6.71) t he n *-O cases before proceeding. The former is the only imaginary term
p'"
in the ce and is nonzero only if th e gain term So is negative. It is cus
k= l
11 (1 - p~nz-l) tomary to compute only the real part of the ee, realizing that we might
be throwing away an insi gnificant piece of phase information correspond
III which IfI ~Ul ~f l / , ~ul and ing to inversion of the waveform. The "D" term represents another bit of
Q ou t
phase inforrnation-t and arises from two sources [see (6 .73»)-the pres
S o == - S '0 IT y Oul
'> k (6.72)
enc e of nonminimum-phase zeros and the possibility of an initial delay
of the waveform. It is custo ma ry to eliminate this second contribution by
k~ 1
1
shifting the waveform so that it originates at time zero . (Looking ahead,
D =D + QOUI . (6.73) we see that this will be significant in short-term processing, since, in so
Now, subjecting s(n ) [with z-transform as in (6.7 1)] 10 the first two opera doing, we will be giving up the information about the delay of the frame,
tions (Q.) in Fig. 6.15 yield s " m." ) With these two assumptions, we have
QJn
10g1Sol , n==O
log S'(co) = log So + log e- Jw
/) + Zlog (l-( ~ne -jW )
k =l Q ,n[( ~nr p ; n[p~n r QOUl(_lr +1
Q out pin
-k ~I1 -+2:- + , n >O.
+ L 10g(1 - IfI~U'e JW) - Z 10g(1 - p~ne -JW) f S(n) ::= n n k=J n
k=l
Q in
k~ 1
QUU'
I [ rp ~uln + QOU'(_lr+1
n<O
= [log So I I+ .6 10g(1 - ( ~ne -JW ) + k log( I - \II~utejW ) k=1 n n
(6.77)
-.6 pm
Iog ( 1 - p ~e -jW)
J+ j (a - wD), In principle, we should go back and modify th e definition of the CC
so that [for a causal sequence, s(n)] the signal is shifted, if necessary, to
(6.74) originate at time zero, and so that th e real part of (6.65) is taken :
lIlt is easy to include poles outside the unit circle in th is development, but we have no D(-l )" +'
I- I -n + I ,n- + -n-
k= 1 k =1
Y,(n; m) =
I
Real { 2n fIr log [N-lfo f (1;
-1l
<-
m)e- )WI] e)"'" doi }
I[Grl
(l'" I[p;rl p'" Qout
:5 I - + I- +n- (6.79) (6.81 )
k~ l k~1
fJ" QOUl
)- +-
= Real { ~n L:log [S(w; m) ]e.l w
" -l.
n n '
Note that the frame has been appropriately shifted down to begin at time
where fJ ~f max (maxk ICI'
max k Ip~n I). zero, and we put an arrow over the frame , 7,
and the stDTFT, S, as a
3. If sen) is minimum phase (no zeros outside the unit circle), then reminder. (Recall the discu ssion in Section 4.3.5.) Accordingly, the index
ys<n) = 0, n < 0 (the CC is causal). m in the argument of the stCC serves only to catalog the position of the
4. Conversely, if sen) is maximum phase (no poles or zeros inside the frame. The computation of y/ n;m) is illustrated in Fig. 6.18. The practi
unit circle), then Y,(n) = QOU'( _l)"+Jjn, 11 > O. cal version in which discrete frequency transforms are used is shown in
5. The CC is of infinite duration even if sen) is not. (Recall that a fi Fig. 6.19 . Again the frame is shifted downward so that the "conven
nite duration signal will have only a finite number of zeros and no tional" DFT, rather than the delay-preserving stDFT, is used. As in the
poles .) case of the RC , a potential for aliasing exists when discrete transforms
are used; generally, zero padding of the initial frame is necessary. In this
case the DFT-IDIT pair will be based on N' > N points, where N' re
Let us digress for a moment and note an important point about flects the zero padding.
minimum-phase sequences. We do so with the caution that speech and
frames of speech (even when they are shifted in time so that they origi
nate at n = 0) are generally not minimum-phase sequences, so we must be w(m -n)
~)J
careful how we apply this information . However, in one very important
application of cepstral analysis, we are concerned with the cepstrum of ; ;-
the impulse response of the LP model of a speech frame. In this case the f(n; m~ Shift !(n ; m) S(w;m)
'itDTFT log 1'1 1+ lDIT ~I/n.m )
signal considered is minimum phase and this point is very useful: For a frame
minimum-phase signal, say x(n), the ee,Yx(n), is completely specified by
its even part that, in turn , is precisely the RC , c,,(n) . This says that in the FIGURE 6.18. Computation of the stCC.
6.3 I Complex Cepstrum 395
394 Ch. 6 I Cepstral Analysis
s(,,)~f(l1;m l
~ [ (I1;m )
Shin •
fram e
J
<
-+-1
,y'-po "n
0 171'
<---
S( II . ~
-
-
l o g 1·1
.....,
r
IWi r" , ,·ml
In --l l. To be explicit, let us write the z-tran sform of the vocal system im
pulse resp onse 8( n) in the form of (6.71),
n
Q l"
( I - (~n z -l) rr
Q OUI
( I - lfI~U'Z )
t
w( m - n)
Potcn tiall y aliascd
version of the.
theoreti ca l "1,(11 ; m} 8(z) =8 oz- Q
c u' k = 1 k=1 (6.84)
r '"
FIGURE 6.19. Computation of the stCC using the OFT·IOFT pair.
in Figure 6.1SI
f1 (I _P ~ z - l)
k ~ l
6.3.3 Example Application of the stCC to Speech Analysis Then we ha ve, according to (6.77),
As we have repeatedly indicated, much of speech analysis is success
1801 . n =O
fully performed with the simpler R'C, However, as we have seen above, log
the CC is more yielding of theoretical results, which, in turn, provide in Q'" t(~n r p'" [p ~T QOUI(_l r+ 1
n> O
sight into the behavior of the Re. Indeed, some key points in our discus
sion of pitch and formant estimation using the stRC were based upon )'o(n) =
-I - n +I - n +
k- l k=l
promised future results from the Ce. We now look bri efly at the related
problems of pitch and formant analysis using the stCC, with a primary I lfIk] + Q out(_I),,+J
Q Olll [ out -n
n <O.
interest in strengthening the same discussion about the stRC, and a sec
k =l
ondary interest in showing that the stCC itself can be used for these (6.85 )
purposes.
Of course, as in the stRC case, the stCC of the speech is the sum of
Thi s is the justification for (6.39), which played a key role in the simila r
the individual cepstra of the excitatio n and vocal system impulse
di scussion about the RC. Finally, therefore, we conclude that
response:
ys(n; rn) = y~ (n ; rn) +Yin). (6 .82) ye(O; rn) + 1'0 (0), n =O
Note that we have one long-term and one short-term CC on the right Q ,n[ ( ~nr pm[p~nr Q oUt( _l r +1
side of the equation. Recall that this is a consequence of assuming that -I-+
n
I-+.-C---~-
k~ l n
k w\ n
O<n <P
the window which creates the frame is slowly varying with respect to the
impulse response. The reader is encouraged to review the discussion sur y/n;rn) =
rounding (6.23) to refresh his or her memory on this point if it is not I[!fI ~utr n + QOUI(_I)"+l - P < n< O
clear. As in the case of the RC , the key point is to demonstrate that the k= 1 n n
two CCs are well separated on th e qu efrency axis.
In a very si milar manner to our work on c~ (n ; m) , it can be shown that ye(n; rn) (weighted pulse train), Inl > P.
I'e(n; rn) takes the form of a pulse train with de caying weights. Whereas (6.86)
we are able to deduce onl y the form of the pulse train for the stRC [see
(6.36)], here a precise expression can be derived. It is (Oppenheim and which is similar to (6.41), which we obta ined for the stRe.
Schafer, 1968) Although more rigorous than our similar discussion of the stRC, the
conclusion is essentially the same. We find that the component excitation
Q- I
I'e(n; rn) = I
q=O
y,..) q)J (n - qP) , (6.83)
and vocal system cepstra are well separated in the quefrency domain, the
excitation portion consisting of a high-time pulse train and the signifi
cant contribution of the vocal system occurring in th e low-time region
where, vm(n) dgr w(rn - nP), and Y" (n) is its Ce. As usual, w(m - n) is the below quefrency P. In Fig . 6.7 we studied the stRC for the speech frame
window used to create the frame, m and it is assumed that Q pulses of e(n) representing the vowel li/. In Fig. 6.20(a) we show the stCC for the same
occur inside the window. speech frame as well as a repetition and extension of the stRC of Fig.
We are also well prepared to make a st ro ng case for the fast decay 6.7 , which appears in Fig . 6.20(b). Note that although the pitch peaks ar e
of Ye (n) with respect to P, since we now know that an y CC will decay as still present and well separated from th e vocal system information in th e
6 .3 I Complex Cepstrum 397
2 I~ I I I I
t:
0
-0.5 defined Q~ \ operation exists. Although this is seldom the purpose of
..c:
</) using cepstral analysis, it is interesting to note this ability, which is lost
-1.5
6.3.4 Variations on the Complex Cepstrum
-2 At least two interesting variations on the complex cepstrum have ap
() 100 200 300 400 500 peared in the literature. Each differs from the cepstrum discussed in this
Quefrency, n (norm-sec) chapter in the nonlinear operation applied to the spectrum. In the spec
(a) tral root cepstrum (Lim, 1979) the logarithm is replaced by the operation
of raising the DTFT to the power a, say S a(w). This system was devel
0.6 oped explicitly for the pulse deconvolution problem and has been applied
to some simple voiced speech analysis-synthesis experiments in Lim's
0.5 paper.
A second variation, called the differential ceostrum (Polydoros and
0 .4 Faro, 1981), involves the replacement of the logarithm by the derivative
of the log. Since
0.3 d dS(w) I
N -logS((Ll)=--S- (z»), (6.87)
V) do: doi
0.2
5~
<.; the phase unwrapping problems inherent in the usual cepstrum are not
0 0.1 present with this operation. The properties and potential applications of
~
E the differential cepstrum are discussed in the paper cited. The technique
v
,... 0 has not found much application to speech analysis, and the reader should
;:;
..c:
</)
be careful to distinguish the differential cepstrum from the delta, or dif
---{II ferenced , cepstrum discussed in Section 6.2.4. The former represents the
dynamic behavior of the spectrum with respect to frequency, while the
-D .2 latter represents a time derivative.
-D.3
---{I.4
a 100 200 300 400 500 FIGURE 6.20. (a) The stCC of the same frame of speech analyzed in Fig.
Quefrency, II (norm-sec) 6.7. (b) The stRC repeated for convenience and extended to 512 points to
(b ) show symmetry.
398 Ch . 6 I Cepstrat Ana lysis
6.4 I A Cr itica l Ana lys is of th e Ce pst rum and Conclu sions 399
6.3. Consider the USe of the stRC to estimate the pitch described in Sec 6.5. (a) From the properties of the CC in Section 6.3.1, we can deduce
tion 6.2.3. The theoretical developments underlying this method depend that a causal, real, minimum-phase sequence sen), the ce, y.( n),
critically upon the assumption that the window used to create the frame will also be real and causal. Use this fact to deduce the follow
ing relationship between the CC and RC:
6 .5 I Problems 405
404 Ch . 6 I Cepstra! Analysis
l'Of course, we have taken the liberty of converting Noll's definition into discrete-time loWe will pose this problem in "long-term " quantities and let the reader take care of the
tenus. "short-term " processing details.
c
Z
<C
a-:
zwz
....
~w
w:;E
ocn
zen
<tw
::I: (I)
ZOO
w<t
.->
CJ ....
z
-...J
C<C
I ::J
~ 00
lHAPTER 7 I
Speech Coding and Synthesis
Reading Notes: In addition to the general DSP and stochastic process mate
rial in Chapter 1, many developments in this chapter will depend on topics
from Sections 1.3 and 1.4.
Generally speaking, there are two fundam ental approaches to analyzing
signals and systems. The first is to work with "real" time waveforms, using
temporal averages and time argum ents to obtain desired results. In this case,
appropriate stationarity and ergodicity properties are assumed when results
f rom stochastic process theory are required. The second general approach is
to work initially with theoretical models, til particular stochastic processes,
then move the essential findings into the practical "wavef orm " world by ap
propriate statio narity and ergodiclty assumptions. The second approach is
often more appealing in its formali ty, generalizability, and rigor, but it does
not always lend itself as well to ad hoc techniques that can be exploited fo r
specific speech tasks.\ Speech processing is inherently an "applied" discipline
in which we oft en have "more waveforms than models" and the "temporal"
approach is f requently more appealing or even essential. This approach has
been taken throughout most of the book thus f ar.
Historically, however, som e of the material in this chapter (especially Sec
tion 7.2) has been developed and presented in a probabilistic setting. The
basic problem addressed, speech coding, is am enable to, and indeed benefits
from, a form al modeling approach. Therefore, f or the first time in the book,
the material will be init ially presented in such a framewo rk. The reader is
encouraged to review the material in Section 1.2, with special attention to a
review of the notation. In particular, it is important to have a clear under
standing oj the notations used to describe random processes and their key
concepts. Recal! that we use x to indicate a random process. The notation
(~ (fl), II E (som e time interval, say 11)) indicates the set of random variables
that comprise ~, that is,
::: = ( ~ (n ), n E n). (7. 1)
Finally, "x (n)" (meaning (x(n), n E 11}) is used to indicate a sample fun ction
or realization of x. Vector counterparts to these notations will also be used
f or vector-valued st ochastic processes, and, although our discussions have cen
tered on discrete-time random processes, we will occasionally have reason to
work with continuous-time processes as well.
[An interesti ng discussion of the duality between time and ensemble analysis is found in
Gardner (1990).
409
7.2 { Optimum Scalar and vector Quantization 411
410 Ch . 7 / Speech Coding and Synthesis
speech signal. The efficient digital representation of the speech signal To begin , let us suppose that an analog source emits a message wave
makes it possible to achi eve bandwidth efficiency in the transmission of form xiI), which may be considered a sample function of a continuous
the signal over a variety of communication channels, or to store it effi time ? stochastic process ~ a ' The subscript a is used to remind the reader
ciently on a variety of magnetic and optical media , for example , tap es that we are temporarily working with an analo g waveform drawn from an
and disks. Since the digitized speech is ultimately converted back to ana "analog" stochastic process (source) . We assume that :2. a is a stationary,
log form for the user, an important consideration in speech coding is the correlation-ergodic, stochastic proce ss with an autocorrelation function
level of signal distortion introduced by the digit al conversion process. r," tt) == rx.(-r:) and also a power spectral density function r~ a(Q) == fx}Q)·
Over the pas t four or five decades, a variet y of speech coding tech Furtbermore, let us assum e that ~" is a bandlimited stochastic process ,
niques has been prop osed, an alyzed, and developed. In this chapter we
describe the most important of the se methods, which may be subdivided that is,
into two general categories, waveform coders and voice coders (vocoders). r (0) = a for [nl > 2nW, (7 .2)
i!.
In waveform encoding we attempt to dir ectly encode speech wave where W is the bandwidth in Hz. From the sampling theorem, we know
forms in an efficient way by exploiting the temporal and /or spectral char
that any sample function xJt) may be represented without loss of infor
acte ristics of speech signa ls. In co ntrast, voice enco ding involves th e
mation by its samples, say x ( n) == xa(nT) for -co < n < co, as long as
repres entation of the speech signal by a set of parameters, the estimation T < 1/2 W Of course, the sequence x( n) can be considered a realization
of th e parameters from frame s of speech , and the efficient encoding of
of the discrete-tim e stochastic process , say ~, which consists of the count
these parameters in digit al form for possible transmission or storage.
In our discussion of speech cod ing techniques, we will assume that the able random variables x_ a(nT ) drawn from X_ a .
an alog speech signal is confined by filtering to a nominal bandwidth of The sampling process convert's th e output of an analog source into an
3- 4 kHz. Then , th e signal is sampled at a minimum rate of 8000 sam equivalent discrete-time sequence of sam ples. The samples are th en
ples per second to avoid aliasing. In most of the waveform coding tech quantized in amplitude and encoded as a binary sequence. Quantization
niqu es to be described , th e sa mples are processed one at a time by of the amplitudes of the sampled signal results in waveform distortion
quantizing and encoding each sample separately. We call this process sca and , hence , a loss in signal fidelit y. Th e minimization of thi s distortion
lar quantizatio n and coding. In contrast, a block of samples may be is considered below from the viewpoint of optimizing th e qu ant izer
quantized as a single entity and the index of the resulting code vector characteristics.
may be transmitted to the receiver. In such a case we call the quantiza Our tre atment considers two cases, scalar quantization and vector
tion process vector quantizat ion . The latter may be used either for quantization . A scalar quantizer operat es on a single sample at a time
waveform encoding or in a vocoder. and repres ents each sample by a sequence of binary digits. In contrast, a
We begin by consid ering scalar and vector quantizati on from a general vecto r qu ant izer operates on N signal samples (N ) I) at a tim e and thus
viewpoint. Then we describe waveform encoding techniques in Section quantizes the signal vectors in N-dimension al space.
7.3 and vocoder techniques in Section 7.4.
7.2.1 Scalar Quantization
7.2 Optimum Scalar and Vector Quantization Consider the sequ ence of sample s x( n), a realization of a discrete-time
stochastic process ~, which is crea ted by appropriate sampling of a band
In thi s section we consider the general probl em of coding the output of
an analog source from a th eoretical viewpoint. In the speech cod ing ' Here we briefly work with a continuous-time random process. Th e concepts used here
probl em , the " analog sour ce" may be viewed as the speaker who pro arc analogou s to those discussed for the d iscrete-tim e case in Section 1.2. The reader is re
ferred to the textb ooks listed in Appendi x I.B for details.
duces analog acoustic waveforms. Formall y, the "analog source" is a con
4'3
7.2 I Optimum Scalar arid Ve ctor Quantization
412 en. 7 I Speech Coding and Synthesis
In general , an optimum quantizer is one that minimizes D by opti~
limited stochastic analog source as described above. The sequence x(n) is l1lally selecting the output levels and the corresponding input range of
input to the quantizer, which is assumed to have L = 2R levels. The num each output level. This optimization problem has been considered by
ber of bits per sample is therefore LJoyd (1957) and Max (\ 960) , and the resulting optimum quantizer is
R = log, L. (7.3) usually caned the Lloyd-Max quant izer.
For a uniform quantizer. the output levels are specified as
The units of R may also be designated bits per normalized second (bpn), _
(7.11)
since we consider the interval of time between each sample to be a nor x(n) = x d,g (2k - 1)~, when (k - l)i'l -< x(n) < kt»,
malized second. Clearly, the quantity RF, represents the bit rate per real k
time, and is measured in bits per second (bps). We shall use the abbrevia where /:, is the step size, as shown in Fig. 7.1. When the uniform quan
tions bpn and bps throughout our discussion. tizer is symmetric with an even number of levels, the average distortion
Now let us denote the output of the quantizer by
in (7.\ 0) may be expressed as
i(n) = Q[x(n)], (7.4)
where Q[' ] represents the mapping (assumed functional) from the se
quence x(n) to the L discrete levels. We also assume that the marginal
D=2 I
2
(1-/ )- 1 fkL;.
h
[(2k- t)L\ J
-~ h(~)d~
bl (k-l ).6 2 - (7.12)
probability density function (pdf) of the stationary stochastic process :2.a
is known and is denoted ha(l/S) for any t. Because of stationarity, this
pdf does not depend on t. -Obviously, the sampled process :2. is also sta -t 2 roo h \(L - 1)L\ - ~l h (¢) s:
tionary and J \(LI2)-lJ d l 2 'J -
- ~(n)( t; ) = ~a(l)(t;) for arbitrary choices of t and n. (7.5)
Since the first-order pdf is the same for any time in either random proc OUlput
ess, for simplicity we will adopt the notation ~(~) to mean .l III
71:>. /2 f-
.fi~) d~f~(,J~) =~a(I)(t:,) for arbitrary t, n. (7.6)
110
We wish to design the optimum scalar quantizer that minimizes the 5M2
error in the following sense . Let q denote the random process that mod
els the quantization error sequence. Realizations of q are of the form \01
3M2
q(n) d~f x(n) - x(n) = Q[x (n)] - x(n), (7.7)
100
and the random variables in the process are formally described as 1:>./ 2
~lnpLlI
q(n) = ~(n) - ~(n) = Q[~(n)] - ~(n). (7 .8) -21:>. -I:>. 01 \ 0 t> 2t> JI:>.
-31:>. -6./2
In a temporal sense , we desire to find the quantization mapping, Q, that
minimizes the average of some function of the error sequence, say h[q(n)].
010
Assuming appropriate stationarity and ergodicity properties, we find Q -3t>/2
that minimizes
for an arbitrary n. Using (7.8) and recalling (7.6), we find that the quan 000 I - 7M2
tity to minimize becomes
6
(L/2 )-1
(2k-I) (k-Il~ h 2 -s
]
~(;;')ds
k~l X k- '
which As a special case. we again consider minimizing the mean square value
2
h(a);; a 2 , (7.14) of the distortion . In this case, heal = a ; hence (7.18) becomes
(7.20)
Max (1960) evaluated the optimum step size .60Pl and the minimum MSE x == Yk + h ·t-! k:« 1,2, ... ,L- 1,
when the pdf is Gaussian, that is, k 2
.f)~) = . ~ 2
e-e / 2 . (7.15)
which is the midpoint between hand h+\· The endpoints are
(7.2 1)
- v 2n XL = ro o X = -00 ,
o
Some of these results are given in Table 7.1. We observe that the mini The corresponding equations determining the numbers (Ykl are
mum mean square distortion D min decreases by a little more than 5 dB
for each doubling of the number of levels L. Hence each additional bit
XI. (7.22)
that is employed in a uniform quantizer with optimum step size l1oP1 for k= 1,2, . . . ,L
Yk == { ~f)s) d~ ,
a Gaussian-distributed signal amplitude reduces the distortion by more X, _ l
than 5 dB. Thus Yk is the centroid (mean value) of j)O between x k - \ and x k• These
By relaxing the constraint that the quantizer be uniform, the distor
tion can be reduced further. In this case , we let the output level equations may be solved numerically for any given j~ (~).
Tables 7.2 and 7.3 give the results of this optimization obtained by
"~n) = Yk' when X k_ 1 ~ x(n) < x k. (7 .16) Max (1960) for the optimum four-level and eight-level quantizers of a
xI. Yk
Number of Optimum Step Minimum MSE 10 log Dm;n I
Level k
Output Levels Size .8o P' Dm;n (dB) 1.510
I -0.9816
1 0.0 -0.4528
2 1.596 0.3634 -4.4
2 0.4528
4 0.9957 O. \188 -9.25
3 0.9816
00
1.510
8 0.5860 0.03744 -14.27
4
16 0.3352 0.01154 -19.38
6 1.050
Q \\ ' '\
- 0.7560 0 '\ -,
,' '
cO , '\
7 1.748 1.344 $:! \.
8 co 2.152 3 \. '\'' Optimum nonun iform
C - IS \. \ ' quantizer
o.; = 0.03454; 10 log D mon = - 14.62 dB. .~ \ \
\ \ "
~ '\.. ' -,
'0
\.
\
\
\
,
Entropy cod uig
Gaussian-distributed signal amplitude having zero mean and unit vari -20
\
,
\ '\ -,
, r
ance. In Table 7.4 we compare the minimum mean square distortion of a DIstortion -rate »>; '\ -,
function for / \. . '\ \ \
uniform quantizer to that of a nonuniform quantizer for the Gaussian Gaussian source \. '\ -,
distributed signal amplitude. From the results of this table, we observe OCR ) = 221{ \
\ . \ '\ \ '\ -, ,
that the difference in the performance of the two types of quantizers is - 25 \. ,
relatively small for small values of R (less than 0.5 dB for R < 3), but in \ \
creases as R in creases. For example, at R == 5, the nonuniform quantizer \. "
is approximately 1.5 dB better than the uniform quantizer. \,
It is instructi ve to plot the minimum distortion as a function of the bit - 30 2 3 4 5
0
rate R = 1082 L (bpn) for both the uniform and nonuniform quantizers: Rate. R = \082 L bits/sample (bpn)
These curves are illustrated in Fig. 7.2. The functional dependence of the
FIGURE 7.2.. Distortion versus rate curves for discrete-time, memoryless
distortion D on the bit rate R may be expressed as D(R). This function is
Gaussian source .
called the distortion-rate function for the corresponding quantizer. We ob
serve that the distortion-rate function for the optimum nonuniform
quantizer falls below that of the optimum uniform quantizer. Any quantizer reduces the continuous-amplitude source, :?:. a' into a
discrete-amplitude source, say X. As above, suppose that the discrete val
ues taken by the quantized source are
(7.23)
TABLE 7.4. Comparison of Optimum Uniform and
{Yk' 1 ::; k :$ q.
Nonuniform Quantizers for a Gaussian Random This set of discrete entities is sometimes called an alphabet of the dis
Variable (Max, 1960; Paez and Glisson , 1972). crete source. Let us denote the probabilities associated with the symbols
10 log., Drnin or letters from the source by {P(Yk) = Pk} ' If the random variables, ~ ( n) ,
from the source are statistically independent, the discrete source is said
R (bpn) Uniform (dB) Nonuniform (dB) to be memoryless. We know from Section 1.4.2 that such a source has
1 -4.4 - 4.4 entropy
2 - 9.25 - 9.30 T.
3
4
- 14.27
-19.38
-14.62
-20.22 H(~) = -I k ·'\
Pk log2Pk '
(7.24)
5 - 24.57 - 26.02
6 - 29.83 - 3 1.89 An algorithm due to Huffman (1952) provides an efficient method for
7 -35 .13 -37.81 source encoding based on the notion that th e more probable symbols (or
\
- l_ -
7.2 I Optimum Scalar and Vector Quantization 419
418 Ch. 7 I Speech Coding and Synthesis
blocks of symbols) be assigned fewer bits and the less probable symbols be THEOREM 7.1 RATE-DISTORTION FUNCTION J<'OR A MEMORYLESS
GAUSSIAN SOURCE (SHANNON, 1959) Th e minimum information rate
assigned more bits. The Huffman encoding algorithm yields a variable
length code in which the average number of bits per letter can be made as (bpn) necessary to represent the output of a discrete-time, continuous
close to H(x) as desired. We call this coding method entropy coding. aniplitude, memoryless stationary Gaussian source [corresponding to a
For example, the optimum four-level nonuniform quantizer for the random process x with random variables x (n)] based on an MSE distor
Gaussian-distributed signal amplitude given by (7.15) results in the prob tion measure per symbol (single-letter distortion measure) is
abilities p\ = P4 = 0.1635 for the two outer levels and P2 = P3 = 0.3365 ~ iogk1 VD), 0.$ D s o~ (7.25)
for the two inner levels. The entropy for this discrete source is
H(x) = 1.911 bits per letter. With entropy coding (Huffman coding) of
blocks of output letters, we can achieve the minimum distortion of - 9.30
R/D) :=
t
0, D > a~ ,
dB with I.911 bits per letter instead of 2 bits per letter. Max (1960) has where 0 2x is the variance of the Gaussian source output. The function R g (D)
given the entrop y for the discrete source symbols resulting from quantiza is called the rate-distortion function for the source (the subscript "g" is
tion. Table 7.5 lists the values of the entropy for the nonuniform quantizer, used to denote the memoryless Gaussian source).
These values are also plotted in Fig . 7.2 and labeled "entropy coding."
From this discussion we conclude that the quantizer can be optimized We should note that (7.25) implies that no information needs to be trans
when the pdf of the continuous source output ~a is known. The optimum
R
mitted when the distortion D ~ Sp ecifically, D = 6~ can be obtained
a;.
quantizer of L = 2 levels results in a minimum distortion of D(R) ,
where R::::: log, L bpn. Thu s this distortion can be achieved by simply
by using zeros in the reconstruction of the signal. For J5 > we can use 0;,
statistically independent, zero-mean Gaussian noise samples with a vari
representing each quantized sample by R bits. However, more efficient
coding is possible. The discrete source output that results from quantiza ance of D - 6; -
for the reconstruction. R (D) is plotted in Fig . 7.3.
g
tion is characterized by a set of probabilities {P k ) that can be used to de The rate-distortion function R(D) of a source is associated with the
sign efficient variable-length codes (Huffman codes) for the source following basic source coding theorem in information theory.
output (entropy coding). The effici ency of any coding method can be
compared , as described below, with the distortion-rate function or, equiv
alently, the rate-distortion function for the discrete-time, continuous
amplitude source that is characterized by the given pdf.
2.0
Rate-Distortion and Distortion-Rate Functions
It is interesting to co mpa re the performance of the optimal uniform
and nonuniform quantizers described above with the best achievable p er
formance attained by any quantizer, For such a comparison, we present
some basic results from information theory, due to Shannon, which we E0.
state in the form of theorems. e
""-",,. \.0
~
TABLE 7.5. Entropy of the Output of an Optimum c.::"
Nonuniform Quantizer for a Gaussian
Random Variable (Max, 1960).
R Entropy Distortion,
(bpn) (bits/letter) 10 log., D min
I 1.0 -4.4 1 --'---.
0 0.4 0.6 0.8 \.0
2 1.91 I -9.30 0 0 .2
3 2.8 25 -14.62 2
Norm aliz ed distorti on. D/ J
4 3.765 -20.22
" FIGURE 7.3. Rate·distortion function for a continuous-amplitude, memoryless
5 4.730 -26.02
Gaussian source.
zo'.
420 Ch. 7 / Speech Coding and Synthes is
a speci fied level of m ean squa re di st ort ion. T h us th e rate-di stortion R(D)
THEOREM 7.2 SOURCE CODING WITH A DISTORTION MEASURE o f a ny con t inuous-ampli t ude , m em oryless source with zero mean and
(SHANNON , 1959) T here exists a coding scheme that maps the source out finite variance a; satisfies the cond ition R (D) ::::; R/D ). Similarly, th e
put into code words such that for any gi ven d istortion D, the minimum rate di storti.on -rate functi on of the same source satisfies the con ditio n
R(D) bpn is sufficient to reconst ruct the source output with an average dis D (R ) -< D /R ) := 2 -2R (1~. (7. 29)
tortion tha t is arbitrarily close to D.
A lower bound on the rate-dist ortion func t.ion als o exists. T his is
It is clear, therefore, that the rate-distortion function R(D) for an y sou rce called the Shan non lower bound for an MS E disto rt ion measure, and is
represents (by definition) a lower bound on th e source rate that is po ssi
given as
ble for a given level of distortion .
Let us return to th e result in (7 .25 ) for th e rate-d istortion fun cti on of R"(D ) = H(x- ) - !.2 lOg2 2neD, (7.30)
a memoryless Gaussian source. I f we reve rse th e fun ctional dep endence
between D an d R , we may express D in terms of R as whe re H (!.. ) is th e differential entro py of th e continuous-amplitude, mem
2R
D/R) = 2- O} (7.2 6) o ryless source, defined as
I O' ~ (7 .38)
R(D) ~ - log -=-. Os D sa ,!2 • (7.28) := 1
6[RiD) R"(D) dB .
2 2 D '
lWe temporarily ignore the C<l nvention established in (7.6) in order to give a pre cise def
A proof of this theorem is given by Berger (19 7 \ ). It implies th at the
inition. Also compare (1.243).
Ga uss ia n source req uires th e ma ximum rat e am ong all other so urces for
422 Ch. 7 I SpeeCh COding and Synthesrs
The relations (7.37) and (7.38) allow us to compare the lower bound in
the distortion with the Upper bound, which is the distortion for the
Gaussian source. We note that D'(R) also decreases at -6 dB per bit. We
should also mention that the differential entropy H(?:) is Upper bounded
by H/~). as shown by Shannon (1948). 6;
'"
Q)
--«
Table 7.6 lists four pdf's that are models commonly used for source
signal distributions. Shown in the table are the differential entropies, the
"0
~
0 ~~
Ifg
I 0 I
M
""--< I
N
\0
ci I "..,f-"
N
g~
differences in rates in bpn, and the difference in distortion between the lO
c ~'<
upper and lower bounds. Paez and Glisson (I972) have Shown that the OJ
U5
gamma probability distribution provides a good model for speech signal '
amplitude. The optimum quantization levels for this amplitude distribu~ Q
tion are given in Table 7.7 for L = 2,4, 8, 16, 32. The signal variance has '" --S-
::,...
"0
been normalized to unity. From Table 7.6 we note that the gamma pdf a.
~'21
a..
I -
I
c "" """
shows the greatest deviation from the Gaussian. The Laplacian pdf is the
most similar to the Gaussian, and the uniform pdf ranks second of the
0
E
E
;.,e
S
0 I
or.
N
ci
0
ci
0
l-
ci
pdf's shown in Table 7.6. These results provide some benchmarks on the o ~'<
0
difference between the upper and lower bounds on distortion and rate. '
::J
Before concluding this section, let us consider a continuous-time 0
LL
band limited Gaussian source with spectral density .------.
'0 C'I
<,
'"
o~J2 W;
c '" '<I
r,:, ,, (.0) = IQI-<2nW 0 0
[
-----..
N~I
0
--------..
"''<,
0
N'"
0""
<:v
E ~
'"
::z= -
N I:l,l ~
-----
0 N N "'T
0 "---"" '----" ----....-
When the output of this' source is sampled at the Nyquist rate, the sam c 01) '" '"
eo 01) '" '"
0/}
0 .2 0
ples are unconelated and, since the source is Gaussian, the samples are 0
''E
.!2
also statistically independent. Hence the equivalent discrete-time Gauss 0 -IN -<IN -IN -IN
<;;
ian source, x, is memoryless. The rate-distortion function for each sam 6
ple is gi ven- by (7.25). Therefore, the rate-distortion function for the <h
bandlimited white Gaussian source in' bps is ~
a:
'0 ~'<.l
0 x2 C b'" <,
RiD) = W log, --=., o-< D:$ o~, (7.40)
ro
S
><
D '"
Q) b~l
t.;
'is. ---
..:!.
<'l~.
b VI
<;
I
"
0 N
~
The corresponding distortion-rate function is '
~' :>
C K
>< L>
~
w I
~ 1
~
I:l,l
I ",
s i'
D (R) = 2- R / W 0 :§
--<I~r
.... S
2 0 ><'1
« ~' (7.41) "E
Which, when expressed in decibels and normalized by o~, becomes
a)
'
Q)
N
~~
~
6
DAR) 3R r.O
lOlog--=-_. ,...: c c
(7.42)
0 2
is.
TV W
...J
Ul
...
-.:I
C<l
' 0;:;
rn
§
.s '0
C<l
C<l E
o;l
l:l,
~ '2 0. 8
~
C<l
The more general case in which the Gaussian process is neither white c ::J C<l
.....l 0
nor bandlimited has been treated by Gallager (1968) and Goblick and
Holsinger (1967).
423
7.2 / Optimum Scalar and Vector Quantization 425
c oooOO~~~c-iN"';"";.:f,,",r-:
o
en
continuous-amplitude source is memory less. Jf, in addition , the signal
a C'l\O-OV"lOQV>'r>
the dependency by jointly quantizing blocks of samples or parameters ,
ro
N
Ql
<Il
f!;..
-
'C
0-.00"'<')1"1<--0
";~I ~~~~~~b 8
\0
'"
.....
o
d
that achieved by scalar quantization .
c OO~~N"";V;
.Q (7 .4 3)
'S ~ = l~(l) ~ (2) x (N )f ,
~ -7
u;
is
0-. 0-.
:0:.::1 ::': ~
c:id~M
s. g:0'\
where the x U ) are real random variables. Later these random variables
will correspond to signal samples, or perhaps to parameters such as LP
ro
Eloo
E
-
N
I "
o
coefficie nts characteriz ing a frame of speech . Consequently, we index
them by integers in parentheses as a foreshadowing of this fact. The ran
<Il ~-<N d
o ";-1 ~ ~ ~ 8 dom vector ~ is governed by a joint pdf
ro O~N
.I::.
.~
~(~l""'~."i)=f~(l ), __ ,2:.(Nl ~J'· · ·'~N) · (7.44)
en
1"100 Formally, the quantizer maps the random vector x to another random
<a ;;;:;1 g~
c
Q')
ON \0 vector :r of dimension N,
N
l( ·~·,')f. (7.45)
C:i5J"'1'
... V'>
<"l
N
o
r = L~::C I) ~(2 )
.2
<f) lot 1 fl 8 Let us express this mapping as
"
(7.46 )
<1l
N
<=
r= Q(~) .
C
ro r-, The vector y has a special distribution in that it may only take one of L
::l
a ;;;:;I~ (d eterm inist ic) vector values in !R '~'. Th erefore , its pdf will con sist of
E
o o
::l ("\
00
\0
L impulses over th e iV-dimensional hyp erplane. Let US de signate these
E \0
a
o ;.t~ I 8
d L values by YI" .. , YL'
Basically, the vector quantization of x may bc viewed as a pattern rec
r-..: ognition problem involving th e classification of the outcomes of the ran
r.: dom variable x, say x, into a discrete number of categories or "cells" in
w N-space in a way that optimizes some fidelity criterion , such as mean
-I
m i -N<')-7V"l\Or-ooo-.O-N<')~V>\O
e
~
s ~ -""" - - --.. -..: - - Q sq uare distortion. For example, consider the quantization of the out
comes of the two-dimensional vectors ~ = l~(l) ~ (2 )r The two
di mensional space is partitioned iota cells as illustrated in Fig. 7.4,
where we have arbitrarily selected hexagonal-shaped cells ICkl' All input
vecto rs that fall in cell C, are "quantized" into the vector yp which is
shown in Fig. 7.4 as the center of the hexagon . In this exam ple, there are
424
7.2 I Optimum s~alar an d Vector Qu antization 427
426 Ch. 7 f Speech Coding and Synthesis
x ("2) Thus far, our discussion has been formulated in t erms of ab stract (sto
ch ast ic) quantiti es. Before proceeding with the theoretical d evelopments,
let us po int out some of the pr actical uses of VQ. In practice, of course,
we will be faced with th e task of associating a real input vect o r, say x ,
with on e of the vector s y l' .. . ' YL' As noted above, x is to be thought of
as an outcome of the vector-valued random va riable x . x mi ght represent
kth "ce ll" C~
an N-lengt h frame of speech that is to be code d. For exa m ple, suppose
Centroid .v l: 1"
x = [elements fen; m), n = m - N + 1, . .. , m ] , (7.4 8)
where f en; m ) is our usual notation for a frame of sp eech e nd ing at time
( • ) ( • ) ( . ) ~ x(l )
m for a given window. However, VQ is not limited to quantizing a block
of sign al sa mp les of a source waveform. It can also b e applied to quan
ti zing a set of parameters extracted from the dat a. Examples include lin
• ea r predi ctive coe fficient s, in which case x is th e M-dim ensional LP
vect or
x =[a(l;m ) . . . a(M; m)(. (7.49)
These parameters can be considered and quantized as a block by apply
ing some appropriate distortion measure. Ordinarily, the Itakura distance
FIGURE 7.4. An example of quantization in two-dimensional space. measure would be used as the measure of distortion.
Alternat ive sets of param eters that may be Quantized as a bl ock and
L = 37 vec t ors, on e for eac h of the 37 cells into which the two transmitted to the receiver a re reflection co efficients, log-area ratio pa
dimensional space has been partitioned. Under a VQ sche me, if x is a rameters, and inverse sine parameters. These were int roduced in Section
vec to r of signal samples to be tran smitted or stored, for exa m ple, then 5.3.3 . In these cases the vect or x takes the forms
only the index k of th e cell to which x is assigned is act ually transmitted (7 .50)
or store d.
x = [K(l ; m) ' . . K(M; m )l",
In general, quantization of th e N-d imensional vec t or x into an N x = [g (l; m)' . . g (M ; m)r, (7 .51)
dimensional vector Yk introduces a Quantiza tion error or a distortion
d(x , Yk)' In a statistical sense, th e average di stortion over the set of input and
vectors is x = 10'(1 ; m)· .. a(M; m)r, (7.52)
t. respecti vely, for coding of the N-length fr ame ending at time m . T h e
D == L P(~ E Ck)t (d(~, Yk ) l~ E C
k=t
k} lz and (:» norms are typically used as measures of distortion for these
(7 .47) paramet ers (Ma khoul et al ., ] 985).
so that the average d istort ion is minimized over all L-1cvel quantizers. from (7.47) we have that th e average distortion D res ult ing from repre
There are two conditions for opti mality: senting ! (m) by IOn ) (for arbitrary m ) is t[dL~.( m ), r (m) j}. It is useful to
I . The optimal quantizer em p loys a nearest neigh bor select ion rule , express this on a "per dimension" basis, that" is, we define
which m ay be expressed mathematically as follows: Let x be a vector to
dc f D l '(d [?:(m), rem)]) (7.59) I'
be classified (quantized). Then DN "" !if =:. -=--=- N--==------'
Q( x) = Yk (x E: Ck ) (7.54)
NoW the information in the output process , whi ch consists exclusively of
if and only if t he vectors Yk' 1 ~ k ::5 L, can be transmitted at an average bit rate of
d(x, Yk ) -< d(x, yJ for k -=F j, I < j -< L. (7.55) H(Y) (7.60)
- -- bp n .
R = -N
2. Ea ch output vector Yk is chose n to minimize the average distortion
in cell Ck , say D k · In oth er words , Yk is the vector in Ck such that where H(r,) is the ent ro py of the quantized source output,
Yk=argmin Dk =argmin C(d(x, y)lxE Ckl L
y y -
H(yJ = - L Pk log
I:~l
2P k'
(7.61 )
= argmin
Y
f ... J x E Ck
d(x,
-
y)/ x (~ l ' ... , ~ ) d~\ .. . d¢'N'
Il
(7.56) For a given average rate R, the minimum achievable distortion per di
mension, D/oi(R), is
quantization problem. In gen eral , we expect th e code vectors to be closer approach infinity, we obtain
together in regions where th e joint pdf is large and farther apart in re lim D N(R),
D(R) = i"y..-oo (7.63)
gions where / x ( .) is small.
As an upper bound on the performance of a vector quantizer, we may where D(R) is the distortion-rate function introduced in Section 7.2.1. It
use the performance of the optimal scalar quantizer, which can be ap is apparent from this development that the distortion-rate function can
plied to each component of the vector, as described in the preceding sec be approach ed arbitrarily closely by increasing the size .V of the vectors,
tion. On the other hand, the best performance that can be achieved by The development above is predicated all th e assumption that the joint
optimum VQ is given by the rate-distortion function or, equivalently, the pdf 1,,(, .) of the data vector is known. Howev er, in practice, the joint pdf
distortion-rate function. may not be known. In such a case , it is pos sible to select the quantized
The distortion-rate function , whi ch was introduced in the preceding output vectors adaptively from a set of training vectors, using the K
section, may be defined in th e context of VQ as follows, In this case we means algorithm described in Section 1.3.5'. This algorithm iteratl velv
envision a vector-valued input source (stationary stochastic process), say subdivides the training vectors into L clusters such that the two neces
x, consisting of a sequence of random vectors x(m). Consider, for exam sary conditions for optimality are practically satisfied.
ple, that each x(m) represents a block (frame) of N speech samples end It is appropriate that we digress momentarily to remind the reader of
ing at time m~ as in (7.48). Each input vector is then quantized to the term LBG algorithm that often appears in the present context. In
produce an output vector random pro cess, say y, with vector random var Section 1.3.5 we noted that Lloyd (1957), in considering scalar quantiza
iables I(m). The transformation is of the form tion for pulse code modulation, had essentially developed the K-means
I(m) = Q [ ~(m) ] (7.57) al. (1980) were the first in the communications field to suggest the use of
p rIem) = Yk] = PI:' (7.58) th e speech processing and other communications literature, is frequently
,
called the Linde-Buzo-Gray (LBO) algorithm. The LBO algorithm and a
slight variation are detailed in Fig. 1.16. ~. l.l
Once we have selected the output vectors !Yk' I -< k :::; L }, we have es
tablished what is known as a codebook . Each input signal vector x(m) is
quantized to the output vector that is nearest to it according to the dis
.l' \
tortion measure that is adopted . If the computation in volves evaluating
the distance between x(m) and each of th e L possible output vectors (Ylel,
the procedure constitutes a full search. If we assume that each computa- .
)' l.2
tion requires N floating point operations (flops)," the computational re
quirement for a full search is
C := NL (7.64)
flops per input vector.
If we select L to be a power of two, then log, L is the number of bits Y2.1.1
required to represent each vector. Now, if R denotes the bit rate per sam
Yl.1
ple (per component or dimension of x(m)], we have NR = log, Land,
hence, the computational cost is
C>= l'rivR • (7.65) Y2
Note that the number of computations grows exponentially with the
dimensionality parameter N and the bit rate R per dimension. Because of
this exponential increase of the comp utational cost, VQ has been applied Y2.2
to low bit rate source encoding, such as in coding reflection coeffici ents
or log-area ratios in linear predictive coding.
The computational cost associated with full search can be reduced by
slightly suboptimum algorithms [see (Cheng et a1. , 1984 ; and Gersho, FIGURE 7.5. Uniform tree for binary search vector quantization .
1982»). A particularly simple approach is to construct the codebook
based on a binary tree search. Binary tree search is a hierarchical cluster
the half of the tree stemm ing from h Then we com pute d[x(m), YI,.]'
ing method for partitioning the N-dimensional space in a way that re
duc es the computational cost of the search to be proportional to log, L. i =1 ,2. If d[x(m) ,Y\.l] >d[x(m),y d , we eliminate the part of the
This method begins by subdividing the N-dimensional training vectors tree stemm ing from y 1 I and continue the binary search along y I 2' The
into two regions using the K-means algorithm with K ;= 2. Thus we obtain search terminates aft er B step s when we reach a terminal node. '
two regions and two corresponding centroids, say Y1 and )'2' In the next The computational cost of the binary tree search is
step, all points that fall into the first region are further subdi vided into C := 2Nlog2 L = 2NB (7.66)
two regions by using the K-means algorithm with K = 2. Thus we obtain
two centroids, say y I 1 and y 1 2' This procedure is repeated for the second flops , which is linear in B compared with the exponential cost for full
region to yield the two centroids Y2 I and )'22' Thus th e N-dimensional search. Although the cost has been significantly reduced, the memory re
space is divided into four regions , each region having its corresponding quired to store the vectors has actually increased from NL to approxi
centroid . This process is repeated until we have subdivided the N mately 2N L. The reason for this incr ease is that we now ha ve to store the
dimensional space into L = 2 8 regions where B = NR is the number of vectors at the intermediate nod es in addition to the vectors at the termi
bits per code vector. (Note that B is an integer.) The corresponding code nal nodes,
vectors may be viewed as terminal nodes in a binary tree , as shown in This binary tr ee search algorithm generates a uniform tree. In general,
Fig, 7.5. the resulting codebook will be suboptimum in the sense that the code
Given a signal (or parameter) vector x (m) , the search begins by com words result in more distortion compared to the codewords generated by
paring x(m) with Yl and Y2' If d(x(m) , )'1] < d[x(m) , Y2), we eliminate the unconstrained method corresponding to a full search. Some improve
ment in performance may be obtained if we remo ve the restriction that
the tree be uniform. In particular, a codebook resulting in lower distor
4We again define a flop to be one multipl ication and one addition.
7 .2 I Optimum S c alar and Vector Quantization 433
432 Ch. 7 I Speech Cod ing and Synthesis
E. z
tion is obtained by su bd ivid ing th e cluster of test vecto rs ha ving th e larg
est total di stortion at each step in the process . T hu s, in the first step , th e u +b
N· d ime nsional space is divided into t wo region s. In th e second ste p, we 2-/2
select the cluster with th e larger di st ortion and subdiv ide it. No w, we a+b
have three cluster s. The next subdi visio n is per fanned on the clu st er hav, zJ2.
ing th e largest distortion. T h us we obtain four cluste rs and we repeat the
pr ocess. The net re sult is th at we generate a nonuniform code tree as il
lustrat ed in F ig. 7.6 for th e cas e L = 7. Note that L is no longer con
st ra ined to be a p ower of two . In Section 7.4 .6 we compare the di stor tion
f l {2}(c'2) .I -
of these two su bop tim um search schem es with th at fo r full sear ch.
To demonstrate th e benefits of vector quantization compared with sca 01 I ~
I )/ ./ /
~ I)
' (/ + b
E. ,
lar quanti zation , we pr esent th e following example, due to Makhoul et a1. 2v'i
(198 5).
a- b
EXAMPLE _
2-.fi
Let ~ == [~ ( I ) ~(2)y be a random vector with uniform joint pdf
l /a b, S EC a +b
fJ~ l' ~2) == /J~) =
{ 0,
, (7.67) 2\12
- - otherwi se
f,£(l )(c',)
where C is the rectangular region illustrated in Fig. 7.7. Note th at th e E. z
rectangle is rotat ed by 4SO relati ve to th e horizontal ax is. Also shown in -12
Fig. 7.7 are the mar gin al den siti es fx(l/¢') and fx(2)(f, ). If we quantize th e
1
outcomes of ~( I) and ~(2) separate ly by using uniform interval s of I
YJ
Hence the number of bits need ed fo r cod ing th e vector outco me
Yz x = [x C}) x(2)t using scalar qu antization is
R x •SQ = R x (J ) +R X
( 2) == log, L x (, ) + log, L x ( Z)
(7. 69)
(a + b)2
y)
== log, 2'
- 26.
Y4
Thus th e scalar quantization of eac h component is equ ivalent to vector
quantization with th e total number of levels
(a + h)2
Ys L x•SQ L I )L 26.2 '
(7.70)
0= X ( x (2) :=
Y6 We observ e that this appro ach is equiva lent to cove ring the larg e square
that encloses th e rectangl e by squ are cells, where each cell rep resents one
\'
-7 of the L x •SQ quantized regions. Since = 0 except for C, E C , this fi"')
encoding is wast eful and results in an increased bit rate.
FIGURE 7.6. Nonuniform tree for binary search vector quantization.
7 .3 I W avef orm Coding 435
434 c-, 7 / Speech Cod ing and Sy nth es is
If we were to cover on ly the region C for which / x(I;) ;f= 0 wit h squares of a speech signal. In cont rast to opti mu m scalar quant izatio n des cri bed
in Sect ion 7.2.1, whic h requires knowledge of th e pd f of t he sig na l sam
hav in g area t:,2 , the tot al number of levels th at wilC result is th e area of
ples, we make few assumptions a bout specific sta tistical pr opert ies of
the rec tangle divided b y /12 •
speech signals. Co nsequen t ly, th e wave form enc odi ng t echniqu es to be
ab describe d do not achieve th e theoreticall y opti mal performance th at is
L x.VQ = /i2' (7.71 ) ac hievable whe n the pdf of the signal sam ples is kn own. N everthele ss ,
the techniques described below arc rel ati vely robust and, at sufficie ntly
Therefore, the difference in bit rat e between th e sca la r and vecto r quanti high bit rates, provide high-quality speec h.
zat io n methods is
(a + b)2 7.3.2 Time Domain Waveform Coding
R x,s Q - R x,VQ -- log 2 - 2ab
(7.72)
In thi s sect io n we describe several waveform quantizati on and enco d
For inst ance, if a = 4b, th e d iffer enc e in bit rate is ing t echniques that ha ve been applied to speec h signals. In pa rticul ar, we
Rx,sQ- Rx,vQ = 1.64 bits per vector. (7.73) modulat io n (DM), and several adaptive versions of th ese m eth od s.
Vect or quantization ha s be en a pplied to severa l t ypes of speech encod Th e quantized waveform may be modeled mathem ati call y as
ing m ethods, includin g bo th wave fo rm and m od el-based meth ods. In .~ (n) =s (n) + q (n) , (7 .74)
model-based method s such as linear pred icti ve cod ing, VQ has made
possible the coding of sp eech at rates below 1000 bps [see papers by wh ere ,~(n ) represents the quantized value of sen) and q( n) represents t he
Buzo et al. (19 80); Roucos et al . (1982) ; and Paul (I 983)]. When applied quanti zation error, whi ch we treat as an additi ve noi se. Assu ming th at a
to waveform encoding methods, it is possible to obt ain good-quality en uniform quanti zer is us ed and the number of levels is suffi ciently large ,
coded speech at 16,000 bps or, equ iva lentl y, at R = 2 bpn. With addi the qu antization noise is well characterized statisticall y as a reali zation of
tional com putat ional complex it y, it m ay be po ssibl e in the future to a sta t iona ry random proc ess q in which eac h o f th e ra ndo m va ria bles
implement waveform en coders pr oducing good-quality spe ech at a rate of q(n) h as the uniform pdf
R = 1 b p n.
f'g, (n)(¢')= i", -%.::;~ .::;% , (7.75)
R
7.3 Waveform Coding where the st ep size of the quantizer is /1 = T . T he mean squa re value of
th e qu antization error is
7.3.1 Introduction 2R
2 /:/T (7.76 )
Methods for digitally representing th e t emporal or spectral characteris £ (q (n)} = - =
- 12 12
tics of speech waveforms a re gen erally called wavefor m encoding. In this
section we describe several tim e domain an d frequency domain wave for arbitr ary n. M easured in d ecibels, th e mean square va lue of the
form encoding techniques. These techniques have been widely used in noise is
With the exception of Secti on 7.3.4, th e methods considered in Sec lO loglo l2. = 10 logw 1'2 = -6R - 10 .8 dB . (7.77 )
tion 7.3 involve scalar qu antization of t ime samples or frequency sa mples
436 Ch. 7 / Speech COding and Synthesis
7.3 / Waveform Coding 437
1.0
:5
o.s
I A = R7 .56
I
-i2
:5
'c
Of;
OJ -s
E "'.3" 0,6
"5 '2
0 en
s OJ
o
["
:;
0.4
I 0
0.2
o ,
0.2
'-
0.4
, ~
0.6 0.8 1.0
0
0 0.2 0.4 0.6 OJ; 1.0
InpUI magnitude, lsi Input magnitud e, Is]
FIGURE 7.8. Input-output magnitude characteristic for a logarithmic FIGURE 7.9. COmparison of ,u-Iaw and A-law nonlinearities.
compressor.
SA piecewise linear approximation to the characteristic for J1 = 255 is used in practice.
7.3 / waveform LOOO'''\! ~--
tween successi ve samples is relatively small. Consequently an encoding error, sequence is th e di fference
M (7.83)
scheme that exploits the redundancy in the samples will result in a lower
bit rate for the source output. e(n) =" sen) - .~(n) ~ sen)- L, a(i) s(n - n.
i~ !
A relatively simple solution is to encode the differences between suc
cessive samples rather than the samples themselves. The resulting tech Before proceeding, we should point out that long-term notations are
nique is called differential pulse code modulation (DPCM). Since intentionaHy used in the above. The objective of the LP procedure is not
differences between samples are expected to be smaller than the actual to do an excellent job of prediction on a sample-by-sample basis, but
sampled amplitudes, fewer bits are required to represent the differences. rather to remove correlation in a broad sense for more efficient quantiza
In this case we quantize and transmit the differenced speech sequence tion. Accordingly, a \ess~than-perfect
predictor will suffice. Also , the LP
parameters can be computed over a very long corpus of speech data and
e(n) := sen) - sen - 1). (7.80) can be built into the quantizer as static parameters . The case in which
In keeping with the comments in the introductory paragraph, we note the LP parameters are dynamically estimated is discussed below.
that the differencing procedure is a simple attempt to remove redun Having described the method for determining the predictor coeffi
dancy (correlation) from the speech sequence. One way to see this is to cients, we now consider the block diagram of a practical DPCM system,
think about the spectral "tilt" on the speech spectrum, which is lessened shown in Fig. 7.10. In this configuration. the predictor is implemented
by 6 dB per octave by the differencing operation. We can also easily show with a feedback loop around the quantizer. The input to the predictor is
by taking long-term averages th at the temporal variance (power) of the denoted as 05(11), which represents the signal sample sen) modified by the
sequence e(n) is related to that of sen) as quantization process, and the output of the predictor is
0 ('2 = 20 2~
(1 -r.(I»)
--
2'
as
(7.81 )
§(n) =
M
I, a(i) .~(n - i) .
(7.84)
;=1
where we have assumed that the temporal averages of both sequences are
zero. Therefore, as long as r,( 1)/ r,(0) = rs(l)/a;> 0.5 (indicating suffi
cient correlation), the differencing will reduce the long-term power in the
signal. In fact, the long-term temporal autocorrelation ratio r,(O/r/O) -I- 1 ' \'" ~ To channel
typically exceeds 0.8. s ( ,,) ~ ~
rent sample from the source and let sen) denote the predicted value of
sen) , defined as ~.
M I DCi) I
s(n) := L a(i) s(n -
; ., 1
i). (7.82) (I»
FIGURE 7.10. (a) Bloc\< diagram of a DPCM encoder. (b) DPCM decoder at
Thus sen) is a weighted linear combination of the past M samples and the receiver.
the a(i)'s are the LP coefficients. If we seek the prediction coefficients
that minimize the MSE between sen) and its predicted value, then the 61n Chapter 5 this residual signal is called fen) because it is nominally an estimate of
the vocal system model input, which was there called c (n). We omit tne "hal" ill this mate
optimal solution is exactly the set of LP coefficients derived in Chapter
5. (In fact, Interpretive Problem 5.2 in Section 5;\.2 is precisely the de rial for simpl icity.
440 Ch. 7 I Speech Coding and Synthesis
7.3 I Waveform Coding 441
The difference
e(n)
I I ~ To chan nel
e(n) := 5 (n) - fen) (7.85)
is the input to the quantizer and c(n ) denotes the output. Each value of
the quantized prediction error c(n ) is encoded into a sequence of binary
digits and transmitted over the channel to the receiver. The quantized
error 2(n) is also added to the predicted value fen) to yield sen).
At the receiver the same predictor that was used at the transmitting
end is synthesized and its output sen) is added to 2(n) to yield sen). The +
signal sen) is the desired excitation for the predictor and also the desired
output sequence from which the reconstructed signal sa(t) is obtained by
filtering, as shown in Fig. 7.1O(b).
The use of feedback around the quantizer, as described above, ensures
that the error in sen) IS simply the quantization error q(n) := e(n) _ e(n)
and that there is no accumulation of previous quantization errors in the
implementation of the decoder. That is, 5(11) ToDJA
) i ~ converter
q(n) := e(n) - e(n)
i =n-N
, .
S"(t ). (7.88)
Then , the step size for the quantizer is
-----------------------'r. ~ ..
, -.va¥lOlvllll-"""........ "'~
.'
Previous coefficients on a sample-by-sample basis by using a gradient-type algo
1 1 1 - output rithm. Similar gradient-type algorithms have also been devised for adapt
M(4) _ _
MUltiplier ing the filter coefficients aU) and bU) of the DPCM system shown in
-lU I ':' -
110 Fig. 7.10. Details on such algorithms can be found (Jayant and Noll,
M(3) 1984). Some recursive algorithms for computing the LP parameters are
also discussed in Section 5.3.3.
3M2 101 A 32-kbps ADPCM standard has been established by CCITT (Consul
M (2) tative Committee for International Telephone and Telegraph) for interna-
M2
M (I )
TABLE 7.8. Multiplication Factors for Adapt ive Step Size Adjustment
- 36 - 26 -6 0 110 6 26 36
Input (Jayant, 1974).
-M 2
M(I )
PCM DPCM
0 10
-3M2 2 3 4 2 3 4
M(2 )
M(I) 0.60 0.85 0.80 0.80 0.90 0.90
001
M(2) 2.20 1.00 0.80 1.60 0.90 0.90
M(3) - 5M 2
M(3) 1.00 0.80 1.25 0.90
000
M(4) 1.50 0.80 1.70 0.90
-7M 2 M (5) 1.20 1.20
M (4)
M(6) 1.60 1.60
M(7) 2.00 2.00
FIGURE 7.12. Example of a quantizer with an adaptive step size (Jayant, 1974). M(8) 2.40 2.40
7 .3 I WllVeform LOOl"!,! ~-~
~ NoW, since
.I(n)
~ - ~( n)
Quantizer
e(n)
+
Encod er r----
I
T o chan nel
q(n) = e(n) - e (n)
= e(n) - ts(n) - g(n)],
(7.93)
L ~(n )=tl
+ !'.
e( 1l)
,Quan~ -\-~ To cnannel
From Decod er
c(n) + L ToD/A
ch annel converter
(a)
+
=::F
%(n)
(b) Decod er Encoder
Accum ulator
FIGURE 7.13. ADPCM with adaptation of the predictor.
~
a pole-zero predictor of the form shown in Fig. 7. I I. A gradient algo -
Lo wp,ISS I- ~Ot1tpul
rithm is employed to adaptively adjust the coefficients of the pole-zero tJ( n) .A.ccumul<ltor filter
predictor. A description of the 32-kbps ADPCM coding standard is given
in the paper by Benvenuto et al. (1986).
i
Do l
\b)
Decod er
Delta Modulation
FIGURE 7.14. (a) Block diagram of a OM system. (b) An equivalent
Delta modulation (DM) may be viewed as a simplified form of DPCM
in which a two-level (I-bit) quantizer is used in conjunction with a fixed realization of a OM system.
7 .3 I Waveform Coding 44 7
e(n). Hence an equivalent realization of th e one-step predictor is an ac every iteration . The quantized error sequ ence e(n ) provides a good indi
cumulator with an input equal to the quantized error signal e(n). In gen cation of th e slope characteristicS of the waveform being encoded . When
eral ; the quanti zed error signal is scaled by some value, say 6.1' which is the quantized error e(n) changes signs between successive iterations , this
called the step size. This equivalent reali zation is illustrated in Fig. indicates th at the slope of th e waveform in the locality is relatively small.
7. 14(b). In effect, the encoder shown in Fig. 7.14(b) approximates a On the other hand, when the waveform has a steep slope, successive val
waveform s~ ( t) by a linear staircase function . In ord er for the approxima ues of the error e(n) are expected to have identi cal signs. From the se ob
tion to be relatively good, the waveform sit) must change slowly relati ve servations, it is possible to devis e algorithms that decrease or increase
to the sampling rate. Thi s requirement implies that the sampling rate th e step size depending on successive values of e(n ). A relativ ely simple
must be several (at least five) times the Nyquist rate. rule devised by Jayant (1970) is to vary adaptively the step size accord
Slope
overload
distortion
I • -) {
FIGURE 7.15. An example of slope overload distortion and granular noise in FIGURE 7.16. An example of variable step size DM encoding.
a OM encoder.
448 Ch. 7 / Speech COding and Synth esi s
7 .3 I W avefo rm Co ding 44 9
To
1!.(n - 1) r ,(P ; m)
b(l;m) = ;,(o; m) ' (7. 100)
OUl pU (
where rs(!"f; m) re presents a short-term estimator of th e autoco rrelat ion of
th e speech seq ue nce sen) corresponding to th e processing of the fram e
(b)
fen; m ).
FIGURE .7.17. An example of a DM system with adaptive step size. In pr act ice, th e t rue pitch peri od may not be a n exact multiple,
nam ely P, of th e sampli ng period IfF,. Tn such a case, we may use a
The param eter s a, k , and k 2 are selected such th at 0 < a < 1 a nd t hird-o rde r pit ch pred ictor of the form
k , > k 2 > O. For more di scussion on thi s and othe r variations of ADM,
the interested read er is referred to paper s by Ja yant (1974) and Flanagan ](n ; m) '" h( l;m)f(n - P + I;m ) + b(2; m)f(n - P;m)
et al. ( 1979), and to the extens i ve reference list s contai ned in these (7. 10 1)
papers. + b(3;m)/(n- P - l ; m),
&
0
performed either on the time domain waveform in each subband or on
~ o En coder To
channel
!(n;m) Qu anti zer the frequency domain representation of the corr esponding time domain
- waveform.
~
Filter Bank Spectrum Analyzer
f( n; m)
In Section 4.3.5 we introduced the filter bank method of waveform en
~
S pectra l
predi ction
f+ us briefly recall the principles of the system described there.
Pitch
su; m) = I m
n= m - ,~r+l
parameter
estimator
(a) Encoder. Im
s(n) w(m - n)e-Jk (2 nI
N
)n k=O , . . . ,N-I.
n=m~N+ ' ,
(7.103)
We have argued that these samples are sufficient to specify the entire
nonzero part of the time domain sequence, f en; m) , or equivalently its
Inpu t Decoder Output
short-term Fourier transform S(w; m) for any w. We have also argued
that because the window acts as a lowpass filter, it is sufficient to encode
the stDFT at frames ending at times
'
Quan tizer r :. ~
2
tiller
\\' ( 11)
Lowp a ~ ~
Lowpass Decirnator Interpol ator
filler
tiller Q uan uzer N
N .\ 1 : .,, \\ '(111
5( 11) I .
" '(II ) 2 .
I
E
n To Fro m
D
e
c
.. .
. V O
"
U
IP
U
I spec:;h
sa mp les
c channe l
d1alllle[
e-Joy...ljofl
Lowpass
filter
11'(11)
Dec irnator
N
? : I
Q uuntizer
0
d
c
r
0
d
c
r
H I ntcl"p'.' lato r
I: 2
N
H
Lowpa ss
filte r
1\'(1l )
e)-"r\,n
e } J>\ ' I
FIGURE 7.19. Filter bank spectrum analyzer. FIGURE 7.20. Filter bank syntheSizer.
defined for a finite seq uence th a t begins at the t ime origin." Accord ingly, T he bi t all oca tio n for encoding the spectral samples C( k ; m) ,
we first shift the fram e dow n to begin at tim e n = 0 [as in (4,76) k = O. I . . , , ,N - I. is d one adaptivcly by monito ring the power as a fun c
/ (n ; m ) = / ( 1/ + In - N + I ; m). (7 .106) tion of freq uency, For th is p urpose. the N frequency samples are subd i
Th e short-term DCT (stO CT) is defined as
vid ed into L nonoverlapping freq uency ba nds. Each or
the N/L spe ctral
values in a frequency ba nd is sq ua red, the logar it hm is computed , a nd
C,(k:m) = the resulting logarithmic va lue s are added togeth er to form an average of
a logarithmically scaled powe r le vel for th e band. Thus we obtain th e
O. otherwise
va lue corresp ond ing to each of th e spectral coeffic ien ts. Thus we obtain
P,(k.; m ), k = 0, I, . .. , N - J.
(7.107) Based on th e values of the PJk; 111) , tl~ number of bits allocated to
wher e the arrows above l and C a re simply reminders that this definition co di ng each of the sp ectral components Ctk; m) is estim ated from th e
includ es the shifting of the frame so that it " begins" at time O. Then the formu la
inverse sIDCT is given by
/ (11; 111) =
R I [
R ic ~ lV +"2 PJk ; m) - N ,6
I :V- l ]
P,(k; m) . (7.110)
f[c(o: mJ+ 1/2 ~' C(k ; m)cos(2"~ )'k)J k = O, ... , N - l where R is the total number of bits availabl e to en code th e entire block
[ 0, othcrv....ise
of spect ral samples. Note that R/N is the ave rage number o f bits per
spectral component. The second term se rves to increase or d ecre ase the
(7.108) n u m ber of bits relative to the average, depending on whether th e pow er
so that the frame retur~ to the time range n = 0, ... , N - l . Note that level of the spe ctral component is ab ove or below th e average pow er level
the spec tral parameters C tk: m) a re real-valued. In fact , it is not difficult among the N spe ctral components .
to show that In general, the R ic will not take integer valu es and must be rounded off
!Q integers. If R k < 0, it must be set to zero. In addition to coding th e
_ { Re (X (O; m» ), k=O
C ik; m) , the spect ral power measurem ent s P.,(k; m ). k =: I, .. . , L , must
Ct k; m) =
I 7.4 Vocoders
_ ,J _
I
I T he wavefor m co d ing techn iques desc ribed in Secti on 7.3 a re b ased o n
I
I
eit her a sample-by -sample. o r a frame-by -fra mc. spe ech wa ve fo rm rep re
!If/:m)
I sentat io n either in th e t ime o r freq uency doma in . In co n trast, the met h
I
I ods described in this section are ba sed on th e represe nta ti on of a speech
Compute and Hu assignment I Bit assignmenr signa l by an a ll-pole model o f t he vo ca l system upon w hich we have
and step -size I
quanuze SIde and step-size
intorm ation comp utation I
CUrn p UL11;tm
b ased n ume ro us results in our work. For co nven ience. t he model is re
I
lruerpola tion peated as Fig. 7.24.
I
I
I
lnterpolanon
,. Recall th at in this mo del the spe ech produ ct ion system is mo de led as
Out put
T o charmcl
I
I speech
a n all-pole filter. For voiced speech. th e exc itation is a period ic im pu lse
- - -, samples tra in with per io d eq ua l to the p itc h period o f the spee c h. For unvo iced
Speech analyzer Speech synthesize r
I
speech . th e exc itat ion is a whit e noise sequence . In a d d it ion . there ca n be
FIGURE 7.23. Block diagram of an adaptive transform coder (Tribolet and an est im ate d gai n pa ra m eter include d in th e mod el. Ba sically, the differ-
Crochiere , 1979).
pr ovides toll-q ua lit y speech at a bo ut 32,000 bp s. With ad apti ve m ethods. Pitch period. P O(1
the bit ra te ca n be reduced furt her. In particular, A PC a nd ATe prov ide Gain
to ll-q ua lity speech at a bit rat e of about 16,000 bps . DT (;(: ) voiced/ unvoiced
In o rd er to reduce t he bit rate of waveform en co de rs eve n further. we Voiced I Impulse G lonal switch
can resort to vector wave for m quantization , whi ch encodes one fr ame of gent:ralOr filler
H (: ) R(: )
I(} I)
speech samples at a t im e. In particular. we segme nt the speech waveform Vocal Lip Speed!
sa m ples in frames of .v sam p les and design a code bo o k containing vec tract radiat ion
~ ig ll: d
to rs th at a re obta ined by th e K-means algo rithm described in Secti on filter filter
While
1.3. 5. Si nce computa tional co m plex ity and m emory re q ui re ments are an Unvoiced I noise
im po rta nt consideratio n in th e desi gn of the codebook, it is necessary to generator
kee p the frame size relati vely sm all.
GallI
One approach th at sim plifies the problem to so me ex te nt is t o ext ract
ell
s ho rt -te rm s pec tra l and pi tc h in fo rm a tion a nd t o co de th ese se pa ra te ly.
T he re fore , we are left wi th co d ing the res idual. which tends to be white. Pitch period, P (a)
Vect or Quantization is a pp lie d to t he re sid ual and the index of the q uan
ti zed vecto r is tran sm itt ed. At th e receiver the resid ua l pro vides th e e xci lJT
Beside the com p utat io na l complexit y and mem ory co st s in th e use of generator
5 \11)
vect or quanti za ti on for wave form encod ing . anot her pro blem is the po All-pok Speech
tent ial discont inuit y in t he syn thesis of the waveform d ue to th e q ua nt i filler ~ i gn a l
zati o n of frames of signal sa m ples . With vector wave form qua nt izat ion it
While-
is likely th at th e end of on e quantized vector will not mat ch the begin Unvoiced I noise
ning of the sub seq ue nt qua nti zed vector. Thi s problem is pa rticula rly se generator
Gain
to.
8 (: )
riou s at low data rates where co de vecto rs in the cod e boo k are far apa rt . cs tun ate .
/\. rem edy is to overlap sa m ples in adjacent fra m es a t t he ex pense of an "
(:>0
increase in th e bit rate.
In spite of these d ifficult ies, t here is m uc h resea rch act ivit y devoted to
(b)
vector waveform Qua ntizat io n with t he obj ect ive o f ac hie ving toll -q uality
speech at rates in th e ran ge o f 8000- 10,000 bp s. -FIGURE 7.24. Mod el of speech produ ction to be estimated using LP analysis .
460 Ch. 7 I Speech Coding and Synthesis 7.4 I Vocoders 461
en t voco ders describe d below est imate th e model parameters fro m frames
of sp eech (speech ana lysis). encode a nd t ransm it t he pa ra meters to the
re ce iver o n a fram e-by-frame basi s, a nd reco nst ru ct t he speech signal
from the model (speec h synthesis) at the receiver.
Voeod ers usua lly pr ov ide more ba ndw id th com press io n tha n is possi
ble wit h waveform cod ing. In parti cula r. th e techniques de sc ribed in this
sec t io n resu lt in co m mu nicat io n-q ual it y speech at data rates in the. range
of 2400- 960 0 bps. Ou r discussion includes th e ch a nnel voc oder. the E
cepstral (ho m o m o rph ic) vocoder, th e phase vocoder, the formam vocoder, n
and the linea r pred ictive coder. The last of th ese is the m ost widely used c ~ 'f
in practice today. ' (II) ----1 I ~ dlaJ~ne l
e
bits, th e resulting bit rate is in the range of 2400-3200 bps. Further re channel o
d
d uctio ns in bit rate to about 1200 bps can be achieved by exploiting the c
freq uency correlations of the sp ectral magnitudes. In particular. we may r I
use companded peM for the first band and DPC M to encode th e other ,SWllCh
- _I
spe ct raJ samples across the frequency ban d with in eac h fra m e.
At the receiver, the speech is synthesi zed as shown in Fig. 7.26. The
signal samples are passed through D/A converters wh ose ou t put s a re mul
ti plied by the voiced or un voiced signal so urces: t he resulti ng signal s are
passed thro ugh co rrespo nd ing bandpass filters. The o utp ut s o f the band
pa ss filters are summed to form the ou tput synthesize d speech signal.
T he cha nnel vocoder is th e first a nd oldest vocode r to have been stud
ied an d implemented (Dudley. 1939). W hen im plemented with modern FIGURE 7.26. Block diagram of synthesizer for a channel vocoder,
462 Ch . 7 f Speech Cod ing an d Synth e si s
~
d igital signa l proc essi ng tech niq ues. it provides communicati on s quality
speech at about 240 0 bps. ~ ~
-;:;
.l.
7.4.2 The Phase Vocoder ;.l.1 '= :..J :. -::: f.J '"
~
"
.§
''-
~'-'
a
lJ
;;
a.
<Do
N
J:
sp eech sign al , By compromi sing th e signal quality co ntaine d in the rela c
lJ
c E '0 Ii)
~
tive phase information , we achieve greater bandwidth comp ression and. ~, 8Z'
t:
o ~ca
c»
hence, a lower bit rate . .c
Vl Q) '5.
A typical channel of a phase vocode r analyzer is sho wn in F ig. 7.27. ~-----....J <Fl>,
-c ... "
~
ro~
cro_
(/)
J::el)
pute th e magnitude and phase d eri vative of the speec h signal in each o C
ban d . These are sampled at a nominal rate of 50-60 samples per second . g '(OJn
(No te that sampling the magnitude and phase of t he speec h signal at a ~(/)
~(/)
Cod ing of the spectral magnitu de may be done as in the channel vo ~ ;0 E Q)
~ ~ ro~
code r by using log PCM and DPCM , with lower fre q uenc ies ha ving o ,5 O>~
roc:
greate r precision . The phase derivative is usually cod ed linearly using - 5 :;<
2- 3 bpn. The resulting bit rate is in the range of 7200 bps . .:.::
'" (,)'0
Oc
At th e receiver, the signal is synthesized as shown in th e block dia mel)
gram in Fig. 7.28. First , th e phase in each channel is inte grated . Then the t--~''2
signal magnitude and the resulting phase are used to form two lowpass N• (\l'" •
r.
1'- '0
in-phase and quadrature signal components , which are interpolated and .~
-'" ...: c
WQ)ro
'/ .
Rt) &. \;
transl ated in frequency by multipl ying each co mpon ent by cos(wkn) and := fS
..... .:::
c t::
a:N"O
::>Z'<ll
sin (wkn), where UJk represent s the normalized frequency translati on . The j - ..J ~ro<Fl
_clll
U-(tl"O
two signal com ponents ar e then added to form th e bandpass speech sig
nal for each of the freq ue ncy bands.
Du e to the limitation s indicated above , the ph ase vocod er has not
been widel y used in practice . ·"t
c
' 'l.
s
by exci t ing the fi lter with a whi te no ise seq uence. In eithe r cas e, the 4 63
7.4 I Vocoders 465
vocal system respo nse is slowly varying, whereas the excitation is varying
2 1~'"'"
,
more rapidl y.
c Th is basic di ffe re nce between t he excitation and the vocal system re
"
sp onse is exp loit ed by the cepst ra l vocod er. We recall from Chapter 6
that by processing fra mes of a speech signal in the cepstral domain we
can sep ar ate th e slo wly varying voc a l system spectrum from the fast er
"-3'"< varying pe riod ic spectrum due to the pitch . Thus we can separate these
j~
'/,
o Q.)
-.: two sp ect ra and estimate the characteristics of the vocal system.
'-' .~ N As illust ra ted in Fig. 7.29, th e analysis system computes the short
'iii
Q.)
s: term ce pstr um [ca lled the short-term real cepstrum (stRe) in Chapter 6J,
E c.(n; m) , for a given frame /(11; m), and th en uses a low-time lifter to ex
>.
en
3 t ract the component of the cepstrurn corresponding to the speech produc
j '
Q.)
"0 t io n m ode l im pu lse respon se, " cin;m). Typ ically, a time window of 2-3
e- " t:"
0
<.> ms ec is sufficient to exclude the effect of the excitation. Pitch estimation
c" .§ ~ can als o be performed by analyzing the cepstral component due to the
<ll
Ul
III
exc itat io n , ce(n; m ), wh ich is ex tr ac te d by means of a high-time lifter.
s: Ce p stral analysis of a real speech waveform is illustrated in Figs. 6.7, 6.8,
a.
III an d 6. 10.
'0 T he low-time liftered cepstrurn , Co lli; m ), is coded using 5-6 bpn, and
1ii th e resulti ng bits are transmitted , along with the side information on
c
C
III
voiced/ u nvo iced speech and pitch period , to the receiver. The resulting
s: bit rate is in the range of 6000-10,000 bps .
o
g A block diagram of the sy nt hesize r for th e speech signal at the receiver
::5 is sh own in Fig. 7.30. The cepstral component ce(n; m) is subjected to the
~ ,~ C
Q>
custo m ary nonlinear processing that results in the estim ate O(n; m)
" <r,
0
of the system impulse response corresponding to the frame / (11; m ) (com
'0 pa re Section 6.2.3). The speech signal is synthesized by convolving the
E voca l syst em response with the appropriate excitation, which is generated
...
(ll
Cl)
(ll
at th e receiver. Edge effects can be reduced by interpolating or smoothing
'6 t he ccpstral responses from frame to frame . Recall that this series of op
~
o e ra t io ns will destroy all phase information [see the discussion surround-
0
iIi
... cO
l:II Ccpstral
~ ,...: pitch Excitauon
2 UJ estimator
'" parameters
a: u:~i n g (:,(n ;JJr)
::::l
]§ -5 = e
I.
~ ~
ii'
.! ~i
"' - ;>
.J:f,~
To
o~?
s(ll ) IDFf I ~
, t ch annel
•
0 -;;; ~
8 "' o ,,-= ~ -
1<'(1/1 - /1)
~
E
~
l=:
li
::
u
FIGURE 7,29. Block diagram of eepstral voeoder analyzer.
"The careful read er will recall that CO( I1; /Il) should not have a seco nd (frame) argu ment,
464 The reason can be found in the discu ssion leading 10 (6 .30). Howeve r, we include the argu
me nt In here to keep track of the fram e upon which we arc workin g.
456 Ch . 7 I S peech Co d ing and Sy nth esis
7 .4 I Vocoders 467
, IRe o f A met hod fo r determ ining the excitati on sequ en ce has been described
1\
Fmm I I'DCal ImCI , - - - - - , r I () ( II : III ) I I
----.~..,I D~ LO(\ e r I ~ I DI--I IDf-T~
by Chung an d Scha fer (1990). The exci tatio n signal is determined d y
channe l na mically every few millisecon ds with in th e fra me under analysis , and
lake s one of thr ee form s. For unvoiced speech, th e excita tio n sequence
lixci tauon e(II) is selected from a Gaussian codcbook of sequences and has th e form
parumc tcr-,
1(11 : /11 11
C( II) = /3Pk(n), (7.111)
where k is the index of the sequ ence select ed from th e codebook, and f3
Switch
is a scale factor. For voiced fram es, the excitation sequence e(n) consists
of two sequenc es, selecte d from a tim e-variant qu eue of past excitations,
of the form
e(n ) = f31e( n - d l ) + /32c.>( n - d2), (7 .112)
where /3 and /32 are scale factors a nd d, and d1 are appropri ately selected
FIGURE 7.30. Block diagram of cepstral vocoder synthesizer. values of1 delay. For speech frames in which the excitat ion signal is classi
fied as mixed. the excitation seq ue nce e( n) is model ed as the sum of a
Gaussian codebook sequence pj n) and a sequence selected from an inter
ing (6. 17)J. In part icul ar, the placeme nt of the fra me in t im e is lost when
val of the past excitation, that is,
the " real" cepst ru m is used. and the formali ties here should not be inte r
e(n) = f3 ,pk{n) + /32e(n - d) , (7 .113)
pret ed to th e contrary. In practi ce, thi s is not a pro blem , as th e receiver
will co rrectly concatenate incorni ng information.
More significa nt ly, the loss of pha se informat ion from th e spee ch spec where d is the dela y.
Spec t ral weighting is performed on tbe vocal-tract impulse resp onse
t ra degrade s speec h qu alit y. Phase information can be preser ved by com a nd th e original speech frame prior to spe ech synthesis, as shown in Fig.
puting. the complex cepstrum (see Section 6.3). However, the additio nal .3 1. Note that th e perceptually weighted spee ch sequ enc e is denoted as
co m putation coupled with the associa ted phase unwrap ping that is neces
sa ry at the receiver render s this approach undesira ble from a computa y( n), that is,
y (n) = j (n; m ) * wen), (7 .114)
ti o nal viewpoint. Further. one of the genera l find ings of Qua tieri' s work
d iscussed in Section 6.4 (Quatieri, 1979) is that the com plex cepstr um, where fen; m) is the incoming speech fram e of length N, wen) is the im
by preserving phase information at significant co mputatio nal cost and al pu lse response of the weighting filter, and "", " denotes convolut ion. The
gorithmic overh ead , does not significantl y outper form the real cepst rum excitat io n is appli ed to the weighted vocal-tract respon se to produce a
in term s of speech qu alit y. Even wit h the use of the real cepst ru rn, th e
computational complexity of the cepst ral vocod er is its main disad van
tage for pract ical use (see also Section 6.4 ). O(n: m ) Ccpstral fen: nI )
Another t ype of homomo rphic voco d er is ba sed on a n an alysis-by nnal ysis
synthesis method. Th e stRe is used to estimate the vocal system imp ulse
response at the encode r in each frame, as described ab ove. Then t he syn
thesi zed spee ch is generated at the encoder by exciting th e vocal system Spectral Perceptu al Perce ptual
('(II) I we ighting
filter mode l. The difference bet ween the syn t hetic spe ech an d the original envelope weighting
fil ter
filler filter
speech signal constitu tes an erro r signal. which is spectra lly weighte d to l\'(:l 11'( : .)
&(Z:III)
em phasize lower frequ en cies and then mi ni mized by optimizin g the exci
tation signal. Opt imal excita tion seq uences arc typ ically co mputed o ver Inp ut
four or five block s within the f rame duration, meaning that the excitatio n Error speech
minimi zation 5(11)
is updated marc frequen tl y tha n the voca l system filter." Excitatio n over time e(n)
' For clarity in the ensuing discussions. we will use the term block to indicate a subframc
generator range II
/I '
= /I ' . I
+ I .. . . , II "
y( n)
__--:f En<:,)<! c r ~ To channel
interval. The 'frame will indicat e the usual analysis interval over which the speech is ana El(: : ", )
lvzed , Ordinarily this will correspo nd to the interval over which the vocal-tra ct character
ization is estima ted. The characteriza tion of the excitation, however. will often be updat ed FIGURE 7.31. Analysis-by-synthesis method for obtain ing the excitation
more frequently- over blocks of the analysis frame. sequence e(n ).
7.4 I Vocoders 469
468 Ch . 7 I Speech Coding and Synthesis
v ocal traer filte r updates
synth et ic speech seq uence f'( n ). For e xam ple. t he response of the
weighted vocal-trac t fi lter to t he m ixed exc itatio n may be exp re ssed as r ~ -, nthes izcd
speech
Spectral
.i'(n) = PJ',(n) + fJ2 .v/ n), (7.115) From Buffer
envelope s' (Il)
channel Decode r and
co ntrol ler ExciUllion filler
where )\(n) is the response of the weighted voc al-trac t filter to pJII} and
update,
.02(11 ) is its respo nse to e(1I - d). T he para meters Pl' P2' k, a nd d a re se
lected to m inim ize th e error ene rgy o ver small blocks of tim e between FIGURE 7.32. Synthesizer for the homom orphic vocoos r. The controller is
included to concatenate and interpolate incoming information from different
the weighted speech y(n) and the sy nt het ic speech .v(n). For discussion
frames and diffe rent excitation blocks within frames.
purposes, let us assume that the time range of interest is n = II'. 1/' + I .
. . . , n", To simplify the optimization pro cess, the min imization is per
overla p-ad d method may be used to com b ine adjacent output speech rec
forme d in two steps. First, P2 and d are dete rmined to minimize the error
energy ords at the rec eiver.
T h is an alysis-b y-sy nthe sis homomorphic vocoder has been imple
Ii
me nte d by Ch ung and Sc ha fer (1990). A frame duration of 20 msec was
~2 = I
n=ll'
[y en) P2Yi n )f (7 .116) used for t he vocal-t ra ct analys is ( 160 samples of an 8-kHz sampling rate)
and 5-msec block d urat ion (40 sam ples) for determining the excitation.
The codeboo k employed 256 ze ro-m ean Gaussian codewords of 40 sam
For a given value of d. the optimum value of fi2 l say P2(d), is easily ples each . A bit rate of about 300 0 bps was achieved with this imple
sh own to be
mentat io n .
n-
I y(n)y 2(n)
P2(d) = '!:':! ; ~-
_ - (7.1 17) 7.4.4 Formant Vocoders
I
n=ll
Yi(n) A formant voco de r may be viewed as a type of channel vocoder that
esti m at es the fi rst three or four formants in a segment of speech and
the ir correspond ing bandwidths. It is this information plus the pitch pe
By restricting the delay d 10 a small range, the optimization of d is per
formed by exhaustive search and the resulting P2(d) is obtained from riod that is enco ded and transmitted to the receiver.
Fo r a gi ven fram e of speech, each formant may be characterized by a
(7.117).
Once these two parameters are determined, the optimum choices of /3\ two-pole d igit al filter of the form
and k are made based on the minimization of the error energy between 8k k = 1,2.3,4, (7.120)
the residual signal Y' r(n) = y(n) - P2Y2(n) and f3S'I(n) . Thus fi, and k are 8 ,,(z) = (I - PkejwJ.. z- ' )( I - /\ e-J'Y\ z- I ) '
chosen by an exhaustive search of the Gaussian codebook to minimize
where 8 is a ga in factor, Pk is the distance of the complex-valued pole
k
if pair from th e o rigin . and OJ!.; = 2nFk T. Also. F, is the frequency of the kth
¢'I = I
n= n '
[y l(n) - I1 IY/ II )( (7.118) fo rmant in Hz , and T is the sa m pling period in seconds. The bandwidth
is determ ine d by the distance of the pole from the unit circle .
For any given sequence from the codebook, the optimum choice for PI is Th e for m an ts can be estim ated by linear prediction or by cepstral
an alysis. T he m ost difficult aspect of the analysis is to obtain accurate
Ii "
L y J(n)y t(n)
est imate s of the for mants, especially when two formants are very close
together. I n such a case, the chirp z-tran sform c an provide better fre
jj "-,-~-,,, - -- (7.1 19)
fJ 1 = n" quency reso lu tion and may be used for formant estimation. Neverthe
I
n=n '
J)~(n) less. thi s pro b lem has h in dered the use of this vocoder in practical
a pplicat io ns.
The synthe size r for the fo rman t vocoder may be realized as a cascade
The optimum excitation parameters and the vocal system im pulse re
of two-po le filters, one filt er fo r each formant, as shown in Fig. 7.33. An
sponse (or its eepstrum sequence) are coded and transmitted to the de
al ternat ive rea lizat io n is a pa ra llel bank of two-pole filt ers with adjust
coder. T he synthesized speech is gene rated as shown in Fig. 7.32 by
ab le gai n paramete rs. In this configuration , the overall filter contains
excit i ng the vocal system filter with the ex cit at io n signal e(n) . The
7 ,4 I Vo co d e rs 471
:;..:: eros whose values depend o n the gain paramete rs, Therefore , the va lues
0."
- r. o f the gain parameters m ust be carefully selected.
O ~
For unvoiced speech, the formant voco der ca n be s impli fied to fewer
",
..
n
en
<v 7.4.5 Linear Predictive Coding
-g o(.J Generalities
'./ o
>
6 As des cr ibed in Cha pter 5, the objecti ve of LP analys is is to estim ate
. ['
E
"
-'I
'.0 III parame te rs of a n all-pole mode l of th e voca l t ract . Re lated prob le ms a re
E
.... to de term ine th e ty pe of excit at io n and also to estima te th e pi tch period
~ ~l ,,'" .E
and the gain parameter.
'0
E
Suppose that these pi eces of informat ion have bee n determi ned for a
....
(tI
given frame o f speech to be transmitted o r stored . T his is ca lled a linear
OJ
ell predictive coding (LPC) p roblem , a na me that is freq ue ntly ap plied to LP
" ,'" '6
.:>:: analysis wh ether or not coding is the issue. Typica lly, the pi tch period re
(.J
"'-
c a quires 6 bits an d the ga in param eter may be re pr e sented by 5 bi ts a fte r
iii its d ynamic range is compressed logarithmica lly. If th e p red ict io n coeffi
~
M
CO) cien ts were to be co ded . th ey woul d req uire 8- 10 b its pe r coeffici ent for
~
E
-:;;
2
'"
.....: accurate represe ntati on . T he reaso n fo r s uch high accurac y is that rela
:: UJ
<0 ~ :) .~ a: tively small changes in the pred ict io n coefficients result in a large change
<0 <J ~
.~
o 5 Cl in the po le positi o ns o f the filte r mod el. T he acc uracy requirem en ts are
TI
~
o iL lessened by tr a nsm itt ing the reflection coeffic ients. whic h have a sma ller
CI)
dyn a mic ra nge, t hat is. IK(i: Tn)I <: I (se e Sect io n 5.3.3). Th ese a re ade
~
.
3 u.§
quately represented by 6 bits per coe ffic ie nt. Th us, fo r a 10th-o rder pre
di cto r (i.e., fi ve po les), t he total n u m ber of bit s ass igned t o the m od el pa
~ r:
.=
. ~
- au, '5 E
:.!: %
rameters pe r fra me is 72. If the model pa ra mete rs are c hanged ev ery
::;
eo
15-30 msec , the resul t ing bi t rate is in the ra nge 2400- 4800 b ps. Sin ce
the reflect io n coefficie nts are usu a lly t ransmitt ed to the recei ver, th e syn
thesis filter at the receiver is im p le men ted as a latt ice filt e r, as show n in
§ t Fig. 7.34 .
:§ ~ The co d ing of the re flection coefficients can be imp ro ved fu rt he r by
6:" I-
.:::X
,~ R fir st perfo rming a no nlinea r transform at io n of th e co efficie nts . A prob
lem ari ses when so me of the reflecti o n coeffic ien ts a re verv close to ± I .
In such a ca se. the qu antization erro r int rod uced by coding sign ificantly
470 affects the quality of the synthesized speech. By mean s o f a n a p propriate
7.4 I Vocoders 473
'-'
...
.r: i~ no nlinear transformation of the reflection coefficients. the scale is
:., :.11
warped so t hat we ca n ap ply a uniform quantizer to the transformed co
efficients. T hus we obtain an equivalent nonuniform quantizer, The de
sired transformat ion should expand the scale near the values of ± I.
Two nonlinear transformation s that accomplish this objective are the
inve rse sine transform and the inverse hyperbolic tangent for log-area
ratio (LAR)] transform. The inverse sine maps the xt] ; m) into
(J (I;. m ) == -2
T(
. - 1 K (.1; m ) ,
SlD 1 :S iz: }Vf, (7.121 )
(i;
spect ral pair (LSP) parameters (Section 5.4.1), which may also serve as
CD an efficient alternative to direct use of the LP parameters.
Q
B Several methods ha ve been devised for generating the excitation se
SJ ~
~
w
LPC-l0 Algorithm
a: The earliest LPC-based vocoders followed the classic all-pole model of
:;:,
CJ Fig. 5.1 or Fig. 7.24 directly in the use of a pitch-pulse excitation for
u::
synt hesis of voiced spe ech. and a noise source excitation for unvoiced .
Such a vocoder first caught the public's attention in the educational toy
" Speak-and -Spell" produced by Texas Instruments in the 1970s. The
basic method is still in use in many vocoding applications and is em
ployed by the U.S. Government as one spee ch coding standard. The algo
:'! r ithm is usually called LPC-10. a reference to the fact that J 0 coefficients
<~\~ ~ are typically employed.
t
,~
a n unvo iced deci sion is made. onl y four coe ff icients are used. In the
stored or t ransm itted bit stream , 4 1 b its are used for t he reflecti on coef
~
ficients, 7 for pitch and the voiced/ unvoiced bit, and 5 for the gain. On e
addition al bit is used for syn ch ro nization. Accordingly, a total of 54 hits
:= 2es
..c::
u
per frame are sent, yield ing a bit rat e of 2400 bps.
~
"5
~
~
Residual Excited Linear Prediction Vocoder
Spe ech Qualit y in LPC can be improved at th e expense of a higher bi t
rate by computing and tran sm itting a residual error, as don e in the case
o f DPC M. There are various ways in which this can be do ne . On e ~
"is '0
meth od is illustrat ed in th e block diagram of Fi g. 7.35. Once th e LPC o
U
m odel and excita tion parameters are est imated from a fra me of speech, o
>
the speech is synth esized at th e tr ansmit tel' and subtract ed from th e orig ~
..J
ina l speech signal to form a resid ual error. The residual erro r is quan W
tized , co de d , and tr an smitted to th e recei vel' along with th e model !f
pa rameters. At th e receiver th e signal is synt hesized by adding th e resid .1 'J
o
o,
ua l erro r signal to th e signa l generated from th e model. Th us th e addi i<.
'-'
..J
'0
tion of the residual erro r improves the qualit y of th e synthesized speech . 22
'(3
An other approac h th at produces a res idual error is shown in Fig. 7.36. x
OJ
In this case. th e origina l speech signal is pa ssed through th e in verse (all 'E:
."'" ... (ij
ze ro) filte r to gen erate th e residual e rror signal. This signa l has a rela - ,
;;-;=:
,-
OJ
::;,
~
0> - '0
ti vel y flat spectr um, but it s m ost important frequen cy com po nen ts for ..,j 'w
~
ctl
r-r-r-r " ]~
.S< "
Buffer [ (11 : 111 ) + e( n : n:j ·1 r. -
. ;:! -..: .2
" (II) and
window
l
Re, j,lual ct ~-;:;. (p
N
>.
- error
(ij
c:
"U
-r: "-
>
~
~
E
i~
«
!: ...... <~ <D
~
LP param eters ~
Q..
'"
~
~
E r-
~ I LP
lii'(i: 11/) I ....l c: W
".... ~
ana ly, b E a l:t
::l
n o
c iL
Exviuuion
{ 0,. gain estimate o
To
'J:
c, ':;'
~ ~
V/ U. l!edsion channel
parameter> d ~ §
P. pitch es timate e
¥~
r
e ::
~ ~
LP
synlbe, is
model
"- ~
2 2:=
':§ .~
- "":'
c:
'"
,
FIGURE 7.35. LPC encoder with residual error transmitted. [Note that the
error signal e(n; m) is not the prediction residual], 475
7 .4 I Vocoders 477
476 CM. 7 / Speech Coding and Sy nthesis
Input speec h
impr o ving speec h qualit y are co ntaine d in th e frequ ency range below
1000 Hz. Idea lly, we would li ke to t ra nsmit th e e nt ir e re sidual signal
sp ectrum. Howev er. in o rder to reduce th e bit rat e , we pass t he res idu al
error through a lowpass filt er with band width 1000 H z, de cim ate its out Buffe r
put , and encode the de cimat ed sign a l. U sua lly, the deci mated signal is and
, - -- - LP
tr ansformed into the freq ue nc y dom ain via th e DFT a nd the magnitud e I unalysis
and phase of the frequency co m ponent s are coded a nd tr ansmitt ed to the 1
I
recei ver.
At the receiver, the res id ual error signal is t ransformed into the t ime ~ +
/{II : m )
«(II : m)
LP -
domain, inte rp o lated , a nd filtere d . T he resulting signal contain s no high
~
~
synthesis 1\
frequency info rm at ion, wh ich must be rest ored in som e manner. A sim filter / (II: m )
ple method for rege nerating th e high-frequency co mpo ne nt s is to pass th e Perceptual
signa l through a full-wave rectifier a nd th en flatt en th e resulting spec weigh ting
fi lter
trum by filtering. Th e lo wp a ss a nd high pass signals are su mmed . a s WI :)
s how n in Fig. 7.37 , a nd th e resulting residual erro r signal is used to ex
cite the LPC mod el. We not e th at this m ethod does not require pitch in Multipulse Error
minimization
I
formation a nd voicing information . T he residu al erro r signal provid es cxcitauon ( ) n: m )
generator
the excitation to the all-pole LPC model. Th is LP C vocoder is called a
residual excited linear prediction (REL P) voc oder: it provides co m m uni FIGURE 7.38. Analysis-by-synthesis method for obtaining the multipulse
ca t io n-q ua lity speech a t a bout 9600 bps. excitatio n.
Multipulse LPC Vocoder length speech frame , f(n; 111), wh ich endsat t ime m. T he o ut put o f th e fil
One of the sho rtc o m ings of R ELP is the need to regenerate the high ter is the synthet ic speec h (fr ame), say /(11; m) , wh ich is su bt racte d from
frequency components at th e decoder. T he regen eration sch eme results in he o rigi n al speech signa l to form th e residual err o r sequence ." say
a c rude approxim at ion of th e high fre q ue ncies . The mu ltipulse LPC ~ (n ; m ) = / (n; m) - ](n;m). (7. 123)
m ethod described in this sec t io n is a time domain m eth od that result s in
a better excitation signal for th e LP C vocal syst em filter. The err or seq ue nce is passed through a perceptual error weighting filter
Multipulse LPC is an a nalysis -by-synthesis method , due to Atal and with sys tem fu nctio n
Remde (1 982), which has the basi c co nfig urat io n sh o wn in Fig. 7.38. The 8( z/ C) .A lZ) (7 .124)
LPC filt er coefficients are d et ermined fro m the speech signa l samples by W( Z) == --- - == ~ .
8 (1') A( z/ c)
th e con vent ional method s described in Cha pte r 5. Let G(z) denote th e
sys te m function of the all-po le fi lte r, whi ch is u su all y realized a s a lattice where c is a pa rameter. in the range 0 < c S 1, that is u sed to co ntro l the
nit er. Thi s filter is to be underst o od to have been co m pu te d o ver th e N- noise sp ect ru m weighting. Note that whe n c = I . th ere is 00 we ight ing of
th e noi se spec t rum and when c == 0, "V( z) = A(2) = 1/8 (1'). In pract ice, the
range 0.7 -s c :::; 0 .9 has proved effective. Let us de no te th e p er ceptually
Buff~r
From Residual ,.... Highpass weigh ted erro r by (,( n; m) , noting that
Decoder f and huerpo lator Rect ifi er f-+
ch an ne
controller
fi lter
1 S'",(n: 111) = {( n: rn) * Iv(n).
(7 .12 5)
model
I..P
param e ter
update'
LP Excitat ion
V The m ult ipulse excitatio n co nsists of a sho rt seq uence o f pul ses (d iscret e
time i m p ulses) whose amplitudes and locati ons a re chose n to m inimi ze
the ene rgy in C (n; m) over small blocks of the cu rre nt fram e. (For discus
sion purposes, let us co ncern o urselve s with 3 block ove r th e t im e ra nge
synthesizer
II == n' , n' + I , ... , n" .) For simplicit y, th e a mpl itud es a nd lo catio ns of the
FIGURE 7.37. Synthesizer for a residual excited LPG vocoder. The controller
is included to concatenate and interpolate incoming information from different IOT he reader should appreciate th a t th is IS not the resid ual seq uence associat ed with the
frames . LP filter design.
7 .4 I Vocoder s 479
478 Ch. 7 I Speech Coding and Synthesi s
After the locati on and the am plitude o f the im pulse is detcrm in ed, it
is subtracted fro m ( ~.( II: m) to form a new seq uence ( :(11; 111 ). Th e loca
im p ulses a re obt a ined sequ e nt ia lly by mi n im izing the er ror energy for
one pulse at a t im e. In pract ice only a few im pu lses, typically 4-8 every
ti o n and amplitude of the seco nd im pulse is then det e rm ined by mini
5 ms ec , are s u fficient to yield hi gh-q uali ty synthetic speech.
In pa rticular, let us consider the p lace m ent of a pulse of amplitude m izi ng the e rr or energy
aJ(k) at locat io n n = k . Ass u m ing th a t th is is the initial pulse proposed,
I
n"
prior to its insertion we will hav e (see Fig. 7.38) ~2 = [(~)n ; Ill) - (l /k~)O,.( Il - /..:2)]2 (7. 132)
n= x'
( O( n : m) = f en; 111), (7.126)
in a sim ila r m anne r. This pr o ced ure for det ennining the loc ations and
where th e superscript indicates the number of pulses in the excitation. In
am p litud es o f th e impulses is repeated unt il th e perceptuall y weighted
ge ne ral , let u s write ( ' (n: m ) and C( Il ; m) to denote the residual and
e rro r is re d uced below some spec ified le vel or the number of pulses
perceptually weigh ted resid ual sequences with i pulse s in the excitation.
rca ches the m axim um th at ca n be e nco de d at some sp ecified bit rate.
Furthe r, let us denote t he response of the cascaded filters
1 There are several variat io ns of the basic multipul se LPC method de
8 (z) Hl'(z) = 1/.4(c- z) to this initial impulse by at(k)O,..(n - k). It should
sc ribed abo ve . In part icula r, Singha l and Aiat l1984) have noted that, for
be clea r, there fore , that C.
(n ; m ) can be wr itten
vo iced spe ech , the m uitipulse LPC excitation seq ue nce shows a sign ifi
(~(I1 ; m) = fen ; m) .. wen) - G1(k)O,.(n - k) can t correlat ion from one pitch period to the next. Thi s o bse rv at io n sug
gests that the perceptually weight ed err o r c an be furth er reduced by
(7 .127)
= (~v(n: m) - a,(k)ew(n - k) . including a long-delay correlation filte r in cas cade with th e speech svstem
8(:), as shown in Fig. 7.39 . This filte r is usually Implemented as a' pitch
We seek to muurnize pred icto r with syst em function
0p (7 . 133)
fl " UN
e (z) = __p ,
~I = L. [(;, (n : m)]2 = I
l /=n ' n=rz'
[ ( ~. (n ; m) - aJk)8,.,.(n - k)y (7.128) p 1- b..
L. [ ( ~. ( n ; m) -
' J= /1'
at(kW".(n - k)]8,,.(n - k) = 0 (7.129)
U ll fl~ r
or, equi valently, ,..--- ----- - - -- an d Side
I
1.1' infonnntioll
« I
1\
I I'
aJ k) = Pt;e(k) I
~~----
I I / ( 11 . JII)
(7.130) r
PoJk> - t +
L. O:(n k)
,-
Pilch LP
H)
( I ll : m l
I I "'""
synthesi
filte r B \:\
I'
I- s yn\ he ~ I '
filte r t
We can eli m in ate a/!<) from (7.128) by subst ituting the optimum values P.:rccptual
""-Ighling
al(k). Thus we obtain fill.:r
~ I (k) = (7 .131)
Muhipulse I
n-,,' Puo(k ) .
cxci tution
Error
rnn umiza tio n
~
~ .. (II : 1/1 \
gene-ralnr
He nce the o pti m um loca tio n fo r the first impulse is the va lue o f k , say k..
that mini mizes ~ I (k) or, eq uivale ntl y, m axim izes p ~ (I ( k )/p(}lJ (k). Once is I: FIGURE 7.39. Analysis-by-synthe sis method for obtaining the multipulse
obtained , a(k) is co m puted from (7. 130) . excitation with a pitch synthesis filter.
7.4 I vocod er s 481
480 Ch . 7 I Speech COdin g and Sy nth esIS
where 0 < b < I, P is an est im ate of the number of sa m ple s in the pitch ord I Gau.. . sian
pe rio d, a nd 8 t' is a sca le fac to r. Th us t he pitch filte r 8 /z) p rovi des the From lXeo(k'r
"l excitation
channel , code book
lo ng-term correlatio n in th e excitat io n and , th us, the co rre lat io n in t he
mult ipulse excitat io n is red uce d.
O ther variat io ns of th e basic mu lt ip ulse LPC me t ho d ca n be dev ised l P parameters.
by ado pt in g differe n t st rategies in th e opti m izat ion of the p ulse-signa l gain, and pitch esumate
a m plit ud es. Fo r ex ample, as t h e loc ation a nd am plitude of each new upda te,
p ulse is ob tained: o ne ca n go ba ck an d reo ptim ize th e amplitudes of the FIGURE 7.40. CELP synthesizer.
p revi o us pulses. Alt e rnat ively, we ca n perform a j o in t optim izati on of all
the am plitud es o f th e pu lses b y so lv in g a se t o f li nea r eq uation s after all
t he pu lse locations have been d etermined . T h is a nd several other varia h as 10-1 2 co efficients t ha t are determin ed peri odi call y. as descr ibe d pre
ti on s o f the ba sic scheme de scribed a bove have been sugges ted in th e vio usly. for exam ple, every 10-20 rnsec.
paper by Lefe vre and Pa ssien (19 8 5). A block diagram o f t he a n alys is-b y-sy n t hesis code r is shown in Fig.
In concl usio n, multipu lse LPC ha s pro ved to be a n effect ive method 7.4 1. A stored seq uence fr om a G a ussian excitat io n co de book is scaled
fo r syn thesizing good-quality spe ech at 960 0 b ps. Multipulse LPC voco a nd used to exc ite th e cascade o f a pit c h syn t hesis filt er a nd the LP C
ders have be en implem ented for commerci al us e [see (P ut ni ns et al. , sy nt hesis filt er (co m p u te d ov er t he "curren t" frame). The sy nt he t ic
198 5)] and ha ve been used for ai rborne mo bile sa te llite tel ephone service. speech is compared wit h the o rigina l speech a nd th e d ifference consti
T he in fo r m atio n that is tr ansmitted con cerning the excit at ion sequence tutes the resid ual error signal , which is perceptuall y weighted b y passing
includ es the locations of the pulses, a n ove ra ll scal e factor corresponding it t h rough a filte r that is characterized by th e syst em functi on in (7. 124),
to the largest pulse amplitude , an d th e pulse amplitudes relative to the as in m u lt ip ulse LPC. Th is perceptually weighted error is square d and
ov era ll scale factor. The scale factor is loga rithmi ca lly quantized , usually su mm ed ove r a subframe block to give th e erro r ene rgy. By performing
to 6 bits . Th e amplitudes ar e lin earl y quantized , usually int o 4 bits (o ne an exha ust ive sea rc h through the cod ebook we find the ex c ita t io n se
o f 16 levels). The pulse loca tion s arc usu ally enco de d b y m ea ns o f a d if que nce that mini mizes the error energy. The gain fact or for scaling the
fere ntial co d ing scheme . The excita t io n param et ers are usu ally updated exc ita tio n se quence is determined for eac h co de wo rd in th e code bo o k b y
eve ry 5 msec, whi le the LPC voca l-tract pa ra me ters a nd the pitch pe riod m in im izing t he err o r energy for the bl ock o f sa m ples .
es tim ate a re updated less frequ entl y, say eve ry 20 msec. T h us a nominal Fo r ex ample , suppose that a speec h signa l is sa mpled at a frequ ency o f
bit rate o f 9600 bps is obtai ned. 8 kHz and th e su b fra m e block duration for th e pitch est ima tio n an d ex-
S peech samples
\ \111
Code-Excited Linear Prediction Vocoder
Bull er a nd I Side
As indicated in th e p reced ing section, multi pulsc LPC p rovides good r - - - - - - - - - - - I LP
q uality s peech signals at 9600 bps. Further reductions in the bit ra te can I r - - - intormatio n
Gain I"
be ac h ieved by better se lect io n o f th e exci ta t ion sequence e(n ). .r
Code-ex cited linear prediction (CELP) is an a na lysis-b y-syn t hes is Pitch
Gaussian
met ho d (sec Schroeder and Ata l, 1985) in whi ch th e excita ti o n sequence excitatio n
-ynthcsis ( (II : III )
is selected from a codebook of ze ro-me an G aussia n sequen ces. Hence the cod eboo k tille r
excitat ion secuence is a sto chasti c signal selecte d from a stored code bo ok Perceptual
of s uch sequences. we ighting
fi lte r 11'1: 1
The C ELP synthe sizer is sh own i n Fig. 7.40. It consists o f t he ca scade 1 I Pe rceptually
o f tw o all -pole fi lters, with coefficients that a re upd at ed per iod ica lly. The ( .. (II ; ms weighted
r J I error
first filt er is a long-delay p itch filt er used to gene rate th e pi tch period ic Compute C:1Tl1l
ity in vo iced speech. Th is filter t ypi ca lly ha s th e form given by (7. 133), energy I •
Index of
where it s param e ters ca n be de te rmin ed by m in i m izing th e p red ict ion {square nnd ,uml
exd lation
er ro r ene rgy. afte r p itch esti m at ion. ov er a fra me du ra ti on of 5 m sec. Th e seq uence
seco nd filter is a sh ort-del ay all -pole (voca l-tract) filter used to ge ne rate
FIGURE 7.41 . CELP analysis-by-synthes]s coder.
th e spectral en velo pe (fo rma nt s) o f the speech signal. Thi s filte r t ypically
482 Ch. 7 I Speech Coding an d Synthesis 7 .4 I Vo coders 483
ci tat io n seq uence select ion is perform ed eve ry 5 msec. T h us we have 40 0 .62 5 m sec at a n 8-k Hz sa m pling rate. He n ce t he L PC analysis fra me
sa m ples per 5-msec block interva l. T he excitation sequences co nsist of 40 may be four times as long as the exci ta tion block size .
sa mples each, stored in the co debook. A cod e book o f 1024 seq uences has in C hen 's imple mentation , t he logarit hm o f the excita tion ga in is
been found to be su fficient ly la rge to yield good-quality sp eec h. For such ad ap ted every subfra me exci ta tio n block by em p loy ing a J Ot h-o rd er
a co debook size, we requ ire 10 bits to send the codeword in de x. Hence adaptive lin ea r pr edictor in th e logarithm ic doma in . T he coefficients of
th e bit rate is ±bpn. T he transmi ssion of the pitch predictor parameters the logari thmic-gain pred ict or are updated ever y fou r blocks by perform
a nd th e spe ctral pr edictor pa ra meters brin gs the bi t rate to about 4800 ing a n LP C a nalysis of p re viously q uan ti zed an d scaled excitation signal
bps . Methods for allocating bi ts dyna m ica lly (adapti vel y) have also been blo cks. T he pe rceptu al weigh ting filter is also IOth o rde r an d is updated
devised ; they are de scribed in pa pers by Kroon a nd Ata l ( 1988). Yong once every four blocks by employing an LPC an al ysis on frames of the
and Gersho (1988), Jayant a nd Ch en (1989 ), Ta niguchi et al. ( 1989). and input speech signa l of duration 2 .5 msec, F ina lly, th e exc ita t io n code
Aka mine and Miseki (1990 ). boo k in the lo w-delay CELP is also modi fied co m pa red to co nve ntional
CELP ha s also been used to ach ie ve toll-qu ality sp eech a t 16.00 0 bps CE LP. A l Ov bit exci tat ion codebook is employed , but th e codewords are
wit h a relati vely low delay. Altho ugh oth er ty pes of vocode rs produce clo sed-loo p op timized to ta ke into account th e effec ts of p redictor adap
high -q uality speech at 16,000 bps, these vocode rs typica lly bu ffer 10-20 tation and ga in ada ptat ion (Chen, 1990).
m sec of speech samples and encode speech on a fram e-by -fra me basis . As Another vari at ion of the convention al CEL P vocoder is th e vector slim
a conse q uence, the one-way dela y is of the order o f 20 -40 m sec. How. exc ited linear prediction (VSELP) vocoder describ ed in the p ap er by
eve r, with a modification of the bas ic CELP, it is possible to reduce the Ge rson a nd Jasiuk ( 1990). The VSELP cod er and decoder basicall y differ
o ne-way delay to ab out 2 m s. in the meth od by which th e excitation sequ ence is form ed. To be spe
T he low-delay version of CELP is ach ie ved by usin g a backward, cific, we co ns ider the 8000-bps VSELP vocodcr described by G er son and
adapti ve predictor with a gain param et er an d an exci tation vector size as Jasiuk.
sma ll as 5 sa m ples. A block diagram o f th e low-delay C ELP en coder, as A blo ck diagram of the VSELP decoder is shown in Fig. 7.43. As we
im p leme nted by Ch en ( 199 0), is show n in Fig. 7.4 2. Note that th e pitch observe, there ar e three excitation sources. One exci ta t io n is obtained
predi ctor used in the conve n tio na l forward-ad apt ive cod er is elimina ted . fro m the pitch (long-te r m) predictor sta te . The o the r two excitation
In order to compensat e for th e loss in pit ch in for mat ion, th e LPC p redic so urces are obtained fro m two co de bo o ks, eac h co nta in ing 128 co de
tor ord er is increased significantly. typica lly to an o rde r of a bout 50. The words. T he outputs from the three exc ita tio n sources a re multiplied by
LP C coefficients are als o upd at ed more freq uently. t ypically e very 2.5 their corresponding gain s and su m me d . The LP C sy nthesis filt er is im
m s, by performing LPC a nalysis o n pre viou s ly qua nt ized speech. A ple me nted as a l C-pole filter and its coe ffic ient s are code d a nd tran smit
5-sa m ple excitation vector co rre sponds to an exci ta tion blo ck du ration of
- - - - - - - - - - - ,
Lon g-term
filler state
Excuruio n I.P
I
vector
>----r-+l (h igh - o rder) J ( I/;1II) - ((n: lII ) (')0
quaruizer
co de ho o k
sy nt hes is I I "I Spectra l
filter Pilc h envelope
l Spectral Synthetic
syn thesi. tl. P)
postfi ltcr speech
fi lter s Y ll t he, i ~
ti ltcr
Perceptual !~ I
weighting
tiller H'tz l Codebook
2
( ...(I/ : m l
3,
FIGURE 7.42. Low-delay CELP coder. FIGURE 7.43. VSELP decoder.
484 Ch . 7 I Speech Coding an d Synth esis 7 .4 I Vo coders 485
ted ever y 20 m s. These coefficients arc up dat ed in eac h 5-ms fra m e bv su m marize the number of bits trans mitted to the decode r for t he vario us
interpolatio n . The exc itation parameters are also upda ted everv 5 ms. speec h signal parameters and the excitat ions .
The 128 codewords in each of th e two codebooks are construct ed fro~ Finally. we observe that an ada pt ive spe ctral postfilter is employed in
two sets of seven basis codewo rd s (vectors) by form ing linear com bina VSE LP following the LP C synt hesis filter. This postfilter is a pole- zero
tions o f the seven basis codewo rds. Initia lly, t he seven basis codewords in filter of the form
each co deboo k are selected as zero-mean Gaussian seq uences . H owever, B (z )
these basis co deword s ar c opt imized over a training database by mini 8 ,( : ) = -_- , (7.134)
mizing the total perceptually weigh ted erro r. A( z/c)
T he long-term filt er state is also a codebook with 128 codeword se whe re .4(z) is the denominator of the LPC synt hesis filter, an d B (z) is th e
quences, where each seq uence correspo nds to a given lag (p itch period ) numerator poly nom ial whose coefficients are determ ined ad a pt ively t o
of the filter 6 / z) given b y (7. 13 3). In each 5-msec fra m e, the codewords sm oot h the sp eech signal spectrum.
from this codebo o k are filt ered th roug h the speech system filter 8(z) and T he 8000-bps VSEL P vocoder described in t he paper by G erson and
correlated with th e input spe ech seq uence. The filtered codeword that Jasiuk ( 1990) ha s been adopted by the Telecommunications Ind ust r y As
has the hi ghest co rr elat io n to the speech is used to up date the history soc iat io n (T IA) as the st andard sp eech coder for use in Nort h American
array and the lag is transmitted to the dec oder. Thus th e upda te Occu rs digit al cellular tele pho ne system s.
by append ing the best -filtered codewo rd to t he hi story codebook, and the
oldest sa mp le in the histo r y array is d iscarded . The re sult is that the
7.4.6 Vector Quantization of Model Parameters
long-te rm st ate becomes an adap t ive cod ebook .
The three excitat io n seq uences are select ed sequent ially from each of In the des cription o f the vocoders given ab ove, VQ is widely used to
the th ree codebooks. Eac h code book sea rch attempts to find th e code effic ient ly represent exc itat io n sequences an d other signal parameters
wor d tha t minim izes the to tal energy o f t he percep tually weighted erro r. and , th us, t o red uce th e bit rate over t he channel. For exa m ple, in a
Then , once the codewords have been selected , the th ree gain parameters ch annel vocoder, the outputs of the channel ban k can be qu antized as a
are opti m ized. Joint gain opt im ization is seq uen tia lly accom plished by vector instead of quant izing each filter out put separately, The same ap
orthogonalizing each weighted codeword vector to eac h of the previously plie s to the analysis outputs of th e phase vocoder, th e ce pst ral vocoder,
selected weighted exci tat ion vecto rs prior to th e co debo o k sea rch. These an d the formant voco der. T hus th e bit rate can be significantly reduced
parameters are vector qua nt ized to one of 256 eight- bit vect ors and as a re sult of the gre at er efficiency afforded b y VQ relati ve to scalar
transmitted in ever y 5-m s frame . q uant ization. Side information such as pitch and vo icing infor mat ion are
In additi o n t o the se gain pa rameters, th e lag in the p itch filter is esti us ually quantized sepa rately, sin ce they are not highly correlated with the
mated and transmitted ever y 5 rnsec as a 7-bil number. Also , th e average ot he r signal parameters.
speech energy is esti m ated and t ran smitted once every 20 m sec as a 5-bit Vecto r quantizati on has pr ov ed 10 be part icu larly efficient in LPC,
number. T he 10 LP C coefficient s are represented b y reflect ion coe ffi w he re speech coding rates in the range o f 200-800 bp s have been
cients and are qua nt ized by use of scalar quantization . In Ta ble 7.9 we achi eved [see, e.g., (Wo ng et al., 1982)]. It is customary to app ly VQ to
the log-area ratios gik: m), k = I , ... , /vI, whi ch are obta ined d irect ly from
th e LP coeffi cient s. T he co m monly used d istortion measure for VQ of
the LP parameters is the Itakura distance (see Section 5.3.5).
TABLE 7.9. Bit Allocations for 80aO-bps VSELP. As an exam ple, su ppose that the analys is rate in LPC is 50 fra mes per
second. Typically, 10- 13 b its per frame are needed for VQ of th e log-area
Parameter Bits/5-ms Frame Bits/20 ms rat ios. With 6 bits per fra me for the pit ch an d I bi t per frame for the
10 LPC coefficients 38 voicing in formation , the total rate is about 20 bits per frame and . hence,
Average speech energy 5 a bit rate of 1000 bps.
Excitation codewords To illustrate the benefits o f VQ vis-a-vis sca lar quantization for the
from two VSELP log-area ratios in LP C, let us refer to Fig . 7.4 4, taken fro m the paper by
codebooks 14 56 Makhou l et al. (19 85). ]n thi s case th e speech signal was filtered to 5 kHz
Gain pa rame ters 8 32 and sampled at 10kHz. The LP C model had 14 coe ffic ien ts, A total of
Lag of pitch filte r 7 28 60 ,000 frames were used to adap t ively train the scalar an d vector quan
Total 29 159 tizers. T he LPC coefficients were transformed to log-area rati os for scalar
7 .4 I Vo coder s 487
4 86 C!1. 7 I Speech Cod ing and Sy nthesIs
Codebook design for VQ o f the mo del pa ra mete rs ca n be im ple me nte d
a nd vecto r co d ing. T he MSE was th e d isto rt io n mea sure selec te d . J I by using th e K-means algorithm , as described in Sect io ns 1.4 .5 a nd 7. 2.2 .
Shown in Fig. 7.4 4 is th e MS E plott ed as a funct io n of th e number of An important issue is the sea rch through t he cod e bo ok . As ind icated in
bits per vector (hen ce th e num ber of cent ro ids in the codeboo k ). The Sect io n 7.2 .2. a full sea rch through the codebook requires M L m u ltiplica
performan ce of two type s of scala r quant izers are show n in the figure. tio ns and additions to qua ntize each inp ut vecto r. For the exa m ple
T he fir st o ne. labeled (a) , represents t he performance of the Llovd-M ax treated abo ve. M =' 14 and L = in. so tha t AlL = 14.336. wh ich is very
quantizer, wh e re ea ch log-area rat io coefficient is optim all y assigned a large. This computationa l bu rd en is sig nifica ntly red uced by using th e bi
num ber o f bits. in proport io n to its variance , so as t o minimi ze the MSE. nary tree sear ch m ethod . I n such a c ase , the co st bec omes
as show n b y M a khou l et a l. ( 1985). More bits ar e assigne d as the vari 2M log , L = 280 m ulti pl icat ions and additions. On the other hand , th e
a nce gro ws. T he sec ond sca la r quantizer, labeled (b ). uses o ptimum bit memory requ ire men t is now 2j\!f L 28 .672, whi ch is twice as much as
=0
all ocation after performin g a rotation operation to rem o ve lin ear depen
for the full sea rch .
dence from the vect or of log-area ratios. T he cod ebook for the vector We co nclude with a co m pa rison o f the M SE perform ance of full
quantizer was generat ed b y th e K-means algori thm . T he bit allocation for search . uniform b ina ry search , and no nunifor m b ina r y search for th e
the sc a la r quantizers wa s based on empirical pdf 's obtained from the LPC examp le with i\,f = 14. gi ven abo ve. The results shown in Fig. 7.45,
dat a. due to M akhoul et al. (198 5), illust rate such a co m pa riso n. We note that
We note from Fig. 7.4 4 th at V Q requires a bo ut 10 bits per vector for t he nonunifo r m b in ar y tree searc h method is only sligh tly inferi or to the
M = 14 and red uces th e bit rate per frame b y 5 bits rel ative to the scala r full sea rch and o nly slight ly b etter th an uniform binary sear ch . Co nse
quanti zer (b) and 8 bit s p er fram e relative to th e scala r quantizer (a) . q uently, the i ncreas e in distortion for a giv en bit rate resulting from bi
nary tree search co m pa red to full sea rch is relatively small. Considering
the com pu ta t io na l savings resulting from th e use of th e bin ar y tree
0.2 0.'1 0.0 0.8 1.0 1.2 search a lgorith m , we f ind t ha t th e co m p rom ise in performance is well
justified.
la) Scalar wi th bu altocarion o
(b) Sca lar with rotat ion and bit all ocation
-~ (c) Vec tor quantizatio u
s: -2
I.U -4
v:
:E
"".'-'::: c::
~ - t: ::J
- 4
o
z
'./'j
~
'J
-s ~
~ -0
z
-1 0 ' I I I I I 1 I I 1 1 I I 1 I 1 I I
o 2 4 6 8 10 12 14 16
-R (a) Un iform hinary
S :o bits/ vec tor ( M = 141 (h) N onuni form bin ary
a
FIGURE 7.44. Normalized MSE in quantizing log-area ratios (LARs) using IC) f u ll search h
three methods: (a) scalar quantization with bit allocation; (b) scalar
quantization withbit allocation , preceded by eigenvector rotation; and (c) vector - 10 1 I I I I I I I ! J ....J
- Ii -8 - Ill
quant ization. The 3-bit reduction from (a) to (b) takes advantage of linear o -2 -4
dependencies (correlation) , and the additional 5·b it reduction from (b) to (c) S =bits!\,e<:lnf ( M = 14 )
takes advantage largely of nonlinear dependencies (Makhou: et al.. 1985).
FIG URE 7.45. Comparison of MSE when quantizing LARs with three types of
vector quantization: (a) uniform binary search, (b) nonuniform binary search,
" T his mean s the average squared Euclidea n d istance between 3 vector a nd its assigned and (c) full search (Makhoul et al., 1985).
centroid over the entire popu lation of vectors.
488 Ch. 7 I Sp e ec h COding and Sy nt nes rs 7 .6 I C on c lus ions 48 9
.5 Measuring the Quality of Speech Compression large. This makes it d ifficult to usc su bj ec t ive test s to determ ine optim a l
Techniques vocoder param eter settings, incl udi ng. fo r exa m ple , bit a llo ca tio n and
co de boo k si ze. Since only gross sett ings can be co m p ared , such te sts sel
Clearly. a trade-o ff exists between th e efficiency o f re presen ta t ion and the dom provide much insigh t into factors that may lead to improvements in
fid elity of the re su lt ing speech -bit rate versus speech qu ality. In essence. the evaluated system s (Q uacken b ush. 1988 ).
the hi story of sp eech co d ing rep resen ts a n effort to co nt inu a lly expa nd the A n alt e rna ti ve method fo r eva lua t ing s peech coding alg o r ithm s is
en velope of th e e ffic ie ncy -to-quality curve. The relat ion sh ip be twee n trans based on a computable measure of q ua lity. which is ro ugh ly eq uivalen t in
mission bit rate a nd qu ality is shown in Fig. 7.4 6. Of th e two quanti ties in accu racy to t ha t o f su bject ive tests. T hese tests measu re the distort io n be
question , bit rate is highly quantifiable wh ereas the latt er is subj ect to in tween the input and o ut p u t sign a l fro m coding system s. Th ese objective
terpretat ion . Wh en compet ing digital speech COding systems are to be eval measu res a re cl assified into wavefo r m di sto rt io n meas ure s. includ ing. for
uat ed relati ve to one anot her. a method that is repeatable. meaningful. and exa m ple, signal-to-noise ra tio . a nd freq ue ncy-sp ect ru m d isto rtio n mea
able to reliably measure the so u nd Qualit y of t he sp eech is needed. s ures, such as LPC-based d ista nce m ea su re s. T he choice of wh ich objec
On e possibl e technique for e va luating spe ech co d ing a lgor it hm s is to tive qual ity measure to use in coder e val uat io n rests on the measure's
usc a sta nda rd ized pro cedure that e m plo ys human list en ers to evaluate abil ity to p red ict su bjec ti ve quality. So me measu res are bett er than oth
"goodness ." The gen eral test procedure entails having the human evalua ers for particular envi ro n m en ts (e .g., perform a nce in b ro a d band channel
tion gro up list en t o example coded utterances and recording their opin no ise ver sus impu lsi ye no ise ). 1n order to det e r m ine gene ra l coder per
ions o f the qu alit y relative to the coded ex a m p les and/ or known test orrna nce, an o bjective measure that reliabl y p redicts subj ective qual ity
cases (e .g., standard LPC at 22 kbps, 16 kbps , 10 kbps) . Such tests result fo r a broad range of distortions is desirable. In Chapte r 9, we discuss a
in su bject ive m easures of speec h q uality because they a re based on th e numbe r of subject ive and objective intelligib ility an d qu alit y tests. Before
list en er gro up's perception of qu ality. These te sts have been used by in descri bing the m . ho wever. we ta ke lip another a rea in wh ich th ese quality
dustry and th e military for more than 30 years to evaluate voi ce commu assessm ent measure s will be use fu l. Chapter 8 treats the issu e of enhanc
ni c ati on sy ste ms (Fla na ga n , 19 79: Hecker. 196 7; Tribolet, 1978 ). Th e ing spee ch th at h as been corrup ted or dist orted by som e techn ical or nat
tests a re known and well understood. Examples include the diagnostic ural process. Clearly. this area is closely rel at ed to t he speech coding
rhyme test, the mean opinio n score, a nd th e diagn ost ic acceptability m ea tech n iq ues we have d isc ussed in the present chapte r.
sure (Voier s, 1977) . (T hese m easures are discussed in C ha p ter 9.) Su ch
test s, howe ver, ar e ex pe ns ive in terms of tim e and resources, requiring.
for ex am p le, trained listen er gro ups and large a m o unts of coded data.
They are also di fficult to administer and are often su sp ect d ue to the in 7.6 Conclusions
herent nonrepeat abiliiy of human responses. In ge ne ra l, results from on e
coder ev aluation cannot be c o m pa re d reliably with another unless the Wave fo r m encoding o f s peec h has been widely used for seve ra l decades
test environment is preserved (same listener group, speech corpus. and on tel ephone channe ls, and conseq uently, t he va rio us cod in g methods de
presentation order). M an y of th e subjective tests are ba sed on pairwise sc ribe d in t his c ha pt er are well d eveloped . In general, waveform encoding
comparisons. rn order for th e listener group to e sta blish statist ica lly sig metho d s yie ld high (to llj-quality speech at bit rates of 16.000 bps and
nificant results . t he coding distortion between ex a m ples must be fairlv high er with reasonable imple me nt ation compl exity. For lower bit rates,
, co m plex VQ encoding method s are necessary to achie ve hi gh-quali t y
Waverunn coci ng '( : ~ Source codmg speech . With present VLSI tech n ology, real-time implem entation of such
,,
I<F,)[1 it, per second :
. highly co m plex VQ m ethods are expensi ve and , he nc e, impractical.
'o r sp eech communicati o n o ver mobile cellular radio c ha nnels, chan
,I
nel ba nd wid th is more scarce a nd . hence. bandwidth cons t ra int s are
I I I I I I I I : I , I I I more s e vere t ha n on wire li ne telep hone channels. Tn this ca se there is a
200 64 32 14 I ti <,l.ti S.O 7.2 :4.11 2.4 1.1 0.5 0.05 great need for low bi t rat e speech codin g. T h is is an application area
,,
,,
,
..,,
.
where LPC-based enco d ing sc hem es employing VQ, such as C ELP and
: Broadcast : Toll Comrnunicurions : Synthetic quality . VSELP, are particu la rly suitable. It is anti cipated th at by th e m id - 199 0s,
,4 _,4 .e .,4 . , L PC- based VQ tec hni q ues w ill be developed to provide communication
: quality : quality : quality : :
I . I I
q ua lity speech at rat es of 20 0 0- 24 0 0 bps. This te chnology w ill lead to a
FIGURE 7.46. The range of speech coding transmission rates versus quad ru p ling of th e channe l capacit y for mobile cell u lar rad io com m uni
assoc iated quality . After Flanagan et at. (1979). catio ns syste m s co m pared with today' s eELP and VSEL P capabi lit ies .
490 Ch . 7 I Speec h C o d in g an c Sy nt're s rs
7 .7 I Problem s 491
7.7 Problems
I. '7 = 0
r~ ( 'l) = ~. 1/ = ± I (7.13 7)
7.1. Let x be a sta tio na ry fir st-order Ma rkov so urce. Th e random va ria
bl es x(n) de note th e o ut put (st at e) of th e sou rce d isc ret e ti me II. T he { 0, o therwise.
so urce is cha ra ct erized b y th e state prob a bilit ies P( x) d,;f P[~( ll ) = x,l for
a rbi tra ry 11, where (x,. i = 1, 2, . . . . 1 J is the set of labels for th e L pos (a) Determine th e p red ict io n coe fficien t , b( I), of t he fi rst -ord er
sibl e states; a nd t he tr ansition probabilit ies,
minim um MS E predicto r,
p(x,l x,l ~f P[~(n)=xk l ~( I1- I ) =x, l for i,k= 1, 2. .. .. 1. x(n) = a(l)x (n- I) , (7.138)
a nd for a rb itra ry n. Th e ent ro py (se e Section 1.5) o f th e M arko v So urce
IS whe re the seqc unce x( n) represent s a ny realization of x.
(b) Repeal part (a) for th e seco nd-o rder predi ct o r,
L
i'(n) = a( I )x (n - 1) + d( 2)x (n - 2).
H(~ ) = I
1..' =1
P(xk ) HC~ l xk )' (7.1 35)
(7.l 39 )
Plx ll·,z)
JlII 2a
"'"•.
U
1."
20 a
~ -5 5 "2
FIGURE 7.47, First-order Markov so urce. FIGURE 7.48. Joint pdf of random var iables ~" ~ 2'
492 Ch . 7 I Speech Cod.nq and Synthe sis
7 .7 I Problems 493
(a) Eval uate th e bit ra te s req uired for uni form q ua nt ization of ~I lC) Now supp ose we su bt ract t he dither noi se fro rn x ( II) (do tted
and ~ 2 separa tely (sca lar q uan ti zati on) . and combined (vector) a th in Fig. 7.49 ). so that the res ult ing erro r is q
q uant izat ion of ~ I' ~ 2)'
e(n ) = S(II) - _,)n ) - d(n) . (7 . 14 5)
(b) Det ermine the d ifference in bit rate when a = 4b.
De te rmine the variance. say (1 2 , fo r th e rando m process ~, and
7.4 . In evaluating th e pe rformance of a uniform quanri zer, it is common
show that
p rac ti ce to model t he quantizati on er ror as a rand om pr ocess . say e.
which is independent of the random process modeling the continuous (j 2 =0 ~ . (7. 146)
amplitude signal , say s. T his assu mption is not vali d if th e number of
quantization levels is small. However, by adding a small dither noise to 7.5. Let sen) be a realization of a stationary, zero -mea n ra ndo m p rocess
the signal prior to the quant ization , as shown in Fig. 7.49 , we can change s. whos e aut ocorrelatio n seq uence
r s( 11) is no nzero if an d only if 1111..::; N.
the statistical character of th e error signal. T he seque nce v(n) is defi ned as
(a) Suppose that th e d ithe r noise seq uence, d (n ), is m od eled as the l· (n) == s (n) - a .l' (n - D ), (7 . 147)
realizat ion of a zero-m ean , wh ite noi se random process d with
pdf where a is a constant and D is a delay . He re 1'(n) is modeled by a sta
tionarv ra ndo m process I' .
(a) Determi ne th e "Varian ce 0;. for D > N and co m pa re it to the
. { 1/~ . Idl :S; .1/2
j <!.( n)(d ) = . (7 .141) varia nc e 0;.
Which is lar ger?
0, otherwise (b) Det ermin e- th e varia nce 0;' for D < N and co m pa re it to the
va ria nce 0;.
Which is larger?
for arbitrary n. Let ede n) den ot e th e quanti zation erro r sequence (c) Determ ine 't he values o f a wh ich m inim ize th e varia nces 0;. for
with dither th e cases of parts (a) and (b). F ind a;.
in eac h case a nd compare
def it to a :.
ed ( n ) = S( I1 ) - x q(n) , !
(7.142 )
7.6. T his problem will re view the complex cepstru m (e C ) and real ce p
whe re xQ(n) is th e qu anti zed seque nce with dither. Show that st rum (R C) treated in Chapter 6. Compute th e (long-term ) CC and RC
the stochastic process modeli ng eden), say !!.d' is statistically in for the seq uence
dependent of s. x(n) = J(n) + b6(n - D ). (7.148)
(b) Also let e,.{n) d enote the quantization e rro r sequence without
dither, where b is a scale factor and D is a dela y.
7.7. This probl em is fo r persons who have stud ied th e mat erial on qu ad
e,.(n ) def
= s(n ) - sq( n) , (7 .143 ) rature mirro r filters in Ap pend ix 7.A . By using (7.160 ), derive th e equa
tions correspo nding to the st ru cture for the pol yphase synthesis section
where sq(n ) is the qu antized seq uen ce with out dither. Show that show n in Fig. 7.50 .
th e variances associ ated with th e sto chas tic pr ocesses !!.d and ~u'
say oJ and (J~ , sa ti sfy th e relation
0 d2 > 0 1
u · (7 .14 4)
'i\ n)
-----i)~~ (Iuantizcr ~ V ~
Synthesrs
secuon
FIGURE 7.49. Add ition of dither nois e to signal prior to quantization . FIGURE 7.50. Polyphase filter structure for the QMF bank .
49 4 Ch. 7 / Speech Coding and Synthesis
7 .A I Quadrature M irr or Filters 49 5
APPENDIX Now. su ppo se we co nnect the a nalysis filter to the co rrespo nd ing synthe
sis filter. so th at X,'.0(W) = .I\'.•. o(w ) a n d X ". I(W) = X,) w ). T hen . by substi
7.A Quadrature Mirror Filters tut ing from (7. 15 l) into (7 .1 52). we obtain
The basic buildin g b lo ck in application s of quadrature mirror filt ers "\"(rll) = H Ho(w )Gn(w ) + /-/,(w) G,(w ) IX (w)
(QMF ) is th e two-cha nnel Q M F ba nk sh own in Fig . 7.5 J. Th is is a multi (7.153)
rat e d igi tal filter structure that em ploys two dccimators in the signal + ~ (110«(1) - n)Go(w ) + H 1( l V - n)G1(w»)X (W - 7l) .
analys is sec t io n and two interpolators in the sig na l synthesis sect io n . The
lowpass and h igh pass fi lters in th e analysi s secti on have im p u lse re The first term in 17 . 153) is the desired signal output from th e QM F
sponses 11 0(11 ) a nd II I ( n ), respect ively. Similarly, the lowpass and high pas s ban k. The second term re presents th e effect of aliasing. which we would
filters co nta ine d in th e syn thesis sectio n have im p ulse respon ses gr) n ) like to eli m inate. Hence we req uire t hat
and g l(n), respecti vely.
We kn ow that fo r an :\I- fo ld dec irnator the relat io nsh ip be twee n t he r
H o(w - rr)Go(w) + HJw - n)Ci,(w) == O. (7.1 54)
trans fo r m of the input an d output signals. .X (z) a nd Y(z ), res pec t ively. is This co nd itio n can be simply satisfied by select ing Go«(() and GJw) as
J M Gofw) = HI (o) - Tt)
Y( z) = M I X(Z I/I/ )e-J2" I./ .I / (7 .1 4 9) (7. 155)
,-0 6)w) = - H o(w - n).
If '\:~ . o« (o) and X ,. J w) r epresen ts the two in p uts to the synth esis sect ion. As a co nseq uence , H o(w) and H J w ) have mirror-image symmetry about
the ou tput is simpl y the frequency to = 71./2, as shown in Fig. 7.22 . To be consistent with th e
co nstrai nt in (7.155 ), we select the lowp ass filter Go(w) as
. ,f(W) = X, o(2 w )Go(rv ) + X", J2w) G j « ,) . (7.152)
Go(w) = 2H(w) (7.158)
~ Allalys b, - . . j ~ SYllll.lcsb ----+j T he sca le facto r of two in go(n) and g )(Il ) co rr espo nds to the interpolation
section secno n
fact o r th a t is used to normali ze the ov er all frequency response of the
FIGURE 7.51. Two-channel OMF bank. Q ME With this cho ice of the filter cha racte rist ics, th e component due to
496 Ch . 7 / Speech Coding and Syntnesis
7 .A I Ou ad rat ure Mirror Filt e r s 497
aliasing vanishes. T hus the aliasing resul ting from decima tio n in th e
for all ca. which follows from (7. 16 8). Unfortunately. the on ly filter fre
anal ysis sect ion of the QM F bank is perfectly canceled by the image »s
quency response funct ion that satisfies (7. 169) is one with the trivial mag
nal spectrum that arises due to interpolation. As a result . the two-channel
QMF behaves as a linear, tim e-invariant system. nitu de spectr um Hico ) == cos2(mJJ). Consequentl y, any nontri vial linear
phase FIR filter H( w ) will introduce some amplitude distortio n.
If we substitute H o(w ). H \(w ), Go(w ). and G ,(ev) into the first term of
(7.1 53) . we obtain T he amount of amplitude d istortion introduced by a nontrivial linear
phase FIR filter in the QMF can be minimized by optimizi ng the FIR fil
i(w ) = [ H~(w) - II ~( w - 11') ]X( co). ter coefficien ts. A particularly effective method is to select the filter
(7.16 1)
Ideally. the two-channel QMF bank shou ld have unity gain . coefficients of /few ) such that l1(cv) is made as nat as possible while simul
taneo usly minim izing (or constraining) the stop band energy of fI( u)). This
I H~( w) - II2(w- 11') 1 = I (7 . /62) approach lead s to the minimization of the integral squared error
for all o), where !J(w ) is the frequency response of a lowpass filter. Fur
thermore. it is also desirable for the Q M F to have linear phase.
Now, let us consider the use of a linear phase filter H( w ). He nce fl( w)
¢" = I\' fI
~
H(w>j2 dco + (I - 11') f/~ [.4«(;) -
U
I r dCI) . (7. 170 )
may be expre ssed in the form
where 11' is a weighting facto r in the ran ge 0 < II' < I. and OJ, is the sto p
H(w ) = H, ( w )e /(,) ( X- l l / 2, (7 . 163 ) band frequency. I n performing the op timization. the FIR is constrained to
where Il.(ev ) is the DTFT of the "u ndelayed" version of h(ll ) and N is the be symmetric (linear phase ). This optimization is easily done numerically
du rat ion of the filter impulse response . Then on ~ digital computer. This approach has been used by John ston (1980)
and Ja in and Croch icre ( 198 4) to design two-channel QMFs. Optim um fil
fI 1(w) = H ; (w )e- j"J(;\' I)
ter coefficients have been tabulated by Johnston ( 1980).
(7 . 164) As an alternative to the use of linear-phase FIR filters. we may design
= /H (w )/2 e- 1"1 (>"- I)
an IIR filter that satisfies the all-pass constraint given by (7 . 162). For this
and purpose , elliptic filters prov ide especially efficient designs. Since the QMF
H 2 (w - 7r) = H ;( (j) - l1')e - / (w- n )(S - I)
would introduce some phase distortion. the signal at the output of the
QM F can be passed through an all-pass phase equalizer that is designed to
(7 . 16 5) minimize phase distortio n.
= (_ I )'" II H (eo - l1'We -1rv C.V- I ).
In addition to these two methods for QM F design. one can also design
Therefore. the overall trans fer func tio n of the two-channel QMF that em the two-channel QMFs to com pletely elimi nate both am plitude and phase
ploys linea r-phase FIR filters is disto rtio n as well as to cancel aliasing d istortion . Smith and Barnwell
A (w) = jH(w)12+ [fl(w - l1'W , (7.168) where bo is some constant. Hence all the even-numbered samp les are zero
except at n = 0. The zero-phase requirement implies that b en) = b(-n). T he
which avoids the pro blem of a zero at w = 17./ 2. For N even, the ideal two frequency response of such a filter is
chann el QMF should satisfy the condition
K
A(w ) = IH(w)j2 + IH (w -17.W = (7. 169) fl (w ) = I b( I1 )e- j ru n , (7.172)
ll - - K
498 Ch. 7 I Speech Coding and SynthesIs 7,A I Q uad ratu re Mi r ro r Filte rs <199
where K is odd. Also. JJ (w ) sa tisfies the cond ition that B(w) + Btn - w) be
eq ua l to a constant for all fre quenci es. T he typical fre q ue ncy response
characteri stic of a ha lf-ba nd filter is shown in Fig. 7.52. We note that the
band-edge frequencies w I' an d w are symmetric about co = Tt/2. and the
j
peak pa ssba nd and stopband errors ar e equal. We also note that the filter
",.,G""
Ampli'"'" ~'
rcspon-e of
B ..(:1
o
....' ;r ~'
/ \ /\.--; :~'
can be made causal by introducing a delay of K samples. o p, '\
"0
~
.,..,
t: pol yphase filters are implemented fo r each decirnator and two for each
s"
' t = f,2 .;) +:.0..' ::=
I' ,
J'j"
inte rp olator. Howeve r, when we employ linear-phase FI R filters, th e sym
metry properties of the analys is filt ers and synt hesis filters allow us to
simplify the stru cture and reduce the number of polyphase filters in t he
analysis section to two and in the synthesis section to a not her two .
o I
• :/I J~
7\ I '
r l )
( )::;
"
------..- ~~
To dem onstra te this co nst r uction , let us assume that th e filters a re
I - f. linear-phase F IR filters of length N (N even), which hav e im pu lse re
..oJ Ll,."
,fl .' sponses given by (7 .157). T hen the outpu ts of th e anal ysis filter p air, after
FIGURE 7.52. The typical frequency response characteristic of a half-band filter. decimation by a factor of two, may be expressed as
500 Ch . 7 I Speech Coding an d Syntna s.s
cc
I.
[~~8
kn
I
x a.k(n) = (- I ) lt (Il ) X (2 m - n), k=O, 1
11=- 0::,.
I ""
= I. I
l ~O 1=-""
( - I ) k (2 /.
ll
h(2/ + i )x (2m - 2 / _ i )
(7.180)
(7.182)
In many speech comm un icatio n sett ings, the presence of background in
(:W2 )- 1
te rfere nce causes the qual ity or in telligibilit y of speech to degrade. Wh en
+(-1 /, 2: pJm) x(2m -2/-I).
1- 0
k = O. 1. a speaker and listener com mu n icate in a qu iet environment , information
exchange is easy an d accu rate. However, a noisy environment reduces the
This expression corresponds to the polyphase filter structure for the analy listener's ability to understand what is said . In addition to interpersonal
sis section that is shown in Fig. 7.50. Note that the commutator rotates commu nicatio n, speech can also be tra nsmitt ed across telephone chan
counterclockwise, and the filter with impulse response po(m) processes the nels . lou dspeakers, or headphones. Th e q uali ty of speech , therefore. can
eve n-numbered samples of the input sequence, and th e filter with impulse also be in fluenced in data conversion (mi cropho ne), transmission (noisy
response PJ(m) processes the odd-numbered samples of the input signal. data cha nne ls), o r reproduction (loudspeakers and headphones). T he pur
In a sim ilar manner, by using (7.160), we can obt ain the structure for pos e of many enhancement algorithms is to reduce background noise,
the polyphase synthesis section, which is also shown in Fig. 7.50. This der improve speech q uality, or suppress channel or speaker interference. In
ivation is left as an exercise for the reader (Problem 7. 7). Note that the this chapte r, we di scuss the general problem of speech enhancement with
commutator also rotates counterclockwise . particular focus on algorith ms des igned to remove additive background
Finally, we ob serve that the polyphase filter structure shown in Fig. 7.50 noise for im proving speech qualit y. In our discu ssion , background no ise
is approximately four tim es more efficient than the d irect-form FIR filter will refer to any additive broadban d noise component (examples include
realizaiion. white Gaussian noise. aircraft cockpit noise, o r machine noise in a fac
Tutorial treatments of the two-channel QMF and its generalization to tory environ ment). Other speech pro cessin g areas that are som etimes in
multichannel QMF banks ar e given in two papers b y Vaidyanathan (1987. cluded in a d iscussion of speech enhance me nt include suppression of
1990). disto rti o n fro m voice coding algo rit h ms, suppression of a competing
speake r in a multi speaker setting, enhancing speech as a result of a defi
cient speec h p roduction system (exam ples include speakers with pathol
ogy or di vers breathi ng helium- oxygen mixture). or enhancing speech for
hea ri ng-im pa ired listeners. Since the ran ge of possible applications is
broad, we will generally lim it o ur discussion 10 en hancem ent algorithms
directe d at im pro ving speech q ua lity in addit ive broadband noise for
speakers and listeners with no rmal prod uction and auditory systems.
T he pr oblem o f enha ncing sp eech degrad ed by additive background
noise ha s received consi derable attention in the past two decades . M any
501
502 Ch . 8 I Speech Enhancement
8.1 / Introduction 503
app roaches ha ve been taken. each attempting to cap italize on spec ific
characteristics or con stra int s, all with va rying degrees of Success. The
Speec h Enhancement Scen ario
success of an enhancem ent algorithm depends on the goals and assu mp
tions used in de riv ing the approach . Depending on the specific applica
tio n. a syst em ma y be directed at one or mo re obj ectives . such as
impro ving overall qua lity, increasing intelligibil ity, or reduci ng listener
fatigue. The objective of ach iev ing higher quality and/or intelligib ilit y of
nois y speech ma y also cont ribute to improved perfo rman ce in other
speech applications, such as speech co mpression, speech recogn itio n. or
speaker verifica tion .
As in an y eng ineering problem , it is use ful to have a clear understan d
ing o f the object ives and the ability to measure syste m per formance in
ach iev ing those objectives. When we consider noi se reductio n. we nor
mally think of improving a signal-to-noise rati o (SN R). It is importa nt to
note. howe ver. that this may not be the most app ro priate performance
T ransmission
criterion for spe ech en hancem ent. All listeners have an int uitive under channel
standing o f spee ch quality, intelligibility, an d liste ner fat igue. However. noise
these a reas are not ea sy to quantify in mos t speech enhancement applica
tions. since they are based on subjecti ve eva luat ion of the processed sig Ideal
com m unications
nal. Du e in part to the efforts of researchers in the speech coding ar ea. cha nne l
testing methods d o exist fo r me as uri ng qua lit y and intelligibility. 1\1_
though methods for assessing spe ech q ualit y will be addressed in Chapter
9. we will find it convenient to refer to algorithm perfo rma nce in terms
of t hese measures lat er in th is ch apter. Although the testi ng methods or
measures \....ere originall y form ulated to qu antify di stort ion introduced by
speech coding algorith ms, we will find that they ca n be used to quantify
performa nce for enhancement ap plication s as well.
Speech enhance ment
In thi s chapter, we will examin e a n asso rtment of techniques that at
tempt to imp ro ve the qua lity or intelligibi lit y of speech. Actual quality
im pro vement will be su bj ect to certain assumptio ns. such as the type of
additive noise . interfering speakers, single or mul t iple data channels. and
available signa l bandwidth . Figure 8. 1 pre sents a general speech enh ance
ment framework. illustrating the possible so urces of distor tion an d ap pli
cations in which an enhanced speech signal is nee de d . Most speech
enh ancemen t techniq ues focus on red ucing th e effects of noi se intro FIGURE 8.1. Typical sources of degradation lor speech enhancement
applications.
du ced at the source. D isto rtion at the source can be either ad d iti ve back
gro un d nois e or one o r more competing speakers . Speech enhancement
ca n also be usefu l for reducing distortion in t rod uced in a speech cod ing
algorithm. This distortion is by far the most wide ly studied o f the possi T he majority of enhancement techniques seek to reduce the effects of
ble source distortio ns. due in pa rt to th e need fo r e fficient sp eech bro ad ba nd additive noise. Generally speaking, constraints placed on the
co m pression techniques by the communications indust ry. Noise can also inp ut sp eech sign al improve the potential for separating speech from
be introduced during t ransm issio n, as shown in Fig. 8. J. What ever the backg ro und noi se.' However. such systems also become more sensiti ve to
origin of the distu rbance, the job of the speech enhancement algo rithm is " deviations" from these constraints. The same reasoning holds for noise
to enhance the speech signa l prior to processing by the a uditory syste m . assumptions. Confi ning the noise type improves the chances of removing
Recently. it has also been shown that front-end speech enhancement can it. but at the expense of dedicating the technique to a specific interfer
also be useful for other speech processing applications. such as processing ence. such as wideband, narrowband, or competing speaker.
befo re cod ing or recognition.
' \V~ will sec in Pan V that cons tra ints o n speech will aid in success of recognition,
504 Ch . 8 ! Speech Enhancemer.t
8.2 ! Classification o f Speech Enhancement Me thod s 505
8.2 Classification of Speech Enhancement Methods th e noi se-free signal based on noncausa l Wiener filtering. T hese en ha nce
me nt techniques estim ate speech parameters in noise based on auto
T here are a number of ways in which speech enhance men t systems can regressive. const rained autoregressive. or autoregressive-mov ing aver age
be classified . A broad gro uping is concerned with the manner in which model s. T h is class of enh a ncement techniques requires a priori knowl
the speech is modeled. So me techniques are based on stochastic process edge of noise and speech statistics and generally results in iterative en
models of speech, while others are based on perceptual aspects of spcech. hancement schemes. T he third class of systems. discussed in Section 8.5,
Systems base d on stochas tic process m odels rely on a given mathematical is based on "adaptive noise canceling" (AN C). Trad itional AN C is for
c riteri o n. Systems ba sed on percept ual cr iteria attem pt to impro ve as m ulated usi ng a dual-channel time o r frequency domain en viro nment
pects important in hu man perc ept ion. For example, one tech nique may based on the "Ieast mean square" (LMS) algorithm . Although ot her en
co ncentrate on impro ving the qualit y of con sona nts, since co nsonant s ar~ ha ncement algorithms can benefit fro m a refer ence chann el, successfu l
known to be im porta nt for int elligib ility in a manner disp ropo rtio nat e to ANC requires one. The last area of enhancement is based on the perio
o vera ll signal energy. Such methods will be discussed in mo re detail in dicity of voiced speech . These methods em ploy fun damental frequency
Secti o n 8.7.
tracki ng using eit her single-channel ANC (a special applicati on) or adap
Enhancem ent algo rit hms ca n als o be partition ed dep ending on tive co m b filtering of the ha rm onic magnitude spectrum. Fundamental
whether a single-cha nnel or dual-channel (o r multichannel ) approach is frequency-tracking met hods are discussed in Section 8.6 .
used. For single-channel ap plic ations, only a single microph one is avail In this chapter, we con sider only a small subset of t he possible topics
a ble. Characterizatio n of noise statistics mu st be perform ed d ur ing peri of enhancing speech degrade d by noise. Specifica lly. we address the pro b
od s of silence between utterances, requiring a stationarity assumption of lem of speech degraded by additive nois e as follows.
the backgrou nd noise. In situations s uch as voic e telepho ne or radio
co m mu nications. only a singl e channel is available. In dual-ch an nel algo Y(II ) = 5(11 ) + Gd(n). (8.1)
rithms, the acoustic sound waves arrive at each senso r at slightly differ where in general y(n) . s(1I), and d (n) are realizations of stochastic pro
ent tim es (one is norma lly a delayed ve rsio n of the ot her). Multi- or cesses with .1' (11) representing th e o riginal "clean" spe ech signa l, (/ (11 ) t he
dual-cha nnel enh ance ment techniques are based on one of two scenarios. degra ding noise. and G a gai n term that controls SN R. Many pract ical
In the first, a p rimary cha nnel conta ins speech with add itive noise and a pro blems fall into the category of addit ive disto rtion, an d those that do
second chan nel conta ins a sample no ise signa l co rrelated to the noise in no t ca n at times be transfo rmed so that th ey sa tisfy a n add it ive no ise as
t he primar y chan nel. No rma lly, an acou stic barrie r exists between senso rs su mpt ion. Co nsider for exam ple , a co nvolutional no ise degradat ion.'
to ensu re that no spe ech lea ks into the nois e reference channel. In the
second scenario, no acoustic barrier exists, so th e enha nce ment algorit hm r(n) = s (lI ) * G d (n). (8 .2)
m ust add ress the issu e of cross-talk or employ a multi senso r beam where * represents con volution . If a homo mo rphi c signal transformation
formi ng sol ution. In OUf di scussion. we shall conce ntrate on methods that is first applied (see Chapte r 6), t he resu lting tra nsform ed spe ech signal
assu me t hat (1) the noise d istortion is additi ve, (2) the noise an d speech and noise component are addit ive (Oppenheim et a l., 1968 ). Another
signals are uncorr elated , and (3) on ly one input ch annel is ava ilable (ex type of noise distortion is signa l-depe ndent quantizati on noise found in
cept for adapt ive nois e ca ncela tion, where we will assume a d ua l-chan nel cod ing applications such as PCM (see Ch apter 7). It has been shown that
scena rio). suc h noise can be co nvert ed to signa l-inde pendent no ise usin g a pseud o
Beyond the classi fications ba sed on specific details of t he approach, nois e sequence technique (Roberts , 1962).
the re are four b road classes of enhance ment that differ sub stantia lly in Ano t her consideration in speech enhanceme nt is that assessment of
the gen era l approa che s taken. Each of t hese classes has its own set of perfo rmance is ult imately related to an eva luation by a hu man list ene r.
assum pt io ns, ad vantages, and lim itat io ns. The first cla ss, addressed in Based on the enhancement context. evaluation may dep end on qual ity.
Sect io n 8.3, con centrates on the sho rt-te rm spectral domain. T hese tech intelligibility. or some other audit o ry att ribute. T herefo re, algorith m per
n iqu es suppress noi se by subtract ing a n est imated noise bias (in th e forma nce must incorporate aspects of human speech perception. Som e
power spectral, Fourier tr ansform, or a utocorrelatio n do ma in) found dur tec hn iq ues arc moti vated heavily by a mat hema tical cr iterio n, ot he rs
ing no nspeech activity in single-m icro pho ne cases. or fro m a re ference focus more on perceptual properti es. It is therefore desirable to consider
microphone in a dual-channel setting. In Section 8.4, we discuss t he sec a mathematical cri terion that is con sistent in som e way with hu man per
ond class of en hancement techniq ues. which is based o n speech modeling ception . Alt hough no optimum criterion yet exists , so me are bett er than
using iterati ve m etho ds. T hese systems focus on estimati ng mo del pa
rameters that characterize the speec h signa l, followed by resynt hesis of lA simila r argum ent can also be constructed for mu ltiplicat ive no ise.
8 .3 I Short -Te rm Sp ectral Amplit ud e Techniq ues 50 7
506 en. 8 I Speech Enhancem ent
su med that the no ise is sh o rt -term sta t ion ary. wit h second-o rder statist ics
o the rs. Exa m ples o f math e m ati cal c riteria t ha t are not particularly well
est im a ted during silent fram es (single-channel) or from a re ference chan
correlated with per ceptual q ualit y includ e mean sq uare error (MS E) and
nel (d ua l-channel). The esti mated no ise power spectru m is subt ract ed
SNR . Performan ce can also be measured usi ng su bjective o r obj ec ti ve
qualit y measurement techniqu es. fro m th e t ra nsfo rm ed noisy input signal.
Let s. d. and I' be stochastic p rocesses re present ing speech , no ise , an d
noisy sp eec h, re~pectively. T he p rocess ~ is ass umed to be u ncorr elated ,
We have seen that the voc al tract must vary in o rd er to generate dif
fer ent speec h sounds. Thi s m ovem ent is re flec ted in a ti me -va r ying lin
e ar t ra nsfer function . On a sho rt -te rm ba sis, it is reasonable to assume with autocorrelation func tio n
(8.3)
that this syst em is stati onary. Therefore, man y speec h e n ha ncemen t tech r!t(10 = Dij(r/).
niques operate on a frame-by-frame basis and base en ha nce me n t on as
pect s of the slowly varyin g lin ear syste m th at reflects p roperties o f sp eech where Do is some co nstant. N ote t hat o f the three ra nd o m p rocesses, only
production. Some of th e these a spect s include en ha nce ment of t he sho rt d ca n reasonably be a ssumed stat io nary. Reali zat io ns sen), d( n), and yen)
term spect ral en velope wi th respect to forman t location , am p litu de. and are related by
ba ndwidth. a s we ll a s t he ha rm onic st ru cture o f vo iced speec h so und s. (8 .4)
y (n) = sen) + d en).
Let us begi n b y assuming (unrea listi ca lly, of co urse ) th at S a nd, hence , Y
8.3 Short-Term Spectral Amplitude Techniques are sta tio nar y processes. Becau se ~ is an unco rrelat ed p roc ess, it follo ws
immed ia tely t hat
8.3.1 Introduction ~, ( w) = ~ (OJ) + fi w ).
(8 .5)
We begin our discussion of sp eech en hance me n t by con sidering meth
where r;.(v) is t he ternporal'' PDS ' o f t he ran d o m proce ss L- Clearly,
ods th at focus the processing in th e sho rt -term spect ra l d omain . Su ch
techn iques seek to enhance a no isy speech signa l by subtracti ng an esti gi ven ~.(w ) and an es t imate o f ~ (w) , say [d«V) , it is possib le to es ti mate
mated noise bias. The particular d oma in in which the su bt ra ct io n proc the P DS of t he unco rrup ted speech as
ess take s place leads to severa l a lte rnat i ve formulat io n s. T he m o st
po pular by far is spectral sub tra ct io n, since it was one of the earliest an d
r,( w) = ~. «(J) - [ J(w), (8 .6)
perhaps th e easie st to implement . T his ha s led to m an y alt ern ati ve meth Although t heo reticall y inter est ing, t his an a lysis ha s li tt le practical sig ni fi
ods invol ving th e "s ubt ra ct ion d o m ai n " a nd var iati o ns o f a s ubs eq uen t cance. since we d eal wi th real wavefo rms ove r short time fra mes . It d oes ,
nonlin ear process ing step. however. suggest the esse nce o f the sp ectra l su bt rac tio n a pproach to noi se
Sh ort-term spectral d om ain m ethod s perform all th eir p roc essing o n elim inat io n . The simplicity o f the app roach is e vide n t .
the spect ral m agn itude." Th e en hance ment procedure is performed o ver Let us now beco m e m o re realist ic and dro p the st ationari ty ass u m p
fr ames b y obtaining th e sh ort-term m agnitude a nd ph a se o f the noisy tions on s and y. We a re given a signal y en) (a re alizat ion of y) a nd the
sp ee ch spect ru m, subtracting a n est ima te d noise ma gnitude spectru m task of est im atin g the co rrespon d ing speec h sen). Recogn izing th at , a t
fr o m th e spee ch magnitude spect ru m, and in verse t ransforming thi s spec best. sen) will be "l oca lly stationar y" ove r ShOl1 t im e ran ge s, we select a
tral a m plit ud e using th e ph a se of th e origi nal degra ded spe ec h. Since frame of y en). say J;.(n; m ). using a w indow o f len gth N e nd ing at t ime
background noise d egrad es bo th the spe ctral magnitud e and phase . it is m. I,.(n:m ) = y (n)w(in - 11). It foll ow s fro m (8.4) th at th e selected frame
rea sonable to question the performance of a technique tha t d ocs not ad can be expressed in terms o f the u nd erlying spe ech and noi se fra mes,
dress noi sy phase. We sha ll ex plic it ly d iscus s th is issue below.
. ~,< 1l : JIl) = / (11; m ) +J~ ( I1 : m ). (8.7)
8.3.2 Spectral Subtraction By a na logy to (8.6) we migh t thi nk to us e st P DS6 and est im ate
Generalities ~ ( w: m) = ~.(w; Ill) - ri cI): m ), (8.8)
w here we rec a ll that the st PDS is d e fi ned as r..(W: m) and tp,.(w: m) are both obtained fro m the st D T FT (i n pract ice .
th e stOFf) o f the present noisy sp eech frame ,
rr (c/)" 111) = ~ ~
N L r (I]' 111)(' -/"" :
J' , (8 .9 ) 5 .. (z»; 111) = \S,.(W: 111 ) 1e J'P ,(w . m ) = r :/2(w: 111)" J"" ,(W;"I) , (8.16 )
'1= -00
with r,,( ,/: 111) the short-te rm autocorrelatio n . T he care ful reade r will rec and rd(w: m) ca n be estimated using any fram e of t he signa l in wh ich
ognize t hat t his expressi o n is only valid 10 th e ext ent that rJ. rema ins "lo speec h is not present , o r from a reference cha nnel with noise on ly.
call y uncorrelai ed," meaning t ha t
r,,(Il ; m ) = Dob(IJ) , (R.I0)
Details, Enhancements. and Applications
a nd al so uncorrelated with ~,
S pectra l Subtr action Varia tions a nd Gener ali za tions . Many va ria tio ns o n
rd.(l/: m) = 0 (8. 11) t he basic st ra tegy above arc fo und in t he lite ra tu re . These a re b est placed
for all 1/. i n perspect ive by presenting a generalized app ro ac h due to Wei ss a nd
Wherea s the long-term POS of (8.5) is part o f a m at hem a tica l m o d el Aschkenasy (1983). N ote that the estimato r (8 . 15) ca n be written
that is related to time waveforms in only an abst ract way, the sa me is not
true of th e stPOS of (8 .8). In fact, th e st POS is relat ed to th e st DT f T in
S,(w: m) = [I S,(w: m)1~ - Isr/w; m)12r "ew ,.(", :ml. (8. 17)
a simple way. For example [see (4.69) in Secti on 4. 3.5],
2
Sy(w: m)S;.(w: m) 1 Sy (w: m) 1
A gene ra lize d est imator is giv en by
r;.(w:m) = N2 = _., (8.12)
usin g t he autoco rrelation. The time domai n approac h ca n also be used ~ <C~
for values of a ot her than two. in which case the name generalized corte :>
r./)
:: :i. ~j E -§' c ro
(J) a.
Negative Spectra l Comp onents. In add ition to th e di fferenc es in spec t ral
pro cessing. there is anot her im portant aspect of spe ctral subtraction t hat
i t i &. 2: ] ~l Z 7." - _ - - :...
-
.:,
c...
:f> o ..... a..E
.~ a) "0 ~
_l'-oc'O
c.enro~
Ql..... .
is handled di fferently ac ross vari ous algori thms. Fro m (8.15) [or (8. 18)] it E~:>'6
is observed that the estimated speech mag nitu de spe ct rum is not guaran ~
lI.l
o If> ._
0.. 00 ~ U
teed to be positive. D ifferent systems remedy this by performing hal f '--
::>
~
.. E-lI.l«l
.- c:::t:,'::;
c
wave rectification or full- wave rectificat io n, or b y using a weighted ~ E
o- .-=:::._ Qlo.c.o
.I: '- o ::::l
S <~ -t)(Il(/)
di fference coe fficient . Most techniques use ha lf-wave rectification (i.e.. ., :;;
~ 2-
Et c.ro"'c
set negative portions to zero). Forci ng negative spectral magnitude values .§ <(.. .~ ~ - 0
o .olf>-
,;, ~b t: If>
to zero , however, introduces a " m usical" tone artifact in t he reco n ~ ~ 'Q) ~
stru cted speech. This an om aly represents the major limitatio n of spec t ral 9 7.
,g ro ~ t:
«I .... a
subtract ion tech niques. Tn the following ma terial. we pursue details and 'C
«1
u E o
0'0
enha ncements to th e basic pa rad igm t hat have been tried in research and >• Q)
c. ....
<fl- N
Ql
~ ~ \~
u:::~If>~(ij
rupted speec h frame , !s(n; m ). We survey a few tec hniq ues here with the
purpose of pointing out furt her technical en hancements. Research results '7. 1 -"
for specific spectral subtraction systems are presented here (a comparison
with other en hancem ent techni q ues can be found in Section 8.7). Basic
estim at ion of th e sh ort-term spect ral magnitude has resulted in a variety .! ' ~ E ~ :t -e ~
~ ~ ~1
~
I t ! i, ~ ~ 8~
of methods such as spectral su bt ract ion (Bo ll. 1979). corr elation subtrac -~ 9- E "=:-' .- : !:. 7" ~
' For historical interest. a d iscussion of the I~ T EL system is included in Append ix 8.A. 511
8.3 I Sh o rt -Ter m Spe ctral Amplitude Techni q ue s 513
512 en. 8 I Sp eec h Enhancement
he system pr opos ed by Boll (1978 , 1979) attempts to reduce spectral Finally, ir\SJ(w;m) ! is greater than the maximu m noise resid ua l, t ~en
erro r by applying three pro cessing step s once the spectral magnitude ha s speech is present in the sign al at that freq ~ e ncy ; t~ererore. subt ractl?g
been fo und . T he three steps are magnitude averaging, half-wave rec tifica the noise bias is enoug h. Boll ev a luated th is a lgo rnhm for sp eech dis
t io n, and resid ual no ise red uctio n . Th e pr ocess of magni tude averaging to rted by he lico pte r n o ise. Figure 8.3 shOWS short-term vocal system
red uces spect ra l er ro r by pe rform ing local averaging of the spect ral mag spect ra of noisy an d enhanced hel ico pte r speech . T he re sults showe d that
n itudes. The magn itu de-ave raged spect ru m is found using th e sample spect ra l subtracti on a lon e does no t increase in te lligib ility as m easured by
mean the diagnostic rhyme test (see Chapt er 9 ), but d oe s in crease qu ality, es
pe cially in the a rea s of increased pleasa nt ness and inconspicuousness of
I i ->- I noise backgro und. It was al so shown that ma gnitu de averaging does re
IS,.(ev: m,)1d,gf _ _
.
I ISj'(W: m,)! ,
2/ + 1 _ _ {
(8.22) duce the effects of mu sica l tones ca used by errors In accurate noise bias
/ 1
esti m at io n .
where In , I ' " . . m , ~ 1 index 2 / + I fra mes centered on the " curren t" fra m e A furt her e n ha nce m ent to spe ct ra l processing is to introduce a
a t m,. T here fo re, th e res ultan t estimator for the speech stDTFT. usin g th e weighted subtraction ter m k as
no isy phase f{J.I' (w : m,) fro m t he ori gina l d ist orted speech . is
Ss(w: m) = [I S/(I); ml!" - k I,Sd(W; mW t oeN,ltv: m l . (8.24)
Ss(w; m,) = [ls}'(w:m;l l- IS)w:lni)l] eNI'(w,mi \ (8.23 ) Berouti et al. (19 79) considered such a method with a = 2. Their results
sho wed tha t if the wei ghted subtraction term k is increased (i.e .. overesti
w he re IS,,(w:m ,> != f ~/2 ( w; m,) . an es t imate o f th e magn itude spe ctru m of mating the noise spect ru m) , musica l tone artifact s can be reduced. It was
t he no ise fra m e .f:t (n: m,). Th e estim ato r is se en to be o f the spect ra l m ag a lso desirable to ad j ust k to m aintain a minimum an d maximum spec
ni tude ty pe. t ral 0001' based on t he cst imated input SNR as show n in Fig. 8.4 . As
T he m ag nitu de -a ve raging m et hod works well if t he ti m e wave fo rm is wit h a ll fo rm s o f spectral subtraction , negative values from the subtrac
stati o na ry. Unfortun at ely, the va lue o f J in (8.22) is lim ited b y th e sho rt tion I S) , « (() : m )I"- k l '~d( w; m) l a can be remo ved by full-wave or half-wave
term stat io na r ity as sumpti on . T he refo re, only a few fra mes o f data ca n rectificat ion , A freq ue n cy-dep e nd ent su btr a ct io n term k((v) was also
be used in averaging. Bo ll' s second pr oc essing step is ha lf-wave rect ifi ca considered.
ti on . wh ich reduce s th e mean no ise level by an amount ISd(W: m)l . With Anothe r approac h that furt he r modifies spectral subtraction was pro
t h is rect ifi cat io n, low-varia nce co herent noise is approximately e lim i pose d by M cAulay and M alpass ( 1979). Tn this method, a spectral decom
na ted . T he d isad va ntage of half-wa ve rectificatio n is that it is pos sible position o f a frame of no isy spe ec h is performed and a particular spectral
fo r the speech-p ius-no ise sp ectrum 10 be less than 1 '~d ( W: m)1a nd co ns e line is atten uated based on ho w m uch the speech-pl us-noise power exceeds
qu en tl y, speech info rm ati o n is removed . This step is the maj o r in a d e an estimate of the b ackground no ise. T he noise at ea ch frequency compo
quacy o f most sp ectra l su bt ract io n tec hn iq ues, si nce it is a non linea r ncn t is ass u med to be G aussian , res ultin g in a m axim u m likelihood est i
p rocessing st ep with no math e mat ica l basi s ot h er than the req uire m ent mate of IS,(w: m) l. A furt her exte ns io n. also du e to Mc Aulay and Malpass ,
th at the spect ral m agnit ude be positive. I
is to scale the in put freq ue ncy respo nse IS,,(r.v: Ill) by the probabilit y that
The la st step in Bo ll's a lgo rit h m is residual noise red uct io n. After ha lf speech is present in th e in put signal. T hei r reason ing is that if the proba
wa ve rec tificat io n , th e spe ct ra l ba nd s of speech plus noise above the
I I
t h resho ld Siw: fIl ) rema in , thereby p rese rv ing a resid ual noise co m po
nen t. T he argu me nt a t th is po in t is that res idual noise can be red uced b y
replacing the p resent frame value with a m in im um value from adjacent
frames. Th e questio n that arises is. why shou ld such a method work? The
a nswer is t hat if. fo r som e w, I.5s(ev: m)! is less th a n t he ma xim u m noi se
res idua l. a nd if it varies from frame to frame. then th e re is a high proba
bili ty th at the spect rum a t that freq ue ncy is due to noise. Therefore, the
noise ca n be sup pres sed by ta king th e m in im u m fro m adjacent frames . If
IS,(w: m)1is less tha n the maximum noise resi dual. but 1.S:s(w; 1n) 1is ap
proxim at e ly con st a nt be twee n adj acent fra m es , t he n a high probabili ty la) n»
ex isis that the spectrum at that frequency represents low-energy speech . FIGURE 8.3. Examples of (a) no isy and (b) enhanced short-term vocal '
The re fore. taking the minimum will not affect the informatio n content. sy stem spectra of spe ech degraded with he lico pter noise.
8.3 I Short-Ter m Spectral Amplitude Techniques 515
514 en. 8 I Speech Enhancement
{_l_±
80 o " '¢ a = 0 .25
40 "
«:
f
··if
.,.,
o ~<,, "
p" ~ ~
0
.,ti,.'. . r>
(8.25)
";:: i{>'; '
i'
where P(noiseISNR. m) is the probability of only noise being present at 20 0 '"
the frame ending at m; given an estimate of the present SNR , and the
times In/ index the end-times of the frames used in the magnitud e averag I t I ..
()
ing . An extension to this approach was proposed by Hansen (1991). in -5 0 5 10
which a noise-adaptive boundary detector was used to pa rt ition speech SNR (dm
into vo iced/ tra nsit io nal/u nvo iced speech sections to allow for a variable
FIGURE 8.5. Intell igibility scores of a spectral subtraction lor enhancement 01
noise suppression based on the input speech class. followed by the ap p li speech d egraded by wide-band random noise. Adapted from Lim (1978).
cation of morphological-based spectral constraints to red uce frame-to
516 Ch. 8 I Speech Enhancement 8.4 I Speech Mo deling and Wiener Filtenng 517
performed for bot h the ha lf- an d full-wave rectification. employing 8.4 Speech Modeling and Wiener Filtering
1-5 frames of ma gnitu de averaging. The evaluation was performed und er
ident ical co nd itio ns (same distorted utt eranc es. same glo ba l SN R est i 8.4.1 Introduction
mates). Ta ble 8. \ su m marizes the results." Full-wave rec tificat ion resu lted
Short-term Wiener filtering is a n app roach in which a frequ ency
in impro vement over a wide r range of SN R. altho ugh hal f-wave recti fica
weighting for an optimu m filter is first estimated from the no isy speech.
tio n had greate r impro vement over the restricted SNR band of 5- 10 d B.
y( II ). T he linear estimator of the unc orrupted speech sen). which mini
In add itio n. magn itu de averaging usi ng fra mes tha t look ahead per
mizes the MSE crite rion. is obtained by filter ing .1'(11) with a noncausal
formed poorer th an the co rrespo nding equ ivalen t looking back in time.
Wiener filter. This filter requires a priori knowledge of both speech and
For both rectification approaches, magni tu de ave ragi ng pro vided im noise statistics. an d therefore mus t also adapt to changing characteristic s.
pro ved quality. In a single-channel framework . noise statistics mus t be obt ained d uring
silent frames. Also. sinc e noise-free speech is not avai lable, a priori statis
Dual-Ch annel Spectr al S ubtraction. The spectral subtracti on method s dis tics mus t be based up on y(II) . result ing in an ite rative estimation scheme .
cussed thu s far have focused on single-cha nnel techni qu es. Res earchers The estimation of speech parameters in an all-pole mo del assum ing an
have also consi dered du al-cha nnel spectral subtraction me thods. Specifi additive white Gaussian noise distortion was investigated by Lim and
cally, Ha nson et al. (1983), C hilders and Lee (J 987). and Naylor and Boll Oppenheim (1978). and lat er generalized for a colo red no ise degradation
(1987) have all considered va rious for ms of spect ral su btraction for th e by Hansen and Clements (198 5). This approach attem pts to solve for the
purposes of co-talker separatio n. T hese methods norm ally require some a maximum a posteriori estimate of a speech waveform in additive whit e
pri or i knowledge of t he speaker characte ristics (norm ally fu nda mental Gaussian noise with the requirement that the signal be the response from
frequ ency contours) to assi st in the enha ncem ent proc ess. T he method an all-pole process. Crucial to the success of thi s approach is the acc u
proposed by Hanso n and Wong ( 1984) co nsiders a po wer expo ne nt a and racy of the estimates of t he all-pole parameters at eac h iteration . Th e es
the phase di fference bet ween speech from two com pet ing sp eakers. Their timat ion procedures that result in linear eq uatio ns without background
results show that magnitude subtractio n (i.e.. a = 1) is prefer abl e at low noise become non linear when nois e is introduced . Howev er, b y usin g a
SNR. Although estimation of pitch an d voicing were necessary. t hey were suboptimal procedure, an iterative algorithm results in whi ch th e estima
abl e 10 show a significant increase in in telligibil ity. which has proven to
tion procedure is linear at each iterat ion .
be a diff icult task for the co mpeti ng spea ker problem. T he greatest im
prove me nt occurred for low SN R (- 12 d B) with sm aller levels o f im
pr oveme nt as SNR increased . Finally, Ariki ct a l. ( 1986) consi de red a
two-dimensional spectral smo oth ing and spect ral am plitude transforma 8.4.2 Iterative Wiener Filtering
tion method . No ise proc essing is per formed in the time versus cepstrum
We begin with the same setu p used in th e spect ral subtracti on prob
do main. resu lting in improved forma nt ch aracterization with respec t to
lem: s. d. and yare sto chastic processes represent ing sp eech . noise. and
conventiona l freq uency s ubt raction.
noisy-speech. respectively.~ The process d is assumed to be uncorrelated
as in (8.3 ): sen ). dell), and yen) represent random variables from th e re
spec tive processes. and s(II)-:d(n), an d y (ll ) denote realizatio ns. Appropri
8.3.3 Summary of Short-Term Spectral Magnitude Methods ate ergodicity properties are assumed so tha t time averaging may be used
In this section , we have co nsidered spee ch enhan cement techniq ues in place of statistical averaging when desirable. Soon we will encounter
th at focus th eir processing in t he sh ort- te rm spectral domain . These an estimator for the rando m p rocess s. a nd the est imator itself will b e a
me thods are based on su btracti o n of an est imated noise bias foun d dur andom process. In anticipation of this estimator. we define th e notati ons
ing nons peech acti vity or fro m a refere nce channel. T he techniq ues differ i, ~(n). and .~ ( n ) . The noise is additive, so
in t he domai n in whic h subt ract ion is performed, the power expo nent. )' (n) = 5(11 ) + den). (8.26)
the presence or ab senc e o f t he weighted subtraction coe fficie nt based on
freq uency an d/ or probab ilit y of speech, as well as postprocessing wit h
ha lf- or full-wave rect ificatio n. or magnitude aver agi ng.
'We shall see below that the results do not change if the uncorrupted speech is dcterm in
"The I takura-Saito quality measure is discusse d in Chapter 9. istic . Therefore, we assume a stochastic process for generality.
~
0)
Spectral Subtraction
I I •I I
I - 4 X X X X +0.36 + 0.67 +0. 59 x x
I I•I I I 4 X X X X + 0.2 1 + 0.50 + 0.40 X x
I I I •I I I 5 X X X +0.10 +0.48 +0 .74 + ().52 X X
._
<.11
r.D
8.4 I Sp eech Mod eling a nd W iener Filteri n g 52 1
520 Ch . 8 I Speech En llancem en t
O u r goal is to fo rmul ate a linear filter with wh ich to pro duce an opti mate of f (w : 111) this would imply that we ap proxima tely knew the mag
mal est imate of sen), sa y .v( n ). wh ich is optimal in the MSE sense. Th at nitude spectrurn of th e spe ech in the frame. since
is, we desire a filter wit h impulse re spons e h t(n ) such th at wit h input r (w: 111) = ISJ (I): iii) I::' . (8.3I)
sen) the output is an estimator 5(/1) for which
However. it is the speech in thc fram e that we ar c tr ying to estimate [see
¢ = t {U:(n) - i (n)f} = ..l'{[s(n ) - J (n)fJ (8.27)
(8.30) ] an d it is unli kely t hat we would hav e an accurate est im ate of its
is m inim ized . For t he moment . we allo w th e filter to be IIR an d eve n spe ctrum. O ne approach to the speech spectr um esti mation p ro blem is to
noncausal. T he reader may recall that this filter. called the noncausal usc a n iteta tive p rocedure in wh ich an ith estima te of 1 (r»; 111 ), say
~Fiel/ e,. filte r, was derived in P rob lem 5.8 using the o rt hogona lit y pr inci [(w: m. i) [or I.S) (u: 111. i) 1 2] is used to ob tain an i + Ist filter est im ate. say
ple. T he res ult in th e freq uen cy domai n is 1ft (w: 1/1 . i + I). In t he nex t sectio ns. we co ns ider seve ra l me thod s for
modeling speech in suc h an ite rat ive framework.
t
H (w) =
[ l~. (W)] . Generalization s of W ien er filte rin g have been st ud ied in other areas of
(IU S) signal process ing. O ne a pproach for im age restorat ion employs a noi se
r: (w) + I:t( w )
scale term k, and a power exponent a. give n by
where [, (0 ) and I:/ w) are the PDS IO fo r the pro cesses ~ a nd ~.
"
In pra ctice, th e filter (8.n) ca nnot be d irect ly ap plied to the nois y
inp ut speech s ignal. since ~ is only sho rt -term stationary, and the PDS
r;(w) is gen erally unknown. One way to app roxim ate the noncausal Wie
H : (w) =" [r:(W
)+ kr;/w)
I ,(W)
] .
(8. 32)
ner filte r is to adapt 1he filte r charact eri sti cs on a fra me -b y-frame bas is The num bers a an d k can be va ried to o btain filters wit h di fferen t fre
by using th e stPDS, quency characteristics. If we were to set a = I and k = I, (8.32) re verts
back to the standard Wiener filt er in (8.28 ). If we set a = ~ and k: = I.
~ r:(w : m ) then (8.32) is eq uiva lent to power spectral filterin g. Aga in.- due to t he
fI' (U): iii ) = ~ ~ (8.29)
r; (w: m ) + ~ «([) : m) short-term sta tiona rity assum pt ion. (8. 32) must be mod ified for pro cess
ing o n a frame-by-frame bas is sim ila rly to (8.29).
T he hats over the stPDS are reminders that these spectra m ust be esti
mated . For a singl e-cha nnel enhanceme nt sc heme, th e noise power spec
t ru m ~ ( w: m) is estimated du ring period s of sile nce. In dual-channel 8.4.3 Speech Enhancement and All-Pole Modeling
scenarios, th e noise estimate is up dated whe neve r the speech spectrum is We kn ow from ou r stud ies in Cha pter 5 tha t over a given fram e of
reest im at ed . Est im at ing th e speech stP D S is a mo re di fficult pr o blem, speech, say J:< /I:iii) := s(n) l\'(m - 11) . an all-pole model of th e form
wh ich we add ress mo mentari ly.
G iven t he filter res ponse fi t (OJ: iii) , the sh ort-term sp eech spectrum is 8 o(1/1)
8 ( :: 111 ) = - - - - - - (8 .33)
t hen obtai ned by filtering the no isy speech signa l as .\1
I -'\'
L,
a(i ; m yz:'
S..(w;111) = fi t ((il: m)S.r(w ; m ) llUO) i= 1
eit her in t he time or frequency domain. We should note. however, that is freq uent ly su fficient to accurately model t he m agnitude spectrum of
JIt (w: m) has a zero-phase spectrum, so that t he output phase o f the the frame . T he a(i ; m rs are the sho rt -ter m LP coefficients as defined in
en hanced speech spect ru m SJw; m) is simply t he noisy ph ase from Chapter 5. whe re tech niqu es for thei r estimatio n in noise-f ree speech are
5 ,.(w; Ill) . Th ere fore, like spectral subtraction meth ods, adaptive Wiener discussed . Tech ni qu es for estimating these parameters in noisy speech
f ilteri ng focuses its proc essing onl y in the sp ect ral magn itude do main . (which is t he case here) have been considered by Magill and Un (l976),
but en d s up attribu t ing t he sa me phas e cha racterist ic to the speech that Ko bat ake et al. (1 9 78) , a nd Lim and Oppe nheim (1978 , 1979).
is used in the spectral subtraction method . The metho d by Li m a nd Oppenh eim is based on maximu m a posteri
Let us now retu rn to the p roblem of est im ating ~ (w ; m), whic h, as we ori (MAP) estimation of the LP coefficients, gain. and noise-free speech .
have indicated, is not a triv ial problem. Indeed , if we had a good esti The me tho d is an ite rat ive on e in wh ich th e LP parameters an d spe ech
frame are repeatedly reestim ated . I n the following . / ( n: m) is th e (un
known) u nderlyi ng fram e o f noise-free spe ech that we desire to estimate.
I"Again, please read PD S as "power den sity spectrum" or "spectra" as appropriate. For simplicity. an d witho ut loss of gen erality. let us la ke m = N , th e win
522 Ch. 8 I Sp eech Enha ncement 8.4 I Speech M odeling and Wiener Fil:eri'lg 523
d o w le ng th . H ere J;.(II: N ) is the o b se r ve d fra m e of n o isy speech . Also, where the extra index k is inclu ded to ind icate th e k t ll ite ration (th is will
ali\') is our us ua l notation for the (unknow n) AI-vec lo r of LP paramet ers become cl ea r below). If q,
is white noise. then G(w) can be replaced by
o ve r th e frame , a nd 8 o(N ) is our usua l n o tat io n for the (u nk n ow n] , m od el (]~ . I f the Gaussian assumpt ion of t he unknown par a m et ers ho ld s, th is is
ga in . Fo r sim p licity in th e d isc ussio n to fo llow, we define : t he optimum processor in a M SE se n se . If the Gaussia n assu m ption do es
n o t ho ld . this filter is the best iinear processor for obt a in in g the next
s,_ ~f kth est im at e (from k t h iteration) o f the vector
speech estimate Sk+ I ' Wit h this relation. sequent ia l MAP estima tio n of
s ~f [J/l : S ) J; r:2:.V) ., f( N; N )]"
(he LP parameters and the speech frame generall y fo llows these steps:
s, ~f given or es t im ated in itial conditions for the 51. vector I. Fi nd a k == argITIux p(al s k _ l , y·gk I, Sf) '
[1..(1 ;N) j :.(2-,t..~ ... .r,,(A. N) Y (8 .34 ) T he first step is performed via LP pa rameter est imatio n an d the second
step through adaptive Wiene r fi ltering. The fina l im p lemen tatio n of the
a, ~f k t h es t im a te of the vector il(N)
alg orith m is presented in Fig. 8.6. This ap p roac h ca n a lso b e exten d ed to
gl.. def
=
k th esumare
. f . e~
0 the model gam - r/ N). the colored noise case as shown in (Ha nse n, 19 8 5) . T he noi se sp ect ra l
d en sit y. o r noise variance for the w h ite G au ssi an case. m ust be estimated
It is ass u med t hat all unknown parameters are random with a priori d uring nons pecch activity in the single -channel framewo rk.
Gaussian pdf 's. The re sulting M AP estimator. which maximizes th e co n
diti onal pdf of the p arameters given t he observations, co r respo n ds to
maximi zing!' p (akls k_ I ) . wh ich in general requ ires the solution of a set of FIGUR E 8.6. Enhancement algorithm based on all-pole mocelinq/noncausal
nonlinear equations for the additive white Gaussian n o ise (AWG :\f) case. Wiener filtering; (1) An AWGN distortion, (2) a nonwhite distortion .
In the noisy case , th e estimator requires ak.i!.". and Sf be chosen to maxi
mize the pdf"? p(a k • gk' s/I
y) . Essentially, we wish to perform joint MAP Step I. Estimate a k from 5 " U sing either:
a. First M values as (he ini tial con dition vect or, or
estimation o f the LP sp eech modeling parameters and noi se-free speech
b. Always assum e a zero initial con dition sl. = O.
by maximi zing the jo int density p(Uj;,Skly ,gk'S/ ), where the terrns z, and
Step 2. Estimate 5 1. (N ) given the present estimate a/ U,J).
s/ are as sumed t o be kn own (or estimated). Lim a nd Oppenheim co nsi de r
a s u bo pt imal so lu t io n e m p loying sequential M AP es ti mation of s, fol
gi
a. Using aI' est imate the speech spect rum: f,( w; N , k) - - - , where
lowed by MA P estim at ion of ak' g k given S k' The sequential e stimation e is the vector II-are 2
1
MAP es t ima tio n o f S k is eq uivale nt to noncausal Wi ener f ilt e r ing o f the b. Calculate gain term g;; using Parscva l's the orem .
noisy spe ec h y. Lim and Oppenheim showed that this techn ique. under c. Estimate either the degradi ng
certain con d itions. increases the joint likelihood of a k and 5 k with each it
( I) white noise varia nce (j~ , or (2) colo red noise spectr um fiw; N )
eration. It ca n also b e sh ow n to be the optimal solut ion in t h e M SE se n se
for a wh ite noise distortion [wit h , say, rd(w) = (j' ~J The resu ltin g eq uati o n rom a period of silence closest to the utt erance.
for estimating th e no is e-free speech is simply the optimum Wien e r filter d. Co nstruct the non ca usal Wiener filter :
(8.28).
~((!): N . k ) f, (w : iV, k) ]
H T((I); N, k) (8.35) (I) tt ' (w ; N , k ) = [ ~( w; I''', k ) + iT ~
(w; N. k ) + ["d(W;111 ) ,
r, (w; N. k) ]
~(w: N , k ) + f,t (w; I',', k)
t
We o mit the conventi onal subscr ipts fro m thc pd f 's in th is discussion beca use the
II (2) I-I (r»; N , k ) = [
meaning is clear without them. Furt her, we use the symbol p ra the r than f for the pdf to
avoid con fusion with the frame notation.
e. Filter the esti ma ted speech SI ; to produce Sk l l '
" The unknowns in th is case are t he LP model param eters, gain, and init ial cond itions
for the predictor.
f. Repea t until som e speci fied err or criterion is satisfied.
8 .4 / Spe e ch Mod elin g and Wi en er Filt e rin g 525
524 C h . 8 / Spe ec h Ef' ha nce fl18 nt
8.4.4 Sequent ial Estimation via EM Theory the co nditiona l ex pecta t io n is not exact . T he algo rithm th erefore iterates,
using each new p aram et er esti mate to improve th e cond itiona l expecta
In this secti o n . ...ve co nt in ue to use th e sim p lified notati o n defi ned tion on the next iterat io n cycle (the E step ). and then use s this co ndi
a bove. T he basic sequen tial MA P es timat io n proc ed ur e above can be fo r tiona l est imate to im prove th e next parameter estimate (the M step ).
mu lated in a n a lte rn a te way. '? T he estitnate-ma x imi-r (EM ) algorit h m The E M approach is simila r to th e two-step MAP esti m ati o n proce
was fir st int ro d uced by Dem pster ct al. (19 77 ) as a techn ique fo r o btain dure o f Lim a nd Oppenheim; th e m ain di ffer ence is that the error cri te
ing m a ximu m likel ihood estimation fro m incomplete data. In th e EM al rion he re is to m ax imize the exp ected log-l ikelihood function give n
go ri th m . th e o bse rva t ion s are co nside red "incom p let e" with respect to observed or est ima te d speech data. Fed er et al. (198 8, 1989 ) fo rmulated
so me o riginal set (w h ich is co nsi dered "c omplete" ), T he algorith m ite r such a method for dual-channel noi se ca nce lati on a p plicat io ns whe re a
a tes between est ima t ing t he suffic ient statistics of the " co m p lete" da ta. controlled lev el of cross-talk was present. Their results showed impro ved
given th e observat ion s a nd a cu rrent set o f pa ram eters (E step ) a nd ma xi perform a nce ov er a tradit ion al least MS E es t ima tio n proced ure.
mizin g th e likelihood of the co m plet e data, us ing the est imat ed su fficient
sta t ist ics ( M ste p). If the unknown mode l parameters are d ist ributed in a
Gauss ia n fashion , t hen it can be shown th at t he EM appro ach employ ing 8.4.5 Constrained Iterative Enhancement
m a ximum likelihoo d est imation is eq u ivalent t o the origi nal sequ en tial Altho ugh trad it ional adapti ve W iener filteri ng is strai ghtforward an d
MA P estim at io n p ro ced ure develope d by Lim a nd O ppenhe im. To do so, useful fro m a mathematical po int of view, there a re several facto rs th at
co ns ider a vector of no isy speech da ta [reca ll defin itio ns (8.34); make a pplica ti o n d iffi cult. H a nsen an d C lem ents (1 987. 1988.1 9 91 ) co n
)'::= S + d, (8.36) sidered a n a lte rn a ti ve fo rm ula t ion based o n it erat ive Wi e ner filt e ring
augmente d with speech-specific constra ints in the spect ral d omain . Th is
whe re d has the o bvio us m ean ing in ligh t of (8.34), a nd where the noise meth od wa s m o ti vated b y t he fo llow ing o bservat io ns . First , t he tradi
is zero m ean , G aussia n. wi th [,/(v ) = a ; . The basic problem , a s ab o ve. is tional W ien er filter scheme is iterative with sizable comput at io na l re
to es timate li(N ) a nd the s peec h fr ame l/ n;m) (vector s) give n t he fra m e quirements. Seco nd. a nd m o re importa nt , al though the o riginal seq uen
.(, (11: 111) (vec to r y). Now, if we view th e obse rve d da ta vec to r y as bei ng tial MAP est imation techniqu e is shown to increa se the jo int likel iho od
incomplete and spec ify so me co m plete data se t s th at is relat ed to y by of th e speech wavefo rm a nd all- pole pa ra m eters, a h eurist ic conver gen ce
th e relation criterio n must be employed. This is a d istur bi ng d rawback if th e ap
fits) = y, (8.37) proach is to be used in en v iro n men ts req uiring a utom ati c speech en
hancement. Hansen and Cle m ents ( 1985) pe rformed a n investigation o f
where J/( . ) is a no n inve rt ible ( many-to -one) tra nsfo rm atio n . the EM al this tech ni q ue for AW G N . and a genera lized vers io n for additive non
go r it hm (at it er atio n k ) is d irect ed at fin d ing t be maxi m um likelihood es White. no nsta tio na r y a ircra ft int erio r no ise . O bj ect ive s peec h qu al ity
ti mate of th e mode l pa ra me te rs, say, measures, wh ich have been shown 10 be co rrelated wit h su bjective quality
(Quackenbus h et al. , 198 5. 198 8), were used in the evaluation . T his a p
a' = argp1ax log Prj )'. a), (8 .38)
proach was found to produce sign ifica nt le vels of enhance ment for white
wi th P),O (. ..) th e pd f fo r what in the present co nt ext may be consi de red Gaussia n no ise in th ree to fo u r iterat ions. Im pro ve d all- pole pa rameter
random- vectors I a nd ~. T he algo rit h m is itera tive, wit h a o de fined as th e estimatio n was also observed in terms of reduced MSE. O nlv if t he pd f is
initial gues s and ak defined by induct io n as foil 0\\'5 : unimod a l and th e init ia l estima te for 11 k is su ch that the loca l m aximum
equals the globa l max imum is the proced u re eq uivalent to th e j o int MA P
I
a k = argp1ax [jlog P~ . !(s , a ) y. a/;-l l. (8.39) estimate of a k' gk' and Sk'
So me interesting anomalies were noted tha t mo tiva ted develo pm ent of
The bas ic idea beh ind this approach is to ch oose ak such that the log the alte rn at ive enhancement procedure based o n spect ral co nst ra ints .
likelih ood of th e co m plete data. log Ps.•(s . a ) is maxim ized. However. the
First. as ad d it io na l iterations were perform ed, ind ivid u al fo rma nts of t he
joi nt de ns ity fu nct io n P•.•(s . a ) is no t- available . T he refo re, instead o f
speech consist entlv decreased in bandwidth and shift ed in locat io n . as in
m axim izin g the lo g-like lihoo d , we m a ximize it s ex pectatio n give n the ob
dicated in Fig. 8.7. Second , fram e-t o-fra m e po le ji tter was observed
served data y in (8 .39 ). In addit io n . th e current est ima te of the parame
across tim e. Both effects co nt r ib uted to u nnat u ral sou ndi ng speech.
ters a k is used rat her than th e the actual (unknown) ft(N ). For this reason,
Thi rd. although the sequentia l tv1AP est im at io n t ech ni que wa s shown to
increase th e joint likeli ho od of the speech wave fo rm and all- pole param e
ters, a heuristic convergence cri teri on had t o be employed . F inally, the
11W C will en counter th is meth od again in the st udy of h idde n Ma rko v models in Cha p or igin al technique em ploys no explicit fram e-to -fra me co nstra ints , though
tcr 12.
526 C h. 8 / Sp eech Enh an ce m ent
8.4 I Sp ee ch Modeli ng and Wien er Filte ring 527
\
\
(a) O rig inal. tb ) Distorted o ng mal. (e) Four iteration s, (d) Eighl itcratio n- ,
FIGURE 8.7. Variation in vocal-tract res ponse across iterations: (a) orig inal. r tu r Original . ( I b l D istort ed ori gi nal . ( Ie ) Four iterutions. l i d) Ei gfu iterations.
(b) distorted original, (c) 4 iterat ions, (d) 8 iterations.
it is norm ally assu med that the characteristics of speech are sha n -term
sta tionary. The altern at ive algorit hms arc base d on seq uen tial two-step
MAP estimati on of t he LP param eters and noise-free speech waveform.
In o rder to impro ve para meter es ti ma tion. red uce fra me-to-fra me pole
jitter across tim e. an d pro vid e a convenient and consistent term inating 12alOrigi nal , \2bl Di-rortcd o riginal, (2c ) fo ur uc rauons, (l d ) Eight itera tio ns.
criterion , a variety of spectral co nstra ints were in t rod uced betwe en MAP FIGURE 8.8. Variation in vocal-tract response acro ss iterat ions for (1a-d)
estimation steps . Th ese constraints a re applied based o n the pre sence of Lim-Oppenheim (1978. 1979) unconstrained enhancement algorithms and
perceptually important sp eech cha racteris tics found during the enha nce (2a- d) Hansen- Clements (1987) constrained enhance ment algorithms:
ment procedure. The enhancem ent algorithms impose spectral co n (a) original . (b) distorted original , (c) 4 iterat ions, (d) 8 iterat ions .
straints on all-pole param eters ac ross time (inte rframe) and iterati ons
( i nt raf ramc}, which ens ure th at 8.4.6 Further Refinements to Iterative Enhancement
1. Th e all-p ole speech mode l is stable. All-po le modeli ng has been shown to be successful in cha racte rizing
2. Th e mod el possesses speech-h ke cha racterist ics (e.g., pole s are not uncorru pted speech. Techniques have been proposed for estima ting all
100 close to th e un it circle cau sing abnorm ally narrow bandwi dths). pole mo del parameters from no isy observ ati on s by Lim a nd Oppenh eim
3. The vocal syste m cha racterist ics do not vary wildly from frame to (1978) and Done a nd Rusb forth ( 1979). Alth ough all-po le mod eling of
frame when speech is pr esent. speec h has been used in many speec h appli cati on s, it is kn own th at some
sounds a rc better modeled by a pol e-zero syste m (Flanagan , 1972;
Du e to the imposed co nst raints, improved estimates a z. :. ( result. Rabiner a nd Scha fer, 1978; O 'Sh au ghn essy. 1988). Mu sicus a nd Lim
In order to increas e numeri cal acc u racy, a nd elim inate inconsistencies (1979 ) co nsidered a gen eralized MAP estimation pro cedure based on a
in pole ordering, the line spect ral pair (see Section 5.4.1) trans format ion pole-zero model for sp eech . Essen tially, the pro cedure req uires MAP es
was used to implement m ost of the cons tra int requ ire ments. T he imposi timation of the pr edictor coeffici ents for both den ominator an d num era
tion of these constraints helps in o btai ni ng an optimal terminating itera lo r pol ynom ials, fo llowed b y MAP estimati on of th e no ise-fr ee sp eech
ti on and improves speech q ua lity by red ucing the effects of th ese thro ugh t he use of an adaptive Wien er filt er. Paliwal and Basu (19R7)
anomalies. cons idered a speech enh an cement method based on Kal man filteri ng. A
Th e constrain ed iteration method att empts to br idge the gap between delay ed Kalman filt ering method was found to perform better th an a tra
the two broa d enh an cem ent phi loso phies, wher e the basic seq uentia l diti onal Wiener filtering scheme. Another refin em ent proposed by
MAP estimation pro cedure serves as t he math ematical bas is for enhance Gi bson et al. (1991) considers scalar and vector Ka lma n filte rs in an iter
ment while the im positio n of co nst rai nts between MAP esti mati on steps at ive fra mewo rk in pla ce of the adapt ive Wi ener filter for remo val of col
att empts to improve aspe ct s importa nt in hum an perception. Figure 8.8 ore d noise. Other enhancement techniques based on sp eech modeling
illustrates results from a single fra me of speech fo r the tra ditional Wie ner have employed vector quantization and a noisy-based distance metric to
filtering method (uncons tr ained) an d co nstrained approach. Further d is dete rmi ne a more suitable noise-free speec h fram e for en hance ment
cuss ion of q uality impro vement for iterative speech en hancement meth (G ibson et al ., 1988; O'Shaughness y, 1988). Such methods require a
ods will be fou nd in Section 8.7. Anothe r speech m odeling approach t rai ning phase to characterize a spe aker's producti on system . An oth er
using a d ual-channel framework by Na ndku mar and H ansen (1992 ) ex speaker-dependent enhancement approach by Ephraim et al. (1988, 1989)
tends this met ho d by employing aud itory-based con stra in ts. Im p rove em ploys a hidden Markov mod el (HMM ) to cha racter ize the uncorrupted
ment in spe ech q uality was also dem onstrated o ver a po rtion o f th e speech. The paramete r set of the HMM is est ima ted us ing a K-means
T I M lT dat abase (sec Sectio n 13.8). clus tering algorithm, followed by sequent ial estimat ion of th e noise- free
speec h. and HM M state sequences an d mi xture coefficie nts . (T he HMM
528 C h . 8 I Speech Enhancement 8.5 I Adaptive Noise Canceling 529
is discussed in Cha pter 12.) The speech signal estimation pr ocess also re of electrocardiography. elim inatio n of p eriodic interference. d im ination
sults in a nonca usa l Wiener filtering pr ocedure . of echoes on long-distance telephone transmi ssion lines, and ad aptive an
T he majority of speech-model -based en hancement methods result in tenn a theory. The initial work on AN C bega n in the 19605. Ada pt ive
iterative proced ure s. Fo r t hese methods, a term inat ion c riterio n is noi se canceling refers to a class o f ada pt ive enhancemen t algorithms
needed . Normally, this is accomplished by listening to successive ite ra based on the availability of a p ri mary input sour ce and a secondary ref
t ion s of processed spee ch and subjectively dete rm ining the ite rat ion with ere nce sou rce. T he p rim a ry in put source is assumed to contain speech
th e "best" result ing qual ity. This knowledge is then used to term in ate the pl us additive noise,
p roced ur e a t th a t ite ration . Such testi ng procedures ma y need to be rc
Y(f1) = s (n) + dl(n), (8.40)
peated as noise types or distortion levels var y. Another means of deter
m in ing th e it erat ion with high est quali t y is to use objective speech whe re, as usual. these seq uenc es are realizations of stochastic processes y.
q ualit y measures (C hapter 9). s, and ell ' The seco nda r y or reference channel receives an input d2(n). the
Finally, we note that noncau sal W iener filtering tech niques have also realization o f a stoc hasti c process tJ.2 that may be correlated with !!.1 but
been em ployed . Li m and Oppen heim ( 1978) con sidered such an iterative not s (sec Fig. 8.9). All random processes arc assumed WSS and appro
approach for the AWGN cas e. T heir resu lts showed im provement in priately ergod ic so t hat t im e waveforms can be used in the following
speech q uality for enhancement at various SNRs. In add it io n, improve analysis .
ment in all-pole parameter est im ation as measured by reduced MS E was The adapt ive noise canceler consists of an adaptive filter that acts on
also obser ved . This method was evaluated by Hansen and Clements the reference sign al to produce an estimate of the noise, which is then
( 1985 ) for AWG N and slowly varying a ircra ft cockpit noise . White sub tracted fro m the primary input. The overall output of the canceler is
G a ussia n no ise results are shown in Ta ble 8. 1. This evaluation confirmed used to cont rol an y adjustments made to the coefficients of the adaptive
that goo d speech quality can be achieved if the iterative procedure is ter filter (often called " tap weights" in this context, see Fig . 8.10). The crite
m inated bet ween three and four iteration s. For a colored noise distor rio n for adj ust ing these weights is usually to minimize the mean square
ti o n, th e m ethod of characterization for the background noise greatly energy of the overa ll output (this might seem odd, but sec below) . The
influences enhancement performance, Evaluat io ns with colored noise re research area of adaptive filter theory is rich in algorithms and applica
vealed that a Bartlett spectral estimate (Kay and Marple, 1981) produced tions. Fo r examp le, textbooks by Haykin (1991), Messerschrnitt (1984),
higher levels of speech quality compared with other spect ral estimation and Proakis et al. (1992) develop an adaptive filter framework and dis
methods (e.g., maximum entropy, Burg, Pisarenko, or periodogram esti cuss a pplicat ions in system Identification, adaptive channel equalization,
mates). Further discussion of Wiener filtering performance can be found adaptive spectral analysis , adaptive detection, echo cancelation, and
in Section 8.7 .3. adapt ive bea rnforming. In this section , we will limit our discussion to the
ap plic ation of ANC for speech enhancement.
8.4.7 Summary of Speech Modeling and Wiener Filtering
l n t his section, we have considered speech enhancement tec hn iques
that en hance speech by first estimating speech modeling parameters, and r---- ---- -- ------ ------ ---~
I
then resynthesizing the enhanced speech with the aid of either a non I ~
causal a daptive (0 vcr time) Wiener filter or delayed Kalman filter. The .1'( 11) =.1 (11) -'-d , ( 111 I '; (j_
l1~
L I •
Enhanced
techniques differ in how they parameterize the speech model, the crite I
I
+ E(n } t
I :,ignal
rion for speech enh ancement (MSE, MAP estimation , ML est imation. I I
I
pe rce pt ual criteria ), and whether they require single- or d ual-channel I
I
I
inpu ts. I I
I I
A
I
' ''1(11 ) Ad apt ive " ,(III
I
I ti ller I
8.5 Adaptive Noise Canceling Reference I I
channel I I
I I
8.5.1 Introduction
I Adap tive noise ca ncel er
, - - - - - - - - - - - - - J
Th e general technique of adaptive noise canceling (ANC) has been ap
pl ied successfully to a number of problems that include speech. aspects FIGURE B.9. Flow diagram 01 adaptive noise canceling.
530 ell. 8 / Speech Enhancement 8 ,5 I Ad ap tive No,so Ca nceling 53 1
Primary l-:nhallc<,,<1 Our extens ive wo rk with L P in Cha pter 5 w ill permit us to get some
channel signul ne ed ed re sults very q uickly. O ne interesti ng analytical detai l should be
.11 11 1 = -'111 ) 1 d , \II) ' L . ~
+ SIll) po inted out before p ursu in g t hese re sult s. I n t his d ev elopm en t it will be
en t irely sufficient to wo rk exclusively with long-term analysis. Alt hough
sho rt -tc rm q uanti t ies ca n be introduced in an o b vio us place (sim ilar to
L ) Id1(l/) the t ran sit ion in t he LP d evel opm ent s), thi s would turn out to be unnec
essary here. Th e reason is tha t an est im at io n p ro ced u re at the end of the
deve lo p ment in he ren tl y prod uces an a lgori thm th at ca n realist ica lly work
in rea l t im e fra mes a nd in th e p res en ce o f non station ar y sign a l dyna m ics.
o(n ) We will the refore be ab le to av oid t he d eta ils of sho rt -term p ro cessing in
Error the fo llowing di sc us sio n wit h n o loss o f p ract ical va lue .
11 sign al s in Fig. 8.9 are a ssu m ed to be re al izati o ns of W SS stoch astic
d 1(1l) ,U II - I) d Zlll - 2 ) ' /2(II - M+ I ) proce sses wit h a p p ro p riate ergod icity p ro pert ies so t hat we m ay use time
Reference wave fo rm s in th e a nalysis. T he m ea n in g o f ea ch o f the signals has been
channel
discussed in th e in trod uct io n. It was exp la ined in t he int ro d uctio n that
FIGURE 8.10. The LMS adaptive filter . the object ive of t he adaptive !' filter in F ig. 8.9 is to est im ate the noise
seq ue nce d ,(n) fro m di n) in order that th e no ise ca n be rem o ved from
Most enhancement technique s, su ch as spectral subtraction and .1'(11). T his seem s very reason able fro m th e d iagram . W ith t his interpreta
speech-modeling-based approa ch es, ca n be ge ne ra lize d to operate in a tio n, th e output of the noise ca ncele r can be interp reted as a n estimate,
dual- channel system . H owever. unlike spec tra l subtraction and adaptive say .Hn), o f t he uncorrupted speech S(II ). T he filte r is FI R with estim ated
Wiener filteri ng , ANC usuall y requi res a second a r y reference channel. In tap weigh ts, say Ii (i ). i =: 0, I , . .. , M - 1, so that
Section 8 .6.2, we discuss a speci al AN C approach that takes advantage of ,11- 1
th e periodic it )' of vo ice d sp eech to ob viate th e second channel. J \(n ) == )' fi U )d -,(n - i ). (8.41 )
~
Initial studies on ANC can he tr aced to Widrow and his co-workers at ,-0
Stanford in J 965 , a nd K ell y at Bell Laborato ries. In th e work by Widrow,
For co nve nie nce , let us define t he M -vecto r o f we ight es timat es
an adaptive line enhancer was d e veloped to ca nce l 60-Hz interference at
the output of an e lec t ro ca rd iog rap h ic a m plifier and recorder. This work
h ~r [!I(O) h(l ) .. , h(M - l )r (R.42)
was later described in a paper by Widrow et a1. (1975). The adaptive line
enhancer and its appli cation as an adaptive detector were patented b y
Figu re 8. 10 illustrates the LMS adaptive filt er st ruct u re .
McCool et a1. (1980 , J 9 81) in 1980. T he ste ad y-st at e behavior of the N ow our experience would lead us to d iscern that a natura} o ptim iza
adaptive line enh ancer was lat er studi ed by Anderson and Satorius
tjon crite rio n is to m in im ize the MSE bet ween th e seq uences d.(n) and
(1983 ) for stat ion ary input s co ns isti ng o f finit e bandwidth signals embed
11(11 ). U nfort un a tely, the sig na l d/ n) is not measurable , so we w ill b e un
d ed in a white G aussian noi se ba ck gr ound . Ke ll y. al so in 1965, devel able to d esign on thi s basi s. Howeve r. a result d e velo ped in Ap pe ndix 5.B
oped an adaptive filter for ec ho cancelat io n th at uses th c speech signa l a llows us to achieve the sam e objecti ve fro m a d ifferent view po in t. lt is
its elf to adapt th e filt er. T his wor k wa s later recognized by Sondhi shown t here that attem p ting to estim at e d ,( II ) usin g d2(n) a nd a least
(1967). The echo canceler a nd its refin ements b y Sondhi are descr ibed in M SE cr iter ion is equivalent to estimat ing d l ( ll ) plus a ny signa l that is or
patents by Kell y a nd Logen ( [ 9 70) and S a n d hi ([ 970) . t hogo nal to d,(n ). In th is ca se , t herefo re, we may att em p t to est im at e .v (n)
fro m d2(n) and derive an id en t ical filter to t hat wh ich would be obtained
8.5.2 ANC Formalities and the LMS Algorithm for est ima ti ng d ,(n). It is interest ing that in t his in terpretat io n th e sig nal
s( n) is interpre te d as an erro r [call it e(n)], which is 10 be minimi zed in
The classical approach to du al -ch annel a dapt ive filtering. based o n a mean squ are . T he refo re, the AN C is som et im es d esc r ibe d as having been
least MSE [t he acronym used for " least (o r minimum) MSE" in th is co n designed by mi n im izing its output powe r (o r e ne rgy in the sh ort-term
text is often " LM S" · crite r io n, wa s first formul ated b y W idrow an d Ho ff ca se).
(1960, \975 ). This technique ha s the majo r ad va ntage o f req uir in g no a
pri ori knowledge of thc noi se signal. Figure 8.9 illust r ates t he basic stru c . "The filter is not really "ad aptive' yet beca use we are working with a lon g-te rm sit ua
ture of an ada p ti ve noi se can celer. tion in which the relevant prop ert ies of all signals are assumed to rem ain forever stat ionary.
53 2 Ch. 8 I Speech Enh ancem ent 8.5 I Ad aptiv e No ise Ca nceli ng 533
In keeping with th e alternative, but equivalent. optimizatio n cri te rio n. we igh ts. it is possible to imagine a " su rface" plott ed o ver the II(O)- h( 1)
fi is cho sen such that pla ne. Now, by definition
= argm in z'
{[ ~~ I
y(n) - ~ h (i ) d2 ( n - i)
]2} .
(8.4 3)
rvd,(n ) - I
i~O
iiU )rd,(1J - i) = 0, 11 E[O, M] (8 .4 4 )
fro m (8 .4 6), t he gra di en t wit h resp ect to weight h( ll ) is
i}V(h) { .If-I , }
or. in matri x-vecto r notation similar to pre vious d e velopments [see (5.23) ~ = 2..£ y (n )di n - I) - '2)I(i )d!( n - i )d?(n - IJ)
ah( lJ) i ~O
and (5.24 )). 15 (8 .4 7)
R rf, h = f .vd., ' (8.45) M-\ ]
= 2 [ r \',h(ll) - I. h(i )rdl (lI - i)
1- 0
The rem aining issue is th e solu t io n of (8.45 ) for the filter tap weights.
Ind eed . this set of eq u at io ns d iffer s fro m t hose derived for the LP p rob or. by diffe ren t iating wit h re spect to th e en t ire weight vecto r at on ce ,
lem only in the pres ence of cross-correlations in the au xiliary vect or on
th e right side. Acco rd ingly, i t po ssesses all o f the symmetry propertie s o f I o ~( h )
- - _- = I' d - R d ) 1. (8.48)
t he LP normal equat ion s a nd can be solved b y any of the methods dis 2 oh Y l •
T he re fore, we have chosen a n unbiased est im a te of g. si nce the joint micro p ho nes are requ ired to be s uffici ent ly separa ted in space . o r to co n
wi de sense st ationari t y o f ~ and [ follows from original assum pt io ns. The tai n a n acou stic barrier bet ween them to achieve noise ca ncelation . In
app roxim atio n chosen also am o un ts to using t he sa m ple e rro r su rfaces this sect io n , we co nsider se veral a pplicati o ns o f AN C to t he p rob lem of
due to each incomin g point as a n esti m ate of t he erro r su r face associated en ha nci ng degraded speech .
wi th the entire fram e o f d a ta. T he ap p roxim a tion makes us e of the gradi O ne of the earlier d ual-c ha nn el eval uatio ns o f A N C for sp eech was
en t o f t he MSE , but d o cs not require an y sq ua ring o r d iffe ren t iat io n op co nd ucted by Bo ll a nd Pulsip her (1980). Tw o adaptive algorith m s were
e rat ions . T he resulting rec u rsio n is inves t iga ted: t he LMS approach o f W id ro w et al. (1976) a nd the gr a d ient
..... n - n- I lJ ....n latt ice approac h of G riffi ths (1978).16 Each a p pro a ch was compa red in
h =h -L\ g , (8.51 )
terms of degree o f no ise power red uc tion , algo rit h m settling time. and
where ff' ind icates the esti m at ed gra d ient asso cia te d w it h time n. In prac degree of speech en ha ncement. Based on ea r lier sim ulat ion st ud ies
tice . a fixed step size , 6 " = L\, is oft en u sed for case of im ple men tat io n (Pulsi p her et a1. . 1979). the typical F l R ada pt ive fi It er necessary to esti
and to all ow fo r ad ap tat io n o f t he es tim ate over t im e as the dynam ics of mate the in p ut n o ise charac te r ist ics req u ire d 1500 lap we igh ts. Suc h
the signal change . T his simp le a lgo rit hm wa s fi rst proposed by Wid row large filter lengt hs re sult in mi sadj ust rnent, de fine d b y Wid ro w et a l.
a nd H o lT (1 960 ) a nd is now wid ely kno wn as t he LMS algorithm. (1976) as t he rat io o f ex cess MSE to m in im u m MS E. This notion of
N oti ce that without ever ex p licitly resort ing to sh ort-t erm anal ysis , we misadj ust m ent is a n im portan t design cr it e rion for du al-ch ann el ANC ,
ha ve a n algorit h m that is immediatel y pract ically applicable. This is be since large m isa dj ustme nt lead s to pro no unced ec ho in the resulting
ca use the ap p rox im a tion m ade in (8.50) is a short-term est imator of the speech signal. T h is oc curs becau se o f th e a d ap tive structure of th e F1R
c ro ss-co rrelat io n. Because the est imato r is very short term (o ne point) , ANC filter. Fo rt u nately, the echo can be reduced by decreasing th e adap
the LMS a lgo rith m has the potenti al to tr ack time-varying signal dynam tation step size use d in updat ing fi Iter wei ghts. but th is increases th e set
ics. A di scussion o f thi s point is beyond the sco pe of th e p re sen t discus tl ing t ime of the adapt ive fil t e r. Both the LMS and gradi ent lattice
sion . so we refer t he interes ted reader to one of the cited te xtbooks. approaches prov id e comparab le noise power reduction. E m p loyi ng st ep
he conver gence , st a bility. and o ther pro perti es of LMS have been sizes that co rr espo nd to 5% m isadjustment. both alg orithms co nv erge
st ud ied extensi vely; refer to Widro w an d Stearns (1985) or Proakis ct a1. a fter 20 second s of input with a just-n ot iceable lev el o f echo. The major
( 19 92 ) fo r de tails . It has been shown using long-ter m analysis (W id row, poi n ts fro m th is study suggest t hat LM S or gra d ient-la tt ice-based AN C
19 75) t hat starting with an arbitrary initial wei ght vect or, the LMS algo ca n provid e noise suppression in th e time dom ain , but that a larg e tap
rit hm w ill converge in the m ean a nd remain stable a s long as the follow delay fi lter is needed. Als o , for all of t hei r s im ulat io ns. Boll and
ing co nd it io n on the step-si ze parameter 6 is satisfied , P ulsipher pl a ced t he re feren ce m icro pho ne d irect ly next to the noise
source to elim inate the need for d elay est imation cau sed by noi se arri v
1
O<tJ. < - , (8.52) ing at each mi crop hone at d iffere nt ti me ins t a nces.
)' max Alt ho ugh no ise ca nce la ti o n can be ac hiev ed u sing LMS or gr adient
lattice ANC , com p utat io nal re q u irem ent s becom e incr ea singly d emand
where }' ma, refers to the larg est e igenvalu e of th e m a trix Rd.' In practi ce.
ing as ada pt ive filt e r lengths gr ow to as ma n y a s 1500 taps. An
th e bo unds in eq uat io n (8.52) a re ge ne ra lly m odified to ensu re a wo rk
altern at ive m ethod of ad apt ive filtering, based on th e co m p lex form of
ing m a rgin fo r syste m sta b ility (Ho rw itz an d Sen.ne, 198 1; Tat e and
the LMS a lgorith m , can result in a su bs ta n tia l sa vings in com p utat io n b y
G o od yea r 198 3).
perform ing th e no ise cancelati on in the fre q ue ncy rather tha n t he time
Man y al terna t ive a p p roac hes fo r rec u rsive tap fil te r weigh t paramet er
do m ai n . T he freq u en cy d o ma in LM S ad ap tive filt er is show n in Fig.
est im at io n can be found in th e litera ture for co nt ro ls, sys tem identifica
8.1 1. T he struct u re is simi la r to t he co n ven tio n al t ime d o mai n filte r
tion , a nd ad apt ive filter th eo ry. The inter ested reader is en co u ra ged to
~ h ow n in Fig. 8. 10 ; howe ve r in p u t data are fra m e-p roce ssed through
co nsid er texts by Hayki n ( 19 9 1), M es serschmiu (I 984). Bella nge r ( 1987 ),
Inp ut a nd ou tp ut N- po in t F ITs. The filter coeff icients are complex and
and P roa kis et at. (1992).
are updated only o nce per fra m e using the update eq u ation (W id ro w et
31.. 1975)
8.5.3 Applications of ANC
Experimental Research and Development " T he ANC method by G riffiths employs a latt ice filter frame work, rathe r tha n the tap
delay lines (FiR filters) used by the other methods . 11 has been shown that the successive
One of t he advantages o f d ual -channel A N C is th at speech with eit her ort hogonalization prov ided by the latt ice offers an ada pti ve con vergence rate tha t cannot be
sta tio nary or nonst at io nary no ise ca n be pro cessed . In general. the IwO achieved with tapp ed-delay lines.
536 ce. 8 I Speech Enhancement
8.5 I Adaptive Noise Canceling 537
No isy speech
frame
Enhanced An alternative approach for frequency dom a in AN C can be formu
s ign al lated bv explicit estimation of th e filter instead of a gradient m eth od
f ,,(I/;m ) =1/1/: 111) +1.11(1/;111)
- - - - - --- --....~ W lramc ~ such as·LMS. Boll ( 198 0) proposed such a method. where the auto and
Prima ry
chan nel
r +Y
- I
dl(j)
(1/;/11)
cross-power spectral estimates ar e used (see Fig . 8.12). Comparable per
forman ce to that of LMS and gradient latt ice was observe d , but with a
substant ial computational savings. In related work, Reed and Feintuc h
(198 1) considered the statistical behavio r of Bo ll' s frequency do m ain
adaptive canceler with white noise inputs. They develo ped expressions
for the mean an d variance of the adaptive filter weights, and com pared
the performance to a time domain canceler. It was shown that the tran
sient responses of both time and freque ncy domai n implemen tations are
the sam e, but that the inverse transform of the steady-state mean weights
of th e frequency domain canceler may differ fro m the stead y-st ate me an
weights of th e time dom ain can celer due to fram e processing effects of
the st D FT. If the signa l dela y between inputs is s mall compa red with the
lengt h of the filt er, th e steady-state mean weight vector for ea ch ca nceler
is esse nt ially the same. Since time doma in approaches can in troduce
Reference noixe
fra me ech o in the enha nce d speech , such frequen cy domain ada pt ive filters
II (": III ) serve as a co mputati o nall y efficient a ltern ati ve.
1
2 Alt h ough our d iscussio n of AN C has foc used on the dual-channel
Re ference
fram ewo rk, these systems can be extended to higher d imensions. For ex
channel
odB, intelligibility as meas ured using phon eti cally balanced word
lists in
Qi
'C
:J
from the pr imary microphone. so th at it picks up as little speech as pos
0
LL sible . This must be true so that th e algorit hm doe s not cancel the speech
inst ead of the noise. In th e pr eviously cited study by Boll and Pulsipher
...E (1980), the referen ce mi crophone was placed directly next to the noise
Z
...
0
source, and the prima ry was placed near th e weak signal source but as
s: far as possible from the reference m icrophon e. This satisfies the ANC
'"
Q) cons traint of a high SNR in th e primary channel and a low SNR in the
<",
C!J
.:::
15.
phone spaci ng is not an issue is in air craft cockpit environments. This
<'I:l
has received considerable interest as a means of improving the perform
'::l
« anc e of exist ing co mm unication systems . In this case. the pilot's oxygen
N facemask serve s as an acoustic barr ier between the two sensors. th ereby
.... ens uring th at th e SNR of the pri mary senso r be much greater than the
c:O
E SNR of the reference sensor. while per mitting close sensor spacing.
1.::IU1' .L:lal~ w'
I a:.2
:::::ll/l Man y aspects of the cockpit noise problem have been studied. The in
OC
~ ~ ~~ I I I
_«l terested reader is referred to pa pers by H arr ison et al. (1984, 1986), Dar
=...... .: ~ ~
u.~
lington et al. (1985), Powell et al. (1LJ87), and Rodriguez et al. (1987).
-
':'::
:t - _ u,
<:
JJJ.lnq 11l!OJ-N
I
I L
.I;)JJnq ll" Od -N
Cross Talk Within Dual-Channel ANC
In th e fore goin g discussion of ANC, we enforced a requirement that
f1,"1
c, -;:; ;
'<
II
l~\-,
t,)
~ ~
~
g '3
t:
-
_ t:"1
~
-
II
0- .
c,
the primary an d reference channels be well separ ated either ph'ysically or
by virt ue of an acoust ic barr ier. If the m icrophones a re too close to one
another. cross talk occurs. A typical ada pt ive filter will thereby suppre.ss a
portion of the input speech characteri stic s. O ne mea ns of addressing thi s
problem is to place a second adaptive filter in the feedback loop. Con
~,
y l (n ) = " 1(n) + dl( n) (8.5S) rhyme test!' o f 2 1 poin ts was also obtained. However. co nside ra ble va ria
lion in pe rform an ce was observed usi ng a rea l-t ime implem entation wit h
Yin) = sz(n ) + d2 ( n ). (8.56)
different microphones. Implernentat ion issues as we ll as micro p ho ne
Here , th e p rimary chan ne l Y,(II) co ntain s speech an d a degrad ing noise placement ca n affect AN C performance. T hese results do suggest a good
component. D ue to th e close proxim ity of the secondary m ic ro pho ne, th e potential for effective noise canceling in high -noise envi ronmen ts.
refe ren ce channel contains a low-le vel speec h signal s2(n) in hig h- leve l
noise. We assume t hat the SNR ra tio of the primary chan nel is higher 8.5.4 Summary of ANC Methods
than that fo r the reference (SN R 1 > SNR 2 ) . U nder such cond it ions. th e
low-le vel spe ec h rep resents " in terference" in the desi re d noise reference. In thi s sec tion. we have co ns idered several me t ho ds of dual-cha nn el
If a speech reference can be found , then a second adaptive filter ca n be AN C for enh ancing noisy speech. Ea rlier evaluations that placed t he ref
use d to ca ncel the sp eech in terferen ce in the reference chann el. This in ere nce mi crophone directly ne xt to t he noise sou rce revealed pro m ising
tu rn results in an improved noise r eference in wh ich to filt er th e prim a r y noise cancelation perfo rmance. Howe ver, furt her stud ies that focu s on
channel. Such a framewor k for suppressing c ross talk using a dual a dap th e coherence o f the primary a nd reference mic ro phon es suggest that
ti ve filter feed bac k loop is sh own in Fig. 8. 13. We assume a speech refer such performance may no t be ac hieva ble in actua l env iro nm ents. Com
ence srcf(n ) exists as in put to adapt ive filt er 1. The tap weights of put atio nal issues from direct -for m lattice t o gra dien t-descent techn iq ues
ada pt ive filt er I a re a dj usted to p ro d uce th e best estimate of th e low- le vel based on the LM S algo rith m have suggeste d a variety o f A NC im plem en
speech int erfere nce s,(n ) in t he MSE sense . T his can be do ne usi ng the tations . Tim e versus freq ue ncy do m ai n fo rmulati ons have also be en dis
stan d a rd LM S algorithm or other grad ien t-descent methods. T he esti cussed . We fo und that time domain approa ch es le nd them se lves to
mate '\:2(n) is subtracte d fro m J'ln) to prod uce the est im ate d (n), which re al-tim e imple mentation . but req uire close m icr ophone place m e nt to
2 avoid long filter lengths and t he intro d uct io n of echo in t he processed
is al so the MS E est im at e assumi ng uncorrel aied sp eech a nd noi se. The
estima ted noise signal d,(n} is no w uscd as inp ut to adap ti ve filt er II. A speech . Freque n cy d o main a pp roach es offer sign ifica nt red uction s in
seco nd set of tap weights is adju sted to produce th e best est im ate of the computat ional requirements and o ffer eq uivale nt levels of perform ance .
prima ry interference d / n). T he estim ate is su bt racted from the prim ary Finally, while some enhancem en t methods req ui re accurate cha ract eriza
input. res ulting in th e estima ted speec h signal .51( n). This represents the ti on or est ima t ion of t he no isy speech c ha rac te risti cs. A NC requ ires a
enha nced s ignal as well as th e refe rence fo r ad ap t ive filter 1. noise reference with no a priori know led ge of the input sp eech character
Such a method was co nsid ered by Z inser et al. (19 85) an d Mirchan istics. In high-n o ise enviro n ments, wh ere est im a tio n o f such a priori
da n i et a!. ( 1986 ) fo r spe ech spoken in a helicop ter bac kgroun d noi se en knowledge may no t be sufficiently accu rate, AN C offers a via ble means
viron m en t. Their e va lua t io ns sh owed in creases in SNR in th e range for speech en ha ncement.
9-11 d B. An increase in in telligibilit y as me asured by the d iagnost ic Finally. t he re is another so lution to t he p ro ble m s ca us ed by di stant
m icro pho ne spacing, which ha s be en treated bri efly a bov e. This is to de
liberately place the microphones as close as possibl e a nd to address the
Enhanced issue of spe ec h sig na l cross tal k directly'. This subject is d iscussed further
Error ...igna l in Appendix 8. B.
Primary
." , (11) =S I(IlJ + J1 11I) • •I L• . I £(11 )
A
J 11111 )
ch anne l
+
Sl\ R 1 » S N R~
A
8.6 Systems Based on Fundamental Frequency
-, "'.'1 i = l e( ll) + " 2( /1) + ell ( ll ) Tracking
Ad? Pl iv ~ J1z{1I 1 Adaptive 8.6.1 Introduction
Illrcr filler
I II In this section , we di scu ss enh ancement techn iq ues th at a re ba sed on
tr acking the fu ndame ntal fre q uency contour. Such app roac hes in clu de
~ll ) singl e-channel ANC. a dapti ve comb filte r ing, and enhancement based on
) 'd, lI )
harmonic selection or scaling. T hese tech niq ues capita lize on th e prop
FIGURE 8.13. Suppression of cross talk employing two adaptive noise filter s
with in a feedback loop . I7See Chap ter 9 fo r a comp lete d iscussio n o f th e di agnostic rhvmc test an d other intclli
gibility tests . .
8.6 I Systems Based o n Fu nda m ent a l Frequen cy Tracking 543
542 Ch . 8 I Speech En ha nc em ent
noise correlated with noise in the primar y channel. G enerally speaking, s(n) = 2) (i )YI(J1 - To- i) , (8.59)
, ~O
ANC can only be employed when a second channel is ava ilable. Suppose,
however, that we could sim ulate a reference usi ng data fro m the primary w he re h(i) , i = 0, . .. . M - I, are the FTR filter wei ght s identified in a
channel. Under these conditi ons. traditional A!\C can be applied. sim ilar way to methods de scribed in Section 8.5.2.
Sam bur (1978) pr oposed such a n approach where, instead of canceling Sam bur investigated this approach for additi ve wh ite noi se and
noise in the primary ch ann el, the speec h signal is canceled . qu ant ization noise. T he pitch period was estimated using an average
In dual-channel ANC, th e success of the adaptive fi lter depends on the mag nitude difference function (Section 4.3.2) and nonlinear sm oothing
avai lability of a good noise reference input that is free of cross-talk. In
most speech enhancement appl ications, a reference no ise ch annel is not
Noise
available; therefore. many enha nce ment tech niques must est im ate no ise eI( lI)
characteristics duri ng per iod s of silence (period s between speech act ivit y) "Enhanccll"
and assume the noise characteristics to be stat ionary d ur ing spe ech activ noise
, --, ~ " I ( II )
ity. Extracting a noise reference fro m the in p ut has some d isad vantages, Speech
~( II)
including \ (11)
Enha nced r.
spe ec h
I \
S(l I .
X( II - T) .~ ( n )
P itc h T
Adaptive
Delay filte r
estimator
lnput SNR
Od B 5 dB IOdB
(R abiner et al ., 1975). Since this method exploits th e periodicit y of the FIGURE 8.16. Improvement in SNR for a single-channel ANC method using
input signal , in principle it should only bc applied fo r voiced spe ech. For the funda mental frequency contour. Adapted from Sambur (1978a).
unvoiced sections, one of two proced ures may be applied. The first ap
proach is to pass the noisy unvo iced speech through the syst em un O ne of th e m ain limitations of single-channel ANC is the requirement
processed; th e second method is to keep the LMS filter response constant of acc urate pitch estimation. Sam bur's method was modified by Varner
an d process the unvoiced speech. et al. (1983) by removing the pitch estimator and obtaining a reference
In Sam bur' s work , this approach is shown to improve quality for addi signal thro ugh the use of a low-order OPCM adaptive predictor (model
tive white noise in the SNR range 0-10 dB, with higher levels of im order three). T he resulting re ference signal contains correlated speech
provemen t as the severity of degrading. noise increases. Impro ved Si\R plus un corr elated noise, whic h is filtered using an LMS adaptive filter.
res ulted as the LMS filter length M was increased fro m 6 to 14. espe No quality or intelligibility results were reported , but experiments involv
cially for lower initial SNRs. Figure 8.16 shows im pro vements in SNR ing waveforms representing ste ady-state vowels showed improvement. Fi
for va rying filter lengths [from equation (8.59 )]. It was observed that the na lly, another approach (K im a nd Un , 1986) attempts to remove the
mo re severe the noise. the more dramatic the im pro vement in SNR. Sub pitch estimato r by deve loping an ANC using both forward- and back
je ct ive ev aluations were also performed. Listeners conclu ded th at the ward-adapti ve filters. Th e method requires a speech/silence discriminator
speech was more pleasant to listen to and "a ppeared" to have more intel for narrowband no isy speech and obtains similar levels of enhancement
ligibili ty, altho ugh no formal tests were performe d to determine t he leve l to those result ing in Sarnbur's work .
of intelligibi lity before and after pro cessing. Performance in the presence
of qu antizati on noise from a var iable-rate delt a mo du lat io n syste m was
also det ermined . The L MS adapt ive filter remo ved so me of t he "gr anu 8.6.3 Adaptive Comb Filtering
lar" quali t y of the quanti zed speec h. T his degradatio n possesses two Corru pt ing noi se can take many forms. In some applications speech is
types of noise: slope overload (s tep size too sm all) an d granular no ise degraded by an underlying process that is periodic. resulting in a noise
(hunting due to a too-large step size ). ANC remo ves the granular noi se. spectrum that also possesses periodic structure. Two methods are avail
since it is signal independent and bro adb and. but leaves slo pe o verload able for reducing such noi se: adapt ive comb filtering (ACF ) (Lim , 1979)
noise un affected . since it is signal de pendent. Sam bur also co nsidered and l i m e doma in harmonic scaling (TORS) (Malah et al., 1979). Time
th is sche me for an LP anal ysis/synthesis system and fo und improved all domain harmonic scal ing will be addressed in the ne xt sub section.
pole parameter estim atio n, especially at low SNR . Adapti ve com b filtering is similar in its basic assumptions to single
Al though single-chann el AN C has been fo rmulated in the time do channel LMS-based AN C. Since voiced speech is quasi-periodic, its
mai n. a frequency doma in general ization is also possible. follow ing Boll's magnitude spectru m contains a harmonic str uct ure. If the noise is Dun
( 1980) dua l-channel freq uency do main AN C approach. periodic. its energy will be dis tributed throughout the spectrum. The es
546 Ch . 8 I Speech Enhancement
8.6 I Systems Ba sed o n Fu ndame ntal Frequen cy Track ing 54 7
""'r 1\ (\
\;T\) ~
Mt. !\
V
"
A 1\ n
\!J\Tv A .
v
A
V
1", 4 ....
A\(Tv ~
mai n constant thro ugho ut the voiced section of spe ech . Un fort una te ly,
spea kers normall y vary their pitch and therefore req uire t he comb filt er
to adapt as data are processed .
A typ ica l block diagram for an adapt ive co mb filter is sho wn in Fig .
Ti me, II (nor m-sec )
ra) 8.18 . The com b filter has large values at the specified fu nda mental Ire
quency Fo and its harmonics. and low values bet ween . The filt er is usu
IY(,,-')I
ally implemented in the tim e domain as
AJ!VwwvVWIMANV s(n) = I
1= -L
c( i )y (n - iTo), (8.60)
~vUJ)JJ1)JJ)J)J)JA)J
rfame nta l period in samp les. an d L a sma ll constant (typi cally 1-6),
which rep resents the nu mber of pit ch periods used forward and back
ward in ti me for th e filtering process.
Since a co mb filt er can only be used to enhance noisy voiced spe ech, a
h equen cy , _ (uorm- r ps) method m ust be ava ilable with whic h to handle un voiced speech or si
( C) lence sect ions. Two approaches are typi cal. First, the eomb filter can be
turned off by setting c(k) = 0 for all k :;/; O. T his has th e effect of passing
IS""~.
the unvoiced speech t hrough the filt er unprocessed. Figure 8 .1S sho ws
that a sca ling term is used fo r the unvoiced (or silence) dat a path. The
cali ng term (which is typically in th e range of 0.3-0.6 ) is necessar y be
cause appl ying an ACF to voiced sound s reduces the noise ene rgy pres
ent. Failure to appl y attenuation in un voic ed or silence sections result s in
(d) unnat ura l em phasis of unvoiced spe ech sound s with respect to voiced
sounds. T he second method for pro cessi ng unvoi ced speech is to main
FIGURE 8.17. An example of adaptive comb filtering. (a) ,. time waveform
of a typica l section of voiced speech. (b) The amplitude f .Jectrum of noise
tai n a consta nt set of filter coefficients, obta ined from the last voiced
corrupted voiced speech. (c) Frequency response of a typical comb filter. speec h frame. a nd pr oce ss th e un voiced sounds or sile nce as if th ey were
(d) Resulting voiced speech spectrum after comb filtering . voiced. This technique has not been as successful as the fir st.
,1(11 , Vo iced
concen trated in small ener gy bands , but that the corr upt ing no ise (white Enhanccd
decision
Ga ussian in this ca se) is bro adband and di stributed t hro ugho ut the fre - pcech
Speech Cornh tiller
quency band . It is clear that th e magnitude response contains enough en
s(ll)
I V(II;
Voiced
e rgy to allow the accurat e esti mat io n of the fu ndamenta l frequency
component. A filter ca n be implement ed th at passes t he fundam ental fre
qu ency plu s harmonics while rejecting freque ncy com po ne nts between
harmon ics . Ide a lly, spa ci ng betw een eac h "to ot h" in th e comb filt er FIGUR E 8.18. Block diagram of a typical adaptive comb filter.
8 .6 I Systems Based o n Funda m ental Frequen cy Tracking 549
548 Ch . 8 I Speech Enhancement
varv ing natu re o f speech. T hey sh owed th at classical co m b filterin g
The filter of (8.60) is typ icall y im p lem en ted i n the time d o m a in . d isto rt s th e speech somewhat, but that an approach which ad ap ts itself
whereby the o u tp ut is an average of delayed and weight ed ve rsion s of the both glo bally and locally to the tim e-var ying nature of speech improves
no isy in p ut . Choosing the d elay to co rres po nd to th e fund a m en ta l period perfolm ance . Lim et al. (1 978) , using w id eba n d ra nd om no ise , and
in yen) result s in an ave raging p ro cess that strengthens all components Perlm u tt e r et al. (1977 ), using a co m pet ing speaker, e va luated this adap
with that period , a nd atten ua tes or cancels others th a t ha ve no period or
tive te ch n iq ue for varying filter lengt hs. No nsen se sente nces a nd exact
a period differe nt fro m t he origina l fundam ental. Clea rly. the success of
pitch in fo rm a tio n (ob t a in ed fro m t he n o ise- free speech) were used in
t his process is inherent ly d ependent on the ac curacy o f th e estimate of F
both evalua tio ns. Figure 8 .19 illustrates intellig ibil ity scores for both dis
for the desired signal. Certa inly, the perform a nce o f t he comb fi lt er i~
tortions . In both ca ses . p itch inform ation was obtained from noise-free
best when the pitch does not change over the a nalysis window [e.g.,
speech . T he co m pet ing speaker problem resulted in a decrease in intelli
(2LTJ + I samples ], The estimation of Fu can in fact be a diffic ult task, gibility. Also , decrease s in in tell igib ilit y were usuall y o bser ved for various
since sp eech signals often vary from on e pitch period to t he next. The
SN R fo r wideband random noise. In general , it is not realistic to assume
problem of Fo changing within an analysis window can be approached
a ccurate pitch information at low SN R , so intelligibility scores should be
with the following modification to (8.60):
lower. Even wit h d ecreases in in tell igib ilit y, both studies mention that
L processed speec h so u nde d " less no isy" due to the system's ability to in
sen) = L c(i)y(n -
r=-L
iTu + (). (8.61) cr ease the local SN R. N o qu a lit y test s were performed to verify this hy
pot hesis, howe ver.
Here individual pitch period durations are required over the entire data
a na lysis window. These measurements are then used as the timing adjust
ments (, to align adjacent pitch periods. As the number of pitch periods
used in the comb filter, L , increases, so docs the n eed to include the 8.6.4 Harmonic Selection
align m e nt factors (,. Therefore. simple ACFs set ( i = 0 and keep the num If the d egr a d ing noise source is a competing speaker, then an enhance
be r of pitch periods used for filtering to a min im um (e.g. , L 1, resulting
0=
ment tec h nique sim ilar to co m b filtering can be formulated in which
in three periods). More sophisticated approaches incorporate larger num spect ral harm on ics o f each speaker are separated based on external pitch
be rs of pitch periods (e.g., L = 6, resulting in 13 periods), and perform est im at es . Parso ns (1976) proposed such a method in which a high
pi tch alignment using (, timing terms. It is desirable 10 include as many resolu tio n , sho rt -term spectrum is used to separate competing speakers.
periods a s possible, since the number L is inversely proportional to the All processi ng is performed in the frequency domain . Speech is generated
bandwi d th of each tooth in the comb filter. Larger values of L , therefore. based o n that po rt io n of the spectral content which corresponds to the
prod uce m o re narrow harmonics for the filter an d all ow for further noise
primary speake r.
re m o va l. However, due to the shall-term stat ionary property of speech,
the number of pitch periods must not represent more than a 30-40-m sec
du rat ion , thereby limiting the range of L. It has been shown (Lim et al..
1978) that although small increases in SNR occur as the filter le ngt h in
pitch periods increases from 3 to 7 to 13, intelligibility sco res de crease. ........... -': <lpr ~'in~
1W l-
e---e ('01'.1 pn ..-.::....II )\;'
,1 ptt.."11 ~noo ~
I mI- C>--0 J 1"1 dt , ll;lf H-...h
0- - Q
Ma la h and Cox (1982) proposed a generali zed comb fi lt er ing tech C - -c 1 pitch rcnod..
0 - '0
~. .t:.
7 p'l ..111 'o;'(iI.'o(1~
I ~ i'i, ':-h pc riod~
6, , ~ 1),r.I't'b rt" ' Il .)o,j ~
n iqu e that applies a time-varying weight to each pitch per iod . The ~ f)
met hod is similar to normal comb filter ing, ex cept th a t the we ights a s ~
- >oil
P" "
', ~~2
;%
~ w
' 0" - . ~ WI
' /,l::i.
signed to samples across the pitch period arc not fixed . T he ge nera lized ;.:'":.
~ .~
co mb filter has been shown to reduce fram e-rate n oise for an ad a pt ive ..0 JI)
~'"
~
~
tran sfo rm coder (Cox and Malah, 19~H). The metho d ha s als o sho wn 1 10 .= l(J
~{
':.
promise in smearing the structure of sim ulat ed pe ri o dic interference.
Malah and Cox also reported some preli m inary experim en ts fo r a com -5 '0 -<
'"
S\ R ~ '. \) \ · 1F1 1
pe ti ng speaker problem (one male and one fema le). Their ini tial fin d ings ~ ;'(K ~ lt( , ( , H!.)
ters are more effective than a cascade of two fix ed co m b filte rs. FIGURE 8.19. Intelligibility results for adaptive comb filtering for (a) a
Fi nally. an ACF techniq ue int ro d uced by F ra zier et al. ( 1976) form u competing speaker problem (Perlmutter et al., 1977), and (b) wideband
la tes a filte r th at adjusts itself both glo ba lly a nd lo ca lly to the tim e random noise (Lim et al., 1978).
~
between these teeth , while block interpolation in creases the gap space .
T he choice of th e time domain windo w will dete rmine th e shape of th e
result ing " bandpass" f ilte r aro und each pit ch tooth. Figu re 8.2 1(h) and (e) Inter polated speech.
(c) illustrates frequ en cy domain co nseque nces of decimat io n and inte r
polation . FIGURE 8.21. Frequency interpretation of decimation and interpolation
operations required by TDHS.
Time domain ha rm oni c scaling was o riginal ly p rop osed by Ma lah
( 1979) for use in perceptuall y reducing peri odic no ise in speech. Tn a
lat er st udy, Cox and Malah ( 198 )) proposed a hybrid syste m that lIses pr iatc wi ndow greatl y influe nces no ise cancelat ion performan ce. Th e
both ACF a nd TDHS. An addi tio na l benefit of th eir syste m is time-scale hy brid syste m cons ists of an expan sion of th e speech spe ctru m using
reduction of input spee ch for waveform coding and isolated word recog T D HS, fo llowed by a contracti on also using T DHS. T his expans io n
nition . Du e to its time domain impleme ntation, th e choice of a n appro cont ract ion method do es not reduce int erferen ce to th e same exte nt as
sim ple T D HS; howeve r. th e resulting speech signa l is less d ist orted. In ef
feel. the expansio n- com pres sio n ope ra tio n wo rks li ke a tim e-var ying
~~JVC1~
\'~:'
I
\ ~ '
\
\
\
\ I
I
I
focus on en hancing voiced speec h . Th erefore. the y generally perform con tribute to the success of enhancement syste ms. It is known, for exam
poorl y for un voiced speech sections. It is generally known that con so ple, th at consonants pla y an important role in speech inte lligibility even
nants possess a disproportionate amount of linguistic information when tho ugh th ey represent a sma ll fraction of the overall signa l energy. It is
compared with vowels. Since vowels (voiced speech) usuall y poss ess also known that the sho rt-te rm spectrum is important for speech percep
larger amounts of energy. bro adband noise degradation tend s to ma sk tion . f ormant locati on is the most important feature, with bandwidth,
unvoiced sections more than voiced , thu s cau sing decr eased intelligib ility. amplitude. or spectral slope as seconda ry features. In addition, th e sec
Employing a techniqu e that attempts to improve qu alit y in voiced sec on d form ant is more important perceptually than th e first formant
tions may in fact decrease overall speech qu alit y and /or int elligibility. For (Thom as, \ 968; Agrawal and Lin , 1975).
this reason , the se methods are not normally used to attenuate broadband Licklider and Polla ck ( l 948), Thomas ( 1968), T hom as and N ieder
ad ditive nois e. Inst ead , th eir main area of successful application has been john ( 1970, 1972), Thomas and Ohl ey ( 1972), a nd Niederjohn and
in reducing th e effects of a co mpeting speaker, whe re distinct fundam en Grotet ueshe n (1970a , \ 970b, \ 970c, 1976 ) exam ine d th e effects of fre
tal frequ ency contours can be identified (e.g., competing male and femal e qu ency distortion (highpas s, bandpass, and lowpass filtering) and am pli
speakers). tude distortion based on infinite~p e a k clipping on th e intelligibility of
speech prior to noise degradat ion. T hese methods are applicable where
the noise-free speech is available for processin g, prior to the introdu cti on
8.7 Performance Evaluation of noise (e.g., thi s might occur in voice communication systems). T heir
results show that remo ving the first formant (highpass filterin g) followed
8.7.1 Introduction by infin ite-peak clipping can increase Imetligibility, hut with devastating
effects on qu ality (an exam ple of their int elligibility perform ance is
A uni fied performance evalua tion of the four areas of speech enhance shown in Fig. 8.22) . Thi s procedure incre ases the energy of high-fre
ment considered in this chapter would be a difficult task becaus e of the quency com ponents (most notably the consonants) while decreasing th e
differ ences in the assumptions and applicati ons of the varying enhance low-energy frequ ency components (most notably th e vowels). Since this
ment algorith ms. An algorit hm that reduces broa dband add it ive nois e increases th e relative SNR, th e final noise degradation do es not affect
may not be appropriate fo r th e co m peting speaker problem. Gen erall y the consona nts as mu ch without pro cessing. These' expe riments show that
speakin g, a comparative evaluatio n is onl y valid if the same test condi
tion s are maintained (same input speech, noise, and quali ty/intelligibility
100,
evaluat io n methods). At th e outset of this cha pte r, we ide ntified three d
b
possible goals for speech enha ncement: (1) im proving qu alit y, (2) improv c
ing intelli gibility, and (3) reducing listener fati gue. Since these improve
ment crite ria are based on aspects of human perception , it is necessary to
understand how distortion is perceived in noisy speech applications. A
~
~
80 ~ .>:: fla
~
car eful system evaluation will therefore require the use of either subjec ::> 60
tive or objective measures of speech quali ty. In this section, we summa '"'"
.~.
rize some performance evaluations of speech enh ancement algorithms. II). 'B
~§) 40
Section 8.7.2 we discuss aspects of perceptual speech enhanceme nt based "5
on various type s of highpas s, bandpass, or lowpass filtering and /or clip E
ping of the speech waveform. Th ese methods repres ent operations that 20
greatly influence th e intelligibility of speech. In Section 8.7.3, we discuss
performance for each of th e four classes of enh ancement. )
() '----'-_ ----J.
- 10 -5 0 +S +10
SNR (dB)
8.7.2 Enhancement and Perceptual Aspects of Speech
FIGURE 8.22. Highpass filtering and amplitude modification for intelligibility
It is well known that accurate mod els of speech production in nois e enhancement. Noise level is 90 dB SPL (sound pressure level). Results are
can lead to improved speech quality. However, in addition to production shown for (a) normal speech in noise , (b) highpass filtered and clipped
aspects, th ere are also auditory perception aspects th at can be exploited speech in noise , (c) highpass filtered speech in noise, and (d) highpass
in speech enhancement. These are not as well understood as production filtered speech with amplitude compression in noise. Adapted from
aspects ; however, there are a number of commonly accept ed featu res that Niederjohn and Grotelueschen (1976).
i
6 .7 I Perform ance Eval uation 555
554 Ch. 8 J Speech E nh ancem e nt
ite rative speech enha nce ment sch em e th at employs ·spectral constraints
th e first fo rm a nt is mo re important in terms of perceiv ed quality and placed on redundancie s in the hum an speec h production pro cess as rep
th at higher frequ encies a re more important for int elligibility. It ha s also resented by the LSP par ameters ha s b een sh own to overcome thi s limita
been shown that a mild level of highpass filterin g alone can increase the tio n. Figure 8.23 compare s quality impro vement for each of th e three
crispness of spe ech by limiting the first form ant , without significant los s techn iques for AWGN. Quality measur es for a th eor etical limit were ob
in speech intelligibility, tained by subs tit uting th e noise-free LP coefficients into th e uncon
As we have seen, speech quality d epends upon a good represen tation strained Wien er filter, th ereby requiring on ly one ad ditional iterati on to
of th e short-term spectral magnitude, whereas th e phase is relativelv obtain th e estimated speech signal. Th ese results show that good quality
unimportant. Auditory processing of the sp ectral ma gnitude of sp eech improvement can be achieved with all three methods. Figure 8.24 shows
suggests a phenom enon th at ca n be used to sup press narrowband in tim e versu s freque ncy plot s of processed speech from unconstrained and
terference . Auditory ma ski ng is a perceptual ch aracte ristic that allows the constrained W iene r filt ering. A lat er stu dy (H ansen et al., 1988) co m
human aud ito ry system to sup press background nois e. Th e auditory sys pared th e performan ce in colored aircraft coc kpit noise. Improvem ent in
tem can sometimes mas k o ne sound with another. For example, consider speech quality was also demonstrated . Trad iti onal (no ncausal) Wien er fil
a narrowband noise source such as a varying sine wave or an artificial tering outperforms spect ral subt raction with magnitude averaging for this
noise component. Such distortions ca n be more annoying, since they fa type of distortion. Alth ough the con strain ed iterative enh ancement algo
tigue the auditory syste m faster th an broadband nois e. An enhanceme nt
syste m can th er efore int rod uce broadband no ise in an effort to mask
ltakura -Saira Lik e lihood Measu re ---, 6
narrowband or artificial noise.
In addition to such methods as lowpass or high pass filtering, one a
might consider applyin g a Wi en er filter based o n the long-term power
spect ru m of speech. Such pr ocessing is optimal in th e MS E sense if the b"
c . .............. .......
5
signal is stationary and th e add itive nois e is white Gau ssian (it is the op ........ .......
5 ..............
tim al linear filter if the noi se is nonwhite). H owever, such a method is ' <, <,
. ........ ........< , <,
limited, since spe ech is not trul y stationary. Even if th e signal were truly
<,, <, <,
stationary, the criterion upon whi ch the Wiener filter is based is not a 4
<, . .......
particularly effecti ve er ro r cri terio n for speech enha nce me nt. To illustrate 4 ...... .......
, ......
......
.....
, . . . . .......
..... a
"u
.......
"
creasing the MSE wh ile improving the perceived quality of the speech. A " " ....... .......
~
second example can be found in modification of the phase characteristics f1" .; .......
.......
.......
.......b
(5
Ci
of speech. If speech is filt ered with an all-pass filter (with some phase <. <,
d isto rting characteristics), no au dible differen ce is percei ved, but a sub d .... - .... _ 'e C
stanti al MSE can result. Although the MSE is sensitive to the phase of 2
- -- 2
the speech spectrum , o nly t he sp ect ral m agnitude is imp ort a nt for e ·- - - - _._._. .... -- - .. - --- - --- ed
perception . -.., -._ -
- ... - .- ._ .- .- - '- - - . e
methods. A study by Hansen et a1. (198 7, 1991) considered an evaluation 5NR (dB)
of th e following three enhan cem ent methods: (1) noncausal (unco n FIGURE 8.23. Comparison of enhancement algorithms over SNR. (a) Original
st rained) Wiener filtering (Lim et al., 1978), (2) spec tra l subt raction with distorted speech. (b) Boll: spectral subtraction. using magnitude averaging .
m agnitude averaging (Boll , 1979 ), and (3) two inter- and intraframe (c) Lim-Oppenheim: unconstrained Wiener filtering. (d) Hansen-Clements :
employ ing interframe constraints (FF-LSP:T). (e) Hansen-Clements: employing
spectral-constrained Wiener filtering methods. Hansen et al. showed that inter- and intraframe constraints (FF-LSP:T,Auto:I). (f) Theoretical limit: using
unconstrained Wiener filtering based on an LP all-pole model tends to undistorted LPC coefficients.
produce speech with overl y narrow bandw idths. Ho wever, an alternati ve
556 Ch. 8 I Spee ch Enhancemen t 8 .9 I problems 5 57
No ise- Free Ori g ma l Distorted Ori g inal mathematical sense, mo st error criteria are no t well correl at ed with aud i
wry perception . Cons equently, the use of both subjecti ve and obj ective
quality measures is necessar y to meaningfuJ1y compare enhancement al
gorithm performance. Also, the evaluation of an enhanceme nt algorithm
wilJ depend on its ultim ate application.
As these concluding comments suggest, speech enhancement continues
to be an important research area for improving communications between
speaker and listener, speaker and vocoder, or speaker and speech
Freque ncy (kH z) 4 recognizer. In particular, the task of evaluat ing the effecti veness of an en
han cement technique continues to be an important area of investigation.
U nc onxtrum e d
We now turn to this issue in Chapt er 9.
Constrained
8.9 Problems
8.1. (Computer Assignment) In th is problem, we will investigate appro
priateness of th e MSE as a criterio n for speec h enhancement.
(a) Design a filter which is all-pass with a nonzero phase response.
Filter a speech signal with this filt er. Play both th e original and
FIGURE 8.24. Time versus frequency plots of the sentence "Cats and dogs filter ed signals. Can you hear any perceptual differences? If so,
each hate the other." The original and distorted original (additive white
Gaussian noise, SNR = + 5 dB) are shown above. The lower left-hand plot is list th em.
the response after three iterations of the unconstrained noncausal Wiener (b) Write a small routine that finds the MSE between two input sig
filtering approach. The lower right-hand plot is the frequency response after nals. Find the average MSE between th e original and an all-pass
six iterations of an inter- plus intraframe constrained approach. filter ed speech signal. What can you say about th e magnitude of
th e MSE with respect to any perceptual differences you were
rithms produced speech of higher quality, computational requirements abl e to hear?
increased over th e unconstrain ed approach. (c) Can you suggest an alt ernate erro r criterion that might be more
appropriate?
8.2. Tn the discussion of speech enha ncement , we addressed methods in
8.8 Conclusions the following four areas : (i) short-term spectral amplitude techniques, (ii)
speech modeling and Wien er filtering, (iii) adaptive nois e canceling, and
In this chapter, we have considered a variety of approaches for speech
(iv) fundamenta l frequency tracking systems . Consider a single algor ithm
enhancemen t and have incorporated th em into a common framework .
from each area and discuss on e adv antage or di sad vantage of each
Due to th e large number of applications, assumption s concerning inter
method. Note possible speaker and /or noise environments in which a
ference, and available input channels, an almost unlimited number of en
techn ique ma y be more useful.
hancement systems could have been considered.
In con cludion, it should be emphasized that man y enhancement sys 8.3. Suppose that we consid er the following speech enhancement scenario
tems improve the ratio of speech to noise, and therefore improve quality. of hands-off cellular phone dialing in a running automobile (i.e., a speech
This might be all that is important in certain applications in which the recognizer is used to automatically dial a desired phone number) .
context of the mat erial is known to the listener so that intelligibility is (a) Wh at assumptions would you make on the speech and noise sig
not of concern . However, the majority of speech enh ancement algorithms nals in order to formulat e a speech enhanceme nt conditioner
actually reduce intelligibility and those that do not generally degrade the prior to reco gnition ? What are th e design considerations (in
quality. This balance between quality and intelligibility suggests that con terms of mathematical cr iteria , perceptual criteria , and per
siderable work remains to be done in speech enhancement. formance improvement) involved in such an enhancement ap
. , Further, as we have seen here, several enhanceme nt algorithms are de plication? Discuss any trade-offs, restrictions, or limitations in
signed to improve a mathematical criterion. Although at tractive in a your proposed solution.
558 Ch . 8 I Speech Enhancement
8 .9 I Problems 559
(b) Now, suppose that the cellular phone system is based on LP exist for thc frame duration choice for magnitude averaging. What
analysis. We wish to extend the speech enhancement condi speech classes would benefit the most from magnitude averaging and
tioner to reduce automobile noise prior to LP encoding within
which would be most seriously affected?
the cellular phone (i.e. , output of the enhancement system will
be used for automatic recognition when dialing and voice com 8.7. Given the most general form of spect ral subtraction [see (8.25)], ex
munications when phone communication is established) . What plain how you would adjust a frequency-dependent weighting factor.
additional design considerations may influence your proposed k(w ). in order to obtain improved enhancement for the following speech
solution in part (a)? classes : vowels, nasals, stops, fricatives , silence, or pauses. In selecting
k(w), consider the broad spe ctral characteristics of ea ch speech class.
8.4. In the discussion of spectral subtraction, we identified a problem of
resulting "musical tones" that occurs during subtraction. 8.8. Consider a signal sen ) that is corrupted by an additive white noise
(a) Discuss why such residual noise persists after spectral subtrac component wen) as,
tion processing and why magnitude averaging can reduce such yen) = sen) + gwen) . (8.66)
effects.
where g is a fixed term that is adjusted to ensure a desired overall aver
(b) Suggest an alternate means of characterizing the background
age SNR .
noise interference that might further reduce these tones. (a) Assume that sen) is a steady-state sinusoid. Derive the optimum
8.5. Consider a speech Signal sen) that is corrupted by an additive sinu noncausal Wiener filt er solution for s(n). Can a direct form so
soid of the following form: lution be found. or is there an iterative solution?
(b) Assume 5(n) to be a steady-state vowel. How would your answer
y en) = sen ) + g':1 sin(2n.fot), (8.62) to part (a) change?
where g is a fixed term that is adjusted to ensure a desired overall aver (c) Consider the speech signal sen) to contain di screte samples of
age SNR, and fo a fixed sinusoidal frequency. th e word "st ay." Di scuss how the Wiener filter will change as
(a) Assume that sen) is a steady-state vowel (i .e., deterministic over each phoneme is processed.
the time interval of interest) and A a uniformly distributed ran (d ) Suppose that the exponent a in (8.32) is varied. How would the
dom variable in the range [0, I]. Can a speech enhancement so resulting Wiener filter's characteristics change for the word
lution be formulated based on spectral subtraction [i.e., a = 1 in "stay" with a = I and a = 2?
(8.18)]? If so, find it. If not, why? 8.9. One method of noncausal Wiener filtering is based on an alJ-pole
(b) Suppose that a random phase is added to the sinusoid as model for the speech signal. Assume th at speech is degraded by additive
follows
white Gaussian noi se.
yen) = sen) + g :i sin(2nfot + If), (8.63) (a) From our discussion of the various LP methods from Chapter
5, discuss the tradeoffs in enhancement using the autocorrela
how would your solution to part (a) change? tion, covariance, and lattice filter methods of LP analysis. Will
(c) Now, let the speech signal sen) be corrupted by a nonlinear am stability be an issu e?
plitude term as (b) Suppose that an ARMA (pole-zero) model is used for speech in
yen) = sgm[s(n)]. s(n), (8.64) the Wiener filter. Discuss the sensitivity of the filter to errors in
speech model zero location estimation. How does this compare
where to errors in speech model pole location estimation?
8.10. Consider a simple two-channel processor, consisting of a primary
- I if w>O
sgm{w] = { 1 (8.65) microphone and a reference microphone that are connected. The refer
if w<O ence microphone output Yref(n), is weighted by a value IV and then sub
tracted from the output of the primary microphone, Ypn(n). Show that
What effect does this have on the degraded speech signal Yen)? the mean-square value of the output of this two-channel processor is
Does a spectral subtraction solution exist? Why or why not? minimized when the filter weight IV achieves the optimum value
8.6. One extension to spectral subtraction for reduction of musical tone
C[Ypri(n)Y;"f(n) )
artifacts is magnitude averaging as discussed in (Boll, 1978). Consider w o Pt = . (8.67)
each class of speech (i .e., vowels , stops, etc.) and discuss what trade-offs c:[1 Y~ f(n) 2
1 )
560 Ch. 8 J Speech Enhancement a.A J The INTEL System 561
8.11. Start with the formula for the estimation error for LMS adaptive 8.18. (Computer Assignment) You are given an isolated-word uttered in
noise cancellation : noise-free conditions. Add white Gaussian noise to this word at an SNR
e(n) = yl(n) - hT Y2(n) , (8 .68) of 5.0 dB.
(a) First, assume that the speech is a stationary process across the
where y\(n) == sen) + den) and sen) is the desired speech signal, Ii the tap entire word. Construct a stationary Wiener filter for the entire
weight filter vector, and Y2(n) the reference channel input vector at time n word and obtain an estimate of the n oise-free signal. Next , per
(assume no cross-talk). Find a relation for the gradient of the instantane form the same process for a steady-state sound (e.g., vowel) ex
I
ous squared error e(nW in terms of sen), Ii, and Y2(n). tracted from the given word. Compare the performance in the
8.12. Consider an ANC scenario as described by (8.40). An estimate of two cases in terms of output SNR and resulting MSE, and com
the primary channel noise, dt(n) , is given by (8.41). Show that minimiz ment on the differences .
ing the MSE between ~I and 41
is equivalent to minimizing the signal es (b) Partition the input word utterance into phonemes. Construct a
st ationary Wiener filter for each phonene and filter each sepa
timate ~ = ~ - ~(' in a mean-square sense.
rately. Join the enhanced phonemes and compare its speech
8.13. Beginning with the optimization criterion of adaptive noise cancel quality (waveform characteristics, overall SNR, and spectral
lation from (8.43 ), derive the minimum MSE solution of (8.45). What as representation) with the fixed Wiener filter from the first part
sumptions must be made to arrive at this relation? of part (a).
8.14. Derive a recursive (or lattice) [east squares solution to the ANC 8. 19. (Computer Assignment) Obtain two degraded speech files by cor
problem defined in (8.40-8.43). Discuss its advantage or disadvantages rupting an isolated-word with white Gaussian noise then white Gaussian
with respect to an LMS solution. noise plus a slowly varying sinusoid. Construct a program for time
8.15 . Given the MSE criterion for the ANC scena rio as in Problem 8. [2 , domain ANC using LMS. Filter each degraded utterance using correlated
discuss and illustrate the effect of cross-talk (i.e., primary speech compo noise reference. Discuss the enhancement performance for each noise
nent) in th e reference channel on the signal estimate, s.
Suggest methods condition. Can ANC be used to attenuate (i) a slowly varying sinusoid,
you feel might reduce this effect. (ii) white Gaussian noise, or (iii) slowly varying colored noise?
It is known that speakers A and B possess vocal-tract lengths of 12 em INTEL,l° along with other forms of spectral subtraction, fall under the
and 17 em , respectively. Suggest a frequency domain scheme to separate general category of noise suppression prefilters (no assumed a priori
the two speakers . What are the trade-offs and assumptions in volved in knowledge of noi se statistics). In this approach a spectral decomposition
such a strategy. of a frame of noisy speech is performed and a particular spectral line is
attenuated depending on how much the measured speech plus noise
8.17. (Computer Assignment) We consider the effects of noise on an iso power exceeds an estimate of the background noise power. One approach
lated word utterance. Corrupt a word uttered by a male speaker using in particular, developed by Weiss and Aschkenasy (1974, 1983) imple
three additive noise sources: white Gaussian, a sinusoid, and another iso ments a real-time audio processor using the INTEL system and a tone
lated word uttered by a female. Adjust the gain of the noise sources so component suppression filter.
that the SNR is 5 dB in each case. Weiss and Aschkenasy (1974) originally developed INTEL as a gener
(a) Listen to each corrupted file and comment on perceived distor alization of spectral subtraction. Figure 8.2(a) and (b) illustrates the dif
tion. Noting that the SNR is the same in each case, what can be ferences between INTEL and Boll's spectral subtraction . The former
said about noise effects on vowels , consonants , and upon gen involves raising the noisy speech magnitude spectrum to a power a. Later
eral in telligibility? work (1983) resulted in a real-time filter that removes interference from
(b) Assuming noise stationarity. find an average power spectrum for
each noise source and perform spectral subtraction as illus
trated in Fig. 8.2(a). Comment on the results.
'%is section is included for historical purposes.
562 Ch. 8 I Speech Enhancement 8.A I Th e INTEL System 563
recei ved or recorded data, t ermed th e com puteriz ed audio processor. Th e T he pr imary use of INTEL is t o attenuate additive wide ba nd ran dom
system is com pr ise d of t wo processin g sec tions. The first process ing op er noi se. The input signal is transformed to a cepstra l domain . An estima te
at ion, called digital spectra! shaping (055), det ect s an d attenuates impul of th e noise cepstrum is su btrac te d; the result in g cepstra l dat a are used
sive and tonal noi se. T he second is INT EL, which is used to attenuat e to reform th e e nhanced speech. Figure 8.26 illust ra tes the INTEL pro
add itive wid eb an d ra ndom noise . ced ure . (Not e th at th e power term a has been se t t o -t
in th is im
For DSS to be effect ive at tone suppress io n, three steps mu st b e ac plementation. )
complished . First, th e to ne mu st be det ect ed accurate ly. Next, the syst em F igu re 8. 26 shows that the difference bet ween INT EL and spe ctral
must remove the maximum amount of to ne ene rgy once it has been de subtract ion is the added transform pair, which results in the subtraction
t ected, whil e removi ng a mi nimum a mo un t of speech energy. The last operat ion being carried out in the cepstrum domain . Improvement over
ste p requires that the rege nerate d speec h be maxi mally free of di sconti to ne su bt ract ion results from differences in ce pst ra l characterist ics be
nu ities a nd di stortion . T he det ecti on p roc ess ex ploits differences between tween th e speech and random noi se. Above a quefre ncy of 0.5 rnsec, the
speech a nd nois e. Tone noi se is more stat io na ry in both frequency and n oise e nergy falls off qu ickly while sp eech ene rgy is still present at the
a mp litude compared with th e q uas i-sta t ionary speech. In the magnitude pitch peri od and its harmonics. Ther efore, if a noi se-onl y cepstrum can
spe ct ru m of tone no ise, peaks re sult at t he frequen cies of the tones. In be fou nd , su btract ing it from th e speech plus no ise cepstru m greatly re
contrast , the speech spectr um is smoo th ove r the ent ire frequency ba nd du ces th e bro adband random n oise. H ow to compu te th e n oise-only
with smooth peaks a t the fo r mant frequen cies and finite nonzero
bandwidth s. To minimize th e speech vers us no ise overlap in the fre
Inp ut speech
quency domain , ton e e ne rgy shou ld be co nce nt ra te d into as narrow a
spectrum as possibl e. Fi gure 8.25 illustrates thi s processing section .
To achieve minimal ove rla p between the speech an d tone spectra, an +
Segment
appropriate weighting fun cti on on the ti me se ries must be chosen. Choice
of analysis fram e length m ust also be co ns ide re d in the segm entation + I t
portion. The INTEL system uses a fra me len gth of 200 msec, with a
Bartlett window overla pped by 50%. T his a pp roach for tone remo val
works well wh en ton e freq uencies a re differe nt fro m p ea ks in the sp eech
spectrum . The grea te r th e differ ence, th e more successful the procedure
becomes . If, ho wever, the t on e com pone nt is random, the success of this
approach is limited. Thi s m oti vat ed th e formul ation of the second pro
t
Phase
spectrum
sr t
Amp litude
spect rum
TW J x No ise cepstrum
- W 2 x P resent average cep ,trum
cessing sect ion. , t 1
New cepsuum
Perform ( . )2
Input speech
t
FFT
+
Se gment t
Subtract c
fFf
t
lFFT
+ t
Decide tonelspecc h sect ions
~ (o)'
+
Sub tra ct maximum amoun t 01" tone energy
Comb ine
+
IfFI'
t
LFFT
Reform speech
t
Output speech
FIGURE 8.25. Tone remov al using a computerized audio processor. FIGURE 8.26. INTEL procedure for remov lnq add itive wideband random noise.
564 en. 8 I Speech Enha ncem ent 8.8 I Addressing Cro ss- Talk in Dual -Ch anne l ANC 565
cepstrurn must th erefo re be addressed . To accomplish th is, a "lossy" mov 8.B Addressing Cross Talk in Dual-Channel ANC
ing ave rage" of th e noi se cepstrum is formed , an d th e noise ceps tru m is
th en a ble to foll ow changes in the noi se distribution. The two weights for This appendix pr esents an expande d AN C framework th at incorporates
the pr esent and update noise cepstra mu st be chosen during processing. dual-channel cross talk. We present th e basic application ideas without
Once th e algorithm is abl e to track th e noise cepstrum , the choice of concern for th e details of short-term p rocessing.
sca le factor for subtracti on bet ween n ois y spe ech and nois e-onl y cepst ra When cross t alk is present in an ANC system, a path mu st exist in the
must be chosen . Weis s and Aschkenasy conclude d th at .th ree scale fact or s b lock diagram for speech to enter th e reference channel. T he speech com
were adequate for processing, with th e process being ca rried out as ponent in the referen ce channel co uld simply be a greatly attenuated
shown in Fig . 8.27 . Results indicate th at choice of the K, scalars is de speech component, or some filtered version due to prop agation, differ
pe nde nt on SNR, but so mewha t indep en dent of particular n oise distribu ences in microphone characteristics, or aco ustic barrier between the two
tion . T heir system used two sets of scale factors: one for use above 6 dB, mi crophones. Figure 8.2 8 shows a blo ck diagram of th e expan de d ap
th e seco nd for use below 0 dB. Betw een 0 and 6 dB , both sets produced pro ach. The filt ers H 1(w ) and H 2(w ) model the frequen cy-d ependent at
similar results. te nuatio n that the sig nals s(n) and d(n) expe rienc e due t o the separation
T he INTEL syst em is useful for reducing wide band random noise, but of th e primary a nd ~eference microphon es . The adapti ve filter H( w ) pro
many di sadvantages exi st. Since th e average noise cep st rum is built up vide s an estimat e d1( n) of the noi se co m ponent d1( n) in the primary
over time, any lon g sect io ns of silence (I sec or more) dr i ves the average channel so that the resulting erro r term e(n) is an estimate of the desired
noise cepst ru m to zero . Wh en the signal reap pears, a loud noise burst re
speec h sj( n).
sults until the noise ce ps t ru m can be reestablished. Cho ic e of update The output from th e primary channe l ca n be written as
wei ght s for the noise ceps tr um must also be selected. T his choice de
pends o n the speed wit h which th e syst em tr acks cha nges in the noise yJn) = sJn) + d1{n) = st(n) + [d2(n) * hi n)1, (8.70)
character ist ics. Scal e facto rs for su btract ion of the cepstra must also be
chose n in some op timal fashion. Alth ou gh Weiss and Asch kenasy found where den) * h,(n) is th e degrading nois e component resulting from the
th ese to be somewhat inde pe ndent of noi se di stribution for the cases ana convo lut io n of th e input noi se source and impulse response of the nois e
. lyzed , thi s ma y not be true for all types o f noise. Although no quantita sha ping filter. Similarly, the referen ce channel signal can be written as
tive res ults were presented in this in vesti gation, it was suggest ed that
so me improvement in intelligibility is possibl e. This dep ends on how ac Yin) = s2(n) + d2(n) = [sl(n) * h,(n)] + din), (8.71 )
curately the noi se ce pstrum can be updated from silent fra m es. In most
in stances, any tones th at are enco unte red will be random . In addition, al where hJ n) is th e impulse response of the speech-shaping filter. Under
though INTEL is a n improvement over spec t ra l subtr action, it still suf typical con dit io ns, th e SNR of the prima ry channel SNR t is much higher
fers fro m th e musical tone artifacts found in spectral subtraction. th an th at for the refer ence (SN R , » SNR 2) . The output of th e AN C ca n
be written, assuming an adaptive filt er impulse response hen), as
Sp eech plu s noise ce pst ru m
Enhanced
- K, x [Average no ise cepst rum ] l: ) -'l en) = " I(n) + dl (n) signal
sJ(n) , .. I ~ I 1:1
Primary
channel
+
~
Est im at ed sp eech only ce pst ru m 1\
({ l (n)
;:; (n) = y .(n) - dt (n) lattice formulation (Morf and Lee, 1978), convergence from zero initial
(8.72) cond itions was achi eved in less than 20 m sec, In a noise field creat ed
= 5 1(n) + fd2(n) * h2(n)] - [Yin) * hen)]. using a single source, an improvem ent of of 11 dB in SNR was obtai ned.
T he improvem ent obtained by Harrison et al., however, does not occur
Afte r applying a DTFT and substituting the relation for Yiw), (8.72 ) for an frequ encies. In a related study, D arlington , Wheeler, and Powell
becomes
( 1985) atte mpted to address th e practical qu estion of why ANC is suc
sew) = Slew) + [D i w )H2(w )] - [ Y 2(w )H (w)] cessful onl y for aircraft cockpit noise at low frequencies. The noise levels
(8.73) experienced within fighter cockpit environment s typically exceed 90 dB
= SJ w ) + D 2(w)H2(w ) - [Diw)H(w) + S ,(w)H ,(w) H (w)]. SPL over the entire speech frequency band . For the Brit ish helm et and
oxygen mask considered in their study, frequ en cy respo nse showed a
If we assume that the adapt ive filt er can ad equ ately approximate the 30-d B atte nuat ion above 2 kH z, dropping to adB below ! 00 HZ. ll Th e
=
noise-shaping filter [i.e., H (w) H 2(w )] to th e exte nt that the spectral use of ANC in th e noise transmission problem across an oxygen mask
magnitude of D 2( w )H/ w ) - D 2 (w )H(w) is small whe n com pared wit h depe nds upon the successful identification of the linear filte r relat ing the
Slew) - Sl«(f)H l(w)H(w ), th en the spec tral output of th e ANC employ exterior reference pressure to the no ise component detected by the inte
ing a cross-talk filt er is rior mask microphone. Poor coherence in noi se transmi ssion acro ss an
oxygen mask contributes to th e failure of ANC at high frequ encies. In
e(w) = S t(w)[ 1 - H. (w)H(w )J, (8.74) addi tion, the distributed nature of th e noise sources within a cockpit en
or assuming H(w ) = H 2(w ), vironm ent suggests that a diffuse model could be applied successfully.
Da rlington et al. point out that in a fully diffuse field, the random inci
e(w) = S,(W)[ 1 - H I(w) H 2(w )]. (8 .75) dence of so und at any point cau ses th e coherence between th e pressure at
two poi nts to dec rease as the points becom e farth er apart. Us ing a nota
Th erefore, if we could ensure th e magnitude of H 1(w )H2(w) to be much tion adopted from Piersol ( 1978), the coherence betwee n two points x
less than unity across the speech frequ ency band of interest , the resulting and Y is defined as
transform e(w) would represent speech with a minimum of distortion.
Harrison, Lim , and Si nger ( 1984, 198 6) conside red such a dual 2 = [Sin KoW] (8.76)
cha nnel ANC approach in a fighter cockpit environment. Tn their study, YXY K o JV'
the condition /Ht Cw) H2(w)/ « I was satisfied tbrough the use of an oxy
gen mask, which served as an acou stic barrier between the two micro where K o is the wave number and W th e spaci ng between the points. Th e
phones. The primary microphon e was located inside th e mask and the coherence between th e reference noi se and primary noi se com po nent de
reference microphone was located outside. T he oxygen mask provided a fines the prop ortion of the primary no ise power that is linearl y related to
barrier that attenuates signals by about 10 dB. Thi s attenuation applies the refer ence signal. Therefore , the magnitude coherence between two
to both the ambient noise-shaping filter H2 (w ), and the speech-shaping WSS random processes, !: and J:., is defined as
filter H, (w ), so that the combined atten uat io n is approx imately 20 dB. It 4 .. (w)\ 2
1
should be not ed that these are, in fact , shaping filte rs, so that the speech y2 (w) = -- (8 .77)
component in th e primary cha nnel St(n) sho uld possess different spectral ~ r:«(f))~Cw) ,
chara cte ristics than tha t in reference s2(n). Since both inputs contai n
where l y(w) is the cross-PDS of ~ and r, and T:(w) and ~.(w) th e auto
speech, updating the adapti ve filt er coefficients during speech activity
PD S [or- x and y. We wilJ refer to th e magnitud e-squ ared coh eren ce
could lead to distorted speec h. Instead , Harrison et al. allowed the filter
coefficients to adapt only during silence or noise and held them constant
Y,kCw) as s imply-t he coh erence of ~ and t It can be shown that the
during speech activity. Thi s ca n easily be accomplished by setting the ad cohere nce repr esent s the fraction of l v' which is relate d to r:
by a linear
aptation parameter to zero during speech. Thi s implicitly assumes that filte r. For this rea son , we can think" of th e cohe rence as it frequ ency
th e no ise-shaping filter H 2 (w ) does not change significantl y when speech depend ent co rre lat ion coefficient. Since ANC uses th e noise reference
activity occurs. By dire ctly employing a cross-talk filter, Harrison et a1. signal to est ima te the primary noise signal, a lar ge cohe rence between
found a significant red ucti on in the requ ired filte r length for adequate prima ry and reference no ise signa ls is necessary if ANC is to be effective.
enha nceme nt [from 1500 for Boll ~ n d Pulsiph er (1980) to 100]. An adap
tati on time of 120 msec was required for convergence of the LMS algo llJt would in fact be desirable to increase the noise attenuation characteristics for such
rithm starting from zero in itial conditions. Using the exact least squares passive barriers. It may not be possible to introdu ce such modifications in present mask!
helmet configura tions; however, future designs sho uld certainly consider the issue of noise.
9.1 I Introduction 569
~HAPTeR ~
group of listeners , who subjecti vely rank the qu ality of speech along a
predetennined ~cale . Objectiv~ ~u ality me asures are based ?n a mathe
I
mati cal companson of the ori ginal and processed speech signal s. Mo st
objective quality measures quantify qu ality with a numerical di stance
measure or a model of how the auditor y system interprets quality. Since
Speech Quality Assessment the distortion intro duced by speec h codi ng systems, background noise,
and enhancem ent algorithms varies , a collective body of quality mea
sures and tests has emerged for different applications.
9.1 Introduction Ge nerally speaking, there arc three areas in which it is desirabl e to
measure distortion du e to nois e. First, as we bri efly discussed at th e end
9.1.1 The Need for Quality Assessment of Chapter 7, we may want a qu ality measure for the amount of distor
tion introduced by a speech comp ression algorithm. T his type of nois e
There are two speech processing areas where speech quality assessment is can be additi ve, but in man y cases is nonlinear and signal-depe nde nt.
of primar y concern: speech coding or synthesis and speech enhancement. Such evaluations ca n be used to determine parameter settings or bit allo
Historically, the majority of quality testing methods have been formulated cation for parameters, to improve qu alit y, or to furth er reduce data rates.
to eva luate performance of speech coding algorithms. The need for such Second, we may wish to measure th e level of distortion introduced by a
testing met hods became apparent du ring the development of new analog noisy environme nt (during speech collect ion) or duri ng transmission o ver
comrnunication systems for speech data in the t 950s. The advent of digital a noisy channel. In most cases, this noi se is additive , broadband, and sta
speech coding algorithm s in the 1960s, which led to the fully digital pri vate tistically independent of the speech signa l. An exception might be a burst
branch exchange (PBX) speech communication systems of the 1980s, moti of bit err ors over a channel, which may arise, for exam ple, in a speech
vated the formulat ion of new, mo re sophisticated quality assessment tools communication satellite uplink becaus e of sunspot acti vity. Finally, we
for voiceband coding systems. Such too ls were essential for optimum may wish to measur e the performance of an enhancement algorithm to
coding-system design and effective communications network planning. Re see if the quality of the processed speec h has been improved . Consistency
cent ly, many of the same tests or measures have also been successfully ap of input speech test data as wel1 as listener group s is essential if the per
plied to quantify impro vement for speech enhancement algorithms. form ance of speech algorithms is to be properly compared. Since the
Since this is a rap idly changing field , Our purpose in addressing speech noise or distortion introduced in each of the areas mentioned is quite
qu alit y evaluation is not to ma ke the read er an expert , but rather to ex differe nt , the applica bility of each qua lity test will va ry. The read er
pose th e reader to useful and reliabl e methods by wh ich coding or en should keep thi s in mind as we discuss each measur e or test.
han cem ent algorithm s can be evaluated . As an example, su ppose you We begin the main body of this cha pter with a discussion of subjective
were to listen to an old analog record of a speech. If th e record .has been qu alit y mea sures and illustrate the ir use in speech coding applications.
played man y times, the aud io may co nta in his s, crackle, or other typ es of Next, we consider several objecti ve qu ality mea sur es and discuss th eir
background noise. In listening to the recording, you cert ainl y would have use in quantifying performance for both coding and enhancement. Fi
some impression of th e signal qu alit y. As a listener, you might choose to nally, the last section briefly discusses the interrelati onship between sub
judge the "processed" speech signa l along more than one ove rall quality jecti ve and objecti ve measures. Th e literature is rich with papers devoted
scale. For example, on a scale of one to five, how natural sound ing is the to quality assessment. Several tut ori als have been written on both subj ec
speech? Does it contain background hiss? Does it sound mechanical? Al tive and objective quality mea sur es. On e of the most com plete treat
thou gh the most meaningful j udgme nt of such issues comes from the ult i ment s is that by Quackenbush et al. ( 1988). However, as the research
mat e human list ener, qu ant ifiable means of judging speech quality must continues to mature, measures that are even better able to quantify qual
be available for testing new coding and enhancement algorithms in re ity and intelligibility will becom e avail able.
search and developm ent laboratories. Th is is the purpose of quality as Throughout this chapter, an y reference to processed speech refers to re
sessment methods. synthesized speech from a speech cod ing system, or enhanced speech from
To esta blish a fair means of comparin g speech coding or enhancement an enhancement algorithm. To be entirelv consistent with the research liter
algorithms, a varie ty of qu ality assessment techniques has been formu ature , this chapter should focus only on' quality assessment of speech cad
lated. Gen erall y spea king, tests faJl int o two classe s: subjective qu ality ing/compression systems, since most if not all qualit y assessment techniqu es
measur es and obj ecti ve quality measures. Subjecti ve measures are based were formulated for rating such systems. However, we have taken the liberty
on compa risons of original and p rocessed spee ch data by a listener or of expand ing this chapte r, for several reasons. First, man y of the quality
568 tests used for coder evaluation have also been used for enhancement evalua
9.2 I Su bjective Quality Me a sures 57 1
570 Ch . 9 I Speec h Qua lity Assessment
tion . Some vocoding algorit hms introd uce nois e or distortion that is simi lar duce a measure of speech quali ty o n a un id im ensional scale. Th e main
in spectr al content t o noise which enha nceme nt algorithms seek to remove. advant age is th at a single number results, which can be used to dir ectly
Fina lly, man y qu ality testing meth ods used to measure the perceived nois e com pare speech processing systems.
introduced through a speech com press ion/resynthes is procedure can also be In contrast , analyt ical methods seek to identify the undedyin g psycho
used to measure noise introdu ced from a surrou nding environment du rin g logical com po ne nts that det ermine p ercei ved qua lity. These methods ar e
dat a acquisition. oriente d more toward characterizing speech perception than measuring
per cei ved qual ity and t ypically u se m ore th an one dimension for repo rt
ing results (e.g., r ou gh to smooth, bright to muffled) . Studies in th is area
9.1.2 Quality Versus Intelligibility include those by Kruskal (\ 964a, 1964b) , Mc Ge e (1964) , McD ermott
In o rder to eva lua te speech processing algorithms , it would be useful ( 196 9), an d Voiers (I 964 }. In thi s sect ion, we will focus on utilitarian a p
t o be a ble to id entify the simi lar ities a nd di fferences in pe rceiv ed qu ality proaches, whi ch can b e further di vid ed in t o th ose that test for intelligi
a nd subjectively measured inte lligibility. Engineers have so me feeling fo'r bility and th ose th at m easure other aspects of quality. As we have noted
th e " meri t" of their processing systems. This feeling, whi ch is difficult to above, intelli gibil ity can be viewed as one aspe ct of qual it y, since high
descr ibe, let alone measu re, is refe rred to as perceived quality. The quality quality speech generally implies good intelligibility. However, the co n
of speech addresses " how" a speaker conveys an utterance a nd may in verse is not n ecessarily t rue. Quality tests (ot her than those m easu ring
clude such attributes as "natura lness ," or spea ker recognizability. In con int elligibility) a re usually em ploye d t o ev alua te systems with high inte lli
t rast, speech inte lligibi lity is co nce rned wit h what t he speaker has gibility sco res, since low intelligibility is generally a good indicat or o f
sa id-t he meaning or information content beh ind the words. At present poo r qu ality.
we do no t clea rly unde rst and th e interrelati on ship between perceived The test s and t he ir acrony ms to be st ud ied in this chapter ap pear in
q ua lity a nd in tell igibility. Ord in a rily, unintelligi ble speech wo uld not be Ta ble 9.1. Ea rly subject ive measures foc used on speech intelligib ili ty,
j udge d to be high quality; however, the co nverse need not be tru e. For ex one important as pect of ov erall speech qu al ity. Several te sts have been
a m ple, a very m echan ical-sounding syntheti c utter an ce may be highly in for mulated usin g rh yme word li st s, such as the modified rhyme test
telligibl e. Accordi ngly, int elligibility can be cons idered to be one o f many (MRT) a n d the diagnostic rhyme lest (DRT ). Here listeners are pre
"dimens io ns" of th e a bst ract noti on of qu ality. In the following material, sented wit h rhyming words that differ only in t he ir leading consonantal
we shall treat intelligibility in thi s way. Wh en we j ux tapose quality a nd ph on em es. The qu ality of the sp eech processing system is b ased on the
inte lligib ility, we ar e reall y referring to all of t he feat ures th at contribute
to "quality" that a re not necessarily requir ed fo r understanding what is
TABLE 9.1. Quality Measures Discussed in This Chapter.
spo ken.
Th e difficulty in se p arat ing the notions of qu ality a nd intelligibility Type of Test
Test
is du e, in part, to the diffi culty in isolating a nd cha racterizing those Subjective intelligibility
aco ustic -cor relates of qualit y or inte lligibility in speech . H owever, exten Modified rhyme test (MRT) Subjective intelligibility
sive research has b een carried out in developing bot h subjective a nd ob Diagnostic rhyme test (DRT)
Subjective qualit y
j ecti ve test s to ascertain qu ali ty a nd intelligibil ity. These t ests have been Isometric absolute judgment (IAJ)
Subjective qualit y
used extens ively in evalua t ing speech coding/transm ission system s Mean opinion score (MOS)
Subjective quality
Paired acceptability rating (PAR)
(Flanaga n, 1972 ; Kay ser, 19 81; Kitawaki et al ., 1984; M cDermott et al ., Subjective quality
Parametric absolute judgment (PAJ)
1978a, 1978b ; Tr ibol et , 1978). In ad d itio n, research on the statistical cor Overall subjective qualit y
Qual ity acceptance rating test (QUART)
relati on between objecti ve a nd subject ive mea sures has bee n performed Overall subjective quality
Diagnostic acceptability measure (DAM)
in orde r to formul at e goo d objective measures of qu al ity. Objective intelligibility
Articulation index (AI)
Objecti ve quality
Signal-to-noise ratio (SNR )
Frame-based SNR
Segmental SNR (SNR",g )
Frame-based SNR with
9.2 SUbjective Quality Measures Frequen cy weighted segmental SNR
spectral weighting
(SNR rw_,eg) Objective quality
Subject ive measures a re based on t he opin ion o f a listener or a gro up of ltakura log-l ikelihood measure
Objective qualit y
list en ers of the quality of a n utterance. As sugges te d b y Hecker an d Log-area ratio measure (LAR)
Objective qualit y
W illia m s (1966), one mean s of classifyin g subj ecti ve qu alit y measures is Oth er LP-based measures
Weighted-spectral slope measure (or Klatt
Objective quality
to gro up measures as utilitarian o r ana lytical. Utilitaria n meas ure s em
measure) (WSSM)
ploy test ing p rocedures that are bo t h efficient and reliable and th at pro
9.2 I Subjective Quality Measures 573
572 en. 9 I Speech Quality Assessment
~I I I + I 0+ I where N,c'l' is the number of tests, Ncurrec\ is the number of correct re
..... sponses, and Ntn COITtcl is the number of incorrect responses. A typical
::1 I I I I 0+ I summary of DRT results is shown in Table 9.4. A system that produces
"good"-qua1ity speech should have a DRT score in the range 85-90. As
21 I I I I I I I th e table shows , application of the DRT will pinpoint exactly why a
...... speech processing system fails. providing valuable insight for further al
~I r I I I -i- I I
gorithm design. The DRT has enjoyed more widespread use than the
......
({fjl
..... I I I +0+ MRT and provides very reliable results .
.....
001
..... I I + + 0 +
9.2.2 Quality Tests
~I
...... I I ++ I I I Intelligibility tests such as the DRT have been widely accepted primar
..... ily because they are well defined, accurate, and repeatable. However, they
~I I I + I I I
..... test intelligibility, which is only one facet of the multidimensional space
that makes up overall speech quality. Tests that distinguish among speech
~I I I + I +
system s of high intelligibility are usually called speech quality tests. One
>'1 + I + I 0++ direct method of measuring speech quality is the MOS. A second, more
.....
systematic method is the DAM .
':::1 + I + 00+
......
... + 1 + 1 1 1 + Mean Opinion Score
.....
<;; An opinion rating method can be used to assess the degree of quality
......
~ $; +1+1 +1+
Ql
...... for a speech processing system. For the case of voice telephone transmis
E ...... sion systems, five grades of Quality are distinguished. Although other
>
s: ......~ +1 10+
cc quality measures evaluate a wider array of speech characteristics that
...... comprise overall quality, the MOS is the most widely used subjective
u 'C + 1 I 1 1 1
'';::: .....
'" quality measure (IEEE, 1969). In this method, listeners rate the speech
e°1
. . .
01
<IS
,.Q
..... + 1 1 + 1 under the test on a five-point scale where a listener's subjective impres
is ..... sions are assigned a numerical value (Table 9.5) . A training phase is
.!: 1
eN
.....
+ 1 1 + 0 + 1 sometimes used before evaluation in order to "anchor" the group of lis
U ..... teners. If a training phase is not used. anchor test phrases with known
Ql
IJ)
N
..... +1++0+ MOS levels are submitted for listener evaluation . Both procedures nor
~
e
..... malize listener bias for those who always judge processed speech to be
.....eo: + 1++1
°
~ .....
low or high in quality. A standard set of reference signals must be used if
the test is to be compared with results from other test sessions. The MOS
(J
:;: ~
..... + 1 + 1 I
'(i) has been used extensively for evaluation of speech coding algorithms
(/)
-<lSI ...........
U ... +1+1+1 (Daumer and Cavanaugh. 1978; Daumer, 1982~ Goodman, 1979; Kita
wah ei al., 1984). An advantage of the MOS test is that listeners are free
C <,
+ + 1 1 1 1 1
c
<IS .....e: to assign their own meanings of "good" to the processed speech. This
o
o
'e" e...... ++ 1 1 + I
makes the test applicable to a wide variety of distortions. At the same
time, however, this freedom offers a disadvantage in that a listener's
U
'" scale of "goodness" can vary greatly (Voiers, 1976). Selection of the sub
M '"
<U
oj §::::~SJ;l jects, as wen as the instructions given to the subjects, can affect opinion
;h . - 0 <U v,
W :::=
O.l)
~
d·~ g~:I scores. Finally, particular attention must be used in maintaining a consis
'" ~v B~ >
..J .- <U
lJ]
.~ ~ E ~ tent test condition framework (i.e., the order of presentation , type of
~ ~Za{i3Q8~ speech samples, presentation method, and listening environmental
conditions).
574
Grave 100.0 0.00 100.0 0.00 0.0 0.00 100.0 0.00
Acute 100.0 0.00 99.0 1.04 1.0 1.04 99.5 0.52
Sustention 99.5 0.52 96.9 1.52 2.6 U5 98.2 0.92
Voiced 100.0 0.00 96.9 2.19 3.1 2.19 98.4 1.10
Unvoiced 99.0 1.04 96.9 2.19 2.1 1.36 97.9 1.57
Sibilation 99.0 0.68 100.0 0.00 -r.o 0.68 99.5 0.34
Voiced 99.0 1.04 100.0 0.00 -1.0 1.04 99.5 0.52
Unvoiced 99.0 1.04 100.0 0.00 - 1.0 1.04 99.5 0.52
Graveness 89.1 1.35 94.3 2.22 -5.2 3.22 91.7 0.88
Voiced 99.0 1.04 96.9 2.19 2.1 2.61 97.9 1.11
Unvoiced 79.2 2.23 91.7 4.17 -12.5 6.10 85.4 1.36
Plosive 97.9 1.36 96.9 1.52 1.0 2.4 6' 97.4 0.76
Nouplosive 80.2 2.70 91.7 4.45 -11.5 6.67 85.9 1.56
Speaker LL CH RH
VI
....
-...l
578 Ch . 9 I Speec h Quality Asse ssme nt
9 .2 I Subje ct ive Q uality Measure s 579
TABLE 9.5. Mean Opinion Score Five-Point Scale. co de r evaluati o n , reference signals wit h a ra nge of Q va lues ar e ev aluated
by the listener group during th e M O S test . A plot of MOS versus the Q
Rating Speech Quality Level of Distort ion
of the reference signals is ob taine d (su ch as th at show n in Fig. 9.2). This
5 Excellent I mperceptible tra nsfo rm s MO S to a n op in io n eq uiva lent Q, which can be used to com
4 Goo d Just percept ible but not annoyin g pare qual ity p er fo r m ance ac ross codi ng syste ms su ch as th e wave fo rm
3 Fair Percept ible and slight ly a nnoying code rs (A PC-A B, AT e. A DP C M, log-P CM) an d source coders (LSP,
2 Poor Annoying but n OI objectiona ble PARCOR ) in Fig. 9.3.
I Un satisfactory Very anno ying and obje ctionable
/
ver sus tran smi ssi o n rate. The results revea l that d irec t co m par iso n for
.,.c
MOS is difficult , even with ca refully co nt ro lled exp e rim ent al co ndition s.
S tu dies have sh own th at reference signals used as pa rt of th e ev a lua
~
;;;:
~j ,o~
tion help normalize th e MOS so t h a t syst ems tested at di fferen t t imes
1- I ;01' of the C 0< I~c
a nd places ca n be co m pa red in a more relia ble m ann er (Ki ta waki et a l.,
1
/ J,,, I
1984 ; Nakatsui a nd Mermelst ein, 1982; Goodman et a I., 1976). On e ref l.O ~ to 20 '.'0 ...0
e re nce signal that is oft en used is speech degraded by varying amounts of SpccclHo-specch-corre lNed-noise ratio. Q l.dB)
multiplicati ve white noi se . since it s di stortion with resp ect to quality is
similar to a da p t ive wav eform coder noi se (Sch roe de r, 1968 ; Law and FIGURE 9.2. A plot of speech-to-speech-correlated-noise ratio Q in dB
Seym our, 1962). Th e speec h-to-speech-co rre la te d-no ise rati o . re ferred to vers us mea n opinion score.
5.0 I 40
APC-AB
LSP~
~
'-'
~ ....() ~ c: 30
;;:
e C:I \
.~
/:
g
';::
.~
3.0
\
Britain Japan
~
.~
':T
<J
20 M,/ 1 ADPO t
L og-pe M
Canada Non\ ay
a 2.0 PA RCOR
<J France t ;SA
;E ha ly 10
-5
1.0 ·10 JS
lh 2·1 32
Tr ~ln;"'lll i "i:"ll lH ~ rate l kb/;.. j 0
0 10 20 30 ...0 so 60
Tr an sm is- ion rate ( kb/s)
FIGURE 9.1. A summary of lis tene r o pinion quality using the MOS for a
fixed predictor ADPCM coder with e rror-free transmiss ion for se ven FIGURE 9.3. A s ummary of quality results versus transmission rate for
countries. The solid curve is the ave ra ge MOS mea sure ove r all countries . seve ral waveform and so urce coders. The op inion equ ivalent
After Goodman a nd Nash (t 982). speech-to-speech-corretated-nc rse ratio Q is s hown in dec ibels .
580 Ch. 9 I Speech Quali ty A sse ssment 9 .3 I Object ive Q u a lity M ea s u r e s 581
THE R.-\CKGROlJ:" D
Di agnostic Acceptability Measure :>lake a ,13,h 3 1 the .spprop riale po mt
un eac h scal e 10 indicate the degree to
I II I I I I I I I I ! I I I I I ! I ! I
T he DAM (Voiers, (97 7 ) is used for evaluating m ed ium - to hig h whi ch thh trnn,mi"lon -umpte is char
10 20 30 40 50 (i() 70 lSO YO roo
aclcril.etl b}' the ind icated quality.
q uality speech . The DAM is un like ot her subjective mea sures in that it negligible HISSI:\"G extreme
S I ~ I ~ I E R 1N (j FILL Ir>G
incorporates a mult idimen sio nal ap proach. Th e DAM evaluates a speech
nl E SPEECH SIG:\'AL I I I I I ! I I I I I I I I I I I I 1 I I
signal on 16 separate scales, divided into th ree categories: signal qu alit y. 10 20 .10 40 50 so 70 80 90 100
U II I I I J il l II I ! I ! I I I
J
bac kground qualit y, an d tot al quality. The mu ltip licity of sca les is an im 10 20 30 40 50 60
70 8U 90 Ion
nt'gh gih le CHIRPI'I v extreme
port ant feature, sin ce it allows t he listener to judge signal and back negli gihl e FLVTIHU:-'-U ext reme
CHEEPI."< Cj ClIC'Kr:-.-r;
gro und characteristics separatel y. This fine-grain ed st ru ct ure of the DAM - lW lnEkll':G PUI$!\TIr> G I I I 1 I I I I I I I I I I I I I I I I I
L I I I I I I I I I I I I I I I I I I 10 20 JI1 40 50 (,Q 70 80 90 100
ensu res th at liste ners are no t required to com prom ise th eir su bj ective im I I
10 zo 30 .\0 50 60 70 SO 90 IIJO neglig ible RO .; KING e xtreme
pression s. Most listeners agr ee on the presence of a d isto rtion. hut differ n"g hglble "'l>I'fLE D e xtre me
RUSHINl' GUSHING
o n prefer ence . Since th e DAM solicit s separate reacti o ns from the lis S ~10THEREIl LOW I I ! I I ! ! I I I ! ! I ! I ! I I I ! I
10 20 ~ lJ 40 50 60 70 SO 90 100
tener regarding percei ved speech signal, bac kgroun d , and tot al qual ity, it LI I I ! I I I I I I I I I ! ! 1 I I I
C K "C K I.I~G
III 20 30 40 511 60 70 s o 90 100
neglig ihk extreme
tend s to minimi ze th e sam pling error (bia s) associated wit h individual lll'gl igible DI SI \/o.'T extreme SCRATCHII"G ST AT le y
hanceme nt is to produce speech that is percei ved by th e aud itory syste m FIGURE 9.4. The DAM rating form. After Quackenbush et at. (1988).
to be natural an d free of degradatio n, it is understan dable that subjective
q ua lit y measures be th e preferable mea ns of quality assessme nt. H ow nificant time a nd perso nnel resources. For some classes of distortion s, it
e ver, as speech compression and enhance ment algo rit hms become in may not always be exactly reproduci ble. We t he refore turn to th e class of
creasingly com plex, it becom es im perati ve to be able to di stinguish even bjeciive spee ch quality m easures th at a re reliable, easy to implem ent,
th e most subt le differen ces in processed speec h qua lit y. Furt her. subjec an d ha ve been shown to be good predi ctors o f subj ective quality.
t ive measures gen erally serve as a mean s of obtaini ng a broad measure of Th e performance crit erion for an objective speec h quality measur e is its
pe rform anc e. For example, we may wish to in vest igat e th e perform ance correlation with subjecti ve qu alit y est imates. To obtai n the correlati on co
of an L PC algorithm for varyin g types of feed for ward or fccd backward efficient , both a subjective and an obj ecti ve measure must be ap plied to a
pred ictors. In order for subj ective measures to be useful, q ua lity d iffer database of processed speech. A corre lation ana lysis is appli ed to deter
ences m ust be large eno ugh to be dis tinguishable in the listener group. If m ine th e ability of t he objective quality measure to predict quality as
on ly ma rgin al quality differences ex ist . it may be di fficult to fix algo judged by the listeners in the subjecti ve evaluation. Such evaluation proce
rithm pa rameters for th e result ing quality. Subjective testing requ ires sig du res have been perform ed by research laborator ies over extensive data
562 Ch. 9 I Speech Qua lity As s essmen t 9 .3 I Obj ect ive Quality M easures 583
100
1)0
t ,L .. t lOC ,
90
fo r quality assessment o f a n a log signals. O t he r resea rchers (Fla na ga n .
1972 : H o use et al .. 1965: K r yte r, 1962a, 1962b) subseq uently d e veloped
~
t he AI measure. Although the A I m ea sur es o n ly one as pec t of q ua lit y
RO + + RO intelligib ility- it is quit e accurate. The AJ assumes t hat t he intelligibility
1
51
of a processed signa l is eq ual to t he co m p one nt inte lligib ili t y losses
70 + ~ "
+
0
+ 70 acro ss a set of fre quency bands that spa n the sp eech spectrum . Th e fre
~
que n cy li m it s for each ban d a re no rma lly asso ciat ed with the c r it ica l
0
~ + so
, oo t band s for the human auditor y syst em. ln the study by French and
~
" 50
;. " + 50 Stei n be rg, the me l sca le wa s us ed (see Se ction 6.2.4). although others
<>:: + ha ve a lso bee n propo sed . Ta b le 9.6 summ a ri ze s 20 typical frequenc y
c 40
30 + D------O S NR 0 dB
V o + 40
t 30
ba n ds used to fo rm ula te the AI. T he AI assu me s that distortion in one
ba nd is in d e p e n d ent o f los se s in ot h er b a nd s. An oth er underlying as
I I SNR 6 dB su m p tio n is that t he distort ion prese nt in th e n oisy s peec h results from
20 + &r-------fi SNR I2 d B + 20 either add itiv e noise or si gn a l attenua t io n. Other proce ssing steps must
x--->< S NR 18 dB
10 + 0------0 S:'-.'R 24 d H + 10 be inco r po rated if the AI m easu re is to be used fo r ot he r t ypes of di stor
o! I I I
S NR 30 d B
I I I I , I I I i I I I I I n
tion (e.g .. sign al -d epen d en t noise).
T he articulation with in a frequency band is d efined as that fraction of
SF S H so SL SI SN TSQ HI' fi B HI' BRTBQ I P A CA
th e o rigina l speech energy perceivable by th e listener. Speech is deemed
Signa! Background Parametric comp osite
perce iva ble if it is a bove the car's threshold o f hearing and below the
DAM score pro files (ma les)
th reshold o f p ain (see re fere nces in Appendix I .F). If the dynamic range
FIGURE 9.5. Subjective quality as measured by the DAM versus presence of of t he p ro cessed speec h falls entirely within thi s band in the absence of
additive background noise . After Quackenbush at al. (1988).
noise . the Al would have a measure of 1.0 . In practice , some residual
noise is no r m a lly p resent in the enhan ced signal; therefore. the noise
bases, incorporating a variety of voeoder distortions (Barnwell and Voiers, spect ru m is measu red and partitioned into each of the bands and com
1979; Barnwell et al., 1984; Barnwell, 1985; Quackenbush. 1985 : Voiers, pa red to p ro cessed si gnal en e rgy. The final AI is equ al to that fraction of
1977). A representative summary is included at the end of this chapter. t he dynamic ra nge o f th e signal whi ch is below the threshold of pain ,
Qualit y assessment via an objective measure pro vides a quantitative. above the t hresho ld of hearing, and abo ve the background noise-masking
repealable. and accurate m eans of comparing vocoder performance. All spectr um . T he refore, the bandwidth of ea ch of the filters in Table 9.6 is
objective measures make a direct comparison between a n or iginal (or ref such that each con t ributes equally to speech intelligibility. One way to
erence) wav eform and the processed (resynthesized or e n ha nced) version. easu re Al is to co m p ute the SNR for eac h band) , for j = 1, . . .. 20 , and
Since a direct comparison is made, it is necessa ry that th e original an d average the m easu res. SNR measures for each fre q u e ncy band must be
processed speech waveform s be synchronized. The common m ethod of
implementing an objective measure is to partition speec h into frames of
TABLE 9.6. Freque ncy Ba nds (in Hz) of Equal Contribution 10 the
typically 10-30 msec in duration , then to compute a di stance/distortion Articu lation Index .
measure for each frame. Most measures weigh differen ces in sp ect r al
characteristics between original and processed d ata. A fin a l m ea sure is Frequency Frequency
Number Limits Mean Number Limits Mean
formed by combining the distortion measures from each fr a m e. One or
more mathematical conditions are normally used in for m ulat ing objec I 200- 330 270 11 1600-1830 1740
tive measures. These include positive definiten ess , sy m m et ry. a n d the tri 2 330-430 380 12 1830-2020 1920
angle in equality (see Sect ion 1.3.1). If all t h re e cond itio ns are met. the 3 430-560 490 13 2020-2240 2130
measure is called a m etric. In the ne xt se ctions, we d isc uss fi ve classes of 4 560-700 630 14 2240-2500 2370
o bj ecti ve sp ee ch q uality measures . 5 700-840 770 IS 2500-2820 2660
6 840-1000 920 16 2820-3200 3000
7 1000-1 150 1070 17 3200-3650 3400
9.3.1 Articulation Index 8 1150- 1310 1230 18 3650-4250 3950
9 1310-1480 1400 19 4250-5050 4650
O ne o f the first wide ly accepted o bjective speech q ua lity mea sures is
10 1480-1660 1570 20 5050- 6 100 5600
th e AI. The A I was origi na lly p ro posed by Fren ch an d Ste in berg ( 194 7)
584 Cl1. 9 I Speech Quality Assessment 9 .3 / Objective Quality Measures 585
limit ed b y the threshol d o f pa in and heari ng (in o u r relat in n. Vee limit the
SNR to 30 dB). T he m easu re is formulated as
AI = -
1
L. m m \SNR) , 301
2U . I
(9.2)
100
YO
i " -
± 100
')0
~
20 J 1 30 t
r
80
~n t II
This relation is a "long-term" meas ur e in the se nse that each SNR is t>.
~' f
jO
It should be emphasized that the AI is a good predictor of intelligi
~
C
'J
6. + 50
bilit y for th e analog communication systems for which it was originally ~ 50 +
d esigned. In many digital cod ing or speech en han cement appli cat ions.
noise characteristics become signal-dependent and thereby violate the
o 40 y o + 40
+ 30
underlying Al assumptions. For example , speech enhancement algo 30 t 0----{] Sr-<R () dB
-
ri th ms such as spectral su bt raction with half-wave rectification or non I I S ~ R 6 dB
+ 20
lin ear coders produce noise artifacts that cannot be modeled as 2ut (:r-----i:; SNR 12 dll
SNR l::i dB
add i ti ve, or ind ependent across frequency bands . Therefore, care must lo t 0 ----------:> S:-J K 24 J B + 10
be taken in using AI measures for speech e valuation . Once a number SNR 30 dll
(I I I I i I I I i I : I I I I I i I i 0
has b een obta ined for the AT, it is necessary to re late it to intelligib ility. BN fiB BF BRTBQ I P /\ C A
Sf SH SO SL 51 S'-' TSQ
A rti cu la t io n te sts ar e subject to co n sid e ra ble variabilit y and their re Backg round Parametric corn pos ue
S igna l
su lt s depend strongly on th e testing t echniqu e , data , and procedure. DAM sco re profi les (ma le-)
Usually, it is m ore relevant to c o ns id e r d iffer ences in intelligibilit y
sco res between sys te ms under similar test condition s. A s an example,
Fi gure 9.6 illustrates empirical relations between th e intelligibility score
a nd the AI fo r se ve ra l test conditions .
E, = L.
n = - cc
e2 (i ) = I
rr= - o::;:o
[s(i ) - .{(i l ( (9 .4) mi ne imp ro ve men t. T herefore, such measures are used primarily in sim
ulat io n where both degraded and noise-free sp eech signals are available.
586 Ch. 9 I Speech Quality Assess ment 9 .3 I Obje ct ive Qua lity Me as ure s 587
T he principal benefit of the SNR quality measure is its mathematical weights to be appl ied in each band and thereby produces an SNR mea
simplicity. T he measure represent s an average error over time and fre
I
sure more closely related to a listener' s perceived not ion of quali ty. The
quency for a processed signal. It has been well documented. howe ver. measure is formed as follows.
that classical SNR is a poor estimator of speech quality for a broad range
voiced segments. since noise has a greater perceptual effect in low-energy term signal energy contained in the k t h frequency band for the frame of
segments (e.g.. unvoiced fricatives ). A much- improv ed quality mea sure noise-free speech indexed by III}. and Ed the similar quantity for the
can be obtained if SNR is measured over short frames and the results av noise sequence e(n ). Studies have shown the SNRrw- seg measure to be a
eraged . Th e frame-based measure is called the segm en tal S NR (SNR,eJ. bet ter predictor of speech qualit y than the classical SNR or SNR wg
and is formulated as measu res.
SNR scg
.. = -
I .I- I
\' ) 1010g [ffl')
~ 10 ~
~
.
s2(n )
...
J. (9.7) 9.3.3 Itakura Measure
1\1 J~ O
"l
erro rs in formant locat io n a nd ba ndwid th than t o the spec tral vall eys he In fact. any set of LP-based parameters cou ld be used in place of the
twee n pea ks. LAR parameters found in (9 . 11). A generalized LP spect ral d ista nce m ea
We not ed in Sec tio n 5.3.5 th a t th e ltaku ra d ista nce is no t a me t ric sure between two frames. ending at t im es III and m' : is ba sed on a
beca use it does no t have the req ui red property of symmetry. Tha t is. if weighted Minkowski metric of order {J,
df ( · · · ) de no tes the Itak ura di stance. an d 3(111) and b(m' ) two L P Vecto rs
If 11 //1
bet ween which we des ire the dis tance, then
I 1I ·1." ,.",·l~\(/ : m) - (fJJ I; m ') l
il
T he pro cedure for fi ndi ng th e WSS M distort io n measure involves four us assu me that we ha ve two co d ing a lgorit hm s to co m pare. Code r A does
steps. First , th e spect ra l slo pe in ea ch freq ue ncy band of each signal is an o utsta nd ing j o b for a ll voiced so u nds. bu t d oes po o rly fo r unv oi ced.
fo un d usi ng the follo wi ng co m p utations (the index 111 refers to th e cur Code r B performs margina lly fo r bot h voi ced a nd unv o iced spe ec h. We
rent frame). also assu me tha t the te st dat a used fo r ev alua tio n ha ve a h ighe r conccn
~ I S( k :m l l = IS(k + l: m l ! - IS(k: m)1 Iration of vo iced versus un vo iced speech fram es . and th at the mean qual
(4. 13) ity for coder A is supe rio r to th at of marginal co der B. A list en er gr oup
~ IS( k.m ) I = I ·~(k + l;m )I - IS(k : m)1 migh t pre fer coder B ove r cod er A becaus e th e overa ll q ua lity is more
for k = I, . .. , 36, wher e S(I-.: : fII ) is th e st D FT o f t he origi nal reference consistent . T h is aspe ct. re fe rr ed to as liste ner fa tigue. ca n be addressed
spect rum evaluated at the center freq uency of hand k. an d S(k: 111 ) is the by co m put ing the va riance of each q uality mea sure (tho ugh few studies
similar quantity for the processed spectru m . T he magnitude spect ra are include th is va lue ). An even be tter mean s of represen ting globa l speech
ex p resse d in dB even t ho ugh t his is not ex p licit in (9. 13). The secon d quali ty is to est im ate a pdf fo r the res ult ing m ea sure. T his g ives a clea r
step is to calculate a weight for eac h ha nd . T he magnitud e o f eac h weight indicat ion of algo rit hm pe rfo rm a nce.
r eflect s wheth er th e band is ne ar a s pec tral peak or valley. and whether
th e peak is th e largest in th e spect ru m . Fo r a given fra m e tim e m, Klatt
9.3.7 Example Applications
com putes the we ight for ea ch s pec t ru m se pa ra te ly, th en a verages the two
set s of weights to obtain H\ ",' k = I, . ... 36. Afta obtaining th e set of We close the di scussion of obje cti ve Qualit y measures with severa l il
band weights, the third step is to form the per-frame spe ctral distance ustrat ive exa mples . Fi gu re 9.7(a) illustrates the acc um ulate d di stort ion
m easure. say
:1 6
dWSSM(IS( w: m)! ,!.S(w; m)\) = K + LWk,ml IS(k; m)I- )S(k; m)1 .
k-I
]1 cc
E
2
:::;
1()C
o
Oi.' lor:
.!; .
(9. I4) 'j
c,
Vi (J I 2 3 ..
wh ere th e term K is related to o ve ra ll so un d pr essure level of the refer Frequency (J,. H z.)
ence and processed utterances. a nd a lso may b e a dj us t ed t o incr eas e (a)
o verall perform an ce .
The Klatt measure possess es se ve ral properties that are att ractive for 1:0 Original Sp<:<:l:h Waveform
quali ty assessment. First. it is not r equired to id ent ify and tim e-align ~ 2 • "! ' • I'
s pee c h formant s prior to spect ra l di st ance co mputatio n . Second . no prior -"",:; 01 ••• • ,_ ' .... ~ ~ , I
knowledge is required for normalizing differences in spectral tilt . Finally. g. -,
- [ Q\ ,
th e m easure implicitl y weights spectral di ffer en ces du e to varying band .;: 500U 10 .000 15.000 20,000
widths in the fil t ers used to est im a te th e short-time spec t ru m . This in Tim c . 1I(norm -sec )
turn yield s a perceptu ally me aningf ul frequen cy weighting. A D PC M Codcd Speech Wav efo rm
b
it y measures can be grouped by phone mes (I~pl . 11'/. I GI). speech classes ell)
(vo we ls, nasals , fricati ves), or spea kers , to name bu t a fevv. Although
FIGURE 9.7. Example of ltakura quality measure for ADPCM coded speech :
m ean Qua lit y is an im port a nt me asure of per fo rma nce fo r a cod ing or en (a) Disto rtion for a frame across frequency ; (b) original and coded speech
hancemen t algo rithm . co ns istency is also im p ort a nt. As an exa m ple, let wavef orms with frame -ta-frame quality measure ,
592 Ch. 9 / Speech Qua lity Assessmen t 9 .4 / Obj ective Ver sus Su bject ive Me asu res 59 3
rep resent ed by an obje ctive measure, The sp ectral en velope fo r a single the di stribu tion tail s. help to ident ify any va ria bility in speech qu ali ty
speech frame before a nd after AD PC M cod ing is shown (the plo t illus from th e given coding algo rithm .
trates d ist ort ion versus frequency). Q ua lity ca n also be Seen as a funct ion Fina lly. an example of objecti ve speech versus t ime fo r noisy co ndi
o f t im e in Fig.9.7(b ). Here, t he sentence " Only the best players enjoy tions and speech enhance me nt is shown in Fig. 9.9. Figu re 9.9(a) and (b)
popu larity" spoken by a ma le is shown before a nd a fter AD PC M coding. a rc plots of the m ale speech wavefo rm in noisefrcc a nd noi sy co nd it ions .
A frame-to-fra me lt a ku ra q uali ty measure d, is also shown. Essent ially. The sa me sentence as in Fig. 9.7 was degraded with 5-dB SNR o f addi
t he area under each fra me me asu re co rres po nds to the accumulated dis tive wh ite G aussian noise. Figure 9.9(c) illustrates the di stortion intro
to rt io n introduced by the coding pro cess. T he results show that the coder duced by additi ve noise via fra me-to-frame ltakura d, quality me asure
perfo rm s well for steady-state voiced frames . wit h decreasing perform betwee n Fig. 9.9(a ) an d (b). Finally. Fig. 9.9(d) shows the result of a
a nce for dynamically changing speech characterist ics (e.g.. stop conso si ngle-channel Wiener filte r (see Section 8.4) which W<l S used to enha nce
nants such as Ipl and /t/). Quality m easur es can also be illustrated in the speech waveform from Fig. 9.9(b) (three ite rations were used). T he
hi sto gram form (Ha nsen and Nandkumar, 199 1, 1992) as sho wn in Fig. wavefor m shows a reduction in noise during periods of silence, as well as
9.8. H ere, an example of the three objective q ua lity measures Itakura dl ' a decrease in distort io n from th e resulting frame-to-fram e Itakura qu alit y
log-area-ratio d L A R , and weighted-spectr al slope d W SSM ' for a 4.~ Kbit /sec measure plot in Fig. 9.9(e).
CE LP cod er (Kemp et al. , 1989 C am pbe ll et al., 1991 ) is shown in
his togram form . ~ Deviation (rom the mean, as well as co ncent ratio n in
ITo obtain these histograms, 100 sentences from the Tr MIT data base (sec Section 13.8) 1. The LAR measure, which requires an ord er of ma gnitud e less com
were processed . resulting in approximately 37,000 frame-to -frame measures (Hansen and putat ion th an other spectral distance measures, dem on strate s a
Nandkuma r. 199 1) . cornpctitive correlation coefficien t.
594 Ch. 9 I Speech Quality As ses sm ent
9.5 I Pr oblems 595
TABLE 9.7 . Compar ison of the Average Corre lation Coeff icient 1151 Between
Objective and Subjective Speech Quality (as Measured by
2
Composite Acceptability of DAM).
1c.. 0 Objective Quality Measure Ipl
<:
« -2 SNR 0.24*
I I I I I I I I I
SNR",g c.n
Ti me. n (norm-sec) 0.9 3*
SNRr" seg
ta.l LP-based measures:
LP coefficients 0.06
Noisy Original Sp Retlect ion coefficient s 0.46
IU'~'L '
I I I I I i
.g
oj
2f • Ii:. ~ ~OL
.1. .• I ..JJu~I~1 .all .l jh. .III III"I~ '.
-lI Log predictor coefficients
Log reflection coefficients
0.11
0.11
c.. 0 0.24
Linear area ratios
« -2
Log-area ratios 0.62
---.L-l
ltakura distance 0.59
Time . II (norm-sec )
Linear spectral distanc e 0.38
(b)
Inverse linear spectral distance 0.63
Log spectral distance 0. 60
ji1[~:~J~
Enhanced Speer
I I I I I i j I 1'1.,11 I I I
2
2. O f those measures employing a n au ra l model, the WSSM possesses
~ I
0
the highest correlation coefficient with subjective quality.
"-
<: 3. T he best pr edictors of subj ecti ve qualit y are composite m easures,
- I
«
-2 wh ich are those formed usin g multipl e linear regression on sets of
I I I I I ! I I I I I i
Tim sim ple me asures , The high degree of correlation results by select ing
(d ) a number of parameters in th e composit e objective measure th at
yield m a ximum correlation . The performance of composite m ea
~ 25 Frame-to-Frame Irakura Quality Measure sures ca n be considered an estimate of the limit of the ability of ob
i 1G:~
9.5 Problems
Frame number (n orm-sec )
( e)
9.1. Discuss the differences between speec h quality and sp eech inte lligi
FIGURE 9.9. Examples of the ltakura quality measure for speech deg raded bi lity in a m ed ium-bandwidth speech co m mu n ica tio n system. H ow does
by additive noise and enhanced speech .
speech co d ing transmission rate affect sp eech quality? Define th e follow
ing te rm s: synt he t ic quality, communications quality, toll quali ty, a nd
broadcast quality.
596 Ch . 9 I Speech Q ua lity Assessm e nt 9.5 I Problems 597
9.2. We wish to co nstruct a subjec tive quality-based rhym e test to m ea in transmission rate to that in (a ). how will th e LAR measure
sure coder pe rfo rmance fo r unvo iced stop co nso na nts. Using the MRT change? Is it better to adapt step size or prediction whe n co n
an d D RT as models. de vise such a te st. Incl ude (i) wo rd lisu s), (i i ) sco r sidering the LAR quality me asure?
in g me th od . (iii ) a t est of your pro ced ure, and (iv) a suggeste d confidence
me asure. 9.7. In F ig. 9.8. we see that q ua lity measures for ADPCM are poo r for
stop co nsonants , but go o d fo r vowels, D iscuss why t h is is true. Is the
9.3. Consider speech comm uni cat ions over a t yp ical te lep ho ne channe l. human au ditory sy ste m m o re sensitive to disto rt io n in for m ant location
Under th ese co nd it io ns: it is norm ally assu med that the transmitted ut
o r forman t amplitude'? Wh y?
terance is id ea l bandpass filtered with a lower cutoff freq uen cy 200 Hz
and a n upper c ut-off 36 00 H z. 9.8. (Computer Assignment) Co nstruct a d igital si mu la tor for a neutral
(3) What ar e th e effects on qualit y and intelligibilit y due to this vowe l using four complex pole pa irs. Fo r dis cuss io n purposes, assume
proces sin g and why? that the Ny qu ist freq uency is 4 kH z. Excite this filter with an 8 m sec per
(b) Now suppose that th e passband of th e chan n el introduces a iod ic pulse train and obta in a l -sec o utput wave fo rm . Use this as the ref
spectral tilt of 2 dB /octave. How will this affect subj ect ive erence wavefo rm .
speech quality? (a) Using the o rig in al 8- pole filt e r, dec rease the radial pole lo ca
tions by 5% (i.e., move the ei ght pol es toward the o rigin in the
9.4 . Suppose that a na r ro w-ba nd noise signal d(n) is used to corrupt z-plan e). This introduces a disto rt ion in formant a mplitude and
an in pu t speech signa l 5(11) that occupies a 4 kHz ba nd width . resulting ba nd wid th . Obtain I sec o f t he d istorted waveform . Using LP A
in a distorted signal j(»). Let th e noise sign al be uniform in the band an a lys is. fi nd the fram e-to-fra me Ita kura qualit y m easure for
[ 1.0, 1.2] kH z the for m ant bandwidth/am plitude distorted signal (u se the orig
(a) Write an ex p ressio n for the global SN R (i .e., over the enti re inal signal as reference).
speech wa veform). (b) U sing the original 8-pole fill er. inc re ase the frequ ency locations
(b) Write an ex pressio n fo r a time-domain segmental SNR (t ime o f all poles by 5% (e.g ., if the first po le-pa ir is at 500 Hz, the
domain fra mes of N sa m ples). modified pole-pair wi ll be 52 5 Hz). This introduces a distortion
(c) Write an exp ression fo r a frquenc y-doma in segm ental SNR (fre in forman t locatio n. Obtain I sec of th e di storted waveform and
quency doma in blocks of 200 Hz). find t he frame-to-frame Ita kura q uality measure .
(d ) Discuss th e tr ade-offs encou nt ered in the use of each of these (c) Com pa re frame-to-fra m e ltakura qual it y measure s for distortion
measures to predict sp eech quality. Which is th e better predic i n fo rm ant locat io n versu s ba ndwidt h/amplitude . W hic h distor
tor of sp eech quality for this distortion and wh y? t ion introduces a greater los s in speech qual it y? Repeat the
9.5. (Computer Assignment) A speech-coding strategy occupies a process for a 10% shift in po le locatio n fo r each distortion .
[0. 4] kHz bandwidth . We wish to determine the effect on speec h qual it y
of increasing additi ve white Gaussian noise. Using sp eech from a male
sp eaker. degrade an input utter ance with glo bal SNRs of 5.1 0 ,20 dB .
(a) Using fram es of 128 sa m p les, with a 64-sample o verlap from
frame to frame , find th e segm enta l SNR measure for the t hree
degraded speech wavefo rms. Discuss th e differences in segrnen
tal SN R versus global SNR.
(b ) Suppose th at the list en er has a mild hea ri ng loss so that dist o r
tion a bove 3 kHz is not perceived. Construct a new segm en ta l
based SNR m easure that does no t include distort ion o utside t he
a ud ito ry band of this liste ner.
9.6 . Co ns ide r a n AD P CM vocod er, Assume th at a fixed pr edictor is us ed
with ada pti ve step size (see Ch apte r 7).
(a) Wh at is t he im pa ct on the LAR qualit y me asure of increasing
t he data ra te from 16 kbit/sec to 32 kb it/ sec?
(b ) Suppose that t he AD P CM vocoder is modi fie d to ha ve fixed
step size but adaptive pr ed ict ion . Assuming a simi la r inc rease
c
.-....o
e
o
CJ
~ (1)
~ a:
CPTEA 10 I
The Speech Recognition Problem
Reading Not e: Thi s chapter is very descriptive; no special reading in Chapter
1 is required.
10.1 Introduction
In this short chapter we introduce t he fiel d of speech recognition and
discuss the problems that make this endeavor so challenging.
601
602 c-. 10 I The Sp ee ch Recog nition Problem 10 .1 I Introduction 603
Mo st syst em s e m p loyed in p rac t ical a p plicati ons a re of th e small cenain "task domains" can be extremely useful in th ose domains. The
voca bu lar y or isolated -wor d ty pe. Existing syste ms fo r more " n at ural" possib ilities are alm ost u n l i ~ ited -~n e can ,imagine applicat ions, for ex
hu ma n- machine co m municatio n remain primarily experim ental. No ex amp le. in lib rary or other information relrlcval. fu nctto~s . m ~lr traffi.c
isting system, even of th ose being used in pract ica l ap plicat ion s, is highly control towers , in emergency response ce nters, In ce rt ain medical envr
robu st to environmental noise (office noise , factory no ise. ex tra neous ron ments, or in any situa tio n in which specific tasks tend to restrict the
speech , etc.), All perform signi fica ntl y better if required t o recognize onl v vocabulary and the message con ten t. While application of existing te ch
a single speaker who " trains" th e system. Even if th e system is used to nology proceeds, several prom ising technologies are contributing to labo
recognize multiple speake rs, performance is gene rally im pro ved if the ratory and co mm er cia l syst em s that atte m pt to solve one or more of the
system users are also the trainers. Whether a single- or a multispeaker chaJle nging p roblems noted above. T hese method s will be the principal
system , utterances of cooperati ve speakers (who a rticu lat e clea rly. do not focus o f our stu dy.
use word s outside th e vocabular y. etc.) are more easily recogn ized. Al As we shall di sc uss, the first s tep in solving the speech recognition
though some existing systems take advant age of the grammatical struc roble m was to understand its complexity. Categori zing and di scussing
ture of the language, only experim enrat systems have m ore abstract the di m ensions o f the problem is the central issue of this chapter. In light
"cognitive " abilities like discerning meaning Or learning from mistakes. of thi s overwhelm ing co m plexity as it is now understood, it is natural to
When such qualities are present, they appear in very prim itive measure . wonder whethe r the goal noted in th e opening paragraphs is a realistic
In cont rast, most na tural application domain s that would ben efit from one. The fu tu re will an swer that question , and perhaps the rea der will
speech recognition do not use d iscrete. clearly art iculated, utteran ces by contri bute to the solution , but a sense of optimism can be derived from
a sin gle person in a quiet environment , nor is it generally possible to assess ing recen t progress. Very realistic speech rec ognition systems have
have the system trained by its user populat ion . In fact , speech recogni been de veloped in the short period sin ce the invention of the digital com
tion system s, to be maximally ben efi cial and universally applicable. must puter. In t he ea rly 1970s. commerc ial systems became available that were
be capable of recogni zing continuous speech , and must be able to recog qu ite re marka ble in their time. Th ese systems addressed a speech recog
nize multiple speaker s with possibl y diverse accents, spea king styles. and nitio n p rob le m considered nearly tri vial by contemporary standards.
different vocabulari es and grammati cal tendencies (perhaps even multi T hey were designe d to recognize di screte utterances (usuall y words) in
ple languages); must be able to recognize poorly arti culated speech: and relatively noi se-free environments . Th e systems em p loyed small vocabu
must ha ve the ability to recognize speech in noisy environment s. Further la ries ( 10- 100 words) , and were often used in cases in which the machine
more. all of these capabilities must come in an affordable , sufficient ly was required to recogn ize only the speaker who trained it. In contrast ,
small syst em t ha t can op erate in real ti me. Ideally, too, the system should near the end o f the 19805 rese ar chers at IBM , for example , had devel
ad apt a nd learn new lexical , syntactic. semantic, a nd pragmatic informa oped an experim enta l system capable of recognizing a vocabulary of
tion .' just as a human can . Whe n placed in th is perspective. the field of 20.000 word s when uttered in isolat ion . or nat urally spoken utterances
speech recognition is seen to be in its very early infancy. drawn from a SOOO-word vocabulary. Not only was such a system almost
The enormity of the problem notwithstanding, grea t progres s has been unim aginable in 1970 , but the thought that it would be implemented in a
m ade in recent de cades. and th e work that has been accompli shed is cer desktop com put er t he size of a small suitcase probably would ha ve been
tai n ly not without practical value. Sm all- voca bulary syste ms, e ven those consi dered ver y unr ea list ic.
requiring discrete word inputs, can be employed in man y relativel y sim Anot he r pe rsp ect i ve o n modern speec h recogn it ion te chnology is
ple applications to improve th e efficiency of ent eri ng in fo rm a tion to a ac hie ved by looking back be yond 1970. Prior to ab out 1950, the digital
machine 2-in manufacturing environments (e.g .. for sorting tasks). in ap om puter did not exist. Th erefore, what we hav e called "recent" progress
plications where the hands are unavailable (in surge ry. to assist a person com p rises mo re than half of the pe riod o ve r which sp eech recognition re
with motor d isabilities , in a darkro om , in a co ckpit). or in appli cat io ns sea rch has been condu cted. J ust as th e digital computer gave rise to the
where t he user mu st remain in remote contact with th e machine (o ver mo dern era, so t oo have ad van ces occurred in proportion to computing
the phone, in hazardous environm ent!'). Syste ms co nstrained for use in spe ed and me morv size .
. Ne w int erd isci plinary knowledge and im pr oved computing technolo
gies c.o nt lll ue to adva nce the state of the art with each passing year.
Tracking t his progress is difficult. because the field do es not move
'These terms are defined below,
line arly- the re are m an y d iffere n t probl ems involved in the attempt to
JSpeech input is about tw ice as fast as information entry by a skilled tyuixt (K ap 'a n ,
recognize speech. and each e volving syste m . wheth er commerci al or re
1976).
ea rch, tends to focus on so me aspect of the whol e. We will nevertheless
604 Ch . 10 J The Speech Rec og nitio n Problem
10.1 J Introductio n 605
attemp t to pi ece together this status and its history as we go along in OUr
st udy.J tral anal yzer " still prevails . We noted in Chapter 3 that Teager an d
Teager (1990 ) ha ve urged the speech processing co m munity to consid er
10.1.2 Discovering Our Ignorance t hat analysis techn iqu es based on linear models are quite inappropriat e
and are hypoth et ically responsible for hindering greater speech recogni
Iro nically, a realization of the fact that we are in the infancy of the tion success.
speec h recognition field is a cons eq uence o f several deca des of pa instak
in g research into what was onc e thought to be a relatively straight forward
pro blem. In the 1950s research ers concei ved of a machine th at was to be 10.1.3 Circumventing Our Ignorance
ca lled the " phonet ic typ ewriter" (F ry an d Denes, 1958; Dreyfus and Alt hough our fund amental understan ding of the speech process re
Graf, 1962) . T he goal was to use acou stic features of speech an d knowl mains incomplete. a major asset to th e field has been the exp losive ad
edge of phonet ics to turn Oowing speech into a phonetic transcription, vances in d igital co mputing bas ed on ver y-large -scale integrati on of
and even tually in to a graphe m ic (convent ionally wr itten) transcri ption of circu it components beginning in the 1980s. Computing spee d and abun
the message . T hese researchers di d no t anticipate the extreme difficulty dant memory com bined with special ized ar ch ite ctures and signal p roces
of the task. The relative ease with which we h uma ns co mmu nicate using sors have made it possible to execu te enormousl y com plex a lgorithms
speech obscures t he aweso me compl exity of t he task as it is now appreci that would have been unthinkable in the early days of co m puting. What
ated; it is a testim o ny to t he rema rkable ab ilit y of the " human com is mo re, resea rchers can proceed with high confidence that speech recog
puter." (This latter fact has inspired some resear chers to investigate the nition strategies that today are im plementa ble only when using vast labo
use of "n eural net work" archi tect ures in spee ch recogn ition. We will have ratory systems of netwo rked processors will be run on small systems in
more to say abo ut th is in C hapter 14.) In a 1976 article in the IEEE the future. It is interesting to no te the comment s of Dr. Frederick Je linek
Spec/rum (Kaplan, 1976), Dr. James Fla nagan of Bell Laboratories made of IBM in the same 1976 IEEE Sp ectrum article: "Com puters are still
the fo llowing observation which is a larg ely accu ra te re flect ion of the too slow and too expensi ve. Ten years ago [1966] th ey were even to o slow
state of the technology even as we ap proach the turn o f th e century': to carr y out research [emphas is adde d] in sp eech recogni tion . New re
"The problem of speech recognitio n has no t been sol ved, primarily be search is possible, but continuous-speech reco gnit ion product s [em phasis
cause the speech communicatio n process is a subtle one. Many of its fun added ], by present techniques. would be qu ite cost ly. Beca use program
damenta!s are not well understood . For example , while mo st researchers ming, even in today's high-level computing la nguages, is di fficu lt, re
recogn ize that a sho rt-time frequency spectrum of speech bears impor search is slow. It takes a very lon g ti me to tes t ou t the simplest ex
tant in fo rmation. the hu man ear an d brain are not a labo ratory spectrum perimenta l idea.'
analyzer. We do not com pletely understand th e inner ear, a nd what hap Whe reas it is poss ible to mistake Jelinek's co mme nt for one that might
pens beyond the a udito ry nerve [relaying ne ural signals from the inner have been made today, rela tively speaking, we have come much furth er in
ear to higher auditory centers in the brain ) is almost a to tal myster y, " In addressi ng the ne ed for greater computing power tha n in add ressing
deed as with many research endeavors. decades of research have served :"Ianaga n's and Teage r and Teager' s concern for a m ore com plete view of
to point out how little we know ab out a very co mplex pro blem. Many the speech communication process. In a sense we have used th e st rengt h
years after these com ments were made, th e speec h pro cess rem ains fun of the former to compensate fo r the relative wea kness of the latter. Some
damenta lly mysterious. and the engineering view of the brai n as a "spec brief discussions of hardware capab ilities will be mad e a t ap pro priate
points in t he following chapters.
lRegrel1ably, we cannot possi bly give proper credit to the vast number of resea rchers at Although advances in hardware have been a major boon to speech rec
many labora to ries and com pa nies aro und the world who have advanced this field . We can ogni tio n technology, cert a in co ncept ual advances also un derl ie high
on ly hope to give a sampli ng of the sys tems th at rep rese nt va rious concepts. As a mailer of
po licy. we will avoi d d iscu ssio ns of specific co m mer cia l systems a nd focus on research de performance "software." f ndee d, in th e same era as the concerns abo ve
velo pme nts. except in a small nu m ber of cases in wh ich the wo rk re pre se nts landmark a d were raised , researchers were beginn ing to bu ild systems bas ed o n
van ces in t he field . Th e rea der is encouraged to peruse the Proceedings of the IEEE stochastic models that would, in effect. learn their own representat ions o f
Int ernationa] Conferences on A COIlSlI CS, Speech, and Signal Processing (ICASSP). for exam
ple. where new resu lts are often first reported. The papers in these Proceedings also offer the speech process rather than having it deterministically encoded using
extensive reference lis ts that will d irect the reader to other SOUrces of informat io n. In par experts' kno wledg e. This act ually rep resen ts anoth er compensat ion for
ticula r; a wonderfully com prehensive surve y o~ speech recogni tion ad va nces is given by J .
Ma ria n i in the 1989 ICASSP Proceedings (Maria ni , 198 9). Also, in 1990 a co llect io n of pa
lack of precise modeling information. Stochastic approac hes circum
pe rs o n the subject of speech recognition was compiled by Waibel and l ee ( 1990). This col vent ed the need for extraordinary amounts of complex infor mat io n nec
lect io n present s some o f the sem in a l wo rk 111 th e field. foc usi ng prin cipall y. bu t not essary to write "deterministic" progra ms , but at the same time placed
e xclusivelv, on wor k in t he Uni ted Sta tes. The papers in th is collection also co nta in many
use ful reference lists . • heavy demands upo n com pu t ing systems fo r bo th tra ining recognitio n
ta sks. [For example, it took more than 13 hours on a DEC-l090 corn
606 Ch. 10 I Th e Spe ech Recog nilicn P' obl em
10.2 I The "Dlmen slons o f Difficulty· 607
put er to compile th e net work necessary for one of the fir st succe ssful
large- vocabulary (10 11 words) co nt inuo us-speech rccognizers , H AR P·Y. at 10.2.1 Speaker-Dependent Versus Speaker-Independent
Ca rnegie-Mello n Un iversi ty in 1976 (Reddy, 1976: Lowerr e a nd Red dv. Recognition
1980)J. Stoch astic approa ch es (based on li near acoustic mo del s). as ~e Most speech re cognitio n algorithms. in principle. can be used in eithe r
sha ll see. a re no w firmly entrenched in most contemporary speech recog a "spea ker-dependent "' or "speaker-i ndependent " mod e, an d th e designa
nit ion sy stem s with m ode ra te to la rge vocabularies. La rge-vocab u la r y. tion fo r a partic ula r syste m d epen ds upon th e mode of train ing. A
co nt ino us-speech re cognitio n syste ms still pose man y cha llen ges fo r com speaker-dependenr recognizer uses th e utte ra nces o f a si ngle speaker to
puter technologists to cre a te fa ste r h a rd wa re and so ftware a nd much lea r n the parameters (or models) that characteri ze the system's int ern al
m or e co m p ac t c ircu itry nec essa ry fo r th ese sy ste m s to mo ve fro m th e mod el of the speech process. T he sys tem is the n used spec ifica lly for rec
lab orat ory to th e real wo rld . H oweve r, it appears likely tha t cont in uo us ognizing the speech of its traine r. Accordingly, th e rec ognize r win yield
speec h re cognition sys te ms, and hum a n- machi ne interactio n, will play a relati vely high recognition results compared with a speaker-independent
s igni f ica nt role in societ ies of the no t-too-d is ta nt fu t u re. recognizer, which is trained b y multiple speakers a nd use d to reco gni ze
man y sp eaker s (who may be outside o f t he train ing popul ati on). Al
though more ac cu ra te. t he appa rent di sad vant age or a speaker-dependent
10.2 The "Dimensions of Difficulty" system is th e need to retrain the system ea ch ti me it is to be used wit h a
new spe aker. Beyond the accu racy/ co nvenience tra de-o ff is t he issu e o f
In Sect io n 10.1, we descri bed th e ge nera l goals of th e speech recognition nec essity. A tel epho ne system lsee. e.g .. (Wilpo n et a l., 1990)1 th at must
t ask a nd generally sugges te d so me o f t he m aj o r problems involved . In respond to inq uiri es from the public is n ecessar ily spe a ke r-inde pe nde n t,
th is section we wish 10 m o re fo rma lly di scu ss wh at Waibel and Lee whil e a system used to recognize th e se verely dy sa rt hric spee ch of a per
(1 990 , p. 2) ha ve call ed th e "dimensio ns of di ffic ulty " in speech rccogni so n with speec h d isa b ilit ies (see. e.g., (De lle r et al., 199 1lJ mu st be
t ion. We a d d res s t he questi on of wha t fac to rs influ en ce the success Or tra ined to tha t perso n's speec h. Bot h t ype s o f s yste ms, the refore, a re used
failure of a speec h recogn it ion syst em and dicta te th e degre e of sophisti in practice, and bo t h have been st udied exte ns ively in th e laborat or y.
ca tion neces sary in th e design of th e system . Thes e fa ctors a re en ume r Before co nt inuing. let us note th at so me au tho rs d ist inguis h bet ween
ate d as ans wers to th e fo llowing Quest ion s :
speaker-indepe nden t systems for wh ich th e t rai ning popul ati ons are th e
1. Is th e system re q ui red to recogni ze a sp ecific ind iv id ua l or multi ple same as the use rs, a nd those for which th e tr ai ni ng po pu lat ions a re dif
sp ea ke rs (incl ud ing, perh a ps. a ll speakers)? fere nt fro m the use rs. I n th e fo rm e r case , the term m ultiple speak er is
2. What is th e size of th e vocabu lary? . used whil e th e term "s peaker indepe nde nt" is reserved for the latter. We
3. Is th e speec h to be en tere d in d iscret e units (usua lly word s) with shall no t ma ke th is d ist inction in the fo llo wing. H owe ver , it is important
distinct pauses a m o ng th em (disc re te ut tera nce recogn it io n), or as a to ta ke note of th is issu e in co m p a r ing th e pe r for ma nce of var io us
continuous utt erance (co nnec ted or co ntin uo us recognit io n- to be syste m s.
d istinguishe d belo w).
4. Wh at is th e ex te n t o f ambig uity (e .g.. "know" a nd "no ") an d aco us
ti c co n fusa bility (e.g .. "bee," "see," "pea " ) in th e vocab ula ry? 10.2.2 Vocabulary Size
5. Is tb e syste m to be o perate d in a q uiet o r noisy env iron me nt, and Clearly. we wou ld exp ect perfo rmance and speed o f a pa rt icular recog
wh at is th e nature o f th e env ironmen tal no ise if it exists? nize r to degrade with inc reasin g voca bula ry size. As a r ule of thumb ,
6. What are the lin gui stic co nst ra i n ts pla ced upon t he speech. and some speech resea rcher s estim ate t ha t the d iff ic ulty of th e recogn itio n
what linguistic kn owled ge is b uill into th e recogn izer? problem increases logar it hmically with the size of the voca bular y. M ern
We consider each of th ese Questi on s seq ue ntia lly in th e follo wing sub ry req uirem ent s also increase with increasing vocab ular y size, th ough
sec tio ns .' (as we will be able to in fer fro m the study below) gen era lly no t so much
as a con seq uence o f th e increasing num be r o r words , but rather as a re
' These "dimensions" tend to focus on the task to be accom plished an d accordingly su lt of the increasin g complexity of the recognition task that larger vo
might make the reader th ink of the correlati ve difficultv in finding theo retical and algori th ca bu la ries im p ly.
mic solutio ns. However. another facet of this challenge is becoming clear as some of the Speec h rec ognit ion system s o r a lgo rit hm s arc gen e ra lly cl assified as
more difficult algorithmic problems are being solved: the availab ility of necessary memory
resources with which to implement the "solutions." Many exist ing a lgorithms require too sm all. medium . or large vocabu lar y. Th ere is som e va r ia t io n in th e litera
much memory to even he tested. Therefore. the computing resou rces necessary to irnple t ure o n the q ua n tificat ion of these term s, but as a rule of thu m b. sm all
men! a " solution" arc becoming a very real part of the difficulty implied by the more ex voca bu la r y systems a re th ose whic h have vocabulary sizes in the range of
trem e answers to these q uestions.
1-99 wo rds: mediu m . 100- 999 wor ds: an d large. 1000 words or more.
608 Ch. 10 I The Speech Recognition Pro b lem 10.2 I The -Dimensions of Difficulty· 609
Since reco gn izers have been designed for 200,000 words . a 1000-wor d ca ord inari ly think of as a gra mmatically correct sentence, a simp le stri ng of
pability m ight be called "small" in some contexts, so we need to be care d igits, or eve n, in the "degenerate case." a single wo rd .
ful a bo ut the meaning of these loose classificatio ns. Sma ll-voca bu lary (as
defined he re) systems have become routinely available and hav e be en Isolated-\Vord Recognit ion. Discrete-litterence rccogn izers are trai ned
used in tasks such as cred it ca rd o r telephone number recognition . and in with discrete rend itions of speech units. Since the discrete utterances arc
sorting systems (recognizing destinations) for shipping tasks. T he focus of usuallv words, th is form of speech recog nition is us ually called isolated
th e med ium-sized vocabulary syst ems has bee n experimental laboratorv word recogn itio n (I\VR). In the reco gn ition phase. it is assumed t hat t he
systems for continuous-speech recogn ition research (driven in part by th ~ speaker deliberate ly utt ers sentences with suffici entl y lon g pauses be
ava ilab ility of standa rd ized data bas es. to be d iscusse d in Chapter 13). tween words (typica lly, a m inim um of 200 msec is required) so that si
Large-vocabulary systems have been used for com mercial pro d ucts CUT len ces are not co nfused wit h weak fricat ives and gaps in plo si ves.
ren tly ai me d at suc h applications as office correspondence an d docum ent Single-wor d sentences arc sp ecial cases wit h "in finit e" pauses. The fact
retrieval. T hese systems have been of the isolated-wo rd type in which the that boundaries between words can be located signi ficant ly simplifies the
spea ker m ust utter each word d iscretely fro m the o thers. It is im port an t speec h recognition task. T hese bo undar ies are located in variou s techni
to keep in m ind that a gi ven -siz e vocabulary can req uire far more effort cal ways, incl udin g th e use of an endpo int detect ion algorithm to mark
fo r a spea ker-i ndependent system th an a speaker-dependent on e. Co n th e beginning and e nd (o r candidate se ts of begin n ings and ends) of a
tin uo us-speech recognit io n is also m uch mo re d ifficult t han di screte ut word . Thi s is th e sim ples t form of recognition strategy. and it requires a
terance recognition (see below); thus. vocabulary size is only on e mea sur e cooperative spe aker. It is nevert heless very suitable for certain applica
of d ifficulty . It is also tru e that " lingu istic constra ints" (see below) ca n re tions, part icula rly those in whi ch single-word co mmands from a small
duce the "per word" d iffi culty of a fixed-size vocabulary. vocabulary are issu ed to a machine at "lengt hy" inter vals. A good exam
As we sha ll see. for sm all vocabular ies and rela tively co nstrain ed tas ks ple of an ap plication wit h suc h intervals ar ises in th e sort ing machine ap
(e.g., recogn izing numerical st ring s), simple discrete utt erance or con plication no ted above. in whi ch the operator utt ers a destination as each
nected-wo rd recogni t ion strategies' can o ften be em ployed. In these cases, package presents itself o n a conveyor.
mod els for each wor d in the vocabulary are reside nt in the system and When the vocabulary size is la rge, isolated-word rccogni zers need to
the list can be exha ust ively searched for each word to be reco gnized. As be specia lly constructed and trained US! ng subword models. Further, if
vocabularies beco me larger an d recogn ition tasks more compl icated. sentences com posed of isolated word s a re to be recognized , the pe rform
train ing a nd storing mode ls for eac h word is generally impossible and ance can be enhanced by exploi ting probabilistic (or simply ordering) re
mo de ls for subword units (e.g.. syllables , phonem es) ar e em ployed." Sim lati onships among word s (" syntac tic" knowledge) in th e sente nces. We
ple ex haustive search of all possible me ssages (built fro m these subword will be better ab le to comment o n this issue after describing lan guage
units) also beco me s u nmanageable, and much more so phisticated search const raints,
algo rith ms th at pa re down the n um be r of items sea rched mu st be de
signed . Significa nt to these algori th ms are " lingu ist ic co nstrai nts" on the Continuous-S peech Recognition. Th e most complex recognmon syste ms
search th at eli m inate un mea ningfu l and grammatically inc orrect co n are those which perform continuous-speech recognition (CSR), in which
st ructions. We d iscuss t his im port an t issue below. Also co m plicating th e the use r utt ers the messa ge in a rela tively (or com plete ly) unconstrained
recognitio n task as vocabu lari es bec ome larger is the po ten tia l for an in manner. First, the rec ognizer mu st be capable of so mehow' dealing with
creased n umber of con fusable items in the vocab ulary. Th is issue is also unknown tem po ral bounda ries in th e acoustic signa l. Second , th e
di scussed below. recognizer must he capable of per form ing well in th e presence of all the
co articulato ry effects and slo ppy arti culatio n (inc lud ing insertions and
deletions) that accom pany flowing sp eech. As an exa mple o f cross-word
10.2.3 Isolated-Word Versus Continuous-Speech Recognition
coa rt icu lat ion e ffects, th e I zl in " zoo" is pronou nced som ewhat di ffer
Hencefo rth in th is dis cussion. we will usc the term sentence to mea n ently in the utt era nces of "S1. Louis Zoo" and "Cincinn ati Zoo. " The lat
any st ring of word s to be reco gn ized t hat is presu mably ta ken fro m t he ter tends to be a true h i (voiced fricat ive) sou nd , wh ereas in the former,
vocab ulary unde r co nsideration. Th e "sentence' ca n be what we wo uld the voici ng tends to be missi ng. As an examp le of ho w intra- as well as
inte rword articulation degenerates in continuous speech , speak the qu es
tion, " D id you find her'? " as discrete word s. and th en nat urally. The lat
"These terms arc defined belo w.
ter likely resu lts in "Didjoo (or Di dja) fin de r'? " Whereas th e CSR
' It sho uld be noted thai mult iple mod els for eac h subword un it ar c frequently necessa ry
to account for coarticulaiory and ph on ological effects . This is yet another effect that is problem does not in the ext reme case require an y cooperation from th e
driven upward by increasing vocabula ries. speaker, it mu st co m pensate for this fact by employing algorit hms that
610 Cn . 10 I The Speech Recogn'tion Problem 10 .2 I The "Dimen sio n s of Difficulty" 611
are robust to the myriad nuances of flowi ng speech. CSR syste m s are the W he n probabilistic re lat ions hips among words (sy nt ax) are known,
most natural from the u ser 's po int of view. T hey will be essential in these can be exploited in the connected-speech recognition ap proac h to
m an y applications in which la rge populations of na ive users interact with im prove pe rformance. In the phone-dialing e xa ~ ~ ~ e , we might .~ o~el th e
th e recognizer. sentences as random strings with eq ual pro babi lities o f any digit 111 an y
In line with the issue of spe aker cooperat ion, it is worth no ting that time s lot. or there might be certain probabilist ic rel ation ships among
even IWR systems must be robust to some of the an omalies of contin u the digi ts d ue to higher fre quen cy of calls to on e ar ea , for examp le. In
ou s speech if used with naive speakers. O ften the pause between words th e latter case we could em ploy this syntact ic knowledge to improve
by persons who are asked to speak in discrete utt er ances is not sufficient performance.
or even existent . " Pa using" is a ver y subjective speaking behavior that is
so met imes not manifested acoustically. In general, obtai ning cooperat ion Endpoint Detection. We conclude t his subsect ion by revisiting the prob
from speakers is not simple, and speech recognizers must be robust in lem of en d po int detecti on. since the proper detection of th e onset and
handling resulting problems. Pausin g is only one such noncooperative be term ination of the sp eech am id st background noi se is cen tral to the suc
ha vior. Others include th e inclusion o f extraneous speech or noise , and cess o f man y lWR strategies. Th e pr obl em of endpo int d etection has
use of out-of-vocabul ary words (Wilpon et al., 1990: Asadi et al.. 199 1). been described as an a ppl icatio n of the short-t erm en ergy and zero
In large-vocabulary CSR speech sys te m s, the same two considerations crossi ng measures in Sect io n 4.3.4 . The approach discu ssed there was
as in t he IWR case apply. Words must be trained as su bword un its , and v.. idelv used in p ract ice in th e 1980s a nd cont inues to be useful in a lim
int erword relation ships must be exploited for good performan ce. There is ited number of simpler syste ms and a pplicat ions. An even simpler ap
furth er pressure in th e co n tinuo us-speech case to model words in ways proach that ca n be used when th e s pee ch is more severely bandlirnited
that capture the intra- and interword phonologica l varia tions, and. per (to, say, belo w 3 kH z) relies on thres hold settings on the energy onl y and
haps , to learn and ex plo it probabilisti c relationships among subword is described in (La mel ct al. , 1981 ; Wilpon et al ., 1984 ).
units C'Iexical'' an d "p honological" kn owledge), just as we do with the More recentl y. th e en d poin t detection problem ha s been addressed
word relationships C'syniax") at a more macro level of an al ysis. using techni q ues a ris ing fro m the stud y of CSR. In thi s approach , th e
acoustic signal is m odeled as a co n tinu um of silence (or background
"C onnected-Speech " Recognition. In sma ll-voca bular y, continuous-speech noise), followed by the desired utterance, then more silence. In this ca se,
a pplicat ions. a recognition technique called conne cted-speech recognition the pre cise locat ion o f endpo ints is determined in conjunction with the
is so me t imes used . It is im portant to note that the term "conn ected strategy used to actu ally recognize the wo rds. We shall see how this is ac
speech" refers to th e recognition strategy rather than to th e speech itself. compl ished in the next three chapters (see, in particular, Section l2.4.2 ).
In general , the speech is uttered in a continuous manner. Wit h this approach, the endpoint detection stage may be used to provide
In th e connected- speech techn iqu e, th e sentence is decoded by patch init ial est im ates (so me tim es crude) or sets of estimat es for use in th e
ing together model s built from discret e words and mat ch ing th e co mplete higher stages.
utterance to th ese co ncatenated models. The system usually does not at 'he som ewhat un inspiring problem of endpo int detection would seem
tempt to model word-boundary allophonic effects, nor sloppy intra- or to be rather easil y solved. In fact , it is o fte n very problematic in practice.
int erwo rd articulati on. There is an implicit assumption t hat , wh ile dis Part icularly troublesome are words th at be gin or end in low-energy pho
tinct boundaries canno t be locat ed am ong words, th e words ar e nemes like fricatives or nasals. or word s that end in un voiced plosives in
reas onably well articulated . In general, this assumption is violat ed by th e whic h th e silence before the release might be mistaken for the end of th e
speaker, but the resu lts are improved by speaker cooperation . An exam word (see F ig. 10. 1). Some speakers also habitually a llow their words to
ple of a "cooperative spea ker" a pplicat io n would be the entry of st rings trai l off in en ergy (see Fi g. 10.2). Oth ers tend to pr oduce bursts o f breath
of digits representing a credit card number by a sales clerk who has been noi se at the ends of words (sec Fig. 10.3). Background noise is also an
in structed to "speak slowly and pronounce the digits ca refully.' An exam obv io us po te ntial so urc e o f in terfer ence with the correct location of end
ple of an "u nco ope rative spea ker" a pplicat io n would be voice d ialin g of poi nts , with transient no ises oft en ca using more of a cha llenge than st a
phone numbers from a publ ic phone. In th at case, t he average caller tionary noi se that can be well mo deled .
(who does not understand speech recognition technology) is not likely to When the older en d point tech nology is used, a short -te rm energy mea
be ver y cooperati ve even if asked to be so (Wilpon et al., 1990 1. an d th e sure is the princi pal and most na tura l feature to use for detection. but
problem becomes one of recognizing cont in uou s speec h. We ca nno t over eac h o f the problems ment io ned abo ve interferes with th e effective use o f
emphasize the fact that co nnected -speech recognition is really recogn ition e nergy measur es. Acco rdingl y, safeguards are bu ilt into end po int det ec
of continuous speech , since we in tend to use this case t o introduce tors. T he most fu nda me n tal problem is th e nonsiati onarv nature of th e
cont inuous-speech recogn itio n techn iq ues in Cha pter 13. int ens ity o f th e speech ac ross words. T hresho lds fixed o~ one relat ively
61 2 en. 10 I The Speech Recognit ion Problem 10.2 / The "Dimen sio ns of Dif ficulty" 61 3
2000 1 I I J I I [ ~
1500
1000
-;:;
.; :i
v 500 -:l
.= .~
~~rw~
c.. C.
~ ::
s: (J ~~ "
u
1:g,
'"c,
'"
rr.. v:
- soo
- 1 (~ \(J
- I .~ OO ()
lOOO 2000 3000 400{) :'00 0 6000 7 ( ~ )( )
o 1000 2 3000 4000 5000 6000
Time. n (n Om HCC) Time, n (norm-sec )
FIGURE 10.1. Acou stic waveform for the word "six" illustrating the two (a l
effe cts in the te xt.
1200 I I I I I I I I I
.s...
,j
-0
'II ~!1'YI'rt
~ I 111 • • 11111111 •• 1 1. I
.;
c.
!::WU - . J
I
~
;:;
R
u;
.;;:;
lou d wo rd mig ht be en ti rely inappropriate for endpoint detection On a a recogn izer is given by the " perplex ity" o f the language m o d el ." Th is
wea ke r pho na tion (eve n of the sa me word). Therefore, thresh olds ar e gen te rm roughly mea ns t he ave rage number of branche s at a ny decis ion
erally normalized to the general energy level of a word , or, a lte rn a t ively. point whe n t he d ecod ing o f messages is viewed as the search of path s
the in tensit y of the words can be norm aliz ed to use fixe d t h res h o l d ~ . th rough a graph o f pe rm issib le uttera nces. We wi ll define perplexity
Such no rmalization procedures help to co m pensa te for th e fact tha t when more formally in C ha pte r 13.
words occur seq ue ntially as pa rt o f a spoken se nte nce, the ave rage e ne rgy Let us begin the conside rat ion of lan gu age constraints by posing a n
le vel tends to decay as th e se ntence pr ogresses. Furt her, to a la rge exten t. abstract mod el of na t ural lan guages. Peirce's model of lan gu age
th e e nergy norm ali zation a llevia tes th e p rob lem not ed abo ve with weak (H artstone a nd Weirs. 19 35) as d escri bed by Rabiner and Levin s~n
sou nds at the beginni ngs or e nds o f words. T he pro blem with term inal ( 198 1) includes four co mp onents of the natural language code: sym bo lic,
p los ives , however. requires that a nu m be r of fram es of "backgrou nd " (si g ra m matical.~ sem ant ic, a nd pra gmat ic. The symbols of a language ar e
lence) be determ in ed before the end of the wo rd is de cla red. Sim ilarly. defined to be the mo st fund amen tal un its from wh ich all messa ges a re
th e problem of tran sient ba ckgroun d sounds bei ng d etected as word on ultimately com posed. In the spoke n form of a lan guage . for exam ple, th e
set s can be overcome by requiring that a cert a in number of frames be sym bols m ight be words or phonemes. whe re as in th e written form , th e
"abo ve background" in int ensity before a word is de clared . If this crite alphab et of th e la nguage might se r ve as the symb ols. Rabiner a nd
r io n is met , the initi al poin t is th e word is found by backtracking to the Levinson wr ite th at "[f Jor spoken Engli sh . th e 40 or so basic sounds or
point of the initial onse t of ene rgy. An example techn iqu e is described in phonemes a re a rea sonable c ho ice [fo r th e symbol s of the spoken form of
(La mel et al. . 1981). the language]. Although they are subj ect to substa ntial variation, th ey do
T he problem of breath no ise at the ends of words has been addressed correlate highly with measurable spectral p aram et ers." For the purpose of
in the paper by Rabin er ana Sambur (1975),J in which they suggest the discussion , let us adopt Rabiner and Levinson 's suggest ion and use the
preliminary inclusion of th e breath noise as part of the word, and then phonemes as th e symbols of the lang uage. The grammar of th e lan guage
its o m iss ion by secondary processing. is co ncern ed with how symbols ar e related to on e another to form ulti
Ge ner ally, wh eth er the older or newer endpoint detection m ethod is mate message units. lf we consider th e sen te nce to be the ultimate meso
used, the fact that sources of noise can often appear to be valid speech sage unit , a nd we c hoose phonemes as sy m bols, then how words a re
sound s (and vice versa) is also problematic. The solutions to this prob formed fro m phonemes is properl y consider ed as part of Peirce's gram
lem depend very much on the information available about the noise and mar. as well as th e m anne r in which words form sentences . How pho
th e technological fram ework in which the TWR problem is embedded . nemes form words is go verned by lex ical const raint s, and how wo rds
Some techniques for dealing with the problems of spee ch in noise were form sentences by syntactic constraints. Lexical a nd syn tactic constraints
discussed in Ch apter 8. a re both compon ents o f the grammar.
Before continuing the discussion of Peir ce's model of language , let us
view the following "sen te nces" proposed b y Red dy (I976), in light o f o ur
10.2.4 Linguistic Constraints
defi nit ion of gra m ma r. Next to each se ntence is a description of its con
T he most abstract problem invol ved in speech recognition is e ndowing formity to the ling u ist ic concepts discussed above . These conclusions will
th e recognizer with th e a ppropria te " language co nst raint s." Whether we be drawn from th e di scussion belo w.
view phones, phonemes, syllables. or words as the basic unit of speech .
I. Colorless pap er packages crackle loudly. [gra m mat ically correct]
la ngua ge (o r linguisti c) constraints are generally co ncerned with how
2. Colo rless yellow ideas slee p furi ou sly. [gra m m at ically correct , se
th ese fundamental units m ay be concatenated, in wha t order. in what
mant ically incorrect]
conte xt , and with what intended me an ing. As we di sc uss below. th is
3. Sleep roses dan gerously young colorless . [gra m m a t ically (sy ntacti
problem is more involved th an simply programm in g the correct gram
call y) incor rect ]
matical rules for th e language. Clea rl y, the marc const ra ined the rules of
4. ~ e n burad a ne yapt igi m i bilmiyorum.' ? [grammatically (le xic ally)
lan guag e in the recogni zer, the less freedom of exp ression the user has in
incorrect]
constructing spoken m ess ages . Th e cha llenge o f language model ing is to
ba lance the need for maxim ally co nstrai ni ng the "pathways" th a t mes
sa ges may take in th e recogn izer. wh ile m in im izing the degree to wh ich
"Another term used is " hab ita bility".
the speaker's freedom of expression is diminished . A measure of the ex
' We have repla ced Peirce's word "syntax" and "gram mar" for more consi stency with our
tent to which a given lan gua ge model const ra ins perm issib le dis course in "e ngineering" form alisms in the following. We will reserve the word syntax 10 refer 10 the
rules that govern how words may com bine.
'Th is is the paper upon wh ich our disc ussion in Section 4.3.4 was based. IQAccord ing to Redd v, this is a Turkish sentence.
61 6 en, 10 I Th e Speech Rec o g niti o n Pro blem 10.2 I Th e "Dim e nsio n s of Difficul ty " 6 17
Su ppose th at in a Peircia n model. the gram ma tical compo nent includes The components o f Peirce 's abstract la nguage mode l are in essence
a ll (lexical) rules that form English words fro m sym bols (p ho nemes) as constraints on the way in which "sound symbols" may form complete ut
well as the classification of th ese words int o parts of speech and th e (s v n , teran ces in the spoken form of a language. Imp l!cit o r expl i~ i t ban ks of
tact ic) r ules by which th ese pa rts o f sp eech may be co m bi ned to fo"rm lingui stic knowledge resident in speech reco gnrze rs, so met imes ca lled
se ntence s. Se ntence 4 is gram matically inco rrect in that the word s are kn owledge sources. can usual ly be associated. .with a co mponent o f
not legal concatenatio ns of symbo ls in English. Sentence 4. therefore Peirce 's model. Among t he first spe ech recogruuon systems to s uccess
fails at the lowest level (the wor d or lexical level) of grammatical con fullv use linguistic constraints on a grand scale was th e HAR PY system
st rain ts. Whereas sent ence 2 is mean ingless, sen tences I a nd 2 are both d e v~loped at Carnegie-Mellon Univers ity (eM U) (Lowerre a nd Reddy,
gram m atically correct in t hat they consist of pro per Eng lish words (Cor 1980) as part of the ARPA Sp eech Understan d ing P roj ect ( Klatt, 1977 ).
rect lexically) an d t he word s are co rrect ly co ncatenated (correct syntacti Amo ng the pri nc ipa l in vesti gators o f the system was Pro fessor D. R aj
cally) acc ordi ng to th e rul es of English . Altho ugh lexically correct, Redd y of eMU . Redd y. wri ting in th e Proceedings of the IEE E in 1976
sentence 3 do es no t obey t he rul es of English and ther efo re fai ls at a (Red dy, 1976), gives th e followi ng introduction to th e use of know led ge
high er level of gram m ar (t he syntac t ic le vel). Sentences I and 2. th ere sources in speech recognition: "[A] native speake r uses. subc onsciously,
fore, are perm itt ed in our la nguage mod el. his knowledge of th e lan guage , the env iro nmen t. an d th e context in un
The gra m mar of a language is, in prin ciple, arbitrar y. in the sense that derstanding a sentence. . . . [SJources of kno wledge incl ude th e cha racter
any rule for combining symbols may be posed , We have witnessed this in isti cs of the speec h so unds (ph onetics). va riabi lit y in pronu nciation
declaring that sente nce 2 is a gra m ma tically co rrect me ssage. On the (phonology ), the st ress an d intonatio n patt ern s of speech (prosodies). the
oth er hand, sem antics is con cerned with th e way in which sym bols are sound patterns of words (lexicon) , the gra mm atica l st ru ct ure of the lan
combined to form meaning/it! communicat ion. Systems ernbued with se guage . .. (sy ntax ).ll th e mean ing of the words an d sentences (sem antics),
mantic knowledge traverse th e lin e between speech recognition and and t he co nte xt of t he conve rsation iprag matics; " T he block di agra m of
speech understan di ng, a nd draw heav ily up on a rt ificial intelligence re a general speech recognizer show ing t hese sources of knowledge at th eir
search o n knowledge rep resentation . If o ur recognizer is semantically appropriate levels of the hierar chy is shown in Fig. I0.4. Note that th e
constrained so that only meani ngfu l English sen ten ces are pe rm itted , "prosodies" knowledge sou rce is shown to in teract with both t he language
th en sentence 2 will clearly fail to be a ca nd idate message. Likewise , sen and acoustic processo rs. T he acoust ic processor is generall y co nside red to
tence 3 is semantically im proper. Sentence 4 will presumabl y fail at the be that segmen t of t he recogn izer whic h interfaces t he acoust ic waveform
lexical level of grammatical testing and will not be subjected to semantic with the " intelligent " lan guag e sector of the recogni zer by redu cing the
scrutiny. waveform to a paramet ric or featu re representati o n. Since, however. th e
Beyond simple "nonsens e/meaningful" decisions about symbol strings, prosodic co ntent of th e utt erance is in tric atelv tied to th e acou sti c con
semantic processors ar e o ften used to impose meaning upon incomplete. tent of the waveform . it is di fficult to segregate the funct ion of the pro
ambiguous. noisy, or othe rwise hard-to-und erst and speech. A noisy utter sod ic knowled ge source from the acoustic processing.
ance recognized as "[Ind etermina nt word ]' th ank you" with the aid of se We has te n to po int out that Fig. 10.4 is a very gene ral syst em dia
mantic processing mi ght imme diately be hypoth esized to be ei ther " No. gra m. It is clear from ou r disc ussion so far. and issues to be co vered
th ank you " or " Yes, thank yo u" with high likelihood . A semantic proces later, that speech recognit ion is an exceed ingly co mplex p ro blem . Accord
so r co uld also choose between " Kn ow. thank you" a nd "N o, thank you." ingly. attem pts to solve the probl em ar e ma nifold and di verse . resulting
two phrases that mi ght be equally likel y witho ut it. in nu me ro us and vario us hard wa re and soft ware systems. Fu rther. as we
Finally, the pragmat ics compon ent of th e language model is concern ed have also noted. d ifferen t speech processing applica tions requi re d iffering
with the relat ionship of th e symbols to th eir users an d th e enviro nme nt degree s of recognit ion capability. which leads to m or e diversit y among
of the discourse. This aspect of language is very difficult to fo rmalize. To recognizers. Figure 10.4 encom pass es m ost existing syste ms . but there are
understand t he nature of pragmat ic kn owledge. co nsid er t his sentence: probably exceptio ns. It is cert ainly th e case that not all real-world SYS
"He saw that gas can burn ." Dep ending on th e nature of t he conversa tems will have all the features shown in the figure. -
tion , the word "ca n" might be e ithe r a no un (He saw a gas can burning. Existing spe ech rccognizers can be cla ssified into two broad categories,
and it was that one.) or a verb (He saw that gas is able TO bu rn .). A sim i which indic ate the direct ion of th e " n ow of in for mat io n" in F ig. 10.4. If
l ar problem occurs with the phrase " rocking cha ir," whi ch ca n refer ei the acou stic processing is used to hypothesize va rious words, pho nes, a nd
th er to a type of cha ir or a chai r th at is in the process of tilt ing back and so on, and th en these hypotheses are "processe d upward" to see if th ey
forth. A source o f pragmati c knowledge within a recognizer m ust be able
to d iscern am ong these vario us mean ings of sym bol st rings. and hence " In this sente nce. Reddy has used the word "grammar" in the more convent ional way
find t he correct decoding. (learned in primary schoo l>, in whic h it is equivalent to syntax .
6 18 Ch , 10 I T he Speech Recog nition Probl em 10 .3 I Related Problems and Approaches 619
K NO \ VL EDG E PEIRCIi\ \
LO ces sor is ca lled up on to ascertain whether the acousti cs are consist en t
SOU RCES \ k ssage hypo thesis :-'lODEL
with the lowest hypothesized abstract sym bols; for exa mple , a set of
phone s. Top-down processing req uires much more com plex and compu
tationally intensive processing syst ems than does bo ttom -up processing.
Pragmauc I '- 1 Pragrn.uics
kno wle dge In the early 1990s. researchers began to focus upon techniq ues that em
Meuningtul
ploy a combination of the two types of processing. T he theor y and prac
sentence tica l applications of each of the se st rategies will be discussed extensively
hypot hes is in Chapter 13.
Serna ruic In summary. sources of kno wledge will co nstrain the recognition proc
Semanti cs
knowl cduc ess and help convert an unmanageab ly com plex decoding process into a
tractable one. We will d iscuss aspects of la nguage modeling at several key
points in this part of the book. with a formal int ro duct ion to language
Syn tact ic
knowledg e S m l3 X l mod el ing in Chapter 13. However. ou r tre atment of the highest levels of
lingu istic constra ints. semantics and pragmat ics. will be only superficial ,
since they do not lend them sel ves well to form al d iscus sion (at least in
Lex ica l
k n " wil'tl~ e
Lexicon J(""",, nur
convent ion al en gi neering way s). When we describe cert ain recognition
systems in Chapter 13. we will po int out some of th eir attempts to em
ploy the se levels of the hierarch y.
words. As the term implies, acoustically ambiguous words are those that
phouologi ca l S ym bo l'.
are indistinguishab le in th eir spoken renditions: "know" and "no"; and
knowledge
"two ," " to ," and "too" are sets of exa mples. In terms of our formal dis
cussio n of language models. we can say thai these words consist of the
same lingu istic symbols. At an acoustic level, therefore, these words are
ACl1U, ti" "' ill donn indist inguishable, unless th ey ca n be resol ved through prosodic subtleties.
FIGURE 10.4. Block diagram of a general speech recognizer showing the Ordinarily. higher levels of the reco gnizer would be called upon to make
acoustic and linguistic processors. To the left are the knowledge sources the correct recognit ion.
placed at the appropriate level of decision-making in the language hierarchy. O n the other hand. confusability refers to the extent to which words
To the right are the correlate components of the abstract Peircian language can be easily confused because of partial aco ustic similarity. The words
model. The illustrated system uses phones as the basic symbols of the
language. LD = linguistic decoder; AD = acoustic decoder. for the digits 0 through 9 are ra t her di ssim ilar acoustically, the most
con fusa ble being "one" and "nine" (because of the final nasal) and "five "
and "nine" (beca use of the st ron g di phthon g). The vocabulary consisting
can be pieced toget her in a m an ner foll owi ng the "hi gher level" rules. of the word s for tile letters of the alphabet, however, are highly
then the syste m is said to ope rate in a bottom-up mode. The ea rlier rec confusable, primarily because of the set B , C, D. E, G, P, T, V, but also
ognition syst em s, including HARPY, em ploy bottom-u p processing. beca use of the sets F. S, X : A. H . 1, K: I , Y: and so on . In each case, the
Roughly speaking. top-down p rocessi ng begin s with sen t ence hypotheses uttera nces are only dis cernible by co rrect recogn iti on of nonvowel pho
being posed at the high est levels of the p rocessor. T hese hypotheses arc nemes. which are relat ivel y weak in co ntrast t o t he vowels . Whereas re
then scrutinized at eac h o f the lower levels for likelihood of represent ing sol ving co nfusab ilit y can be assisted at high er leve ls of processing in the
the spoken ut tera nce, eac h level calling on the next lower level to provide recognizer, this pro blem is t heo ret ica lly sol vable at the acoust ic level , and
information th at it use s in its assessment. Ultimately, th e acoustic pro there is no subst itute for a high -qualit y aco ustic front end in this regard.
620 C h. 10 I The Speec h Rec og nitio n Pro bl em 10.5 I Problems 621
is a meaningful sentence.
9 /2+7+1 - 1/ 4 +5-5
+ 5 + 0 + 0 + 0 - 0 + 2 -1 -1 (l 0.2)
[HAPTER 11 I
Pragmatic: When "speak ing" to a ch ild , each word in a sentence is usu
ynamic Time Warping
ally no more than th an fi ve digits long. T he re ar e no constraints for adult
listeners. Reading Notes : The concept of distan ces amo ng feature vectors will play an
important role in tlus chapter. The reader might wish [ 0 review Sections ].3.1
(a) Determine whether eac h of th e following sentences is lexically,
syntactically, and semantically pr oper. and 5.3.5.
(i) 2 + 3 + 4 / 1 + 8 + 0 + 0 / 7 - 5 - 5 - 5 + 8 + 9/0 +
1 -1 + 2 - 2 +3 - 3 +4 -4 + 5 - 5 +6 -6 +7 _
7 + 8 - 8 + 9. 11.1 Introduction
(ii) 2 + 3 + 4 / 1+ 8 + 0 / 7 - 5 - 5 - 5 + 8 + 9/0 + 1 _ In this ch apte r we begin in earnest our study of the technical methods
1+ 2 - 2 + 3 -3 +4 -4 + 5 - 5 + 6 - 6 +7-7+ used in speech recognition. Th ere are two basic classes of m ethods upon
8 - 8 + 9.
which alm ost all contemporary speech recognition a lgori thm s using se
(iii) 2 + 3 + 4/ I + 8 + 0 + 0 / 3 + 7 - 5 - 5 - 5 + 8 + q uenti al computation are based. The first class th at we stu d y in this
9/0 +1 -1 + 2 - 2+3 - 3 +4 -4 +5 -5+6-6+ chapter is based on a form of template matching. T hese methods draw
7 - 7 + 8 - 8 + 9.
heavily upon conventio nal feature-based appro aches developed for gen
(iv) 2 + 3 + 4/1 + 8 + 0 + 0 /7 - 5 - 5 - 5 + 8 + 9/ 1
eral statist ical pattern recognition problem s. Accor dingly, we will be able
1+ 2 -2 + 3 - 3 +4 -4 + 5 - 5 + 6 -6+7 -7+ to ma ke use of our general background alo ng th ese lin es from Section
8 - 8 + 9.
1.3. However, the speech problem bas an int eresting and important nu
(b) Suppose that you are the linguistic an alyzer inside of a speech ance tha t doe s not arise in all template-matching problems. This is th e
'recognizer. An utt er an ce known to have been read from a chil need to appropriately temporally align the features of th e test utterance
dren's story has th e following repr esentation in th e recognizer: with th ose of the reference utterance before computing a match score. To
1+ 8 -1/1 + 2 +7 +5/ solve this problem , we will exploit t he principles of " dy na m ic program
(10.3) rning, " a subject which is taken up first in th e chapter. Because one fea
3 - 6 + 1 - 1 - 5 + 8.
ture string is "warped" (stretched or compressed in time) to fit th e other,
The proper representation of thi s sente nce has been corrupted and because dynamic programming is used to accom plish th is task , the
by the po ssibl e addition of o ne ext ra digit and sign (or sign and class of feature-matching approaches used in speec h recognition is often
digit) at th e beginning and/ or end of ea ch word. Find the cor referred to as dy nami c lime warping (DTW).
rect representation , exp la ining specificall y how th e sources of D yna mic tim e warping has been successfully em ployed in simple ap
linguistic kno wledge are used . plic ations requiring relatively straightforward a lgo ri t hms and minimal
hardware. Th e technique had its gene sis in IWR , but has also been ap
10.3. During the course of a rad io newscast, the ann ouncer who is dis
cussing th e political situa tio n in th e Middl e East exclaims. "Turkey's role plied to CSR using the connected-speech stra tegy. Since DTW requires a
te mplate (or con catenation of templates) to be avail abl e for any utterance
in the Persian Gulf follow ing these message s." Explain how a speech
to be recogni zed, the method does not generalize well (to accommodate
recognizer might deduce an absurd translation of thi s utteran ce. and the
the num erous sources of variation in speech) and it is not generally used
type of processing that could be used to discern the co rrect pr onounce
for co mp lex tasks involving large vocabularies. It is also not used for
ment. What fact s might you program int o th e processor that would pre
vent the amusi ng tran slat ion ? CSR except in the connected-speech parad igm. '
10.4. Consider the digit s to be sym bols in a language of telep hone n um
bers in some geographic region (you r city. sta te. provin ce, count ry) with 'These comment s appl y 10 DTW as we shall study it in this chapter. Th ey are not tr ue
which vou are fam iliar with the possible phon e numbers . Write a gram for the "hi dde n Markov model" that we stud y in Chapter 12 a nd which may be considered
mar fo~ the language. What is th e implicati on of res t ricted Iength (s) of a stochastic form of DTW.
the symbol strings for th e success of recogn ition: 623
11.2 I Dynamic Programming 625
624 Ch . 11 I Dynamic Time Warping
We will exam ine the several facet s of DTW in thi s cha pter, and then
mo ve on to stu dy th e second gen era l clas s of me thod s based on a (I. J)
"sto cha stic" approach . th e " hidden Mar kov model," in Chapter 12. We J
will find that the hidden Markov mod el ca n be cons ide red a gc ne r aj,
ization o f DTW; ac cordingl y, it is also hea vily based on d ynamic pro
grammin g meth ods. In turn , many "highe r level" problems in speech rec
.·.
·
ognition a re based on the th eor y of hidd en Markov mod eling . Ca reful at
·
tenti on \0 deta il in this chapter, th erefore, will payoff in much of our
futu re study.
·
5
noting that this cost is apparently "Markovian" in its dependence on the T he distance associated with a complete path is usually taken as the
immediate predecessor node only.' For consistency, we will a!v.'ays as sum o f the costs of these transitions and/or nodes along the path that we
sume that drr·] is a nonnegative quantity, and that any transition origi
can now express as '
nating at (0,0) is costless . This latter assumption u sually means
K
dAu.j) I (0.0)] = 0, for all (1,./), ( 11.3) D= 2, d[(ik·JJ IUk- I')H)1, (11.9)
k~ l
although varianons may occu r.
In the Type N case, costs are associated with the nodes themselves. in which K is the number of transitions in the path and i o E O. )0 "" 0,
rather than with the transitions among them. Let us define the notation i . e= I. and if(
"'" J. The objective, therefore , is to find the path that mini
~izes D . Th e most common variations on this problem include cases in
dN(I,) ~f cost associated with node (i,) (11.4) which D is found by multiplication of the individual costs,
for any I and). In general, we will choose d N <" , .) to be nonnegative. and K
we will insist that the problem be set up so that node (0,0) is costless,
Usually, this means
D= 11 d[U,,')k)[Uk-l')k-I)]'
k~ l
(11.10)
dN(O, 0):= 0, (11.5) and ca ses in which D is 10 be maximized rather th an minimized . The
meth ods described below are applicable with obvious modifications
but variations occur; for example, dN(O, O) would be taken to be unity if
when eit her or both of these variations occur.
the node costs were combined by multiplication.
The Type B case is that in which both transitions and nodes have asso
ciated costs. The transition and node costs are usually combined by addi EXAMPLE (Traveling Salespeople)
tion at a given node, Befo re proceed ing, let us consider three examples in which the three
dB [ Uk' )k) I (ik-l' ik- l ) ] ~f dr [ (ik ,jk) I Uk- I'),H )] + dN(ik.JJ. (11.6) cost-assignm ent paradigms would be used. Suppose that each grid point
in the i- j plane represents a city on a map. In the first example, suppose
The most frequent exception to this case is when they are combined by that the transitional cost into a city is the ti me it takes to travel from the
multiplication, previo us city. There are no restrictions on which cities may be visited
fro m any given city. The intercity cost is independent of the history of
da[ Uk' i k) I Uk-!' )k-I)] ~f dA Uk,)k) I Uk.. t '}k-l )] X ds(ik..ik). (11.7) travel prior to the current transition , so that the costs will be Markov.
Note that in the multiplication case we would want dT[U.j) I (0.0)] = I Also suppose that the time it takes to pass through a city is negligible, so
and dN(O , 0) = I for a "costless" initiation. that there are no costs assigned directly to the nodes. This, therefore,
Since most often we will be dealing with "Type B" distance quantities. represents a Type T cost assignment to the grid. If a salesman wants to
for convenience we usually drop the subscript B and write simply trave l from city (0,0) to (1, J) at the least cost (shortest lime), we would
find his p ath through the grid subject to minimization of D in (11.9)
d[Uk'}k) I (ik-l,k.,)] ~f dAUp}k) i Uk-plk -ill (IUS) w.here. in this case, the distances dr[(ik,J k ) I (iH')H)] are the intercity
distances."
to refer 10 such a cost . When we need to specifically use a Type T or As an important aside, we should follow up on a point made in foot
Type N distance, we shall use the appropriate subscript for clarity, How not e 2. It is possible to use the transition costs to formally prevent cer
ever, note that there is no loss of generality in always using a Type B dis tain pr ed ecessor nodes (or local paths) from being "legal." If, for
tance , since it subsumes the Type T and N varieties as special cases :'sce exa mple, there is no direct route from city (p, q) to (r. s) we could assign
(11.6) and (11.7)].
d[(r, s) I (p , q)] = co (Il.ll)
'In fact, these transition costs may be non-Markovian in the following sense. Whether
the transition into (i,.. j,.) from (1'- 1' .1, ,) is part of a "legal" path (in accordance with path "This total distance appears to omit the cost of the node (0,0), but recall that we insist
constraints placed on some problems) may depend on the histo ry o f the path leading up to tha t d.v(O, 0) ~ O.
(ii. i' j ' - I) ' As a practical matt er. however, we can treat these quantities as Markov, making ' We should be careful to distin guish this problem , and the one to follow involving a
sure that our path-search algorithms do not seek out paths which arc " illegal." Onl y in this saleswoman' s journey through the grid, from what is often called the Traveling Salesman
case will we be able to consistently use Markov transition costs without creating theoretical Problem (TSP) . In the TSP. the salesman must visit all cities enroute from (0, 0) to (1, J)
conflicts in later developments. This point will become clearer when we actuallv discuss and do so over the shortest path . The TSP is a much more difficult problem than those
such algorithms. .
posed her e. A further mention of the TSP WIll be found below.
628 Ch . 11 I Dynamic Ti m e Warping 11 .2 I Dynamic Programming 629
to for mall y prevent this t ra nsition fro m occurring o n anv path . However to be the best pat h (in the sense of minimum cost) leading from (s, l) to
we will explicitly avoid the use of these qu antities for 'this purpose, as
suming instead th at that illegal transitions are handled by the search al (II , v). Also denote by
gorithm without recourse to these costs. [T he salesman knows that the (s, l) (¥) (u , v) (11.1 3)
bridge is out between cities (p , q) and (Y, s) and do es not even consider
that tran sition.] To attempt to formally "block" cert ain tran sitions using the best path segment from (s, t) to ( u , v) which also passes through
in finit e costs will ca use us to have to resort to non-Markovian transition (w,x). In t hese terms , the BOP can be stated as follows (Bellman , 1957 ,
probabilities in later work , because a transition that may be "blocked" to p. 83).
a path coming from on e "direction" may be permissible to a path coming
from another. Indeed, this means that the pr obabilities must. in fact, be
non-Markov, but prohibiting assignment (11.11) will allow us to work BELLMAN OPTIMALITY PRINCI PLE
with the probabilities as though they were Markov.
On the other hand , suppose that a saleswoman has a telephone in her (s, t) (0/) (U, 11) 0= (s, t) *' (lV, x ) EB (w, x ) "* (u , v) (11.14)
automobile, so that she can transact business while in transit from one
city to the next. In this ca se our model includes no transition costs (time
in the car is not "costly"), but it does include cost s associated with each
for any s, t. U , 11, w, and x, such thai 0 -< 5, W, U -< I and
where $ denotes concatenation of the path segments.
° -< i, x, 11'::::; J,
cit y through which she passes . It is known in advan ce that at each city
(I,) a certain number of units of her product [d,v(i ,) ;::: 0]. which were T he consequences of this result for efficien t algori thm development ar e
sold on a trial basis, will be returned. To find her opti ma l journey, we quite significant. In particular, it implies that
would find her least-cost path subject to minimizing D. This problem , of
(1; '- 1' );' - 1) ( ' . )
d(Uk,Jk ) I (ik-pJk
enr oute to (ik')k)' It is sufficient to simply extend (0, 0) (ik- l'jk-I ) over *'
1 ) ].
the shortest path segment possible to reach (ik, Jk) . If we define
[Note that in this case, D would be accumulated by additi?n .as ~n (Il.~) ,
but th e individual dB costs would be constructed by multiplication as In DminU,) ) ~f distance from (0,0) to (i,j) over the best path
( 11.7).] We will ha ve a n opportunity to use each of thes e DP problem
ty pes in our sp eech wo rk. = distance associated with (0, 0) *' (i ,j)
(11.16)
let us temporarily refo cus our attention on the broad er class of paths that best partial path through u; ")k- I)
need not start at the ori gin nor end at the terminal node. In general, let
the path begin and end at arbitrary nodes (s, I) and (u , v), respectively. = distance associated with (0, 0) (I' -~k- I) Uk:)/: )
We d efine the notation * (1 I.l 7)
(s, I) *' (u , v) ( 11.12)
then, as a dir ect co nseq uence of (11.15), we ca n write
'Sh e calls from the pr eviou s cit y and warn s of her arrival. Th e custo mer has thi s length Dmin[(ik')k) I (ik- I'Jk- 1 )] = D min(ik- I' ) H ) + d[Uk' )k) I h -I'JH )).
of tim e to ponder th e situati on. ( 11.18)
630 Ch . 11 / Dynamic Time Warping
11 .2 / Dynam ic Programming 631
This expression describes distance of the best path beginning at (0.0) and
(eventually) arri ving at (ik' )k) from (ik-l,Jk - I ) . The globally optimal path (i ~ ,j~ ) = (l,J)
arriving at (ik,j~) can be found by considering the set of the best path (i~- I' j ;"-I) = 0/(1, J) = tp(i;" ,j ~)
segments arriving from all possible predecessor nodes and taking the one (11.22)
with minimum distance. The optimal path to (i , ) ..) therefore has (i ;"-2 , J ~-2 ) = 'f'(\f(l, .T») = tp(i ~-I ,j ~- I )
distance k
Note that (I 1.19) relates the distance associated with (0,0) (ik')k)' *"' i '.f
but it does not tell us which nodes are on the path. In some problems it
will be sufficient to simply know the distance of the shortest path, but in
others we will wish to know precisely which path is the shortest one. A
simple way to keep track of the nodes on paths is the following: Once the
.(i.J+l)
optimal partial path to Uk' )k) is found, we simply record the immediate
predecessor node on the partial path, nominally at a memory location at
tached to (i~ ,Jk )· Suppose that we define j (i-I.f). -
t
- (I.j)
• (i , j - l )
ning at (ik,Jk) · In particular, if Uk' I,,) = (1, J) , the termi nal node in the
grid, then the globally optimal path can be reconstructed . If we let FIGURE 11.2. A "difficult" grid search to perform by DP. For any "ordered
U;,J;) denote the kth index pair on the path (0,0) tI, J) (assumed to
be K nodes long), then
"* search" of the processing of nodes (e.g., bottom to top along columns
beginning from the left, right to left along rows beginning at the bottom) the
predecessor nodes will not be available for extension.
632 Ch. 11 I Dynamic Time Warping 11 .2 I Dynamic Program m ing 633
times required to break the problem down int o stages of decisions that
follow a pattern yielding a useful recurrence relation .s Once found, the
.I , " T.I
memory and computational requirements of the algorithm are sometimes
extraordinary as the number of predecessors to keep track of at each
stage grows exponentially. However, we do not wish to overstate the
problem. There is nothing that invalidates the BOP when the sea rch be
comes complicated , and a DP algorithm can be a powerful tool that is
often far superior to algorithms based on more naive considerations. [For
example , the TSP can be solved using O(lJ21J ) flops, whereas a complete
• .- • ( I.j )
enumeration of all paths in the grid requires O(lJ - I)!.J
Although challenges may arise in structuring DP searches. there is
often a structure to a sequential decision problem that imposes neat con
straints on which nodes may follow others. Fortunately. this will be the
case in every attempt to employ DP in our speech work, as we will be
dealing with grids that are close to what have been called layered graphs
(Papadimitriou and Steiglitz, 1982, Sec. 18.6). Generally speaking, this
means that the nodes can be lined up in columns (layers) in such a way
that there are no "b ackwa rd" transitions (from right to left) in the grid.
These constraints will have a significant impact on the DP algorithm we o~ ~
o I
develop and its computational complexity, in most cases making the al
FIGURE 11.3. If all transitions on all paths must move exactly one unit
gorithm very easy to construct and inexpensi ve to compute,
to the "east" in the grid , then only optimal paths to predecessor nodes
(i- 1,j'), for all l' , must be known to find the optimal path to (i,j). The
EXAMPLE DP ALGORITHM (Traveling Salespeople Revisited) _ _
search policy should naturally be to complete paths along columns (nodes
Before proceeding, let us consider an exampl e of how structured paths along a given column can be treated in any order), proceeding from left to
can yield very simple DP algorithms. We return to the problem of the right. The only memory requirement is a location to hold the minimum
salesman trying to map a shortest path through the grid of cities. If the distance, and one for the optimal predecessor, at each (i, j) .
salesman is required to drive eastward (in the positive i direction) by ex
actly one unit with each city transition, this lends a special structure to
the problem . In fact , this mean s that any "legal" predecessor to node
(ik,jk) will be of the form (ik-I ,.ik- 1). Only knowledge of a certain set
FIGURE 11.4. Example dynamic programming algorithm for the "eastbound
of optimal paths mu-st be available (those in the "column" ju st to the
salesman problem :'
" wes t " of ik ) in order to explore exten sion to (fk ,jk)' Figure I 1. 3 il
lustrates this idea. An example DP algorithm that finds the optimal path
(0,0) *" (1, J) for this problem is shown in Fig. 11.4. The algorithm pro
In itiali zat ion: "Origin" of all paths is node (0,0).
For) = 1,2, . . . ,J
ceeds by successively extending, on a column-by-column ba sis , all po ssi Dm,n(l,j) = d[(1 ,j) I (0,0»)
ble path segments ] in the grid according to the inherent prescription \}J(I,j) = (0, 0)
above. Each time a path segment is optimally extended, the predecessor Next j
node is recorded. Eventually, all paths that lead to (1. J ) from predecessor
nodes of the form (1- I, p) are found, and the globally opti mal path is Recursion: For i = 2, 3, ,1
selected from among them accordi ng to (I 1 .19) . The best path can th en Forj=1,2, ,J
be reconstructed , if des ired, by backtracking. Compute D mio(i,}) according to (I 1.1 9).
- - - - - - - - - -- Record '¥(i,}) according to ( 11.21).
Next j
·One tnck that is sometimes useful IS to start at the terminal node and work backward. Next i
The BOP says that an opt imal path to (u , l') through ( I'.". x) mus t terminate in an optimal
path from (\\', .r) 10 {u. d. Therefore. one can start at (/ , J) and find all prede cessor nodes;
then fin d all opt imal paths to th ose predecessors; and so on. Termi nation: Distance of optimal path (0,0) -;' (I, J) is Dm m (1, J).
'Those that can reach (I. J) . Best path is found by backtracking as in (J 1.22).
the p u rsuit of a warping " fu nct io n " is that frequently th e mapping turns
11.3 Dynamic Time Warping Applied to IWR out not to be a fun ction at all. In this case an ad hoc strategy is used to
maint a in the semblance of a pursuit of mappings. All of this eventually is
11.3.1 DTW Problem and Its Solution Using DP boiled down to a simple DP algorithm, with the mapping inherent in the
With a firm grasp of the principles of DP, we are now prepared to o pt imal path found and serving only as 3 diversion from the main issues.
apply these ideas to our first speech recognition technique. T herefo re, we approach this topic by avoiding the usual starling point
Dynamic lime warping (DTW) is fundamentally a feature-matching a nd going directly to the DP problem. We begin by looking at a simple
scheme that inherently accomplishes "time alignment" of the sets of ref ve rsion of this technique, then add specific details.
erence and test features through a DP procedure. By time alignment we Suppose that we have reduced both a test and a reference utterance of
mean the process by which temporal regions of the test utterance are a wo rd to strings of features extracted from the acoustic speech wave
matched with appropriate regions of the reference utterance. In this sec form. Fo r example, in each case we might compute 14th-order LP vectors
tion we will focus on discrete utterances as the unit to be recognized. For on fram es of length N"" 128 that are shifted by 64 points before each
the sake of discussion, we will assume the usual case in which these dis co mpu t at io n. Let us denote the test utterance LP vectors by a(m) and
crete utterances are words, although the reader should keep in mind that those of the reference utterance by b(m) , where m , as usual, denotes the
there is nothing that theoretically precludes the use of these principles on end point of the frame in each case. If the te st waveform is 1280 samples
subword units, or even complete paragraphs (if such is to be considered a lo ng, an d the reference waveform is 1536 samples long (each assumed to
"discrete utterance"). The need for time alignment arises not only be begin at n == 0) , this procedure will result in the following feature strings:
cause different utterances of the same word will generally be of different test features: a(127),a(191),a(255), , a ( 1279) (11.23)
durations, but also because phonemes within words will also be of differ
ent durations across utterances. reference features: b( l27) , b(l91), b(25 5), , b(1535). (11.24)
Generally speaking, DTW is used in IWR to match an incoming test In a seco nd example, suppose that we compute 10 cepstral coefficients
word (represented by a string of features) with numerous reference words on the sam e two signals, in this case using frames of length N "" 256 that
(also represented by feature strings). The reference word with the best a re sh ifted by 128 points each time. Let us call the vector of test cepstral
match score is declared the "recognized" word . In fairness to the correct coefficients c(m), and the vector of reference coefficients d(m). In this
reference string, the test features should be aligned with it in the manner
case we have the following feature strings:
that gives the best matching score , to prevent time differences from un
test features: c (2 55), c(383) , c (511) , , c (1279) (I 1.25)
duly influencing the match."
Early attempts at compensating for time differences among words con (I \.26)
reference features: d(255l. d(383), d(511), ,d(l 535).
sisted of simple linear expansion or compression of the time axis of the
test utterance. This procedure is highly dependent upon the correct de Whether we are dealing with the LP parameters, cepstral parameters, or
termination of endpoints and makes no attempt to align intraword pho some other feature strings from some other problem , let us reindex the
netic events. In an effort to compensate for this latter deficiency, some stri ngs so that they are indexed by simple integers and refer to them as
researchers tried to line up identifiable events within the test and refer follows:
ence utterances by using energy measures [e.g. , (Pruzansky, 1963)]. These ( 11.27)
techniques were the predecessors to DTW in the sense that they per test features: t(l), t(2) . t(3), , t(i), , t(1)
formed a nonlinear mapping of the test utterance onto the reference refere nce features: r( 1). r(2), r(3) , , r(j) , ,r(J). (11.28)
utterance.
This subject is frequently introduced by asserting that the DTW para It is clea r that the indices i and j are only related to the original sample
digm offers a systematic method of finding a nonlinear mapping. or ti m es in the acoustic waveform through knowledge of the frame end
"warping," of the time axis of the test utterance onto that of the refer ti mes in the analysis. Nevertheless, it is customary to refer to the i and j
ence utterance. which effects an optimal match in a certain sense . The a xes upon which we will layout our features as "time" axes.
name "DTW" -derives in part from this apparent quest for an optimal We now develop the formal DTW problem and show how it can be
warping function. The problem with placing too much emphasis upon q uickly sol ved using a DP algorithm. Our objective is to match the test
and refere nce features so that they are appropriately aligned . By this pro
'On the other hand, the reader m ight wonder about the poss ibilit y of warping the test ced ure we mean that the features are matched pairwise so that the best
M
word in such a way that it fits an inc orrect reference string quite well. For example. "spills global m atching score for the two strings is obtained. In order to quantify
could be time ....-arped in such a way that it matches " pills" ver y well. Measures will be the match score. let us denote the "local" cost of the kth pairing by
taken 10 pr event such anomalies.
636 Ch. 11 J Dynamic Time Warping
11.3 I Dynamic Time Warping Applied to IWR 63 7
O:S d.'1(i"'}k ) = cost of matching the t(/k) with rUk )' (11 .29) j
This notation, of Course , is a foreshadowing of t he fact that we are set
ting up a DP search problem with (at least) Type .!I,r cost assignments to r (ll (/ , J j
the grid. In the example above, in which the features consist of LP vec
tors, a likely cost function would be
IS the path
t(i k) matched with r(J,J, for k == I, 2, ... ,K, (11. 33)
T(2)
r(l )
,
1<. 1/ 1(2) 1(3 ) • • •
! =J
t(l) •••
.. "
[ (1 .1
~ i
Uk')/.:), for k == 1, 2, .. . , K. (11.34)
FIGURE 11.5. Test and reference feature vectors associated with the i and j
In particular, associated with the optimal feature pairing, say, coordinates of the search grid, respectively.
t{i~) matched with r(j;). for k = 1,2, ... , K', (11.35)
IS the path jectory'' of the path taken to enter node (ip ;k)' (Think about the effect of
th e manner of the transition upon the warping of the two feature strings.)
U;,);), for k== 1,2, ... ,K·. (11.36) Acco rd ingly, we might wish to use
N ow to node (i, ) in the grid we assign the cost dNU,j) as in (11.29). The d[(ik')k)IUk--I')H )];= dAUk,jk)!(ik-l' )k-l)] X d,Aik.)/:) (I 1.37)
cost of any path of form (I 1.34) is naturally D of (I 1.32). Therefore,
(11.36), which corresponds to a minimum of D, represents the mini as a meas ure of cost at node Uk'}')' In this case (11.32) becomes
mum-cost , or shortest-distance, path of form (I 1.34) through the grid. K
We have therefore reduced our feature mapping problem into a shortest
distance path search through the i-) plane, and should feel quite confi
D= I
k~ l
d[Uk' )k)l(ik- I')k- l )]' (I 1.38)
dent that we can solve the problem using DP given the right constraints
on the search. I n fa ct , it is prudent to normalize the final distance measure so that
paths o f different lengths receive an equal opportunity to he optimal.
An enhancement to the shortest-path problem is appropriate before
(Im agine. for example, the correct word being penali zed simply because
continuing. Although the Type N problem setup above captures the es
its test st ring of features is longer than that of an incorrect match.) One
sence of the DTW problem, it is often de sirable to solve a Type B prob
rat io nal me thod of normalization is to express D = D (I . J) on an "aver
lem instead. The reason is as follows: In making the transition into. say. 1111 n
age co st per node" basis. A moment's reflection will ind ica te that the ap
node (fk')/,:)' we may wish to assign a local cost based not only on the
p ro p riate calculation is to divide D of (11 .38) by the sum of the
pairing of features t (iJ with rUk)' viz., d.v(i""JJ. hut also upon the "tra t ransitio n costs.
638 Ch. 11 / Dynam ic Time Warping
11 .3 / Dynamic Time Warping Applied to IWR 639
K
D
- clef=
L d[(ik')k) I(ik-l' J~-l )J _
k=1 - -
tween th e test and reference strings. Let us concentrate on this latter
issue first , then come back and discuss computational complexity at the
« (I 1.39) end of th e section.
L dT[(tk,J,J I(ik-j')k-I)J
k=l
Th e search of the grid is usually subject to four basic types of con
stra ints hav ing to do with physical arguments about the data and reason
able ma tching, These are discussed in the following subsections.
We will discover later that the ability to incorporate this normalization
with a DP solution is not always possible. Endpoint Constraints and "Word Spotting." In some approaches to DTW,
Finally, before pursuing the details of the DP problem, let us elaborate the en dp oi nts of the test and reference strings are assumed to match to a
on a point made at the outset of this discussion. There we noted that the reasona ble degree. In others, the endpoints are assumed to be virtually
DTW problem is often introduced in a somewhat different manner in tu unknown and are found inherently in the OTW search .
torial material and in the early literature on the subject [e.g., (Myers et Th e strictest form of endpoint constraints in a DTW algorithm re
aI., 1980)]. It is often stated that the objective of the DTW problem is to q uires that the endpoints match exactly:
find an optimal warping function, say w( . ), which maps the i axis onto
the j axis in a manner that best aligns the features in the sense of mini t(l) must be paired with r(l) on any candidate path (11.41)
mizing D of (l 1.32). The warping function, therefore, nominally creates a t(I) must be paired with r(J) on any candidate path. (11.42)
relation of the form
Items (1 1.4 1) and (11.42) simply imply that any path we examine must
J= w(i). (11.40) begin at (1, I) and end at (1, J). Recall that we formally let the DP search
originat e at a fictitious (0, 0) in the grid . The requirement that the path
We can now appreciate that the difficulty with this approach is that the
optimal mapping is not always functional-more than one r(j) may be "begin at (1, 1)" here simply means that the only allowable transition out
associ ated with a single t(i) . This, for example. will OCCur if the test of (0, 0) will be to node (1 , 1). It is also important to recall that we in
string is shorter than the reference string. ? and it is mandated that every sisted in earlier discussions that any Type N assignment of costs to the
grid include a zero cost for the node (0,0). The relevance of this require
r(j ) must be associated with some tU). The analytical method proposed
ment in the present situation should be evident. The reader is encour
to resolve this problem in (Myers et aI., 1980), for example, involves two
aged to return to (11.19) and (I l .20) and note the effect of these
mapping functions , one for each of the i and j axes, onto a third axis.
endpoint constraints on the initialization of the recursion.
The indices along this third axis may be considered as integers that count
A much less constrai ned approach (Bridle and Brown, 1979) uses the
the nodes in the path (like k above); hence, in essence, the mappings are
DTW search itself to automatically locate endpoints in a feature string
simply used to create node pairs. Whether this method or some related
by finding the candidate set of points (beginning and end) that yields the
interpretation of the "generalized" warping procedure is used [e.g., (Par
best match. This technique is sometimes called continuous scanning or
sons, J 986, p, 298)], the problem quickly boils down to a path-search
simply the Bridle algorithm. Word spotting is the process of automatically
problem of the type we have described above. Therefore, we have simply
determ in ing the presence of a word (i.e., the features representing the
chosen to begin with the path-search problem to avoid unnecessary
details. word ) in the context of a longer test string of features that represent. in
gene ral, a multiword utterance. However, word spotting can also be used
We now proceed to the issue of solving the DTW search by OP.
to locate a single word whose endpoints are unknown. As illustrated in
Fig. 11.6, suppose we lay the test string (length 1) out along the abscissa
11.3.2 DTW Search Constraints in the customar y manner, and place a single-word reference template
alo ng th e ordinate. For simplicity, let us assume that the endpoints of the
In Sect ion 11.2, we discussed in some detail the effects of search con
refe rence template are known exactly, Now suppose we allow DTW
straints on the eventual form of a DP algorithm . Grid searches in DTW
sea rch paths to begin at any point along the test axis , and to end at any
problems are usually very highly structured, both to limit the amount of
poi nt along the "top" of the search space. Of course, any acceptable path
computation and to assure the appropriateness of regions matched be
will need to adhere to reasonable constraints such as monotonicity and
mi nimu m word length (see below). A moment's thought will indicate that
"It might occur (0 the reader to simpl y reverse the role s of [he refer ence and test axes in
[his case to pre serve [he funct ional form of r!J( . ), Howe'er. th is cannot always be done thi s liberal beginning- and ending-point policy corresponds to highly
wnhour som e effect on [he recognition rate (Myers ct al.. (98 0). unco nst rained endpoints for the test word. In other words, we do not
know where in the test string the hypothesized (reference) word might re
11 .3 / Dynamic Time Warp in g Applied to IWR 641
640 Ch. 11 I DynamiC Time Warping
.:
met hOd of relaxing endpoint constraints simpl y consists of " opening up
J
the ends" of the search region of the grid. For exam ple, as illustrated in
Fig. 11.7, the initial transition of the path [which for formal purposes is
still anchored at (O ,O)J. is nOW permitted to arrive at any of th e seri es of
nodes ( I, I) to (I , I + s) in the vertical direction (flexibility in th e refer
c
3
ence direction), and (I, I) to (\ +C, I) in the hori zontal (flexib ility in the
o
.5 test direction). Similar flexibility is also found at the other end of the
""E
,~
.,c:
"0 poten tial endpoint nodes at each end of th e search. This method has
8c been referred to as UELM [or unrestricted endpoinT. local minimum .
<:
-.!:
~" Monotonicity. Th e path should be monotonic. Thi s m ean s that
( 1\.43)
ik _ 1 <: i. and )k-l ~jk '
\ r(l )
}G
\. .., I \. .. I
. .
side, so we are willing to try an y reasonable sets of beginning and ending
points. A path with a favorable cost , say between test times i' and i" (see
.
Fig. 11.6), may be considered to be the result of matching the proper seg
ment of the test string (i.e., between its correct endpoints) to the refer
ence string. This results in the reference word becoming a candidate for
the recognized word , and also a set of candidate endpoints for the test
word. We will see this process used inherently in the "one-stage" algo
rithm, and more explicitly when we discuss the "grammar-driven con
nected-word recognition" system , in Section 11.4.4. Many of the formal
details of the procedure will be presented in the context of the one-stage
G{
algorithm. ~
Methods exist which are somewhat intermedi ate between the assump
tion of known endpoints and the use of word spotting. In Section 10.2.3 ,
we discussed the difficulties encountered in locating the beginning and
o
FIGURE 11.7. "Relaxing" the endpoint constraints at both ends of the grid
end of a discrete utterance by direct methods. Rela xing the endpoint search allows for some uncertainty in the initial and final points in time for
constraints on a DTW grid search offers one method for making the al both the reference and test waveforms . Regions A and 0 correspond to
uncertainty in the reference string beginning and end. while 8 and C
gorithm's performance less depend ent up on th e precise location of these corr espond to uncertainty in the test string endpoints .
endpoints without complete recours e to the Br id le approach. One
642 Ch . 11 f Dynamic Ti m e Warping
11 .3 I Dy na m ic T ime Warping App lie d to IW R ' 643
which , in turn, require th at any ca ndidate path not go "so uth" or "west"
j
at any time.! ? Ph ysically, th is requires, for examp le. th at features of the i > !.i- \l , + J)
2 :2
test waveform must ne ver be matched to features in th e reference wave_ I
I
form that are earlier in time th an those already ma tched. Thi s prevention I
I
of th e path from doubling back on itself is critical to prevent high match I
scores from very in appropri at e warpings. An exa mple is Shown in Fig. I
11.8.
)- /
FIGURE 11.9. The ltakura global path search constra ints for maximum
compression and expansion factors of two . The method for constructing the
parallelogram is evident from the figure and is formally described in the text.
sia n and expansio n factors are each restricted 10 two. A pa rallelogram re
r (; + 1) B gion follows from th e following arguments. Wor st-case paths (co mpres
A
• sion factor two, slope ~ and expansion factor two , slope two ) ar e drawn
r (j ) beginn ing at the tied end point (I , I). It is clear th at these paths will gen
• erally not inte rsect wit h the ot he r obligato ry end point (I , J) unless tb ey
• • make very sha rp "tu rn s" at the top or right boundar y of th e grid and fol
low trajecto ries that a re in serious violatio n of th e co mp ression o r expan
• sio n limits. T he refore , noting th at a ny path which enters (I , J) mu st no t
• represent co m pressio n or expansion of mo re th an two over a long range
• of test points, worst-case paths are drawn ente ring (1, J). The int er ior and
bounda ry nodes of t he parallelogram form ed by the intersection of the
four worst-case pat hs are deemed appropriate node s for sea rch.
Note th at the It akura parallelogram degen erate s to th e single linear
r(i ) tU+ 1) path connecting ( 1, 1) with (1. J) whe n J = cJ or cT :=.I, whe re c is the
FIGURE 11.8. The monotonicity constraint requires thai Ihe search path not max imum allowable co m pression or expansion factor. The parallelogram
make a transition in the negative i or negative j direction at any time.
Moving "southward" (trans ition A in the figure) causes a past reference
will allow th e exploration of th e most paths when I """ J. When c = 2 and
vector, r(j}, to be reused , and moving "westward" (trans ition B) causes a I ~ J. then ab out J ~/ 3 grid points are used , implying that J2/ 3 costs of
past test vector. t(i) , to be reused. Preventing the path from doublinq back the form ( 11.29) need to be com puted. T his is to be co m pa red with
on itself is critical to preventing high match scores fro m very inappropriate abo ut J2 to be computed if the en tire grid is searched. We see, th en , th at
warping. sea rch con st raints redu ce computation as well as place physically reason
ab le boundaries on the matc hing pr ocess .
'OThe re ad er sho uld n ot e ca re fully that th ere is no imphca u o n t ha t i, = 1" _1 + I o r tha t A second, sim pler type of global search region is imposed by using t he
j , = J H + 1. . .
constraint th at for a ny node. say li,..ii..)' o n a ny path to be conside red
11.3 ,.Dynamic Time Warping Applied to IWR 645
644 Ch. 11 , Dynamic Time Warping
iJ< W
o
Ijk - (11.44)
where W is called the "window width." This constraint generates a sim
ple strip around the purely linear path, as shown in Fig. II.) O. The sav
ings in computation resulting from this search region is explored in
:71
E
m ax
", ::e
Problem 11.2. Errun
. ",0
E =2
Both the Itakura and windowed search regions illustrate the fact that / "" ", I
mm 2
global path constraints are often inseparably related to local path Con
straints to which we now turn. (a) (b)
o
o
Local Path Constraints. Whereas global path constraints are used to re
o
strict the amount of compression or expansion of the test waveform over
long ranges of time, local path constraints are used to restrict the local
o o
range of a path in the vicinity of a given node in the grid . It is usual to
specify the constraints by indicating all legal sets of predecessor nodes to o o o o (f 0 o
a given node, but this is not uni versally the case. Myers et al . (1980), for E _1
Emax =31 max - 2
example, have suggested the specification of successor nodes by posing a ")
. . . ..
j (0
J:='; +W (e)
,
~
J • .". .. • :1/(1,1)
Linear path
o
o : /: E
max
=2
1
o Emm = 2
'J==i-W £ ",3
taA=l3
mill
W (~
FIGURE 11.11. "Local constraints" on DlW path search. In each case is
shown the "legal" local paths by which a global path may arrive at the node
w (i, j). Cases (a) to (d) are considered by Sakoe and Chiba, (h) by Itakura.
and (b) and (e) to (h) by Myers et al. in papers cited in the text. In case
(h). there is an explicitly forbidden path shown by x. Also shown is the
maximum and minimum global expansions of the test with respect to the
reference waveform. E and Emin • which can result under the various local
max
constraints. The relationship between local and global constraints is
described in the text.
0 1 I ' _ I
o I W I
FIGURE 11.10. A sImple global search region that restricts the search to a
region of width 2W+ 1 around the purely linear path. Shown: W= 2.
646 C h . 11 I Dynami c Ti me Warping
11.3 I Dy na m ic Time Warpin g A pp lied to IWR 647
so res) to (ik- pj, ! ). In some cases, this depend en ce goes back as far as i'l . i l )
three nod es fro m th e node (i~ ,jJ un der co ns iderat ion. In princip le. th is
m eans that Our Type T cos t ass ign men ts to t he grid a re no long er
Markov. Thi s is mil dl y disturbing. since all of Our di scussion s t hus far
have been ba sed o n an assumption th at decision s could be made sequen
tiall y without recourse to past nod es. T he problem is re solved in a man
0
o
(i~ _ 2 · jl· _ 2 )
0
(ik -I·jk-l)
FIGURE 11.12. Illustration of the use of (11.47). Note that the value "p" is
ner that pre serves the Mar kov nature of th e tran sit ion co sts as follows:
generally variable across diffe!ent local paths within the constraint, as is
The BOP states that th e best path fro m (s , 1) to tu , v), which passes
through (w. LJ), consi sts of th e o ptima l path (.I'. r)
*"
*'
(1\', .v) conc ate nated
with the optimal path (lV, x) tu, (.'). Th ere is no requirement th at (\\~X )
app arent from this example. d [Uk,j k)IUk-P,jH ,)] is the accumulated cost
along some local path with p trans itions . If one of the "outer" paths is used
in this example. then p = 2; if the "inner" path is used , p = 1.
be an imm ediate pred ecessor node to i u, v) on the path. As a co nse
quence, we ca n easily gene ralize ( J 1.18) as
tios of 0, 1/2, I, a nd 2 ar e found in th e figure . We wish to relate these to
D m 1n [(i k')k) Iu.; ,jk-P)]= f) mln[ (ik-p')k ,, )] + d[ (ik,jk)I(ik-p,ji.:)]' the global con straints dis cussed abo ve. Su ppose th at for a gi ven local
path const raint type, there are R possibl e "loca l" path s ov er which to
(11.45) reach (ik, jk)' For example , for local constraint in Fig . 11.1 2 , there are
where (ik-,,')k-p) is so me lega l (" dista nt " ) p rede ces so r node to (ik')k) five pos sible paths to (ik, ji.: )' If a nd L\; L\;
represent th e total change in
which is p no des back on th e pa t h of int erest a nd where the i and) dir ections, respectively, over local path r, th en th e max imum
p .. 1
and mi nimum expansion of the test waveform with respect to the refer
d[('' k'}k.) .,. . )Jd,!f L"" d[('Ik- m, h. _ m )1('Ik- m - I' h. - ", _ J )].
· Il ,(._p ' h-
fJ - (11.46 )
ence are given by
m=O
L\'
6 m 3• max ~ (1 1.48)
(Note th at local path co nst rai nt shou ld be defin ed so th at a unique path r e l , .. .. R.1'
I
exists bet we;n an y "(ik- p')k- P)" a nd (ik')k) so th at no am biguity exists in
co mp uting d. Thi s is tru e. for ex am p le, of all co nstrai nts shown in Fig. .1'J
•
11.11. To find the optima l path to (i}. ,Jk ) , th erefore, we simply take th e 8 nn n r '
rnm,R.1
, ~l .. .
(1 1.49)
minimum over all di stant pred ecessors. i
resp ectively. Myers et al. (1980) then give th e result for th e permi ssible
Dm1n (l"" h ) = -(i,,~~~ {Dmin[(ik' )k)!Uk_P' )('_p)]}
. l'
search region in the i-) plane for the case in which L\mln = .1 ~~ . A pair
(i.j) is foun d in the global search space onl y if both of the following con
= " (I.t~~~ {Dmin[(ik -"'}~_")] + d[(ik,)k) IU"-P' )k-P )]}'
p I" ditions are m et:
( 1 1.47 ) i -I
1+ -!1 - < ) :Sl + .1max(i - I)
( 11.50)
We have bee n a bit slop py in writi ng (1 1.47). since different pr ede cessor
max
nod es ma y ha ve d ifferent va lues of p, hen ce th e ap pearance of quotes
around the m inimization argu me nt (ik _ p , i k - P ) ' A simple illu st ratio n in i- 1
Fig. 11.12 will make thi s poi nt clear. J+.1 (i -I) <j < J+ - . (11.51)
max. - - ~max
As usual , we sho uld be ca reful with the init ialization of th is recursion.
For most points Uk' )k)' th e legal pr ed ecessor nodes (i~ ,,')k- p) will be de The re are four implied in equalities on j. By setti ng ea ch of these to
tennined by the local path con straints. At th e outset , howe ver. the initial eq ualit ies, we obtain the four lines that intersect to form the boundaries
point of interest will be ( J, I) and it s onl y legal predec esso r, accord ing to of t he sea rch region . The reader is encouraged to ponder this issue in
our formal convention, is (0, 0). Once we recap (1 1.20 ), th e recursion is light of th e di scussion surrounding Fig. 11.9.
read y for use from the outset. It should also be noted that the monotonicity con straint, as well as the
The purpose of a local path con st raint is to limit th e amo unt of expan global path co nst raint s, is (or should be) inhe ren t in the local path con
sion or compression of the test wavefor m in a small ne ighborhood pr e st raint s. Sai d ano ther way, loc al path co nstraints th at violate th e require
ced ing Uk')")' For exampl e, the four Sakoc an d Ch iba con strain ts shown ment of monotonicity should not be chosen .
in Fig. 11.1 J requ ire th at any pat h ma ke no mor e tha n In hor izontal or As we ha ve dis cus sed. a re fin ement to th e local path constraints is th e
vert ica l transitio ns without first mak ing 11 diagonal tr ansitions. II/m ra
incl usion of t ra nsi tion co st s of the form d 7 [Uk-m,Jk-,J! (ik-m'.l,Jk. m-l)]
648 Ch . 11 I Dynamic Time Warping 11.3 I Dynamic Time Warping Applied to IWR 649
along the variou s local path sectors. Generall y, these weights an: used to A problem with this normalization is that the normalizing cost (path
discourage particular types of deviation away tram a linear path through "lengt h,.) in the denominator is dependent upon the particular path for
a local neighborhood. Four transition cost types have been proposed bv transi t ion cost types (11.52) and (11.53) . In a DP algorithm, where opti
Sakoe and Chiba (1978). These are . mization is done locally, a path-dependent normalization is clearly inap
propriate, and we must resort to an arbitrary normalization factor. A
d T [Uk-m')k-"J I(ik-m- I'lk - ",- I)] = min [i,_m i k-m- pJI:- m- )k- m- I} (1 1.52)
natura] choice is to simply use I (number of elements in the test string)
d, [(ik- m' )k- m)I(ik-m- I ' )k- ",- I)] = max ([k-m - ik- m- I'J k- m - Jk-m-1J (I 1.53) as a normalization factor (Myers et al., 1980). In this case a DTW algo
rithm using (11.52) [(11.53)] will be biased toward the use of longer
dT[(ik-m,), -, J I(ik-»I- I')k-m_I )] = ik _ m - [k- ",- I (11.54) [shorter] paths. In the case of the other two weight types, the normaliza
tion is path-independent, since it is easily shown that
d[( ' ' )1(' . )l =[ik-rsn - ik-m- \ ]+ [}k-m
1 Ik-m,lk- m J k-m- p l k- m-, J
, - }k-
' m- I'] (11.55)
K K
The reader is encou raged to think about the mann er in which each of I dT [(ik,JJ IUk- I' Jk-I)]= I t, - iH = I (11.57)
these transition cost strategies influences the path evolution. Note that k ~ 1 A~ I
some cost strategies, when applied to certain path constraints; can result
and
in zero transition costs, a clearly inappropriate result since it makes the
matching at the ensuing node "cost-free." An example of this phenome
non occurs if costs of type (11.52) are applied to the "top" local path in dN [(ik,jk)j(ik- l' j k- I )] = [ik - ik_,l + [jk- jk-ll = 1+ J. (11.58)
Fig. 11.12. To circumvent this problem, Sakoe and Chiba have suggested
A detailed study of these various constraint and transition cost strate
"smoothing" the transition costs along each local path by replacing each
gies has been reported by Myers et al . Their work is based on a 39-word
cost by the average cost along the local path. An example is shown in
Fig. 11.13. vocab ulary (alphabet, digits, and the words "stop ," "error," "repeat")
with sp eaker-depend ent trials. Local constraints tested were (b) and
Finally, let us return to the issue of normalizing the optimal distance
(e)-(h) in Fig. 1 J. II. Generally, they find that all constraints perform
in order to express it on a "cost per node" basis. We first introduced this
similarly with respect to accuracy, except (g), which is significantly worse.
notion when we suggested the normalization (11. 39), which is repeated
here for conven ience, As expected , restricting path range improves computational efficiency,
but at th e expense of increased error rate. Interestingly, placing the test
K sequence on the abscissa improves accuracy, especially if transition cost
t5
I
~f _
~~_l
d}k ,jk)!Uk-l'jk-I)]
_
( 11.54) is used . Finally, the DTW algorithms perform best when the test
an d reference strings are about the same length. The reader is referred to
(11.56) the Myers paper for details.
K
I
k~ 1
dT[Uk,j,J IUk-I'Jk - I )]
11.3.3 Typical DTW Algorithm: Memory and Computational
Requirements
I
2 To summarize the discussion thus far, let us set up a typical DTW al
gorithm and examine some of the details. The example search grid is il
fl
1
(3 )
I
I.
o
•
wt
1
2
(b )
FIGURE 11.13. Adding transition costs to the local constraints of Fig. 11.11.
1
I
•
•
lustrated in Fig. 11.l4.
Let us first consider the computational requirements of the algorithm.
SUppose we choose to work with the Sakoe and Chiba local constraints
shown in item (b) of Fig. 11.11. We know from (11.50) and (11.51) that
the search region will be restricted in this case to the parallelogram with
The use of certain general transition costs can sometimes create results slopes -!, and 2. Note that we have additionally relaxed the endpoint con
that are inappropriate because certain arcs become "cost-free:" An example stra ints by a parameter e. Search over this parallelogram requires that
occurs in (a), in Which transition cost type (11.52) is applied to the shown bout lJ/3 distance measures (Type N costs at the nodes) be computed
local constraints . In (b), "smoothing" is used to prevent this anomaly , Under
smoothing, each cost along a local path is replaced by the average cost and that DP equation (11.47) be employed about IJ/3 times. This latter
along the local path. figure is often referred to as the "number of DP searches" (Ney, 1984;
650 Ch . 11 I Dy n am iC T ime W arp ing
11 .4 I DTW Applied to CSR 65 1
J FIGURE 11.15. Typical DTW algorithm for discrete utterance recogn ition.
Recursion: For i = 2, ,r
Slope 2 , ./ For J = J , ,2
Compute Dmon(l,) using (J 1.4 7).
(N ote special case for i = 2.)
02(J) = 0IU)
1 +£ 01(J) '" D m 1n (i , j )
[Note that Dmon [(i - 2,j )] is held in 02(J) and
Dmin(i -l,j )] is held in 0 1(J).]
Ne xt j
N ext i
1+ £ I-f I
Termination: Best path has cost
FIGURE 11.14. Illustration of the search space for the example DTW
algorithm of Section 11.3.3. _ . {Dm ;n(J, } )/I. i> J - e, . .. ,J
D=mm
DmtnU, J)/ I , i = 1 - e, . . . , 1.
Silverman and Morgan, 1990), and taken together, the number of dis
tance computations and number of DP searches are often used as a mea
sure of computational complexity of a DTW algorithm. We will use these tion, the details will, of course, change with different constraints on the
measures in our discussion here and in others to follow, but point out search. A second example will be con sidered in Problem 11.4.
that there ar e deeper issues involving addressing and computation that
should be con sidered in the serious evaluation of any DTW algorithm
[see (Silverman and Morgan, 1990)]. 11.4 DTW Applied to CSR
Ne xt, we consider the memory requ irem ents. In light of (l1.4 7), we
11.4.1 Introduction
see that to co m pute Dmin(i,) for some t, and for any i it is only
necessary th at the past quantities Dm,n(i - I ,)) and DmJoU - 2,j) be avail At the outset of our discussion of DTW, we not ed that there is nothing
able for all j . Let us therefore employ two I X J arrays , say 0 1(J) and t hat th eo retically precludes the use of th e ba sic DTW approach with an y
0ij), j = I, . .. , J, respectively, to record these quantities. The essential unit of speech declared to be a discrete unit. Practically, however, as tem
memory requirement is therefore modest, totaling onl y 2.J locations. Fur plate lengt hs increase , d ifficulties in both per formance and computa
ther, note that the computation can be done "in place." We sequentially tional expense become significant. In long er utterances , variations in
move through the values of i and, for a given i, compute D"'in(i.j) "from speaking rate , prosodies, and a rticulat ion become so vast that local path
top to bottom" - j = J , J - I, . .. , 2 (onl y j' s within the parallelogram constrai nts can no longer be imposed without unacceptable risk of prun
actually computed). In-place computation refers to the fact that DmlnU, J) ing co rrect paths, II often in early stages of the search. The alternative is
may replace 02(1), DminU,J - I} replace 5/ ' - I). and so on . Arrays 02( . ) to greatly relax constraints, but at unaccept able com putational cost.
and 0l( . ) are then swapped before proceeding to the next i. If for some DT W has therefore been appli ed in a connected-speech paradigm in
reason backtracking to find the optimal path is desired . then addition whic h individual reference templates representing words are , in principle,
ally, a matrix of size O(IJ) must be allocated to hold the backtracking con catenated to form utterance-length templates th at are then compared
information.
1n Fig. 11.15 'we sketch an algorithm for th ese co nstra ints. Although
this example algorithm illustrates the general principles of a DTW solu . IIThis means to terminate an incompl ete path due to unaccepta ble likelihood (see Sec
tion 11.4.5).
652 Ch. 11 I Dynamic Time Warping '1 .4 I DTW A pp lied to CS A 65 3
against the string of featu re vect o rs representi ng the test utterance. Recall , earch over al l po ssibl e reference-word concatenations . This technique
that what distinguishes this st rategy as a co nnected-speech approach i was first published by Myers and Rabiner (1981 a) and. to so me extent, it
t~e fact that the rcfer:nce " model" co nsists of individually traine~ ~",In be vicwed as an attempt to stream line an earlier method reported b .v
pieces. If all :onc~tenatlOns o f words are to be tried. then for a V-word Sakoc (1979) called the "two-level" met hod . Sakoe 's meth od co m pa res
vocabulary, with K. reference tem plates per word, and utterances of up to reference strings in two passes. o ne for the indivi d ual words, th e second
L words in length, O[(K V)/ j re ference strings must be matched again st for the com plete utterance . Level building combines these tw o tas ks and
each test utterance . Recalling that a search of the DTW grid for a test is significantly more efficien t.
string of length I and reference string of length J requires Ot ulJ) distance To understand the op eration of L B. fi rst let us picture a lar ge sea rch
computations and DP sea rches Cu is a fraction , typically I. to account for grid with the tes t utterance laid ou t along th e i axis, as in Fig. 11. 16. The
global path constraints). we see that an exhaustive search over the VOcab lta kura pa ralle logram is usually imposed as a global pat h co nstra int. an d
ulary space requires O[.u(KV)L Il] distance computations , where .1 denotes we have shown this region Oil the grid. Assum ing that we know approxi
the average single-word referen ce string length (in number of frames). mat ely how man y. say L , word s arc present in the test utterance, let us
This is clearly prohibiti ve for all but the smallest V's and L's. For exam part it ion the reference axis into L "levels" into which we will " plug" va r
ple . in a phone-dialing-by-speech problem, Ney (1984) cites the following
typical numbers: V = 10 (number of digits), K = 10, L = ] 2 (maximum
number of words in a phone number) , 1= 360 (typical test string length),
j
J = 35 . In this case . we find that exhaustive search requires 0(10 1 7 ) dis
tance computations to recognize a phone number. We shall also see that
JJ
for at least one method of organizing the computations of an exhaustive
search, the number of "global" memory locations required (in addition to
the "local" memory required for the basic DP procedure) is O(3KVLI). Level -l
For our example , this amount s TO O( I 0 6 ) , which , in the early days of
DTW (J 970s) was in itsel f prohibitive of exhaustive search. Therefore,
much effort has been devoted to finding efficient algorithms requiring
J,
less storage for this DTW search. We will generally describe some impor
tant results of these efforts in the following sections.
We will discuss three specific algorithms that ha ve been widel y used in Level 3
connected-speech recognition b y DTW. These are the "level-building" al
gorith m (Myers and Rabiner, 1981 a, I981 b), the "one-stage" or "Bridle"
algorithm (Br idle and Brown , 1979 ; Bridle et al ., 1982), and the
J:
"grammar-driven" algorithm (Pawate et aI., 1987). The first two of these
fall into two general classes of algorithms, which have been called fixed
memory and vocabulary-dependent mem ory algorithms. respectively, for Leve l 2
reasons that will become apparent (Silverman and Morgan. 1990). R, . bq~lnnlng pOIlUS
- of level 2
Fo r r ', the a lgo rithm proceeds to fi nd the best pat h to each of t he possi J. (" (i. J 1 ) . ind ex o f th e word a sso c ia t ed wit h th e be st path to (i. J 1 ) ;
ble ending po ints of level 1 utilizi ng co nve nt ional DP t echniq ues wh ile 2. D~in(i . J 1 ) · cost of th e b est path to (i, J 1 ) ;
ad hering to any loc al p ath co nst ra ints that ma y app ly. Bac ktrac ki ng in. 3. (0, 0), start ing node o f the be st path to (I, J I)'
formation is re corded so that p a t hs m ay be re traced to th e ir o rigin s. I.; Movi ng to le ve l 2 is st ra ight fo rward . For co n ve n ie nce , Jet u s inde x the
Let LIS assume fo r s im plicit y t ha t t he in itial e nd po in ts o f th e reference points a lo ng t he ordinat e of level 2 by j = 1, 2, . .. , J I , rather than
a nd t est st rings are tie d [0 . I ) m ust occu r on e ve ry p a th ]. which , recall. j = J 1 + I . J 1 + 2 , . . . , I t + l z' wh ich m ight seem m ore con sistent with the
m ea n s t hat th e fo r m a l sta rt ing node (0. 0) ma y o n ly m a ke a t ra ns it io n to p hysical lay o ut of the problem . For form al re a so n s, we a lso re index th e
( I. I ). T h is co n s tra in t is easily rel a xed . It is worth reca lli ng th at grid im po r tant grid point s [t hose fo r ma lly d e sig n ated (i, J 1) , i EE l] as (i, 0),
poi nts (in th is case w it hin a le ve l) are usua lly, for a fix ed i. evaluated se i E 1;.'1' Now there are certain grid points in level 2 that represent possi
q uen tia lly al o ng th e j dim en sio n un ti l t he to p m o st a llo wa ble point in th e ble co nt in ua t io ns of path s arising o u t o f level I . These correspond to
sea rch is reac h ed . The n the search p rocee d s to t he n ex t set o f vert ical j = 1 and a certain r a nge o f i , say B z' to co n no te "beginning of le vel 2. "
grid po int s (i is increm e nt e d ). In th is case. th is proced u re creates ve rt ica l For m all y. we can con sider the set o f po ints (1.0 ), i EEl' as a se t o f ori
"st ripes " of ev alua te d nod es th a t t e rmina te at the upper b ounda ri es of gins fo r pat hs in level 2 . Also . for formal record-keepin g purposes , we as
t he leve ls.
sign the cost in th e sec o n d le vel of the st o rage array a t each (i , 0) to that
T he to p o f th e fi rst le vel co rrespo n ds to j = I I and a certain range node in th e fo r m o f a T ype N co st. We must remember to a p p en d this
a long the i ax is. sa y E l • to conn o te "e n d po ints of le vel I. " W he n th e top cost to any pat h origin ati ng from on e of these nodes. (4
bo u n dar y of the fir st le ve l is reach e d using r ' , t h ree pieces o f in fo r mat io n At the t o p boundary of le vel 2. a t h ree-pa rt sto rage a rray is se t up to
are sto re d ina th ree-le vel array fo r (i. J 1 ) , for each i E £ 1: ho ld the th ree pieces of in form ation discussed a bove . T his inform ation
1. A n id e n t ifie r o f the wo rd ass ociat ed with 1'1 (we ass u m e tha t this is m ust be recorded for each node (I,lz), i E E z.
simp ly the integer s up ersc ript I ); . Fo r each r", a DP search is carried out from each (i, 0) , i EEl' to each
2 . t».
min (i ~ 1 I )/ i · wh ere D min
, (. ~ .) r ep rese n ts the usual minimu m d ist a.nce (l.l2), i E E l' When ev er a lo wer co st path is fo un d to , say, (i',1Z)' the in
co m p utatio n as in (11 .47) wit h the s u pe rsc rip t u sed to aSSOCIa te format ion in its storage arra y is replaced by th e new word index, new
th is q ua n t ity w it h r ' , and whe re t he d iv isio n by i re presents n or COSt, and origin ati ng point of the su pe rior path.' >
maliza tion to path length; T he proced ure d escribed for le vel 2 is then repeated for successi ve le v
3. Th e sta rt i ng p o int of t h e path leadi ng to (i, l l )' in thi s case. (0.0 ). els. until th e fi na l le vel , sa y L , is rea ched. If t he re is a singl e ti ed end
po uu, say (1,1/- ). th en the three-level array will contain the best word in
T h e purposes of th ese data will be come clear as we proceed. le vel L , the co st of th e glo b a l p at h , and th e o rigin o f the b est path in
Illf the lest utterance has fewer t han L words. we will st ill be a ble to fi nd the co rre ct so
luti on . Thi s will be discu ssed below, ' <Notc that fo r the first nm e in our di scussi on s, we have an effec tive Type II' cost assign.
rnent to o rigina l nod es.
" T his In form atio n is ob vio usly superfl uous at level I if (0, 0) is the onl y allowable o ri
gin , b ut it will be essent ial at higher levels. ISWe now sec t hat it is real ly o nly necessary to record th e ap p rop ria te I (i f' F.,l value of
the originating node. sinc e the j value is known.
656 en. 11 I Dyn amic Time Warping 1 1.4 I DTW Applied to C SR 657
Enhancements and Further Details second enhancement is concerned with the fact that LB, which in
Researchers at AT &T have introdu ced a number of independent ly con herently segments the test string into component word strings, sometimes
trollable vari able s that can be use d to impro ve the perfo rmance. an d in inserts incorrect words into the final decoding. Efforts to presegrn ent the
so me ca ses, e fficiency, of th e LB algorithm (Myers an d Rabi ner. 1981 b: test string could poten tially alleviate this problem. but at the expense of
Rabin er and Levin son , 198 1; Rabiner et al., 1984 ). Most of t hese param, other erro rs caused by erroneous boundary de cisio ns. A hybrid approach
eters are illust rated in Fig. 11.1 8. Para meters like t he illust rated ,5' and a" has been proposed in a paper by Brassard (1985) in which the test string
cha racterize regions of uncert ainty at the beginnings and en ds of refer is fi rst segmented into regions approximating syllable s; t hen a loca l
ence st rings in the levels. whil e ()'" accou nts for uncert aint y in the end min im um-search-t ype DT W algori thm is used to adjust th ese hypothe
time of the test string. M, is a multiplier used to determ ine the sizes of sized boundaries. Spurious insertions of short wor ds are d iscouraged by
th e beginn ing-point sets (ca lled BI for th e !th le vel in th e d iscu ssion high penalti es for path trajectories repr esent ing large com pressio n of th e
above) at each level. Here C' is a sort of window-widt h par ameter th at d ic test str ing. Details are found in Brassard's paper.
tates the amount of permi ssibl e local warp along the refere nce string. We noted early in the discussion that an L-Ievel procedure coul d be used
Thi s di stance is taken around th e path that is cu rrentl y least costly. and to find the best match to reference strings of length L - I, L - 2. . . . . This
accordingly th e metb od is called a local mi nim um search. Such a pr oce is acco mplished by simply seeking paths that end at grid poin ts (I,JL_I )'
dure signi ficant ly redu ces th e allowable sea rch space, thereby dec reasing (I. JL- J, . . . . This idea is illustrated in Fig. 11.19, where the global path
th e co mputa tio na l load of th e sea rch. Not illustrat ed are th e parameters, constra ints have been mo dified so th at the term inal end of the parallelo
say• rand
m in Tmax , th at are cost b oun ds in side of which an '"v accumulated gram is "opened up:' Generally, the local and global path const raints can be
cost must be at a level bound ary for the co ntin uat ion of th e associated modi fied on an ad hoc basis. Rab iner and Levin son ( 198 1) point out tha t
path . The upper bound com prises a prun ing mechanism for ridding the such a multiple-path-length approach is particularly useful when several pos
search of unlikel v paths. wh ile th e lower bound pre vents unfair co rnpet i sible pat h lengths are known a priori. T his would be the case, for example.
tion by paths th at ha ve appar ently not been subje cte d to proper compari in recognizing phon e numbers.
so ns . The read er is referred to th e original pap er s for details on the use Finally, we note that by doubling the st orage at the end of ea ch level,
of th ese param eters. it is possible to record a second best pa th to eac h endp oin t, th us offeri ng
the poss ibility of alternative paths to the "best" path deduced using on ly
one storage array. No tice that by ta king co m binatio ns of first- and
j second-best path segments among the levels, 2L global pa ths are po ssible
<1\"'>
ca nd idates. T hi s proced u re can be extended to m ore t ha n two sto rage
arr ays-the limiting case be ing that for which all cand idate referen ce
templates have a ran king at the endpo ints of th e levels. In this case, LB
r2
T he one-stage (O S) appro ach is so -n amed be cause th e reco gm tion IS
accomplish e d by finding an optimal path through a DP grid in "o ne
stage" of computation, rather than b y bu ildi ng a seri es of path "l evel s" as l a_ ~--,,-" - - , I
in th e 1.B techn iq ue. The n am e a lso contrasts the technique w ith t he car
tier " two -le vel" method of Sa koe (1979), which was briefly d escri be d
a bo ve. T he OS met hod was first d escri b ed by V intsyuk (1971 ), but was
-
not well known u n t il a similar algorithm was reported b y Brid le and Optimal pat h
Brow n (1 9 7 9). A tutorial o n the OS m et ho d is found in ( N e y, [ ( 84).
.o.u.o:
More recen t ly, an en ha nced version o f the Br id le algorithm has been
d esc rib ed by M iller et al. (1987 ). N ey cit es seve ral p ap ers d escr ibing FIGURE 11.20. The as algorithm can be conceptualized as a search over a
three-dimensional grid. Shown is the case of three reference strings. The
syste m s based o n th e OS approach , while another is d e sc ribed briefl y 2
ath indicated in this figure happens to sequentially move through r' , r , and
in (Silverman and Morgan . 1990), an d m o re fu lly in t he paper b y M iller then ra. In fact. the optimal path may start at the ' southwest" corner of any
et al. grid and end at the "northeast" corner of any grid, sequentially traversing
In many ways . the O S approach is muc h simpler than t he LB algo the grids in any order, including repeated traversa ls.
rith m a nd its p redecesso rs . It is al so m o re efficient comput ation ally, and
often requires less memor y. Beca use , as we shall sec . the mem o r y re wri te an " u pgra d ed" versi on of (1 1.47) !the sentence under (1 1.4 7) ex
quirement is propo rt ional to t he size of the voca b ula ry, r~ the met ho d plains th e mea ning of the qu o tes aro un d t he mi nimizat ion a rgum ent]:
ha s been cla ssified as a vocabulary-dependent memory technique
(Silverman an d Mo rgan . 1990). DmmU"" )k' I J =
" ( I ... r'
min
1t.. - :'1 ' r':. f' Y'"
{Dmin[(ik')k' I'JI (ik-P,jk-I" /'k-,J]}
T he Algorit hm and It s Co mplexi ty and Memory Require ments min
"'{t ' - r • .IA-I ' ~ n · . I' )
_{Dmin h - ·)I.-I'- (.';. -J
p
( 1 1.6 1)
th inking in thi s way. straints on the trans itions that may be made with respect to the u dimen
662 on , 11 / Dyn am ic Ti me War ping
11.4 I DTW Appl ied to CSR 663
sian. Only when a path reaches the uppe r boundary of one of the grids
associa ted with a particular reference string (particular 1') should it be (i. I. 1') may be preceded by (i - J , I . t.') and (i - I , .I"" il') for an y 1" , in
possible to exit tha t grid. In that case, it should be required tha t the path eluding I " = I '. In thi s case, we can write (11 .61 ) as
contin ue at the bottom of another grid (possibly the same one to allow Dm,n(i. I , u) = min {Dmin[U . I. u) IU - l. i.d] .
the same word to OCCur con secuti vely in an uttera nce). We have. there_
,. {Dmin[(i, l. II) Iu- I. J"" I")]}}
fore. a set of within-templat£, transition rules that govern the pat h search min
while the path eva luation is internal to one of the word grids. and a set
of between-template transition rules that are op erative at the top = min {D rrun [U- !, I, I') 1+ d [(i, l. d I(I - I, I. 1-')],
boundaries.
he within-template rules correspond to the usual sorts of local and
m ~n (DminU- I ,J, ." 1" ) + d[U, l.dl U- 1..1"" lI')]}}.
I'
fro m "top to bottom " (j = J" , J" - l , . . . , 1) and replaci ng the term
Dmm(i - I .j. 1') [held in &(j, 1.') ] by DmrnU,j, I;) as soon as it is com pute d.
Once again. the compu tati ons are performed in "vertical stripes" as we
have frequently found to be t he case. Since one such column vector is
I.i - I . I. '·1 U. J. I,)
~ , Bo((om needed for each t: the total amo unt of local storage necessary is O( Vl ).
~ 7 of " T he vth column of the matrix & holds the information for word v. For
, To p or z:' mally and pract ically, • . th e dimension of 0 is J rna'\. X 1< where
u- J. J , .. (, ')
Je f
(a) u» J m ax = m ~xJ" , (11.65)
FIGURE 11.21 . (a) Sako e and Chlba loc al con straints redrawn here for
convenient referen ce. (b) Cross-b oundary con str aints for the OS algorithm . In general, two or more columns might be necessary to hold the relevant
distances for a pa rticular I ' (see the exam ple in Section 11.3.3), In this
11 .4 I DTW A p pli ed to CSR 665
6 64 en, 11 I Dynamic T ime War ping
wh ere u' (i) is given in (11 .66). The use o f th e two extcrn a l a rrays w ( . )
case t he nu m ber of memory loc ati o ns n ecessa r y for th e S m a t r ix IS a nd c( . ) is illustrated in Fi g. 1 1.22. N ote t ha t, whereas the global mem
O(v.!max II ), where v is a sma ll integer.
orv a r ran ge m e n t is sim pler here than in the L B case, the local m em o ry
In an e ffort to d is co ve r further me mo r y needs , le t us nex t co nsider
req u ir ement is more co m plex a nd the qu an t it y of local m emo ry de pend s
" glo ba l m em o r y" requi red to sto re the informat ion for re cove r in g t h e
fi nal utterance. As wit h the LB algo rith m , we a rc not usuall y interested o n t he vo ca b ula ry size , V
The esse nt ial details of t he as m et ho d have n ow be en comp letely dis
in recovering the entire node seq ue n ce of t he optimal p at h through the
cu ssed, and we are prepared to state th e a lgori t h m in a gen era l form. In
3-D grid . If we were, we could kee p a ba ckt ra cking rec o rd at cad; node sp ite of the fact that the d e velopm ents abo ve re q u ired a bit of thought,
in the 3-D grid, and upon re a ch ing th e e nd of the optimal path simply t he algo ri th m , whi ch is sh own in F ig. 11.2 3, is a particularly simple one .
backtrack to discover the sequence of te m plates u sed . In this case an Fi nally. let us note t he co m pu ta t io na l complexity of the as algorithm.
0(1 X .J X V) matri x of backtracking in format io n m ust be st o red . This is
If we e m ploy th e Sakoe local constraint assumed throughout our discus
unnecessarily expensive , howe ve r, s ince th e boun d a r y-crossing informa sion a bo ve . there is effec t ively no global constraint on th e sea rc h space
tion would suffice to recover the word sequ e n ce . W h a t we ultimately
need 10 know is simply thi s: For a ny path reach ing t he to p boundarv of ~ o
)1.
[i.e ., some grid point . say (i, J,. , 11 wh at was th e la st top boundary grid "Inte rnal"
arra ys b and (3
point, say (i' , J"., 1 a lo ng that path ? Kno wing th ese top boundary grid
0
' ) ,
~1'J 'fHB
points allows reconst ru ction of th e word sequ ence. Let us explore how 6(j . t )
these pieces of in for mation can be stored in a very eff icient manner.
f"
Note th a t if the " pr eced ing" boundary point is (i ' , J" " t"). then. ac
cording to (11.64), th e "entry point " to grid u is (i' + I, 1. 1'). However,
also according to (11 .64) an y path with a node of the fo rm (i ' + I. I, IV)
as an entry point to a word w will al so have as its prece d ing point the · The vector w( ) ho lds in loc alion i
node (i ' , J" " 1/ ). This say s th at there is only o n e possible exit point at · an ind ication o f whic h word grid.
frame i' , the one from word 1/ . Let us, therefore , create an I X I row vec · at frame i. c( . ) , !<lCCS the
tor, say w( . ) (to connote best word), a nd hold in location i th e word with
the best exiting path at that frame . When all updating is complete for a ;J\ J. /,')
particular i, the matrix o(j, 0) will hold all di stances to the top bound f
n"'in(i·}· /.)
aries of words in its locations 6 (J", 10') , l' =: 1,2, . . . , V. It should be clear . .If!
point on that path is recorded in location (j , v) of p. (Eq u ivale n tly, fra m e i is incre men ted .
e (i ) = PU,.o(i)' /l0(i )J. (11.67) FIGURE 11.22. Illustration 01 the two external arrays in the OS algorithm.
1 1.4 I DTW A p pJio d to CS A 66 7
666 C h. 1 1 I Dynamic Time Wa rp ing
For j = 2, . .. ,J,.
Syn tact ic knowled ge is employed in pape rs by N ey, Me rgel , et a1.
Update ~ (j , u).
tactic information frequentl y takes the to rm of a grap h mdlcatlO g per
Next j
m issible word sequencing. A t ra nsi t ion in the grap h represen ts the
Next v
pr oduction o f one word in a sentence. while a co m p lete pat h thro ugh th e
clud ing the fac t that the sa m e word may be pronou nced differe nt ly in
Backtra ckin g: The optimal seq uence of word s in reverse order is w(l). w(e(I»), d iffer ent contexts). if the same word appea rs in d iffere nt transit io n s of
w(e(e ( I) ), . . . .
th e sy ntax grap h. it may have a different set o f reference templ at es for
each transition. This fact tends to effe ctive ly increase the size of th e vo
a nd the number of di stances a nd DP sea rches required is /'1.1. The com cabulary. since each occu rr en ce is effecti vely treat ed as a sepa rate word.
putational load is a sm all fac to r larger in thi s ca se th an with LB, but the Wher eas the co m p ut ati o n a l load of the unre st rict ed OS al go rith m is
alg orithm is simpler a nd the a moun t of m em ory required is sign ificantly O( VI ] ), wit hout fu rt he r consid erat io n , the OS algo rithm su p p lem ente d
less . by syntax wo uld be O( vtr; where V' is the size o f the en la rged voc abu
Before lea ving th e O S a lgo rithm , let u s see k to understand from lary. Of course. the syntax graph ha s anot her effect that m o re tha n com
whe nce its main benefit deri ves with resp ect to exhaustive search . In the pe nsates for the sm all in crease in the effecti ve vocabu la ry: T he nu m be r
LB algorithm , we found th at th e red uc tio n in computational effort arose of pa ths. or possible word seq uences, is dr a st icall y red uced with respect
from the willingness to optimize over re fere nce str ings at each level of to an unrest ri cted voc a bu la ry. Let us examine these point s m or e closely
the sea rch. Here it is cle ar th at no suc h a tte m pt is m ade , since any refer in th e context of the as algorithm- in part icular u sin g th e wo rk of Ney
ence string may begin a path at a ny lev el. However, the benefit derives in et al. to illustrate t hese ideas.
this cas e by optimizing over o rigins a t each boundary so th at only one In the st ud y by Ney et al.. three G erm a n speech da taba ses are em
path survives the transition ac ross a bo un da ry at a ny given fra me i. In ployed. These databa ses and the surrounding language (fo r ou r purposes,
this sense, the OS algorithm is much m ore simi la r to the standard DP syntax; m odels are de scribed in the paper by Mergel and Paes eler ( 198 7) .
sea rch through a grid , a nd , indeed , th e a pp lica tio n of the DP method is We will focus on just a few sa lient detai ls here. In the syn tax graph repre
m ore straightforward . sen t ing t he sim plest database, for example. there arc 3481 allowable tran
sitio ns (words), each of which is represented by its own set of refe rence
templates. (Fo r ou r purposes it is su ffi cient to assume that there is o ne
Syntactic Information refe re nce template per transition.) Since t b ere are approximately 900
wo rds in the vocabu lar y, the sy ntax grap h effectivelv increases the voca b
Although th e as
algorithm provides a highl y str uctu red way to imple ulary by a factor of four. H owe ver, in an unrestricted situ at io n , eve ry
ment connected-word reco gnition, it is frequently th e case that the com word may be followed by 900 o th er words, m ea ni n g th at (recall Fig.
putational demands of th e a lgo r it h m preclu de real-tim e implementation .
11.20) 900 other search grids may be entered upon exiting from each of
Co ns ide r the phone-dialing task introdu ce d in Sect io n 11.4. I. for ex a m the word grids. Said a nother way. paths will tra ve rse al m ost ever y point
pie. The vocabu la ry size is V = 10 . th e number or
re ference templates per
in ea ch of the 90 0 grid s, re q uirin g an evaluation o f each o f the 9 00J
wo rd is K = 10, and th e typic al referen ce templ ate length is .1 .:= 35 . Thi s
points in the "search space." O n the ot her h and . with the bene fit of the
m eans th at about KV] = 3500 gr id poi nts must be anal yzed at ea ch frame
syntax graph . exit in g po ints from grids will enter only a " few" of t he suc
of thc test string. If th e fra m es ar e analyzed a t IO-msec interv a ls. th is
cesso r grids- in th e Ney experim ents, freque ntl y 10 o r fewe r (M ergcl a nd
means that 3.5 X 10 5 gri d poi nts must be sea rched pe r seco nd . Even for
668 e n. 11 / Dyn a mic Time Warping
11.4 I DTW Applied to CSR 669
Paescler, ( 98 7) . As a consequence, although 3481 grids may now be
imag ined to be stacked as in Fi g. 11.20, a vast majority of the search 11.4.4 A Grammar-Driven Connected-Word Recognition System
spa ce remai ns "in act ive" and points need not be evaluated in th ose re To t he extent that syntactic in format ion is included in the LB and as
gions. In addition. a "bea m search" is im plem ent ed (see be low). which approaches, the incorp oration of linguistic inform~tion represents a sim
further red uces th e number of paths. The algorithmic methods for pro ple for m of "top-down " i~forma tion ~ow. By thi s we ,:,ean that. wor.d
cessing onl y "active" regions of the search space involves a series of list strings are hypothesiz ed from " abo ve a ~ d the acoustic process 109 1S
pro cessing operat ions. The interested reader is referred to the original used to determine whethe r the ob servat io n sequences would support
papers for details.
these hypotheses. Rese archers at Texas In st ruments have used a different
Whereas unrestricted wo rd sequences would have resulted in the need form of recognizer in which the n ow of information is "bottom-up'
to ev a luate 34811 grid po ints for ea ch test frame, the synt ax-directed (McMahan and Price . 198 6: Pawate et al., \987 ; Picone, 1990). In the
processing plus beam search was found to red uce th is to typic ally 2q,o of Pawat e paper, this system is referred to as a grammar-driven connected
th is n um ber. Even with t his ver y signi ficant reduction , Ney estimates word recognizer (G D CW R). We shall discuss the basic approach here: for
that the execution of 50 milli o n instr uctions/sec would be necessary to detail s of hardware implementatio ns, the reader is referred to the first
carry ou t the search in real tim e if fram es are pro cessed each 10 msec. two pa pers cited above.
Adj usting the threshold of the beam sea rch results in more or fewer paths In the G DCW R, DTW processing is used for word spotting (see Sec
be ing reta ined, and lit tle gai n was fo und by increasing t he number of tion 11.3.2) to locate words in the test string. These candidate words are
pa ths preserved to greater th an the nom inal 2%.
then hypothesized (along with their nominal endpoints) to a decoder that
Wit h the extremely low percentage of pat hs actually searched in these valuates the likelihoods of various co ncatenations of hypothesized
experim ents, it is natural to wo nde r how often th e correct pa th might be words. Th is and sim ilar ap proaches are sometimes called "N-best"' meth
m issed . Ney et al. also co m puted th e scores fo r the co rrect paths and ods becau se the 1.\/ (a predeterm ined number) best hypotheses are submit
found tha t, o ver six speake rs, for one of the databases the correct path ted for furthe r pro cessing.
was mi ssed on ly 0.57% of the ti me.
As an aside . we note t hat the word-spotting process is very similar to
Some result s o f t he exp erim ent s by Ney et al. involving the on e data the LB algorithm in certain ways. The basic difference is that, in word
base discussed abo ve ar e shown in Table 11.1. To pu t into perspective spotting, t he pat hs from searches over different reference strings do not
the num be r of grid po ints searched (OP searches) per frame, we note that interact at the aco ustic processing level. Accordingly. if carried out in the
if the average reference te m pla te is take n to be J = 50 , th en th e total manner suggested here, the acoustic search task of the GDCWR ap
nu mber of possible gri d poin ts that co uld be search ed each frame is proach has simila r complexity and memory requirements for a similar
348 1X 50 "", 1.74 X 105 .
vocab ula ry. We will therefore not belabor this issue.
We have d iscussed the iss ue of syntax. in a rather cur so ry and qualita The sentence hypot hesizer has the task of patching together the hy
ti ve way here. Similar ideas co ncerning language mo de ling a nd its effects pothesized words in a n order indicated by the given endpoints. We will
on recognition will be studied in further detail in fut ure chapters, study such linguistic proces sing systems in much more quantitati ve detail
in Chapter 13. Howe ver. the approach considered here is among the ear
liest language p rocessing system s an d can be understood in general terms
TABLE 11.1. Results from Syntax -Driven as Algori thm Used for Connected. qui te easily. Consider a latt ice of permissible word strings as shown in
Speech Recognition . After Ney et al. (1992). Fig. 11.24. Neglect ing th e details of endpoint justification, we can imag
ine that the word-spott ing analysis provides costs for traversing a given
path in the lattice in the form of likelihoods that the corresponding word
r-Oj
13,900 20.7 will appea r in that tim e slot in a string. With these costs assigned, the
F-01 1/564
6,000 9.9 lattice can be sea rched usin g Dr principles and the word string corre
M-OI 3/3 76
7.600 15.6
M-02 9/564 spo nd ing to the patch with th e best overall score can be declared the spo
7.400 13.1 ken sentence.
M-03 0/56 4
8.800 7.8
M-IO 0/564 Clearl y, there ar e no ntr ivial issues involved in matching endpoints of
8,600 14.7 3/376 th e hypot hesized words. Some flex ibility in this matching (allowing some
Note: Th e erro r rate is the number of inserted , deleted. or confused words as a percentage o verlap or gaps) can be inco rp orated to allow for uncertainties in the
of the total numbe r of correc t words. The numb er of missed paths indica tes the numb er of endpo ints (McMahan and Price. 1986). Related to this time svnchronv
times the correct path through the search space was cen ainly missed. 35 a proponion of the
total numb er of recognized senten ces. issue is the fact that m ultiple hypot heses (with different boundaries) can
occupy the same pat h thro ugh th e word latt ice. Methods for handling
670 Ch . II/Dynamic Time Warp ing
11.4 / DTW Appliea to CSR 671
Each rmnsirion
n:prescnts a at i, the n th e pat h to a n y (i ,)) will be a ca n did a te fo r ext en sion at fram e
word.
i + I o nly if
(
) D minU.}) -s Dmin(il.i") + o(i) , (11.68)
End
.:£:. ,:/~ " where 0(; ) is usua lly taken to be a constant.
In fa ct. p ru n ing can be important both for r educing computation and,
S ian Word 2 where m em ory is a llo ca ted a s needed, for reducing memory require
ments. When memory is statically declared (a s in the LB and OS algo
"End
!.l- rithm s). t here is little memory be ne fit to pruning. This is one of the
o/"(Y..J
co nsid e ratio ns that led to th e de velopme nt of the GDCWR syst em de
scribed above. in which pruning co uld be carried out at the syntax level
(P ico ne . 19 92).
End We wi ll see this same beam search idea used again in searching the
FIGURE 11.24. lattice of permissible word str ings in the language (syntax) hidden Markov model in Chapter 12, and then again at the linguistic
processor of the GDCWR.
level of processing in Chapt er 13.
these synchronization problems will be treated in Chapters 12 and 13. 11.4.6 Summary of Resource Requirements for DTW Algorithms
Generally speaking, th is issue is handled by variations of the DP search
st ra tegy. Table 11.2 pro vides a co nv en ie nt summary of computat ional and
m emory requirements for various DTW algorithms di scuss ed . In th e
11.4.5 Pruning and Beam Search co nn ected-wo rd case s the numerical values in parenthe ses indicate re
quiremen ts for the speech-teleph on e-number-dialin g pro blem introduced
A final, and very im p ort a nt, constraint that is often imposed in large in Sect io n I J .4.1.
DP sea rc hes is that no path should be exte nde d from a nod e. say u.n.
for which D min(I ,) ) is unacceptably large. The threshold d epends. of
co urse , on the measure of cos t e mploy ed and the nature of th e problem. TABLE 11.2. Summary of Computational and Memory Requirements for
This phenom enon lead s to man y paths that are terminated earl y in the Various DTW Algorithms.
searc h grid, short of thei r potential for becoming a co mplete path. This
Typical Typical
clipping of undesirable paths is called pruning. n
Memory Requirements Computational Requirements
In searching the DTW grid in the syntax-driven OS method abo ve (or
in the ca se of the GDCWR system , th e syntax latt ice it self), for example. Distance
it is usually th e case that relatively few partial paths sustain sufficient DTW Task Local Global Measures DP Searches
probabilities [sm a ll eno ugh D rntn{i,j ) value s] to be co nsi d ered ca nd id a tes Single wo rd 2J /if) /i1.1
for extension to the optimal path. In ord er to cut down on what can be (+IJ if back
an extraordinary number of paths and computations. a prun in g proc e tracking used)
dure is frequentl y employed that terminates co ns iderat io n of unl ikel y' LB 2J 3IL O( /iVI LJ) O(/i vrV )
[O ( I O ~)1 !O ( I O~);
3
paths. Most of the path s will be prun ed because they will be so unl ikel y [O(lO l)] [0(10 ) ]
that their exten sions are not wa rranted . This p ro cedure is oft en referred os 2 VJ 21 0 (V1J) O( vIJ )
[0 ( 10 )1
4 4
to as a beam search (Lowerre a nd Reddy. 1980) since o nly pat hs that re [0(10.1 )] [O(l O.1 )J [0 (10 ) ]
main inside a certai n acceptab le "bea m" o f likelih o o ds arc ret ain ed .
Note: In the connected -wo rd cases. the numerical valu es in square bra ckets indicat e re
Those that fall o u tsi d e the beam arc prun ed. A sim ple be a m . for exa m quirements for the sp eech-telephone-nu mber-d ialing problem in troduced in Section 11.4.1.
ple , would consist o f all pat hs at fram e i whose co sts fall within. say. 6(1') ~ is a fraction, typ ically ~ . arising from the search co nstrain ts, L is the numb er of levels, j
of th e be st path . If. for exam ple. U. F ) is th e gr id po int with th e best pa th is the average single-word refer ence template length . and Ii is th e vocabul ar y size. it is as
sumed that only one refer en ce temp late is used per word . If K ar e used, V shou ld be re
place d by KV and the num er ical estima tes mod ifi ed accor d ingly. When impl em ented as
' "The me ta phor used here is pru ning undesirable branches from pla nts and trees. sugges ted in Sectio n 11.4.4. the G DCWR system req uire s sim ilar amou nts of reso urces .
However. a mu ltiprocesso r system fo r this task is described in (Pawatc el al., 1987).
672 Ch. 11 ! Dynamic Time Warping 11 .5 I Tra ini ng Is sue s in DT W A lg Ori thm s 673
So me de ta ils sho uld be kept in m ind when com pa ring th ese number s. with respect to one another (by DT W !). then th e feature strings arc aver
First is that distance m etrics a re ofte n redundant ly co m pu te d in the LB age d to give a single re ference template (Martin. 1975 : Sambur a.nd
scheme (see the discussion surrounding Fig. 11. 16). We ha ve no te d that Rab iner. 1976). It is important. of course. to average features for whi ch
in som e cases thi s redundant com putation ca n be avo id ed by co m p ut ing linea r averaging is mea ningfu l. A common choice is the autocorrela tion
all di st ances before beginning the LB process. T h is req uires the a priori sequence. the average o f which may then be converted into anot her d e
co mputat ion and storage of O(;.JYI J) distances . The reader sho uld note sired feature . Averaging tends to mi nimize the risk of ve ry spu rious an d
how this will change th e ent r ies in the LB row o f the table. 11 sho uld also unr eliable templates. but at the same time may create a te m plate th at is
be noted that th e "enhanced " LB algorithm of M yers and Rabiner poorer than so m e of th e compone nt strings. Commercial syst ems hav e
(1981 b) results in a sa vings of about a factor of tw o in computati onal used th is method successfully (Martin , 1976 ).
load. Also pruning ben efits ma y be ob tained from th e GD CWR system Whe n DTW is applied in a speake r-indepe ndent a n d/ or co nnec ted
that ar e difficult to achieve with the LB and OS algorithm s. word reco gnit io n system . three further proble m s e me rge . First , differ
e nt spea ke rs te n d to p ro n o u n ce t he sa me wo rd in different way s.
Second . sp eakers tend to produce isolated words (for train ing) that a re
11.5 Training Issues in DTW Algorithms of longer duration t ha n if they we n: part o f cont in uo us sp eech. Third ,
in co nnecte d -speech application s, the coarticulatory e ffect s between
We conclude with a dis cussion of a very important issue in D'l 'W-based wo rds are not modeled by si ngle- wo rd referen ce te m p la t es. The third
recognition-th e construction of appropriate reference te mplates. Th at problem ca n only be re me died by p a insta king training procedures that
this topic is at th e end of the ch apter should not be construed as an indi attempt to model a ll suc h effects . Beca use mo re effic ient and robust
ca t ion of insignificance. On th e contrar y, prope r "train in g" of th e DTW meth od s ha ve been de veloped (in pa rt icu la r. the HMM ). suc h proce
sea rch is of utmost importance. Howe ver, it is also a probl em 10 whi ch du res have not been explored fo r DT W. The seco nd problem is impor
there is no simple or well-formulated solution . The appearance of thi s ta nt. of co urse , if mult ip le refere nce te m p la tes a re to be combined . In
issue her e, therefore, is a n ins ta nce of sa ving th e worst p ro blem for last. th is ca se. some form o f tim e normalization is ne cessa r y. Sever al meth
However, its appearan ce here is also appropri ate bec ause it represents ods o f tim e no r m a liza t ion have been used . includ ing th e us e of DTW
o ne of the central contrasts between th e DTW meth od and th e two rec itsel f to time-align various templates. a nd tech n iq ues for lin ear co m
ognition approach es to follow. Whereas DTW will be found to be very pr ession or expan sion o f the referen ce te m p la tes (M ye rs and Rab iner,
simila r in many fundam ental ways to th e hidden Markov model (H M M) 198 1a . 198I b) .
(C hapter 12) an d to ha ve many connections to some of the arti ficial T he first problem , that o f variou s pro nuncia tion s occurring in th e
neural network (AN N) a pp roac hes (Chapter 14), we will fin d that these speaker-ind epende nt problem , has been ad d ressed through clustering
latter te chn iques (pa rtic ula rly HMM , to which DTW is more closely re techn iqu es. This strategy has produced very good result s in both spea ker
lated) possess vastly superior training methods. The H M M and the ANN dependent (Rab iner and Wilpo n. 1979) and speaker-in de pe ndent applica
can be trained in a "supervi sed " paradigm in which t he model can Jearn tions (Lev inso n et al. , 1979). In the clustering approach . m ultiple, say P
the statistical makeup o f the exemplars. The application of the HM M to (typically 50- 100). utterances of a wo rd a re red uced to Q < P clusters
speec h recogniti on in the 1980s was a re volutionary development. largel y that. in t urn . are represented by one template each . Th~ clu sters are
b ecause of the supe rv ised training aspect of the mo d el. We will, of ach ieved by assigning a " feature vec to r" to each refer e nce t emplate, con
course , have more to say about the HMM in the su cceeding ch a pter. Let sisting of its d ist ances from all P exe m plars determ ined by D TW. These
us focu s upon some of the methods used to tra in th e DT\V search . featu re vecto rs are the n cl ustered using a met hod such as th e K-mean s,
In th e simplest DTW task- speaker-dependent fWR-i t is frequ entl y or isodata, algorithms d iscussed in Sect ion 1.3.5. T here ar e ma ny det ails
sufficient to store unalt e red feature strings from one or more utt er ances assoc iated with clustering techniques. including the nu mber o f clusters to
of each word . Th is is th e mode that has been more or less imp lied in the be used , setting cr iteria and t hresho lds for sepa rating cl uste rs. a nd modi
for egoing discussions. Th is sim ple tr aining strat egy has been called C(Ju.1G1 fying algorithm s to operate properl y with d istances as feat ures. An exten
training (Itakura, 197 5; Rabiner a nd Wilpon , 1980) . A maj or d rawback siv~ study of these is sues is found in th e pap er by Ra biner et a1. (\ 979 ).
o f th is strategy is that the qual it y of any referen ce te m plate ca n be esta b whic h addresses the ta sk of speaker-independent reco gn it ion o f a small
lished only experimen tall y. An other problem is that robu stn ess ca n be in v.ocab ula ry (digits. alphab et. "stop ." "error," "repe at "), The results show a
c reased only by increasing the number of utterances o f the same word. Significa nt improvement of clustered templates with resp ect to rando m ly
which incr eas es computation lin early. select ed ones . The referenc ed paper als o provides a su m ma ry of results
Another technique th at has been em p loyed in the IWR probl em is that fro m several st ud ies of IWR using the va riou s tr aining techni ques di s
of averaging. In this strategy. two or more uttera nces are ti me -a ligned cus sed ab ove .
674 Ch . 11 I Dynamic T ime Wa rp ing
11.7 I Pro blems 6 75
In summary, training a DTW a lgo rithm is a n importa nt ta sk , pani cu Ind icate the distance o f. an d the cities o n, this pat h. tNote: T he
larly in speaker-independ ent system s, which m ust be given proper atten strict "eastbound " constraint still applies .)
tion for successful results . Th e tra in ing issue is on e t hat will Contrast (c) Was the boss correct about v isit ing more ci ties? Is there a less
greatly with those us ed in th e me t hod s to follow in t h e remaining expensive (short er pat h ) way to visit the same num be r of (or
c ha p te rs.
more) cities'?
11. 2. I n a D T W search. suppose that I """ J , .and that we apply a window
11.6 Conclusions wid th constrain t under which no node U, -.h) m ay be o n t he optim al path
fo r wh ich
T he central focus in th is cha pte r has been the DT\V al go ri t hm a nd its Ii, - ikl > IV (l l.7 0)
various applications a nd impl ementations in IWR and CS R. Dvnamic
tim e wa rp ing is a landm ark technology that rep rese nts on e o f th e first Approxim ate ly how many co sts of for m (1 1.2 9) ne ed to be co m p u ted?
m ajor breakthrou g.h s in modern speech recogn iti o n . Alth ough it is no Yo ur a nswer will be in te rm s of 1 and W
longer as widel y used as technologies to be discu ssed in later chapters. 11.3, O ne me asu re of the computationa l effo rt of any sequential d ecision
th e fundamental ba sis for its operation, DP. will be found at the heart of p ro blem solved by D P is the number of t imes th e equation o f form
much of what we have ye t to st ud y. Th erefore, the hard work in this (1 1.4 7) is used . This is someti m es ca lled th e n um ber of DP searches.
c ha p ter will pay dividends in futu re study. (a) Consider an L B proced u re in wh ic h the tes t utteran ce is I
The next issu e we t ake up is that of the roo st pre valent speech frames long, th e ave rag e wo rd -re ference te m plate is J fram es ,
recognition techniqu e , th e hidden Markov model (H M M). We will find th e vo ca bulary size is r:
the n um be r o f levels is L, and continu
th at the HMM is ver y simi la r 10 a DTW algorithm in certai n ways. but ity constrain ts are a p plied wh ich reduce the sea rch space to a
that a stochast ic co m po ne n t to the HMM pro vid es many useful proper factor of IL ti mes the total grid . G ive a clear a rgum ent show ing
ti es for t rain ing. reco gnition, and general robustn es s. that the number of D P searches ca rried o u t in fi nd ing th e opti
In this chapter, we have also seen some glimpses of the usefulness of m al path thro ugh the en t ire serie s o f le vel s is O{tl VILJ) a s
language processin g in s peech recognition . I t will be interesting to see claimed in Sec tion 11.4.2.
how so m e of th e sa me principles we ha ve learned in this ch ap ter will (b) Usin g the sam e fig ure s I , J, T
< a nd L from part (a ). rep eat th e
co n ti n ue to be built up on in Chapter 13. ana lysis for the O S algorit h m em plo ying the Sa koe and Cb iba
loca l path constrain t s as in Sect io n 11.4 .3.
11.7 Problems 11.4. (a) Modify the D T W algo rith m o f Sec t io n 11.3 .3 to accommodate
the follo wi ng:
(i ) In it ia l endpo int ( L. I) a nd fi na l end po in t (1, J) onl y;
11 .1. In this problem we reconsid er the aJgorithm for the " eastbound
sa les ma n problem " shown in Fig. 11.4. (i i) The use of local cons tra int (d ) o f Fig. 11.11.
(a) In the given alg orithm , find the min imum distan ce path from (b) What is the minimu m and ma x im u m global expansion of the
test waveform with respect to t he re fe re nce waveform with the
(0.0) to (5 .4) (1 = 5 and J '-= 4) if the d ista nce m easure used is
the simpl e E uclidea n distance in the pla n e, modified algorithm?
(c) How many distance measure s and DP se a rc hes will be per
d[U,.I)j(k, I )] == dr [U. j )/(k, I )J formed per test wa veform . assu m ing f = J?
( 1 1.69)
== V(i - k) l -;- (j _ I) ~ . lJ .5. Modi fy the a lgorithm in Section 11. 3.3 for usc with an LB sea rch .
In particula r, include the t wo fo llowing backtracking arrays: ~I (j )
In initializin g t he " first" cities, (L j' ), .i '= 1,2. 3. 4, use (l 1.69 l.
and P2(J), j = 1,2, . .. ,J, and explain how th ey a re used.
Indicate wh ic h ci ties ar e on the op t im al path. a nd gi ve th at
path 's distan ce. Is the optim a l pat h un iq ue? 11.6. In a phon e-d ialing a p pl icati o n , the 10 d igits " O(zerol. 1, . . . . 9 are
(b) Suppose that th e sa les m a n's boss gets th e id ea tha t be tter cover to be recognized using DT W. T here arc 30 reference templates (t h ree fo r
age of th e t errit ory wo u ld b e obta ined (m o re cities wo uld he each d igit) against wh ich each incoming uttera nce is compared . Sup po se
vis ited ) if th e sa les m a n were req u ired to take the max imum dis
ta nce pat h fro m (0.0 ) to (I. Jl. M od i fy the a lgo rith m to accom
m od ate th is cha nge . a nd fi nd th e m a x imu m dis tance path . !'FreQu cntl)' the utte ra nce "oh' is include d In a digi t recogmuon pro blem , but we ignor e
it here for simplicity.
67 6 Ch . 11 I Oyn am ic Time Warpi ng
th at a typi ca l utterance is 0.5 sec long and is sa m pled at 8 kHz. The se
quen ce is then reduced to a vector o f featu re parameters over 256- poi nt
fra mes wh ich a re shifted b y 128 points each computation . This ty pica l
len gth sh oul d be assumed for bo th refe rence a nd test strings. Also as
su me tha t an lt ak ura paralle logra m with slopes ! a nd 2 is applied to t he
[HAPTER 12 I
likely th at on e o r m ore o f th ese m ethods will be t he basis fo r fut ure " do ubl y stochastic process" in whic h the observed data were t ho ught to
large-scale speech understanding sys te ms when suc h sys tem s fina llv be t he result of ha ving p assed th e " tr ue" (hid den ) p roces s th ro ugh a " cen
em er ge. T he term "stochastic a ppro ac h " is u sed to indicate that model-s so r" that produced the secon d process (observed ). Bo th processe s wer e to
a re e m ployed that inherently cha rac ter ize some of the va riab ility in the be cha racte rized using o n ly the o ne th at could be o bserved . The resulting
speech . T his is to be co nt rast e d with the straightforwa rd det cr~ i n i st ic ident ification algorithm came to he kn own as th e estim at e-m axim ize
us c o f th e speech data in template-matching (e.g.. DT W ) a pproac hes in (EM ) algo rithm (Dem pst e r, 19 77). In t he 1960s a nd ea rly 197 0s, Baum
which no attem pt at pro babilist ic m o deling o f va riab ility is prese nt. The a nd colleagues ( 1966, 1967, 1968, 19 70, 1972 ) wo rked on a spec ia l case
term "struct ura l methods" is also used to descr ibe th e stoch ast ic ap of the HM M a nd develop ed what ca n be co nside red a special case of the
proaches. since eac h of these meth od s is based upon a model wit h a very ::M a lgo rithm , the f orward-ba ckward (F-B) algo rit hm (also called the
important m a them atica l "structure ." . IJaum- Welch reest i m at ion algor it hm ], fo r I-l MM parameter estimation
Tw o very d ifferent ty pes of stochast ic methods have been resea rched. and decod ing in t im e which is li near in the le ngth of th e o bservat io n
The first , the hidden M arkov m odel (H M M ), is a m ena ble to co m p utat ion st ring. As we sha ll see, t he F-B algo rit hm tu rn s a n otherwise co m puta
on co n ve ntio na l sequential computing machines. The H M M will be t he tio na lly intra ct ab le prob lem into an ea sily so lvable o ne. Because th e orig
su bject of th is c ha p ter. S peech resea rch has d riven m os t of the engineer ina l wor k is develo ped in ve ry abstract terms a nd publi sh ed in journals
ing interest in t he H M M in the la st th ree de cades , an d t he H MM has not wi de ly rea d by e ngin ee rs, it took se ve ra l yea rs for th e meth ods to
been th e basis fo r se veral su ccessfu l large-sca le labo rato r y a nd comm er com e to fruit io n in the speech recogniti on p roblem . O nce real ized, how
cia l speech recognit io n system s. In co ntrast , th e seco n d class of stocha st ic ever, th e im pac t of th is tech no lo gy ha s been ex traordi nar y.
techniqu es in speech recogn it io n to be discussed . that based on the artifi
cia! neural net work (AN N) , has been a sm all part of a m uc h m o re general
research effort to ex plo re a lternat ive co m p ut ing architectures with some
superficial resem bl a nces to the massi vely pa ra lle l " com pu t ing" of biologi 12.2 Theoretical Developments
ca l neu ral system s. R esults o f the application of AN Ns to speec h resea rch
lag far behind those for H MMs bec a use of th e re lat ive you th of the tec h 12.2.1 Generalities
nology. Because of the in terest in ANNs and their potential fo r exc it ing
Introduction
ne w technologies , we will gi ve a bri e f syno psis of ANN resea rch as it ap
plies to speech reco gn it io n in Cha pter 14. In Cha pter 13 we will realize that a n HMM is, in fact , a " stocha st ic fi
The history of t he H M M p recedes its use in speech proc essing and n ite sta te a uto rna to n 't-i-a type of ab stract "machine" used to mo de l a
only graduall y becam e widely known and used in the speech field . The speech utterance. The utt erance may be a word, a su bwo rd uni t , or, in
introduction of th e H M M into the speech recognit ion field is generally pr inci ple , a comple te sent ence or paragraph. In sm all-vocabulary systems,
attributed to the independent wo rk of Ba ker at Ca rn egie- M ello n Univer the H M M te nd s to be used to model wo rds , whe re as in larger vocabulary
sit y (Baker, 197 Sa, 197 5b), a nd J elinek and colleagues at IBM (J eline k et syste ms , the HMM is used for subword units like phone s. We will be
al. , 19 75, 1976). Ap pa rently, wo rk o n th e H MM was als o in progress at more specific a bo ut th is issue at appropri ate points in th e di scu ssion. In
th e In stitute for D efen se Ana lvsis in the earl v 19 70s {Poritz. 1988).1 In order to introd uce the operat ion of the HMM , howe ver : it is sufficient to
teresti ngl y, the field of SYIIW;lic pattern recognition was inde pe nd ently assume t hat th e unit of int e rest is a wo rd . We w ill so o n di sco ver t hat
e volv ing during thi s sam e period, p rincipa lly d ue to the research e ffo rts there is no re a l loss of generality in di scu ssing th e HMM from this point
of the late Professor K. S . F u a t P urd ue U n ive rs ity (Fu , 198 2 ). In the of view.
1980s the close relat io nships bet ween ce rt ai n th eo ries o f synta ctic pa ttern We ha ve become accustomed to th e noti on of redu ci ng a speech utter
recognition and th e H M M we re recognized and exploited . ance to a st ring of features. In discu ssing DTW, we refe rr ed to the fea
An exposition of t he hi sto r y of th e H M M prio r to the sp eech pro cess ture strin g representing the word to be recogni zed as th e se t of "test
ing work is related b y Le vinso n (1 98 5) and Pori tz (198 8). Levinson c ites feat ures," a nd denoted it
the paper of Dempster et al. ( 1977 ), which indicates that the roo ts of the
test features: t ( I) , t( 2), t( 3) , ... , t (i ), . . . , t (f ). (12 .1)
theory ca n be t raced to the 1950s when statisticians were studying the
pr oblem of chara cterizing ra nd om pro cesse s fo r which in complete obser The reader is e nco uraged t o review th e d iscu ssion le adi ng to ( 1 1.27) a nd
va tio ns wer e a va ilable. Their approach was to model th e problem as a (11 .28) with particu lar attent io n to th e link be twee n the sho rt -te rm l-ea
ture ana lysis a nd t he resulting te st string. In th e HMM literature, it is
'The Poritz pape r cites unpublished lectures of J. D. Ferguson of IDA in 1974. Custo m a r y to refer to the string of test featu re s a s th e observations or
h . 12 I The H idden M ar kov M Od e l
~~3
meaning that the state transition at time t does not depend upon the his
tory of the state seq uence ) prior to t ime r- I. Under this condition , the
. "T his pro babi lit y is freq uent ly denot ed 0 1/ in the lit erature, bu t we denote i l more exphc
itly here to remind the reader of its mean in g in term s of its defining probability.
IO f course, in th eory, not hing pre cludes dependence upon n past states, but the com
FIGURE 12.1. Typical HMM with six stat es. plexity of tra ining this model and recogn it ion using it inc reases dramaticall y with each in
cremen t. Further, the memory require ments increase as Is:
t> Cl ~ L.1"l. 1 <' I ,
ra nd o m sequence is ca lled a ifirst-orderv M a rko v process [see, e .g.. (Lcon-. Let us now t urn o u r attent ion to the obse rvat ion s. The observa tion se
G a rc ia . 1985l . Ch . R)]. When t he ra ndo m va ria ble s o f a Markov proc:ess uence m ay a lso be m odeled as a d isc rete-tim e stochast ic: p rocess. say y.
ta ke onl y discr ete va lues (ofte n int egers). th en it is ca lled a Markov chain, WIth ra ndom vari ables y( l) . Fo r the M oo re fo rm . upon entering a state.
The state sequen ce ~ in t he HM M is a Marko v cha in, sin ce its rand om sav state i. at time I , an o bservat io n is ge nerated. T he genera t io n of t he
varia bles a ssume in tege r values correspondi ng to the states o f the m odel. p~rtic u la r obse rvations is gove rn ed by the probabil it y d en sit y fu nc tion
F urt he r, since the sta te tra nsit io n probabil ities do not d ep end on I. t he fylrl l.!- (: j(c.1
•
i ). wh ich we wi ll call the observation pdffor state i. Both y(1)
Markov chain is sa id to be hom ogeneous in time. a-nd ~ a re. in gen era l, AI-d imens io na l vectors. where .\1 is t he dim en sion
Alt ho ugh th e di ffe re nce is o nly form al , in so me d iscu ssion s in the lit f the vecto r feature ex tracted from th e speech. Fo r math em atical tracta
era tu re invol vin g th e M eal y form, th e t ran sition s rath er t han the states bility, it is custo mary to make th e unrealistic ass u m p tio n th at the ran
will be indexed. That is, ea ch tran sit ion will be give n a label and the Se do m p ro cess I has independent and id e nt ica lly d ist ributed ran dom
quence of transitions, r at he r th an th e se q ue nce o f states. will be featured. variables , I(l)· In particular, this m eans th at f l i t i I w )(1; Ii ) is not de pen
On th e few oc cas ions wh en it is necessary for us to fea t u re the tr an si dent upo n I , a nd we write -
tions rat her th a n th e st ates in th is text , since OUf states will a lways be la
bel ed a n yway. we will si m p ly refer to t he transiti on between st ates j a nd i
1r1 ,,(1;1i) ~rf!(llI..J:( rl(E,1 i) for arb itr a ry l. (12. 10)
by U . This sho uld be though t of as an integer la bel in g the tran sit ion
re fe ri~ ce d. Rather than a s ta te seq ue nce, we ca n di scuss the transition
o r t he Meal y form, t he o bse rvation d en sit ies a re slight ly di ffer ent.
These are
seq ue nce m odeled by random process!!. with. th e random variables !!.(l ).
It sho uld be clear th at. for a n arb itrary I, .I~ I.'/ S IUi IJ) d='{f!(I)!.!!(I)(E, Iui 1j ) (12.11)
P~(l ) = u,,) = p( ~(!) = i l~ (l- 1) = j ) = aUI j ) (12.6) fo r arbit ra ry t . (Reme m ber that lI' l! is j us t an integer in dex ing the tr ansi
so th at the m at ri x o f transiti on prob abiliti es is identica ' t o (12.3). This tion fro m sta te j to i.)
makes perfect sens e , since (12 .3) is t he m atri x o f p robabilities of m aking In a for ma l sense. the n, a Moore-form H MM, say m, is co m p rise d of
th e transitions. It is also clea r th at th e t ra nsi t io n seq ue nce is a homoge th e set of m athemat ical en tities
neous Marko v cha in . m« {S,rr(l ), A, {j~I ;!( E,li) . J :5 i < S}} (12 .12)
Fo r convenien ce in th e fo llowi ng d isc uss ion invo lv ing th e Moore fo rm,
let u s define th e state probability vector at tim e I to be the S-vect o r 4 ea ch o f which has been d escribed in th e d isc uss io n above. A sim ila r
ch a racter izatio n o f th e Me aly form is give n b y
P(~(l) "" I) 171 = {S, n(l ), A, {J!I~( S u'I)' 1 :::; i. ] -s S}}.
I (1 2.13)
dcfl P (:!(l) = 2)
7t (t ) = : (12. 7) H e nce fo rt h in our discussion , we will focus o n t he Moore fo rm o f th e
HM M unless o the rw ise noted.
P(:.5Jt) = S)
It s ho uld be clear fro m ou r di scu ssion above th at, fo r a ny I, The Two HMM Problems
1t (t ) = Arr (t - l ). ( 12.8) G ive n thi s formal de scription o f an HM M , we now exa m ine two key
issue s cen te re d on th e trai nin g a nd use o f the HM M . These a re the
I n fa ct, gi ven th e initial state probability vector. It ( I), it is a sim p le m atter fo llowing:
to s how by recursion that
1. G ive n a se ries of traini ng obser va ti ons for a give n wo rd , bow do we
rr(t ) = A, - I1T ( I ). (12 .9) tr ain a n HMM to represen t th e word? This a moun ts to find ing a
Taken together. therefore , th e sta te tr ansition m atrix a nd the initial state procedure for est im ating an a pp ro p ria te state tr a nsition m atrix, A,
proba bili t y vec to r completely s peci fy th e p ro ba bi lity of residing in any an d obse rvat ion pdf's , J;.I ./SIi), for eac h sta te . T h is re p resents th e
state a t an y time. HMM training problem"
2. Given a trained HMM , how d o we fin d the li kelih o od th at it pro
' T he sym bo l It is some times used In the literat ur e 10 de note t he initi al sta le prob ab ilit y d uced a n inco m ing speech obs erva tio n seq ue nce? T h is re p res ents
vecto r whic h, in o ur case, IS 1t(1). the recognition problem .
684 e n. 12 I The Hid den Markov Mod el 12.2 I Theoretical Developments 685
We beg in our purs u it of an swers t o these qu est io ns b y exa m in ing a case h( 111) b ( 112) ... b(l IS - I) b( I IS)
of th e 1-1 Nl:vi with a sim ple fo rm fo r the observatio n pdf's.
B= b(k li ) I, (12.)6)
fully trained HM Ms o f the fo rm ( 12.2 1) for eac h word in the vocabularv through the mo de l. T he second a pp roa ch to be considered will ins ist that
are available. and th at we desire to deduce wh ich of th ese words a g ivc ~ the likelihood be based on the best state seq uence through 11/.
(quantized ) observat ion sequence y ( I) , . . . . y( T ) represent s. That is, we Let us first consider a "br ute force" approach to the com putation of
want to determ ine th e likelihood that eac h of the models could have pro
duced the observat ion seq uence.
pcrl 111) . Co nsider a specific state sequence t hro ugh the HMM o f p rop er
lengt h T. say ..9 = [il ' i 1 • . . .• i-rl· T he probabilit y of th e obse rv atio n se
Let us begin by defining some simple not at ion that will be of critical quence being pro duced over this stat e seq uence is
importance in our di scussions. For simplicity, we de not e any parti al se
quence of observat ions in time, say {.l'( II)•.\'(11 + 1). Y(l I + 2 ). . , .. y( IJ j, P( y \ 3, Un = b(y ( 1)1it )b(y(2)Ii 2) • • • h(y(n l iT) . ( 12.26)
by
The pro bab ility of the state sequence J is
y;: ~f {Y (I I ) · Y ( l l + I ), }'(II + 2 ). . . . ,Y(l1)}' (1 2 .22)
p (..9 I111) = p ( ~( I) = i , )a (i11 i, )a (i31iJ ...a(ir li 1 ·_ , ) · (12. 27)
In particular. th e for ward pa rtial seque nce of observatio ns at time l is
Th erefo re.
'vII d-.!f(Y·( l ) , )'' (2) , ... ,y'( l)J ..,
(12 .2.) p ev,Jl m ) = b(y (I) li l )b(y(2)li 2 ) ' " b(y(T )l i T )
( 12.28)
and the backward pa rtia l seq uence of obse rva tions at time t is X p(~ ( I) = i I )a(i21t,)a(i3h )... a(iTIi T- I ) .
In order to find P( y!1Jl) , we need to sum this result over all possibl e
Y~+ I d,gf!y(l+ 1), y(l + 2).. . . ,yeT )). (12.24)
paths (mut ually exclusi ve events),
Th e term "backward" is used here to co nno te that the seq uenc e can be
ob tained by starting at th e last obse rva tion and working backward . Note p ( yl 111) = I P( y, -91m) . (1 2. 2 9)
a ll -Y
that the backward partial seq ue nce at time t does not in clud e the obser
1
vation at ti me t. It is also useful to note th at t he for war d par t ial se Un fo rtunately, di rect computat ion o f ( \2 .28) requ ires O(2 TS ) flops,'
qu en ce at tim e T is th e co mp lete seque nce of observa tio ns th at, for since there ar e S 1" possible state sequenc es, and for each J leach term in
convenience, we denote simply the sum of (12.28)] about 2T computations ar e necessary. T his amount o f
dcf r computation is infeasible for even sm all values of Sand T. For example.
y = )' 1' ( 12.2 5) if S = 5 and T = 100 , then
72
If we wish to denote a partial sequence of random variables in any of the 2 XIOO X5 1oo =1.6 Xl0 ( 12.3 0 )
senses above , we will simply under score th e quantity- i l' This. of computations ar e required per HMM. This is clearl y prohibitive and a
course, is an abuse of notation , since !:/. is not a random variable but a
more efficient solution must be found .
model for a partial reali zati on of the random process y ; however. the The so-ca lled [o rward- back ward (F - B) algorithm" of Baum et a l.
meaning should be clear. (Bau m and Eagon , 196 7; Baum and Sell. 1968) can be used to efficient ly
A key question is: " What is m eant by th e 'likelihood' of an HMM?" com pute P( y j1tJ). To develop thi s method , we need to define a " forwa rd
There a re two gen eral measu res o f likelih ood used in the recogni tion going" and a " back ward-go ing" probability seq ue nce. Let us de fin e
problem. Each' lead s to its own recognition algo rithm, so we must con a(y:. i) to be th e joint probability of havi ng gene rate d th e partial forw ard
sider them individually. seq uen ce y: and having arrived at state i at the tth step. given HMM m,
"Any Path" Method 1 (F-B Approach) . A mo st nat ural mea sure of likeli
a(y:, i) d~f p{ ~~ = y: , ~(t) = iI1ll). (12.31 )
hood of a gi ven HMM. say 111, would be P(ml.!: c:- y ). However, the avail
abl e data will not allow us to cha racte rize thi s st ati st ic d uring the Wh ereas a(,l':. i ) accounts for a forward path se arc h ending at a ce rta in
training process . We will therefore resort to th e use of p{J:. = the yl m; state. we wiII also need a quantity 10 accoun t for th e rest of the sea rch .
probability that th e obse rvat ion sequence J' is prod uced , given the model et /3(.1';+11 i) denote the probability of gen erati ng th e "backward" partial
m. The reader is reminded of th e di scussion in Section 1.3.3 that j ust i
fies this substitution . T he reason for the n am e "any path " method is that
1Precisely (2T - l)S T mult iplications and ST - 1 add itions are req uired (Rabiner. 1989).
the likelihood to be co mputed here is based o n the probability that th e
lAs men tione d in the int rod uction. this is also called the Baum -Wclch reestimation algo
observations could have been p rodu ced using a llY st ate sequence (path) rithm; we will see why in Section 12.2.2 when we use it in th e training problem.
688 Ch . 12 / Tne Hidden Mark ov M ode l
12.2 / T he or et ic al Developmen ts 689
seq ue nce Y~+l using mod el 1//, gi ve n that the s ta te seq ue nce emerges frortl This equat io n suggests a latt ice- t ype co m p u ta tio n t ha t can be u sed to
sta te i at t ime I.
compute t he a seq ue nce fo r eac h state and for p ro gressivel y larger ( 5 .
r I· d.r
P(Yt+ l /) r,. 1= YrH I;s.(r) = t.. lJJ ).
»:r (1· (12.32)
T hi s lattice is illus t ra ted in Fi g. 12.2. By d efi n it io n of a, it is cle ar t hat
th e rec u rsio n is in it iated b y scui ng
Actuall y, we will d iscover that onl y t he a sequen ce is necessary to COIn
pute PCv /l1l) , wh ich is ou r p rese n t objective. but the P seq ue nce will be aCr :. j ) == p ( ~ (l ) = j )b( y ( 1)1j) (1 2. 3 5)
very us eful in future de velopments.
Suppos e that we now lay th e s tate s out in tim e to form a lattice as for each j.
sh own in Fig . 12.2. At time I. we have arrived at s ta t e i and have some Bv a s im ila r line of reasoning, the follo wing backward recursion in
h ow managed to compute a(y;. i). Suppose furt h er th at we wish to Com tim e- ca n be d eri ved for the fJ seq ue nce:
pute a(y ~+I . j ) for some sta te) at the ne xt tim e . If there were only one
s
path to stal e j at { + 1, that a rising from i at r (see Fig. 12.2). th en clearly
P(.I';. ,Ii) = :2)(l +2IJ)aU!i)b(y(l + 1)1))· ( 12.36 )
a(ll+I,J) = a(y( , i)P(?:(r + 1) "=' J I~ (t ) = i) )= 1
X P(J:(r + 1) = )'(/ + 1)1~(l + I ) = j ) ( 12.33) T h is rec u rs io n is initialized by defining y~+l to be a (fict itiou s) partial se
qu e nce such t hat
= a ( y ~ . i)a (j li) b ( .v(l + 1) /) ).
I, if i is a legal final state
,T • de f
Now if th ere is m ore than on e state " i" at time I through which we can P(.l 1"-'-1 11) "=' (12 .37)
get to) at time I + I , then we s ho uld simply su m th e possibilities. {
0, otherwise
s
a(y~+ I, j) == I
,~ 1
a(y;, i )a (jli)b( y(l + 1)1 j). (12.34)
whe re a " lega l final state" is one at which a path through the model
m ay en d. Note that p(y;+ll i) will only be used in o u r developments for
1 .s t :$ T - I , so this last definition is onl y a convenience to start the
re cursi o n .
N ow we note that
(l1.'. ; -I. })
T he d esired likelihood ca n therefore be obtained at any time slot in th e
lattice by forming and summing the F-B products as in (12 .39 ). In par
tic ular. ho weve r, we can work at the final tim e [ = T, and by inserting
( 12. 37 ) in to (12.39) obtain
Obs ervation y( 1 - I) T his expression makes it unn ecessary to work with the backward recur
y( r)
.'"(1 + 1)
Time --+- ;- : sio n in order 10 obtain the d esired like lihood. Further. a simple algorithm
1+ 1
is e vid e nt in equations (1 2.34 ), (12 .35 ), and (12.40 ). This is shown in
FIGURE 12.2. State lattice used to derive the forward recursion . F ig. 12.3.
12 .2 I Theo re tic a l De ve lo p me nt s 69 1
690 Ch . 12 I The Hidden Markov M od el
fro m each st ate to all others [ma ny of the aU \.i r s a re zero j. Typica lly, the
FIGURE 12.3 . Computation of p (y I 11l) using the forwa rd recurs ion of the
F-B algorithm .
co m plexity is 0( 35 7')= O(5 J / l T) . A met ho d to assure O(ST) is sug gested
by De lle r a nd Sni der (1990). T he metho d is based o n viewing the HMM
as a sta te space system for whic h (1 2.8) and (1 2. 19) comprise the stat e
Initialization: Initia lize a(y :,.I) for j = 1. . .. . S using ( 12.35) .
eqlla/io n an d obser varioll equati on. respectivel y:
Recursion : Fo r t = 2• .... T 1t(l)~ An (t - l ) + J(r )u(l) (12.42)
For j = l , . . . , S p(1) '" Bn(t) . (12.43)
Update a(y ;.J) using (12 .34).
Next .I Fo r form a l rea so ns, we have added an input term. u (l) , to th e state eq ua
Next I
iion de fi ned such that u(O) = n(l) and u (r) is arbitrary with bounded ele
ments ot herwise . Here b(l) is the unit sample seq ue nce a nd n(t) is
respect to the same co m p u ta tio n ca rried out directly in (12.30). The key
to this reduction is eliminating redundant computation. In turn. this is a changed by this transformation. Since
co nseque nce of noting that a ll possible sta te sequences m ust merge into
T
one of 5 states at lime f. By summ ing likelihoods locally at th os e nodes P( y]11J) = TIp (~ (l) = y(l ) lm). ( 12.4 6)
as we progress th rough time , th e combinato rial explosion of computa 1=1
tions that must be performed if paths ar e considered indi vidually is
avoided . we ca n sim pl y use (l2.44) and (12.45) at each ste p to com p u te
We should note a practical probl em here. In th e co u rse o f computing P(2: (1) =-Y(l) ! m). Note that if y(l) = k , then only the k th elem ent of p(t)
the a(' . .) sequence in the algorithm a bo ve. many probabilities are multi need be co m p u te d at I. Since A is diagonal, it is e a sily seen that the
plied together [see (12 .34)]. This is also true for the computation of the nu m ber of operations necessary to compute (1 2.46) is 0(5T). In cer
I·)
PC sequence [see (12.36)]. which will be used in the training procedure tain ca ses, it is possible to reduce the average search cost per model to
to be d iscussed below. Frequ ently, th ese computat ions cause numerical 0 ([ 1 - K]5T ), where 0::; K < I by furth er manipulation s o n the set of
problems as the result s begin to underflow th e machi ne's p rec ision capa HM Ms rep resen ted in this formulation . by combining essent ially redun
bilit y. To remedy thi s problem , a scali ng procedure to be described in d ant co m p utat io n . For details, the reader is referred to (Deller and
Section 12.2 .5, is employed . Th e algorithm above should be thought of as Sn id er , 1990).
onl y a theoretical result , which mu st be "scaled" in most cases for practi In a d d it io n to the small co m p u tat io na l advantage of th e state space
cal implementation. The scali ng procedure . once understood . is eas ily form ulatio n , a numerical advantage is gained b y restructuring the com
added to the algorithm. putatio ns into this form. We noted above that M ethod I is subject to ser
ious num e r ica l problems unless a rather elaborate sca li ng procedure is
"Any Path" Method 2 (State Space Approach). As i nd icated above. applied . T his problem is avoided here be cause th e accumulation of small
Method 1 is generall y of 0 (5 2T ) co m p lex ity per model. As clai me d ill numbers by multiplication is concentrated into a single equation, (12.46),
footnote 9, however. the computat ion is freq ue nt ly less , d u e to a model an d a sim ple way to avoid small numbers in this computation is to accu
structure (discussed in Secti on 12.3.2) that d oes not perm it tra n sit io ns mulat e (by addi tion) negative logarith m s,
T
"In many HMM ap plications. the A ma trix is very sparse (mo stly zero clements) and the
complexity is typically O(3ST) . .An alternative "any path" method ihat assu res O(ST) com -logP( y ]11l )"" - I
t= !
logP (r U) .:= .\'(1)\117). ( 12 .~7)
plexity in general is discussed below.
69 2 Ch . 12 I T he Hidden M a rk ov Model
12.2 I Theoret ical Dev elopments 693
It sho uld be clea r that - log P ( y ! 11I) serves equally well as a measure of
likelihood . Of cour se , thi s likeliho o d measure beco mes mo re favorable as
it becom es smaller. s } pma",,,y end only
A care ful exa m ina t ion of t he two " an y pa th" searches will re veal that .----------' at "legal
th e y are ver y sim ila r. Th e seco nd method a m o unt s to a reorgani zation of fina l Sial es
"Best Path" M ethod (Viterbi Approach). In the "an y pat h" met ho d taken
above , th e likelihood m easure as signed to an HMM is based on the proh. '- ,
S L-
abi lity th at th e model gene ra te d the ob serv at ion seq uen ce usi ng an y se .g"
quence of stat es of length T An altern ative like lih oo d m easu re , whi ch is .S
slightly le ss expensi ve to com pute , is based on the proba bil ity that the s"
~ 1>- .-----
HMM could generate th e given observatio n seq ue nce using th e best pos
sibl e sequence of sta tes. Fo rmally, we seck th e numbe r P( y. .!J" r1J ). where I .----
L-
~ .----
.!J" ~f a rgjLlax. P( y, ..9111lj (12.48)
3 4 T
occurrence along the best path through th e states. We therefore have the 2
add ed co m p lica t io n of finding .9' concurrently with co m p ut ing the Frame inde x, I
likelihood. FIGURE 12.4. Search grid for the HMM viewed as a DP problem.
As th e reader m ight have surm ised, this problem can be reduced to a
sequential optimi zation problem that is am enable to DP. Co nsider form
i ng a grid in which th e observ ati o ns a re laid out alo ng the absci ssa, and
th e states a lo ng the ordinate, as illustrated in Fig. 12 .4. (This is noth ing fo r any i a nd j and for arbitrary I > I. Also, to acco unt for initial st ate
more tha n th e " latt ice" d iagram studied in Fig. ] 2.2 viewed in a d ifferent probab ilit ies, we can a llow all path s to originate at a fictitious a nd cost
way.) Each point in th e grid is ind exed by a time , state pair (t, i). In less (Type N) node (0 , 0 ), which makes a transition of co st (Type T)
searching this gr id , we impose two sim p le restrictions: p( ~ (l ) = i) to any initi al nod e of th e form (1 , i). U po n arrivin g at the
initial no de , the path will also incur a Ty pe N node co st of th e form
1. Sequential grid poi nt s a lo ng any path must be of th e fo rm (t, n. b( y ( I)l i ). The acc u m ulate d (Type B) co st a ssociated with any transit ion.
(z + I ,)). where I -s i.] ::5 S. This says that every pat h must ad va nce say (t - 1, j) to (t , i ), is th erefore
in time b y one, a nd only on e, time ste p for each path segm e nt.
2. Fin al grid points on any ' path must be of th e form (T if )' where if d' [(t, i) I(r - I , j)] = d~ [ (t. i) l(t - I , j )] d~ . u, i )
(l 1.51)
is a legal final state in the model.
= aU!;)b( y (i)]i)
The reader is en co uraged to ponder these restrictions to det ermine th eir
reasonableness in relation to the m odel search. for t > I , a nd
Suppose that we ass ign a Type N co st to any node in t he gr id as fol d ' [(l , 01(0, 0)] = d~ [ (l, i) \(O, O ) ] d,~,(l , i)
(12.51)
lows (the reason for th e prim es on the followin g d istance quant ities will
= p ( ~(l ) = i) b(y( I) l i)
become cle ar below) : for t > I.
Now let us co nside r a co m ple te path through the grid of the fonn
= nT
r~ I
a(i,li'_I)b(Y(l)l iJ. wherc
- - 1og D '
D d.£f
( 12.60)
where. for convenience, we have defined
and
a(illio) =a (i1IO ) dJf p( ::!( I ) :; it)· (12 .56) d[( t. iJ !(l - t , i,- I))== -logd'[(t.i,)\(l- Li t_I )] (12.6 1)
Then it is clear that D ' , the "c ost" of th e path (12. 53) , is eq uiv alent to its == [-log a (i,\ i,_,)] + l- log b(Y(I )\ i,)].
probability of occurrence jointl y with the observation sequence J'. For
mally, since knowing the state seq uence .1= il' i 2. . .... iT is equivalent to T his reveals that we can do the path search by th e following assignments:
knowing the path (12 .53) , we can writ e Let the Typ e N cost at a node (t , i,) be
D' '= P(y, _91m; (12.57) d,J1. i
l
) == - log b(y(t II if) = -logP(.!:(t) = y(()\ ~(l) ir)
'=
( 12.62)
Therefore. the best path (state sequen ce), J ' , will be the one of maximum and th e Type T cost to any transiti on in th e grid as
co st ,
dA(', U\(l- I. i.. _I ) ] = -loga (i,li,_l) (12.63)
[D')' '= P(y, Y! m; (12.58) = - l og P(~ (t )= i, >~.(l - I ) == i ,- , ) ,
'.,\Ie have once again reduced our problem to a path search problem
for any i f and it I and for arbitrary t > 1. As abo ve. to account for initial
that is solvable by DP. The reader will no doubt recall that all of the
sta te probabilities. we can allow all paths to originat e at a fictitious and
path sea rches we ha ve encount ere d in our work thus far have required so
costlcss (Type N ) node (0,0). which makes a tr an sition o f cost (Typ e T)
lution for the minimum-cost path. Here we want the maximum-cost
- log P( ~ (I ) = i I) to an y initial node of the form (I. i ~j. Upon arriving at
(p ro bab ilit y) path. Thi s presents no problem , for th e BOP is equally
valid when optimization implies maximi zing cost. The reader should be the initi al no de , the pa th will also incur a Typc N node cost of th e form
ab le to return to (I 1.16) and rewrite the ensui ng discussion for the case - log b(.1'( I ) \ i I)' Fu rt her. if we let accumu lated (Type B) cos t associated
in which maximization is required and return with a ready-made solu with an y transit io n, say (1- 1, i'_I) to (r. U.
bc obtained by adding the
Type T and N costs. t hen we obta in p recisely (12.61 ). Finally, if we also
t io n for the present problem. However, we circumvent the need to do so
by making a small modification to the procedure. which is often used in acc u mulate Type B costs by addition as we mo ve along a path , then we
practice. obtain exactly ( 12.59). Clearly. seeking the path with minimum D under
In (12.51), (12.52), and (12.55), and at every step along th e way to the t his setup is equivalent to seeking th e path o f maximum D', and we ha ve
DP solution suggested abo ve, products of probabilities are formed . These red uced our pr oblem to a shortest-path search with this "log cost"
products often become ver y small , causing numerical pr oblems (often co nvers ion.
fatal) to arise in the solution . Further. th e formation of pr oducts is usu Gi ven ou r vast experience with DP algorithms. dev eloping the steps
ally more computationally expensi ve than sums. Th er efore. we usc the fo r th is sho rt est-path pr oblem is quite simple. We first let
simple trick of taking negative logarithm s. II which we have used before
to turn the expensive, numerically problematic multiplications of proba Dmin(l. IJ dJf dist ance from (0,0) to (t, il ) over th e best path . ( 12.64)
for l > I. N o te that, because all pred ecesso r nodes to (I. i,.) must come We wi ll inco rp orate t his backtracking procedure in to t he algorithm to fol
from "time slot" , - I , the rig ht side is really only a m inim iza tion Over low wit hout fu rt her di scussi on .
the previous states, and we can write We have now developed th e essen tia l steps, based on DP principles, for
Dmm(t,i,)=min{Dm;n (t-l,iI-!)+ d[(l,i/)I(I-I,i/_ 1 )]} ( [ 2.66)
com puting th e likeli hood for a part icular H MM in light of the observation
1,-1 y y;:
seq uence = We summarize the steps in a formal algorithm shown in
Fig. J 2. 5. This tec h niq ue is often called the Viterb! algorithm. since it was
for I> I. Writing the local distance quantity on the right in terms of its first suggested by A. J . Viterbi in the context of decoding random se
model parameters, we have
qu ences (Vite rbi, 1967 ; Viterbi and Omura , 1979; Forney, 1973). The al
Dmin (l. I,) = min [Dmin(t - I, iI-!) + [-log a (i,1 i - )]
gorit hm is also som et im es called the "stochastic [ann of DP," an allusion
1, _ 1 r l to the fac t that the Type N costs in volved are stochastic quantities.
(12.67) In co nclusion , let us compare the two general approaches to decoding
+ [- log b(y(t) l itJ]}
the HMM . The comparison is simple upon recognizing the following.
for i » I. This recursion can also be used for l = I if we recall that Sup pose t hat we had defined a(' , .) in the "any path" approach such that
Dm1n(O, 0) = 0 [see (1 1.20)]. fo means 0, and aU11 0) is defined as the initial a(y~+t, j) '" max + I) I j).
state probabil ity for state i l as in (12.56). , a(.v: , i) a(j Ii) b(y(t (12 .72)
At the end of the search the quantity T his co rrespo nd s to replacing the sum in (12.34) by a maximization and
to taki ng t he maximum o ver all paths entering a node in the lattice of
D" = min {Dmin(7~ iT)}' ( 12.68) Fig. 12.2. A little thought will convince the reader that this is precisely
legal '1
th e dista nce to grid point (t + I , i) in the Viterbi approach (if we do not
will be the negative logarithm of the probability of joint probability of use the loga ri th m s). Therefore, the methods are similar but the Viterbi
occurrence of the observation sequence y and the best state sequence, say a pp ro ach requires (S- I)T fewer additions per HMM. Typically, 10 to ~
.!F = i;, I;, .. . , i ~, for producing it . The minimization over "legal i/' im
fewe r co m p u tat io ns are required wi th the Viterbi search (Picone , J 992).
plies that more than one final state might be possible. Clearly, the best of
these final states is Either the Viterbi or F-B algorithm can generate likelihoods for paths
that never appeared in the training data. This feature represents an effective
i~= argmin{D min (T, iT )}' (I2 .69) "sm oothing" of the probability distributions associated with the HMM.
legal i.,
For comparisons across models, this quantity is just as useful as the Initializat ion: "Origin" of all paths is node (0.0).
probability itself, and there is usually no reason to convert it. It should For i = 1,2, .. . ,S
be kept carefully in mind , however, that small logs arc favorable because I
Dmm{l , i) = aUI O)b(y(\) i) = p (~( 1) = i )b( y (1)1i)
they imply large probabilities. As noted in Section 1.3.3, a measure de \11(1, i) = 0
rived from a probability and used as a measure of likelihood of a certain Next i
model or occurrence is often called simply a likelih ood. We will use this
term to refer to the negative log probability in our discussions. Recursion.' For I "" 2, 3, .. . , T
In certain situations, it might be desirable to reco ver the state se For it = 1,2, . .. ,S
quence associated with the best path through the HMM. This is equiva Compute Dnlln(l, I,) according to (12 .67).
lent to recovering the best path through the grid in the search problem. Record 'i'(i"J ) according to (J 2.71).
We are well acquainted with the backtracking procedure to accomplish Next it
this task. Let us define \fir"
i.1 to be the best last state on the' optimal Next I
partial path ending at (z, I/). Then we have
to any (I, it) will only be a candidate for ext en sion at time t 1 if r
"This discussion will need to be generali zed somewhat when we begin to use HMMs in
more soph isticated ways, but the extension of the ma in ideas will he app arent.
"Although we are " bo rr owin g" this notation fro m the Meal y form , it should be under
" In fact. the term "b eam search " IA-'3 S first employed in a system using an Hl\·jM-like stood that we are seeking para meters for a Moore-for m model. It should also be intuitive
structure (Lowerre a nd Reddy. 1980). that a similar procedure could be derived fo r the Mealy form [sec, e.g.. (Bahl ct al., 19831·
700 Ch. 12 I The Hidden Markov Madel
12 .2 I Tneoretrcal Developments 701
Let us also define a new set of random processes y .. I 5: j -s S. which
have random variables -J Y·(r) that model the observation being emitted at and
state J at time ( (this may include the "null observation"). The svmbol l5U, k; t ) d~f P(r / O = kl.l', 117)
b~
0
which is, in effect, used to indicate an arbitrary Slale above, wil! also
used to indicate an arbitrary lime in the following. = (U (j ; n, if .v(t ) = k and 1 < / :=s T
Now SUppose that we have a model, 117, and an observation sequence, 0, otherwise
y =; y;' If we are just beginning the process, m may consist of any legiti
mate probabilities. We first compute the following numbers: (12.80)
¢(i,); l) _{a(y;,j)P( Y~+ 1 Ij) , y (1) = k an d 1 :=S 1.$ T
~f P~(I) =; ujll/y, m)
-
0,
p( y l rtl)
otherwi se
= P~(t) =; uJJ"y/m )/p(y/m) Note that we make ext en si ve use of both the forward and backward prob
ab ili ty se que nces, a(- ,.) and pcl·), in these co m p ut at io ns . Now from
_[G(y;, i)aUji)b(yU+ 1 )lj)!J(Y:2/J), l= 1, . .. , T-l these four quantities we compute four related key results:
- P(ylm) ,
1'- J
0, other I :;(i,j; 0) = P(!!-(o) == Hj l, Ir. m) = I
I I
W,J; 0 , ([2.8 J)
(12.77)
where the sequences 0, b, a, and f3 are defined in (I2.4), (12.14), (J:U 1), T-I
an d (J 2.32), respectively; 15
y(i; 0) = P~(. ) E: II" ,Lv, 111) == I ;(i; I ), (12 .82)
yi ]; t) ~f P0!.(t) E U.11/y, m ) ( ~ I
s T
and
G(Y:, n!J(Y;-rlli) , (= 1,2, ... , T-l (I 2.78)
T T
== [ P(yjm)
0, other t
(5(j, k; 0) = P(l)o) == kjy, m) = L r3(j. k;l) = I
{~ l I -I
v(j; l). (12 .84)
y(j;l ), I = 1,2 , .. . , T- I
We now give important and intuiti ve interpretations 10 the four key
qu a ntiti es (12.81 )-( 12.84). Let us define the random variables:
== a(y;,j), t =; T
{ ~(Uj l i ) d,gf number of transitions Ujli for an arbitrary
0, other I
observation seque nce of length T and an (12.85)
(12 .79) arbitrary model ,
a(Y;,i
- :-_ )P(Y:
-- 11-
j) , (== 1, 2, ... , T ~( u 'l l ) ~f num ber of transitions from the set 1.1'1' for an
== [ p(ylm) arbitrary observa lion sequence of length T (12.86)
0, other l and an arbitrary model,
~ (llj' . ) d~f n u m ber of transitions from the set Ii, ;. for a n
a rbi tra ry observation seq uence of len'gth T (12 .87)
and an arbitrary model,
IJThe no ta t ion !!(r) Eu." is used below to mean tha t the ra ndom variable .!!U) takes a ~ C~) o) == k ) d~f nu m ber o f times observation k and state j
value from among those in the SCI u.I,'
occur jointly for an arbitrary observation (J 2.88)
sequence of length T and an arbitrary model.
702 Ch. 12 j The Hidden Markov Model
12 .2 j Theoretical Developments 703
Now it is easy to show that
H aving co mputed (1 2.9 7)-Q 2.99) for all i.i. k , we no w have the parame
¢(i,};.) == £{ ~ ( IIJ I J l y, 111) (12.89) ters of a new mo del. say rn.
T he procedure th at we h ave just been t h ro ugh will p ro bably seem a bit
/ (1": .) == l'f ~ (u. , ') l y, 111) (J 2.90) st range and unco nvincing a fte r pond er ing it for a m oment. Indeed , what
we have d one is ta ken a m o d el, 117, and used it in conjunction with the
v(j;.) =l (~(~'I')ly,ml (12 .91) tr aini ng o bservati o n . y. to compute quantities w ith which 10 produce a
new m o d e l. m. Recall that our ideal objective is to use the training string,
oU, k,.) == I f~Ct/l) = k) Iy, m; (12.92) y, in order to find t he mod el, say 1'17", such that
. For example . since there is either one transition tl at any time, Or there tit = argmax P( y' m). (12.100)
is none, we have Jli 111
tf~ (Uj'I)ly, m} = [I XP(ujlJv, m)] Now for a given training sequence y, p( yl m) is generally a nonlinear
funct io n o f the many parameters that make up the model rn. This func
+ [0 X P(not ujl,IY, m)] = ~(i,j; 0). (12.93) tion will accordingly have many local maxima in the multidimensional
space. This idea is portrayed in two dimensions in Fig. 12.6. The optimal
The interpretations of the other three quantities are verified in a similar m od el , m", corresponds to the global maximum o f the criterion functi on.
manner.
The signifi cance of what we ha ve done in the foregoing is that repeated
Wi th these interpretations, it is easy to see that the following are rea rcestirn ation of the model according to these steps is guaranteed to con
sonable estimates of the model parameters: verge to an m corresponding to a local maximum of p(yl 1m
(Baum and
t(fl(uJ 1Jly, 111 ) ((i,j;o) Sell. 1968). That is, either 111 = 111, or P(yl111) > PLv! m; The model will
aUJi)=- (12.94)
always im p ro ve under the reestimation procedure unless its parameters
t(fl (u.IJly, mJ ),(i;.) already represent a local maximum. So this reestimation does not neces
sarily p rod uce the best possible model, mo. Accordingly, it is common
practice to run the algorithm several times with different sets o f initial
- I.:l~dy/l) = k)jy, 71'1) J(j, k;.)
b(kjj)=--- _
(12.95)
parameters and to take as the trained model the m
that yields the largest
C(.~( uJ1.)/Y, mJ v(j;.) value o f p( y l'iJJ).
The F-B reestimation algorithm is summarized in Fig. 12.7.
perU) = ;j == y(i; I ). In Sectio n 12.2.2 we noted the numerical problems inherent in the use
(12.96) of t his F- B algorithm. We remind the reader that a scaling procedure is
Now retrieving results from above, we have
T-l
aul i) =
I
_(~_l
a(y;, i )aUI i)b(y(t + J)I j)P(y~. 2/ j)
_
P (-"I ~' l)
T-I ( 12.97)
L a(y;,i)f3(Y;~lli)
' ~1
r
I
I~ I
a(y;,j)f3(y:'l j)
b(kU) = _I)
~:_=_
k _
(J 2.98)
I
I- I
a(y;, J)fl(y;+IIj )
1. Use m
= {S, A, B, n(l), \Yk' 1 -s k z: K}} and Y = y~ to compute
b(k\ j) = n( ).)l) = k) (12 .106)
(\2.8\ )-(12.84). _
n (u I •
11'
2. Reestimate the model (call the new model 111 ) using ( 12.94 )- (1 2.96).
3. It will be true that
Having co mp uted (12 .105)-(1 2 .l0 6), we now have th e parameters of a
p(yl m ) ~ p ( y l 111). new mo del , say 111. The sa me, or addition aL tr ai nin g stri ngs can be u sed
If in further iterat ion s.
The Viterb i reestimation algorithm can be sho wn to conve rge to a
p ( yl m ) - p(yl m, ~ e, proper characterization of th e underl ying observat io ns (Fu, 1982. Ch. 6;
Lee and Fu, 1972), and has been found t o yield models o f compa ra ble
m
return to Step I with -= 111. Otherwise STOP.
performance to those trained by F-B reestimati on ( Picone , 1990 ). Fur
4. Repeat the above steps with several initial models to find a favorable local
maximum of p(yl m; ther. whe reas an HMM corresponds to a particular typ e of formal gram
mar (a "regul ar" or " fin ite state" gram ma r), this Vit erbi-type training
proced ure can be used for a broader class of gramm ars. T his issue will be
gene ra lly necessary for practical im pleme nta t io n of th e procedure. This discussed in Chapte r l3. Finally, the Viterbi app roach is more computa
sca ling technique will be described as pa rt o f our exp loratio n of practical tionally effi cient than the F-B procedure.
issues in Section 12 .2.5.
Finally, we note th at t he F- B proced ure, as well as the Viterbi proce
dure to follo w, are som et im es called reestim ation by recognition for the 12.2.3 The Continuous Observation HMM
obvi ous reason.
I ntroductio n
We now return to the more gen eral case in whi ch th e observati ons are
Viterbi Reestimation. Although the F-B algo rit hm is th e most popular al con ti nuo us an d vector-valued, corresponding to th e "unquanti zed " vec
gorithm for training the discret e obser vat io n HMM , a simp ler and tors of featu res drawn from the spe ech . In thi s case th e fonn al descrip
equally eff ect ive algorithm is available base d on the Vitcr bi decoding ap tio n of t he HMM contains a multivari at e pd f cha rac te riz ing the
pr oa ch to recognition (P icone , 1990; Fu, 1982). Suppose. as above. we distributio n of ob servations within each sta t e. This is given in (12 .12),
have a gi ven model , 117, an d a training seque nc e y = y; with which to rees whic h we repeat here for convenience,
timate the model paramet ers. It is ass u me d th at a particular state
is d esignated as th e initial state for th e model so th at the initial sta te 1Jl = {S,1t(l),A, {Ir li1;\i ).1 ~ i< S n· (12.107)
probabilities ne ed not be est imat ed . We first evalua te th e likelihood.
p ( yl m) , using Viterbi decoding as described in Section ' 2.2.2. Along the Recall that we write !rt/E,\
i) rather tha n !, (I) L! (II(E, \ i ) becau se th e process r
way, we keep tr ack o f th e following talli es: is assum ed to have independent a nd identicall y dist ributed random vari
= number 0 f transm
n ( U , ) def tr ansi .on s un i (12.10 I ) abl es, ~ ( t) .
JI
n(u. l ;) ~r number of t ra nsit ions fro m the set "' I i ([2 .102 )
Recognition
nUl
( l' ) de l'
= number 0f
rran' . ns
sino f rom t he set 111 1, (11.10 3)
We once aga in need to concern ourselves with the two central prob
n (J: .(t) k ) d,ff nu m ber of t imes observ at ion k and state j lem s of train ing an d recogn ition. The latter is very si mple, an d th e for
;0
(1 2.104) me r is sol ved by var iat io ns on th e reestl mation proc.:dures descr ib ed
) occur joi ntly.
706 Ch. 12 I The Hidd en Markov MOde l
12 .2 I Theoretica l Develop ments 707
above. ln the recognitio n prob lem , for any in co m in g obs erv a t io n . say
y (f). we define the likelihood of generating observati on y (l) in state j 'lSi ; u(i; I , I) d,gt p ( ~( t) = ily(l) p rod uced in acco rd a nce with
m ixt u re d en sity l)
nition methods d es cri bed abo ve. The resu lt ing measure comput ed for a
model, 111, and an observation st ring, say !? y= y~ will be P(Y!l1l) if the
I
j= 1
a(y;, j )p ( Y~;- 1 1 j) I <: 17(l;; Il"",· C
m= l
i ", )
member of this class is the Gaussian mix ture del/sir)', whi ch is of the
form T
I v(i; I , I)Y (l)
M /= 1
IljJ = ( 12.114)
Irl ,o (~ Ii ) = L ell" 1'l(~ ; Il
m=l
i rr l , CII/I ) ( 11 . 109) v(i;·, /)
1
in which c,m is the m ixture coefficient for the »nh component for state i,
a nd n; . ) denotes a multivariate GauI ssian pdf with m ea n J.l,,u •. and covari-
L )I(i ; I, / )[y(t ) - fljJ][y(t ) - Ji,t
-
C., =
1= 1, - - -
(12.115)
a nce matrix C,I/I' In ord er for j~ I /l; ii ) to be a p roperly norma liz ed pd f. v(i;. , / )
th e mi xture co effi cients must be nonnegat ive and sat isfy the co nst ra in t
An heuristic interpretation o f (l 2.113)- ( 12.115) is as follows. Eq ua t ion
,;"1
(12. 1 13) re p rese n ts a ratio of the exp ect ed number of tim es the path is in
I
"1 =1
<. I, I -s t s: S. ( 12. 110) state i and using the Ith mi xture co m po nen t to generate th e observation
to the num b er of times the path resid es i n sta te i. Equation (1 2.114 ) is a
For a sufficientl y large num ber of mixture d ensiti es, M . ( 12. 109) ca n be weig hted tim e a ver age of the observation vec tors , weighted according to
used to arbitrarily accu rat ely approximate a ny conti n uous pdf. :"!ot c th at the like liho od o f their having be en produced by mixture density ! in st ate
as 3 s pecia l ca se, a 5ingle G au ssian pdf (AI = I ) may be used 10 model th e i. Th e co m put ati o n of the co variance re estimate is a similarly weight ed
observations at a n y sta te. tem po ra l averag e.
Reest imation formulas ha ve been deri ved for th e t h ree quan tities c:i ' As in the d iscre te observatio n case. th e use o f th ese reest irnat ion for
Il " . and C,I for mi xture de n si ty! in st ate i (Li po race , 1982: Jua ng. 198 5: mu la s will ult imatel y lea d to a model , m,
which represen t s a local m axi
Juang et al ., 198 6). Suppose th at we d e fi ne m u m of t he likel ihoo d P(YI117). Howe ver, findi ng a good local m aximum
de pend s ra ther critic al ly upon a reasonable in itial estimate of the vector
a nd m at rix parameters ~1 " and CII for each i and l. The read er sho uld be
" How should this likelihood be defined fo r the Mea ly-form IIM\1 '} ab le to d iscern th at thi s is intu itivel v correct as he or she ponders the
l IT he no ta tio n ). = ). ~ a nd simil ar nota tions arc just t he obvio us vecto r versi o ns of trai n ing process. T h is means th at ~e must somehow be a bl e to lise
( 12.22)- (12.2 5).
the t rain ing data 10 derive m eaningful in it ia l estim ate s pr ior to execu ting
12 .2 I T he o retica l Dev elopm en ts 709
708 Ch. 12 I The Hidden M a rkov MOd el
the reestimation steps. Below we descri be a co nven ient procedure fo r de 10 I modd/dlg il
ri ving good sepa ration of th e mixt ure componen ts for use with a Viterbi 5 srate-,
procedure. We will discuss th is issue furth er in Section l 3. 5.4, Where the ~
H MMs will be tra ined in th e context of a continuous-speech recognizer. 'J
e 5
Th e same gene ral principl es d iscussed there appl y to th is simpler case as ~
sumed here in which each HM M represent s a dis crete utteran ce. ~ 10 ~ la lc ;'
Of course, the ob ser vat ion densities compri se only on e pan of the
mod el description . We still need 10 fin d est imates for th e sta te transition 0, J 5 79
a nd initial state probabiliti es in the co ntinuo us ob serv atio n case. How. (,l l
ever, we need not do a ny further work on th is issue. since these probabil
11
iti es ha ve exactl y the same me aning and struct ure in the conti n uo us
obser vation case and ca n be fo und directly from (12 .94) and (12.96). The
6 5 states
complete training algorith m will t here fore be simil a r to th e F-B algo ~ 11 states
i::
g
Viterbi Procedure. If a Vi rerbi approach is used , th e mean vectors and L:J
2
covari ance matrices for th e obse rvation densit ies are reesti mated by sim 10 q ates
Jt/
1
= -1\' /
2:
I I
y (l ) (12 .11 6)
12.2.4 Inclusion of State Duration Probabilities in the Discrete
)' (1)- ,
Observation HMM
1 T One of the benefits of usi ng H MMs is th at th ey obviate a co mplete a
c, = N
/
2:
,-I
[Y(l ) - li,][y(t ) - /1,( ( 12.117) priori characte rizat ion of t he acoustic struc t u re of an utte rance to be
y (t) - / modeled. If prope rly "seeded" (sec Secti on 13.5.4). th e HMM is capable
of "se lf-o rga n izing" th e acoustic dat a into a meaningful a nd e ffective
Wh en M > 1 mi xtu re co mpone nt s appe ar in a state. t he n the observa m odel. Nevertheless, th e states of th e mod el are often thought of, to a
t io n vecto rs assigned to th at sta te must be su bd iv ide d int o AI subsets fir st ap proxim ation , as representing di st inct aco ustical ph en om ena in th e
pri or to averaging. This ca n be don e by clu ster ing, using. for examp le. utterance . such as a vowel sound in a wor d o r a transition between ph o
th e K-means algorithm (see Secti on 1.3. 5) with K = M. If there arc N" nemes in a word . T he number of states in a mod el is so me times chose n
vectors assigned to t he Ith m ixture in state i, then th e m ixtu re coefficient to correspond to the expected number of such phenomena. For exam ple,
ei/ is reestimated as if an HMM is used to model a ph on eme (rather than a compl ete word),
then three sta tes m ight be used- one to capture the transition on ei the r
»; en d of th e phoneme. and o ne for th e "steady-state" porti on. How ever. th e
( 12. 118)
c i/ =
.N, ' HM M "organ izes itse lf" to maximi ze an analytic criterion , and not nec
essarily to co rrespon d to some acoustic structure that the designer may
Some results showing the effect of the nu mber o f mixtu re co mponents ha ve in mind.
per sta te in a di git reco gni tio n experiment (R abin er ct al ., I n 9) arc Nevert heless. experimenta l evidence su ggests that states freque n tly
show n in Fig. 12.8 . Furt her expe riments ar e rep orted in (Wilpon ct al., repre sent ident ifiable acoustic phenomena. To th e extent th at th is is th e
19 9 1').
7 10 Ch 12 I The H 'd Oen Markov Modol
...
12 .2 I T ne o retrc at Developm ents 711
cas e . the co n ven tional HM ivl as described above has a serious flaw. cn tl y t.:o n ta in an F- B procedure for co m p ut ing p(ylnn fo r th e recogn i
Ac oustic phen o m e na in speech te nd no t to be ex pon e nti ally di st rib ute d ti o n phase. In this case th e HMM in cl ud es t he d u ra tio n probability
in d urat io n. O n e wo uld expect a giv e n phon e m e in a giv e n posit ion in a d is t ribu tions at the states
wo rd , for example , 10 ha ve a norm ally di st rib uted duratio n ac ross differ
e n t rend itions of th e word . Yet , th e du rat io ns o f s ta tes with in a conven
11I== {S. TC ( I ) . A . B . V.>(~ , = d) fo r J :=; i :=; S. 1 =5 d =5 Dn ·
(12.1 21 )
tio na l H MM h ave exponential probability d istrib ut io ns . By th is we m ean No te t hat we ha ve included a m axim um allowable duration. D, fo r each
th e fo llo wing: S up pose we kn ow that a given H Mrvr. at t im e l. e nte rs ;;tat e. Also note that th e diago n a l elemen t s o f A need n o t be est im ated .
sta te i. What is th e proba bility th at th e d ur ation o f st ay in stat e i is d We be gi n by redefining th e forwa rd -goi ng probabil ity seq u e nce,
fram es long? F rom o u r k n owledge of the tra nsiti o n p rob a bilities, it is
easy to sho w th at o(y;, i) <l~f P( .v ~ . ::(l ) = i .~ (r+ I) -# 11m). (1 2.1 22 )
aljll)
X a(irli'_1) P~ i, = d,) b( vid, + . .. + dr_I + 1)\i,)
. . ·h(YU)l i,),
(1 2.1 2 5)
(lU I))
...
stat e seq u e nce. By induction, we ca n write (12.1 2 5) as
S D I
for I> D , wh ere D is the m aximum allowable du rati on in a ny state. case in which th e reside nc y in state i comes to an e nd at time I , while
To initialize th e recursio n , we need the values of a(y;, j ) for IE [I. Dl. th e "prim ed " sequences invol ve th e cas e in whi ch st at e i begins at the
and j E [ I, 5J . Th ese ca n be show n to be as follows: next obse rv a tio n ti m e , t + I . It is not difficult to demonstrate t he follow
a(y :, j ) == P (~(1 ) =j ) P(~J == I
I ) b (y ( I ) i), l -s] :::; S. ing relati o ns hips a mong the "pr imed" and " un p rimed" forward and
( 12 .127) backwa rd seque nces:
s=1
b( ,l'(s) ))
s
atl';. i) = I a(y;. i)aCi Ij).
) - 1
(12. 134)
s D t
+I ael': ,k )aUIk)P ((!) == I)/1( .1'(2) Ii), 1 <: jsS, ev;.i ) == I a(y;-d, i)P(~, == d) 11 b(r (s)l i), (12 . 135)
k~l d- l . . .: J
k *j
s
(J 2.128)
fJ ( Y ~+ I I i) ::; IfJ' (Y ~+ I \J)a(j li) , (12.136)
o ,+J
s= l
fJ'( y;J i ) = I fJ (Y;+d+lli ) P~ , = d) 11 b(y(s)[n . ( 12.i37)
+
2
I I
d~l k
S
I.."j
I
a(y:-d.k )a(jlk)p (~j ==d) n 3
-,~ 4- d
b(y(s)jj), 1s t« s.
d~ 1 s= r+ 1
In these terms , the reest irna tion formu las for t he parameters of t he
mo del are as follows:
(12.129) T
I a(v;, i)a(jl i)fJ ' (y'~ ~ II j)
a nd so fo rt h , u nt il th e a (y~ j ) for I S j -s 5 are co m puted.
auli ) = --!..:!.
s r
( 12. 1 38)
As a n aside , we no te that
I L a(y;. i)aUI i )j3'( l+l ] j )
) - 1 I- I
s
I
~ [~ a ' (Y ~ ' i)[3'(Y~'t1 1 ;)- ~I a(y~ , i)p(y~+ , 1 n]
p(y l?rl) = a(y;' i) ( 12.l30)
i~ 1
so that this forward sequence can be used to evaluate th e model with re ,l'( t) = k
D(k li ) ==-.:...:....:- -- - -- - - -- - -
spect to an inco m ing string J' for recognition purposes, This is exactly
a nalogous to what ha p pe ns in the case of the co nve ntional model ("any
pat h " app roac h) , so we need not say anything furt her about the recogni
~ ~ [~ a'(Y~' i)fJ' (Y~+ ll i)-.~ a(y~ , i)P(Y: J i)]
I
y{rl - lc
tion pro ble m .
( 12. 139)
Now let us define furt he r forward and backward seq uences as follows:
I+d
L a' (y:, i )P(!!', == d)[J(l+d+111) n b( y(s) Ii)
T
a'(y:, i ) d,£fP(.v :, ~ (I ) *- i, ~ (t + I) = il m). (12 . 13 1)
. _, s- l + l
P ~I = d) = -D-'-
· -- ·T- - - - - - - - - - - - - "-r-Td- - -
The reade r is encouraged to study each of these definitions to under P ( x( l) 0-= i} fJ(.v~ li )
P(~ ( I ) = i) =
(12 . 14 1)
stan d its significan ce . The "unprimed " sequ en ces are concerned wit h the P( y; Irn )
'If 12 .2 I Th eoretical De ve lopments 7 15
7 14 C h. 12 / Th e Hidden M a rkov Medel
an d wri te
"At firs t t his m ight seem inconsistent with o ur defi nitio n of th e initial sla te probabili ty.
Ho wever, the est ima te th at we usc for this q uantity is precisely the pro ba bilit y of being in ii(Y :. i) = c ( I )a(y: , i). (12 .1 46 )
Slate i at t i m e 1= 1 given the train ing observa tio ns. So th is estimate is a proper one.
i r - - - -- - - ', -• •·" ......ccrarrMarI<Ov M o de l . 12.2 { Theoretical Developments 7 17
s This is reasonable , since the o bj ec t ive of the scaling is simpl y to keep the
0:(.1':, i ) =- I
) '= 1
a(y;-\ j )a(il j) b(y (t) 1i) (12 .150)
nu m be rs in a useful dynami c ra nge . and since the a's a nd P's t end to be
of the sam e order of m agn itude . With th is sca ling st ra te gy. it can be
and shown t hat
with Furt her. with th is choice o f sca ling, it is not diffi cul t to show that
T- I
Th is expression makes it possible to compute li(Y/p i), i = I , .. . , .), di so that the sca led va lues can be used in the rc estimati on eq ua tio n for the
sta te tran sitio n probabilities wit ho u t modificatio n . In a sim ila r m anner.
rectly from aev:- 1,j), j= I, .. . • S, thereby ob viati ng the " in term ed ia te"
the symbol probabil iti es ca n b e co m p uted fro m th e "usual" fo rm,
seq ue nce lilY;,
i). ( J2.98), with t he scaled forwa rd a nd backwa rd sequ ences insert ed ,
Let us now explore th e practical effects of this scaling s tra tegy. Using
(12 .I SI ) in (12 .153 ), we have
" Recall thai j1 (Y;.lln is associated with ti m e t .
71 8 en. 12 I T he Hid den M ark oll M o del 12 .2 I Th eo reti c a l Developm ents 719
Cle arl y, it is not ap prop ria te to use t he scaled a valu es in t his exp ression. L
However, fro m (12. 154) we see th at
s b(kl j) =
--
)' 6(/)U, k; 0)
1=1
L
(12 .166)
~ a (.l' ; . i ) = I ( 12.1 59) L I'(/)(j ;o)
j= [
/ =1
for a ny £. In particular, we ca n write
where the s upe rscript I indicates th e result for th e fth observation, sa y
s
~ a(y;, i ) = ~
S (
n T
c(r)
)
a ( y ~, i ) == 1 ( 12.160)
[.v;'1'/) =
y ( /) . of which there are a tot al of L. The length o f observation y(l)
( 12. 16 1) auli ) = c.
""
I 1', - 1
'" u{! )( vl i)R(I) ( / 1 .
Sin ce t he p roduct o f th e C(T)'S is likel y t o be ext remel y large. we can 1Sp(y(l )lm) ?S . I' fJ )'"11 1)
com pute th e logarithm instead , (12.167)
r
- logPU l m) = L log c (r ),
r= (
( 12. 162) '"
L p(
L I T,
,(1)1
'"
) L a ' . ,r j ' }
( /) ( I ' ) R(I )(
fJ YH 1
T1 I .)
}
1=1 Y I 111 / - 1
which provides th e nec essary likelihood measure. Jj(klj) = ). ~ t)= k (l 2.1 68)
L T - I
"" 1
L p( (I) ]1'i'l) L a
~ (1) ( /
.II '} P
') R(I ) ( .T 1
) ' ,+ \ }
I .)
L Ii = 1;.,£("11) ( 12.172)
L I
1= 1 , -I
(i(/ ) ( y ; . j ) P (I I ( J <~ l jj ) J Dl =
J-=f.- ,c(\·\I
-
)log
f. . !I~
(\' \ 2)
dx
l '(/ ) - ~
lil, k\..',) = - - - - ---;:-- - - - _
[cf.. e.g., (1.216)1. whe re ~ = I in d icates th e sign al. an d £= 2, th e HMM,
7 (12. 171)
L
L
" ,U,( .r
1
HP(y(l'ltn) ~ a
'
) 1' )
.)p(l) ( .T, I .)
.1 '+1 ) and where ~ is som e vecto r of pa ra meters that cha rac te rizes th e signa l.
O f co urse, a small 01 measure ind icat es go o d agreement between th e sig
wh ere fi and jj indicate the scaled valu es . nal and m od el densi t ies.
Fin ally, we note thai the use of the V it c rbi ree st im a tion approach with In the first o f th e pa pers noted above , the ML approach to HMM
multiple obser vations is very straightforward . In fact , we mentioned this tr ain ing is show n to be a speci al case of the MDl approach. Another spe
idea wh en the method wa s introduced in Secti on 12 .2 .2 . With a review cial ca se is the mnximu m average mutual lnfo rmaiion (M MI) approach
of ( 12. 10 5) and (1 2.106) , th e enhance ment , vill be apparent. which was researched by Bahl et at. (198 6) in respon se to the prohlems
with t he M L a pp roach noted above . Th e MMI approach gi ves a go od
sense of the " negat ive training" aspect of MDT ap proac hes.
12.2.7 Alternative Optimization Criteria in the Training of HMMs Su ppose that we have R different HMMs to be trained (e.g., each rep
Prerequisite Chapt er I Reading : Sec /ion s / .3.4 and lA, especiallv Su bsection resenting a di ffer ent wo rd) , 1111' ... ,1111<" Let us deno te by m th e random
/.4 .3. . va riable indi cat ing a model outcom e in an ex pe ri me nt (e .g:-, rJ7 = I indi
cates that nJ is chosen). Let us also assume that we have L training ob
Thus far, we have adopted a m ax imum likelihood (ML) approach to . n st n.1ngs a f 1engt hs T 1" ' " TL' say y( I ) , ... , y(Tl . As we have done
ser vatlo
the design of an HMM . This philosoph y as serts that a model is "good" if in the past , let us use the abusive notation "y = y \k) " to mean that realiza
its parameters are adjusted to maximi ze th e probabil ity p (yl11'1) of gen tion k has occu rr ed . By a slight generalization ~f (1. 240) to a llow for t he
e ra ting the observation (t ra in ing) se q ue nces for which it is "responsible." rando m vecto r string, the average mutual information between the ran
Although the maximum likel ihood techniqu e has yielde d m any encourag
do m q uantit ies r and m is
20
ing results in practice , there are t w o fundamental co ncept ual problems
with this approa ch . Fi rst , th e signal (as retlected in the observation se
_ L !i, p(y = y(ll,m =1t7,)
Af( _y, _m) = ~
L, 'L" p (y_ == y(ll, -m= rnr ) log p( - (I) )p(- .
q uence) might not adhere to th e co ns trai nts of the HMM . or th ere might
be insufficient data to properl y tra in th e HMM even if the model is gen
1=1 , ~I ~= y m== 'J1l, )
(12 .173)
era lly accurate. Second , th e ML a p p ro ac h does not include any means of
effecti ng " nega t ive training. " in which a model is t rained to not only re No te that this ca n be written
spond favorably to its own class, but to di scriminate against productions
of other models. These conceptu al problems gave r ise t o research into 1. R
i~ to be us ed to t rai n 171,. t hen . i I' we assu me that P(I = v'". 111 = 111, ,i ~ D('lJI" 1112 ) d~( ~2 [log P( y< 211111,) - log p( y ( 2) I1112 ) ] , (12.176 )
()(I -r). we ca n a pp roxima te (1 2.1 74) by
We have alluded to the fact that thc HMM is often used at higher lev. 1'.1
M
el.s o r ~ speech recognizer to mo?cl linguistic constraints. In the present
d iSC USSIOn , we restrict our attention to the use o f HM M at th e acousti(.
leve l as we have tacitly do ne throughout this chapter.
M
Another important practical consideration is th e struct ure and size of
t he modcl. Reca ll t hat by structure we mea n the pa ttern of all owable
st ate transiti ons, and by size the number of states to be included . If the ~
H M M is a discrete-observatio n type . then size also incl udes the number
o f le vels in t he observation codebook, and if con t i nuous-observation the
number of mi xtures in the d u rat ion densities. At the beginni ng o f Section
12.2.4 we discussed the fact that H M Ms are used in part to avoid spe
cific a priori stat ist ical and st ruct u ral characterization of the speech sig.
na l. Accordingly. there is not a n exact science dic t ating th e size or B
struct ure o f the mod el in spec ific situat io ns. Ce rtain genera l guidelines
FIGURE 12.10. Phone model used In the SPHINX system . This HMM is of
are recogn izcd , however. the Mealy type , which generates observations upon transition . The phone
As we also no ted in Section 12.2.4. experimen tal ev idence suggests model has only three distinct observation distributions, which are shared
that states freque ntly represent ident ifia ble aco ust ic ph en o mena . There among the several transitions. These are labeled B, M, and E. to denote
fore. the nu m ber of states is often chosen to rou ghl y co rrespond to the beginning, middle, and end.
expected number of such phenom ena in the utterance. If word s are being
model ed with discrete observations, for example, 5- 10 sta t es are typi th e elem en ts in the t ran sition p robability m at rix, A. is co nstra ined to be
call y use d to capture t he p hon es in t he utteran ces. Cont inuous zero . An example wit h six states is shown in Fig . 12. 12. Such a structure
observation H MMs typically use more states , o ft en o ne per analysis does not coincide well with speec h u ttera nces because it does not
fra me [e.g., (Picone, 1989)]. IfH M Ms are used to model d iscrete phones, attempt t o model the seq uent ial ordering of events in the signal. Whereas
three st ates are sometimes used-one each for onset an d exit ing transi it ca n be used to provide more Oexibili ty in the generat ion of ob scrva
tions, and o ne for t he steady-sta te port ion of the phone. An en hancement tions." this advantage comes at the increased risk of con vergin g on an
of this three-state pho ne model used in th e SPH IN X syst em (Lee et al., un sati sfactory local maximum in the training process (see Secti on
1990 ) is shown in Fig. 12. 10, and will be d iscussed furt her below, As a 12.2.2). interest ingly. a nd perhaps not surprisingl y. when use d wit h
cruder measure, the average lengt h of th e utt erances is somet im es used to speech. the ergodic model will often train so that it essentially represents
determine the numbe r of necessary states. A peculia r version o f th is idea a seque nt ial structu re tbac kwar d transitio n p robabilit ies turn out zero).
is embodied in the "fenone" m od el. wh ich we will di scuss in the nex t The model str ucture generally adopted for speech recognition is a left
su bsect io n. A fenone is an acoustic uni t of speech that is no m inally one o-riglit or Bakis (Baki s, 1976) mod el. A typica l six-sta te exam ple is
frame long. Accordingly. t he H MMs for Ieno nes are ver y small. Example shown in Fig. 12.1 3. Othe r mo de ls illustrated abo ve are seen to he of the
feno ne structures are shown in Fig. 12. I I. Bot h the SPH INX and fenone left-to-right variety. The Bakis model has the property that states can be
models as show n use the Meal v form of the HMM . aligned in such a way that only left- to-right tr ansitio ns ar c poss ible . Fur
The relationship of the number of stat es to the perfo rm a nce o f the the r, it has a \vell-defined initial and fin al st ate. Such a mo del naturally
HMM is very imprecise. and, in practice. it is o fte n necessary to expe ri suits a sign a l-like speech t hat va ries in time from left to rig ht. 11' th e
ment with d iffer en t model sizes to determine a n app ro priat e num ber. states are numbered sequentially from left to right , then the transi tion
igure 12.8. for example, shows a plot o f error rat e versus the number of probability m atri x will be lower di agonal.
states in a digi t recognition experiment (Rabine r. 1989). Ano ther st udy of
digit recogn ition (Picone. 1989) suggests that the num be r of states should
uU\ j) = O. for all i < j. ( 11 . I 78)
be all owed to va rv across models for better resu lts. F req uen tly, the m odel in clud es the addit ional co nst raint that no mor e
The most general structure fo r the HM M is t he so-called ergodic2 1• than one or lWO states may be skipped in an y transit ion. For examp le. if
mo del, which allow s unconstrained sta te transitions. In this ca se no ne at onl y one sk ip is allowed. as shown in Fig. 12. \ 3. then
21 For the definition of an ergodi c Markov cha in, see, for example. (Leon-Garc ia. 1989: 21 By this we mean . for example. that a path could "j ump backward" to pick up a symbol
Grimmett and Sti rzakcr, 1985).
no t foun d in a part icular state.
726 Ch. 12 I ThE! Hrdden Markov MOdel
12.3 I Practical Issues 727
au,- IIf ) = 1
02.180)
and
P(~(1)=i,); I, (11.181)
where iJ
and ii represent the final and initial states, respectively. (In Fig.
12.13, Ij= 6
and i, = 1.) We will see some variations on this structure
when we discuss specific forms of the H~IM later in the chapter.
The choice of a "constrained" model structure like the Bakis does not
require any modification of the training procedures described above. In
the F-B case, it is easily seen that any parameter initially set to zero will
remain zero throughout the training [see (12 .97)-( I2.99);.
Q- ...
...
'0
,,
/ ...
I \
HMM, but a parallel de velopment exists for the Mealy form (see the
0
Bahl paper ). Two states are sa id to be lied if t hey share co mm on observa
*2
Ci
0.5
tio n pr obab ilit y di st ribu tions (or co variance mat rices, etc.), For simplic
uv, consi der the three-state HMM shown in Fig. 12.15(a). which is
0.3 a~sume d to be trai ned using a conventional method like the F-B algo
rithm in conjunct ion with a set of t rai ning seq uences, say '2. In Fig.
12.15(b) , th e " same " H M M is assumed to have been trained with states 1
0.1 " I I I I and 2 tied. To deduce the para meters for the tied model requires only
::' 4 8 16 32 6~ 128
a simple modificati on of the t ra inin g algorithms discussed in Sec
Log codebuc k "I ,'
tion 12.2.2 (see Problem L2.1 0) . At the completion of training
FIGURE 12.14. Typical plot of distortion versus log codebook size. bCk\ L) == b(k\ 2) for all observat ions, k. The advantage of the model with
tied states (which is ess entially a two-state model) is that the same
being more common . Since a larger codebook implies increased compu amount of t ra ining data is used to deduce fewer parameters, which will
tation, there is an incentive to keep the codebook as small as possible
without decreasing performance. Like the other parametric decisions that therefor e be of lower variance.
must be made about an HMM, this issue too is principally guided by ex
perimental evidence. Since the centroids loosely correspond to different
acoustic phenomena in the speech, some rough guidance is provided by
examining the acoustic complexity of the vocabulary, If, for example, the
vocabulary is very small, fewer symbols might to he sufficient. In a re
lated, but very unusual circumstance, the speaker might be restricted in
ft ·0=!ft (a)
9l
"
M~tric~ ~ A. n , B 1/
the number of sounds he or she can reliably produce . for example, be
cause of a speech disability. In this case a smaller codebook might also
be preferable, even with a relatively complex vocabulary. 91 l'-l alriee s AI ' R I
1
model parameters are not unacceptably small (Rabin er et al.. 198 3). For A = ErA, + (I - E,l ,\
example , we might set A=eB
{ 1
+ ( I - E)
t
B II
ti on sco res. If A, and B ( denote th e st ate tran sit ion and obser vat ion
1980) and at BBN (Schwart z et aI. , 19 84). Ex peri me nts have induded
As we move up the hier arch y of linguistic un its , we next come to the Phone 2
ph on eme s.. Recall th ~t a ph ?n em e. is actu ally an abstract unit that may
have m ultiple acou stic manifestations. From a modeling point of view
th e phoneme ca n be mod eled by cr eat ing a netwo rk of several phon'
mod els. This n ct w~ rk might pro vid e mult iple paths through th e phone~
Phone I
L.
Pho ll ~ m ~ :2
~ r--
+{-----..J>-+i Phoneme s
ce pts of la n gu age modeling co ul d almost be co ve red in thi s c hapter
J
devo ted to H M M s. T h e H M M-like language m odel is n ot u n iversa lly ern
played in s peech recogn it io n systems , howeve r, and as time goes o n mo re
Phoneme I and more variat io ns are being d e ve lo pe d . T he re fo re, we t rea t language
models in a separa te chapte r. and m ust remain content for the p re se nt to
\
, / exa m ine some work that doe s not involve la ng u age models. In C h a p ter
/
(sec Secti on s 4.3 .4 a nd 1O.2.3 i in both train ing and recogn it io n tasks. In
t his case t he model represent s onl y spee ch. and o bservat io n strings to be In t he t ra in ing ph a se. each o f individual word models is firs t " see d ed "
recogn ized m ust be ca refull y p rocessed to rep resen t only speech . The sec bv a p re lim ina ry t raining pass using an ob servat io n st ring known to be
ond approach is to in ten tiona lly include nonsp eec h o n both ends of the the spo ken word o n ly. T he no nspcech m o d els ar e a lso trained by
training samples so that it gets acc o u nted fo r in each ind ivid ua l model nonspeec h segments . The model s are then joined as shown in Fig . 12. J 8 ,
(presumably in the initial and final states). D uri ng recogn it io n . the obser_ with so me prel imi na ry est ima tes of tr ansitio n probabilities on the junc
vation strings may optionally con ta in no nspeech at either end . The third tu re t ransiti o ns. A complet e trai ning d ata base is then used to retrain the
technique is (Q train separate m od els for the no nspcech and to concate_ netwo rk of mod els. Whe n training wit h a particular word, the path is
nate them to either end of the models whic h represent the words onlv. In const rai ned so th at it m ay pas s only t h ro ugh that word's model (all other
this case, like the second , the incomi ng observation st rings for rec~gni_ transit io n probabilit ies tem po rari ly set to zero at the "boundaries"). The
tion may have nonspeech at either end that will be p roperly accounted network o f HMMs is train ed with t he entire database in this manner,
for by the model. using the methods fo r multiple tra ining observations discussed in Section
The third approach to accounting for nonspeec h present s us again 12.2.6 . Aft er a com plete pass t h ro ugh the training data is co m plete d , the
with the problem of hooking together two or more H M Ms in an appro parameters o f th e n et wor k are updated.
priate way. In our discusion of word models above. we did not Worry re m a r ka ble feature of the procedure described above is that the
about how such models would be trained and used. This , of Course. is an boundar ies bet wee n speech and nonspeech need not be known in the
important issue. Indeed, as we travel up the hierarchy to more complex t raining da ta (excep t t o create the seed models). Although the benefits of
speech recognition problems, this task will be increas ingly more complex thi s feat ure a re ap pa re n t here, we will later see that this ability of HMM
and important. Our analysis of this problem here will lay the groundwork networks to au to matica lly locate boundaries is profoundly significant to
for these more sophisticated systems. Figure 12. I 8 shows the nonspeech CS R system s.
model concate na ted to the initial and final states of the models for each or co urse , multiple passes through the database are necessary to ap
word in the vocabulary. Note that a transition is presen t for the path to propriate ly train the network. Convergence to a useful model is usually
bypass the initial nonspeech model. The complete network amounts to achieved in a few iterations.
nothing more than a large HM M that we must now train . I n th e recogn it ion phase. the large HMM network is searched in the
same ma nner as we search any HMM (see Sections 12.2.2 and 12.2 .3). In
this ca se , a beam search is an important tool because of the large number
of paths that can be generated through the network. At the completion of
t he search, the recognized word is discovered by backtracking through
the max im u m likelihood path .
We know t hat in large-vocabulary systems , isolated words will likely be
mode led by the concatenation of subword HMMs of units such as
phones or fenones. In fact , we might want to include various concatena
ti ons of subwo rd models to a ccount for several possible phonetic tran
scr ipt io ns of the word." An example of such a "composite" model is
Nonspccch
show n in Fig. 12. 16. While the tas k is more complicated in terms of rec
model Nonspeech
model
0 1
ord kee p ing, the methods for training and recognition a re essentially simi
~.J
la r to the case of concatenated "nonspeech-speech-nonspeech" models
discussed above. Since this task may be considered a special case of the
CS R strategies, we defer specific details until Secti on 13.7, when we will
be better prepared to address this issue.
Some exa m p le research on HMM·hased IWR has be en compiled by
Ma ria n i (1 9 89). Speaker-dependent IWR studies using a fenone-based
HMM were con d ucte d by IBM on the highly confusable 62-word "key
" Inh ere nt in this network is a set of lexical rul es ind icat ing how subword units are com
FIGURE 12.18. "Nonspeech" HMMs concatenated to the Initial and final bined to fo rm wo rd s. These rules. ind eed. for m a set of linguist ic constraints. How these
states of the models of each word in the vocabulary. rul es we re d iscov ered a nd built int o the ne two rk will no t co ncern us here . but th e problem
of learn ing the linguistic r ules fro m a dat abase will be a n import ant issue in our study of
lang uage mod els in Cha pter 13.
38 en . 12 I Th e HIdden Markov M Odel
12.4 I F;rst View o f Recognition Syste ms Ba sed on HM M s 739
boa rd " vocabu la r y, whi ch includes th e a lp h a bet. di gits . a nd punctuation t he LB algo r it h m . The other details are essentiall y identica l to t hose dis
marks (Bahl et al. , 1988) as a pa n of th e broader "TA:--:GORA" effolt to
be descrihed in Chapte r 13 . T hese st ud ies yie lded a (U~ %
erro r rate. At
cussed in Sect ion 11.4. 2, so we ne ed no t belabor the issu e.
he AT &T d igit study (Rabiner et al.. 1989) o ffe rs an interestin g ap
Lincoln Laboratory, researche rs have st ud ied the effect s of d ifferent t YPes plicat ion of many of the method s we have studied in o ur wo rk . T he indi
of speaking modes (lo u d , soft , fas t . etc. ) a n d th e effects of noise on
vid u al m odels a re whole-word HMMs o f continuous observat io n t ype
s p ea ker-dep en d e n t HMM-based IWR (Pa ul ct al.. 1986 ). Continuous Ob
with Gaussian mix ture pd f ' s used in each sta te to ch a racte riz e th e o b ser
se r vat io n HMMs with "multistyle train ing" yi eld ed a 0 .7 % error rate on
vation s. T he mai n obse rvation s consist of ce pstral featu res d e rive d from
a IDS-word database. This work wa s later ex te n de d t o a mediurn_ LP a nalysis of each frame . Interest ingl y. a se co n d (en ergy) feature is a lso
vocabulary (207 words) speaker-d ependent CS R sys te m , wh ich is detailed
in (PaUl and Martin, 198 8) . co m put ed in each frame , an d the ob se rv a ti on likel ihoo d [b (y( !) 10] is re
p laced by the product o f this q ua n t ity wi th th e p roba bi lity of th e en ergy
In th e speaker-ind ependent realm , the C e n tre :--.Im ionak des
o bse r vatio n (see p a pe r for detai ls). St a t e d u ra t io n pro babil it y distri
R echerches Sci entifiqu e (CNRS) in France has u sed continuous
bu t io ns are al so incl ud ed in t he m o d e ls. The a lgo r ith m was tested us
observati on HMMs for th e recogn ition of isol ated digits (8 5% recogni_
ing si ngle-sp ea ke r-tr a in ed (s a m e speaker in tra ining and recognition) ,
tion) a nd sig ns of the Zodiac (89% recogn it io n ) Over th e p u bli c phone
lin es . m u lt ispeaker-train ed (sam e speakers in training and recogn iti o n) , and
sp ea ker- in de p en de nt (d iffe re nt speakers in train ing from those u sed in
recogni t io n) mode ls. Multiple HMMs were used for each digit. Som e typ
12.4.3 CSR by the Connected-Word Strategy Without Syntax ica l res u lt s are shown in Table 12.1. Thes e res earch effo rts have been ap
p lied in va rio us practical domains including s peec h-te le p ho ne di aling,
I n one widely read paper (Rabin er et al ., 1989) , r esearchers at AT&T
Bell Laboratories have explored th e a pplicatio n of isol a ted-word HM~1s
credit card n um b er verification systems, and automatic d ata en tr y [see
se ve ra l articles in (AT&T System Techni cal Journal , 1990)]. Anoth e r
to the recognition of co nne c te d di git s (co nne c te d utt erances o f the words
speaker-ind e pe nd e nt digit recognition a p p lica t io n to telephone dialing is
fo r th e numerals 0- 9 plu s the word " oh"). R ecall th at in our ea rlier work.
fo un d in t he work o f Jovet et al. (1986) .
we ha ve e m p h as ized the fact that the t erm connecled-speeeh recof{nil ion is
Many ot he r re search center s have co nt r ib uted to the probl em of
a refe rence to th e de coding t echniqu e rath er th an th e m a nn er in which
co nnected -d igit recognition (e .g., (D o d d in gto n , 1989 )]. T he AT &T work
th e spe a ker necessarily utt ers th e message. In gen eral. th e speech is ut
was described above because it serves as a good example app lic atio n of
tered in a "co n t in uo us " manner without coopera tio n of the sp eaker. Such
is th e case with the digit st udy discu ssed here . m a ny co nce p ts st ud ied above . We will return to thi s probl em after we
ha ve more ex peri en ce with the concepts of sy ntax a nd g ra m m a rs in
As we will see below, knowl edge of the syn ta x of a lan guage (which
C ha pter 13.
wo rd s may follow which others) sign ifica n t ly improves p erformance of
m u ltiword utterances . Digit str ings represent ch a lle ngi ng tasks for speech
recognizers in that, if the di gits are equall y li kel y in each ti m e slot, then TABLE 12.1. Typical Results of an AT&T Digit Recognition Study. From
th ere is no syntax. > M ethods for recogni zin g di git st rin gs h ave be en (Rabiner, 1989).
based on detailed st atist ical models and D TW (Bush and Kopec. J 987 :
Trai ning Set T esting S et
Rabiner et ai. , 19 8 6a , 19 8 6b ; Bo cch ieri a n d Dodd ingto n, 1986 ). The re
se arc h d escribed :1ere is b ased on HMM s used in a n LB approach . Unknown Kn own U n k nown Known
Our ba ckground will all ow us to und er stand th e ge ne ral prin ciples of length length length length
HMM-based LB in a few se n te nces . In di scu ssin g DTW.ba scd LB. we Mode strings strings strings strings
lea r ned h ow to la yout th e problem a s a seque nce or
gri d sco rc hes using Speaker t ra ined
Vite rh ] decoding. The tricky part of the a lgorithm was setting up the ap (5 0 talker s) 0,78
0 .39 0.1 6 0.35
propri ate arrays 10 "interface" th e grid searches a t th e bo u n d a ries o f th e Mu lt ispea ker
va r io us level s. Further, we have al so le arned 110\\0' to v iew t he H rvr r-.,! rec (50 talkers) 1.74 0 .98 2.85 1.65
og nit io n problem as a Vit erbi sea rch o f a gr id. LB on H MM s co nsists of Speaker ind epende n t
inse rt ing the HMM grid searc hes in place of the D T\V grid search es in (112111 3 ta lkers) 1.24 0. 36 2.94 1.75
Note: The algo rith m was tes ted using single-speake r. tra ine d (sam e speaker in train in g a nd
I eI' l'Th
13. is is a n example of what we WIlt ca ll a Type O. or unrcstnctcd, gra m ma r III C ha p reco gn it io n). mult ispeaker-tra ined (same speakers in traini ng and recogn itio n). and speaker
in depende nt (different speakers in training from t hose used in recognitio n) mo d els. Multi
ple HMMs were used fo r each d igit.
1 ~ . :J I t""I VU IO " ......
740 Ch . 12 I Th e H idd en M arko v M o de l space model of (12.42) a nd (12 .4 3). Show ho w to usc th is fo rm o f t he
l-l MM to co m pute a ( .I' ~. i) . th e forw a rd probability in the: F - B algor ithm :
12.4.4 Preliminary Comments on Language Modeling Using HMMs
fo r t == 1,2 . . . . . T and for i = 1. 2. . . . . S.
We a re fin a lly ready to d iscuss the most challenging speech re cogni_ 12.3, In devcloping the Viterbi sea rch . we conclude d t hat th e opt im al
tio n pro blem, that of recognizing utterances from large vocabularies. par
model o r length T is th e one wh ich m a xim izes
t ic ula rly those utt ere d as "con tinuous" speech. It is at th is poi nt in Our
stu d y th at we can no longer avoid the issue of li nguist ic co nstrain ts. This
is t ru e fo r tw o reaso ns . First. as we a lread y kn ow, la rge vocabu la ries re
qu ire the re cogn ize r to in terface with the aco ust ic signal at a su bword
D' = nI
( : 1
(t[(t.I)\(I
1. i/-l )]
( 12. 18 5)
le vel sinc e all wor d mo dels and variation s cannot be stored and searched.
T h is requi res the prese nce o f lexical knowledge, which ind ica tes how
word s a re fo rm e d fro m more basic units . (When co n fro nted with this
= na (i,li, l) h(.I' (t)\iJ
1
( : 1
pr o blem abo ve. we sim p ly a ssu med that the lexical knowledg e was som e
how pro vided ). Secon d , even if w hole-word m odels co uld be us ed, the where ( 12. 18 6)
presence of grammat ica l ru les above the wo rd level is ne ces sa ry to reduce
a(; ,\io) (\~ a(i\ \0) ~I' P ( ~ ( I ) = ;1)'
th e num ber of word strings searched, an d by doing so decrea ses th e en T his re presents the fi rst DP search enc ounte red fo r wh ich the ma ximal
trop y of the search . res u lt ing i n bette r pe r fo rm ance. We will therefor e cost pa th is desired . We circum ve nted the need to worry abo ut th is cas e
stu dy t he issues of language mo delin g and CSR togethe r in C hapte r 13. by taki ng the negat ive logarithm of eac h side of the cost equat io n. the n
W h y a ll of th ese re m arks in a chapter devoted t o HM Ms ? System s in mi ni m izing t he cost. In this pr ohlc:m . we d cvelop the v uerbi algo rit h m
vol ving ling u istic constrain ts co m pris e t he m ost resea rc hed area of
speech recognit io n in rece nt yea rs, a nd it is in t his dom a i n th at th e pow based directly u pon (12 .185).
(a) G ive a sim ple a rgume n l showi ng t ha i the BO P works eq ua lly
erful " self-o rga nizing" a bilit y o f the HM M h as had th e m ost sign if icant
well fo r finding the maximal-cost pa th .
payoff. As we have not ed se veral t imes ab o ve. we will fin d t hat H;Vl~ls (b) Modify t he Viterbi algo rit hm in F ig. 12.5 so th at it is based
play a ro le not o nly at th e acoustic level o f the processing , bu t freq ue ntly
upon a maxim izat ion of (12 .185) .
at the linguist ic levels as wel l. This is beca use a n H M M is in essence a (c) What is wrong with thi s m od ified algo rit hm in practice? Gi ve
" fi n ite sta te a ut o m a to n ," a n a bst ra ct ma chine th at ca n genera te a lan some ro ugh nu m bers to sup port yo u r a nswer.
guage p ro du ced by a " regula r gramm ar." Beca use of its sim p licity. a regu
lar gra m m a r is often used to mode l th e speech p rod uc tio n code . It is 2.4 . l-Iidden Marko v modelin g depends critically up o n th e unrealist ic
wh en the la nguage is mo deled wit h ot he r than a regu lar gra m ma r that ass um pt ion that the ob ser vatio ns in the seq uence y~ == y a re sta t istica lly
th e rec ogni ze r co nta ins so m e no n-HM M as pec ts . in dependent. (T h is h as often b een cited in the lite ratu re as a possib le ex
Wi th th is ind icat io n th at th e th eory o f H M Ms th a t we have pa in stak planatio n fo r m an y o r the HMM 's Iai lings.) O ne place to sim p ly sec how
in gly worked thro ugh in th is cha pter will be cen tra l to the interes ti ng impo rt a nt th is ass um ption is to the methods is in th e sta te space equa
pr oblem s to fo llow, we p roceed to t he next le vel of o ur study o f speech tions (12.42 ) and ( 12.4 3).
recogn ition. (a) Explain where and how the equati ons wou ld cha nge if tb e ob
servations withi n a state wer e assumed II-dependent for ever y
st ate [this means that the random va riable ret)
is in depcndent
of L ( U) o n ly if \11 - II> Jll
(Not e : Th e mai n poi nt. th at th e
12.5 Problems sto rage req u ire m ents and computational complex it y inc rease
beyond practicality if )1 > 0 , can be appreciat ed by conside ring
12.1. T he Ba u m - Welch f - B algo r it hm invol ves bo th a " fo rwa rd " proba )l == 1. You ma y wish to co n sid er thi s case , then co m me nt o n
bilit y rec ursio n a nd a " backwa rd " pro babili ty recursio n . The fo rwa rd re what happen s as ,IL inc reases. )
curs io n is given by (11 .34) an d is initiali zed by (12.3 5). This res u lt was (b) If thc state' s sequence were additionall y \'-d ep cnde nt , how
ca re fully devel o ped in th e tex t. would th e equations change? (Again v = I is suffic ient t o make
(a ) The ba ckwa rd recu rsio n . ( 12.36). is devel oped by a sim ila r line the point. )
of reason ing. G ive a ca re ful develo pmen t or th at recurs io n. (e) In either of the cases above. does a simple adj us t mc nt to the
(b) E xp la in how t he backwa rd recur sio n shoul d be initialized . tra ining p roce ss seem li kely? Wo uld su fficicn t t ra ining data be
12.2. In th is p ro ble m we see k an analytical relati o ns hip bet wee n th e two available?
" a ny pa th " decodi ng methods. Su p pose tha t yo u were given th e state
12 .5 I P ro bl e m s 743
74 2 Ch. 12 I The Hi d d en Markov M odel
12.6. (a) [t is d esire d to ded uce a co nt in uo us o bse rva tio n HMM with ti vel y replaced by u niform durat io n d ist ribu tio ns,
scala r obser vat io ns, Th e obse rvatio n d e nsities are to be mod
I / (D - D D 'in :$ d s: o,ou t '
~led by a single G a uss ia n pdf at each st ate. The re a re S states iE., = d) + 1).
t
=' LO UI I. i n
I ~ th e mode.1. ~ ketch the part. ~f the training ( r~e~tima t i o n ) algo.
other d
rit hrn that infers these densities fro m the trar rung data. O. (12 .189)
(b) Repea t (a) fo r a V itcrbi p roced u re .
12 .7. HMM?rI I has been trai n ed o n utte ra nces of th e wo rd " o n." A sec (b ) A simi la r a p pro ach to that in (a ) is to allow the search to re
ond HMM, say 1112 , is t ra in ed o n the wo rd " o ff." A speech-rccognition . mai n in sta te i a t no co st fo r D, ill to Dl.o", obse rva t io ns, but to
system conta in ing th ese two m o d el s is to be used to control a nuclear impo se a se ve re co st pen alt y fo r leaving earlier or later. Modify
power plant .
the Viterb i sea rch ac co rd ingly.
(a) rtl ) ha s a si ngle G aussia n o bse rva t io n pdf at ea ch state. Follow
ing trai ning. it is d iscovered t hat. fo r any stale. the feature vec 12.9. Verify (12 . 157) a nd de rive th e s im ila r scaled reest irnation formula
tors have a p prox i ma t ely independent. m ean-zero. unity vari
for the o bser vatio n probabilities.
ance e le m e n ts. Use (12 .67 ) to show that the lik elihood of 11/)
hav in g ge ne rat ed any observation se q uence is ind ep e n d en t of 12.1 0 . (a ) Show how to modify the F-B algorithm for th e discret e obser
the p ar tic ula r obser va tion sequence. Give an expression for that va tion HMM so that the observation probability vectors arc
likeli hood. [1'v'ote: .\'(1) in (12.67) shou ld be re p la ced by }'(I) be shared (tied) between st a tes i) and I,.
cause vector- va lued observati ons are use d in this problem .] (b) Repea t (a ) for Viterbi training.
(b) 1112 exhibits si m ila r observation pdf's to those of 1111 , Give an
analytic a l arg u m ent that no matter which word is sp o ke n, the 12.J I. T he followin g problem is based on an analy sis by Juang and
same word (ei ther " o n" or " o ff ") wi ll a lways be recognized . Rabi ne r ( 19 85 ) us ed to motivate the need for an HMM distance measure
(c) In practice . o f course , utterances of "o n" a nd "off" are equally like the one di scussed in Section 12 .2.8 . Co ns id e r two di scret e
likely (t hey occur a lt e rn at el y) . Find the long-term recognition observ at io n HMM s, say 111 1 and nil' with the following associated
rate (pro ba b ility of correct recogniti o n) using the HM\1s 111 1
statist ics:
and nJ2 • Wo u ld it be just as effective to randomly select "on" or
" o ff " in response to an utterance? How clo se to the nuclear
power plant would you like to live? (Express your answer in A) = r p 1 ·- p]
light-years.) LI- P P
(c) Su ppose tha t for rI1 1 each st a te has a single observation pdf
sim ila r to th at described in (a) except th e mean vect o rs are
di fferent. The mean feature vector for state j is ~ ) I J ' Sh ow that
B, = rLI -q q
I-
q
q]
the Eucl idea n m etric is a n appropriate measure of likelihood to
use in ( 12.67),
- lo g b (y (r )lj) = d 2( y(t) , 1l IJ (12.18 7)
(1) = [ } ~
1t 1 Jr
1 (12 .190)
(d) Suppose th a t rtl l emerged wit h different s tat~ mean vectors [as
in (d)] a nd with ap proxi mate ly e q ual co v a fl a n ce~ among the
A2 = rL1-r
r
1- ' ]
,.
Let C , be t h e ave rage co var ia nce matrix . Show that the nll )=[ ~ ~ ] .
Mahalano b is distance is a n appropr ia te co st t o use in (12.6 7).
744 en. 12 I T h e Hidden Markov MOd el
(a ) If there arc o n l y t wo pos sib le o bse rva tio ns a rc cal led .\'[ and J\ ,
show th at f-' I [(l) 1 1Il, l = ~rr(T) 1 1lJ2 1 if
S =
P + q - 2pq - r
I _ 2r
•
(12. I 91 )
1111 are very sim ila r, sin ce they tend to ge ner ate ' the sam~ ob.
Language Modeling
servation sequences. Howey er, we show next that this similar_
Reading Notes:
ity need not be evident in the m atri ces that comprise the
models. . This chapter wil! not require any of the specialized IOPI CS I n Chapter 1.
(b) Suppose that p = 0.6 , q = 0.7, and r »: 0.2. W hat is .\ such that The reader m ight wish to casually review the com ments at (he begin
(12 .191) holds? A re th e two co rrespondi ng HM M s apparently ning of Sectio n {.3 10 help pill this material i ll perspective.
similar by viewing their matrices? 2. IJ it has been a while since the reader has studied Chapter 10, espe
(c) In vector spaces , vector norm s are said to in duce a norm upon cially Section 10.2.4. it would be worth while 10 revww the concepts and
matrices [see , e.g., (Nobel, 1969, Sec. 13.2)] . For example, the terminology.
Euclidean norm of a m atri x D is
' Even recognition techniques with no apparent linguist ic pro cessin g ca n be fram ed as
special cases.
745
746 Ch . 13 I Language M odeli ng
13.2 I Formal Tools for Llnguisl ie Pr ocessing 74 7
th at IWR is st ill used in ling u ist ic p rocess ing syst e m s where applicable. that th e gra m mar is a set o f rules by w h ic h sym bols (p ho ne m es in our
but t he p rocess ing fit s neatl y into th e general fra mewo rk and need not be d isc ussio n t he re) may be p ro perl y co m b ine d in th e natu ra l la nguage. Re
co ns id e red se pa ra te ly.r As we w ill di sco ve r, th e sa m e co m m e nts apply to ca ll a lso th a t th e langu age is th e se t o f a ll p os sib le combinatio n s of
the "conn ccted -speech" p rocess ing tec hniq ue . It is m o re effec tive to view sy m bols.
such a sy stem a s a specia l case o f ge ne ra l CSR processi ng. In fact, we "vin . T he concep t of a " la n gu age" a nd a " gram m a r" can b e ge ne rali zed to
e mpl o y th e co nne ct ed- sp eec h ap p ro ac h as a s im p le way t o introdUce a nv phe n om e n on th at ca n be viewed as ge ne rat in g st r uctu red entit ies by
som e of the fu nd a m en ta l p rin cip les of lingu ist ic p ro ce ssing. building them fro m p rim it ive pa tte rns acco rd in g to certain rules . Speech
Of co urse . a ll o f t he comme n ts a bo ve wi ll be be tter understood in ret is o ne such pa tterned p he nome no n , b ut we can al so co n str uct grammars
ro spect. At th e o utset. howeve r, it is im po rt a n t to note th a t language pro th a t govern t he fo rm at ion o f im ages, strings o f binary di git s. or computer
cessing tends to u n ify a ll sp eech re cog n it io n te chniques int o a general cod e . T he p rim it ive patterns associated with a "formal language" are
framework. Alt hou gh la rge-vocab u la r y CS R benef it s great ly from linguis ca lled " te r m inal s" of the language . The language itself is defined to be
ti c proces sing. a nd is essen t ia lly im poss ible without it, language modeling the set o f all term ina l strings that can be produced by the rules of the
is not just a sop h isticate d theo r y employed to assist with the very Com " fo rm al gra m m a r." We will make these notions more con crete b elow.
ple x problem of large-vo ca bula ry CSR . Fo r o u r purposes. we can define a n automaton to be an abstract m a
Language modeling research comprises a vast ~ n d cavo r with a long c h ine ca p a ble of carrying out th e rul es of a grammar to produce a n y ele
and int erest ing hist o ry in which numerous and vari ed systems have been m e nt o f the language . We will discover that an HMM is in essence a
produced . O ur goa l he re will be to provide an overview of some of the "fin it e state automaton ," an abstract machine that can ge n e ra te a lan
ba sic operating princip les , particularly those related to HM Ms. and to gua ge p rod uced by a " regu la r gramma r." Because o f its simplicity.. a regu
briel1v sketch some of the historical and contemporary developments and la r gra m m a r is oft en used to model the speech production cod e at all
sv stern s. Following th e st u d y of this chapter, the reader who is interested le ve ls o f lin gu ist ic and acoustic processing. This is why th e HMM figures
in specific detail s of systems will be pr epared to pursu e the literature On so p ro m ine nt ly in the CSR problem. It is when th e language is modeled
the su bject ,
wit h o th e r than a regular grammar that th e recognizer will contain some
Our first task is to learn some formal techniques for the development " no n- H M M " aspect s. In ord er to put these issues in perspective , our fir st
a n d stud y of language models. task in t h is chapter will be to learn som e pri nciples of fo rm a l language
modeling.
Fo rmally, a grammar of a language is a four-tuple
13.2.1 Formal Languages in wh ic h 1<, and T~ are th e nonterminal and terminal vocabularies (fini te
sets), P is a finite set of produciion rules, and S is the starting symbol for
Our goals in this section are to gain a working knowledge of some of all pr o d uct io ns. The sets ~ and ~ a re disjoint, and their union , say V, is
the concepts of formal languages and automata theory. We will not pur called the vocabulary of th e language . We d enote by ~. th e set of all po s
sue these topics in depth . The interested rea d er is referred to one of the s ibl e strings that can be composed from the el ements of T~. (A similar
standard texts on the SUbject [e.g., (Fu. 1982: Hopcroft and Ullman , mea n ing is given to Vn' and V'.) The ru les in P are of the form >
1979)]_
In Chapter 10 we described a model of natural lan guage du e to Peirce a -> fJ, (t3 .2)
that included symbolic, grammatical, semantic, and pragmatic compo
nents. We then went on to describe knowledge sources d ue to Reddy that whe re a and fJ are str in gs over V, with a containing at least one element
co u ld be e m be d d ed in a sp eech recogn iz er. Th ese a re sh o w n in Fig . 10.4 of ~ .
in relation to th e formal components of Peirce's abstract model . Recall By definition . ~~ . is the language with which the grammar is associ
ated . We can desc ribe the language more formall y as follows . Let
"=
I 6 (13.3)
ZO ne can correctly argue that IWR. as studied in previo us chapters. is a specia l case of
the connected-speech CSR problem when no la nguage model is used. Th is is. of course. cor
rect, but unlike the present case. attempti ng to treat IWR as a special case of CSR there
w o u ld have been pedagogica lly inappro pria te because the known temporal boundaries arc lIn the following, !o....e rcase roman lett ers will indicate strings over v,', upp ercase roman
very significant in that pro blem.
letters strings over I~' , and lowercase Gr eek leu ers strings over V' .
748 en. 13 ! Languag e MOdeling
13.2 I Form al Tools fo r Li ng uistic Pro ce s smq 7 49
I
13.2,3 Bottom-Up Versus Top-Down Parsing
w here 11'7de n o t es th e ra n dom va ri a b le string 11'(1 ). .. . , \l' (:\' ). a nd w" de
To o b ta in a m o re co n c re te feeling for form al gra m ma tica l rules. let us
n o tes th e pa rt ia l rea liza tio n 11' ( I ). . . .. \1-' ( N ). an d the su m is ta ken lover
a ll such rea liza t io n s..I Since th e wo rd s in a la ng uage arc not lik ely to be co nsider a n exa m p le . In term s of a natural lan gu age (P eirc ian m odel ) we
ind e p e nd ent, we use ( 13. 12). reco gnizing that it re duces to ( 13. 11) when co nside r t he sen te nce to be a co mple te utteran ce, and phon emes to be
th e y a re in d epe n de nt. Fo r an e rgod ic source . we ca n compute the entropy the ba s ic sy m bols of the langu age. Fi gure 13.1 illustrates the idea o f a
using a "tem p o ral" ave ra ge . gra mmar fo r t h is choice of utterance and sy m bol. For one e xa m p le utter
ance, "T he child cried as she left in th e red pl an e ," th e set o f rul es by
}'J( ) . J 1 p(
1 ~ = ~J ~~
v '' ')
j\j og \.!..:", <:: 1\ " 1 . (\ 3. J 3) wh ich p ho ne m es m ay ultimately compo se thi s senten ce may be inferred
fro m this fig ure. We sec rules for ho w ph onem es form words, how words
I n practice, the lon ger th e se nte nce (larger N ) used to esti m a te H. the are cl assified iot a p arts of speech . h ow parts o f speech form phrases, and
better will be th e es tima te; H represents the average n umber of bits of in how phrases form se n t e nc es . In the o t he r direction , we see how th is se n
formation inherent in a word in th e lan gu age. In tu rn. th is means that tence can u lt im a t ely be decomposed into its com pone nt sym bols by a se
J-l ( !~J b its mu st be ex t rac te d by the re cogn ize r from t he aco ust ic data on rie s of ru les. T he sent en ce is rewritten in te r ms o f phrases (a noun phra se
th e a vera ge in ord er to recogni ze eac h wo rd . [N P] and a prepo siti on al phrase [pPJ) and p art s of speec h (ver bs [V], a
O f course, t he probabiliti es P (.!..:';v= lI';\{) a re unk nown and must be esti co njunct io n [CONJ}, a n d a p ron oun [PRON D. Then the phrases are de
m at ed from training data (which ca n be viewed as e xa mple productions co mposed int o parts of speech , th e parts of s p ee ch indicators produ ce
o f the grammar) , Let us call the estimates p(!:..';"=W'I'')' and the resulting wo rds. an d the words a re finall y d ecomposed into phonemes.s
ent rop y rneasu re H(::: ). Altho ugh th e e xa m ple above in vol ves a na lUral la ng ua gc . th ere is noth
~ I ~ " fo,'
ing preve nt in g u s from vi ewing it as a fo rma/ la nguage. Th e terminals are
H:(I1')
-
== - li m - log P (lI" l =
N - a:> 1\' .
11' 1 ) .
. ( J 3.14) ta ken to be ph on emes and all other qu antiti es but S are taken as noruer
m inals . Som e of th e production rul es a re evident in the example. Fo r ex
a mple . coming d own th e right side of the fig u re , we see the product ion
It ca n be shown (J elinek, 1990) that if > H if w is prope rly ergodic.
rul es
Alt ho ugh the entropy provid es a perfectly va lid mea sure o f difficulty,
\~ 'I\' is similar to the ab usive no rarion we have used before to denote a rand om model of
a part ial rea lization. · We should not e that this is a phonem ic transcription, mea ning that not much of the
phonetic detail is rendered. For details see (Ladefoged , 1975).
752 Ch . 13 / Language M o deling
13 .2 / Form al To o ls for Lingu ist ic Proc es s on g 75 3
7'~~
et US exami ne how parsing with a grammar ca n be used in the speech
recogniti o n problem. We will firs t look at this issue in broa d term s and
beco me more specific as the discussi o n proceeds. In the bottom- up ca se,
th e gra m m a r aids in recogni tion by d isa llowing sym bo l com b ina t io ns
l\ P v CON J I-'RON v PI-' that a re not in the language. It can als o be used to ass ign a like lihood to
1\
legitimate symbol strings if the g~ammar. is s t?c h~s t i.c . T he p a r~ o f the
ART :'I
/\
PREI'
\
:\P
recogni zer that co nvert s th e aco ustic data mt o ling uist ic symbols IS so me
times call ed th e acoustic decoder (A D ). Suppose t ha t t he AD h as pro
cesse d the utterance and has hypo thesized a set of pho nes (sym bols or
ter m in als) . T he linguistic decoder (LD ) then goes t o work a pplying the
lingui st ic (in this case gra mmat ica l) co nst ra ints . Wo rking from th e bot
THE I I I ing) . guid ed at each ste p by the legiti macy an d likel ihood of the string it
is crea ting. T hose hypotheses tha t are illega l o r unlikely would be aban
A~ ~ 7\ ~]\
do ned before reach ing th e end of th e utterance. T h is idea will show up as
/\
J..) x C Y I d k r Y d
1\ 'A"
.x s S i I E f I
A
I n D .\ rE d p Ic ro
a pru ning measu re later o n. O f cou rse, m u lt ip le evolving hypotheses may
be co nsid ered si multaneo usly.
In the to p-dow n case. t he gram m a r aga in pro hibits illegal sym bo l
FIGURE 13.1. Production of the utterance "The child cried as she left in the stri ngs, bu t in this case, n o ne is ever hypoth esized. In thi s case the gram
red plane" according to grammatical (syntactic and lexical) rules . The mar serves to restrict the num be r of sym bo l combinati ons that must be
nonterminal vocabulary for this example corresponds to phrases, parts of cons idered by the recogn izer at the acou stic level. Th e process begins at
speech. and words : NP = noun phrase, PP = prepositional phrase, V = verb, th e to p, where the LD hyp othesi zes a sen tence in the language . The rules
CONJ .. conjunction, PRON = pronoun, ART = article, PREP = prepos ition, o f th e gram mar a re th en used to ded uce a possible set o f phones corre
ADJ = adjective. The terminals are phonemic symbols.
spon d ing to the hypot hesi zed sentence. T h is p rocess would also produce
an a priori likelihood of the derived phones according to the statistical
where we have used uppercase to denote the nonterminal quantities and st ructure of th e gra m m a r. The co m plete likelihood of this phone string
lowercase for the terminal s.
give n the d at a (o r vice versa) is then co m puted by the AD. More realisti
We implied above th at the grammatical rules (for a pa rticular sen ca lly, the LD wo uld o nly use th e gram m a r to ded uce phones from the left
tence) ma y be viewed in either the upward (compose the sentence) or for a given sen tence hypothesis a nd would aban don that sentence if the
downward (deco m pose th e senten ce) direction. C learly, the in for m at ion ph one string were t urn ing o ut to be u nlikely.
is equivalent. There is a fundamental technical difference. however. in T he mai n di sad vantage o f t he bott o m-up ap proach is that a sentence
herent in the direction in which th e grammatical rules are viewed . The canno t be reco gn ized unless each of its symbols is h ypothesized by the
process of determ ining whether a set of product ion rules exists in a AD. The bottom- up approach also do es not ta ke ad vantage of the lin
grammar for composing (from terminals) or decomposing (into termi guistic co nst ra in ts in decod ing the acoustic signa l. On the other hand , the
nals) a sentence is called parsing. A parsing algo ri thm t hat seeks rules for language must be h ighly co nstra ine d (wit h respect to natural discourse)
composition, that is, which uses the grammatical ru les in the direction for the t o p-down a pproach to be p ractical , beca use, in principle (at least
terminal s - sentenc e, (! 3.19) the begi nning o f) every possi ble sen tence m ust be hypothesized by the
LD . However, t here are m a ny a pplications th at invol ve constrained lan
is called a bott om-up parser. On the other ha nd , if t he rules are used for guages wi th re lati vely sm all voca bu la ries. So m e of th e tasks that have
decompos ition.
been explored a re data entry. milita ry reso urce ma nageme nt , arch ive doc
sent ence - term ina ls, (13 .20) ument retri eva l, a nd o ffice dictatio n. Most of the resea rch on language
th e a lgo rith m is a top-down pa rse r. model ing since th e ea rly 1980 s has foc used on th e to p-down approach.
Accordingly. most o f the syst em s we will describe are based on top-down
odelin g 13 .3 I HMMs, Finite Stil1e AulOmatil (FSA) . and Regular GramrcClrS 755
ever, we introd uce the equiva len ce be tween the HM M and the "finite
sta te a uto maton ," and di scuss their relation shi p to regula r grammars.
.-.~
~ ~'~
~ _.
::- .:::
:: ~ ;; 'fJ' 0 '{.r
-L ro
cr- r:
T hese are the "observation s" as so ciated with th e t ra nsit ion s o f the consistency. it is that the term " H M M " is usuall y used to indicate the
model, and they are a na logo us to th e observations y (t ) a t the ac o ustic acoustic m od el, b ut t here are except ions to th is rul e as wel l.
lev el. Howev er. I and t are very d iffe ren t indices: ( indexes t h e QCOlIH i c The next point is criti cal. We sh ow th at a n FS A has a one-t o-one co r
fr a me in the speech d ata a n d is ordinarily regularly syn ch ro n ized with resp ondence with a regular gr amm a r. With th is knowle dg e. we wi ll b e
the o rigina l samples of th e data; I. o n th e othe r ha nd . in d exes th e word able to use the FSA as a parser fo r a reg ula r la nguage . To show this for
lim ber in a n utt e ran ce an d is on ly loosel y related to t im e. Let X be a our p re se nt exa m p le, we tr eat th e I I words for the di git s a s the ter mi
random process with random variables :rU). 1= 0 , 1. 2.. . . , wh ich --;-nodel na ls. N ote that we have spelled ou t th e d igit s in low ercase lette rs above
the state sequence through the mod el. and {J be the ra nd o m p ro cess with p reci sel y so th ey wo uld loo k like terminals in this part o f o u r d isc ussio n :
ran dom va ria bles Uil ), 1= 1, 2 , . . .. wh ic h model th e transitions on a " ze ro ," " o h, " " o n e ," . .. , " n in e ." We vi ew th e sta te designators as
path . Also let (JI ll be an identifying label of the transitio n from state J to non te rminals, with t he st a rt ing sta te . labeled S, playin g the role of the
state f . T h e st a te qu ant itie s are labeled with uppercase lette rs be ca use ,'oot o f th e gra m m a r. The producti o n ru les of the gra m mar a re as fo llows:
th ey will have an int erpretation a s nonterrninals i n a fo rmal gram m a r
below. The tra nsitions are labeled with uppercase lett e rs be cause th ey arc ( 13.2 3)
closelv related to th e states. but the transiti ons will pl ay no direct role in
S !?2. ze ro, II
the formal grammar. N ow, by direct analogy to t he develo p m ent s a t the - p, - (13.24)
5 .....:., oh, A
acoustic lev el. we ca n define the (stale) transition probabilities. ( 13.2 5)
S !2 o ne . 13
Aul.! ) d~Ip(!::!. (/)= U' IJ) = P( :l(/ ) = fl ~(I - l) =J )
- p - ( 13.2 6)
( 13.2 1) 5 --:. fi ve . B
fo r a rb it ra ry I. a nd th e observation iword or digit) proba bilities, !2 ze ro. A ( 13.2 7)
A
(t 3 .2 8)
b(1I'(!) 1UI IJ dgJ P( ~ (!) = 1I"(/)I i!(/) = VI II) (13 .22) A !2. o h ..:1
1', one, 13 ( 13. 29 )
a lso fo r a rbitrary I. Finall y. for co mplet en ess. we note th at th e state A
probability vector prior to wo rd I is t he vect or n U) wi th »i t h element A I' ,
-'0
.
SIX ,
B ( 13 .30 )
P (~(/) = In,} in wh ic h 1m, In = 1, 2, . . . . M re p rese n ts so me o rd er in g of ( 13.3 1)
R ~ one. fj
th e M states . The m os t sign ific a nt o f these ve ctor s is t he initial state
probability vector" n (0 ). Since we a lways d esignate t he state S as the ini B ~ two, C (13.3 2)
tial state, there is only on e nonzero ele me n t in n (0), th e one co rresp on d (1 3.3 3)
ing to P ( ~( O ) = S). The re ad e r sh o u ld be able to extract all of these C' ~ two , C
probabi lities for the prese nt examp le from Fig. 13.2. [t shou ld also be C ~ three. i5 ( 13. 34)
clea r that a ll of the training a n d recogn it io n tech n iq ues t hat we ca re fu lly ( 13 .35)
d eveloped for the acoust ic HM M ap p ly eq uall y well to th is mod el.
15 I' ll three. 15
In th e par la n ce of form al language th eo r y. t he H M M is kn own as a jI - 1'" - ( 13.36)
D - fo ur. £
nile state automaton (F SA ). T he wo r d stochastic o r nondeterministic ( 13.37)
E ~ fo u r. E
might al so precede this name to indica te that th e re are m u ltiple tran si
ti o ns an d obser vati o ns that can be genera ted by a n y move in th e dia E ~ fiv e. F ( 13.38)
gram. and that these are governed by proba bi listic rules [for de ta ils see. ( 13.39)
e .g., ( Fu, 1982: H o pc ro ft a nd U llma n . 1979)] . We will beg in to use the
F 1'17 zero . F
term FSA for an HMM that is being used to m od e l linguistic in fo rm a F Pl' oh, F ( 13.40)
t io n . and reserve th e te rm " H M M " fo r th e acou st ic-le vel mod el. Th e (13 .41 )
reader should appreciate that, for our purposes. the two names refe r to
F PI" fi ve , F
abstract m odels that are eq u iva le nt e ven t h ough they are us ed to m o del f• P'
....:c.ll .
SIX, C ( 13.42)
d i ffe rent phenomena . We should also point out that th e re is no st an d ard ( 13.4 3)
usage of t hese term s i n the sp eech p rocessing literature. If there is a ny
G I'l l six . C
C I'll sev en , R ( 13.44)
"For a Mealy-form HMM , it makes the most sense to index the initial sta te b ~ "t i mc~ O.
so that the first transition (observatio n) corresponds to " t i m e" I . R I2; seven . n ( 13 .4 5)
758 en. 13 / Languag e MOd elin g
13 .4 I A - Bo tt o m- Up' Parsing Ex ample 75 9
,7 P~ .,
fo ur. 17
sim pl y so t hat the grammar may gen erate a final termina l wit hout gene
H ~ eight, j rat ing a fu rt he r nonterm inal. The rules (13 .49)-( 13. 52) a re therefore nec
[j Pl', nine , j essa ry fo r formal reasons. In fact , we could accomplish the sa me th ing in
the state d iagram by including a phantom final state to wh ich s ta tes It, i,
i n, i
eigh t. and J co uld m ak e final tr ansit ions without ge nerati ng a new state na me.
eight Freque ntly. however, we d o not account for the fact that a state tra nsitio n
is a final one in train ing and using an H M M or FSA. In the present stat e
d iagram . for example. the final wo rd is genera ted by m a kin g a fi nal tran
siti on into eit her st ate tor J, then simply rema ining th ere. We ca n co m
pensa te for this litt le discrepancy between the FSA a nd the gr am m ar by
allowing th e co m bined production rules
i ~ eigh t, I, or eig ht ( 13.60)
- p" . - .
J ....::. rune. .T; o r rune, ( 13.6 \ )
o r (J 3.53 )
where fJ k = fJ ~ + fJ: fo r k = 27 . 28 . With this pro vision , th e state diagram
and the gram mar correspond precisely.
Q!!" r, (1 3.54) T h is ex a m ple illu st ra tes the equ ivalence between an FS A or HMM
wh er e Q. R E v;, and q, rEV;. Note furt he r th at th e proba bilities associ a nd a regular grammar. Note the fo llowing:
a ted with the production rul es a re relat ed to th e FSA a s follows. Con 1. Traversing a co m plet e path in the FS A (an d accumulating probabili
sider. for e xample , th e rul e
t ies along the pat h) a nd genera ti ng th e corresp onding se nte nce co r
- p re sponds ex actly to the forward or to p-down a p p lica tio n of a set of
A -!. one, B. (13.55) prod ucti o n rules to generate the sentence.
T his rule co rr espo nds to a j um p from sta te .1'4 to jj in th e m od el with the _. G iven a sente nce , ex ploring th e FSA to det ermi ne whe t he r a path
acco m pa ny ing generati on of o bserva tio n "one." Acco rd ingly'. it is dear (or set o f p rod uct io ns) ex ists to ge nerate thai se n te nce co rrespo nds
from th e state diagram th at
exac t ly to th e bottom -u p appli cat io n of th e producti on rules to de
P7 := (0.9)(0.8) = 0.72. term ine wh e th er the sente nce co nforms to th e gra m m a r,
(13.56 )
With th e exc eption of the rul es in vol vin g primed prohab ilitics. the rule Alt ho ugh th e FS A offe rs a sim p le and highl y str uctu red manner to
probabilities are related t o th e sta te d iagram probabilit ies in a similar carry o ut th e pa rsing in ei ther direction . whe t he r it is m ore conve nient to
way. In general, th e probability. fJ , associate d with t he ru le for mally view th e grammatical stru ctu re as a set of produ ctio n rule s , or
as a n FSA. depends o n the nature of the approac h ta ken . T he FSA is use
Q .!-.qR ( 13.5 7) ful when the li nguist ic decod ing is structured as a d yna m ic pro gramm ing
is gi ven In terms of th e FSA pro bah ilities a s sea rch. Th is will always be th e ca se in th e to p-d own ap proac hes we st udy
tha t in vo lve regu lar gram m a rs . O n the ot her ha nd . if th e FS A is j ust
p "" P(:!'(/) = R!:rU-I ) = Q)P(:::,(/) = ql:r(/) = R,:rU - I) = OJ be ing use d as " a graph of the product ion ru les ," the n th e gra m m at ical
"" P ( U(I ) = UR IQ ) P(~'(I ) ~ q/ [/ (1 ) = V'II () . struct ure mig ht just as well be tho ught of in ter m s of th ose rul e s. In th e
discussio ns to follo w, it will be usefu l to view the linguistic co ns tra ints
( 13.5~ )
pr ima rily in t er ms o f th e FSA .
for arbitrary I> O. In t urn . we kn ow that th is can be written
I
fJ :: ,-1 (R Q) b( II I UR I Q)• (I 3.59)
13.4 A "Bottom-Up" Parsing Example
wh ere Q. R E v;, (eq uiva le ntly. Q a nd R arc sta tes) . a nd lJ E I,;' [equiva
lentl y. q is a word (d igit) in t he natural voca bulary]. In th is sect io n, we take a brie f loo k at a n exa m p le of a bottom -u p
T he one small nuance Occu rs at the e nd o f the prod uct io n nile list pa rs ing-based C SR system. We do so by stud ying a re lati vely simple ex
where we have . fo r exa m pl e, bot h .1P;. nine . .1 a nd ]p';s nin e. This is a m p le system co nsist ing of t he ad d it ion of a gra m m a r (lingu ist ic co n
st ra int s) to th e LB a pp ro ach for d igit recognition . T he m et ho ds de scribed
13 .4 I A "Bo tto m- Up " Parsi ng Example 76 1
760 Ch. 13 I Languag e M o delin g
here apply equally well to either DTW- or HM M -b ased L B. but we will In Section 11.4 .2 we noted t h e in crea sed mem ory and computa tion
focu s on th e latter. To put t h is example in the p erspective o f th e in tro necessa r y to see k ou t multi p le-sen tence h ypotheses using the LB algo
ductory commen t s o f the chapte r. n ote that t he LB ap proac h is . rithm . In Section 13 .2 .3 we d iscu ssed t h e po ss ib ility of the AD in a
" conn ect ed-spee ch " m et ho d th at does no t use grammat ical r ules helo:' botto m -up sch e me hy p othesizing words (o r othe r un its) fro m th e le ft a n d
t he word level. T hi s is b eca use th e word can he used as a fun d a m en tal rece ivin g " guidance" fro m the L O a s to wh et he r co nti n ua t ion o f th at
u ni t o f speech , since su ch s yst em s in vo lve small vocab u la ries. In this string is ad visa ble . Ind eed , such a n approach can be ta ke n wit h the LB
sense, th e following example involves a simple r sy ste m t h a n might ordi algo rit h m . A "l ingu istic cost" ca n be integrated with the " acoust ic cost"
narily be ernbued with a language mo de l. but its simplic it y will ~allow us on a " pe r level" b asis in or d er to fi n d a single-sen ten ce h yp othesis con
to focu s on the principl es rath e r t han th e details. Afte r m a stering these sistent with both b od ies o f informat io n . Alt h ou gh sim ple in co ncep t , th e
basi c concepts, we will bu ild onto t h e syste m by addi ng gr a m m a r at the fo rmal d escript ion of such an algorit h m is complicat ed , so thi s case do es
subword le vel. not le nd itself well to the lea rn ing p ro cess . T he re fore we exami ne th e
As we proceed th rough th is exa m p le, pay carefu l a tten tion to how the simpler case in wh ich comp lete-se n te nce hypoth eses are su b m itted b y th e
grammar is us ed . [t will b e see n th at th e p rodu ctio n ru les o f th e gra m AD to the L D . I n Prob le m 13.3 we return to th e en ha nceme nt.
m ar will be used from th e " b otto m up," This wi ll im ply th at th e LB sec Eac h of the wo rd strings su b m itte d by the AD to t he LD is now " rec
tor o f the syste m (wh ic h comprise s th e A D) is re spo ns ibl e for a verv ogn ized" by t he LD as though it were an "o bse rva tio n" seque nce su b mit
important task co m mon to a ll language-ba sed syste m s . espec iall y CSR ted to an " H M M" for a likelihood score . It shou ld b e clear th at th e
system s. This is wh at we m ight ca ll the " tempora l record kec p ing" func decodi n g of t he LD FS A w it h re sp ect t o th e o bservat io ns m ay p ro ceed
t ion . Because the tim e boundaries between speech unit s (in this case according to a ny of the recogn ition m ethods fo r di scret e-observation
words) a re unknown , m any different regio ns of th e speech must be tried H M Ms d esc ribed in C h ap ter 12 (with min o r m od ific ation s for th e
as ca ndi d ate regions for the fir st word , seco nd word. an d so o n. Furth er, Mealy- fo rm FS A). Th e result for stri ng 1I'1"i is ei t her p ( Wlk l l ~) o r
any legitimate so lut io n mu st be co m po sed of word s wh o se co rre sp o n d ing I
p(w1k" J" q), depe nd in g on whether an "any path" o r Viterb i a p proach is
t empo ral regions are adj acent a n d nono ve rla p pi ng." As we d iscuss the ex used (we as s ume the former in this discu ssion). ~ (t o d en ote grammar) is
ample bel ow, the read er is also e nc o uraged to no tice ho w th e LB algo t he na m e fo r the FSA mo del in the L D , and , as us ua l, .!J* m ea n s th e b est
ri thm perform s th e reco rd keeping fun c ti o n. (Act uall y. we a re already path through the m od el.
familiar with how LB d oes th is, but it is useful to re vi ew our understand Th e pro p o sed m et h od fo r combinin g the AD and LD co st s is ve ry
in g in this ne w situ ation .) Toward th e e nd of t his section . we will discuss simple. T h e cost of the k th best L8 h yp othesis at th e acoustic le vel may
how th e LD could b e m ade respon sible for the t empora l record keep ing in
b e wri tt e n
a sim ila r svst ern.
We assum e the language mod el for th e digits tha t wa s dev eloped in
k1 (T.' J L )
Dlmtn = - log p(~v, y lkl l m 1kl) , (13. 62)
Section 13.3 and illustrated a s an FS A in Fig. 13.2. O ur o bject ive is to where J \kl represents th e globally (over all le vels) opt im al p ath th rough
exa m ine now how we can use t his lingu isti c kno wled ge in as sis ti ng in the the se t of L HMMs m
llo l, a nd J is th e label of a fin al Sl ate in the final
L
re cogn it io n process in th e L8 p roblem . Let us first mo ve down to t he model. Since t h e Hlvl M string ha s a un ique co rrespo ndence t o th e wo rd
ac o ust ic-lev el processing. and sup pose t hat we have fo un d N ca nd id ate
wo rd strings. say \1'111, .. .' . 1\,1.\'. in orde r o f best to worst cost, as a consc stri ng , let us wr ite
quence o f applying the LB sea rch . (Recall t hat mu lt ip le-s e ntence h ypo t h Dl~; i n (7-: J L ) == - logP (y. . 911.1\ 11.(1.1). ( 13.63)
eses ca n b e gen erat ed by kee p ing trac k of t he seco nd -, third-. ....
Now we alter th e ac o ust ic co st fo r st ri ng 1\.11.1 by simply add ing to it the
Nt h-best p aths through the L8 " grid "- sc c Sect io n 11.4. 2.) Le t t he co rre k
spon d ing cost measu res be den ot ed D I '~in ( T.JJ , . . . , d·~in ( T. JL ) . I O whe re lingu istic co st. I I say - log P ( lI.1 l!Ql. to obta in
.I, is t he indicator for a te rm inal s ta te in the fi na l H MM. We no w ~nte l~ d C[~i n ~f -log P(Y, ..9' 1/, \\ w1Io j)p(1I"Ik:lq). (13 .64)
to s ubj ect these strings to " lingui st ic scrut in y" using t he information In
the LD's FSA . Before d oing so, ho we ver, a pract ical point shou ld be The globally op ti mal wo rd st ring. say w", is ta ken to be th e o ne o f mini
no ted . m u m aco ust ic plus li ngu isti c cos t,
w· :::: wWl. (1 3.6 5)
"Actually. some systems have provisions for resolving "o verla pping" hypotheses. We will
briefl y describe one such system at the end of this sectio n.
lOWe have assum ed all the str ings to be L words long, but this need not be the case. IIWe add because of the logs. In effect , we arc mu ltiplying pro bab ilities .
;,
702 en . 13 I Langua g e Modeling
' 3.4 I A "B ottom- Up" Parsing Examp lo 763
whe re
Ap pa rently, we max imize the joint probability of II ' and j: with the se lec
k" =: a rg~ i n Cr~l i n arg~ i n f-logP(y, .9[~j l \l'lkJ)p (II.[kllgl}.
=: (13.66) tio n of the word string above. Alth ough this seems intuitively reaso na ble,
we can show that this choice is eve n better than is apparent. Co nside r
Clearly, th e gram m a tical info rm a tio n has th e o pportu ni ty to influence the fac t that P( y ) is a fixed number that does not depend o n the choice
th e outcome of the fina l string d ecision . Con side r. for exam ple . a ca se in of wo rd string. Acco rd ingly. we choose II" to maximize Pt y. w )/ P( .l') a nd
whi ch the best aco ust ic choi ce. wP1, is a very low probability sentence in achi e ve the sa me answer as in (13.7 3). Howev er.
th e language. With refer en ce to the FSA in F ig. 13.2. for e xa m ple , we see
th at the outcome P( v. 11' ) ..,
w· == argmax · = a rgmax P (ll'ly ). (1 .,.74)
II' P( y) \I'
wil l = zero , on e. (two)S, three, four, five , six , seve n . eight. nine.
so we have apparent ly a chieved the idea l resu lt of ( 13.6 9) with th e
(13. 67) method suggested above. Finally. note from ( 13.73) that whe n e very wo rd
where (twop means a st ring of five "t wo's," is much less likely than. say,
string is equally like ly [P(II') d oes no t depen d on 11'1. then th e acoustic de
11'121 = zero, one , t wo , three, fOUI. (fivej>, six, sev en, e ight. nine. cisio n al o ne is the ide a l one . T h is is entirely e xpecte d, since eq ua lly
likely word strings essentiall y impl y " no gram mar." Fro m this form alit y,
(13.68)
ln this cas e. ev en though D '~'ln (T..JJ might be quit e small. C l'~in
therefo re. we con cl ude th e o bvi o us, that the la nguage informati on ac
might be coun ts for t he u neq ua l likelihood s o f word st rin gs in makin g th e final
quite large relati ve to C [; 'in'
dec ision.
Let us examine more spec ifica lly what the lin gui stic information is As we no ted at the outset of th is discu ssio n , the LB algorithm , which
a dd ing to the decision . Id eall y, under any strategy we wou ld like to find plays the ro le of the AD in the above syste m, is principall y responsible
th e word string, w', with ma ximum likelihood in ligh t of t he observa for kee p ing trac k of ti m ing informati on . We co uld . of course , shift th is
tions; in other words, we want' ?
burde n to th e LD section of the sys te m by reco rd ing at th e states of th e
w' = argmax p(wIY ). FSA in formati on ne cessary to piece to gether hypotheses coming up from
w ( 13.69) the acoust ic processor. In th is ca se, the AD is essentially reduced to a
Using only acoustic informati on with no grammar, we must settle for hyp ot hesize r that a ttem pts to m atch H M M s o r DTW reference templates
t to va rio us reg io ns of the speech. Fo r ex a m ple, th e AD mi ght begin by at
w == argw ax P(y ! w). (13. 70)
t empti ng to m atch a series of ca ndi date first -wo rd HMMs . 1111" .. , 1111\"
It is clea r from th ese de velopm ents that th e o pt im al word string to arise
from this method is
to a range o f pos si ble t ime inter va ls , sa y fra m es P,f,].P,l,
+ 1], ... .
'} [I, f , + r] Any word with a ll HMM of suffic ient likelihood would be hy
pot hesized to the LD (a lo ng with the tim ing information). H yp otheses
w' = argmax
IV
P(YI w)p(\o\-·I~). ( 13. 7 1)
fo r the sec o nd wo rd wo uld be fou nd by tes t ing a sec o n d (perha p s th e
A moment 's thought will reve al that ~ is simpl y a n a ltern a t ive way to same) set of models on a diffe rent set of inte rvals. In general , th e seco nd
formalize the uniqu e rand om process that gen erates word strings. set of t im e inte rvals will ex te nsivel y o ver lap with th ose of th e fir st set ,':'
The refore ,
an d the LD is re sponsible for p iec ing toget he r an ap propr iately timed set
p(wlg) = P ( ~ = w) == p ew), ( 13.72) of hy pot heses. Pro visions are often made fo r som e ov erlap at th e bound
aries of th e proposed words . T he proc ess carried out bv th e AD here is
where we employed th e abusive notation II" == Ii' to me an that the random what we have prev io usly termed word spotting (Sect io n 1" 1.4.4), b eca use it
p roc ess w (or g) has produced th e partial real izat ion 1\: Putting ( 13.71)
atte mpts to s pot re gio ns co rr espondin g to words in th e string of
into (13 -:71 ). we have
observa tio ns .
w 1'( ·viw ) P ( w ) =
= a rgmax
w' aramax P ( 1'. \I') . (13. 7.3) In essence, the d eve lo pmen ts a bo ve p rov ide the formal det ail s for the
~W '
G D C WR syst em descr ibed i n Sect io n I 1.4.4 (McMahan and Price , 198 6:
Pawate et a l., 1987 : P ico ne. 199 0) . The GDCWR uses an FSA to
bo tt om -u p pa rse \vo rd s that are h ypothesized by a DTW al gorit h m a t the
12We a re ignoring the "b est path" dependency of the acous tic decoding. bur 1lS inclusion
would not cha nge the basic argume nt he re, aco ust ic le vel. The DT W algorithm is use d to imp le m ent wor d spo tt ing
1JO f Course. we arc o nly abl e to select a rgumen t> fro m tbe st rings w il l• . . ' . II ' \ • which
are provided to the LD.
" Rem ember that the ti me boundaries between the wo rd s arc not known.
13 .5 / Pr incip les of "Top -Do w n" Re Co g ni" e rs 765
764 Ch . 13 I L anguage M o d eling
rectly with the AD . T he linguistic processor is also responsible for the
as describe d above. This system highlights th e point that oth er st rategies temporal recordkeep ing function . We point o ut that th is latter task can
beside HMMs may be used by t he AD to hypothesize words. be handled in many ways algo rithmically. and it is the object ive here to
Befo re leav ing th is examp le, th ere a re so me a dd it ional po ints to be in dicate on e gener a l strategy which will admit understand ing of the b asic
m ad e about t he linguisti c decodi ng problem. Fi rst , alth ough wo rds were
u sed as the basic unit of speec h in th is exa m ple, sim ilar pri nc iples could notOne
ions.of the main advantages of the presenc e of a regu la r gramm a r in a
be used to ad d grammatical con strai nt s to syste m s using ot he r terminals, problem is that th e search fo r a so lution can be carried out usi ng a DP
such as phonemes, although this is no t co m mo n with botto m -up systems. approach. T his is so because, as we have di sco vere d p reviously, the pro p
whic h represen t a n old er tech no logy. Seco nd . we have sa id ver y little os it ion of a seq uence of production rules from a regular gram m a r co rr e
about the use of " heur ist ics" in the decod ing pro blem . It might ha ve oc sponds to the navigatior: through a path in th e corresp o nd ing FSA. T he
c urred to th e rea der t hat it wou ld be senseless, for exam p le. to tr y the produ ction rules that best explain th e observat io ns co rrespond to t he op
word "seven" in th e seco n d level of th e LB algorithm, beca u se there is no tim al path through the graph in a certain sense. We will illu strate this
possib ilit y of such an occu rre nce in the la nguage . O f co u rse, a lgor ith m s usin g our example . bu t first it is useful to de velo p so me fo rm alit ies.
are wri tten to t ake ad va ntage of such entrop y-d ecreasing info rma tio n. In T he not at io n em ployed in the following discussion is th at deve lo ped in
fact , for more complex gr a mma rs , many h eur istic t ech niq ues a re em Section 13.3. In anticipation o f th e use of O P. we de fin e t he following
ployed to im p rove t he perform ance of th e pa rser (Fu. 1982. Ch. 5). A costs. Consider a t ran sitio n in the FS ,A. corresponding to a prod u ct io n
fina l point, im p lied by the last co mment , is that ot her gramm ars may be
Q.!!- wkR, where \l"k represents o ne of the words (digit s) in the natu ral vo
u sed in the LD . We will h ave more to say ab o ut thi s issue in our di scus
ca bulary. Suppose thi s pro d ucti o n is produc ing th e Ith wo rd in the st ring .
sion of to p-down parsers, to which we now turn.
In ter m s of the FS A , t his prod uctio n involves bot h t he tr a n sitio n
!!.(I) = U R I Q and the generat ion of t he o bservation ~(I) = W k· Reca ll that
probabilit y P ac counts for b ot h the state transition and the word genera
13.5 Principles of "Top-Down" Recognizers
tio n. In terms of the various qu anti ti es we ha ve defin ed ,
13.5.1 Focus on the Linguistic Decoder
p=p(Q-\\',R)
Many re search gro u ps, pa rti cular ly be ginnning in t he ea rly 19805. ha ve = " (!iU) = RI:rU - I) = Q)p ( ~(/ ) = \\ '/J ~(/ ) = R. ;iU- t ) = Q)
For thi s lower-level search we need to introduce anoth er cost. To any at Dmin(t. R ) = min Dmin((I,R )\ U' , Q ), wJ
tempt to associate the tran sition Q!!... w. R wit h the observatio ns ';'; '~ l ' 1 ' ,QLw~
/ /< 1
I }
- - Iog p(.1' '/'+1 I Ilk
d.4 rl w,k .' ( , r d.!:f ') • ( 13.80) It ap pears that we will b e able to implement th e linguistic constraints
using a v iterbi DP a pproach to sea rchi ng th e FSA. Before retu rning to
Now we wish to attach som e "w k " to th e word str ing already pres ent
the exa mple to see how thi s is don e, \et us take care of o ne m ore imoor
at Q by ma kin g a tran sition to R a nd generating \\') in the pr ocess. Note,
however. that th ere will, in general. be m any compet ing tran sit ions com rant de tail.
In orde r to keep track of th e best word (te rm inal) stri ng at (l. R ): let us
ing into R vying for th e honor of completing th e best word st ring to state
R. Th ere may be multiple str ings co m ing over the sam e tran siti on in th e de fine
FSA be cause of the different ob servation s th at can be gene rated on that
transition . T he re m ay even be mult iple strings comi ng over t he same
'Y (t. R ) ~f minimum-cost word string to state R (13 .83)
tran sition With th e sa me o bse rva t ion generated because of th e d ifferent assoc iate d with obs er vat ions / 1'
I f th e op tim al path exten sio n into sta te R at I was o ver the tr ansition
"'Two poi nt s are ma de here. First, in p revio us wor k, we have assigned a Type .V COSI to
an event at a nod e. Here th e cost is associated with th e transi tio n because the word is ge n
era ted d uring th e tran s itio n. Seco nd , th is de fi nition is essentially the a nalo g o f (12.61) used
Q' t: I I ', R and over the time int erval V' + 1, t}. th en
(13.84)
in [he se a rch of the HMM in Cha pt er 12. Accordingly, it m igh t see m to ma kc sens e to de tp (!. R ) ='V(I ". Q') 'II \\.'1'
fine dL wit h the followi ng no ta t ion:
1
d,,[U. R)!U- I. Q)] ~'[- l og A (R IQ)] + {- Iog b ( wU)/ VI< I Q)]'
where E£:l means conca tenation . 1 mtlOn
We now illust rate thi s Vii erbi search usin g th e digit reco g ex-am
Howe ver, there a re two su btle reasons for not doi ng so . Th e fir st reason is that the FSA ple. Following th e co nversion of an acoustic utteran ce into an obse rva-
"ti me" va ria ble I is superfluous in thi s d evelop ment , since we will keep tr ack of wo rk order
ing by synchr onizing the processi ng to the aco ustic-level fra mes. The o the r reason is tha t, n ce
unlike the H M ~I search wher e the acou st ic obse rva tio ns y (r ) ar e known. he re the word "ob mal
l1T h is IS an alternative to backtr<Kkmg, for fmdrn g the globally Op l1 word s.eq ue
se rva tions" \\'(1) ar e u nk nown and m ust be hypothesized by the FSA. Th is requires tha t we
speak of the wo rd in a mo re pa rt ic ular form , II', . at the en d o f the sear ch.
13.5 I p rinciples of "To p- Down " Reco g nizers 769
768 Ch . 13 J La ng uag e Mode ling
. The extensio n of lowest cos t acc ord ing to ( 13.82) completes the o p ti
tion st r ing, y = J';. t he LD in iti at es th e p ro cess as follows. For SOme
ma l path to R at time I. For exampl e. let us assum e that at th e fir st
a ss u m ed mi nimu m t ime at which t he first wo rd m ay en d in the observa_
fra me i, only state A becomes activ a ted in our exa m ple F SA. T hen at
ti o n st ring. sa y I I ' th e LD h ypo the sizes the presence of all wo rds. say \I'~'
in d ica ted by p rod uction rules o f fo rm fra me 1 = 1\ + I this wo uld happen in our exa mple : We see that only
states A and B ha ve p red ecessor states wh ich have active hypot he ses. The
- P
S -. II'J? (13.85) pat hs that co uld be extended to A are the active h yp ot heses at (/ 1 ,.4) and
(0. .5). T.!:'e active hypot heses t hat co uld be ext ended to B are at (II' A)
to be in t he t im e in te rval [I, II]' T his mea ns the FSA is sea rched for all and (0. S). The operat ive t ra nsit ions are as soc iate d with rules
initia l transit io ns (fro m S) a nd hy pot hesizes the correspo nd ing words to - p
be rep resen ted in th e obser vat ion s in the in terval [ I , '11
. I n the present S -2. zero , A
case wo rd s "zero ," "o h." "one." a nd " five " woul d be hypot hesized , corre s!2 o h. ;1
spo nd ing to the prod uct ions (see F ig. 13.2)
- P .~ !!2 one. Jj
S -2. ze ro , A (13.8 6) - p
S --.:'. fi ve, B
(13.92)
.5' -{J2 0 h, ....AJ (13.87) II !2 ze ro. A
s !2 on e, 13 .~ Po oh, it
( 13.88)
"4 !!:. one, Jj
S p, fi ve, B. (13.89)
- p. ".
A - SI X. B.
Th e request go es d own to the AD for a likelihood m easure for these
words for the o bs e r va tio n string ,t';I. The likelihoods PlY: Izera). Note tha t it is extraord inarily u n like ly th at either o f th e last two rules
P (Y;'j oh ) , p (.l<llon e ). a n d p( y ~1 1 five) a re reported ba ck. The best costs wou ld be use d . since th is wo u ld co rrespo nd to the ge ne rat io n of a word
a nd words to s ta tes .:j a nd JJ ar e c o m p ute d ac cording to ( 13.g2) and using on ly o ne aco ust ic obs ervatio n.
(1 3.84). For example. A fter a ll states have been exam ine d at time t , only if Dm1n(t ,R) is s uf-
Iiciently sm a ll does R beco m e activ ated at t. The best word string to Rat
D min(IL. A ) = min tDmm(O, S) - log P (.<l l \I'k) -Iogp} I is recorded a s in ( 13. 84). T he so lut ion is ultimately fo und in \}I(T, R r),
O.S"::'w. A where T is the lengt h of the ob se rvation string and
= mi n {- Iog P(y;J Izera ) - log PI' (13.90) R/~f argRm in Dn" nl T, R.r) ' (13.93)
f
- lo g P (y ',' jo h ) - log P2 ),
where R( is any perm iss ible fina l state in the FSA .
wh ere D min(O. .S) d~f 0 . If e it he r of the minimu m costs to states .4 o r B is As would be ex pe cted, rel a tively few hy pot heses remain active at any
excessi ve . thi s mean s th at the cor respond ing (bes t) transiti on in to that give n t . T h is is because re latively few pa t hs in volve th e ap p rop r iat e lin
sta te d oes not explai n t he observations su ffi cient ly .... ell an d t ha t path is gui sti c co nstra ints , tiMMs , an d t ime warping to m a tch the incomi ng
not started. This am o unt s to "pruning" t he path be for e it is even origi aco ustic observat ion s. Mo st of t he paths will be p ru ne d because they will
nat ed (a n unlikel y practical event). lf a pat h is compl eted to sta te .J. for he so unli kely that th eir ex ten sio ns a re no t warra nted . We saw a similar
ex a m ple, then th e stat e is flagged as h avi ng an active sentence hypothesis pru n ing p ro cess take place at the aco ust ic le vel to con tro l the number o f
at time 11 , For simp licity, we will say th at ··(II .A) is ac t ive ." Similar ly. pa t hs t hroug h t he aco ust ic-le vel H M Ms . Similarly to th at pruning proc
(' I' 8 ) is acti ve if a successful path is initiat ed th e re . By d efin it io n wc_ say ess. th e p resen t pro cedu re is also referred to as a beam search (Lowerre
th at state S ha s an ac t ive se n tence hyp ot hesis at t ime 0: that is. (0. S ) is and Red dy. 1980 ). since only path s t hat re m ain inside a certain accepta
a lwa ys act ive. For ac ti vated sta tes th e wo rd st ring (in thi s case. j ust on e ble bea m o f likeli ho ods are re tained . Those that fall outsid e the beam are
word) is recorded accordin g 10 (1 3.83) . By d efiniti o n, pru ne d. A simple bea m, for example. wo uld co nsi st of all pat hs (hypothet
ses) a t t im e I who se cost fell wit hi n , sa y, ()(l ) of t he be st h ypotbe sis. If R
't'(0, S) ~f 0 , t he null string in the la ngua ge. ( 13.9 1)
is th e active state assoc iated wit h th e best hypothesis at I , then any other
'I
Co ntin u ing on to t im e [. + I. + 2. . .. is simpl e. Fo r each sta te R , state R w ill o nl y become ac t ive at I if
a nd fo r su ccessi vely large r t's. all tra nsit io ns of fo rm Q!'" \I 'k 1<' and all ac t (13.94)
Dmm(l . R ) -.:: D min(t. R ) + J ( / ).
t ive hy po th eses (t ', Q ) for t ' < [ compete for extension of t heir paths to
770 Ch . 13 / Lang uag e Modelin g 13 .5 / Pr incip les of 'T o p-D own" Re co g nizers 77 1
13.5.2 Focus on the Acoustic Decoder nicatio n between levels. In fact , in ou r cu rre nt example, if we in sert th e
d igit HMM s direct ly into the transition s that produce the co rresp o nd ing
Thus far we hav e not paid a ny a tt ention to the process taking place I. digit in th e LD (see Fig. 13.3). it is no t d ifficu lt to d isco ve r how to
th e A D . ~t t hiIS I evet we. have (fo r the d igit pro blem) I I o r m a rc' HMMns sea rc h this FSA to car ry o ut prec isely th e same search as was described
representing th e wo rds 1D the vocabu la ry. Th e AD is rece iving request ' abo ve. We exp lo re this issue in Pro blem 13.4.
for likelihoods of th e form p(.v;.lwk ) fo r I t .:s t ' <: t s: T Recall tha~ It ca n be appreciated fro m this simple exa mple that o ne of the m aj o r
p(.v;,lwk ) means p(y;·lrIIl.). wh ere 1IJ, is t he H MM representing word (ter co nc erns in an algo ri th m incorporating a language model is the rec ord
minal) Il'k' [In fact , if Viterbi decod ing is used in th e H M M search. the keep ing fun ction. Altho ugh we have d iscusse d the basic pr inciples of thi s
likelihood reported back will be P(y;.,.9;!mk ) , where is] is the best state tas k ab o ve, we have not thought very deeply abo ut pr act ical implementa
sequence through mk · ] It is import ant to under stand tha t the HMMs can tion . A good exam ple of a syste m th at processes co nt in uo us s pee ch ac
o pera te in synchrony with the LD in ce rt a in ways. co rd ing to the principles above is foun d in th e papers by Ney et at
First . we need to reco gn ize that an entire ly parallel search procedure is (1987. 1992 ), which also des cr ib e the de t a ils o f recordkeeping, These
taking place in ea ch of th e HMMs to that occurring in th e FSA at the me thod s were described in Sect ion 11.4.3. In essence , th e Ney papers de
linguistic level. Let us concern o urse lves with all request s for P(J';,I \t' . ) scri be a n ex tension of the o ne-stage DTW a lgo rit h m di scussed in Section
for a fixed lim e t' , Ass um ing that 1 - t' is at least as large as the numb~r 11.4 .3 to incl ude sy nt act ic information. The recordkeeping is accorn
of states in 1rJk , th en each sta te in 117k has the potent ial to be act ive with a 'Jlis hed through a ser ies o f list processing op er ati on s. Thi s paper repre
viable best path to it at tim e t . (Ordinaril y, for a Bakis model all states se nts on e of the first repo rt s o f th e integrat ion of stoc hast ic language
would rerna in active after a sufficient number of obscrvations.) This is mo d els into the CS R pro ble m. A more recent tutorial is found in the
nothing new. It is just a different wa y to view the Viterbi decoding proc ap er b y Lee and Rabiner (198 9).
ess we discussed in Secti on 12 .2.2. At each new obser vation frame. all
states to which transition s can be made from ac t ive states are updated
in a similar manner to that which occurs in th e FSA. The measure HMM tor "zero"
HMM
language.
FIGURE 13.4. Word HMM decom posed into a set of appropriately connected
As we have discussed previou sly, in la rge r-voca bu la r y systems, Say
1000-100,000 words [see, e.g., the systems described in Section 13.9 and phone models.
for p( y~ , \11'''). whe re \\'k is a word. goes down to the word level from the
(Levinson et al. . 1988; Deng et aI., 1988; D u mo uchel et al., 1988)1. the
collection of data for. ant! training of, indiv id ual wo rd models is prohibi
sentence level, t he word level must locate an FSA representing word w k
tive . In this case, subword units such as phones , d iphones. phonemes. or
whose first acoust ic observat io n input is y(l'). and report back the likeli
syllables must be employed. This will mean that th e AD will work, for
hood of t he best path reaching one of its legal final states at time l. In
example, at the phone level , and the LD will be charged with hypothesiz
turn, the disco ver y of th is path will be the result of searching this word
ing sentences in terms o f pho nes.
FSA in the sam e m a nner in which we searched the sentence-level FSA in
Let us return for a mom ent to the idea that a two-level system can be
the two-le vel case. At ea ch frame ti m e. t' , t' + I, ... ,I, we ....'ill have at
compiled into one big FSA by inserting word HMMs into the transi
tempte d to extend eac h active hypo thesis using Viterbi decoding. Each of
tions of the sentence FSA as in Fig. 13.3. In this casco the compiled
these attem pts will hav e involved req uests for likelihoods of phones over
FSA represents a larger regular grammar with the acoustic observations
as terminals . Conversely, we know that this large FSA can be decom certain fra me ranges. PCr;::'\
z;J. where ::; is a phone. Such a request will
posed into a regular grammar with words as te r m in a ls and HMMs at req uire locat ing a phone HMM for z) that has considered the observa
the word level representing another regular gra mm ar with acoustic ob tions y;::' a n d repo rting back t he likel ihood of its best path to a final
servations as terminals. Now suppose each word were represente d by se state. While the reco rd keepin g is m o re complicated with more levels, the
ries of appropriately connected phone H.MM models. as in Fig. 13.4. basic pr inciples remain the same . A gen e ra l algorithm for performing the
This representation of a word a m o u nt s to a "large" regular grammar mu lti level search is sho wn in Fig. 13.6.
with acoustic observations as t er m inals. By analogy, this word represen Th e d iscussions above have employe d a V it e rb i a lgo r it hm in which,
tation could be decomposed into a regular grammar with phones as ter i n principle . a ll st a t es were co nsid e red to se e whether extensions could
minals, and HMMs at the phone level. This process cr eat es a three-level be m ade fr om active hypothese s. Th is is th e way in which we are ac
system in which the phone HMM m odels now co m pris e the AD. and a cu sto med to u si ng the V ite rb i a lgo r it hm - m o v ing forward one step by
new layer is added to the LD. The result for our d igit recognitio n sys co nne cti n g p aths fro m be h in d . Fo r a large language model in which
tem is illustrated in Fig. 13 .5. Of course. further le vels coul d be added re la t ively few st ates are likely to be act ive at any time. it is sometimes
by. for example, decomposing wo rds int o sy lla bles a n d then syllables mo re effic ien t to it erat e o ver t h e act ive hyp ot heses at time I rather
into phones. However, most CSR systems op erat e with two le vels using t han o ve r all th e st ates. In this case th e hypot he ses are pushed forward
words at the ac oustic le vel , or three levels with p hone model s at the o ne step t o all poss ibl e next st at es , an d then the Viterbi algorithm is
acoust ic level. pe r fo rm e d on all next states rec eiving m ore than one extended path.
The method for finding a m a xim u m like lihood solut ion in a three In this ca se th e al go rithm abo ve is slight ly modified, as shown in Fig.
level system is a natural extensio n o f the me thod s use d with two leve ls.
13.7.
Again each operation is frame-synchro nous across levels. When a re q uest
774 en . 13 / Language MOd el ing
13 .5 I Principles o f -To p -D ow n" Reco gn izers 77 5
LD
FIGURE 13.6. State-based Viterbi decoding of a CSR system based on a
Se ntence hyp othesize;
regular gramma r.
Req uest for Phonc T'erm inalion: Sel ect hypot hesis in sen tence-level FSA asso ciated with highest
phone li keli hoods
likel ihood,
likeliho od path to a fin al state.
Th ese differences emerge when we tr y to fo llow up on the suggest ion trary) initial transi tion probab ilities to those which represent a local
mad e a bo ve t hat wc simply consider the CSR system as one big Hi\lM . maximum likelihood with respect to the observat io ns.
S uppose. we attempt to place all . of the .wo rd or phone (or ot her unit ) If there is a second level of the LD , orthographic transcription s o f
H.M Ms In the ir prope r loca tion s In the big network and t hen tra in them words in terms of pho nes. if they arc available. can be used to esti m ate
wit h exa mple utterances (sentences). Th e first as su mpt ion inh erent i parameters for each of the word FSA~. If these trans(:ript ions are n?t
this suggestion is th at we know the production rules of the gram mar! In~ ava ilable. the word FSAs ca n be estimated alo ng With t he acous tic
deed we might no t. and d ed ucing the gramm ar "ma n ua lly" from th e HMMs. as we discuss below.
tra ini ng da ta cou ld be a n o verwhelm ing task. T he firs t problem . then. is This is a good point to ree mpha size tha t we are de aling strictly with
lea rn ing t he gram ma r o f the language. T hen we mi ght recogn ize another an LD ba sed on a regular gramma r. There are other gramm a rs an d lin
major d ifference between this big HM M and one that would be used in guistic structures that can be used. and each has us own trai ning p roce
th e iso lated-word case. In the isolat ed-wo rd case, t he entire model repre du re. These will be discussed below.
sents a clas s (wo rd ) within th e set of classes (vocabu lary ). In the CSR Having tra ined the LD , let LIS co nside r th e tr aining of th e H MMs in
case. a pa th represents a cla ss (sente nce) wit hin t he universal set of the A D. \Ve now return to the pr obl em that a gi ven s peec h utteran ce
cla sses (Ianguagej.!" This means that we m ust find a wav to train onl v m ust be used to train o nly on e path of the glob al FSA. T his problem is
o ne path at a t ime, which will turn out to be a sim ple problem. Finalh-' ver y easy to solve. For a gi ven t rai ning utt era nce, we temporarily create a
a nd pe rh aps most significantly, in the isolated-word case it is known ex sm all system that recognize s on ly t he correspo nd ing se nt ence . Thi s is
act ly which ob servation fram es ar e ass oc iated with t he H MM. whereas in done, in principle, by setting to ze ro the proba bilit ies of all pr oduction
the CS R ca se we are faced with a string of ob ser vat ion s whose temporal rules (tran siti on pr ob ab ilit ies) that do not lead to t he sent ence in its fin
associati ons with th e individual H~H\ll s in th e AD are generally est known decom posit io n. By "finest known decompositio n" we mean.
unknown . for example , that if we a re dealin g wit h a three-level syst em and o ur
In o rde r to solve t hese pr oblems, it is usefu l to once aga in decouple tr aining sentences a re tr anscribed to th e ph o ne level. t hen pr esumabl y
th e models of the AD from those in th e LD. Having done so. let us con th e LD gram m ar is trained all th e way through th e word-level FSA. In
sider th e problem o f learn in g th e pr oduction ru les of th e gram mar. Th e th is case we freeze all the p roduct io n p rob abi lit ies leading down to th e
techniqu e is ver y sim ple . To trai n th e se n tence-level FSA . wh ich ha s a ppropriate set of ph one model s (in th e AD) and (in pr inc iple) set t he
words as ter m inal s, we en ter orthographi c transcript ion s of sentences rest of the probabilities in the LD to zero . Th is p rocedure, in effect, ha s
(se nte nce writt en out as words) as train ing "obser vat ion " sequences. A selected one path through the glob al FS A. If t he senten ce is tran scribed
versio n of the F- B a lgor ith m fo r the M eal y rather tha n th e Moore ver on ly in terms of words, a nd we are trai ning a three-level recogn izer, th en
s io n of a n FSA (Bahl, 1983) is then used to estimate th e tra nsitio n p rob the probabilit ies in t he sentence -leve l FS A leading to th e appropri at e
abiliti es o f the FSA. Alt ernativel y: a Viterbi-like algo rithm can be used to words are fro zen, and th e rest ar e set to zero . Assumin g that th e correct
est ima te th e transi ti on probabilities (Fu , 19 82, Sec. 6.6). If we were production ru les 10 d eco mpose th ese wo rds int o a ltern at ive sets o f
working with Mo ore forms of the FSA , eac h of these woul d be identi cal phones ar e known,"? even t hough th e probab iliti es arc not , then the set o f
to th e co rrespo nd ing approach taken to tr ai ning isolat ed-word HM Ms. In acoustic observatio ns ca n be used to train both th e phon e mode ls a nd
either case . the entire da ta base of tr aining sequences sh ou ld be ent ered th e word m odel s. In th is case we have ju st co nst ra ined th e global m od el
before upd ating th e FSA parameters. Note that this pr oced ure is ta nta to a small set o f pat hs th at could concei vably prod ucc th e se nte nce .
mount to "est imati ng" th e production ru les of t he gram ma r an d thei r as We now inten d to use an F-B or Vitcrbi approach to esti mate th e un
sociate d probabil it ies. known p ro babi liti es 0 0 the selected path o f t he glob al fSA . We should
As with a "standard " HMM , the st ructu re o f th e FS A must be pr e point out that ne ither o f these proced ures will be affec ted by th e fact tha t
scribe d before th e proba bil ities can be assigned in th e a bo ve proce d u r~ . so me of the p robabiliti es on the path arc fixed. For th e Vit erbi ca se th is
In t his case t his sim ply mea ns choosing the n um ber o f sta tes (non ten.n l is self-evi dent , and for the f - B case t his result is proven by Baum
na ls) and transit ions (term inals) to be i ncluded . Initially. each tra nsit ion (1972) . Now we must face th e p roblem of not kn owin g whe re the tem
sho uld be allowed to generate ever y observatio n wit h nonzero probab il poral boundaries are in the aco ust ic obse rvat ion st ring. If th ese bound
ity, j ust as each st ate in a Moore H MM may genera te ea ch o bse rv at io ~ .
The F- B algor it hm operat es in th e sam e fas hion. moving from (arb i
"This ju st means that all of the phon etic tr anscript ions of cach word are availab le, even
though we do not know which transcription is appropriate for the present sentence. For
" Th is means that either the F-B approach or the Viterbi approac h can he used to de mally speaking, it means that we know the characteristic gra m ma r for each word even
code an isolated-word HMM, but only the Viterbi method can be used for the CSR HMM. though we do not know the production probabilities.
778 en. 13 ! Language M ode lin g 13 .6 I Ot her Lar.guage Model s 779
Input to ken
a ries wer e known , we cou ld s imply use the da ta segments to train the
indiv id ual phone models. How ev er, the crea tio n of sufficient ly large data
ba ses marked acco rding to phonetic tim e bound ari es is gen erally impra-,
tical. On e of the most remarkable pr opcrt ies of th e HM M com es to our
aid in thi s situation. Resea rch ers have d iscove red that these HMMs can
be t rain ed in contex t as long as reas onable "se ed" models a re used to in i
tiat e th e est imat ion proced ur e. Thi s means that the entire ohse rvation se
quen ce for a sentence ca n be presented to th e appropriate string of As-oc iated HM l\l mo de l
HMMs and the models will tend 10 "soak up " the part o f th e observation
seq uence correspondin g to their words or pho nes , fo r e xample. This abil
ity of th e HMM has re volution ized the CSR field, since it obviates the u. = 0.99
time-con suming procedure of temporally ma rking a data base . B=0.005
However. good in itial models must be present in th e sys tem before the
tr aining begins. This mi ght req uire some "manual" work to "hand ex
ci se" words or phones fro m some speech sampl es." Freq uently. however.
seed models consist of previ ously trained word or phone m odels, or are
deri ved from an ava ilable marked da tabase" [sec . e.g., (Le e et al. , 1990)].
The see d models need not be excellent representations o f the application
database. but they must be sufficiently good th at they "attract" the I 2 3 4 5
proper portions of the obs ervation sequences in the training data. Other FIGURE 13.8. Training of a continuous-observation mixture Gaussian dens ity
wise. of course, the acou stic-level HMMs will not rep resent the intended odel, A hand-e xcised phone (training token) is modeled by a five-state
speec h unit and reco gnition performance will be degraded. model, one state for each analys is frame in the phone . In this case , the
As Picone (1990) di scusses, seed models are oft en genera te d by an it analysis frames consist of nonoverlapping port ions of the utterance chosen
by an expert phonetician by examining the spectrog raphic trace. (The
e rat ive process in which a crude model is successi vely refined into an ac spectrograph gives a frequency versus time record in which the energy in a
ce pta bly good one. He cite s, for example, the training of a continuous frequency band is indicated by the intens ity of the trace in the
obser vation mixture Gaussia n d ensity model in whi ch a hand-excised correspo nding frequency region .) Atter Picone (1990) .
phone is modeled by a five-state model , one state for each an alysis frame
in th e phon e (see Fig. 13.8). Rath er than begin wit h a n accurate set of generally ba sed upon th e discrim ination information o r cross-entropy a p
co va riance matrices for each state, the procedure begin s with a single proaches briefly described in Section 12. 2.7, but a more ad hoc approach
cova ria nce matrix for all st at es, and then is itera tively refined. follows the corrective training approach of Bahl et al. (1988). Tn fact, th e
One point should be em p ha sized before leaving the iss ue o f training. less for m al technique was found to outperform th e ma ximum mutu al in
Afte r t he seed models are inst alled and the trai nin g process begins . pa form at io n approach (a form of discrimin ati on inform ation) , a nd so me
rameters of the system are not changed until the e nt ire tra ining database specu la tions o n pos sible reasons are give n in the cit ed paper.
has been entered . Th is is true whether the F-B or Vit erb i approach is
used. With a tittle thought a bout the process, the reason will become evi
dent. The situation is analogous to the lise o f mult iple training sequences
in a singl e HMM. 13.6 Other Language Models
Finall y, we note that in th e late 1980s an d early 199 05. researchers
began to emp loy a second phase of training ba sed upo n the d iscrimina In t hi s sec t io n we exa m ine some alternatives to regular gramm a rs for
tion techniques introduced in Section 12.2.7 for IWR (E phra im ct al.. mo deling languages.
1989 ; Bahl et aI., 1988). The ap plica tion of these techniques to CSR was
fir st reported in (Lee and Mahajarn , 1989). More recentl y. Chow' ( 1990)
13.6.1 N-Gram Statistical Models
has proposed a method based on the ,iV-best sea rc h. T hese t ec hniques are
uc h of the pioneering research o n HMM s in speec h processing was
IOT his process would likely be accompl ished by an expert phon etician using a speech ca rried out at IBM in an effort t o de velo p a large-vocabulary speech rec
ed iting program . ogn izer for office dictation . This work has been on going since 1972 , and
21We will discuss sta nda rd da taba ses in Sect ion 13.8. in 1986 the I BM gro up reported on a 5000 -20. 000-wo rd isolated -word
780 Ch. 13 I Langua ge MOdeling
13 .6 i Other La nguag e Mod els 781
k) deC
I\ (,W I = L
T
"P(
I -I
k
WI· Y . a
.
, 1"t+1 I\1'k1· .1'11 ) •
I) 1"- 1 L" P(II"Y
.,
( 13.96) P(\_ \." , ."\' Tt+ 1I\"
' I, .)"'
' t )~ ]:> ( "I'" T
f-J..- I 11
· ',1)
1 .:::::: n P ( y ( r ) l y;=~,) ·
T
r -/ t I
(13.97)
whe re IV' re p resents a ny wo rd st ring wh ich can fo llo w I\ ' ~ ' The first pr o b T he latte r approx im at io n beco m es be tt er as t' incre a ses , but -r' = 1 is
a bi lity accounts for the likel ihoo d that st ring \\' ~ is associated with o bser usua lly a d eq uate . The te rms P(y(r)\y;=;,) ca n be es timated from tr a ining
vat io ns .\':. a nd the second p ro bability t he likeli hood t hat the rem a in ing da ta . Not ing th at P( \V~ . y : ) = P( \\, ~ ) p( y : I II'~ ), we ca n wri te (13. 96) as
T
ll This did not happen in the Viterbi decoding meth ods used for regular grammar sys ( k) " ( I k) T P Y T
I = P \\' 1 L P Y I \ \ . \ a
A (I\' k) I ')
t+ I Y I '
1 - I ( (13 .98)
tern s. in which pruned paths lost the com peti tio n with other subpaths of identical length. ,- I
78 2 en. 13 I Language Modeling
13 .6 I Other La ngu age Mode ls 783
T he rea d e r is no w encouraged to rev iew the di scussion surrounding J. P ( II'~) = P( I1'(I)) P(II'( 2) l w( I) ).
( 13.73) 10 be remi nded of t he ap p ropr ia tene ss of th is likel iho od measllre~ 2. P(.r; 11I'7) . for ( = I, . .. , T.
O ne di ffe rence bet ween A (I\') an d the measure that would have been
1 I.vl') lo
3. P ( J"~l - r r= I , ... ,
r .
co mpute d fo r II' using the Viterbi a pproach a bove is that here we attempt
to tr uly max im ize Pl y. w) (i n s pite of all the ap proximat ions). whereas in Ite m 1 requ ires the q ua n tit ies 1'(1I'(2) j ll' ( I ). mo re lingu ist ic informatio n
t he Viterbi approa ch we max imize the the p ro babilit y P( y, 3 ,11'), where .9 whi ch is stored in t he LD . (In th is case . these quant it ie s represent
re p rese nt s a st ate seq ue nce th ro ugh t he co m p lete ne tw o rk. T he difference Markov proba bilities. b u t t hey are not associated wi t h stat e transitions in
is si milar to t hat which ex ists between the Viterbi and " a ny path" ap the LD sta te d iagram. Rat he r, th ey correspo nd to Ma rko v d ependencies
pro aches to d eco d ing an HMM. between couples o f st a te transition arcs in the di agra m . Thi s is simply a
The 81M tec h n iq ue also involves a second major difference in the matter of the way in whic h the p roble m is structured in thi s case.) The
ma nner i n which the LD and AD are searched for the optimal solution . quan ti ty P(lV ~) is sto red fo r t he next ext ens io n. T he item 2 qu antities arc
The method is a p runed version of the best-first search algorithm used in easy to co m p ute beca u se we ha ve stored the qu a n tities P ( y ~ IV (I ») for I
a rtific ia l in te lligence research [see, e.g ., (Nilsson, 197 J)]. Let us illustrate each \1'( I) a nd each t. We have, there fore,
this fo r t he digit recognitio n p roblem for which the LD network is Shown
in F ig. 13.9 . Suppose we initially pursue all hypotheses arrising from P( y ; IIV~ ) = pC<1w ( I ) )P(Y;'-'-ll w(2»). (13. J 01)
state S. This will cause us to arrive at states j and jj and to require like
These q uant it ies arc stored fo r fu rt her path extensions . Finally, the item
lihoods A(zero), A(oh), A(one) , and A(five). In turn. by examining
3 q ua n tities have all been computed in the first iteration and have pre
(13.98), we see that we will need
su mably been sto red.
1. P (~(l)=z ero ) , P(~ ( I)= o h ). P (~(l)=one ) , and P (~(I)=five). After A( zero - o ne) is co mpu ted , t his candidate partial path is placed in
2. P (y : I~(I) =zero) , P( y ; I ~( I ) =oh), P ( y ; I~( I ) =one ) , and
the stack in its appro priat e rank ing. If it is no longer at the lop, the new
to p cand id at e is extend ed in a si milar manner and also placed in the
P(Y; I.1.:(I) = fiv e) for (=I. . . . ,T.
stack in the correct ord er. Ea ch time, the topmost partial path is ex
3. p(Y ~+ l l y ; ) fo r 1= L ... ,T. en d ed b y one word, an d th e re su lt s placed back in the stack in the
Let us assume that t he in itial probabilities for each word in item I arc prop er order. After a n iter a tion is complete, the partial path at the top of
found in a lookup table considered part of the LD. The quantities in the st ack is ex a m ined to sec whet he r it is a comp lete path. If it is, t hen it
item 3 are found using (I3.97) with r ' = 1 and a lookup table for the is de cl a red th e opt imal pat h , beca use extending any partial path belo w it
quantities P(y(l)I.Hr - 1) ). Repealed calls are made 10 the AD to C?~ in th e st ac k ca nnot res ult in a pat h with a bener likelihood . [t is to be
pure t he quant it ies p(.\, ; Iw(J») for each 11'( I) a.nd each t, T~esc . q uanunes no te d th at pa rt ia l pat hs may fall be low some acceptable likelihood and
are stored for a fu t ure purpose. U po n completion of these likelihoods, we lo t be pla ce d in the stack e ven if t here is sufficient room. It is also the
put ea ch of these sin gle-wo r d partial paths in a stack in decreasing order case that stack si ze must be li m ited . and ce rta in pa rt ia l paths might be
of likel i hood. Suppose, for exam p le. the result is los t beca use of insuffi cient roo m. T he fo rme r has be en called soft prun
ing and the latter hard pru n ing (Vcnkatesh et a I., 199 1).
zero A(zero )
It is instruct ive to m ov e o u t to th e next step in the search. Suppose
t hat the st ack res ulti ng fro m the above is as follows:
one I\ (o ne) I (13.100)
five A tf ive) zero A(zcro)
oh, \I '~ = zero-one , an d 11'7= zero-six. Sup po se th at we try zero-one first. oh A(oh)
784 Ch . 13 f Lan g uage M o de ling
13 .6 f Ot her La ngu ag e Mo d els 7 85
T he fact that a single "z ero " is a t the top of th e sta ck is a n Illd ic:lti on paper by Lev inson ( 1985). For a tri gram m odel. fo r exa m ple. we see t hat
t hat there is at least o ne mo re exte ns io n of this wo rd . Indeed we know every word is gen erat ed by the ruJc2J
th a t there are three ot hers: zero- zer o, zero- o n. and zero- six . After Com
put ing t he like lihoo d fo r eac h of these part ia l pa t hs in t he next three Q !-. (fR, 03.1 0 5)
ste ps, suppose th e stack co nta ins where if q re p rese nts th e kth word in the st ring , thcn p = p(q l wZ=~ ) . Ac
cord ing ly. the gene ra t io n of the wo rds can be m ode led as an H M M an d
ze ro- o ne A(ze ro- o ne ) the N-gra m probabil ities can be inferre d from training data using a n F-B
algorit hm . In fact. the F- B a lgo rit hm redu ces to a relativel y simple
zero-on A(zero- o h)
counti ng pro ced ure in th is case. Fo r details see (Bah l et al ., J 98 3; Jelinek .
on e 1\(o ne) 1990).
( 13. 103) As poi nte d o u t by Ba hl et a l., the sta ck sea rc h proced ure is not ex
zero-zero i\(zero-zero )1
haustive a nd th e deco ded sent ence m ight not be the most likely one. Thi s
ze ro - six i\(zero six) can ha p pe n. fo r exa m ple. when a po o rl y a rt iculated word (frequentl y
short "functio n " wo rd s like " a ," " the, " a nd " o f" ) ca uses a poor ac oustic
five i\(fi ve)
match. Fo r this rea so n , a m o d ified proced ure is someti mes used in which
all partia l path s in the stack with likel ihoods within, say 6., of the maxi
Now we go to the top a nd ex te nd th e parti al path " zero - o ne" to. say. mum likelih o od a re ext ended before a fin al decisi on is made. A large .0.
"zero-one-two," Again. for gene ra lity, let us speak of w ~. \\" ~ , 11'(1 ), w( 2) , implies more co mputation. but lower risk of di scounting the correct path .
and 11'(3). We need Finally. we note th at Ve n katesh et al, ( 199 1) ha ve wor ked with a multi
1. p( wn= p(w (1 ))p(w(2) 111'( 1) )p(w(3) Iw; ) = 1'(11';)1'(\1' (3) I\\'~ ) . ple stack sea rch tha t can be used wh en scatter ed , rather t ha n sequential
left-to-right ev aluatio n s, are m ad e o n th e wo rd m ode ls.
2. p( y ~ l w ; ) for ( = l, , T.
3. P( y ~+ I I.r ~ ) for T= I , 1'.
13.6.2 Other Formal Grammars
Aga in, the item 3 q uan titi es a re a lrea dy avail able, and the q uantities in In prin c iple, an y formal gram m ar ca n be used to model th e la nguag e
item 2 are computed s irn ilarl y to (13.101 ). It is th e first set of quantities in the LD . We will ma ke some br ief comments on this issue in this sec
o n which we need t o focu s. tion. but an exte ns ive treatment of th e high er lev els of th e C ho ms ky hi er
In this ca se th e LD needs to have kn ow led ge of the quantities archy would take us well be yo nd th e scope o f thi s text. As pointed out by
P (w(3)!w~ ). [Note that P(w; ) was st o red at th e last siep.] In general (for Levinso n ( 1985), any finite language can be generat ed by a regular gra m
a kth extension ), we wilJ need P(w(k)11I'7-' ). Even for a sm a ll voc abulary mar. but o ne mo ti vat io n fo r using other gra m mars is to make th e lan
lik e the digits, it is ea sy to see th at thi s requirement Quickl y becomes guage mod el co n fo rm to a more convention al mod el of natural la nguage.
prohibiti ve. It is cu stomary. th erefore, to use th e approximation Fo r exa mple. nat ura l lin guistic rules are o fte n present ed in context
sensi t ive fo rm (see, e.g., F ig. 13 . 1). The di sad vantage of higher gra m ma rs
k
is th e inc reased co m p lexit y encou nte re d in th e co rrespo nd ing pa rs ing a l
P(\V ~) "'" fI P(11' (1) Iw~=:~" d ) ( 13. 104) gori th m s. T he number of operati ons requir ed for Vi te rb i decod in g of 3
1. \
stri ng \I' using a reg ula r gra m m ar is proportional to 1 1 ~1 X Iwi,
where
for so me small N The LD co nta ins a statistical characte rist ic o f the lan 1 1~ 1 is the size of t he no n te rm ina l voca bula ry a n d 11I'1 represents the
gu age consisting of th e probabilitie s p ( w(l)II1' ;=~ + I ) ' This is called an lengt h of the given sentence. Let us keep this nu m ber in m ind as we di s
Nigram model of the language. A 2·gram or bigram model a ssume s a cuss some furth er m ethod s.
M arkovian depend en cy bet ween words. For most voca bu la ries the use of Con text-free languages are gener all y pa rsed using the Cocke- Younger
N-gra m models for N > 3 (a 3-gram is a Trigra m model) is pro hibitive. Kasami (CY K) algorithm or Earley's algorithm . Th e CYK algo rith m was
While the BJM techniqu e is so me times ch aracter ized as being ba sed first develo ped by Cocke , but independently published b y K as a m i"
upon a st a tist ica l la ng uage m odel as opposed to a gra m mati ca l m od el
[see e.g., (Wa ibel a nd Lee , 1990 , p . 447)], the BJM lan guage ca n, in fact. :lAlso see the d iscussio n of the Paesclcr and Ney ( 1989) work in Sect ion t 3.9.
be viewed as a regular stochastic language. Th is point is made in the "A more convenient reference to th is work is (Kasarni and Tori i, 1969).
786 c-. 13 / L anguage M o deling 13 .6 I Other Lang uage M o del s 787
( 196 5) and Yo unger (1 967). Earley's m eth od . somet im es ca lled the chan where A ,B E I~ . a E II; an d p=p lp" . This . of course. rep rese nt s a regular
p.(J rsill ,~ algorithm, was published in 1970 (Earley. 197 0). Th e CYK algo gr a m m ar. One transition in the network is illustrated in Fig. I 3. I O(a) .
rit hrn IS esse nt ia lly a D P approac h . wh ereas Earley's a lgo ri thm uses a Si n ce e ve ry state that m ay make a tra ns iti on to B in the a bove m ust gen
centra l d a ta str uct u re called a chart to e ffectively combi n e int er m ed iat e -rat e Ickl observations on the transition . a more efficient se t of produc
su bparses t o red uce red und an t co m p ut at io n . Each algo rith m requires t io n s is formed as fo llows. Let u s a rt if ic ia lly create t wo nontermina Is,
O(l lIf ) o perat io n s, but Earley's method re d uces to o(lllf ) If there arc H ill = \C,C..j and Bout = [C Ck l from ever y nonterminal B in th e a b o ve .
J
no a m b igu it ies in th e grammar (Fu . ) 982, Sec. 5.5 ). M o re rcccnu v. NoW if we allow rules of the form
Paeseler (1988) ha s p u b lish ed a m odi fic a tion o f Earle y's m ethod thai
u se s a bea m -sea rch procedure to reduce t he complex ity to linea r in the AOU1 r: Bm ( 13. 10 9)
len gth of the in p ut st rin g. Example CSR syste ms based on th e Pacscler
a n d CYK a lgo rit hm s can be fo u nd in Paese le r's pa pe r a nd in (Ncv et al and
1987). respect ive ly. . .,
L eft - right (L R ) parsi ng is an efficien t algorith m for p a rsin g context. u; s: aBout' (13 .\10)
fre e la n gu ages th a t was o rigi n a lly de velop ed fo r p ro gra m m ing languages where Ao u l = IC, C) , and a, p i, and o" ha ve iden t ica l m ea n in gs to the
[e. g., (Hopc ro ft a nd Ullm a n , 1979: F u , 1982)). A generalized LR parsing a bo ve . t h e n t h e n u m be r of necessary tran s it ions in t h e n e t wo rk is re
algorithm h a s been a p p lied t o t he CSR p roble m by Ki ia ct a!. (1 9S9) and duced by a fact o r equal to the number o f ca tego ries. This revi sed transi
H a n a za wa e t al. (1 990) . T he resulting syst e m is ca lled HMM-LR because tion schem e is also illustrated in F ig. 13. 10 . We see tha t t h ese rules
it is based o n H M M an alysis of pho nes d rive n by predi ctive LR parsing. technically co m p rise a co nt ex t -free gr a mmar, altho ugh we kn ow it is a
Th e HM M- LR system w ill be de scrib ed in Se ction 13. 9 . thinly disgu ised regular grammar. It sho uld also be apparen t t hat t he la t
In discussi ng t he BJ M m et hod in Secti on 13.6.1. we made the point ter st r uct ure is equival ent to a M oore-form FSA in wh ic h t he no nt e r m i-
that the trigra m model co uld he posed as a regular gra m m a r and conse
que ntl y t ra ine d using a n F- B-typ e a lgo r it h m . It is int erest in g to note that
a syst e m d esc ribe d by Pae se le r a n d Ney (19 89 ) uses a trigram model of
word categories Itriclass model) i n a syste m in which th e st atistical lan
guage model mani fes ts it se lf as a regula r gra m m a r in one form and a
co n text -free gra m m a r in a m ore effi c ie n t fo rm ." T he se t o f wo r ds (termi
na ls) in t h is syste m is part it io ned int o word class es t hat s ha re common A =c , C.j
Ie) poss ible
c ha ract e rist ics. Eac h word within a given category is eq u ip ro b a ble to all ou tputs for
o t hers. In o n e fo r m o f t he sy stem . states (no nterrn inals) rep resent couples every tran sition
into R
(r e mem her the trigra m d ep en de nc y) of wo rd catego r ies. II' the categories
are Cj • C2 , . . . , CN , th e n any st a te (n o n terrn ina l) is o f the form
A = fe,Sl. (13.106) (a )
I
wh e re ckl is the numbe r of words in ca tegory Cl;' T h e networ k can
t he refo re be thought of as a m an ifesta tio n of prod uct ions o f the for m
( 13.108) (b)
A!!.... aB,
FIG URE 13.10. State model of one productio n rute in Paeseler and Ney's
triclass grammar. (a) "Direct" for m co rresponding to production rules of form
>SA similar idea is the usc of a " tn- POS" (pans of speech) stati stical model (D crouault (13.108). (b) Revised transition corresponding 10 rules of form (13.109) and
and Merialdo, 1986; Dumouchel et aI., 1988). (13.110).
7BB e n. 13 / Language Mod eling 13. 7 / tWR as "CSR " 7 89
nals ar~ generated at th e states. Rather than split the states apart in process. A discussion of thi s issue and related references are found in
the re vised network . we could simply allow the states to generate th e Levinson's paper.
words.
Specialized gra mmars have also been used for CSR. AII f[III Cnlec! trc _
.. k T ' in
st tton lU?II'or ' (A N) gram m a rs (Wood s. 1970. 1983 ) a re similar to 13.7 IWR As "CSR"
cont ext-Ir.ee grammars but arc more effic ient due to the merger of com
mon parsi ng paths. T hese grammars were developed speci ficallv for nat In Section 12.4.2. we left open for further discussion the task of recogniz
ural language processing. An ATN grammar was used in the :'HWIM" . :1g isolated words whose models were composed of smaller subword
system (Wolf and Woods , 1977. 1980 ). discus sed furt her in Section models. We said that this task could be considered a special case of CSR.
13.9, in com bination with an "island-driven" strategy in which reliable Indeed. we are now in a posit ion to understand this comment. It should
phones, words, or phrases were located using an initi al scan. and then be clear tha t we could conside r the word. rather than the sentence, as the
bui lt upon using a " midd le-out" search. This app roa ch is novel in its di ultimate production in a language. where the word is compose d of termi
vergence from conventional left-to-right parsing. Stochast ic unification nals corresponding to th e basic subword s. A formal grammar could be
gram m ars represent generalizations of formal gra mmars in wh'ich fea constructed whose production rules would ult irnatel y prod uce words
tu res are added to t he clements of the forma l vocabular v. Thcv have from the terminals. Now with sentence replaced by word and II'0rd re
been used in speech processing to model contextual ' information placed by subword (whatever its form ). any of the discussion above per
(Hemphill and P icone. 1989: Shie ber, 1986) and to add natural lan taining to a two-level CSR system would ap ply equa lly well to t his
guage feature s (person . num ber. mood , etc .) to the no nte rminal cle simple r prob lem.
ments of the gramma r (Chow and Roucos, 1989). The inclusion of An accompanying recognizer of isolated words based on these formali
feat ure information in the gram ma r represents a step toward speech un ties would have an LD consist ing of a parser used to hypothesize termi
dcrs tan ding in its prov ision of linguist ic knowledge beyond the gram nal strings, and an AD that would provide the acoustic matches for the
matical st ruct ure. In their paper, Hem phill a nd Picone int roduce the models corresponding to the term inals. If. as would be likely." a regular
basic unification gra mmar formalism and argue that viewing the speech grammar underlay the linguistic process, then the LD ","a uld be repre
production proce ss as based on a grammar rather than an FSA (in th e sentable as an FSA. and the entire discussion of training and recognition
regula r gramm ar case ) has computational adv antages when a chart algorithms for two-level systems would be applicable to this isolated
pars ing algor ithm is used to generate the hypot heses. In th e paper by word system.
Chow and Roueos. a speech und ersta ndin g system is presented that em We should also recognize that higher-level linguistic infor mat ion can
ploys an augme nted co ntext -free grammar. be used to assist in the IWR problem when the isolated words com prise
As we travel farther up the Chomsky hierarchy, complexity of parsing sentences or longer messages. A grammar or other model that contai ns
algori thm s incre ases drastically, and these grammars have not found infor mation about how words may compri se sentences is an entropy
much application to the s peech recognition problem . A DP-type parsing educing device that can assist in improved recognition perfor mance. In
algorithm for context-sensitive gra mmars has been reported by Tanaka a system recognizing sentences comprised of isolated words, therefore. we
and Fu (1978) of which the comp lexity is exponential in 111'1. For unre might have two "coupled" component s of the LD. one parsing words into
stricted grammars. there exists no universal parsing algorithm (Fu, 1982. subwords, the other overseeing t he order in which words are hypot he
p. 55), alt hough algorithms do exist for special cases (Hopcro ft and sized in the first place. T his description could. of course. also describe
Ullman, 1979, pp. 267-268), A discussion of this issue and a useful bib he operation of a CSR recognizer. and a litt le thought should convince
liography on the general subject is given by Levinson ( 1985). the reader that the present problem is jus t a special case of the CSR
Before leaving this sect ion. we should reiterate a point that has been roblern.> ' An important example of this approach is the IWR version of
made several times in earlier discussions: Both F- B-like and Viterbi-like the IBM TANGO RA recognizer, a system built on the princ iples of the
approaches exist for the inference of the probabilities of t he production BJM methods discussed above. We will say more abo ut TANGORA
rules of a 1'0rmal stochastic grammar. given the characteri stic grammar below. .
(Levinson. 1985; Fu, 1982). Recall tha t for a regular grammar: th is t~ sk
is equivalent to the problem of finding the probabil ities associated WIth IOHowever. context-depende nt phone mod els of words have been found to im prove rec
an HMM or FSA with a fixed structure. Consequently. the fact that these ognitio n (Bahl et al., 1980; Schwan z et al.. 1984 ).
trai ning algorithms exist is not sur prising. In fact, however. any st? 27 How would the top -down algorithm discussed in Section 13.5 be modified to accom
mo date known tem poral boundaries in the observa tion string (correspond ing to isolated
chastic grammar may be shown to have a corre lative doubly stocha stic word s)?
790 Ch . 13 I Lang uage Model Ing
13.9 I A Survey of Lang uag e-M od el-Based Systems 791
13.8 Standard Databases for Speech Recognition 13.9 A Survey of Lang uage-Madel-Based Systems
Research
We conclude th is cha pte r wi th a b rief su r vey o f s om e o f the speech rec
A s ign i ~i ca~ : boon to the f!cld o f speech recogn it ion resea rc h has been og ni tio n system s of histor ical a nd cont em po rary si gn ificance. Any such
the ava~labJllty ~ f standardized databases fo r syste m tes ting. wh ich ap su rvey will nec essarilv be inco m plete , as we ca nn o t ho p e to co ve r t he
pea red In the mi d- to late 1980s. Am o ng those most freq uent)" Cited in vast array o f syste ms and techn iq ues that have been p roposed and im ple
th e . literature are the DAR PA Reso urces .Man age m~ n t D ata base ' (DRMD) me nte d o ver th e yea rs. O ur o bj ect ive will be to p rese nt a few systems
(P rice et a I., 1988 ), the T I M lT Acoust ic Ph o netic Database (Fish er ct that illustrate va r ious approaches a nd con cept s desc ribe d in the foregoing
'II.. 1986), and th e Texas In st rume nts/ N ationa l Bu reau of Standards material. We will also see some o f th e higher-le vel la ngu age sources come
(TI/NBS) D ata base of Connected Digit s (Leo nard. 1984 ). in to play that we have on ly b r ietl y d isc ussed in this cha pter. F inally, this
T he DR M D is a IOOO-word database containing mat erial for speaker su rvey wi ll give a n ind icat io n of t he performance capab ilities of co ntem
dependent , spea ker-independ ent, an d speak er- adaptive recognition. It is pora ry speech re cogn itio n syste m s. W it h one exce pt io n , we exp lic it ly
ba sed on 21.000 English language utterances abo ut na val re so urce man focus o n system s that ar e p ri nc ipally research systems an d avoid discus
agement , co llect ed fro m 160 speakers with an array of di alects. The ma sio n of co m m erci al prod ucts. O f co urse, the ulti ma te purpose of speech
terial is pa rt itio ne d into training and testing data sets . T he availability recog n it io n re sea rc h an d d eve lo pm e nt is a pplication to pr a ct ica l prob
of the DRMD was publ ished in 198 8 in the Price et a!. paper refer lems. Alt ho ugh th e speech recogn ition field is still relat ively you ng, and
enced above , an d it has been widel y used to test la rge-vo cabulary CSR Itho ugh m an y challenging pro blem s rema in , m an y in teresti ng applica
syste m s. tions have ta ke n pla ce. T he rea der is re fe rr ed 10 (AT &T, 1990) , for exam
The TIMIT database represents another DARPA-supp orted project . p le. to rea d abo ut som e of these e ndeavo rs.
TIMlT is a phonet ically transcribed database that was d igita lly recorded The more recent o f t he systems described be low represen t evol ving re
by t he Texas Inst rum ents Corporation (TI) and transcribe d at the Massa sea rch . Acco rd ingly, we ca n o n ly p ro vid e a b rief syn opsis of the ope r
ch usetts Instit u te of Technology (MIT) . The material fo r th e database at ing pri nciples exta nt at the t im e o f com pletio n of this boo k. T he reader
wa s selected by MIT, TI , and the Stanford Research In stitute (SRI). It is en cou raged to consu lt t he lite ra ture to d iscover rece nt ad van ces in
co nta ins the data for 4200 sentences spoken by 630 talkers of various th ese a nd other system s. N ew resu lts are often fi rst rep o rted in the Pro
dialects. Data for 420 of the talkers are used as a training database, while ceedings of't he IEEE Int erna tional Con ference 0 /1 Acoustics, Speech, and
th e others' data co mp rise the testing data. Details of th e T IM IT database Signa! Processing (see Append ix I. E.4) . As an exa m p le of thi s continuing
are found in the paper b y Fisher et al. e vo lut io n , we no te that a rt ific ial ne ura l ne tw or k tec hno logy, which we
Finally. the TIINBS28 database is a collect io n of d igit utt erances (0-9 , ake u p in t he follo wing cha pter. has begun to be in tegra ted into some of
" oh") for use in s pea ker-in depe nden t trials. The data in clude the speech th e co n tem po ra ry syste ms d esc rib ed be low. In Cha pt er t 4 , we sha ll
of 300 men. wo m e n, and ch ild re n recorded in a quiet envir onment. The brietly retu rn 10 th is iss ue an d d iscu ss so m e of th ese deta ils.
mater ia l includ es strin gs of digits ranging from one (isola ted) to seven
long. Details of the Tl/NBS database can be found in th e paper by ARPA S peech Under standin g Proje ct . In t he U nited States. the modern
Leonard. Also noteworthy is the first industry standard d ata ba se. TI-46. er a of la rge-sca le auto ma tic speec h recognitio n was ush ered in by the Ad
w hich was also developed by TI. This collection contain s the alphabet. vanced Research P rojects Agency (ARPA ) of t he De part me nt of Defen se
di git s, and seve ral command words. It is still being us ed for neural net whe n, in 19 7 1 it a nno u nced a fi ve-year de velo pmen t p rogr am with the
l
work research (Chapter 14). among other applicati ons. goal of sign ifica nt ly ad vancing t he field o f speech understan ding." Th e
Whereas the databases listed a bo ve ar e English-language COllect io ns. it ARPA goa ls for a proto type syste m are sh own in Ta ble 13.1 along with th e
is ce rta in that other lan guage data bases will be de veloped in the future. features of the HAR PY system of Carn egie-M ello n Univers ity, the only
At the lime of writing thi s book. the Acoustic Soc iet y o f Japan is plan system to exceed all of the stated goals. Klatt ( 1977) ha s written a re view
ning to release a J apanese lan guage database. The SA lVI p roj ect is a Euro of the ARPA proj ec t that com pares a nd co n trasts the archit ectures and
pean effort to develop large m u ltilingua l datab ases (Ma ria ni. 1989 ). op era ting principles of fo ur of the systems that resu lted fro m the study.'?
Seve ral of the databases disc ussed above are ava ila ble o n CD-ROM
fro m th e U .S. N at io n al In stitute of Standards and Tech nology (:;\i [TS ). 29
10 spite of the name " speech u nde rs tand ing." o nly on e of the syste ms d escribed belo w
Ordering instru ct ion s arc given in t he preface to th is boo k. will be see n to use ling uis tic knowled ge above the syn tac tic leve l.
J~ l n thi s pap er Kla tt a lso gives a useful list of c ita tions lO earlier wo rk in speec h recogni
" T he National Bu rea u of Stand ard s ( N BS) is the fo rmer name o f the ," 1ST menti oned tio n. In part icular, he reco mme nds the pa per by Redd y (19 76) from which we Quo ted in
below.
Sect io n 10.2.4 .
T9~ Ch . 13 / Language Modeling
13.9 I A Survey Of Language-Model -B ased Systems 793
TABLE 13.1. ARPA 1971 Five -Year Goals for a Prototype Speech
TABLE 13.2. T he Four Systems Resulting fro m th e 1971 AR PA Study and
Recognition System, Along with the Features of the HARPY The ir Gr oss Performance Fig ures. After Klatt (1977).
System of Carn egie- M ellon University. After Klatt (1977 ).
-A RPA
- - --- - - - - - - - .-- - - -H-AR-P-Y -C ha- ra-cteris
Fh'e-Year Goals
- -tics- S entences
System Un der stood (%) Perplexity
---
(November 1971) (N ovember 1976)
Accept connected speech eMU HARPY 95 33
Yes
from many e MU HEARSAY II 9 1, 74 33, 46
Cooperative speakers 5 speakers (3 male, 2 female) 44 195
yes BBN HWIM
in a quiet room System De velop me nt Corp . 24 105
computer terminal room
using a good microphone
c1ose -la lking microphone NOfe: Percentages arc based on more than lOO sentences spoken by several talkers, except
with slight tuning per sp ea ker
20 tra in ing sentences per ta lker for HEARSAY II, which was tested with a smalle r data set.
accepting 1000 words
10 1 1 word s
using an artificial syntax
a verage br a nch ing factor « 33 The ad dition of the beam-search concept in HARPY. however,
in a constraining task
document retr ieval
yielding Jess than 10% semanti c
vast ly improved com putatio nal efficiency.
error
he acoustic processing in the HARPY syste m consists of ex
5% semanti c er ror
in a few limes real time tracti ng [4 LP parameters from 10 msec fram es of speech. Interest
80 tim es real ti me
on a 100-MIPS machine ingly. the frames are combined (by summing co rrelation matrices) if
0.4 MIPS PDP-K.<\ to
using 256K of 36-bit words sufficiently similar (according to the Itakura distance) in order to
I cost ing $ 5 p er Sentence processed reduce processing time and smooth noise effects. The resulting
- - - - - - - - -- - -- --"'-- -- - "acoustic segments" (typically 2-3 fram es) are classified into one of
98 groups using the ltakura distan ce,
We give a brief synopsis of the ARPA research with the reminder that
these systems, while remarkable achie vements in their era . naturatly rep 2. The HeARSAY IT system of CMC (Lesser et al., 1975). The
resent early technologies that do not reflect the current state of the art. HEARSAY system has a rad ically different architecture from that
Accordingl y, the reader ma y wish to casually read through these descrip of HARPY. Information from all sources of acou stic and lingui stic
tions for general information only. knowledge (all below the seman tic level) are inte grated on a " black
The four systems resulting from the ARPA study and their gross per board" that serves as a controller for the processing. The knowledge
formance figures are listed in Table 13.2. Note that th e perplexity is used processors are quite compartmentalized and relat ively easily modi
as a measure of the diffi culty of the recogn ition task. As noted by Klatt , fied . One component , the "word verifier," is in the form of a
given the different branching factors, it is difficult to determine absolute HARPY-like FSA for composing words from subword units. The in
performance differences between the syst ems . The se four systems em tcract ion with the acoustic signal occurs in an island-driven fashion
ployed vastly different approaches, but all employed a form of top-down in which a high-scoring seed word is sought fro m which to expand
governance of the processing.» Briefl y, the se syste ms are: to complete sentence hypotheses. A CYK parser is employed in this
process in order to consider many hypotheses in paralle l.
1. The HARPY system of Carnegie-Mellon University (CMU) Acoustic processing in the HEARSAY II system consists of com
(Lowerre and Reddy, 1980). The basis for HARPY is a massive put ing the peak-to-peak amplitudes and zero cros sing measures on
J 5,000-state network that includes lexical represent atiom. syntax, l O- msec nonoverlapping frames of both a preernphasized speech
and word boundar y rules co mpiled into a single framework. The re waveform and a smoot hed version of it. Frames of similar measures
sulting FSA is decod ed with resp ect to acoustic measures (sec are grouped into intervals and these intervals, in turn , are clas sified
below) using DP and beam search . A pred ecessor of HARPY is the by man ner of articulation using a deci sion tree and a series of
DRAGON system (Baker, 1975), also developed at eM U . which also th reshold tests on th e four parameters.
employed decoding of an FSA using a brcadih-tr rs: Dr search; " 3. The H WIM ("Hear What I Mean ") system of BBN. Inc. (Wolf and
Woods. 1977, 1980). The HWIM system also uses an island-driven
stra tegy in whic h complete hypotheses are built from high-scoring
·' '1n add ition to the references cited with each system. Klatt 's paper gives a comprehen seed words. Lexical decod ing is accomplished using a sophisticated
si ve description of each of the systems.
network of phonological rules applied to a latti ce of possible seg
' r[his term conno tes the fact that all paths are extcndcd in parallel. rat her than extend
ing the highest likelihood paths first as is the ease. for exam ple. in the RIM techni q ue, men ta tions of t he waveform. Word hypotheses are guided from
above by syntactic and semantic knowled ge, which takes the form
794 Ch . 13 / Language Modeling
13 .9 I A Surv ey of Languag e -Model -Based System s 795
o f a n ATN gramm a r (sec Sect io n 13.6.2 ). T he co m plet e hypotheses
are develo pe d usi ng bes t- fi rst sea rc h . . pe rs o na l co m p uter wit h fo u r specia l-p ur pose signa l p ro ces sing boards,
w hile having an expanded vocabulary sca lable from 5000 to 20. 00 0
Aco ustic p rocessing i.n t he HW~ M system consists of ext ract ing
form~ n ts (by LP anal ysis), en ergy III va ri o us freq uency bands, zero
wo rd s (Ave rb uch ct al.. 19 8 7). In 19 89 . the exten sio n to co ntinuou s
spe a ker-d epend e nt utterances of t he 5000-word vocab ular y was an
crossings. and fu nd amental frequency On 20- m sec ( Ha m m ing win
dowed) fra mes every 10 rnsec, no u nced (Ba hl et a l., 19 89 ). T he syste m is ca lled the T4NGORA system ,
na m ed for Albert Ta ngor a who is listed in the 198 6 Guinness Book of
4. Syste ms Devclo o rnem C orpo ra t io n (SD C) speech und e rsta nd ing
World R ecords as the wo rld ' s fa ste st typ is t (Ba h1 et aI. , 1988). .
sys te m (Ritea , 1975 ). T he SD C system genera tes a set o f alternative
T he TANGO R A sys te m is ba sed on d iscrete-observation HMM mod
p hon et ic t ransc ripti o ns from acoustic and p hon et ic proccssing that
els using a 20 0-sym bo l VQ co debo o k. an d a tr igram language model. The
a re s to red in a "ma trix " for processi ng fro m a bove. The SYStem
co ncept of fe no nic baseforms was ad ded in the 198 8 report of t he work
then gen erally fo llows t he basic parad igm o f left-to-righ t. bes't-first
(Ba hl et a I., 19 88) (see Sectio n 12.3.4 ) in order to decrease the training
search using a ph o net ic "mapper" t hat in terfaces t he syntactic and
tim e fo r a n ew sp ea ker. It is in teresting t hat t he trai ning of the 5000
lexica l kn owledge sources w ith the phonetic transcript ion hypothe
se s from the acou stic proc essing . wo rd system req uires ab o u t 20 minute s to rea d 100 sente nces com pose d
of 12 0 0 word s. 700 o f w hic h are d istinct. Some t ypi ca l perfo r m a nce
Acou stic ana lys is in t he SDC syst em involves computation of en
res ults fo r TANGORA as compiled by Picon e (1990) are sho w n in Table
er gy and zero cro ssin g measures on t D- msec int erva ls, pitch estima
tio n b y a center cl ip ping and autocorrelation technique (Gillman. 13.3 .
19 75), and LP analysis On 25 .6-msec Hamming windowed frames
e very 10 msec. T he BYBL OS Sys tem. 1n recent years, ARPA ha s become DARPA (the
Defense Adv a nce d Projects Research Age ncy). and several systems have
Perhaps one of th e mOSI signific a nt findi ngs of the ARPA project is bee n de veloped und er contracts from th is agen cy. Among them are the
m a n ifest in the HARPY a p proa ch- efficie nt ly appl ied grammatical con BYBL OS and SPHJ:\ X systems. The BYBLOS s ystem, developed at
straints can com p rise a power ful tool in ac hieving high ly accu rate per BBN , Inc. (Chow et al., 19 87 : Kubala et al.. 198 8) , is a speaker
formance . This is t ru e in spite of th e relati vely sim p le acoustic processing epend e nt CSR system intended for large-vocabulary applicati ons. Its
used in the HA R P Y system . Th is findin g ha s cl ea rly influ en ced the syste m st ruct ure and sea rch mo de (d eco d ing is ba sed on the F-B algo
course of speech recognition resea rch in the e nsuing yea rs. r it h m ) close ly conform to th e general d escription ea rl ier in th is chapt er
o f top-d own processors. I n fact , the reade r mi ght wish to read the paper
TA N G O RA . Concur rent with th e ARPA proj ects , work (continuing y Chow et al. , because its organizatio n is parallel to our discussions
to d ay) was in progress at TBM on the application of statistical methods above a nd will therefore be an easily followed ex pla nat io n of a real-world
in automatic speech recognition . In particular, ea rly work on the HMM SR syste m.
was published by Jelinek (197 6) [and independen tly during the same pe O ne of the uniq ue feat ures of BYBLOS is the inclusion of context
riod by Baker ( 19 75) at e MU ]. In 1983 , IBM resea rchers publi shed the de pend e nt ph one models, which were de scribed in Sect ion 12.3.4. This
pa per (Bahi et a l., 198 3) to wh ich we hav e frequent ly refe rred in our dis ap pro ac h all ows the models to capture coarticulatory effect s. In fact , the
cu ssions. Th is pa pe r in d ic at es sign ifica nt ly bett er perfo rma nce on the na me BYBLOS is quite significa nt in th is rega rd - BybJos is t he name of
speaker-depend ent , co ns tra ined -ta sk CSR than was ac hieved by HARPY. a n a nc ien t Phoenecian town (now Jubeil, Lebanon ) where the fi rst pho
In th is paper a re m an y of the se mi nal id ea s on the use o f HMMs in the ne tic w riting was discovered . Researchers at BNN cho se the name 10 em
CSR problem. S ince we have described the ba sic technologies in Sect ion p hasize t he phonetic basis fo r the system. In 1986 , whe n the system " vas
13.6.1, let us ju st mention here a m a nifestatio n of the resea rch . In th~
earl y J 980s the IBM group focu sed t hei r att ent io n o n the p ro blem of at. TABLE 13.3. Typical Perlormance Results for TANGORA as Compiled by
ficc dictation . The res ult. announ ced in 19 84 . was a la rge-v o ca b ula r y Picone (1990).
(5000 words plus a spelling fac ilit y). sp eak er-depend en t. isolated-word. "
Recognition Task Word Error Rate (%)
nc ar-real-time recog nit io n syste m built o n a vas t platfo rm of computi ng
fa c ilit ies includ ing a m ai n fra me com p u te r a nd a wo rkstat ion (J e linek, 5000-word office correspondence 2.9
] 9R5). By 19 8 7, the sys tem was sca led down to operate in rea l time in a 20,OOO-word office co rrespondence 5.4
2000 most frequent words in office 2.5
JlT he system uses the techniques of Section 13.6. 1. which are app licab le to CSR. The cor respondence using phonetic basefonns
IWR problem may be considered a special case in which (word) temporal boundaries arc :W OO most frequent words in office 0.7
known in the observat ion sequence. correspondence using fenoni c baseforms
796 Ch . 13 I Language MOdeling
conceived , th ere was a wid espread bel ief among th e sp eech processing c: "O~ I " i .n ..-i v-i.n 0"": ° 10
<Oo;f"-<'l""""'OO v")
;:C I ;,;-
~t:~
(D ... """ Q,jI..-..
community that stochast ic spe ech recognition systems based On Phonetic E Q '"-'" '0 '0 -0 r-- r-- I'- r-- r- r-
uni ts were unfeasible (Makhou l, 199 1). (D
mel-cepstral coeff icients every 10 rnsec usin g a 20-m sec window. The
acou stic models a re discrete-observation H MMs based on VQ usio£ a
:2
rJJ ZQ I"O~
o " '"'
~ I
00 '-0 ..",. r-
Oo'o:"o.nN O..,r
'0 0 <r) oc IN "T
(D . ~~~ ""1C'"'1f"l"":. N C"">I r-l M C""'l t"":l
256-symbo! code boo k. y o
~
BY BLOS has been tested under various con dit ion s an d wit h various orJJ
tasks. In (Kubala et al., 1988), the IDOD-word DR MD is used as the test
ll)
a:: - .c
'" '>1
rtl .,;
m aterial. T hr ee gra mm ar models were em ploye d for word hypotheses.
The first is a regular gram mar (FSA) of perp lexity 9, th e second a word
a:a:: ~ ~
.~ t
I N'O o;f""""o->O"Tr- ! N
'0 >0 \0 <r) <r) <r) '-0 '0 \0
«
o
!""'c.,
pai r gram mar (bigram gram mar wit hout probabilities) of perplexity 60.
and the third a null grammar (no grammar) of pe rp lexity lOOO (simply (D
-£
equ a l to the n um ber of words). Some typical results with data from the
E
...
OJ
DRM D are shown in Table 13.4 . Other results are found in (Chow et al., ,g ._
A
.. , -==0---1
e:
a:; . . . .......
aJ'"". "£.
0 ..",. N ('I o;f" 00 "T N 0
"':t <r) r')"':t N N V"') N
I~
198 7; Kubala et aI., 1988). ~ <')
More recent ly, N- best search has been incorporated into the BYBLOS
rn
«i
o
'0
..
0
rtl
system (Schwart z and Chow, 1990; Schwartz et al., 1992); 1990 experi ~
ments were perfo rme d on the speaker-dependent portion of the DRMD
using (in th e N-bes t search) simple language models consisting of no
C1
C
'Uj
:J
. "''"'
'0 1:>
~I:~
; . ; 0 .......
-0 .r
<D
" ti -o.
..
r-'" O::
CSR in sp eaker-independent mode. 2 «
S '
Like BYBLOS, the SPHINX system generall y follows the basic princi
ples of top-down linguistic processing using Viterbi decoding described
rJJ ~
(Drn
a::E ..
I::
.8 '"
gO '"' 1oO "':t oo OO "T 'C '<t I'"0
(j) E ~ t: ~
earlier in the cha pter. The most interesti ng features of the system occur ,... "C1l
u ~ ~ W'-"' N N
at th e lowest levels of the linguistic and aco ust ic processing. (ti C!) o:
E <D ~
S P HIN X is based on context-dependent discrete-observation phone '-(j)
o ' a:;
't: .c c
models. wh ich in this work are referred to as triphones . The basic phone
HM l\1 topo logy used in SPHINX is illustrated in Fig. 12. 10. O ne thou
~ 1--
"0
rtl
Q,)
"O::: t>~ ;? I M ~OO - V"l ~O
00 0-: 0 0' 0: 0' 00 0 [ 0:
CO C ~ Q~ 00 00 0 0 \ 0 \ 0 ' 0 ' 0 00
san d such pho ne models were trained On the 1000 most frequent natu co rn "U - -
0)
,.... (D
rall y occu rring triphones in the DRMD, which was used to test the -
III
rn
s ystem (700 0 tripho nes were found in the da ta). The HM Ms ar c discrete rl .c
•- C1l '0 ..
0. " Q '"' \.Or- tn O O'. I.I"'1 ;O lr)
observat io n, bu t have an in teresting feat ure that t he three 256-sym bol
codeboo ks used- cepstral, d ifferential cepstral , and energy fea tures- a rc
~~ .; t: ~
, ~ --
NNO"":OO N O -""':
oi
derived from LP analysis and are each coded sepa rately. Each transition M
e-
in the FSA ulti mately gen erates three feat ures an d th eir probabilities are UJ
<.)
OIl
combined . Word duration mod els are also incl uded in a lat er version of ...J
co ~ O:: ~ Cl 'l) ::r: ~ cr: e
~
tJJ~ f- f-30::'::~
th e system . In add ition, measure s are taken to acco unt for po orl y art icu ~ crl UOO-.C:.P::f-< ~
>
lated " functio n" words such as "a ," "the," "of." and so on.
797
798 Ch. 13 I language Modeling 13 .9 I A Survey o f Languag e-M a del-Ba sed Sy stems 79 9
Se veral langu age m o d els were d er ived fro m t he DRMD for testing of 198 9: H ana za wa et al.. (990). We ha ve a lread y briefl y desc ribed LR p ars
SPHINX . These included th e same three gramma rs used to te st BYBLOS ing at the end of Sect io n 13 .9. [LR parsing has mo re recen tly been in cor
abo ve p lus a bigram m odel of perplexit y 20 . M an y d ata ind icating per pora ted in to the SP H I N X system (Kita and Ward. 199\ ).J Tn HM M-LR ,
for m a nce of SPHIN X und er va riou s co nd itio n s an d st rategies are given pho nes arc modeled by d iscrete-o bse r va t io n HMMs em p loying multiple
in t he paper by Le e et a l. So m e typical recogn ition re sult s using the co d eboo ks incl ud ing cepstral d iffe re n ces. en e rgy. and an L P-b as ed spec
DR MD with different gram m a rs a nd with va rio us im p ro ved versions of t ral measure describe d in (Sugiya m a a n d Shikano, 19 81). T h e tr aining
the system arc shown in Ta ble 13.5 . mate ria l is a 5456-isolated -wo rd database d e velop ed b y ATR.
In ex periments repo rt ed in the 1991 paper by Kit a a n d Ward , the task
The LINCOLN System . I n C ha p te r 12 , we mentioned the efforts of re~ was to recogni ze short Ja p a nese phrases (Bunset su) uttered by three male
se archers at Lincoln Laboratories to mode l speech u nde r various condi sp ea ke rs an d o n e fem a le speaker. A vocabul ar y o f 1035 words and a
tions of speed, stre ss, e m oti on , and so on. In 1989, Paul (198 9) repOned gra m m a r o f est im ated pe rp lexit y greater than 100 we re used . A phrase
on the efforts to ex p an d th is resea rch to th e la rge- vocabulary CSR do re cogn itio n rate of 88.4 % was ach ie ved with th e co r rec t phrase a ppearing
m ain for both spe a ke r-de pe nde n t and speak er-i n de pe nd e nt trials. The in the to p fi ve ch o ices 99% of the tim e . The res ults we re show n to have
DRMD was used with a wo rd -pa ir gra m ma r. Continuou s-observation , be nef ited fro m du ra t io n models and a sp ea ker ada p ta tio n r o ut ine .
G a us s ia n mi xture density HMM s are used to model contcxt-sensn: vc
phones. Accordingly, t h is work invol ves a sys te m and tasks ...v ith similari CSE LT. As a pa n of the European joint re se ar ch e ffort ES P R IT. re
ties to th e SPHINX work di scussed abo ve. For details. the reader is re searche rs at th e Centro Studi e Laboratori Teleco m unicaz ioni (CSELT )
ferred to the paper b y Paul. a nd the U nivcrsita d i Salerno have de velo pe d a 1000-word co ntin uo us
speech recogn ize r that employs a unique word hypoth esi zer (F isso re et
DECIPHER. The DECIPHER sy st em. dev eloped a t SRI International. is al., 1989). Using a kind of " N-b e st " appro ach , th e CSILT syste m select s
based on s im ila r principl es t o sys tem s di sc ussed above (M urveit and word s o n th e basis of coarse phon etic de script ion , a nd th en refines th e
Weintraub, 1988; Weintraub et aI. , J 989; Co he n et a l.. 19 90). This sys hyp o theses using m ore detailed m atching. Acous tic d ecodin g is based o n
tem is notable for its ca refu l atten ti o n to modelin g of phonological de H MM ph o n e models an d mel-cepstral coefficien ts. ln a la ngu age o f per
tails suc h as cross- wo rd co arti cu latory e ffect s an d speake r-specific plex it y 25, ex p er ime n ts involving t wo speakers utterin g ~ 14 sen t ences
phonological adaptation . Based on ex pe r iments with th e DRMD, com p rod uced a wo rd accu racy of 9 4.5 % a n d correct se n te nce ra te o f 89 .3%.
pari sons to ( 19 88 ) SPHIN X an d BYBLOS results in the 19 89 Weintraub
paper indicate impro ved performan ce as a co nseq ue nc e of phonological Phi lips Research Laboratory. A natural ev o lu tion of t he work of Ncy ct
mod eling. al., which we have discussed extensively in th is chapte r, is a J O,OOO-word
co nt in uo us-sp eec h recognition system described in (Ste in biss et al ., 1990;
ATR HMM-LR System. Res earch ers a t ATR Interpret ing Telephony Re Ney et al ., 1992 ).
search Laboratories (ATR) in K yoto ha ve d eveloped the H'.-fM-LR sys
tem . wh ich is based on direct pa rsing of HMM phone models without Con nected-D igit Recognition with Language M odels . On a n um be r o f oc
an y intermediate struct ures such as p h o ne me or word models (Ki ta ct al.. casions we ha ve m entioned the efforts o f rese ar ch ers a t AT&T Bell Labo
rato r ies to so lve th e difficu lt problem o f recognition of co n tin uo us
TABLE 13.5. Typical Sentence Recognition Results for the SPHINX System st rings o f s pea ke r-in de pe n de n t digit s. T hese efforts have co ntrib ute d
Using the DARPA Resources Management Database with grea tly to th e understand ing of th e b as ic properties of th e HMM. In par
Different Grammars and Various Enhancements to the System . tic u la r. ma n y results on th e application o f co ntin uo us-o bse rvatio n HMMs
After Lee et al. (1990). we re d e velo pe d during the co u rse of this re sear ch. We have a lrea dy di s
Grammar
c ussed the LB algo rithm for both DTW- and HMM-based sys te ms, which
has a lso e merged fro m the se efforts . For de ta ils of this wo rk, th e read er
System Version Null Word Pair Bigram is referred to th e paper b y Rabiner et al. (l989) a n d it s refe ren ce s. Also
Baseline n o tewo rt h y is a n im p le m e n t at io n o f a co nt i nu ous-o bse rv a t io n HMM
31.1 % (25.8%) 6 1.8% (58.1%) 76. 1% (74.8%)
3 code books and 4 45 .6% (40.1% ) b ased LB a lgo rith m with a finite state gra m mar o n a sys to li c processor
83.3% (81.1 %) 88 .8% (87.9%)
feature set s developed at AT&T (Roe et al ., 1989) . Thi s la tt er wo rk rep resents a con
Word d uration 55. 1% (49.6%) 85.7% (8 3.8%) 91.4% (90.6%) vcr~ion o f th e pr oblem fro m a " le vel syn c h ro nous" process ing scheme.
wh ich we saw in th e botto m -u p parsing exampl e. to one using a "fra me
Note: Percentages In parentheses rcprcsem wo rd recognitio n rates.
s ync h ro nou s" stra tegy, as we us ed in th e t op-down ap proach . Some typi
800 Ch . 13 / Languag e Modeling 13 .10 / C o nc lusions 801
ca l result s illu st ra t ing th e effec ts of vari o us model pa rameters were given d ures we have discussed several ti mes previously are a primitive form of
in C hapte r 12 . this integration . As syste m s become more complex . it is like ly th a t t h is
Re sea rchers at Te xas In st ru m e nts (TI) Corpora tio n have also contnj.. bi di rec t io na l sea rch stra tegy will b ec o me more p re val e nt.
ut ed sign if ica n tly 10 th e conn ect ed -d igit recogn it io n p ro ble m . A st udy
described in (Dod dington . 19 89) u se s a tec h n iq ue k n own as "pho
n eti c discriminants" to m a xi m ize d isc r im inat io n in fo r mation among
continuous-observation HMM s re p resent ing t h e d igits. In the T1 re 13.10 Conclusions
search, the TUNBS digit d atabase was used as t he evaluat io n material ,
and a three-level FSA structure -sentence , word . phone-i s employed to We ha ve now co m ple ted o ur stu d y o f t he atte m pts TO autom at ically rec
model the language, excep t for th e d igit recogn iti on experiments in og n ize speech using sequen t ial computing mach ines . T he journe y has
which two levels we re used . Ta b le 13.6 shows re sults fo r four classes of been long and d eta ile d , a nd it is cl ea r th at a tremen d o u s a m o un t of ef
mod els. ex plained in the tabl e note. fo rt a nd inge nui ty has gone into fi ndin g so lu t io n s for various sub p ro b
le ms . It is eq ually cl ear t h a t we st ill ha ve much to le arn , a nd th at th e
drea m of a naturally conversa nt m ach ine re m ains a d ist ant goal. Th e re
Very Large Vocabulary Systems. Se veral research grou ps have wo rked on sults of the last seve ral decad es have been humbling, but have a lso p ro
t he problem of recognizin g very large vo cab ula rie s with the a id of lan vided much ho pe tha t o ne o r m ore so lut io n s w ill even tua lly be found.
guage model s. At INRS and Bell N orth ern in Ca na da . investigators have W hat ever the even tual so lu tio n . fro m ou r cur re nt va n tage p oint it se em s
wo rke d on a 75 ,OOO-word spe a ker-de pe nde n t system with several differ like ly that dynamic p rogram m ing. langua ge m od eling, a n d fa st er com p ut
ent language models. The be st performance was achieved with a trigram ing will be a part of it. M a ny expert s have al so argued that , in sp ite of
model with whi ch 90% recogniti on was o bt ai ne d . h e vast ly sign ifica nt pe r fo r m a nce impro vem ents bro ught about by lan
At IBM in Paris , exp erim ent s ha ve b een con d uc te d o n a 20 0 .00 0- wo rd guage models . language m od els alon e wi ll n ot ult im ate ly yie ld satis fac
vo ca b u la r y in which the e nt r y m ode is spea ker-depen de nt sylla b le-b y to r y performance. O n e em e rging tre nd is th e use of langu age models in
sy lla b le utterances (Me rialdo, 1987). Other exa m p le syste ms ha ve been bid irectional (both to p -do w n a nd bott o m -u p) h yp oth esis fo rmation . More
reported b y Kimu ra ( 199 0) a n d Me ise l et a l. ( 1991 ). wo rk on th e t ec h no logies at ac o usti c level, a n d m ore fu ndamen tally on
t he speec h prod uct ion m o d el itse lf. will be need ed .
St ill other researchers ha ve begu n to exp lore th e p o ssibilit y t h at a radi
VOYAGER. We have agreed t o rem ain " be low" the high er-level knowl
ca lly different co m put ing a rc hitectu re mi gh t hol d promi se . To t h is rel a
e dge sources. such a s se ma nt ics . in ou r st udy . The inte rested reader may
tive ly infant speech t ec h no logy, the a rt ificial neu ral n etwork, we briefly
wi sh 10 explore th e papers o n th e VOYiGER syst em de veloped at MIT as
a n example of a syste m with su ch n atural language components (Zuc et t u rn in t he ne xt cha pter.
a l., 1990). An int er esting fe ature of thi s system is the integration of both
top-d own and bottom-up sea rc h ing. Tn a se nse . the X -best search procc-
13. 11 Problems
TABLE 13.6. Texas Instruments' Independent Digit Recognition Study. After
Picone (1990). 13.1. (a) A for mal gra m m a r tJl h as the termi n a l se t y;= la, b) a nd corre
Sent ence \\'ord sp ondi ng lan guag e
Observation Model Error Rat e (%) Error Rate (Ufo) ( 13. 111)
,.L'(9j) = !a'l/ Ii .j = 1.2. . .. ).
Pooled covariance 3.5 1.3 Give a possib le se t o f nonter m inals an d pr od uc tio n rul es fo r (JI' in th e
Diagonal covariance 3.1 1.2
pr ocess showing that 91 is a fini te sta t e gra m mar.
Full covariance 2.1 0.8
(b ) Suppos e that a se co nd grammar. (j" wi th th e sa me t ermin al se t,
Confusion d iscriminant s 1.5 0.5
V , has correspo ndi ng language
NO /i!: T he four model classes are as follows: (I) The pd f's of states of the model sha re a
com mon (r pooled") co variance matrix. (2) Each state density is modeled with a d iagonal l I
.J:«:;~ ) = a'b I i = 1. 2.. . . i. (l 3.I 12)
cova riance matrix (fea tures assu med uncorrelaicd ). (3) A full covaria nce matrix is present
at each state. (4) "Phonetic discrim inants" arc empl oyed. The percentage represents stri ng Re peat pa rt (a ) fo r th is gra mmar. in t he process a rg uing that (J2
error rate , No grammar is emp loyed. can no t be a fini te state grammar.
13 .11 I Problem s 80 3
80 2 en. 13 I Language M od eling
~ (/ - 3) = w t ] - 3) ) = P ( ~(/ ) := I -
11' (/ ) ~ (/ I) = 11'(/ . - I ),
w(/ - 2) = wt ] - 2»),
... Ling uistic
i a) decoder
\'1...,,0 (:,
(l 3. 113)
Requests Word
I n likd ihood;.
H ( ~ ) = - }~~ n 2)og P(~ (I ) = Iv(! ) I ~ (I ·- 1) -= 1'.'(/ - I ) .
for word
like lihoods
f'· ,
~(I - 2) = w(l - 2») .
( 13.114)
[Hint.' Use (13 .I3).J
(b) Explain how to es tim a te the perplex it y o f the language of part
(a) experimentall y.
13.3. In a general way, d escribe a bo tt om-up rec ogn iti on sys te m in which
the AD hypoth esizes wo rds from th e left and rec eiv es gu ida nc e from the
LD as to whether conti nuat ion of a string is ad v isab le. 1n p art icular, de
sc ri be an LB algorithm in whi ch a linguistic cost is int egrated with the
aco ust ic cos t on a per le vel b asi s in order to find a si ngle se nt en ce hy
8BilllS
t ~
Acoustic
decoder
Word level
pothesis consistent with b ot h bodies o f in form ation . Not ice that the LD Hl\l M
mu st remain relati vel y sim p le. leaving the recordkeeping fun cti on to the
AD. if the processing is to remain bottom -up.
13.4. Figure 13.11 (a) represent s a small lingu istic-le vel rs.!\ that in te ract s
with th e acoustic-le vel wo rd unit HMMs. The search for a n o p t im a l pa t h
through th e utt erance proceeds in a ccordan ce with th e method descri bed
in Sect io n 13.5. In Fi g. 13.II (b ). th e t wo lev e ls h ave been comp ile d into
a s ingle large FSA n et work. Carefull y describe a Vitcrbi- Iikc al gorithm
for sea rc h ing the compiled FS A with respect to observat io n string f. Compiled
which will produ ce th e sa me optim al se n tence h ypoth esi s as th at pro (h ) {,SA
duced by the tw o-l e vel processor. In pa rticul ar. m a ke clear th e me a n s by
whi ch yo ur alg orithm traverses th e boundari es between mod e ls in the
co m p iled FSA .
:Il~
13.6. Gi ve a formal argu m e nt th a t the to p-down di git recognizer in Sec
tion 13.5 ultim ately finds th e d igit string 11'" , where
13.7. The stack decoding a lgo r ithm described in Section 13.6.1 can be
greatly simplified if the wo rd bou nda ries in the observation string (and
therefore. the number of words) are known. For si rn plicirv assume that
The Artificial Neural Network
each utterance is K words long. Suppose 'k
is the last observai: on corre
Rea ding Nares : This chapter requires no special topics from Chapter J.
sponding to w(k), for k = 1, 2, ... , K (IK = T) . In fact. the concern for fair
ness in comparing paths of di fferent word lengths can be circumvented
by providing K stacks and o nly entering pa ths of length k in the kih 14.1 Introduction
stack.
(a) Without concern for ' practical a lgorithm details, generally de In this fina l ch a pter we tr eat an emerging computing technology and its
scribe the operation of such a multiple -stack decoding algo application to speech reco gnitio n . The application of artificial neural net
rithm with a best-first procedure in this known -boundary case. works (AN N s) to speec h recognition is the youngest and least well under
Begin by defining a greatly simplified version of th e likelihood stood of th e recogn it io n technologies. Somewhat unlike other technolo
A( w7) found in (13 .96). gies a nd t heo r ies we have discussed for spe ech recognition-DTW,
(b) If memory allocation for the complete set of stacks is limited. HMM. language modeling-the speech recognition problem has been an
so that the kth stack may only hold, say, LJ.: paths, what is the impo rt a nt application of AN N technologies but has not been a principal
appropriate relationship among the lengths [ I ' I~2' .... I.".? driver of th e de velopm ent of the ANN field. In fact, the application of
Whv?
ANN stra tegies 10 speech recognition occupies a small corner in a vast
(C) Generally speaking, how does the procedure devised in parts (a) field of theo ries and applications centering on these computing networks.
and (b) differ from a "Viterbi" approach to the same problem? This research endeavor has not yet m atured to the point at which a gen
Include a comparison of the hard and soft pruning aspects of eral techn ica l framework exists, although many of the well-known para
the two solutions. digms have been carefully formalized independently. The field currently
13.8. (a) Argue that any N-gram model of a language can be posed as a co nsist s o f numerous and varied architectures and techniques tied to
language governed by a stochastic finite state grammar. g. The gether by a co m m o n computing " p h iloso phy" that is radically different
N-gram relationship is among the words, and the words are con from that un derlying the von Neuman computer. To attempt to describe
sidered the terminal elements in g. t he to tal it y of this field would take us well beyond the scope of this chap
(b) If there are W words in the vocabulary of part (a). show that rv te r, a nd well beyo nd what is needed for an initial understanding of ANN
N appli cation s to speech recognition. We will mainly restrict ourselves to
terminals, W nonterminals, and W "'+I productions arc re
quired in g. (Hint: A panial state diagram might be useful in the study of a single ANN (called variously a "multilayer perception" or
your argument.) "Ieed fo rwa r d neu ral net ," among other names) and enhancements that
ha ve been at th e cente r of most ANN applications to automatic speech
recogn it io n. Fu rt her, we will largely restrict our discussion to technologies
that build on fam ilia r ideas from previous chapters.
T he AN N research field has a rich , often controversial, history. The
work has been highly inte rdisciplinary, or perhaps marc accurately cross
di sciplina ry, ha ving rece ived attention from physiologists, psychologists,
li nguis ts, p hysic ists, m at he m at icia ns, computer scientists , and engineers.
Research in this field has arisen out of interest in neurology. cognition,
p ercept ion , v isio n. speec h. linear and nonlinear systems theory, algo
rit h m s , and VL SI a rchit ectu res , among other topics. To a greater or
lesser extent. depen di ng on the research aim, attempts have been made
to relat e ANNs to th e operation of the human nervous system and its
basic "co m put ing" elements (more on this later). Many of these varied
805
806 Ch . 14 I The Art ific ial Neural Networ k
14 .1 I tntro ductron 80 7
a nd fasc inat in g as pects of th e SUbjCL·t wi ll necessarily be o mi tt ed fro m the ax on a n d o t h e r prope rt ies o f th e cell co m p os ition . and ca n range
au r d iscussi on he re . POI' a cl ea r and co rnprc hens l \ c t rca tm cni of th ese is. from 0 .5 rn/sec (small di a m e te rs) to 100 m/sec , W het h er a rece iv ing ne u
sues and an exten sive bib liog raph y. t he read er is en couraged to sec t he ro n will " fi re" (sen d a n ac t ion p otent ial dow n its ow n a xo n ) depends o n
exam p le textboo ks a nd reso urces in Ap pe nd ix I .G . 1n p a n ic ular. a co m
the in p u ts received on its de n drites from other ne u ro n s. Some o f th ese
prehensi ve but con cise histor ical survey of t he field is give n in the ap
inpu ts m ay be ex citatory (te n d ing to pro m o te fir ing, " p osi t ive " ), while
p endix of t h e Simpson ( 19 90 ) m o n ogra p h . O f cou rse . m any referen ces
othe rs may be inhibitory ("n egat ive" ). In most case s the mecha n is m b y
c ite d in the c ha p te r w ill p oin t to in fo r m a t io n fo r furt he r st u dv,
which in fo rm a t ion is transmitted fr om axo n to d end rite (acro ss th e gap,
The ANN is based o n t he no tio n that co m p lex " computing" opera
o r svnapse o f - 200 An gstrom s) is ac t ua lly chem ica l. Ncu rotransmuters ,
tions can be implemented by th e m ass ive int egrat ion of individual COm
re leased by exci tat o r y a xo n s int o th e syna pse, ch a nge th e e lect rical prop
puting units , each o f which p erform s an e leme n ta ry co m p u ta t io n .
erties of t he receivin g n eu ro n 's m em b ra ne a nd prom ote th e transmi ssion
"Memories" are sto re d , co m p utat io ns pe rfo rmed. a n d relations formed
of a n action po t e nt ial. In h ib it o r y a xon s re lease s u bsta nces that hyper
through patte rn s o f ac tivity o f th ese 'sim p le un it s. rath er than through se
pola rize t he cell in a m a nn er w h ic h pre vents ac tion p ot e ntial format ion .
quences of logical operation s used in convention al von Ncurnar, ma
Sma ll nerve ce lls ca n be st im u lated a t a ra te o f a bo u t 250 H z, w h ile large
c h in es. Motivation for s uc h a compu ting st ruc t ure is d erived from the
fibe rs ca n carry 10 ti m es that m an y p u lses p e r seco nd .
human
11_10 central
14 nervou s system (C N S). The C NS con si st s of an estimated
In a s im p le model , t he response of th e receiv ing neuron is " a ll or
10 nerve cells. or neurons, each of whi ch ty pically inte racts with
4 none. " T ha t is, the neuron either receives s uffic ie n t excitat o ry input to
I oj-10 other neuron s as inputs. A s im pli f ied a n a to m ica l model of a
produc e an a ct io n potential or it does n ot. Th e amplitude o f the resulting
neu ron is shown in Fig. 14.1. 1 The cell body. or som a. co n ta in s the nu
p uls e is not a ffec ted linearly by th e ma gnitude of the ac cumulated in
cleus of the neuron and ex h ib its p rotrusions tdendritess, which receive in
puts. ln thi s sen se , the neuron is a nonline ar syst em wh ose output repre
puts from the axon s of o t he r ne u ro n s. The a xon is th e "transmission
scnt s a n onlinea r transformatio n of it s in p uts . Note th at if the CNS is
line." which tran sm it s a n e lec t rical p ul se taction noiennat, to neurons
view ed as a "computer," it s on ly re sou rc e for forming computations and
farther down the n etwork . Axo ns in th e hum an body range in diameter
relat io n s, or in voking mem o r ies or gene ra liz a tio ns , resid es in the fi ring
from O. 5,um to 20 jim , a nd in length fro m a bou t a m illim et er to over a
pattern s o f it s man ifo ld and richly interconnected s im p le cells .
meter. The vel o c ity o f propagation o f a p ulse d epends on th e diam eter of
Lest we form unrealisti c b eliefs ab out o ur own understan di ng of th e
V Axon from
hu m a n brain , we must ca ut iously point out that th e p ortrayal of th e n eu
ron gi ve n a bove is a grossly si m p lifie d picture o f a ve ry complex and
~. anoiher cc II varied ce ll. Ph ysi ol ogists have a s igni fica nt understanding of th e anatom y
an d che m ica l a nd elect rica l properti es o f th e nerve cell (well beyon d the
sim p le d esc rip t io n give n here), but the operation of the totality of the
Synnpce / "
hu m a n b ra in rem a ins fu nda m e nt ally m yste ri ou s. Whether the concept of
(gap)
the brain as a " co m pu te r" that computes wit h electr ical firing patterns
has any va lidity is pu re ly speculati ve. How such firi n g patterns co uld b e
rel at ed to h igh er cogn it ive functi ons s uch as reason in g, planning, o r deci
~~ sion m a kin g is unknown. If our model of th e brain a s a m assi vel y power
~
fu l computer h as an y validity, ANNs m igh t alre ady be pro v iding
p ri m it ive cl ues to t hese unknown s. It is critica lly im p o rtan t to keep in
W
I ;
m ind , howe ve r, that the ANN is motivat ed b y a hi gh ly specu lat ive not ion
of how the bra in m ight fu n ction . It is compos ed o f " ce lls" th at are crude,
grea t ly simplifie d m od els of the biological neuron. D em o n str a tion s b y
A xo n O ut put ANNs of " v isio n ," " sp eec h recognition," or o the r forms of pattern re cog
regi on o f
a, on nitio n , or crea ti on o f artificial cognitive fun ction by th e se networks , ma y
FIGURE 14.1. Simplified anatomical model of the biolog ical ne uron.
ultimate ly assist in a cl earer understanding of brain fu nc tio n . On the
other h a nd , th e ope ra t ion of ANNs might ultimately ha ve almost nothing
to d o w ith how the brain function s. Only the future will answer these
'The following discussion is grossly simplified . For a com plete de scr iption of the human questi o ns . In t he meantime we m ust be wa r y o f exage ra te d cl aims about
nerv ou s system . see, for exa m ple [Aidlcy, 19i I: G uyton, 1979: Sherwood, 1989).
t he re la t io ns of A N Ns to the human CNS.
808 Ch 14 f The A r tificia l Neural N et w o rk
14 .2 t T he Ar tific ial Ne u ro n 809
holdi ng funct ions. so m e of which a rc ill ust rated in F ig. 14.3 . ln general.
14.2 The Art ificial Neuron we will say that S ( . ) is a thresholding [unction if
In thi s section we int ro d uce the basic processing un it o f th e ANN a nd 1. S tu ) is a monoton ically no ndecrcasing fu nct io n of II.
de fi ne some use fu l voca bula ry. 2. lim S(II) = c: and lim S(II) = c: with
1/1+ 00 III 00
< 00 . Ie,1.lel
The processing elements of A NNs-also call ed cells, neuro ns. nudes, or
threshold logic unit s- are in essence mo d els of th e sim pli fied vers io n of Now let us furthe r for malize t he a rt ificia l neuron . We label some cell.
the bio logi ca l neuron gi ven above . A sing le art ifi cial neuro n is shown in sa)' no de k. by Ill.' Su ppose th at ther e arc N input s a rriv ing a t 11 k as
F ig. ] 4 .2. Like th e sim ple m odel of the biologica l neuro n , the a rti ficial shown in F ig. 14.4. F req uently, th ese inp uts will be th e we ighte d outputs
neuron has a n in put reg io n that recei ves signa ls from other cells. a "cell of predecessor ce lls, as shown in the fig ure . Fo r si m plic ity. let us ass u me
body" that in tegrat es incoming signa ls a nd de te rmi nes th e outp ut acc ord hat th e N p redece sso r ce lls to 11 k are 11" i = I. 2, ... , N, and th at k > N .
ing to a thresholding funct ion, and a n ou tput region th at ca rries the cell's 1" denotes t he outp ut of nj fo r an y ) (freq ue ntly call ed th e activation o f
res po ns e to fu ture ce lls in t he netwo rk. We ights are included on ea c h ', / ) and \1'/" is the weight in th e connect ion to 11 k from II" Typica lly, and
co nnect io n to fut ure cells i n o rde r to m odel d iffering effects o f o ne cell not surp risingly, the in tegration region o f the ce ll body form all y s ums th e
upon seve ra l o thers. inco ming in pu ts, so th at (igno ring the aux iliary input m omentarily)
Whereas th e thresh old ing funct ion o f the biological neuro n was sa id to
be "all o r none," the a rt ifi ci al neu ro n m odels in cl ude a variety o f t hre s
F
r);
= s (~L I\'
~, .
1' 1)
1
~f S(\,+Ty l) .
A·
( 14.1)
;= 1
"Cell body" consisting
of an integrator of
incoming sig na ls
S{ II ) .'1(11)
and a thresholding
function /
" I
Incoming
signals
from other
v II ~ I' II
cel ls 7>
" Ax on"
carries signals
"Dendrites" to other Ourput region of
receiving neurons axon -e-wei ghr-,
inputs included 10 Ramp
Linear
mudd drttc rcrt
ef fect- on future
neurons S(/I) 5 ( 11)
(a)
I r - -=..- - - -
I , II --- I II
-,
FIGURE 14.2. (a) A s ingle artificia l ne uron. (b) The axon In the mode l is FIGURE 14.3. Typical thresholding functions used in artificial neurons . The
s upe rfluous and is frequently o mitted . We leave a small axon, which will be sigmoid follows the relation S (u ) = (1+ e-ur '. Threshold funct ions are also
later labeled with the ce ll's output. Also , Inputs are usually sh own entering a used in which the lower saturation is - 1 . A sigmoidlike functio n with this
s ingle point (dendr ite ) on the node. pro perty is the hype rbolic tangent S(u) = tanh (u) .
8 10 Gil , 14 I The Artifi c ia l N eural N etwork 1 4 .2 I T he Ar tificial Ne uro n 811
on outputs
t or y' and the co nnectio n weights w k :
of previous
FIGURE 14.4. Formal labeling of the artificial neuron. ~ (Il 'i k - .\';)2) = S( l w, - y'IU. (14 .5)
If, for example. S( . ) is th e step thresh old function shown in Fig. 14.3, Suc h cells a rc said to implement radial baSIS j unctions (Broom head a nd
th en th e output of 11 k is a s shown in Fig. 14.5 . Note that the vector y ' owe, 1988; Bridle, 1988; Moody. 1990) in at least two imp ortant practi
represents the collection of inputs from pred eces sor cells, and w I, the vec cal cases. In th e fir st ca se we introduce a thresholdin g b ias in the fo llow
tor of weights leading to n k . Note a lso th e defin it ion ing manne r:
del' T r h =S Il \o\'k - y ' II2 - f)k)' (14.6)
Uk ::= W kY (14 .2)
This is the scalar quant it y that gets passed to the no nlin ear thresholding No w we let S( ·) be a " revers ed step " thresh old fun cti on illus trated in
func tion " ins ide" th e cel l. A ce ll that integrates the incoming weighted Fig. 14.7 . Under these circumst a nces, the ce ll is ac tiva te d (produces
inputs by addi tion, th en subjec ts the result to a nonlinearity (especially unit y o u tput) if the d istance between th e input an d th e co n nect io n
the step function ) in thi s fash ion, is often ca lled a percept ion. "we ights" is less th an 8k . In other word s , th e ce ll is ac t iv ated if its in p ut
Sometimes an au xiliary input is includ ed a s an input to the percept ron vect o r is found inside a ball of rad ius 8, in N-dimension al spa ce. This
to se rve as a thresh olding bias. For example , if we include the input Ok' idea is illustrated for the case N = 2 in Fig . 14.8.
the ac t iva tio n of 11 k becomes
v, =.'1(11, )
fi,
• , II = w T,.
1 I"
FIGURE 14.5. Computing the output of nk with inputs v', input weights Wk' a lO ne instance in whic h we will need to be careful with tlus issue is in the draw ing of
summ ing integration, and a step threshold functio n. Note that u, indicates " dec isio n reg ions" in the feat ure space. since t hc b ias in p ut is not a rea l featu re . w e will
the integrated value that is subjected to the nonlinear thresholding. have mo re to say ab out th is issu e la ter,
6 12 Ch. 14 ! The AnlJrc, al Neural Networ k 14 .3 / Network P rinciples and Par adig m s 813
A seco nd case of radial basis fu ncti o n impl ementa t io n invol ves no bi
asing [i.e.. output of form (1 4.5)] a nd em ploys a G aussia n nonlinearity
S(u ) == e- u 1•
(14 .7)
In this ca se, the ac tivat io n is co nsta nt o n any ball centered o n W The il
v
lustration for N = 2 is found in F ig. J4 .9. Clearly, th e act ivation level for
FIGURE 14.9. A radial basis function cell nk computing a Euclidean distance
a Gau ssian radial ba si s fun ction is co n t in uo us -va lued . However, we with no bias between its weig hts w, and its inputs y ' and passing that
mi ght cho ose to d eclare th e neuron ina ctive if result through a Gaussian nonlinearity will output constant activation levels
for inputs on concentric balls about its weight vector. The case of two
y~ < PI, (14.8) inputs to n, is illustrated here.
for so me Pk (Pk mu st be less th an unit y). In t his case, it is eas ily shown
th at th e neuron is active for in put s s' suc h that I n ei ther ra d ia l basis fu nct io n implem entati on d iscussed ab o ve , th ere
is a n e ffective (o r exa ct) ball of influence of r adius Ok in sid e of which the
1/ \\'4- )" 11< V log Pk ~ 8
, ~ dr
k• ( 14.9) in put will act iva te the neuron . For this reason , radial basis funct ion neu
ro ns a re so metimes called radi us-limited perceptrons.
In the iV-dimensional cas e, this mean s that the unit will be a ctivated only
if the input re sides inside a ball of radius Ok'
,
14.3 Network Principles and Paradigms
-"2
14.3.1 Introduction
Input to "k:
Rad ius 01
~. , == [ .\'.1 y:
' If
Ha ving introd uced th e basic principles of th e a rt ificia l neuron , we ca n
Wei ghts: T \1'/ 2
no w beg in to co nnect these units into networks of ce lls co m p rising co m
Cell ll1 is act ivarcd
W := [w ' l l1', ,I puting mach ines. The re su lting ANNs are so named because th eir topolo
( " 10. .. ( Y1'" I) i f ~" is in
shaded reg ion gies co ns ist of "axon-ro-dendri te " Iinkages o f th e i ndi vid ual cell s,
re m in iscent in a primitive way of the patt ern s of biologi cal neurons . The
I.e 1 v.'
• I co m p ut ing power of t he resu lt ing network derives from th e complex in
w" eraction o f m an y si m ple nonlinear elements that perform their opera
t ion s in pa ra llel. This form of computation stands in stark contrast to th e
seq ue ntia l operatio n o f th e vo n Neumann com p ut ing machine.
A rt i fici al ne ural netwo rks have sev eral ad vantages re lat ive to seq uen
FIGURE 14.8. A radial basis function cell nk comp uting a Euclidean distance tia l m achin es. Fir st , the ab ility to ada pt is a t th e ve ry ce nter of ANN op
between its weights w k and its inputs y' and passi ng that result through a erat io ns . Ada pt a tion ta kes t he form of a dj ust ing the co nnectio n weights
biased reversed step nonlinearity will be activated for inputs with in a radius in o rd er to ac h ieve desired m a p pings. Furt he rmore. AN N s can co ntin ue
Ok of its weight vector. The case of two inputs to n, is illustrated here. to ad apt a nd learn (so me tim es on -line) , which is ex tremely useful in p ro
14 .3 I Network Princip les a nd P aradigm s 815
814 Ch . 14 I The Artifici al Neural Ne tw o r k
ces sing and reco~nIzlng speech . Adapta tion (learn ing) algorithms ":On
unuc to be a major focus of research in the ANN field. Seco nd . AN:'-ls
tend to be mo re rob~st or fau lt-tolerant than von Ne um ann machines he
cause t~c ~ e two rk I S com pose? o f ma ny intercon nect ing neuro ns, all
com put ing III pa rallel. and the failure of a few p rocessin g uni ts can often
E! 3
b~ co~ pe n sa t~d ~~r by r~d u n da ncy in the . network . S.im ilarly. A:'\ Ns can
olten generalize from incomplete or nOISy data. FInally. Ai\'N s. when
used as classifie rs, d o not req uire stro ng statis tica l cha racteriza tion or
para meterization of data.
Alt hough ANN s ca n perfo r m ma ny co mp ut ing functions. they are
o ften used in speech p rocessing to imple ment patt ern recog nit io n- to as
'-Iii • 9
F. i~ b t t:.\ Clllpbr 1'3 IlL' rn"
~~~3
war d , in wh ich a n ou tp u t pa tte rn res ults that ide nt ifies the class
membershi p of the in p ut patt er n. Th e sec ond is a vector qua ntiza tion
(VQ ) func t io n in whi ch vecto r input patt erns are quantized into a class
index by th e netwo rk. In som e sense the se two fun cti on s appear to be
a bout the same task. In AN N d isc ussi ons , however, the VQ terminology
is usually reserved for a part icu lar t yp e of ANN a rchi tect ure that is
tr a ined quite differe nt ly (more in keeping wit h the basic no t at ion of VQ
as we know it from past study) than networks for more genera l types of
patter n associaiors. A third subt ype of classifier is the so-called content
addressable memory or associati ve m em ory netwo rk. This type of net
wor k is used to produce a " me m orized " patt ern or "class exemplar" as
ou tp ut in response to an input, which mig ht be a noisy or incomplete
3 3 3 3
Output puuerns for noisy - 3" input.
Ib J
pall ern from a give n class. An examp le o f th e op er at ion of a con tent FIGURE 14.10. Illustration of a content-addressable memory ANN (Lippmann,
addressa ble memo ry network is shown in Fig. 14.10, which is taken from 1987). Behavior of a Hopfield network [see, e.g., (s tmpson, 1990)] when
th e art icle by Li ppmann ( 198 7). For di scu ss ion of cont ent-addressable used as a content-addressable memory . A 120-node ANN was trained using
memor y ANNs. the read er is referred 10 (Ko hon en , 1987). We will have the eight shown in (a). The pattern for the digit 3 was corrupted by
randomly reversing each bit with a probability of 0.25, then applied to the
more to say about the fir st two of th ese classi fier su bt ypes later. network at time zero. Outp uts at time zero and after the firs t seven
In ad d it ion to pattern rccogni zers, a secon d general type of ANN is a iter ations are shown in (b).
feature ex tractor. The basi c fun cti on of such an AN N is th e reduction of
large input vectors to small output vectors (features) that effect ively indi been used for VQ of spe ech and related ta sks. We will also br iefl y de
cate the classes represen ted by the input pattern s. In esse nce . the feature
sc ribe this archi tecture in the following.
ext racto r is charged wit h de creas ing th e dimension o f th e representatio n
sp ace by remo ving nonessential or red und ant information . It is also
14.3,2 Layered Networks: Formalities and Definitions
so metimes the case th at feature representations wi ll appear aspatterns o f
act ivat ion internal to the netw ork rath er tha n at th e output (e.g.. (Waibel Ma ny ANN arc hitect ur es can be con ven iently viewed as " layers" of
et aI., J 989 ; Elman and Zipser. 1987)]. cells, as illustrated in Fig . 14.1 1. A layered st ru cture is one tha t may be
The provision of a ta xonom y of ANN archite ctu res is difficult. since de scribed as follows : A group o f N I cells de s ignated layer 1 recei ve as
t he number of p oss ible interco nnect ions a mo ng ne ur o ns is lInJimitc~ . thei r inputs weight ed versions o f the ext ern a l inpu ts of the net work ,
H owever, as research and developm ent ma tures , a few sp eci fic prcerru There are No external inputs, one of which may correspo nd to a bias.
nent archi tectu res are emerging (e.g., (Simpson, 19<JO)]. As stated earli~r The remaining cell s in th e network (abo ve layer I) can be gro uped in to
in th e cha pt er, we will conce nt ra te here p rim a rily upo n the t ype o f archi layers 2,3, .. . , L, such that cells in layer / receive as inputs weight ed ou t
tec t ur e k nown as " multilayer percep tro n .' It is lipa n t his arch itecture puts of cells in layer 1- 1. T he outpu ts of the final layer, L, are the exter
t hat much o f the recent research in spe ech recognit io n has been based. A nal outputs of the network. We usc the term " weighted outputs" loosely
second quite d ifferent architecture, the "learn ing vector qu antizcr". has to mean some co mbinat ion o f th e labels o n th e connect io ns (weigh ts)
Ou tputs \. ~ 'v'1
· 1 y~ =: _\'~ \'\ [. == Y/h _ OUlPUI 14.3 1 Network Pflnciples and Paradigms B17
vector
) c, .I L
wit h the outputs of t he lower layer. Th is com bination r ule depends on
t he operation of the cell (see Sect io n I 4.2) . A hidden laye r is one co ntain
Layer I.
1\ nodes ... ing cells whose o ut p ut s ca nn ot be meas ured d irectly. According to o ur
framework , each layer is hidden except layer L.
To avoid co nfusio n . no te that so me a ut ho rs consider the set of inputs
to co m prise a layer- a nd m ight even draw " cells" at t he bottom of the
netwo r k to rece ive th e se in put s. T hese cells are stri ctly f ormal, perform
ing the ma pping X i = F (x,) fo r eve r y X " a nd therefore having no practical
sign ifi ca nce. We will a void cells a t the bottom of the network. Neverthe
less. in rea d ing the lit erature an d comparing some development with our
discussio n here. t he rea der should note whether the inputs are co u nted as
_ Output \. I a layer in the ot he r work.
v..." or .
Layer I
A [eedfo rwa rd, or nonrecurrent , ANN is one for which no cell has a
N( no de,
con nec t ion pat h lead ing from its output back to its input. If such a path
IndivHlual _
can be fo und, th ere is feedback in the ANN and the architecture is called
weig ht
recurrent. La yering as we have described it above depends not only on
non recurrence , but also upon sequential connection. By this we mean that
_ Output '-
v'
I
ce lls in laye r I must be connected to cells in layer 1+ m, where m must
VeClor ~
not o nly be positive (no feedback) but it must be exactly unity (sequen
Layer 1- 1
tial). A laye re d ANN can sustain a few feedback or nonsequential con
N" I nodes
nect ions with o ut losing its basic layered pattern, but too many such
connect io ns ero de the fundamental network structure. Such excessive de
viati on s from the layered structure will not arise in our discussion .
A multilayer perceptron (MLP) , or feedforward ...1 SX. is a nonrecurrent
layered network in which each of the cells is governed by an activation
\1/ -Ol:tput
\, ..
rule of the fo r m (14.1). A thresholding bias may also be included in each
ce ll of the MLP, and we will assume that it is accounted for as in (14.4)
vector an d no t show it explicitly. Whereas the pcrceptron (single cell) . as we
Layer 1
have defin ed it above, involves a step nonlinearity, the term "M LP" often
1'12 node'
O ne of t he most interesting and useful aspects of AN Ns is their abilit.. 14.3.3 The Multilayer Perceptron
to " lea rn" to implemen t cert a in co m putations . Learning refers to th)
p roce ~ s o f weight a?j~stment . to achieve the desi red aim . The MLP an~ Some H istory and Rat ional e Behind the MLP
Lvq in volve t:vo d l st~nc tl .y d ifferent for.ms of learning paradigm s. In su In 19 57, Fra nk Rosenblatt, working at Co rn ell Universi ty. created one
pervised learnin g, which IS . used to tram the ML P, a series of traini ng of the first ANNs with t he ability to lear n. Rosenblatt developed his n et
pans (Inpu t vectors a nd desired out put vectors or targets) are "shuwn" t; work by build ing on the earlier layered logic concept of McCo lloch and
the network and th e weights ad justed according to some algo rit hm . The PillS (1943) . Rosen blatt first wo rked wit h the si ngle-cell perccp t ro n
o bje ct ive is to rep ro d uce th e entire population o f target out pu ts as (whi ch he so named ). 1n the context of Rosenblat t's ea rly work, the term
closely ~s possible in some sense . M ~ n y itera~io n s th rough the training "percept ron" sp ecifically involve s t he step t hreshold non linea ri t y. He de
pans mi ght be necessary for th e lea rning algorithm to converge 011 Some veloped a su per vised lea r ning pr oced ure gu arant eed to converge to
set of weights (if convergence is possi ble). O n th e other hand . some A:--J"N weights that would accu rately classify tw o-cla ss data under a ce rt ain co n
a rch itect ures like the LVQ are organized to be trained by unsllf)crl'ised dition . Fo r discussion pu rposes. we have dr awn and labeled a perceptron
learning. In thi s case, th e networ k autom at ically adj usts its OVI'O weights in Fig. 14.12. In ant icip at ion of using more than one cell in the future,
so th at (t ra ining) inp uts t hat are sim ilar in some sense produce similar we label t his cell nk •
(or id ent ica l) o ut put s. In e ffect . t he result ing network m ay be used to We seek to disco ver necessar y co nd it ions for th e convergence o f th e
classify input data accord ing to the outputs they prod uce . The training perceptron weights. Recall the expression for the act ivatio n
a nd use of such a network a re rem in iscen t of a clu st er ing procedure used 7
in stati stic a l pattern recogn it io n. In de ed. the one examp le of a self I, W k X> 0
orga nizing ANN that we wi ll consid er, the LVQ , is used to achie ve VQ, h = S( w~-x) = { ( 14. 10)
I
(We will also find that LVQs can also be trained in a super vised mode.) O. WkX <0
Si nce a signi ficant part o f our st udy has been devo ted to the very im
porta nt H MM . it is worth maki ng a few compari sons in the basi c tech
no logy. AN Ns (in part icular M LP s) and HMMs a re fundamentally
sim ilar in t hat both ha ve th e ab ilit y to lea rn from trai ning data. The v
- (
p rocess by which HMM s are t rained may likewise be consi de red a form
o f supervised learn ing. Wha t is lea rned in each case can be qu ite di ssimi
lar. both in content and in philosop hy, even if both mo d els are being ap
pi ied t o the sam e p ro blem . The H lvlM learn s th e st at istical nature of
obse rvat ion sequ ences presente d to it , wh ile the AN N ma y learn any
WI
nu mbe r of things. such as the class es (e.g., words ) to wh ich such se
q uences are assigned . Alt hough influ en ced by th e statisti cal nature of the
\1 -'3 XNo x
observat io ns, the in tern al st ructure of th e AN N th at is lea rned is not sta Xl
(a)
t ist ical. In its basic form , an ANN req uires a fixed-lengt h inp ut, whereas
we have discussed the convenient t ime no rm alizati on propert y of EM Ms
in Cha pter 12. To a grea ter or lesser d egr ee depending on t he de sign.
Xl
_ Intersection 01plane wrx =0
both syste m s ca n be rob ust to no ise, to m issing da ta in th e observat io ns, wuh .. I -- .r, plane is
line "' k l -'1"+ " 'i2 x 2 = ".
to m issing exemplars in the tr aining. and so on . T herefore, a lt hough di f
ferent in phi losophy. HMM s an d AN Ns do have important sim ilarit ies.
However, there is a fundamental difference in the two tech no logies. Even XI
FIG URE 14.12. (a) The perceptron . (b) Decision hyperplane (line) for the case
' See the sta te-space interpretation of the HMM in Section 12.2.2. No= 3; the third dim ens ion is used for biasing: x 3 = - 1, W k 3 = O k (see text).
820 Ch . 14 I The Artificial Neural Network
14.3 I Network Principles and Paradigms 821
X2
Th e bou ndary bet ween the t wo " d ecisio ns" Y. = 0 and .r" '= I in th e input
(a)
vec to r space is the hy perpla ne
wTx = 0 .
~.
(t4.11)
I r one of the inpu t compon ent s. say x S o' a ctu ally correspo n d s t o a bias
(X,\;, = - I and wk.S" = O~ ) . t h en. sinc e x.\ ;, takes a COn stant value - I .
(14.1 1) re presen ts a (No -l)-di mensional hyp erp lan e in the " true" feature ( I ;// T/ 7I -r- tl
spa ce gi ven b y
R
S O- I
I;= 1
W ki X , - o, = O. ( 14.12)
The inclusion of a bias perm its the construction of a d eci sion boundary
that is not co nstra ined to pass through the origin (see Problem 14.1).
~ Deci,io n bo undary
(The bias can be learned like any other weight .) The case of a two
d im e nsio na l (b ot h dimensions represent " t rue" features) space is il
lustrated in Fi g. 14. t2 (b). The d ecision hyperplane (in this case. a line)
X
ca n be p laced in an y orientation in fR. 2 by appropriate choice o f weights, 2
(b)
si nce a bias term (representing a third dimension) is include d in the
per cept ron .
If th e inputs represent two classes whose vectors are separable bv a hv
perpla ne (linearly separable), then a set of weights can be found ' to ex
actly di stinguish them via the o ut p u t of th e perceptron . Rosenblatt's A
per ce pt ro n learn ing (PL) algori thm is proven to co nve rge to this set of
7
( ~ I \\ I, XI
wei gh ts ( Block, 1962; Nilsson. 1965). E xamples of clas ses wh ich can and
cannot be exa ctly cla ssifi ed by a percept ron are shown in Fig . 14.13 . We
note th at the bi as term can be removed if it is suffici ent for the decision
boundary to pa ss through the origin in the feature sp ace . It is unlikely,
how ever, that this would be known a priori.
The Ro se nbla tt PL algorithm is shown in Fig. 14.14. It can be seen
th at the weights are ad justed (the per cept ron lea rn s) only wh en an error
FIGURE 14.13. Regions labeled A and B represent regions in the feature
occ urs be t wee n the actual output and the target (tra ini ng) output of the space from which feature vectors representing classes A and B may be
network . The param et er 1], whi ch takes values '1 E [0. I]. controls th e rate drawn. (a) The two classes are apparently separable by a line (in general, a
of learn in g. A trad e-off is en countered in t he choice of '7- Thi s parameter hyperplane). (b) The classes are not linearly separable .
mu st be chosen large enough to adapt quickl y in the presence of err ors,
yet s m a ll enough to allow the weight esti m ates to stabi lize when a pp ro
x(p) m eans [x( . ) ] mouulo l' e valuated at " t ime" p
priate values have been reached.
If p> P, then
To "teach" the perceptron (adj ust it s weight s) we present t he network
'A
wi th a series o f training patterns, say l(x(p ). (p » . o> 1, 2. . . . . P J. where
{
'k(p ) m ean s lr k( . ) l m(xhdo p e va lua ted at p,
( 14.1 3)
x (p) is the pth in p u t and Tk(p) the pth ta rget. Let .l'k(P) denote the actual
o ut put in response to x (,0) and W k(P) be th e vector of weights fbi/owing where [x( . )\ represents th e o riginal P-Ien gth sequen ce of inputs. and sim
t he p rese n ta ti o n of tra in ing pair p. It is important to u n de rst and the ilarly fo r [' k( . )1·
mea ni ng o f the index p. There will always , of course. be a finite number. As a n asi d e, we note that t he Rosen blatt a lgo r ith m becomes ex act ly
P, of trai ning patte rns. D u ring training procedures. how e ver, ea ch pattern t he well-known LMS algor it hm of W id row a nd Hoff (Wi drow a nd
w ill be applied m ult iple ti m es to th e n etwork, so that effe cti vely many Stea rn s. 1985) wh en t he hard -lim it ing threshold is re place d by the sim ple
more t ha n P training pa irs will be employed . When an index line a r m appin g S (II ) = II . T he LMS algo ri th m was d escribed in Chapter 8
,0 > P is encou nt ered , it sho uld be in terp reted a s follows: and we see it recur i n o u r st udy of M LP s late r in the chapter. Fo r a suffi
14 .3 I Networ k Pn nc iples a nd P arad ig ms 823
822 Gh. 14 I The Artificial Neural Network
II
Dec ision
FIGURE 14.14. Perceptron learning algorithm (two-class problem) . boundar y
for n :\
.r.,
time: Cell is named an d relate d quantities show subs cript k in anticipation
Il k '
of the use of more th an one percept ron.
f. 1, class 2. r
I
.'( 1 VQ
Adjust weights: w k (P) = w k ( p - I ) + I]l:k(P) , FIGURE 14.15. (a) A layer of three perceptrons can be used to classify
Next P three linearly separable classes. A, S , and C. (b) For the case No = 3 with
the thi rd input (x = - 1 ') corresponding to a bias, the hyperpla ne (line)
3
Termination: SlOp when weights ch.ange negligibly according to so~e criterion . oundary forme d by perceptron nA • separates class A from all others.
book Perceptio ns was published in t 969 (Minsky and Papert . 1969). Ex
cie ntly small 1/, the LMS algorithm converges asymptotically to the
plo iting th e fact that the perceptron is only capable of linear scparabilit y,
weights that minimize the total squared error between the target and ac
tual errors in the train ing population. A perceptron trained with LMS Min sky and Pa pert pro ve a number of the orems that en ume rate apparent
w ill create a boundary that meets this criterion whether or not it exactly
wea knesses of t he perceptro n . The ex cl usive OR (XOR) problem in
sepa rates the classes. When classes are not linearly separable. LMS which classes ar e not separable by a straight line (see Fig. 14.16) was
avo id s the problem of oscillating (nonconverging) weights that can occur used to illus trate the perceptron's inadeq uacies. T his prob lem is still fre
wit h the PL algorithm . quentl y employed today as a test of the efficacy of an ANN architecture
A singl e layer of perceptrons (as illustrated in Fig . 14.15) can be used
x,
to separate mu ltiple clas ses as long as each class can he separated from
all ot hers by a hyperplane. The idea is simply to "assign" a percept ron to
each class and to let that cell correspond to a hyperplane separating that Cla~s F
cla ss from all others. In Fig. 14.15 , for instance. the cell labeled I1 J corre
sponds to the hyperplane (line) labeled A in the weight space. When an
ele ment of clas s , I appears in the data , this perceptron will output a
"one ," and a "zero" otherwise . The perceptrons for classes Band C oper
ate sim ilarly. As noted above. any perceptron correspondi ng to a hyper I ! / . ..-:= / . J Xl
i 7 7
pla ne that does not pass through the origin must contain a bias weight j
or it s learn ing algo rith m. Ro senblatt as well as Minsky and Papert were im pressi on s. M insky a nd Pa pe r t authore d a revised versio n o f Percep
aware o f the benefits of usi ng multiple layers in the perccp t ro n architec_ irons in th e late 1980s (M insky and Pap ert, 1988). Wh iche ver view on e
ture. In fact, a two-layer perce pt ron with a single outp ut cell is capable of takes. the 1969 edit ion of Perception s coi nc ides with the end of th e init ial
dis t inguishing between an y two classes th at fa ll in open o r closed convex period of excite ment ab out th e A N N . It was not until the populariz ing of
regions in t he feature space. A three-layer pc rceptro n can fo rm arbitrarilv the back-propagation a lgor ithm by Ru me lha rt et al. in the lat e 1980s
co mpl ex decision regions i f a sufficient nu mb er of cells is provided (R u rnclhart et a!.. 1986 ) th at the excitement ab out AN Ns began to recur.
(Lip pm ann, 1987). These reg ion s are illus trated in F ig. 14.17. The proof Before discussing back-propagat ion and learn ing in MLP s, a few de
o f t his fact provides some insigh t into the nu m ber of nodes. Hmvever. tails sho uld be elabo rated upo n. We have no ted tha t a three-layer
Ro senblatt was unable to find a learn ing algo rit hm for a tWO-layer perceptro n is capable o f lea rn ing a rbit rarily complex decision regions in
oerceotron,s and Minsky and Pa pert exp ress so me doubts in the 1969 the inpu t fea ture space given the prope r number of nodes in each layer.
ve rsion of Pereeptrons as to whether a lea rning algorithm would be Of co urse, b y generalization of our previous discussion, it should be intu
achieva ble for the MLP.
itively clear that a th ree-layer output with C outputs could be found to
Tn so me historical accounts, the book Perc('pfrons is al most blamed for d istingui sh am ong C classes comprising arbitrary reg ions in the input
havi ng deah an unnecessarily devastating blow to ANN research. In oth space. Ho wever, t hese results deal with spec ific problems of class assign
ers. th e book is viewed as an accurate criticism o f weak nesses of the per men t with binary-valued outputs. When the step nonlinearities in the net
cept rou th at actually advanced the field by its extensive analysis. and work are replaced by smoother thresholding functions like the sigmoid,
wh ic h was misinterpreted by the research community. Block (1970) wrote
simi la r behavio rs to those described above are observed if the resulting
a review of Perceptrons in an attempt to clarify Some of these mistaken network is used as a classifier (maxi mum output indicates the class). Not
unexpectedly, however, th ese networks result in more complex decision
Poss ible regions and are more difficult to an alyze . Nevertheless, the smooth
Structure dec tvion \l c,hed
XOR Pro blem
boun dar ies classc, no nlinearit ies make a network am enable to training by contemporary
learn ing algo rithm s and are in this sense preferable. Also, it should not
T be construed that the performance of networks will be inferior because of
Single-cell
perception
Hype rplane the more complex decision regions. On the contrary, it might be greatly
im p ro ved (and intuitively one might expect this). However, it is more dif
F T
ficu lt to predict and analyze performance with the more complex
netwo rks .
Two-layer Convex open
~~F Beyon d the classification problem, in general, we may wish to execute
a genera l ma pping iR Nf) !R'''''. using an ANN , and question whether such
F~~
1'>lLP or d osed -4
single output po lyto pes a network is possible. In 1989 Hornik et al. proved their so-called rep re
sentatio n Theorem s. which show that sigmoida l feed forwa rd architectures
can represent any mapping to any de gree of accuracy given a sufficient
number of hidden neurons (Hornik et al, 1989). In fact, the convergence
of the mapp ing to the desired mapping is uniform over the input space
(Jy
F
Thre e-layer Arbitrary
complexity
with the number of cells. Although theo retica lly interesting and en
MLP
single output depends on couraging, th e representation t heorem s offer no guide to the number
H istory. T he publica ti o n o f the two-vo lume Parallel Distributed Process s (W) =.;. L [r(p) - s'! », W)j' [r (p) - s't ». W )] ( 14.15)
ing in 1968 (R umelha rt et al., 1986 ) helped t o create a rena issa nce of - 1'= 1
ANN research . in particu lar by pop ular iz ing the back-propagal i f1 (BP)
o 1 P S f.
a lgori th m (o r gcnerali ::ed dena-rule as it is called in th e book ). In this
sa me era , Sej nowski and R osen berg ( 1986) use d the BP algo rithm to suc = 2" ~ ~ fr,.(p) - y ~(Jl, W)r (14 .16)
ce ssfu lly t ra in a te xt- to-speec h sy nt hesizer dub bed " N ET ta]k." The
N ET ta ik experiments were made m o re d ram at ic by a ud io renditions o f We have included the factor ~ in this expression purely for mathematical
the synthesized speech a t va rious stages o f learning, which wer e played at co nve nience. whi ch will b ecome clear below. Note that we have also
confe rence sess ions an d fo r the pr in t a nd te le visio n media. A long
awa ited br eakth ro ugh in AN N research- t he abili ty to effectively train
show n the explicit dependence of the outputs. . v;,
on the choice of
we ights . In general, all inputs and outputs to th e nodes in the \1LP will
M LP s- ha d ap paren tly arri ved . [It was no t lo ng, however, before sim ilar depen d o n the we igh ts in the layers below them. We will explicitl y show
d eve lo p me nt s were uncovered in the earlier litera t ure (Wcr bo s, 1974: the de pen d enc e of certain o f these input and output quanti ties on
Pa rke r. 1982 ). Wh it e ( [989) ha s a lso shown th a t BP is a spec ial case of weig hts of interest in the discussio n to follow.
sto cha st ic approximation. wh ic h has been resea rched s ince the 1950s T he obj ect ive of the BP algorith m is to find the weights , say W · , that
(Tsyp kin. 1973). T he relati ve d o rma ncy o f the AN N fie ld for J 5 years minimi ze ~ ( \V) . If there ar e , say, N". weights in t he M LP, then a plot of c;
ha d apparently kept th is ea rlier wo rk fro m having a n irn pact . ] The in it ial ove r the N ...-dimensional h ype rp lane (each d im en s io n represent ing one
exc item ent abo ut BP, howe ve r, was responsible for m a ny m isu nde rst and weig ht) is called an error surf ace. Since t he M L P impl ements a nonlinear
ings a nd exagge rat ed cla im s a bo ut its propen ies and ca pa bilit ies. At first map ping, in gen eral there will be m ultiple m in ima in ih~' error surface?
man y investigators int e rpreted BP as a mean s o f avo iding the wea knesses and ideally we would like to find W ' wh ich co rrespo nd s to the global
of gra d ient descen t (wh ic h Mi n sky an d Pape rt had crit icized in Percep mi n im um . In practice , we must settle for locat ing we ights co rres pond ing
tronsi . It is now a pp rec iated tha t this is not the ca se. Fu rt h er. the a lgo to a local minimu m , perhaps repeat ing t he proced u re sev eral times to
rit hm a nd the ar chit ect ure wer e see mingly con fused initia lly. It is now find a "good" local mi n imum . So m e measures fo r a na lyz i ng t he error
understood t hat BP ca nnot e m bue a n archi tecture wi th class ifica tio n su rface to assist in this task a re desc r ibed in (Burrascano and Lucci,
pro pe rt ies Or oth er performance capa bilit ies th at are not theoreti ca lly 1990 ).
predi ctable. For in sta nce , a two -laye r percept ro n can not exhi bit decisio n T he re is no known way to si m ult a ne o usly adj ust a LI weights in an
region s othe r t han those show n in F ig. 14. I 7 by virtue of BP (or a ny MLP in a sin gle tra in ing step to find a m in im um of~ . In fact , th e BP at
ot he r) training.
te m pts to finel a min im um by ta ckling a muc h m o re m od est task. Not
Research a nd practica l ex perience ha s led to clearer un derstand ing of o nly does BP consider on ly one weight at a time (hold ing a ll others con
t he BP algor it h m's be ha vio r an d has m o dera te d pe r fo r m a nce expe ct a st ant ). but it also co nsiders o nly a single tra in ing p attern 's er ror surface
tio ns . In part icula r. it is now appreciated that BP is a stochastic grad ient
d escen t algorithm subject to the sa m e ty pes o f co nvergen ce pro ble ms as
' Recall the Hl\·IM tra ining p rob lem in Chapter 12.
14 .3 / Network Princ iples and Par adig m s 829
82 8 Ch 14 I The Artifi cial Ne ural Network
at a time.! By this we me an that in using BP. we (in p rinci p le) consider 1I' i;. in the output layer to be adjusted in respo nse to a trai ning pattern
on ly th e erro r surfa ce, say ~(p , W ), beca use of a single training pair at time p . All oth e r weights are treated as constants held at wha teve r
(-r ep ), x( p »), an d re peat the p roced ure indepen de ntly fo r eac h p. [In fact. va lue they assume at time p - I . Beginning wit h (1 4.1 7) , it is easy 10
as in the Rosen bla tt PL a lgorith m , ea ch tra in ing pattern ca n intro~ be show that
d uced man y ti mes so th a t the rul es fo r interpreti ng t he inde x p shown in
(14 .13) are in effect here.] T he issue of whet her s uch a p roc ed ure should
at.(p.
. W)
L = - L( p,
[ ' k(P) - J'k \I' 't)] , p. II' J'J
J S '[ It L( :J ) )';L- \(p), (14 .20)
b e expected to con verge to a mi n im um of the " su m m ed " error surface a11'''-1
will be discussed below, Clea rly, e,(p, W ) is given by
whe re S'( a) d enotes the deriv at ive of th e threshold ing funct io n evalu at ed
NI at a. 9 a nd where all other notat io n is defined in F ig. 14. 18. Let us denot e
¢( p , W) == I
l J=t
[',.(p ) - .I' ~( p, ww· ( 14.17) the err or at node nt
in response to t raini ng pattern fJ by
Li L) dcf
"" Tk(p) - f
L(p. H' kjt. ) (14 .21 )
Su p pose that we are currently working on we ight w,t. . which has recentlv f:;J p . II· k; k
I 'J
be e n adju ste d to value wkj (p - I) by p roce ssing pattern p - I . Back
p ro p agat io n wo rks by moving wk'J· slightly awa y from " ':.,. (p - I) in the di so th at
rectio n that ca use s ~ ( P . W) to decrease along the correspond ing dimen ae,(p, W) _ t. ,L) -r L( ,L ] ,t.- I ')
si o n . T he adjusted value will natu ra lly be called ( p) in the ensuing1<; , I- - -ek( p, II kl S [Ilk p. 1\ AJ .I j
(} 'rI' k;
( p) . (14 .2_)
d iscussio n . To se nse in which direction !;(p , W) is de c rea sing al ong the
ilL di m en sio n at the value w~J (p - I). we e va lua te th e parti al derivative Co mbining ( 14 .22 ) wit h ( 14.19) p ro vides th e necessary m echa n is m for
a~ ( p, W) comp ut ing t he up dated weigh t II't;·
( 14.18) As an as ide. we note that if 5 is taken to be th e step t h reshold func
,I
d ll'kj
,,;)=w0(p - I ) tion .!" the n th e eq uatio n used to m odi fy the we ight s he re is identical to
the rule used in Rosen berg's PL alg o rithm . If it was no t a p parent before ,
If t he gradi ent is positive at that point, th en subtracting a small quantity we ca n no w appreciat e why R osen berg' s method is ca lled a gra d ie nt
from w;J(p - I) corre sponds to moving downhill on th e erro r surface, descent algo rit h m as well .
and vice versa. This me ans that w~ should be adjusted accor d ing to the Aft er ap ply ing the pro ced ur e descr ibed above to eac h weig ht leading
learning rule o layer L. we move d own to layer L - 1. Clea rly, a similar eq uat io n to
awk;t be formed with respect to t hese nodes. These target va lues would some
where l/(p) is a sm all learning constant that gen erally de pe nds on p (more How does o ne find such ta rget va lues? T he a nswer beco mes ev ident upon
o n t his seq uence below), For the obvious rea so n. BP is ca lled a gradient ma king a " b rute fo rce" attempt to compute th e gradient of ~( p . W) with
F ind ing an expres sion for the deri vative, especiall y for weights below tract a ble by focusing o n t h is single we igh t. even though it is theo retically
t he o ut p ut layer L, is made tractable by considering a singl e we ight at a desirab le to c ha nge the ent ire set o f weights toge th er. T he rel evant nota
t ime. as we sha ll a p prec ia te momentarily. Let us begin with a weight. say ti o n is illustrated in Fig. 14.1 8 . Th e gra d ie nt is
' T his type: of identification problem is often called stochastic approximation. The error
surface .;( P. W) for each p may be cons idered a realization of a random variable. say ~( W ) .
a~ (p,W) _ _(J _ [ ~L [.' ..( p )- -I' ,./',( fJ . W Z;I)}2]
~ - , ,Ll
( \4.23)
which is pa ra meteri zed bv the mat rix W. Ideallv, we would like to find the val ue of \V that II' I<j a l ~ k} " I
min imizes £'1~ ( W) I. bu t we mu st be content 10 work with the rea lizatio ns. A significant
amoun t of research has been done un this gene ral class of problems . Much of the foun da
tion for the subject. especially as it applies to learning systems. is laid in th e classic " ar k of 9BP can only theoret ically be used if S is dllTerenti ab lc.
Tsypkin (1973), A rigorous application of stochastic learning theory to the study of the BP
algorithm is fo und in the paper by S tan kov ic and M ilosavljevic ( 199 1l, IODifferent iable everywhere excep t at the origin.
830 Ch . 14 / T he Art ificral Ne ural Ne twork
14 .3 / Network Pr inc iples and Par ad ig m s 831
a¢( pI,. W)
a W kj
I
= _ [?
~
" - 1
\V~; 1 )] W;,,(P)]
Integrated value /I t (1') O utp ut
layer L
( 14.25)
x S,[II ~ - I(p, 11,; ;- 1)] y ; - l ( p) .
The term set off in bracket s is called the hack-propagated error at node
111.- 1 in response to pattern p. a nd we give it th e nota t ion
;" L
L- l( p, w L-I
G" ,,! ) def
= '"
L '( p. \l'L-
e,: '' j
I) S'[ II L( L- I)] \t',k
" P , "' '';
[ (P) . (J 4.26)
(:I ) v- I
d¢ ( p . W)
----:-L--:
L-I(
- I- = - e" P .
L-J)
II 'k} 5 'lII L~ - J( n. w;[.- 1)]VL -2 (p) . (14 .27)
a ll',,; . , ; < )
/I ~ ,( f') Com paring this exp ression wit h (14. 22) reveals the reaso n for the nam e
back-propagated erro r. In fact, if node I/~.- I were an exte rnal node with
erro r eq uivalent to ei·-1(p), the n by the same means we a rrived at the
Wi "o uter layer" expression (14. 22). we would have obta ined (14.27). In th is
uJ:.
sense. we can think of node n;-I
as ha ving a target val ue
r; -I (p) = y;-'(p, W ~;' l ( p_ l») + e ~- I ( p, wt - I( P - I»). ( 14.28)
I ntegrated va.ae /I t - I (p) Lay er f. - I
but we have no exp licit need for th is qu antity in the deve lopment.
~
. " , - l (p _ l ) Hav ing now found a n expression for the grad ient of ~( p. W ) wit h rc
,;
Y~ - 1(1')
speer to lV ~;- I , we now evaluate it at IV t-1
= lV~j 't»- \) a nd use (14 .19) to
adapt the weight. Note th at the significance of th e back-propaga ted e rror
n l:- 2 is that of a recordkeeping device . T he process is one of com puting the re
}
qu ired grad ient . which could, in principle, be don e withou t recour se to
<b) this bookkeeping.
Upon mo ving down to lower hidden layers in the network, a similar
FIGURE 14.18. Notation used in development of the BP algorithm: (a) Layer
L; (b) layer L - 1. procedure would re veal that. for weight 1<,
o( p, W) I ( [ ) 'f '( " ] I- I
- , - = - f:,. p . W k! S .Ilk p. II', ) ) Y; (p), ( 14.29)
}Fie}
Recogni zing that (see Fig. 14.18)
Y~; ( [J, II'~,- l ) = S(u ~'(p , W;;- I)) = s[w~,yt - I( p, W~- l ) + 0IJ
where e~ (p , wi)) is th e back-propagated err or at node n:.
'\ /- 1
, ·Ie ·)C' [Ilk1.-1(p · l1 /.-1)]+ ° } ;:, Sf 11'"",)
= s[W L L <> [ IV 1.-1 P + {) 2] ~' oJ J'
L-2() I (
eA !J, I ) d ef
= L et,1+ I ( r. IV':; 5 ' [U 1+ I ( u, II'k}I )] 1I"'k!+I ( p) .
"" I )
( 14.30)
! k 1 k} Y; lI 'k; . 1/
1'=1
( J 4.24 ) Wit h on ly two exceptions. ( 14.29) and ( [4.30) are general ex press ions
.
where 0 1 and 02 are "othe r term s" which are not dependent upo n I\ '~;/.- 1 tha t ca n be used to com pu te the grad ient of the e rro r su rface for any
we can apply th e chain rule of di fferen tial calculus to obtain weight in the network. The exceptions occur at the top and bottom of the
network. For 1= L. (14.21 ) mu st be used to compute e'', since th is e rro r
832 en , 14 I T he A rti fic ial N eu ra l Ne twork
14 ,3 ! Network Pnncip les and Paradigms 833
..t
..:::
f-<
Co
~ l e ~'"
0
rn
ucccU-v
~v~ ..=~c
- cJl-o>!l-~ ....
. 0:c ._ • ...,. E
1\'./ rr., ~ 00 -00 .~ '~..c ~ ~ ~
·' 1
$ c . C '~ ~ ~ ~ ~
~~
•
whe re 0 < a < 1 (typically 0.9) . The m om e n t u m term allows a larger 'is
": s -5 g] ae
co S<.>.=:'5-
0 0 '
value of tj to be used without ca using oscillat ions in th e solution . '6' l - ..:::
~:t1 ~~
co _ u ""0 ~ -r.::) <1)
Several investigators h ave explo red the possibility of linearizing the if?
v<.:oOb.,,<:
-5 ~ ~,~:g '5
:IJ
dy n a m ics of a feed forwa rd netwo rk around th e "current" set of weights, 'E
'1~ r- M r ..... C'C.e ~ u,
'N ~
-< .~ v; v; M <J _ ~ o;n
?;.-'" .
t he n applying some form of linear processi ng in conjunction with BP to «
"0 J:...c::~"'O -~ ~
im p le me nt the we ight modification (Kol lias and Anastassiou , 1989: u
..::: ~ ~ ::,) .!: C "5 '
(J)
d ':::0::: ~
Azirni-Sadjadi et al. , 1990; Ghiselli-Cr ip p a and El-Jaroud i, 1991 : Hunt.
co e ",' ~ ~ ~
p.,
,:"&
;j
1992 : Deller and Hunt. 1992). The Hunt work is based on a QR D.. ~ (;c=...:~~~=
-E § :;; - ] ;c:
CD
~
Co
~
E ~ oOl"",!" OIl
decom p os it io n-b ased RLS algorithm of the type described in Chapter 5. -£ <:.I 0..::: 0-= 2 ~ .-
.:.= 00 -~~.;~~
I..
-§ u ~ ::c or.
~~
Fo r nodewise training, the fundamental algorithm turns OUl to be theo 1":1 ' - V"J
ret ica lly equivalent to the Azimi-Sadjadi method except for the RLS im 0> U
'"
..::: . = ~ -r ..:: ~ ~
"0 '- ~
f=: :J
c
~ c~-5 "5 "'O
ple m e nt a tio n . A significant improvement in convergence performance '1:
'iii
.c
.;: ~ E:?:.; I:: ~ . ::
> ..... ,
~
was achieved using the method of Hunt and D e ller wit h respect to the t=
:::> '=:"f:C ~
cf 8~X~~5
"
BP algorithm, and some results are given in Ta b le 14. / . rn~
Finally, several research groups have been concerned with the exten
6'f
;S
OS I
"0
'0
..::: .....
,00
-..
<:.I
"'0 0
E~ ~S;:Z
~E;-;E-~
..... .... ;;...
.". ~
(J) ~
s ion of the BP algorithm to recurrent network s. Studies arc reported in ~ ~ ~'3]CC~
>
_ OJ)..c -= c
~
~
-..J
(Al m eid a , 1987 , 1988; ROhwer and Forrest, /987 ; Samad and Harper, ~:=a E-;: ~ ~
1987: Atiya, 1988; Pineda , 1987 , 1988: Williams and Zipser, 1988). aU).
CD
s: ~ c. ....
:;'5~0-~-;
I E
~ C1l
c_
::J.c
IO
:s.2."E 1 o~e '"
-g.~ 'r) 0 0
v
.§
>-.1-"": :'::l
~ -=:2 ~ g 0\
_4.1 4.)\-;>...
;::i
.9
that is ver y useful in speech recognition technology. This network was in E ::l 5-;.~~:.2.::!:
o~ ~~ E ~ =o:J~~:"O
.-0..:::c
trod uc ed by Kohonen (1981). The LVQ. which is shown in Fig. 14.2 0 . au ::ll::
e 0..::: ~ ,~ ~ ~ -5 ~ ;
resem b les th e MLP, but is des igned to fun ction in a ve ry different mode .
T he network consists of a si ngle layer o f cells. which d iffer from a single
<J)
.... (0
-
::J<l>
<J) >
>- ~ .r.Jj p
Of)~
c
(J)
~
"0
~~
Of) r- -_
00 00
- C':)
t
~ EO ';;; ~ ~
. ~ ~ .~ ~~ 0
::;I ~ I: -
c:o
'
<l><l> -o 0 v '- 0
layer o f perceptrons by the intercon nect io ns among th e top laye r. These a:(J) . ::~ 1= C5 "'C
L.o :;
co nn ect io ns are present in o rd er to impl e m ent wh at is freq ue ntl y called ... ",t"'o..,.,.=c
t OJ ~ _.:=. o ~
com petitive learning, in which t he o utpu t cell s "com pete" fo r th e right 10 ,.j. t? >t-cii~ ~N
c~ .,J:s(':0\
res pon d to a gi ven in p ut. Tn tu r n, th is lea rning sch e m e is cent ral to t he W gf~
._ 0 _ c'2 . - '~ ' ~
...J c..::: o ,.c.= i';; ~-
unsu pervised lea rning proced u re used to tra in the net work. CD "; ~
1 :l.. '1
::C::r.l« ~ ~ i: E- :J fo 'C
We ha ve see n t he VQ techniq u e employed in va rio us aspects o f speech ~ ~ :t <: .~ g g ~ 1t::c
coding a nd rec ogni t ion . As t he na m e im pl ies, th e LVQ A N N an d lea rn
ing pro ced u re are des igned to carry o ut t he VQ tas k. Ana logo usly to a
cl ust e ring alg o rit h m (wh ich , in fac l, the LVQ im p le m e nt s). the LVQ is
presented with training in puts only, and is requ ired to fo rm weights that
835
836 Ch. 14 / The Artificial Neural Network
14 .4 I Applications o f ANNs in S peech Recognition 83 7
+ arc de no ted W k' O nce the netwo rk is trai ned. an arbitrary in pu t pattern
will cause a unity response in o utput cell Il k " whe re W k ' is the we ight vec
to r clo sest to th e in p ut in Eucl idea n norm. This is ana logo us to deter
mining which o f N 1 cluster ce nte rs is closest. and qua ntizing (classifyin g)
th e vector acco rd ingly.
As long as the learni n g const a nt. tl de ca ys with time [typically, e .g..
I/(p) = p- I). the LVQ lea rn ing algorithm ha s been sh own to co nverge to a
tessellat ion o f the N, space with the weight vectors as centroids
( Ko ho nen , 198 7, Sec. 7.5). Extensive learning times arc oft en required .
Le a rni ng vecto r q ua ntizers ca n a lso be used in a multiple-winner
unsu pervised learning mode in whi ch a "neighborhood" of cells around
th e wi nn in g cell has its weights reinforced . The learning rule is simply
mod ified in this case so that if a trai ning pattern x (p) is closest to. say
"" p th en any weights associated with any cell in a neighborhood , say n",
of n", a re adjusted according to
.t No
6"~='7(P)["«fl) -WJ ,
J2
l1J E n k· (14.34)
FIGURE 14.20. The LVQ architecture.
The ou tp u t layer is often arranged into a planar topology so that the
produ ce outputs which effect ive ly cla ssify the input patt erns into mean neighbo rhood of a cell is comprised of all cells within a given radius .
ingful groupings. Finally. we not e that the LVQ can al so be trained in a supervised
The so-called Single-winner un supervised learning al gorithm for the mode. The training vectors are marked according to one of N, classes.
LVQ is shown in Fig. 14.21. The notation used in the algorithm is de Whe n vector x(p) co rre ctly indicates cell nl. (in the sense of minimum
fined in Fig. 14 .20, and the rule for interpreting in dex [I is given in dista nce to its weights, as above). then wk is reinforced ,
(14 .13). Note that all input vectors are normalized to have unity length. 6Wi.. := + /'} ( p) [x ( p) - wi,J. (J4 .35)
This is critical to correct operation in general , and is very important in
speech processi ng. The procedure automatically determines the 1\ best If class k is incorrectly indicated . then th e plus sign in (14.35) is replaced
reference vectors (like cluster centers) needed to represent the space by a negati ve sign s o that 11 k is " move d away from x(p)."
spanned by the lV'o·dimensionaJ input vectors x(p), p = 1.2.. ... PEach Having established the basic principles of the ANN and two important
of these learned reference vecto rs is used as a weight vect or o n the con architect ures for speech processing. we now turn to some applications of
nections leading to an output cell. As usual, the weights leading to cell n; ANN techno logy in speech recognition .
FIGURE 14.21. Single-winner unsupervised learning algorithm for the LVQ. 14.4 Applications of ANNs in Speech Recognition
In itialt zation: Assign initial weight values randomly on 10. I I. In this section we describe som e example applications of ANNs in
Select initial learning constant 17(0) (ma y be decreased with speech rec ognit io n. The focus here will be upon technologies that employ
time). the M LP an d LVQ and that build upo n our background in conventional
speech recogn it io n approaches. Many other application e xam ples are dis
Recursion: For p = 1,2, . . . (cycle through training patt erns) cussed a nd cited in (Lippmann. 1989) and in the materials listed in Ap
Find k'= argr in ll w,,( p - I) - x( p) 1I pen d ix I.G.
Adjust Wk ': w k .( p) = wk .(p - / ) + 1J ( p)[ x (p) - w, .( p - I )]
All k 'l= k': wk(p) = wk(p - I) 14.4.1 Presegmented Speech Material
Next p
Beca use the ANN represen ts a relatively new technology, much of the
Termination : Stop when weights change negligibly accord in g to so me criterion. resea rch int o its sp eec h rec ognitio n capabilit ies has focused on the funda
m ental p ro blem classifying stat ic prescgrnented speech . Table 14.2,
838 C h . 14 I Th e Ar t if ic ial Neura l Net w or k
14.4 / Applicatio ns o f AN N s in Sp eec h Reco gnitio n 839
Iso and Wantanabe ( 1990. 199 1) Predictive netwo rk Larg e-vo cabular y, mput
speaker-dep endent IWR FIGURE 14.22, The TDNN celi. y '(P) is the nominal vector of inputs to n, at
Levin (1990) Predict ive netw ork-Multi spcak er time p. The 0 delayed values o f this vector are also entered. d W~ represents
connected-d igit recogni tion the weight vector on the d th delayed copy of the input.
versio n of the TDNN was designed to classify th e voiced sto ps l b. d. g/ The in itial T D N N was train ed using BP an d prov ides yet anoth er ex
in th e context of var ious Ja pa nese word s. The TDNN has three lavers. ample of extraordinary times to convergence that may result wit h this al
T he fir st contains eight cells o f the type shown in F ig. 14.22 , and thesec gorithm, especially wit h la rger net wo rks. T he aut ho rs repo rt th at " SOD
ond contains three. At "time p," the layer I cells recei ve th e "present " lea rning samples were used an d between 20,000 an d 50,000 iteration s of
pl us two "delays" of th e input vector, say x (p), x( n; 1), x (p - , 2). whil e the BP loop were run over all trai ning samples." T he mod ular appr oac h
th e layer 2 ce lls receive the pr esent plus fou r del ays of th e ou t pu t of layer to designing t he TD NN (Wai bel et al. , 1989b ) was de vised in pa rt to
I, say y l(p), . .. .s'!»- 4). Eac h of the cells in layer 2 is assigned to one re med y the learning time problem .
Othe r examples of AN Ns designed to learn dynamic as pects of speech
a ! th e conso nan ts . l et t hese no des be lab eled n~, n ~, 11 ~. Then one can
are fou nd in the pa pers cite d in Tab le 14.3. In particula r. t he read er
view t he output vector from layer 2 as consi st ing of three com ponents,
might wish to co mpare the time slat e neural netwo rk of Komo ri (19 9 1)
say
wi th the T D N N desc ribed a bov e. Anothe r interesting a ppro ach uses
. Y1( p) = [.1' b(P)
1 '
rip ) 2 p) ]T.
Yg( (1 4.36) M LPs to pre d ict patterns rat her than to classify t he m (Iso a nd
The output layer nod es also co nta in TDNN units ass igned to th e individ Wantanabe , 1990 , 199 1; Levin , 1990; Teb elskis and Waibel. 1990). T he
ual conso nants, but eac h is responsible for integrating temporal informa predictive me thod has some interesting relationships to linear prediction
tion onl y. Thus n~ , for exa mple, receives (scalar) outputs fro m n~ only, theory and HMMs, which we have stud ied in de tail.
but taken over th e present plu s eight delays, y ;,(p), . . . , y ~ (p - 8). Fairl y
ex tens ive preprocessing o f th e data was required fo r tb e expe riments per
14.4.3 ANNs and Conventional Approaches
formed with the ori gin al T D NN . Fifteen frame s of 16 mel -scale ene rgy
pa ramet ers based on th e F FT were centered around th e ha nd-labeled Finally, we rev iew some approaches that combine AN N co mputi ng
on set of the vowel. Fram es ar e effectively computed every 10 msec, and with conventional a lgorithms-in particular, DT W. H MM , an d Vit erbi
each vector is th en normali zed . For man y cont exts and 4000 tokens search-di scussed in earlier chapters. A sum m ary of the ap proaches to
(200 0 used for training, 2000 [or testin g) taken from three speakers. t he be described here appea rs in Table 14.4. T he ANN contribut ion to these
T DN N pro vided a 1.5% erro r. compared wit h 6. 5lJ.h for a d iscr ete techniques is principally to serve as an altern ative co m puting structure
observ ation HMM appro ach use d on the sam e feature vecto rs. The more for carrying out the necessary mathematical operation s. The main ad van
recent papers by Waib el et al. ( 1989b. 1989c) desc ribe an app roach for tage in this regard is the developmen t of more co mpact and efficient
mergin g subnet wor ks int o larger T DN Ns for recogn izing the com plete set hardware for real-time implementation . The ANN strategy ca n also en
of consonants. hance the distance or likeli hood co mpu ting task by incorporating context
, 6;:z-e aural Net w o rk 14 ,4 I Ap pli c ati on s of AN Ns In Spee ch Recognitr::m 843
TABLE 14.4. Exa mp le Applic atio ns Co mbi ning ANN s w ith Conve nt ional 'lip) 'SOl f')
Techn ologies. Partially ada pted fro m Lippm ann (1989).
S tu dy Approach/P roblem
50 output nodes
Bo urla rd and Welle kens ( 1987) M LI's co m pute d istance sco res in DTW i\lLI'
Sa koe et al. (1989) 50 or 2(N) hidden nodes
MLP s compu te d ista nce scores in DTW:
delayed fea tures used
Lern e r and Deller ( 1991) N o n-M LP pre process o r learns
tim e-freq uen cy representa tio ns used in
DT W -7 wi w' 7W 1
Fr an zim et al , ( 1990) WI - 1 I
Co nnectio nis t Vitcrbi t ra ining. a h ybrid
pre a ches t h ~ t . co m bine the AN N. a.nd HMl'vJ paradig ms (C heng et al., is proportional to the log probability computed by a Viterbi decoder
1992 : Franzin i et al ., 1990; Fra nz ini e t al. . 1989: H ua ng a nd Lipp mann (H M M ). The weights o f the netwo r k cannot be learned , bu t ra ther m u st
1988; M o rga n an d Bo u rla rd . 1990: R a m esh ct a t.. 1992 : Singer and be downloaded from a conventional training algorithm (lik e the F-B a l
LIppma nn . 19.92: Su n ~t aI., '.9 90 ) to network s that direct ly implement gor ith m). In isolated-word tests wit h the Lincoln Labo rat ory Stress Style
th e co m pu ta ti o ns required of th e HMM (Lippm an n a nd Go ld . 198 7' database (Lippmann et al., 1987). the system performed very compara bly
Niles and S ilve rma n. 1990). In addition. the network develo ped bv \lilc~ to robust H M M m odels.
and Sil verm a n is capable of lea rning the pro bab ility st r ucture '01' th e The HMM network of Niles and Si lve rm an (1990) is a full y H M M
model. eq u iva lent net work in its abilit y to learn and compute wit h t he p ro ba b i
As an exa mple of the hy brid approach, we descr ibe the work of listi c structure. T he network has recurrent connecti o ns. as one might
Fra nz ini et al. ( 1990) beca use it is closely re lated to research studied in expect. due to the in he re nt fee d bac k in t he H MM d ynam ics [see ( 12.42)
Chapte r 12. T he so-ca lled connectionist Viterbi traini ng (CVTj is de an d ( 12.43)]. O ne of t he interest ing aspects of this study is the de mo n
signed to recogn ize continuous sp eech and was tested o n the 6000 d igit strated relationship between th e BP algo rith m used t o train M L Ps and
strings in th e TIIN BS data ba se (see Sec tio n 13.8). One thou sa nd of the the F-B algorith m fo r H MMs. H oc hberg ct al. ( 1991) have repo n ed re
strings were used fo r tr a in ing. Dis crete-sym bo l (cepstra l vecto r ) HMMs sults of recognition expe rim en ts fo r th e HM M ne two r k. Fo r a vo ca bul a rv
simi lar to tho se use d in t he S PHIN X system we re employed as wo rd consi st ing of th e a lpha bet , digits . and two co nt ro l wo rds. mod el s were
dependent pho ne mo dels, which, in turn , were con catena ted into wo rd trained with vecto r-q uan t ized ccpst ral coefficient s. d elta cepstral coeffi
model s an d then d igit sen te nce mo del s. T he sent e nce network was c ien ts, and e nerg y an d de lta en ergy fea t ur es co m puted every 10 m sec
t ra ine d using t he cu st o m ary F -B proced ure (see Sect io n 12.2 .2). T he o ver 40-msec fram es. Ab o ut three minutes of speec h per 38 talk ers re
sym bo l st rings of t he tra ini ng utterances were then segmented al o ng th e qu ired three hours of t ra in ing (u sing five workstati o ns op er a tin g in paral
vario us p a th s using Vit erb i back tracki ng so th a t t he ir correspondi ng lel) for an F-B-l ike pro ced u re , and 12 hou rs for a grad ie nt-ascen t
s peec h sa m ples co u ld be asso ciat ed wit h th e va r ious a rcs in th e network. maximum like lihoo d proced ure . Conti n uo us utterances wer e segm ented
Eac h arc was then ass igned a recu rrent M LP. Each M LP was the n train ed sing a Vite rbi backtrackin g procedure as in the CVT. Exper im ents were
us ing BP to compute the o utp ut pro babilities in respon se to the original performed with both bigram an d n u ll gram m ar s with a number of d iffer
s pee c h in fram es o f 70 msec surro u nd ed by three le ft a nd t hree r igh t ent train ing strategies . R esults are reported in (Ho ch be rg et a!.. 199 1).
fra mes. Ite rat i ve rea lignm e nt and retra ining was used to im prove per T he find i ngs arc so m ew hat in concl usive an d are discuss ed in d etail in
formance. For un known -len gt h st rings. t he wo rd recogni t io n rat e was he pa pe r. N evert heless , t he y de monstra te pot ent ial for th is int erest ing
98 .5 % u sing t he CVT a n d the stri ng accu rac y was 95 .0 %. while fo r a pproach .
known- lengt h strings. these ta llies we re 9 9. 1 a nd 96.1 %. It is to be no te d
t hat t hes e resu lts have poore r erro r rates t han so me of t he bes t HMM
14.4.4 Language Modeling Using ANNs
based d igit results by a facto r o f about three (Dod dington. 1989 ). How
e ve r, im pro vem ents to the C VT syst em we re re po rted in 1991 (Ha ffner et Severa l researc h gro ups have explored t he possibility of us ing AN N s to
al ., 199 1) tha t reduced t he erro r rate on the sa me task by more than 50% mode l la nguage inform a t io n. Tasks ha ve incl uded N-gra m wo rd ca tegory
(see also (Haffne r. 1992 )] . pr edi ct io n (Nakamu ra a nd Sh ikano. (9 89), modeling of a regu la r gram
O ne of the pri m a ry a dva ntages of the CVT syst em is that it obviates mar (L iu et a l., 199 0), mo del ing of a contex t-free gra m m a r (Sun et aI.,
the use of VQ and the co nco mi ta nt d isto rti o n of th e featu re vecto rs. The 199 0 ), a n d th e int egra tion of T D N N s with a pa rsi ng strategy (Sawai .
o utput d istributio ns in the HM M are also re presen ted in th e M L P with 1991). Se m a nt ic a nd other mo re ab stract infor m ati o n has a lso been m od
o ut un su bstant ia ted stat ist ical assu m ptio ns. T here is a lso e vidence that eled in re lat ively si mp le expe rimen ts us ing A NNs. A partic ula rly interest
ANNs mi gh t be superior to H M M s at stat ic pattern c lassificat io n ing study is repon ed by G o rin et al. ( 1991) o n t he adapt ive ac quisit ion
(Wa ibel et al., 1989 ), a nd thus it mig ht be be nefici a l to re p lace low -leve l of la nguage using a co nnecti o nist net wo r k. Fo r a re view o f other se lected
HM M processing with ANNs whe n possible. system s , th e reade r is referred to (M o rga n and Scofie ld , 1991, C h . 8),
It is no tewort hv tha t th e C VT presented a fo r mi d a ble BP t ra in in g and to the gen e ral re feren ces cited in Appendix I .G .
prob lem . T he researchers were able to scale th e pro blem to abo ut 10 1'
floa ting point o pera tio ns by ta kin g several measures to speed t he SP al
14.4.5 Integration of ANNs into the Survey Systems of Section 13,9
go rith m . For detai ls see (Franzini et a I., 1990).
T he Viterbi ne t re port ed by Li p pmann a nd G o ld (198 7 ) was among N o t su rprisingly. ANN technology has bee n int egrat ed int o so m e of
the fi rst appro aches to integrating H M M -like tec hno logy and AN:\s. In the recent system s surveyed in Sect ion 13.9, or has been used to develop
response to an in p ut feature vect or, th e net work computes a quanti ty th at alternative system s. T he rese a rc h group at BN N respo nsible for the
84 6 C n . 14 I The A rt ifi c ial Neural Ne tw o rk 14.6 I Problems 847
BYB LOS syst em ha s developed a new sys tem base d on "segm ental neural Jee ts o f audit io n [fo r re v iew s. se e ( Lip p m a nn . 1989: Greenberg. 1988)]
nets" and H M Ms in tegra t ed t hro ugh t he u sc o f th e tV-best search ap m ight emerge as more fruit fu l t han th e m ore po pular ANN approach es
p ro a ch . The AN Ns are used for improved phon etic mode ling in this svs d escribed in th is cha pter.
te rn (Aust in ct a l., 19 92). ANN s have been emp loyed in the DECIPHER We have note d t ha t many ot he r AN N a rc hitec tu re s have been ex
system to estimate the output probabil ities o f the HlvfMs (Rcnals et al., plo red . a nd have encou raged t he interested reader to pursue the literature
1992). In anot her rece nt study. t he L R parsing approac h developed at in the field . Jt should als o be po inte d out that ANN applications to
AT R has been co mbi ned w it h a T DNN (Sawai, 199 1). T he t rend in these sp eech have included ta sks other th a n reco gnit io n. and recognition tasks
a nd si mila r stud ies has been to use t he ANN technology to implement a ot her t ha n t hose di scussed here . Some example applications includ e the
specia lized fun ct io n th at it pe rfo rm s wel l. G iv en t he complex ity of train fo llow ing:
ing ver y-large-sca le AN N s, t his trend seems likely to co n tin ue until major 1. Keywo rd s po tt ing (M organ et al., 1990; Anderson , 1991).
brea kt hrough s in tr a in ing m et hod s take pl ace . 2. Syn t hesis (Sejnowski an d Rosenberg. 19 86: Scordilis and Gowdy,
1989; R ah im a nd G oodyear, 1%9).
3. Art icu la to ry modeli ng (X ue et al.. 1990).
14.5 Conclusions 4. Enhanceme nt and noise robustness (Tamura and Waibel. 1988:
Ta m u ra , 1989: Paliwal, 1990; Barbi er and Chollet, 1991 : Mathan
Wc be ga n our study of speech recogniti o n in Chapter 10 with an enumer
and Mi elet. 1991 l.
at io n o f t he challenges fa ced by speech rec ognitio n e ngineers. Whereas 5. Vo ice d -u nvo iced -sile nce discrimination (Ghiselli-Cripoa and El-
tr e mendo us progress has been ach ieved in ad d ressing these problems in
Jaroudi, 1991 ).
the past several decades, the performance a nd ca pab ilit ies of solutions on 6. Speaker recognition and verification (Bcnnani et al. , 1990: Morgan
sequential machines re mains far short of human recognition. The ANN
et al ., 1989 : Og elsby and Mason. 1990, 1991).
rep resents an opportu nity to exp lo re new and un convent ional approaches
to these difficult problems. ANN sol utions ca n potentially add massivelv
parallel comput ing and a ltern a ti ve strategies fo r a d a pta tio n to the tech
niq ues upon which speech processing engi neers can draw. The -current
.6 Problems
state of AN N res earch and de velopment for speech recogn iti o n. however,
lags far behind th at of conventional m etho d s, and the ult im at e impact of 14.1. Con sid e r the simple perceptron shown in fig . 14.24(a). The thresh
t h is relatively immature field is un ce rta in . ol d ing funct io n is the customary st ep threshold defined in Fig. 14.3 . Also
We have explored in this chapter some ba sic principles underlying the (j represen ts a biasin g weight and the input to its connection is always
AN N concept and two general types of AN N archit ectures-the MLP - I . T hree classification problems are posed below. In each case , two
a nd LVQ-thal have natural a p p licat io n to speech recogni tio n . The ren di men sio na l input training vectors, (x (p), jJ = I , . .. ,Pl . come [rom one
aissance of interest in ANN has been made possible in large part by the f t wo cla sses, A or B. The training vectors are drawn from populations
di scovery of a training a lgo rith m , BP, for the more di ffi cult of the two, that a re un ifo rmly distributed ov er the region s show n in Fig. 14.2'l(b) .
th e MLP. T he challenge o f fi nd ing such a training method . and the ardu (a) Fi nd a set of weights. w =[I\') \\' 2 e.' , to which the Rosenblatt
ous task o f und e rsta ndi ng it s con vergence properties, are both reflective PL a lgorit hm might converge for th e class regions in (i). Is th e
o f a cen t ra l differen ce between the AN\! and m o re con ven tio na l engi bias con ne ct io n \0 th e perceptron necessary in this case?
nee ring ap p ro ache s: The AN N is gen er ally a non lin ea r syst e m . Herein (b) Repeat part (a) [or the class regions ( ii),
lies its power and much of it s my ster y, an d with t he encoura ging results (c) Fo r the class regions (iii ), will th e PL algorithm be able to rind
repo rted here and elsewhe re come new c hall e nges fo r speech processing an exa ct classification boundary in Ihe feature space? Can you
eng inee rs to explain a nd un ify these resu lts a nd hel p to bui ld a genera l specu lat e on wha t boundary the LMS algorithm might d educe?
t heo r y of AN N computi ng. It is import an t to keep in mi n d . however. Assu me that all classification errors are equally costly.
t ha t lea r ni ng all a bo u t AN Ns (o r an y o ther tech no logy) is no t necessar ily
th e key to b uil d in g la rge-sca le, rob ust. cont inu ou s-speec h rccogni ze rs. In 14.2. In o rder to be tt er understand th e d e velopment of the BP algorithm ,
focusin g o n th e techno logy and n ot th e ba sic p roblem s . we mi ght once do the foll owing:
aga in adv ance th e tech nica l power underl ying the system s wi t hout gain (a) Verify the gra d ie n t exp ression (14.20) .
in g much und erstanding o f the d ee p a nd co m plex problem s we are irving (b) Find explicit expressio ns fo r (), and o~ in (14.24) .
to so lve . In t his se nse. recent e fforts to m o del va rio us quant ifiable as (c) Verify the gradient expression ( \ 4.25) .
848 Ch . 14 I T he Art ificia l Neural Network 14.6 I Prob lems 849
(J
~0\
Xl .1'2 - 1
(ii) Plot the 30 training vectors (ex clud ing th e co ns ta nt - }) in
(1I)
r,
I!/ "1 t wo dimensions, along with th e three deci sion boundari es
x,
J. learned by the network:
1
Wk . 1 F) + 111" 2 1:~ = 8k • k =1, 2,3 . ( t 4.39)
Region
A (iii) Co m ment on your results. H ow man y training vecto rs a re
A
misclassified by the deci sion boundari es?
-1
i I . ~ .f ,
'I (b) (i) Train the weights of a three-output LVQ to achieve the
-1 I
Re ~ i ()n
classi fication problem above.
B (ii) Plot the " clu ster centers" (indicated by the weight vectors
of the LVQ) on the same plot with t he feature vectors
'--- - - ..,1- I
-I fro m part (a ).
(i l (ii)
(iii) Commen t on your results. How m any training vect o rs a re
misclass ifi ed by the LVQ?
X
2 (c) Expe ri ment with va rious MLP st ru ctu res with two inputs and
I three outputs, using different numbers of lay ers a n d differ ent
thresh olding function s. Using the BP a lgo r it hm , ca n yo u find at
A lea st one structure that will pro vid e superior performance to
th e percepiron and LVQ networks de sign ed a bo ve in cla ssifying
- I I
.\' 1 th e tr aining vec to rs?
quadran t of class A
-1
(b)
FIGURE 14.24. Figures for Problem 14.1. (a) Perceptron . (b) Feature vectors
are uniformly distributed over the regions shown in three situations.
851
85 ? Bibliography
B ibliography 853
ANDERS?N. c... a n~ E. ~ATORIl;JS . :Ad?~t i ve cnhan~e mcnt o f finite bandwidth sig
nals III wh ite G a ussian nor se. IEE E Transactions on Acoustics. Speech. and AZIMI-SADJADI, M., S. CITRtN, and S. SHEEDVASH. "Supervised learning process
S igna l Processing. vol. 3 1, pp . 17- 2 8, Feb. 1983. of multilayer perceptro n neural net work s using fast least sq ua res," Proceedi ngs
ANDERSON, T. "S pe a ker-ind ependent ph o nem e recogn it io n wit h an auditory of the IEEE International Conference on Acoustics, Speech, an d S ignal Process
mode l and a neural net wo rk: A comparison with t rad itio na l techniques," Pre; ing, Al buquerque, N .M .. vol. 3, pp . 138 1- / 384, 1990.
ceedings of the IE EE International Conf erence on Aco ustics. Sp eech. and Signal BAHL, L. R., R. BAKIS, J. B ELLAGARDA el a l. " La rge voca bula ry na tural language
Processi ng, Toro nt o, Ca nad a , vol , I , pp . 149- 152, 199 1. cont in uo us sp eech recognition," Proceedings of the IE EE Internat ional Confer
AMt , M . "M app ing a bili t ies of th ree-layer neural net works." Proceedings of the ence on Acoustics. S peech, and S ignal Processing, G la sgow, Scotland , vol, I,
International Joint Conf erence on Neura l Networks, Wash in gto n. D .C .. vol. I. pp, 46 5-46 7, 1989.
pp. 4 19-42 3, J u ne 198 9.
BAHt , L. R , R. BAKlS, P. S. Cm-IEN et al. "Recognition results with several exper
ARIKl. Y . K. K-UJMOTO, and T. SAKAI. "Acoust ic noise reduction method by two ime n ta l acoustic processors," Proceedings of th e lEEE International Confer
d imen sio nal spect ru m smoo th ing and spectral amplitud e tr ansformation," Pro ence on Acoust ics, Speech, and SIgnal Processing, Washington , D .C., vol . J,
ceedings of the IEEE l nternationat Conf erence On Acoustics, Speech. and Signal p p. 249-25 1, 1979.
Processing. To kyo, J apan, pp , 97 - 100, 1986.
- - - . "Further results on the recognition of a continuously read corpus," Pro
ASADI . A. , R SCHWA RTZ, a nd 1. M AKHOUL. "Aut o matic modell ing for adding new ceedings of the IEEE International Conference on ?'! couslics. Speech, and Signal
wo rd s to a large-vocabulary continuous speech recogn ition system," Proceed. Processing, Denver, Colo., vol. 2, pp. 872-875, 1980.
ings of the IE E E Internat ional Conf erence on Acoustics, Sp eech , and Signal BAHL, L. R., P.· F. BROWN, P. V. DESOUZA et al. " Ma xim u m mutual information
Processing, Toro nt o, Canad a , vol, l. pp. 30 5- 308, 199 J. est im at io n of hidden Markov model parameters for speech recognition," Pro
ATAL, B. S . "A u to m atic speech recogn it io n based on pitch contours." Ph.D. dis ceedings of the IEEE International Conferen ce 0/1 Acoustics, Speech , and Signal
sertatio n, Polytechnic Inst itute of Brooklyn, New York, 1968. Processing, Tokyo, Japan, vol. l , pp . 49-52, 1986.
ATAL, B. S., and 1. S. HANAUER . "Speech analysis and synthesis by linear predic - -- . "'A new algorithm for the estimation of hidden Markov model parame
tio n of the speech wave ." Journal of the Acoustical Society ofAmerirn, vol . 50 , ters," Proceedings oj the IEEE International Conference on Acoustics, Sp eech,
pp. 637-655. 1971. and Signal Processing , New York, vol . 1, pp. 49 3-4 96, 1988.
ATAL, B. S., and J. R. REMDE. "A new model of LPC excitation for producing BAH t., L R.. P. F BROWN, P. V. DESOUZA et al. "Acoustic Markov models used in
natu ra l-so u nd ing speech at low bit rates," Proceedings of the IEEE Internathe TAN G O R A speech recognition system," Proceedings of the IEEE Interna
tional Conference on AcoUSlics, Speech, and Signal Processing, Paris, pp. 614 tional Conference on Acoustics, Speech , and Signal Processing, New York, vol,
6 17, M ay 1982.
r, pp. 497-500, 1988.
ATAL, B. S., a nd M. R. SCHROEDER. "Predictive coding of speech signals ." In BA HL,' L. R., S . K. DAS, P. V. DESOUSA et al . "Some experiments with large
Y Ko nas i, ed .. Report of the 6th International Congress 0 11 Acoustics, Tokyo , voca b ula ry isolated word sentence recognition," Proceedings of the II:.EE Inter
J apa n, J 96 8.
national Conference on Acoustics, Spee ch, and Signal Processing, San Diego,
- - - . "Ad a p t ive predictive coding of speech signals," Bel! System Technical Calif., vol. 2, paper 26.5, 1984.
J ournal, vol, 49, pp. 1973-1986, 1970. BAHL, L. R., F. JELINEK, and R . L. MERCER. "A maximum likelihood approach to
ATIYA, A. " Learn ing on a general network." In D. Anderson, ed.. Proceedings of co n t in uo us speech recognition," IEEE Transactions on Pattern Analysis and
the 198 7 IEEE Conference on Neural Processing Systems-Natural and 5)111 Machine Intelligence, vol. 5, pp, 179- \ 90, Mar, 1983.
thetic. Ne w York: American Institute of Physics, pp, 22-30. 1988.
BAKER, J. K. "Stochastic modeling for automatic speech understanding." ln D . R.
AT&T System Technical Journal. vel, 69 , Sept.vOct, 1990.
Reddy, cd., Speech Recognition. New York: Academic Press, pp. 521-542 ,
A USTIN, S., G . ZAVAUAGKOS: J. MAKHOUL et al. "Speech recognition using seg
1975 . Reprinted in (Waibel and Lee. 1990).
mental neu ral nets," Proceedings of th e IE1:E International Conference on
- - - . "The DRA.GON system-An overview," IEEE Transactions on Acoustics.
A coustics, Speech, and Signal Processing, San Francisco. Calif.. vol, I, pp,
Speech , and Signal Processing, vol , 23 , pp, 24-29 , Feb. 1975. Rep r inted in
I-625-r-628, 1992.
(Dixon and Marti n. 19 79).
AVERBUCH. A.. L. BAHL, R . BAKIS ct al. "An IB M- P C based la rge-vocab ular y BAKlS, R. "Continuous speech word recogn it ion via centisecond acoustic states,"
iso la ted-u tterance speech recogn ize r," Proceedings of the IE E E Int ernational Proceedings of the 91st Ann ual M eeting oj the A coustical Society of Amenca,
Conf erence on Aco us tics, Speech, and S ignal Processing, Tokyo. J a pan. vol, L Washington, D .C ., 197 6.
pp , 53-56, 1986.
BARBI ER, L., a nd G. C HOLLET. "Rob ust s peech paramete rs extrac: .on lor word
AVERBUCH, A., L. BAHL, R. BAKIS et al. "Experimen ts with the TANGORA reco gn ition in no ise using ne ura l netwo rks ," Proceedings of the ii.E): Interna
20,000 word speec h recognizer." Proceedings of the IE EE International Confer tional Conf erence on Acoustics, Speech, and Signal Processing, Toronto, Can
ence on Acoustics, Speech. and Signal Processing, D a llas. Tex .. vol, 2, pp. 701 ad a, vol . I. pp. 145-148 , 19 9/ .
704, 1987.
BARNHA RT, C. L, ed . The A m erican College Dict ionary. N ew York: Random
House. 1964.
854 Bibl 'ography Bibl iography 855
BARNWELL, T. P. " Co rr elat ion an a lysis of su bject ive and o bject ive meas ures for _ . Dynamic Programming, P rin ce ton , N .J. : P rin ceton U niversi ty P ress,
speech q uali ty." Proceedings of the IEEE International Conf erence on Acous 195 7.
tics, Speech, and Signal Processing, D enver. Colo .. pp. 706-709. 1980. BELLMA N, R., and S. E. DR EYFUS. Applied Dynamic Programming. Princeto n ,
- - - . "A comparison o f parametrically different o bj ecti ve speech q ua lity m ea N.J .: P rin ceton University Press, 19 62.
sures using corr ela tion a nalysis with su bj ective quality re sults," Proceedings or BENNANI, Y , F. FOGELMAN, and P. GALLlNAR1. "A ne ura l net approach to a uto
the IEE E International Conference on Acoustics, Speech , and Signal l'races,; m atic spe a ker recognition," Proceedings of the IEE E tnt emational Conference
ing, D enver, Co lo ., pp . 7 10-713 , J 9 80. 011 Acoustics , Sp eech, and S ignal Processing, Alb uq uerque. N .M., vo l. 1, pp .
- - - . "I m p ro ved o bj ective q ua lity m ea sures fo r low b it sp eech co m pression." 26 5- 2 68, 19 90 .
N a tio n al Scien ce Fo u n dat io n, Fina l Tec h n ical Repo rt ECS -80 167 12, 198 5. BENVENUTO, N. , G. BERTOCCI, and W. R. DAUMER . " T he 32-kbs ADPCM codi ng
BARN WELL, T. P., an d A. M . BUSH. " Sta tist ical co rrel at ion between objective a nd sta ndard," AT&T Technical journal, vol . 65 , pp. 12- 2 2, Sept.-Oct. 1986.
subj ective m easu res for sp eech q uality:' Proceedings of the IEEE International BERANEK, L. L. Acoustics. New Yo rk: McGraw-Hi li, 1954.
Conf erence on Aco ustics, Sp eech, and Signal Processin g. T ulsa , Okla., p p, 595 BERGER, T. Rate Distortion The ory. En glewoo d Cl iffs , N .J.: Pr entice Hall, \ 9 7 1.
598, 19 78.
BEROUTI, M . G . " Estim a t io n of th e glo tta l volume velocity b y the linear p redic
BARNWELL. T. P., M. A. CLEME..'IT S, S. R . QUACKENBUSH et a l. " I m p ro ved objec
ti o n inverse filter." Ph .D . disserta tio n , U nive rsity of Florida, 1976.
tive m ea su r es fo r sp ee ch qu a lit y testing. " D CA F in a l Te chnical Report. no.
DCA I 00-8 3-C-00 27 , Sept. 19 8 4. BEROUTl. M. G ., D. G. CHILDERS, and A. PAtGE. " G lott a l area versus glottal vol
ume velo city," Proceedings 0/ the IEEE In ternational Confe rence on ACOUSlics ,
BARNWELL. T. P., a nd W. D . VOIERS. "An analysis of objective measures for user
ac cep t ance o f voice co m m u nication syste ms," DCA Final Tec hnical R ep ort,
Spee ch, and Sig nal Processing, H artford , COIUl., VO\. 1, pp. 33-36, 1977 .
no. D CA I 00 -78-C-0003, Sept . 19 79. BEROUTI, M ., J . MAKHOUL, a n d R . SCH WART Z. "Enhancement of speech cor
BARRON, A. R. "Sta tis tical prope rt ies of artificial neura l networks," Proceedings rup te d by acous tic noise," Proceedings of the IEE E In ternattonal Conference
of the IE EE Conference on Decision and Control, Tampa, Fla.. vol, 1. pp, 280 011 Acoustics, Speech , and Signal Processing, Wash in gto n , D.C., pp . 208-21 1.
28 5, 19 89. 19 79.
BARRON, A. R. , and R . L. BARRON. "S tatistica l learn in g netwo rk s: A un ifying BLOCK, H. " T he percept ro n: A m od el fo r brai n fu nctioning L Analysi s of a four
view." Proceedings of the Symposium on the In terface Be tween Statistics and laye r seri es co upled perceptron II." Review of Modern Physics. vol , 34, pp.
Com puting S cience, Reston, Va. , Apr. 1988. 12 3- 14 2, 19 62.
B.<\ UM, L. E. "An ineq ua lity a nd associa ted m ax imization techn ique in statistical _ _. "A review of Perceptions" Information and Cont rol, vol , 17, pp . 50 1- 522 ,
es tim a tio n fo r p ro ba b ili st ic functions of Ma rkov processes," Inequalities, vol. 19 70 .
3, pp . 1- 8, 19 72 . BLUM, J. R. "M ulti d im ensi onal stochas tic approx ima tio n procedure," A nna ls of
BAUM. L. E., an d J. A. EAGON. "An in eq uali ty with applications to statistical esti Ma thematical S tatistics, vol. 2 5. pp. 737 -74 4, 1954 .
matio n fo r p robabi listic fu nct io ns of Markov processes and 10 a m od el for BOCCHIERI, E. L., a nd G. R. DODDINGTON . "Frame s pecific statistical features fo r
eco logy," Bulletin of the American Mathematical Society; vol. 73. pp . 360 -363 . sp ea ker-in d ep en d e nt speech recognition." IEEE Transact ions on A coustics,
19 67. S peech, and Sig nal Processing, vol , 34 , pp. 755 -764 , Aug. 19 86 .
BAUM, L. E., a nd 1. PETRIE. " St at ist ica l infe re nce fo r probabi list ic functions of fi BODENHAUSEN, U. , a nd A. W.", JBEL. "Learn ing the architect ure of neural networ ks
ni te sta te Ma r kov chains," An nals of M a them at ical S ta tistics, vo l. 37. PP· for sp eech recognitio n:" Proceedings of (he IEEE Internation al Conference OTI
1554- 156 3. 1966 . Acoustics, Sp eech . and Signal Processi ng, Toronto, Ca nad a, vol. J, pp . 117
BAUM, L. E., T. PETRIE. G. SOULES er al. "A m ax im izat ion technique in the stat is 120, 1991.
tical a na lysis of prob ab ilist ic fu nc tio ns of Ma rko v chains ," Ann als of Mathe BOGERT, B. P., M. J . R. HEALY, a nd J. W. TUK£Y. "The quefre ncy alanysis of t ime
matica l Statistics, vol, 4 1, pp . 164 - 17 1, 19 70. series fo r ec hoes: Cepstrum, pseud o-autocovariance, cross-cepstrum and saphe
BAUM , L E.. a nd G. R. SELL. "Growth fu nctions for transformat io n s on m ani cracking." In M . Rose nb la tt, ed .. Proceedi ngs of the Symposium 011 Time Se
fold s," Pacifi c Journal o f Mathematics, vol . 27. p p . 2 11-227 . 1968. ries Analysis. New York: J ohn Wiley & So ns, pp. 209 -243, 196 3 .
BELLA NGER, M. G . Ada ptive Digital F ilters and S ignal A nalysis. N ew York :
BOLL, S. F. "Sup p ressio n of noise in speech u sing the SABER method," Proceed
Ma rcel D ekke r. 1987. ings oj the IEEE Int ernational Conferen ce on Acoustics, Speech , and S ignal
BELLMA N, R. " O n th e th eo ry o f d yna mic p rogram m ing." Proceedings of the Na Processing, Tu lsa , Okla. , pp . 606-609 , 19 78.
tional Academy 0/ Scien ces. vol. 38, pp . 71 6-719. 19 52. - - . " S up p ress io n of acoustic n o ise in speech usi ng spectra l subtract io n,"
- - - . " The th eo ry o f dynamic programmin g." Bulletin of the America n Mathe I1:'I;r: Transactions on Acoustics , Sp eech, and S ignal Processing, vo l. 27 , pp .
matical Society," vol . 60 , pp . 503-5 16, 195 4. 1 13- 120 . Apr. 1979.
856 B ibliograp h y Bibliography 857
- - - . "Adaptive noise canceling in speech using the short-time transform:' Pro BURRIS, C. S. "Efficient Fo uri er transform and convolution." Chapter 4 in J: S.
ceedings of the I};;1:£ International Conference on ACOUSlics, Speech, and Signal L im a nd A. V. O p pe n he im , eds, Advanced Topics ill S ignal Process mg.
Processing, Denver. Colo.. pp. 692-695. 1980. Englewood C liffs. N .J .: P ren tice Ha ll, 1988.
BOLL, S. F.. and D. C. PU.LSIPHER . "Suppres.sion of acoustic noise in speech using BURTON , D. K ., J . E. SHORE. and J. T. BUCK. "I solated-wo rd s? cech recogn it ion
two microphone adaptive noise cancellation." IEfJ:: Transactions 011 .1coustics. using multi-sectio n vect or quantiza tion co debooks." IEEE Transactions on
.Speech, and Signal Processing, \"01. 28. pp. 751-753, Dec. 1980. Acoustics, Speech, a nd S ignal Processing , vol. 33, pp. 837-849, A ug. 1985.
BOURLARD, H., and C. J. WELLEH.NS. "Speech pattern discriminations and BUSH. M . A.. and G . E. KOPEC. "Net wo rk-based connecte d digit recognition,"
multi-layer perceptrons." Computer Speech and Langllag£,. Dec. 1987. IEEE Transactions on Acoustics, S peech, and Signal Processi ng , vol, 35, pp.
BOYCE, W. E., and R . C. DIPR1M."'. Elementary Differentia! Equations and Bound 1401-1413, 1987.
ary Value Problems. Ncw York: John Wiley & Sons. 1969. BUSINGER, P. A., a nd G . H. G OLUB. " Linear leas! squares solutions by House
holder transformations," Numerica l M athem atics, vol. 7, pp . 269-276 , 1965.
BRASSARD, J .-P. "Integration of segmenting and nonsegrnenting approaches in
continuous speech recognition:' Proceedings of the IEEE International Confer Buzo, A., A. H . G RAY, JR.. R . M. GRAY et a l. "Speech coding based upon vector
ence on Acoustics, Speech, and S ignal Processing, Tampa, Fla., vol. 3, pp. quantization," IEEE Transactions on Aco ustics, Speech, and Signal Processing,
1217-1220, 1985. vol. 28, pp . 562-5 74, Oct. 1980.
BRE1TKOPF, P., and T. P. 8"'R~rwELL "Segmentation prcclassificarion for improved CAMPANELLo\, S. J ., and G . S. ROBINSON. "A comparison of orthogonal transfor
objective speech quality measures," Proceedings /~r the IEEE International mations for digi tal speech processing," IEEE Transactions 011 Communica
Conference on Acoustics, Speech, and Signal Processing, Atlanta, Ga .. pp. tions , vol. 19, part i. pp, 1045-1049, Dec. 197 I.
1101-1104, 1981. CARLYON, R. Persona l commu n ica tio n, 1988.
BRIDLE, J. "Neural network experience at the RSRE Speech Research Unit," Pro CHABR1ES, D. M ., R. W. C HRISTJ.\;-':SfN, R. H. BREY, et al. "Application of the
ceedings of the ATR Worksh op on Neural Networks and Parallel Distributed LM S ada ptive fi lter to im prove speech communication in the presence of
Processing, Osaka, Japan, 1988. noise," IEE E ln ternailonal Conference on Acoustics, Speech, and Signal Pro
BRIDLE, J. S., and M. D. BROWN. "Connected word recognition using whole word cessing, Pa ri s, p p. 148-15 I, 1982.
templates," Proceedings of the Institute for Acoustics, Autl/lIIll Conference. pp. C HANDRA, S., and W. C. LIN. "Experimental comparisons between stationary and
25-18, Nov. 1979. non-s ta t ionary fo r m ula t io ns of linear predict ion appl ied to speech ," IEEE
BRIDLE, J. S., R. M. CHAMBERLAIK, and M. D. BRO\'·:-.J. "An algorithm for con Transac tions on Acoustics, Speech , and Signal Processing, vol. 12, PP. 403-415,
nected word recognition," Proceedings of the IEEE International Confer 1974.
ence on Acoustics, Speech, and Signal Processing, Paris, vol , 2. pp . 899-902, CHEN, C. T. Linear System Theory and Design. New York: Holt, Rinehart and
1982. Winston, 1984 .
BROOM HEAD, D. S., and D. Lowe. "Radial basis functions. rnultivariable func CHEN, J . H. "High-quality 16 KBPS speech coding with a one-way delay less than
tional interpolation , and adaptive networks," Technical Report RSRE Memo 2 ms," Proceedings of the IEEE International Conference on Acoustics. Speech,
randum No. 4148, Royal Speech and Radar Establishment. Malvern, and S ignal Processing, Albuquerque, N .M., vol, I, pp. 453-456, 1990.
Worcester, England, 1988 . CHENG, D . Y., A. GERSHO. B. RAMA!\lIJRTHI et al, "Fast search algorithms for
BURG, J . P. "Maximum entropy spectral analysis," Proceedings (d the 3 7th Meet vector qua n t ization and pattern matching:' Proceedings of the IEEE Interna
ing of the Society of Exploration Geophysicists, 196 7. tional Conference on Acou stics, Speech, and Signal Processing, San Diego,
- - -. "Maximum entropy spectral analysis." Ph.D. dissertation . Stanford Uni Calif., paper 9.11 , 1984.
versity, 1975. C HENG, Y., D . O'SHAUGHNESSY, V. GUPTA et al. "Hybrid segmental-LVQ/HMM
BURR, B. J .. B. D . ACKLA~D, and N. WESTE. "Array configurations for dynamic fo r large vocabulary speech recognition," Proceedings of the IEEE Interna
time warping," IEEE Transactions on Acou stics, Spee ch. and Signal Processing, tional Conference on Acol./srics, Speech, and Signal Processing, San Francisco,
vol, 32, pp. 119-128, 1984. vol. I, pp. 1-593-1-596, 1992.
BURR, D. J. "Experiments on neural net recognition of spoken and written text," CHlBA, T., a nd KAJ1YAMA. M. The kOwel. Its Nature and Structure. Tokyo: Tokyo
IEEE Transactions o n Acoustics, Speech, and Signal Processing, vol , 36, pp. Kaise ik an Pub. Co., 194 1.
1162-1168, July 1988. C HILDERS, D. G. "Laryngeal pathology detection:' CRe Reviews in Bioengineer
BURR-'lSC.;:--lO, P., and P. L uCC1. "A learning rule eliminating local minima in ing, vol, 2, PP. 375-424, 1977.
multi-l ayer perceptrons," Proceedings of the IEEE International Conference on C HILDERS, D. G., and C. K. LEE. "Co-channel speech separation," Proceedings of
Acou stics, Speech, and Signal Processing, Albuquerque. N. M.. vol. 2, pp. 865 the IE E E Interna tional Conference on Acoustics, Speech, and Signal Process
868, 1990. ing, Dallas, Tex., pp . 181-184, 1987 .
858 8 lbl:ography
Bibliography 859
CHI LDERS, D. G.. D. P. SKINNER. a nd R. C. K EM EKA.lT. "The cepstrum: A guide Cox , R. v., and D. MALAH. "A techniqu e for perceptually reducing periodically
to processing," Proceedings of the IEEH, vol, 65, pp. 1428- 1443. Oct. 1977. structured noi se in speech ," Proceedings of the IEEE Internat ional Conference
CHOMSKY. N. 'Th ree models for the description of language," IF.EE 'l"ransactions on Acoustics, Speech, and Signal Processing, Atlanta, Ga., pp. 1089- 1092,
on Inf ormation Theory, vol. 2, pp . 1 13- 124, 1956. 198 1.
- -- . "On certai n formal properti es of gra mmars," I nf ormation and Contra! CRAVERO, M. , R. PEIRACCI Nt. and F. R AINERI. "De finition and evalua tion of pho
vol. 2, pp. 137-167, 1959 . , neti c un its for speech recognit ion b y hidden Marko v models, " Proceedings of
- --. "A note on phrase str ucturcd gram ma rs," lnl ormation and Controt. vol , 2. the IEEE International Conference on Acoustics, Speech, and S ignal Process
PP · 393-395 , 1959. ing, Tokyo, Japan , vol, 3, pp . 22 35-2238, 1986.
CHOMSKY, N., and G. A. MILLER. "F ini te stat e languages." b ({ormat ioll and Co nCROCHIERE, R. E., J. E. TRIllOLE1~ and L. R. RABINER. "An inte rpretat ion of th e
trol, vol, I , pp. 91-11 2. 1958. log like lihood ratio as a meas ure of waveform coder performa nce," IEEE
CHOW, Y L. "Maxim um mutual information estimatio n of H~l M paramet ers for
Transactions on ACOl/stics, Speech, and S ignal Processing, vol. 28, PP. 367-37 6,
conti nuo us speech recogn ition using the N-besl algorith m." Proceedings of the Aug. 1980.
IEEE Internalional Conf erence on ACOUSI ics , Speech, and S ignal Processing, CROSMER, J . R., and T. P. BARNWELL. ";\ low b it rate segment vocoder based on
Albuq uerque, N.M., vol. 2, pp. 70 1-704 . 1990. line spectru m pa irs," Proceedings of the IEEE Internat iona l Conf erence on
Acoust i cs, Speech, and S ignal Processing, Tam pa, F la., ve l. I, pp. 240-243,
CHOW, Y. L., M. O. DUNHAM, 0. A. KJ\1BALL et al, "BYBLOS: Th e BBN COntinu
ous speech recognition system," Proceedings of the fELT International Confer 1985.
ence on Acoustics, Sp eech, and Sign al Processing, Dallas, Tex.. vol, I, Pp. CURTIS. R. A., a nd R. J. N 1EDERJOHN. "An invest igation of several frequenc y
89-92, 1987. do main pro cessing methods for enha ncing the int elligibility of speech in
CHOW, Y L., and S. Ro ucos. "Speech und erstanding using a un ificat ion gram wideban d random noise," Proceedings of the lE FT Int ernat ional Conference on
mar," Proceedings of the IEEE International CunjCrcn('e 011 AClm stics, Speech. Acoustics, Speech, and Signal Processing, Tulsa , Okla., pp. 602- 605, 1978.
and S ignal Processing, Gla sgow, Scotland, vo l, 2, pp. 727- 730. 1989. DARLINGTON, P., P. D . WHEELER, and G. A. POWELL. "Adaptive noise reduction
in aircraft communication systems," Proceedin gs of the IEEE Int ernat ional
C HOW, Y. L., R. M. SCHWARTZ. S. Roucos et al. "The role of word -dep endent
Confe rence on Acoustics, Speech, and S ignal Processing, Tampa. Fla., pp, 716
coa rt iculatory effects in a pho neme-based speec h recogn ition system." Proceed
719 , 198 5.
ings of the IEEE Internan onal Confere nce on Acou stics, Sp eech, and S ignal
Processing, Tokyo, Japan, vol. 3, pp. 1593-15 96 , 1986. DARWIN, C. 1., and R. B. GARDNER. "Mistuning a harmon ic of a vowel: Gr oup
ing and phase effects on vowel efficiency," Journal of the Acoustical Society of
CHU, P. L., and D. G. MESSERSCHMITT. "A weighted Itakura-Saito spec tral dis
Am erica, vot, 79, pp . 838-84 5, Mar. 1986.
tance mea sure." IEEE Transactions on A caustics , Speech. and Sign al Process
ing, vo l. 30, pp . 545-560, Aug. 1982. DA UMER, W. R. "Subj ecti ve compariso n of several efficient speech cod ers," IEEE
Transactions on Comm unicauo ns, vol, 30. pp . 655- 66 2, Apr. 1982 .
C HUNG, J. H., and R. W. SCHAFER. "Excit atio n modeling in a hom omorphic
vocoder,' Proceedings of the IEEE lnternauo nal Conf erence on Acoustics. DAUMER, W. R., and J . R. CAVANAUG H. "A subjec tive com parison of selected dig
Speech, and S ignal Processing, Alhuquerque, N.M., vol, I, PP. 25- 28, 1990. ital cod ers for speec h." Bell System Technical Journal, vol, 57, pp . 3 109- 3 165,
Nov. 1978.
CHURCHILL, R. V. Complex Varia bles and Appli cations, 2n d cd. New York:
CIOFFI, J. M .. and 1. M. K <\IL\TH. "Fast recursive least sq uares tra nsversal filters
DAVIS, S. B., and P. MERMELSTEI N. "Compa riso n of param etri c representations
for adaptive filtering," JEEE Transactions on Acoust ics, Speech. and Sig na!
for mon osyllabic word recognitio n in continuo usly spoken sent enc es," IEEE
Processing, vol. 32, pp. 304-337 , Apr. 1984.
Transactions on Acoustics, Speech. and Signal Processing. vo l. 28, PP. 357-366,
Aug. 1980.
- . "W indowed fast tran sversal filler ada ptive algori thms with normaliza
tion," IEEE Transaclions on Acoustics, Speech. and Signal Processing, vol. 33. D EL<\TTRE, P. C , A. M . LIBERMA N, a nd F. S. COOPER, "Acoustic loci and transi
pp, 607-625. June 1985 . tional cues for consonants ." Journal of Acoustical Society 0/ Ame rica, vol. 2 7,
no. 4, pp. 769- 773, J uly 19 55.
COHEN, M., H . MURV EIT, J. BERNSTEI N et al. "The DEC IPHER speech recogni
tion system ," Proceedings of the lEEI~' International Conference on Acoustics, D ELLER, J. R . "Some notes on closed phase glott a l inverse filtering," IEEE Trans
Speech, and Signal Processing, Albuquerque, N.M., vol, I , p p. 77-30. 1990, actions on Acoustics. Sp eech, and S ignal Processing, vol. 29, pp, 9 17- 9 19. Aug.
198 1.
COOLEY. J. w and .I. W. T UKEY. "An algorithm fo r th e mac hine com puta tion of
the com plex Fou rier ser ies," Ma thematiea! Computation. vol, 19. pp . 297-301. - - -. "O n the time domain propert ies of t h e two-pole model of th e glottal
1965. waveform and impli cation s for LPC ," Speech Com m unication: An I nterdisci
plinary Journal, \' 0 1. 2, pp . 57- 63, 1983.
860 Bibl iography
Bibliograph y 86 1
- - - . "O n th e identifica tion of a utoregressive systems exc ited by period ic sig D EVIJVER. P. A.. a nd J . Kn T LER. Pattern R ecognu ion : A St atistical Approach.
na ls of unknown phase," IEEE Transactions on Acoustics, Speech, and Sign al London, England: Prentice Ha ll Int erna tio na l. 1982.
Processing, vol. 32, pp. 638-64 1, 1984.
DIDAY, E., and J. C. SIMON. "Cluster analysis : ' I n K . S. Fu, ed.. Digital Pattern
D ELLE R, J. R.. and D . Hsu. "An alterna tive ada ptive seq uentia l regressio n algo R ecognition. N ew York : Sp ringe r-Verlag, 1976.
rithm and its applica tio n to th e recogni tio n of cerebral palsy spe ech." IEE E
DIXON. N . R., and T. B. MARTIN, eds. Automatic Speech and Spea ker R ecogni
Transactions on Circuits and S ystems , vol, 34, pp . 782-7 87. Jul y 1987.
tion. New York: IEEE P ress . 1979.
DELLER. J . R ., D . Hsu s an d L. J. FERRIER. "O n th e use of h idde n Markov
DODDINGTON , G . R . " P ho ne tic a lly se nsi tive d iscri m inants for im proved speech
m odelli ng for rec ogn ition o f d ysa rt hric spee ch," Com puter M ethods and Pro
grams in Biom edicin e, vo l. 2, p p . 125- L39, J une 199 1. rec o gni ti o n ." Proceedings of the IE EE l nternational Conf erence on Acolls
tics. Sp eech, and Signal Processing. G lasgow, Sco tla nd, vol. I, pp. 556- 559.
D ELLER, J . R., an d S. D. H UNT. "A si mple 'lin eari zed' learning algorit hm which 1989.
o ut perfo rm s back-propagation." Proceedi ngs of the International Joint Confer
D ODDINGTON. G . R., and T. B. SCHALK. "Speech recogni tion: Turning th eo ry int o
ence on Neural Networks , Ba lti more. Md., vo l, Ill , pp. 133-138. 1992.
pra ctice, " IEEE Spectrum . pp. 26-32. Jan . \ 981.
D ELLER, J . R., and T. C. LUK . " Set-mem be rs hip th eo ry applied to linear predic
DONE, W. J., a nd C. K. R USHFORTIi. " Estim a ting th e pa ra met ers of a no isy all
tio n an a lysis of speech," Proceedings of the IEEE International Conference on
pole pro cess using pole-zero modeling," Proceedings of the IF:F.E Interna tional
A coustics, Speech, and Signal Processing, Dall as, Tex., vol , 2. pp, 653-656.
198 7. Conference on Acous tics, Speech , an d Signal Processing, Wash ingto n, D .C. , pp .
228- 23 1, 1979.
- - - . "Li ne ar pr ed iction analysis of speech based o n set- me m bers hi p theory,"
Computer Sp eech and La ng uage. vo l, 3. pp. 30 1-327 , 198 9. DREYFUS-G RAF, J. " P ho ne iograp h u nd Schwa llclle n-Q ua ntelung, Proceedings of
the St ock holm Sp eech Com m unica tion Se m inar. Stockho lm, Swed en, Sept.
D ELLER, J. R., and S. F. ODEH. " Impleme nt ing th e optima l bounding ellipsoid al 1962.
gorit hm on a fast processor," Proceedings of the l1:.EE I nternat ional Co nference
on Acoustics , Spee ch, an d S igna l Processi ng, Glasgow, Scotland , vol, 2, PP. D UBNOWSKl, 1. J., R. W. SCHAFER, a nd L. R. RABINER. " Rea l ti me d igital har d
1067-1 07 0, 1989. ware pit ch det ector," IEEE Transactions on Acoustics, Speech. and Signal Pro
cessing, vol. 24, PP. 2-8, Feb. 1976.
- -- . "Ada pt ive set-mem be rsh ip iden tificat ion in O(m ) t ime fo r li near-i n
pa ra me ters models, " IEEE Transactions 0/1 Acoustics. Speech, and S ignal Pro D UDLEY, H. "T he vocoder," Bell L abs Record. vol. 17, pp. 122- 126 , 19311. Re
cessing, May 1993. pri nt ed in (Schafer an d Ma rkel , 1979).
DELLER. J . R .. and G. P. PI C'ACHE. "Adva ntages of a Gi ven s rotat ion ap p roach to - - . "The ca rr ier nature of speech ," Bell System Technical Jo urnal. vol. 19.
tem po ra lly recursive linea r p red icti on an alysis of speech," IEEE Transactions pp . 495- 515, 1940.
n Acoustics, Sp eech , a nd Signal Processi ng. vol, 37, pp. 429-4 31. Mar. 1989. - -. "Fu ndamen tal s of speech synt hesis." Jo urnal of the Audio En gineering So
ciety, vol. 3, pp . 170-1 85, 1955.
D ELLER, J. R., a nd R. K. SNIDER. "' Q uant ized' hidde n Markov mod elling for ef
ficient recognitio n of ce reb ral palsy speech," IEEE In ternat ional Symposium D UDLEY, H. , R. R . REiSZ, and S. S. A. WATKlNS. "A synthetic speaker," Jou rnal of
on Circ uits and Systems. New O rlea ns, La., vol. 3, pp . 204 1-2044. 1990. the Frank lin In stitut e, vol. 22 7, p p. 739- 764 , 1939.
D EMPSTER. A. P., N. M. LAIRD, a nd D. B. R UBIN. "Ma ximu m likelihood fro m in D UDLEY H ., and T. H . T... RNOCZY. "T he spea king ma chi ne of Wolfgang vo n
co m ple te data via t he EM algorithm ," Journal of the R oyal S tat istical Socie ty; Kernpelen," Journal of the Acoustical Society of America, vol. 22, pp. 15 1- 166,
vol. 39, pp . 1-88, 1977 . 1950.
D ENG, L., M . LENNIG. V. G UPTA et al. "M odeling acoustic-phoneuc detail in all DUMOUCHEL, P., V. G u pT.-\, M . LENNIG ct al. "Th ree probabil isti c la nguage mo d
HMM-based lar ge voc a bula ry speech recognizer," Proceedings of the fEF:E In els for a large-vocabula ry speech reco gni ze r." Proceedings of the IEEE Interna
ternational Conf erence on Acoustics, Speech, and Signa l Processing, New York, tional Conference on Aco usti cs. Speech. and Sig nal Processing. New Yo rk , vo l,
vol. I , pp . 509-512. 1988 . 1, pp . 5 13- 5 16, 1988.
D ENTINO, M .. J. MCCOOL, and B. W IDROW. "Ada pt ive fi lte ring in the freq uency D UNN. H . K. "T he calculat ion of vowe l reson an ces. a nd an elect ric al voca l tr act ."
d om a in ," Proceedings of th e IEEE. vol . 66, PP. 165 8- 1659, Dec. 1978 . Journ al of the Acoustical Society of America, vol, 22 . pp . 740- 75 3. 19 50.
D EROUAULT. A.-M . " Co ntext-depen de n t phon eti c Marko v m od els for lar ge vo cab - - . " Me thods of measuring vowel forma nt bandwid t hs," J ournal of the Acous
ula ry spee ch recognitio n," Proceedings of the IEEE Internat ional Conferen ce tical Society of A m erica. vol, 33, pp . 1737-1 746. Dec. 1961.
on Acoustics, Speech, and S ignal Processing, D allas. Tex.. \'01. I. pp . 360- 363. D URBIN. J . " Efficient est im atio n of par amete rs in m o ving-average mod els,"
1987. Biom etrika. vol. 46. pa n s I a nd 2. pp. 306- 3 16. 1959.
D EROUAULT. A.-M .. and B. MERIALDO. " Na tu ra l la nguage mod eling for phonem e - - . "T he fitt ing of time series mo dels," Re view of the i nstitute f or Interna
to-text transcript io n." IEEE Transactions on Pa ttern Analysis and M achine In tional Sta tistics, vo l. 28, pp . 233- 243. 1960 .
telligen ce, vol. 8, PP. 742-749, N ov. 1986.
D UTOIT, D . "Eva luatio n of speake r-indepen d ent isolat ed -wo rd recogn itio n svs
B62 Bibliography
Bibliography 863
terns over telephone network," Proceedings of the European Conference on FANT, G ., an d B. SONESSON. "I ndirect studies of glottal cyc les by sy nc h ro n o us in
Speech Technology, Edinburgh, Scot la nd. pp. 24 1- 244 , 19 87. verse filte r ing and pho to-electrica l glottography," Quarterly Progress & S tatus
EARLE Y, J. "An efficient context-free parsing algorithm ," Co m m uni cations of the Report, S p eech Transmission Laboratory. eds. Roya l Institute of Tec h nology,
Associa tion for Com puting Machinery, vol , 13, pp. 92- 102, 1970. Stockholm, Swed en , vo l. 4 , 1962.
EKSTROl\l, M . P. "A spectral characterization of t he ill-condi t io n ing in numerical FEDER. M ., a nd A . V. OPPENHEIM . "A new class o f seq uen t ia l an d ada p tive algo
deconvolution," IEEE Transactions on Audio and Electroacoustics. \'01. 2 L Pp. rithms wit h ap p licatio ns 10 no ise ca ncellatio n ." Proceedings of the IEEE Inter
34 4-348 , Au g. 1973. national Conference on Acoustics, Speech, and S ignal Processing, N ew York,
EL-JAROUDI . A .. and J. MAKHOUl.. "Speech a nalys is usi n g discrete spectral pp . 5 57- 560 , 198 8.
modelling," Proceedings of the 32n d Midwest Symposium on Circuits and SysFEDER. M ., A . O PPENHEIM, a nd E. W EINSTEI N. " M axim u m -likel iho od noise can
tems, C ha m pa ign , 111., vol. I, pp . 85-88, 1989. ce lla t io n in m ic rop h o n es using est im a te-m ax im ize algo r it h m:' IEEE Trans
ELMAN, J. L.. a n d D . Z IPSER. " Lea rn ing t he hidd e n stru c tu r e of speech," ICS Re actions on Acoustics, Speech. and S ignal Processing. vo l. 37. PP. 1846- 1856 ,
p o rt 8701, University of Ca liforn ia at Sa n Diego, 1987. Feb. 198 9.
E PHRA IM. Y , A. D EMBO. and L. R. RABINER. "A m in im u m discrimination infor FERRARA , E. R., a nd B. WIDROW. " M ult ic ha nn el adap tive filt e rin g fo r signal en
mation approach for hi dden Markov modelin g," IEEE Transactions On I;i(or. ha nceme nt : ' IE EE Transactions on Acoustics, Speech, and S ignal Processing,
mation Theory, vol, 35. pp. 100 1-1 0 13, Sep t. 1989. . vol. 29. pp . 766- 77 5, J un e 198 1.
EPHRAIM, Y., a nd D . MALAH . " Speech enhan ce m en t using optimal non-linear FI SHER, \\1. M .. G . R. DODDINGTON, a nd K . M . G OUDIE-MARSH/\ LL. "The
spectral amplitude estimation ." Proceedings of the IEEE Internationat Confer D A R PA speech recogni t io n rese a rc h d at a ba se : Specif ica t io ns and status,"
ence on Acoustics, Sp eech, and S ignal Processing, Boston, pp . 24.1.1-4. 1983. Pro ceed i ngs of the DARPA Speech Recognition Workshop, pp. 93-99,
1986.
- - -. "Speech enhancement usin g a m in im um m ea n-sq ua r e error short-time
sp ec tra l amplitud e es t im a to r," IEEE Tra nsactions on Acoustics , Speech , and F ISSORE, L.. P. L.-\FACE, G. MICe" ct al. "A word hypothesize r for a large voca b u
Signal Processing, vol. 32, pp. 1109-1 12 \ , Dec. 1984 . lary continuous speech understanding system:' Proceedings of the IEE E inter
nat ional Conferen ce on Acou stics. Speech , and Signal Processing, Glasgow,
EPHRAIM. Y , D . M ALAH , an d B. H. .lUANG. "On the a p plica tio n of hidden
Scotl a n d, vot . I , pp. 453-456 , 1989.
Markov models for enhancing noisy sp ee c h," Proceedings 0/ the IEEE lnterna
tional Confe rence on Acoustics, Speech, and Signal Processing , New York, pp. F L-\NAGAN, J . L. Speech Analysis, Synthesis, and Perception , 2nd ed. New York:
533-536, 1988. Sp r inge r-Ve rla g, 1972.
- -- . " Speech en hancement bas ed upon h id de n Markov modeling," Proceedings - - - . "Voices of men and mach ines," Journal of the Aco ustical Society of Amer
of the IEEE I nternational Conf erence on Acoustics, Speech, and Signal Process ica. vol . 51, pp . 1375-1387, Ma r. 1972.
ing, Glasgow. Sco t la nd , pp , 353- 3 56 , May 1989. F LANAGAN, J. L. "Speech coding," IEEE Transactions on Comm unicat ion Theory,
EYKHOFF, P. System Identification. N ew York: John Wiley & Sons, 1974. vol. 27, pp. 710-736. Apr. 197 9.
FAIRBANKS. G . " Test o f phonetic d iffer en ti a t io n: The rhyme test ," Journal of the FORNEY, G . D. "The Vite rb i algorithm," Proceedings 0/ th e IEEE, vo l, 61, pp ,
Acoustical S ociety 0/ America, vol. 30, pp , 596-600, July 1958. 26 8-27 8, Mar. 1973 .
FANO, R. M. " Sh o rt -t im e au toco rrelat io n functions and power spectra:' Journal
FR<\NZINl, M . A .. K.-F. LEE, and A. W:\ IIlEL. "Connect io n ist vncrbi training: A
of the Acoustical Society 0/ America, vol, 22, pp. 546-550, Sept. 1950. .
ne w hy b r id method fo r co nt in u ou s speech rec ogn it io n ," Proceedings of the
FANT, C. G . M. "Ana lysis an d synthesis of sp eech p ro cesses." In B. Malmberg.
11:.' £E Transactions on Aco l/stics, Speech. and Signal Processing, vol . 30 , pp, trix to t ri a ngula r form," Journal oj the S ociety for Industrial and Applied
69 9-709 , Oct . 1982.
Mat h. vo l. 6, p p. 26-50. 195 8.
FRY, D , B., an d P. D ENES. "The so lution o f som e fundamental pro blem s in me, GUNSK [, S., T. M . LALUMIA, D. CASSIDAY et al. "The gra p h se arch m a ch ine
chan ica] speech recognition ; ' Language and Speech, vol. I , p p. 35-58, 19 58 . (GSM) : A program m a ble processo r for co nnected word speech recogn it ion a nd
Fu, K. S. S yntactic Patt ern Recogn ition and Applications. En glewood Cl iffs. N.J .: other app lications, Proceedings of the IEt.'E International Conf erence on Acous
Prentice Hall. 198 2. ics. Speech. an d S igna l Processing. D allas , Te x. , vol, 1, pp, 5 19- 522, 1987.
F URIJI, S. " Cepstral analysis techn ique for automatic speaker verification." IEEE G OOLlCK, T. 1., J R., and J . L. HOLSINGER. "Analog source digitization : A compari
Transactions on Acoust ics. Speech. and S ignal Processing, vol, 29 , pp . 254-272, son o f theor y and p ra ctice ," i EE E Transactions 011 Inform ati on Theory, vol,
Apr. 1981. . 13, pp . 323-32 6, Ap r. 196 7.
- - - . "Speaker-ind epend en t iso lated word recognition using d yna m ic features G OLUB, G . H . "Numeri ca l me tho d s for so lvin g least squares pr oblems ," N umeri
o f the speech spectrum.. " IEEl:: Transactions on Acoustics. Speech. and S ignal cal Mathem atics, vol. 7, pp. 206-216. 19 65.
Processing. vo l, 34. p p. 52-59. Fcb. 1986.
G OLUB, G. H. , and C. F. VAN loAN. Matrix Computation s, 2nd ed. Baltimore,
GABEL. R. A., and R. A. ROBERTS. Signals and Li near System s, 2nd ed . New Md .: J o hns Hopkins Uni versit y Press. 1989.
York : John Wiley & Sons, 1980.
G OODMAN, D . .I.• B. .I. MCD£ RMon, and L. H . NAKATANI. "Subj ecti ve evalua tio n
GABRI EL. C. M.. " M achine parlan t d e M . Faber." Journale de Physique, vol, 8, of P C M coded speech," Bell Sy ste m Tech nical Jour nal, vol. 55, pp . 1087
pp. 274-275, 1879 . 110 9, Oc t. 19 76 .
G A L L~GH E R , R . G. Inf orm ation Theory and Reliable Communicat ion. N ew York : GOODMAN, D. , a nd R. D . NASH. "Subjective quality o f th e sa me speec h transmis
J ohn Wiley & Son s. 1968. s io n co ndit io ns in seven different countries," IEEE Transactions on Acoustics,
G ARDNER, W. A. i ntroduction to Random Processes with Applicatio ns to Si gnals Sp eech, and Signal Processing, vol, 30 , pp . 642-654, Apr. 1982.
and Systems, 2nd ed. New Yor k: Mc G raw-H ill. 1990 . G OODMAN, D. , C. SCAGLIA, R. E. C ROC HIER E et al , "Objective a nd subjective per
GENTLEM .O\ N, W. M ., an d H . T. K UNG. "Mat r ix tria ngul a riza t io n by syst o lic ar forma nce of tan d em conn ect ions of waveform coders with an LPC voco de r,"
ra ys." Proceedings of th e Society 0/ Photoopt ical In strumentation Engineers Bell System Technical Journal, vol . 58 , pp. 60 I -62 9, M ar. 19 79.
(R eal Tim e Signal Processing I V) , Sa n D iego , Ca lif., vo l. 298 , pp. 19-26.. G OODWIN, G. c., and K. S . SIN. Adaptive Prediction, Filtering , and Con trol.
1981 . Engl ewood Cliffs, N.J. : Prentice Hall , 1984 .
G ERSHO, A. "O n th e structure of vecto r qua ntizers," i EEE Transactions 0 11 in/ or GO RE N, A. L., a nd R. SHIVELY. "T he ASPEN parallel co m p uter, sp eech recogni
mation Theory, vol. 28, p p. [ 5 7- 166 , Mar. 1982. tion and parallel dynamic programming, " Proceedings of th e IEEE l nterna
G ERSO N. 1. A., a nd M . A. JASIUK. " Vecto r su m excited linea r p redict ion (VSELP) ional Conf erence on Acoustics. S peech , and S ignal Processing, Dallas, Tex.,
speech coding at 8 kbp s," Proceedings of the I EEE In ternational Conference 0 /1 vol. 2, pp. 976-979, 1987.
Acoustics, Speech , and S ignal Processing. Albuquerq ue, N .M .. vol, I , pp. 461 G ORIN, A., S. LEVINSON. a nd A. GERTNER. "Ad ap t ive acq uisi t io n of spoke n lan
464 , [990.
gu age ," Proceedings of th e iEEE Internati onal Conference on Aco ustics, Speech,
G HISELLI-CR IPPA, T. , and A. EL-JAROUDI. "A fast neural networ k training-algo and S ignal Processing. Toronto, C a na d a, vol, 2, PP. 80 5-808 , 1991 .
rithm and its ap plication to voi ced-unvoiced-silence classificat ion of speech," GRAUPE, D . Tim e Series Analysis, Ident ifi cation, and Adap tive Filtering, 2nd ed .
Proceedings of the IEEE I nternational Confer ence on Aco ustics, Sp eech, and Ma la bar, F1a: Krieger, i 98 9.
Signal Processing, Toronto. Canada, vol, I , pp . 441-444. 199 1.
GRAY, A. H ., a nd J . D . MARK EL. "A spec t ral flatn ess m easure fo r st udy ing th e
G IBSON, J . D., " O n reflection coeffic ien ts a nd the Cho lcsky d ecomposit ion:' autoco rrel a tion method of linear pr ediction of sp eech an al ys is," IE EE Transa c
IEEE Transactions on Acoustics. Speech . and S ignal Processing, vo l. 25. pp . tion s on Acoustics, Sp eech, and S ignal Processing, vol . 22. pp . 20 7-2 17, 19 74 .
93-96 , Feb. 19 77.
- - . "Distance measu res for sp ee ch processing," IEEE Transactions on Acous
G IBSON, J. D., T. R. FISHER, an d B. Ko o . " Est ima tio n a nd vec tor q uan t izatio n of
lies, Speech. and Signal Processing, vo l. 24 , pp . 380-391 , 1976 .
noisy speech ," Proceedings 0/ the IEEE Int ernational Conference all Acoustics,
Speech, and S ignal Processing, N ew York, PP. 54 1-544 , 1988. GRAY, R . M ., A. Bu zo, A. H . GRAY et al. " Disto rt io n measures for spee ch pro
cessing," IE EE Transactions on Acousucs , Speec h, and S ignal Processing, vol ,
GIBSON, J . D .. B. Ko o. and S. D. G RAY. " F ilte ring o f colored noise for speec h en
28 , pp . 367-376, Au g. 1980.
hancement and co d ing, " IEEE Tran sactions on Acoustics. Speech, and Signal
Processing, vol . 39, pp . 1732-1744, Aug . 1991. GRA,Y, R . M ., a nd L. D . DAVI SSON. Random Processes: A Mat hem atical Approach
l or Engineers. En glewood C liffs, N .J .: Prenti ce Hall. 1986.
G ILLMAN. R . A. "A fa st freq uency d o ma in pitch algorithm" (a bs tract). Journal of
the Acoustical S ociety of America. vol . 58 , p, S63(A). 197 5. G REEFKES, J. A. "A di gita lly companded delta modulat io n modem for speech
tra nsm ission ," Proceedings of the IE EE l nt ernat ional Confe rence on Com m uni
G IVENS, W.• "Co m p uta t io n of plan e unita ry rotations transfo rm ing a general rna-
cati ons, p p. 7.33- 7.48, June 1970.
86 6 Bib liog raphy Bibllograpny 867
GR EENBERG. S. "The ear as a speech a nalyzer," Journal (~r Phonetics. vol . 16. pp. - - - . " St ress compensation and noise reduction algorithms for robust speech
139- 149 , 198 8. recognitio n ," Proceedings of th e IEEE lnt ernattonal Conf erence all Aco usti cs,
GRI FFITHS, L. J . "A n ad a pt ive la tt ice structure for no ise-ca nceling applicat io ns :' Sp eech , and Signal Processing, G lasgow, Sco t land . vo l. I, pp . 266-269. 1989.
Proceedings of th e IEEE Int ernationul Conference on Acousti cs, Speech . a,;d - -- . "Use of object ive speech q ua lity measu res in selecting effect ive sp ectral
Si gnal Processing, Tulsa, Okl a., vo l, I. pp . 87-90, 19 78. es tima t io n techniques for sp eech en ha nceme nt ," Proceedin gs of th e IEEE 32 nd
Midwest Sym posium 011 Circuits and Syst em s, Cha m pa ign, Il l., pp . 10 5-108,
GRIMMErr. G. R , and D. R. STlRZAKER . Probabilit v and Random Processes. Ox
ford, England: Clarendon , 1985. 1989.
GUo, H.. and S. B. G EU:AND. "Analysis o f gradient d escent learning algorithms H,\ NSON, B. A. , and H . WAKITA . "Spectral slope d istan ce me a sures wi th lin ear
for multilayer feedforward networks," IE E E Transacti ons on Circuits and S vs. prediction analys is for word recogn ition in no ise," lEEE Transaction s on
terns, vol, 38. pp . 883-894, Aug. 1991. . Acoustics, S peech, an d S igna l Processing, vo l. 35, pp. 968-9 73, Jul y 1987 .
G UYTON, A. C. Physiology of the Hum an Body. Ph iladelphia: Saunders. 1979. HANSON. B. A.. and D. Y. WONG. "The harmo n ic magnitude supp ress io n (HMS)
echn iquc fo r intelligibility enha nce ment in th e prese nce o f int erfering
H AFFNER, P. "C onnect io n ist word level class ification in sp eech recognition," Pro speech ," Proceed ings of the IEEF. Internation al Conf erence on Acoustics ,
ceedings of the IEEE Int ernati onal Conference 0 11 Acoustics. Speech. and SIgnal Speech, and Signal Processing, Sa n Diego. Ca lif., p p. 18A.5.1- 4, 1984 .
Processing, San Francisco, vol. T, pp. 1-621-1 -624 , 1992.
H,\NSON. B. A.. D . Y. WONG, a nd B. H . J UA:-.IG. "Speech en ha nce me nt with har
H AFFNER, P., M. FRANZIN1 , and A. WAIBEL. "I n tegrat ing t ime alignme nt and mo nic synthes is," Proceedings of the 1EEE International Conference on Aco us
neural networks for high performance cont inuous spee ch recog ni tio n, " Pro tics, Sp eech , and Sign al Processing, Boston , pp. 2 4.2. 1-4 , 1983 .
ceedings of the IEEE lnt ernational Co nfe rence on Acoustics. Speech. and Signal
Processin g, Toronto, Canada, vol. I , pp. 10 5-108, 1991. H,\RRISON, W. A. , J . S. LIM, and E. SINGER. "Adapti ve noise ca nce lla t io n in a
figh ter coc kpit en vironment ," Proceedings of the IE EE Intern ati on al Confer
HALUDAY, D., and R. RESNICK. Physics (Pan s J a nd II). New York: John Wil ey &
ence on Acou stics, S peech , and S ign al Processing, Sa n Diego, Ca lif., pp.
Sons, 1966 .
i 8A.4 . 1- 4 , 19 84.
HANA ZAWA , T., K . KJTA, S. NAK,\MURA er al. "ATR HMM-LR continuous spe ech
- - - . "A new app licat io n of adaptive no ise cancellatio n ." IEE E Transactions 0 11
rec ognition syste m," Proceedings of the IEEE Int ernati onal Conference on
Acous tics, Speech, and Sign al Processing. vol . 34 , pp . 2 1-27, Feb . 1986.
Acou stics, Sp eech, and S ignal Processing, Albuq uerq ue, KM. , vol. I. pp. 53
56 , 1990. HARTSTONE, C., and P. WEIRS, eds. Collected Papers of Charles San ders Peirce.
C a m b rid ge , Mass.: Harvard Uni vers it y Press , 193 5.
H .-\NSEN, J . H . L. "A new sp eech enhan cement a lgo r ith m e m ployi ng a co ust ic
endpoint det ection and morphological based spec tral co ns trai nts," Proceedings H AYK IN, S. Ada ptive Filter Theory, 2nd cd. Engl ewood Cliffs , N.J .: Prentice Hall,
of the IEEE International Confere nce on Acoustics, Speech, and S ign al Process 1986 .
ing, To ro nto , C a na d a. pp. 90 1-904, 1991. H.·\YT. w. H. , a nd J . E. KEMMERLY. Engineering Circuit Analysis, 2 nd ed . New
HANSEN, J . H. L., and O. N. SRIA. " Lo m ba rd effec t compensation fo r robust au York : Mc G raw-H ill, 1971 .
tomatic sp eech recog nition in noise ," Proceedings of 1990 Int ern at ion al Con H ECHT-NI EL<;EN, R. " Kol mogorov's mapping neural network ex iste nce theorem,"
f eren ce on Sp oken Language Processing, Ko be , Japan , pp . 1 125- 1128, Nov. Proceedin gs a/ the 1st IEEE In ternat ional Conf erence on Neura l Netwo rks, San
1990. D iego . Calif. , vol. lII , pp. 11- 15, 1987.
H ANSEN. J . H . L., and M . A. C LEM ENTS. "E nha nce me nt o f sp eech degraded by H ECKER, M . H. L., and C. E. W ILLI AMS. "Cho ice of reference cond it io ns for
no n-white additive noise," F inal Tech n ica l Repo rt s ub mi tted to Lo ckheed speech preference tests," Journal of the Aco ustical Society of America, vol. 39 ,
Co rp ., DSPL-85·6, G eo rgia Institute of Technolo gy, Atl a nta. Ga., Au g. 1985. pp . 946-9 52. Nov, 1966.
- - -. " Co nstra ined it erat ive sp eech en ha nce me n t wit h applicatio n to automatic HEDELlN, P. "Q D- An algo rit h m for non-lin ear in verse fi lte ri ng." Proceedings of
sp eech recognition," Proceedin gs of the TF.EE l ntcrnational Conference on the IEE E Internat ional Confe rence 0 11 Acoustics. Sp eech. and Signal Process
Acoustics, Speech , a nd S ignal Processing. New Yor k. pp . 561 - 564. 1988. ing, Atlanta, G a. , vo l. I, pp. 366 - 369, 1981 .
- - -. " Co ns t ra ined iterat ive speech en ha nc eme nt wit h applicat io n to speech - - . "A glo tt al Ll' Cvvocod er,' Proceedi ngs of the IEEE Intern ational Confer
recog nition," IEEE Transactions on S ignal Processing. vol , 39 . pp . 795-80 5. ence on Acoustics, Sp eech, and Sign al Processi ng, Sa n D iego , C a li f., pp.
Apr. 1991. 1.6. i - 4. 1984.
- - - . " It e ra tive sp ee ch en hancement wit h spec t ra l constrai nts." Proceedings of - - -. " H igh quali ty glo tt a l LP C-v oc od ing," Proceedings of the IE EE Int erna
the IEEE International Co n ference on A coustics, Sp eech. and Signal Process tiona l Conf erence Oil Acoustics, Speech, and S ignal Processing, To kyo. Japan ,
ing, Dallas , Te x., vol, I , p p. 189-192 . Apr. 198 7. pp. 46 5-468, 19 86.
- - - . "Obj ect ive qualit y mea sures ap plied to enhanced speec h: ' Proceedings of H EINZ. J . M ., an d K . N . STEVENS, " On the propert ies o f voiceless fric ative conso
the Acoustical Society of America. I I0th Meeting. Nashville. Tcnn.. p. CII. na n ts," Jo urn al of the Aco ustical So ciety of A m erica . vo l. 33, pp. 58 9- 596,
No v. 1985. 1961 .
868 Bibliography 8ib:icgraphy 869
H u ME. B.. an d C. L N IK J..>,S . " Impro ved spect rum performance via a data HUFFMAr-: . D. A. "A met hod for the cons truction of mi ni mum red und ancy
adaptive weighted Burg techniq ue," IEE E Transactions on ACOUSlics . Speech, code s," Proceedings of the iRE , vol. 40, pp. 1098-1 10 1, Sept. 1952. .
and Sign al Proc essing ; vo l, 3 3. PP . 90 3- 910, Aug. 1985. UNT, S. D. " Layer-wise training of reed forward neu ral networ ks based on
H ELMHOLTZ, H. L. Evon. Sensations of Tone. Tran slated by A. J . Ellis (18 75). linearization and selective data processing: ' Ph.D. dissertation, Michi gan Sta te
New York: Dover, 1954. University. 1992.
HEMPHILL, C, and J. P ICONE. "Speech recognit ion in a unification grammar HUNT, M. J ., J. S. BRIDLE. and J. N. HOLMES. "Interactive digital inverse filtering
framework," Proceedings of the IEE E lnternat ion n ] Confer ence 0 11 ACOllSlics and its relation to linear prediction methods ." Proceedi ngs of the JEEF. l nter
Sp eech, and Sign al Processing, Glas gow, Scotland, vo l, 2, pp. 723-72 6. 1989 . ' national Conf erence 0 /1 Acoustics, S peech, and Sig nal Processing, Tulsa. Okla ..
HOFFMAN, K., and R. KUNZE. Lt near Algebra. Englewood Cliffs, J',; .J .: Pr enti ce 880- 883, 1980.
Hall , 1961. "IEEE reco mmended pract ice for speech quality measurements:' JEEE Transac
HOLMES, J. N. "An invest igation of the volume veloc ity waveform at the lary nx tions 0 11 Audio and Electroacoustics, pp . 227-246. Sept. 1969.
during speech by mean s of an inverse filter," In G. Fant, ed., Proceedings of IRJE, B., an d S. MIYAKE. "Capa bilities of three-laye red perceptrons," Proceedings
the Sp eech Com mu nicarion Se minar, Speech Transm issio n Lab oratory Royal of the 2nd IEE E Int ernat ional Conf erence on Ne ural Ne tworks, Sa n Di ego,
Institute of Technology, Stoc kho lm, Sweden, p. B4, 1962. Calif.. vol. I. pp . 64 1- 648, 198 8.
- - - . "A n Investigati on of the volume velocity waveform at the larynx during IRWIN, M. J. "Red uct ion of broad band noise in speech by spec tra l weight ing."
speech by mean s of an inverse filter," Congress R eport: 4lh international Con Proceedings of the IE EE i nternational Co nfe rence on Acoustics, Speech , and
gress on Acoustics, Cope nhagen, Denm ark , 196 2. Sign al Processing, D enver, Colo., PP. 104 5-1051 , 1980.
- - -. "Form ant excitatio n befo re and afte r glo tta l closure," Proceedi ngs of the l SHl ZAKA, K ., and J. F LANAGA N. "Synthesis of voiced so unds from a two mass
IEEE In ternational Confere nce on Acoustics, Sp eech , and S ignal Processing, mod el of the vocal cords." Bell System Techn ical Jou I'll a I, pp . 123 3- 1268,
guages, and Computation. Read ing, Mas s.: Addi son-Wesl ey, 1979.
neura l pred iction mo del," Proceedings of the IE EE Int ernat ional Conference
are universal approxirn ators,' Neural Net works. vol, 2, PP. 359-3 66, 1989.
44 1-444, 1990.
HOROWITZ, K. D., a nd K. D . SENNE, " Perform ance ad van tage of complex LMS
- -- . " Large vocabular y speech recognition using neural predict ion mod el,"
for circuits and systems, IEEE Transactions 0 /1 Circui ts and Systems, vol. 28,
Proceedings of the I EEE Int ernat ional Conference on Acoust ics, Speech, a nd
HOUSE, A. S., and K. N. STEVENS. "Analog stud ies of the nasalization of vowels."
ITAKLJ RA, E "M ini mum predictio n resid ua l pr inciple applied to speec h recogni
ods : Conso nantal diffcrentlat ion with a closed response set," Journal 0/ the
and Lee, 1990).
HOUTSMA, A. J. M., T. D. R OSSJ:-" (j , and W. M. WAGENAARS. Auditory Dem onstra ITAKURA, E , and S. S AITO. "Analysis-synthesis teleph on e based on the maximum
tions, Institut e for Per cept ion Research (IPO ). Eind hoven. Netherlan ds. 1987. likeliho od method." Proceedings of th e 6th Internat ional Co ngress on Acous
Available from the Aco ustical Society of America, tics, Ja pan, pp. C 17-C20, 1968.
HSIA, T. C. identifi cation: L ea)'! Squa res Methods. Lex ington, Mass. : Heath, 1977.
- - - . "Speech analysis-synthesis system based on the part ial autocorrela tion co
efficient,' Proceedings of the Acoustical S ocietv of Japan Meeting, 1969.
H UANG, W. M ., and R. P. li pPMANN. " Neural netwo rks an d trad it ional classifi
ers." In D. And erson , cd., Proceedings of the /987 IEEE Con ference on Neural
- - -. " On the optimum quant iza tio n of feature parame ters in the PARCOR
Processing System s-Nat ural and Syn thetic. New York: Ameri can Institute of
speech synt hesizer: ' Record of the IEEE Conference on Sp eech Communication
ITAKURo\. E , S. SAITO, Y. KOtKE ct al, "An audio response unit based on pa n ia l JUANG, B.-H . " Ma xim um likel ihood estimatio n fo r mi xture m ultivariate sto
correla tio n," lE Er: Transactions on Conun unicau on Theory. vol. 20. pp, 79 2 chast ic obse rva t io ns of Mar ko v cha ins," AT& T System Technical Journal. vol.
796, 1972. 64, pp. 1235- 1249, J uly- Aug. 198 5.
J ACKSON, L. B. Digital Filters and Signal Processing. 2nd ed . N o rwe ll, Mass' J UA NG. B.-H ., S. E. LEVIN SO~ , a nd M . M . SONDHI. " Ma xim um like lihood est ima
Kluwer, 1989 . .. tion for m ultiva ria te m ixture observation s of Markov ch a in s." IEEE Transac
JAIN, V. K ., and R. E. C ROCHIERE. "Q uad ra t ure mirror filter design in the tim e tions all Inf orm ation Theory, vo l, 32, PP. 307- 309, Mar. 1986.
domain," IEEE Transactions on Acoust ics. Speech, and S ignal Processing, vol, J UANG, B.-H .. a nd L R . RABINER. "A probabilist ic d ista nce measure for hi dden
32 , pp. 353- 36 1, Ap r. 1984. M a rko v mo dels," AT&T System Techn ical Journal, vol. 64 , PP. 39 1- 408, Feb .
J AIN, A. J., A. WAIBEL, and D . S. TORETZKY. " PA RS EC: A st r uct ured co nnec. 198 5.
tionisi parsing sys tem fo r spo ke n language," Proceedings of the ItEE ln tema. J UANG, B.-H ., L. R . RABINER , and J . G . WILPON. " O n th e use o f bandpass
tional Conference on Acoustics. Sp eech, and Signal Processing. San F rancisco. liftering in speech recognition," IEEE Transactions on . icoustics, Spe ech, and
vo l, I, pp. 1-205-1-208, 1992 . Signal Processing, vol. 35, pp. 947-954, Jul y 1987.
J AKOBSON. R ., C. G. FANT. a nd M. HALLE. Preliminaries to S peech A nalvsts: Dis J UTLA. ND, E C , G. C HOLLET, and N. DEMASS1EUX. " V L$I a rch itec tures for d y
tincti ve Features and Th eir Correlates. Ca m bridge , Mass.: M.l.T. P ress'. 1967. na m ic tim e warping using systolic arra ys," Proceedings of the i EEE Interna
JAYANT, N. S. "Ad aptive della mod ulation with a o ne-bit mem or y," Rel! Sy stem tional Confe rence on Acoustics. Sp eech. and Signal Processing , San D iego ,
Technical Journal, pp . 321-342, Ma r. 19 70. Calif., vo l, 2, paper 34A .5, 1984.
- -- . "Digital cod ing of speec h waveforms : PCM, DPCM . and DM quan K A. ISER, J. F. " Re p rod uc ing the cocktail part y effe ct " (a bs t rac t), Journal of the
tizers," Proceedings of the [EE E . vol. 62, pp . 6 11-6 32, May 1974. Acoustical S ociety of AmenCG, vol, 32, p . 918 , Jul y 1960.
- -- . Wm'eform Quantizati on and Coding. New York: IEEE P ress. 1976. KAMMERER, B., and W. KUPPER. "Experiments for isol at ed -wo rd recognition
wit h single and multilayer perceptrons," Abstracts of th e 1st Annual Inter
JAYANT, N. S., and J . H. OlE!\'. "Speech cod ing with time-va rying b it allo cation national Ne ural Net work So ciety, Boston , in Neural Networks, vol. I, p. 302,
to exci tat io n and LP C pa ram eters," Proceedings of the IE EE Internatio nal
1988.
Conference 011 Acoustics, Speech , and Sig nal Processing, Albuquerque. :t\·.M.,
vol. I , pp . 65-68, 1989. KANG, G. S., a nd D. C. COULTER . " 600 bits per seco nd voice di giti ze r (lin ear
pred icti ve form an t vocoder)," Na val Resear ch Labora to ry Rep ort , 1976 .
JAYANT, N. S.. and P. NOLL Digi tal Codin g of Wcm / brms. En glewood Cl iffs, N.J .:
Pr entice Hall, 1984. KANG , G. S. et a l. "Multirate processor for di gital voice co m m u nicat io ns," Naval
Resear ch Labo rato r y Report 8295, 1979.
J ELI NEK. F. " Co ntin uous speech reco gn it io n by sta tis tica l me thods." Proceedings
of the IEEE, vol. 64, pp . 5 32- 556, Apr. 19 76. KAPLAN , G. " Wo rds into action I : ' IE EE Spec/rum , vol. 17, pp . 22-26, June
1980 .
- -- . "Developm ent of a n ex pe rimenta l di scret e d icta tion recogn izer," Proceed
ings of the IEEE, vol. 73, pp. 1616-1624, No v. 1985 . KASAMI, T. "An effi cie nt recognition and synt a x algo rit hm for co ntex t-fre e lan
guages," Sc ie ntific Report AF CRL-6 5-75 8 , Bedford, M ass.: Air Force
- -- . "Self-o rga n ized language modeling fo r spee ch re cogn iti o n ." In (Waibel Cam brid ge Res e arch Laboratory, 1965 .
and Lee, 1990) .
KASAMt, T., and 'K . TORII. "A syntax ana lysis pr ocedure for un ambiguous context
J ELI NEK, F., L. R . BAHL. and R . L. MERCER. " Design of a linguist ic st at isti ca l de free gra m ma rs," Journal of the Association fo r Computing Ma chin ery, vo l. 16,
coder for the recogn ition o f cont inuous speech," IEEE Trans actions on 111./0" pp . 423-4 31, 1969.
Ina/ion Th eory. vo l. 21 , pp. 250- 25 6, May 19 75 .
KAv EH , M., a nd G . A. LIpPERT. "An optimum tap ered Burg algorit hm for linear
J ENKlNS, G. M., an d D. G. WATTS. Spect ral Analysis and Its Applica tio ns. San prediction and sp ectral analysis ," i EEE Transactio ns 0 11 Acoustics, Speech, and
F ra nc isco, Calif.: Ho lden-D ay, 196 8. S ignal Processing, vol. 31, pp. 438-444 , Apr. 1983.
JOHNSON, C. R . L CClIIres on Ada pt ive Parameter Est tmat ion . Englewoo d Cliffs, KAY, S., and L PAKULA. " Sim p le proofs of th e min imum ph ase propert y of the
N.J. : Pr entice Hall, 1988. predi cti o n err o r filt er," iEEE Transactt on s on Acaust ics, Speech, and S ignal
J OHNSTON, J . D . "A filter famil y de sig ned fo r us e in q uad ra tu re mirror f ilt er Processing, vol. 31, p. 510, Apr. 1983 .
banks," Proceedings of the IEE E In ternational Conf erence on Acoustics. K~Y, S. M., and S. L MARPLE, "Spectrum a na lysis-A modern perspective," Pro
Sp eech, and S ignal Processi ng, San Diego, Calif., vol. I pp. 291-294. 1980. ceedings of the IEEE, vol, 69 , pp. 1380-141 9, Nov. 1981.
JOUVEr, D ., J . MONNr:, a nd D. D UBOIS. "A new net wo r k-based. speaker K ·\ySER, 1. A. "T he co rr ela tio n bet ween subjecti ve a nd obj ecti ve measures of
inde pende n t connected- word reco gn itio n system," Proceedings of the IEEE 111 code d speech quality and in telligibilit y foll owing no ise co rr uptions," M.S .
terna tiona l Conf erence on Acoustics, Speech, a nd Signal Processing. Tokyo, th esis Air Force In stit ute of Technology, Wri ght-Paterson Air Force Base,
Japan, vol, 2, pp . 1109- 1I 12, 1986. O hio, D ec. I 98 I .
872 S;tliography Bibliography 873
KELLY, J. L.. and C. C. LOCHBAUM. "Speech synt hes is." Pro ceedings ofthe 4th ill system." In E. Oja and 0. Simu la. eds , Proceedings 0/ the 2nd Scandinavian
ternational Congress on Acoustics. vol, G 42, pp. 1- 4. 196 2. Also appears in Conference on Im age Analysts. p p. 214-220. 1981 . See also (Kohoncn,
Proceedings of the S tockholm Speech Communications Seminar. Royal rnstitllte 1987).
of Technology. Stockh o lm , Swede n, 1962 . _ _ _ . Cont ent Addressable Memories. 2nd ed . New York: Springer-Verlag, 1987.
KELLY, J. L., and R. F. LOGAN. Se lf- Adaptive Echo Canceller. U .S. Patent ___. "An introduct ion to neural com puting." Neural Networks, vol . 1, pp .
3,500,000. M ar. 10, 19 70. 3-16. 1988.
K IM. J. W , a nd C. K . UN, "Enhancement o f no isy speech by fo rwa rd/backward KOLLlAS, S.. an d D. ANAST.O\SSIOU. "An adaptive least squares algori thm for thc
adapt ive d igital filtering," Proceedings of the iEEE tn ternational Conference efficient training of artificial neural net works," IEEE Transacttons on Circuits
on Acoustics, Speech, and Signal Processing, To kyo. Japan. vol. 1. PD. 89-92 and Sys tems. vo l, 36, pp . 1092 -11 0 I . Au g. 1989 .
198 6. . ,
KOLMOGOROV, A. N. " O n the representation of cont inuous functio ns of many
KIMURA, S. " IOO,OOO-word recogn itio n syste m us ing aco ust ic segme nt networks:" variables by superp osition of functions of o ne variable an d addition." Dokl.
Proceedings of the IE EE International Conf erence on Acoustics. Spe ech, and Acad. Nauk USSR , vol, 144 . pp. 953-956. 1957.
Signal Processing, Albuqu erq ue, N.M., vol. I , pp. 6 1- 64. 1990. KOMORI. Y. "T ime state neural ne two rks (TS N N) for phonem e identification by
KITA , K., T. KAWA BAT.o\ , and H . SAITO. " H M M con tinuo us speech recognition considering temporal structure of phonemic features ," Proceedings of the IEEE
using predict ive LR parsing, " Proceedings oJ the IF;r~l:; Internauonal Conf er International Conference 0 /1 Acoustics, Speech . and Signal Processing. To ron to,
.ence on Acous tics, Speech, and Signal Processi ng, Gl asgow, Scot la nd . vol, 2. Canada , vol, I , pp. 125-128, 1991.
p p. 70 3- 706, 1989 . . KONY.-\L.INKA, 1. S.. an d M. R . MATAUSEK. "O n th e si multa neo us estim at io n of
KlTA, K.. and W. H. WARO. "Incorporati ng LR parsing int o SPH IN X: ' Proceed pol es a nd ze ros in speech an a lysis, and ITlF : Iterative inverse fil tering a lgo
ings of the IEEE Inte rnational Conference on .t1col/stics, Spee ch. and Signal rithm ," IE EE Transactions Of l Acollstics, Speech, and Signa l Processing, vol,
Processing. To ro nto, Can ada, vol, I , pp, 269-272. J 99 1. 27, pp . 48 5- 492 , Oct . 1979 .
K ITAWAKI, N.. M . HONDA. and K. ITO I-! . "Speech quali t y assessment methods for KORENBERG . M. J., and L. D . Pr\ ,\ RMANN. "An orthogonal AR MA identifie r with
speech coding systems, " IEEE Commun ications Magazine. vol, 22 . pp, 26-33 . automati c orde r estim ati o n for bi ological m od ell ing," A nnals of Biom edical
Oc t. 1984. Engineering. vo l, 17. pp. 571- 592 . 1989.
KLo\Tf, D. "A d igital filter bank for spect ra l ma tch ing," Proceedings of the IEEE KOSKO, B. Neural Ne twor ks and Fuzzy Systems. Englewood Cliffs, N.J.: P rent ice
International Conference on Acoustics, Speech, and S ignal Processing. Philadel Hall, 199 2.
phia, pp. 573-576, 197 6.
KRISHNAMURTHY, A. K. "Two channel analys is fo r fo rma nt tracking a nd inverse
- - - . "Pred iction of perceived phonetic distance from cri tica l-ba nd spectra: A filt e ring," Pro ceedings of the IEEE International Conference on Acoustics,
first step ," Proceedings of the IEEE Interna tional Conferen ce on Acoustics, Speech, and Signal Processing, San Diego , Ca lif., vol. 3. pp . 36. 3. 1- 36.3.4 ,
Speech. and Signal Processing, Paris, pp. 1278-1 281 , 19 R2. 198 4 .
- - - . " Re view of the ARPA speech understanding project ," Journal of the KRISHNAMURTHY, A. K ., an d D. G. C HILDERS. " Two channel spe ech ana lysis,"
Acoustical Society of America, vol. 62, pp . 1324-13 66, D ec. 1977 . Reprinted ICEE Transactions on Acoustics, Speech, and Sign al Processing , vo l. 34, pp .
in (D ixo n a nd M art in , 197 9) and (Waibel and Lee, 1990) . 730- 743, Aug. 198 6.
- - - . "Review of te xt-to-speech conversion for English ," Journal 0./ the Acousti KROON, P., an d B. S. ATAL. " Strategies for im prov ing t he perfo n n a ncc of CEL P
cal Society of America, vol. 82, pp. 737-793, Sept. 1987. coders at low bi t rates," Proceedings of the IEEE International Conf erence on
KOBATAK E. H., J . 1NARI , an d S. KAKUTA, "Li nea r pred icti ve cod ing of speec h sig Acoustics, Speech, and Signal Processing, New York, vol , I , pp . 151- 154, Apr.
na ls in a high ambient noise e nviro nment:' Pro ceedings of the IEEE Intern a 1988 .
tional Conference on Acoustics, Sp eech, and S ignal Processing, pp . 472-475 , KRUSKA L, J . B. "M ult id ime nsio na l scaling by op timizi ng goodn ess of fit to a nu
Ap r. 1978. merica l hyp othesis," Psychometrika. vo l, 29. pp. 1- 27 , 1964 .
KOENIG. W. "A new f reque ncy scal e for acoust ic mea surem ents:' Bell Telephone - - - . "Nonrnetric multidimensional sca ling: A numencal method ." Psychomet
L aboratory Record, vol . 27 , pp . 299 - 30 1, 1949 . ika, vo l. 29, pp. 1 15- 129, 1964 .
KOENIG, W., H . K. D UN ~ , and L. Y. LACY. "T he so und spec trograp h," Journal of KRYTER. K. D . "Methods fo r the calculation of the articulation index," Journal of
the Acoustical Society of A m erica, vol. 17, pp . 19-49, July 194 6. the Acoustical Society 0/ America. vo l. 34, pp . 1689 - 1697 , N o v. 1962 .
KOFORD, J., an d G . G RONER. "T he use of an ad apt ive threshold clement to de - -- . " Vali d ati on of the articulation index," Journal of the Acoustical S ociety of
sign a li near opti m al patte rn classi fier," IEEF: Transactions On Inf ormation A merica, vo l, 34, pp. 1698-1702, Nov. 1962.
Theory. vo l. 12, pp . 42 - 50, J an. 19 66. K UBALA, E , Y. CHOW, A. DERR ct al. "Co ntin uo us speech recognitio n results of
KOHONEN, T. "Auto m at ic formatio n of topological maps in a self-o rga ni zing the BYBLO S system on the DAR PA 1000-word resource managemen t data
8 74 B ib liog rap hy Bibliography 875
base." Proceedings of' the I l:.FI:-" In ternat ional Co nf erence on Aco ust i cs. Speect, LEHISTE. 1.. and G . E. PETERSON. "Tra nsit io ns. glides. a nd d iph th ongs," Jo urnal
a nd ,....ig nal Processing. New York. vol. I. pp, 29 1-294. 1988. ' of the Acousti cal Society 0/ America. vol, 33 . pp . 268- 277 . Mar. 196 1.
K uo, C. 1.. J . R . D ELLER. a nd A . K . J AI.'i. "Tra nsfo rm encryp tion coding of im LEONARD, R . G . "A database for speaker ind ependen t d igit rec?gni tio n : ' Pro~eed
ages," su bm it t ed to I EE E Transactions 01 1 Im age Processing, A ug. 1992. ings. of the IEEE Int ernational Conference on Acoustics , Sp eech , and Signa l
L.-\DEFOGED, P. A Course in Phonetics. N ew Yor k: H a rcou rt Bra ce Jovanovich Processillg. San D iego . Ca li f , vol , 3, paper 4 2.11. 1984.
19 75. . LEON-GARCIA, A. Probability a nd Random Processes f or E lectrical Engineer ing.
LAEAENS, J. L., a nd J. R. D ELLER. '''STSST"- A si lent input selective sequential Rea d ing, Mass.: Ad dison-Wes ley. 198 9.
identifi er for AR sys tems," Proceedings du Ne uvieme Colloque sur ie Traitmcnt LERNER. S. Z., and J. R. D ELLER. "Speech recogn it ion by a self-organizing feature
du Signal et ses Applicat ions , Nic e. F ra nce. vo l, 2. pp . 989 - 994. 1983 . finde r," Intem ationai Journal 0/ Neural Systems , vo l. 2. pp . 55 -78 . 1991.
LAMEL, L. F., L. R. R ABINER. A. E. ROSENBERG e t al. "An improved endpoint de LESSER, V. R ., R. D. FENNEJ.L. L. D. ERMAN et al . "O rga n izat ion o f th e H EA R
tecto r for is ol ate d word recognition," Pro ceedings of the 11:1::1:' II/t('I'/lWlOnal SAY II speech un de rst a nd ing system." IEEl:.· Transactions on Acoustics , Sp eech,
Con ference on Acoustics, Sp eech, and Signal Processing, Atla nta. Ga.. vol , 29, and Signa l Processing, vo l. 23, pp. 1 1-23 , Feb . 1975 .
PP. 777-78 5, 198 1. LEVIN, E. " Wo rd recogni tion using h id den control neural architecture," Proceed
LANG, S. W., and G. E. HI NTON. "The de velo pme n t of the tirnc-dclav neural net ings of the IEEE Int ernat ional Con ference Oil Acousti cs. Speech. a nd S ignal
work archit ect ure fo r speech reco gni tion ," Techn ical Repo rt ;-(0. CMLJ-CS-88 Processing. Albu q uerque. N.M .. vol. 1, pp . 433-436. 1990.
152, Ca rn egie-Mellon University, 198 8. LEVINSON, N . "T he Wein er RMS (roo t mean sq uare ) error criterion in filte r de
LANG, S. W , an d J. H. MCCLELL,\ N. " Freq uency estimat ion with maximum en sign and predicti o n," Journal of Mathema tical Physics, vol. 25 . pp, 26 1-278,
tropy sp ectral estimato rs." IEEE Transactions 0 11 Acoust ics. •Speech. and Signal 1947 . Als o a ppears as Append ix B in (Weiner, 1949 ).
Processing, vo l, 28, pp. 7 I 6-72 4. D ec. 19 80. . LEVINSON , S. E. "Structur al me th ods in a utom at ic speech recognition ," Proceed
LARAR, J. N .. Y. A. AlSAKA, a nd D . G. CHILDERS. " Variabi lit y in closed phase ings of the IEE E. vol. 73 , p p. 1625- 1650, Nov. 1985 .
a na lysis of speech," Proceedings of the I EEE lnternational Conference all - - - . " Con t inuously variable durat io n hi d de n M ark ov models fo r automat ic
Acoustics, Speech, and S igna l Processing, Ta m pa , F la ., vol. 2, pp . 1089-1092, spe ech recognition," Computer Sp eech and Language, vol. I, pp. 29-4 5, Mar.
198 5. 1986.
LAW. H. B.. and R. A. SEYMOUR. "A reference d istort ion system using modulated LEVINSON, S. E., A. LJOU E, and L. G . MILLER. " Large vocabulary speech recogni
nois e." Proceedings of the l EE, PP. 484-4 85, Nov. 196 2. tion using a hidden M a rkov mo d el fo r aco ust ic/ phon et ic classification," Pro
LEE, C.-H. , a nd L. R . RABINER. "A fra me synch ro nous ne t work search algorithm ceedings of the IEEE International Confe rence 011 Acoustics, Speech, and S ignal
fo r connected wor d recogniti on ," IEE E Transactions on ,-tcoll.m es. Speech, and Processing, N ew York, vol, 1, pp . 505 - 508, 1988.
Sig nal Processing, vol, 3 7, p p. 1649- 1658, Nov. 1989. LEVINSON, S. E., L. R. R<\BINER, A. E. ROSENBERG et a l, " In teractive clust e ring
L EE. C.-H .. F. K. SOONG, an d B.-H. J UANG. "A seg ment m od el based approach to te ch ni q ues for select ing speaker-ind epend ent refe rence template s for isolated
speech recognit ion ," Proceedings 0/ the IE EE ln ternational Conference on wo rd recogn ition ." IEEE Transactions Oil Acoustics , Speech, and S ignal Pro
Acoustics, Speech, and Signal Processing. New York, vol. I. pp . 50 1-504. 1988. cessing, vol. 27, pp. 134- 141 , Ap r. 19 79.
L EE, H . c., a nd K. S . F u . "A stochast ic syn tax analysis procedure: and its applica LEVINSON, S. E., L R. RABINER, and M . M. SONDHI. "An introduction to the ap
t io n to pa tt ern cla ssi fica t ion," IEEE Transactions on Com puters. vol, 21. pp . pl ica t ion of the theor y o f pro bab ilist ic funct io ns of a Ma rko v process to au to
660 - 666, J uly 19 72. ma t ic speech recognit io n," Be/l System Techn ical Journal, vol. 62 . pp .
LEE, K .-F.. H.-W HON. a nd D. R . R EDDY. "An overview o f the SPHINX speech 10 35-1074, Ap r. 198 3.
recog ni ti o n syst em, " IE EE Transac tions on Sign al Processi ng, vol. 38. pp, 35 LICKLIOER, J. C. R ., and r. P OLL-\CK. " Effec ts of different ia tio n. integration , and
4 5, Jan . 1990 . infin ite peak cli p ping up on the intelligibility of speech," Journal of the Aco us
L EE, K.-F., a nd S. MAHtVAM. "Co rr ecti ve a nd rein for cem ent lea rn in g fo r speaker tical Society of America, vol, 20. p p. 4 2- 5 1, J an . 1948.
indepe ndent co ntin uo us speech recognitio n," Tech nica l Repo rt N o. CMC-CS LiM, J. S. "Evaluat io n of a correla tio n subt rac tio n method fo r enha ncing speech
89-100, C a rn egie- Me llo n University, Jan . 1989. degraded by a dd itive wh ite no ise," IEE E Transac tions on Acoustics, Speech,
LEE. R . K. C. Optim al Estim at ion, Identification, and Control. Cambridge , Mass.: and Signal Processing, vo], 26 , pp. 471-472, Oct. 197 8.
M.LT. Press, 1964. - - - . "Spect ra l roo t homomorph ic d econ vol utio n system: ' IEEE Transactions
LEFEVRE, J. P- s an d O. PASSIEN. " Effi cie nt algorith ms fo r obtai ning multipulse ex on Acoustics, S peech, and Signal Processing, vo l. 27 , pp , 223-232 . J u ne 19 79.
cit at ion fo r L P C cod ers ." Proceedings of the IEEE i nternat ional Con ference 0 11 LIM, J. S. , and A. V. O PPENHEIM. "All-po le modeling of degraded speech," IEEE
Acoustics, Sp eech. and S ign al Processing, Tampa , F la.. vo l, 2. pp. 95i - 960. Tra nsactions Oil Acoustics, Speech . and Signal Processing. vol, 26 , pp. 197-2 10,
19 85. June 1978 .
876 Biblio g rap hy
B ibliography 817
- - - . "Enhancement and bandwidth comp ression of noisy sp eech ." Proceeainos recogn izer t rai ned on th e DARPA resource manageme nt co rp us," Proceedings
of the IE EH. vo l. 67. pp , 1586- 1604. D ec . 197 9. c
of the IEEE Intern ational Conference 0 11 Acaustics. Speech, and Signal Process
LIM . J . S., A. V. O PPENHEIM, and L. D . ARAIDA. " Evaluation of an ada pt ive comb ing. San Francisco. vol. I. pp. 1-605-1-608 , 1992 .
filter ing method fo r evaluat ing sp eech d egraded by whil e noi se addition:' LUSTERNIK. L. A ., a nd V. J . SOBOl. EV. Elements of Functional Analysis. N ew York:
IEEE Transactions on ACOUSiics, Speech. and Signal Processing, vo l. 26. pp Halsted (Wiley), 1974 .
354-358. Aug. 1978. . .
MCAULAY. R. J ., and M. L. MALPASS. "Speech enhancement using a so ft-de ci sio n
lI NDBOLM, B. E. F., and J . E. F. SUNDBERG. "Acoustic conseq ue nce s of lip. noise sup pressio n filter," IEEE Transactions on ACOI/S{{CS, Speech . and Sig nal
longue, j aw, and larynx movement," Journal ofthe Acoustical Societ v ofAmer Processing, vol. 28, pp. 137-14 5. Apr. lY 80.
tea. vol, 50, pp . 1166-1179. 197 1. .
M CAULAY, R . J. , and T. F. Q UATIERI. " Speech a na lysis-syn thesi s ba sed on a
LINDE. Y . A. Buzo. and R . M . G RAY. "An algorith m fo r vect or q ua nti zc r design," sinusoidal repre sentation." IEEE Transactions 0 11 Acoustics, Speech, and Signal
!EEE Transact ions on Communicat ions, vol, 28, pp. 84 - Y5. Ja n. 1980 . Processing, vol. 34, pp . 744- 754. Aug. 1986.
LlNDQVlST, J . "Inverse fil te r ing. I nstru m en ta tion and techniques: ' Quarterly Prog CCANDLESS, S. S. "An algori th m fo r a uto ma tic fo rm ant ex t rac tio n usin g linea r
ress & S tatus Report, Speech Transmission Laborator y, Royal In st itute of Tech prediction spectra," IEEE Transactio ns on ACOUSIics. S peech, and Signal Pro
no logy. Stockh olm . Swed en. vol, 4, 1964 . cessing, vol. 22, pp. 135-141. Ap r. 1974.
- - -. "Studies o f the voice source by means of inverse fi lte ri ng," Qu arterly _ __ "Modifications to fo r m a n t t rack ing a lgo rit h m o f A p r il 197 4." IEEE
Progress & Status R eport, Speech T ransm issio n Labo ra tor y. Royal In st itute of Transactions on Acoustics, Sp eech. and Sig nal Processtng, vol, 24, pp. 192-193 ,
Techno logy, Stockholm, Sweden, vol, 2, 1965. Ap r. 1976.
- - - . "T he vo ice source studi es by means of inverse filt ering." Quarterly Prog MCCLELLAN, J. H. "Parametric signal modell ing," Ch a p ter I in 1. S. Li m and
ress & S law s Report , Speech T ra nsm issio n La borato ry, Royal Institute of Tech A. V. O p pe nheim . ed s., A dvanced Topics ill S ignal Processing. Englewoo d
nology, Sto ckh olm , Sweden, vol. I. 1970. Cli ITs, N .J .: Prentice Hall, 1988.
LIpORACE, L. A. "M a xim u m likelihood estimatio n for m ul tivaria te observations MCCOLLOCH, W. S ., and W. Pins. "A logi cal calculus of the ideas immi ne nt in
of Ma rko v so urces," IEEE Transactions on Inf ormation Th eory, vol, 28, pp . nervous ac tivity," Bulleun of Ma thematical Biophysics, vol. 5, pp . 115- 133,
729- 734, Sep t. 198 2. 1943.
LIpPMANN, R . P. "An introduction to computing with ne u ra l nets.' IEEE Acous MCCOOL, J . M . et al. Adaptive Line E nhancer. U. S. Patent 4,238,746 , D e c. 9 ,
tics. Speech. an d Signal Processing Ma ga zin e, vo l. 4, p p, 4- 22, Ap r. 1987. 1980.
- - - . " Review of resear ch on neura l ne two rks for sp eec h recognition," Neural M CCOOL, J . M. et a l. An Adaptive Detector. U .S. Pat ent 4,243 ,935, Jan. 6. 198 I.
Computation, vol . I, M a r. 1989. Reprint ed in (Wa ibel and Lee, 1990 ).
M CDERMOTI, B. J . " M ult id imen sio na l analysis of circu it qualit y ju dg me n ts,"
L lI' PMANN. R . P., an d B. G OLD. "Neu ral-n et clas si fie rs use fu l for speech recogni Journal of the Acoustical Society ofAmerica, vol. 45, pp . 774- 78 1, Mar. 1969.
t io n ," Proceedings of the 1st IEEE lnternational Conference on Neural Net
MC D ERMOTI. B. J., C. SCAGLIOLA, and D. J. GOODMAN. " Pe rcep t ua l a nd objec
works, Sa n D iego , Ca lif. , vo l. IV, pp . 41 7-42 5, 1987.
ti ve eva luat io n of speech processed by adapti ve differential P CM ," Proceed
L1PPM:\ NN . R. P., E. A. MARTIN, and D. P. PAUL. " M ulti-style trai nin g for robust ings of the IEEE Intern ational Conf erence on Acous tics, S peech, an d Si gnal
iso lated-word speech recogn it io n," Proceedings of the I EE E In ternational Con Processing, Tulsa , Okla., PP. 581-585, Apr. 1978 .
f erence on Acoustics, S peech, and Signa l Processing. Dallas, Tcx., vol , 2, pp .
- - -. "Per ceptual and objective ev a luati o n of spe ech processed by adapt ive dif
70 5- 708, 1988.
f erential peM," Bell System Technical Journal. vol, 57, pp. 1597-1619 , May
L1 u, Y. D ., G. Z. SU:-l, H. H. C HEN et al. " G ra m ma t ical in fere nce and neura l net 1978.
work state machines," Proceedings of the International Joint Conf erence on
McDERMOTT, E., a nd S. KATAGIRI. "Shift-invariant, multi-catego ry phonem e rec
Ne ural Networks. Wa sh ingto n , D .C., vol. I, pp . 28 5-288, Ja n . 1990.
ogn it ion using Kobonen's LVQ2 ," Proceedings of the IEEE lnt ernationai Con
U UNG, L.. and T. SODERSTROM. Theory and Practice of Recursive Identifi cation. f erence on Acoustics, Speech, and Signal Processing, Glasgow, Scotland, vol, 1,
Ca m b ridge. Ma ss.: M.LT. P ress, 1983. pp . 81- 84, 1989.
LLOYD. S. P. " Leas t squares q uan tizat ion in P C M." Bell Laborato ries Tech nica l Mc G EE, v, E. "Semantic co mponents of the qualit y of pr oc essed speech:' Journ al
Note, 1957. Rep r in ted in IEEE Transactions on Inf orm ation Theo ry. vol. 28. of Speech and Hearing R esearch, vol, 7. pp. 310-323, 1964.
pp . 129- 137, Mar. 1982.
M CMAHAN, M ., and R. B. PRICE. " G ram ma r dr iven co nnected wo rd recognit ion
LOWERRE, B. T., arid D . R . R EDDY. "T he H A R PY speech u nd erst a nd in g syst em : ' o n th e TI-SPEECH board," Proceedings of Speech Tech, New York, pp . 88-9 J •
In W. A. Lea, ed., Trends in Sp eech Recognition. Eng lewood Cli ffs. N J .: 1986 .
Prentice H all, 1980.
MACNEILAGE, P. F. "Motor co nt ro l o f serial o rd ering o f speech , Psychotogical R e
L UCKE. H., a n d F. FALLSIDE. "Expa nd ing th e vocabula r y o f a co nnect io n ist I'iew, vol, 77. pp. 182- 196. 1970.
MARKEl., J. D .. A . H. GR \Y, a nd H. W.\KlT,-\. Linea r I'/'('d id ion of S'peec!l
MC\V HIRTER, J. G. "Recursive least squares solution using a systolic array." Pro
Theo ry and Practice. /I.'1 o nogra ph N o . 10. Sant a Barba ra . Ca li f.: Speech Com
ceedings of the Society 01 Photooptical Instrumentation Engineers (Rea! Time
Signal Process ing TV), San D iego , Calif.. vol , 431. pp . 105- 112, 1983 . mun ica tio n Resea rch Laborato ry. lnc., 19 73.
MARKEL, J . D.. an d D. Y WONG. "Co nsiderat ions in the esti ma tion of g,lott al
MAHAL\ NOBIS. P. C. "On t he generalized distance in stat ist ics." Proceedings of the
volume ve lo cit y waveforms: ' Proceedings 0/ the Acol/stical Society of /!lIIcm :u ,
Nat ional Institute 0/ S cience (l nd ia). vol. 12. PP. 49-55 , 1936 .
MAKHOUL, J . "Linea r prediction : A tut ori a l rev iew." Proceedings ofthe IE[~"f:"_ vol, 915t Meeting, R R6. 1976 .
63. pp. 56 1-580, Apr. 197 5. MARPLE, S. L. D igita l Speclrul Analys is with App!lcations. Englewood Cliffs, N.J.:
- - - . " Spect ra l linea r prediction : P ropert ies a nd ap plica tions." IEEE TransacPrentice Hall, 1987.
MARSHALL. D. E ,. and W. K. JENKI NS. "A fast ouasi-Newton adaptive filtering al
tions on Acoustics, Speech. and S ignal Processing , vo l. 23 , pp. 283-296 . June
19 75. gorithm ." proceedings of th e IEEE Internalional Conference on Acoustics,
- - - . "Stable and e fficie nt latt ice meth od s fo r linea r predi ction." If.EE Trans
actions on Acoustics, Sp eech, and Sign al Processing. vol. 25. PP. 423-428, Oct. MARTIN . T. B. "Practical applications of voice input to machine," Proceedings of
- - -. Perso nal communication , 1991 . MA T..... USEK. M . R ., and V. S. BATALOV. "A new approach to the determination of
the glo tt a l waveform," IEEt: Transa ctions 011 ACo/ls/ics, Speech, and SIgna!
MAKHOUL. J., a nd L. COSELL. "Adaptive latt ice an a lysis of speech," IFEE Trans
actions on Aco ustics, Speech, and Signal Processing. vol, 29, pp, 654-659, June Processing, vol. 28, pp. 616-622, Dec 1980.
1981. t.'t ATHAN , L. . and L. MICLET. " Rejection of extraneous input in speech recognition
MAK HOUL, J., S. Roucos, and H. GISH. "Vector q ua ntizat ion in speech coding:' applicat io ns using multi-layer perceptrons and the trace of H\1Ms," Proceed
Proceedings of the IEEE, vol, 73, PP. 1551-1 588, No v. 1985. ings of the IEEE International c onterence on A coustics, Speech , and Signal
Processing, Toronto, Canada, vol. 1, pp , 93-96_. 1991.
MAKHOUL, J ., and R. VISWANATH"':-': _"Adaptive preprocessing for linear predictive
speech compression systems" (abstract), Journal of the Acoustical Society of MAX, J. "Quantizing for minimum distortion," IRE Transacti ons on information
America, vol . 55, p. 475, 1974. Theory, voL 6, pp. 7-12, Mar. 1960.
M,"'LAH, D., and R. V. Cox. "Time-domain algorithms for harmonic bandwidth MEISEL, W. S., M . T. ANIKST, S. S. PtRZAOfl-l ct al. "The SST large-vocabulary
red uction and time scaling of speech signals," IEEE Transactions on Acoustics, spea ke r-independe nt continuous speech recognition system," Proceeding) of the
S peech , and Signal Processing. vol . 27, pp. t21-133, Apr. 1979. IEEE International Conference on Acoustics, Speech . and Signal Processing,
- --. "A generalized comb filtering technique for speech enhancement," Pro ron to, Canada, vol. 1, pp. 337-340, 1991.
To
ceedings of the iEEE International Conference 011 Acoustics. Speech, and Signal MERGEL, D ., and A. PA£SELER. " Co nst ructio n of language models for spoken da
Processing, Paris, vol. I, pp. 160-163, 1982. tab ase queries," Proceedings of the IEEE International Conference on A COl/S
tics, Speech , and Signal Processing, Dallas, Tcx.. vol. 2, PP· 844-847, 1987 .
MANN, J. R .. and F. M. RHODES. "A wafer-scale DTW multtprocessor," Proceed
ings of the IEEE International Conference on Acoustics, Speech, and Signal MERJALDO. B, "Speech recogntt.ion using a very large size dictionary:' Proceedings
Processing, Tokyo, Japan, vol. 3, pp. 1557-1560, 1986. of the IEEE International ColI/erena on -4 co 115 tics, Speech, and Signal Process
M ARI ANI, J. "Recent advances in speech recognition." Proceedings of the IEEE ing, Dallas, Tex ., vol . I , pp . 364--367, 1987.
lntem ational Conference on Acoustics, Speech . and SIgnal Processing, Glasgow, MESgERSCHMITT, D. G. Adaouve Filters . Norwell , Mass. : Kluwer, 1984.
Scotland, vol, I, pp. 429-440, May 1989. MILENKOVlC, P. "Glottal inverse filtering by joint estimation of an AR system
MARKEL, J. D. Formam Trajectory Est imation from a Linear Least Squares In with a linear input model," IEEE Transactions on Acoustics, Speech , and Sig
verse FIller Formulation. Mo nogra ph No.7, Santa Barbara, Calif.: Speech nal Processing, vol, 34, pp. 28-42, Feb. 1986.
Communication Research Laboratory, Inc., 1971. MILLER, L. G .. and A. L. GORIN. "A structured network architecture for adaptive
- - -. "The Prony method and its application to speech analysis" (abstract). language acquisition," Proceedings of the JEliE Jlllernalionai Conference on
Journal of the Acoustical Society of America, vol, 49, p, 105, Jan. 1971. A coustics, Speech , and Signal Processing, San Francisco, Calif.. vol. I, pp.
- - - . "Digital inverse filtering: A new tool for formant trajectory estimation," 1-201-1-204, 1992.
/ EEE Transactions on Audio and Electroacoustics, vol. 20 , pp . 129 -137, June MILLER. R. L. "Nature of the vocal cord wave," Journal of the Acousucal Society
1972. of America, vol. 31, pp . 667-677, 1959.
- --. "The SJFT algorithm for fund am e ntal fre q ue nc y es t imation," IEEE MILLER, S. M ., D . P. MORGAN , H. F. SILVERMAN ct al. " Real-time evaluation sys
Transactions on Audio and Electroacous tics, vo l, 20 , pp, 367- 377 , Dec. 1972 . tem for a real-time connected speech recognizer." Proceedings of the IEEE In
MARKEL, J . D ., a nd A. H. GRAY, L inear Pred iction 01' Speech. New York: ternational Conferrnce on Acollstics, Speech, and Signal Processing , Dallas,
Springer-Verlag, 197 6. Tex ., vol. 2, pp, 801-804, 1987.
880 Bibl iog raphy
B ib lio g ra p h y 8 81
MILN ER, P. M. Physiological Psychology. New York: Holt. Rinehan and Winston I Et:E Tra nsact ions 011 Acoustics. Speech . a nd S Ig nal Processing . vo l. 29 . PD
1970. .
35 1- 363. J une 198 1.
MINSKY. M. L.. and S. PAPERT. Percept rons, Camb ridge. Ma ss.: M.l.T. Press MY ERS, C. S., L. R. RABI NER. a nd A. E. ROSENBERG. " Perfomlancc tradeofTs in
1969. .
dynamic time w'Urping algorithms for isolated word recognition." i E£E Trans
- - - . Perceptrons, 2nd ed . Cambridge. Mass.: M.r:r. Press. 1988. actions on Acoustics. Sp eech, and S ignal Processing. vol , 28. pp. 622- 635, Dec.
MIRCIlANDANI. G.. R. C. GAUS, a nd L. K. BECHTEL. " Performa nce cha racteristics 1980.
of a hardware implementation of the cross-talk resistant adap tive noise NAGATA. K.. Y. KATO, and S. CHIBA. "Spoken digit recognizer for the Ja panese
canceller," Proceedings of the i EEE lnternational Conference on ACOUSlics. lan guage," Proceedings of the 4th International Con ference 0 11 Acoustics. 1962.
Speech. and Signal Processing, Tokyo. Ja pan. pp, 93- 96. Apr. 1986. NAK.-\MURA, M.. and K . SHIKANo. "A stud y of English word category predi ction
MIYOSHI. Y.. K. YAMAMOTO. R. MIZOGUCHI et al. "Analysis of speech signals of based on neural networks," Proceedings of the I E1:.'£ Int ernational Conference
short pitch perio d by sample selecti ve linear prediction," IEE E Tran saction s on Acousti cs, Sp eech , and Signal Processing, Glasgow. Scotland, vol. 2, pp.
n Acoustics. Speech, and Signal Processing. vol. 35. pp , 1233- 1240. Sept. 731-734. 1989.
1987. NAK....TSU, H" H . NAGASHIMA, J. KOJIMA ct al, "A speech recognition meth od for
MOODY. J. "Fast learning in multi-resolution hierarchies." In D. S. Tcu rctzky, ed.. telephone voice," Transactions of the Institute of Electroni cs . Inf orm ation. and
Advances in Neural Info rm at ion Processing Systems. San Mat eo . Calif.: Computer Engineers (in Ja panese) , vol, J66- D. pp , 377-384. Apr. 1983.
Morgan-Ka ufmann , 1989. NAKATSUI, M.. and P. MERMELSTEIN. "Su bjective speech-to-noise ratio as a mea
MORGAN, D. P., L. RIEK, D. P. LoCOt·1TO et al. "A comparison of neural networks sure of speech qu alit y for d igital waveform coders ," Jou rnal of the Acoustical
and traditional classification tech niques for speaker ident ification," Proceed S ociety of A merica, vol. 72. pp . I 136- / 144. Oct. 1982.
ings of the Military and Government Speech Technology Conference, Washing NAKATSUI, M., an d 1. SUZUKI . " Method of obse rva tion of glott al-so urce wave
ton. D .C., pp, 238-242, 1989. using digi tal inverse filtering in the time domain." Journal of {he Acoustical
MORGAN. D. P., and C. L. SCOFIELD. Neu ral Networks and Sp eech Processing. Society of America, vol. 4 7. pp. 664- 665, 1970.
Norwell, Mass.: Kluwer, 199 1. N,\ NDKUMAR, S., and J. H. L. HANSEN. "D ual-channel iterative speech en hance
MORGAN, D . P., C. L. SCOFIELD. T. M. LORENZO et al. "A keyword spot ter which ment with co nstraints based on an a ud itory spectru m," submitte d to IE EE
incorporates neu ra l networks for seco nda ry process ing:' Proceedings of the Transac tions on Signal Processing , 1992.
iEEE i nt ernational Conference on Acoustics , S peech, and Sign al Processing, - - -. " Dual-channel speech enha ncement with aud itory spectrum base d con
Albuquerque , N.M.. vol , I , pp. 11 3- 116, 1990. strain ts," Proceedings of the IE EE International Conf erence on Acous tics,
MORG AN, N.. and H. BOURL<\RD. "Continuous speech recogn ition using multi Speech, and Signal Processing, San Franc isco, pp. 36.7.1- 4, 1992.
layer perceptrons with hidden Markov mode ls," Proceedings of the IJ::EE In ter National Institute of Standards and Technology (N1ST). "Gett ing started with th e
national Conference on Acoustics, Speech, and Signal Processing, Albuquerque, DARPA T IMIT CD-ROM: An aco ustic pho netic cont inuous speech data base"
N.M .. vol. I, pp. 4 13-41 6. 1990. (prototype), Gaithersburg, Md ., 1988.
MORSE, P. M., and K. U. INCiARD. Theoretical Acoustics. New York: McGraw NAv,,/\B, S. H. , and T. F. Q UATlERI. "Short time Fourier transform ." Chapter 6 in
H ill, 1968. J. S. Lim and A. V. Oppenheim, cd s., Advanced Topics ill S ignal Processing.
MURVEIT, H., and M . WEII'TRAUB. " I OOO-word speaker-in dep endent continuous Englewood Cliffs, N.J.: Pr ent ice Hall , 1988.
speech recognition using hidden Markov models," Proceedings of the IE EE In NAYLOR, A. w. , an d G. R. SELL. Li near Operat or Theory. New York: Holt,
ternational Confe rence on Acoustics. Speech, and Sig nal Processing, New York, Ri nehart an d Winston , 1971.
vol. I, pp. ll5- 118, 1988.
NAYLOR, J. A., and S. F. BOLL. " Techn iq ues for suppression of an int erfering
Musrcus, B. R. "An iterat ive technique for maxi mum likelihood parameter esti talker in co-channel speech," Proceedings of the IEE E Int ernat ional Confe rence
mation on noisy da ta." M.S. thes is, Massachuset ts Institute of Technology, on Acoustics , Speech, and S ignal Processing, Dallas, Tex., pp. 205- 208, 1987.
June 1979.
NEY, H. "The usc of a on e stage dynami c programmi ng algorithm for connected
MUSICUS, B. R., and J. S. LIM. "Maxi mum likelihood parameter estimation on word recognition," JEEE Transactions on Acoustics, S peech, and S ignal Pro
noisy data," Proceedings of the IEEE i nternational Confe rence on Acoustics. cessing, vol. 32, pp , 263- 27 1. Apr. 1984.
Speech, and S ignal Processing , Washingto n, D.C.. pp . 224-227, Ap r. 1979.
NEY, H., R. HAEB-UMBACH, B. TRt\N et al. " Improveme nts in beam search for
MYERS, C. S., and L. R. RAB INER. "A level building dyna mic time warpin g algo 10,OOO-word continuous speech recognition," Proceedings of the IE EE Interna
rithm for connected word recognition," IEEE Transactions on Acoustics, tional Confe rence on Acoustics, S peech, and Signal Processing, San Francisco,
Sp eech, and Signal Processing , vol. 29. pp . 284-296, Apr. 198 1. vol. I, pp. 1-9-1- 12, 1992. .
- - - . " Connected digit recognition using a level building DTW algorithm: ' N EY, H., D. MERGEL, A. NOLL et al. "A data-dri ven organizati on of the dy namic
88 2 Bi b liogra p hy
Bibliog ra ph y 883
p rogram mi ng beam sea rch fo r co n tin uo us speech recogn it io n: ' Proceeding s or O 'SHAUGlINESSY, D. Sp eech Communication; Human and Machine. Reading,
he 11:."££ International Conference on Acoustics. Speech. and Signa l Proccs; Mass.: Add ison-Wesley, 1987 .
ing, Da llas, Tcx.. vol, 2, pp . 83 3-836, 1987 .
- - - . "Speech en hancement using vecto r quant ization a nd a formant d istance
- - - . " D a ta -driven search organization for continuo us speec h recogni ti o n:' measure," Proceedings of {he IEEE Interna tional Confe rence on A COI/STics,
IEEF. Transactions on S ignal Processing, vo l, 40 . p p. 272-28 1. Feb . 1992. Speech. and Signal Processing. New Yo rk. pp . 549 -55 2. 198 8.
NI EDERJOHN, R. J., a nd J. H. G ROTELUESCHEN. "T he e nha ncem ent of speech in OSTENDORF, M., a nd S. Ro uc.o s. "A stochastic segme nt model for phoneme-based
tellig ibility in h igh no ise leve ls by h igh-pa ss fi ltering followed by rapid ampli con tinuo us speech recognition." IE E E Transacti ons on Acoustics, Sp eech. and
tude co mpression ." IE E E Transacti ons on Acoust ics, S peech, and Sienal Signa l Processing, vol , 37 , p p. 1857-1869. Dec. 1989 .
Processing, vol . 24, pp . 277 - 282. Aug. 197 6. ~ PAESELER, A. " Mo d ifi cat io n of Ea rley's algorit h m for speech recogni tion ." In
NILES, L. T., and H. F. SILVERMAN. "Co m bin ing h idden Marko v m odels and H. Nie ma nn ct al., eds., Recen t Advances ill S peech Understanding and Dia log
neural network classifie rs," Proceedings of The I F. £/~· International Conference Systems. New York : Springer-Ve rlag. 1988.
on ACOUSTics. Sp eech, and S igna l Processing. Albuq uerq ue. N .M.. vol. 1. pp, PAESELER. A., a nd H . NEY. " Co nt inuous sp eech recog ni tio n usin g a st ochas tic la n
417-42 0 , 1990. gua ge m od el." Proceedings 0/ The IE E E International Conference on Acous
N ILSSON, N. J. Learn ing Machin es. New York: McGraw-Hil L 196 5. ucs, Sp eech. and Signal Processing, G lasgow, Sco tla nd. vol . 2. p p. 7 19-72 2.
- - - . Problem Solv ing Methods in Anificial Int elligence. New York : YlcGraw
19 89 .
H ill. 1971.
PAEZ. M . D ., a nd T. H . G LI SSON. "Min im um mea n sq ua red erro r q ua nt iz atio n in
N IRANJAN, M. , and F. FALLSIDE. " Ne u ral networks a nd ra di al ba sis fu nc t io ns in speech PCM a nd DPCM syste m s," IEEE Transactions on Com m unication
cla ss ifying stat ic s peec h pattern s." Techn ical Report CUE D/ F-Tl': F EN G/T R Theory, vol. 20 , pp . 225-230_ Apr. 19 72 .
22 , Cambridge University, Ca mbridge, Engla nd, 1988. PAGET. SIR RICH,4.RD, Human speech . London and New York : H a rco urt, 1930.
N OBEL, B. Applied Linear A lgebra. Engl ewood C liffs, N .J.: Prentice Hall . 1969. PALIWAL. K. K. "O n th e perfor ma nce of the q uc frcncy-weighted cepstral cocffi
NOCERINO, N .. F. K . SOONG, L. R . RABI NeR et a l. "Co m parat ive st udy o f several cients in vo wel reco gnition ," S peech Com mun ication: A n Interdisciplinary
d istortion measures for speech reco gn it ion," Proceedi ngs of the IF.1::1:." In terna Journ al, vol, I , pp . 15 1- 154, Ma y 1982.
tional Conference 011 Acoustics , Speech, and Signal Processing. Tampa , Fla. , - - -. "Eva lua t io n of va rio us linear predict ion pa rametric rep rese n tati o ns in
vol, I , pp. 25-28, 1985 . vowel recognition ," S ignal Processin g, vol. 4. pp . 32 3-327 , J uly 1982.
NOLL. A. M. "Cepstrum pitch det e rm ination ," Journal of The A coustical Society of - - - . " Ne u ra l net cla ssifiers for robust speech recognition under no isy e nviro n
Am erica, vo l. 41 , pp . 293- 30 9, Fe . 196 7. men ts," Proceed ings of The IE EE Int ernational Conf erence on Acoustics.
O GATA , K . St are Spa ce An alys is of CO/1l I'OI Systems. Engl ewood Cl iffs, N .J.: Speech, and Si gnal Processing, Albuq uerq ue, N .M .. vol. I. pp. 429-4 32, 1990.
Pren tic e Hall, 1967. PALIWAL, K . K., and A. A o\RSKOG. "A comparat ive performan ce ev alua tio n of
O GELSBY, J ., and J. S. M ASON. " O p tim iza t io n of neural models fo r speaker identi pi tc h e st ima t io n m etho ds for TDHS/sub-band codi ng o f speech," S peech Com
fica ti on ," Proceedings of the IE E E Int ern at ional Conference on Acoustics, m unication: An Int erdisciplinary Journ al, vo l. 3, PP. 25 3-25 9. 1984.
Speech. and Signal Processing , Albuque rq ue , l\' .M ., vol . I, pp. 261-2 64. 1990. PALI WAL, K . K., and A. BASU. " A speech e nha ncement method based o n Kal ma n
- - - . " Rad ial basis fun ction netwo rks fo r spea ke r recognition ." Proceedin gs of fi lte ri ng." Proceedings of the IEEE Internat ion al Conf erence on A coustics,
the IEEE International Con ference on Acoustics, S peech. and Signal Process Speech. and S Ignal Processing. D allas , Tex., vol . I, p p . 177-180, 19 87.
ing, Toronto, Canada , vol. I. pp. 39 3-39 6. 199 1. PAPADIMITRIOU. c.,
a nd K. STEIGLITZ. Com bina torial Optim ization: Algorithms
OPPENHEIM, A. V. "Ge ne ra lized supe rp os ition." l nform attcs and Control. vol . II , and Com plexity. Engl ewood Cliffs, N .J.: P rent ice H a ll. 19 82.
pp . 52 8-536 , No v.-Dec. 1967 . P,,\ PAMICIIALlS, P. E. Practical Approaches TO Speech Codi ng. En glewood C liffs,
- - -. "A speech analysi s-syn th esis syste m ba sed on homomorphic fi lte r in g: ' N .J .: P rent ice Hall. 198 7.
Journal of the Acoustical Society of A m erica. vol, 45. pp . 458- 46 5. 1969. PAPOULlS, A. Probability. Random Variables. an d S tochastic Processes, 2nd ed.
O PPENHEI M, A. V., a nd R. W. SCHAFER. " Ho mo mo rphi c a nal ysi s of spee ch ," New York : McG raw-Hill, 1984 .
IEEE Transacti ons on Au dio and Electroacoust ics. vo t. 16, pp . 22 1- 22 6. June PARKE R. D. B. " Lea rn ing log ic: ' In ven tion Repo rt S81 -64. F ile I. O ffice of Tech
1968 . no lo gy Licensing. Sta nford U ni ve rsit y, 19 82.
- - -. Discrete Time S ignal Processing , En glewo od Cliffs. NJ. : P ren tice Hall. PARSONS, T. W. "Sepa ra tio n of speech from interfering speech by m ea ns of har
19 89. monic selecti on," Jou rnal of the A coust ical S ociety of A m erica, vol. 60 , no . 4.
OPPENHEIM, A. v.. R . W. SCHAFER, an d T. G . STOCKlIAN, JR. " No n linea r fi lteri ng p p. 91 1-918, O ct. 1976.
of m ultiplied and convo lved sign als," Proceedings of the IEEE , ve l. 56. pp . - -- . Voice and Sp eech Processing. N ew Yor k: McG raw-H ill, 198 6.
1264-1 291 , Aug . 1968 . PAUL, D . B. "An 800 bps ad a ptive vecto r quantizatio n vocod er using a percep tua l
H84 Biblio graphy Bibl io graph y 885
d ista nce measure: ' Proceedin gs of the IE EE Intern ation al Confe rence 0 /1 POl.YDOROS, A., a nd A. T. FAM. " The differential cepstrum: D efi n ition and prop
Acoustics, Sp eech, and S ignal Processing. Bos to n. vol, I . pp. 73- 76, 1983. er ties." Proceedings of the IEEE International Symposium on Circuits and S ys
PAUL. D . B. " T he liN CO LN rob ust co n t in uou s speech recogn izer: ' Proceedings tems, vo l, I , PP. 77-80, 198 1.
of th e IEEE International Conf erence on Acoustics, Sp eech, and S ign al Process PORITZ, A. M. " H idde n Ma rkov m odels: A gu ide d tour," Proceed ings of th e IEEE
ing , G la sgow, Scotland, vol . I. pp. 449- 45 2, 1989. tntemational Conf eren ce on Acou stics, Sp eech. and S ignal Processing, New
PAUl., D . B., R . P. LIpPMANN, R. P. C HEN ct al. "Rob ust HM M -based techniques York, vol, 1, pp. 7- 13, 19 88.
for recogn ition of speech prod uced under stress a nd in no ise." Proceedings of P ORTNOFF. M . R. " Re p res e nt at io ns o f si gn als an d sy ste ms base d on the short
S peech Tech, New York. 19 86 . t im e Fo urier transform." IEEE Tran sactions 0 1/ Acoustics, Speech, and S ign al
rA UL. D . B., a nd E. A. M ARTIN. "Spe ake r stress -resista nt co nti n uo us sp eech rec Processing , vol. 28. pp . 55-69, Feb. 1980 .
ognit ion ," Proceedings 0/ the I EE E Int ernational Co nf erence 0 /1 Acoustics. PORTNOFF, M. R. , and R . W. SCHAFER, "Mathematical consid erations in digital
Sp eech, and S ignal Processing , New York. vol, I, pp. 283 -286. 1988. sim u latio n s of the vo ca l tract " (abstract), Journ al of the Acoust ical Soc iety of
PAWATE, B. L, M . L. M cM AHAN, R . H . W IGGINS et al. "Co nnect ed word America. vo l. 53 , p . 29 4 , Jan. 19 73 .
rec ogn ize r on a m ult ip rocessor system," Proceedings of th e IE EE International P OTI ER, R . G ., G . A. Ko pp, and H . G . KoPP. Visible Speech . New Yo rk : Van
Conf erence 011 Acoustics, Speech. and Signal Processing, D allas, Tex., vol. 2 , N o stra nd , 1947 . R eprinted: New York: Dover, 196 6.
p p. 1151 - 1154, 198 7. POW ELL, G. A., P. D ARLI NGTON, and P. D. WHEELER. " Pract ical a dap tive noise
PEELING, S. Moo and R. K. M OORE. " Expe ri me n ts in iso la te d d igit recognition r educt ion in th e a ircra ft cockp it environment," Proceed ings of the TEE E I nter
usin g th e mult i-layer perceptron," Techni cal Report No. 4073, Royal Speech nauo na l Conference on Acousti cs, Sp eech , and S igna l Process ing, D allas, Tex.,
and Rad ar Establishment, Malvern, Worcest er, Engla nd, 1987. pp . 173-176 . 1987 .
PERKELL, J . Physiology of Sp eech Product ion. R esearch Monograph, no. 53, PREUSS, R . D . "A frequency domain noise ca nce lla tio n pre pr ocessor for narrow
Cam bridge, Mass.: M.LT. Press, 196 9. ba n d speech commun ications system s," Proceedin gs of the IE EE Int ern ati onal
Conference on Acoustics, Sp eech, and Sig na l Processing, Wa sh ington, D. C .. pp .
P ERLM UTTER, Y. M. , L. D . BR:\ lDA, R . H . FR....ZIER et at. " Evalua tion of a speech
en ha nce m en t system," Proceedings of the i E EE International Conference 212-2 15, 1979.
on Acoustics, Sp eech , and S ignal Processing , H artford , C o nn., pp. 212-215 , P RICE. P. J., W. F ISHER, J . BER NSTEIN ct al. " A dat abase for co nt in uo us speech
1977 . reco gnition in a tOOO-word domain," Proceedings of the IE E E Int ernat ional
P ETERSON, G. E.. and H. L. BARNEY, "Co ntro l m ethods used in a study of the
Co nfe rence on Aco ustics, Sp eech , and S igna l Processing, New Yo rk. vo l. 1, PP·
vowels. " Journ al of the Aco ustical S ociety a/ A m erica, vol. 24, 175- 184 , 1952 . 6 51- 654 . 1988 .
P ROAK IS, 1. G . Digital Co m munications, 2nd cd. New York: M cGraw-H ili, 1989 .
PETERSON, T. L. , and S. T. B OLL. "Ac ou st ic noise suppress ion in th e co nt ext of
a perce pt u al model," Proceedings of the IEEE Int ernational Co nf erence PR OA KI ~, J. G., and D. G. M ANOLAKlS. Digital Signal Processing: Principles, Al
on Acoustics, Speech, and Signal Processi ng, Atlanta, G a ., PP. 10 86-1088, gorithms, and Appli cauons, 2nd e d . N ew Yo rk: Macmillan , 1992.
1981. P ROAKI S. 1. G " C. RADER . F. LING e t al. Ad va nced TopICS In S ignal Processing.
P ICACHE, G . P. "A Givens rot ation algorithm for single cha nnel fo rm ant tracking New York: M acmillan. 1992.
and glo tta l waveform decon volution ." M .S. th esis, Northeastern University, P RUZANSKY, S. " Pa tte rn -m a tc h in g proc edure fo r a uto ma t ic t al ker rec ogn it io n,"
Bo sto n, 19 88 . Journ al of the Acoustical Society of A merica, vo l. 35, pp, 35 4- 358, 19 63.
PICONE, 1. " O n modeli ng durat ion in context in sp eech recogn iti o n." Proceedings P ULSIPHER, D . Co, S. F. BOLL, C. K. R USHFORTH et a l. "Red uc tio n o f non sta
of the IEE E International Conference on Acousti cs, Speech, and Signal Process t io na ry ac o us t ic noi se in speech using LMS ad a pt ive no ise ca ncelli ng," Pro
ing, Glasgow, Scotland. v ol. I. pp. 421-424 , 19 89. ceedi ngs of the IEEE Int emanon al Co nference on Acou sti cs, Speech, and Signa l
- - -. " C ont inu o us speech recognit ion using h idden M arkov models." TEEE Processin g, Washington, D.C., pp . 20 4-208, 19 79 .
A coustics, Sp eech, and S ignal Processin g M agazine, vo l, 7. pp. 26 - 4 1. July P UTNlNS, Z. A.. G . A. WILSON, I. KOMA R et al. "A multi-pul se LPC synt hesi zer
19 90 . for t elecommunications use ," Proceedings of th e i E E E International Confe r
- - - . Wri tten co m mu n ication, 1992. ence on Aco ustics, Speech, and Signal Processing, Tampa , Fl a .. pp . 989- 99 2,
P INEDA. F. " Genera liza tio n o f back-p ro p aga t io n to recurr ent neura l n etworks." 198 5.
Physical Review L etters , vol. 18, pp. 2 229- 22 32 , 1987 . QUACKENBUS H, S. R. "O bj ec tive measures of speech qua lity." Ph.D . di ssertati o n,
- - - . " G eneraliza tio n of back-p ro pa gat ion to rec urrent an d hi gh -order n eural Geo rgia Institute of Technology, 1985 .
networks. " In D . Anderson ed ., Proceedings of the 198 7 IF:EE Conf erence on Q UACKENB USH, S. R .. T. P. BARNWELL, and :\1. A. C LEMENTS, Objective .lI cQw res
N eural Processing System s-Natu ral and Synthet ic. New York : Ame ri can lnsti of Speech Qu alit y. E nglewo od CliiTs. N .J. : Prenti ce Ha ll. 1988 .
tute of P hysics, pp. 60 2-6 1 1. 1988 . Q UATIERI, T. F. " Mi n irn um a n d m ixed ph ase spe ech a naly~ is -syn th('s i~ b y a da p
88 6 Bib liography
Bibliography 887
t ive homomorphic deconvolution," IEEE Transac tio ns on Acous tics, Speech. Marko v models o r tem plates," Comp uter Sp eech and Lan guage. vo l, I. pp .
and Signal Processing, vol , 27, p p . 328- 335. Au g. 1979 . .
167- 197. Dec. 1986 .
Q UENOT. G., J. L. GAUVAIN. J . J. GANGOLF et al. "A d yna m ic t ime wa r p VLSI RA BINER, L. R ., J . G. WI LPON. A. Q UINN et al. "O n the application o f e m bedd ed
p rocessor fo r continuo us speech recognition," Proceedings of the fE E t ; Interna. d igit traini ng to speaker independent co nnect ed d igit training," IEEE Transac
tiona I Conf erence on Acous tics, Speech, and Signal Processing, To kyo. Japan. tions on A coustics, Sp eech, an d Signal Processin g, vol , 32. pp. 272-280, Apr.
vel , 3, pp . 1549- 1552, 1986. .
1984.
RABINER , L. R . "On the use of autocorrelat io n a na lysis for p itch detection," R ABfNER, L. R., J. G. W ILPON, and F. K. SOONCi . "High performance connected
IE EE Transa ctio ns on A coustics, Speech, and Signal Processing, vol. 26. pp . digit recognition us ing hid d en Ma rko v models," fEEE Transac tion s all Acous
24-33, Feb. 1977.
tics, Sp eech. and Signal Processing, vel. 37, pp. 1214-\225, Aug. 1989.
-- . "O n creating reference templates for sp ea ker independent recognit io n of RADER, C. M., and S. SUNDAMURTHY. "Wafer scale systolic array for adaptive an
isolated words ," I EE E Transactions 0 11 Aco ustics, Speech. and Si gnal Process t enna processing," Proceedings of the IEEE ln ternational Conference on Acous
ing. vol, 26, pp, 34- 42, Feb. 1978.
tics. Sp eech , and S ignal Processin g. N ew York, vol, IV, pp. 2069-2071 , 1988.
- -- . "A tu torial o n hi d den Markov mod els and sele cted appl ications in speech RAHIM, M., and C. GOODYEAR . "Articulatory synthesis with the aid of a neural
recogn iti o n," Proceedings of the IEEE, vol, 77, pp. 257-2 85, Feb . 1989. net ." Proceedings of the IEEE Internauonal Conferen ce on Acoustics, Spee ch,
R AHI NER, L. R ., an d S. E. LEVINSON. " Isolated and co nnected word recognition: an d Signal Processing, Glasgow, Sc otland , vol . I, pp. 227-230, 1989.
Theo r y a nd selected a ppl icat io ns," IEE E Transactions 011 Com m unications, RAMESH, P., S. MTAGIRI, and C.-H. LEE. "A new connected word recognition al
vo l, 29, p p. 62 1- 659, May 198 1.
gori thm based on HMM/LVQ segmentation and LVQ classification," Proceed
RABINE R, L. R., S. E. LEVI NSO:-l, A. E. ROSENBERG et al. "Sp ea ke r-ind cpend e nt ings of the IEEE International Conference on Acoustics, Speech, and Signal
recognition of iso lated word s usin g clustering t echniques," IEEE Transacti ons Processing, Toronto, Canada, vol. I, pp. 113-116, 1991.
On Acoustics, Speech, and Signal Processing . vol. 27 . pp. 336-349. Aug. 1979. REDDY, D. R . "Speech recognition by machine: A review," Proceedings of the
R ·\ BINER, L. R , S. E. LEVINSON, and M . M . SOND HI. "On the application of vec IEEE, vol, 64 , pp , 501-531, Apr. 1976 . Reprinted in (Dixon and Martin,
tor quantization a nd hidd en Marko v models to speaker-independent isolated 1989) and (Waibel and Lee, 1990).
word recognition ," Bell S ystem Technical Journal, vol. 62, pp. 1075-1105, - - - . "Words into action II: A task oriented system," IEEE Spectrum, vol, 17.
1983. p p. 26-28 , June 1980.
RABINER, L R., A. E. ROSENBERG, and S. E. LEVI NSO l'-'. "Considerations in dy R EED, F. A., a nd P. L. F EINTLJCH. "A comparison of LMS adaptive cancellers im
namic time warping a lgorit h ms for discrete utterance recognit ion," IEEE ple me nt ed in the frequency domain and the time domain," IEEE Transactions
Transactions on ACOUSlics, Speech , and Signal Processing, vol. 26, pp. 575-582, on ACOLLHics. Speech, and Signal Processing, vol. 29, pp. 770-775, June 1981,
, Dec. 1978.
R£NALS, S.. N. MORGAN , M. COHEN et al. "Connectionist probability estimation
RABINER, L. R , a nd M. R SA1!BUR, "An algo r it h m for determining the endpoints in the DECIPHER speech recognition system ," Proceedin gs of the IEEE Inter
o f isolated uttera nces," Bell System Technical Journal, "-0 1. 54, pp . 297-3 \ S. national Conferen ce on Acoustics. Speech. and Signal Processing, San Fran
Fe b. 1975.
cisco, vol , I, pp. 1-60 1-1-604. 1992 .
R ABINER. L. R., M . SAMBUR, and C. SCHMIDT. "Applications of a non-linear RICHARDS, D. L., and J. SWAFf'IEUJ, "Assess me n t of speech communication
smoothing algorithm to speech processi ng;' IEEE Transaction s on ACOUSlics, links, " Proceedings of the lEE,. vol . 106, pp . 77-89. Mar. 1959.
S peech. and Signal Processin g, vol. 23, pp. 552-557, D ec. 1975.
RITEA, B. "Automati c speech understanding systems," Proceedings of the 11th
RABINER, L. R ., and R . W. SCHAFER. Digital Processing of Speech Signals.
IEEE Computer Society Conference. Washington , D.c.. pp. 3 J 9-322, 1975.
Englewood Cliffs , N .J.: Prent ice Hal l. 19 78.
ROBBINS, H" and S. MUNRO. '"A stochastic approx imation method," Annals of
RABIN ER, L. R . and J . G . WILPON . "Ap p lica tio ns of clustering tech n iq ues to
Math ematical Statistics, vol , 22, pp. 400-407 , 1951.
spea ke r-t ra ined iso lated wo rd rec ogn it io n." Bell System Technical Journal, vol.
ROE. D. B.. A. 1.. GORI!':. and P. R.AM ESH. "I ncorp orating syn ta x in to t he level generalized delt a rule. " Proceedi ngs of the 1st IEEE International Co nference
building algorith m on a tree-st ruc t ured pa ra llel com puter," Proceedings of the 011Neural Networks, San Diego, Calif.. vol. II I. pp . 173-184 , 1987.
IEEE International Conference on Acoustics, Speech, and Signal Processing, SAMBUR, M. R. " LMS adaptive filtering fo r enhancing th e quality of noisy
Glasgow, Scotland, vol. 2, pp . 778-78 1, )98 9. speech." Proceedings of the IEEE tn ternauonal Conference on Acoustics,
R OHWER, R ., and B. FORREST. 'T raining time de pe ndence in neural ne two rk s," Speech, and Signal Processi ng, Tulsa. Okla., pp. 6 10-6 13, Apr. 19 78.
Proceedings of the 1st IE EE International Conference on Neural Netwo rks, San _ _ _ . "Adaptive no ise canceling for spe ec h signals," IEE t.: Transactions on
Diego, Ca lif., vo l. n, pp. 701-708, 198 7. Acoust ics , Speech, and Signal Processing, vol. 26, no . 5. pp . 4 19-423, Oc t.
ROSENBERG, A. E. " Effect s of glottal pulse sh ape on the q uality of natu ral vow 1978.
els ," Journal of the Acoustica l Society of Am erica , vo l. 49 , p p . 583-5 90, Feb . SAMBUR, M . R ., an d L. R . RABINER. "A statistical decision app roach to reco gni
197 1. tion of connected digits," IEEE Transacti ons on Acoustics, Speech , and Signal
R OSENBERG, A. E., L. R . RABINER. J . G . WILPON et al. " D ernisyl lab le-based iso Processing, vol. 24, pp . 550-5 58, 19 76.
lated wo rd recognitio n system," IEEE Trans actions on Acoustics, Speech. and SAW,\I , H . "T D N N-LR co ntin uou s speech recognition system usi ng adap tive in
Signal Processing, vo l. 3 1, pp . 713-726 , J une 1983. cremental TDN N training," Proceedi ngs of the IEE E Internatio nal Conf erence
ROSENBLATT, F. "T he pe rcep tr on: A pe rce iv ing and recogn iz ing a utomato n, " on Acoustics, S peech, and Sign al Processing, To ro nto , Ca nada, vol. I , pp . 53
Co rn ell Aero na utical La bo rato ry Repo rt 8 5-460- 1. 1957. 56, 199 1.
- - - . Principles of Neuro dynam ics. Washi ngto n, D .C.: Spartan Boo ks. 1962 . SCHAFER, R . W. "Echo removal by gen eralized linear filte ring." Ph.D. disserta
ROTHENBERG, M . "T he glott a l volu m e velocity wavefo rm during loo se an d tight lion , Massach usett s Inst it ute of Technology, 1968.
vo iced glo ttal adj ustmen t," Actes du 7em e Congres Lnternattonal des Sciences SCHAFER, R. W., a nd J. D . MARKEL. eds . S peech Analysis. New York: Jo hn W iley
'honetioues, Mo ntrea l, Canada, pp . 380-388, 1972 . & Sons, 1979.
Ro u c os. S. , a nd M . L DUNHAM. "A sto cha stic segment model for ph on eme SCHAFER, R. W. an d L. R. RABINER. "S yste m fo r a utoma tic forman t ana lysis of
o
based con tin uous speech recogn itio n," Proceedings of the IE EE In ternational voiced speec h," Journal of the Acoustical Society ofAmerica , vol. 47, pp . 63 4
Conference on Acoustics, S peech, and S ignal Processing, Dall as. Tex.. vol. I , 64 8. Feb. 1970.
pp, 73- 76, 198 7. SCHAR F, B. " Cri tical bands." In J. Y. Tobi as, cd., Foundation s of M odern Auditory
Roucos. S.. R. SCHWARTZ, a nd J . M,\ KHOUL. " Seg ment q uan tiza tio n for Theory. N ew Yor k: Aca dem ic Press, pp. l57-202. 19 70 .
ve ry-low-ra te speech cod ing," Proceedings of the IE EE Interna tional Conf er SCHROEDER, M . R. "Vocoders: Anal ysis a nd sy nt hes is of spee ch ," Proceedings of
ence on Acoustics, Speech, and S ignal Processi ng, Pa ri s. pp. J 565 -1 5 69, the IEEE, vol. 54, pp . 720-7 34, M ay 1966.
1982.
___. " Period histogram and product spe ct rum: New me tho ds for fund am ental
R UMELHART, D . E., G . E. H INTOr-:, a nd R. J. W ILLI AMS. "Le arn ing inter na l rcpre frequenc y me asuremen t," Journal of the Acoustical S ociety 0/ A m erica. vo l, 43,
sen tations by e rro r pr o paga tio n. " C ha pter 8 in D. E. Rurncl ha rt a nd J. L.
p p. 829-834 , Apr. 1968.
McClelland, eds., Para llel Distributed Processing-Vol. I; Foundat ions. Ca m
bridge. Mass: M.LT. P ress, 1986 . _ __ . " Recognition o f complex acousti c signals," L ife Science Res earch R epo rts.
vo l. 55, pp . 323-328. 1977 .
R USSELL, M . J., an d R . K. MOORE. " Explicit mod ell ing of st ate occupa nc y in hid
den Ma r ko v mode ls fo r auto ma t ic sp eech recogn it io n," Proceedings of the SCHROEDER, M . R. , an d B. S. ATA L. " Gene ralized short-t ime power spectra a nd
IE EE In ternat ional Conf erence OIl Acoustics, S peech, and Signa l Processing, a utoco rrela tion ," Journal of the Acoustical Society 0/ America, vol. 34., pp .
arnpa, Fla ., vol. I, pp. 5-8, 198 5. 1679-16 83, Nov. 1962 .
SAKOE. H . " Two-le vel DP ma tching: A d yna m ic progra m m ing ba sed pa t ter n _ __. Code-exci ted linea r pred ictio n (C ELP): H igh-q uality speech at very low
recognition algorith m fo r connect ed wo rd rec ogni tio n," IEEE Transac b it ra tes ," Proceedings 0/ the IEEE lnterna uonal Conference on Acoustics,
tions on Acoustics, Speech, and Sign al Processing, vol, 27 , pp . 58 8- 595. Dec. Speech, and Signal Processing, Ta mpa. Fla., pp. 937-940 , 1985 .
19 79. SCHWARTZ, R., S. AUSTIN, F. KUBALA et a l. " New uses for th e lV-best sentence hy
SAKOE, H ., a nd S. C HIBA. "Dyn am ic progra m m ing a lgori th m o ptim iza tio n for pot heses withi n the BYBL O S speech recogni t ion system," Proceedings of the
spoken wor d recogni tion ," IEEE Transactions on Acoustics, Speech, and Signal IE EE Internati onal Conf erence on Acoustics. S peech, and S ignal Processing,
Processing, vol. 26, pp . 43 - 49, Fe b. 19 78. San Francisco, vo l. I , p p. 1- 1-1 -4, 1992 .
SAKOE, H ., R . IS01:",NJ . K. YOSHIDA et al. "Spea ker-indepe ndent wo rd recognitio n SCHWARTZ, R . M ., a nd Y. L C HOW. "The N-b est a lgorithm : An effi cie nt a nd
using d yna m ic progra m m ing neural ne twor ks," Proceedings of the IEEE Inter exact procedure fo r finding the N most likely sentence hypoth eses," Proceed
national Conf erence on Acousti cs, S peech, and S ignal Processing, G lasgow. ings of the IEEE Intern ational Conference on Acoust ics, Sp eech. and S ienal
Sco tland, vol. I, pp. 29-3 2. 198 9. Processing, Albuquerque , N.M ., vol. I. pp . 81 - 84, 1990 .
SAMAD. T., a nd P. H ARPER. "Associa tive mem o ry sto rage using a va riant of the SCHWARTZ, R . M ., Y. L. C HOW. O. A. KJMR,\LL et al. "Context depe nde nt m odel
890 B ibli ograp hy
Bibliography 891
ing for acoustic-phonetic recogn ition of cont inuous speech: ' Proceedings of the
struction Fi lter Ba n ks for Tree Structured Subband Coders," Proceedings of the
IEEE International Conference 0/1 A coustics, Speec h, and S ignal Processing,
Tampa, Fla., pp . 1205-1208, 1985. IEEE International Conf erence a ll Acoustics, Speech. an d Signal Processing.
San D iego , Calif., pp. 27 . 1.1- 2.7. 1.4 , Mar. 1984 .
SCHWARTZ. R. M., Y. L. CHOW, S. R OlicOS et al. " Impro ved hidden Markov Sa NDHI, M . M . "An a d ap t ive echo canceller." Belt Syst em Technical Journ al. vol.
modelling of phonemes for continuous speech reco gnitio n," Prucef'dings of the 46 , p p. 49 7-5 11, 196 7.
iEEE International Conference on ,·1coustics, Sp eech, and S ignal Processing,
San Diego, Calif., vol. 3, paper 35.6. 1984. - -- . Closed L oop Ada ptive Echo Can celler Using Generalized Filter Ne tworks .
U .S. Pat ent 3,4 99,999. M ar. 10, 1970.
SCHWARTZ, R. M., J. KLOVSTAD, J. MAKHOUL et al. "A prelim inary design of a
phonetic vocoder based on a diphone mod el," Proceedin gs of the lEEf:-' Inter - - - . " Model for wave pr opagat ion in a lossy voca l tract," Journal of the Acous
national Conference on AcoustiCS, Speech, and S ign al Processin g, Dcnvcr. tic S ociety of America, vol. 55 , pp. 10 70-107 5, May 19 74.
Colo., pp. 32-35, 1980. . - -- . "New me t ho d s o f p itch extraction ," IEEE Transactions on A udio and
Electroacoustics, vol. 16, pp. 26 2-26 8, J une 1968 .
SCORDILlS, M. S. , and J. N. GOWDY. "Neural network based generation of funda
mental frequency con to urs ," Proceedin gs of the IEEE Int erna tiona] Conference - - - . "Resona nces of a bent voca l tract," Journal of the Acoustical So ciety of
on Acoustics, Sp eech , and Signal Processing, Glasgow, Scotland, vol. L pp. America, vol. 79 , pp. 111 3- 1116, Ap r. 1986 .
219-222, 1989. SOONG , F. K, and B.-H. lUANG. " Line spectrum pair and speech compression,"
SEJNOWSKI, T. J., and C. R. ROSENBERG. " NET t alk: A parallel network that learns Proceedings of the IE EE i n ternat ional Conference on Aco us r;cs, Sp eech, and
to read aloud," Technical Report JHU/EECS-86/01 , Johns Hopkins University, Signal Processing, Sa n Di ego, Ca lif., vol. 1, pp. 1.l0 .1- 4, 1984.
1986. SOONG, F. K., and A. E. ROSENBERG. "O n the usc of instantaneous and transi
SHANNON, C. E, "A mathematical theory of communication," B ell System Techni tional spectral info rm a tio n in speaker recognition ," Proceedings of the IEEE
cal Journ al, vo l, 27, pp . 379-423 and 623-656, 1948 . International Confer ence on Acoustics, Spe ech , an d S igna! Processin g, Tokyo,
Japan, vol. 2, pp. 8 77-890, 1986.
- - - . " A coding theorem for a discrete source with a fidelity criterion:' IRE
National Convention Record, part 4, pp. 142-163, M ar. 1959. STAN KOVIC, S. S., and M. M. MI LOSAVU EVIC. "T rain ing of multi-layer perceptrons
by st ochast ic approximation." Chapter 7 in Vol. IV of P. Antogneni and
SHERWOOD. L. H uman Physiology. St. Paul, Minn.: West Publishing. 1989 .
V. M ilutinovic, eds., Neural Ne t works : Con cepts. A pplications, and Impl emen
SHIEBER, S. M. "An introduction to unif icatton-based approaches to grammar." tations. Englewood Cliffs , N.J. : Pren t ice Hall, 1991.
CSLJ Lecture Notes no. 4, Center for the Study of Language a nd Information.
STEIGLITZ, K . "O n the simultaneous est imation of poles and zeros in speech anal
Stanford Uni versity, 198 6.
ysis," IEEE Trans action s on Acous tics. Spe ech, and S igna! Processing. vol, 25,
SHIKANO, K " Evalua tio n of LPC spectral matching measures for phonetic unit pp . 429-433, Oct. 1977.
recognition" (technical report), Computer Science Departm ent. Carnegie
STEIGLlTZ, K. , and B. DICKIN SON. "The use of time domain se lec tio n for im
Mellon University. May 1985 .
proved linear prediction," IF-EE Transaaions on Acoustics, Speech , and Signal
SHORE, 1. E., a nd D. K. BURTON . "Discrete utterance speech recogn itio n without Processing, vol . 25, pp. 34-39. Jan . 1977 .
time alignment," IEEE Transactions on Information Theory, vol. 2 9, PD- 473
491 , 1983. STEINBISS, v., A. NOLL, A. PAESELER et al. "A IO,OOO-word continuous spee ch rec
ogn ition syst em ," Proceedings of the IEEE Int ernational Conference on Acous
SILVERMA N, H. F.. and D . P. MORGAN. "The application of dynamic program tics , Speech , and Signal Processin g, Albuquerque, N .M ., vol. 1, pp. 57-60,
ming to connected speech recognition," IEEE ACOUSllcs, Sp eech . and S ignal 1990.
Processing Magazine, vol . 7, pp, 6-25, July 1990.
STEPH ENS, R. W. B.. and A. E. BATE. A cous tics and Vibra tional Physics. New
SIMPSON, P. K. A rtificial N eural Systems. New York: Pergamon Press. I 99(l. York: St. Martin 'S P res s, 19 66.
SINGER, E., and R, LIPPMA NN. "A speech recogn izer using radial basis fu ~ctjon STEVE NS, K . N., and A. S. HOUSE. "An acoustical theory of vowel production and
neural network s in an HMM fr amework," Proceedings otthe IEET: lnterna some implications," Jou rnal of Sp eech and H earin g Res earch , Yol. 4, p. 303,
tion a] Conference on Acoustics, Speech , and Signal Processi ng. San Francisco. 1961.
vol. I, pp. 1-629-1-632, 1992. - - - . "Development o f a qua n t ita tive description of vowel articulation:' Jour
SINGH AL, S., and B. S. ATAL. " Im p rov ing performance of m ult i-p ulse LPC co de rs nal of the Acoustical S ociety of America. vol. 27 , pp. 484-493, 1955.
at low bit rates," Proceedings of the IEEE Internat ional Conf erence on Acous STEVENS, S. S.. and 1. VOLKMAN. "The re la t io n of pi tc h to frequency," America n
tics , Speech , and Signal Processing , San Diego , Calif. , vol. I, pp , 13 1- 134. Journal of Psychology, vol, 53. p, 329, 1940.
1984 .
STEWART, J. Q. "An elect rica l analogue of the voc a l cords." Na ture. \ 0 1. 110, P
SMITH, M. J. T., and T. P. BARNWELL. "A Proced ure for D esigning Exact Rcc on - 311, 1922.
892 Bib lio graphy Bib liography 893
STOCKHAM. T. G . "The appl ication of genera lized line arity to auto matic gain con T EB ELSKIS, J .. an d A. WAIBEL. " Large vocabu lary recognition using linked
trol ," IEEE Transactions on Audio and Electroacoustics, vol. 16. pp . 828- 842 pred ictive neur al networks." Proceedings of the IEEE International Conference
J une 1968 . ' all ACOllstics, Speech, a nd S ignal Processing. Alb uquerque, N. M., vol. I, pp.
STROBACH , P. " New forms of Levinson a nd Schur algorithms," IEEE S ignal Pro 437-440, 1990.
cessi ng Ma gazine, pp . 12-36, Jan . 1991. THOMAS. I. B. "The influence of first and second formants on the intelligibility of
clipped sp eech," Journal of the Audio En gineering Society, vol. 16, pp. 182
STRUBE, H. W. " Determ inat io n of the in sta nt of glott al closure from the spee ch
wave," Journal of th e Acoustical Society a/A m erica. vol, 56, pp. 1625-1 629. 185, Apr. 1968.
1974. THOMAS. 1. B., and R. J. NI£DEKJ OH:--I. "Enhancement of speech intelligibility at
SUGAMURA. N. , and F. [TAKURA. "Speech data compression by LSP analysis high noise levels by filtering and dipping: ' Journal of the Audio Engineering
synthesis tech nique," Transactions of the Instttu te of Electronics , Inf ormation. Society; vol. 16, pp . 412-41 5. Oct. 1968 .
and Co m puter Engin eers. vol , J 64-A, pp. 599-606. In!. _ _. "The int elligibility of filt ered-clipped speech in noise : ' Journal of th e
SUGIYAMA. M oo an d K. SHIKAl\O. " LPC pea k weighted sp ectral matching mea Audio En gineering So ciety, vol. 18. pp. 299-303 , June 1970.
sure," Transactions of the Institute of Electron ics, Information , an d Computer THOMAS, I. B., and W. J. OHLEY. "Intelligibilit y enh anc ement through spectral
E ngineers, vol. J64 -A, pp. 409- 4 16, 1981. weighting," IEEE Conference on Spe ech Commun ications and Processing, pp.
SUN. G. Z., H. H. CHEN, Y. C. LEE et al. "Recurrent neural networks, hidden 360-363, 1972.
Markov models, and sto chast ic grammars," Proceedi ngs of the International THO~ AS , 1. B., and A. RAVINDRAN. "Intelligibility enhancement of already noisy
Joint Co nference on Neural Networks, San Diego , Calif., vol. I, pp. 729- 734, speech signals," Journal of the Audio En gineering Society. vol, 22, pp. 234
June 1990 . 236, May 1974.
SWI NG LER, D. N. " Frequency erro rs in MEM processing ," IEEE Tran sactions 0 11 TIMKE, R. H. , H. VONLEDEN. and P. MOOR E. " Laryngeal vibrations: Measure
Acoustics, Speech, and S ignal Processin g, vet . 28. pp . 257-259, Apr. 1980. ment of the glottic wave," A men can Medical Association Archives (~I' Otolaryn
TAKAHASH I, J . I.. S. HATTORI. T. KIMURA ei at. "A ring ar ray processor architcc gology , vol. 68, pp . 1-19. July 1948.
lure fo r highly parallel dynamic time warping," IEEF. Transactions on A cous TOIlKURA, Y. ~A weighted cepstral distance measure for spe ech recognition ,"
tics, Sp eech, and Signal Processing, vol. 34, pp . 1302-1309. Oct. 1986. IEEE Transact ion s on Acou sti cs, Speech , and Signal Processing. vol . 35, PP·
TAM URA, S. "An analysis of a noise reduction neural network:' Proceedings of the 14/4-1422 , Oct. 1987 .
IEE E Internation al Conf erence 011 A coustics, Speech , and S ign al Processing , TREMAIN, T. E. "The government standard lin ear pr edi ctive coding algorithm:
Glasgo w, Scotland, vol, 3, pp. 2001-2004, 1989. LPC- IO," Speech Techn ology, vol . I, pp . 40-4 9. Apr. 1982.
TAMURA, S., and A. WAIBEL. "Noise redu ction using connectionist models," Pro TRIBOLET, 1. M. "A new phase unwrapping algorithm ." IF.F..F. Transa cti ons on
ceedings of the IEEE lnt ernational Conference on Acoustics. Sp eech , and S ignal Acoustics, Speech , and Si gna l Processin g, vol , 25. pp. 170-1 77, Apr. 1977 .
Processing, New York , vol. I. pp. 553-556. 1988. TRIBOLET, J . M. , P. NOLL, B. J. McDE RMon ct al. "A study uf complexity and
TANAKA, E., and K. S. Fu. "Error correcting parsers for formal languages," IF.EE Quality of speech waveform code rs: ' Proceedings of th e IEF..E Int ernational
Transactions on Computers , vol. 27, pp. 605-6 15, Jul y 1978 . Confe rence on Acoust ics, Speech, and Signal Processing, Tulsa. Okla.. pp. 586
TANIGUClU, T., S. UNAGA MI. and R. M. GRAY. "M ulti mode cod ing: Applications 590, 1978.
to CE LP: ' Proceedings of th e IF.EE International Co nference on Acoustics, Ts ENG. H. P., M. J. SABIN, and E. A. LEE. "Fuzzy ve ct or quantization applied to
Sp eech, and S ignal Processing. Glasgow, Scotland, vol. I , pp. 156-159, 1989. hidden Markov modeling," Proceedings of the 11:."1:."£ lntemtuional Conference
T>\TE, C. N.. and C. C. GOODYEAR. "Note on the con vergence of linear predictive on Acoustics, Sp eech, and S igna! Processing, Dallas, Tcx., vo l, 2, pp. 64 t-644,
filt ers, adapted using the LMS algorithm,~ I £E Transactions, vol . 130, pp . 61 1987.
64, Apr. 1983. T SYPKJ N, YA.. Z. Founda tions 0/ th e Theory of Learning System s. Translated by
T EAGER , H. M. "Some observ ations on oral air flow during ph onation," IEEF.. Z. 1. Nikolic. Orlando, Fla .: Acad emic Press, 197 3.
Tran sactions on Acoustics, Speech. an d S ign al Processing. vol. 28, pp . 599-60 I, U N, C. K. , and K. Y. CHOt. "Improving LPC anal ysis of noisy sp eech by
Oct. 1980. autocorrelation subtraction method ," Proceedings of th e IEEE International
T EAGER. H . M., and S. M. Teager, "A phenomeno logica l model for vowel pr od uc Conference on Aco ustics, Speech , and Sign al Processin g. Atlanta, Ga., pp .
tion in the vocal tract." In R. G . Daniloff, ed., Speech Sc iences: Recent Ad 1082 -108 5. 1981.
vances. San Diego. Calif.: College-H ill Press, pp. 73- 109, 1983. VAIDY.'\NATHAN, P. P. " Q uad ratu re mirror filter banks, M-band extensions and
- - -. " Evidence for nonlinear prod uct ion mechanism s in the vocal tract ," perfe ct reconstrucnon techniques," IEEE Aco ustics, S peech, and Signal Pro
NATO Advanced Study Inst itute, Speech Production and Modelling, Ch ateau cessing Ma gaz ine. vol. 4, PP. 4-20. July 1987.
Bonas, France, J uly 17-29. 1989 . Also in W. 1. Ha rd cast le an d A. Marchal. _ _ . "M ult irate digital filt ers. filter banks. polyphase networks and applica
eds, Proceedin gs of th e NATO ASI. Norwell, Mass.: Klu wcr, 1990. tio ns: ' Proceedings of the IEE J-:, vol. 78, pp. 56-93, Jan. 1990.
B iblrog raphy 895
89 4 B iblt ograp hy
WAGN ER. K. W. " Ein ne ues ele kt rischcs Sprechgera t zur Nachb ildung der
V."'R t"ER. L. W.. T. A. MILLER. an d T. E. EGER. "A simple a da ptive filt er ing tech
me nschlichen Vokale." Abhandl. d. Preuss. Akad d. Wissenschaft , 1936.
niq ue fo r speech enhancement." Proceedings of the I f:EE International Conf er
ence on Acoustics, Speech, and S ignal Processing. Boston , pp . 24. 3. 1- 4. 1983 . WAIBEL, A.. T. H.·\NAZAWA, G . H INTON ct al. "Pho ne me recognit ion using tim e
delay neural net wo rks: ' IEEE Transactio ns on A COustics, Sp eech, and Signal
VEENEMAN, D . E., an d S. L. BEM ENT. "Au to ma t ic glo tt a l inverse fi lter ing of
Processing , vo l. 37, pp . 328 -3 39, M a r. 1989. Reprinted in (Waibel a nd Lee,
speech," Il:..' EE Transactions on Acoustics, Sp eech, and S igna l Processing. vol,
33, pp. 36 9- 377, Apr. 1985. 1990).
WAIBEL, A.. and K.-F. LEE. eds . Readings in S peech R ecogntuon . Palo Alto , Calif.:
VENKATESH, C. G .. J . R. D ELLER. an d C. C. CHil i. "A grap h part iti oning ap
proach to signal de cod ing" (technical repo rt ), Speech Processin g Laboratory. Morgan- Ka ulTm ann , 1990 .
Department of Elect rica l En gineeri ng. M ich igan Stat e U n iversity, Aug. 1990. . WAIBEL, A., H . SAWAI, and K. SHIKANO. " Co nso nan t recognition by modular con
VERH EL5T, W., a nd O. STEENHAUT. "A new model for th e co m plex ccpstrum of struction of large pho ne m ic tim e-delay neural netw orks," Proceedings of the
vo iced spe ech ," IEEE Transactions on Acoustics, Sp eech, and Sig nal Process
IEEf: International Confe rence on Acoustics, S peech , and Signal Processing,
ing, vo l. 34, pp. 43-51, Feb. 1986. Gla sgow, Scotl and, vol. 1. pp. 112 -11 5, 198 9. Reprinted in (Wa ibe l and Lee,
- - - . " O n short-time cepstra of voiced sp eech ," Proceedings of the l EE/:; Inter 1990).
national Conferen ce on Acoustics, S peech, and Signal Processing. New York. _ _ _ . "Modularity and sca li ng in large phonemic time-delay neural networks,"
vol. 1, pp . 311-31 4, 1988. IEEE Transacttons on Acoust ICS, Sp eech, and Sig nal Processin g, vol . 37 , pp.
188 8-/898 , D ec. 1989.
VINTSYUK, T. K. "Element-wise recogn ition of continuous speech composed of
wo rds fro m a spe cified d ictionary," Kibernetika, vol. 7, pp. 133-143 , Mar. WANG , D. L., an d J. S. LIM . " T he un import an ce o f ph ase in speech e nhanc e
Apr. 1971. ment ," IE EE Transactions all Acoust ics, Speech, and Sig nal Processing, vo l. 30 ,
pp . 679-681 , Aug. 1982 .
VISWANATHAN, R ., and J. MAKHOUl. "Quantization properties of the transmis
sion pa rameters in linear pred ict ive syste ms," IEEE Transactions on Acoustics. WATROUS. R., B. L",O£NOORF, and G . K UHN. "Complete gra d ie nt o ptimi zati on o f
Speech, and Signal Processing, vol. 23, PP. 309-321 , June 197 5. a recurrent network applted to /b , g. d/ di scrimin ation ," Journal of the Acous ti
cal SOCiety of Am erica, vol . 37, pp . 130/ -1309, Mar. 1990.
VISWANATHAN, V. R. , J. MAK HOUL. an d W. H. RUSS ELL. "Towa rds perceptually
co nsis tent measures of spe ct ral di stance," Proceedings of th e i EEE Int erna WATROUS, R ., a nd L. SHASTRI. " Lea rn ing ph on eti c feat ures using conne ctionist
tional Conf erence on Acousti cs, S peech, and Signal Processing, Philadelphia, networks," Proceedin gs of the 1st IE EE Inte mational Conf erence on Ne ural
pp. 485-488 , 1976. Networks, San Diego , Calif., vol . Iv , pp . 38 1-388, \987.
V ISWANATH AN, V. R ., W. H . RUSSELL, and A. W. H UGGINS. "Objective speech WEINER, N. Ex trapolation, Int erpolation , and Smoothing of Sta tionary T im e Se
qu ality eval uat io n of medium and narrowband real-time speech code rs ." Pro ries. Ca m brid ge, Mass.: M .l.T. Press, 194 9.
ceedings of the IEEE International Conference on Acoustics , Speech, and Signal WEINTRAUB, M., H. M URVEIT, M. COHEN et al. " Linguistic constraints in hidden
Processing, Boston, pp. 543-546, 1983. Markov m od el bas ed speech recognition ," Proceedin gs of the IEE E lnterna
VITERBt, A. 1. "Erro r bou nds for co nvo lutio nal codes and an asymptotically opti uonal Conf erence Oi l Acousucs, Sp eech. and Signal Processing, G lasgow,
mal decod ing algorithm ," IE EE Transactions on Inform auon Theory, vol, 13. Scotland , vo l. 2, pp. 69 9- 70 2, 1989.
pp . 260-269 , Apr. 1967. WEISS, M . R ., and E. ASCHKENASY. "Com pute rized au d io pro cessor," Rome Air
VITERBI. A. J ., a nd J. K. OMURA. Principle s of DIgital Com m unication. New De velopment Center, Fin al Report R AD C-TR-83- 109. Ma y 1983 .
Yo rk: McGraw-Hili, 1979 .
WEISS, M . R. , E. ASCHKE NASY, and T. W. PAR SONS. " St ud y and development of
VOIERS, W. D. " Percept ua l bas es of spea ker ide nt it y," Journal of the Acoustical the INTEL technique for improving speec h intelligib ilit y," Ni col et Scientif ic
Society of America, vol . 36. pp. 1065 -1073, June 1964 . Corp., Final Report NSC-FRJ40 23 , Dec. 19 74.
- -- . "Di agnostic acc eptab ilit y measure for speech com munication syste ms," WERBOS, P. " Beyo nd regre ssi on: New tools for pred ict ion and anal ysis in th e be
Proceedings of the J EEE In tern at iona l Conference Oil Acoustics, Speech, and havioral sc ience s." Ph .D . di ssertation . Harvard University, 19 74 .
S ignal Processing, Hartford, Conn., pp , 204-207, 1977.
The Scientific Papers of S ir Charles Wheatsto ne, London a nd We stminst er Re
- - - . " Diagno stic evaluation of sp eech intelligibility." In M. E. Hawl ey, cd ..
view, vol, 28, 18 79. Also Proceedings of the Bruish A.'sod a tion of Advanced
S peech In telligibility an d Sp eaker R ecognition . Stroudsburg, Pa. : Dowden.
H utchenson, and Ross, pp . 374-38 7, 1977 . Science No tices, p. l 4. 1835 .
WHITE, H . " Learn ing III a rt if icia l neural net works: A statistical persp ecti ve,"
- --. " Me tho ds of predicting user a cceptance of vo ice co mm un icat ion s sys
Neural Com putation, vol , I , pp. 425-46 9. 1989 .
tems." Final Report , Dynastat , lnc., DCAIOO-74-C-0056 , July 1976 .
WIDROW, B., 1. R. GROVER, J. M , MCCOOL et a l. "Ada pt ive noise can celing: Prin
- - - . "Inte rdependencies among measures of speech intell igibi lit y a nd speec h
c ip les and appl icat io ns," Proceedings of the /I·:J ~·r: , vol. 63 , pp . 169 2-171 6,
q ualit y," Proceedi ngs of the i E EE I nternational Confer ence on Acoustics,
Sp eech, and Signal Processing. Denver. Colo., p p. 703 - 705, 1980. Dec. 1975 .
Bibliography 897
896 Bi bl io g ra ph y
WIDROW, B., and M . E. HOFF. "Adaptive switching circuits." IRE WES COi\' Con . XU E, Q ., Y.-H . H U. and P. MI LENKOVIC. "Analy sis of t he hi dde n un its of th e
m ulti-layer pe rcept ro n a nd it s application in ac ou st ic-t a -art icula tory m ap
vention Record, pp. 96- 104. 1960.
ping," Proceedings of the IEE E Int ernational Confe rence on Acoustics, Sp eech ,
WIDROW, B., J . M . MCCOOL, an d M. BALL. "T he complex LMS algorithm," Pro an d S ignal Processing, Albuq uerque, N .M ., vol. 2, pp . 869-8 72. 1990.
ceedings of the IEEE, vol. 63 , pp , 719-720, Apr. 1975.
YONG, M. , and A. GERSHO "Vecto r e xci tat io n cod ing with d ynamic bit all oca
Wi DROW, 8. , J . M . MCCOOL, M . G. LARIMORE et a1. " Station a ry a nd nonsta tio n ," Proceedin gs of the IEEE GLOBECO M , pp. 290-2 94 , Dec. 1988 .
l
tio na r y learning characteristics of the LMS adapti ve fil ter," Proceedings of the YOUNGER, D . H. "Recognition and parsing of context-free languages in time n , "
IEEE, vol, 64. p p . 115 1-1162, Aug. 1976.
Info rm a tion and Co nt rol, vol . 10. pp . 189-208 , 1967.
WIDROW, 8., P. MANTEY, L. G RIFFITHS et a l. "Adap tive antenna systems," Pro
ZELINSKI, R. , and P. NOLL. "Adaptive transform coding of speech signals," IEEE
ceedings of the i E EE , vol. 55, pp. 2143-2159, Dec. 1967.
Transactiolls Oil Acoustics , Speech, and Si gnal Processi ng, vol. 25, pp. 299-309,
- -. "Adaptive fi lters." In R. K a lman an d N . DcCla ris, eds. , Aspects of Net
1V0rk and System Theo ry. New York: Holt, Ri nehart and W in ston , pp . 563 Aug. 1977.
ZINSER, R. L., G. MIRCHANDANI , and J. B. EVANS, "Som e expcrimental and theo
587, 197 1.
retical results using a new ad aptive finer structure for noise can cellation in the
WIDROW, B.. and S. D . STEARNS. A daptive S ignal Processing. En glewoo d Cliff s. presence of crosstalk ," Proceed ings of th o IEEE In/emotional Confe rence on
N .J .: P rent ice H all. 1985. A coustics, Speech, and S ignal Processing, Tampa, Fla ., PP · 32.6.1-4 , 1985.
W ILLIAMS, R ., a nd D. ZIPSER. "A learning algorit hm for continually ru nning full y ZUE, V. "The use of spe ech knowledge in automatic speech recognition ," Proceed
recurrent neural netwo rks," ICS Rep ort 880 5, Universit y of Califo rn ia at San ings of the IEEE , vol . 73 , pp. 1602-16 15, Nov. 1985.
Die go, 1988.
ZUE, v.,J . GLASS, D . GOODINE et al. "The Voyager spee ch understanding system:
WILPON , J. G ., C.-H . LEE, an d L. R. R A.. B1NER. "Im pro vem ents in co nn ected d igit Preliminary development and eval uat io n," Proceedin gs of the IEEE Interna
recognitio n using highe r ord er sp ect ral a nd energy features," Proceedings of the tional Conference on Acoustics, Speech, and Sign al Processing, Albuquerque,
IEEE i nternational Con feren ce on Acou stics, Speech , and S ig nal Processing,
N .M. , vol. 1, pp. 73- 76, 1990.
Toronto , Ca nada, vol. I , pp . 349- 352, 199 1.
Zue , v.. and M. Laferriere. "Aco ustic study of me dial It , dl in American English,"
W iLPON, J. G ., R. P. MIKKILINEKL D . B. ROE et a1. "S peech recogn it io n: From Jo urnal of the Acoustical Society of A merica, vol. 66 , pp. 1039-1050, 1979.
t he la bo ra to ry to the real world ," AT&T System Technical Journal. vol. 69. pp ,
ZWICKER, E. " Subd ivision of the audible frequency range into cri tica l bands,"
14- 24, Oct. 1990.
J ournal of the Acoustical Society of A merica, vol , 33, pp . 248, Feb . 1961.
WILPON, J . G., L. R . R.... BINER, an d T. B. MARTIN . "An im proved wo rd detect ion
algorith m fo r telephone q ua lity sp eech inco rpo ra ti ng bo th syntactic and se
m an tic con st rain ts." .'11'&T Sys tem Technical Journal, vo l, 63 , pp. 4 79-A 97.
1984 .
WINTZ, P. A . "Transfo rm pict ure cod ing," Proceedings of the TEEE, vol . 60. pp .
880-920. J uly 1972.
WOLF, J . J ., and W. A . WOODS. " The HWI M speech unde rsta nd ing system ," Pro
ceedings of the IE EE I nternational Conf erence 0 11 Acoustics , Sp eech, and Sign al
Processing, Hanford , Co nn ., vo l. 2, pp . 784-787, 1977.
- - . "T he HWIM speech unders ta nding system ." In W. A. Lea. ed., Trends ill
Speech R ecognition . En glewood Cliffs , N .J.: Prentice Hall. 1980.
WONG, D . Y , J . D. M ARKEL, and A. H. G RAY. "Least squ a res glo ttal inverse fil
tering fro m the acoust ic speech wa ve fo rm ," IE EE Transact ions on Acoustics.
S peech, and S ignal Processing. vol , 27, pp. 350-355, Aug. 1979.
WONG, E., an d B. H AJ EK. S tochastic Processes in E ngineering Systems. New York:
Spri nger- Verlag, 1984.
WOODS, W. A. "Tra nsit ion network grammars for na tural language analysis,"
Communications of the American Association f or Comput ing Machinery, vo l.
13, pp . 59 1- 606. Oct. 1970.
- - . "La nguage p rocess ing fo r speech un derstanding." Ch apter 12 in F.
Fall side and W. A. Woods. eds ., Computer S peech Processing. Englewood
Cliffs, N .J .: Prentice Ha ll. 1983. Reprinted in (Waibel and Lee , 1990).
Index
characteristic, 172
quantizer], 814. 818 , 819-822 . 820
44 6- 448
Augmented transition network (ATN).
Adap tive noise cancel ing oxtvt") 505,528 111 linear predict ion analysis. 291
54 1. 565 - 567
of a random vector. 41
449, 579
short-term. 236-238
44 1- 444
tion), 346
457, 579
of a random process, 46
p rod uct, 32
(long-term) temporal (or time) , 45, 226 .
A R PAbe t, 117
Backtracking a lgo rithm, 630. 696
899
900 Index Index 90 1
Bel lma n o pt imality princip le (BOP ). 62 :) De lta . o r dil Tere nced. ce pstru rn, 385
use in I\VR. 634- 6 5 1
629. 660.694.74 1
Co m plex ccpstru m: see Ce pst rurn
Delta modulation (DM ), 43 5,444-446,544
in practical systems [scc also Speech
Buccal ca vity, 10 1
D ilTer ential pulse code modulation
Endpoint d etection. 246- 25 1, 609 , 6 11
del ta , or differenced , 38 5
Speech reco gn itio n applica tio ns ). 606.
256- 258. 45 5
Entr o py, cod ing, 4 18
usc,
38. 40
Discrete ut tera nce recognitio n: see Iso
Even t. 30
in HMM, 723-72 8
Cosi ne tra nsform, 359
'at ed word recogn iti o n
Exci ta tion types. 110. 15C;
in sp eech recognition, 79 1- 80 1
between rando m va riable s. 40
Bhau acharyya, 6 8. 69
un vo iced . I 10. 159. ~ 6 5
algorithm
fo r ra nd om vectors, 4 1
fo r the HMM. 722-723, 74 3
Exp ectat ion : see Sta tistical expecta t io n
Cl us teri ng. 70
Cross -co rrela tion, 48
Dist inctive features , 573
Fenones, 724 , 732 -73 3. 795
dy namic'. 7 1
Cross-cova rian ce. 48
Dis to rt ion-ra te function. 41 6. 418-424.
Field (of events), 30
hier ar ch ica l, 7 1
Cross en tropy; see Discr imina tio n
4 26- 4 27 .
Filter bank meth o d. 259. 451- 45 2. 589
67 3,677. 79 5, 79 8
CSE LT syst e m, 799
D? searches, 649 , 656, 668 , 675
relation to H MM , 754-7 59
search o f. 43 1- 432
Data bases. 790
fo r. 649 -6 51, 65 6- 6 57, 67 1- 672
bandwidths. I26. 338
Co herence, 567
D EC IPHER sys tem . 798. 84 6
842 -8 43
332, 336-338 , 347, 373 -374 . 395
Co m pa ndor, 437
De fe nse Ad vanced Proj ects Agency
search co nstr a ints in . 638 - 649
39 8. 404
549 . 552
Deleted interp o lat ion. 729
use in CSR, 6 5 1- 672
Ka ng-Coulter p rocedure . 338
Formants (Continued ,
Ha rm o nic pro d uc t sp ectrum ( HPS ). 261
learni ng in ne ura l ne tworks: see Artificial
I n fo rm at io n. 74
124
HAR PY , 606 , 616. 618. 714,791 - 792
Inner product. 41
817. 834-837
Fo rm a nt vocodc r, 469 - 4 7 1
Hea ring [see also Aud ito ry system ], 85
In-place compu ta tio n ( in DTW), 65 0
learning algo rithm for , 836-837. 849
G aus sian pd f, 36
70 9- 7 14
Itakura d ista nce: see Distance mea sures
L1N CJ LN syst em , 798
ope n p ha se o f, 160
776
co nve rsio n of pa ra m eters to cc pst ra l
slo pe m easur e
G lotus, I 10
co rrec ti ve . 722 , 779
frequen cy selec tive, 347
6 16,77 4
o pe n ph ase of, I 12
with multi ple obse rv ation seq uences.
lin ear pr edi ct io n inte rp reta tio n, 27 8-279
7 18-720
lon g-term , 267-290
788,794
HMllf -LR svstcrn , 786, 798 - 799. 846
Lan gua ge
p aram ete rs use d as VQ fea tur es, 426 -42 7
Homomorph ic sys tem s and sign al p rocess form al, 746 -749
p ar am ete rs used in HMM , 72 3-728
null, 796, 84 5
74 7,764,784-785,786, 801, 80 2, 84 5
La nguage co nstrai nts: see Linguisnc
syste m Identification interpret ation,
Engineers (l E EE), 92
Lang uage processor or de coder; see Lin
und er esti m ate d mod el orde r. 28 1- 286
117-11 8
ca rtilage s of. 104
co de-exc ite d lineal' predi cti on (C ELP).
Implo sive so u nd s, l iS
Lattice structures. 302 -309 . 471-472
4 74-4 76. 4 89
Im pu lse
Bu rg. 307 -309
d iscre te time , 9, 86
rela tion to lossless tu be mod el. 306
res idual exci ted linear p red ictio n
Half-band filter. 49 7
Indep end en ce ;see Sta tistica l ind ependence
L ~ L dccomposilion. J j 0-312
( R El P).474- 4 76
(V SE LP ). 483, 489
Mod ified rhyme test (M RT) . 57 1. 573
Open pha , e (of glott is). 112
Pole-ze ro model. 20 0. 527
526, 579
M oment s (stat istica l), 39
opt im a lity principle
Power
724 , 734 , 7 40
j oint ce nt ra l. 40
O rt hogo na lity princ iple . 278-279.350.520
short-term . 226, 228. 230- 23 1. 246
770 .772-774.789
Moore form HM"1. 68 0
O uter pro du ct. 4 1
for ra ndo m processes. 48
Linguistics. 85, 99
Mu -law co mpander , 436-437
sh ort-term (stPDS), 239-2 40
Lip models
Multilayer pcrceptron (MLP) [sl'e also
Pa ired acceptability rating (PA R). 572
for signals. 13
Lloyd algorithm, 71
764. 802
of features. 58. 24 3
592 , 593
Nasal ca vitv . 10 I. J 36
Pa tte rn recognition
P roba bilist ic sepa rability measure. 67
Long-term features
Nasal tracL'101-103. 175
sta tisti cal . 55
P robab ilitv
LV decomposition, 309
models for. 186
Pausing. 610
j oi nt. 3 1
NETt.alk. 826
pcrcep tron] , 81 0. 817
fo r random vari a bles
Mandible, 101
Neural network; see Artificial neural
his to ry of. 819-826
co nd itiona l. 36, 38
521
Noise
Perp lex ity . 6 14. 749- 75 1. 791-801. 802
Prosodies, 137- 1,+), 774
392
I, or Eucl idean, 41
Phone. 115
43 7. 578 , 579
575-579
Nonn-sec,6
Phone mic tra nsc ript ion , I 16-1 18
493, 494 -500
278
solution of. 296-3 25
Phonetic discriminants. 80 0
Quality acce p tance ratin g test (QUART),
Mel,380
Normalized frcquencv, 6
P ho netic transcription . 116-118
572
Minkowski or I" 56
Normalized time, 6
articula to rv , 115. 125
57 5, 578-579
Chebyshev or I"" 57
audi tory, 115
in enhancement applicat ion s, 528. 552
906 Index
lncex 90 7
- e nergy-typc,- 235
609 , 672. 735- 7 38. 789
Terminals (of a formal la nguage). 747
basis function
Ra h rn o nics. ] 59
R a nd om pr oces s. 42
239-24 0
spea ker-ind epen d ent [see a/so Speec h
Tied Sla tes (HMt-.l) . 729 . 743
an d linea r systems. 5 1
" po wer-type ." 2 35
recogn ition appl icati o ns] , 37 4. 60 7.67 3
T ime alignment. 63 4
mult iple . 48
rea l cc pst ru m; see C epstru m
science, 82
Time-delay neural network, 839 - 84 1. 84 6
m u lt iple. 37
Sho rt- time fcat u res: sC'C' Sho rl-lerm fea tures
T IM IT da tabase. 79 0
Speech product io n
Ra ndom vectors. 40
SI FT al gorithm . 333-336
TI / N 135 d ata base . 790. 800 , 844
G a us sia n. 4 1
Sigm a-algebra (o f events). 31
Toeplitz operator (m a tr ix), 291
200
42 6- 427
Sigm o id functi on: see Threshol d functions
Trachea, 101
R ea liza tio n. 44
Source modeling: see Excita tion
State sp ace structure, 2 1. 34 8
co rrective (H MM). 72 2. 779
30 1. 30 5. 33 1, '345, 4 27. 43 0 . 44 3,
vcn ticau on. 83 . 620, 84 7
Sta le tr ansit io n matrix (fo r H M M), 681
learning algorithms
47 1,579, 588
Sp eak er-dependen t recogn it ion [SCI' also
Sta te transit io n p rob ab ilities (for H ivlM),
o f cont in uous speech recognizers, 77 5
Ro ot po we r s u ms m easure, 378
gene rali zatio ns of, 509-510
Stat ist ica l exp ect a tion . 39
T rigram model : see N-gram model
Spectrogram , 109 . 14 6
co nd it ional. 40
T ri-PO S mo d el, 786
Sa m ple fun ct io n. 44
an aiy , .i , 82
o f eve nts, 3 1
Sa m p le poi nt, 30
codin g a nd compressio n. 112 . 2 58-261.
of random pro cesses. 44
U nification gra m ma r. 788
pro duct , 32
da taba ses. 790
Stead y-state resp on se . 79
U n ifo rm qu a rui zcr , 41 3
Seed mo d els . 77 6
single-cha nnel me thods. 504
Stochastic fin ite state a uto m a ton; see
a na log (con tinuous t im e), 9, 85
774 , 845
qu ality, 50 5
Subband co d ing (SHe). 4 53-4 55
Sernivowcls. 128
perceived. 570
Su bglottal pr essu re. Il l. 160
Va riance , 39
Shor t-term featu res [see a/so Sp eech rec qu alit y assessmen t . 83. 488-489
mod el] . 609 , 7i2
Co deboo k), 39 , 7 1. 82
a verage. 233 .
502
Syn tax . 611 . 615. 62 1. 656. 666 -668. 745 .
in speec h co di ng. 4 11. 425-,U4. 485- 48 8
25 6- 258. 263
609-6 10,7 59-764
Telecom m u n icat ions Industry Associat io n Voca b ula rv
908 Ind ex
models. 166
using VQ . 457 - 458
d iscrete-tim e, 21 7
W eighted rec ursive least sq ua res (\\ I( LSj
33 2
3 14
188- 190
Weighted spectral slo pe mea sures
20 7. 2 12, 2 14
W h ispe r, I 10
459-487
usc in speech enhance m ent. 517- 52 ~
Wi ndows , 16, 23 1
vod er , 153
Blackm a n. 19
Ha n n ing, 19
Kaiser, 19
rect a ngular, 18
re troflexed , 124
Ze ro crossi ng m easu re, 24 5. 24 6-2 51.