Thanks to visit codestin.com
Credit goes to www.scribd.com

0% found this document useful (0 votes)
74 views20 pages

Incorporating Knowledge Sources Into Statistical Speech Recognition

This document provides an overview of a book about incorporating knowledge sources into statistical speech recognition. The book presents a graphical framework called GFIKS that allows various knowledge sources to be incorporated into hidden Markov models (HMMs) for acoustic modeling in automatic speech recognition (ASR) systems. The framework uses Bayesian networks to represent the probabilistic relationships between different knowledge sources, such as background noise, accent, gender, and phonetic knowledge. This allows a simplified joint probability model to be constructed and estimated using limited training data, while maintaining performance improvements over traditional HMM-based ASR. The book evaluates the approach on large-vocabulary continuous speech recognition tasks.
Copyright
© Attribution Non-Commercial (BY-NC)
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
74 views20 pages

Incorporating Knowledge Sources Into Statistical Speech Recognition

This document provides an overview of a book about incorporating knowledge sources into statistical speech recognition. The book presents a graphical framework called GFIKS that allows various knowledge sources to be incorporated into hidden Markov models (HMMs) for acoustic modeling in automatic speech recognition (ASR) systems. The framework uses Bayesian networks to represent the probabilistic relationships between different knowledge sources, such as background noise, accent, gender, and phonetic knowledge. This allows a simplified joint probability model to be constructed and estimated using limited training data, while maintaining performance improvements over traditional HMM-based ASR. The book evaluates the approach on large-vocabulary continuous speech recognition tasks.
Copyright
© Attribution Non-Commercial (BY-NC)
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 20

Incorporating Knowledge Sources into Statistical Speech Recognition

Lecture Notes in Electrical Engineering


Incorporating Knowledge Sources into Statistical Speech Recognition Sakti, Sakriani, Markov, Konstantin, Nakamura, Satoshi, and Minker, Wolfgang 978-0-387-85829-6 Intelligent Technical Systems Martnez Madrid, Natividad; Seepold, Ralf E.D. (Eds.) 978-1-4020-9822-2 Languages for Embedded Systems and their Applications Radetzki, Martin (Ed.) 978-1-4020-9713-3 Multisensor Fusion and Integration for Intelligent Systems Lee, Sukhan; Ko, Hanseok; Hahn, Hernsoo (Eds.) 978-3-540-89858-0 Designing Reliable and Efficient Networks on Chips Murali, Srinivasan 978-1-4020-9756-0 Trends in Communication Technologies and Engineering Science Ao, Sio-Iong; Huang, Xu; Wai, Ping-kong Alexander (Eds.) 978-1-4020-9492-7 Functional Design Errors in Digital Circuits: Diagnosis Correction and Repair Chang, Kai-hui, Markov, Igor, Bertacco, Valeria 978-1-4020-9364-7 Traffic and QoS Management in Wireless Multimedia Networks: COST 290 Final Report Koucheryavy, Y., Giambene, G., Staehle, D., Barcelo-Arroyo, F., Braun, T., Siris, V. (Eds.) 978-0-387-85572-1 Proceedings of the 3rd European Conference on Computer Network Defense Siris, V.; Ioannidis, S.; Anagnostakis, K.; Trimintzios, P. (Eds.) 978-0-387-85554-7 Intelligentized Methodology for Arc Welding Dynamical Processes: Visual Information Acquiring, Knowledge Modeling and Intelligent Control Chen, Shan-Ben, Wu, Jing 978-3-540-85641-2 Proceedings of the European Computing Conference: Volume 2 Mastorakis, Nikos, Mladenov, Valeri, Kontargyri, Vassiliki T. (Eds.) 978-0-387-84818-1 Proceedings of the European Computing Conference: Volume 1 Mastorakis, Nikos, Mladenov, Valeri, Kontargyri, Vassiliki T. (Eds.) 978-0-387-84813-6 Electronics System Design Techniques for Safety Critical Applications Sterpone, Luca 978-1-4020-8978-7 Data Mining and Applications in Genomics Ao, Sio-Iong 978-1-4020-8974-9 Continued after index

Sakriani Sakti Konstantin Markov Satoshi Nakamura Wolfgang Minker

Incorporating Knowledge Sources into Statistical Speech Recognition

Sakriani Sakti NICT/ATR Spoken Language Communication Research Laboratories Keihanna Science City Kyoto, Japan Satoshi Nakamura NICT / ATR Spoken Language Communication Research Laboratories Keihanna Science City Kyoto, Japan

Konstantin Markov NICT/ATR Spoken Language Communication Research Laboratories Keihanna Science City Kyoto, Japan Wolfgang Minker University of Ulm Ulm, Germany

ISBN 978-0-387-85829-6 DOI: 10.1007/978-0-387-85830-2

e-ISBN 978-0-387-85830-2

Library of Congress Control Number: 2008942803 Springer Science+Business Media, LLC 2009 All rights reserved. This work may not be translated or copied in whole or in part without the written permission of the publisher (Springer Science+Business Media, LLC, 233 Spring Street, New York, NY 10013, USA), except for brief excerpts in connection with reviews or scholarly analysis. Use in connection with any form of information storage and retrieval, electronic adaptation, computer software, or by similar or dissimilar methodology now known or hereafter developed is forbidden. The use in this publication of trade names, trademarks, service marks and similar terms, even if they are not identified as such, is not to be taken as an expression of opinion as to whether or not they are subject to proprietary rights. While the advice and information in this book are believed to be true and accurate at the date of going to press, neither the authors nor the editors nor the publisher can accept any legal responsibility for any errors or omissions that may be made. The publisher makes no warranty, express or implied, with respect to the material contained herein. Printed on acid-free paper. springer.com

This book is dedicated to our parents and families for their support and endless love

Preface

State-of-the-art automatic speech recognition (ASR) systems use statistical data-driven methods based on hidden Markov models (HMMs). Although such approaches have proved to be ecient choices, ASR systems often perform much worse than human listeners, especially in the presence of unexpected acoustic variability. To improve performance, we usually rely on collecting more data to train more detailed models. However, such resources are rarely available, since the presence of variabilities in speech arise from many dierent factors, and thus a huge amount of training data is required to cover all possible variabilities. In other words, it is not enough to handle these variabilities by relying solely on statistical models. The systems need additional knowledge on speech that could help to handle these sources of variability. Otherwise, only a limited level of success could be achieved. Many researchers are aware of this problem, and thus various attempts to integrate more explicitly knowledge-based and statistical approaches have been made. However, incorporating various additional knowledge sources often leads to a complicated model, where achieving optimal performance is not feasible due to insucient resources or data sparseness. As a result, input space resolution may be lost due to non-robust estimates and the increased number of unseen patterns. Moreover, decoding with large models may also become cumbersome and sometimes even impossible. This book addresses the problem of developing ecient ASR systems that can maintain a balance between utilizing wide-ranging knowledge of speech variability while keeping the training/recognition eort feasible, of course while also improving speech recognition performance. In this book, an efcient general framework to incorporate additional knowledge sources into state-of-the-art statistical ASR systems is provided. It can be applied to many existing ASR problems with their respective model-based likelihood functions in exible ways. Since there are various types of knowledge sources from dierent domains, it may be dicult to formulate a probabilistic model without learning the dependencies between the sources. To solve such problems in a unied way, the

VIII

PREFACE

work reported in this book adopts the Bayesian network (BN) framework. This approach allows the probabilistic relationship between information sources to be learned. Another advantage of the BN framework lies in the fact that it facilitates the decomposition of the joint probability density function (PDF) into a linked set of local conditional PDFs based on the junction tree algorithm. Consequently, a simplied form of the model can be constructed and reliably estimated using a limited amount of training data. This book focuses on the acoustic modeling problem as arguably the central part of any speech recognition system. The incorporation of various knowledge sources, including background noises, accent, gender and wide phonetic knowledge information, in modeling is also discusses. Such an application often suers from a sparseness of data and memory constraints. First, the additional sources of knowledge are incorporated at the HMM state distribution. Then, these additional sources of knowledge are incorporated at the HMM phonetic modeling. The presented approaches are experimentally veried in the large-vocabulary continuous-speech recognition (LVCSR) task. The book closes with a summary of the described methods and the results of the evaluations.

Contents

Introduction and Book Overview . . . . . . . . . . . . . . . . . . . . . . . . . . 1.1 Automatic Speech Recognition - A Way of Human-Machine Communication . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.2 Approaches to Speech Recognition . . . . . . . . . . . . . . . . . . . . . . . . . 1.2.1 Knowledge-based Approaches . . . . . . . . . . . . . . . . . . . . . . . 1.2.2 Corpus-based Approaches . . . . . . . . . . . . . . . . . . . . . . . . . . 1.3 State-of-the-art ASR Performance . . . . . . . . . . . . . . . . . . . . . . . . . 1.4 Studies on Incorporating Knowledge Sources . . . . . . . . . . . . . . . . 1.4.1 Sources of Variability in Speech . . . . . . . . . . . . . . . . . . . . . 1.4.2 Existing Ways of Incorporating Knowledge Sources . . . . 1.4.3 Major Challenges to Overcome . . . . . . . . . . . . . . . . . . . . . . 1.5 Book Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Statistical Speech Recognition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.1 Pattern Recognition Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.2 Theory of Hidden Markov Models . . . . . . . . . . . . . . . . . . . . . . . . . 2.2.1 Markov Chain . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.2.2 General form of an HMM . . . . . . . . . . . . . . . . . . . . . . . . . . 2.2.3 Principle Cases of HMM . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.3 Pattern Recognition for HMM-Based ASR Systems . . . . . . . . . . 2.3.1 Front-end Feature Extraction . . . . . . . . . . . . . . . . . . . . . . . 2.3.2 HMM-Based Acoustic Model . . . . . . . . . . . . . . . . . . . . . . . . 2.3.3 Pronunciation Lexicon . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.3.4 Language Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.3.5 Search Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Graphical Framework to Incorporate Knowledge Sources . . 3.1 Graphical Model Representation . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.1.1 Probability Theory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.1.2 Graphical Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.1.3 Junction Tree Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . .

1 1 4 4 6 7 10 10 12 15 16 19 19 22 22 23 25 35 36 43 49 50 51 55 56 56 59 63

Contents

3.2 Procedure of GFIKS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.2.1 Causal Relationship between Information Sources . . . . . 3.2.2 Direct Inference on Bayesian Network . . . . . . . . . . . . . . . . 3.2.3 Junction Tree Decomposition . . . . . . . . . . . . . . . . . . . . . . . 3.2.4 Junction Tree Inference . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.3 Practical Issues of GFIKS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.3.1 Types of Knowledge Sources . . . . . . . . . . . . . . . . . . . . . . . . 3.3.2 Dierent Levels of Incorporation . . . . . . . . . . . . . . . . . . . . 4

68 70 71 72 75 75 75 76

Speech Recognition Using GFIKS . . . . . . . . . . . . . . . . . . . . . . . . . 79 4.1 Applying GFIKS at the HMM State Level . . . . . . . . . . . . . . . . . . 79 4.1.1 Causal Relationship Between Information Sources . . . . . 80 4.1.2 Inference . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81 4.1.3 Enhancing Model Reliability . . . . . . . . . . . . . . . . . . . . . . . . 81 4.1.4 Training and Recognition Issues . . . . . . . . . . . . . . . . . . . . . 82 4.2 Applying GFIKS at the HMM Phonetic-unit Level . . . . . . . . . . . 83 4.2.1 Causal Relationship between Information Sources . . . . . 83 4.2.2 Inference . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85 4.2.3 Enhancing the Model Reliability . . . . . . . . . . . . . . . . . . . . 85 4.2.4 Deleted Interpolation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86 4.2.5 Training and Recognition Issues . . . . . . . . . . . . . . . . . . . . . 86 4.3 Experiments with Various Knowledge Sources . . . . . . . . . . . . . . . 87 4.3.1 Incorporating Knowledge at the HMM State Level . . . . . 87 4.3.2 Incorporating Knowledge at the HMM Phonetic-unit Level . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 116 4.4 Experiments Summary and Discussion . . . . . . . . . . . . . . . . . . . . . 132 Conclusions and Future Directions . . . . . . . . . . . . . . . . . . . . . . . . . 139 5.1 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 139 5.1.1 Theoretical Issues . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 139 5.1.2 Application Issues . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 140 5.1.3 Experimental Issues . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 141 5.2 Future Directions: A Roadmap to a Spoken Language Dialog System . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 142 Speech Materials . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 145 A.1 AURORA TIDigit Corpus . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 145 A.2 TIMIT Acoustic-Phonetic Speech Corpus . . . . . . . . . . . . . . . . . . . 146 A.3 Wall Street Journal Corpus . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 148 A.4 ATR Basic Travel Expression Corpus . . . . . . . . . . . . . . . . . . . . . . 150 A.5 ATR English Database Corpus . . . . . . . . . . . . . . . . . . . . . . . . . . . . 150

Contents

XI

ATR Software Tools . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 153 B.1 Generic Properties of ATRASR . . . . . . . . . . . . . . . . . . . . . . . . . . . 153 B.2 Data Preparation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 153 B.3 SSS Data Generating Tools . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 155 B.4 Acoustic Model Training Tools . . . . . . . . . . . . . . . . . . . . . . . . . . . . 155 B.5 Language Model Training Tools . . . . . . . . . . . . . . . . . . . . . . . . . . . 157 B.6 Recognition Tools . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 157 Composition of Bayesian Wide-phonetic Context . . . . . . . . . . 163 C.1 Proof using Bayess Rule . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 163 C.2 Variants of Bayesian Wide-phonetic Context Model . . . . . . . . . . 164 Statistical Signicance Testing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 169 D.1 Statistical Hypothesis Testing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 169 D.2 The Use of the Sign Test for ASR . . . . . . . . . . . . . . . . . . . . . . . . . 172

References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 175 Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 189

List of Figures

1.1 1.2 1.3 1.4 1.5 1.6 1.7

1.8 2.1

A machine that recognizes the speech waveform of a human utterances as Good night. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 Knowledge-based ASR system. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 Speech spectrogram reading, which corresponds to the word sequence of Good night. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 Corpus-based statistical ASR system. . . . . . . . . . . . . . . . . . . . . . . . 6 2003 NISTs benchmark ASR test history (After Pallett, 2003, c 2003 IEEE). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 TC-STAR ASR evaluation campaign (After Choukri, 2007, c TC-STAR). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 S curve of ASR technology progress and the predicted performance from combining deep knowledge with a statistical approach. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 Incorporating knowledge into a corpus-based statistical ASR system. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16 Pattern recognition: Establishing mapping from multidimensional measurement space X to three-class target decision space Y . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Pattern recognition approach for ASR: Establish mapping from measurement space X of speech signal to target decision space Y of word strings. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Simple three-state Markov chain for daily weather. . . . . . . . . . . . HMM of the daily weather, where there is no deterministic meaning on any state. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Left-to-right HMM of the daily weather. . . . . . . . . . . . . . . . . . . . . . Process ow on trellis diagram of 3-state HMM with time length T. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Forward probability function representation (for j=1). . . . . . . . . . Backward probability function representation (for i=1). . . . . . . .

20

2.2

2.3 2.4 2.5 2.6 2.7 2.8

22 22 24 25 26 27 28

XIV

List of Figures

2.9 2.10 2.11 2.12

2.13 2.14 2.15

2.16 2.17

2.18

2.19

2.20 2.21 2.22 2.23 2.24 2.25 2.26 3.1 3.2 3.3 3.4

Example of nding the best path on a trellis diagram using the Viterbi algorithm. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Graphical interpretation of the EM algorithm. . . . . . . . . . . . . . . . Forward-backward probability function representation. . . . . . . . . A generic automatic speech recognition system, composed of ve components: feature extraction, acoustic model, pronunciation lexicon, language model and search algorithm. . . . Source-Filter model of the speech signal x[n] = e[n] h[n]. . . . . . Source-lter separation by cepstral analysis. . . . . . . . . . . . . . . . . . (a) A windowed speech waveform. (b) The spectrum of Figure 2.15(a). (c) The resulting cepstrum. (d) The Fourier transform of the low-quefrency component. . . . . . . . . . . . . . . . . . . MFCC feature extraction technique, which generates a 25-dimensional feature vector xt for each frame. . . . . . . . . . . . . . . A summary of feature extraction process, producing a feature vector which correspond to one point in a multi-dimensional space. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Discrete HMM observation density where the emission statistics or HMM state output probabilities are represented by discrete symbols. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Continuous GMM, where the continuous observation space is modeled using mixture Gaussians (state-specic). They are weighted and added to compute the emission statistic likelihoods (HMM state output probabilities). . . . . . . . . . . . . . . . . Structure example of the monophone /a/ HMM acoustic model. Structure example of the triphone /a , a, a+ / HMM acoustic model. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Shared-state structures of the triphone /a , a, a+ / HMM acoustic model. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . An example of a phonetic decision tree for HMM state of the triphone with the central phoneme /ay/. . . . . . . . . . . . . . . . . . . . . Contextual splitting and temporal splitting of SSS algorithm (After Jitsuhiro, 2005). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Example of a tree-based pronunciation lexicon. . . . . . . . . . . . . . . . Multi-level probability estimation of statistical ASR. . . . . . . . . . . Incorporating knowledge into corpus-based statistical ASR system. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Two equivalent models that can be obtained from each other through arc reversal of Bayess rule, since P(a,b)=P(b,a). . . . . . . Graphical representation of P (a|b1 , b2 , ..., bn ). . . . . . . . . . . . . . . . . Three BNs with dierent arrow directions over the same random variables a, b, and c. They appear in the case of serial, diverging, and converging connection, respectively. . . . . . . . . . . . .

30 31 32

36 37 37

38 41

43

44

45 46 47 47 48 49 50 52 55 60 60

61

List of Figures

XV

3.5 3.6 3.7 3.8 3.9 3.10 3.11

3.12

3.13

3.14 3.15 4.1

Example of BN topology describing conditional relationship among a, b, c, d, e, f , g and h. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Moral and triangulated graph of Figure 3.5. . . . . . . . . . . . . . . . . . Junction graph of Figure 3.5. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . The resulting junction tree. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Clique C1 = [a, b, d] in the original graph of Figure 3.5. . . . . . . . . General procedure of GFIKS (graphical framework to incorporate additional knowledge sources). . . . . . . . . . . . . . . . . . . . (a) BN topology describing the conditional relationship between data D and model M . (b) BN topology describing the conditional relationship among D, M , and additional knowledge K. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Examples of BN topologies describing the conditional relationship among data D, model M , and several knowledge sources K1 , K2 , ..., KN . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . (a) BN topology describing conditional relationship among D, M , K1 , and K2 . (b) Moral and triangulated graph of Figure 3.13(a). (c) Equivalent BN topology. (d) Moral and triangulated graph of Figure 3.13(c). (e) Junction tree of Figure 3.13(d). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . (a) Equivalent BN topology of the BN shown in Figure 3.12(a). (b) Corresponding junction tree. . . . . . . . . . . . . . . . . . . . . Incorporating knowledge sources into HMM state (denoted by a small box) and phonetic unit level (denoted by a large box). . (a) Applying GFIKS at the HMM state level. (b) BN topology structure describing the conditional relationship between HMM state Q and observation vector X. . . . . . . . . . . . . . . . . . . . BN topology structure after incorporating additional knowledge sources K1 , K2 , ..., KN in HMM state distribution P (X, Q) (assuming that all K1 , K2 , ..., KN are independent given Q). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Example of observation space modeling by BN, where each value of Ki corresponds to a dierent Gaussian. . . . . . . . . . . . . . . (a) Applying GFIKS at the HMM phonetic-unit level. (b) BN topology structure describing the conditional relationship between HMM phonetic model and observation segment Xs . . BN topology structure after incorporating additional knowledge sources K1 , K2 , ..., KN in HMM phonetic model P (Xs , ) (assuming that all K1 , K2 , ..., KN are independent given ). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Rescoring procedure with the composition models. . . . . . . . . . . . . BN topology structure showing the conditional relationship among HMM state Q, observation vector X, and additional knowledge source of gender information G. . . . . . . . . . . . . . . . . . .

63 64 66 66 67 69

70

71

73 74 77

80

4.2

81 82

4.3 4.4

84

4.5

4.6 4.7

84 87

88

XVI

List of Figures

4.8

4.9

4.10 4.11

4.12

4.13 4.14 4.15 4.16 4.17

4.18

4.19 4.20

4.21

4.22

4.23

Recognition accuracy rates of proposed HMM/BN, which are comparable with those of other systems from the Hub and Spoke Paradigm for Continuous Speech Recognition Evaluation for primary condition of WSJ Hub2-5k task. . . . . . 93 BN topology structure describing the conditional relationship between HMM state Q, observation vector X, and additional knowledge sources of noise type N and SNR value S. . . . . . . . . . 94 Comparison of dierent systems: HMM, DBN (Bilmes et al., 2001), and proposed HMM/BN . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98 BN topologies of the left state (a), center state (b), and right state (c) of LR-HMM/BN for modeling a pentaphone context /a , a , a, a+ , a++ /. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99 BN topologies of the left state (a), center state (b), and right state (c) of LRC-HMM/BN, for modeling a pentaphone context /a , a , a, a+ , a++ /. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100 Observation space modeling by BN, where a dierent value of second following context CR corresponds to a dierent Gaussian.101 Knowledge-based phoneme classes of the observation space. . . . . 102 Determining distance metric by Euclidean distance. . . . . . . . . . . . 103 Data-driven phoneme classes of observation space. . . . . . . . . . . . . 103 Recognition accuracy rates of pentaphone LR-HMM/BN using knowledge-based second preceding and following context clustering. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106 Recognition accuracy rates of pentaphone LRC-HMM/BN using knowledge-based second preceding and following context clustering. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107 Recognition accuracy rates of pentaphone LR-HMM/BN and LRC-HMM/BN using data-driven Gaussian clustering. . . . . . . . . 108 Comparing recognition accuracy rates of triphone HMM and pentaphone HMM/BN models with a xed and a varied number of mixture components per state, but having the same 15 mixture components per state on average. . . . . . . . . . . . . . . . . 109 Topology of fLRC-HMM/BN for modeling a pentaphone context /a , a , a, a+ , a++ /, where state PDF has additional variables CL and CR representing the second preceding and following contexts, respectively. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110 (a) fLRCG-HMM/BN topology with additional knowledge G, CL and CR , (b) fLRCA-HMM/BN topology with additional variables A, CL , and CR , and (c) fLRCAG-HMM/BN topology with additional knowledge A, G, CL , and CR . . . . . . . . . . . . . . . . . 111 Recognition accuracy rates of proposed HMM/BN models having identical numbers of 5, 10, and 20 mixture components per state. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113

List of Figures

XVII

4.24 Comparing recognition accuracy rates of dierent systems: triphone HMM baseline, pentaphone HMM baseline, and the proposed pentaphone HMM/BN models having the same ve mixture components per state. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115 4.25 BN topology structure describing the conditional relationship among Xs , , CL , and CR . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 116 4.26 (a) Equivalent BN topology. (b) Moral and triangulated graph of Figure 4.26(a). (c) Junction tree of Figure 4.26(b). . . . . . . . . . 117 4.27 (a) Conventional triphone model, (b) Conventional pentaphone model, (c) Bayesian pentaphone model composition C1L3R3, consisting of the preceding/following triphone-context unit and center-monophone unit. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 119 4.28 Rescoring procedure with pentaphone composition models: C1L3R3 or C3L4R4. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 120 4.29 N-best rescoring mechanism. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 121 4.30 Recognition accuracy rates of Bayesian triphone model. . . . . . . . 122 4.31 Recognition accuracy rates of Bayesian pentaphone models. . . . . 124 4.32 Relative reductions in WER by Bayesian triphone C1L2R2 model from monophone baseline and by Bayesian pentaphone C1L3R3 model from triphone baseline. . . . . . . . . . . . . . . . . . . . . . . 125 4.33 Recognition accuracy rates of conventional pentaphone C5 and proposed Bayesian pentaphone C1L3R3 models with dierent amounts of training data. . . . . . . . . . . . . . . . . . . . . . . . . . . 126 4.34 BN topology structure describing the conditional relationship among Xs , , CL , CR , A, and G. . . . . . . . . . . . . . . . . . . . . . . . . . . . 127 4.35 (a) Equivalent BN topology of Figure 4.34. (b) Moral and triangulated graph of Figure 4.35(a). (c) Corresponding junction tree. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 127 4.36 Rescoring procedure with the accent-gender-dependent pentaphone composition models: C1L3R3, C1L3R3-A, C1L3R3-G, and C1L3R3-AG. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 128 4.37 Comparing recognition accuracy rates of dierent systems triphone HMM baseline, pentaphone HMM baseline, and proposed pentaphone models having the same 5, 10, and 20 mixture components per state. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 130 4.38 Comparing recognition accuracy rates of dierent systems: triphone HMM baseline, pentaphone HMM baseline, and proposed models incorporating knowledge sources at HMM state and phonetic unit levels. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 136 5.1 Roadmap to spoken language dialog system incorporating other knowledge sources at higher ASR levels. . . . . . . . . . . . . . . . . 143

B.1 The ATRASR phoneme-based SSS data creation for phone-unit model training. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 156

XVIII List of Figures

B.2 The ATRASR topology training for each phone acoustic-unit model. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 158 B.3 The ATRASR embedded training for a whole HMnet. . . . . . . . . . 159 B.4 The recognition process using ATRASR tools. . . . . . . . . . . . . . . . . 160 C.1 Bayesian pentaphone model composition. (a) is C5, the conventional pentaphone model, (b) is Bayesian C1L3R3, which is composed of the preceding/following triphone-context unit and center-monophone unit, (c) is Bayesian C3L4R4, which is composed of the preceding/following tetraphonecontext unit and center-triphone-context unit, (d) is Bayesian C1Lsk3Rsk3, which is composed of the preceding/following skip-triphone-context unit and center-monophone unit, and (e) is Bayesian C1C3Csk3, which is composed of the center skip-triphone-context unit, center triphone-context unit and center-monophone unit. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 167 D.1 The distribution of population according to the null hypothesis (H0 is true), with upper-tail of rejection region for P . . . . . . 171

List of Tables

4.1 4.2

English phoneme set. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90 1993 Hub and Spoke CSR evaluation on Hub 2: 5k read WSJ task (Kubala et al., 1994; Pallett et al., 1994). . . . . . . . . . . . . . . . 92 4.3 HMM/BN system performance on Hub 2: 5k read WSJ task. . . . 93 4.4 Recognition accuracy rates (%) for proposed HMM/BN on AURORA2 task. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97 4.5 Knowledge-based phoneme classes based on manner of articulation. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101 4.6 Recognition accuracy rates (%) for proposed pentaphone HMM/BN model using fLRC-HMM/BN (see Figure 4.22) on a test set of matching accents with dierent numbers of mixture components. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 114 4.7 Recognition accuracy rates (%) for proposed pentaphone HMM/BN model using fLRC-HMM/BN (see Figure 4.22) on a test set of mismatched accents with 15 mixture components. . . 115 4.8 Recognition accuracy rates (%) for proposed Bayesian pentaphone C1L3R3-AG (see Eq. (4.30)) on a test set of matching accents with dierent numbers of mixture components. 131 4.9 Recognition accuracy rates (%) for proposed Bayesian pentaphone C1L3R3-AG model (see Eq. (4.30)) on a test set of mismatched accents with 15 mixture components. . . . . . . . . . . . 132 4.10 Summary of incorporating various knowledge sources at the HMM state level. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 134 4.11 Summary of incorporating various knowledge sources at the HMM phonetic unit level. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 135 A.1 A.2 A.3 A.4 A.5 Dialect distribution of speakers. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 147 Speech materials of TIMIT database. . . . . . . . . . . . . . . . . . . . . . . . . 148 Statistics on the TIMIT database. . . . . . . . . . . . . . . . . . . . . . . . . . . 148 Text sentence materials of ATR English speech database. . . . . . . 151 Speech materials of ATR English speech database. . . . . . . . . . . . . 151

Glossary

AM ARPA ASR A-STAR ATR AUS BN BRT BTEC BU C1 C3 Csk3 C5 CCCC CNRS-LIMSI CPD CPT CSR C-STAR CU DAG DARPA DBN DCT DEL DI DSR EDB ELRA

Acoustic model Advanced Research Projects Agency Automatic speech recognition Asian speech translation advanced research Advanced Telecommunication Research Australian Bayesian network British Basic travel expression corpus Boston University Center monophone unit Center triphone context Center skip-triphone context Center pentaphone context CSR corpus coordinating committee Frances National Center for Scientic Research Conditional probability distribution Conditional probability table Continuous speech recognition Consortium for speech translation advanced research Cambridge University Directed acyclic graph Defense Advanced Research Projects Agency Dynamic Bayesian network Discrete cosine transform Deletions Deleted interpolation Distributed speech recognition English database European language resources association

XXII

GLOSSARY

EM EPPS fLRC-HMM/BN fLRCA-HMM/BN

Expectation-maximization European Parliament Plenary Sessions Full HMM/BN for left, right and center state Full HMM/BN for left, right and center state, including accent dependency fLRCAG-HMM/BN Full HMM/BN for left, right and center state, including accent and gender dependency fLRG-HMM/BN Full HMM/BN for left, right and center state, including gender dependency FFT Fast Fourier transform GDHMM Gender-dependent Hidden Markov model GFIKS Graphical framework to incorporate additional knowledge sources GIHMM Gender-independent Hidden Markov model GMM Gaussian mixture model HMM Hidden Markov model ICASSP International conference on acoustics, speech and signal processing ICSI International Computer Science Institute ICSLP International conference on spoken language processing IEEE Institute of Electrical and Electronics Engineers IEICE Institute of Electronics, Information and Communication Engineers Imp Improvement INS Insertions L3 Left triphone context L4 Left tetraphone context LM Language model LPC Linear prediction coecients LRC-HMM/BN HMM/BN for left, right and center state LR-HMM/BN HMM/BN for left and right state Lsk3 Left skip-triphone context LVCSR Large-vocabulary continuous-speech recognition MAD Machine translation aided dialogue MAP Maximum a posteriori MDL Minimum description length MFCC Mel-frequency cepstral coecients MIT Massachusetts Institute of Technology ML Maximum likelihood MLLR Maximum likelihood linear regression MSG Modulation-ltered spectrogram MT Machine translation NIST National Institute of Standards and Technology NOVO Noise voice composition PDF Probability density function

GLOSSARY XXIII

PLP PMC R3 R4 Rel Resc Rsk3 S2ST SD SI SIL SLC SNR SSS STQ SUB SWB TC-STAR TI US VQ WER WFST WSJ

Perceptual linear prediction Parallel model combination Right triphone context Right tetraphone context Relative Rescoring Right skip-triphone context Speech-to-speech translation Speaker dependent Speaker independent Silence Spoken Language Communication Signal-to-noise ratio Successive state splitting Speech processing, transmission and quality Substitutions Switchboard Technology and corpora for speech to speech translation research Texas Instrument United States Vector quantization Word error rate Weighted nite state transducers Wall Street journal

You might also like