Thanks to visit codestin.com
Credit goes to www.scribd.com

0% found this document useful (0 votes)
27 views16 pages

Deep Neural Networks

This document discusses the advancements in speech recognition systems, particularly the transition from hidden Markov models (HMMs) and Gaussian mixture models (GMMs) to deep neural networks (DNNs) for acoustic modeling. It highlights the advantages of DNNs, including their ability to outperform GMMs in various benchmarks, and outlines the two-stage training procedure that enhances their performance. The article represents the collective insights of four research groups that have successfully implemented DNNs in large vocabulary continuous speech recognition tasks.

Uploaded by

Sourish Joshi
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
27 views16 pages

Deep Neural Networks

This document discusses the advancements in speech recognition systems, particularly the transition from hidden Markov models (HMMs) and Gaussian mixture models (GMMs) to deep neural networks (DNNs) for acoustic modeling. It highlights the advantages of DNNs, including their ability to outperform GMMs in various benchmarks, and outlines the two-stage training procedure that enhances their performance. The article represents the collective insights of four research groups that have successfully implemented DNNs in large vocabulary continuous speech recognition tasks.

Uploaded by

Sourish Joshi
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 16

[Navdeep

Geoffrey Hinton, Li Deng, Dong Yu, George Dahl, Abdel-rahman Mohamed,


Jaitly, Andrew Senior, Vincent Vanhoucke, Patrick Nguyen, Tara Sainath,
and Brian Kingsbury]

Deep Neural Networks


for Acoustic Modeling
in Speech Recognition
Pr E
[Four research groups share their views]
E
<AU: please check that added subtitle is Ok as given or please supply short alternative>

f
oo © xxxxx
IE

M
ost current speech recognition systems Introduction
use hidden Markov models (HMMs) to deal New machine learning algorithms can lead to significant
with the temporal variability of speech and advances in automatic speech recognition (ASR). The biggest
Gaussian mixture models (GMMs) to single advance occurred nearly four decades ago with the
determine how well each state of each introduction of the expectation-maximization (EM) algorithm
HMM fits a frame or a short window of frames of coefficients for training HMMs (see [1] and [2] for informative historical
that represents the acoustic input. An alternative way to evalu- reviews of the introduction of HMMs). With the EM algorithm,
ate the fit is to use a feed-forward neural network that takes it became possible to develop speech recognition systems for
several frames of coefficients as input and produces posterior real-world tasks using the richness of GMMs [3] to represent
probabilities over HMM states as output. Deep neural net- the relationship between HMM states and the acoustic input.
works (DNNs) that have many hidden layers and are trained In these systems the acoustic input is typically represented by
using new methods have been shown to outperform GMMs on concatenating Mel-frequency cepstral coefficients (MFCCs) or
a variety of speech recognition benchmarks, sometimes by a perceptual linear predictive coefficients (PLPs) [4] computed
large margin. This article provides an overview of this progress from the raw waveform and their first- and second-order tem-
and represents the shared views of four research groups that poral differences [5]. This nonadaptive but highly engineered
have had recent successes in using DNNs for acoustic model- preprocessing of the waveform is designed to discard the large
ing in speech recognition. amount of information in waveforms that is considered to be
irrelevant for discrimination and to express the remaining
information in a form that facilitates discrimination with
Digital Object Identifier 10.1109/MSP.2012.2205597 GMM-HMMs.
Date of publication:

IEEE SIGNAL PROCESSING MAGAZINE [2] november 2012 1053-5888/12/$31.00©2012IEEE


GMMs have a number of advantages that make them suit- output layer is required to accommodate the large number of
able for modeling the probability distributions over vectors of HMM states that arise when each phone is modeled by a num-
input features that are associated with each state of an HMM. ber of different “triphone” HMMs that take into account the
With enough components, they can model probability distri- phones on either side. Even when many of the states of these
butions to any required level of accuracy, and they are fairly triphone HMMs are tied together, there can be thousands of
easy to fit to data using the EM algorithm. A huge amount of tied states. Using the new learning methods, several different
research has gone into finding ways of constraining GMMs to research groups have shown that DNNs can outperform GMMs
increase their evaluation speed and to optimize the tradeoff at acoustic modeling for speech recognition on a variety of
between their flexibility and the amount of training data avail- data sets including large data sets with large vocabularies.
able to avoid serious overfitting [6]. This review article aims to represent the shared views of
The recognition accuracy of a GMM-HMM system can be research groups at the University of Toronto, Microsoft
further improved if it is discriminatively fine-tuned after it has Research (MSR), Google, and IBM Research, who have all had
been generatively trained to maximize its probability of gener- recent successes in using DNNs for acoustic modeling. The
ating the observed data, especially if the discriminative objec- article starts by describing the two-stage training procedure
tive function used for training is closely related to the error that is used for fitting the DNNs. In the first stage, layers of
rate on phones, words, or sentences [7]. The accuracy can also feature detectors are initialized, one layer at a time, by fitting
be improved by augmenting (or concatenating) the input fea- a stack of generative models, each of which has one layer of
tures (e.g., MFCCs) with “tandem” or bottleneck features gen- latent variables. These generative models are trained without
erated using neural networks [8], [9]. GMMs are so successful using any information about the HMM states that the acoustic
that it is difficult for any new method to outperform them for model will need to discriminate. In the second stage, each
acoustic modeling.
Pr E
Despite all their advantages, GMMs have a serious short-
coming—they are statistically inefficient for modeling data
that lie on or near a nonlinear manifold in the data space. For
example, modeling the set of points that lie very close to the
surface of a sphere only requires a few parameters using an
E
appropriate model class, but it requires a very large number of
generative model in the stack is used to initialize one layer of
hidden units in a DNN and the whole network is then discrim-
inatively fine-tuned to predict the target HMM states. These
targets are obtained by using a baseline GMM-HMM system to
produce a forced alignment.
In this article, we review exploratory experiments on the
TIMIT <AU: can TIMIT be spelled out?> database [12], [13]

f
diagonal Gaussians or a fairly large number of full-covariance that were used to demonstrate the power of this two-stage
Gaussians. Speech is produced by modulating a relatively
oo training procedure for acoustic modeling. The DNNs that
small number of parameters of a dynamical system [10], [11] worked well on TIMIT were then applied to five different large-
and this implies that its true underlying structure is much vocabulary continuous speech recognition (LVCSR) tasks by
IE
lower-dimensional than is immediately apparent in a window three different research groups whose results we also summa-
that contains hundreds of coefficients. We believe, therefore, rize. The DNNs worked well on all of these tasks when com-
that other types of model may work better than GMMs for pared with highly tuned GMM-HMM systems, and on some of
acoustic modeling if they can more effectively exploit informa- the tasks they outperformed the state of the art by a large mar-
tion embedded in a large window of frames. gin. We also describe some other uses of DNNs for acoustic
Artificial neural networks trained by backpropagating error modeling and some variations on the training procedure.
derivatives have the potential to learn much better models of
data that lie on or near a nonlinear manifold. In fact, two Training deep neural networks
decades ago, researchers achieved some success using artificial A DNN is a feed-forward, artificial neural network that has
neural networks with a single layer of nonlinear hidden units more than one layer of hidden units between its inputs and its
to predict HMM states from windows of acoustic coefficients outputs. Each hidden unit, j, typically uses the logistic func-
[9]. At that time, however, neither the hardware nor the learn- tion (the closely related hyberbolic tangent is also often used
ing algorithms were adequate for training neural networks and any function with a well-behaved derivative can be used)
with many hidden layers on large amounts of data, and the <AU: please note that as per magazine style, footnotes are
performance benefits of using neural networks with a single incorporated into text. Please check that placement is OK
hidden layer were not sufficiently large to seriously challenge throughout> to map its total input from the layer below, x j,
GMMs. As a result, the main practical contribution of neural to the scalar state, y j that it sends to the layer above.
networks at that time was to provide extra features in tandem
or bottleneck systems. y j = logistic (x j) = 1 , x j = b j + / y i w ij , (1)
1 + e -x j i
Over the last few years, advances in both machine learning
algorithms and computer hardware have led to more efficient where b j is the bias of unit j, i is an index over units in the
methods for training DNNs that contain many layers of non- layer below, and w ij is a the weight on a connection to unit j
linear hidden units and a very large output layer. The large from unit i in the layer below. For multiclass classification,

IEEE SIGNAL PROCESSING MAGAZINE [3] november 2012


output unit j converts its total input, x j , into a class probabil- overfitting but only by removing much of the modeling power.
ity, p j , by using the “softmax” nonlinearity Very large training sets [16] can reduce overfitting while pre-
serving modeling power, but only by making training very com-
exp (x j) putationally expensive. What we need is a better method of
pj =
/ exp (x k) , (2) using the information in the training set to build multiple lay-
k
ers of nonlinear feature detectors.
where k is an index over all classes.
DNNs can be discriminatively trained (DT) by backpropagat- Generative pretraining
ing derivatives of a cost function that measures the discrepancy Instead of designing feature detectors to be good for discrimi-
between the target outputs and the actual outputs produced for nating between classes, we can start by designing them to be
each training case [14]. When using the softmax output func- good at modeling the structure in the input data. The idea is to
tion, the natural cost function C is the cross entropy between learn one layer of feature detectors at a time with the states of
the target probabilities d and the outputs of the softmax, p the feature detectors in one layer acting as the data for training
the next layer. After this generative “pretraining,” the multiple
C = - / d j log p j, (3) layers of feature detectors can be used as a much better start-
j
ing point for a discriminative “fine-tuning” phase during which
where the target probabilities, typically taking values of one or backpropagation through the DNN slightly adjusts the weights
zero, are the supervised information provided to train the DNN found in pretraining [17]. Some of the high-level features cre-
classifier. ated by the generative pretraining will be of little use for dis-
For large training sets, it is typically more efficient to com- crimination, but others will be far more useful than the raw
pute the derivatives on a small, random “minibatch” of training inputs. The generative pretraining finds a region of the weight-

Pr E
cases, rather than the whole training set, before updating the
weights in proportion to the gradient. This stochastic gradient
descent method can be further improved by using a “momen-
tum” coefficient, 0 1 a 1 1, that smooths the gradient comput-
ed for minibatch t, thereby damping oscillations across ravines
and speeding progress down ravines
E space that allows the discriminative fine-tuning to make rapid
progress, and it also significantly reduces overfitting [18].
A single layer of feature detectors can be learned by fitting a
generative model with one layer of latent variables to the input
data. There are two broad classes of generative model to choose
from. A directed model generates data by first choosing the
states of the latent variables from a prior distribution and then

f
D w ij (t) = aD w ij (t - 1) - e 2C . (4) choosing the states of the observable variables from their condi-
2w ij (t)
oo tional distributions given the latent states. Examples of directed
The update rule for biases can be derived by treating them as models with one layer of latent variables are factor analysis, in
weights on connections coming from units that always have a which the latent variables are drawn from an isotropic
IE
state of one. Gaussian, and GMMs, in which they are drawn from a discrete
To reduce overfitting, large weights can be penalized in pro- distribution. An undirected model has a very different way of
portion to their squared magnitude, or the learning can simply generating data. Instead of using one set of parameters to define
be terminated at the point at which performance on a held-out a prior distribution over the latent variables and a separate set
validation set starts getting worse [9]. In DNNs with full con- of parameters to define the conditional distributions of the
nectivity between adjacent layers, the initial weights are given observable variables given the values of the latent variables, an
small random values to prevent all of the hidden units in a layer undirected model uses a single set of parameters, W, to define
from getting exactly the same gradient. the joint probability of a vector of values of the observable vari-
DNNs with many hidden layers are hard to optimize. ables, v, and a vector of values of the latent variables, h, via an
Gradient descent from a random starting point near the origin energy function, E
is not the best way to find a good set of weights, and unless the
initial scales of the weights are carefully chosen [15], the back- p (v, h; W) = 1 e -E (v, h; W), Z = / e -E (vl, hl; W), (5)
Z vl, hl
propagated gradients will have very different magnitudes in dif-
ferent layers. In addition to the optimization issues, DNNs may where Z is called the partition function.
generalize poorly to held-out test data. DNNs with many hidden If many different latent variables interact nonlinearly to gen-
layers and many units per layer are very flexible models with a erate each data vector, it is difficult to infer the states of the
very large number of parameters. This makes them capable of latent variables from the observed data in a directed model
modeling very complex and highly nonlinear relationships because of a phenomenon known as “explaining away” [19]. In
between inputs and outputs. This ability is important for high- undirected models, however, inference is easy provided the
quality acoustic modeling, but it also allows them to model spu- latent variables do not have edges linking them. Such a restrict-
rious regularities that are an accidental property of the ed class of undirected models is ideal for layerwise pretraining
particular examples in the training set, which can lead to severe because each layer will have an easy inference procedure.
overfitting. Weight penalties or early stopping can reduce the

IEEE SIGNAL PROCESSING MAGAZINE [4] november 2012


We start by describing an approximate learning algorithm p (v i = 1; h) = logistic (a i + / h j w ij) . (11)
j
for a restricted Boltzmann machine (RBM) which consists of a
layer of stochastic binary “visible” units that represent binary Getting an unbiased sample of 1 v i h j 2 model , however, is
input data connected to a layer of stochastic binary hidden units much more difficult. It can be done by starting at any random
that learn to model significant nonindependencies between the state of the visible units and performing alternating Gibbs sam-
visible units [20]. There are undirected connections between pling for a very long time. Alternating Gibbs sampling consists
visible and hidden units but no visible-visible or hidden-hidden of updating all of the hidden units in parallel using (10) fol-
connections. An RBM is a type of Markov random field (MRF) lowed by updating all of the visible units in parallel using (11).
but differs from most MRFs in several ways: it has a bipartite A much faster learning procedure called contrastive diver-
connectivity graph, it does not usually share weights between gence (CD) was proposed in [20]. This starts by setting the states
different units, and a subset of the variables are unobserved, of the visible units to a training vector. Then the binary states of
even during training. the hidden units are all computed in parallel using (10). Once
binary states have been chosen for the hidden units, a “recon-
An efficient learning procedure for RBMs struction” is produced by setting each v i to one with a probabil-
A joint configuration, (v, h) of the visible and hidden units of an ity given by (11). Finally, the states of the hidden units are
RBM has an energy given by updated again. The change in a weight is then given by

E (v, h) = - / ai vi - / b j h j - / v i h j w ij ,(6) D w ij = e (1v i h j2data - 1v i h j2recon) .(12)


i ! visible j ! hidden i, j

where v i, h j are the binary states of visible unit i and hidden A simplified version of the same learning rule that uses the
unit j, a i, b j are their biases, and w ij is the weight between states of individual units instead of pairwise products is used for

Pr E
them. The network assigns a probability to every possible pair of
a visible and a hidden vector via this energy function as in (5)
and the probability that the network assigns to a visible vector,
v, is given by summing over all possible hidden vectors

p (v) = 1 / e -E (v, h) .(7)


Z h
E the biases.
CD works well even though it is only crudely approximating
the gradient of the log probability of the training data [20].
RBMs learn better generative models if more steps of alternat-
ing Gibbs sampling are used before collecting the statistics for
the second term in the learning rule, but for the purposes of
pretraining feature detectors, more alternations are generally of

f
The derivative of the log probability of a training set with little value and all the results reviewed here were obtained using
respect to a weight is surprisingly simple
oo CD1 which does a single full step of alternating Gibbs sampling
after the initial update of the hidden units. To suppress noise in
1
n= N
2 log p (v n)
N
/ 2 w ij
=1v i h j2data - 1v i h j2model ,(8) the learning, the real-valued probabilities rather than binary
IE
n=1
samples are generally used for the reconstructions and the sub-
where N is the size of the training set and the angle brackets are sequent states of the hidden units, but it is important to use
used to denote expectations under the distribution specified by sampled binary values for the first computation of the hidden
the subscript that follows. The simple derivative in (8) leads to a states because the sampling noise acts as a very effective regu-
very simple learning rule for performing stochastic steepest larizer that prevents overfitting [21].
ascent in the log probability of the training data
Modeling real-valued data
D w ij = e ^1v i h j2data -1v i h j2modelh ,(9) Real-valued data, such as MFCCs, are more naturally modeled
by linear variables with Gaussian noise and the RBM energy
where e is a learning rate. function can be modified to accommodate such variables, giving
The absence of direct connections between hidden units in an a Gaussian–Bernoulli RBM (GRBM)
RBM makes it is very easy to get an unbiased sample of
(v i - a i) 2
1 v i h j 2 data . Given a randomly selected training case, v, the E (v, h) = / - / b j h j - / vv i h j w ij ,(13)
i ! vis 2v 2i i
binary state, h j , of each hidden unit, j, is set to one with proba- j ! hid i, j

bility where v i is the standard deviation of the Gaussian noise for vis-
ible unit i.
p (h j = 1; v) = logistic (b j + / v i w ij) (10) The two conditional distributions required for CD1 learning
i
are
and v i h j is then an unbiased sample. The absence of direct con-
p (h j ; v) = logistic c b j + / w ij m (14)
vi
nections between visible units in an RBM makes it very easy to
i vi
get an unbiased sample of the state of a visible unit, given a hid-
den vector p (h j ; h) = N c a i + v i / h j w ij, v 2i m ,(15)
j

IEEE SIGNAL PROCESSING MAGAZINE [5] november 2012


where N (n, v 2) is a Gaussian. Learning the standard devia- single forward pass. This inference, which is used in deriving
tions of a GRBM is problematic for reasons described in [21], so the variational bound, is not exactly correct but is fairly accu-
for pretraining using CD1, the data are normalized so that each rate. So after learning a DBN by training a stack of RBMs, we
coefficient has zero mean and unit variance, the standard devia- can jettison the whole probabilistic framework and simply use
tions are set to one when computing p (v ; h) , and no noise is the generative weights in the reverse direction as a way of ini-
added to the reconstructions. This avoids the issue of deciding tializing all the feature detecting layers of a deterministic feed-
the right noise level. forward DNN. We then just add a final softmax layer and train
the whole DNN discriminatively. Unfortunately, a DNN that is
Stacking RBMs to make a deep belief network pretrained generatively as a DBN is often still called a DBN in
After training an RBM on the data, the inferred states of the hid- the literature. For clarity, we call it a DBN-DNN.
den units can be used as data for training another RBM that
learns to model the significant dependencies between the hid- Interfacing a DNN with an HMM
den units of the first RBM. This can be repeated as many times After it has been discriminatively fine-tuned, a DNN outputs
as desired to produce many layers of nonlinear feature detectors probabilities of the form p (HMMstate ; AcousticInput) . But to
that represent progressively more complex statistical structure compute a Viterbi alignment or to run the forward-backward
in the data. The RBMs in a stack can be combined in a surpris- algorithm within the HMM framework, we require the likeli-
ing way to produce a single, multilayer generative model called hood p (AcousticInput ; HMMstate) . The posterior probabilities
a deep belief net (DBN) [22]. Even though each RBM is an undi- that the DNN outputs can be converted into the scaled likeli-
rected model, the DBN (not to be confused with a dynamic hood by dividing them by the frequencies of the HMM states in
Bayesian net, which is a type of directed model of temporal the forced alignment that is used for fine-tuning the DNN [9].
data that unfortunately has the same acronym) <AU: please All of the likelihoods produced in this way are scaled by the

Pr E
check that placement of footnote is OK as given> formed by
the whole stack is a hybrid generative model whose top two lay-
ers are undirected (they are the final RBM in the stack) but
whose lower layers have top-down, directed connections (see
Figure 1).
To understand how RBMs are composed into a DBN, it is
E
helpful to rewrite (7) and to make explicit the dependence on W:
same unknown factor of p (AcousticInput) , but this has no
effect on the alignment. Although this conversion appears to
have little effect on some recognition tasks, it can be important
for tasks where training labels are highly unbalanced (e.g., with
many frames of silences).

Phonetic Classification

f
and Recognition on TIMIT
p (v; W) = / p (h; W) p (v ; h; W), (16)
oo The TIMIT data set provides a simple and convenient way of
h
testing new approaches to speech recognition. The training set
where p (h; W) is defined as in (7) but with the roles of the visi- is small enough to make it feasible to try many variations of a
IE
ble and hidden units reversed. Now it is clear that the model can new method and many existing techniques have already been
be improved by holding p (v ; h; W) fixed after training the RBM, benchmarked on the core test set, so it is easy to see if a new
but replacing the prior over hidden vectors p (h; W ) by a better approach is promising by comparing it with existing techniques
prior, i.e., a prior that is closer to the aggregated posterior over that have been implemented by their proponents [23].
hidden vectors that can be sampled by first picking a training Experience has shown that performance improvements on
case and then inferring a hidden vector using (14). This aggre- TIMIT do not necessarily translate into performance improve-
gated posterior is exactly what the next RBM in the stack is ments on large vocabulary tasks with less controlled recording
trained to model. conditions and much more training data. Nevertheless, TIMIT
As shown in [22], there is a series of variational bounds on provides a good starting point for developing a new approach,
the log probability of the training data, and furthermore, each especially one that requires a challenging amount of computa-
time a new RBM is added to the stack, the variational bound on tion.
the new and deeper DBN is better than the previous variational Mohamed et. al. [12] showed that a DBN-DNN acoustic
bound, provided the new RBM is initialized and learned in the model outperformed the best published recognition results on
right way. While the existence of a bound that keeps improving TIMIT at about the same time as Sainath et. al. [23] achieved a
is mathematically reassuring, it does not answer the practical similar improvement on TIMIT by applying state-of-the-art
issue, addressed in this article, of whether the learned feature techniques developed for large vocabulary recognition.
detectors are useful for discrimination on a task that is Subsequent work combined the two approaches by using state-
unknown while training the DBN. Nor does it guarantee that of-the-art, DT speaker-dependent features as input to the DBN-
anything improves when we use efficient short-cuts such as DNN [24], but this produced little further improvement,
CD 1 training of the RBMs. probably because the hidden layers of the DBN-DNN were
One very nice property of a DBN that distinguishes it from already doing quite a good job of progressively eliminating
other multilayer, directed, nonlinear generative models is that it speaker differences [25].
is possible to infer the states of the layers of hidden units in a

IEEE SIGNAL PROCESSING MAGAZINE [6] november 2012


DBN-DNN

RBM DBN W4 = 0

RBM W3 W3 W3
T
Copy

W2 T
GRBM W2 W2
Copy

T
W1 W1 W1

[Fig1] The sequence of operations used to create a DBN with three hidden layers and to convert it to a pretrained DBN-DNN. First, a

alignment.
Pr E
GRBM is trained to model a window of frames of real-valued acoustic coefficients. Then the states of the binary hidden units of the
GRBM are used as data for training an RBM. This is repeated to create as many hidden layers as desired. Then the stack of RBMs is
converted to a single generative model, a DBN, by replacing the undirected connections of the lower level RBMs by top-down, directed
connections. Finally, a pretrained DBN-DNN is created by adding a “softmax” output layer that contains one unit for each possible state
of each HMM. The DBN-DNN is then DT to predict the HMM state corresponding to the central frame of the input window in a forced

The DBN-DNNs that worked best on the TIMIT data formed


E
the starting point for subsequent experiments on much more
reported. All methods use MFCCs as inputs except for the three
marked “fbank” that use log Mel-scale filter-bank outputs.

f
challenging large vocabulary tasks that were too computational-
ly intensive to allow extensive exploration of variations in the
oo Preprocessing the waveform
architecture of the neural network, the representation of the for deep neural networks
acoustic input, or the training procedure. State-of-the-art ASR systems do not use filter-bank coefficients
IE
For simplicity, all hidden layers always had the same size, as the input representation because they are strongly correlated
but even with this constraint it was impossible to train all possi- so modeling them well requires either full covariance
ble combinations of number of hidden layers [1, 2, 3, 4, 5, 6, 7, Gaussians or a huge number of diagonal Gaussians. MFCCs
8], number of units per layer [512, 1,024, 2,048, 3,072] and offer a more suitable alternative as their individual components
number of frames of acoustic data in the input layer [7, 11, 15, are roughly independent so they are much easier to model
17, 27, 37]. Fortunately, the performance of the networks on using a mixture of diagonal covariance Gaussians. DBN-DNNs
the TIMIT core test set was fairly insensitive to the precise do not require uncorrelated data and, on the TIMIT database,
details of the architecture and the results in [13] suggest that the work reported in [13] showed that the best performing
any combination of the numbers in boldface probably has an DBN-DNNs trained with filter-bank features had a phone error
error rate within about 2% of the very best combination. This rate 1.7% lower than the best performing DBN-DNNs trained
robustness is crucial for methods such as DBN-DNNs that have with MFCCs (see Table 1).
a lot of tuneable metaparameters. Our consistent finding is that
multiple hidden layers always worked better than one hidden Fine-tuning DBN-DNNs to
layer and, with multiple hidden layers, pretraining always optimize mutual information
improved the results on both the development and test sets in In the experiments using TIMIT discussed above, the DNNs
the TIMIT task. Details of the learning rates, stopping criteria, were fine-tuned to optimize the per frame cross entropy
momentum, L2 <AU: should L2 be italicized?> weight penal- between the target HMM state and the predictions. The transi-
ties and minibatch size for both the pretraining and fine-tuning tion parameters and language model scores were obtained from
are given in [13]. an HMM-like approach and were trained independently of the
Table 1 compares DBN-DNNs with a variety of other meth- DNN weights. However, it has long been known that sequence
ods on the TIMIT core test set. For each type of DBN-DNN the classification criteria, which are more directly correlated with
architecture that performed best on the development set is the overall word or phone error rate, can be very helpful in
improving recognition accuracy [7], [35] and the benefit of

IEEE SIGNAL PROCESSING MAGAZINE [7] november 2012


- p ^l tn- 1 = i, l tn = j ; v 1n: T h@ (19)
[Table 1] Comparisons among the reported
speaker-independent (SI) phonetic recognition accu-
2 log p (l 1n: T ; v 1n: T)
= / =m ltd - / p (l tn = k;v 1:n T) m kdG
T K
racy results on TIMIT core test set with 192 sentenc-
es. 2w ij t=1 k=1

Method PER # h tdn (1 - h tdn) x tin .(20)


CD-HMM [26] 27.3%
Augmented conditional Random Fields [26] 26.6%
Randomly initialized recurrent Neural Nets [27] 26.1% Note that the gradient ^2 log p (l 1:n T ; v 1:n T) h / (2w ij) above can be
Bayesian Triphone GMM-HMM [28] 25.6% viewed as back-propagating the error d (l tn = k) - p (l tn = k ; v 1n: T),
Monophone HTMs [29] 24.8% versus d (l tn = k) - p (l tn = k ; v tn) in the frame-based training
Heterogeneous Classifiers [30] 24.4%
algorithm.
Monophone randomly initialized DNNs (six layers) [13] 23.4%
In implementing the above learning algorithm for a DBN-
Monophone DBN-DNNs (six layers) [13] 22.4%
DNN, the DNN weights can first be fine-tuned to optimize the
Monophone DBN-DNNs with MMI training [31] 22.1%
per frame cross entropy. The transition parameters can be ini-
Triphone GMM-HMMs DT w/ BMMI [32] 21.7%
tialized from the combination of the HMM transition matrices
Monophone DBN-DNNs on fbank (eight layers) [13] 20.7%
Monophone mcRBM-DBN-DNNs on fbank (five layers) [33] 20.5%
and the “phone language” model scores, and can be further
optimized by tuning the transition features while fixing the
DNN weights before the joint optimization. Using the joint
optimization with careful scheduling, we observe that the
using such sequence classification criteria with shallow neural sequential MMI training can outperform the frame-level train-
networks has already been shown by [36]–[38]. In the more ing by about 5% relative within the same system in the same

Pr E
recent work reported in [31], one popular type of sequence clas-
sification criterion, maximum mutual information (MMI), pro-
posed as early as 1986 [7], was successfully applied to learn
DBN-DNN weights for the TIMIT phone recognition task. MMI
optimizes the conditional probability p (l 1: T ; v 1: T) of the whole
sequence of labels, l 1:T , with length T, given the whole visible
E
feature utterance v 1:T , or equivalently the hidden feature
laboratory.

Convolutional DNNs for


phone classification and recognition
All the previously cited work reported phone recognition results
on the TIMIT database. In recognition experiments, the input is
the acoustic input for the whole utterance while the output is

f
sequence h 1:T extracted by the DNN the spoken phonetic sequence. A decoding process using a
oo phone language model is used to produce this output sequence.
p (l 1: T ; v 1: T) = p (l 1: T ; h 1: T) Phonetic classification is a different task where the acoustic
exp ` / t = 1 c ij z ij (l t - 1, l t) + / Tt =1 / d =1 m lt, d h td j (17)
T D
   input has already been labeled with the correct boundaries
= ,
IE
Z (h 1: T) between different phonetic units and the goal is to classify these
phones conditioned on the given boundaries. In [39], convolu-
where the transition feature z ij (l t - 1, l t) takes on a value of one tional DBN-DNNs were introduced and successfully applied to
if l t - 1 = i and l t = j , and otherwise takes on a value of zero, various audio tasks including phone classification on the TIMIT
where c ij is the parameter associated with this transition fea- database. In this model, the RBM was made convolutional in
ture, h td is the dth dimension of the hidden unit value at the time by sharing weights between hidden units that detect the
tth frame at the final layer of the DNN, and where D is the num- same feature at different times. A max-pooling operation was
ber of units in the final hidden layer. Note the objective function then performed, which takes the maximal activation over a pool
of (17) derived from mutual information [35] is the same as the of adjacent hidden units that share the same weights but apply
conditional likelihood associated with a specialized linear-chain them at different times. This yields some temporal invariance.
conditional random field. Here, it is the topmost layer of the Although convolutional models along the temporal dimen-
DNN below the softmax layer, not the raw speech coefficients of sion achieved good classification results [39], applying them to
MFCC or PLP, that provides “features” to the conditional ran- phone recognition is not straightforward. This is because tem-
dom field. poral variations in speech can be partially handled by the
To optimize the log conditional probability p (l 1n: T ; v 1n: T) of the dynamic programing procedure in the HMM component and
nth utterance, we take the gradient over the activation parame- those aspects of temporal variation that cannot be adequately
ters m kd , transition parameters c ij , and the lower-layer weights handled by the HMM can be addressed more explicitly and effec-
of the DNN, w ij , according to tively by hidden trajectory models [40].
2 log p (l 1n: T ; v 1n: T)
= / ^d (l tn = k) - p (l tn = k ; v 1:n T)h h tdn (18)
T The work reported in [34] applied local convolutional filters
2m kd t=1 with max-pooling to the frequency rather than time dimension
of the spectrogram. Sharing-weights and pooling over frequen-
= / 6d ^l tn- 1 = i, l tn = j h
2 log p (l 1n: T ; v 1n: T) T

2c ij cy was motivated by the shifts in formant frequencies caused by


t=1
speaker variations. It provides some speaker invariance while

IEEE SIGNAL PROCESSING MAGAZINE [8] november 2012


also offering noise robustness due to the band-limited nature of neural network without using any pretraining [43], though
the filters. [34] only used weight-sharing and max-pooling using more hidden layers and pretraining works even better.
across nearby frequencies because, unlike features that occur at
different positions in images, acoustic features occurring at very Bing-Voice-Search speech recognition task
different frequencies are very different. The first successful use of acoustic models based on DBN-DNNs
for a large vocabulary task used data collected from the Bing
A summary of the differences mobile voice search application (BMVS). The task used 24 h of
between DNNs and GMMs training data with a high degree of acoustic variability caused by
Here we summarize the main differences between the DNNs and noise, music, side-speech, accents, sloppy pronunciation, hesita-
GMMs used in the TIMIT experiments described so far in this tion, repetition, interruptions, and mobile phone differences.
article. First, one major element of the DBN-DNN, the RBM, The results reported in [42] demonstrated that the best DNN-
which serves as the building block for pretraining, is an instance HMM acoustic model trained with context-dependent states as
of “product of experts” [20], in contrast to mixture models that targets achieved a sentence accuracy of 69.6% on the test set,
are a “sum of experts.” Product models have only very recently compared with 63.8% for a strong, minimum phone error
been explored in speech processing, e.g., [41]. <AU: please check (MPE)-trained GMM-HMM baseline.
that the placement of the incorporated footnote is OK as The DBN-DNN used in the experiments was based on one of
given> Mixture models with a large number of components use the DBN-DNNs that worked well for the TIMIT task. It used five
their parameters inefficiently because each parameter only pretrained layers of hidden units with 2,048 units per layer and
applies to a very small fraction of the data whereas each parame- was trained to classify the central frame of an 11-frame acoustic
ter of a product model is constrained by a large fraction of the context window using 761 possible context-dependent states as
data. Second, while both DNNs and GMMs are nonlinear models, targets. In addition to demonstrating that a DBN-DNN could

Pr E
the nature of the nonlinearity is very different. A DNN has no
problem modeling multiple simultaneous events within one
frame or window because it can use different subsets of its hid-
den units to model different events. By contrast, a GMM assumes
that each datapoint is generated by a single component of the
mixture so it has no efficient way of modeling multiple simulta-
E
neous events. Third, DNNs are good at exploiting multiple
provide gains on a large vocabulary task, several other impor-
tant issues were explicitly investigated in [42]. It was found that
using tied triphone context-dependent state targets was crucial
and clearly superior to using monophone state targets, even
when the latter were derived from the same forced alignment
with the same baseline. It was also confirmed that the lower the
error rate of the system used during forced alignment to gener-

f
frames of input coefficients whereas GMMs that use diagonal ate frame-level training labels for the neural net, the lower the
covariance matrices benefit much less from multiple frames
oo error rate of the final neural-net-based system. This effect was
because they require decorrelated inputs. Finally, DNNs are consistent across all the alignments they tried, including mono-
learned using stochastic gradient descent, while GMMs are phone alignments, alignments from ML-trained GMM-HMM
IE
learned using the EM algorithm or its extensions [35], which systems, and alignments from DT GMM-HMM systems.
makes GMM learning much easier to parallelize on a cluster Further work after that of [42] extended the DNN-HMM
machine. acoustic model from 24 h of training data to 48 h and explored
the respective roles of pretraining and fine-tuning the DBN-
Comparing DBN-DNNs with GMMs DNN [44]. As expected, pretraining is helpful in training the
for Large-Vocabulary Speech Recognition DBN-DNN because it initializes the DBN-DNN weights to a
The success of DBN-DNNs on TIMIT tasks starting in 2009 moti- point in the weight-space from which fine-tuning is highly
vated more ambitious experiments with much larger vocabular- effective. However, a moderate increase of the amount of unla-
ies and more varied speaking styles. In this section, we review beled pretraining data has an insignificant effect on the final
experiments by three different speech groups on five different recognition results (69.6% to 69.8%), as long as the original
benchmark tasks for large-vocabulary speech recognition. To training set is fairly large. By contrast, the same amount of
make DBN-DNNs work really well on large vocabulary tasks it is additional labeled fine-tuning training data significantly
important to replace the monophone HMMs used for TIMIT (and improves the performance of the DNN-HMMs (accuracy from
also for early neural network/HMM hybrid systems) with tri- 69.6% to 71.7%).
phone HMMs that have many thousands of tied states [42].
Predicting these context-dependent states provides several Switchboard speech recognition task
advantages over monophone targets. They supply more bits of The DNN-HMM training recipe developed for the Bing voice
information per frame in the labels. They also make it possible to search data was applied unaltered to the Switchboard speech
use a more powerful triphone HMM decoder and to exploit the recognition task [43] to confirm the suitability of DNN-HMM
sensible classes discovered by the decision tree clustering that is acoustic models for large vocabulary tasks. Before this work,
used to tie the states of different triphone HMMs. Using context- DNN-HMM acoustic models had only been trained with up to 48
dependent HMM states, it is possible to outperform state-of-the- h of data [44] and hundreds of tied triphone states as targets,
art BMMI trained GMM-HMM systems with a two-hidden-layer whereas this work used over 300 h of training data and thou-

IEEE SIGNAL PROCESSING MAGAZINE [9] november 2012


sands of tied triphone states as targets. Furthermore, training data (the 2,000-h Fisher corpus) (18.6%; see the last
Switchboard is a publicly available speech-to-text transcription row in Table 2).
benchmark task that allows much more rigorous comparisons Detailed experiments [43] on the Switchboard task con-
among techniques. firmed that the remarkable accuracy gains from the DNN-HMM
The baseline GMM-HMM system on the Switchboard task acoustic model are due to the direct modeling of tied triphone
was trained using the standard 309-h Switchboard-I training states using the DBN-DNN, the effective exploitation of neigh-
set. Thirteen-dimensional PLP features with windowed mean- boring frames by the DBN-DNN, and the strong modeling
variance normalization were concatenated with up to third- power of deeper networks, as was discovered in the Bing voice
order derivatives and reduced to 39 dimensions by a form of search task [44], [42]. Pretraining the DBN-DNN leads to the
linear discriminant analysis (LDA) called heteroscedastic LDA best results but it is not critical: For this task, it provides an
(HDLA) <AU: please check that the expansion of HLDA is cor- absolute WER reduction of less than 1% and this gain is even
rect as given>. The SI crossword triphones used the common smaller when using five or more hidden layers. For underre-
left-to-right three-state topology and shared 9,304 tied states. sourced languages that have smaller amounts of labeled data,
The baseline GMM-HMM system had a mixture of 40 pretraining is likely to be far more helpful.
Gaussians per (tied) HMM state that were first trained genera- Further study [45] suggests that feature-engineering tech-
tively to optimize a maximum likelihood (ML) criterion and niques such as HLDA and VTLN, commonly used in GMM-
then refined discriminatively to optimize a boosted maximum- HMMs, are more helpful for shallow neural nets than for
mutual-information (BMMI) criterion. A seven-hidden-layer DBN-DNNs, presumably because DBN-DNNs are able to learn
DBN-DNN with 2,048 units in each layer and full connectivity appropriate features in their lower layers.
between adjacent layers replaced the GMM in the acoustic
model. The trigram language model, used for both systems, was Google Voice Input speech recognition task

Pr E
corpus and interpolated with a trigram model trained on writ-
ten text.
The primary test set is the FSH <AU: please spell out FSH>
E
is correct as given> rich transcription set (RT03S). Table 2
Google Voice Input transcribes voice search queries, short mes-
trained on the training transcripts of the 2,000 h of the Fisher
sages, e-mails, and user actions from mobile devices. This is a
large vocabulary task that uses a language model designed for a
mixture of search queries and dictation.
portion of the 6.3-h Spring 2003 National Institute of Standards
and Technology <AU: please check that the expansion of NIST
Google’s full-blown model for this task, which was built from
a very large corpus, uses an SI GMM-HMM model composed of
context-dependent crossword triphone HMMs that have a left-

f
extracted from the literature shows a summary of the core to-right, three-state topology. This model has a total of 7,969
results. Using a DNN reduced the word error rate (WER) from
oo senone states and uses as acoustic input PLP features that have
the 27.4% of the baseline GMM-HMM (trained with BMMI) to been transformed by LDA. Semitied covariances (STCs) are used
18.5%—a 33% relative reduction. The DNN-HMM system in the GMMs to model the LDA transformed features and BMMI
IE
trained on 309 h performs as well as combining several speaker-[46] was used to train the model discriminatively.
adaptive (SA), multipass systems that use vocal tract length nor- Jaitly et. al. [47] used this model to obtain approximately
malization (VTLN) and nearly seven times as much acoustic 5,870 h of aligned training data for a DBN-DNN acoustic model
that predicts the 7,969 HMM state posteriors from the acoustic
input. The DBN-DNN was loosely based on one
of the DBN-DNNs used for the TIMIT task. It
[Table 2] Comparing five different DBN-DNN acoustic models with had four hidden layers with 2,560 fully con-
two strong GMM-HMM baseline systems that are DT. SI training on
309 h of data and single-pass decoding were used for all models nected units per layer and a final “softmax”
except for the GMM-HMM system shown on the last row which layer with 7,969 alternative states. Its input
used SA training with 2,000 h of data and multipass decoding was 11 contiguous frames of 40 log filter-bank
including hypotheses combination. In the table, “40 mix” means a
mixture of 40 Gaussians per HMM state and “15.2 nz” means 15.2 mil- outputs with no temporal derivatives. Each
lion, nonzero weights. WERs in % are shown for two separate test DBN-DNN layer was pretrained for one epoch
sets, Hub500-SWB and RT03S-FSH. as an RBM and then the resulting DNN was
WER discriminatively fine-tuned for one epoch.
modeling technique 6
#params [10 ] Hub5’00-SWB RT03S-FSH Weights with magnitudes below a threshold
GMM, 40 mix DT 309h SI 29.4 23.6 27.4 were then permanently set to zero before a fur-
NN 1 hidden-layer # 4,634 units 43.6 26.0 29.4 ther quarter epoch of training. One third of the
+ 2 # 5 neighboring frames 45.1 22.4 25.7 weights in the final network were zero. In
DBN-DNN 7 hidden layers # 2,048 units 45.1 17.1 19.6 addition to the DBN-DNN training, sequence-
+ updated state alignment 45.1 16.4 18.6 level discriminative fine-tuning of the neural
+ sparsification 15.2 nz 16.1 18.5 network was performed using MMI, similar to
GMM 72 mix DT 2000h SA 102.4 17.1 18.6 the method proposed in [37]. Model combina-
tion was then used to combine results from the

IEEE SIGNAL PROCESSING MAGAZINE [10] november 2012


GMM-HMM system with the DNN-HMM hybrid, using the First, SI features are created, followed by SA-trained (SAT) and
SCARF <AU: plesae spell out SCARF> framework [47]. Viterbi DT features. Specifically, given initial PLP features, a set of SI
decoding was done using the Google system [48] with modifica- features are created using LDA. Further processing of LDA fea-
tions to compute the scaled log likelihoods from the estimates tures is performed to create SAT features using VTLN followed
of the posterior probabilities and the state priors. Unlike the by fMLLR. Finally, feature and model-space discriminative
other systems, it was observed that for Voice Input it was essen- training is applied using the BMMI or MPE criterion.
tial to smooth the estimated priors for good performance. This Using alignments from a baseline system, [32] trained a
smoothing of the priors was performed by rescaling the log pri- DBN-DNN acoustic model on 50 h of data from the 1996 and
ors with a multiplier that was chosen by using a grid search to 1997 English Broadcast News Speech Corpora [37]. The DBN-
find a joint optimum of the language model weight, the word DNN was trained with the best-performing LVCSR features,
insertion penalty, and the smoothing factor. specifically the SAT+DT features. The DBN-DNN architecture
On a test set of anonymized utterances from the live Voice consisted of six hidden layers with 1,024 units per layer and a
Input system, the DBN-DNN-based system achieved a WER of final softmax layer of 2,220 context-dependent states. The
12.3%—a 23% relative reduction compared to the best GMM- SAT+DT feature input into the first layer used a context of
based system for this task. MMI sequence discriminative train- nine frames. Pretraining was performed following a recipe
ing gave an error rate of 12.2% and model combination with the similar to [42].
GMM system 11.8%. Two phases of fine-tuning were performed. During the first
phase, the cross entropy loss was used. For cross entropy train-
YouTube speech recognition task ing, after each iteration through the whole training set, loss is
In this task, the goal is to transcribe Youtube data. Unlike the measured on a held-out set and the learning rate is annealed
mobile voice input applications described above, this application (i.e., reduced) by a factor of two if the held-out loss has grown

Pr E
does not have a strong language model to constrain the inter-
pretation of the acoustic information so good discrimination
requires an accurate acoustic model.
Google’s full-blown baseline, built with a much larger train-
ing set, was used to create approximately 1,400 h of aligned
training data. This was used to create a new baseline system for
E
which the input was nine frames of MFCCs that were trans-
or improves by less than a threshold of 0.01% from the previous
iteration. Once the learning rate has been annealed five times,
the first phase of fine-tuning stops. After weights are learned via
cross entropy, these weights are used as a starting point for a
second phase of fine-tuning using a sequence criterion [37] that
utilizes the MPE objective function, a discriminative objective
function similar to MMI [7] but which takes into account pho-

f
formed by LDA. SA training was performed, and decision tree neme error rate.
clustering was used to obtain 17,552 triphone states. STCs were
oo A strong SAT+DT GMM-HMM baseline system, which con-
used in the GMMs to model the features. The acoustic models sisted of 2,220 context-dependent states and 50,000 Gaussians,
were further improved with BMMI. During decoding, ML linear gave a WER of 18.8% on the EARS <AU: can EARS be spelled
IE
regression (MLLR) and feature space MLLR (fMLLR) transforms out?> Dev-04f set, whereas the DNN-HMM system gave 17.5%
were applied. [50].
The acoustic data used for training the DBN-DNN acoustic
model were the fMLLR-transformed features. The large number Summary of the main results for
of HMM states added significantly to the computational burden, DBN-DNN acoustic models on LVCSR tasks
since most of the computation is done at the output layer. To Table 3 summarizes the acoustic modeling results described
reduce this burden, the DNN used only four hidden layers with above. It shows that DNN-HMMs consistently outperform GMM-
2,000 units in the first hidden layer and only 1,000 in each of HMMs that are trained on the same amount of data, sometimes
the layers above. by a large margin. For some tasks, DNN-HMMs also outperform
About ten epochs of training were performed on this data GMM-HMMs that are trained on much more data.
before sequence-level training and model combination. The
DBN-DNN gave an absolute improvement of 4.7% over the Speeding up DNNs at recognition time
baseline system’s WER of 52.3%. Sequence-level fine-tuning of State pruning or Gaussian selection methods can be used to
the DBN-DNN further improved results by 0.5% and model make GMM-HMM systems computationally efficient at recogni-
combination produced an additional gain of 0.9%. tion time. A DNN, however, uses virtually all its parameters at
every frame to compute state likelihoods, making it potentially
English Broadcast News much slower than a GMM with a comparable number of param-
speech recognition task eters. Fortunately, the time that a DNN-HMM system requires
DNNs have also been successfully applied to an English broad- to recognize 1 s of speech can be reduced from 1.6 s to 210 ms,
cast news task. Since a GMM-HMM baseline creates the initial without decreasing recognition accuracy, by quantizing the
training labels for the DNN, it is important to have a good base- weights down to 8 b and using the very fast SIMD primitives for
line system. All GMM-HMM systems created at IBM use the fol- fixed-point computation that are provided by a modern x86 cen-
lowing recipe to produce a state-of-the-art baseline system.

IEEE SIGNAL PROCESSING MAGAZINE [11] november 2012


amount of labeled training data
[Table 3] A comparison of the Percentage WERs using DNN-HMMs and GMM-
HMMs on five different large vocabulary tasks. is available, and minibatch sizes
over training epochs are set
hours of GMM-HMM GMM-HMM
task training data DNN-HMM with same data with more data appropriately [45], [53].
Switchboard (test set 1) 309 18.5 27.4 18.6 (2,000 h) Nevertheless, generative pre-
Switchboard (test set 2) 309 16.1 23.6 17.1 (2,000 h) training still improves test per-
English Broadcast News 50 17.5 18.8 formance, sometimes by a
Bing Voice Search significant amount.
(Sentence error rates) 24 30.4 36.2
Layer-by-layer generative
Google Voice Input 5,870 12.3 16.0 (22 5,870 h)
pretraining was originally done
Youtube 1,400 47.6 52.3
using RBMs, but various types of
autoencoder with one hidden
layer can also be used (see
Figure 2). On vision tasks, per-
tral processing unit [49]. Alternatively, it can be reduced to 66 formance similar to RBMs can be achieved by pretraining with
ms by using a graphics processing unit (GPU). “denoising” autoencoders [54] that are regularized by setting a
subset of the inputs to zero or “contractive” autoencoders [55]
Alternative pretraining methods for DNNs that are regularized by penalizing the gradient of the activities
Pretraining DNNs as generative models led to better recognition of the hidden units with respect to the inputs. For speech recog-
results on TIMIT and subsequently on a variety of LVCSR tasks. nition, improved performance was achieved on both TIMIT and
Once it was shown that DBN-DNNs could learn good acoustic Broadcast News tasks by pretraining with a type of autoencoder

Pr E
models, further research revealed that they could be trained in
many different ways. It is possible to learn a DNN by starting
with a shallow neural net with a single hidden layer. Once this
net has been trained discriminatively, a second hidden layer is
interposed between the first hidden layer and the softmax out-
put units and the whole network is again DT. This can be con-
E
tinued until the desired number of hidden layers is reached,
that tries to find sparse codes [56].

Alternative fine-tuning methods for DNNs


Very large GMM acoustic models are trained by making use of
the parallelism available in compute clusters. It is more difficult
to use the parallelism of cluster systems effectively when train-
ing DBN-DNNs. At present, the most effective parallelization

f
after which full backpropagation fine-tuning is applied. method is to parallelize the matrix operations using a GPU. This
This type of discriminative pretraining works well in prac-
oo gives a speed-up of between one and two orders of magnitude,
tice, approaching the accuracy achieved by generative DBN pre- but the fine-tuning stage remains a serious bottleneck, and
training and further improvement can be achieved by stopping more effective ways of parallelizing training are needed. Some
IE
the discriminative pretraining after a single epoch instead of recent attempts are described in [52] and [57].
multiple epochs as reported in [45]. Discriminative pretraining Most DBN-DNN acoustic models are fine-tuned by applying
has also been found effective for the architectures called “deep stochastic gradient descent with momentum to small mini-
convex network” [51] and “deep stacking network” [52], where batches of training cases. More sophisticated optimization
pretraining is accomplished by convex optimization involving methods that can be used on larger minibatches include nonlin-
no generative models. ear conjugate-gradient [17], LBFGS [58] <AU: please spell out
Purely discriminative training of the whole DNN from ran- LBFGS>, and “Hessian-free” methods adapted to work for
dom initial weights works much better than had been thought, DNNs [59]. However, the fine-tuning of DNN acoustic models is
provided the scales of the initial weights are set carefully, a large typically stopped early to prevent overfitting, and it is not clear
that the more sophisticated methods are worthwhile for such
incomplete optimization.
Input Units Output Units Code Units
Other Ways of Using Deep Neural
Networks for Speech Recognition
[Fig2] An autoencoder is trained to minimize the discrepancy
between the input vector and its reconstruction of the input The previous section reviewed experiments in which GMMs
vector on its output units. If the code units and the output units were replaced by DBN-DNN acoustic models to give hybrid
are both linear and the discrepancy is the squared reconstruction DNN-HMM systems in which the posterior probabilities over
error, an autoencoder finds the same solution as principal
components analysis (PCA) (up to a rotation of the components). HMM states produced by the DBN-DNN replace the GMM out-
If the output units and the code units are logistic, an put model. In this section, we describe two other ways of using
autoencoder is quite similar to an RBM that is trained using CD, DNNs for speech recognition.
but it does not work as well for pretraining DNNs unless it is
strongly regularized in an appropriate way. If extra hidden layers
are added before and/or after the code layer, an autoencoder Using DBN-DNNs to provide
can compress data much better than PCA [17]. input features for GMM-HMM systems

IEEE SIGNAL PROCESSING MAGAZINE [12] november 2012


Here we describe a class of methods where neural networks are
[Table 4] WER in % on English Broadcast News.
used to provide the feature vectors that the GMM in a GMM-
HMM system is trained to model. The most common approach 50 H 430 H

to extracting these feature vectors is to DT a randomly initial- GMM-HMM GMM/HMM


LVCSR Stage Baseline AE-BN Baseline AE-BN
ized neural net with a narrow bottleneck middle layer and to FSA 24.8 20.6 20.2 17.6
use the activations of the bottleneck hidden units as features. +fBMMI 20.7 19.0 17.7 16.6
For a summary of such methods, commonly known as the tan- +BMMI 19.6 18.1 16.5 15.8
dem approach, see [60] and [61]. +MLLR 18.8 17.5 16.0 15.5
Recently, [62] investigated a less direct way of producing fea- Model Combination 16.4 15.0
ture vectors for the GMM. First, a DNN with six hidden layers of
1,024 units each was trained to achieve good classification accu-
racy for the 384 HMM states represented in its softmax output h the AE-BN system provides a 0.5% improvement over the
layer. This DNN did not have a bottleneck layer and was there- baseline. The 17.5% WER is the best result to date on the Dev-
fore able to classify better than a DNN with a bottleneck. Then 04f task, using an acoustic model trained on 50 h of data.
the 384 logits computed by the DNN as input to its softmax Finally, the complementarity of the AE-BN and baseline meth-
layer were compressed down to 40 values using a 384-128-40- ods is explored by performing model combination on both the
384 autoencoder. This method of producing feature vectors is 50- and 430-h tasks. Table 4 shows that model-combination pro-
called AE-BN because the bottleneck is in the autoencoder rath- vides an additional 1.1% absolute improvement over individual
er than in the DNN that is trained to classify HMM states. systems on the 50-h task, and a 0.5% absolute improvement
Bottleneck feature experiments were conducted on 50 h and over the individual systems on the 430-h task, confirming the
430 h of data from the 1996 and 1997 English Broadcast News complementarity of the AE-BN and baseline systems.

Pr E
Speech collections and English broadcast audio from TDT-4.
The baseline GMM-HMM acoustic model trained on 50 h was
the same acoustic model described in the section “English
Broadcast News Speech Recognition Task.” The acoustic model
trained on 430 h had 6,000 states and 150,000 Gaussians. Again,
the standard IBM LVCSR recipe described in the aforemen-
E
tioned section <AU: edit made to avoid repeating section name.
Instead of replacing the coefficients usually modeled by
GMMs, neural networks can also be used to provide additional
features for the GMM to model [8], [9], [63]. DBN-DNNs have
recently been shown to be very effective in such tandem sys-
tems. On the Aurora2 test set, pretraining decreased WERs by
more than one third for speech with signal-to-noise levels of 20
dB or more, though this effect almost disappeared for very high

f
OK?> was used to create a set of SA DT features and models. noise levels [64].
All DBN-DNNs used SAT features as input. They were pre-
oo
trained as DBNs and then discriminatively fine-tuned to predict Using DNNs to estimate articulatory features
target values for 384 HMM states that were obtained by cluster- for detection-based speech recognition
IE
ing the context-dependent states in the baseline GMM-HMM A recent study [65] demonstrated the effectiveness of DBN-
system. As in the section “English Broadcast News Speech DNNs for detecting subphonetic speech attributes (also known
Recognition Task,” the DBN-DNN was trained using the cross as phonological or articulatory features [66]) in the widely used
entropy criterion, followed by the sequence criterion with the Wall Street Journal speech database (5k-WSJ0). Thirteen
same annealing and stopping rules. MFCCs plus first- and second-temporal derivatives were used as
After the training of the first DBN-DNN terminated, the final the short-time spectral representation of the speech signal. The
set of weights was used for generating the 384 logits at the out- phone labels were derived from the forced alignments generated
put layer. A second 384-128-40-384 DBN-DNN was then trained using a GMM-HMM system trained with ML, and that HMM sys-
as an autoencoder to reduce the dimensionality of the output tem had 2,818 tied-state, crossword triphones, each modeled by
logits. The GMM-HMM system that used the feature vectors a mixture of eight Gaussians. The attribute labels were generat-
produced by the AE-BN was trained using feature and model ed by mapping phone labels to attributes, simplifying the over-
space discriminative training. Both pretraining and the use of lapping characteristics of the articulatory features. The 22
deeper networks made the AE-BN features work better for rec- attributes used in the recent work, as reported in [65], are a
ognition. To fairly compare the performance of the system that subset of the articulatory features explored in [66] and [67].
used the AE-BN features with the baseline GMM-HMM system, DBN-DNNs achieved less than half the error rate of shallow
the acoustic model of the AE-BN features was trained with the neural nets with a single hidden layer. DNN architectures with
same number of states and Gaussians as the baseline system. five to seven hidden layers and up to 2,048 hidden units per
Table 4 shows the results of the AE-BN and baseline systems layer were explored, producing greater than 90% frame-level
on both 50 and 430 h, for different steps in the LVCSR recipe accuracy for all 21 attributes tested in the full DNN system. On
described in the section “English Broadcast News Speech the same data, DBN-DNNs also achieved a very high per frame
Recognition Task.” On 50 h, the AE-BN system offers a 1.3% phone classification accuracy of 86.6%. This level of accuracy
absolute improvement over the baseline GMM-HMM system, for detecting subphonetic fundamental speech units may allow
which is the same improvement as the DBN-DNN, while on 430 a new family of flexible speech recognition and understanding

IEEE SIGNAL PROCESSING MAGAZINE [13] november 2012


systems that make use of phonological features in the full detec- models that use DNNs and ones that use GMMs will continue to
tion-based framework discussed in [65]. increase for some time.
Currently, the biggest disadvantage of DNNs compared with
Summary and Future Directions GMMs is that it is much harder to make good use of large clus-
When GMMs were first used for acoustic modeling, they were ter machines to train them on massive data sets. This is offset
trained as generative models using the EM algorithm, and it by the fact that DNNs make more efficient use of data so they do
was some time before researchers showed that significant gains not require as much data to achieve the same performance, but
could be achieved by a subsequent stage of discriminative train- better ways of parallelizing the fine-tuning of DNNs is still a
ing using an objective function more closely related to the ulti- major issue.
mate goal of an ASR system [7], [68]. When neural nets were
first used, they were trained discriminatively. It was only recent- Authors
ly that researchers showed that significant gains could be Geoffrey Hinton ([email protected]) received his Ph.D.
achieved by adding an initial stage of generative pretraining that degree from the University of Edinburgh in 1978. He spent five
completely ignores the ultimate goal of the system. The pre- years as a faculty member at Carnegie Mellon University,
training is much more helpful in deep neural nets than in shal- Pittsburgh, Pennsylvania, and he is currently a distinguished
low ones, especially when limited amounts of labeled training professor at the University of Toronto. He is a fellow of the Royal
data are available. It reduces overfitting, and it also reduces the Society and an honorary foreign member of the American
time required for discriminative fine-tuning with backpropaga- Academy of Arts and Sciences. His awards include the David E.
tion, which was one of the main impediments to using DNNs Rumelhart Prize, the International Joint Conference on
when neural networks were first used in place of GMMs in the Artificial Intelligence Research Excellence Award, and the
1990s. The successes achieved using pretraining led to a resur- Gerhard Herzberg Canada Gold Medal for Science and

Pr E
gence of interest in DNNs for acoustic modeling.
Retrospectively, it is now clear that most of the gain comes from
using DNNs to exploit information in neighboring frames and
from modeling tied context-dependent states. Pretraining is
helpful in reducing overfitting, and it does reduce the time
taken for fine-tuning, but similar reductions in training time
E
can be achieved with less effort by careful choice of the scales of
Engineering. He was one of the researchers who introduced the
back-propagation algorithm. His other contributions include
Boltzmann machines, distributed representations, time-delay
neural nets, mixtures of experts, variational learning, CD learn-
ing, and DBNs.
Li Deng ([email protected]) received his Ph.D. degree
from the University of Wisconsin–Madison. In 1989, he joined

f
the initial random weights in each layer. the Department of Electrical and Computer Engineering at the
The first method to be used for pretraining DNNs was to
oo University of Waterloo, Ontario, Canada, as an assistant profes-
learn a stack of RBMs, one per hidden layer of the DNN. An sor, where he became a tenured full professor in 1996. In 1999,
RBM is an undirected generative model that uses binary latent he joined MSR, Redmond, Washington, as a senior researcher,
IE
variables, but training it by ML is expensive, so a much faster, where he is currently a principal researcher. Since 2000, he has
approximate method called CD is used. This method has strong also been an affiliate professor in the Department of Electrical
similarities to training an autoencoder network (a nonlinear Engineering at the University of Washington, Seattle, teaching
version of PCA) that converts each datapoint into a code from the graduate course of computer speech processing. Prior to
which it is easy to approximately reconstruct the datapoint. MSR, he also worked or taught at Massachusetts Institute of
Subsequent research showed that autoencoder networks with Technology, ATR Interpreting Telecommunications Research
one layer of logistic hidden units also work well for pretraining, Laboratories (Kyoto, Japan), and Hong Kong University of
especially if they are regularized by adding noise to the inputs Science and Technology. In the general areas of speech recogni-
or by constraining the codes to be insensitive to small changes tion, signal processing, and machine learning, he has published
in the input. RBMs do not require such regularization because over 300 refereed papers in leading journals and conferences
the Bernoulli noise introduced by using stochastic binary hid- and three books. He is a Fellow of the Acoustical Society of
den units acts as a very strong regularizer [21]. America, ISCA <AU: please spell out ISCA>, and the IEEE. He
We have described how three <AU: beginning of article says was ISCA’s Distinguished Lecturer in 2010–2011. He has been
four research groups, please confirm correct number> major granted over 50 patents and has received awards/honors
speech research groups achieved significant improvements in a bestowed by IEEE, ISCA, ASA <AU: please spell out ASA>,
variety of state-of-the-art ASR systems by replacing GMMs with Microsoft, and other organizations including the latest 2011
DNNs, and we believe that there is the potential for considerable IEEE Signal Processing Society (SPS) Meritorious Service
further improvement. There is no reason to believe that we are Award. He served on the Board of Governors of the IEEE SPS
currently using the optimal types of hidden units or the optimal (2008–2010), and as editor-in-chief of IEEE Signal Processing
network architectures, and it is highly likely that both the pre- Magazine (2009–2011). He is currently the editor-in-chief of
training and fine-tuning algorithms can be modified to reduce IEEE Transactions on Audio, Speech, and Language Processing
the amount of overfitting and the amount of computation. We (2012–2014). He is the general chair of the International
therefore expect that the performance gap between acoustic Conference on Acoustics, Speech, and Signal Processing

IEEE SIGNAL PROCESSING MAGAZINE [14] november 2012


(ICASSP) 2013. over 60 scientific papers; holds 26 patents; and is an associate
Dong Yu ([email protected]) received a Ph.D. degree in com- editor of the journal Pattern Recognition. His research interests
puter science from the University of Idaho, an M.S. degree in range across speech and pattern recognition, computer vision,
computer science from Indiana University at Bloomington, an and visual art.
M.S. degree in electrical engineering from the Chinese Academy Vincent Vanhoucke ([email protected]) received his
of Sciences, and a B.S. degree (with honors) in electrical engi- Ph.D. degree from Stanford University in 2004 for research in
neering from Zhejiang University (China). He joined Microsoft acoustic modeling and is a graduate from the Ecole Centrale
Corporation in 1998 and MSR in 2002, where he is a researcher. Paris. From 1999 to 2005, he was a research scientist with the
His current research interests include speech processing, robust speech R&D team at Nuance, in Menlo Park, California. He is
speech recognition, discriminative training, spoken dialog sys- currently a research scientist at Google Research, Mountain
tem, voice search technology, machine learning, and pattern View, California, where he manages the speech quality research
recognition. He has published more than 90 papers in these team. Previously, he was with Like.com (now part of Google),
areas and is the inventor/coinventor of more than 40 granted/ where he worked on object, face, and text recognition technolo-
pending patents. He is currently an associate editor of IEEE gies.
Transactions on Audio, Speech, and Language Processing Patrick Nguyen ([email protected]) received his doctorate
(2011–present) and has been an associate editor of IEEE Signal degree from the Swiss Federal Institute for Technology (EPFL)
Processing Magazine (2008–2011) and was the lead guest editor in 2002. In 1998, he founded a company developing a platform
of the Special Issue on Deep Learning for Speech and Language real-time foreign exchange trading. He was with the Panasonic
Processing (2010–2011), IEEE Transactions on Audio, Speech, Speech Technology Laboratory from 2000 to 2004, in Santa
and Language Processing. Barbara, California, and MSR in Redmond, Washington from
George E. Dahl ([email protected]) received a B.A. 2004 to 2010. He is currently a research scientist at Google

Pr E
degree in computer science with highest honors from
Swarthmore College and an M.Sc. degree from the University of
Toronto, where he is currently completing a Ph.D. degree with a
research focus in statistical machine learning. His current main
research interest is in training models that learn many levels of
rich, distributed representations from large quantities of per-
ceptual and linguistic data.
E Research, Mountain View, California. His area of expertise
revolves around statistical processing of human language, and
in particular, speech recognition. He is mostly known for seg-
mental conditional random fields and eigenvoices. He was on
the organizing committee of ASRU 2011 and he co-led the
2010 JHU <AU: please spell out ASRU and JHU> Workshop on
Speech Recognition. He currently serves on the Speech and

f
Abdel-rahman Mohamed ([email protected]) received Language Technical Committee of the IEEE SPS.
his B.Sc. and M.Sc. degrees from the Department of Electronics
oo Tara Sainath ([email protected]) received her Ph.D.
and Communication Engineering, Cairo University in 2004 and degree in electrical engineering and computer science from
2007, respectively. In 2004, he worked in the speech research Massachusetts Institute of Technology in 2009. The main focus
IE
group at RDI Company, Egypt. He then joined the ESAT-PSI of her Ph.D. work was in acoustic modeling for noise robust
<AU: can ESAT-PSI be spelled out?> speech group at the speech recognition. She joined the Speech and Language
Katholieke Universiteit Leuven, Belgium. In September 2008, Algorithms group at IBM T.J. Watson Research Center upon
he started his Ph.D. degree at the University of Toronto. His completion of her Ph.D. degree. She organized a special session
research focus is in developing machine learning techniques to on sparse representations at INTERSPEECH 2010 in Japan. In
advance human language technologies. addition, she has been a staff reporter of IEEE Speech and
Navdeep Jaitly ([email protected]) received his B.A. Language Processing Technical Committee Newsletter. She
degree from Hanover College and an M.Math degree from the currently holds 15 U.S. patents. Her research interests mainly
University of Waterloo in 2000. After receiving his master’s focus in acoustic modeling, including sparse representations,
degree, he developed algorithms and statistical methods for DBN works, adaptation methods, and noise robust speech rec-
analysis of protoemics data at Caprion Pharmaceuticals in ognition.
Montreal and at Pacific Northwest National Labs in Washington. Brian Kingsbury ([email protected]) received the B.S.
Since 2008, he has been pursuing a Ph.D. degree at the degree (high honors) in electrical engineering from Michigan
University of Toronto. His current interests lie in machine State University, East Lansing, in 1989 and the Ph.D. degree in
learning, speech recognition, computational biology, and statis- computer science from the University of California, Berkeley, in
tical methods. 1998. Since 1999, he has been a research staff member in the
Andrew Senior ([email protected]) received his Department of Human Language Technologies, IBM T.J. Watson
Ph.D. degree from the University of Cambridge and is a research Research Center, Yorktown Heights, New York. His research
scientist at Google. Before joining Google, he worked at IBM interests include large-vocabulary speech transcription, audio
Research in the areas of handwriting, audio-visual speech, face, indexing and analytics, and information retrieval from speech.
and fingerprint recognition as well as video privacy protection From 2009 to 2011, he served on the IEEE SPS’s Speech and
and visual tracking. He edited Privacy Protection in Video Language Technical Committee, and from 2010 to 2012 he was
Surveillance; coauthored Springer’s Guide to Biometrics and an ICASSP area chair. He is currently an associate editor of

IEEE SIGNAL PROCESSING MAGAZINE [15] november 2012


IEEE Transactions on Audio, Speech, and Language Processing. [28] J. Ming and F. J. Smith, “Improved phone recognition using Bayesian tri-
phone models,” in Proc. ICASSP, 1998, pp. 409–412.
[29] L. Deng and D. Yu, “Use of differential cepstra as acoustic features in hid-
References den trajectory modelling for phonetic recognition,” in Proc. ICASSP, 2007, pp.
[1] J. Baker, L. Deng, J. Glass, S. Khudanpur, Chin Hui Lee, N. Morgan, and D. 445–448.
O’Shaughnessy, “Developments and directions in speech recognition and under-
[30] A. Halberstadt and J. Glass, “Heterogeneous measurements and multiple
standing, part 1,” IEEE Signal Processing Mag., vol. 26, no. 3, pp. 75–80, May
classifiers for speech recognition,” in Proc. ICSLP, 1998. <AU: Kindly provide
2009.
the complete page range.>
[2] S. Furui, Digital Speech Processing, Synthesis, and Recognition. New York:
[31] A. Mohamed, D. Yu, and L. Deng, “Investigation of full-sequence training of
Marcel Dekker, 2000.
deep belief networks for speech recognition,” in Proc. Interspeech, 2010. <AU:
[3] B. H. Juang, S. Levinson, and M. Sondhi, “Maximum likelihood estimation for Kindly provide the complete page range.>
multivariate mixture observations of Markov chains,” IEEE Trans. Inform. Theory,
[32] T. N. Sainath, B. Ramabhadran, M. Picheny, D. Nahamoo, and D. Kanevsky,
vol. 32, no. 2, pp. 307–309, 1986.
“Exemplar-based sparse representation features: From TIMIT to LVCSR,” IEEE
[4] H. Hermansky, “Perceptual linear predictive (PLP) analysis of speech,” J. Trans. Audio Speech Lang. Processing, vol. 19, no. 8, pp. 2598–2613, Nov. 2011.
Acoust. Soc. Amer., vol. 87, no. 4, pp. 1738–1752, 1990.
[33] G. E. Dahl, M. Ranzato, A. Mohamed, and G. E. Hinton, “Phone recognition
[5] S. Furui, “Cepstral analysis technique for automatic speaker verification,” IEEE with the mean-covariance restricted Boltzmann machine,” in Advances in Neural
Trans. Acoust., Speech, Signal, Processing, vol. 29, pp. 254–272, 1981. <AU: Information Processing Systems 23, J. Lafferty, C. K. I. Williams, J. Shawe-
Kindly provide the issue number.> Taylor, R.S. Zemel, and A. Culotta, Eds. 2010, pp. 469–477. <AU: Kindly provide
the place of publication and publisher name.>
[6] S. Young, “Large vocabulary continuous speech recognition: A review,” IEEE
Signal Processing Mag., vol. 13, no. 5, pp. 45–57, 1996. [34] O. Abdel-Hamid, A. Mohamed, H. Jiang, and G. Penn, “Applying convolu-
tional neural networks concepts to hybrid NN-HMM model for speech recogni-
[7] L. Bahl, P. Brown, P. de Souza, and R. Mercer, “Maximum mutual informa- tion,” in Proc. ICASSP, 2012. <AU: Kindly provide the complete page range.>
tion estimation of hidden Markov model parameters for speech recognition,” in
Proc. ICASSP, 1986, pp. 49–52. [35] X. He, L. Deng, and W. Chou, “Discriminative learning in sequential pattern
recognition—A unifying review for optimization-oriented speech recognition,”
[8] H. Hermansky, D. P. W. Ellis, and S. Sharma, “Tandem connectionist feature IEEE Signal Processing Mag., vol. 25, no. 5, pp. 14–36, 2008.
extraction for conventional HMM systems,” in Proc. ICASSP. Los Alamitos, CA:
IEEE Computer Society, 2000, vol. 3, pp. 1635–1638. [36] Y. Bengio, R. De Mori, G. Flammia, and F. Kompe, “Global optimization of
a neural network—Hidden Markov model hybrid,” in Proc. EuroSpeech, 1991.
[9] H. Bourlard and N. Morgan, Connectionist Speech Recognition: A Hybrid <AU: Kindly provide the complete page range.>
Approach, Norwell, MA: Kluwer, 1993.
[37] B. Kingsbury, “Lattice-based optimization of sequence classification criteria
[10] L. Deng, “Computational models for speech production,” in Computational

names.>
Pr E
Models of Speech Pattern Processing. New York: Springer-Verlag, 1999, pp. 199–
213. <AU: Kindly provide the editor names.>
[11] L. Deng, “Switching dynamic system models for speech articulation and
acoustics,” in Mathematical Foundations of Speech and Language Processing,
New York: Springer-Verlag, 2003, pp. 115–134. <AU: Kindly provide the editor

[12] A. Mohamed, G. Dahl, and G. Hinton, “Deep belief networks for phone rec-
ognition,” in Proc. NIPS Workshop Deep Learning for Speech Recognition and
E
Related Applications, 2009. <AU: Kindly provide the complete page range.>
[13] A. Mohamed, G. Dahl, and G. Hinton, “Acoustic modeling using deep belief
for neural-network acoustic modeling,” in Proc. ICASSP, 2009, pp. 3761–3764.
[38] R. Prabhavalkar and E. Fosler-Lussier, “Backpropagation training for multi-
layer conditional random field based phone recognition,” in Proc. ICASSP, 2010,
pp. 5534–5537.
[39] H. Lee, P. Pham, Y. Largman, and A. Ng, “Unsupervised feature learning for
audio classification using convolutional deep belief networks,” in Advances in
Neural Information Processing Systems 22, Y. Bengio, D. Schuurmans, J.
Lafferty, C. K. I. Williams, and A. Culotta, Eds. 2009, pp. 1096–1104. <AU:
Kindly provide the place of publication and publisher name.>
[40] L. Deng, D. Yu, and A. Acero, “Structured speech modeling,” IEEE Trans. Au-

f
dio Speech Lang. Processing, vol. 14, pp. 1492–1504, 2006. <AU: Kindly provide
networks,” IEEE Trans. Audio Speech Lang. Processing, vol. 20, no. 1, pp. 14–22, the issue number.>
Jan. 2012.
[41] H. Zen, M. Gales, Y. Nankaku, and K. Tokuda, “Product of experts for statisti-
[14] D. E. Rumelhart, G. E. Hinton, and R. J. Williams, “Learning representations cal parametric speech synthesis,” IEEE Trans. Audio Speech and Lang. Process-
oo
by back-propagating errors,” Nature, vol. 323, no. 6088, pp. 533–536, 1986.
[15] X. Glorot and Y. Bengio, “Understanding the difficulty of training deep feed-
forward neural networks,” in Proc. AISTATS, 2010, pp. 249–256.
ing, vol. 20, no. 3, pp. 794–805, Mar. 2012.
[42] G. Dahl, D. Yu, L. Deng, and A. Acero, “Context-dependent pretrained deep
neural networks for large-vocabulary speech recognition,” IEEE Trans. Audio
IE
[16] D. C. Ciresan, U. Meier, L. M. Gambardella, and J. Schmidhuber, “Deep, big, Speech Lang. Processing, vol. 20, no. 1, pp. 30–42, Jan. 2012.
simple neural nets for handwritten digit recognition,” Neural Comput., vol. 22, pp. [43] F. Seide, G. Li, and D. Yu, “Conversational speech transcription using con-
3207–3220, 2010. <AU: Kindly provide the issue number.> text-dependent deep neural networks,” in Proc. Interspeech, 2011, pp. 437–440.
[17] G. E. Hinton and R. Salakhutdinov, “Reducing the dimensionality of data with [44] D. Yu, L. Deng, and G. Dahl, “Roles of pretraining and fine-tuning in con-
neural networks,” Science, vol. 313, no. 5786, pp. 504–507, 2006. text-dependent DBN-HMMs for real-world speech recognition,” in Proc. NIPS
[18] H. Larochelle, D. Erhan, A. Courville, J. Bergstra, and Y. Bengio, “An empir- Workshop Deep Learning and Unsupervised Feature Learning, 2010.<AU:
ical evaluation of deep architectures on problems with many factors of variation,” Kindly provide the complete page range.>
in Proc. 24th Int. Conf. Machine Learning, 2007, pp. 473–480. [45] F. Seide, G. Li, X. Chen, and D. Yu, “Feature engineering in context-de-
[19] J. Pearl, Probabilistic Inference in Intelligent Systems: Networks of pendent deep neural networks for conversational speech transcription,” in Proc.
Plausible Inference. San Mateo, CA: Morgan Kaufmann, 1988. IEEE ASRU, 2011, pp. 24–29.
[20] G. E. Hinton, “Training products of experts by minimizing contrastive diver- [46] D. Povey, D. Kanevsky, B. Kingsbury, B. Ramabhadran, G. Saon, and K.
gence,” Neural Comput., vol. 14, pp. 1771–1800, 2002. Visweswariah, “Boosted MMI for model and feature-space discriminative train-
ing,” in Proc. ICASSP, 2008. <AU: Kindly provide the complete page range.>
[21] G. E. Hinton, “A practical guide to training restricted Boltzmann machines,”
Tech. Rep. UTML TR 2010-003, Dept. Comput. Sci., Univ. Toronto, 2010. [47] N. Jaitly, P. Nguyen, A. Senior, and V. Vanhoucke, “An application of pre-
trained deep neural networks to large vocabulary speech recognition,” in Proc.
[22] G. E. Hinton, S. Osindero, and Y. Teh, “A fast learning algorithm for deep Interspeech, 2012. <AU: Kindly provide the complete page range.>
belief nets,” Neural Comput., vol. 18, pp. 1527–1554, 2006. <AU: Kindly provide
the issue number.> [48] G. Zweig, P. Nguyen, D. V. Compernolle, K. Demuynck, L. Atlas, P. Clark,
G. Sell, M. Wang, F. Sha, H. Hermansky, D. Karakos, A. Jansen, S. Thomas, G.
[23] T. N. Sainath, B. Ramabhadran, and M. Picheny, “An exploration of large S. V. S. Sivaram, S. Bowman, and J. Kao, “Speech recognition with segmental
vocabulary tools for small vocabulary phonetic recognition,” in Proc. IEEE Au- conditional random fields: A summary of the JHU CLSP 2010 summer workshop,”
tomatic Speech Recognition and Understanding Workshop, 2009. <AU: Kindly in Proc. ICASSP, 2011, pp. 5044–5047.
provide the complete page range.>
[49] V. Vanhoucke, A. Senior, and M. Z. Mao, “Improving the speed of neural net-
[24] A. Mohamed, T. N. Sainath, G. E. Dahl, B. Ramabhadran, G. E. Hinton, and works on CPUs,” in Proc. Deep Learning and Unsupervised Feature Learning
M. Picheny, “Deep belief networks using discriminative features for phone recog- NIPS Workshop, 2011. <AU: Kindly provide the complete page range.>
nition,” in Proc. ICASSP, 2011. <AU: Kindly provide the complete page range.>
[50] T. N. Sainath, B. Kingsbury, and B. Ramabhadran, “Improvements in us-
[25] A. Mohamed, G. Hinton, and G. Penn, “Understanding how deep belief net- ing deep belief networks for large vocabulary continuous speech recognition,”
works perform acoustic modelling,” in Proc. ICASSP, 2012. <AU: Kindly provide Speech and Language Algorithm Group, IBM, Tech. Rep. UTML TR 2010-003,
the complete page range.> Feb. 2011. <AU: Kindly provide the location of the organization.>
[26] Y. Hifny and S. Renals, “Speech recognition using augmented conditional [51] L. Deng and D. Yu, “Deep convex network: A scalable architecture for speech
random fields,” IEEE Trans. Audio Speech Lang. Processing, vol. 17, no. 2, pp. pattern classification,” in Proc. Interspeech, 2011. <AU: Kindly provide the com-
354–365, 2009. plete page range.>
[27] A. Robinson, “An application to recurrent nets to phone probability estima-
tion,” IEEE Trans. Neural Networks, vol. 5, no. 2, pp. 298–305, 1994.

IEEE SIGNAL PROCESSING MAGAZINE [16] november 2012


[52] L. Deng, D. Yu, and J. Platt, “Scalable stacking and learning for building
deep architectures,” in Proc. ICASSP, 2012. <AU: Kindly provide the complete
using the information in the training
page range.> set to build multiple layers of
[53] D. Yu, L. Deng, G. Li, and Seide F, “Discriminative pretraining of deep neural
networks,” U.S. Patent Filing, Nov. 2011. <AU: Is this a patent-type reference? If
nonlinear feature detectors.
yes, provide the patent no.>
[54] P. Vincent, H. Larochelle, I. Lajoie, Y. Bengio, and P.-A. Manzagol, “Stacked
denoising autoencoders: learning useful representations in a deep network with a
local denoising criterion,” J. Mach. Learn. Res., vol. 11, pp. 3371–3408, 2010. <AU: One very nice property of a DBN that
Kindly provide the issue number.> distinguishes it from other multilayer,
[55] S. Rifai, P. Vincent, X. Muller, X. Glorot, and Y. Bengio, “Contracting auto-
encoders: Explicit invariance during feature extraction,” in Proc. 28th Int. Conf. directed, nonlinear generative models
Machine Learning, 2011. <AU: Kindly provide the complete page range.> is that it is possible to infer the states
[56] C. Plahl, T. N. Sainath, B. Ramabhadran, and D. Nahamoo, “Improved pre-
training of deep belief networks using sparse encoding symmetric machines,” in of the layers of hidden units in a single
Proc. ICASSP, 2012. <AU: Kindly provide the complete page range.> forward pass.
[57] B. Hutchinson, L. Deng, and D. Yu, “A deep architecture with bilinear mod-
eling of hidden representations: Applications to phonetic recognition,” in Proc.
ICASSP, 2012. <AU: Kindly provide the complete page range.>
Pretraining DNNs as generative models
[58] Q. V. Le, J. Ngiam, A. Coates, A. Lahiri, B. Prochnow, and A. Y. Ng, “On opti-
mization methods for deep learning,” in Proc. 28th Int. Conf. Machine Learning, led to better recognition results on
2011. <AU: Kindly provide the complete page range.>
TIMIT and subsequently on a variety of
[59] J. Martens, “Deep learning via Hessian-free optimization,” in Proc. 27th Int.
Conf. Machine learning, 2010. <AU: Kindly provide the complete page range.> LVCSR tasks.
[60] N. Morgan, “Deep and wide: Multiple layers in automatic speech recognition,”
IEEE Trans. Audio Speech Lang. Processing, vol. 20, no. 1, Jan. 2012. <AU: Kind-
ly provide the complete page range.> The successes achieved using
[61] G. Sivaram and H. Hermansky, “Sparse multilayer perceptron for phoneme pretraining led to a resurgence of
recognition,” IEEE Trans. Audio Speech Lang. Processing, vol. 20, no. 1, Jan.
interest in DNNs for acoustic modeling.

provide the complete page range.>Pr E


2012. <AU: Kindly provide the complete page range.>
[62] T. N. Sainath, B. Kingsbury, and B. Ramabhadran, “Auto-encoder bottle-
neck features using deep belief networks,” in Proc. ICASSP, 2012. <AU: Kindly

[63] N. Morgan, Q. Zhu, A. Stolcke, K. Sonmez, S. Sivadas, T. Shinozaki, M. Os-


tendorf, P. Jain, H. Hermansky, D. Ellis, G. Doddington, B. Chen, O. Cretin, H.
Bourlard, and M. Athineos, “Pushing the envelope aside [speech recognition],”
IEEE Signal Processing Mag., vol. 22, no. 5, pp. 81–88, Sept. 2005.
E
[64] O. Vinyals and S. V. Ravuri, “Comparing multilayer perceptron to deep belief
network tandem features for robust ASR,” in Proc. ICASSP, 2011, pp. 4596–4599.
Currently, the biggest disadvantage
of DNNs compared with GMMs is that it
is much harder to make good use of
large cluster machines to train them
on massive data sets.

f
[65] D. Yu, S. Siniscalchi, L. Deng, and C. Lee, “Boosting attribute and phone es-
timation accuracies with deep neural networks for detection-based speech recog-
nition,” in Proc. ICASSP, 2012. <AU: Kindly provide the complete page range.>
oo
[66] L. Deng and D. Sun, “A statistical approach to automatic speech recognition
using the atomic speech units constructed from overlapping articulatory features,”
J. Acoust. Soc. Amer., vol. 85, no. 5, pp. 2702–2719, 1994.
IE
[67] J. Sun and L. Deng, “An overlapping-feature based phonological model in-
corporating linguistic constraints: Applications to speech recognition,” J. Acoustic.
Soc. Amer., vol. 111, no. 2, pp. 1086–1101, 2002.
[68] P. C. Woodland and D. Povey, “Large scale discriminative training of hidden
Markov models for speech recognition,” Comput Speech Lang., vol. 16, pp. 2547,
2002. <AU: Kindly provide the complete page range and issue number.>

[SP]
callouts

Deep neural networks that have


many hidden layers and are trained
using new methods have been shown
to outperform GMMs on a variety
of speech recognition benchmarks,
sometimes by a large margin.

Over the last few years, advances in


both machine learning algorithms and
computer hardware have led to more
efficient methods for training DNNs.
What we need is a better method of

IEEE SIGNAL PROCESSING MAGAZINE [17] november 2012

You might also like