Exploring the Role of Pitch-Adaptive Cepstral
Features in Context of Children’s Mismatched ASR
Rohit Sinha, S Shahnawazuddin and Patri Satya Karthik
Department of Electronics and Electrical Engineering
Indian Institute of Technology Guwahati, Guwahati -781039, India
{rsinha, s.syed, s.patri}@iitg.ernet.in
Abstract—The presented work explores the role of pitch- mismatched cases. In this paper, we present our efforts towards
adaptive cepstral features in context of automatic speech recogni- improving the recognition of children’s speech on acoustic
tion (ASR) of children’s speech on adults’ speech trained acoustic models trained using adults’ speech. The task of recognizing
models. On account of large acoustic mismatch between training
and test data, highly degraded recognition rates are noted for children’s speech on adult data trained acoustic models is
such cases. Earlier studies have shown that the said acoustic mis- referred to as children’s mismatched ASR in this work.
match is aided by the insufficient smoothing of pitch harmonics As mentioned earlier, child speakers exhibit higher fun-
in the case of mel-frequency cepstral coefficient (MFCC) features damental frequencies in comparison to adult speakers. This
for child speakers. Motivated by that, in this work, we explore
pitch-adaptive cepstral features for reducing the sensitivity to affects the front-end speech parameterization process resulting
gross pitch variations. For this purpose, a simple technique in severe pitch-dependent distortions. The front-end speech
based on adaptive-cepstral-truncation is employed for deriving parameterization in ASR systems involves short-time analysis
the pitch-adaptive MFCCs. We have also explored the existing of speech signal. The speech signal is generally analyzed using
STRAIGHT-based MFCCs for contrast. Both the approaches overlapping windows (Hamming, Hanning, etc.). For every
are found to result in significant and similar improvements for
children’s mismatch ASR case. The effectiveness of the adaptive- frame of speech, the short-time Fourier transform (STFT)
truncation-based approach is also demonstrated in context of the is computed and is followed by mel-scale warping of the
deep-neural-network-based acoustic models. Further, it has been magnitude spectrum. The mel-scale mapping involves a bank
shown that the effectiveness of the existing feature normalization of nonuniform bandwidth triangular filters whose centre fre-
techniques remain intact even with the use of the proposed quencies lie on the mel-scale. The log-energies at the output of
features.
Index Terms—Children’s speech recognition, pitch-adaptive
the mel-filterbank are converted to cepstral representation by
features, STRAIGHT-based MFCC, DNN. taking discrete cosine transform (DCT). The vector obtained
by low-time liftering of the resulting cepstra yields the mel-
I. I NTRODUCTION frequency cepstral coefficient (MFCC) [9] features. In general,
the MFCC features are expected to be free from the effect
There are a number of speech-based applications where of pitch of the speech signal. In [10], [11], it is shown that
an automatic speech recognition (ASR) system is required the MFCC features do get affected for child (higher pitch)
to recognize the speech from both adult and child speakers speakers in contrast to that of the adult (lower pitch) speakers.
(male/female). Some such applications are voice-based search, Due to the insufficient smoothing of the pitch harmonics
speech-based information retrieval, language learning tools, present in the magnitude spectrum of the windowed speech
entertainment, etc. [1], [2], [3], [4], [5]. In those cases, a signal, ripples appear in the lower frequency region of the
severe degradation in recognition performance is noted when spectrum as shown in Fig. 1. Consequently, the dynamic range
children’s speech is recognized on acoustic models trained of the higher-order MFCCs is reported to get enhanced in
using adults’ speech and vice-versa. This happens due to the the case of children’s speech [12]. In the case of acoustic
differences in the acoustic attributes between the adults’ and models trained on adults’ speech using mixture of Gaussians,
children’s speech. The major factors causing those differences the variances corresponding to the higher-order coefficients
are the differences in the vocal-tract geometry, less precise are usually low. On decoding the children’s speech feature
control of the articulators and a less refined ability to con- vector with respect to such acoustic models, the involved
trol suprasegmental aspects such as prosody as summarized Mahalanobis distance metric happens to enhance the distance
in [6]. Further, higher fundamental and formant frequencies score for higher-order feature coefficients. This, in turn, leads
and greater spectral variability are also noted in the case to a degradation in the recognition performance. In our recent
of children’s speech as reported in [7], [8]. Consequently, work [13], low-rank feature projections were explored for the
acoustic models trained on adults’ speech are found unsuitable reducing the aforementioned mismatch in the variances.
for recognizing children’s speech and vice-versa. One can, In order to address the aforementioned pitch-dependent
therefore, conclude that there is a need for improving the distortions, we explore pitch-adaptive signal processing for
recognition of children’s or adults’ speech in acoustically speech parameterization in this paper. One such technique
978-1-5090-1746-1/16/$31.00 c 2016 IEEE is the STRAIGHT-based spectral analysis reported in [15].
5 F0 = 200 Hz Fo= 300 Hz
4 0
(a)
2
Magnitude
Magnitude (db)
-1
0
-2 -2
-4
F = 100 Hz -3
0 0 0.5 1 1.5 2 2.5 3 3.5 4
-6
F0 = 200 Hz (b)
1
-8
F = 300 Hz
0 0
Magnitude
-10
0 1 2 3 4
-1
Frequency (kHz)
-2
Fig. 1. Spectral plots corresponding to the central frame of the vowel /IY/ -3
(extracted from TIMIT [14]) for varying values of F0 . For deriving each 0 0.5 1 1.5 2 2.5 3 3.5 4
of these spectra, the base MFCC features (C0 -C12 ) are transformed back Frequency (in kHz)
to frequency domain using 100-point IDCT. The effect of pitch-dependent
distortions is quite evident especially for F0 = 300 Hz. Note that a shift of
Fig. 2. The smoothed spectra for the cases F0 = 200 Hz and F0 = 300
-2 dB and -4 dB are added to the plots corresponding to F0 = 200 Hz and
Hz obtained using (a) STRAIGHT analysis, and (b) proposed approach. To
F0 = 300 Hz to make them separable.
make the curves distinguishable, an intentional shift of -2 db is added to lower
plots. The same frames that were used in Fig. 1 are used in this study as well.
Further, we also present an adaptive cepstral truncation-
based spectrum smoothing approach to enhance the pitch of employing STRAIGHT-based MFCCs in the context of
robustness of MFCCs. The proposed approach is noted to be children’s mismatched ASR. Since the pitch-dependent dis-
less sensitive to the errors in frame-specific pitch estimation tortions are much severe in this case, much improved recog-
unlike the STRAIGHT-based approach. Apart from the con- nition performances are expected. The effect of STRAIGHT-
ventional Gaussian-mixture-based hidden Markov modeling based pitch-adaptive signal processing on the spectra derived
(GMM-HMM), the proposed approach is also evaluated on an from the resulting STRAIGHT-MFCC features are shown in
ASR system based on deep neural network (DNN) [16]. Fur- Fig. 2(a). As evident form the shown plots, better spectral
thermore, we have also explored the existing dominant speaker smoothing is achieved with STRAIGHT analysis compared to
normalization approaches in the context of the proposed pitch- the default MFCCs (see Fig. 1).
adaptive MFCCs. The TEMPO algorithm employed in the STRAIGHT frame-
The remaining of this paper is organized as follows: In work is observed to be computationally latent. Moreover, the
Section II, the proposed pitch-adaptive feature extraction approach is quite sensitive to the errors in frame-specific
approach is discussed. The experimental evaluation of the pitch estimation as reported in [17]. Motivated by these, we
explored approaches is presented in Section III. Evaluation present a simpler spectral smoothing approach that is less
of the proposed approach in the DNN domain is given in sensitive to errors in frame-specific pitch estimates. The steps
Section IV. Finally, the paper is concluded in Section V. in the proposed scheme are as follows: First the spectral
representation of the speech signal is obtained using the
II. P ITCH - ADAPTIVE CEPSTRAL FEATURES STFT analysis with fixed duration Hamming window. For each
The spectral analysis of STRAIGHT proposes a pitch- frame, the log-compressed magnitude spectrum is derived and
adaptive window having equivalent resolution in both time is transformed to cesptral domain using an inverse discrete
and frequency domains [15]. In contrast to the conventional Fourier transform (IDFT). Note that all these processing steps
STFT-based spectral analysis employing a Hamming window, are essentially equivalent to linear filtering, thus the periodicity
a window that is Gaussian in both time and frequency is of the speech excitation is retained in the cepstral domain. Now
used. The STRAIGHT analysis results in a smoothed spectra a suitable low-time lifter is applied and the liftered cepstrum
which is not affected by the signal periodicity. MFCCs derived is transformed back to the spectral domain using DFT. The
using the STRAIGHT spectral analysis were employed for block diagram of the proposed smoothed spectrum derivation
speech recognition in [17] and were found to be inferior to the approach is shown in Figure 4. Given the smoothed spectrum,
conventional MFCCs. This degradation was attributed to the the MFCCs are derived following the usual steps.
smoothening function used after pitch-adaptive windowing as In order to calculate the required duration of low-time lifter,
argued in that work. On removing the smoothening function, we have made use of only the average pitch value of each
an improved recognition performance was reported in [18]. of the speech utterances. The pitch value for an utterance
In that work, the training and test speech was from the can be obtained through a number tools, viz. TEMPO [15],
adult speakers. Consequently, the reported improvements were RAPT algorithm [19] or wavesurfer-based pitch tracking [20].
small. Motivated by those studies, we explored the feasibility Though the frame-specific pitch values may differ among these
Static Proposed STRAIGHT
250 250 250
F0 < 150 Hz F < 150 Hz F0 < 150 Hz
0
F > 220 Hz F0 > 220 Hz F > 220 Hz
200 0 200 200 0
150 150 150
Variance
100 100 100
50 50 50
0 0 0
1 2 3 4 5 6 7 8 9 10 11 12 1 2 3 4 5 6 7 8 9 10 11 12 1 2 3 4 5 6 7 8 9 10 11 12
Base feature index Base feature index Base feature index
Fig. 3. Variance plots for the base MFCC features C1 -C12 for the vowel /IY/ corresponding to two broad pitch (F0 ) ranges. For this analysis, the feature
vectors for nearly 2000 speech frames corresponding to the central portion of the vowel extracted from TIMIT are used. For the higher F0 range, the mismatch
in the variances of higher-order coefficients (C8 -C12 ) is evident in the case of static/default MFCCs (leftmost pane). A significant reduction in the variance
mismatch is achieved by the proposed and the STRAIGHT-based approaches as evident from the figure.
Input
speech for the pitch-adaptive features as well and the corresponding
Computing Determining variance plots are also shown in Fig. 3. It is to note that the
average pitch lifter length mismatch in the variance is reduced substantially by the use of
Framing and pitch-adaptive cepstral features. Therefore, the acoustic models
windowing
trained on the pitch-adaptive features are expected to result in
better recognition performances than those obtained by the use
Log− Low−time Smoothed
DFT IDFT DFT default/static features.
magnitude liftering spectrum
III. E XPERIMENTAL EVALUATION
Fig. 4. The block diagram for the proposed pitch-adaptive liftering approach An speaker independent (SI) ASR system is developed for
for spectral smoothing.
experimental evaluation using the Kaldi speech recognition
toolkit [21]. The GMM-HMM-based acoustic modeling ap-
proach is employed for the same. The WSJCAM0 British
tools, the average pitch is generally observed to be same in English speech corpus [22] is used for learning the SI GMM-
all the cases. Hence, in contrast to STRAIGHT, the proposed HMM parameters. This database consists of 15.5 hours of
technique is less sensitive to the errors in the pitch estimate training speech (CAMtr) from 92 adult male/female speakers.
for each frame. The effect of applying pitch-adaptive liftering The adults’ speech test set (CAMts), used for the matched case
to the spectra derived from the resulting MFCCs are shown testing, consists of 0.6 hours of data from 20 speakers. The
in Fig. 2(b). Like the STRAIGHT-based MFCCs, a smoother number of words in CAMts set is 5, 608. The effectiveness
spectra is obtained by the proposed approach despite the of the explored approaches in the mismatched condition is
application of global liftering to all frames. evaluated using the children’s speech test set (PFts) of the PF-
As mentioned earlier, pitch-dependent distortions lead to STAR British English speech database [23]. This set contains
an increase in the dynamic range of higher-order MFCCs 1.1 hours of speech data from 60 child speakers with a total
in the case of children’s speech. In order to visualize that, of 5067 words. All speech data is re-sampled to 8 kHz rate to
we collected the base MFFC feature vectors for nearly 2000 simulate the telephone speech scenario.
speech frames corresponding to the central portion of the For extracting the default (or static) MFCC features, the
vowel /IY/ extracted from TIMIT database [14]. These feature length of Hamming window is chosen to be 25 ms. The frame
vectors were grouped in to two broad pitch (F0 ) ranges, i.e. rate and the pre-emphasis factor are selected to be 100 Hz
F0 < 150 Hz and F0 > 220 Hz. Using all the feature vectors and 0.97, respectively. The 13-dimensional base features (C0 -
belonging to a particular group, the variance was computed. C12 ) are computed using 21-channel mel-filterbank. This is
The variance plots for the two F0 ranges are shown in Fig. 3. followed by time-splicing of the base features with a context
The leftmost pane corresponds to the case of the default (or size of 9 frames (± 4). Next, linear discriminant analysis
static) feature extraction process. Due to the differences in followed by maximum likelihood linear transform is applied
the pitch values, an increase in the variance of the higher- for dimensionality reduction and de-correlation. The final
order MFCCs is evident. The same process was repeated dimension of feature is chosen to be 39 which is generally
TABLE I
C OMPARISON OF THE RECOGNITION PERFORMANCES OBTAINED FOR existing STRAIGHT-based MFCCs. Moreover, improvements
DIFFERENT KINDS OF PITCH - ADAPTIVE AND DEFAULT MFCC FEATURES are also noticeable in the matched case testing.
ON ADULTS ’ AND CHILDREN ’ S TEST SETS .
Type of WER (in %) IV. E VALUATION OF PITCH - ADAPTIVE FEATURES IN
MFCC CAMts PFts
DNN- BASED SYSTEM
feature (adults) (children) Recently developed DNN-based acoustic modeling has been
reported to achieve significant improvements in the matched
Default / Static 12.15 62.55
case recognition performance over the conventional GMM-
STARIGHT-based 11.21 52.34
based modeling. DNN-based acoustic modeling was explored
Proposed adaptive-truncation-based 10.97 50.78 for children’s speech recognition in [25]. Motivated by those,
in this section, we explore them for evaluating the proposed
pitch-adaptive features.
In DNN architecture, 8 hidden layers with tanh nonlineari-
reported to be suitable for adults’ ASR systems. The frame rate ties are employed. The objective function is the cross-entropy
and frame width are kept the same for the proposed method criterion, i.e. for each frame, the log-probability of the correct
as well. In the case of STRAIGHT-based analysis, pitch class. A soft-max layer representing the log-posterior of the
estimation is done for each frame which is 25 ms in duration output labels corresponding to context-dependent HMM states
and a frame shift of 10 ms is maintained. In the default mode, is used as the output layer. An initial learning rate of 0.015
STRAIGHT analysis requires a frame shift of 1 ms and frame is selected which is reduced to 0.002 in 20 epochs. Extra
width of 80 ms for the estimation of pitch. In order to have 10 epochs are employed after reducing the learning rate to
the same number of frames in all the cases and to keep the 0.002. A preconditioned form of stochastic gradient descent
latency substantially low, we chose the aforementioned values. is employed during DNN training in Kaldi. The minibatch
The default MFCC feature extraction is performed with the aid size for the neural net training is selected as 512. The lexicon
of MATLAB toolbox called VOICEBOX [24]. and the LM employed for decoding PFts remain the same as
For the GMM-HMM-based SI system development, deci- discussed earlier.
sion tree-based state tying is employed for cross-word tri- For evaluating the effectiveness of the proposed pitch-
phone acoustic modeling. A 3-states left-to-right HMM with adaptive features, two separate DNN systems, one using the
8 diagonal covariance Gaussian components per state is used default MFCC and the other using the proposed adaptive
for modeling each context-dependent triphone. The standard MFCC features, are developed. Table II shows the relative
MIT-Lincoln 5k Wall Street Journal bi-gram language model improvements obtained in DNN framework with the proposed
(LM) is used for decoding of the CAMts set. The used LM adaptive MFCCs. A significant reduction in the WER can
has a perplexity of 95.3 for the CAMts while there are no out- be noted by the use of pitch-adaptive features. For contrast
of-vocabulary (OOV) words. The lexicon used in this case has purpose, Table II also shows the corresponding performances
a total of 5, 850 words including the pronunciation variations. for the conventional GMM-based system.
The use of MIT-Lincoln LM in decoding PFts set was not We have also explored the effect of existing feature nor-
found suitable at all and resulted in a meaningless word error malization techniques for reducing the acoustic mismatch in
rate of 110.97%. This is attributed to large differences in the context of children’s mismatched ASR. The approaches
the word-list and word counts across adults’ and children’s explored in this study are the feature-space maximum like-
datasets. To prevent this linguistic mismatch from affecting the lihood linear regression (fMLLR) and the vocal tract length
results in this study, a domain-specific bigram LM is employed normalization (VTLN). The employed fMLLR transform is
for PFts decoding. This LM is trained on the transcripts estimated using the GMM-based system applying speaker
of speech data in PF-STAR excluding PFts (i.e. training adaptive training as reported in [26]. For VTLN, warped
transcripts only). The employed domain-specific bigram LM features are computed for each of the utterances in the PFts
has an OOV rate of 1.20% and perplexity of 95.8 for the set by varying the warp factor from 0.88 to 1.12 in steps of
PFts set, respectively. A lexicon of 1, 969 words including the 0.02. The differently warped feature sets are aligned against
pronunciation variations is used. The word error rate (WER) the GMM-HMM model under the constraint of the first-pass
metric is used as a measure of recognition performance. hypothesis. The value of the warp factor resulting in the
The recognition performances for the explored pitch- highest likelihood is chosen to be optimal. During the decoding
adaptive MFCCs along with those of the default MFCC on the DNN-based system, the optimally warped features
features for adults’ and children’s test set are given in Table I. are employed. Furthermore, in order to further enhance the
Note that both the pitch-adaptive approaches have resulted recognition performance, the VTLN is combined with the
in significant improvements for the children’s (mismatched) fMLLR. To do so, optimally warped features are used in the
as well as the adults’ (matched) speech recognition cases. estimation of the fMLLR transform for the test data. The
Interestingly, the proposed approach of deriving pitch-adaptive WERs for those experiments are also given in Table II. We can
MFCCs, though based on very simplistic assumptions, is noted note that the effectiveness of the existing feature normalization
to result in competitive performance when contrasted with the techniques is preserved in the context of the proposed adaptive
TABLE II
P ERCENTAGE RELATIVE IMPROVEMENT IN THE RECOGNITION PERFORMANCES OBTAINED THROUGH THE PROPOSED ADAPTIVE - TRUNCATION - BASED
MFCC FEATURES IN COMBINATION WITH EXISTING SPEAKER NORMALIZATION TECHNIQUES . T HESE EVALUATIONS ARE DONE SEPARATELY IN THE
CONTEXT- DEPENDENT GMM- AND THE DNN- BASED ACOUSTIC MODELING PARADIGMS .
Acoustic WER (in %)
modeling No normalization VTLN fMLLR VTLN + fMLLR
approach Def. Prop. Rel. Def. Prop. Rel. Def. Prop. Rel. Def. Prop. Rel.
GMM-HMM 62.55 50.78 19 35.06 27.62 21 43.53 35.83 18 25.82 21.42 17
DNN-HMM 43.32 39.32 9 23.57 21.43 9 24.25 22.33 8 17.00 15.90 6
features. Moreover, the VTLN and the fMLLR and found to [9] S. Davis and P. Mermelstein, “Comparison of parametric representations
result in additive reduction in WER. for monosyllabic word recognition in continuously spoken sentences,”
IEEE Trans. Acoustics, Speech and Signal Process., vol. 28, no. 4, pp.
357–366, Aug 1980.
V. C ONCLUSION [10] S. Ghai and R. Sinha, “Exploring the role of spectral smoothing in
context of children’s speech recognition,” in Proc. Interspeech, 2009,
In this work, we have evaluated the pitch-adaptive sig- pp. 1607–1610.
nal processing-based features in the context of children’s [11] Shweta Ghai, Addressing Pitch Mismatch for Children’s Automatic
Speech Recognition, Ph.D. thesis, Department of EEE, Indian Institute
mismatched ASR. A novel approach for deriving pitch- of Technology Guwahati, India, October 2011.
adaptive MFCCs is presented which employs adaptive- [12] Rohit Sinha and Shweta Ghai, “On the use of pitch normalization for
cepstral-truncation for smoothing the speech spectrum prior to improving children’s speech recognition.,” in Proc. Interspeech, 2009,
pp. 568–571.
feature computation. On contrasting with existing STRAIGHT- [13] S Shahnawazuddin, Hemant Kathania, and Rohit Sinha, “Enhancing
based MFCCs, the proposed approach is found to be closely the recognition of children’s speech on acoustically mismatched ASR
competitive. At the same time, it is noted for robustness to er- system,” Proc. IEEE TENCON, 2015.
[14] William M. Fisher, George R. Doddington, and Kathleen M. Goudie-
rors in frame-specific pitch estimation. The effectiveness of the Marshall, “The DARPA speech recognition research database: specifi-
proposed pitch-adaptive features is also demonstrated in DNN- cations and status,” in Proc. DARPA Workshop on Speech Recognition,
based acoustic modeling paradigm. Further, it is shown that in 1986, pp. 93–99.
[15] Hideki Kawahara, Ikuyo Masuda-Katsuse, and Alain De Cheveigné,
DNN-based modeling additional improvement in recognition “Restructuring speech representations using a pitch-adaptive time–
performance in the context of children’s mismatched ASR is frequency smoothing and an instantaneous-frequency-based f0 extrac-
possible by combining the fMLLR and the VTLN. tion: Possible role of a repetitive structure in sounds,” Speech commu-
nication, vol. 27, no. 3, pp. 187–207, 1999.
[16] George Dahl, Dong Yu, Li Deng, and Alex Acero, “Context-dependent
ACKNOWLEDGMENT pre-trained deep neural networks for large vocabulary speech recogni-
tion,” IEEE Trans. Speech and Audio Process., vol. 20, no. 1, pp. 30–42,
The authors wish to thank Prof. Hideki Kawahara for January 2012.
providing us with the opportunity to use the STRAIGHT code. [17] G. Garau and S. Renals, “Combining spectral representations for large-
vocabulary continuous speech recognition,” IEEE Trans. Speech and
R EFERENCES Audio Process., vol. 16, no. 3, pp. 508–518, March 2008.
[18] G. Garau and S. Renals, “Pitch adaptive features for LVCSR,” in Proc.
[1] A. Hagen, B. Pellom, and R. Cole, “Children’s speech recognition with Interspeech, 2008, pp. 2402–2405.
application to interactive books and tutors,” in Proc. Automatic Speech [19] D. Talkin, “A robust algorithm for pitch tracking (RAPT),” in Speech
Recognition and Understanding (ASRU), Nov 2003, pp. 186–191. Coding and Synthesis, W. B. Klein and K. K. Palival, Eds. Elsevier,
[2] R. Nisimura, A. Lee, H. Saruwatari, and K. Shikano, “Public speech- 1995.
oriented guidance system with adult and child discrimination capability,” [20] Kåre Sjölander and Jonas Beskow, “Wavesurfer - an open source speech
in IEEE International Conference on Acoustics Speech and Signal tool,” in Proc. Interspeech, 2000, pp. 464–467.
Processing (ICASSP), May 2004, vol. 1, pp. 433–436. [21] Daniel Povey, Arnab Ghoshal, Gilles Boulianne, Lukas Burget, Ondrej
[3] Linda Bell and Joakim Gustafson, “Children’s convergence in referring Glembek, Nagendra Goel, Mirko Hannemann, Petr Motlicek, Yanmin
expressions to graphical objects in a speech-enabled computer game.,” Qian, Petr Schwarz, Jan Silovsky, Georg Stemmer, and Karel Vesely,
in Proc. Interspeech, 2007, pp. 2209–2212. “The kaldi speech recognition toolkit,” in Proc. Automatic Speech
[4] Johan Schalkwyk, Doug Beeferman, Franoise Beaufays, Bill Byrne, Recognition and Understanding (ASRU), Dec 2011.
Ciprian Chelba, Mike Cohen, Maryam Kamvar, and Brian Strope, “Your [22] T. Robinson, J. Fransen, D. Pye, J. Foote, and S. Renals, “WSJCAM0:
word is my command: Google search by voice: A case study,” in A British English speech corpus for large vocabulary continuous speech
Advances in Speech Recognition: Mobile Environments, Call Centers recognition,” in IEEE International Conference on Acoustics Speech
and Clinics, chapter 4, pp. 61–90. 2010. and Signal Processing (ICASSP), 1995, pp. 81–85.
[5] Andreas Hagen, Bryan Pellom, and Ronald Cole, “Highly accurate [23] A. Batliner, M. Blomberg, S. D’Arcy, D. Elenius, D. Giuliani,
childrens speech recognition for interactive reading tutors using subword M. Gerosa, C. Hacker, M. Russell, and M. Wong, “The PF STAR
units,” Speech Communication, vol. 49, no. 12, pp. 861–873, 2007. children’s speech corpus,” in Proc. Interspeech, 2005, pp. 2761–2764.
[6] A. Potaminaos and S. Narayanan, “Robust Recognition of Children [24] Mike Brookes, “VOICEBOX: Speech Processing Toolbox for MAT-
Speech,” IEEE Trans. Speech and Audio Process., vol. 11, no. 6, pp. LAB,” 2005.
603–616, November 2003. [25] R. Serizel and D. Giuliani, “Vocal tract length normalisation approaches
to dnn-based children’s and adults’ speech recognition,” in Proc. Spoken
[7] S. Eguchi and I. J. Hirsh, “Development of speech sounds in children.,”
Language Technology (SLT) Workshop, 2014, pp. 135–140.
Acta oto-laryngologica. Supplementum, vol. 257, pp. 1–51, 1969.
[26] Shakti P. Rath, Daniel Povey, Karel Veselý, and Jan Černocký, “Im-
[8] R. D. Kent, “Anatomical and neuromuscular maturation of the speech
proved feature processing for deep neural networks.,” in Proc. Inter-
mechanism: Evidence from acoustic studies,” JHSR, vol. 9, pp. 421–447,
speech, 2013.
1976.