Digital Signal Processing: S. Shahnawazuddin, Nagaraj Adiga, Hemant K. Kathania, Gaydhar Pradhan, Rohit Sinha
Digital Signal Processing: S. Shahnawazuddin, Nagaraj Adiga, Hemant K. Kathania, Gaydhar Pradhan, Rohit Sinha
a r t i c l e i n f o a b s t r a c t
Article history: In the context of automatic speech recognition (ASR) systems, the front-end acoustic features should not
Available online 16 May 2018 be affected by signal periodicity (pitch period). Motivated by this fact, we have studied the role of pitch-
synchronous spectrum estimation approach, referred to as TANDEM STRAIGHT, in this paper. TANDEM
Keywords:
STRAIGHT results in a smoother spectrum devoid of pitch harmonics to a large extent. Consequently,
Pitch-adaptive spectral estimation
the acoustic features derived using the smoothed spectra outperform the conventional Mel-frequency
TANDEM STRAIGHT
Vocal-tract length normalization cepstral coefficients (MFCC). The experimental evaluations reported in this paper are performed on
Speaking-rate normalization speech data from a wide range of speakers belonging to different age groups including children. The
Glottal closure instants proposed features are found to be effective for all groups of speakers. To further improve the recognition
Zero-frequency filtering of children’s speech, the effect of vocal-tract length normalization (VTLN) is studied. The inclusion of
VTLN further improves the recognition performance. We have also performed a detailed study on the
effect of speaking-rate normalization (SRN) in the context of children’s speech recognition. An SRN
technique based on the anchoring of glottal closure instants estimated using zero-frequency filtering is
explored in this regard. SRN is observed to be highly effective for child speakers belonging to different
age groups. Finally, all the studied techniques are combined for effective mismatch reduction. In the case
of children’s speech test set, the use of proposed features results in a relative improvement of 21.6% over
the MFCC features even after combining VTLN and SRN.
© 2018 Elsevier Inc. All rights reserved.
https://doi.org/10.1016/j.dsp.2018.05.003
1051-2004/© 2018 Elsevier Inc. All rights reserved.
S. Shahnawazuddin et al. / Digital Signal Processing 79 (2018) 142–151 143
ther through principal component analysis (PCA) or heteroscedastic smoothening is required to reduce the ill-effects of the pitch har-
linear discriminant analysis (HLDA) performed on the training data. monics. In order to derive smoothed spectra, several methods have
Since a significant amount of spectral information was retained, been proposed over the years [32–34]. Kawahara et al. proposed a
projecting the acoustic features to a lower-dimensional subspace pitch-adaptive spectral analysis technique named STRAIGHT which
was shown to outperform the cepstral-truncation-based scheme. gives equivalent resolution both in time and frequency domains.
At the same time, both these techniques suffered from a common After adaptive windowing, interpolation is done to get smoothed
problem i.e., the recognition rates for the adults’ speech got dete- vocal-tract spectra which is not affected by interference arising
riorated due to cepstral truncation or low-rank feature projection. from signal periodicity. STRAIGHT provides a high-quality analysis-
Motivated by these facts, we have explored pitch-adaptive spectral synthesis framework, which is commonly used in speech synthe-
estimation in this work to reduce the ill-effects of signal periodic- sis [35] and voice-conversion [36].
ity. This will not only yield smoother spectra but also avoid loss of The MFCC features derived using the STRAIGHT-based spectra
critical information, unlike the aforementioned approaches. were employed for ASR in [37] and were found to be inferior to
The differences in the pitch is not the only factor that af- the conventional ones. Later, it was figured out that the cause
fects the performance in the case of children’s speech recogni- of degradation was a smoothening function used after the pitch-
tion on ASR systems trained using adults’ data. Another impor- adaptive windowing which led to over-smoothing. On removing
tant mismatch factor is the difference in the speaking-rates for that smoothening function, an enhanced recognition performance
the two groups of speakers. Earlier studies have highlighted that was obtained as reported in [38]. Further, legacy STRAIGHT is
the variation in the speaking-rate affects both perception and pro- computationally expensive. To alleviate these problems, TANDEM-
duction of phones [25]. Degradation in the ASR performance has STRAIGHT was introduced for spectrum estimation [30]. Motivated
been reported when the speaking-rate is exceptionally fast or by the success of TANDEM STRAIGHT in speech synthesis and
slow [26–28]. A few earlier works had explored explicit speaking- voice-conversion, its role in ASR is studied in this paper. In the
rate modification for improving children’s ASR [29,23]. Therefore, following, a brief review of spectrum estimation through TANDEM
we have also studied the effect of speaking-rate normalization STRAIGHT is presented. This followed by a discussion on the effects
(SRN) in this paper along with pitch-adaptive spectral estimation. of spectral smoothing which is vital for ASR.
Fig. 2. Comparison of STFT spectrogram, STRAIGHT spectrogram, and TANDEM spectrogram derived from a segment of speech from a high-pitched child speaker.
3. Experimental evaluations
The speech data used for training the ASR system was obtained
from the British English speech corpus WSJCAM0 [39]. The train
set derived from WSJCAM0 consisted of 15.5 hours of speech data
from 92 male/female adult speakers between the ages 18–55 years.
The total number of utterances in the train set was 7852 with a
total 132,778 words. In order to evaluate the effectiveness of the
existing MFCC as well as the proposed features, five different test
sets were created. The details of the test sets are as follows:
• ADSet 1: This test set was derived from the WSJCAM0 corpus.
This test set consisted of 0.6 hours of speech from 20 adult
male/female speakers with a total of 5608 words.
• CHSet 2: This test set consisted of nearly 1 hour of speech
data from 17 child speakers obtained from the British English
speech database PF-STAR [40]. The age of the child speakers
in this test set lies in between 4–7 years. The total number of
Fig. 4. Variance plots for the base MFCC features (C 1 –C 12 ) for vowel /IY/. The feature words in this set was 4804.
vectors for nearly 4000 speech frames corresponding to the central portion of the • CHSet 3: This test set was also derived from the PF-STAR cor-
vowel extracted from the adults’ speech data were used for this analysis. Similarly,
pus. It consisted of almost 1 hour of speech data from 15
3000 frames corresponding to the central portion of the vowel was extracted from
the children’s speech. In the default case, the variance for children’s data is much child speakers with a total of 4664 words. The age of the child
higher when compared to adults’ case. Significant reduction in variance mismatch is speakers in this test set lies in between 8–10 years.
achieved by pitch-adaptive spectral smoothening through TANDEM STRAIGHT. The • CHSet 4: This test set was also derived from the PF-STAR cor-
feature vectors employed in these analyses have been normalized using cepstral
pus. It consisted of nearly 1 hour of speech data from 14 child
mean and variance normalization.
speakers with a total of 4924 words. The age of the child
speakers in this test set lies in between 11–14 years.
• CHSet 5: This test set was also derived from the PF-STAR cor-
pus. This set consisted of 1.1 hours of speech data from 60
child speakers with a total of 5067 words. The age of the child
speakers in this test set lies in between 4–13 years.
Its worth highlighting here that, the test data was obtained from
Fig. 5. Block diagram outlining the proposed front-end speech parameterization those speakers whose data was not included in the training set.
technique. Even among the test sets, none of the speakers were common.
Moreover, children’s speech was not used for training in order
pitch-adaptive spectral estimation. Hence, we computed the vari- to simulate large differences in pitch and speaking-rate. Age-wise
ance of the cepstral coefficients for the vowel /IY/ obtained after splitting of utterances in each of the children’s speech test sets
spectral smoothening and the same is shown in Fig. 1 (bottom is given in Table 1. Since the amount of speech data available in
panel). There is a significant reduction in the variance of each of PF-STAR corresponding to each of the considered age is unbal-
the coefficients due to spectral smoothening. At the same time, the anced, CHSet 5 does not have equal representation. It is to note
mismatch in the variance for the two pitch ranges is also reduced. that, CHSet 5 was not created by pooling the data from CHSet 2,
The variance plots for vowel /IY/ extracted from adults’ and CHSet 3 and CHSet 4. The experimental studies reported in this
children’s speech databases used in this study are shown in Fig. 4. paper were performed on wide-band (WB) speech data (sampled
For both the class of speakers, there is a significant reduction in at 16 kHz rates). As the PF-STAR database is originally sampled at
the variance of the cepstral coefficients due to pitch-adaptive spec- 22,050 samples per second, we down-sampled the speech files us-
tral estimation. It is worth noting that, when a conventional ap- ing MATLAB for consistency.
proach is used, the variance of the higher-order coefficients in the
case of children’s speech is more than that for adults’ speech. The 3.2. Front-end acoustic feature extraction
variance for the higher-order cepstral coefficients become almost
similar when spectral smoothening is done prior to the computa- For computing the MFCC features, speech data was first ana-
tion of the cepstral coefficients. lyzed into overlapping short-time frames using Hamming window
After obtaining the smoothed spectra, the front-end features are of length 20 ms with frame-shift of 10 ms. A pre-emphasis factor
derived using the usual steps. For the sake of clarity, the steps in- of 0.97 was used during feature extraction. A 40-channel Mel-
volved in the proposed front-end feature extraction technique are filterbank was used for warping the linear spectra to Mel-scale
summarized in Fig. 5. The proposed acoustic features are referred before computing the 13-dimensional base MFCC features. Next,
S. Shahnawazuddin et al. / Digital Signal Processing 79 (2018) 142–151 147
Table 1
Age-wise splitting of the number of utterances taken from the child speakers to create each of the test sets.
Fig. 6. Time-domain waveforms and spectrograms illustrating the effect of increasing or decreasing the speaking-rate using the ZFF-GCI-based approach.
plotted WERs are with respect to the CHSet 5. The scaling fac-
tor for this study was varied from 1.25 to 1.55 in steps of 0.05.
SRN results in additive improvements for both the MFCC as well as
proposed features. Further, the WERs are lower in the case of pro-
posed TS-MFCC features. Finally, we concatenated VTLN and fMLLR
with SRN to further improve the performance. The WERs for this
study are enlisted in Table 4. The three approaches are observed
to be additive. Moreover, the proposed TS-MFCC features are noted Fig. 8. WERs for the children’s test set (CHSet 5) with respect to GMM-HMM and
to be superior in this case well. The best case WER is presented in DNN-HMM systems trained on adult’s speech illustrating the effect of speaking-rate
bold to highlight this fact. normalization.
4. Conclusion quently, the derived acoustic features are more robust towards
pitch variations compared to the conventional MFCC features. The
A novel front-end speech parameterization technique for ASR is same has been validated experimentally in this paper. Furthermore,
presented in this paper. The proposed feature extraction approach the reported experimental evaluations have been performed on
includes a pitch-adaptive spectral estimation module prior to com- several different test sets comprising of speech data from speak-
puting the acoustic features. The use of pitch-adaptive signal pro- ers belonging to different age groups. Breaking up speech data
cessing helps reduce the ill-effects of signal periodicity. Conse- in different age groups helps in better understanding the age-
150 S. Shahnawazuddin et al. / Digital Signal Processing 79 (2018) 142–151
Table 4 [18] A. Potaminaos, S. Narayanan, Robust recognition of children speech, IEEE Trans.
WERs for the children’s speech test set (CHSet 5) with respect to adult data trained Speech Audio Process. 11 (6) (2003) 603–616.
ASR systems demonstrating the effect of concatenating VTLN and fMLLR with SRN [19] S. Lee, A. Potamianos, S.S. Narayanan, Acoustics of children’s speech: devel-
on MFCC as well as proposed features. opmental changes of temporal and spectral parameters, J. Acoust. Soc. Am.
105 (3) (1999) 1455–1468.
Acoustic Feature WER (in %) Relative
[20] M. Gerosa, D. Giuliani, S. Narayanan, A. Potamianos, A review of ASR tech-
model kind VTLN+fMLLR + SRN improv. (%)
nologies for children’s speech, in: Proc. Workshop on Child, Computer and
GMM MFCC 23.01 13.54 41.2 Interaction, 2009, pp. 7:1–7:8.
TS-MFCC 20.72 11.22 45.8 [21] H. Liao, G. Pundak, O. Siohan, M.K. Carroll, N. Coccaro, Q. Jiang, T.N. Sainath,
A.W. Senior, F. Beaufays, M. Bacchiani, Large vocabulary automatic speech
DNN MFCC 15.00 11.28 24.8 recognition for children, in: Proc. INTERSPEECH, 2015, pp. 1611–1615.
TS-MFCC 12.48 9.78 21.6 [22] S. Shahnawazuddin, N. Adiga, H.K. Kathania, Effect of prosody modification on
children’s ASR, IEEE Signal Process. Lett. 24 (11) (2017) 1749–1753.
[23] S. Ghai, Addressing Pitch Mismatch for Children’s Automatic Speech Recogni-
dependence of the pitch-induced acoustic mismatch. For the task tion, Ph.D. thesis, Department of EEE, Indian Institute of Technology Guwahati,
of decoding children’s speech test sets using adult data trained India, October 2011.
[24] H.K. Kathania, S. Shahnawazuddin, R. Sinha, Exploring HLDA based transforma-
DNN-HMM system, relative improvements of 13–17% are obtained
tion for reducing acoustic mismatch in context of children speech recognition,
when the proposed TS-MFCC features are used instead of conven- in: Proc. International Conference on Signal Processing and Communications,
tional MFCC features. In addition to that, the proposed features are 2014, pp. 1–5.
subjected to linear frequency warping for reducing the ill-effects [25] J.L. Miller, L.E. Volaitis, Effect of speaking rate on the perceptual structure of a
phonetic category, Percept. Psychophys. 46 (6) (1989) 505–512.
of formant scaling. The inclusion of VTLN leads to further im-
[26] M.A. Siegler, R.M. Stern, On the effects of speech rate in large vocabulary
provements in recognition performance. We have also studied the speech recognition systems, in: Proc. ICASSP, vol. 1, 1995, pp. 612–615.
effect of speaking-rate normalization on children’s speech in this [27] N. Mirghafori, E. Fosler, N. Morgan, Towards robustness to fast speech in ASR,
work. When SRN is performed, the proposed features result in fur- in: Proc. ICASSP, vol. 1, 1996, pp. 335–338.
ther reductions in WER. After combining VTLN, fMLLR, and SRN, [28] N. Morgan, E. Fosler, N. Mirghafori, Speech recognition using on-line estimation
of speaking rate, in: Proc. EUROSPEECH, 1997, pp. 2079–2082.
the use of proposed features in place of MFCC features for train-
[29] G. Stemmer, C. Hacker, S. Steidl, E. Nöth, Acoustic normalization of children’s
ing DNN-HMM systems results in a relative improvement of 21.6% speech, in: Proc. INTERSPEECH, 2003, pp. 1313–1316.
with respect to the children’s speech test set. [30] H. Kawahara, M. Morise, Technical foundations of TANDEM-STRAIGHTs, a
speech analysis, modification and synthesis framework, Sadhana 36 (5) (2011)
References 713–727.
[31] S. Prasanna, D. Govind, K.S. Rao, B. Yegnanarayana, Fast prosody modification
using instants of significant excitation, in: Int. Conf. on Speech Prosody, 2010.
[1] L. Rabiner, B.-H. Juang, Fundamentals of Speech Recognition, Prentice-Hall, Inc.,
[32] J. Makhoul, Linear prediction: a tutorial review, Proc. IEEE 63 (4) (1975)
Upper Saddle River, NJ, USA, 1993.
561–580.
[2] G.E. Hinton, L. Deng, D. Yu, G. Dahl, A.R. Mohamed, N. Jaitly, A. Senior, V. Van-
[33] A.V. Oppenheim, Speech analysis-synthesis system based on homomorphic fil-
houcke, P. Nguyen, T. Sainath, B. Kingsbury, Deep neural networks for acoustic
tering, J. Acoust. Soc. Am. 45 (2) (1969) 458–465.
modeling in speech recognition, Signal Process. Mag. 29 (6) (2012) 82–97.
[34] R.J. McAulay, T.F. Quatieri, Speech analysis/synthesis based on a sinusoidal rep-
[3] G. Dahl, D. Yu, L. Deng, A. Acero, Context-dependent pre-trained deep neural
resentation, IEEE Trans. Acoust. Speech Signal Process. 34 (4) (1986) 744–754.
networks for large vocabulary speech recognition, IEEE Trans. Speech Audio
[35] H. Zen, T. Toda, M. Nakamura, K. Tokuda, Details of the Nitech HMM-based
Process. 20 (1) (2012) 30–42.
speech synthesis system for the blizzard challenge 2005, IEICE Trans. Inf. Syst.
[4] J. Schalkwyk, D. Beeferman, F. Beaufays, B. Byrne, C. Chelba, M. Cohen, M. Kam-
E90-D (1) (2007) 325–333.
var, B. Strope, Your word is my command: Google search by voice: A case
[36] T. Toda, H. Saruwatari, K. Shikano, Voice conversion algorithm based on Gaus-
study, in: Advances in Speech Recognition: Mobile Environments, Call Centers
sian mixture model with dynamic frequency warping of STRAIGHT spectrum,
and Clinics, 2010, pp. 61–90, Ch. 4.
[5] A. Hagen, B. Pellom, R. Cole, Children’s speech recognition with application to in: Proc. ICASSP, vol. 2, 2001, pp. 841–844.
interactive books and tutors, in: Proc. ASRU, 2003, pp. 186–191. [37] G. Garau, S. Renals, Combining spectral representations for large-vocabulary
[6] A. Hagen, B. Pellom, R. Cole, Highly accurate children’s speech recognition continuous speech recognition, IEEE Trans. Speech Audio Process. 16 (3) (2008)
for interactive reading tutors using subword units, Speech Commun. 49 (12) 508–518.
(2007) 861–873. [38] G. Garau, S. Renals, Pitch adaptive features for LVCSR, in: Proc. INTERSPEECH,
[7] S. Davis, P. Mermelstein, Comparison of parametric representations for mono- 2008, pp. 2402–2405.
syllabic word recognition in continuously spoken sentences, IEEE Trans. Acoust. [39] T. Robinson, J. Fransen, D. Pye, J. Foote, S. Renals, WSJCAM0: a British English
Speech Signal Process. 28 (4) (1980) 357–366, https://doi.org/10.1109/TASSP. speech corpus for large vocabulary continuous speech recognition, in: Proc.
1980.1163420. ICASSP, vol. 1, 1995, pp. 81–84.
[8] H. Hermansky, Perceptual linear predictive (PLP) analysis of speech, J. Acoust. [40] A. Batliner, M. Blomberg, S. D’Arcy, D. Elenius, D. Giuliani, M. Gerosa, C. Hacker,
Soc. Am. 57 (4) (1990) 1738–1752. M. Russell, M. Wong, The PF_STAR children’s speech corpus, in: Proc. INTER-
[9] S.P. Rath, D. Povey, K. Veselý, J. Černocký, Improved feature processing for deep SPEECH, 2005, pp. 2761–2764.
neural networks, in: Proc. INTERSPEECH, 2013. [41] D. Povey, A. Ghoshal, G. Boulianne, L. Burget, O. Glembek, N. Goel, M. Hanne-
[10] V. Digalakis, D. Rtischev, L. Neumeyer, Speaker adaptation using constrained mann, P. Motlicek, Y. Qian, P. Schwarz, J. Silovsky, G. Stemmer, K. Vesely, The
estimation of Gaussian mixtures, IEEE Trans. Speech Audio Process. 3 (1995) Kaldi speech recognition toolkit, in: Proc. ASRU, 2011.
357–366.
[11] L. Lee, R. Rose, A frequency warping approach to speaker normalization, IEEE
Trans. Speech Audio Process. 6 (1) (1998) 49–60. Syed Shahnawazuddin received his B.E. degree in Electronics and
[12] S. Ghai, R. Sinha, Exploring the role of spectral smoothing in context of chil- Communication Engineering from Visvesvaraya Technological University,
dren’s speech recognition, in: Proc. INTERSPEECH, 2009, pp. 1607–1610. Karnataka, India in 2008. He then obtained his Ph.D. degree from the
[13] R. Sinha, S. Shahnawazuddin, Assessment of pitch-adaptive front-end signal Department of Electronics and Electrical Engineering, Indian Institute of
processing for children’s speech recognition, Comput. Speech Lang. 48 (2018)
Technology Guwahati, in 2016. He is currently working as Assistant pro-
103–121.
fessor in the Department of Electronics and Communication Engineering
[14] W.M. Fisher, G.R. Doddington, K.M. Goudie-Marshall, The DARPA speech recog-
nition research database: specifications and status, in: Proc. DARPA Workshop at National Institute of Technology Patna, India. His research interests are
on Speech Recognition, 1986, pp. 93–99. speech signal processing, speech recognition, keyword spotting, speaker
[15] S. Ghai, R. Sinha, A study on the effect of pitch on LPCC and PLPC fea- recognition and machine learning.
tures for children’s ASR in comparison to MFCC, in: Proc. INTERSPEECH, 2011,
pp. 2589–2592.
Nagaraj Adiga received his B.E. degree in Electronics and Communi-
[16] S. Ghai, R. Sinha, Analyzing pitch robustness of PMVDR and MFCC features for
children’s speech recognition, in: Proc. Signal Processing and Communications cation Engineering from University Visvesvaraya College of Engineering,
(SPCOM), 2010. Bengaluru, India, in 2008. He was a Software Engineer in Alcatel-Lucent
[17] M. Russell, S. D’Arcy, Challenges for computer recognition of children’s speech, India Private Limited, Bengaluru, India, from 2008 to 2011, mainly focus-
in: Proc. Speech and Language Technologies in Education (SLaTE), 2007. ing on next-generation high-leverage optical transport networks. He then
S. Shahnawazuddin et al. / Digital Signal Processing 79 (2018) 142–151 151
obtained his Ph.D. degree from the Department of Electronics and Elec- Gayadhar Pradhan received his M.Tech. and Ph.D. degrees in Electron-
trical Engineering, Indian Institute of Technology Guwahati, in 2017. He ics and Electrical Engineering from Indian Institute of Technology Guwa-
is currently pursuing PostDoc in the Department of Computer science, hati, India, in 2009 and 2013, respectively. He is currently working as
University of Crete, Greece. His research interests are speech processing, Assistant professor in the Department of Electronics and Communication
speech synthesis, voice conversion, speech recognition, voice pathology Engineering at National Institute of Technology Patna, India. His research
and machine learning. interests are speech signal processing, speaker recognition and speech
recognition.
Hemant Kumar Kathania received his B.E. degree in Electronics and Rohit Sinha received the M.Tech. and Ph.D. degrees in Electrical Engi-
Communication Engineering from the University of Rajasthan, Jaipur, India, neering from Indian Institute of Technology Kanpur, in 1999 and 2005, re-
in 2008, and M.Tech. degree in Electronics and Electrical Engineering from spectively. From 2004 to 2006, he was a Post-Doctoral Researcher with the
the Indian Institute of Technology Guwahati, India, in 2012. Presently he Machine Intelligence Laboratory, Cambridge University, Cambridge, U.K.
is working as an Assistant Professor in the Department of Electronics and Since 2006, he has been with IIT Guwahati, India, where he is currently
Communication Engineering, National Institute of Technology (NIT) Sikkim, a Full Professor with the Department of Electronics and Electrical Engi-
India and also pursuing his Ph.D form NIT Sikkim. His current research in- neering. His research interests include speaker normalization/adaptation
terests include speech signal processing, speech recognition and machine in context of automatic speech recognition, speaker verification, noise ro-
learning. bust speech processing, and audio segmentation.