Thanks to visit codestin.com
Credit goes to www.scribd.com

0% found this document useful (0 votes)
12 views10 pages

Digital Signal Processing: S. Shahnawazuddin, Nagaraj Adiga, Hemant K. Kathania, Gaydhar Pradhan, Rohit Sinha

Speech Signal Processing

Uploaded by

jayantrout.fca
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
12 views10 pages

Digital Signal Processing: S. Shahnawazuddin, Nagaraj Adiga, Hemant K. Kathania, Gaydhar Pradhan, Rohit Sinha

Speech Signal Processing

Uploaded by

jayantrout.fca
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 10

Digital Signal Processing 79 (2018) 142–151

Contents lists available at ScienceDirect

Digital Signal Processing


www.elsevier.com/locate/dsp

Studying the role of pitch-adaptive spectral estimation and


speaking-rate normalization in automatic speech recognition
S. Shahnawazuddin a,∗ , Nagaraj Adiga b , Hemant K. Kathania c , Gaydhar Pradhan a ,
Rohit Sinha d
a
Department of Electronics and Communication Engineering, National Institute of Technology, Patna, India
b
Department of Computer Science, University of Crete, Greece
c
Department of Electronics and Communication Engineering, National Institute of Technology, Sikkim, India
d
Department of Electronics and Electrical Engineering, Indian Institute of Technology, Guwahati, India

a r t i c l e i n f o a b s t r a c t

Article history: In the context of automatic speech recognition (ASR) systems, the front-end acoustic features should not
Available online 16 May 2018 be affected by signal periodicity (pitch period). Motivated by this fact, we have studied the role of pitch-
synchronous spectrum estimation approach, referred to as TANDEM STRAIGHT, in this paper. TANDEM
Keywords:
STRAIGHT results in a smoother spectrum devoid of pitch harmonics to a large extent. Consequently,
Pitch-adaptive spectral estimation
the acoustic features derived using the smoothed spectra outperform the conventional Mel-frequency
TANDEM STRAIGHT
Vocal-tract length normalization cepstral coefficients (MFCC). The experimental evaluations reported in this paper are performed on
Speaking-rate normalization speech data from a wide range of speakers belonging to different age groups including children. The
Glottal closure instants proposed features are found to be effective for all groups of speakers. To further improve the recognition
Zero-frequency filtering of children’s speech, the effect of vocal-tract length normalization (VTLN) is studied. The inclusion of
VTLN further improves the recognition performance. We have also performed a detailed study on the
effect of speaking-rate normalization (SRN) in the context of children’s speech recognition. An SRN
technique based on the anchoring of glottal closure instants estimated using zero-frequency filtering is
explored in this regard. SRN is observed to be highly effective for child speakers belonging to different
age groups. Finally, all the studied techniques are combined for effective mismatch reduction. In the case
of children’s speech test set, the use of proposed features results in a relative improvement of 21.6% over
the MFCC features even after combining VTLN and SRN.
© 2018 Elsevier Inc. All rights reserved.

1. Introduction ulary speech recognition task is no more a problem. Furthermore,


fast and efficient techniques for training the network parameters
Automatic speech recognition (ASR) is the task of generating have been developed [2]. As a result of the progress made in the
textual output by decoding a digitally acquired acoustic pressure research on speech processing, a number of speech-based user ap-
wave using machines (computers). With the introduction of hidden plications have been developed, e.g., voice-based web search, read-
Markov models (HMM) for statistically learning the acoustic and ing tutors, language learning tools and entertainment [4–6].
linguistic attributes of speech [1], rapid progress has been made in Another factor that plays an important role in ASR is the front-
ASR. In initial HMM-based systems, the observation densities for end speech parameterization module. The primary objective of
the HMM states were modeled using the continuous density Gaus- front-end speech parameterization process is to extract the in-
sian mixture models (GMM). Nowadays, the GMM is fast being re- formation relevant to the task and discard the rest. As a result,
the front-end acoustic features result in a compact representation
placed by deep neural networks (DNN) [2,3]. Since the computers
of raw speech signal. This significantly reduces the computational
available these days have excellent computing power, training very
cost. The development of front-end speech parameterization tech-
deep neural nets on large amounts of speech data for large vocab-
niques mimicking the human perception mechanism further aided
in boosting the performance of ASR systems. Two dominant speech
parameterization approaches used in ASR are the Mel-frequency
*
Corresponding author.
cepstral coefficients (MFCC) [7] and perceptual linear prediction
E-mail addresses: [email protected] (S. Shahnawazuddin), [email protected]
(N. Adiga), [email protected] (H.K. Kathania), [email protected] (G. Pradhan), coefficients (PLPC) [8]. In last two decades, further advancements
[email protected] (R. Sinha). have taken place in the ASR domain which includes normalization

https://doi.org/10.1016/j.dsp.2018.05.003
1051-2004/© 2018 Elsevier Inc. All rights reserved.
S. Shahnawazuddin et al. / Digital Signal Processing 79 (2018) 142–151 143

of the acoustic features prior to modeling by applying a set of lin-


ear transformations [9].
The performance of ASR systems deployed in the aforemen-
tioned user applications involving human machine interactions is
affected by a number of factors. One among those is the inter-
speaker variability such as age, gender, accent, speaking-rate, emo-
tion and health conditions of the speakers contributing to the
training and test speech. To impart robustness towards those fac-
tors, statistical models are trained on a large amount of speech
data collected from a different class of speakers. In addition to that,
techniques like feature-space maximum likelihood linear regres-
sion (fMLLR) [10] or vocal-tract length normalization (VTLN) [11]
are commonly included to reduce the ill-effects of inter-speaker
variability. Unfortunately, acoustic variations due to the aforemen-
tioned factors are so diverse that it is almost impossible to enhance
the robustness towards all the factors simultaneously. In this paper,
we attempt to reduce the ill-effects due to two of the dominant
mismatch factors viz. the pitch differences among the speakers and
the speaking-rate variability. In the following subsections, a brief
review of existing research on the aforementioned aspects is pre-
sented.

1.1. Motivation and prior art

As mentioned earlier, MFCC is one of the most commonly em-


ployed front-end acoustic features. In the case of MFCC features,
the speech signal is first analyzed into short-time overlapping
frames. Next, the spectral representation is obtained by short-time
Fourier transform (STFT). This is followed by warping the spectra
to a non-uniform frequency scale by applying a triangular Mel-
filterbank. Logarithmic compression of the filtered power spectrum
is done next. The real cepstrum (RC) is then obtained by applying
discrete cosine transform (DCT). The final MFCC features used for
learning system parameters are obtained by low-time liftering of
the cepstral coefficients. Fig. 1. Variance plots for the base MFCC features (C 1 –C 12 ) for vowel /IY/ corre-
In general, due to involved low-time liftering, it is expected that sponding to two broad pitch (F 0 ) ranges. The feature vectors for nearly 2600 speech
the MFCC features would be largely free from the effect of excita- frames (for each group) corresponding to the central portion of the vowel were used
tion. On the contrary, the MFCC features do get affected by the for this analysis. For the chosen F 0 ranges, the mismatch in the variance of higher-
order coefficients (especially C 11 and C 12 ) is evident in the default case. The bottom
signal periodicity especially for the high-pitched speakers in com-
pane highlights the reduction of variance mismatch with the TANDEM-STRAIGHT-
parison to the low-pitched ones [12,13]. The signal periodicity due based pitch-adaptive spectral estimation. These analyses were performed on the
to the excitation source is not well smoothed out while analyzing data extracted from TIMIT corpus. The feature vectors employed in these analyses
the signals having higher pitch values (>200 Hz) on warping of the have been normalized using cepstral mean and variance normalization.
frequency scale. This is mainly attributed to the narrow bandwidth
(≈100 Hz) of the lower-channel filters in the Mel-filterbank. Con-
The problem becomes much more challenging when we try to
sequently, some ill-effects of the pitch (or signal periodicity) are
transcribe children’s speech on ASR systems trained using adult’s
still present in the derived features, especially for the high-pitched
data and vice-versa. Despite the advances made in research on
speakers, even after low-time liftering. Hence, the MFCC features
ASR, a severe degradation in recognition performance can still be
exhibit enhanced variances for the higher-order coefficients for
noticed in such cases. The primary reason behind this observation
the high-pitched speakers in contrast to those for the low-pitched
is that the acoustic/linguistic properties of adults’ and children’s
speakers. To highlight this phenomenon, a study conducted on
speech differ substantially due to morphological and physiologi-
voiced speech frames from different pitch groups for several dif-
cal differences [17–20]. Consequently, achieving high recognition
ferent vowels was reported in [12]. We repeated that study for
performance for both adult and child speakers on an ASR sys-
a single vowel1 and the result for the same is summarized in
tem becomes quite challenging. Several studies for addressing the
Fig. 1. A mismatch in the variance of the higher-order coefficients
acoustic mismatch in the context of children’s ASR have been re-
(C 8 –C 12 ) for the two pitch groups is easily noticeable. In [12], a
ported [21,22,13].
more detailed study was reported quantifying the changes in the
For high-pitched child speakers, the problem due to insufficient
variance as well as the mean of the MFCC features on multiple
vowels due to pitch. Furthermore, several other front-end cepstral spectral smoothening becomes more pronounced. As argued in [12,
features such as the linear prediction cepstral coefficient (LPCC), 23], spectral smoothening can be increased by reducing the length
PLPC and the perceptual minimum variance distortionless response of the low-time lifter. Even though such an approach improves
(PMVDR) were analyzed and found to be sensitive to the variation the recognition performance for child speakers, there is a loss of
in the average pitch values [15,16]. relevant spectral information when a large number of cepstral co-
efficients are truncated. Motivated by that, low-rank feature pro-
jection was explored in [24]. In those works, the acoustic features
1
Since reliable vowel markings are available in the TIMIT database [14], this anal- were projected to lower-dimensional subspace prior to learning the
ysis was performed on the vowel data extracted from the same. model parameters. The low-rank projection matrix was derived ei-
144 S. Shahnawazuddin et al. / Digital Signal Processing 79 (2018) 142–151

ther through principal component analysis (PCA) or heteroscedastic smoothening is required to reduce the ill-effects of the pitch har-
linear discriminant analysis (HLDA) performed on the training data. monics. In order to derive smoothed spectra, several methods have
Since a significant amount of spectral information was retained, been proposed over the years [32–34]. Kawahara et al. proposed a
projecting the acoustic features to a lower-dimensional subspace pitch-adaptive spectral analysis technique named STRAIGHT which
was shown to outperform the cepstral-truncation-based scheme. gives equivalent resolution both in time and frequency domains.
At the same time, both these techniques suffered from a common After adaptive windowing, interpolation is done to get smoothed
problem i.e., the recognition rates for the adults’ speech got dete- vocal-tract spectra which is not affected by interference arising
riorated due to cepstral truncation or low-rank feature projection. from signal periodicity. STRAIGHT provides a high-quality analysis-
Motivated by these facts, we have explored pitch-adaptive spectral synthesis framework, which is commonly used in speech synthe-
estimation in this work to reduce the ill-effects of signal periodic- sis [35] and voice-conversion [36].
ity. This will not only yield smoother spectra but also avoid loss of The MFCC features derived using the STRAIGHT-based spectra
critical information, unlike the aforementioned approaches. were employed for ASR in [37] and were found to be inferior to
The differences in the pitch is not the only factor that af- the conventional ones. Later, it was figured out that the cause
fects the performance in the case of children’s speech recogni- of degradation was a smoothening function used after the pitch-
tion on ASR systems trained using adults’ data. Another impor- adaptive windowing which led to over-smoothing. On removing
tant mismatch factor is the difference in the speaking-rates for that smoothening function, an enhanced recognition performance
the two groups of speakers. Earlier studies have highlighted that was obtained as reported in [38]. Further, legacy STRAIGHT is
the variation in the speaking-rate affects both perception and pro- computationally expensive. To alleviate these problems, TANDEM-
duction of phones [25]. Degradation in the ASR performance has STRAIGHT was introduced for spectrum estimation [30]. Motivated
been reported when the speaking-rate is exceptionally fast or by the success of TANDEM STRAIGHT in speech synthesis and
slow [26–28]. A few earlier works had explored explicit speaking- voice-conversion, its role in ASR is studied in this paper. In the
rate modification for improving children’s ASR [29,23]. Therefore, following, a brief review of spectrum estimation through TANDEM
we have also studied the effect of speaking-rate normalization STRAIGHT is presented. This followed by a discussion on the effects
(SRN) in this paper along with pitch-adaptive spectral estimation. of spectral smoothing which is vital for ASR.

1.2. Contributions 2.1. Overview of TANDEM STRAIGHT

In TANDEM STRAIGHT, to compute smoothed spectra of a pe-


In order to address the above-mentioned shortcomings, an ap-
riodic signal, a time-window function which covers two harmonic
proach for spectral smoothing based on TANDEM STRAIGHT [30]
is proposed for ASR task. TANDEM STRAIGHT results in smoothed pitch periods is considered. The Fourier transform of time-window
vocal-tract envelope estimation devoid of any source effects due function H (ω) has negligible side lobes. Consider a signal x(t )
to the involved pitch-synchronous signal analysis. Moreover, it consisting of two sinusoidal periodic components separated by
is faster compared to the conventional STRAIGHT-based spectral ω0 = 2Tπ :
0
envelope estimation method. The MFCC features computed af-
ter spectral smoothing are found to outperform the conventional x(t ) = e jkω0 t + α e j (k+1)ω0 t +β (1)
MFCC features. Consistence improvements are noted for the task where α and β represent real numbers. The corresponding fre-
of transcribing adults’ as well as children’s speech. The experimen- quency domain representation is given as follows (by assuming
tal evaluations reported in this paper are done on speech data
k = 0 for simplicity):
divided into five different sets based on the age of the speakers.
The first test set consists of speech data from adult speakers only. X (ω) = δ(ω) + α e j β δ(ω − ω0 ). (2)
The remaining four test sets consist of speech data from children
belonging to the following age groups, 4–7, 8–10, 11–14 and 4–14 The power spectra of windowed test signal is then given by:
years, respectively. This mode of experimental evaluation helps in
understanding the relative impact of studied techniques on a dif- S (ω, t ) = | H (ω)|2 + α 2 | H (ω − ω0 )|2
ferent class of speakers. + 2α H (ω) H (ω − ω0 ) cos(ω0 t + β). (3)
To further improve the recognition performance for children’s
speech, the effect of VTLN is also studied. Combining spectral The third term in the above equation is time-dependent and rep-
smoothening with frequency scaling through VTLN, leads to ad- resents the temporal dependency in the spectrum estimation. The
ditive gains. In addition to that, we have explored the role of third term can be canceled by taking the opposite polarity with a
speaking-rate normalization as stated earlier. A technique based window at t + T 0 /2. The spectrum without any temporal fluctua-
on the anchoring of glottal closure instants (GCIs) to compen- tion i.e., the TANDEM spectrum T (ω, t ) is now given as:
sate for the speaking-rate variation is used in this work. The said  
technique comprises of two subtasks, (i) computing the GCI lo- 1
T (ω, t ) = S (ω, t ) + S (ω, t + T 0 /2) . (4)
cations and (ii) using the GCIs as anchor points for modifying 2
the speaking-rate. The approach based on zero-frequency filter- The envelope of TANDEM spectrum T (ω, t ) happens to be quite
ing (ZFF) is used to compute the GCI locations. The ZFF-GCI-based smooth, yet it closely follows the envelope of raw power spec-
technique is faster and more accurate compared to other methods trum obtained using fast Fourier transform (FFT). Thus avoiding
for modifying speaking-rate [31]. over-smoothing. At the same time, it is reported to be faster than
conventional STRAIGHT [30].
2. Pitch-adaptive spectral estimation However, a similar strategy cannot be applied directly to speech
signal because of finer resolution required to capture the dynam-
As highlighted earlier, the conventional approach for extracting ics of speech sounds. The periodic excitation of a set of resonators,
front-end acoustic features does not explicitly depend on pitch- such as the vocal-tract, by a pulse train, happens to be a sampling
adaptive signal processing. This leads to insufficient smoothening operation on the frequency axis. Alternatively, it may be consid-
of the spectra, especially for the high-pitched speaker. Spectral ered as an analog-to-digital (discrete) conversion on the frequency
S. Shahnawazuddin et al. / Digital Signal Processing 79 (2018) 142–151 145

Fig. 2. Comparison of STFT spectrogram, STRAIGHT spectrogram, and TANDEM spectrogram derived from a segment of speech from a high-pitched child speaker.

axis. Since the overall process consists of both analog-to-discrete as


well as discrete-to-analog conversions, Kawahara et al. introduced
consistent-sampling-based envelope estimation for speech sounds.
Consistent sampling provides a solution that recovers each of the
spectral levels. At the same time, the spectral variations due to sig-
nal periodicity are effectively suppressed. The spectral levels of the
smeared spectrum are compensated at the harmonic frequencies
to recover their original levels. The compensating digital filter de-
signed for this recovery is based on consistent sampling. Instead of
completely recovering the original spectrum, consistent sampling
only requires the re-sampled values to be recovered. The detailed
procedure of consistent-sampling-based enveloped estimation for
speech sound is given in [30].
The obtained TANDEM STRAIGHT spectrogram of a segment
of children’s speech is compared with the conventional STFT and
STRAIGHT-based spectrograms shown in Fig. 2. In the conventional
STFT spectrogram, we can see that the pitch harmonic structure is
still clearly visible. On the other hand, STRAIGHT spectrum results
in a over-smoothed spectrum. In TANDEM STRAIGHT, the spectra is
less smoothed and the harmonic structure is also removed due to
pitch synchronous analysis with faster computation. In this study,
we have used the TANDEM spectra for MFCC feature computation
as it is pitch-adaptive and hence, suitable for speech recognition
especially for high-pitched speakers.

2.2. Analyzing the effect of spectral smoothening

The log-compressed power spectra obtained by the conven-


tional approach employing pitch-independent signal processing for
the vowel /IY/ are shown in Fig. 3. The vowel frames for the adult Fig. 3. Power spectra for vowel /IY/ extracted from the speech data belonging to
male and female speakers are obtained from the WSJCAM0 cor- adult male, adult female and child speakers, respectively. The degree of spectral
smoothening due to pitch-adaptive signal processing is evident.
pus [39] while that for the child speaker is extracted from the
PF-STAR [40] database, respectively. Since only word/sentence level
transcription is available with these databases, forced-alignment STRAIGHT shown in Fig. 3. As a consequence of pitch-adaptive esti-
with respect to trained acoustic models was used to derive the mation, the pitch harmonics are highly suppressed in all the three
frame-level boundaries for the vowel. Forced-alignment was done studied cases.
under the constraints of true transcription. The degree of spectral As argued earlier, insufficient spectral smoothening leads to an
smoothening achieved through the pitch-adaptive spectral estima- increase in the variance of the cepstral coefficients, especially for
tion is evident from the power spectra for the case of TANDEM the high-pitched speakers. This aspect had motivated us to explore
146 S. Shahnawazuddin et al. / Digital Signal Processing 79 (2018) 142–151

to as TANDEM-STRAIGHT-MFCC (TS-MFCC) in the remaining of this


paper.

3. Experimental evaluations

The simulation studies performed for evaluating the effective-


ness of proposed front-end acoustic features are presented in this
section.

3.1. Details of the speech corpora employed

The speech data used for training the ASR system was obtained
from the British English speech corpus WSJCAM0 [39]. The train
set derived from WSJCAM0 consisted of 15.5 hours of speech data
from 92 male/female adult speakers between the ages 18–55 years.
The total number of utterances in the train set was 7852 with a
total 132,778 words. In order to evaluate the effectiveness of the
existing MFCC as well as the proposed features, five different test
sets were created. The details of the test sets are as follows:

• ADSet 1: This test set was derived from the WSJCAM0 corpus.
This test set consisted of 0.6 hours of speech from 20 adult
male/female speakers with a total of 5608 words.
• CHSet 2: This test set consisted of nearly 1 hour of speech
data from 17 child speakers obtained from the British English
speech database PF-STAR [40]. The age of the child speakers
in this test set lies in between 4–7 years. The total number of
Fig. 4. Variance plots for the base MFCC features (C 1 –C 12 ) for vowel /IY/. The feature words in this set was 4804.
vectors for nearly 4000 speech frames corresponding to the central portion of the • CHSet 3: This test set was also derived from the PF-STAR cor-
vowel extracted from the adults’ speech data were used for this analysis. Similarly,
pus. It consisted of almost 1 hour of speech data from 15
3000 frames corresponding to the central portion of the vowel was extracted from
the children’s speech. In the default case, the variance for children’s data is much child speakers with a total of 4664 words. The age of the child
higher when compared to adults’ case. Significant reduction in variance mismatch is speakers in this test set lies in between 8–10 years.
achieved by pitch-adaptive spectral smoothening through TANDEM STRAIGHT. The • CHSet 4: This test set was also derived from the PF-STAR cor-
feature vectors employed in these analyses have been normalized using cepstral
pus. It consisted of nearly 1 hour of speech data from 14 child
mean and variance normalization.
speakers with a total of 4924 words. The age of the child
speakers in this test set lies in between 11–14 years.
• CHSet 5: This test set was also derived from the PF-STAR cor-
pus. This set consisted of 1.1 hours of speech data from 60
child speakers with a total of 5067 words. The age of the child
speakers in this test set lies in between 4–13 years.

Its worth highlighting here that, the test data was obtained from
Fig. 5. Block diagram outlining the proposed front-end speech parameterization those speakers whose data was not included in the training set.
technique. Even among the test sets, none of the speakers were common.
Moreover, children’s speech was not used for training in order
pitch-adaptive spectral estimation. Hence, we computed the vari- to simulate large differences in pitch and speaking-rate. Age-wise
ance of the cepstral coefficients for the vowel /IY/ obtained after splitting of utterances in each of the children’s speech test sets
spectral smoothening and the same is shown in Fig. 1 (bottom is given in Table 1. Since the amount of speech data available in
panel). There is a significant reduction in the variance of each of PF-STAR corresponding to each of the considered age is unbal-
the coefficients due to spectral smoothening. At the same time, the anced, CHSet 5 does not have equal representation. It is to note
mismatch in the variance for the two pitch ranges is also reduced. that, CHSet 5 was not created by pooling the data from CHSet 2,
The variance plots for vowel /IY/ extracted from adults’ and CHSet 3 and CHSet 4. The experimental studies reported in this
children’s speech databases used in this study are shown in Fig. 4. paper were performed on wide-band (WB) speech data (sampled
For both the class of speakers, there is a significant reduction in at 16 kHz rates). As the PF-STAR database is originally sampled at
the variance of the cepstral coefficients due to pitch-adaptive spec- 22,050 samples per second, we down-sampled the speech files us-
tral estimation. It is worth noting that, when a conventional ap- ing MATLAB for consistency.
proach is used, the variance of the higher-order coefficients in the
case of children’s speech is more than that for adults’ speech. The 3.2. Front-end acoustic feature extraction
variance for the higher-order cepstral coefficients become almost
similar when spectral smoothening is done prior to the computa- For computing the MFCC features, speech data was first ana-
tion of the cepstral coefficients. lyzed into overlapping short-time frames using Hamming window
After obtaining the smoothed spectra, the front-end features are of length 20 ms with frame-shift of 10 ms. A pre-emphasis factor
derived using the usual steps. For the sake of clarity, the steps in- of 0.97 was used during feature extraction. A 40-channel Mel-
volved in the proposed front-end feature extraction technique are filterbank was used for warping the linear spectra to Mel-scale
summarized in Fig. 5. The proposed acoustic features are referred before computing the 13-dimensional base MFCC features. Next,
S. Shahnawazuddin et al. / Digital Signal Processing 79 (2018) 142–151 147

Table 1
Age-wise splitting of the number of utterances taken from the child speakers to create each of the test sets.

Test Age of the speaker (in years)


set 4 5 6 7 8 9 10 11 12 13 14
Number of utterances
CHSet 2 6 9 46 39 0 0 0 0 0 0 0
CHSet 3 0 0 0 0 33 35 32 0 0 0 0
CHSet 4 0 0 0 0 0 0 0 31 25 17 27
CHSet 5 2 0 8 8 39 6 42 16 2 2 0

time-splicing of the base MFCC features was performed consid- Table 2


ering a context size of 9 frames. In other words, 4 frames to WERs for the adults’ and children’s speech test sets with respect to adult data
trained GMM-HMM and DNN-HMM systems. Percentage relative improvements over
the left and to the right of the current analysis frame were ap-
the respective baselines are also tabulated.
pended to it making the feature vector dimension equal to 117.
This was followed by dimensionality reduction and de-correlation Acoustic Test set WER (in %) Relative
model MFCC TS-MFCC improv. (%)
using linear discriminant analysis (LDA) and maximum likelihood
linear transformation (MLLT) to obtain 40 dimensional feature vec- GMM Adult – ADSet 1 7.24 6.56 9.4
tors. The window length in the case of TS-MFCC is pitch-adaptive Child CHSet 2 (4–7 years) 85.44 72.40 15.3
CHSet 3 (8–10 years) 20.31 17.13 15.7
while the frame-shift was chosen as 10 ms. The number of chan-
CHSet 4 (11–14 years) 18.91 16.47 12.9
nels in the Mel-filterbank was chosen to be 40 in the case of CHSet 5 (4–14 years) 33.52 30.14 10.1
TS-MFCC features as well. As in the case of MFCC, time-splicing
DNN Adult – ADSet 1 5.89 5.28 10.4
followed by LDA+MLLT was performed on the base TS-MFCC to ob-
Child CHSet 2 (4–7 years) 56.79 49.40 13.0
tain 40 dimensional feature vectors. Cepstral mean and variance CHSet 3 (8–10 years) 15.30 12.90 15.7
normalization (CMVN) were applied to both the acoustic feature CHSet 4 (11–14 years) 11.48 9.73 15.2
kinds. In addition to CMVN, feature normalization was done using CHSet 5 (11–14 years) 19.27 15.89 17.5
feature-space maximum likelihood linear regression. This was done
to boost the robustness towards speaker-dependent variations. The
fMLLR transformations for the training and test data were gener- was trained on the transcripts of speech data in PF-STAR exclud-
ated using speaker adaptive training [9]. ing the test sets. The OOV rate and perplexity of the employed
bigram LM with respect to CHSet 5 were 1.20% and 95.8, respec-
3.3. ASR system specifications tively. Further, a lexicon consisting of 1969 words including the
pronunciation variations was employed. The word error rate (WER)
The ASR systems were developed on the 15.5 hours adults’ metric was used for evaluating the recognition performance.
speech data from WSJCAM0 speech corpus using the Kaldi toolkit
[41] (accessed on June 2017). Context-dependent hidden Markov 3.4. Evaluation results
models were employed for statistically learning the temporal vari-
ations in the speech data. For the initial studies, the observa- The baseline WERs for the five test sets, when MFCC features
tion probabilities for the HMM states were generated using Gaus- are used for system development and evaluation, are given in Ta-
sian mixture models. For GMM-HMM-based ASR systems, context- ble 2. Compared to ADSet 1, the recognition rates are extremely
dependent cross-word triphone models consisting of a 3-states poor for the children’s speech test sets (CHSet 2–5). It’s worth
HMM with 8 diagonal covariance Gaussian components per state noting that, despite the use of CMVN and fMLLR, the WERs are
were used. Further, decision tree-based state tying was performed much degraded as noted in other similar works. Furthermore, even
with the maximum number of tied-states (senones) being fixed at among the children, the WER is higher for the child speakers in the
2000. 4–7 age group. The WERs reduce as the age of the child speakers
Once significant improvements were obtained by using the pro- is increased. These results highlight the adverse effect of differ-
posed TS-MFCC features with respect to the GMM-HMM systems, ences in vocal-tract geometry on ASR. For very young children, the
acoustic modeling based on DNN was explored next. Prior to learn- vocal-organs are much smaller than the adults’. Consequently, the
ing parameters of the DNN-HMM-based ASR system, the fMLLR- pitch is higher and hence the pitch-induced acoustic mismatch is
normalized feature vectors were time-spliced once again consider- also more. As the child grows up, the geometry of the vocal-tract
ing a context size of 9. The number of hidden layers in the DNN- changes accordingly. As a result, the acoustic mismatch gets re-
HMM system was chosen as 8 with each layer consisting of 1024 duced.
hidden nodes. The nonlinearity in the hidden layers was modeled The effect of spectral smoothening on the pitch-induced acous-
using the tanh function. The initial learning rate for training the tic mismatch is evident from the WERs for TS-MFCC features given
DNN-HMM parameters was set at 0.005 which was reduced to in Table 2. The WERs are noted to reduce for all the five test sets
0.0005 in 15 epochs. Extra 5 epochs were employed after reduc- when pitch-adaptive signal processing is used prior to computing
ing the learning rate. The minibatch size for neural net training acoustic features. Similar trends noted for GMM-HMM- as well as
was selected as 512. The initial state-level alignments employed in DNN-HMM-based ASR systems. The percentage relative improve-
DNN training were generated using the earlier trained GMM-HMM ments obtained by using TS-MFCC features in place MFCC are also
system. included in Table 2 to highlight the effectiveness of the proposed
For decoding adults’ speech test (ADSet 1), MIT-Lincoln 5k Wall features.
Street Journal bi-gram language model (LM) was used. The MIT-
Lincoln LM has a perplexity of 95.3 with respect to ADSet 1 while 3.5. Improving children’s speech recognition
there are no out-of-vocabulary (OOV) words. The lexicon employed
in this case consisted of 5850 words including the pronuncia- As stated earlier, due to morphological and physiological dif-
tion variations. While decoding the children’s speech test sets, a ferences, the acoustic/linguistic properties of adults’ and children’s
domain-specific 1.5k bigram LM was employed. This bigram LM speech differ substantially. The pitch in the case of children’s
148 S. Shahnawazuddin et al. / Digital Signal Processing 79 (2018) 142–151

Table 3 uous speech recognition task, a relative improvement of nearly 4%


WERs for the children’s speech test set (CHSet 5) with respect to adult data trained was reported in that work.
ASR systems demonstrating the effect of concatenating VTLN with fMLLR on MFCC
Motivated by these results, we have also explored the effect
as well as proposed features.
of speaking-rate normalization on TS-MFCC features. Unlike the
Acoustic Feature WER (in %) Relative earlier works on children’s ASR, a prosody modification technique
model kind fMLLR VTLN+fMLLR improv. (%)
based on anchoring glottal closure instants is studied in this paper.
GMM MFCC 33.52 23.01 31.4 The GCIs are, in turn, derived from the given speech data using
TS-MFCC 30.14 20.72 31.3 zero-frequency filtering. The ZFF-GCI-based technique is faster and
DNN MFCC 19.27 15.00 22.2 more accurate compared to other methods [31]. Further, the exper-
TS-MFCC 15.89 12.48 21.5 imental evaluations are also performed on DNN-based ASR in this
work while the earlier reported studies had employed GMM-HMM
only.
speech is much higher due to the smaller size of vocal organs. In The ZFF-GCI-based technique for the speaking-rate normaliza-
addition to that, the formant frequencies also get scaled up in the tion (SRN) is summarized in the following:
case of children’s speech. Furthermore, the speaking-rate is also
slower for children when compared to adult speakers. The pitch-
• The speech signal is first passed through the ZFF filter twice.
induced acoustic mismatch can be substantially reduced by using
The transfer function H 1 ( z) for this filter is:
the proposed acoustic features as validated earlier. In this sec-
tion, we further improve the recognition performance for children’s 1 1
speech by addressing the issues of formant scaling and speaking- H 1 ( z) = = . (5)
(1 − z−1 )2 1 − 2z−1 + z−2
rate differences.
• Next, the trend in the filtered output is removed by using
3.5.1. Vocal-tract length normalization moving average filter with window length being equal to the
Earlier works have highlighted the fact that the vocal organs pitch period.
of children are smaller when compared to adult males or fe- • The positive zero-crossings of filtered output give the GCI lo-
males [17–20]. Consequently, substantial formant scaling takes cations. These GCIs are then used as the anchor points for
place in the case of children’s speech. The ill-effects of formant speaking-rate modification.
scaling can be largely addressed by linear-frequency warping • To modify the speaking-rate of a speaker, the GCI interval is
through the VTLN technique. Hence, VTLN was explored next to interpolated and re-sampled according to the SRN factor.
reduce the ill-effects of formant scaling. A set of linear frequency • Next, the modified GCI location is derived from the re-sampled
warping factors lying in the range of 0.70 to 1.12 was employed GCI interval. The new locations will be the modified GCI loca-
in this study. The warp factor value was varied in steps of 0.02. tions to accomplish the desired modification.
To determine the optimal warping factor, a maximum-likelihood • To reconstruct the speaking-rate normalized speech, the wave-
grid search was performed under the constraints of the first-pass form samples in the original GCI intervals are copied to the
hypothesis. The first-pass hypothesis was, in turn, obtained by de- corresponding modified GCI locations. To increase the dura-
coding the unwarped features using the developed acoustic mod- tion, some of the GCI intervals are repeated according to mod-
els. The optimally warped feature vectors were then re-decoded ification factor. Similarly, for decreasing the duration, some of
after performing fMLLR-based normalization. the intervals are deleted.
The effect of concatenating VTLN and fMLLR on the MFCC and
proposed features are demonstrated using WERs given in Table 3. Fig. 6 illustrates the effect of SRN applied to a speech signal
Only CHSet 5 is used for this study as it consists of speech data with increase and decrease in modification factor. From the figure,
from child speakers belonging to all the age groups. Significant re- we can see that the shape of the waveforms as well as the for-
ductions in WER are noted by the application of frequency warping mant transitions remain intact even after increasing or decreasing
through VTLN. At the same time, the proposed TS-MFCC features the speaking-rate. As mentioned earlier, speaking-rate is increased
are still noted to be superior to the conventional ones. The best by removing some adjacent samples of nearby GCI interval from
case WER is presented in bold to highlight the same. Even after the original samples based on ZFF-GCI technique. Similarly, when
combining VTLN and fMLLR, the proposed features result in a rel- speaking-rate is decreased, adjacent samples are repeated using
ative improvement of 21.5% over the MFCC case as evident from the GCI as the anchoring point.
the last column in Table 3. Since the speaking-rate is lesser in the case of children, the
scaling factor was varied from 1.10 to 1.55 in steps of 0.05. The
3.5.2. Speaking-rate normalization (SRN) correspondingly modified test data was then decoded to improve
In the case of young children, the average vowel durations are the recognition rates. The WER profiles for CHSet 2, CHSet 3, and
reported to be longer than those for the adults. As a result, chil- CHSet 4 with respect to GMM-HMM systems are shown in Fig. 7.
dren have lower speaking-rate than adults [19]. In [23], the mean The WERs are for the case when MFCC features are used for sys-
speaking-rates for adults’ and children’s speech was reported to tem development and evaluation. Increasing the speaking-rate is
be 2.03 syllables/sec and 1.79 syllables/sec, respectively. Due to noted to reduce the WER for all the three groups of child speak-
the differences in the speaking-rate, an acoustic mismatch occurs ers. For the three test sets, CHSet 2, CHSet 3, and CHSet 4, the
when children’s speech is transcribed using adult data trained relative reduction in WER over the corresponding baselines due to
ASR system. A few earlier works had explored explicit speaking- SRN are 45%, 50% and 26%, respectively. In other words, greater re-
rate modification for improving children’s ASR [29,23]. In [29], duction is obtained for speakers lying in the 4–10 years age group.
speaking-rate normalization was done using pitch-synchronous For the speaker belonging to the 11–14 years age group, there is a
overlap and add (PSOLA) algorithm. On the other hand, pitch- fast saturation in the WER profile with an increase in scaling fac-
synchronous time-scaling (PSTS) was employed for speaking-rate tor. These experimental results highlight that SRN is more effective
normalization in [23]. On connected digit recognition task, explicit for younger children.
speaking-rate normalization was reported to result in 9% relative Next, we studied the effect of combining proposed features
improvement over the baseline performance [23]. For the contin- with SRN. The WER profiles for those are shown in Fig. 8. The
S. Shahnawazuddin et al. / Digital Signal Processing 79 (2018) 142–151 149

Fig. 6. Time-domain waveforms and spectrograms illustrating the effect of increasing or decreasing the speaking-rate using the ZFF-GCI-based approach.

Fig. 7. WERs illustrating the effect of speaking-rate normalization on the recognition


of children’s test sets with respect to GMM-HMM system trained on adult’s speech
employing MFCC features.

plotted WERs are with respect to the CHSet 5. The scaling fac-
tor for this study was varied from 1.25 to 1.55 in steps of 0.05.
SRN results in additive improvements for both the MFCC as well as
proposed features. Further, the WERs are lower in the case of pro-
posed TS-MFCC features. Finally, we concatenated VTLN and fMLLR
with SRN to further improve the performance. The WERs for this
study are enlisted in Table 4. The three approaches are observed
to be additive. Moreover, the proposed TS-MFCC features are noted Fig. 8. WERs for the children’s test set (CHSet 5) with respect to GMM-HMM and
to be superior in this case well. The best case WER is presented in DNN-HMM systems trained on adult’s speech illustrating the effect of speaking-rate
bold to highlight this fact. normalization.

4. Conclusion quently, the derived acoustic features are more robust towards
pitch variations compared to the conventional MFCC features. The
A novel front-end speech parameterization technique for ASR is same has been validated experimentally in this paper. Furthermore,
presented in this paper. The proposed feature extraction approach the reported experimental evaluations have been performed on
includes a pitch-adaptive spectral estimation module prior to com- several different test sets comprising of speech data from speak-
puting the acoustic features. The use of pitch-adaptive signal pro- ers belonging to different age groups. Breaking up speech data
cessing helps reduce the ill-effects of signal periodicity. Conse- in different age groups helps in better understanding the age-
150 S. Shahnawazuddin et al. / Digital Signal Processing 79 (2018) 142–151

Table 4 [18] A. Potaminaos, S. Narayanan, Robust recognition of children speech, IEEE Trans.
WERs for the children’s speech test set (CHSet 5) with respect to adult data trained Speech Audio Process. 11 (6) (2003) 603–616.
ASR systems demonstrating the effect of concatenating VTLN and fMLLR with SRN [19] S. Lee, A. Potamianos, S.S. Narayanan, Acoustics of children’s speech: devel-
on MFCC as well as proposed features. opmental changes of temporal and spectral parameters, J. Acoust. Soc. Am.
105 (3) (1999) 1455–1468.
Acoustic Feature WER (in %) Relative
[20] M. Gerosa, D. Giuliani, S. Narayanan, A. Potamianos, A review of ASR tech-
model kind VTLN+fMLLR + SRN improv. (%)
nologies for children’s speech, in: Proc. Workshop on Child, Computer and
GMM MFCC 23.01 13.54 41.2 Interaction, 2009, pp. 7:1–7:8.
TS-MFCC 20.72 11.22 45.8 [21] H. Liao, G. Pundak, O. Siohan, M.K. Carroll, N. Coccaro, Q. Jiang, T.N. Sainath,
A.W. Senior, F. Beaufays, M. Bacchiani, Large vocabulary automatic speech
DNN MFCC 15.00 11.28 24.8 recognition for children, in: Proc. INTERSPEECH, 2015, pp. 1611–1615.
TS-MFCC 12.48 9.78 21.6 [22] S. Shahnawazuddin, N. Adiga, H.K. Kathania, Effect of prosody modification on
children’s ASR, IEEE Signal Process. Lett. 24 (11) (2017) 1749–1753.
[23] S. Ghai, Addressing Pitch Mismatch for Children’s Automatic Speech Recogni-
dependence of the pitch-induced acoustic mismatch. For the task tion, Ph.D. thesis, Department of EEE, Indian Institute of Technology Guwahati,
of decoding children’s speech test sets using adult data trained India, October 2011.
[24] H.K. Kathania, S. Shahnawazuddin, R. Sinha, Exploring HLDA based transforma-
DNN-HMM system, relative improvements of 13–17% are obtained
tion for reducing acoustic mismatch in context of children speech recognition,
when the proposed TS-MFCC features are used instead of conven- in: Proc. International Conference on Signal Processing and Communications,
tional MFCC features. In addition to that, the proposed features are 2014, pp. 1–5.
subjected to linear frequency warping for reducing the ill-effects [25] J.L. Miller, L.E. Volaitis, Effect of speaking rate on the perceptual structure of a
phonetic category, Percept. Psychophys. 46 (6) (1989) 505–512.
of formant scaling. The inclusion of VTLN leads to further im-
[26] M.A. Siegler, R.M. Stern, On the effects of speech rate in large vocabulary
provements in recognition performance. We have also studied the speech recognition systems, in: Proc. ICASSP, vol. 1, 1995, pp. 612–615.
effect of speaking-rate normalization on children’s speech in this [27] N. Mirghafori, E. Fosler, N. Morgan, Towards robustness to fast speech in ASR,
work. When SRN is performed, the proposed features result in fur- in: Proc. ICASSP, vol. 1, 1996, pp. 335–338.
ther reductions in WER. After combining VTLN, fMLLR, and SRN, [28] N. Morgan, E. Fosler, N. Mirghafori, Speech recognition using on-line estimation
of speaking rate, in: Proc. EUROSPEECH, 1997, pp. 2079–2082.
the use of proposed features in place of MFCC features for train-
[29] G. Stemmer, C. Hacker, S. Steidl, E. Nöth, Acoustic normalization of children’s
ing DNN-HMM systems results in a relative improvement of 21.6% speech, in: Proc. INTERSPEECH, 2003, pp. 1313–1316.
with respect to the children’s speech test set. [30] H. Kawahara, M. Morise, Technical foundations of TANDEM-STRAIGHTs, a
speech analysis, modification and synthesis framework, Sadhana 36 (5) (2011)
References 713–727.
[31] S. Prasanna, D. Govind, K.S. Rao, B. Yegnanarayana, Fast prosody modification
using instants of significant excitation, in: Int. Conf. on Speech Prosody, 2010.
[1] L. Rabiner, B.-H. Juang, Fundamentals of Speech Recognition, Prentice-Hall, Inc.,
[32] J. Makhoul, Linear prediction: a tutorial review, Proc. IEEE 63 (4) (1975)
Upper Saddle River, NJ, USA, 1993.
561–580.
[2] G.E. Hinton, L. Deng, D. Yu, G. Dahl, A.R. Mohamed, N. Jaitly, A. Senior, V. Van-
[33] A.V. Oppenheim, Speech analysis-synthesis system based on homomorphic fil-
houcke, P. Nguyen, T. Sainath, B. Kingsbury, Deep neural networks for acoustic
tering, J. Acoust. Soc. Am. 45 (2) (1969) 458–465.
modeling in speech recognition, Signal Process. Mag. 29 (6) (2012) 82–97.
[34] R.J. McAulay, T.F. Quatieri, Speech analysis/synthesis based on a sinusoidal rep-
[3] G. Dahl, D. Yu, L. Deng, A. Acero, Context-dependent pre-trained deep neural
resentation, IEEE Trans. Acoust. Speech Signal Process. 34 (4) (1986) 744–754.
networks for large vocabulary speech recognition, IEEE Trans. Speech Audio
[35] H. Zen, T. Toda, M. Nakamura, K. Tokuda, Details of the Nitech HMM-based
Process. 20 (1) (2012) 30–42.
speech synthesis system for the blizzard challenge 2005, IEICE Trans. Inf. Syst.
[4] J. Schalkwyk, D. Beeferman, F. Beaufays, B. Byrne, C. Chelba, M. Cohen, M. Kam-
E90-D (1) (2007) 325–333.
var, B. Strope, Your word is my command: Google search by voice: A case
[36] T. Toda, H. Saruwatari, K. Shikano, Voice conversion algorithm based on Gaus-
study, in: Advances in Speech Recognition: Mobile Environments, Call Centers
sian mixture model with dynamic frequency warping of STRAIGHT spectrum,
and Clinics, 2010, pp. 61–90, Ch. 4.
[5] A. Hagen, B. Pellom, R. Cole, Children’s speech recognition with application to in: Proc. ICASSP, vol. 2, 2001, pp. 841–844.
interactive books and tutors, in: Proc. ASRU, 2003, pp. 186–191. [37] G. Garau, S. Renals, Combining spectral representations for large-vocabulary
[6] A. Hagen, B. Pellom, R. Cole, Highly accurate children’s speech recognition continuous speech recognition, IEEE Trans. Speech Audio Process. 16 (3) (2008)
for interactive reading tutors using subword units, Speech Commun. 49 (12) 508–518.
(2007) 861–873. [38] G. Garau, S. Renals, Pitch adaptive features for LVCSR, in: Proc. INTERSPEECH,
[7] S. Davis, P. Mermelstein, Comparison of parametric representations for mono- 2008, pp. 2402–2405.
syllabic word recognition in continuously spoken sentences, IEEE Trans. Acoust. [39] T. Robinson, J. Fransen, D. Pye, J. Foote, S. Renals, WSJCAM0: a British English
Speech Signal Process. 28 (4) (1980) 357–366, https://doi.org/10.1109/TASSP. speech corpus for large vocabulary continuous speech recognition, in: Proc.
1980.1163420. ICASSP, vol. 1, 1995, pp. 81–84.
[8] H. Hermansky, Perceptual linear predictive (PLP) analysis of speech, J. Acoust. [40] A. Batliner, M. Blomberg, S. D’Arcy, D. Elenius, D. Giuliani, M. Gerosa, C. Hacker,
Soc. Am. 57 (4) (1990) 1738–1752. M. Russell, M. Wong, The PF_STAR children’s speech corpus, in: Proc. INTER-
[9] S.P. Rath, D. Povey, K. Veselý, J. Černocký, Improved feature processing for deep SPEECH, 2005, pp. 2761–2764.
neural networks, in: Proc. INTERSPEECH, 2013. [41] D. Povey, A. Ghoshal, G. Boulianne, L. Burget, O. Glembek, N. Goel, M. Hanne-
[10] V. Digalakis, D. Rtischev, L. Neumeyer, Speaker adaptation using constrained mann, P. Motlicek, Y. Qian, P. Schwarz, J. Silovsky, G. Stemmer, K. Vesely, The
estimation of Gaussian mixtures, IEEE Trans. Speech Audio Process. 3 (1995) Kaldi speech recognition toolkit, in: Proc. ASRU, 2011.
357–366.
[11] L. Lee, R. Rose, A frequency warping approach to speaker normalization, IEEE
Trans. Speech Audio Process. 6 (1) (1998) 49–60. Syed Shahnawazuddin received his B.E. degree in Electronics and
[12] S. Ghai, R. Sinha, Exploring the role of spectral smoothing in context of chil- Communication Engineering from Visvesvaraya Technological University,
dren’s speech recognition, in: Proc. INTERSPEECH, 2009, pp. 1607–1610. Karnataka, India in 2008. He then obtained his Ph.D. degree from the
[13] R. Sinha, S. Shahnawazuddin, Assessment of pitch-adaptive front-end signal Department of Electronics and Electrical Engineering, Indian Institute of
processing for children’s speech recognition, Comput. Speech Lang. 48 (2018)
Technology Guwahati, in 2016. He is currently working as Assistant pro-
103–121.
fessor in the Department of Electronics and Communication Engineering
[14] W.M. Fisher, G.R. Doddington, K.M. Goudie-Marshall, The DARPA speech recog-
nition research database: specifications and status, in: Proc. DARPA Workshop at National Institute of Technology Patna, India. His research interests are
on Speech Recognition, 1986, pp. 93–99. speech signal processing, speech recognition, keyword spotting, speaker
[15] S. Ghai, R. Sinha, A study on the effect of pitch on LPCC and PLPC fea- recognition and machine learning.
tures for children’s ASR in comparison to MFCC, in: Proc. INTERSPEECH, 2011,
pp. 2589–2592.
Nagaraj Adiga received his B.E. degree in Electronics and Communi-
[16] S. Ghai, R. Sinha, Analyzing pitch robustness of PMVDR and MFCC features for
children’s speech recognition, in: Proc. Signal Processing and Communications cation Engineering from University Visvesvaraya College of Engineering,
(SPCOM), 2010. Bengaluru, India, in 2008. He was a Software Engineer in Alcatel-Lucent
[17] M. Russell, S. D’Arcy, Challenges for computer recognition of children’s speech, India Private Limited, Bengaluru, India, from 2008 to 2011, mainly focus-
in: Proc. Speech and Language Technologies in Education (SLaTE), 2007. ing on next-generation high-leverage optical transport networks. He then
S. Shahnawazuddin et al. / Digital Signal Processing 79 (2018) 142–151 151

obtained his Ph.D. degree from the Department of Electronics and Elec- Gayadhar Pradhan received his M.Tech. and Ph.D. degrees in Electron-
trical Engineering, Indian Institute of Technology Guwahati, in 2017. He ics and Electrical Engineering from Indian Institute of Technology Guwa-
is currently pursuing PostDoc in the Department of Computer science, hati, India, in 2009 and 2013, respectively. He is currently working as
University of Crete, Greece. His research interests are speech processing, Assistant professor in the Department of Electronics and Communication
speech synthesis, voice conversion, speech recognition, voice pathology Engineering at National Institute of Technology Patna, India. His research
and machine learning. interests are speech signal processing, speaker recognition and speech
recognition.

Hemant Kumar Kathania received his B.E. degree in Electronics and Rohit Sinha received the M.Tech. and Ph.D. degrees in Electrical Engi-
Communication Engineering from the University of Rajasthan, Jaipur, India, neering from Indian Institute of Technology Kanpur, in 1999 and 2005, re-
in 2008, and M.Tech. degree in Electronics and Electrical Engineering from spectively. From 2004 to 2006, he was a Post-Doctoral Researcher with the
the Indian Institute of Technology Guwahati, India, in 2012. Presently he Machine Intelligence Laboratory, Cambridge University, Cambridge, U.K.
is working as an Assistant Professor in the Department of Electronics and Since 2006, he has been with IIT Guwahati, India, where he is currently
Communication Engineering, National Institute of Technology (NIT) Sikkim, a Full Professor with the Department of Electronics and Electrical Engi-
India and also pursuing his Ph.D form NIT Sikkim. His current research in- neering. His research interests include speaker normalization/adaptation
terests include speech signal processing, speech recognition and machine in context of automatic speech recognition, speaker verification, noise ro-
learning. bust speech processing, and audio segmentation.

You might also like