Guest editorial: Special issue on advances in deep
learning based speech processing
Xiaolei Zhang, Lei Xie, Eric Fosler-Lussier, Emmanuel Vincent
To cite this version:
Xiaolei Zhang, Lei Xie, Eric Fosler-Lussier, Emmanuel Vincent. Guest editorial: Special is-
sue on advances in deep learning based speech processing. Neural Networks, 2023, 158,
�10.1016/j.neunet.2022.11.033�. �hal-03883292�
HAL Id: hal-03883292
https://inria.hal.science/hal-03883292v1
Submitted on 3 Dec 2022
HAL is a multi-disciplinary open access L’archive ouverte pluridisciplinaire HAL, est
archive for the deposit and dissemination of sci- destinée au dépôt et à la diffusion de documents
entific research documents, whether they are pub- scientifiques de niveau recherche, publiés ou non,
lished or not. The documents may come from émanant des établissements d’enseignement et de
teaching and research institutions in France or recherche français ou étrangers, des laboratoires
abroad, or from public or private research centers. publics ou privés.
Guest Editorial:
Special Issue on Advances in Deep Learning Based Speech Processing
Xiao-Lei Zhanga , Lei Xieb , Eric Fosler-Lussierc , Emmanuel Vincentd
a School of Marine Science and Technology, Northwestern Polytechnical University, Xi’an, Shaanxi 710072, China.
b Audio, Speech and Language Processing Group (ASLP@NPU), School of Computer Science, Northwestern Polytechnical University, Xi’an, Shaanxi 710072,
China.
c Department of Computer Science and Engineering, The Ohio State University, Columbus, OH 43210, USA.
d Université de Lorraine, CNRS, Inria, LORIA, F-54000, Nancy, France.
Deep learning has triggered a big revolution on speech pro- cles), and speech emotion recognition (2 articles), respectively.
cessing. The revolution started from the successful application Three articles related to general topics of speech processing are
of deep neural networks to automatic speech recognition, and incorporated as well. A brief summary of these articles is pro-
was quickly spread to other topics of speech processing, in- vided herein.
cluding speech analysis, speech enhancement and separation, Speech enhancement and separation is a topic of recovering
speaker and language recognition, speech synthesis, and spoken clean speech from noisy speech that may include various kinds
language understanding. Such tremendous success is achieved of interference signals. It is conventional formulated as a sig-
by the long-term evolution of neural network technologies as nal processing problem. Recently, deep learning made break-
well as the big explosion of speech data and fast development throughs to speech enhancement and separation, particularly in
of computing power. adverse acoustic environments, which quickly becomes a new
Although such a big success has been made, deep learning research paradigm. However, it still faces many challenges,
based speech processing still has many challenges for real- even in the deep learning era. A creative work was made by
world wide deployment. For example, when the distance be- Zhang et al. [1] for the classic Active Noise Control (ANC)
tween a speaker and a microphone array is larger than 10 me- problem. They formulated ANC as a supervised learning prob-
ters, the word error rate of a speech recognizer may be as high lem, and proposed a deep learning approach, called deep ANC.
as over 50%; end-to-end deep learning based speech processing Unlike other deep learning techniques for speech enhancement
systems have shown potential advantages over hybrid systems, which use deep networks to predict waveforms of speech or
however, they still have a high requirement to large-scale la- their alternatives, they employed deep learning to encode the
belled speech data; deep-learning-based speech synthesis has optimal control parameters corresponding to different noises
been highly competitive with human-sounding speech to tradi- and environments, and hence made deep learning applicable to
tional methods, however, the models are not stable, lacks con- ANC.
trollability and are still too large and slow to be able to put into Monaural speech enhancement and separation is a histori-
mobile and IoT devices, etc. cally difficult topic, particularly in the presence of reverbera-
Accordingly, new theoretical methods in deep learning and tion. A good deep architecture that incorporates sufficient prior
speech processing are required to tackle the above challenges, knowledge is important for real-world applications. In this Spe-
as well as to yield novel insights into new directions and prob- cial Issue, several advanced deep architectures were proposed,
lems. and new state-of-the-art performance was reported. Xian et
This Special Issue provides a collection of state-of-the-art al. [2] proposed a convolutional fusion network for monau-
research works focusing on recent advances of deep learning ral speech enhancement, which fully exploits cross-band in-
based speech processing, which presents novel contributions formation. Chen et al. [3] proposed dual-stream deep attrac-
addressing theoretical and practical aspects of deep learning re- tor networks with multi-domain learning to efficiently perform
lated speech processing techniques. After a rigorous review of both dereverberation and separation tasks. Li et al. [4] pro-
66 high-quality articles that were submitted, 26 articles were posed to generate speech spectra via a new type of adversarial
selected for inclusion in this Special Issue. They cover major training framework, named µ-law spectrum generative adver-
topics of speech processing, including speech enhancement and sarial network, for speech separation. Borgstrom et al. [5]
separation (8 articles), speech recognition (5 articles), speaker proposed an end-to-end neural network architecture based on
and language recognition (5 articles), speech synthesis (3 arti- self-attention for hearing aid. Huang et al. [6] proposed a
lightweight speaker extraction model, named TinyWASE, by
Email addresses: [email protected] (Xiao-Lei Zhang),
compressing speaker extraction models with ultra-low preci-
[email protected] (Lei Xie), [email protected] (Eric sion quantization and knowledge distillation, which is able run
Fosler-Lussier), [email protected] (Emmanuel Vincent) on resource-constrained devices.
Preprint submitted to Elsevier November 27, 2022
Multichannel or multimodal speech enhancement and sep- Therefore, attention mechanism is helpful. For this respect,
aration extends the monaural case by incorporating additional Miao et al. [15] improved conventional attention mechanism by
information and important sources to further improve the per- a novel mixed-order attention for low frame-level speech fea-
formance. In this Special Issue, two works are included. Li tures, and a nonlocal attention mechanism and a dilated residual
et al. [7] proposed a novel dual-channel deep-neural-network- structure to balance fine grained local information and multi-
based Generalized Sidelobe Canceller (GSC) structure, called scale long-range information, which achieves a much wider
nnGSC. The core idea is to make each module of the traditional context than purely local attention. Shi et ai. [16] proposed
GSC fully learnable, and use an acoustic model to perform joint a frame-level encoder and attention to the segments of speech,
optimization of speech recognition with GSC. Chen et al. [8] as well as another segment level attention to construct an utter-
proposed a visual embedding approach to improve embedding ance representation.
aware speech enhancement by synchronizing visual lip frames Given that there is empirically no unique outstanding acous-
at the phone and place of articulation levels. tic features or models suitable for various test scenarios, another
Automatic Speech Recognition (ASR) is a task of transcribing thought for improving the performance of speaker recognition
speech into text. It was the first big breakthrough of deep learn- is to combine complementary acoustic features or models. Sun
ing in applications. However, low-resource ASR is still chal- et al. [17] proposed a c-vector method by combining multi-
lenging in real-world applications of ASR. This Special Issue ple sets of complementary d-vectors derived from systems with
contains several works on this topic with specific applications. different neural network components, including 2-dimensional
Iranzo-Sanchez et al. [9] presented a state-of-the-art stream- self-attentive, gated additive, bilinear pooling structures, etc.
ing speech translation system, in which neural-based models Language recognition shares quite a similar research trend with
integrated in the ASR and machine translation components are speaker recognition. Li et al. [18] investigated the efficiency
carefully adapted in terms of their training and decoding pro- of integrating multiple acoustic features for language recogni-
cedures. They addressed the low-resource and streaming prob- tion, and further explored two kinds of training constraints to
lems of ASR simultaneously. Yang et al. [10] proposed an integrate the features. One option introduced auxiliary classi-
unsupervised pre-training approach to utilize the speech data of fication constraints with adaptive weights for loss functions in
two native languages (the learner’s native and target languages) feature encoder sub-networks, and the other option introduced
together, for the data sparsity problem of the non-native mispro- the canonical correlation analysis constraint to maximize the
nunciation recognition. Zhao et al. [11] proposed an end-to-end correlation of different feature representations.
keyword spotting system in which they introduced an attention Speech synthesis, a.k.a. Text-To-Speech (TTS), is a topic of
mechanism and a novel energy scorer to make decisions with generating speech from text. A challenge problem is how to
the locations of the keywords. Their experimental results on generate speech efficiently and vividly. Liu et al. [19] pro-
four low resource conditions demonstrate the effectiveness of posed a TTS model, named FastTalker, for high-quality speech
the system. Liu et al. [12] proposed incremental training with synthesis at low computational cost. The core idea is to use
revised loss function, data augmentation, and fine-grained train- a non-autoregressive context decoder to generate acoustic fea-
ing, which is able to improve the accuracy for the low-resource tures efficiently, and then add a shallow autoregressive acoustic
or even unseen user-defined keywords while maintaining high decoder on top of the non-autoregressive context decoder to re-
accuracy for pre-defined keywords. trieve the temporal information of the acoustic signal. Dahmani
Besides the low resource problem, how to train an ASR sys- et al. [20] first presented an expressive audiovisual corpus, and
tem with large amount of data is also dramatically important. then proposed to learn emotional latent representation with a
Haider et al. [13] presented a novel Natural Gradient and conditional variational auto-encoder for text-driven expressive
hessian-Free (NGHF) optimisation framework for neural net- audiovisual speech synthesis. In Nallanthighal et al.’s study
work training that can operate efficiently in a distributed man- [21], they emphasized the importance of respiratory voice dur-
ner. Their experiments show that NGHF not only achieves ing TTS by exploring techniques for sensing breathing signal
larger word error rate reductions than standard stochastic gra- and breathing parameters from speech using deep learning ar-
dient descent or Adam, but also requires orders of magnitude chitectures. The conclusion of the study may also help us un-
fewer parameter updates. derstand the respiratory health using one’s speech.
Speaker recognition is a topic of recognizing the identity of Speech emotion recognition aims to identify the feeling of a
a speaker. Deep learning based speaker recognition dominates speaker through his/her voice. It can be categorized into dis-
the topic at present. In this Special Issue, an overview arti- crete speech emotion recognition where the emotion states are
cle of deep-learning-based speaker recognition was included discrete values, and continuous dimensional speech recognition
[14]. It summarizes the subtasks of speaker recognition, in- where the emotion states are in a continuous space. Regard-
cluding speaker identification, speaker verification, speaker di- ing the discrete speech emotion recognition, Zhao et al. [22]
arization, robust speaker recognition. Regarding the research presented an efficient self-attention residual dilated network in-
paradigms of speaker recognition, it contains DNN/i-vector, corporating Connectionist Temporal Classification (CTC) loss
x-vector, stage-wise diarization, and end-to-end diarization. to address the challenging issue of modelling long-term tem-
Many acoustic features and datasets were summarized as well. poral dependencies of emotional speech. Regarding the con-
From the above overview, one see that different parts of tinuous dimensional speech emotion recognition, Peng et al.
speech contribute unequally to the voiceprint of a speaker. [23] investigated multi-resolution representations of an auditory
2
perception model, and proposed a novel feature called multi- [10] L. Yang, K. Fu, J. Zhang, T. Shinozaki, Non-native acoustic modeling for
resolution modulation-filtered cochleagram for predicting va- mispronunciation verification based on language adversarial representa-
tion learning, Neural Networks 142 (2021) 597–607.
lence and arousal values of emotional primitives. [11] Z. Zhao, W.-Q. Zhang, End-to-end keyword search system based on at-
Some new concepts that are out of the above topics are ac- tention mechanism and energy scorer for low resource languages, Neural
commodated in this large Special Issue as well. For example, Networks 139 (2021) 326–334.
Guizzo et al. [24] proposed a novel anti-transfer learning strat- [12] L. Liu, M. Yang, X. Gao, Q. Liu, Z. Yuan, J. Zhou, Keyword spotting
techniques to improve the recognition accuracy of user-defined keywords,
egy to make a deep model focus on its targeted task, where anti- Neural Networks 139 (2021) 237–245.
transfer learning avoids the learning of representations that have [13] A. Haider, C. Zhang, F. L. Kreyssig, P. C. Woodland, A distributed op-
been learned for an orthogonal task, i.e., one that is not relevant timisation framework combining natural gradient with hessian-free for
discriminative sequence training, Neural Networks 143 (2021) 537–549.
and potentially confounding for the target task, such as speaker
[14] Z. Bai, X.-L. Zhang, Speaker recognition based on deep learning: An
identity for speech recognition or speech content for emotion overview, Neural Networks 140 (2021) 65–99.
recognition. This extends the potential use of pre-trained mod- [15] X. Miao, I. McLoughlin, W. Wang, P. Zhang, D-mona: A dilated mixed-
els that have become increasingly available. order non-local attention network for speaker and language recognition,
Neural Networks 139 (2021) 201–211.
Some interesting applications of speech processing technolo- [16] Y. Shi, Q. Huang, T. Hain, H-vectors: Improving the robustness in
gies were included as well. [25] proposed two neural network utterance-level speaker embeddings using a hierarchical attention model,
architectures for modeling unsupervised lexical learning from Neural Networks 142 (2021) 329–339.
raw acoustic inputs, named ciwGAN (Categorical InfoWave [17] G. Sun, C. Zhang, P. C. Woodland, Combination of deep speaker embed-
dings for diarisation, Neural Networks 141 (2021) 372–384.
Generative Adversarial Networks) and fiwGAN (Featural In- [18] L. Li, Z. Li, Y. Liu, Q. Hong, Deep joint learning for language recogni-
foWaveGAN), which combine Deep Convolutional GAN ar- tion, Neural Networks 141 (2021) 72–86.
chitecture for audio data with the information theoretic exten- [19] R. Liu, B. Sisman, Y. Lin, H. Li, Fasttalker: A neural text-to-speech ar-
chitecture with shallow and group autoregression, Neural Networks 141
sion of GAN and propose a new latent space structure that can
(2021) 306–314.
model featural learning simultaneously with a higher level clas- [20] S. Dahmani, V. Colotte, V. Girard, S. Ouni, Learning emotions latent rep-
sification. [26] proposed a novel Residual Network (ResNet)- resentation with cvae for text-driven expressive audiovisual speech syn-
based technique with short-duration speech segments as the in- thesis, Neural Networks 141 (2021) 315–329.
[21] V. S. Nallanthighal, Z. Mostaani, A. Härmä, H. Strik, M. Magimai-Doss,
put to improve the performance and applicability of detecting Deep learning architectures for estimating breathing signal and respira-
impaired speech. tory parameters from speech recordings, Neural Networks 141 (2021)
The Guest Editors would like to thank the authors for their 211–224.
high-quality contributions, and the Editor-in-Chiefs for their [22] Z. Zhao, Q. Li, Z. Zhang, N. Cummins, H. Wang, J. Tao, B. W. Schuller,
Combining a parallel 2d cnn with a self-attention dilated residual network
support throughout the process and realization of this Special for ctc-based discrete speech emotion recognition, Neural Networks 141
Issue. The Guest Editors are also grateful for all the reviewers (2021) 52–60.
who helped ensure the quality of the articles included in this [23] Z. Peng, J. Dang, M. Unoki, M. Akagi, Multi-resolution modulation-
filtered cochleagram feature for lstm-based dimensional emotion recog-
issue.
nition from speech, Neural Networks 140 (2021) 261–273.
[24] E. Guizzo, T. Weyde, G. Tarroni, Anti-transfer learning for task invariance
in convolutional neural networks for speech processing, Neural Networks
References 142 (2021) 238–251.
[25] G. Beguš, Ciwgan and fiwgan: Encoding information in acoustic data
to model lexical learning with generative adversarial networks, Neural
[1] H. Zhang, D. Wang, Deep anc: A deep learning approach to active noise Networks 139 (2021) 305–325.
control, Neural Networks 141 (2021) 1–10. [26] S. Gupta, A. T. Patil, M. Purohit, M. Parmar, M. Patel, H. A. Patil, R. C.
[2] Y. Xian, Y. Sun, W. Wang, S. M. Naqvi, Convolutional fusion network for Guido, Residual neural network precisely quantifies dysarthria severity-
monaural speech enhancement, Neural Networks 143 (2021) 97–107. level based on short-duration speech segments, Neural Networks 139
[3] H. Chen, P. Zhang, A dual-stream deep attractor network with multi- (2021) 105–117.
domain learning for speech dereverberation and separation, Neural Net-
works 141 (2021) 238–248.
[4] H. Li, Y. Xu, D. Ke, K. Su, µ-law sgan for generating spectra with more
details in speech enhancement, Neural Networks 136 (2021) 17–27.
[5] B. J. Borgström, M. S. Brandstein, G. A. Ciccarelli, T. F. Quatieri, C. J.
Smalt, Speaker separation in realistic noise environments with applica-
tions to a cognitively-controlled hearing aid, Neural Networks 140 (2021)
136–147.
[6] Y. Huang, Y. Hao, J. Xu, B. Xu, Compressing speaker extraction model
with ultra-low precision quantization and knowledge distillation, Neural
Networks 154 (2022) 13–21.
[7] G. Li, S. Liang, S. Nie, W. Liu, Z. Yang, Deep neural network-based gen-
eralized sidelobe canceller for dual-channel far-field speech recognition,
Neural Networks 141 (2021) 225–237.
[8] H. Chen, J. Du, Y. Hu, L.-R. Dai, B.-C. Yin, C.-H. Lee, Correlating
subword articulation with lip shapes for embedding aware audio-visual
speech enhancement, Neural Networks 143 (2021) 171–182.
[9] J. Iranzo-Sánchez, J. Jorge, P. Baquero-Arnal, J. A. Silvestre-Cerdà,
A. Giménez, J. Civera, A. Sanchis, A. Juan, Streaming cascade-based
speech translation leveraged by a direct segmentation model, Neural Net-
works 142 (2021) 303–315.