Nonlinear Speech Synthesis
Nonlinear Speech Synthesis
Stephen �McLaughlin
Signals and S,ystems Group:
Department of Electronics and Electrical Engineering:
:
Universi(yof Edinburgh: T he King s Buildings:
Edinburgh: ERg 3JL: Scotland: UK
Email: smH=iJee.ed.ac.uk
Tel (+44)-131-650-5578, Fax: (+44)-131-650-6554
Society.
Authorized licensed use limited to: Universidade Estadual de Campinas. Downloaded on May 07,2024 at 20:22:45 UTC from IEEE Xplore. Restrictions apply.
erates on the speech signal in a different manner. The • the vocal folds vibrate in an exactly periodic man
pitch synchronous overlap add (PSOLA) [1 0] approach ner during voiced speech production;
is non-parametric as opposed to the harmonic method,
the configuration of the vocal tract will only change
'which actually decomposes the signal into explicit source
•
slowly;
and vocal tract models. PSOLA is reported to give good
quality, natural-sounding synthetic speech for moderate These imply a loss of information which means that
pitch and time modifications. Sluwing duwn the speech the full speech signal dynamics can never be properly
by a large factor (greater than two) does introduce arti captured. These inadequacies can be seen in practice
facts due to the repetition of PSOLA bells. Some tonal in speech synthesis where, at the waveforrn generation
artifacts ( e.g. whistling) also appear 'with large pitch level, current systems tend to produce an output signal
scaling, especially for higher pitch voices, such as female that lacks naturalness. This is true even of concaten
speakers and children. ation techniques which copy and modify actual speech
McAulay and Quatieri developed a speech genera segments.
tion model that is based on a glottal excitation signal
made up of a sum of sine waves [11]. They then used 4 Poincare ll1aps and epoch ll1arking
this model to perform time-scale and pitch modifica The section discusses how nonlinear techniques can be
tion. Starting with the assumption made in the linear applied to pitch marking of continuous speech. vVe wish
model of speech that the speech waveform x(t) is the to locate the instants in the time domain speech signal at
output generated by passing an excitation waveform e(t) which the glottis is closed. A variety of existing methods
through a linear filter h(t), the excitation is defined as can be employed to locate the epochs. These are Abrupt
a sum of sine waves of arbitrary amplitudes, frequencies change detection [17], Maximum Likelihood epoch de
and phases. A limitation of all these techniques is that tection [18] and Dynamic programming [19]. All of the
they use the linear model of speech as a basis. above techniques are sound and generally provide good
epoch detection. The technique presented here should
3 Nonlinearities in speech
not be viewed as a direct competitor to the methods
There are known to be a number of nonlinear effects outlined above. Rather it is an attempt to show the
in the speech production process. Firstly, it has been practical application of ideas from nonlinear dynamical
accepted for some time that the vocal tract and the vo theory to a real speech processing problem. The per
cal folds do not function independently of each other, formance in clean speech is comparable to many of the
but that there is in fact some form of coupling between techniques discussed above.
them when the glottis is open [12] resulting in signi In nonlinear processing a d-dimensional system can
ficant changes in formant characteristics between open be reconstructed in an m-dimensional state space from
and closed glottis cycles [13]. More controversially, a single dimension time series by a process called em
Teager and Teager [14] have claimed (based on phys bedding. Takens' theorem states that m 2: 2d + 1 for an
ical measurements) that voiced sounds are characterised adequate reconstruction [20], although in practice it is
by highly complex air flows in the vocal tract involving often possible to reduce m. An alternative is the singu
jets and vortices, rather than well behaved laminar flow. lar value decomposition (SVD) embedding [21], which
In addition, the vocal folds will themselves be respons may be more attractive in real systems where noise is
ible for further nonlinear behaviour, since the muscle an Issue.
and cartilage which comprise the larynx have nonlinear A Poincare map is often used in the analysis of dy
stretching qualities. Such nonlinearities are routinely namical systems. It replaces the flow of an n-th or
included in attempts to model the physical process of der continuous system with an (n - 1)-th order discrete
vocal fold vibration, which have focussed on two or more time map. Considering a three dimensional attractor
mass models [2, 3, 15], in which the movement of the vo a Poincare section slices through the flow of trajector
cal folds is modelled by masses connected by springs, ies and the resulting crossings form the Poincare map.
with nonlinear coupling. Observations of the glottal Re-examining the attractor reconstructions of voiced
,vaveforrn have shown that this waveforrn can change speech shown above, it is evident that these three dimen
shape at different amplitudes [16] which would not be sional attractors can also be reduced to two dimensional
possible in a strictly linear system where the waveform maps.I Additionally, these reconstructions are pitch
shape is unaffected by amplitude changes. synchronous, in that one revolution of the attractor is
In order to arrive at the simplified linear model, a equivalent to one pitch period. This has previously
number of major assumptions are made: been used for cyclostationary analysis and synchronisa
tion [22]; here we examine its use for epoch marking.
• the vocal tract and speech source are uncoupled
1 St.rid,]y t.hese aUrad,or rec.onst.nld,ions ar" disc.ret.e t.im" maps
(thus allowing source-filter separation); and not. c.ont.imlOlIs flows. However it. is possih]e t.o c.onst.nld a
flow vedor bet.ween point.s and lise t.his for t.he Poinc.ar� sedion
• airflow through the vocal tract is laminar; calculation.
Authorized licensed use limited to: Universidade Estadual de Campinas. Downloaded on May 07,2024 at 20:22:45 UTC from IEEE Xplore. Restrictions apply.
Figure 1: Results for the voiced section of "came along" Figure 2: Results for the voiced section of "raining" from
from the J{eele database for a female speaker. From top the BT Labs database for a male speaker. From top
to bottom: the signal; the epochs as calculated by the al to bottom: the signal; the epochs as calculated by the
gorithm; the laryngograph signal; the pitch contour (Hz) algorithm; the processed laryngograph signal; the pitch
resulting from the algorithm. contour (Hz) resulting from the algorithm.
The basic processing steps required for a waveform of along wrapped in a warm cloak", spoken by a female
N points are as follows: speaker. There is considerable change in the signal, and
hence in the attractor structure, in this example, yet
1. Mark YGCI, a known Gel in the signal.
the epochs are sufficiently well located when compared
2. Perform an SVD embedding on the signal to gener against the laryngograph signal.
ate the attractor reconstruction in 3D state space. In Fig. 2, which is a voiced section from the phrase
"see if it's raining" spoken by a male speaker, the epochs
3. Calculate the flow vector, h, at the marked point are well located for the first part of the signal, but some
YGCT on the attractor.
slight loss of synchronisation can be seen in the latter
4. Detect crossings of the Poincare section, E, at this part.
point in state space by signs changes of the scalar
5 Nonlinear Synthesis Approaches
product between h and the vector Yi - YGCI for all
1 ::; i ::; N points. 5. 1 Neural network synthesis background
5. Points on E 'which are 'within the same portion of Kubin and Birgmeier reported an attempt made to use
the manifold as YGCT are the epochs. a RBF network approach to speech synthesis. They pro
pose the use of a nonlinear oscillator, with no external
"Vhen dealing with real speech signals a number of prac input and global feedback in order to perform the map
tical issues have to be considered. The input signal must pmg
be treated on a frame-by-frame basis, within which the x(n) A(x(n - 1)) (1)
speech is assumed stationary. Finding the correct inter
=
section points on the Poincare section is also a difficult where x(n - 1) is the delay vector with non-unit delays,
task due to the complicated structure of the attractor. and A is the nonlinear mapping function [24].
Two different data sets were used to test the perform The initial approach taken [25] used a Kalman-based
ance of the algorithm, giving varying degrees of realistic RBF network, which has all of the network parameters
speech and hence difficulty. trained by the extended Kalman filter algorithm. The
only parameter that must be specified is the number
1. Keele "Cniversity pitch extraction database [23]. of centres to use. This gives good prediction results,
This database provides speech and laryngograph but there are many problems with resynthesis. In par
data from 15 speakers reading phonetically bal ticular, they report that extensive manual fine-tuning
anced sentences. of the parameters such as dimension, embedding delay
and number and initial positions of the centres are re
2. BT Labs continuous speech. 2 phrases, spoken by 4
quired. Even with this tuning, synthesis of some sounds
speakers, were processed manually to extract a data
with complicated phase space reconstructions does not
set of continuous voiced speech. Laryngograph data
work [24].
was also available.
In order to overcome this problem, Kubin resorted
The signals were up-sampled to 22.05 kHz, the BT data to a technique that uses all of the data points in the
was originally sampled at 12 kHz and the Keele signals training data frame as centres [24]. Although this gives
at 20 kHz. All the signals had 16 bit resolution. correct resynthesis, even allowing the resynthesis of con
Fig. 1 shows the performance of the algorithm on a tinuous speech using a frame-adaptive approach, it is
voiced section taken from the phrase "a traveller came unsatisfactory due to the very large number of varying
Authorized licensed use limited to: Universidade Estadual de Campinas. Downloaded on May 07,2024 at 20:22:45 UTC from IEEE Xplore. Restrictions apply.
parameters, and cannot be seen as actually learning the not appear to be crucial to the performance of the net
dynamics of the speech generating system. work. There are two distinct strategies for training an
Folluwing their dynamical analysis of the Japanese RBF network. The most common approach divides the
vuwel /a/, Tokuda et al. constructed a feed-forward problem into two steps. Firstly the centre positions and
neural network to perform synthesis [26]. Their struc bandwidths are fixed using an unsupervised approach,
ture has three layers, 'with five neurons in the input not dependent on the network output. Then the weights
layer, forty neurons in the hidden layer, and one in the are trained in a supervised rnanner so as to rninirnise an
output layer. The time delay in the input delay vec error function.
tor is set at T 3 and the 'weights are learnt by back
= Following from the work of Kubin et al., a nonlinear
propagation. Csing global feedback, they report success oscillator structure is used. The RBF network is used
ful resynthesis of the Japanese vuwel /a/. The signal is to approximate the underlying nonlinear dynamics of
noisy, but preserves natural human speech qualities. Ko a particular stationary voiced sound, by training it to
further results in terms of speech quality or resynthesis perform the prediction
of other vowels are given.
'An alternative neural network approach was pro Xi+! = F(x;) (3)
posed by Karashimhan et aZ. This involves separat
ing the voiced source from the vocal tract contribution, where Xi {Xi, X(i-T)' ..., x(i-(m-1)T)} is a vector of
=
and then creating a nonlinear dynamical model of the previous inputs spaced by some delay T samples, and
source [27]. This is achieved by first inverse filtering F is a nonlinear mapping function. From a nonlin
the speech signal to obtain the linear prediction (LP) ear dynamical theory perspective, this can be viewed
residual. Kext the residue waveform is low-pass filtered as a time delay embedding of the speech signal into an
at 1 kHz, then normalised to give a unit amplitude en m-dimensional state space to produce a state space re
velope. This processed signal is used as the training construction of the original d-dimensional system at
data in a time delay neural network with global feed tractor. The embedding dimension is chosen in accord
back. The KK structure reported is extremely complex, ance with Takens' embedding theorem [20] and the em
consisting of a 30 tap delay line input and two hidden bedding delay, T, is chosen as the first minimum of the
layers of 15 and 1 0 sigmoid activation functions, with average mutual information function [29]. The other
the network training performed using back propagation parameters that must be chosen are the bandwidth, the
through time. Finally, the KK model is used in free number and position of the centres, and the length of
running synthesis mode to recreate the voiced source. training data to be used. "Vith these set, the determ
This is applied to a LP filter in order to synthesise ination of the weights is linear in the parameters and is
speech. They shuw that the KK model successfully pre solved by minimising a sum of squares error function,
serves the jitter of the original excitation signal. Es (F), over the N samples of training data:
N
5.2 RBF network for synthesis I
(4)
2
E.. (F)
A
(x.' - x.·)
2L
A
I. I.
A well known nonlinear modelling approach is the ra
= -
i=1
dial basis function neural network. It is generally com
posed of three layers, made up of an input layer of source where Xi is the network approximation of the actual
nodes, a nonlinear hidden layer and an output layer giv speech signal Xi. Incorporating Equation 2 into the
ing the network response. The hidden layer performs a above and differentiating with respect to the weights,
nonlinear transformation mapping the input space to a then setting the derivative equal to zero gives the least
rIe,'\' space, in which the problem can be better solved. squares problem [30], which can be written 1Il matrix
The output is the result of linearly combining the hidden form as
space, multiplying each hidden layer output by a weight (5)
whose value is determined during the training process.
The general equation of an RBF network with an in where <r> is an N xP matrix of the outputs of the centres;
put vector x and a single output is x is the target vector of length N; and w is the P length
vector of weights. This can be solved by standard matrix
p inversion techniques.
F(x(n)) =
L Wj(1'(llx - Cj II) (2) Two types of centre positioning strategy were con
j=1 sidered:
where there are P hidden units, each of which is l. Data subset. Centres are picked as points from
weighted by Wj. The hidden units, (1'(11x - Cj II), are ra around the state space reconstruction. They are
dially symmetric functions about the point Cj, called a chosen pseudo-randomly, so as to give an approx
centre, in the hidden space, with 11.11 being the Euclidean imately uniform spacing of centres about the state
vector norm [28]. The actual choice of nonlinearity does space reconstruction.
Authorized licensed use limited to: Universidade Estadual de Campinas. Downloaded on May 07,2024 at 20:22:45 UTC from IEEE Xplore. Restrictions apply.
2. Hyper-lattice. An alternative, data independent
approach is to spread the centres uniformly over an
m-dimensional hyper-lattice.
5. 3 Synthesis
mapping function to make a well-posed problem [32]. Figure 4: Spectrums for examples of the vowel luI, cor
The selection of an appropriate value for the regu responding to the signals in Figure 3.
larisation parameter, A is done by the use of cross
validation [30]. After choosing all the other network
parameters, these are held constant and A is varied. For
each value of A, the MSE on an unseen validation set ate (F., + 4) [33] was used to set the number of filter
is calculated. The MSE curve should have a minimum taps to 26. Then, using the source-filter model, the
indicating the best value of A for generalisation. "Vith LP filter was excited by a Dirac pulse train to produce
the regularisation parameter chosen by this method, the the desired length LP synthesised signal. The distance
7D resynthesis gave correct results for all of the signals between Dirac pulses was set to be equal to the aver
except KH Iii and KH lui when using the data subset age pitch period of the original signal. In this way, the
method of centre selection. However, only two signals three vowel sounds for each of the four speakers in the
(CA Iii and MC Iii) were correctly resynthesised by the database were synthesised.
hyper-lattice method. It was found that A needed to be Figure 3 shows the time domain waveforms for the ori
increased significantly to ensure correct resynthesis for ginal signal, the LP synthesised signal and the two RBF
all the signals when the hyper-lattice was used. Achiev synthesised signals, for the vowel lui, speaker MC. Fig
ing stable resynthesis inevitably comes at some cost. ure 4 shows the corresponding frequency domain plots
By forcing smoothness onto the approximated function of the signals, and the spectrograms are shown in Fig
there is the risk that some of the finer detail of the state ure 5. In these examples, the regularisation parameter
space reconstruction will be lost. Therefore, for best A was set at 0. 01 for the hyper-lattice, and 0.005 for
results, A should be set at the smallest possible value the data subset. In the linear prediction case, the tech
that allows stable resynthesis. The performance of the nique attempts to model the spectral features of the
regularised RBF network as a nonlinear speech synthes original. Hence the reasonable match seen in the spec
iser is now measured by examining the time and fre trum (although the high frequencies have been over
quency domains, as well as the dynamical properties. emphasised), but the lack of resemblance in the time
In addition to comparing the output of the nonlinear domain. The RBF techniques, on the other hand, re
synthesiser to the original speech signal, the synthetic semble the original in the time domain, since it is from
speech from a traditional linear prediction synthesiser is this that the state space reconstruction is formed, al
also considered. In this case, the LP filter coefficients though the spectral plots show the higher frequencies
were found from the original vowel sound (analogous to have not been well modelled by this method. This is
the training stage of the RBF network) . The estim- because the networks have missed some of the very fine
Authorized licensed use limited to: Universidade Estadual de Campinas. Downloaded on May 07,2024 at 20:22:45 UTC from IEEE Xplore. Restrictions apply.
Data type MC CA Average
(male) (female) (female)
.
8000
6000
Hyper-lattice jitter (%) 0.470 1.14 0.697
4000 Data subset jitter (%) 0.482 0.663 0.521
2000 Original jitter (%) 0.690 0.685 0.742
� l:J
"',
o 0.1 0.2 Hyper-lattice shimmer (%) 1.00 1.33 0.922
j';me(,j
Authorized licensed use limited to: Universidade Estadual de Campinas. Downloaded on May 07,2024 at 20:22:45 UTC from IEEE Xplore. Restrictions apply.
RVI RV5
O.5'TTTC-,----,-;-;-rTTTC,,---;-;-;-;-m
,OO � Timc(s)
50� �
) ---------�
o
Timc(s)
0.32
50�
o
�
J ------------
Timc(s)
0.32
RV4 RV6
O.5',------rr-,,-.-"';-r--,--,
C::d
O,J2
Time(s) Time(s)
"�
()() ............ .. . ..................................................................
g
�
=====:J
p:
Figure 6: Synthesised vowel sounds together with desired Figure 7: Synthesised vowel sounds together with desired
and measured pitch profile and measured pitch profiles
the data-based model of the vowel dynamics possesses [4] G. Fant, Acoustic Theory of Speech Production.
Authorized licensed use limited to: Universidade Estadual de Campinas. Downloaded on May 07,2024 at 20:22:45 UTC from IEEE Xplore. Restrictions apply.
[8] J. Page and A. Breen, "The laureate text-to-speech [23] F. Plante, G. F. Meyer, and VV. A. Ainsworth, "A pitch
system, architecture and applications," B T Technical extraction reference database," in EUROSPEECH'95,
Journal, voL 14, pp. 57 - 67, January 1996. voL 1, pp. 837 - 840, September 1995.
[9] M. Beutnagel, A. Conkie, J. Schroeter, and A. Syrdal, [24] G. Kubin, "Synthesis and coding of continuous speech
"The at&t next-gen tts system," in Joint "11eeting of with the nonlinear oscillator model," in International
ASA, EAA, and DACA, (Berlin, Germany), March Conference on Acoustics, Speech and Signal Processing,
1999. (Atlanta, Georgia), pp. 267 - 270, May 1996.
[10] E. Moulines and F. Charpentier, "Pitch synchronous [25] M. Birgmeier, Kalman-trained 1Veural1Vetworks for Sig
waveform processing techniques for text-to-speech syn nal Processing Applications.PhD thesis, Technical Uni
thesis using diphones," Speech Communication, voL 9, versity of Vienna, Vienna, 1996.
pp. 453 - 467, 1990. [26] L Tokuda, R. Tokunaga, and K. Aihara, "A simple geo
[11] R. McAulay and T. Quatieri, "Speech ana- metrical structure underlying speech signals of the Ja
lysis/synthesis based on a sinusoidal representation," panese vowel /a/," International Journal of Bifurcation
IEEE Transactions on Audio, Speech and Signal and Chaos, voL 6, no. 1, pp. 149 - 160, 1996.
Processing, voL 34, pp. 744 - 754, August 1986. [27] K. Narashimhan, J. C. Principe, and D. Childers, "Non
[12] T. Koizumi, S. Taniguchi, and S. Hiromitsu, "Glottal linear dynamic modeling of the voiced excitation for im
source-vocal tract interaction," Journal of the Acous proved speech synthesis," in International Conference
on Acoustics, Speech and Signal Processing, (Phoenix,
tical Society of America, voL 78, pp. 1541 - 1547,
November 1985. Arizona), pp. 389 - 392, March 1999.
[13] D. M. Brookes and P. A. Naylor, "Speech production [28] B. Mulgrew, "Applying radial basis functions," IEEE
Signal Processing Magazine, voL 13, pp. 50 - 65, March
modelling with variable glottal reflection coefficient,"
in International Conference on Acoustics, Speech and 1996.
Signal Processing, pp. 671 - 674, 1988. [29] A. M. Fraser and H. Swinney, "Independent coordin
ates for strange attractors from mutual information,"
[14] H. M. Teager and S. M. Teager, "Evidence of nonlin
Physical Review A, voL 33, pp. 1134 - 1140, 1986.
ear sound production mechanisms in the vocal tract,"
in Proceedings of the NA TO Advanced Study Institute [30] C. M. Bishop, Neural Networks for Pattern Recognition.
on Speech Production and Modelling, (Bonas, France), Oxford University Press, 1995.
pp. 241 - 261, July 1989. [31] L Mann, An Investigation of1Vonlinear Speech Synthesis
[15] L Steinecke and H. Herzel, "Bifurcations in an asym and Pitch Modification Techniques. PhD thesis, Uni
metric vocal-fold model," Journal of the Acoustical So versity of Edinburgh, 1999.
ciety of America, voL 97, pp. 1874 - 1884, March 1995. [32] S. Haykin and J. Principe, "Making sense of a com
[16] J. Schoentgen, "Non-linear signal representation and its plex world," IEEE Signal Processing Magazine, voL 15,
application to the modelling of the glottal waveform," pp. 66 - 81, May 1998.
Speech Communication, voL 9, pp. 189 - 201, 1990. [33] L. R. Rabiner and R. \V. Schafer, Digital Processing of
Speech Signals. Prentice-Hall, 1978.
[17] R. J. DiFrancesco and E. Moulines, "Detection of glot
tal closure by jumps in the statistical properties of the [34] J. Schoentgen and R. de Guchteneere, "An algorithm
speech signal," Speech Communication, voL 9, pp. 401 for the measurement of jitter," Speech Communication,
- 418, December 1990. voL 10, pp. 533 - 538, 1991.
[18] Y. M. Cheng and D. O'Shaughnessy, "Automatic [35] J. Stark, D. Broomhead, M. Davies, and J. Huke,
and reliable estimation of glottal closure instant and "Takens embedding theorems for forced and stochastic
period," IEEE Transactions on Audio, Speech and Sig systems," in Proceedings of 2nd World Congress of Non
nal Processing, voL 37, pp. 1805 - 1815, December 1989. linear Analysis, 1996.
[19] D. Talkin, "Voicing epoch determination with dynamic [36] J. Stark, "Delay embeddings for forced systems: De
programming," Journal of the Acoustical Society of terministic forcing," Journal of Nonlinear Science,
America, voL 85, Supplement 1, p. S149, 1989. voL 9, pp. 255-332, 1999.
[20] F. Takens, "Detecting strange attractors in turbulence," [37] H. Haas and G. Kubin, "Multi-band nonlinear oscil
in Proceedings of Symposium on Dynamical Systems and lator model for speech," in 32nd Asilomar Conference
Turbulence (A. Dold and B. Eckmann, eds.), pp. 366 - on Signals, Systems and Computers, voL 1, pp. 338 -
Authorized licensed use limited to: Universidade Estadual de Campinas. Downloaded on May 07,2024 at 20:22:45 UTC from IEEE Xplore. Restrictions apply.