Thanks to visit codestin.com
Credit goes to www.scribd.com

0% found this document useful (0 votes)
21 views8 pages

Nonlinear Speech Synthesis

This document discusses nonlinear speech synthesis. It reviews conventional linear approaches and their limitations. It then discusses using nonlinear techniques like Poincare maps to mark glottal closure epochs more accurately than existing linear methods. The document argues that while modern speech synthesis is intelligible, it lacks human quality and nonlinear methods may offer advantages, but more research is still needed.

Uploaded by

b194768
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
21 views8 pages

Nonlinear Speech Synthesis

This document discusses nonlinear speech synthesis. It reviews conventional linear approaches and their limitations. It then discusses using nonlinear techniques like Poincare maps to mark glottal closure epochs more accurately than existing linear methods. The document argues that while modern speech synthesis is intelligible, it lacks human quality and nonlinear methods may offer advantages, but more research is still needed.

Uploaded by

b194768
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 8

Nonlinear Speech Synthesis *

Stephen �McLaughlin
Signals and S,ystems Group:
Department of Electronics and Electrical Engineering:
:
Universi(yof Edinburgh: T he King s Buildings:
Edinburgh: ERg 3JL: Scotland: UK

Email: smH=iJee.ed.ac.uk
Tel (+44)-131-650-5578, Fax: (+44)-131-650-6554

ABSTRACT the fundamental level of speech production. An accur­


This paper examines the how and the why of nonlin­ ate articulatory model would allow all types of speech to
be synthesised in a natural manner, without having to
ear speech synthesis. It discusses why nonlinear speech
make many of the assumptions required by other tech­
synthesis should be considered, reviews the recent his­
tory and describes in detail a variety of approaches to niques (such as attempting to separate the source and
the problem. It argues that while modern concatenative vocal tract parts out from one signal) [1-3]. Realistic
speech synthesisers produce speech which is intelligible, articulatory synthesis is an extremely complex process,
and the data required is not at all easy to collect. As
however they are very inflexible and often lack a hu­
such, it has not to date found any commercial applica­
man quality. The paper does not suggest that nonlinear
speech synthesisers are ready to replace conventional ap­ tion and is still more of a research tool.
proaches, but rather that they offer some potential ad­ vVaveform synthesisers derive a model from the speech
vantages but there is a considerable amount of research signal as opposed to the speech organs. This approach
still to be carried out. is derived from the linear source-filter theory of speech
production [4]. The simplest form of waveform synthesis
1 Introduction is based on linear prediction (LP) [5]. The resulting
Speech synthesis is a complex task that aims to produce quality is extremely poor for voiced speech, sounding
naturally-sounding speech. vVhile working systems that very robotic.
produce intelligible speech have existed since the 1970's, Formant synthesis uses a bank of filters, each of
the final aim of producing a synthesiser that is indistin­ 'which represents the contribution of one of the form­
guishable from a human speaker has still to be realised. ants. The best knuwn formant synthesiser is the Klatt
There remain a number of problems at all stages of the synthesiser [6], 'which has been exploited commercially
process, including the actual generation of the speech as DECTalk. The synthesised speech quality is consid­
signal itself with the required intonation. This paper erably better than that of the LP method, but still lacks
is structured as follows, a brief review of conventional naturalness, even 'when an advanced voice-source rnodel
linear based approaches is followed by a quick review of is used [7].
nonlinearities which exist in speech generation. Then an
Concatenation methods involve joining together pre­
example of nonlinear techniques applied to epoch mark­
recorded units of speech which are extracted from a
ing is presented followed by two sections on nonlinear
database. It must also be possible to change the prosody
speech synthesis. Finally some conclusions are drawn.
of the units, so as to impose the prosody required for the
2 Conventional Speech Synthesis Approaches
phrase that is being generated. The concatenation tech­
nique provides the best quality synthesised speech avail­
Conventionally the main approaches to speech synthesis able at present. It is used in a large number of commer­
depend on the type of modelling used. This may be a cial systems, including British Telecomm's Laureate [8]
model of the speech organs themselves (articulatory syn­ and the AT&T I\ext-Gen system [9]. Although there is
thesis), a model derived from the speech signal (wave­ a good degree of naturalness in the synthesised output,
form synthesis), or alternatively the use of pre-recorded it is still clearly distinguishable from real human speech,
segments extracted from a database and joined together and it may be that more sophisticated parametric mod­
(concatenative synthesis). els will eventually overtake it.
Modelling the actual speech organs is an attractive Techniques for time and pitch scaling of sounds held
approach, since it can be regarded as being a model of in a database are also extremely important. Two main
This work was snpport.ed by BT, EPSHC and t.he Hoyal techniques for time-scale and pitch modification in con­
catenative synthesis can be identified, each of which op-

Society.

Authorized licensed use limited to: Universidade Estadual de Campinas. Downloaded on May 07,2024 at 20:22:45 UTC from IEEE Xplore. Restrictions apply.
erates on the speech signal in a different manner. The • the vocal folds vibrate in an exactly periodic man­
pitch synchronous overlap add (PSOLA) [1 0] approach ner during voiced speech production;
is non-parametric as opposed to the harmonic method,
the configuration of the vocal tract will only change
'which actually decomposes the signal into explicit source

slowly;
and vocal tract models. PSOLA is reported to give good
quality, natural-sounding synthetic speech for moderate These imply a loss of information which means that
pitch and time modifications. Sluwing duwn the speech the full speech signal dynamics can never be properly
by a large factor (greater than two) does introduce arti­ captured. These inadequacies can be seen in practice
facts due to the repetition of PSOLA bells. Some tonal in speech synthesis where, at the waveforrn generation
artifacts ( e.g. whistling) also appear 'with large pitch level, current systems tend to produce an output signal
scaling, especially for higher pitch voices, such as female that lacks naturalness. This is true even of concaten­
speakers and children. ation techniques which copy and modify actual speech
McAulay and Quatieri developed a speech genera­ segments.
tion model that is based on a glottal excitation signal
made up of a sum of sine waves [11]. They then used 4 Poincare ll1aps and epoch ll1arking

this model to perform time-scale and pitch modifica­ The section discusses how nonlinear techniques can be
tion. Starting with the assumption made in the linear applied to pitch marking of continuous speech. vVe wish
model of speech that the speech waveform x(t) is the to locate the instants in the time domain speech signal at
output generated by passing an excitation waveform e(t) which the glottis is closed. A variety of existing methods
through a linear filter h(t), the excitation is defined as can be employed to locate the epochs. These are Abrupt
a sum of sine waves of arbitrary amplitudes, frequencies change detection [17], Maximum Likelihood epoch de­
and phases. A limitation of all these techniques is that tection [18] and Dynamic programming [19]. All of the
they use the linear model of speech as a basis. above techniques are sound and generally provide good
epoch detection. The technique presented here should
3 Nonlinearities in speech
not be viewed as a direct competitor to the methods
There are known to be a number of nonlinear effects outlined above. Rather it is an attempt to show the
in the speech production process. Firstly, it has been practical application of ideas from nonlinear dynamical
accepted for some time that the vocal tract and the vo­ theory to a real speech processing problem. The per­
cal folds do not function independently of each other, formance in clean speech is comparable to many of the
but that there is in fact some form of coupling between techniques discussed above.
them when the glottis is open [12] resulting in signi­ In nonlinear processing a d-dimensional system can
ficant changes in formant characteristics between open be reconstructed in an m-dimensional state space from
and closed glottis cycles [13]. More controversially, a single dimension time series by a process called em­
Teager and Teager [14] have claimed (based on phys­ bedding. Takens' theorem states that m 2: 2d + 1 for an
ical measurements) that voiced sounds are characterised adequate reconstruction [20], although in practice it is
by highly complex air flows in the vocal tract involving often possible to reduce m. An alternative is the singu­
jets and vortices, rather than well behaved laminar flow. lar value decomposition (SVD) embedding [21], which
In addition, the vocal folds will themselves be respons­ may be more attractive in real systems where noise is
ible for further nonlinear behaviour, since the muscle an Issue.
and cartilage which comprise the larynx have nonlinear A Poincare map is often used in the analysis of dy­
stretching qualities. Such nonlinearities are routinely namical systems. It replaces the flow of an n-th or­
included in attempts to model the physical process of der continuous system with an (n - 1)-th order discrete
vocal fold vibration, which have focussed on two or more time map. Considering a three dimensional attractor
mass models [2, 3, 15], in which the movement of the vo­ a Poincare section slices through the flow of trajector­
cal folds is modelled by masses connected by springs, ies and the resulting crossings form the Poincare map.
with nonlinear coupling. Observations of the glottal Re-examining the attractor reconstructions of voiced
,vaveforrn have shown that this waveforrn can change speech shown above, it is evident that these three dimen­
shape at different amplitudes [16] which would not be sional attractors can also be reduced to two dimensional
possible in a strictly linear system where the waveform maps.I Additionally, these reconstructions are pitch­
shape is unaffected by amplitude changes. synchronous, in that one revolution of the attractor is
In order to arrive at the simplified linear model, a equivalent to one pitch period. This has previously
number of major assumptions are made: been used for cyclostationary analysis and synchronisa­
tion [22]; here we examine its use for epoch marking.
• the vocal tract and speech source are uncoupled
1 St.rid,]y t.hese aUrad,or rec.onst.nld,ions ar" disc.ret.e t.im" maps
(thus allowing source-filter separation); and not. c.ont.imlOlIs flows. However it. is possih]e t.o c.onst.nld a
flow vedor bet.ween point.s and lise t.his for t.he Poinc.ar� sedion
• airflow through the vocal tract is laminar; calculation.

Authorized licensed use limited to: Universidade Estadual de Campinas. Downloaded on May 07,2024 at 20:22:45 UTC from IEEE Xplore. Restrictions apply.
Figure 1: Results for the voiced section of "came along" Figure 2: Results for the voiced section of "raining" from
from the J{eele database for a female speaker. From top the BT Labs database for a male speaker. From top
to bottom: the signal; the epochs as calculated by the al­ to bottom: the signal; the epochs as calculated by the
gorithm; the laryngograph signal; the pitch contour (Hz) algorithm; the processed laryngograph signal; the pitch
resulting from the algorithm. contour (Hz) resulting from the algorithm.

The basic processing steps required for a waveform of along wrapped in a warm cloak", spoken by a female
N points are as follows: speaker. There is considerable change in the signal, and
hence in the attractor structure, in this example, yet
1. Mark YGCI, a known Gel in the signal.
the epochs are sufficiently well located when compared
2. Perform an SVD embedding on the signal to gener­ against the laryngograph signal.
ate the attractor reconstruction in 3D state space. In Fig. 2, which is a voiced section from the phrase
"see if it's raining" spoken by a male speaker, the epochs
3. Calculate the flow vector, h, at the marked point are well located for the first part of the signal, but some
YGCT on the attractor.
slight loss of synchronisation can be seen in the latter
4. Detect crossings of the Poincare section, E, at this part.
point in state space by signs changes of the scalar
5 Nonlinear Synthesis Approaches
product between h and the vector Yi - YGCI for all
1 ::; i ::; N points. 5. 1 Neural network synthesis background

5. Points on E 'which are 'within the same portion of Kubin and Birgmeier reported an attempt made to use
the manifold as YGCT are the epochs. a RBF network approach to speech synthesis. They pro­
pose the use of a nonlinear oscillator, with no external
"Vhen dealing with real speech signals a number of prac­ input and global feedback in order to perform the map­
tical issues have to be considered. The input signal must pmg
be treated on a frame-by-frame basis, within which the x(n) A(x(n - 1)) (1)
speech is assumed stationary. Finding the correct inter­
=

section points on the Poincare section is also a difficult where x(n - 1) is the delay vector with non-unit delays,
task due to the complicated structure of the attractor. and A is the nonlinear mapping function [24].
Two different data sets were used to test the perform­ The initial approach taken [25] used a Kalman-based
ance of the algorithm, giving varying degrees of realistic RBF network, which has all of the network parameters
speech and hence difficulty. trained by the extended Kalman filter algorithm. The
only parameter that must be specified is the number
1. Keele "Cniversity pitch extraction database [23]. of centres to use. This gives good prediction results,
This database provides speech and laryngograph but there are many problems with resynthesis. In par­
data from 15 speakers reading phonetically bal­ ticular, they report that extensive manual fine-tuning
anced sentences. of the parameters such as dimension, embedding delay
and number and initial positions of the centres are re­
2. BT Labs continuous speech. 2 phrases, spoken by 4
quired. Even with this tuning, synthesis of some sounds
speakers, were processed manually to extract a data
with complicated phase space reconstructions does not
set of continuous voiced speech. Laryngograph data
work [24].
was also available.
In order to overcome this problem, Kubin resorted
The signals were up-sampled to 22.05 kHz, the BT data to a technique that uses all of the data points in the
was originally sampled at 12 kHz and the Keele signals training data frame as centres [24]. Although this gives
at 20 kHz. All the signals had 16 bit resolution. correct resynthesis, even allowing the resynthesis of con­
Fig. 1 shows the performance of the algorithm on a tinuous speech using a frame-adaptive approach, it is
voiced section taken from the phrase "a traveller came unsatisfactory due to the very large number of varying

Authorized licensed use limited to: Universidade Estadual de Campinas. Downloaded on May 07,2024 at 20:22:45 UTC from IEEE Xplore. Restrictions apply.
parameters, and cannot be seen as actually learning the not appear to be crucial to the performance of the net­
dynamics of the speech generating system. work. There are two distinct strategies for training an
Folluwing their dynamical analysis of the Japanese RBF network. The most common approach divides the
vuwel /a/, Tokuda et al. constructed a feed-forward problem into two steps. Firstly the centre positions and
neural network to perform synthesis [26]. Their struc­ bandwidths are fixed using an unsupervised approach,
ture has three layers, 'with five neurons in the input not dependent on the network output. Then the weights
layer, forty neurons in the hidden layer, and one in the are trained in a supervised rnanner so as to rninirnise an
output layer. The time delay in the input delay vec­ error function.
tor is set at T 3 and the 'weights are learnt by back
= Following from the work of Kubin et al., a nonlinear
propagation. Csing global feedback, they report success­ oscillator structure is used. The RBF network is used
ful resynthesis of the Japanese vuwel /a/. The signal is to approximate the underlying nonlinear dynamics of
noisy, but preserves natural human speech qualities. Ko a particular stationary voiced sound, by training it to
further results in terms of speech quality or resynthesis perform the prediction
of other vowels are given.
'An alternative neural network approach was pro­ Xi+! = F(x;) (3)
posed by Karashimhan et aZ. This involves separat­
ing the voiced source from the vocal tract contribution, where Xi {Xi, X(i-T)' ..., x(i-(m-1)T)} is a vector of
=

and then creating a nonlinear dynamical model of the previous inputs spaced by some delay T samples, and
source [27]. This is achieved by first inverse filtering F is a nonlinear mapping function. From a nonlin­
the speech signal to obtain the linear prediction (LP) ear dynamical theory perspective, this can be viewed
residual. Kext the residue waveform is low-pass filtered as a time delay embedding of the speech signal into an
at 1 kHz, then normalised to give a unit amplitude en­ m-dimensional state space to produce a state space re­
velope. This processed signal is used as the training construction of the original d-dimensional system at­
data in a time delay neural network with global feed­ tractor. The embedding dimension is chosen in accord­
back. The KK structure reported is extremely complex, ance with Takens' embedding theorem [20] and the em­
consisting of a 30 tap delay line input and two hidden bedding delay, T, is chosen as the first minimum of the
layers of 15 and 1 0 sigmoid activation functions, with average mutual information function [29]. The other
the network training performed using back propagation parameters that must be chosen are the bandwidth, the
through time. Finally, the KK model is used in free­ number and position of the centres, and the length of
running synthesis mode to recreate the voiced source. training data to be used. "Vith these set, the determ­
This is applied to a LP filter in order to synthesise ination of the weights is linear in the parameters and is
speech. They shuw that the KK model successfully pre­ solved by minimising a sum of squares error function,
serves the jitter of the original excitation signal. Es (F), over the N samples of training data:

N
5.2 RBF network for synthesis I
(4)
2
E.. (F)
A

(x.' - x.·)
2L
A

I. I.
A well known nonlinear modelling approach is the ra­
= -

i=1
dial basis function neural network. It is generally com­
posed of three layers, made up of an input layer of source where Xi is the network approximation of the actual
nodes, a nonlinear hidden layer and an output layer giv­ speech signal Xi. Incorporating Equation 2 into the
ing the network response. The hidden layer performs a above and differentiating with respect to the weights,
nonlinear transformation mapping the input space to a then setting the derivative equal to zero gives the least­
rIe,'\' space, in which the problem can be better solved. squares problem [30], which can be written 1Il matrix
The output is the result of linearly combining the hidden form as
space, multiplying each hidden layer output by a weight (5)
whose value is determined during the training process.
The general equation of an RBF network with an in­ where <r> is an N xP matrix of the outputs of the centres;
put vector x and a single output is x is the target vector of length N; and w is the P length
vector of weights. This can be solved by standard matrix
p inversion techniques.
F(x(n)) =
L Wj(1'(llx - Cj II) (2) Two types of centre positioning strategy were con­
j=1 sidered:

where there are P hidden units, each of which is l. Data subset. Centres are picked as points from
weighted by Wj. The hidden units, (1'(11x - Cj II), are ra­ around the state space reconstruction. They are
dially symmetric functions about the point Cj, called a chosen pseudo-randomly, so as to give an approx­
centre, in the hidden space, with 11.11 being the Euclidean imately uniform spacing of centres about the state
vector norm [28]. The actual choice of nonlinearity does space reconstruction.

Authorized licensed use limited to: Universidade Estadual de Campinas. Downloaded on May 07,2024 at 20:22:45 UTC from IEEE Xplore. Restrictions apply.
2. Hyper-lattice. An alternative, data independent
approach is to spread the centres uniformly over an
m-dimensional hyper-lattice.

5. 3 Synthesis

From analysis, an initial set of parameters 'with 'which to


atternpt resynthesis 'were chosen. The pararneters 'were
set at the folluwing values:
Band,vidth =0.8 for hyper-lattice, 0.5 for data sub­
set; Dimension 7; Kumber of centres
= 128; Hyper­
= Figure 3: Time domain examples of the vowel lui,
lattice size = l.0; Training length 1000;= speaker MG. Top row: original signal (left) and linear
For each vowel in the database, the weights were prediction synthesised signal (right); Bottom row: RBF
learnt, with the centres either on a 7D hyper-lattice, or network synthesised signal, hyper-lattice (left) and data
chosen as a subset of the training data. The global feed­ subset (right).
back loop was then put in place to allow free-running
synthesis. The results gave varying degrees of success,
from constant (sometimes zero) outputs, through peri­
odic cycles not resembling the original speech signal and
noise-like signals, to extremely large spikes at irregular
intervals on otherwise correct waveforms [31].
These result implied that a large number of the map­
ping functions learnt by the network suffered from some
form of instability. This could have been due to a lack
of smoothness in the function, in which case regularisa­
tion theory was the ideal solution. Regularisation the­
ory applies some prior knowledge, or constraints, to the
Fr�qu�llC} (1:L:)

mapping function to make a well-posed problem [32]. Figure 4: Spectrums for examples of the vowel luI, cor­
The selection of an appropriate value for the regu­ responding to the signals in Figure 3.
larisation parameter, A is done by the use of cross­
validation [30]. After choosing all the other network
parameters, these are held constant and A is varied. For
each value of A, the MSE on an unseen validation set ate (F., + 4) [33] was used to set the number of filter
is calculated. The MSE curve should have a minimum taps to 26. Then, using the source-filter model, the
indicating the best value of A for generalisation. "Vith LP filter was excited by a Dirac pulse train to produce
the regularisation parameter chosen by this method, the the desired length LP synthesised signal. The distance
7D resynthesis gave correct results for all of the signals between Dirac pulses was set to be equal to the aver­
except KH Iii and KH lui when using the data subset age pitch period of the original signal. In this way, the
method of centre selection. However, only two signals three vowel sounds for each of the four speakers in the
(CA Iii and MC Iii) were correctly resynthesised by the database were synthesised.
hyper-lattice method. It was found that A needed to be Figure 3 shows the time domain waveforms for the ori­
increased significantly to ensure correct resynthesis for ginal signal, the LP synthesised signal and the two RBF
all the signals when the hyper-lattice was used. Achiev­ synthesised signals, for the vowel lui, speaker MC. Fig­
ing stable resynthesis inevitably comes at some cost. ure 4 shows the corresponding frequency domain plots
By forcing smoothness onto the approximated function of the signals, and the spectrograms are shown in Fig­
there is the risk that some of the finer detail of the state ure 5. In these examples, the regularisation parameter
space reconstruction will be lost. Therefore, for best A was set at 0. 01 for the hyper-lattice, and 0.005 for
results, A should be set at the smallest possible value the data subset. In the linear prediction case, the tech­
that allows stable resynthesis. The performance of the nique attempts to model the spectral features of the
regularised RBF network as a nonlinear speech synthes­ original. Hence the reasonable match seen in the spec­
iser is now measured by examining the time and fre­ trum (although the high frequencies have been over­
quency domains, as well as the dynamical properties. emphasised), but the lack of resemblance in the time
In addition to comparing the output of the nonlinear domain. The RBF techniques, on the other hand, re­
synthesiser to the original speech signal, the synthetic semble the original in the time domain, since it is from
speech from a traditional linear prediction synthesiser is this that the state space reconstruction is formed, al­
also considered. In this case, the LP filter coefficients though the spectral plots show the higher frequencies
were found from the original vowel sound (analogous to have not been well modelled by this method. This is
the training stage of the RBF network) . The estim- because the networks have missed some of the very fine

Authorized licensed use limited to: Universidade Estadual de Campinas. Downloaded on May 07,2024 at 20:22:45 UTC from IEEE Xplore. Restrictions apply.
Data type MC CA Average
(male) (female) (female)

.
8000
6000
Hyper-lattice jitter (%) 0.470 1.14 0.697
4000 Data subset jitter (%) 0.482 0.663 0.521
2000 Original jitter (%) 0.690 0.685 0.742

� l:J
"',
o 0.1 0.2 Hyper-lattice shimmer (%) 1.00 1.33 0.922
j';me(,j

REF; H"p<:r lalli"� REF; Dal" ",bsd


Data subset shimmer (%) 0.694 7.65 2.34
: Original shimmer (%) 4.21 7.06 5.17
�8l �: �8 0�:
�' �'"
�, 6000 Table 1: Percentage jitter and shimmer in original and
� 4()()() 14()()(
;.S 2000 , I
2000 synthesised waveforms (hyper-lattice and data subset),
"o , averaged over the vowels Ii/, lal and lui for each
0.1 0.2 o 0.1 0.2
j';me(,j I;me('j
speaker, and as an average over the database.
Figure 5: Wide-band spectrograms for examples of the
vowel lui, corresponding to the signals in Figure 3. vocal tract, see for example, [35,36] for issues regarding
embedding).
Previous studies, discussed above, have successfully
variations of the original time domain waveform, which modelled stationary (i.e. constant pitch) vowel sounds
mav be due to the regularisation. using nonlinear methods, but these have very limited
use since the pitch cannot be modified to include pros­
Further spectrogram examples for different vowels
odv information. The new approach described here re­
and speakers follow the same pattern, with the size of A
sol�'es this problem by including pitch information in the
being seen to influence the quality of the signal at high
embedding. Specifically, a non-stationary vowel sound
frequencies.
is extracted from a database and, using standard pitch
extraction techniques 1, a pitch contour is calculated for
5.4 Jitter and shiIllIller
the time series so that each time domain sample has an
Jitter and shimmer measurements were made on all associated pitch value. In the present study measure­
of the original and RBF synthesised waveforms, us­ ments of rising pitch vowel sounds, where the pitch rises
ing epoch detection2 over a 500 msec windmil'-. Jit­ through the length of the time series, have been used as
ter is defined as the variation in length of individual the basis for modelling; see, for example, figure 1.
pitch periods and for normal, healthy speech should be The time series is then embedded in an m-dimensional
between 0.1 and 1 % of the average pitch period [34]. space, along 'with the pitch contour, to form an (m+I)­
Table 1 shows the results of the average pitch length dimensional surface. A mixed embedding delay between
variation, expressed as a percentage of the average pitch time samples (greater than unity) is used to capture
period length. Results for both centre placing tech­ the variable time scales present in the vowel waveform.
niques are presented, with the jitter measurements of The (m+I)-dimensional surface is modelled by a nearest
the original speech data. The hyper-lattice synthesised neighbour approach, which predicts the next time series
waveforms contain more jitter than the data subset sig­ sample given a vector of previous time samples and a
nals, and both values are reasonable compared to the pitch value (it is envisaged that more sophisticated mod­
original. elling techniques will be incorporated at a later date).
Shimmer results (the variations in energy each pitch Svnthesis is then performed by a modification of the
cycle) for the original and synthesised 'waveforms are nonlinear oscillator approach [37], 'whereby the input
also displayed in Table 1. It can be seen that in general signal is removed and the delayed synthesiser output is
there is considerably less shimmer on the synthesised fed back to form the next input sample. In contrast
'waveforms as compared to the original, 'which 'will de­ to previous techniques, the required pitch contour is
tract from the quality of the synthetic speech. also passed into the model as an external forcing in­
put. Our results show that this method allows the vowel
6 Incorporating Pitch into the Nonlinear Syn­
sound to be generated correctly for arbitrary specified
thesis Method
pitch contours (within the input range of pitch values),
The approach adopted here is to model the vocal tract as even though the training data is only made up of the
a forced nonlinear oscillator and to embed an observed rising vowel time series and its associated pitch contour.
scalar time-series of a vowel with pitch information into In addition, sounds of arbitrary duration can be read­
a higher dimensional space. This embedding, when car­ ily synthesised by simply running the oscillator for the
ried out correctly, will reconstruct the data onto a higher required length of time. Typical synthesis results are
dimensional surface which embodies the dynamics of the shown. It can be seen that the sinusoidal pitch con­
tour of the svnthesised sound is quite different from the
2Using Entropic Laboratory's ESPS Epoch function. rising pitch � rofile of the measured data; the duration

Authorized licensed use limited to: Universidade Estadual de Campinas. Downloaded on May 07,2024 at 20:22:45 UTC from IEEE Xplore. Restrictions apply.
RVI RV5
O.5'TTTC-,----,-;-;-rTTTC,,---;-;-;-;-m

'00 :=== Timc(s)

,OO � Timc(s)

50� �
) ---------�
o
Timc(s)
0.32
50�
o

J ------------
Timc(s)
0.32

RV4 RV6
O.5',------rr-,,-.-"';-r--,--,

C::d
O,J2
Time(s) Time(s)

"�
()() ............ .. . ..................................................................
g

=====:J
p:

50'-I---------------< 50) -i--


- -----------<
O,J2 O.J2
Time(s) Time(s)

Figure 6: Synthesised vowel sounds together with desired Figure 7: Synthesised vowel sounds together with desired
and measured pitch profile and measured pitch profiles

of the synthesised data is also somewhat longer than 8 AcknowledgeD1ents


that of the measured data. The small offset evident The contributions of my colleague lain Mann to this
between desired and synthesised pitch contours is at­ work are gratefully acknowledged.
tributed to minor calibration error. The initial results
presented here are encouraging. Indeed, perhaps some­ References
what surprisingly so. Specifically, good synthesis results [1] B. Gabioud, Fundamentals of Speech Synthesis and
are obtained using a simple nearest neighbour embed­ Speech Recognition, ch. Articulatory Models in Speech
ding model with only sparse data (typically around 1000 Synthesis, pp. 215 - 230. John "Viley & Sons, 1994.
data points embedded in a space of dimension 17, cor­ [2] K. Ishizaka and J. L. Flanagan, "Synthesis of voiced
responding to a very low density of around only 1.5 data sounds from a two-mass model of the vocal chords,"
points per dimension). A limited measured pitch excit­ Bell System Technical Journal, vol. 51, pp. 1233 -1268,
ation data: a simple rising pitch profile 'with a small July-August 1972.
number of data points at each specific pitch value. [3] T. Koizumi, S. Taniguchi, and S. Hiromitsu, "Two­
mass models of the vocal cords for natural sounding
7 Conclusions
voice synthesis," Journal of the Acoustical Society of
In view of these observations, it seems likely that America, vol. 82, pp. 1179 - 1192, October 1987.

the data-based model of the vowel dynamics possesses [4] G. Fant, Acoustic Theory of Speech Production.

an important degree of structure, perhaps reflecting Mouton, 1960.


physiological considerations, that requires further in­ [5] J. Markel and A. Gray, Linear Prediction of Speech.
vestigation. It is also clear that whilst encouraging there Berlin: Springer-Verlag, 1976.
is still some way to go in overcoming the limitations of [6] D. H. Klatt, "Software for a cascade/parallel formant
the approach. It is clear that Speech is a nonlinear pro­ synthesiser," Journal of the Acoustical Society of Amer­
cess and that if we are to achive the holy grail of truly ica, vol. 67, pp. 971 - 995, 1980.
natural sounding synthetic speech that this must be ac­ [7] M. Edgington, A. Lowry, P. Jackson, A. Breen, and
counted for. It is also clear that nonlinear synthesis S. Minnis, "Overview of current text-to-speech tech­
techniques offer some potential to achive this although niques: Part II - prosody and speech generation," B T
a great deal of research work remains to be done. Technical Journal, vol. 14, pp. 84 - 99, January 1996.

Authorized licensed use limited to: Universidade Estadual de Campinas. Downloaded on May 07,2024 at 20:22:45 UTC from IEEE Xplore. Restrictions apply.
[8] J. Page and A. Breen, "The laureate text-to-speech [23] F. Plante, G. F. Meyer, and VV. A. Ainsworth, "A pitch
system, architecture and applications," B T Technical extraction reference database," in EUROSPEECH'95,
Journal, voL 14, pp. 57 - 67, January 1996. voL 1, pp. 837 - 840, September 1995.
[9] M. Beutnagel, A. Conkie, J. Schroeter, and A. Syrdal, [24] G. Kubin, "Synthesis and coding of continuous speech
"The at&t next-gen tts system," in Joint "11eeting of with the nonlinear oscillator model," in International
ASA, EAA, and DACA, (Berlin, Germany), March Conference on Acoustics, Speech and Signal Processing,
1999. (Atlanta, Georgia), pp. 267 - 270, May 1996.
[10] E. Moulines and F. Charpentier, "Pitch synchronous [25] M. Birgmeier, Kalman-trained 1Veural1Vetworks for Sig­

waveform processing techniques for text-to-speech syn­ nal Processing Applications.PhD thesis, Technical Uni­
thesis using diphones," Speech Communication, voL 9, versity of Vienna, Vienna, 1996.
pp. 453 - 467, 1990. [26] L Tokuda, R. Tokunaga, and K. Aihara, "A simple geo­
[11] R. McAulay and T. Quatieri, "Speech ana- metrical structure underlying speech signals of the Ja­
lysis/synthesis based on a sinusoidal representation," panese vowel /a/," International Journal of Bifurcation
IEEE Transactions on Audio, Speech and Signal and Chaos, voL 6, no. 1, pp. 149 - 160, 1996.

Processing, voL 34, pp. 744 - 754, August 1986. [27] K. Narashimhan, J. C. Principe, and D. Childers, "Non­
[12] T. Koizumi, S. Taniguchi, and S. Hiromitsu, "Glottal linear dynamic modeling of the voiced excitation for im­
source-vocal tract interaction," Journal of the Acous­ proved speech synthesis," in International Conference
on Acoustics, Speech and Signal Processing, (Phoenix,
tical Society of America, voL 78, pp. 1541 - 1547,
November 1985. Arizona), pp. 389 - 392, March 1999.

[13] D. M. Brookes and P. A. Naylor, "Speech production [28] B. Mulgrew, "Applying radial basis functions," IEEE
Signal Processing Magazine, voL 13, pp. 50 - 65, March
modelling with variable glottal reflection coefficient,"
in International Conference on Acoustics, Speech and 1996.
Signal Processing, pp. 671 - 674, 1988. [29] A. M. Fraser and H. Swinney, "Independent coordin­
ates for strange attractors from mutual information,"
[14] H. M. Teager and S. M. Teager, "Evidence of nonlin­
Physical Review A, voL 33, pp. 1134 - 1140, 1986.
ear sound production mechanisms in the vocal tract,"
in Proceedings of the NA TO Advanced Study Institute [30] C. M. Bishop, Neural Networks for Pattern Recognition.
on Speech Production and Modelling, (Bonas, France), Oxford University Press, 1995.
pp. 241 - 261, July 1989. [31] L Mann, An Investigation of1Vonlinear Speech Synthesis

[15] L Steinecke and H. Herzel, "Bifurcations in an asym­ and Pitch Modification Techniques. PhD thesis, Uni­
metric vocal-fold model," Journal of the Acoustical So­ versity of Edinburgh, 1999.
ciety of America, voL 97, pp. 1874 - 1884, March 1995. [32] S. Haykin and J. Principe, "Making sense of a com­
[16] J. Schoentgen, "Non-linear signal representation and its plex world," IEEE Signal Processing Magazine, voL 15,
application to the modelling of the glottal waveform," pp. 66 - 81, May 1998.
Speech Communication, voL 9, pp. 189 - 201, 1990. [33] L. R. Rabiner and R. \V. Schafer, Digital Processing of
Speech Signals. Prentice-Hall, 1978.
[17] R. J. DiFrancesco and E. Moulines, "Detection of glot­
tal closure by jumps in the statistical properties of the [34] J. Schoentgen and R. de Guchteneere, "An algorithm
speech signal," Speech Communication, voL 9, pp. 401 for the measurement of jitter," Speech Communication,
- 418, December 1990. voL 10, pp. 533 - 538, 1991.
[18] Y. M. Cheng and D. O'Shaughnessy, "Automatic [35] J. Stark, D. Broomhead, M. Davies, and J. Huke,
and reliable estimation of glottal closure instant and "Takens embedding theorems for forced and stochastic
period," IEEE Transactions on Audio, Speech and Sig­ systems," in Proceedings of 2nd World Congress of Non­
nal Processing, voL 37, pp. 1805 - 1815, December 1989. linear Analysis, 1996.

[19] D. Talkin, "Voicing epoch determination with dynamic [36] J. Stark, "Delay embeddings for forced systems: De­
programming," Journal of the Acoustical Society of terministic forcing," Journal of Nonlinear Science,
America, voL 85, Supplement 1, p. S149, 1989. voL 9, pp. 255-332, 1999.
[20] F. Takens, "Detecting strange attractors in turbulence," [37] H. Haas and G. Kubin, "Multi-band nonlinear oscil­
in Proceedings of Symposium on Dynamical Systems and lator model for speech," in 32nd Asilomar Conference
Turbulence (A. Dold and B. Eckmann, eds.), pp. 366 - on Signals, Systems and Computers, voL 1, pp. 338 -

381, Lecture Notes in Mathematics, 1980. 342, 1998.

[21] D. S. Broomhead and G. P. King, 1Vonlinear Phenom­


ena and Chaos, ch. On the Qualitative Analysis of Ex­
perimental Dynamical Systems, pp. 113 - 144. Bristol:
Adam Hilger, 1986.
[22] G. Kubin, "Poincare sections for speech," in Proceedings
of the 1997 IEEE Workshop on Speech Coding, (Pocono
Manor, USA), September 1997.

Authorized licensed use limited to: Universidade Estadual de Campinas. Downloaded on May 07,2024 at 20:22:45 UTC from IEEE Xplore. Restrictions apply.

You might also like