0% found this document useful (0 votes)

21 views8 pages

Nonlinear Speech Synthesis

This document discusses nonlinear speech synthesis. It reviews conventional linear approaches and their limitations. It then discusses using nonlinear techniques like Poincare maps to mark glottal closure epochs more accurately than existing linear methods. The document argues that while modern speech synthesis is intelligible, it lacks human quality and nonlinear methods may offer advantages, but more research is still needed.

Uploaded by

b194768

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

21 views8 pages

Nonlinear Speech Synthesis

Uploaded by

b194768

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 8

Nonlinear Speech Synthesis *

Stephen �McLaughlin
Signals and S,ystems Group:
Department of Electronics and Electrical Engineering:
:
Universi(yof Edinburgh: T he King s Buildings:
Edinburgh: ERg 3JL: Scotland: UK

Email: smH=iJee.ed.ac.uk
Tel (+44)-131-650-5578, Fax: (+44)-131-650-6554

ABSTRACT the fundamental level of speech production. An accur

This paper examines the how and the why of nonlin ate articulatory model would allow all types of speech to
be synthesised in a natural manner, without having to
ear speech synthesis. It discusses why nonlinear speech
make many of the assumptions required by other tech
synthesis should be considered, reviews the recent his
tory and describes in detail a variety of approaches to niques (such as attempting to separate the source and
the problem. It argues that while modern concatenative vocal tract parts out from one signal) [1-3]. Realistic
speech synthesisers produce speech which is intelligible, articulatory synthesis is an extremely complex process,
and the data required is not at all easy to collect. As
however they are very inflexible and often lack a hu
such, it has not to date found any commercial applica
man quality. The paper does not suggest that nonlinear
speech synthesisers are ready to replace conventional ap tion and is still more of a research tool.
proaches, but rather that they offer some potential ad vVaveform synthesisers derive a model from the speech
vantages but there is a considerable amount of research signal as opposed to the speech organs. This approach
still to be carried out. is derived from the linear source-filter theory of speech
production [4]. The simplest form of waveform synthesis
1 Introduction is based on linear prediction (LP) [5]. The resulting
Speech synthesis is a complex task that aims to produce quality is extremely poor for voiced speech, sounding
naturally-sounding speech. vVhile working systems that very robotic.
produce intelligible speech have existed since the 1970's, Formant synthesis uses a bank of filters, each of
the final aim of producing a synthesiser that is indistin 'which represents the contribution of one of the form
guishable from a human speaker has still to be realised. ants. The best knuwn formant synthesiser is the Klatt
There remain a number of problems at all stages of the synthesiser [6], 'which has been exploited commercially
process, including the actual generation of the speech as DECTalk. The synthesised speech quality is consid
signal itself with the required intonation. This paper erably better than that of the LP method, but still lacks
is structured as follows, a brief review of conventional naturalness, even 'when an advanced voice-source rnodel
linear based approaches is followed by a quick review of is used [7].
nonlinearities which exist in speech generation. Then an
Concatenation methods involve joining together pre
example of nonlinear techniques applied to epoch mark
recorded units of speech which are extracted from a
ing is presented followed by two sections on nonlinear
database. It must also be possible to change the prosody
speech synthesis. Finally some conclusions are drawn.
of the units, so as to impose the prosody required for the
2 Conventional Speech Synthesis Approaches
phrase that is being generated. The concatenation tech
nique provides the best quality synthesised speech avail
Conventionally the main approaches to speech synthesis able at present. It is used in a large number of commer
depend on the type of modelling used. This may be a cial systems, including British Telecomm's Laureate [8]
model of the speech organs themselves (articulatory syn and the AT&T I\ext-Gen system [9]. Although there is
thesis), a model derived from the speech signal (wave a good degree of naturalness in the synthesised output,
form synthesis), or alternatively the use of pre-recorded it is still clearly distinguishable from real human speech,
segments extracted from a database and joined together and it may be that more sophisticated parametric mod
(concatenative synthesis). els will eventually overtake it.
Modelling the actual speech organs is an attractive Techniques for time and pitch scaling of sounds held
approach, since it can be regarded as being a model of in a database are also extremely important. Two main
This work was snpport.ed by BT, EPSHC and t.he Hoyal techniques for time-scale and pitch modification in con
catenative synthesis can be identified, each of which op-
•

Society.

Authorized licensed use limited to: Universidade Estadual de Campinas. Downloaded on May 07,2024 at 20:22:45 UTC from IEEE Xplore. Restrictions apply.
erates on the speech signal in a different manner. The • the vocal folds vibrate in an exactly periodic man
pitch synchronous overlap add (PSOLA) [1 0] approach ner during voiced speech production;
is non-parametric as opposed to the harmonic method,
the configuration of the vocal tract will only change
'which actually decomposes the signal into explicit source
•

slowly;
and vocal tract models. PSOLA is reported to give good
quality, natural-sounding synthetic speech for moderate These imply a loss of information which means that
pitch and time modifications. Sluwing duwn the speech the full speech signal dynamics can never be properly
by a large factor (greater than two) does introduce arti captured. These inadequacies can be seen in practice
facts due to the repetition of PSOLA bells. Some tonal in speech synthesis where, at the waveforrn generation
artifacts ( e.g. whistling) also appear 'with large pitch level, current systems tend to produce an output signal
scaling, especially for higher pitch voices, such as female that lacks naturalness. This is true even of concaten
speakers and children. ation techniques which copy and modify actual speech
McAulay and Quatieri developed a speech genera segments.
tion model that is based on a glottal excitation signal
made up of a sum of sine waves [11]. They then used 4 Poincare ll1aps and epoch ll1arking

this model to perform time-scale and pitch modifica The section discusses how nonlinear techniques can be
tion. Starting with the assumption made in the linear applied to pitch marking of continuous speech. vVe wish
model of speech that the speech waveform x(t) is the to locate the instants in the time domain speech signal at
output generated by passing an excitation waveform e(t) which the glottis is closed. A variety of existing methods
through a linear filter h(t), the excitation is defined as can be employed to locate the epochs. These are Abrupt
a sum of sine waves of arbitrary amplitudes, frequencies change detection [17], Maximum Likelihood epoch de
and phases. A limitation of all these techniques is that tection [18] and Dynamic programming [19]. All of the
they use the linear model of speech as a basis. above techniques are sound and generally provide good
epoch detection. The technique presented here should
3 Nonlinearities in speech
not be viewed as a direct competitor to the methods
There are known to be a number of nonlinear effects outlined above. Rather it is an attempt to show the
in the speech production process. Firstly, it has been practical application of ideas from nonlinear dynamical
accepted for some time that the vocal tract and the vo theory to a real speech processing problem. The per
cal folds do not function independently of each other, formance in clean speech is comparable to many of the
but that there is in fact some form of coupling between techniques discussed above.
them when the glottis is open [12] resulting in signi In nonlinear processing a d-dimensional system can
ficant changes in formant characteristics between open be reconstructed in an m-dimensional state space from
and closed glottis cycles [13]. More controversially, a single dimension time series by a process called em
Teager and Teager [14] have claimed (based on phys bedding. Takens' theorem states that m 2: 2d + 1 for an
ical measurements) that voiced sounds are characterised adequate reconstruction [20], although in practice it is
by highly complex air flows in the vocal tract involving often possible to reduce m. An alternative is the singu
jets and vortices, rather than well behaved laminar flow. lar value decomposition (SVD) embedding [21], which
In addition, the vocal folds will themselves be respons may be more attractive in real systems where noise is
ible for further nonlinear behaviour, since the muscle an Issue.
and cartilage which comprise the larynx have nonlinear A Poincare map is often used in the analysis of dy
stretching qualities. Such nonlinearities are routinely namical systems. It replaces the flow of an n-th or
included in attempts to model the physical process of der continuous system with an (n - 1)-th order discrete
vocal fold vibration, which have focussed on two or more time map. Considering a three dimensional attractor
mass models [2, 3, 15], in which the movement of the vo a Poincare section slices through the flow of trajector
cal folds is modelled by masses connected by springs, ies and the resulting crossings form the Poincare map.
with nonlinear coupling. Observations of the glottal Re-examining the attractor reconstructions of voiced
,vaveforrn have shown that this waveforrn can change speech shown above, it is evident that these three dimen
shape at different amplitudes [16] which would not be sional attractors can also be reduced to two dimensional
possible in a strictly linear system where the waveform maps.I Additionally, these reconstructions are pitch
shape is unaffected by amplitude changes. synchronous, in that one revolution of the attractor is
In order to arrive at the simplified linear model, a equivalent to one pitch period. This has previously
number of major assumptions are made: been used for cyclostationary analysis and synchronisa
tion [22]; here we examine its use for epoch marking.
• the vocal tract and speech source are uncoupled
1 St.rid,]y t.hese aUrad,or rec.onst.nld,ions ar" disc.ret.e t.im" maps
(thus allowing source-filter separation); and not. c.ont.imlOlIs flows. However it. is possih]e t.o c.onst.nld a
flow vedor bet.ween point.s and lise t.his for t.he Poinc.ar� sedion
• airflow through the vocal tract is laminar; calculation.

Authorized licensed use limited to: Universidade Estadual de Campinas. Downloaded on May 07,2024 at 20:22:45 UTC from IEEE Xplore. Restrictions apply.
Figure 1: Results for the voiced section of "came along" Figure 2: Results for the voiced section of "raining" from
from the J{eele database for a female speaker. From top the BT Labs database for a male speaker. From top
to bottom: the signal; the epochs as calculated by the al to bottom: the signal; the epochs as calculated by the
gorithm; the laryngograph signal; the pitch contour (Hz) algorithm; the processed laryngograph signal; the pitch
resulting from the algorithm. contour (Hz) resulting from the algorithm.

The basic processing steps required for a waveform of along wrapped in a warm cloak", spoken by a female
N points are as follows: speaker. There is considerable change in the signal, and
hence in the attractor structure, in this example, yet
1. Mark YGCI, a known Gel in the signal.
the epochs are sufficiently well located when compared
2. Perform an SVD embedding on the signal to gener against the laryngograph signal.
ate the attractor reconstruction in 3D state space. In Fig. 2, which is a voiced section from the phrase
"see if it's raining" spoken by a male speaker, the epochs
3. Calculate the flow vector, h, at the marked point are well located for the first part of the signal, but some
YGCT on the attractor.
slight loss of synchronisation can be seen in the latter
4. Detect crossings of the Poincare section, E, at this part.
point in state space by signs changes of the scalar
5 Nonlinear Synthesis Approaches
product between h and the vector Yi - YGCI for all
1 ::; i ::; N points. 5. 1 Neural network synthesis background

5. Points on E 'which are 'within the same portion of Kubin and Birgmeier reported an attempt made to use
the manifold as YGCT are the epochs. a RBF network approach to speech synthesis. They pro
pose the use of a nonlinear oscillator, with no external
"Vhen dealing with real speech signals a number of prac input and global feedback in order to perform the map
tical issues have to be considered. The input signal must pmg
be treated on a frame-by-frame basis, within which the x(n) A(x(n - 1)) (1)
speech is assumed stationary. Finding the correct inter
=

section points on the Poincare section is also a difficult where x(n - 1) is the delay vector with non-unit delays,
task due to the complicated structure of the attractor. and A is the nonlinear mapping function [24].
Two different data sets were used to test the perform The initial approach taken [25] used a Kalman-based
ance of the algorithm, giving varying degrees of realistic RBF network, which has all of the network parameters
speech and hence difficulty. trained by the extended Kalman filter algorithm. The
only parameter that must be specified is the number
1. Keele "Cniversity pitch extraction database [23]. of centres to use. This gives good prediction results,
This database provides speech and laryngograph but there are many problems with resynthesis. In par
data from 15 speakers reading phonetically bal ticular, they report that extensive manual fine-tuning
anced sentences. of the parameters such as dimension, embedding delay
and number and initial positions of the centres are re
2. BT Labs continuous speech. 2 phrases, spoken by 4
quired. Even with this tuning, synthesis of some sounds
speakers, were processed manually to extract a data
with complicated phase space reconstructions does not
set of continuous voiced speech. Laryngograph data
work [24].
was also available.
In order to overcome this problem, Kubin resorted
The signals were up-sampled to 22.05 kHz, the BT data to a technique that uses all of the data points in the
was originally sampled at 12 kHz and the Keele signals training data frame as centres [24]. Although this gives
at 20 kHz. All the signals had 16 bit resolution. correct resynthesis, even allowing the resynthesis of con
Fig. 1 shows the performance of the algorithm on a tinuous speech using a frame-adaptive approach, it is
voiced section taken from the phrase "a traveller came unsatisfactory due to the very large number of varying

Authorized licensed use limited to: Universidade Estadual de Campinas. Downloaded on May 07,2024 at 20:22:45 UTC from IEEE Xplore. Restrictions apply.
parameters, and cannot be seen as actually learning the not appear to be crucial to the performance of the net
dynamics of the speech generating system. work. There are two distinct strategies for training an
Folluwing their dynamical analysis of the Japanese RBF network. The most common approach divides the
vuwel /a/, Tokuda et al. constructed a feed-forward problem into two steps. Firstly the centre positions and
neural network to perform synthesis [26]. Their struc bandwidths are fixed using an unsupervised approach,
ture has three layers, 'with five neurons in the input not dependent on the network output. Then the weights
layer, forty neurons in the hidden layer, and one in the are trained in a supervised rnanner so as to rninirnise an
output layer. The time delay in the input delay vec error function.
tor is set at T 3 and the 'weights are learnt by back
= Following from the work of Kubin et al., a nonlinear
propagation. Csing global feedback, they report success oscillator structure is used. The RBF network is used
ful resynthesis of the Japanese vuwel /a/. The signal is to approximate the underlying nonlinear dynamics of
noisy, but preserves natural human speech qualities. Ko a particular stationary voiced sound, by training it to
further results in terms of speech quality or resynthesis perform the prediction
of other vowels are given.
'An alternative neural network approach was pro Xi+! = F(x;) (3)
posed by Karashimhan et aZ. This involves separat
ing the voiced source from the vocal tract contribution, where Xi {Xi, X(i-T)' ..., x(i-(m-1)T)} is a vector of
=

and then creating a nonlinear dynamical model of the previous inputs spaced by some delay T samples, and
source [27]. This is achieved by first inverse filtering F is a nonlinear mapping function. From a nonlin
the speech signal to obtain the linear prediction (LP) ear dynamical theory perspective, this can be viewed
residual. Kext the residue waveform is low-pass filtered as a time delay embedding of the speech signal into an
at 1 kHz, then normalised to give a unit amplitude en m-dimensional state space to produce a state space re
velope. This processed signal is used as the training construction of the original d-dimensional system at
data in a time delay neural network with global feed tractor. The embedding dimension is chosen in accord
back. The KK structure reported is extremely complex, ance with Takens' embedding theorem [20] and the em
consisting of a 30 tap delay line input and two hidden bedding delay, T, is chosen as the first minimum of the
layers of 15 and 1 0 sigmoid activation functions, with average mutual information function [29]. The other
the network training performed using back propagation parameters that must be chosen are the bandwidth, the
through time. Finally, the KK model is used in free number and position of the centres, and the length of
running synthesis mode to recreate the voiced source. training data to be used. "Vith these set, the determ
This is applied to a LP filter in order to synthesise ination of the weights is linear in the parameters and is
speech. They shuw that the KK model successfully pre solved by minimising a sum of squares error function,
serves the jitter of the original excitation signal. Es (F), over the N samples of training data:

N
5.2 RBF network for synthesis I
(4)
2
E.. (F)
A

(x.' - x.·)
2L
A

I. I.
A well known nonlinear modelling approach is the ra
= -

i=1
dial basis function neural network. It is generally com
posed of three layers, made up of an input layer of source where Xi is the network approximation of the actual
nodes, a nonlinear hidden layer and an output layer giv speech signal Xi. Incorporating Equation 2 into the
ing the network response. The hidden layer performs a above and differentiating with respect to the weights,
nonlinear transformation mapping the input space to a then setting the derivative equal to zero gives the least
rIe,'\' space, in which the problem can be better solved. squares problem [30], which can be written 1Il matrix
The output is the result of linearly combining the hidden form as
space, multiplying each hidden layer output by a weight (5)
whose value is determined during the training process.
The general equation of an RBF network with an in where <r> is an N xP matrix of the outputs of the centres;
put vector x and a single output is x is the target vector of length N; and w is the P length
vector of weights. This can be solved by standard matrix
p inversion techniques.
F(x(n)) =
L Wj(1'(llx - Cj II) (2) Two types of centre positioning strategy were con
j=1 sidered:

where there are P hidden units, each of which is l. Data subset. Centres are picked as points from
weighted by Wj. The hidden units, (1'(11x - Cj II), are ra around the state space reconstruction. They are
dially symmetric functions about the point Cj, called a chosen pseudo-randomly, so as to give an approx
centre, in the hidden space, with 11.11 being the Euclidean imately uniform spacing of centres about the state
vector norm [28]. The actual choice of nonlinearity does space reconstruction.

Authorized licensed use limited to: Universidade Estadual de Campinas. Downloaded on May 07,2024 at 20:22:45 UTC from IEEE Xplore. Restrictions apply.
2. Hyper-lattice. An alternative, data independent
approach is to spread the centres uniformly over an
m-dimensional hyper-lattice.

5. 3 Synthesis

From analysis, an initial set of parameters 'with 'which to

atternpt resynthesis 'were chosen. The pararneters 'were
set at the folluwing values:
Band,vidth =0.8 for hyper-lattice, 0.5 for data sub
set; Dimension 7; Kumber of centres
= 128; Hyper
= Figure 3: Time domain examples of the vowel lui,
lattice size = l.0; Training length 1000;= speaker MG. Top row: original signal (left) and linear
For each vowel in the database, the weights were prediction synthesised signal (right); Bottom row: RBF
learnt, with the centres either on a 7D hyper-lattice, or network synthesised signal, hyper-lattice (left) and data
chosen as a subset of the training data. The global feed subset (right).
back loop was then put in place to allow free-running
synthesis. The results gave varying degrees of success,
from constant (sometimes zero) outputs, through peri
odic cycles not resembling the original speech signal and
noise-like signals, to extremely large spikes at irregular
intervals on otherwise correct waveforms [31].
These result implied that a large number of the map
ping functions learnt by the network suffered from some
form of instability. This could have been due to a lack
of smoothness in the function, in which case regularisa
tion theory was the ideal solution. Regularisation the
ory applies some prior knowledge, or constraints, to the
Fr�qu�llC} (1:L:)

mapping function to make a well-posed problem [32]. Figure 4: Spectrums for examples of the vowel luI, cor
The selection of an appropriate value for the regu responding to the signals in Figure 3.
larisation parameter, A is done by the use of cross
validation [30]. After choosing all the other network
parameters, these are held constant and A is varied. For
each value of A, the MSE on an unseen validation set ate (F., + 4) [33] was used to set the number of filter
is calculated. The MSE curve should have a minimum taps to 26. Then, using the source-filter model, the
indicating the best value of A for generalisation. "Vith LP filter was excited by a Dirac pulse train to produce
the regularisation parameter chosen by this method, the the desired length LP synthesised signal. The distance
7D resynthesis gave correct results for all of the signals between Dirac pulses was set to be equal to the aver
except KH Iii and KH lui when using the data subset age pitch period of the original signal. In this way, the
method of centre selection. However, only two signals three vowel sounds for each of the four speakers in the
(CA Iii and MC Iii) were correctly resynthesised by the database were synthesised.
hyper-lattice method. It was found that A needed to be Figure 3 shows the time domain waveforms for the ori
increased significantly to ensure correct resynthesis for ginal signal, the LP synthesised signal and the two RBF
all the signals when the hyper-lattice was used. Achiev synthesised signals, for the vowel lui, speaker MC. Fig
ing stable resynthesis inevitably comes at some cost. ure 4 shows the corresponding frequency domain plots
By forcing smoothness onto the approximated function of the signals, and the spectrograms are shown in Fig
there is the risk that some of the finer detail of the state ure 5. In these examples, the regularisation parameter
space reconstruction will be lost. Therefore, for best A was set at 0. 01 for the hyper-lattice, and 0.005 for
results, A should be set at the smallest possible value the data subset. In the linear prediction case, the tech
that allows stable resynthesis. The performance of the nique attempts to model the spectral features of the
regularised RBF network as a nonlinear speech synthes original. Hence the reasonable match seen in the spec
iser is now measured by examining the time and fre trum (although the high frequencies have been over
quency domains, as well as the dynamical properties. emphasised), but the lack of resemblance in the time
In addition to comparing the output of the nonlinear domain. The RBF techniques, on the other hand, re
synthesiser to the original speech signal, the synthetic semble the original in the time domain, since it is from
speech from a traditional linear prediction synthesiser is this that the state space reconstruction is formed, al
also considered. In this case, the LP filter coefficients though the spectral plots show the higher frequencies
were found from the original vowel sound (analogous to have not been well modelled by this method. This is
the training stage of the RBF network) . The estim- because the networks have missed some of the very fine

Authorized licensed use limited to: Universidade Estadual de Campinas. Downloaded on May 07,2024 at 20:22:45 UTC from IEEE Xplore. Restrictions apply.
Data type MC CA Average
(male) (female) (female)

.
8000
6000
Hyper-lattice jitter (%) 0.470 1.14 0.697
4000 Data subset jitter (%) 0.482 0.663 0.521
2000 Original jitter (%) 0.690 0.685 0.742

� l:J
"',
o 0.1 0.2 Hyper-lattice shimmer (%) 1.00 1.33 0.922
j';me(,j

REF; H"p<:r lalli"� REF; Dal" ",bsd

Data subset shimmer (%) 0.694 7.65 2.34
: Original shimmer (%) 4.21 7.06 5.17
�8l �: �8 0�:
�' �'"
�, 6000 Table 1: Percentage jitter and shimmer in original and
� 4()()() 14()()(
;.S 2000 , I
2000 synthesised waveforms (hyper-lattice and data subset),
"o , averaged over the vowels Ii/, lal and lui for each
0.1 0.2 o 0.1 0.2
j';me(,j I;me('j
speaker, and as an average over the database.
Figure 5: Wide-band spectrograms for examples of the
vowel lui, corresponding to the signals in Figure 3. vocal tract, see for example, [35,36] for issues regarding
embedding).
Previous studies, discussed above, have successfully
variations of the original time domain waveform, which modelled stationary (i.e. constant pitch) vowel sounds
mav be due to the regularisation. using nonlinear methods, but these have very limited
use since the pitch cannot be modified to include pros
Further spectrogram examples for different vowels
odv information. The new approach described here re
and speakers follow the same pattern, with the size of A
sol�'es this problem by including pitch information in the
being seen to influence the quality of the signal at high
embedding. Specifically, a non-stationary vowel sound
frequencies.
is extracted from a database and, using standard pitch
extraction techniques 1, a pitch contour is calculated for
5.4 Jitter and shiIllIller
the time series so that each time domain sample has an
Jitter and shimmer measurements were made on all associated pitch value. In the present study measure
of the original and RBF synthesised waveforms, us ments of rising pitch vowel sounds, where the pitch rises
ing epoch detection2 over a 500 msec windmil'-. Jit through the length of the time series, have been used as
ter is defined as the variation in length of individual the basis for modelling; see, for example, figure 1.
pitch periods and for normal, healthy speech should be The time series is then embedded in an m-dimensional
between 0.1 and 1 % of the average pitch period [34]. space, along 'with the pitch contour, to form an (m+I)
Table 1 shows the results of the average pitch length dimensional surface. A mixed embedding delay between
variation, expressed as a percentage of the average pitch time samples (greater than unity) is used to capture
period length. Results for both centre placing tech the variable time scales present in the vowel waveform.
niques are presented, with the jitter measurements of The (m+I)-dimensional surface is modelled by a nearest
the original speech data. The hyper-lattice synthesised neighbour approach, which predicts the next time series
waveforms contain more jitter than the data subset sig sample given a vector of previous time samples and a
nals, and both values are reasonable compared to the pitch value (it is envisaged that more sophisticated mod
original. elling techniques will be incorporated at a later date).
Shimmer results (the variations in energy each pitch Svnthesis is then performed by a modification of the
cycle) for the original and synthesised 'waveforms are nonlinear oscillator approach [37], 'whereby the input
also displayed in Table 1. It can be seen that in general signal is removed and the delayed synthesiser output is
there is considerably less shimmer on the synthesised fed back to form the next input sample. In contrast
'waveforms as compared to the original, 'which 'will de to previous techniques, the required pitch contour is
tract from the quality of the synthetic speech. also passed into the model as an external forcing in
put. Our results show that this method allows the vowel
6 Incorporating Pitch into the Nonlinear Syn
sound to be generated correctly for arbitrary specified
thesis Method
pitch contours (within the input range of pitch values),
The approach adopted here is to model the vocal tract as even though the training data is only made up of the
a forced nonlinear oscillator and to embed an observed rising vowel time series and its associated pitch contour.
scalar time-series of a vowel with pitch information into In addition, sounds of arbitrary duration can be read
a higher dimensional space. This embedding, when car ily synthesised by simply running the oscillator for the
ried out correctly, will reconstruct the data onto a higher required length of time. Typical synthesis results are
dimensional surface which embodies the dynamics of the shown. It can be seen that the sinusoidal pitch con
tour of the svnthesised sound is quite different from the
2Using Entropic Laboratory's ESPS Epoch function. rising pitch � rofile of the measured data; the duration

Authorized licensed use limited to: Universidade Estadual de Campinas. Downloaded on May 07,2024 at 20:22:45 UTC from IEEE Xplore. Restrictions apply.
RVI RV5
O.5'TTTC-,----,-;-;-rTTTC,,---;-;-;-;-m

'00 :=== Timc(s)

,OO � Timc(s)

50� �
) ---------�
o
Timc(s)
0.32
50�
o
�
J ------------
Timc(s)
0.32

RV4 RV6
O.5',------rr-,,-.-"';-r--,--,

C::d
O,J2
Time(s) Time(s)

"�
()() ............ .. . ..................................................................
g
�

=====:J
p:

50'-I---------------< 50) -i--

- -----------<
O,J2 O.J2
Time(s) Time(s)

Figure 6: Synthesised vowel sounds together with desired Figure 7: Synthesised vowel sounds together with desired
and measured pitch profile and measured pitch profiles

of the synthesised data is also somewhat longer than 8 AcknowledgeD1ents

that of the measured data. The small offset evident The contributions of my colleague lain Mann to this
between desired and synthesised pitch contours is at work are gratefully acknowledged.
tributed to minor calibration error. The initial results
presented here are encouraging. Indeed, perhaps some References
what surprisingly so. Specifically, good synthesis results [1] B. Gabioud, Fundamentals of Speech Synthesis and
are obtained using a simple nearest neighbour embed Speech Recognition, ch. Articulatory Models in Speech
ding model with only sparse data (typically around 1000 Synthesis, pp. 215 - 230. John "Viley & Sons, 1994.
data points embedded in a space of dimension 17, cor [2] K. Ishizaka and J. L. Flanagan, "Synthesis of voiced
responding to a very low density of around only 1.5 data sounds from a two-mass model of the vocal chords,"
points per dimension). A limited measured pitch excit Bell System Technical Journal, vol. 51, pp. 1233 -1268,
ation data: a simple rising pitch profile 'with a small July-August 1972.
number of data points at each specific pitch value. [3] T. Koizumi, S. Taniguchi, and S. Hiromitsu, "Two
mass models of the vocal cords for natural sounding
7 Conclusions
voice synthesis," Journal of the Acoustical Society of
In view of these observations, it seems likely that America, vol. 82, pp. 1179 - 1192, October 1987.

the data-based model of the vowel dynamics possesses [4] G. Fant, Acoustic Theory of Speech Production.

an important degree of structure, perhaps reflecting Mouton, 1960.

physiological considerations, that requires further in [5] J. Markel and A. Gray, Linear Prediction of Speech.
vestigation. It is also clear that whilst encouraging there Berlin: Springer-Verlag, 1976.
is still some way to go in overcoming the limitations of [6] D. H. Klatt, "Software for a cascade/parallel formant
the approach. It is clear that Speech is a nonlinear pro synthesiser," Journal of the Acoustical Society of Amer
cess and that if we are to achive the holy grail of truly ica, vol. 67, pp. 971 - 995, 1980.
natural sounding synthetic speech that this must be ac [7] M. Edgington, A. Lowry, P. Jackson, A. Breen, and
counted for. It is also clear that nonlinear synthesis S. Minnis, "Overview of current text-to-speech tech
techniques offer some potential to achive this although niques: Part II - prosody and speech generation," B T
a great deal of research work remains to be done. Technical Journal, vol. 14, pp. 84 - 99, January 1996.

Authorized licensed use limited to: Universidade Estadual de Campinas. Downloaded on May 07,2024 at 20:22:45 UTC from IEEE Xplore. Restrictions apply.
[8] J. Page and A. Breen, "The laureate text-to-speech [23] F. Plante, G. F. Meyer, and VV. A. Ainsworth, "A pitch
system, architecture and applications," B T Technical extraction reference database," in EUROSPEECH'95,
Journal, voL 14, pp. 57 - 67, January 1996. voL 1, pp. 837 - 840, September 1995.
[9] M. Beutnagel, A. Conkie, J. Schroeter, and A. Syrdal, [24] G. Kubin, "Synthesis and coding of continuous speech
"The at&t next-gen tts system," in Joint "11eeting of with the nonlinear oscillator model," in International
ASA, EAA, and DACA, (Berlin, Germany), March Conference on Acoustics, Speech and Signal Processing,
1999. (Atlanta, Georgia), pp. 267 - 270, May 1996.
[10] E. Moulines and F. Charpentier, "Pitch synchronous [25] M. Birgmeier, Kalman-trained 1Veural1Vetworks for Sig

waveform processing techniques for text-to-speech syn nal Processing Applications.PhD thesis, Technical Uni
thesis using diphones," Speech Communication, voL 9, versity of Vienna, Vienna, 1996.
pp. 453 - 467, 1990. [26] L Tokuda, R. Tokunaga, and K. Aihara, "A simple geo
[11] R. McAulay and T. Quatieri, "Speech ana- metrical structure underlying speech signals of the Ja
lysis/synthesis based on a sinusoidal representation," panese vowel /a/," International Journal of Bifurcation
IEEE Transactions on Audio, Speech and Signal and Chaos, voL 6, no. 1, pp. 149 - 160, 1996.

Processing, voL 34, pp. 744 - 754, August 1986. [27] K. Narashimhan, J. C. Principe, and D. Childers, "Non
[12] T. Koizumi, S. Taniguchi, and S. Hiromitsu, "Glottal linear dynamic modeling of the voiced excitation for im
source-vocal tract interaction," Journal of the Acous proved speech synthesis," in International Conference
on Acoustics, Speech and Signal Processing, (Phoenix,
tical Society of America, voL 78, pp. 1541 - 1547,
November 1985. Arizona), pp. 389 - 392, March 1999.

[13] D. M. Brookes and P. A. Naylor, "Speech production [28] B. Mulgrew, "Applying radial basis functions," IEEE
Signal Processing Magazine, voL 13, pp. 50 - 65, March
modelling with variable glottal reflection coefficient,"
in International Conference on Acoustics, Speech and 1996.
Signal Processing, pp. 671 - 674, 1988. [29] A. M. Fraser and H. Swinney, "Independent coordin
ates for strange attractors from mutual information,"
[14] H. M. Teager and S. M. Teager, "Evidence of nonlin
Physical Review A, voL 33, pp. 1134 - 1140, 1986.
ear sound production mechanisms in the vocal tract,"
in Proceedings of the NA TO Advanced Study Institute [30] C. M. Bishop, Neural Networks for Pattern Recognition.
on Speech Production and Modelling, (Bonas, France), Oxford University Press, 1995.
pp. 241 - 261, July 1989. [31] L Mann, An Investigation of1Vonlinear Speech Synthesis

[15] L Steinecke and H. Herzel, "Bifurcations in an asym and Pitch Modification Techniques. PhD thesis, Uni
metric vocal-fold model," Journal of the Acoustical So versity of Edinburgh, 1999.
ciety of America, voL 97, pp. 1874 - 1884, March 1995. [32] S. Haykin and J. Principe, "Making sense of a com
[16] J. Schoentgen, "Non-linear signal representation and its plex world," IEEE Signal Processing Magazine, voL 15,
application to the modelling of the glottal waveform," pp. 66 - 81, May 1998.
Speech Communication, voL 9, pp. 189 - 201, 1990. [33] L. R. Rabiner and R. \V. Schafer, Digital Processing of
Speech Signals. Prentice-Hall, 1978.
[17] R. J. DiFrancesco and E. Moulines, "Detection of glot
tal closure by jumps in the statistical properties of the [34] J. Schoentgen and R. de Guchteneere, "An algorithm
speech signal," Speech Communication, voL 9, pp. 401 for the measurement of jitter," Speech Communication,
- 418, December 1990. voL 10, pp. 533 - 538, 1991.
[18] Y. M. Cheng and D. O'Shaughnessy, "Automatic [35] J. Stark, D. Broomhead, M. Davies, and J. Huke,
and reliable estimation of glottal closure instant and "Takens embedding theorems for forced and stochastic
period," IEEE Transactions on Audio, Speech and Sig systems," in Proceedings of 2nd World Congress of Non
nal Processing, voL 37, pp. 1805 - 1815, December 1989. linear Analysis, 1996.

[19] D. Talkin, "Voicing epoch determination with dynamic [36] J. Stark, "Delay embeddings for forced systems: De
programming," Journal of the Acoustical Society of terministic forcing," Journal of Nonlinear Science,
America, voL 85, Supplement 1, p. S149, 1989. voL 9, pp. 255-332, 1999.
[20] F. Takens, "Detecting strange attractors in turbulence," [37] H. Haas and G. Kubin, "Multi-band nonlinear oscil
in Proceedings of Symposium on Dynamical Systems and lator model for speech," in 32nd Asilomar Conference
Turbulence (A. Dold and B. Eckmann, eds.), pp. 366 - on Signals, Systems and Computers, voL 1, pp. 338 -

381, Lecture Notes in Mathematics, 1980. 342, 1998.

[21] D. S. Broomhead and G. P. King, 1Vonlinear Phenom

ena and Chaos, ch. On the Qualitative Analysis of Ex
perimental Dynamical Systems, pp. 113 - 144. Bristol:
Adam Hilger, 1986.
[22] G. Kubin, "Poincare sections for speech," in Proceedings
of the 1997 IEEE Workshop on Speech Coding, (Pocono
Manor, USA), September 1997.

Authorized licensed use limited to: Universidade Estadual de Campinas. Downloaded on May 07,2024 at 20:22:45 UTC from IEEE Xplore. Restrictions apply.

Challenges in Speech Synthesis: David Suendermann, Harald Höge, and Alan Black
No ratings yet
Challenges in Speech Synthesis: David Suendermann, Harald Höge, and Alan Black
15 pages
An Introduction To Speech Recognition B. Plannere
No ratings yet
An Introduction To Speech Recognition B. Plannere
69 pages
Career Choice Influences Guide
100% (5)
Career Choice Influences Guide
2 pages
Tcode Description: 000000sensitivity: Internal Restricted
No ratings yet
Tcode Description: 000000sensitivity: Internal Restricted
7 pages
Method To Study Speech Synthesis
No ratings yet
Method To Study Speech Synthesis
43 pages
Keller 01 Naturalness
No ratings yet
Keller 01 Naturalness
12 pages
Removal of Spectral Discontinuity in ConcatenatedSpeech Waveform
No ratings yet
Removal of Spectral Discontinuity in ConcatenatedSpeech Waveform
5 pages
Synthesis: Models of Speech
No ratings yet
Synthesis: Models of Speech
6 pages
The Development of Pashto Speech Synthesis System
No ratings yet
The Development of Pashto Speech Synthesis System
4 pages
Neural Speech Synthesis
No ratings yet
Neural Speech Synthesis
63 pages
Test 1
No ratings yet
Test 1
77 pages
Articulatory and Formant Synthesis Guide
No ratings yet
Articulatory and Formant Synthesis Guide
49 pages
Formant Synthesis
No ratings yet
Formant Synthesis
21 pages
Discrete Time Processing of Speech Signa
No ratings yet
Discrete Time Processing of Speech Signa
12 pages
A Tutorial On Speech Synthesis Models
No ratings yet
A Tutorial On Speech Synthesis Models
8 pages
Chapter-3: Theory of TTS
No ratings yet
Chapter-3: Theory of TTS
26 pages
Bhaashika: Telugu Tts System: Dr. K.V.N.Sunitha
No ratings yet
Bhaashika: Telugu Tts System: Dr. K.V.N.Sunitha
9 pages
Marathi Speech Synthesis A Review
No ratings yet
Marathi Speech Synthesis A Review
4 pages
Text To Speech Synthesis TTS
No ratings yet
Text To Speech Synthesis TTS
7 pages
Theory and Application of Digital Speech Processing by L. R. Rabiner and R. W. Schafer
No ratings yet
Theory and Application of Digital Speech Processing by L. R. Rabiner and R. W. Schafer
35 pages
Speech Signal Processing
No ratings yet
Speech Signal Processing
41 pages
Unit 2 Sound or Audio System
No ratings yet
Unit 2 Sound or Audio System
29 pages
Arabic Text To Speech Synthesizer
No ratings yet
Arabic Text To Speech Synthesizer
14 pages
Speech Synthesis in Indian Languages: Abstract
No ratings yet
Speech Synthesis in Indian Languages: Abstract
4 pages
Using 5 Ms Segments in Concatenative Speech Synthesis: Toshio Hirai Seiichi Tenpaku
No ratings yet
Using 5 Ms Segments in Concatenative Speech Synthesis: Toshio Hirai Seiichi Tenpaku
6 pages
HG3052 CourseOutline SpeechSynthesisRecognition AY2019-20 SEM1 Update Sep10
No ratings yet
HG3052 CourseOutline SpeechSynthesisRecognition AY2019-20 SEM1 Update Sep10
6 pages
Introduction To Digital Speech Processing
No ratings yet
Introduction To Digital Speech Processing
42 pages
Synopsis
No ratings yet
Synopsis
11 pages
Jarvis Digital Life Assistant IJERTV2IS1237 PDF
No ratings yet
Jarvis Digital Life Assistant IJERTV2IS1237 PDF
6 pages
Festival Hindi Pxc3893287
No ratings yet
Festival Hindi Pxc3893287
6 pages
0401229
No ratings yet
0401229
4 pages
Super Listener: 2. Signal Processing
No ratings yet
Super Listener: 2. Signal Processing
4 pages
Reconocimiento de Voz - MATLAB
No ratings yet
Reconocimiento de Voz - MATLAB
5 pages
The Main Principles of Text-to-Speech Synthesis System: January 2010
No ratings yet
The Main Principles of Text-to-Speech Synthesis System: January 2010
8 pages
2.2 Speech Processing: - Speech Synthesis. - Speech Recognition. - Speech Coding
No ratings yet
2.2 Speech Processing: - Speech Synthesis. - Speech Recognition. - Speech Coding
7 pages
A Dynamical System Model For Generating Fundamental Frequency For Speech Synthesis
No ratings yet
A Dynamical System Model For Generating Fundamental Frequency For Speech Synthesis
15 pages
Phonetics 2
No ratings yet
Phonetics 2
14 pages
Digital Speech Processing
No ratings yet
Digital Speech Processing
7 pages
Speech Synthesis - Christopher Mwololo Fred
No ratings yet
Speech Synthesis - Christopher Mwololo Fred
18 pages
Ann LA2 Project
No ratings yet
Ann LA2 Project
23 pages
The Use of Speech Synthesis in Exploring DifferentThe Use of Speech Synthesis in Exploring Different
No ratings yet
The Use of Speech Synthesis in Exploring DifferentThe Use of Speech Synthesis in Exploring Different
12 pages
DSP Speech Recognition Analysis
No ratings yet
DSP Speech Recognition Analysis
32 pages
Speechsynthesis
No ratings yet
Speechsynthesis
6 pages
Deep Learning Based NLP Techniques
No ratings yet
Deep Learning Based NLP Techniques
7 pages
1709 07552 PDF
No ratings yet
1709 07552 PDF
138 pages
7.0 Speech Signals and Front-End Processing: References: 1. 3.3, 3.4 of Becchetti
No ratings yet
7.0 Speech Signals and Front-End Processing: References: 1. 3.3, 3.4 of Becchetti
50 pages
Text-To-Speech Synthesis Using Concatena
No ratings yet
Text-To-Speech Synthesis Using Concatena
4 pages
Seminar Presentation: Topic: Speech Recognition
No ratings yet
Seminar Presentation: Topic: Speech Recognition
26 pages
Speech Trainer Kit Using Laryngeal Vibrations
No ratings yet
Speech Trainer Kit Using Laryngeal Vibrations
5 pages
Speech Signal Analysis and Coding: Dr. Arun Kumar
No ratings yet
Speech Signal Analysis and Coding: Dr. Arun Kumar
52 pages
Speech Processing
No ratings yet
Speech Processing
11 pages
117105145
No ratings yet
117105145
649 pages
Human Speech Communication
No ratings yet
Human Speech Communication
44 pages
Speech and Audio Processing and Coding
No ratings yet
Speech and Audio Processing and Coding
52 pages
As R Tutorial
No ratings yet
As R Tutorial
16 pages
Detecting AI-Synthesized Speech Using Bispectral Analysis
No ratings yet
Detecting AI-Synthesized Speech Using Bispectral Analysis
6 pages
(Ebook) Text-To-Speech Synthesis by Paul Taylor ISBN 9780521899277, 0521899273 PDF Version
No ratings yet
(Ebook) Text-To-Speech Synthesis by Paul Taylor ISBN 9780521899277, 0521899273 PDF Version
78 pages
Simulation and System Modeling
No ratings yet
Simulation and System Modeling
16 pages
Topic 11 Golf Course Programming For Cultural Practice
No ratings yet
Topic 11 Golf Course Programming For Cultural Practice
26 pages
Soal Ulangan Genap3
No ratings yet
Soal Ulangan Genap3
7 pages
Unit 4
No ratings yet
Unit 4
2 pages
Physical Health Assessment Pre-Sim Prep
No ratings yet
Physical Health Assessment Pre-Sim Prep
3 pages
Excel Practical Assignments
No ratings yet
Excel Practical Assignments
88 pages
PLSQL
100% (1)
PLSQL
195 pages
Promotion Form
No ratings yet
Promotion Form
2 pages
Thinking Through Drawing
No ratings yet
Thinking Through Drawing
35 pages
Ethics and Human Interface
No ratings yet
Ethics and Human Interface
17 pages
White Paper On Green Hydrogen Investment Opportunities in AP
No ratings yet
White Paper On Green Hydrogen Investment Opportunities in AP
52 pages
Chapter 3 - EMT
No ratings yet
Chapter 3 - EMT
44 pages
Company Profile: Board of Directors
No ratings yet
Company Profile: Board of Directors
6 pages
Trivia Quiz: Environment, History, Geography, Sports, Politics
No ratings yet
Trivia Quiz: Environment, History, Geography, Sports, Politics
4 pages
Verrier Elwin, Sarat Chandra Roy - The Agaria (1992, Oxford University Press, USA)
No ratings yet
Verrier Elwin, Sarat Chandra Roy - The Agaria (1992, Oxford University Press, USA)
383 pages
The Cinema of Federico Fellini - 30197: Syllabus
100% (1)
The Cinema of Federico Fellini - 30197: Syllabus
4 pages
Oxford Rooftops 5th - Reinforcement and Extension-26
No ratings yet
Oxford Rooftops 5th - Reinforcement and Extension-26
1 page
Indonesia's Local Gov Innovation
No ratings yet
Indonesia's Local Gov Innovation
14 pages
Exam Center Data
No ratings yet
Exam Center Data
2 pages
The Ultimate 5-Ingredient Cookbook - Fast and Flavorful 5 Ingredients or Less Recipes For Any Skill Leve
100% (1)
The Ultimate 5-Ingredient Cookbook - Fast and Flavorful 5 Ingredients or Less Recipes For Any Skill Leve
105 pages
Chapter - 3
No ratings yet
Chapter - 3
44 pages
Avogadro's Hypothesis on Gas Molecules
No ratings yet
Avogadro's Hypothesis on Gas Molecules
11 pages
Classroom Visits and Observing The Teaching Learning Situation
No ratings yet
Classroom Visits and Observing The Teaching Learning Situation
36 pages
Multicasting in TCP/IP Protocols
No ratings yet
Multicasting in TCP/IP Protocols
48 pages
Methods For Testing Tar and Bituminous Materials - Determination of Specific Gravity
100% (1)
Methods For Testing Tar and Bituminous Materials - Determination of Specific Gravity
10 pages
Foundations of Microeconomics 7 Ed Bade
No ratings yet
Foundations of Microeconomics 7 Ed Bade
307 pages
Social Movements
No ratings yet
Social Movements
8 pages

Nonlinear Speech Synthesis

Uploaded by

Nonlinear Speech Synthesis

Uploaded by

Nonlinear Speech Synthesis *

ABSTRACT the fundamental level of speech production. An accur­

From analysis, an initial set of parameters 'with 'which to

REF; H"p<:r lalli"� REF; Dal" ",bsd

'00 :=== Timc(s)

50'-I---------------< 50) -i--

of the synthesised data is also somewhat longer than 8 AcknowledgeD1ents

an important degree of structure, perhaps reflecting Mouton, 1960.

381, Lecture Notes in Mathematics, 1980. 342, 1998.

[21] D. S. Broomhead and G. P. King, 1Vonlinear Phenom­

You might also like

ABSTRACT the fundamental level of speech production. An accur

[21] D. S. Broomhead and G. P. King, 1Vonlinear Phenom