ROBUST PITCH DETERMINATION USING
NONLINEAR STATE-SPACE EMBEDDING
Dmitry E. Terez
SoundMath Technologies, LLC
th
6 N. 9 Street, Millville, New Jersey, 08332, USA
[email protected] ABSTRACT directly. The important question, then, is how to recover and
describe the underlying low-dimensional dynamics from a single
A robust method for detecting periodicity and measuring
one-dimensional observable – speech signal.
fundamental frequency in speech and other signals is proposed.
One of the profound results established in chaos theory is
The method is based on concepts originally developed for
the celebrated Takens’ embedding theorem [2], which states that
analyzing chaotic time-series. A signal segment is transformed
it is possible to reconstruct a state space topologically equivalent
into trajectory in m-dimensional state space by using embedding
to the original state space of a system from a single observable.
procedure. Close pairs of points on the trajectory with the
Some attempts have already been made to apply nonlinear and
distance between them less than prescribed neighborhood radius
chaotic signal analysis methods to speech processing [3]. In
are found and their time separations are computed. A periodicity
particular, it was previously noted that pitch period can be
histogram for the distribution of computed time separations is
measured in state space by using Poincaré sections [3]. However,
characterized by distinct peaks corresponding to pitch period and
a truly reliable and accurate method for pitch detection using
its multiples for periodic regions, and by the absence of such
state-space embedding of a signal has not been proposed to date.
peaks for aperiodic regions. The proposed method does not suffer
This paper introduces a robust and general method for pitch
from the limitations of other short-term pitch-estimation
estimation in state space. Theoretical matters are discussed first,
techniques. Updated information and demo software can be
followed by implementation overview and preliminary results.
found at www.soundmathtech.com/pitch.
2. STATE-SPACE EMBEDDING
1. INTRODUCTION The evolution of nonlinear dynamical system can be described by
a point (vector) moving along some trajectory in an abstract state
Pitch determination has a long history in speech research. space (also called phase space elsewhere), where the coordinates
Accurate pitch estimation plays a very important role in speech of the point are independent degrees of freedom of the system.
compression, speech recognition and synthesis, as well as in the Embedding procedure is used to reconstruct a state space from a
musical world. A large number of pitch determination algorithms scalar signal. The most popular embedding procedure is Takens’
have been developed to date. Most of them can be loosely method of delays [2, 4]. Vectors x(i) in m-dimensional state
classified as time-domain or short-term analysis PDAs [1]. The space are formed from time-delayed values of a signal s(i):
most popular and reliable techniques in use today (for example,
those based on correlation, spectrum or cepstrum) are short-term x(i) = [s(i), s(i-d), s(i-2d),… s(i-(m-1)d)], (1)
methods operating on short segments of a signal. None of them, where m is the embedding dimension and d is the chosen delay
however, was found fully satisfactory on real speech. value (in samples). Since speech signals are non-stationary,
One of the reasons for such deficiency is a linear nature of embedding procedure is applied to short consecutive segments,
signal processing employed by many conventional methods. or speech frames. Figures 1(b) and 3(b) show examples of time-
Human speech production, meanwhile, is a complex nonlinear delay embedding for a vowel and a fricative. It is generally found
and non-stationary process. Its complete and most accurate that voiced speech can be sufficiently embedded in three
description can only be achieved in terms of nonlinear fluid dimensions, whereas unvoiced speech has a high-dimensional
dynamics (Albeit this kind of description cannot be used directly nature [3]. The short-term nature of the proposed method makes
for building DSP devices). Traditionally, though, it has been determination of the true embedding dimension unnecessary.
described using techniques like source-filter model and spectral Good results can be achieved even with m=2, despite the fact that
analysis. These techniques work very well for many aspects of a signal trajectory can have many self–intersections. In our
speech analysis, but they are inherently limited in their ability to implementation of the method constant embedding dimension
describe the true dynamics of speech production. m=3 is used. The number of dimensions can be further increased,
Consequently, to study such nonlinear aspects of speech but beyond 4 or 5 no noticeable improvement can be observed
production as excitation function, it is advantageous to dismiss for most practical purposes. The choice of an optimal delay
traditional linear techniques and to use more general nonlinear parameter d depends on a sampling rate and signal properties.
approach. Without making too many simplifying assumptions Delay should be large enough for a reconstructed trajectory to be
one can state that (voiced) speech is generated by a relatively maximally “open” in state space on average. On the other hand, it
low-dimensional nonlinear dynamical system. The active degrees is desirable to keep d relatively small for better time resolution.
of freedom and state variables of this system are not observable For each sampling rate we use constant value of d for all frames.
0 50 100 (a) 150 200 0 50 100 (a) 150 200
(b) (c) (b) (c)
Figure 1. (a) Speech frame of the sustained vowel // Figure 3. (a) Speech frame of the fricative /S/ (female,
(female, 16 kHz) and the 3-D trajectories reconstructed 16 kHz) and the 3-D trajectories reconstructed using (b)
using (b) time-delay embedding (d=12 samples) and (c) time-delay embedding (d=12 samples) and (c) SVD-
SVD-embedding (SVD-window of 30 samples). embedding (SVD-window of 30 samples).
Spatial distance, r
Spatial distance, r
0.2 0.2
0.15 0.15
(a) 0.1
(a)
0.1
577 distances 372 distances
0.05 0.05
0 0
0 50 100 150 200 0 50 100 150 200
Temporal separation in samples, ∆t Temporal separation in samples, ∆t
Number of distances
Number of distances
200 200
100 (b) 100 (b)
0 0
0 50 100 150 200 0 50 100 150 200
1 1
0.5 (c) 0.5 (c)
0 0
0 50 100 150 200 0 50 100 150 200
1 1
0 (d) 0 (d)
−1 −1
Figure 2. (a) Space-time separation plot, (b) periodicity Figure 4. (a) Space-time separation plot, (b) periodicity
histogram, (c) normalized periodicity histogram and (d) histogram, (c) normalized periodicity histogram and (d)
normalized unbiased autocorrelation function for the normalized unbiased autocorrelation function for the
vowel // from Fig.1 (time-delay embedding, d=12). fricative // from Fig.3 (time-delay embedding, d=12).
Time-delay embedding is the method of choice for our PDA
implementation. It is possible, however, to use other embedding 3. PERIODICITY HISTOGRAM
techniques, as long as they preserve topological properties of the Each pair of points on the reconstructed trajectory is separated in
original state space of a system. One particular alternative state space by some distance r and in time by some t (in number
embedding technique, implemented and tested with our PDA, is of samples). Euclidean spatial distance measure (or its square) is
singular value decomposition embedding introduced in [5] (Figs. preferred, but other reasonably defined norms are also possible.
1c and 3c). SVD-embedding has some advantages over time- This can be visualized by making a scatter plot of r versus t for
delay embedding due to its smoothing capabilities, leading to each possible pair of points. Thus, we arrive at the space-time
improved results on some types of signals (e.g. voiced fricatives, separation plot introduced in [6] to visualize the properties of
noisy speech). However, in most cases smoothing can also be chaotic time-series. Figures 2(a) and 4(a) show typical scatter
achieved by simply performing moderate low-pass filtering of a plots for a vowel and a fricative (only lower parts of the entire
signal before embedding it. Overall, SVD-embedding can be a plots are shown). One can see from Fig. 2(a) that for a steady
useful alternative to time-delay embedding, but its computational periodic vowel data points with small r tend to concentrate
cost makes it less practical for real-time implementation. around time separation values t corresponding to fundamental
pitch period and its integer multiples, whereas for an aperiodic and noisy voiced segments, in order to have statistically reliable
fricative in Fig. 4(a) they are randomly distributed along t axis. histogram peaks. To this end, we have developed a simple and
One can choose some neighborhood radius r in state space efficient adaptive procedure that can choose an appropriate value
and find all pairs of points on the trajectory with the distance of r for each frame. The procedure works iteratively by checking
between points less than r. This can be illustrated by dissecting a the highest peak’s magnitude, adjusting r and re-computing the
space-time separation plot with a horizontal line and selecting all histogram for the new value of r. After several iterations the
data points (spatio-temporal distances) below this line, as shown highest peak is either brought to the prescribed magnitude range
in Figures 2(a) and 4(a). For each found pair of points the time (e.g. 0.8-0.95) or to the magnitude attained with the maximal
separation between points in number of samples is calculated. A allowed neighborhood radius r.
periodicity histogram is then computed, where each bin It is interesting to note that present method emphasizes local
accumulates total number of found pairs with the same particular properties in state space, as opposed to the global nature of
time separation equal to a bin index. For a sequence of M points correlation function. The advantage of our method over
(vectors) xi (i=1…M) in m-dimensional state space periodicity correlation-based techniques is evident from Figs. 2(c) and 2(d).
histogram can be formally defined as
i M k
4. IMPLEMENTATION
hist (k , r ) H (r | x i x ik |) , (2) Being a short-term PDA in nature, our implementation of the
i 1 method includes three usual stages [9]: (a) signal pre-processing,
where k is a bin index (k=0…(M-1)), r is a chosen neighborhood (b) generation of pitch period candidates and (c) post-processing.
radius, | xi xi k | is Euclidean distance in state space and H is The basic method works well on raw speech waveforms and does
not explicitly require any signal pre-processing. However, some
Heaviside function. This form of histogram definition was used
signal pre-conditioning, like moderate low-pass filtering, can
in [7] for qualitative analysis of chaotic time-series.
generally improve the quality of results in many cases.
Thus computed periodicity histograms for a vowel and a
Pitch candidates are selected from a normalized periodicity
fricative are shown in Figures 2(b) and 4(b). One can see two
histogram computed with an appropriate value of r for each
sharp peaks in Figure 2(b) at the positions corresponding to
speech frame. The magnitude of the largest peak between low
fundamental pitch period and its doubled value.
and high pitch search bounds is determined first. Then, all local
Since the summation interval in (2) linearly shrinks with
peaks in the valid search range with the magnitudes exceeding
increasing k, the histogram has a bias: the upper bound is not the
some prescribed fraction (e.g. 50 %) of the largest peak are found
same for all bins and is a linearly decaying function of k, as
and their positions are stored as frame pitch period candidates.
shown by slanting lines in Figs. 2(b) and 4(b). In order to remove
Some optional smoothing can be applied to a histogram before
this bias each bin can be normalized with respect to its upper
searching for local peaks.
bound, to obtain a normalized periodicity histogram:
Pitch candidates obtained as described above for steady
i M k periodic speech frames (e.g. Fig.1a) usually include only a true
1
nhist (k , r )
(M k )
H (r | x
i 1
i x i k |) (3) pitch period and its integer multiples. Selecting the lowest
multiple can give a reliable local pitch estimate for such frames.
Normalized histograms, corresponding to the histograms in Figs. However, due to the nature of the problem, it is still necessary to
2(b) and 4(b), are shown in Figs. 2(c) and 4(c). analyze more than one consecutive frame, in order to obtain
One can observe some analogy with the conventional smooth pitch tracks and correctly detect voicing state transitions.
definitions of biased and unbiased auto-correlation function [8]. As with other short-term pitch-determination methods, there
Similar to unbiased auto-correlation, normalized periodicity are different possible approaches to accomplish post-processing
histogram has a large variance at larger bin indices k approaching or pitch-tracking in conjunction with our method, ranging from
M, making those higher bins statistically less reliable when simple median filtering to sophisticated dynamic programming
searching for peaks. In practice, a search range is prescribed, procedures. For our PDA implementation we have developed an
which excludes the regions close to both edges of a histogram. algorithm based on dynamic programming and utilizing the
A characteristic property of periodicity histogram is its properties of periodicity histogram. The algorithm performs
dependence on a chosen neighborhood radius r. The magnitudes simultaneous pitch and voicing state determination. Details of the
of histogram peaks are directly affected by the choice of r. A algorithm will be described elsewhere. A variant of the pitch-
constant value of r for all frames (embedded and normalized to tracking procedure with a fixed latency time of one or two frames
fit into a unit cube in state space) can be chosen (e.g. r=0.15) was also implemented for use with a low-bit-rate vocoder.
and shows good results on average. However, an optimal
accuracy and resolution for each frame cannot be achieved with 5. COMPUTATIONAL EFFICIENCY
constant r. It is advantageous, therefore, to choose a
neighborhood radius r for each frame independently, in order to The proposed method requires finding close pairs of points in a
make main histogram peaks more pronounced and easy to select. set of M points in m-dimensional space, where M is proportional
As a rule, r should be kept relatively small for clean and steady to sampling rate. For relatively small M this can be accomplished
periodic signals (such as the one in Fig. 1a). For such signals in a straightforward way by computing M 2 / 2 distances between
main peaks saturate very quickly at the upper bound when the all possible pairs of points. In this case, computation of (squared)
value of r is increased. Further increasing r can lead to widening Euclidean distances is the most expensive part of a procedure. At
of the main peaks and, consequently, to loss of accuracy. On the higher sampling rates it becomes beneficial to use more
other hand, it is desirable to increase the radius r for transitional sophisticated methods to avoid explicit computation of distances.
On clean periodic signals reliable results can be obtained
with the frame size a little larger than one pitch period – the
best time-resolution possible with time-domain methods.
The method shows robust performance on noisy and band-
1000 2000 3000 4000 5000 6000 7000 8000 9000 10000 11000
Samples
limited speech signals.
200
It is worth noting that more reliable pitch candidates obtained
Pitch period
150
with the present method (as compared to, for example,
100 correlation-based PDAs) account for lower average latency times
50 in the dynamic programming procedure.
0
Figure 5 shows some typical output of our PDA obtained
1000 2000 3000 4000 5000 6000 7000 8000 9000 10000 11000 with the fixed frame size of 200 samples (time step of 100) on a
Samples clean female speech (with no smoothing of the pitch contour).
Figure 5. (top) Speech waveform of “untimely” (TIMIT, Current efforts are concentrated on optimizing parameters of
female, 16 kHz) and the output pitch contour (bottom). the dynamic programming algorithm and tuning its performance
on a large speech database, in order to further reduce the gross
Finding nearest neighbors in m-dimensional space is a task error rate. The low-delay version of our PDA is currently being
frequently encountered in chaotic time-series analysis [10] and is incorporated into an improved low-bit-rate vocoder.
a well-studied subject in computational geometry. A number of
fast neighbor-searching algorithms have been developed. A fast 7. CONCLUSIONS
box-assisted algorithm for finding close pairs, somewhat similar
to the one described in [10], was implemented as an optional Methodologies originally developed for analyzing chaotic time-
experimental feature for our PDA. Its initial evaluation shows series have been successfully applied to pitch determination
that noticeable speed-ups are possible at higher sampling rates on problem. The proposed new method does not suffer from the
clean and steady periodic signals, as long as the size of boxes can limitations of other short-term pitch-estimation techniques. The
be kept relatively small. method has been implemented in the computationally efficient
Another possible approach to reducing computational cost PDA and the preliminary evaluation results show its robust
is to make initial crude estimation with a down-sampled version performance on real speech.
of a signal, then compute a histogram at the original sampling
rate, but only in the vicinity of prominent peaks. This technique 8. REFERENCES
is routinely used with correlation-based PDAs [9], but it is also [1] Hess, W., “Pitch and voicing determination”, in Advances in
directly applicable to the present method. speech signal processing, eds. M. M. Sondhi and S. Furui,
Marcel Dekker, New York, 1992.
6. RESULTS AND DISCUSSION [2] Takens, F., “Detecting strange attractors in turbulence”, in
A formal evaluation of our PDA on a large speech database is Lecture Notes in Mathematics, Vol. 898, eds. D.A.Rand and
currently under way and will be reported elsewhere. In the L.S.Young, Springer, Berlin, 1981.
preliminary evaluation, it was tested on clean speech samples [3] Kubin, G., “Nonlinear Processing of Speech”, in Speech
from TIMIT, as well as on some noisy and band-limited speech Coding and Synthesis, Elsevier, 1995.
and artificially generated signals. Particular attention was paid to [4] Kantz, H., and Schreiber, T., Nonlinear Time Series
the percentage of gross pitch determination errors, mostly Analysis, Cambridge University Press, 1998.
represented by the frames in transitional regions incorrectly [5] Broomhead, D. S., and King, G., “Extracting qualitative
classified as voiced or unvoiced and, occasionally, pitch- dynamics from experimental data”, Physica D 20, 1986.
doubling errors. This required visual inspection of problematic [6] Provenzale, A. et al., “Distinguishing between low-
regions after computation was done. The 10 male and 10 female- dimensional dynamics and randomness in measured time
produced sentences by 10 different speakers from the TIMIT series”, Physica D 58, 1992.
database were used for evaluation. The results are very [7] Gilmore, C.G., “A new test for chaos”, Journal of Economic
encouraging: after some tuning of the parameters the gross error Behavior and Organization, v. 22, Elsevier, 1993.
rate was reduced to about 5 % for female and 6 % for male [8] Bendat, J.S. and Piersol, A.G., Random Data: Analysis and
speech samples. Some modifications to the method were also Measurement Procedures, Wiley & Sons, NY, 1971.
implemented which allow achieving sub-sample resolution. [9] Talkin, D., ”A robust algorithm for pitch tracking (RAPT)”,
In summary, the proposed method appears to overcome in Speech Coding and Synthesis, Elsevier, 1995.
some serious limitations of other short-term PDAs relying on [10] Schreiber, T., “Efficient neighbor searching in nonlinear
computing correlation, spectrum or cepstrum. The following time series analysis”, Int. J. Bifurcation and Chaos, 5, 1995.
combination of properties distinguishes our method from other
short-term techniques:
Signal under analysis can be of arbitrary complexity - the
method is not sensitive to formant structure.