Lips Pronociation
Lips Pronociation
1 Introduction
Lip-reading and bimodal recognition research is following the same trend which
occurred to the studies on the transmission of the linguistic information by the
auditory channel. The first stage was focused on visual intellegibility tests, i.e on
the quantification of the information trasmitted by the visual channel. In the second
stage the research proceeds with the identification of the characteristics of the
signal which trasmit the information. To that purpose, various devices capturing
and recording distal and proximal signals have to be designed, built and tuned up,
and various techniques for the creation of synthetic stimuli for experimental tests
have to be developed. Only relying on a great amount of experimental data,
sufficient to capture the complexity of the whole phenomenum, and, possibly,
characterized by a cross-linguistic nature in order to separate the fundamental
mechanisms from more liguo-specific characteristics, the elaboration of adequate
theories of visual perception of articulatory movements and of bimodal perception
of speech will be possible. The experimental data presented in the following are
intended to contibute to this second stage of the research (Magno Caldognetto et
al., 1995). In fact they constitute the natural development of previous studies
executed at CSRF focused on auditory (Magno Caldognetto and Vagges, 1990a,
1990b) and visual (Magno Caldognetto, Vagges and Ferrero, 1980) intelligibility
tests which enabled us to quantify and verify the characteristics of the
phonological information transmitted separately by both channels. As illustrated in
Figure 1, the intelligibility of visible articulatory movements, as it was expected
and parallely to other languages, is high only for bilabial (/p/, /b/) and labiodental
(/f/, /v/) consonants, while the correct identifications gradually reduce from
anterior to posterior loci of articulation. As for the manner of articulation, the
visible identification of nasals and of all voiced consonants is particularly difficult
due to the fact that neither the movements of the velum neither that of the vocal
folds are visible. On the contrary, considering the auditory intelligibility tests in
various noise masking conditions (Magno Caldognetto, Ferrero and Vagges,
1982), (Magno Caldognetto, Vagges and Ferrero, 1980), (Magno Caldognetto,
Vagges and Ferrero, 1988), nasals, liquids (laterals and trills) and all sonorant
consonants are well identified. The analysis of identification errors enabled us to
build the two dendrograms illustrated in Figure 2. The groups of consonants
visually confused (visemes) tend to share the locus of articulation, while the
clusters obtained in the auditory identification tests correspond to unvoiced, voiced
and sonorant consonants, which are characterized by different spectral patterns
implying different manners of articulation and different activity of the vocal folds.
These results are similar to those obtained for other languages, by Summerfield
(1987) or Mohamadi and Benoit (1992), and support the idea of the auditory and
visual bimodal synergism relevant for the development of the theories on language
acquisition by normal and pathological infants, on speech communication and for
various technological applications, such as audio-visual speech synthesis (Benoit
et al., 1992), (Cohen and Massaro, 1990) or audio-visual speech recognition
systems (Petajan, 1984).
100 100 80
80 80
60
60 60
40
40 40
20
20 20
(a) (b) (c)
0 0 0
a i u a i u a i u
Bilabials Bilabials Plosives Plosives
Labiodentals Labiodentals Fricatives Fricatives
Apicodentals Apicodentals Affricates Affricates
Laminopalatals Laminopalatals Nasals Nasals Unvoiced Unvoiced
Dorsovelars Dorsovelars Liquids Liquids Voiced Voiced
Figure 1. Results of the intelligibility tests for Italian consonants: (a) loci of articulation, (b) manners
of articulation, (c) voiced/unvoiced opposition.
25
(a) 60 (b)
20
45
15
10 30
5 15
0 p b m f v n l r k g tS dZ S Ý t d s z dz ts 0 p f t k ts s tS S b d v dz dZg z l mn Ý r
Bilabial Alveolar Medio- Alveolar Unvoiced Voiced Sonorant
Labiodental palatal
Figure 2. Cluster analysis of (a) visual (Magno Caldognetto, Vagges and Ferrero, 1980) and (b)
auditory (Magno Caldognetto and Vagges, 1990a, 1990b) confusions among Italian consonants.
A / D (100Hz)
Articulatory signals
The subjects are placed in the field of view of two CCD TV cameras at 1.5
meters from them. These cameras light up the markers by an infrared stroboscope,
not visible in order to avoid any disturbance to the subject. ELITE is characterized
by a two level architecture. The first level includes an interface to the environment
and a fast processor for shape recognition (FPSR). The outputs of the TV cameras
are sent at a frame rate of 100 Hz to the FPSR which provides for markers
recognition based on a cross-correlation algorithm implemented in real-time by a
pipe-lined parallel hardware. This algorithm allows the use of the system also in
adverse lighting conditions, being able to discriminate between markers and
reflexes of different shapes although brighter. Furthermore, since for each marker
several pixels are recognized, the cross-correlation algorithm allows the
computation of the weighted center of mass increasing the accuracy of the system
up to 0.1mm on 28cm of field of view. The coordinates of the recognized markers
are sent to the second level which is constituted by a general purpose personal
computer. This level provides for 3D coordinate reconstruction, starting from the
2D perspective projections, by means of a stereophotogrammetric procedure which
allows a free positioning of the TV cameras. The 3D data coordinates are then
used to evaluate the parameters described hereinafter. Simultaneously to the
articulatory signals, ELITE records also the coproduced acoustic signal. In this
study the markers were placed, as illustrated in Figure 4, on the central points of
the vermilion border of the upper lip, and of the lower lip, at the corners of the
lips, and at the center of the chin. The markers placed on the tip of the nose and on
the lobes of the ears served as reference points to eliminate the effects of the head
movement. In Table 1 are indicated all the most relevant articulatory movements
and parameters which can be analyzed using ELITE. The axes are obviously
related to phonological features, x being joined to lip-rounding, y to lip-protrusion
and z to lip-opening.
opening
z
reference
markers plane ∆
(frontal)
plane Ω
(transversal)
8 1 7 x
2 rounding
5 4
3
6
y
target
markers
protrusion
plane Σ
(sagittal)
Figure 4. Position of the reflecting markers and of the reference plane.
abbreviations meaning
definition
ULz upper lip vertical movement d(m2,Ω)
LLz lower lip vertical movement d(m3,Ω)
ULy upper lip frontal movement d(m2,∆)
LLy lower lip frontal movement d(m3,∆)
RCx right corner horizontal movement
d(m4,Σ)
LCx left corner horizontal movement
d(m5,Σ)
Jz jaw vertical movement d(m6,Ω)
velocities ∂p/∂t
accelerations ∂2p/∂t
2.1 Vowels
78 1 3
82
0 0.2 0.4 0.6 0.8 1.0 1.2
40
1
35
mm 3 LOH
30
DS DS
25
20
2
15
0 0.2 0.4 0.6 0.8 1.0 1.2
/’apa/
V C V
Figure 5. Definition of target points which characterize spatial characteristics of vowels and
consonants for two particular articulatory parameters such as Jz and LOH. Points 1 and 3 refer to
vowels while point 2 refers to consonants.
u i o e E a a E e i o u E a i e o u E a i e o u a E i e o u
LOH Jz LOW LLy ULy
Unstressed vowels
5.0 5.0 5.0 5.0 5.0 5.0
a e o i u a e i o u i a e o u a e i o u a i e o u
Figure 6. Hierarchical clustering of the stressed and unstressed vowels with respect to five articulatory
parameters.
Moreover, the extension of its movement always shows greater values than Jz,
cf. Table 2. It is clear that lips not only move in synergy with the jaw, but also in
an independent specific manner for rounding and protrusion movements. LOW
divides both stressed and unstressed vowels in two groups: rounded vowels and
unrounded vowels (see Fig. 6). As for the two protrusions, LLy is the parameter
that best distinguishes both stressed and unstressed vowels. As shown in Fig. 6,
stressed and unstressed vowels are divided into 4 groups, i.e., two degrees of
protrusion and two degrees of retraction. In particular, for stressed vowels, a
higher degree of protrusion characterizes /u/ and /o/ with respect to ///, while /a/
and /%/ are more retracted than /i/ and /e/, see Table 2. Based on the parameters,
(jaw opening, lower lip protrusion and lip width), that distinguished the vowels in
the most significant way, a three-dimensional representation of the stressed and
unstressed vowel space was plotted in Figure 7. The data confirm the cooccurrence
of rounding and protrusion for the vowels. In fact, all the vowels with positive
values of lip width, /i,e,%,a/, also have negative values for both upper and lower
lip protrusion. That is, unrounded vowels are always also non protruded.
Similarly, vowels with negative lip width values, i.e. the rounded vowels //,o,u/,
are characterized by positive values of upper and lower lip protrusion, that is, they
are also protruded. jaw opening and lower lip protrusion are the parameters that
better distinguish the vowels. It should be noted that differences in jaw opening
with respect to lip height may be due to the marker placed on the chin: the position
of this marker was influenced not only by the jaw opening but also by the
movement of the skin especially during the lip protrusion.
Based on the values of the parameters analyzed, the reduction of the unstressed
with respect to the stressed vowels was confirmed. Moreover, the unstressed mid
vowels are more similar to the stressed mid-high /e/ and /o/ rather than to the mid-
low stressed /%/ and ///.
Jz a 16
16
E Jz
12
a 12
e e
i 8 i 8
o u o 4
u
4
5 0
5 3 4
3 1 2
1 0 0
-1 4 -1 -2
LLy -3
-5 -4 -2 0 2 LLy -3 -6
-4
-6 -5
-10 -8
LOW -10
-8 LOW
(a) (b)
Figure 7. 3D representation of the (a) stressed and (b) unstressed vowel space.
Based on the values of the parameters analyzed, the reduction of the unstressed
with respect to the stressed vowels was confirmed. Moreover, the unstressed mid
vowels are more similar to the stressed mid-high /e/ and /o/ rather than to the mid-
low stressed /%/ and ///.
2.2 Consonants
Table 3. Normalized mean values (mm) pooled over subjects (4) and repetitions (5), for each
articulatory parameter and each consonant. Isolated cardinal vowels parameters are also drawn for
reference.
As for opening (LOH), the Table shows that the labial opening values increase
from bilabial to mediopalatal consonants, that is following the degree of tongue
retraction. It should be underlined that the degree of constriction of the vowels is
always greater than that of the consonants and that there are negative values for the
bilabials /p/ and /b/. In fact, when a bilabial plosive is generated, the lips get closer
to themselves and also produce a certain degree of compression. Moreover, the
pairs of omorganic voiced/unvoiced consonants are not distinguished by different
values of LOH. As for rounding (LOW), all the examined consonants were
spreaded compared to the rest position, in particular /f/, /t/, and /d/, whose values
result similar to those obtained, for the same parameter, for the vowel /i/. As for
protrusion (LLy), spreaded consonants are also retracted, while only /S/ is
characterized by a certain degree of protrusion and consequently it can be
considered as a labialized palatal fricative. It is worth noticing that only a 3D
description of consonantal targets enables to distinguish vowels from consonants
and consonants among themselves. In the production of vowels, rounding, i.e.
negative values of LOW, and protrusion, i.e. positive values of LLy, are
concurrent, whereas in the production of consonants they can be independent. In
fact, /S/ is protruded but not retracted. Even for consonants similar for one feature,
such as, for example, LOH for the alveolar /t/, /d/, and /s/, the differentiation is
ensured by considering the other two remaining features LOW and LLy.
2.2.2 Spatial coarticulatory effects
Consonantal spatial targets are subject to possible variations depending on
different contextual flanking vowels. In order to evaluate the relevance of these
variations the effects of symmetrical vocalic contexts, constituted by the three
cardinal vowels /a/, /i/ and /u/, on the labial movements of two different plosive
consonants were examined. In particular the bilabial unvoiced plosive /p/, in which
the lips constitute the primary articulator, and the apicodental unvoiced plosive /t/
where, on the contrary, the primary articulator is constituted by the tip of the
tongue and the lips move due to coarticulatory effects, were considered. As
illustrated in Figure 9 for /’VpV/ non-sense stimuli, flanking vowels determine
different visual shapes for /p/. The context /u/ distinguishes itself, considering all
three parameters, from the contexts /i/ and /a/, which instead result more similar. In
particular a reduced compression (LOH) is evident for /p/ in the context /u/ due
probably to the presence of the protrusion. Being /p/ the consonant showing the
greater degree of labial constriction, a more significant difference between the data
relating to the three contextual vowels and the consonantal target can be observed.
Considering all three labial dimensions, relevant coarticulatory variations are
noticeable also for /t/ in /’VtV/ non-sense stimuli (see Fig. 9). In the contexts /i/
and /a/ the degree of labial constriction for the consonant /t/ is evident and always
lower than that of the corresponding stressed and unstressed flanking vowels. As
regards LOW parameter, values for /t/ show a spreading effect in relation to each
flanking vowel. Contexts /i/ and /a/ determine however greater LOW values than
those pertaining to the context /u/. Also for protrusion, in the context /i/ and /a/
there is a clear effect of lip retraction for the consonantal targets, while context /u/
determine protrusion values similar to those of the vowels. Future research should
obviously be devoted to study the coarticulatory effect of consonants on vowels
and of asymmetric vowel context on consonants (for a review, see Farnetani,
1995).
25 25
/'a/ /'a /
20 20
/a/
/'i / LOH
15 /'i / /a /
LOH /i / 15
/'u/ /'u / /u /
/u/ /tu / /i /
/ti /
10
10
5 /pu/ /pi / /ta /
5
0 5
/p /
a -2 0
2
4
LLy 0
-5 0 2 4
0
-5
5 4 3 2 -6-4 -10 -8 -6 -4 -2
-8
1 0 -1
-2 -3 -4 -10 LOW LOW
LLy -5
Figure 9. Coarticulation effect of flanking vowels on the target consonants (/’VpV/, /’VtV/ stimuli).
/’apa/
V C V
Figure 10. Target points and temporal definition for analyzing dynamic articulatory parameters.
Table 4. Some possible target points and some of the possible temporal definitions characterizing the
dynamic characteristics of visible articulatory novements in the production of consonants.
These analyses are in progress for all the consonants. As an example, in Table
5 are illustrated the values obtained for the duration of the vowel-to-consonant
closure (TMc) and of the consonant-to-vowel opening (TMo) movements relative
to the LOH, LLy, LOW parameters for all the consonants previously utilized
looking at the spatial characteristics (C = /p/, /b/, /f/, /t/, /d/, /s/, /S/).
/p/ /b/ /f/ /t/ /d/ /s/ /S/ /p/ /b/ /f/ /t/ /d/ / s/ / S/
LOH 257 247 303 297 313 335 219 LOH 180 154 203 186 158 187 244
LLy 235 326 281 284 325 325 344 LLy 190 266 229 216 279 334 289
LOW 172 169 400 429 459 434 410 LOW 179 116 233 279 222 348 240
TMc TMo
Table 5. Mean values of the duration of the closure and opening movements.
Closing movements tend to be always longer than the opening ones. This effect
can be related to the different prosodic characteristics of the initial (stressed) and
final (unstressed) vowel which present different LOH values (cf. § 2). Reported
values evidentiate also a different behaviour of the three parameters for each
consonant.
In what follows, only an example of the presently available data will be
presented, and precisely those relative to the vowel-to-consonant closure
movements of the upper (UL) and lower lip (LL) for bilabial voiced and unvoiced
stop consonants /p/ and /b/, produced 5 times by 4 talkers, within non-sense stimuli
characterized by symmetric contexts (‘VCV: V= /i,a,u/) (Magno Caldognetto et.
al. 1989). The data presented in Tables 6 and 7 refer to the mean values and
standard deviations of some of the previously described spatio-temporal
parameters, pooled over all the subjects and all the repetitions. The role of the
voiced/unvoiced opposition and of the different vocalic contexts was investigated
by the use of a series of three-way ANOVAs (3 vowels, 2 consonants and 4
subjects as a between factor) for each of the spatio-temporal measurements.
/’apa/ /’ipi/ /’upu/ /’aba/ /’ibi/ /’ubu/
UL TMc msec 153 182 157 UL TMc msec 169 187 143
s.d. 71 72 47 s.d. 62 46 31
TVc msec 56 67 76 TVc msec 60 77 65
s.d. 26 46 33 s.d. 20 32 21
LL TMc msec 207 185 199 LL TMc msec 234 174 167
s.d. 39 33 30 s.d. 61 32 32
TVc msec 95 102 98 TVc msec 68 76 104
s.d. 24 36 15 s.d. 12 13 118
Table 7. Mean values and standard deviations, pooled over all subjects and repetitions, of the temporal
characteristics of the upper and lower lips in non-sense ‘VCV stimuli (C=/p, b/, V=/a, i, u/).
Both for the upper and lower lips, the temporal characteristics of the closing
movement are not affected by the voiced/unvoiced contrast. In fact, neither the
duration of the closing movement nor the time interval between peak velocity and
peak closure are affected by the type of consonant. Our data on the duration of the
closing movement are not in agreement with those reported by Summers (1987),
who noted a longer duration for /b/ than /p/. As for the spatial characteristics, the
voiced/unvoiced contrast does not affect the onset of the closing movement, but
affects the peak closing position as well as the displacement. The data show that
there is a greater degree of lip compression during the bilabial closure for the
voiceless stop /p/ than for the voiced stop /b/. As it is shown in Table 6, the peak
closure (PC) values show that the distance of the marker on the upper lip from the
reference plane is always greater for /p/ than /b/. On the other hand, the peak
closure values for the lower lip are always smaller for /p/ than /b/. This different
degree of compression can be compared to the greater pressure of bilabial contact
for word-final /p/ than /b/, as observed by Lubker and Parris (1970). Moreover, it
can be compared to the reduced degree of linguopalatal contact noted for the
voiced stop /d/ compared to the voiceless /t/ discovered by Farnetani (1989). The
peak velocity values are also higher for /p/ than /b/, although this trend did not
reach significance for the lower lip. Our data on the lip closure velocity are in
agreement with the data obtained for the bilabial voiced and voiceless stops in
word-final position, by Sussman, MacNeilage and Hanson (1973), Smith and
McLean-Muse (1987), Summers (1987) and Flege (1988). As for the vowel
context, the data show that it affects the duration of the closing movement, but not
the time interval between peak velocity and peak closure position. With respect to
the spatial characteristics, the onset of the closing movement depends on the vowel
quality, for both the upper and lower lips. The peak closure position for the lower
lip is affected by the flanking vowel, in particular, /u/ shows different peak closure
values from /a/ and /i/. This difference may be due to the characteristic of lip
protrusion which cooccurs with the rounding feature in Italian. The displacement
values are, as expected, affected by the vowel quality, while the peak velocity
values depend on the vowel context only with respect to the lower lip. In summary,
the main effects of the voiced/unvoiced contrast on the bilabial closure movement
are a greater degree of lip compression and a greater closure velocity for /p/ than
/b/. The effect of the flanking vowels is evident in the different closure duration of
the upper and lower lips and in the different closure velocity of the lower lip.
Moreover, the time interval between the peak velocity and the peak closure is not
affected by neither the voiced/unvoiced contrast nor by the vowel context. The
asymmetric behaviour of most relevant parameters for upper (UL) and lower (LL)
lip is visualized in Figure 11.
ULz ULz ULz LLz
25 /’aCa/ LLz 300 /’aCa/ LLz
120
/’iCi/ /’uCu/
250 /’aCa/
20 100
200 80
15 /’iCi/ /’iCi/
150 60
10 /’uCu/ 100 /’uCu/ 40
5 50 20
0 0 0
/p/ /b/ /p/ /b/ / p/ /b/ /p/ /b/ /p/ /b/ / p/ /b/ /p/ /b/ /p/ /b/ /p/ /b/
It is evident, from previously presented data, the important role of the context
in the definition of the variability of segmental targets and of the movements of
single articulators. The design of a coarticulation model and the discovery of
coarticulatory rules will be quite inportant for the future visible speech synthesis
and recognition applications. In a previous research, reported in Magno
Caldognetto et al. (1992), focused on the analysis of lip-rounding in bysillabic
sequences of the type /ti’Cu/, /tiC’Cu/, and /ti’CCCu/, very strong anticipatory
coarticulation effects were discovered, as exemplified in Figure 12.
52 mm
LOW
(rounding)
42
0 0.489 1.4
t i s t r u sec
Figure 12. Anticipatory coarticulation effect for lip-rounding. Example for /ti’stru/ and LOW
parameter.
The durations of the rounding movements differ depending on the type and
number of intervocalic consonants, while the rounding movement appears to be
independent of the presence of a syllabic boundary. Our data seems to be in
agreement with the "look-ahead" (Perkell, 1980) rather than the "time-locked"
(Fowler et al. 1980) theory because they do not appear to confirm the predictions
of the intrinsic-timing control models, which predict an equal duration of the
rounding movement before the beginning of the vowel /u/. In our data, rounding
movements present different durations and they always begin during the occlusion
of the initial consonant /t/ or during the front unrounded vowel /i/, and this rather
supports a model of extrinsic-timing control. Future research should examine not
only the duration of the rounding movement and the spatial characteristics of the
maximum and minimum of the rounding target, but also the kinematic
characteristics of the movement. In fact, maximum and minimum values can be
accomplished using different strategies. Assessing the velocity and acceleration of
the rounding parameter can provide evidence for the "hybrid" or "two-stage
anticipatory coarticulation" model (Perkell, 1990). The between subjects
variational behaviour should be underlined. In fact, irrespective of the model of
speech production the data may point to, the specificity of individual strategies of
motor control should not be ignored.
3.1 Introduction
Audio-visual automatic speech recognition (ASR) systems can be conceived
with the aim of improving speech recognition performance, mostly in noisy
conditions (Silsbee and Bovik, 1993). Various studies of human speech perception
have demonstrated that visual information plays an important role in the process of
speech understanding (Massaro, 1987), and, in particular, "lip-reading" seems to
be one of the most important secondary information sources (Dodd and Campbell,
1987). Moreover, even if the auditory modality definitely represents the most
important flow of information for speech perception, the visual channel allows
subjects to better understand speech when background noise strongly corrupts the
audio channel (MacLeod and Summerfield, 1987). Mohamadi and Benoit (1992)
reported that vision is almost unnecessary in rather clean acoustic conditions (S/N
> 0 dB), while it becomes essential when the noise highly degrades acoustic
conditions (S/N <= 0 dB).
3.2 Method
The system being described takes advantage of jaw and lip reading capability,
making use of ELITE (Magno Caldognetto et al., 1989) in conjunction with an
auditory model of speech processing (Seneff, 1988) which have shown great
robustness in noisy condition (Cosi, 1992). The speech signal, acquired in
synchrony with the articulatory data, is prefiltered and sampled at 16 KHz, and a
joint synchrony/mean-rate auditory model of speech processing (Seneff, 1988) is
applied producing 80 spectral-like parameters at 500 Hz frame rate. In the
experiments being described, spectral-like parameters and frame rate have been
reduced to 40 and 250Hz respectively in order to speeding up the system training
time. Input stimuli are segmented by SLAM, a recently developed semi-automatic
segmentation and labeling tool (Cosi, 1993) working on auditory model
parameters. Both audio and visual parameters, in a single or joint fashion, are used
to train, by means of the Back Propagation for Sequences (BPS) (Gori, Bengio and
De Mori, 1989) algorithm, an artificial Recurrent Neural Network (RNN) to
recognize the input stimuli. A block diagram of the overall system is described in
Figure 13 where both the audio and the visual channel are shown together with the
RNN utilized in the recognition phase.
3.3 Experiments
The results obtained in two phonetic classification experiments will be illustrated,
the first one dealing with a Talker Dependent (TD) (Cosi, 1994), while the second
with a Talker Independent (TI) environment (Cosi, 1995). For both experiments,
the input data consist of disyllabic symmetric /’VCV/ non-sense words, where
C=/p,t,k,b,d,g/ and V=/a,i,u/, uttered by 4 talkers (2 male and 2 female) in the TD
case and by 10 male talkers in the TI condition. All the subjects were northern
Italian university students, aged between 19 and 22, and were paid volunteers.
They repeated five times, in random order, each of the selected non-sense words.
The talker comfortably sits on a chair, with a microphone in front of him, and
utters the experimental paradigm words, under request of the operator. In this
study, the movements of the markers placed on the central points of the vermilion
border of the upper lip, and lower lip, together with the movements of the marker
placed on the edges of the mouth (markers 2, 3, 4, 5 of Fig 4. A total of 14
parameters, 7 movements plus their instantaneous velocity, constitute the
articulatory vector which has been used together with the acoustic vector in order
to represent the target stimuli. The chosen articulatory parameters were (see Fig. 4
and Table 1): ULz, LLz, Uly, Lly, LOH, LOW, Jz and velocities.
A V
u i
d d
i e
o o
C C
h AM ELITE h
a a
n SLAM
n
n n
e e
l l
time
1
.... 40 1
.... 14
1 .... 14 1 .... 6
RNN
Dynamic neurons
f(a)
AC+AR ∑
Z -1
1 2 3 6 Z -1
output layer 4 5
Z-1 Z-1 Z-1
Z-1
• •
Z-1
Z-1 Z-1
hidden layer
- - - - - - - 13 15 16 - - - - - - - 19 20 xj1 xj2 xjn
1 2 14
1 2 3 - - - - - - - - - - - 39 40 41 42 43 - - - - - - - - - - - 53 54
input layer
1 2 ---- 5 6 1 2 ---- 5 6
AC 1 2 ------- 5 6
AR
1 2 - - - - - - - 13 14
1 2 3 - - - - - - - - - - - 39 40 1 2 3 - - - - - - - - - - - 13 14
Audio Video
Channel
"frame n Channel
Figure 14. Network structures in the three different experimental settings: ACoustic, ARticulatory,
ACoustic+ARticulatory (see text).
In order to use a learning algorithm which can be "local" in space and in time,
thus reducing the computational complexity, (in other words, an algorithm which
can operate on each neuron using only information relative to its connected
afferent neurons, and using only the present input frame, utilizing information not
explicitly related to previous frames) dynamic nodes were concentrated only in the
hidden layer. In fact with this constraint the requested "local" conditions for the
learning algorithm can be satisfied. The learning strategy was based on BPS
algorithm (Gori, Bengio and De Mori, 1989), and only two supervision frames
were chosen in order to speeding up the training time, as illustrated in Figure 15.
The first one, focused on articulatory parameters, was positioned in the middle
frame of the target plosive, the ‘closure’ zone, while the second, focused on
acoustic parameters, was positioned in the penultimate frame, the ‘burst’ zone. A
20 ms delay, corresponding to 5 frames, was used for the hidden layer dynamic
neurons. A 54(40+14)input * 20(14+6)hidden * (6)output RNN, as illustrated in
Figure 14, was considered. Not all the connections were allowed from the input
and the hidden layer, but only those concerning the two different modalities, which
were thus maintained disjoint. Various parameter reduction schemes and various
network structure alternatives were exploited but those described above represent
the best choice in terms of learning speed and recognition performance.
/t /
s ignal (1 6 K Hz)
AC (4 0 )
AR (1 4 )
3.4 Results
mean 78 67 99 98
Table 8. TD, “clean” condition, correct classification results (%)
activation
40,00
35,00
30,00
25,00
20,00 speaker AN
15,00
10,00
5,00
0,00
LOH
LOW
Jz
ULy
LLy
ULz
LLz /g/
LOHv /k/
LOWv /d/
articulatory Jzv
ULyv
/t/
/b/
parameters LLyv
ULzv /p/
LLzv
Figure 16. Input articulatory parameter influence for the plosives relatively to the talker AN
(see text).
Table 10. Input articulatory parameter influence for the plosives relatively to the talker AN.
In order to test the power of the bimodal approach all the three experiments
were repeated eliminating visual information thus retaining only the audio channel
input. The 40 input * 14 hidden * 6 output RNN utilized in this case is exactly the
audio subnet of the global net utilized in the bimodal environment as indicated in
Fig. 13. Results for this case are illustrated in Table 13.
E1 E2 E3
Mean 68.9 58.3 65.0
Table 13. TI mean correct recognition rate with only Audio information.
2.4 Discussion
As indicated by a direct inspection of Tables 11-13, recognition performance
significantly improves when both audio and visual channels are active. Looking at
Table 12 referring to the talker-pooled results a good generalization power can be
associated with the chosen RNN given that TI results were surprisingly better than
TP results.
REFERENCES
Magno Caldognetto E., Vagges K., Ferrero F.E. and Cosi P. (1995) La lettura
labiale: dati sperimentali e problemi teorici, Proc. IV Convegno Nazionale
Informatica Didattica e Disabilità, Napoli, 9-11 Nov., 1995, (too be published).
Magno Caldognetto E., Vagges K. and Ferrero F.E. (1980), Un test di confusione
fra le consonanti dell’italiano: primi risultati, Atti del Seminario “La percezione
del linguaggio” (Firenze, 17-20 dicembre 1980), Accademia della Crusca 123-
179.
Magno Caldognetto E., Ferrero F.E. and Vagges K (1982), Intelligibilità delle
consonanti dell’italiano in condizioni di mascheramento (S/R), di filtraggio passa-
alto (PA) e passa-basso (PB), Bollettino Italiano di Audiologia e Foniatria, vol. 5,
163-172.
Benoit C., Lallouache T., Mohamadi T., and Abry C., (1992) A Set of French
Visemes for Visual Speech Synthesis, in Bailly G., Benoit C., and Sawallis T.R.
(Eds.), Talking machines: Theories, Models, and Designs, North-Holland,
Amsterdam, 485-504.
Cohen M.M. and Massaro D., (1990) Behavior Research Methods, Instruments
and Computers, Vol. 22 (2), 260-263.
Magno Caldognetto E., Vagges K., Borghese N.A., and Ferrigno G., (1989)
Automatic Analysis of Lips and Jaw Kinematics in VCV Sequences, Proc. of
Eurospeech 1989, Vol. 2:453-456.
Abry C., and Boe L.J., (1986) "Laws" for Lips, Speech Communication, 5, 97-
104.
Magno Caldognetto E., Vagges K. and Zmarich C., (1995) Visible Articulatory
Characteristics of the Italian Stressed and Unstressed Vowels, Proc. of ICPhS 95,
Stochkolm, 14-19 August, 1995, Vol. 1, 366-369.
Fromkin V., (1964) Lip Positions in American English Vowels, Language and
Speech, 7, 217-225.
Farnetani E., (1995) Labial Coarticulation, in Quaderni del Centro di Studio per le
Ricerche di Fonetica, Vol. 13, 57-81.
Sussman H.M., MacNeilage P.F., and Hanson R.J., (1973) Labial and mandibular
dynamics during the production of bilabial consonants: Preliminary observations,
J. of Speech and Hearing Research, 16, 1973: 397-420.
Flege J., (1988), The development of skill in producing word-final English stops:
Kinematic parameters, J. Acoust. Soc. Am., 84 (5) 1988: 1639-1652.
Magno Caldognetto E., Vagges K., Ferrigno G., and Busà G. (1992) Lip Rounding
Coarticulation in Italian, Proc. of International Conference on Spoken Language
Processing, Banff 1992, Vol. 1: 61-64.
Perkell, J.S., (1980) Phonetic Features and the Physiology of Speech Production in
B. Butterworth (ed.), Language Production, Academic Press, London, Vol. 1, 337-
372.
Fowler, C.A., Rubin P., Remez R.E. and Turvey M.T. (1980) Implications for
Speech Production of a General Theory of Action, in B. Butterworth (ed.),
Language Production, Academic Press, London, Vol. 1, 373-420.
Perkell, J.S., (1990) Testing Theories of Speech Production: Implication of Some
Analyses of Variable Articulation Data, Proc. NATO ASI, Speech Production and
Modelling, pp. 263-288.
Massaro D.W. (1987), Speech Perception by Ear and Eye: a Paradigm for
Psychological Inquiry, Lawrence Erlbaum Associates, Hillsdale, New Jersey.
Dodd B. and Campbell R., Eds., (1987), Hearing by Eye: The Psychology of Lip-
Reading, Lawrence Erlbaum Associates, Hillsdale, New Jersey.
Gori M., Bengio Y. and De Mori R. (1989), "BPS: A Learning Algorithm for
Capturing the Dynamical Nature of Speech", Proc. IEEE IJCNN89, Washington,
June 18 22, 1989, Vol. II, pp. 417 432.
Cosi P., Magno Caldognetto E., Vagges K., Mian G.A. and Contolini M. (1994),
“Bimodal Recognition Experiments with Recurrent Neural Networks”,
Proceedings of IEEE ICASSP-94, Adelaide, Australia, 19-22 April, 1994, Vol. 2,
Session 20.8, pp. 553-556.
Cosi P., Dugatto M., Ferrero F., Magno Caldognetto E., and Vagges K. (1995),
Bimodal Recognition of Italian Plosives, Proc. 13th International Congress of
Phonetic Sciences, ICPhS95, Stochkolm, Sweden, 1995.