Cascade

Uploaded by

godilhamza01

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

22 views6 pages

Cascade

Uploaded by

godilhamza01

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 6

A Cascade Sequence-to-Sequence Model for Chinese Mandarin

Lip Reading
Ya Zhao Rui Xu Mingli Song
Zhejiang Provincial Key Laboratory Zhejiang Provincial Key Laboratory Zhejiang Provincial Key Laboratory
of Service Robots of Service Robots of Service Robots
Zhejiang University Zhejiang University Zhejiang University
[email protected] [email protected] [email protected]

ABSTRACT speech recognition offers an alternative way to understand speech.

Lip reading aims at decoding texts from the movement of a speaker’s Besides, lip reading has practical potential in improved hearing
mouth. In recent years, lip reading methods have made great progress aids, security, and silent dictation in public spaces. Lip reading
for English, at both word-level and sentence-level. Unlike English, is essentially a difficult problem, as most lip reading actuations,
however, Chinese Mandarin is a tone-based language and relies besides the lips and sometimes tongue and teeth, are latent and
on pitches to distinguish lexical or grammatical meaning, which ambiguous. Several seemingly identical lip movements can produce
significantly increases the ambiguity for the lip reading task. In different words.
this paper, we propose a Cascade Sequence-to-Sequence Model for Thanks to the recent development of deep learning, English-
Chinese Mandarin (CSSMCM) lip reading, which explicitly mod- based lip reading methods have made great progress, at both word-
els tones when predicting sentence. Tones are modeled based on level [9, 13] and sentence-level [1, 8]. However, as the language of
visual information and syntactic structure, and are used to predict the most number of speakers, there is only a little work for Chinese
sentence along with visual information and syntactic structure. Mandarin lip reading in the multimedia community. Yang et al. [14]
In order to evaluate CSSMCM, a dataset called CMLR (Chinese present a naturally-distributed large-scale benchmark for Chinese
Mandarin Lip Reading) is collected and released, consisting of over Mandarin lip-reading in the wild, named LRW-1000, which contains
100,000 natural sentences from China Network Television website. 1,000 classes with 718,018 samples from more than 2,000 individual
When trained on CMLR dataset, the proposed CSSMCM surpasses speakers. Each class corresponds to the syllables of a Mandarin
the performance of state-of-the-art lip reading frameworks, which word composed of one or several Chinese characters. However,
confirms the effectiveness of explicit modeling of tones for Chinese they perform only word classification for Chinese Mandarin lip
Mandarin lip reading. reading but not at the complete sentence level. LipCH-Net [15] is the
first paper aiming for sentence-level Chinese Mandarin lip reading.
CCS CONCEPTS LipCH-Net is a two-step end-to-end architecture, in which two deep
neural network models are employed to perform the recognition
• Computing methodologies → Machine translation; Computer
of Picture-to-Pinyin (mouth motion pictures to pronunciations)
vision; Neural networks.
and the recognition of Pinyin-to-Hanzi (pronunciations to texts)
respectively. Then a joint optimization is performed to improve the
KEYWORDS overall performance.
lip reading, datasets, multimodal Belong to two different language families, English and Chinese
ACM Reference Format: Mandarin have many differences. The most significant one might
Ya Zhao, Rui Xu, and Mingli Song. 2019. A Cascade Sequence-to-Sequence be that: Chinese Mandarin is a tone language, while English is not.
Model for Chinese Mandarin Lip Reading. In ACM Multimedia Asia (MMAsia The tone is the use of pitch in language to distinguish lexical or
’19), December 15–18, 2019, Beijing, China. ACM, New York, NY, USA, 6 pages. grammatical meaning - that is, to distinguish or to inflect words
https://doi.org/10.1145/3338533.3366579 1 . Even two words look the same on the face when pronounced,
they can have different tones, thus have different meanings. For
1 INTRODUCTION example, even though "练习" (which means practice) and "联系"
Lip reading, also known as visual speech recognition, aims to predict (which means contact) have different meanings, but they have the
the sentence being spoken, given a silent video of a talking face. In same mouth movement. This increases ambiguity when lip reading.
noisy environments, where speech recognition is difficult, visual So the tone is an important factor for Chinese Mandarin lip reading.
Based on the above considerations, in this paper, we present
Permission to make digital or hard copies of all or part of this work for personal or CSSMCM, a sentence-level Chinese Mandarin lip reading network,
classroom use is granted without fee provided that copies are not made or distributed which contains three sub-networks. Same as [15], in the first sub-
for profit or commercial advantage and that copies bear this notice and the full citation
on the first page. Copyrights for components of this work owned by others than ACM network, pinyin sequence is predicted from the video. Different
must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, from [15], which predicts pinyin characters from video, pinyin
to post on servers or to redistribute to lists, requires prior specific permission and/or a
fee. Request permissions from [email protected].
is taken as a whole in CSSMCM, also known as syllables. As we
MMAsia ’19, December 15–18, 2019, Beijing, China
© 2019 Association for Computing Machinery.
ACM ISBN 978-1-4503-6841-4/19/12. . . $15.00
https://doi.org/10.1145/3338533.3366579 1 https://en.wikipedia.org/wiki/Tone_(linguistics)
MMAsia ’19, December 15–18, 2019, Beijing, China Ya Zhao, Rui Xu, and Mingli Song

𝒑𝟏 𝒑𝟐 𝒑𝟑
know, Mandarin Chinese is a syllable-based language and sylla- 𝒕𝟏 𝒕𝟐 𝒕𝟐
bles are their logical unit of pronunciation. Compared with pinyin 𝐩
𝐆𝐑𝐔𝐞
𝐩
𝐆𝐑𝐔𝐞 𝐆𝐑𝐔𝐞
𝐩

characters, syllables are a longer linguistic unit, and can reduce the 𝐌𝐋𝐏 𝐌𝐋𝐏 𝐌𝐋𝐏
p
difficulty of syllable choices in the decoder by sequence-to-sequence Attentiont
attention-based models [17]. Chen et al. [6] find that there might 𝐆𝐑𝐔𝐝𝐭 𝐆𝐑𝐔𝐝𝐭 𝐆𝐑𝐔𝐝𝐭
be a relationship between the production of lexical tones and the Attentionvt
visible movements of the neck, head, and mouth. Motivated by
𝐆𝐑𝐔𝐞𝐯 𝐆𝐑𝐔𝐞𝐯 𝐆𝐑𝐔𝐞𝐯
this observation, in the second sub-network, both video and pinyin
sequence is used as input to predict tone. Then in the third sub- 𝐕𝐆𝐆 𝐕𝐆𝐆 𝐕𝐆𝐆
network, video, pinyin, and tone sequence work together to predict
𝒙𝟏 𝒙𝟐 𝒙𝟑
the Chinese character sequence. At last, three sub-networks are
jointly finetuned to improve overall performance.
As there is no public sentence-level Chinese Mandarin lip read- Figure 1: The tone prediction sub-network.
ing dataset, we collect a new Chinese Mandarin Lip Reading dataset
called CMLR based on China Network Television broadcasts con- • 𝑝 𝑦 𝑥 = σ𝑧 σ𝑡 𝑝 𝑦 𝑧, 𝑡, 𝑥 𝑝
follow: ÕÕ
taining talking faces together with subtitles of what is said. P(y|x) = P(y|p, t, x)P(t |p, x)P(p|x), (1)
In summary, our major contributions are as follows. p t
• We argue that tone is an important factor for Chinese Man- The meaning of these symbols is given in Table 1.
darin lip reading, which increases the ambiguity compared As shown in Equation (1), the whole problem is divided into three
with English lip reading. Based on this, a three-stage cas- parts, which corresponds to pinyin prediction, tone prediction, and
cade network, CSSMCM, is proposed. The tone is inferred character prediction separately. Each part will be described in detail
by video and syntactic structure, and are used to predict sen- below.
tence along with visual information and syntactic structure.
• We collect a ’Chinese Mandarin Lip Reading’ (CMLR) dataset, 2.1 Pinyin Prediction Sub-network
consisting of over 100,000 natural sentences from national The pinyin prediction sub-network transforms video sequence into
news program "News Broadcast". The dataset will be released pinyin sequence, which corresponds to P(p|x) in Equation (1). This
as a resource for training and evaluation. sub-network is based on the sequence-to-sequence architecture
• Detailed experiments on CMLR dataset show that explicitly with attention mechanism [2]. We name the encoder and decoder
modeling tone when predicting Chinese sentence performs the video encoder and pinyin decoder, for the encoder process
a lower character error rate. video sequence, and the decoder predicts pinyin sequence. The
input video sequence is first fed into the VGG model [4] to extract
Table 1: Symbol Definition visual feature. The output of conv5 of VGG is appended with global
average pooling [12] to get the 512-dim feature vector. Then the
Symbol Definition 512-dim feature vector is fed into video encoder. The video encoder
GRUve GRU unit in video encoder can be denoted as:
(hve )i = GRUve ((hve )i−1 , VGG(x i )).
p p
GRUe , GRUd GRU unit in pinyin encoder and pinyin decoder (2)
GRUe, GRUd
t t GRU unit in tone encoder and tone decoder
y When predicting pinyin sequence, at each timestep i, video encoder
GRUd GRU unit in character decoder
outputs are attended to calculate a context vector c iv :
Attentionvp attention between pinyin decoder and video en-
p p p
coder. The superscript indicates the encoder and (hd )i = GRUd ((hd )i−1 , pi−1 ), (3)
the subscript indicates the decoder.
p
x, y, p, t video, character, pinyin, and tone sequence c iv = hve · Attentionvp ((hd )i , hve ), (4)
i p
softmax(MLP((hd )i , c iv )).
timestep
p P(pi |p <i , x) = (5)
h ev , h e , h et video encoder output, pinyin encoder output, tone
encoder output
cv , cp, c t video content, pinyin content, tone content
2.2 Tone Prediction Sub-network
As shown in Equation (1), tone prediction sub-network (P(t |p, x))
takes video and pinyin sequence as inputs and predict correspond-
ing tone sequence. This problem is modeled as a sequence-to-
2 THE PROPOSED METHOD sequence learning problem too. The corresponding model architec-
In this section, we present CSSMCM, a lip reading model for Chi- ture is shown in Figure 1.
nese Mandarin. As mention in Section 1, pinyin and tone are both In order to take both video and pinyin information into consid-
important for Chinese Mandarin lip reading. Pinyin represents how eration when producing tone, a dual attention mechanism [8] is
to pronounce a Chinese character and is related to mouth move- employed. Two independent attention mechanisms are used for
ment. Tone can alleviate the ambiguity of visemes (several speech video and pinyin sequence. Video context vectors c iv and pinyin
p
sounds that look the same) to some extent and can be inferred from context vectors c i are fused when predicting a tone character at
visible movements. Based on this, the lip reading task is defined as each decoder step.
A Cascade Sequence-to-Sequence Model for Chinese Mandarin Lip Reading MMAsia ’19, December 15–18, 2019, Beijing, China

𝒑𝟏 𝒑𝟐 𝒑𝟑 𝒑𝟏 𝒑𝟐

𝐩 𝐩 𝐩 𝐩 𝐩
𝐆𝐑𝐔𝐞 𝐆𝐑𝐔𝐞 𝐆𝐑𝐔𝐞 𝐆𝐑𝐔𝐞𝐱 𝐆𝐑𝐔𝐞𝐱 𝐆𝐑𝐔𝐝 𝐆𝐑𝐔𝐝 𝒕𝟏 𝒕𝟐
𝐆𝐑𝐔𝐞𝒕

𝒕𝟑 p
𝒚𝟏 𝒚𝟐 𝒚𝟑 𝐕𝐆𝐆 𝐕𝐆𝐆 𝐩 𝐩
Attentionc 𝐆𝐑𝐔𝐞 𝐆𝐑𝐔𝐞 𝐆𝐑𝐔𝐝𝐭 𝐆𝐑𝐔𝐝𝐭 𝒚𝟏 𝒚𝟐
𝐌𝐋𝐏 𝐌𝐋𝐏 𝐌𝐋𝐏 𝒙𝟏 𝒙𝟐
𝐆𝐑𝐔𝐞𝐭 𝐆𝐑𝐔𝐞𝐭 𝐆𝐑𝐔𝐝𝐜 𝐆𝐑𝐔𝐝𝐜
𝐆𝐑𝐔𝐞𝒕

𝒕𝟐 Attentiontc
𝐆𝐑𝐔𝐝𝐜 𝐆𝐑𝐔𝐝𝐜 𝐆𝐑𝐔𝐝𝐜
Attentionvc
𝐆𝐑𝐔𝐞𝒕

𝒕𝟏
𝐆𝐑𝐔𝐞𝐯 𝐆𝐑𝐔𝐞𝐯 𝐆𝐑𝐔𝐞𝐯 Figure 3: The overall of the CSSMCM network. The attention
module is omitted for sake of simplicity.
𝐕𝐆𝐆 𝐕𝐆𝐆 𝐕𝐆𝐆

𝒙𝟏 𝒙𝟐 𝒙𝟑 • 𝑝 𝑦 𝑥 = σ σ 𝑝 𝑦 𝑧, 𝑡, 𝑥 𝑝 𝑡 𝑧, 𝑥 𝑝(𝑧|𝑥)
𝑧 𝑡 (16) with
We replace Equation (6) with Equation (17), Equation
Equation (18). Then, the three sub-networks are jointly trained and
Figure 2: The character prediction sub-network. the overall loss function is defined as follows:
L = Lp + L t + Lc , (19)
The video encoder is the same as in Section 2.1 and the pinyin where Lp , Lt and Lc stand for loss of pinyin prediction sub-network,
encoder is: tone prediction sub-network and character prediction sub-network
p p p
(he )i = GRUe ((he )i−1 , pi−1 ). (6) respectively, as defined below.
The tone decoder takes both video encoder outputs and pinyin Õ
Lp = − log P(pi |p <i , x),
encoder outputs to calculate context vector, and then predicts tones:
i
(hdt )i = GRUtd ((hdt )i−1 , ti−1 ), (7) Õ
Lt = − log P(ti |t <i , x, p), (20)
c iv = hve · Attentionvt ((hdt )i , hve ), (8) Õ
i
p p p Lc = − log P(c i |c <i , x, p, t).
c i = he · Attentiont ((hdt )i , he ),
p
(9)
p i
P(ti |t <i , x, p) = softmax(MLP((hdt )i , c iv , c i )). (10)
2.5 Training Strategy
2.3 Character Prediction Sub-network
To accelerate training and reduce overfitting, curriculum learn-
The character prediction sub-network corresponds to P(y|p, t, x) in ing [8] is employed. The sentences are grouped into subsets ac-
Equation (1). It considers all the pinyin sequence, tone sequence cording to the length of less than 11, 12-17, 18-23, more than 24
and video sequence when predicting Chinese character. Similarly, Chinese characters. Scheduled sampling proposed by [3] is used
we also use attention based sequence-to-sequence architecture to to eliminate the discrepancy between training and inference. At
model this equation. Here the attention mechanism is modified into the training stage, the sampling rate from the previous output is
triplet attention mechanism: selected from 0.7 to 1. Greedy decoder is used for fast decoding.

(hdc )i = GRUcd ((hdc )i−1 , yi−1 ), (11) 3 DATASET

c iv = hve · Attentionvc ((hdc )i , hve ), (12) In this section, a three-stage pipeline for generating the Chinese
p p p
= he · Attentionc ((hdc )i , he ),
p Mandarin Lip Reading (CMLR) dataset is described, which includes
ci (13)
video pre-processing, text acquisition, and data generation. This
c it = hte · Attentiontc ((hdc )i , hte ), (14) three-stage pipeline is similar to the method mentioned in [8], but
p
P(c i |c <i , x, p, t) = softmax(MLP((hdc )i , c iv , c i , c it )).
(15) considering the characteristics of our Chinese Mandarin dataset,
For the following needs, the formula of tone encoder is also listed we have optimized some steps and parts to generate a better quality
as follows: lip reading dataset. The three-stage pipeline is detailed below.
(hte )i = GRUte ((hte )i−1 , ti−1 ). (16) Video Pre-processing. First, national news program "News
Broadcast" recorded between June 2009 and June 2018 is obtained
2.4 CSSMCM Architecture from China Network Television website. Then, the HOG-based face
detection method is performed [11], followed by an open source
The architecture of the proposed approach is demonstrated in Fig-
platform for face recognition and alignment. The video clip set of
ure 3. For better display, the three attention mechanisms are not
eleven different hosts who broadcast the news is captured. During
shown in the figure. During the training of CSSMCM, the outputs
the face detection step, using frame skipping can improve efficiency
of pinyin decoder are fed into pinyin encoder, the outputs of tone
while ensuring the program quality.
decoder into tone encoder:
p p p
Text Acquisition. Since there is no subtitle or text annotation in
(he )i = GRUe ((he )i−1 , MLP((hdt )i , c iv , c i )),
p
(17) the original "News Broadcast" program, FFmpeg tools 2 are used to
p p
(hte )i = GRUte ((he )i−1 , MLP((hdc )i , c iv , c i , c it )). (18) 2 https://ffmpeg.org/
MMAsia ’19, December 15–18, 2019, Beijing, China Ya Zhao, Rui Xu, and Mingli Song

Table 2: The CMLR dataset. Division of training, validation 4.2 Compared Methods and Evaluation
and test data; and the number of sentences, phrases and Protocol
characters of each partition.
We list here the compared methods and the evaluation protocol.
Set # sentences # phrases # characters
Table 3: The detailed comparison between CSSMCM and
Train 71,452 22,959 3,360
other methods on the CMLR dataset. V, P, T, C stand for
Validation 10,206 10,898 2,540 video, pinyin, tone and character. V2P stands for the trans-
Test 20,418 14,478 2,834 formation from video sequence to pinyin sequence. VP2T
All 102,076 25,633 3,517 represents the input are video and pinyin sequence and the
output is sequence of tone. OVERALL means to combine the
sub-networks and make a joint optimization.
extract the corresponding audio track from the video clip set. Then
through the iFLYTEK 3 ASR, the corresponding text annotation Models sub-network CER PER TER
of the video clip set is obtained. However, there is some noise in WAS - 38.93% - -
these text annotation. English letters, Arabic numerals, and rare
V2P - 27.96% -
punctuation are deleted to get a more pure Chinese Mandarin lip LipCH-Net-seq P2C 9.88% - -
reading dataset. OVERALL 34.07% 39.52% -
Data Generation. The text annotation acquired in the previous
V2P - 27.96% -
step also contains timestamp information. Therefore, video clip set P2T - - 6.99%
is intercepted according to these timestamp information, and then CSSMCM-w/o video
PT2C 4.70 % - -
the corresponding word, phrase, or sentence video segment of the OVERALL 42.23% 46.67% 13.14%
text annotation are obtained. Since the text timestamp information V2P - 27.96% -
may have a few uncertain errors, some adjustments are made to VP2T - - 6.14%
the start frame and the end frame when intercepting the video CSSMCM
VPT2C 3.90% - -
segment. It is worth noting that through experiments, we found OVERALL 32.48% 36.22% 10.95%
that using OpenCV 4 can capture clearer video segment than the
FFmpeg tools.
WAS: The architecture used in [8] without the audio input. The
Through the three-stage pipeline mentioned above, we can ob-
decoder output Chinese character at each timestep. Others keep
tain the Chinese Mandarin Lip Reading (CMLR) dataset containing
unchanged to the original implementation.
more than 100,000 sentences, 25,000 phrases, 3,500 characters. The
LipCH-Net-seq: For a fair comparison, we use sequence-to-
dataset is randomly divided into training set, validation set, and
sequence with attention framework to replace the Connectionist
test set in a ratio of 7:1:2. Details are listed in Table 2.
temporal classification (CTC) loss [10] used in LipCH-Net [15]
Further details of the dataset and the download links can be
when converting picture to pinyin.
found on the web page: https://www.vipazoo.cn/CMLR.html.
CSSMCM-w/o video: To evaluate the necessity of video infor-
mation when predicting tone, the video stream is removed when
4 EXPERIMENTS predicting tone and Chinese characters. In other word, video is only
4.1 Implementation Details used when predicting the pinyin sequence. The tone is predicted
The input images are 64 × 128 in dimension. Lip frames are trans- from the pinyin sequence. Tone information and pinyin information
formed into gray-scale, and the VGG network takes every 5 lip work together to predict Chinese character.
frames as an input, moving 2 frames at each timestep. For all sub- We tried to implement the Lipnet architecture [1] to predict Chi-
networks, a two-layer bi-direction GRU [7] with a cell size of 256 is nese character at each timestep. However, the model did not con-
used for the encoder and a two-layer uni-direction GRU with a cell verge. The possible reasons are due to the way CTC loss works and
size of 512 for the decoder. For character and pinyin vocabulary, we the difference between English and Chinese Mandarin. Compared
keep characters and pinyin that appear more than 20 times. [sos], to English, which only contains 26 characters, Chinese Mandarin
[eos] and [pad] are also included in these three vocabularies. The contains thousands of Chinese characters. When CTC calculates
final vocabulary size is 371 for pinyin prediction sub-network, 8 for loss, it first adds blank between every character in a sentence, that
tone prediction sub-network (four tones plus a neutral tone), and causes the number of the blank label is far more than any other
1,779 for character prediction sub-network. Chinese character. Thus, when Lipnet starts training, it predicts
The initial learning rate was 0.0001 and decreased by 50% every only the blank label. After a certain epoch, "的" character will occa-
time the training error did not improve for 4 epochs. CSSMCM is sionally appear until the learning rate decays to close to zero.
implemented using pytorch library and trained on a Quadro 64C For all experiments, Character Error Rate (CER) and Pinyin Er-
P5000 with 16GB memory. The total end-to-end model was trained ror Rate (PER) are used as evaluation metrics. CER is defined as
for around 12 days. ErrorRate = (S + D + I )/N , where S is the number of substitutions,
D is the number of deletions, I is the number of insertions to get
3 https://www.xfyun.cn/ from the reference to the hypothesis and N is the number of words
4 http://docs.opencv.org/2.4.13/modules/refman.html in the reference. PER is calculated in the same way as CER. Tone
A Cascade Sequence-to-Sequence Model for Chinese Mandarin Lip Reading MMAsia ’19, December 15–18, 2019, Beijing, China

Table 4: Examples of sentences that CSSMCM correctly predicts while other methods do not. The pinyin and tone sequence
corresponding to the Chinese character sentence are also displayed together. GT stands for ground truth.

Method Chinese Character Sentence Pinyin Sequence Tone Sequence

GT 既让老百姓得实惠 ji rang lao bai xing de shi hui 44334224
WAS 介项老百姓姓事会 jie xiang lao bai xing xing shi hui 44334444
LipCH-Net-seq 既让老百姓的吃贵 ji rang lao bai xing de chi gui 44334014
CSSMCM 既让老百姓得实惠 ji rang lao bai xing de shi hui 44334224
GT 有效应对当前半岛局势 you xiao ying dui dang qian ban dao ju shi 3444124324
WAS 有效应对当天半岛趋势 you xiao ying dui dang tian ban dao qu shi 3444114314
LipCH-Net-seq 有效应对党年半岛局势 you xiao ying dui dang nian ban dao ju shi 3444324324
CSSMCM 有效应对当前半岛局势 you xiao ying dui dang qian ban dao ju shi 3444124324

Table 5: Failure cases of CSSMCM. These show that when predicting characters with the same lip
shape but different tones, other methods are often unable to predict
向全球价值链中高端迈进 correctly. However, CSSMCM can leverage the tone information to
GT
xiang quan qiu jia zhi lian zhong gao duan mai jin predict successfully.
向全球下试联中高端迈进 Apart from the above results, Table 5 also lists some failure cases
CSSMCM
xiang quan qiu xia shi lian zhong gao duan mai jin
of CSSMCM. The characters that CSSMCM predicts wrong are
随着我国医学科技的进步 usually homophones or characters with the same final as the ground
GT
sui zhe wo guo yi xue ke ji de jin bu truth. In the first example, "价" and "下" have the same final, ia,
随着我国一水科技的信步
CSSMCM while "一" and "医" are homophones in the second example. Unlike
sui zhe wo guo yi shui ke ji de jin bu
English, if one character in an English word is predicted wrong,
the understanding of the transcriptions has little effect. However, if
there is a character predicted wrong in Chinese words, it will greatly
Error Rate (TER) is also included when analyzing CSSMCM, which affect the understandability of transcriptions. In the second example,
is calculated in the same way as above. CSSMCM mispredicts "医学" ( which means medical) to "一水"
(which means all). Although their first characters are pronounced
4.3 Results the same, the meaning of the sentence changed from Now with the
progress of medical science and technology in our country to It is now
Table 3 shows a detailed comparison between various sub-network
with the footsteps of China’s Yishui Technology.
of different methods. Comparing P2T and VP2T, VP2T consid-
ers video information when predicting the pinyin sequence and
4.4 Attention Visualisation
achieves a lower error rate. This verifies the conjecture of [6] that
the generation of tones is related to the motion of the head. In terms Figure 4 (a) and Figure 4 (b) visualise the alignment of video frames
of overall performance, CSSMCM exceeds all the other architecture and Chinese characters predicted by CSSMCM and WAS respec-
on the CMLR dataset and achieves 32.48% character error rate. It tively. The ground truth sequence is "同时他还向媒体表示". Com-
is worth noting that CSSMCM-w/o video achieves the worst re- paring Figure 4 (a) with Figure 4 (b), the diagonal trend of the
sult (42.23% CER) even though its sub-networks perform well when video attention map got by CSSMCM is more obvious. The video
trained separately. This may be due to the lack of visual information attention is more focused where WAS predicts wrong, i.e. the area
to support, and the accumulation of errors. CSSMCM using tone in- corresponding to "还向". Although WAS mistakenly predicts the
formation performs better compared to LipCH-Net-seq, which does "媒体" as "么体", the "媒体" and the "么体" have the same mouth
not use tone information. The comparison results show that tone shape, so the attention concentrates on the correct frame.
is important when lip reading, and when predicting tone, visual It’s interesting to mention that in Figure 5, when predicting the
information should be considered. i-th character, attention is concentrated on the i + 1-th tone. This
Table 4 shows some generated sentences from different meth- may be because attention is applied to the outputs of the encoder,
ods. CSSMCM-w/o video architecture is not included due to its which actually includes all the information from the previous i + 1
relatively lower performance. These are sentences other methods timesteps. The attention to the tone of i + 1-th timestep serves as
fail to predict but CSSMCM succeeds. The phrase "实惠" (which the language model, which reduces the options for generating the
means affordable) in the first example sentence, has a tone of 2, 4 character at i-th timestep, making prediction more accurate.
and its corresponding pinyin are shi, hui. WAS predicts it as "事
会" (which means opportunity). Although the pinyin prediction is 5 SUMMARY AND EXTENSION
correct, the tone is wrong. LipCH-Net-seq predicts "实惠" as "吃贵" In this paper, we propose the CSSMCM, a Cascade Sequence-to-
(not a word), which have the same finals "ui" and the corresponding Sequence Model for Chinese Mandarin lip reading. CSSMCM is
mouth shapes are the same. It’s the same in the second example. designed to predicting pinyin sequence, tone sequence, and Chinese
"前, 天, 年" have the same finals and mouth shapes, but the tone is character sequence one by one. When predicting tone sequence, a
different. dual attention mechanism is used to consider video sequence and
MMAsia ’19, December 15–18, 2019, Beijing, China Ya Zhao, Rui Xu, and Mingli Song
SUHGLFWHG&KLQHVHVHQWHQFH

SUHGLFWHG&KLQHVHVHQWHQFH

YLGHRIUDPHV YLGHRIUDPHV
(a) (b)

Figure 4: Video-to-text alignment using CSSMCM (a) and WAS (b).

REFERENCES
SUHGLFWHG&KLQHVHVHQWHQFH

[1] Yannis M Assael, Brendan Shillingford, Shimon Whiteson, and Nando de Freitas.
2016. Lipnet: Sentence-level lipreading. arXiv preprint (2016).
[2] Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. 2015. Neural Machine
Translation by Jointly Learning to Align and Translate. international conference
on learning representations (2015).
[3] Samy Bengio, Oriol Vinyals, Navdeep Jaitly, and Noam M. Shazeer. 2015. Sched-
uled sampling for sequence prediction with recurrent Neural networks. neural
information processing systems (2015), 1171–1179.
[4] Ken Chatfield, Karen Simonyan, Andrea Vedaldi, and Andrew Zisserman. 2014.
Return of the Devil in the Details: Delving Deep into Convolutional Nets.. In
British Machine Vision Conference 2014.
[5] C. Julian Chen, Ramesh A. Gopinath, Michael D. Monkowski, Michael A. Picheny,
and Katherine Shen. 1997. New methods in continuous Mandarin speech recog-
nition.. In EUROSPEECH.
[6] Trevor H Chen and Dominic W Massaro. 2008. Seeing pitch: Visual information
for lexical tones of Mandarin-Chinese. The Journal of the Acoustical Society of
America 123, 4 (2008), 2356–2366.
SUHGLFWHGWRQHVHQWHQFH [7] Kyunghyun Cho, Bart van Merrienboer, Caglar Gulcehre, Dzmitry Bahdanau,
Fethi Bougares, Holger Schwenk, and Yoshua Bengio. 2014. Learning Phrase
Representations using RNN Encoder–Decoder for Statistical Machine Translation.
Figure 5: Aligenment between output characters and pre- In Proceedings of the 2014 Conference on Empirical Methods in Natural Language
dicted tone sequences using CSSMCM. Processing (EMNLP). 1724–1734.
[8] Joon Son Chung, Andrew W Senior, Oriol Vinyals, and Andrew Zisserman. 2017.
Lip Reading Sentences in the Wild. In CVPR. 3444–3453.
[9] Joon Son Chung and Andrew Zisserman. 2016. Lip reading in the wild. In Asian
pinyin sequence at the same time. When predicting the Chinese Conference on Computer Vision. Springer, 87–103.
character sequence, a triplet attention mechanism is proposed to [10] Alex Graves, Santiago Fernández, Faustino J. Gomez, and Jürgen Schmidhuber.
take all the video sequence, pinyin sequence, and tone sequence 2006. Connectionist temporal classification: labelling unsegmented sequence
data with recurrent neural networks. In Proceedings of the 23rd international
information into consideration. CSSMCM consistently outperforms conference on Machine learning. 369–376.
other lip reading architectures on the proposed CMLR dataset. [11] Davis E King. 2009. Dlib-ml: A machine learning toolkit. Journal of Machine
Lip reading and speech recognition are very similar. In Chinese Learning Research 10, Jul (2009), 1755–1758.
[12] Min Lin, Qiang Chen, and Shuicheng Yan. 2014. Network In Network. interna-
Mandarin speech recognition, there have been kinds of different tional conference on learning representations (2014).
acoustic representations like syllable initial/final approach, syllable [13] Stavros Petridis and Maja Pantic. 2016. Deep complementary bottleneck features
for visual speech recognition. In 2016 IEEE International Conference on Acoustics,
initial/final with tone approach, syllable approach, syllable with Speech and Signal Processing (ICASSP). IEEE, 2304–2308.
tone approach, preme/toneme approach [5] and Chinese Character [14] Shuang Yang, Yuanhang Zhang, Dalu Feng, Mingmin Yang, Chenhao Wang,
approach [16]. In this paper, the Chinese character is chosen as the Jingyun Xiao, Keyu Long, Shiguang Shan, and Xilin Chen. 2018. LRW-1000: A
Naturally-Distributed Large-Scale Benchmark for Lip Reading in the Wild. arXiv
output unit. However, we find that the wrongly predicted characters preprint arXiv:1810.06990 (2018).
severely affect the understandability of transcriptions. Using larger [15] Xiaobing Zhang, Haigang Gong, Xili Dai, Fan Yang, Nianbo Liu, and Ming Liu.
output units, like Chinese words, maybe can alleviate this problem. 2019. Understanding Pictograph with Facial Features: End-to-End Sentence-level
Lip Reading of Chinese. In AAAI 2019 : Thirty-Third AAAI Conference on Artificial
Intelligence.
6 ACKNOWLEDGEMENTS [16] Shiyu Zhou, Linhao Dong, Shuang Xu, and Bo Xu. 2018. A Comparison of Mod-
eling Units in Sequence-to-Sequence Speech Recognition with the Transformer
This work is supported by National Key Research and Development on Mandarin Chinese. international conference on neural information processing
Program (2018AAA0101503) , National Natural Science Founda- 2018 (2018), 210–220.
[17] Shiyu Zhou, Linhao Dong, Shuang Xu, and Bo Xu. 2018. Syllable-Based Sequence-
tion of China (61572428,U1509206), Key Research and Development to-Sequence Speech Recognition with the Transformer in Mandarin Chinese.
Program of Zhejiang Province (2018C01004), and the Program of In- Proc. Interspeech 2018 (2018), 791–795.
ternational Science and Technology Cooperation (2013DFG12840).

International Journal of Cognitive Computing in Engineering: Harsh Ahlawat, Naveen Aggarwal, Deepti Gupta
No ratings yet
International Journal of Cognitive Computing in Engineering: Harsh Ahlawat, Naveen Aggarwal, Deepti Gupta
37 pages
Learning and Behavior 9th Edition Full Version Download
82% (11)
Learning and Behavior 9th Edition Full Version Download
17 pages
Hap Id 12534903
100% (3)
Hap Id 12534903
2 pages
DevOps Engineer Learning Path Guide
No ratings yet
DevOps Engineer Learning Path Guide
10 pages
A Novel Machine Lip Reading Model A Novel Machine Lip Reading Model
No ratings yet
A Novel Machine Lip Reading Model A Novel Machine Lip Reading Model
6 pages
Deep Learning Lip-Reading Survey
No ratings yet
Deep Learning Lip-Reading Survey
8 pages
Silent Speech Interpretation Using Ai: Dr.M. Hemalatha, M. Akshayaa
No ratings yet
Silent Speech Interpretation Using Ai: Dr.M. Hemalatha, M. Akshayaa
3 pages
Deep Audio-Visual Speech Recognition
No ratings yet
Deep Audio-Visual Speech Recognition
13 pages
Chung 18
No ratings yet
Chung 18
28 pages
Hybrid Attention Mechanisms in 3D CNN For Noise-Resilient Lip Reading in Complex Environments
No ratings yet
Hybrid Attention Mechanisms in 3D CNN For Noise-Resilient Lip Reading in Complex Environments
11 pages
LRW-1000: A Naturally-Distributed Large-Scale Benchmark For Lip Reading in The Wild
No ratings yet
LRW-1000: A Naturally-Distributed Large-Scale Benchmark For Lip Reading in The Wild
8 pages
Deep Learning Lip-Reading Survey
No ratings yet
Deep Learning Lip-Reading Survey
22 pages
Decoder-Encoder LSTM For Lip Reading: Souheil Fenghour Daqing Chen Perry Xiao
No ratings yet
Decoder-Encoder LSTM For Lip Reading: Souheil Fenghour Daqing Chen Perry Xiao
5 pages
Toward Language-Independent Lip Reading A Transfer Learning Approach
No ratings yet
Toward Language-Independent Lip Reading A Transfer Learning Approach
4 pages
A Lip Reading Method Based On 3D Convolutional Vision Transformer
No ratings yet
A Lip Reading Method Based On 3D Convolutional Vision Transformer
8 pages
NMITCON 2024 Submitted 23 24 B27 Lip Reading Using Deep Learning Project Paper
No ratings yet
NMITCON 2024 Submitted 23 24 B27 Lip Reading Using Deep Learning Project Paper
8 pages
Deep Learning for Visual Lip Reading
No ratings yet
Deep Learning for Visual Lip Reading
15 pages
Deep Learning For Lip Reading Using Audio-Visual Information For Urdu Language
No ratings yet
Deep Learning For Lip Reading Using Audio-Visual Information For Urdu Language
5 pages
Lipsound2: Self-Supervised Pre-Training For Lip-To-Speech Reconstruction and Lip Reading
No ratings yet
Lipsound2: Self-Supervised Pre-Training For Lip-To-Speech Reconstruction and Lip Reading
11 pages
Vision Based Lip Reading System Using Deep Learning: July 2022
No ratings yet
Vision Based Lip Reading System Using Deep Learning: July 2022
7 pages
Afouras Et Al - 2018 - Deep Lip Reading
No ratings yet
Afouras Et Al - 2018 - Deep Lip Reading
8 pages
Lip Reading with CNN for Noisy Environments
No ratings yet
Lip Reading with CNN for Noisy Environments
5 pages
LipReadNet: Advancing Lip Reading
No ratings yet
LipReadNet: Advancing Lip Reading
6 pages
Lipreading With 3D-2D-Cnn BLSTM-HMM and Word-Ctc Models
No ratings yet
Lipreading With 3D-2D-Cnn BLSTM-HMM and Word-Ctc Models
5 pages
Developing Phoneme-Based Lip-Reading Sentences System For Silent Speech Recognition
No ratings yet
Developing Phoneme-Based Lip-Reading Sentences System For Silent Speech Recognition
10 pages
DL Review
No ratings yet
DL Review
4 pages
Developing Phoneme Based Lip Reading Sentences System For Silent Speech Recognition
No ratings yet
Developing Phoneme Based Lip Reading Sentences System For Silent Speech Recognition
10 pages
Second Paper
No ratings yet
Second Paper
7 pages
Learning Spatio-Temporal Features With Two-Stream Deep 3D Cnns For Lipreading
No ratings yet
Learning Spatio-Temporal Features With Two-Stream Deep 3D Cnns For Lipreading
13 pages
2001 08702v1
No ratings yet
2001 08702v1
6 pages
A Multimodal German Dataset For Automatic Lip Reading Systems and Transfer Learning
No ratings yet
A Multimodal German Dataset For Automatic Lip Reading Systems and Transfer Learning
8 pages
Analysis of Lip-Reading Using Deep Learning Techniques A Review
No ratings yet
Analysis of Lip-Reading Using Deep Learning Techniques A Review
6 pages
Lip Reading Using CNN and LTSM
No ratings yet
Lip Reading Using CNN and LTSM
9 pages
Lip Reading Word Classification: Abiel Gutierrez Stanford University Zoe-Alanah Robert Stanford University
No ratings yet
Lip Reading Word Classification: Abiel Gutierrez Stanford University Zoe-Alanah Robert Stanford University
9 pages
Lip-Reading Dataset Construction
No ratings yet
Lip-Reading Dataset Construction
6 pages
Callip: Lipreading Using Contrastive and Attribute Learning: Yiyang Huang Xuefeng Liang Chaowei Fang
No ratings yet
Callip: Lipreading Using Contrastive and Attribute Learning: Yiyang Huang Xuefeng Liang Chaowei Fang
9 pages
Applsci 13 05389
No ratings yet
Applsci 13 05389
2 pages
ANN Paper
No ratings yet
ANN Paper
7 pages
1 s2.0 S2666764923000450 Main
No ratings yet
1 s2.0 S2666764923000450 Main
10 pages
Automatic Speech Recognition History
No ratings yet
Automatic Speech Recognition History
9 pages
Lip Reading With Hahn Convolutional Neural Networks
No ratings yet
Lip Reading With Hahn Convolutional Neural Networks
28 pages
Cep Report
No ratings yet
Cep Report
21 pages
Engineering Science and Technology, An International Journal
No ratings yet
Engineering Science and Technology, An International Journal
10 pages
23MCI10142, 23MCI10007 - Project Report
No ratings yet
23MCI10142, 23MCI10007 - Project Report
38 pages
Deep Speech 2: End-to-End Speech Recognition in English and Mandarin
No ratings yet
Deep Speech 2: End-to-End Speech Recognition in English and Mandarin
10 pages
Acoustic New
No ratings yet
Acoustic New
36 pages
Lip Reading Using External Viseme Decoding: 1 Javad Peymanfard 2 Mohammad Reza Mohammadi 3 Hossein Zeinali
No ratings yet
Lip Reading Using External Viseme Decoding: 1 Javad Peymanfard 2 Mohammad Reza Mohammadi 3 Hossein Zeinali
5 pages
2.1 s2.0 S0925231225009610 Main
No ratings yet
2.1 s2.0 S0925231225009610 Main
10 pages
Lipreading Using A Comparative Machine Learning Approach
No ratings yet
Lipreading Using A Comparative Machine Learning Approach
7 pages
ANN Paper
No ratings yet
ANN Paper
6 pages
Prajwal Sub-Word Level Lip Reading With Visual Attention CVPR 2022 Paper
No ratings yet
Prajwal Sub-Word Level Lip Reading With Visual Attention CVPR 2022 Paper
11 pages
Microcomputerschool 1994 Neural Networks in Speech
No ratings yet
Microcomputerschool 1994 Neural Networks in Speech
19 pages
Jarvis Digital Life Assistant IJERTV2IS1237 PDF
No ratings yet
Jarvis Digital Life Assistant IJERTV2IS1237 PDF
6 pages
An 20advanced 20NLP 20framework 20-Formatted 20paper-Libre
No ratings yet
An 20advanced 20NLP 20framework 20-Formatted 20paper-Libre
12 pages
Batch A3
No ratings yet
Batch A3
7 pages
Multi-Grained Spatio-Temporal Modeling For Lip-Reading: Wangchenhao17@mails - Ucas.ac - CN
No ratings yet
Multi-Grained Spatio-Temporal Modeling For Lip-Reading: Wangchenhao17@mails - Ucas.ac - CN
11 pages
Conversion of NNLM To Back Off Language Model in ASR
No ratings yet
Conversion of NNLM To Back Off Language Model in ASR
4 pages
Modeling of Speech Recognition Based On Deep Learning: Min Zhang
No ratings yet
Modeling of Speech Recognition Based On Deep Learning: Min Zhang
8 pages
Lip Reading Using Image Processing
No ratings yet
Lip Reading Using Image Processing
4 pages
Review I - Documentation Format
No ratings yet
Review I - Documentation Format
20 pages
Learning Individual Speaking Styles For Accurate L
No ratings yet
Learning Individual Speaking Styles For Accurate L
11 pages
Chi Square Test
No ratings yet
Chi Square Test
11 pages
Why Choose Jolly Phonics Flyer - 250125 - 035602
No ratings yet
Why Choose Jolly Phonics Flyer - 250125 - 035602
8 pages
SLW Investment Group
No ratings yet
SLW Investment Group
30 pages
Hydraulic System CX31 (UENR4778-01)
No ratings yet
Hydraulic System CX31 (UENR4778-01)
4 pages
Day Trading Capital Management Plan
No ratings yet
Day Trading Capital Management Plan
38 pages
Easter Events & Weather Forecast
No ratings yet
Easter Events & Weather Forecast
10 pages
Review of Anthropometric Considerations For Tractor Seat Design
No ratings yet
Review of Anthropometric Considerations For Tractor Seat Design
9 pages
Canara - Epassbook - 2024-05-13 09:12:52.002054
No ratings yet
Canara - Epassbook - 2024-05-13 09:12:52.002054
65 pages
01
No ratings yet
01
314 pages
POLITICAL SCIENCE Most Important Questions (Prashant Kirad) PDF Political Parties Elections
No ratings yet
POLITICAL SCIENCE Most Important Questions (Prashant Kirad) PDF Political Parties Elections
1 page
The Importance of Corporate Communications During Financial Crisis
No ratings yet
The Importance of Corporate Communications During Financial Crisis
12 pages
WEEK 4 - Hiking PPT With Youtube Links
No ratings yet
WEEK 4 - Hiking PPT With Youtube Links
25 pages
Random Vibration Fatigue Analysis of Car Roof Luggage Carrier - Gulsevincler 2021
No ratings yet
Random Vibration Fatigue Analysis of Car Roof Luggage Carrier - Gulsevincler 2021
12 pages
MCQ
67% (3)
MCQ
274 pages
Nanto Company Profile & Introduction Letter & ISO
No ratings yet
Nanto Company Profile & Introduction Letter & ISO
15 pages
Contact Process for Sulphuric Acid
No ratings yet
Contact Process for Sulphuric Acid
8 pages
SanyaMidha FullStackWebDeveloper Resume
100% (1)
SanyaMidha FullStackWebDeveloper Resume
1 page
Cost Concepts Quiz
No ratings yet
Cost Concepts Quiz
11 pages
Agency Sales Call Script
No ratings yet
Agency Sales Call Script
4 pages
11 Ergonomics in Osh
No ratings yet
11 Ergonomics in Osh
9 pages
VFD Application Checklist
No ratings yet
VFD Application Checklist
3 pages
05 Dispute
No ratings yet
05 Dispute
29 pages
Link Game PPSSPP (Sfile
100% (1)
Link Game PPSSPP (Sfile
9 pages
MyEdBC Family Portal Instructional Manual
No ratings yet
MyEdBC Family Portal Instructional Manual
6 pages
Chapter 7 Input Tax Credit Under GST
No ratings yet
Chapter 7 Input Tax Credit Under GST
28 pages
Combustion Tutorials 3dsmax Elements
No ratings yet
Combustion Tutorials 3dsmax Elements
30 pages

Cascade

Uploaded by

Cascade

Uploaded by

A Cascade Sequence-to-Sequence Model for Chinese Mandarin

ABSTRACT speech recognition offers an alternative way to understand speech.

(hdc )i = GRUcd ((hdc )i−1 , yi−1 ), (11) 3 DATASET

Method Chinese Character Sentence Pinyin Sequence Tone Sequence

Figure 4: Video-to-text alignment using CSSMCM (a) and WAS (b).

You might also like