Lipreading Using a Comparative Machine Learning
Approach
Ziad Thabet Amr Nabih Karim Azmi
Faculty of Computer Science Faculty of Computer Science Faculty of Computer Science
MISR INTERNATIONAL UNIVERSITY MISR INTERNATIONAL UNIVERSITY MISR INTERNATIONAL UNIVERSITY
Cairo, Egypt Cairo, Egypt Cairo, Egypt
[email protected] [email protected] [email protected]
Youssef Samy Ghada Khoriba Mai Elshehaly
Faculty of Computer Science Faculty of Computers and Information School of Computing
MISR INTERNATIONAL UNIVERSITY Helwan University University of Leeds
Cairo, Egypt Cairo, Egypt Leeds, UK
[email protected] ghada
[email protected] [email protected] Abstract—Lipreading is the process of interpreting spoken The recent advent of novel machine learning and signal
word by observing lip movement. It plays a vital role in human processing approaches have increased researchers’ interest
communication and speech understanding, especially for hearing- in automating the process of lipreading. This attention is
impaired individuals. Automated lipreading approaches have
recently been used in such applications as biometric identifi- motivated by the promising results of lipreading in application
cation, silent dictation, forensic analysis of surveillance camera areas such as human-computer interaction, forensic analysis
capture, and communication with autonomous vehicles. However, of surveillance camera capture, biometric identification, silent
lipreading is a difficult process that poses several challenges to dictation, and autonomous vehicles [1].
human- and machine-based approaches alike. This is due to the However, the recognition of lip motion presents several
large number of phonemes in human language that are visually
represented by a smaller number of lip movements (visemes). challenges to linear classifiers. Mainly because the features
Consequently, the same viseme may be used to represent several used in the classification are calculated from a sequence of
phonemes, which confuses any lipreader. In this paper, we shapes that the lip takes, also known as “visemes”. The number
present a detailed study of the machine learning approach for of visemes that the lip can take is between 10 and 14 [2],
the real-time visual recognition of spoken words. Our focus whereas the number of phonemes (i.e. acoustic sounds) that
on real-time performance is motivated by the recent trend of
using lipreading in autonomous vehicles. In this paper, machine can be produced by these visemes exceeds 50. This mismatch
learning approaches are applied to recognize lip-reading and nine between visual and audio signals creates new horizons in
different classifiers has been implemented and tested, reporting machine learning research. It motivates the quest for improved
their confusion matrices among different groups of words. The visual features and classifiers to bridge the gap between what
classification process went on more than one classifier but these has been spoken and what is visually perceived.
three classifiers got the best results which are GradientBoosting,
Support Vector Machine(SVM) and logistic regression with In this paper, we present LipDrive: a novel system for
results 64.7%, 63.5% and 59.4% respectively. visual speech recognition that targets autonomous vehicles
Index Terms—Lipreading, Classification, Autonomous Vehi- as an application. The focus here is on the application area
cles, Speech Recognition. of autonomous vehicles due to its thriving nature and the
possibilities that lipreading can offer. Human-computer inter-
I. I NTRODUCTION action approach is taken to characterize the challenges and
Lipreading, widely known as visual speech recognition opportunities of lipreading in facilitating the communication
(VSR), is a process that aims to interpret and understand between humans and autonomous vehicles, especially in noisy
spoken words by using only the visual signal produced by car environments. Furthermore, a comparative analysis of
lip movement. Lipreading plays a crucial role in both human- nine different linear classifiers that we tested in LipDrive is
human and human-computer interaction. For example, people presented . Their performance were studied in lipreading using
use lipreading in their daily conversations to understand one raw visual features as well as using a preprocessed feature
another in noisy environments and in situations where the set. Through presenting our experimental results, we aim to
audio speech signal is not readily comprehensible. Therefore, propose a set of guidelines for researchers working in the
the skill of lipreading has long been mastered by individuals area of lipreading that can steer their choice of classification
with hearing impairment. It enables them to understand speech method and preprocessing steps.
and maintain social activities without relying on the perception The main contribution of this paper can be summarized as
of sounds. follows:
978-1-5386-5083-7/18/$31.00 ©2018 IEEE
Authorized licensed use limited to: Chaitanya Bharathi Institute of Tech - HYDERABAD. Downloaded on August 22,2024 at 08:20:54 UTC from IEEE Xplore. Restrictions apply.
19
• A novel lipreading system called LipDrive that is to be algorithm passes by two main steps which are features extrac-
deployed in an autonomous vehicle setting tion and classification of the word. The features extraction
• A comparative analysis of nine classifiers for lipreading process passes by five steps: Video Acquisition, Face and
• Experimental results using preprocessed and raw visual Mouth Detection, Intensity Equalization, Keypoint Extraction
features for classification and Geometric Feature extraction. The word classification is
• A set of design guidelines for visual speech recognition done using Learning Vector Quantization neural network.
In Section II, A full description of the state-of-the-art in
lipreading research is provided. The rest of the paper is Lesani et al. [7] introduced a new method for mobile phones
organized as follow, the description of LipDrive system in Sec- security which is lip-authentication. This method could be
tion III. The experimental approach is outlined in Section IV, used in mobile banking application to ensure the security of
and the results of our comparative analysis are presented in customer’s accounts. The mobile phone camera has the ability
Section V. Finally, Section VI offers our concluding remarks to extract the lip movements and send them to lip-reading
and lays the foundation for our future work. algorithms to classify the security word like password.
III. S YSTEM OVERVIEW
II. R ELATED W ORK
In order to reach high accuracy with real-time recognition
Assael et al. [3] showed that lips movements can be of spoken words, LipDrive system constitutes of six different
extracted while speaking and convert those movements into data processing stages. Figure 1 depicts these six stages in the
written text and researchers also showed that the conversion form of a pipeline and a detailed description of each stage is
process could be based on sentence level instead of working on described in this section.
word level. There were different problems cited by researchers
during there experiments such as designing and learning the
facial features, and the prediction of sentence itself. They
worked on different deep learning approaches to extract the
lips movements and to classify the spoken word.
Chung et al. [4] showed that lip recognition systems have
the ability to understand spoken words using only visual
features and those systems could help in recognizing the
spoken words in corrupted videos without their audio files.
Researchers were aiming to build a system that read lips Fig. 1. System Overview
independently. Researchers had collected a large dataset from
TV broadcasts and they built a deep learning architectures that A. Image Acquisition
effectively learn and recognize hundreds of words. The image acquisition stage receives a raw video as in-
put. This video captures a spoken work within a specific
Garg et al. [5] have discussed different methods for words environment. This stage aims to create a sequence of frames
and phrases prediction from videos without their audio files or images from the captured video and to reduce the effect
and also they have discussed that the process of visual lip- of environmental factors on the quality of the frames in the
reading is important in Human computer interaction and it sequence.
can replace the audio speech recognition technology as it The captured video is first sliced into individual frames
may be difficult in noisy environments and the variation of using openCV Python Library. Next, the acquired frames are
inputs as different people speak different accents. Researchers converted to gray scale using openCV Python Library. The
have concatenated a fixed number of images on the pre- resulting frames are then passed along to the next stage for
trained VGGnet model, they have used the nearest neighbor feature extraction.
interpolation to normalize the number of images per sequence
and they have fed to LSTM and RNN the extracted features B. Feature Extraction
by VGGnet model to classify the word. The goal of the feature extraction stage is to reduce the size
of the images that are received from the acquisition stage. This
Rathee [6] has defined lip-reading as the recognition of lip step is motivated by the famous curse of dimensionality in
movements patterns while speaking. The author also added machine learning [8]. Namely, if we were to use the original
that the visual speech recognition has motivated researchers images as input to the classifier, each pixel would represent a
towards lip-reading. The author has mentioned that speech feature. The reliability and efficiency of reading the lips from
recognition systems are facing a major problem due to noisy the images recieved is vital, because the number of pixels is
environments and added that lip-reading can help hearing typically large and varies with image resolution and camera
impaired or dumb people to communicate normally with other quality. Furthermore, the majority of the captured pixels are
people. An automated lip-reading is proposed, the proposed irrelevant to the classifier.
Authorized licensed use limited to: Chaitanya Bharathi Institute of Tech - HYDERABAD. Downloaded on August 22,2024 at 08:20:54 UTC from IEEE Xplore. Restrictions apply.
20
Therefore, instead of passing videos and pictures, we extract Where,
only the needed features from the videos. This is realized θ
through passing the gray scale images to the features extraction R=
XM ax − px
stage using ”DLib” which is a modern C++ library that
H = YM ax − py
implements a multitude of machine learning algorithms [9].
“Shape Predictor 68 Face Landmarks” is used to detect the D. Concatenation
human face in images and to extract the 68 landmarks of the Individual frames are passed through the face detection,
face. These landmarks represent points on the mouth, nose, feature extraction, cropping and normalizing processes de-
eyes and so forth as shown in Figure 2. scribed above. However, since classifying a word based on
Furthermore, the number of landmarks is reduced to twenty individual frames is rarely ever the case, we concatenate the
points from each frame, that represents the features of the lips, frames back to form a sequence of feature vectors (Figure 4).
as shown in Figure 3. These points of each frame are then This process creates a training dataset that has the sequence
translated to the Z-order, by calculating a z-value that has the of feature vectors as input and the spoken word as class label.
ability to translate a 2D point (x, y) to a one-dimensional For example, if the word ”ABOUT” is captured in 10 frames,
value. This value is calculated by interleaving the binary each of which contains 20 features, this will lead 200 features
representations of its coordinate values. that produces the sequence for that word.
Fig. 2. Face Detection using DLib
Fig. 4. Frames Concatenation
E. Training and Validation
In this stage, sequences of feature vectors are taken for a
number of words. We strive to create a large enough training
dataset, so for each word we consider a number of videos that
capture the same word as spoken by different individuals. The
feature vector sequences for a set of words is then fed into the
Fig. 3. Lip Feature Extraction classifier for training and model generation purposes. Next, a
different dataset is fed to the classifier for validation.
C. Cropping
F. Classification
Taking into consideration the various positions of the user Once a classifier model is built and validated during the
in front of the camera, which yields different positions of the previous stage, we get to the point of real time lipreading. In
speaker’s facial landmarks, we crop the image to the mouth this stage, we extract the same features from the user’s face.
level to reduce the environmental variability in the extracted We note here that videos are captured via a portable device to
features. In addition, the distance between the speaker’s face be streamed continuously to the server for feature extraction.
and the camera could vary from one speaker to another, which The extracted features are then fed to the classifier model, so
makes the features extracted ambiguous at times. In order to the classifier can predict the spoken word which then translated
unify it, all images are normalized to the same width and to a command. Finally, the server will respond by executing
relevant height to its ratio. Equation 1 defines the calculation the command intended.
of a normalized point (p∗x , p∗y ) from each point (px , py ) on the
lips, in which a normalization scale is defined as θ. IV. D ESIGN OF E XPERIMENTS
The six-stage approach is tested with ten different clas-
(p∗x , p∗y ) = ((px × R), (py × (H × R)/H)) (1) sifiers. The purpose of our experiments is to gain a deeper
Authorized licensed use limited to: Chaitanya Bharathi Institute of Tech - HYDERABAD. Downloaded on August 22,2024 at 08:20:54 UTC from IEEE Xplore. Restrictions apply.
21
understanding of the strengths and weaknesses of the classi-
fiers while attempting to classify different words with varying
visual and phonetic similarities.
A. Dataset
In order to cover a breadth of training words and to have a
large dictionary, we have to work on a large-scale dataset. In
our experiments, we used a benchmark dataset that consists of
about one million instances of 500 different words of different
speakers [10]. Each word is consisted of 1000 training videos,
50-100 validation videos and 50-100 testing videos. All videos
have the length of 29-frames with duration of 1.16 seconds,
the spoken word exists in the middle of the video. In addition,
the speakers’ position are not the fixed, meaning that their face
is not always looking directly to the camera, some are talking
facing to someone next to them or their faces are far from the Fig. 5. Naive Bayes’ Confusion Matrix
camera which makes it more challenging.
2) Experiment 2 - Quadratic Discriminant Analysis (QDA):
B. Procedure We used Quadratic Discriminant Analysis that fits class den-
sities to the data and based on Bayes theorem and we got the
accuracy of 32.3%.
Using our large-scale dataset, composed of sequences of
vectors and labels, we tested different classifiers in order to Figure 6 depicts the confusion matrix for this experiment.
fit the data. After extracting lips features, we passed through
different stages. First, we used 5 words from our training
dataset to feed each classifier, those words are ”ABOUT”,
”AROUND”, ”ATTACK”, ”BENEFITS” and ”BETWEEN”.
These words were chosen due to their visual and phonetic
similarity. For example, the similarity between ”ABOUT” and
”AROUND”, they have almost the same lips movements which
makes it more challenging for the classifiers to detect them.
Second, we predicted different sequences for the same 5 words
but from our testing dataset. Third, we calculated the accuracy
of each classifier using predefined scoring functions. Fourth,
we visualized each classifiers’ confusion matrix.
V. E XPERIMENTAL R ESULTS
In this section, the achieved accuracy from each classifier
is reported and visualized the confusion matrix resulting from
each, and discussing some general guidelines based on our
findings. Fig. 6. Quadratic Discriminant Analysis’s Confusion Matrix
1) Experiment 1 - Naive Bayes (NB): We used Naive
Bayes classifier that is based on Bayes’ theorem for objects
classification and we got the accuracy of 26.6%. Naive Bayes’.
3) Experiment 3 - SGDClassifier: We used SGDClassifier
Figure 5 depicts the confusion matrix for this experiment as
and we got the accuracy of 45.9%.
we can see the naive bayes is very weak when its dealing with
large number of features it. Figure 7 depicts the confusion matrix for this experiment.
Authorized licensed use limited to: Chaitanya Bharathi Institute of Tech - HYDERABAD. Downloaded on August 22,2024 at 08:20:54 UTC from IEEE Xplore. Restrictions apply.
22
Fig. 7. SGDClassifier’s Confusion Matrix Fig. 9. AdaBoost Classifier’s Confusion Matrix
4) Experiment 4 - Multi-Layer Perceptron Classifier (MLP): 6) Experiment 6 - Linear Discriminant Analysis Classifier
We used Multi-Layer Perceptron which is a neural net- (LDA): We used Linear Discriminant Analysis classifier that
work classifier and that optimizes the log-loss function using fits class densities to the data and based on Bayes theorem.
LBFGS. We got the accuracy of 48.3%. We got the accuracy of 56.1%.
Figure 8 depicts the confusion matrix for this experiment. Figure 10 depicts the confusion matrix for this experiment.
Fig. 8. Multi-Layer Perceptron Classifier’s Confusion Matrix Fig. 10. Linear Discriminant Analysis Classifier’s Confusion Matrix
5) Experiment 5 - AdaBoost Classifier: We used AdaBoost 7) Experiment 7 - Logistic Regression Classifier (LR): We
classifier that fits the model with the training dataset and then used Logistic Regression classifier that analyzes independent
fits the model with additional copies of the pre-trained model. variables to determine an outcome. We got the accuracy of
We got the accuracy of 54.5%. 59.4%.
Figure 9 depicts the confusion matrix for this experiment. Figure 11 depicts the confusion matrix for this experiment.
Authorized licensed use limited to: Chaitanya Bharathi Institute of Tech - HYDERABAD.23
Downloaded on August 22,2024 at 08:20:54 UTC from IEEE Xplore. Restrictions apply.
Fig. 11. Logistic Regression Classifier’s Confusion Matrix Fig. 13. Gradient Boosting Classifier’s Confusion Matrix
A. Discussion of Results
8) Experiment 8 - Support Vector Machine Classifier The results have showed that most of the classifiers get con-
(SVM): We used Support Vector Machine classifier that an- fused between word ”About” and ”Between” and the accuracy
alyzes data for classification and regression analysis. We got between the classifiers are almost close to each others, this is
the accuracy of 63.5%. due to the small number of words being trained. However, on
Figure 12 depicts the confusion matrix for this experiment. increasing the number of words needed to be trained, the linear
classifiers’ accuracy starts to decrease directly proportional by
increasing the words. Thus, using neural networks classifiers
is essential for large scale dataset and this was clear when we
started to use this large data on MLP classifier. Meanwhile, the
usage of CNN is recommended to be used in order to have
promising results. In addition, to ensure high accuracy with
real-time processing, we recommend to use RNN and LSTM
classifiers.
VI. C ONCLUSION AND FUTURE WORK
Lip-Reading is a new way for enhancing speech recognition
however, there are some constraints to reach this accuracy.
One of these constraints is the variant light conditions that the
camera could face, the lip-reading process is mainly conducted
under the ideal lighting conditions. In addition, the position of
the speaker’s face to the camera matters, the speaker has to
look directly to the camera in order to ensure clear detection.
Not only the position of the speaker, but also the distance
between the speaker and the camera has to be near enough to
Fig. 12. Support Vector Machine’s Confusion Matrix detect the lips clearly. Speaker should consider not to be far to
deliver the command. Furthermore, working on the phoneme
level would widen the words being detected and would make
it more easier. We also believe that using audio-visual methods
9) Experiment 9 - Gradient Boosting Classifier: We used would increase the accuracy, meaning that we can depend on
Gradient Boosting classifier that produces prediction model in both the sound and lips recognition to ensure high accuracy.
the form of an ensemble of weak prediction models, typically By using new datasets like extracting features using DWT may
decision trees. We got the accuracy of 64.7%. give a huge boost to the accuracy of classifiers since it reduces
Figure 13 depicts the confusion matrix for this experiment. number of dimensions that is being processed.
24Downloaded on August 22,2024 at 08:20:54 UTC from IEEE Xplore. Restrictions apply.
Authorized licensed use limited to: Chaitanya Bharathi Institute of Tech - HYDERABAD.
R EFERENCES
[1] A. Hassanat, “Visual speech recognition,” arXiv preprint
arXiv:1409.1411, 2014.
[2] Y. Lan, B.-J. Theobald, R. Harvey, E.-J. Ong, and R. Bowden, “Improv-
ing visual features for lip-reading,” in Auditory-Visual Speech Processing
2010, 2010.
[3] Y. M. Assael, B. Shillingford, S. Whiteson, and N. de Freitas, “Lipnet:
end-to-end sentence-level lipreading,” 2016.
[4] J. S. Chung and A. Zisserman, “Lip reading in the wild,” in Asian
Conference on Computer Vision, pp. 87–103, Springer, 2016.
[5] M. Liu, J. Shi, Z. Li, C. Li, J. Zhu, and S. Liu, “Towards better
analysis of deep convolutional neural networks,” IEEE transactions on
visualization and computer graphics, vol. 23, no. 1, pp. 91–100, 2017.
[6] N. Rathee, “A novel approach for lip reading based on neural network,”
in Computational Techniques in Information and Communication Tech-
nologies (ICCTICT), 2016 International Conference on, pp. 421–426,
IEEE, 2016.
[7] F. S. Lesani, F. F. Ghazvini, and R. Dianat, “Mobile phone security
using automatic lip reading,” in e-Commerce in Developing Countries:
With focus on e-Business (ECDC), 2015 9th International Conference
on, pp. 1–5, IEEE, 2015.
[8] P. Domingos, “A few useful things to know about machine learning,”
Communications of the ACM, vol. 55, no. 10, pp. 78–87, 2012.
[9] R.-L. Hsu, M. Abdel-Mottaleb, and A. K. Jain, “Face detection in
color images,” IEEE transactions on pattern analysis and machine
intelligence, vol. 24, no. 5, pp. 696–706, 2002.
[10] J. S. Chung and A. Zisserman, “Lip reading in the wild,” in Asian
Conference on Computer Vision, 2016.
Authorized licensed use limited to: Chaitanya Bharathi Institute of Tech - HYDERABAD. Downloaded on August 22,2024 at 08:20:54 UTC from IEEE Xplore. Restrictions apply.
25