0% found this document useful (0 votes)

38 views5 pages

Lip Reading with CNN for Noisy Environments

1. The document discusses a model for classifying words and phrases from video data without audio using a convolutional neural network. 2. It explores previous work using machine learning for lip reading and identifies challenges like coarticulation and visual feature diversity. 3. The proposed model uses a CNN to classify sequences of cropped face images as one of 10 words or 10 phrases, with 80% of data used for training.

Uploaded by

Hajar Bouchama

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

38 views5 pages

Lip Reading with CNN for Noisy Environments

Uploaded by

Hajar Bouchama

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 5

International Research Journal of Engineering and Technology (IRJET) e-ISSN: 2395-0056

Volume: 07 Issue: 03 | Mar 2020 www.irjet.net p-ISSN: 2395-0072

Automatic Lip Reading: Classification of Words and Phrases using

Convolutional Neural Network
Nikhil Kesarkar1, Poornachandra P. Kongara2, Manthan Kothari3,Asst. Prof. Suresh Mestry4
1B.E. Computer Science, Rajiv Gandhi Institute of Technology, Mumbai University, Maharashtra, India
2 B.E. Computer Science, Rajiv Gandhi Institute of Technology , Mumbai University, Maharashtra, India
3 B.E. Computer Science, Rajiv Gandhi Institute of Technology, Mumbai University, Maharashtra, India
4 Assistant Professor, Dept. of Computer Science Engineering, Rajiv Gandhi Institute of Technology, Mumbai

University, Maharashtra, India

---------------------------------------------------------------------***---------------------------------------------------------------------
Abstract - Previous studies on human-machine interaction In the past also, several researches and efforts have been
have determined that visual information can augment the made but they were more focussed on gesture recognition,
speech recognition accuracy especially in noisy surroundings. making this project a relatively and very exciting field to
Here we show a model for predicting words from video data explore. Also, there are few systems and applications out
without audio. Although already existing models have there for lip-reading but most of them do not use neural
succeeded in incorporating visual data into speech networks but instead they have used different machine
recognition, all of them contained some or the other learning techniques. More advanced visual speech
deficiency. To overcome this, we have pre-processed the data recognition models such as Google’s DeepMind Lip-Net[3]
by using the haar-cascade model [2] to detect and crop around network were published only a few months ago.
the subject’s face in all frames of the video data and then use
the sequence of frames as input to the model. Our proposed lip- 2. RELATED WORKS
reading classification model is unique in its usage for all
ranges of speakers. This section consists of the existing work that has been done
in the field. Machine Learning methods are mostly used in
Key Words: deep learning, automatic lip-reading, many of the approaches, the concept of deep learning has
machine learning, Convolutional neural network, Image emerged in the last few years. The Effect of Coarticulation
processing. and visual features diversity are two main challenges
involved in the lip-reading process.
1. INTRODUCTION
This paper [6] developed different methods for prediction of
Lip-reading is basically understanding lip movement in words and phrases from videos which doesn’t include the
order to understand the spoken words when there is no presence of audio files. They also discussed that visual lip
access to sound. Human beings identify speeches of a reading process is important in Human Machine Interaction
speaker using multimodal information. Besides speech that could replace the audio speech recognition technology
audio, visual data such as lip and tongue movement can also as the machine finds it difficult in noisy environments & also
aid in understanding speech. By using visual information due the different accents of the people. Researchers used a
that is watching lip movements and understanding what a fixed number of images to concatenate on the pre-trained
subject is saying is referred to as lip-reading. Lip-reading can VGG model. The nearest neighbour interpolation method
help to understand speeches especially in noisy was used to normalize the sequence of images and the
environments or in environments with no noise. This paper extracted features are fed to the LSTM & RNN by VGG-net for
probes into the task of recognizing speech from video data word classification.
without audio. We propose a neural network with a definite
amount of success in this classification task. The input which Rathee [7] called recognition of lip movement patterns while
we provide to our algorithm is a sequence of images taken speaking as lip-reading .Speech recognition system faces
from a video data. We have used a Convolution neural major problems due to noisy environments and it will add as
network (CNN) model to get the output of one of 10 words a help for the physically hearing impaired for
and 10 phrases that are spoken by the subject. communication with normal people .The algorithm proposed
passes by two main steps which include extraction of
Given that video traffic is growing at a pretty steady rate on features and word classification. The extraction of features
the internet, the proposed model could help extract the data passes from five steps: Video Acquisition, Face and Mouth
and process it to find missing information or gain insights Detection, Intensity Equalization, Key point Extraction and
into the action or topics that occurs in the video. Geometric Feature Extraction. The classification of words is
Applications of lip-reading classification include profanity done using Learning Vector Quantization neural network.
detection on social media sites to a live lip-reading mobile In the lipreading system of [8], proposed a convolutional
application. neural network-based feature extractor. The speaker’s

© 2020, IRJET | Impact Factor value: 7.34 | ISO 9001:2008 Certified Journal | Page 2467
International Research Journal of Engineering and Technology (IRJET) e-ISSN: 2395-0056
Volume: 07 Issue: 03 | Mar 2020 www.irjet.net p-ISSN: 2395-0072

mouth region images along with the phoneme labels are

Words: Begin, Choose, Connection, Navigation, Next,
used for network training. In this paper six different
Previous, Start, Stop, Hello, Web
speakers with six different independent CNN models are
used because of which the average phoneme recognition is
Phrases: Stop Navigation, Excuse me, I am sorry, Thank
58% for 40 phonemes over 6.
you, Good bye, I love this game, Nice to meet
you, You are welcome, How are you? , Have a
In [9], the processing pipeline is based purely on neural
good time
networks which succeeds the lip reading process. A single
structure is formed by stacking Long Short Term Memory. Table 1: Words and Phrases in dataset
The inputs are in the form of raw mouth images, the
performance of such network is experimentally calculated In addition to the dataset present, we added 2 more speakers
and compared to a standard SVM classifier. Evaluation are who utter ten times the same set of ten words and ten
performed on data from 19 speakers of Grid corpus. Using phrases to the dataset so that the model gets more data for
end to end neural network-based architecture The best word training and validation. So in total we have 3400
accuracy reported from this paper is 79.6%. instances(17 speakers * 20 object * 10 times). We split the
dataset into two: training and testing. We included 80% of
In [10] this paper the authors proposed a method for Deep each speaker i.e. overall 2720 for training and 20% of each
Convolutional Neural Network classifier along with VGG speaker i.e. overall 680 for testing.
modal. The Vid TIMIT database is used which include 43
people speaking short sentences with head movements and 4. METHODOLOGY
suitable amount of delay. Accuracies for different classifiers
used are FRS-93.53%, ARS-80.66%, Fusion-97.33%. 4.1 Pre-Processing

In [11] authors proposed a methodology such that deep 4.11 Face Extraction
belief network along with conventional HMM and VPR model
is used. The dataset used is CUAVE a dataset is used which Pre-processing was an important task performed on the data
includes 36 speakers uttering over 7000 connected and set. Our Pre-processing was to get the facial part of each
isolated digits. speaker. This step was done with python library dlib and
Open CV along with the use of a pretrained model i.e. haar
Accuracies for different classifiers used are PER-69.36%. cascade model[1] and all the points of the facial structure
When VPR is used accuracy increases to 45.63%.The were obtained then we used these points to crop the
advantages of this method is that it is simple as well as fast speaker’s face as shown in Figure 1.
for interpreting and accurate results are produced when
input is similar to the dataset . Accuracy for the system is
very low for any of the classifiers used and Prediction
becomes difficult when input is different in large amounts
from the used dataset.

3. DATASET AND FEATURES

MIRACL-VC1 data set [1] is the dataset used here that

consists of 15 speakers including 10 women and 5 men and Figure 1: Elimination of background & Extraction of face
they utter ten times a set of ten words and ten phrases.
4.12 Concatenation and Resizing
The data set is a lip-reading data set which consists of both
depth and colour images. It can be used for a variety of Once we get the face of all the speakers, we concatenate
research fields like visual speech recognition, face detection, them into a single image as shown below. This will help to
and biometrics. The dataset consists of a synchronized get all the faces of speakers uttering a word in a single frame.
sequence of colour and depth images for each instance (both Now this concatenation is as follows:
of 640x480 pixels) which was collected at 15 frames per
second. For each word or phrase, there are approximately 4 if there are 6 frames then we added 3 images in 1st row and
to 27 image frames. The data set contains a total number of 3 in 2nd row as shown in Figure 2.
3000 instances (15 speakers uttering 10 words and 10
phrases, 10 times i.e. 15*20*10).

The words and phrases in dataset are shown in Table1.

© 2020, IRJET | Impact Factor value: 7.34 | ISO 9001:2008 Certified Journal | Page 2468
International Research Journal of Engineering and Technology (IRJET) e-ISSN: 2395-0056
Volume: 07 Issue: 03 | Mar 2020 www.irjet.net p-ISSN: 2395-0072

Figure 2: Concatenation of 6 Frames

If we have 7 frames then we added 6 frames as above and

7th frame in 3rd row. Now to fulfil this row completely we
added 7th frame again twice in the 3rd row as shown in
Figure 3.

Figure 4: CNN Model Architecture

Each individual concatenated frame is fed to the

Convolutional Neural Network.

Once we complete training and validation, the CNN model is

saved to disk for prediction.

4.3 Workflow
Figure 3: Concatenation of 7 Frames

Due to concatenation, the size of individual concatenated

frames varies. So we have to reshape them into one single
size for all frames. For our model, we set size to (224,224).

4.2 CNN implementation

In this section we described the implementation of the CNN

model that we used for classification of the words and
phrases which were included in the dataset. Our CNN model
consists of 4 convolutional layers and 2 fully connected
layers. The two fully connected layers used a soft max
activation function layer to produce the probabilities of each
and every word and phrase of which we will take the highest.
The CNN model structure is as shown in Figure 4.

Figure 5: Workflow

Figure 5 explains the workflow. The prediction for lip

reading involves the input in the form of video from which
the frames are extracted. The frame rate we set is 15 fps. The
face is augmented from the frames extracted using haar-
cascade model[9].

Concatenation and resizing of the images is applied to the

augmented frames of faces. Then the concatenated images
are resized to 224x224 pixels.
© 2020, IRJET | Impact Factor value: 7.34 | ISO 9001:2008 Certified Journal | Page 2469
International Research Journal of Engineering and Technology (IRJET) e-ISSN: 2395-0056
Volume: 07 Issue: 03 | Mar 2020 www.irjet.net p-ISSN: 2395-0072

The resized images are then given as input to the CNN model
for classification. The result obtained is the classified word
or phrase.

5. RESULT AND ANALYSIS

For this paper we have considered accuracy as a prime result

analysis. Along with it we used confusion matrix for
understanding errors. We achieved a maximum of 99.26
training accuracy and 80.44 validation accuracy.

The prediction breakdown can be seen using confusion

matrix in Figure 6.

Figure 7: Model Accuracy

The validation loss of 3.0576 at the beginning and was

reduced to about 0.7504 till the last epoch. It’s training loss
started with approximately 3 and was almost 0.04 at the end
as shown in Figure 8.

Figure 6: Confusion Matrix

The model was trained for 50 epochs with a learning rate of

0.0001. It’s validation accuracy was very less at the
beginning of about 9 to 10% at the beginning but then
jumped to 80.44% till the end of 50 epochs as shown in
Figure 7.

Figure 8: Model Loss

© 2020, IRJET | Impact Factor value: 7.34 | ISO 9001:2008 Certified Journal | Page 2470
International Research Journal of Engineering and Technology (IRJET) e-ISSN: 2395-0056
Volume: 07 Issue: 03 | Mar 2020 www.irjet.net p-ISSN: 2395-0072

6. CONCLUSION networks for cross audio-visual matching recognition" 2169-

3536 c. 2017 IEEE.
Over all we found that the model improved due to the [6] M. Liu, J. Shi, Z. Li, C. Li, J. Zhu, and S. Liu, “Towards better
inclusion of the pre-trained facial recognition model. The analysis of deep convolutional neural networks,” IEEE
data augmentation proved helpful for seen people. With the transactions on visualization and computer graphics, vol. 23,
help of this paper you cannot determine the words or no. 1, pp. 91–100, 2017.
phrases spoken by unseen people as the dataset was not that [7] N. Rathee, “A novel approach for lip reading based on
large. In this model, it was difficult to avoid overfitting with neural network,”in Computational Techniques in
unseen people. Thus certain models and hyperparameters Information and Communication Technologies (ICCTICT),
would fit better if we are working on seen/unseen people for 2016 International Conference on, pp. 421–426, IEEE, 2016.
testing and validation. The work inclusion increases if we [8] K.Nodal ,Y. Yamaguchi, K. Nakadai, H.G. Okuno, T. Ogata
use pre-trained models to reduce overfitting. The addition of ,“Lipreading using convolutional neural network.’’ In
properties like regularization would reduce the overfitting Fifteenth Annual Conference of the International Speech
problem to a greater extent. Communication Association . 2014.
[9]M. Wand,J.Koutn et al.,“Lipreading with long short-term
This project is easily extendible and raises the question in memory,’’ in 2016 IEEE International Conference on
the mind of the project makers of how to perform visual Acoustics,Speech and Signal Processing(ICASSP).IEEE,
speech recognition on a much larger or wider range of 2016,pp.6115-6119.
corpus (e.g. English Dictionary). How could we interpret the [10]Vegad, Sagar, Harsh Patel, Hanqi Zhuang and Mehul R.
video as text by addition of audio data? The MIRACL_V1 Naik. “Audio-Visual Person Recognition Using Deep
dataset[0] includes phrase inputs which is an interesting Convolutional Neural Networks.” (2017).
area for exploration and in real life phrases are preferred [11] F. Vakhshiteh, F. Almasganj, Lip-reading via deep neural
over words. network using appearance-based visual features, in 2017
24th National and 2nd International Iranian Conference on
Biomedical Engineering (ICBME) (IEEE, 2017), pp. 1–6.
ACKNOWLEDGEMENT

We wish to express our sincere gratitude to Dr. Sanjay U. BIOGRAPHIES

Bokade, Principal and Dr. Satish. Y. Ket , H.O.D of Department
Computer Engineering of Rajiv Gandhi Institute of Nikhil Kesarkar, Undergraduate
Technology for providing us an opportunity to do our project Student, BE Computer Engineer, MCT
work on - Automatic Lip Reading: Classification of Words Rajiv Gandhi Institute of Technology,
and Phrases using Convolutional Neural Network. This Mumbai University, Mumbai
project bears on imprint of many peoples. We sincerely
thank our project guide Prof. Suresh R. Mestry for his
guidance and encouragement in carrying out this synopsis Poornachandraprasad Kongara,
work. Finally, we would like to thank our colleagues and Undergraduate Student, BE
friends who helped us in completing project work Computer Engineer, MCT Rajiv
successfully. Gandhi Institute of Technology,
Mumbai University, Mumbai
REFERENCES
Manthan Kothari, Undergraduate
Student, BE Computer Engineer, MCT
[1] Ahmed Rekik, Achraf Ben-Hamadou, and Walid Mahdi. A
Rajiv Gandhi Institute of Technology,
new visual speech recognition approach for RGB-D cameras.
Mumbai University, Mumbai
In Image Analysis and Recognition - 11th International
Conference, ICIAR 2014, Vilamoura, Portugal, October 22-24,
2014, Proceedings, Part II, pages 21–28, 2014.
[2] Padilla, Rafael & Filho, Cicero & Costa, Marly, “Evaluation Mr. Suresh R. Mestry, Assistant
of Haar Cascade Classifiers for Face Detection”, (2012) Professor in Department of
[3] Yannis M. Assael, Brendan Shillingford, Shimon White Computer Engineering. Having
son, and Nando de Freitas. Lipnet: Sentence-level teaching experience of 13 years.
lipreading. CoRR, abs/1611.01599, 2016. Specialization areas are Machine
[4] Ziad Thabet, Amr Nabih, Karim Azmi, Youssef Samy, Learning , Artificial Intelligence, NLP
Ghada Khoriba and Mai Elshealy, “Lip Reading using a
Comparative ML Approach", 978-1-5386-5083-7/18/978-1-
5386-5083-7/18/31.00 c. 2018 IEEE
[5]Amirsina Torfi, Seyed Mehdi Iranmanesh, Nasser
Nasrabadi and Jeremy Dawson, 3D Convolutional neural

Facillitate Empowerment Workbook
No ratings yet
Facillitate Empowerment Workbook
83 pages
Pre-Hiring, Hiring, and Post-Hiring
0% (1)
Pre-Hiring, Hiring, and Post-Hiring
11 pages
Chung 18
No ratings yet
Chung 18
28 pages
NMITCON 2024 Submitted 23 24 B27 Lip Reading Using Deep Learning Project Paper
No ratings yet
NMITCON 2024 Submitted 23 24 B27 Lip Reading Using Deep Learning Project Paper
8 pages
LipReadNet: Advancing Lip Reading
No ratings yet
LipReadNet: Advancing Lip Reading
6 pages
DL Review
No ratings yet
DL Review
4 pages
Lip Reading With Hahn Convolutional Neural Networks
No ratings yet
Lip Reading With Hahn Convolutional Neural Networks
28 pages
ANN Paper
No ratings yet
ANN Paper
6 pages
Vision Based Lip Reading System Using Deep Learning: July 2022
No ratings yet
Vision Based Lip Reading System Using Deep Learning: July 2022
7 pages
Lip Reading Using CNN and LTSM
No ratings yet
Lip Reading Using CNN and LTSM
9 pages
Hybrid Attention Mechanisms in 3D CNN For Noise-Resilient Lip Reading in Complex Environments
No ratings yet
Hybrid Attention Mechanisms in 3D CNN For Noise-Resilient Lip Reading in Complex Environments
11 pages
ANN Paper
No ratings yet
ANN Paper
7 pages
Lip Reading Word Classification: Abiel Gutierrez Stanford University Zoe-Alanah Robert Stanford University
No ratings yet
Lip Reading Word Classification: Abiel Gutierrez Stanford University Zoe-Alanah Robert Stanford University
9 pages
Deep Learning Lip-Reading Survey
No ratings yet
Deep Learning Lip-Reading Survey
8 pages
A Lip Reading Method Based On 3D Convolutional Vision Transformer
No ratings yet
A Lip Reading Method Based On 3D Convolutional Vision Transformer
8 pages
Deep Learning for Visual Lip Reading
No ratings yet
Deep Learning for Visual Lip Reading
15 pages
A Novel Machine Lip Reading Model A Novel Machine Lip Reading Model
No ratings yet
A Novel Machine Lip Reading Model A Novel Machine Lip Reading Model
6 pages
Second Paper
No ratings yet
Second Paper
7 pages
Lipreading Using A Comparative Machine Learning Approach
No ratings yet
Lipreading Using A Comparative Machine Learning Approach
7 pages
Deep Learning For Lip Reading Using Audio-Visual Information For Urdu Language
No ratings yet
Deep Learning For Lip Reading Using Audio-Visual Information For Urdu Language
5 pages
Multi-Grained Spatio-Temporal Modeling For Lip-Reading: Wangchenhao17@mails - Ucas.ac - CN
No ratings yet
Multi-Grained Spatio-Temporal Modeling For Lip-Reading: Wangchenhao17@mails - Ucas.ac - CN
11 pages
Decoder-Encoder LSTM For Lip Reading: Souheil Fenghour Daqing Chen Perry Xiao
No ratings yet
Decoder-Encoder LSTM For Lip Reading: Souheil Fenghour Daqing Chen Perry Xiao
5 pages
Silent Speech Interpretation Using Ai: Dr.M. Hemalatha, M. Akshayaa
No ratings yet
Silent Speech Interpretation Using Ai: Dr.M. Hemalatha, M. Akshayaa
3 pages
Deep Learning Lip-Reading Survey
No ratings yet
Deep Learning Lip-Reading Survey
22 pages
Prajwal Sub-Word Level Lip Reading With Visual Attention CVPR 2022 Paper
No ratings yet
Prajwal Sub-Word Level Lip Reading With Visual Attention CVPR 2022 Paper
11 pages
Deformation Flow Based Two-Stream Network For Lip Reading
No ratings yet
Deformation Flow Based Two-Stream Network For Lip Reading
7 pages
Deep Audio-Visual Speech Recognition
No ratings yet
Deep Audio-Visual Speech Recognition
13 pages
Lip Reading via Mutual Information Maximization
No ratings yet
Lip Reading via Mutual Information Maximization
8 pages
Deep Learning for Lip Reading
No ratings yet
Deep Learning for Lip Reading
5 pages
Analysis of Lip-Reading Using Deep Learning Techniques A Review
No ratings yet
Analysis of Lip-Reading Using Deep Learning Techniques A Review
6 pages
Ashrith Miniproject 2
No ratings yet
Ashrith Miniproject 2
11 pages
1 s2.0 S2666764923000450 Main
No ratings yet
1 s2.0 S2666764923000450 Main
10 pages
Engineering Science and Technology, An International Journal
No ratings yet
Engineering Science and Technology, An International Journal
10 pages
Deep Learning Lip Reading Model
No ratings yet
Deep Learning Lip Reading Model
6 pages
Lip Reading Using Neural Network and Dee
No ratings yet
Lip Reading Using Neural Network and Dee
7 pages
2001 08702v1
No ratings yet
2001 08702v1
6 pages
Lip Reading Using External Viseme Decoding: 1 Javad Peymanfard 2 Mohammad Reza Mohammadi 3 Hossein Zeinali
No ratings yet
Lip Reading Using External Viseme Decoding: 1 Javad Peymanfard 2 Mohammad Reza Mohammadi 3 Hossein Zeinali
5 pages
584 Camera Ready
No ratings yet
584 Camera Ready
6 pages
Lip Reading Using Deep Learning in Turkish Language
No ratings yet
Lip Reading Using Deep Learning in Turkish Language
12 pages
Analyzing Lower Half Facial Gestures For Lip Reading Applications Survey On Vision Techniques
No ratings yet
Analyzing Lower Half Facial Gestures For Lip Reading Applications Survey On Vision Techniques
45 pages
Authnet: A Deep Learning Based Authentication Mechanism Using Temporal Facial Feature Movements
No ratings yet
Authnet: A Deep Learning Based Authentication Mechanism Using Temporal Facial Feature Movements
7 pages
Lipsound2: Self-Supervised Pre-Training For Lip-To-Speech Reconstruction and Lip Reading
No ratings yet
Lipsound2: Self-Supervised Pre-Training For Lip-To-Speech Reconstruction and Lip Reading
11 pages
Toward Language-Independent Lip Reading A Transfer Learning Approach
No ratings yet
Toward Language-Independent Lip Reading A Transfer Learning Approach
4 pages
Review I - Documentation Format
No ratings yet
Review I - Documentation Format
20 pages
Callip: Lipreading Using Contrastive and Attribute Learning: Yiyang Huang Xuefeng Liang Chaowei Fang
No ratings yet
Callip: Lipreading Using Contrastive and Attribute Learning: Yiyang Huang Xuefeng Liang Chaowei Fang
9 pages
Lipreading With 3D-2D-Cnn BLSTM-HMM and Word-Ctc Models
No ratings yet
Lipreading With 3D-2D-Cnn BLSTM-HMM and Word-Ctc Models
5 pages
Developing Phoneme-Based Lip-Reading Sentences System For Silent Speech Recognition
No ratings yet
Developing Phoneme-Based Lip-Reading Sentences System For Silent Speech Recognition
10 pages
Icassp19 Zhoupan
No ratings yet
Icassp19 Zhoupan
5 pages
2.1 s2.0 S0925231225009610 Main
No ratings yet
2.1 s2.0 S0925231225009610 Main
10 pages
Learning Spatio-Temporal Features With Two-Stream Deep 3D Cnns For Lipreading
No ratings yet
Learning Spatio-Temporal Features With Two-Stream Deep 3D Cnns For Lipreading
13 pages
Afouras Et Al - 2018 - Deep Lip Reading
No ratings yet
Afouras Et Al - 2018 - Deep Lip Reading
8 pages
2 Base
No ratings yet
2 Base
5 pages
Cep Report
No ratings yet
Cep Report
21 pages
Paper 28
No ratings yet
Paper 28
6 pages
LRW-1000: A Naturally-Distributed Large-Scale Benchmark For Lip Reading in The Wild
No ratings yet
LRW-1000: A Naturally-Distributed Large-Scale Benchmark For Lip Reading in The Wild
8 pages
Developing Phoneme Based Lip Reading Sentences System For Silent Speech Recognition
No ratings yet
Developing Phoneme Based Lip Reading Sentences System For Silent Speech Recognition
10 pages
Park College of Engineering and Teknology Lip Reading Using Neural Network
No ratings yet
Park College of Engineering and Teknology Lip Reading Using Neural Network
10 pages
CNN-Based Lip Reading Method
No ratings yet
CNN-Based Lip Reading Method
7 pages
Lip-Reading Dataset Construction
No ratings yet
Lip-Reading Dataset Construction
6 pages
Continuous Lipreading Based On Acoustic Temporal Alignments: Empirical Research Open Access
No ratings yet
Continuous Lipreading Based On Acoustic Temporal Alignments: Empirical Research Open Access
15 pages
Hikmah Task 10 Presentation
No ratings yet
Hikmah Task 10 Presentation
12 pages
Notice: Office of The Dean Academic
No ratings yet
Notice: Office of The Dean Academic
2 pages
Nepal Pokhara Affiliated College List.
No ratings yet
Nepal Pokhara Affiliated College List.
3 pages
Student Writing Prompts Guide
0% (1)
Student Writing Prompts Guide
3 pages
Instant Download Research Methods in Second Language Psycholinguistics 1st Edition Jill Jegerski PDF All Chapter
100% (13)
Instant Download Research Methods in Second Language Psycholinguistics 1st Edition Jill Jegerski PDF All Chapter
66 pages
Catherine Hoblin-5
No ratings yet
Catherine Hoblin-5
1 page
Accomplishment Report BAC Coordinatorship
No ratings yet
Accomplishment Report BAC Coordinatorship
2 pages
How To Work With Spirits - Taylor Ellwood
No ratings yet
How To Work With Spirits - Taylor Ellwood
8 pages
Curvitaeko Updated
No ratings yet
Curvitaeko Updated
4 pages
Vadnana Luthra Orignal
No ratings yet
Vadnana Luthra Orignal
11 pages
Historical Review of Midwifery
0% (1)
Historical Review of Midwifery
3 pages
New Proposal On Mathematics Tutoring Application For Secondary School
No ratings yet
New Proposal On Mathematics Tutoring Application For Secondary School
4 pages
Grade 9 Dressmaking Analysis
No ratings yet
Grade 9 Dressmaking Analysis
6 pages
Diagnostic Test Variant 2
No ratings yet
Diagnostic Test Variant 2
3 pages
Entry-Level Sales Hiring Platform
No ratings yet
Entry-Level Sales Hiring Platform
7 pages
CLAT 2023 UG Provisional 2nd List
No ratings yet
CLAT 2023 UG Provisional 2nd List
6 pages
Bahamas Medical Council: Application Form For Registration
No ratings yet
Bahamas Medical Council: Application Form For Registration
2 pages
Grade 10 Reading & Health Activities
No ratings yet
Grade 10 Reading & Health Activities
5 pages
West Bengal State University: CBCS, Sem-I Examination, 2018 Regular Candidate
No ratings yet
West Bengal State University: CBCS, Sem-I Examination, 2018 Regular Candidate
1 page
How To Improve Student English-Speaking Skill
No ratings yet
How To Improve Student English-Speaking Skill
2 pages
Mmu PHD Thesis Guidelines
100% (4)
Mmu PHD Thesis Guidelines
8 pages
UACE Physics Notes A Must Have Advanced Level Physics P.1 (Expanded Edition) by Kawuma Fahad
No ratings yet
UACE Physics Notes A Must Have Advanced Level Physics P.1 (Expanded Edition) by Kawuma Fahad
367 pages
Case Study 3 Ramada Demostrates Its Personal Best 1
No ratings yet
Case Study 3 Ramada Demostrates Its Personal Best 1
10 pages
Janae Benson: Exceptional Nursing Student Recommendation
No ratings yet
Janae Benson: Exceptional Nursing Student Recommendation
1 page
Cell MCQ Collection Biology Grade Xi
No ratings yet
Cell MCQ Collection Biology Grade Xi
22 pages
Rizal: Elementary and High School Days
No ratings yet
Rizal: Elementary and High School Days
14 pages
Line Rockets Lesson Plan k-2nd
No ratings yet
Line Rockets Lesson Plan k-2nd
11 pages
Grade 6 Quiz Bee Guidelines 2023
No ratings yet
Grade 6 Quiz Bee Guidelines 2023
3 pages

Lip Reading with CNN for Noisy Environments

Uploaded by

Lip Reading with CNN for Noisy Environments

Uploaded by

International Research Journal of Engineering and Technology (IRJET) e-ISSN: 2395-0056

Volume: 07 Issue: 03 | Mar 2020 www.irjet.net p-ISSN: 2395-0072

Automatic Lip Reading: Classification of Words and Phrases using

University, Maharashtra, India

mouth region images along with the phoneme labels are

3. DATASET AND FEATURES

MIRACL-VC1 data set [1] is the dataset used here that

The words and phrases in dataset are shown in Table1.

Figure 2: Concatenation of 6 Frames

If we have 7 frames then we added 6 frames as above and

Figure 4: CNN Model Architecture

Each individual concatenated frame is fed to the

Once we complete training and validation, the CNN model is

Due to concatenation, the size of individual concatenated

4.2 CNN implementation

In this section we described the implementation of the CNN

Figure 5 explains the workflow. The prediction for lip

Concatenation and resizing of the images is applied to the

5. RESULT AND ANALYSIS

For this paper we have considered accuracy as a prime result

The prediction breakdown can be seen using confusion

Figure 7: Model Accuracy

The validation loss of 3.0576 at the beginning and was

Figure 6: Confusion Matrix

The model was trained for 50 epochs with a learning rate of

Figure 8: Model Loss

6. CONCLUSION networks for cross audio-visual matching recognition" 2169-

We wish to express our sincere gratitude to Dr. Sanjay U. BIOGRAPHIES

You might also like