Journal of Multidisciplinary Engineering Science Studies (JMESS)
ISSN: 2458-925X
Vol. 7 Issue 5, May - 2021
Cloud-Based Human Sign Language Digit
Classification Using CNN: A Case Study of
King’s-Center, Akure, Nigeria.
Ajibike Eunice Akin-Ponnle
The Federal Polytechnic,
Ile-Oluji, Ondo state,
Nigeria.
Abstract— Gesture communication or sign language adopted manner of communication to the hearing
is very important to the hearing impaired to impaired, among themselves. An example of gesture
communicate among themselves. Also, the hearing communication is as shown in figure 1.
impaired oftentimes need to relate with other people
that are not with hearing difficulty, be it at, official;
educational; social; or recreational gatherings.
However, it is not everyone that understands how to
sign, and this oftentimes leads to communication
barrier between the hearing-impaired and the normal
people. Therefore, building machines and/or systems
to classify gestures, such that the hearing and the
hearing-impaired can be enabled to relate and
understand themselves becomes very imperative.
This led to the interesting subject of gesture
communication classification known as ‘sign
language’. In this study, a method of Machine
Language Algorithm, which is deep learning, was
adopted to classify and arrive at convenient model, by Fig 1: Hand Sign Representation of Digits from 0-9 left to
right; top-to-down.
using a holistic convolutional neural networks (CNNs)
to recognise digits shown by human hands by an Communication between people is said to be a
interpreter in a worship centre in Nigeria. The CNN is process of information and idea sharing, either verbally
known for both extracting a comprehensive feature or non-verbally. The non-verbal communication is
representation from the input image and learning a defined to be silent communication with a person or
classifier for each desired output simultaneously. How among people without the use of speech in order to
to implement a CNN for hand sign language digit gain the audience of the listeners. [1] It is effective
classification from scratch in Google Colab and train when both the sending and receiving channels are
the model in cloud environment is described in this properly established. Hand sign language is
study. The cloud-based training permits the use of the considered a non-verbal communication in this study,
powerful GPUs for training, which reduces the training and a scenario is drawn of 300 people; among whom
time in comparison with training on CPU. are 60 hearing impaired, gathering together to worship
Experimental results show 93.5% accuracy on the test at King's centre, Akure, Nigeria. This is taken as ratio
set. of 1:4 of hearing impaired to the hearing persons in
total audience. The challenges arising among this
nature of congregant ranges from but not limited to,
Keywords— cloud-based deep model training; inability of the members of congregant who are
communication barrier; convolution neural hearing to be able to understand the hearing impaired
networks; digit recognition ; hand sign language. and communicate with them freely; inability of the
I. INTRODUCTION hearing-impaired to express themselves freely before
the remaining member of the congregant; and, inability
Human hand sign known as gesture originates as a of the hearing-impaired to communicate their needs or
manner of natural communication which was used not worries to the leaders of the congregation.
only for the hearing impaired but even for the normally
hearing persons as well. This was based on the Human hand sign language recognition is classified
situation and premise surrounding the particular kind of into alphabet and digit classification, each of which has
communication that was being undertaking. In some critical applications for welfare to people with hearing
situations, people enjoy using gesture communication impairment. A hand sign language digit recognition
and they do so intuitively and naturally. In any case, system should receive a hand image as the input, and
gesture could be generally enjoyable once the natural yields a number between 0 to 9, corresponding to the
essence of sending and receiving information given input image, while that of alphabet recognition
effectively with understanding from one person to will take a hand image describing alphabet as input
another is properly established. Gesture has been an and brings out a letter. In this study, a hand sign
www.jmess.org
JMESSP13420738 3899
Journal of Multidisciplinary Engineering Science Studies (JMESS)
ISSN: 2458-925X
Vol. 7 Issue 5, May - 2021
language recognition is studied for the hearing various methods of Machine Learning Algorithm to
impaired congregants described in this paper, with classify and arrive at a convenient model.
difficulty in carrying out specific instructions relating to
numbers during the time of worship. As a scenario,
figure 2 describes a chart of how effective the hearing
impaired were able to understand how to open books
to certain page numbers. For such a situation, there
were 4 categories from A - to - D of the 60 people that
could not verbally understand what the leader was
saying, and if the leader could not gesture correctly.
Fig 3: Levels of Effective Communication at King’s Centre.
Gesture is a form of sign language; which could be
presented in various other forms such as hand
shapes, facial expressions, and movement of hands
[2]. An example is the American Sign Language which
has been primarily adopted as a means of
communication for the deaf in countries such as
Canada, United States of America, and other nations
Fig 2: Different level of Comprehension for Hearing
Impairment
of the World [3]. However, gesture has to be correctly
classified.
Category 'A', opened their books to wrong page; In an effort to create an effective means of
category 'B', did not open at all because they could communication between the hearing-impaired and the
not understand and so they did not bother; category normal people, there is need to design automated
'C', managed to open by following what other people systems that will enable the hearing impaired to
around them did, so some were correct and others communicate among themselves, as well as with
wrong depending on whether the examples they other normal people since it is not everyone that
followed got it wrong or right; the last category, which understands how to sign.
is 'D', was able to open correctly. Hence it be can The sign for digits is a type of ‘sign language’ as
convincingly argue that the leader has successfully there are 10 digit numbers from figure '0' to '9' as can
communicated with about 85% of the total number of be seen in Fig 1. The sign digit could also be a form of
people present effectively. This is as shown in figure hand rotation as presented by researchers in [3].
3.
Therefore, building a machine and /or system that State of the Art. As a result of the quest to overcome
could classify hand sign or gesture appropriately in the challenges of gesture classification and
order to enable the hearing and the hearing-impaired recognition, many studies have been carried out with
relate among themselves and also communicate with the designs of electrical and mechanical machines as
others effectively is very important. In this paper, a well as robots [4], [5]. In this study, there's review of
holistic convolutional neural networks (CNNs), as a some of the studies carried out to classify digits using
type of deep learning, which is a method of Machine CNN, and some other situations where CNN was
Language Algorithm, was adopted to classify and used for classification generally, thereby, comparing
arrive at convenient model, to recognize digits shown and contrasting the accuracy presented by CNN from
by human hands by an interpreter at King’s centre in different studies. A 10 layer CNN to detect sign
Nigeria. Experimental results show 93.5% accuracy language of hand rotation at different angles was
on the test set. proposed by Abdul Kalam et. al. [3], whereby they
The remaining sections of this paper is grouped as; prepared 7000 rotated images from 700 sign digit
literature review; data description, visualization and images using the proposed model and they were able
statistical analysis; data pre-processing; methodology; to arrive at 97.28 percent accuracy for the 7000
results and discussion; and conclusion. rotated images. This is one of the recent related
studies in literature. Suharjito et.al [2], reviewed
extensively and reported several applications of CNN
II. LITERATURE REVIEW
in computer vision projects to recognise sign language
It is not everyone that understands how to sign, or into texts or speech with different results. Certain
how to correctly recognize gesture communication. accuracy as high as 95.5 percent was reported as part
This has brought about many research on the of result obtained by a group of researchers who used
subject of 'sign language', while trying to adopt a 3D CNN for the creation of LipNet. This was a
design made to perform lipreading by visual means.
www.jmess.org
JMESSP13420738 3900
Journal of Multidisciplinary Engineering Science Studies (JMESS)
ISSN: 2458-925X
Vol. 7 Issue 5, May - 2021
This was used to recognize an end-to-end sentence channels. Several samples of this dataset are given in
lipreading. Also, it was reported in [2] that CNN was figure 4.
used for Italian sign language recognition whereby, 20
Italian gestures could be recognised with accuracy of
91.7 percent. The particular design of Hybrid CNN-
Hidden Markov Model (CNN-HMM), which was used
for continuous sign language recognition system and
has sequence modelling capabilities was also
presented. Likewise, studies about hand gesture
recognition using CNN was carried out by M. Han
et.al. in [6], where they propose a convolutional neural
network that is biologically inspired, in an attempt to
reduce the difficulty of gesture recognition from image
of cameras. They adopted Gaussian skin model and
background subtraction for CNN data testing and
training to filter non-skin colours of image, thereby
obtaining 93.8 percent classification rate from Fig 4: Sample Images from Sign Language Digits Dataset.
experiment. Yawei Hou and Huailin Zhao [7] in an
attempt to improve recognition results for handwritten The Sign Language Digits Dataset are 100X100
digit recognition, studied a combined depth network images with consistent background, and the hands
for CNN and BP neural network, and reported a more are located in the center of the image. The images are
accurate result from the combined network normalized from [0, 255] to [0,1] pixels, because CNN
recognition, than when they were separately can learn better from a normalized dataset, [10].
implemented using the same dataset. Z. Lu et. al. [8] In addition, to boost the performance of the model,
implemented one-shot learning hand gesture several common data augmentation techniques such
recognition with their 13D lightweight network using as rotation, width-shifting, height-shifting, shearing,
spatial-temporal separable 3D CNN. This is in an zooming, were used. However, horizontal and vertical
attempt to form a more satisfactory classification while flipping were avoided as these augmentations deform
handling very few samples or only one gesture class. the usual data, and since all the images on the
Their model was tested using ChaLearn gesture dataset are taken from right hand, horizontal flipping
dataset among others, and it was reported that they cannot help the model to learn new features from the
obtained a satisfactory result. Z. Hu and X. Zhu [9] existing dataset. Also, while preparing the dataset,
propose the use of CNN approach for hand gesture 20% of the data for test phase were randomly
recognition from RGB images in order to simulate a selected, and the rest were used for learning the
real time scenario where depth image for RGB- D model.
(RGB image and depth approach; which has been so
much studied in literature), is not available. Their B. Recognition Process
experimental result was reported to have better
accuracy in recognition. However, with all the reported The recognition process with a holistic CNN has
studied on CNN in literature, there is yet not enough two main steps:
report on digit number classification in particular. 1) extracting discriminative features from the input
Therefore, in this research, a model of hand digit image, and 2) learning a classifier based on the image
number classification using CNN is presented. features and its ground truth label. For the first step,
which is feature extraction, four convolutional layers
followed by a linear activation function, and
III. METHODOLOGY regularization layers, were used. Given that the
Convolution Neural Network (CNN) is a deep problem is a multi-class classification task, an output
learning algorithm [2], and it is feed-forward neural layer with 10 nodes to predict the probability
network that is aided by the visual cortex of human distribution of an image belonging to each of the 10
containing convolution and subsampling layers which classes, was required. To this end, a fully connected
is found very helpful in the field of computing vision. layer with 10 neurons with softmax activation function
Complexity of processing is usually reduced by the was used. The detailed information of model summary
use of several kernels in each layer [3] is given in Fig. 5. The analysis is as follows:
1) All the activation functions in the model are
A. Data Description and Processing Rectified linear unit (Relu) activation functions. The
rectified linear activation function is a piece-wise linear
In this study, the ‘Sign Language Digits Dataset’
function that output the input directly if is positive,
that is employed comprises of 2180 number of
otherwise, it output zero. This has become the default
images, collected from 218 students, with 10 images
activation function for many types of neural networks
per person, with every Individual in this dataset only to
because a model that uses it is easier to train and
have used their right hand to show digits between 0 to
often achieves better performance.
9. The size of images is 100 X100 with RGB
www.jmess.org
JMESSP13420738 3901
Journal of Multidisciplinary Engineering Science Studies (JMESS)
ISSN: 2458-925X
Vol. 7 Issue 5, May - 2021
\begin{lstlisting}
[language=Python, caption=Python example]
1. import tensor flow as tf
2. device_name = tf.test.gpu device_name ()
3. if device_name != '/device:GPU:0':
4. raise system error('GPU device not
found')
5. print ('Found GPU at: {}'.format(device
name))
\end{lstlisting}
Listing 1: Python example
The dataset from the Github was then downloaded
using the Listing 2.
\begin{lstlisting}
[language=Python, caption=Python example]
1. !git clone https://github.com/ardamavi/Sign-
Language-Digits-Dataset
\end{lstlisting}
Listing 2: Python example
The model was trained for two different settings of
augmentation. When basic augmentation like several
degrees of rotation, shearing, zooming, random
cropping, etc., were applied, it was observed that
there was 1:5 percent improvement (from 92:0 to
93:5) in the test performance.
As shown in Fig 6, when the model is trained for more
than150 epochs, it starts to be over-fitted to the
training subset. Consequently, the validation loss is
Fig. 5: Model summary. Red colored lines indicate the not improved. To avoid early over-fitting, firstly, 0.5
convolutional layers and blue line shows the fully connected drop out was used, in which half of the learned values
layers with 10 output nodes. in the layer was dropped. This way, model
automatically will focus on more critical features.
2) Several drop out layers were used in the model as Secondly, learning phase was stopped in epoch 150.
regularization layers to prevent the model from early However, based on Fig 6, training can be stopped in
overfitting. epoch 200 also as the difference between validation
3) A flatten layer to provide the features as a vector loss and training loss is negligible.
to the classifier was used.
4) Max pooling layer was used to compress the
features in each level of the model and finally the
most important feature representation of the given
input image was extracted
5) Parameters of the model are the learning weights,
which will be updated in each epoch of training phase.
Final weights are used to predict the test set and
evaluate the accuracy performance of the model.
Google Colab † was adopted to train and test the
proposed model for hand sign language digit
recognition. Google Colab is a free cloud service
based on Jupiter Notebook that supports free GPU to
train and test the model (in a 12-hour active session).
After connecting to a run-time, the GPU was selected
as run-time type and found it using the Listing 1: Fig. 6: Loss value visualization in the course of
training.
www.jmess.org
JMESSP13420738 3902
Journal of Multidisciplinary Engineering Science Studies (JMESS)
ISSN: 2458-925X
Vol. 7 Issue 5, May - 2021
Based on Hand Gesture Analysis”,International
Journal of Computer Science Security, vol. 2, no 1,
CONCLUSION
2009.
In this work, a CNN-based model was implemented to [5] S.K. Keshari, S.Tyagi, N. Tomar, and S.Goel
extract and classify a feature representation from the “Aphonic’s voice: A Hand Gesture Based Approach to
used input hand image. As the dataset is small and convert Sign Language to speech”, IEEE 2nd
easy (i.e., with consistent background), a very shallow International Conference on Issues and Challenges in
model with only 4 convolutional layer was applied. Intelligent Computing Techniques (ICICT), 2019.
However, if large datasets could be collected, deeper
models (such as Inception, ResNet, MobileNet, etc.) [6] M. Han, J. Chen, L.Li, and Y. Chang, “Visual Hand
can be used, without any concern about over-fitting to Gesture Recognition with Convolution Neural
the training data. Consequently, one can learn from Network”, 17th IEEE/ ACIS International Conference
larger datasets and extend the generalization ability of on Software Engineering, Artificial Intelligence,
the model, which is a critical feature in real-world Networking and Parallel/ Distributed Computing
applications as such that has been described by the (SNPD), 2016.
group of congregant in this paper.
[7] Y. Hou and H. Zhao, “Handwritten Digit
Recognition Based on Depth Neural Network”,
International Conference on Intelligent Informatic and
REFERENCES
Biomedical Sciences (ICIIBMS), 2017.
[1] D. Phutela ‘‘The Importance of Non-Verbal
Communications’’, IUP Journal of Soft Skills, vol. 9, no [8] Z. Lu, S. Qin, L. Li, D. Zhang, K. Xu, and Z. Hu,
4, pp 43, 2015 “One-Shot Learning Hand Gesture Recognition Based
on Lightweight 3D Convolutional Neural Networks for
[2] Suharjito, M.C. Ariesta, F. Wiryana and G.P. Portable Applications on Mobile Systems”, IEEE
Kusuma ’‘A Survey of Hand Gesture Recognition Access, pp 131732 – 131748, September 2019.
Methods in Sign Language Recognition”, Pertanika J.
Sci. Technol. vol. 26, no 4, pp 1659 – 1675, 2018. [9] Z. Hu and X. Zhu, “Gesture detection from RGB
hand image using modified convolutional neural
[3] A. Kalam, N. I. Mondal, and B. Ahmed, “Rotation network”, IEEE 2nd International Conference on
Independent Digit Recognition in Sign Language”, Information Systems and Computer Aided Education
International Conference on Electrical, Computer and (ICISCAE), pp 143-146, 2019.
Communication Engineering (ECCE), February 2019.
[10] https://github.com/ardamavi/Sign-Language-Digits-Dataset,
[4] A. B. Jmaa, W. Mahdi, Y. B. Jemaa, and A. B.
Hmadou “A New Approach for Digit Recognition visited on 30th June 2020.
www.jmess.org
JMESSP13420738 3903