Thanks to visit codestin.com
Credit goes to www.scribd.com

0% found this document useful (0 votes)
8 views15 pages

Gray Word11

This paper presents a hand gesture recognition method that combines a Gaussian skin color model with a deep convolutional neural network (DCNN) to improve accuracy and efficiency in various backgrounds. The approach addresses challenges in hand segmentation and classification by utilizing a backpropagation algorithm based on partial differential equations, achieving over 99% classification accuracy on two datasets. The method effectively reduces the complexity of the network while simulating biological visual processing, making it a significant advancement in gesture recognition technology.

Uploaded by

Dr.Ali Adil
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
8 views15 pages

Gray Word11

This paper presents a hand gesture recognition method that combines a Gaussian skin color model with a deep convolutional neural network (DCNN) to improve accuracy and efficiency in various backgrounds. The approach addresses challenges in hand segmentation and classification by utilizing a backpropagation algorithm based on partial differential equations, achieving over 99% classification accuracy on two datasets. The method effectively reduces the complexity of the network while simulating biological visual processing, making it a significant advancement in gesture recognition technology.

Uploaded by

Dr.Ali Adil
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 15

IAENG International Journal of Computer Science, 45:4, IJCS_45_4_09

______________________________________________________________________________________

Static Hand Gesture Recognition Based on


Gaussian Mixture Model and Partial Differential
Equation
Qinghe Zheng, Xinyu Tian, Shilei Liu, Mingqiang Yang, Hongjun Wang and Jiajie Yang

 human-computer interaction. From medical rehabilitation to


Abstract—In the hand gesture recognition process, manually electronic control, gesture interaction has been widely used.
designed features are difficult to achieve good results under the Gesture recognition technology [2, 59] is the basis of gesture
condition of changeable gestures and complex backgrounds. In interaction and is becoming a hotspot research in the field of
this paper, we propose a hand gesture recognition method based
on Gaussian skin color model and deep convolutional neural
human-computer interaction. As the mainstream of gesture
network (DCNN). For gesture images in different backgrounds, recognition, there are two main problems in computer vision
we first use the Gaussian skin color model to segment the gesture based gesture recognition technique [3, 4]: hand segmentation
area, then we use the DCNN to establish gesture classification and gesture classification.
model. Finally, we use the back propagation algorithm based on Hand segmentation is the premise of gesture recognition,
partial differential equation to train the neural network on the and the quality of gesture area will directly affect the accuracy
pure gesture data samples to converge to the global optimum,
and obtain the classification results. The model combines the of hand gesture recognition. Skin color [5, 6] is one of the
process of feature extraction and classification, simulates the most discriminative features of the hand. However, the varied
biological visual transmission and cognition, and effectively gestures, complex background environments, different light
avoids the subjectivity and limitations of artificial features. And sources and color shift in the practical application will lead to
model reduces the size and the complexity of network by using changes in skin color. And distortion in the shape of the hand,
weights sharing and pooling technology. Experimental results
including bending and reversing, can also make the shadow
show that the method is efficient for gesture representation and
classification. The average classification accuracies under two and the angle of light source change, which makes the skin
datasets (indoor and outdoor environments) are both more than color of the entire hand area may not be consistent, or even a
99%. Compared with the traditional methods, the proposed great deal of difference. In the aspect of hand segmentation
method has higher classification accuracy and speed. algorithm [7] based on contour, there are two following
difficulties. One is that the initial contour is difficult to get
Index Terms—hand gesture recognition, Gaussian mixture because of the rotation or bending of the hand. Another
color model, deep convolutional neural network, partial
difficulty is that it is difficult for algorithm to converge due to
differential equation
the presence of deep concave regions in the gesture. In some
improved models, the increase of the number of iterations and
I. INTRODUCTION the amount of computation results in real-time performance
degradation. Motion-based hand segmentation methods are
W ith the rapid development of human computer
interaction technology in recent decades, hand gesture
recognition technology has been widely used in smart office,
divided into frame difference method [8] and background
subtraction method [9]. The frame difference method uses the
differences between adjacent image frames to determine
hospital monitoring, intelligent education and other fields. It
whether moving objects (i.e., hands) are generated in the
bridges the gap of communication between equipment and
foreground. Background subtraction method first constructs
people. Recently, gesture-based human-computer interaction
the background image model, and then the hand is segmented
technology [1] [56] has become an important direction for the
by comparing the background image and the gesture. It can be
future development, providing intelligent and natural way for
seen from many experiments that the changes of light and
shadow produced in the movement and the dynamic change of
Manuscript received August 13, 2017; revised October 19, 2017. This the background affect the segmentation results. At present,
work was supported by National Natural Science Foundation of China
(Grant 61571275) and Shandong Provincial Natural Science Foundation
there are no methods that can achieve good segmentation
(Grant ZR2014FM030, ZR2014FM010). results under all applications and background conditions.
Qinghe Zheng, Shilei Liu, Mingqiang Yang (Corresponding author) and In terms of hand classification, Grimes [10] first uses data
Hongjun Wang are with the School of Information Science and Engineering,
glove to accomplish this task. The gesture recognition based
Shandong University, Qingdao 266237, China (e-mail: 15005414319-
@163.com; [email protected]; [email protected]; imageinstitute- on data glove requires the operator to wear data glove, and the
@outlook.com;). sensor information, including finger movement track and time
Xinyu Tian is with College of Mechanical and Electrical Engineering, information, can be obtained by the glove. Then, the computer
Shandong Management University, Changqing, Jinan, Shandong, China
(e-mail: [email protected]). analyzes and identifies the acquired data to complete the
Jiajie Yang is with Department of Science, University of British human-computer interaction process. Although recognition
Columbia, Vancouver V6T1Z4, British Columbia, Canada (e-mail: method based on data glove has fast speed and high accuracy,
[email protected]).

(Advance online publication: 7 November 2018)


IAENG International Journal of Computer Science, 45:4, IJCS_45_4_09
______________________________________________________________________________________

the equipment is expensive and uncomfortable to use, which works in Section VI.
is not a natural way for human-computer interaction. Vision
based hand gesture recognition is a method of classifying a
sequence of images containing hand gestures. The gesture II. RELATED WORK
classification methods based on computer vision are mainly According to the different data input devices, the existing
divided into the following categories: 1. Template matching gesture recognition methods are divided into two categories:
method. 2. Classification method based on geometric features. gesture recognition based on data gloves and that based on
3. Classification method based on deep convolutional neural computer vision. The gesture recognition algorithm based on
network. 4. Classification method based on Hidden Markov computer vision is cheap and easy to operate, which is useful
Model (HMM). Template matching method first converts the for the development of natural human-computer interaction.
image sequence into a set of static shape patterns and then Therefore, gesture recognition based on computer vision is
compares them with pre-stored behavior templates during the the main emphasis of our research. Therefore, we focus our
classification process. Finally, the closest known template is research on the active perception of hand gesture recognition
selected as the recognition result of the measured behavior. algorithms based on computer vision. In the first part, we
Bobick [11] converts the motion information of the target into introduce some of the relevant research results on gesture
two two-dimensional templates, namely MEI (Motion Energy segmentation. And in the second part, we introduce the
Image) and MHI (Motion History Image), and then uses progress of gesture classification techniques that based on
Mahalanobis distance to test the similarity between samples computer vision.
and templates. The problem with this method is that it cannot
remove the influence of movement time. The classification A. Related Work of Hand Segmentation
method based on geometric features [12] uses the edge of the In computer vision, color space mainly consists of RGB,
contour and the regional structure features of hand gesture as HSV and HIS, and the most commonly used color space is
recognition features, and has good adaptability and stability in RGB color space. Researchers can segment the image by
hand gesture recognition. However, the shortcomings of this using image with deep information based on the Kinect sensor
method are that the learning ability is not strong, and the or to fuse the RGB and deep information. The former focuses
recognition rate will not be significantly improved when the on the speed of the algorithm, while the latter focuses on the
size of samples becomes larger. The statistical method based classification accuracy. In the paper [16], hand segmentation
on deep convolutional neural network [13] [14] can realize task is considered as a depth clustering problem, and the
complex nonlinear mapping, and has the characteristics of pixels are grouped at different depth levels. By analyzing the
anti-interference, so it is widely used in static hand gesture human gesture dimension, a threshold is determined, which
classification. But the learning ability of the shallow neural corresponds to the depth of hand. Lee [17] uses the K-means
network is not strong, and it is easy to fall into the bad local clustering algorithm and the predefined threshold to perform
optimal situation and lead to overfitting problem. HMM [15] hand detection, and the opponent pattern is analyzed by the
has a strong ability to describe gesture spatial-temporal signal convex hull to locate the finger. Both of these two methods
variations, but it has a large number of state probability assume that the hand is closest to the sensor, and the effect of
density values and the number of parameters needs to be algorithm is greatly affected by the accuracy of Kinect depth
estimated, which makes recognition slower. At present, the data. Manuel Caputo [18] uses Kinect generated skeletal data
existing methods need further research in gesture robust to determine hand positions, and determines the size of the
feature extraction and representation. human hand at different depths by looking up tables storing
In this paper, we propose a gesture recognition method standard hand information.
combining skin color model and deep convolution neural If there is no depth information, the difficulty of hand
network. For the hand gesture images collected in different segmentation will increase. Wang [19] uses RGB and YCbCr
backgrounds, the Gaussian skin model is firstly used to space for threshold segmentation, and then uses a variety of
segment the gesture area. Then we use the DCNN to establish morphological operations for image pre-processing. This
the hand gesture classification model, which combines the method reduces the computational complexity and improves
gesture feature extraction and classification process. It can the computation speed, realizing the representation of contour
effectively avoid the subjectivity and limitation of artificially features. Marin [20] uses the inter-frame difference method
designed features by simulating visual transmission and and the skin color model to calibrate the dynamic gesture area,
cognition. Finally, we use the backpropagation algorithm which is small in computation, fast in speed and can overcome
based on partial differential equations to train the DCNN, and the influence of shape change to a certain extent. Oikonomidis
then obtain the classification results of the test set. The [21] integrates the color information for hand detection, and
structure of this paper is organized as follows. In Section II, converts the hand detection problem to the labeling problem
we introduce the related works of our method. In Section III, of hand pixels or non-hand pixels. The skin detection operator
we introduce the image preprocessing process, that is, using of RGB image and the clustering operator of depth image are
Gaussian color model to extract the key area of hand in the two conditions for confirming the hand pixel, and the hand
image. In Section IV, we describe the construction process of region is the intersection of two pixel sets. Jiang [22] treats the
deep CNN for hand gesture classification. The experiments different features into different operators, uses the continuity
that used to validate the effectiveness and rationality of the principle to the depth information and color information of
method are described in Section V. Finally, we discuss the adjacent pixels in the hand area and non-hand area, and
conclusion and shortcomings of the algorithm and the future searches from the palm of the hand as the starting point to

(Advance online publication: 7 November 2018)


IAENG International Journal of Computer Science, 45:4, IJCS_45_4_09
______________________________________________________________________________________

ensure that all the pixels form a connected effective area. This color, movement and edge, extracts the characteristic lines
method avoids the problem that the hand must be located at that can reflect the structural features of human hand, and
the front or multiple targets in the traditional depth problem, finally extracts the model parameters of translation invariant
and effectively solves the problem of uneven color and depth plane to complete the hand gesture classification. Through the
data of the hand. Cao [23] presents a monocular vision gesture fuzzy operations of the background, movement and color of
segmentation method that combines skin color information the video stream in space and time domain, Zhu [34] segments
with motion information. By analyzing the characteristics of the hand and uses the pyramid image to extract features for
skin color and chroma separation of color in YCbCr space and gesture classification. Similarly, Yang [35] proposes a gesture
clustering characteristics of Cb-Cr plane, Gaussian mixture classification algorithm based on space distribution feature of
model is constructed and skin error decision algorithm with hand gesture, which uses the Gaussian brightness model to
minimum total error criterion is proposed. segment the skin area and uses the search window to filter the
On the other hand, deep learning has developed rapidly, skin area to realize the gesture location. Lahamy and Lichti
and a series of segmentation algorithms based on deep CNN [36] use hand color camera arrays to obtain depth information
[55] [57] [58] have been proposed, which have a high degree for hand gesture classification. Vogler and Metaxas [37] use
of guidance. Neverova [24] proposes a deep learning based three vertical color RGB cameras and a position tracker as
hand posture estimation method for gesture segmentation, gesture input devices, and recognize a gesture classification
which requires little label information. It uses unlabeled data accuracy of 89.9% on 53 isolated sign words.
and synthetic data as training samples. The key to this work is In the case of end-to-end gesture classification algorithms
to integrate structure information into model structure, which based on DCNN, a large number of models have achieved
improves the inference speed. It is found that the addition of remarkable results. Among them, the three models of Alexnet
unlabeled samples compared with the pure supervised [38], VGGNet [39] and GoogLeNet [40] have achieved good
training method results in significant improvements in results in the field of image classification. AlexNet contains 5
segmentation results. Zhang [25] builds a hybrid structure convolutional layers, 3 pooling layers and 3 fully connected
which includes three models: depth model (depth with layers, which won the championship in ILSVRC 2012 and has
morphology), skin model (skin color) and background model increased the classification accuracy from 74.3% to 83.6%.
(codebook algorithm). The inputs are the overlap rate VGGNet-16 consists of 13 convolutional layers, 5 pooling
between any two models, and the segmentation is done by layers, 3 fully connected layers and 1 softmax layer, which
using the three-layer neural network, which reflects the ranked first place in the object positioning task and ranked
consistency of the results of two models. Based on the theory second place in the image matching task in ILSVRC 2014. Its
of relaxation labeling technique, Akinin [26] proposed a outstanding contribution is to demonstrate that using very
master-slave segmentation method using Kohonen neural small convolution kernel (3×3) and increasing network depth
network. This method can effectively segment images with can effectively improve the classification effect, and it has
low SNR, and can realize the real-time processing. The good generalization ability for other datasets. GoogLeNet
experimental results show that it is better than the optimum cannot only keep the sparsity of network structure, but also
discriminant threshold segmentation method and the matrix get high computing performance of dense matrix by using
preserving threshold segmentation method. Long [27] inception structure. The model won the championship in the
proposed fully convolutional network (FCN) for image image classification task in ILSVRC 2014 because of its
segmentation, which tries to recover the category of each excellent classification ability. Furthermore, a large number
pixel from the abstract features. It has achieved very good of training techniques are used in these frameworks to avoid
results in image semantic segmentation. But the algorithm overfitting and to improve the generalization ability of DCNN,
requires a large number of samples for training and takes a lot such as regularization, dropout and data augmentation.
of time to complete the semantic segmentation task.
B. Related Work of Hand Gesture Classification
III. IMAGE PREPROESSING
In gesture classification, Bilal [28] chooses skin color
In this section, we mainly introduce the preprocessing steps
features and uses particle filter to detect palm and finger, and
of gesture images before we feed them into the DCNN. In part
then uses template matching method to perform hand gesture
A, we introduce a color balance algorithm to remove the
classification. Bjom [29] uses human skin colour and motion
interference caused by unstable light sources and other factors.
features to track human hand, and uses the nearest neighbor
Part B introduces the gesture area extraction method based on
classifier for hand gesture classification, which has a
Gaussian mixture model. Part C introduces reconstruction
classification accuracy of 82.2% for the 100 sign words in
process of the whole gesture area based on morphological
Irish language. Kaufmann [30] proposes gesture classification
operation and single connectivity method. In fact, the part B
method through using intelligent evolutionary algorithm,
and part C are designed to remove interference from complex
which achieves the classification accuracy of 85.0% for the
backgrounds for hand gesture recognition.
192 sign words in American language. Weng [31] uses bias
color model to segment human hands, and uses multi-feature A. Color Balance Algorithm
fusion to perform hand gesture classification. Flasinski and Due to the impact of light illumination, the acquired gesture
Myslinski [32] construct a Gaussian skin color model and images produce color shift when the illumination is not ideal,
propose a gesture graph analytic classification method based which leads to larger noise of gesture data. Therefore, we
on this model. Ren [33] integrates the information of skin introduce the color balance algorithm which called gray world

(Advance online publication: 7 November 2018)


IAENG International Journal of Computer Science, 45:4, IJCS_45_4_09
______________________________________________________________________________________

Gaussian distribution. Therefore, in the YCbCr space, the


Gaussian distribution is used to model the skin color, and the
probability values of each point in the image belong to the
skin color are calculated, and the gesture region can be
segmented.
Single Gaussian skin color model. The calculation of skin
color model based on Gaussian distribution is shown as (4).

P ( Cb , Cr )  exp(  0 . 5 ( x  m ) T C  1 ( x  m )) (4)

where

x  ( Cb , Cr ) T
m  E ( x)
Fig. 1. Three sets of images before and after correction. C  E {( x  m )( x  m ) T }

where Cb and Cr represent the blue and red concentration


theory (GWT) [41] for color correction. The GWT algorithm offset components respectively. By calculating the probability
is based on the gray world hypothesis, which assumed that the P of each pixel in the image that belongs to hand, we can
average value of three components R, G and B tends to be the establish a complete skin probability distribution matrix, and
same gray value gray for an image with a large number of use the maximum interclass variance method with adaptive
color variations. The color balance algorithm applies this threshold to binarize the skin color probability matrix. In the
assumption to the image that to be processed and eliminates binary image, the bright area with the pixel value of 1 is
the impact of ambient light from the image and obtaining the denoted as the skin color point, and the dark area with the
original scene image. The GWT algorithm is expressed as pixel value of 0 is denoted as a non-skin color point.
follows. The establishment of mixted model based on expectation
maximization. As a nonparametric estimation method for
mathematical modeling of unknown data by observing data,
gray
R ' (C )   R (C ) (1) the K mixed distribution model is expressed as a random form.
R And the discrete variable Z used to describe the observed data
gray follows the polynomial distribution as
G ' (C )   G (C ) (2)
G
gray K
B ' (C ) 
B
 B (C ) (3)
p( x)   p ( x, ), 
i 1
i i i 0 (5)

where Z ~ Multinomial (1,  2 ,  3 , ...,  K ) (6)


x | Z ~ pi ( x,  Z ) (7)
K

gray 
RG  B
3

i 1
i 1 (8)

where R , G and B represent the mean values of the three where p(i) is the probability value of the variable x belonging
channel colors R, G and B, respectively. R(C) , G(C) , B(C) to the target class, parameter K corresponds to the K single
and R ' (C ) , G ' (C ) , B ' (C ) represent the color values of Gaussian distributions and Z represents the single-mode mark
that corresponds to the polynomial distribution with which it
pixel C before and after correction, respectively. The effect of
belongs to, θ is the distribution parameter corresponding to
the algorithm is shown in Fig. 1. It can be seen that the GWT
each single Gaussian distribution and αi is the corresponding
algorithm achieves color correction on color biased images.
weight. Above mixture model has higher flexibility relative to
B. Hand Gesture Area Extraction the single mode distribution.
Because of the complex backgrounds of the gesture image Through the maximum likelihood estimation, we choose
and changeable skin color under different light sources, a the likelihood function of the model as the objective function
reliable skin color model is needed to detect the gesture area. of the model parameter estimation. The optimal solution of
Some results [42] show that the color difference between the parameter corresponds to the extreme position of the
different races is far less than the difference in chromaticity. likelihood function. The natural logarithm likelihood value
YCbCr color space has the advantages of luminance and l(θ) containing K single model and the n training samples and
chrominance separation and has better clustering and stability. its derivative to the parameters of the distribution family are
It is insensitive to the rotation of skin color and the difference given by
of race, and approximately presents the statistical law of

(Advance online publication: 7 November 2018)


IAENG International Journal of Computer Science, 45:4, IJCS_45_4_09
______________________________________________________________________________________

n K
l ( )   i1
log
k 1
k p( xi ; ) (9)

l ( )
n
 j p ( xi ; )  log p ( xi ; )
 j
  K  j
(10)
i 1

k 1
 k p ( xi ; )
Fig. 2. The hand gesture region detected by mixed Gaussian skin color
model.

The expectation maximization method estimates the two


factors in (9), and calculates the expectation of the weight
parameter alternately (in E step), and estimates the parameters
of the distribution family (in M step).
E step: calculate the weight according to the current model
parameters according to
Fig. 3. The hand gesture region processed by morphological operation.
ij  Qit ( z (i)  j)  p( z (i)  j | x(i) ,  (t ) ) (11)

M step: solve the equation according to the current fixed


weight parameters as

p ( x (i ) , z ( i ) ;  t )
(  (i )
j log
Q i( t ) z ( i )
)
Fig. 4. The hand gesture region after single connected operation.
i z (i ) (12)
0
 (t )

After obtaining the update value of the clustering parameter θ,


we update the polynomial parameters as

N
k  l ( z
i1
(i )
 k) / N (13)
Fig. 5. Reconstructed RGB image of hand gesture region.

where l is a function of indicating species. According to the


where N represents multidimensional Gaussian distribution, ti
updating results of each E step and M step, it can be proved
is the multidimensional mean vector and Σi corresponds to the
that the iterative process can increase the objective value of (8)
covariance matrix. The goal in the learning process of the
monotonically, and the stable optimization process can be
model is to estimate all the parameters in (14) according to the
obtained. The specific process of the method is shown in the
training data x.
Algorithm 1.
In the E step, the posterior expectation is computed by
Algorithm 1 The EM process
Model initialization:  ij  P( z (i )  j | x (i ) ;  , t,) (15)
Input: K, Z, X, θ;
for t=1…N do In the M step, the weights obtained above are fixed, and the
Calculate posterior expectation: likelihood values of the training data are expressed as
 ( t )  arg max l ( X ,  ( t )  ( t 1) )
n K
Update parameters of distribution family:
 ( t 1)
 arg max l ( X ,  ( t 1)
 (t )
)
l ( , t, )  log p(x
i 1 z 1(i)
(i )
, z (i ) ;  , t, ) (16)
until convergence 1
Output: θ* exp(0.5( x i  t j )T  tj1 ( x i  t j )) (ji )
(2 ) 3 / 2 |  |1/ 2 (17)
C (i, j )  log
 (ji )
Learning method of mixture model. Assuming that the K-th
Gaussian model mixture of an unknown multi-dimensional
Fix parameter αj, Σi, then calculate the partial derivative of
distribution is expressed as
likelihood value to ti.

f ( x)   N ( x | t , i)
i i (14)
tq  
n
(i) 1 (i)
q (q x  q1t q ) (18)
i1

(Advance online publication: 7 November 2018)


IAENG International Journal of Computer Science, 45:4, IJCS_45_4_09
______________________________________________________________________________________

Accordingly, the updated mean parameter can be obtained by two-dimensional planes at each layer, consisting of multiple
independent neurons. The network consists of convolutional
n n layers, pooling layers, fully connected layers and softmax
tq  ( i 1
q x ) /(
(i ) (i )

i1
(i )
q ) (19) layer, which ensures that both feature extraction and image
classification are performed simultaneously. The network
realizes the displacement, scaling and distortion invariance of
Finally, we implement the parameter update of the Gaussian image information in three ways, that is, local receptive field,
mixture model. Then we apply the Gaussian mixture model to weights sharing and down sampling. The local receptive field
gesture image segmentation, and the preliminary results are means that neurons of each layer are connected to a local
shown in Fig. 2. region of the upper layer. Through the local receptive field,
network can extract the primary architectural features of the
C. Gesture Area Reconstruction
image, such as corner, color, direction, edge, etc. Weights
In fact, the mixed Gaussian skin color model usually cannot sharing can reduce the parameters that need to be trained and
achieve satisfactory segmentation results directly because of greatly enhance the generalization ability of the network.
the interference from the complex backgrounds and unstable There is usually down sampling layers behind convolutional
illumination. So we need to combine some other methods to layers, which can reduce the resolution of the feature map and
reconstruct the gesture area. increase the displacement, scaling and distortion invariance of
1. Morphological method. In the two binary image, the the network. It can make the training of the weight is more
edges of the gesture area have different sizes of holes, burrs or conducive to image classification.
incomplete contours. The morphological expansion algorithm In the convolutional layer, the feature map of the previous
can extend the bright color region in the binary image, while layer is convoluted with the kernels, and then the results are
the corrosion algorithm can extend the dark region in the output through an activation function to get the feature map of
binary image. The morphological opening is operated to this layer. Each output of the feature graph can be correlated
binary image by expansion and corrosion with the 3×3 all-one with the convolution results of multiple input feature graphs
matrix, which can remove the isolated noises and bulges of through shared weights. The neurons for each feature map in
edge in the binary image. The output is shown in Fig. 3. the convolution layer can be calculated by
2. Single connected method. There are still isolated blocks

X
[43] in the gesture image after morphological processing, l 1
X lj  f ( l
i  kij  B lj ) (21)
which are generated by the segmentation of skin color model.
iM j
The existence of these areas have an adverse effect on the
classification results, so we use marker connectivity method
to eliminate the small areas. In the Fig. 3, each small block where l represents the position of the convolution layer in the
area of binary image is marked as {ai | i =1, 2, ⋯}. Then we structure. kij represent the convolution kernels from jth layer to
calculate the area value Si of each marked region, and get the ith layer. f and B represent activation functions and bias
maximum region according to (5). respectively. Xj is the set of input images.
The pooling layer is usually behind the convolutional layer,
S max  max{Si | i  N } (20) which performs down-sampling operation on the input feature
map. The number of input images is the same as the number of
output images after down-sampling, and the dimension of the
where Smax represent the maximum region area, i.e., the area
output image is scaled down. The values of neurons for each
of gesture region. N is the number of all regions. Then we
feature map in the sampling layer can be calculated by
traverse the entire binary image to determine whether the
mark of each pixel belongs to the largest region. The gray
value of the pixel belonging to the maximum region is 1, and X lj  f ( βlj  p( xlj1)  Blj ) (22)
others are 0. The resulting gesture area is shown in Fig. 4.
3. Reconstruction of gesture area. After a single connected where p and β represent the down-sampling function and
processing of the gesture binary image, the gesture area is weights, respectively.
complete and the background is flat. We then use the stroke The last layer of DCNN is the softmax layer, that is, the
method [44] to extract the edge of gesture and rebuild the network outputs the classification results through the softmax
gesture color image. Finally, we get the gesture segmentation function, which was defined as
results, as shown in Fig. 5.
exp(θiT x )
p(i)  K (23)
IV. CONSTRUCTION OF DCNN
In this section, we introduce the deep convolutional neural

k 1
θkT x

network model for hand gesture classification, including the


specific structure, loss function, activation function, training where p(i) represents the probability that classification result
method and processes. is class i. θ and x are input feature vectors and parameters of
A. DCNN Model fully connected layers, respectively. K is the number of image
species.
DCNN is a multilayer neural network with multiple

(Advance online publication: 7 November 2018)


IAENG International Journal of Computer Science, 45:4, IJCS_45_4_09
______________________________________________________________________________________

J (26)
bijl  bijl  
bijl

where

N
J 1
ijl

N a
j 1
L 1 L
j (ai  yi )

N
J 1
bijl

N a j 1
j  yj

where α is learning rate and N is the number of entered


samples per batch. From the above gradient transfer method
Fig. 6. Curves of various activation functions. based on partial differential equation, it can be seen that the
softmax function cooperates with log-likelihood function can
train the deep CNN very well, and there is no problem that the
learning speed is getting slower in the training process.
C. Network Structure used for Gesture Classification
Considering the complex change of gesture with scale
transformation, rotation and translation, we need to construct
the appropriate deep CNN for gesture image classification.
The frame of network consists of convolutional layers and
max-pooling layers, and the last layer is the output layer.
There are many different feature maps in convolutional layer,
and different feature maps are obtained by different
convolution kernels, in which each feature map represents a
kind of gesture feature. Moreover, compared with the
traditional CNN, we increase the number of convolution
kernels dramatically in each convolutional layer to improve
the utilization ratio of the image information. And the size of
kernels in the convolutional layers changes according to the
depth of the network. The max-pooling layer usually uses a
scaling factor of 2, which is scaled at 2×2. The roughness of
features depends on the scaling degree at down-sampling, and
the select of convolution kernels of size 2×2 is a good choice.
Fig. 7. DCNN structure for hand gesture classification.
The activation function of output in each layer is Swish
function, which has good nonlinear mapping properties, as
given in (12).
B. Loss Function and Network Training Method
The weights training method of deep CNN adopts the batch Swish( x)  x  sigmoid( x) (27)
gradient descent method based on back propagation. Since
the last layer of network is the softmax layer, we use the where
log-likelihood function as the loss function, as shown in (9).
1
sigmoid( x) 
J   k
yk log(ak ) (24) 1  ex

where x represents the output of last layer. At present, there


where ak and yk represent the output value and true value of are various activation functions in neural network structure,
the k neuron respectively. It can be seen from (9) that the such as ReLU, Tanh, Sigmoid, PReLU, Softplus and so on.
higher the output probability of the neuron, the smaller the As shown in Fig. 6, the output gradients of Swish function
loss function. The goal of network training is to find the increase gradually and are stable near 1 when the inputs are
minimum value of the loss function J by continually updating greater than zero. Therefore, the activation function does not
the parameters ω and b. The calculation methods of updating have gradient vanishing problems. On the other hand, even if
the network weights and bias are given by the inputs are less than 0, the outputs are correspondingly
small but do not equal to 0. Therefore, there is no neuronal
J (25)
 ijl   ijl   silence phenomenon induced by ReLU activation function in
  ijl
the training process. And the whole network structure is

(Advance online publication: 7 November 2018)


IAENG International Journal of Computer Science, 45:4, IJCS_45_4_09
______________________________________________________________________________________

max-pooling layer behind every three convolutional layers.


Network finally connects three fully connected layers and
outputs the classification results from the softmax layer.
D. Training Details and Process
In order to retain as much details of image as possible, we
pre-process the sample images in the experimental dataset. By
interpolating and normalizing data, the network inputs a RGB
image with the fixed size of 224×224×3. Another pre-
processing we do is subtracting the mean RGB value and
computing on the training set from each pixel. Due to the deep
layer and too many parameters of deep CNN model, the direct
use of small dataset easily leads to overfitting problem. So we
first use ImageNet dataset [45] to pre-train the network, and
then fine-tune it on the target dataset to get the final
classification results.
The batch size of the network is determined by the number
Fig. 8. Training and testing process of Deep CNN. of classes and samples of the training set, and the detailed
parameters are given in the experimental section. During the
training process, the three fully connected layers randomly
emit neurons with a probability of 40%, which can greatly
improve the generalization ability of the network. But in the
test phase, the output value of fully connected layers needs to
become 2.5 times of the original value. At the same time, in
order to avoid the shock effect at the end of training, the
learning rate decrease gradually with the increase of training
epochs. When the validation set meets the requirements or the
(a)
loss function is less than the manually set threshold, the
training process of the network ends. The training and testing
process of the algorithm is shown in Fig. 8.

V. EXPERIMENTS
The experiments consist of three parts. In the first part, we
introduce the two hand gesture datasets used in experiments.
In part B, we demonstrate the effectiveness of segmentation
(b) algorithm based on mixed Gaussian skin color model. In the
part C, we show the classification results and testing speed of
Fig. 9. Different kinds of hand gesture image samples in two databases. (a)
shows the samples of the indoor database, and (b) shows the samples of DCNN in two datasets to show the effectiveness and analyze
NUS database under the outdoor condition. the learning process of DCNN through visualized convolution
kernels and feature maps.
sparse and the overfitting problem of deep CNN is greatly A. Two Datasets
improved. The output of neural network is a floating-point The datasets in the experiments consist of two parts. The
vector whose component value of corresponding correct class first dataset is the set of hand gesture images in the indoor
is 1 and the rest of the values are 0. In the experimental section, background, which is used to validate the effectiveness of
we verify and compare the classification results of DCNN DCNN model. In this dataset, we collect ten classes of hand
under various activation functions to show the effectiveness gestures through an ordinary camera, as shown in Fig. 9(a).
of the Swish function. The ten categories represent numbers 0 to 9, respectively. The
In order to get the optimal number of convolutional layers dataset consists of 3000 images with different rotation angles
and feature maps, various DCNNs with different structures under the single background. The background of images
are tested in experiments. The results show that when the comes from a light stabilized indoor laboratory.
network is consisting of nine convolutional layers, it has the The second dataset used in the experiment is the standard
highest classification accuracy. So we finally use the deep hand gesture dataset of National University of Singapore
CNN structure as shown in Fig. 7. It can be seen that in the (NUS), which has a total number of 3000 images in 10
first three convolutional layers of the structure, each layer categories, as shown in Fig. 9(b). All images come from
consists of 128 convolution kernels with size 5×5. In the various outdoor background, and the background interference
middle three convolutional layers, each layer consists of 256 (e.g., similar edges, textures and colors) are deliberately
convolution kernels of size 3×3. The last three convolutional intensified.
layers contain 512 kernels of size 3×3 per layer. There is a

(Advance online publication: 7 November 2018)


IAENG International Journal of Computer Science, 45:4, IJCS_45_4_09
______________________________________________________________________________________

(a)

(b)
Fig. 10. Segmentation results of gesture samples in two databases. (a) shows the segmentation results of the hand gesture samples under indoor conditions. (b)
shows the segmentation results of the hand gesture samples under outdoor conditions.

TABLE I
THE CLASSIFICATION RESULTS OF SAMPLES BEFORE AND AFTER SEGMENTATION IN TWO DATABASES
Class 1 Class 2 Class 3 Class 4 Class 5 Class 6 Class 7 Class 8 Class 9 Class 10 mAP
Indoor database
(%) (%) (%) (%) (%) (%) (%) (%) (%) (%) (%)
Original image 92.1 91.3 93.3 88.0 88.2 87.5 90.0 87.8 94.4 94.0 90.7
Segmented image 100.0 98.7 98.0 98.4 100.0 97.4 99.3 98.3 100.0 99.9 99.0
Class 1 Class 2 Class 3 Class 4 Class 5 Class 6 Class 7 Class 8 Class 9 Class 10 mAP
NUS database
(%) (%) (%) (%) (%) (%) (%) (%) (%) (%) (%)
Original image 89.2 88.7 86.3 90.1 88.0 87.6 88.1 89.0 86.9 87.1 88.1
Segmented image 98.1 96.9 96.5 98.4 98.4 97.7 97.6 98.2 96.4 96.3 97.5

relatively long due to the huge size of parameters in the neural


B. Hand Segmentation Results
network. So the performance of machine has more obvious
In this part, we show the segmentation results of hand impact on training time. In the experiments, we accelerated
gesture images in two datasets, as shown in Fig. 10. the training process on the GPU. We build the workstation
It can be seen from Fig. 10(a) that the algorithm can be used with a NVIDIA GTX 1080, which has a speed of 400 times of
to segment the entire hand area in the simple background CPU i5 6600K.
effectively, which is favorable for subsequent classification. It The classification results compared with baseline. We
can avoid the adverse effects caused by complex backgrounds first compare the classification accuracy of the two datasets
and noises, and use neural networks to extract the most stable before and after segmentation, as shown in Table I. We
and effective features for hand gesture classification. compared the classification accuracy of ten categories and
Fig. 10 (b) shows the segmentation results of hand gesture average classification accuracy (mAP) in two datasets. The
samples in NUS dataset. Although there are still some pixels highest classification accuracy of each column is bold to
that are wrongly segmented in the edge region, the overall highlight. We put the image samples before and after the
segmentation effect is satisfactory. The segmented gesture segmentation into the network respectively. In the training
image has greatly reduced the complex background area in the process, the training method and the strategy to prevent
original image sample. overfitting problem are the same. It can be clearly seen that
Since segmentation is not our final goal, we do not compare the classification accuracy of the segmented images in the ten
the experimental results with other segmentation algorithms. categories of two datasets is higher than that of the original
We will illustrate the effectiveness of image segmentation images. The average classification accuracy of the segmented
algorithm by showing the image classification effect in next hand gesture images in two datasets are 99.0% and 97.5%
section. respectively, while the average classification accuracy of the
C. Hand Gesture Classification Results original hand gesture images in two datasets are only 90.7%
In the image input phase of training process, since we use and 88.1%respectively. The effective segmentation of hand
the batch gradient descent method to train the whole network, gesture improves the classification accuracy by about 10%.
images of each batch are randomly selected from the dataset. The complicated background and noise have some adverse
In the training process of DCNN, the training time is effects on the classification of gestures. Then we draw the
convergence curves of the network training process on the

(Advance online publication: 7 November 2018)


IAENG International Journal of Computer Science, 45:4, IJCS_45_4_09
______________________________________________________________________________________

TABLE II
THE CLASSIFICATION RESULTS OF DCNN WITH DIFFERENT ACTIVATION FUNCTIONS IN TWO DATABASES
Indoor Class 1 Class 2 Class 3 Class 4 Class 5 Class 6 Class 7 Class 8 Class 9 Class 10 mAP
database (%) (%) (%) (%) (%) (%) (%) (%) (%) (%) (%)
Sigmoid 89.1 86.4 86.0 86.6 90.1 85.9 87.4 86.8 87.9 88.2 87.4

Tanh 90.9 88.7 87.2 89.1 90.2 87.9 90.0 89.9 90.7 91.1 89.6

Softplus 98.2 97.6 96.0 97.1 98.0 96.9 97.4 97.2 98.0 97.8 97.4
ReLU 98.8 97.4 97.1 97.1 98.3 96.9 97.5 97.5 98.5 98.1 97.7

PReLU 98.7 98.6 97.8 98.0 99.2 96.6 98.2 99.0 99.0 99.2 98.4

Swish 100.0 98.5 98.0 98.5 100.0 97.5 99.0 98.5 100.0 100.0 99.0
NUS Class 1 Class 2 Class 3 Class 4 Class 5 Class 6 Class 7 Class 8 Class 9 Class 10 mAP
database (%) (%) (%) (%) (%) (%) (%) (%) (%) (%) (%)
Sigmoid 88.2 87.3 87.3 88.1 89.0 88.3 87.8 89.2 86.7 85.6 87.8

Tanh 89.1 88.3 87.7 89.3 89.0 88.4 88.7 89.1 87.5 87.0 88.4
Softplus 96.8 95.9 95.7 95.8 96.3 95.6 96.0 97.2 96.3 96.8 96.2
ReLU 95.8 95.6 94.4 95.8 97.0 95.6 96.4 96.6 96.7 96.1 96.0

PReLU 96.3 95.7 96.7 96.7 98.5 96.4 97.0 97.5 95.2 94.4 96.4
Swish 98.1 96.9 96.5 98.4 98.4 97.7 97.6 98.2 96.4 96.3 97.5

TABLE III
THE CLASSIFICATION RESULTS OF DIFFERENT NEURAL NETWORK MODELS IN TWO DATABASES
Indoor Class 1 Class 2 Class 3 Class 4 Class 5 Class 6 Class 7 Class 8 Class 9 Class 10 mAP
database (%) (%) (%) (%) (%) (%) (%) (%) (%) (%) (%)
Alexnet 96.7 97.5 96.0 95.4 96.6 95.9 95.3 96.1 97.0 97.5 96.4
VGG-16 99.0 98.1 96.5 97.2 98.8 97.9 98.6 98.6 98.8 98.1 98.2
VGG-19 98.7 97.4 97.0 97.5 99.0 96.2 96.2 97.8 98.0 98.8 97.7

GoogLeNet 98.8 97.1 98.9 97.4 98.5 96.5 96.8 97.5 98.1 98.4 97.8

Our model 100.0 98.5 98.0 98.5 100.0 97.5 99.0 98.5 100.0 100.0 99.0
Class 1 Class 2 Class 3 Class 4 Class 5 Class 6 Class 7 Class 8 Class 9 Class 10 mAP
NUS database
(%) (%) (%) (%) (%) (%) (%) (%) (%) (%) (%)
Alexnet 95.2 93.3 92.9 95.1 95.0 94.8 94.1 94.5 91.9 92.2 93.9
VGG-16 94.7 94.4 96.6 95.4 95.3 95.0 95.3 98.4 94.1 94.0 95.3
VGG-19 96.7 94.2 95.9 93.0 93.1 95.2 95.7 96.0 93.7 96.6 95.1

GoogLeNet 98.5 96.0 94.7 96.9 93.8 97.8 93.0 94.2 93.9 95.4 95.1

Our model 98.1 96.9 96.4 98.4 98.4 97.8 97.6 98.2 96.4 96.3 97.5

NUS dataset, as shown in Fig. 11. The red curve represents with Switsh function has the highest classification accuracy
the training iteration of original images, while the blue curve on the hand gestures in the two datasets. The network with
represents the training iteration of segmented gesture images. PReLU function ranked second in the classification accuracy
It can be seen from the figure that the training process of due to the stable learning property. The problem of deep
segmented gesture image converges faster than that of the training of deep neural networks caused by neuronal necrosis
original images and leads to about 50 epochs advance. The in ReLU makes it less accurate than the PReLU. Because of
hand gesture image after segmentation greatly reduces the the lack of effective training of the deep neural network, the
useless information and makes the network more globally classification accuracy of network with ReLU function is
convergent at the right location. Moreover, the iteration curve lower than that of network with PReLU function. The network
of original images has a large range of oscillations at the end with Tanh function and Sigmoid function cannot effectively
of the training, which is not what we expect to see. train the deep layer neural network because of the gradient
On the basis of segmented hand gesture images, we verify vanishing problem, so the classification accuracy is the
the effect of various activation functions in our DCNN lowest.
structure, as shown in Table II. In the table, we compared the After determining the activation function, we compare the
respective classification accuracy of 10 classes and average multiple structure of DCNN and determine the optimal
classification accuracies under two datasets. The highest structure for hand gesture classification. In Table III, we
classification accuracy of each column is bold to highlight. compare the classification results of four structures of Alexnet,
It can be seen from the classification results that network VGG-16, VGG-19, and GoogLeNet. Based on the Alexnet,

(Advance online publication: 7 November 2018)


IAENG International Journal of Computer Science, 45:4, IJCS_45_4_09
______________________________________________________________________________________

TABLE IV
THE COMPARISON OF CLASSIFICATION RESULTS OF DIFFERENT ALGORITHMS IN TWO DATABASES
Class 1 Class 2 Class 3 Class 4 Class 5 Class 6 Class 7 Class 8 Class 9 Class 10 mAP
Indoor database
(%) (%) (%) (%) (%) (%) (%) (%) (%) (%) (%)
Stergiopoulou [46] 96.6 94.0 94.3 94.3 97.2 94.7 95.0 94.4 97.1 96.9 95.5

Jiang [47] 94.5 92.3 93.0 93.5 95.4 92.1 94.3 93.2 93.6 95.2 93.7
Cai [48] 97.0 95.8 96.9 94.6 96.2 94.8 95.5 95.9 96.1 96.7 95.9

Caffe-DAG [49] 98.2 96.5 95.9 95.8 98.0 95.4 96.6 96.2 97.8 98.8 96.9

Triesch [50] 90.4 91.2 89.8 87.9 91.0 90.1 90.8 90.2 91.0 91.6 90.4
DeCAF [51] 98.0 95.5 94.9 98.8 96.8 96.7 99.2 95.6 97.1 97.4 97.0

Pisharady [52] 97.0 95.7 93.7 98.8 95.8 97.7 99.1 96.6 96.2 95.4 96.6
PHOG+SVM 92.2 89.4 91.4 92.1 91.1 91.0 89.5 89.9 90.0 91.2 90.8

SIFT+SVM 92.1 92.7 91.9 91.6 94.0 90.8 92.4 92.2 93.2 92.7 92.4

Hu+GLCM+SVM 95.1 96.4 96.1 95.7 97.8 96.9 96.6 95.9 97.0 97.5 96.5

Hu+GLCM+GRNN 97.2 98.6 95.5 95.9 97.4 97.7 96.5 96.1 98.0 98.3 97.1
Our model 100.0 98.5 98.0 98.5 100.0 97.5 99.0 98.5 100.0 100.0 99.0
Class 1 Class 2 Class 3 Class 4 Class 5 Class 6 Class 7 Class 8 Class 9 Class 10 mAP
NUS database
(%) (%) (%) (%) (%) (%) (%) (%) (%) (%) (%)
Stergiopoulou [46] 95.7 93.8 93.0 95.5 95.1 94.4 94.4 95.6 92.5 92.2 94.2

Jiang [47] 93.1 90.2 90.7 94.1 94.5 92.8 93.0 94.1 91.2 91.7 92.5

Cai [48] 94.6 93.2 93.4 96.0 95.1 94.6 94.3 96.2 92.9 92.5 94.3
Caffe-DAG [49] 94.6 94.9 97.2 98.5 95.5 96.2 95.0 96.1 95.7 93.5 95.7

Triesch [50] 88.0 87.4 87.2 88.5 89.1 88.4 88.2 89.7 88.0 87.6 88.2
DeCAF [51] 95.2 92.3 92.0 93.7 94.4 93.6 92.2 93.0 91.2 92.3 93.0
Pisharady [52] 93.3 93.9 94.1 96.0 95.4 94.9 94.4 96.5 93.7 91.8 94.4

PHOG+SVM 91.7 91.9 92.2 93.3 94.0 92.6 92.6 94.1 91.7 92.1 92.6
SIFT+SVM 90.1 89.7 89.6 91.2 93.0 90.9 90.1 91.1 88.9 89.6 90.4
Hu+GLCM+SVM 96.0 95.6 94.9 95.5 96.2 95.9 95.7 96.7 96.6 94.1 95.7

Hu+GLCM+GRNN 97.7 96.5 94.4 96.7 96.1 96.6 94.5 96.0 94.7 96.4 96.0
Our model 98.1 96.9 96.5 98.4 98.4 97.7 97.6 98.2 96.4 96.3 97.5

VGGNet increased the depth of the network, making the classification is very close to classification accuracy of the
network has a more outstanding ability to express features. training set. The training process takes full advantage of the
And GoogLeNet increased the width of the network by adding self-organizing ability of the gradient descent learning
the Inception structure to the network, and the convolution method based on partial differential equation, which can force
kernel of different sizes at the same convolutional layer the network to extract the most effective features of images in
enhances the feature extraction ability of the network at different feature spaces. When the traditional single hidden
different scales. Based on the advantages of both of them, we layer neural network is used in gesture classification, the
build the final DCNN framework through a large number of network does not have the feature abstraction capability for
experiments, as shown in Fig. 7. the input data. It is necessary to manually select the feature of
The classification results compared with some other the target area as the input of the network, and the artificial
algorithms. Then we compare the classification accuracy of selection of features is subjective. So it is difficult to establish
various algorithms, as shown in Table IV. The recognition a complete description for the hand gesture. Stergiopoulou
time of each algorithm for a single image is shown in Table V. [46] constructs a CNN model to recognize gestures. The
In terms of classification performance, it can be seen that inputs of the network are binary gesture images, which can
our algorithm has the highest average classification accuracy. improve the classification efficiency of the network and have
The accuracy of two hand gesture datasets is improved from better real-time performance. However, the binary images
97.1% to 99.0% and from 96.0% to 97.5% respectively. The lack the local information of gesture, so that the network does
reason is that the weights of multilayer CNN have weight not have the ability of texture description and can only
sharing ability, and the network structure has limited weights. describe features of hand shape. Therefore, its classification
Therefore, the network has good generalization ability, and accuracy is lower than the method proposed in this paper.
the specific performance is that the accuracy of the test set

(Advance online publication: 7 November 2018)


IAENG International Journal of Computer Science, 45:4, IJCS_45_4_09
______________________________________________________________________________________

TABLE V large, the single shape feature cannot be better described, and
COMPARISON OF TESTING TIME FOR SINGLE IMAGE OF VARIOUS METHODS the texture feature based on the pixel gray value is sensitive to
Method Testing time/frame (s) the deformation of the object and the change of the view angle.
Stergiopoulou [46] 0.005 Jiang [47] used Hu invariant moments to extract the seven
Jiang [47] 0.710 parameters of gestures as features and classify them through
Cai [48] 0.120
Caffe-DAG [49] 0.057
BP neural network. Cai [48] uses the four custom geometric
Triesch [50] 0.026 invariants as a feature to classify gestures. However, these
DeCAF [51] 0.044 features are not as good as DCNN for the gestures, and the
Pisharady [52] 0.026 accuracy of these algorithms under the ten categories of
PHOG+SVM 0.116 gestures is both lower than the accuracy of our DCNN-based
SIFT+SVM 0.492
algorithms.
Hu+GLCM+SVM 0.015
Hu+GLCM+GRNN 0.027 In the aspect of recognition efficiency, although the DCNN
Our model 0.034 based approach does not achieve the fastest recognition speed,
it meets the real-time requirements in practical application
TABLE VI and can process nearly 30 images per second. In the training
CLASSIFICATION ACCURACIES OF VARIOUS ACTIVATION FUNCTION phase, the property of sparse expression of the activation
Activation Classification accuracy Classification accuracy function makes the scale of the network smaller. In the testing
fuctions of indoor dataset of outdoor dataset phase, the network automatically extracts the features of
sigmoid 80.4% 74.9%
tanh 90.2% 84.2%
gesture image through forward propagation, and the
softplus 97.1% 95.6% recognition process takes shorter time. The method of
ReLU 98.4% 96.1% Hu+GLCM+SVM method is the fastest, because the feature
Leaky-ReLU 98.3% 96.2% extraction process of Hu invariant moment is very simple, but
P-ReLU 98.6% 96.6% the classification accuracy is significantly lower than the
maxout 98.8% 97.2%
DCNN based method.
Swish 99.0% 97.5%
The comparison with various activation functions. We
compared different activation functions on DCNN under two
datasets to observe the of Swish function. The classification
results are shown in Table VI. Comparing the experimental
results of activation functions on two datasets, we can clearly
see that Swish achieved the highest classification results. It
increased the classification results of the maxout by 0.2% and
0.3%, respectively.
Visualization analysis and understanding of the learning
process of DCNN. Confusion matrix is a kind of visualization
method of classification results commonly used in supervised
learning, which can intuitively express the precision rate and
recall rate of classification model. Element Mi,j in confusion
matrix represents the number of samples in class i that are
assigned to class j. The value of the matrix can be normalized
between 0 and 1. Based on this ratio, we give each element of
Fig. 11. Convergence curves in the training process of DCNN. the matrix a hue from blue to red. The elements on the main
diagonal represent the proportion of samples that are correctly
Generalized regression neural network (GRNN) and classified. The closer the color of the main diagonal of
support vector machine (SVM) are two commonly used confusion matrix is to red, the higher the classification rate.
methods in image classification, and these two methods have We draw confusion matrices for the classification results of
the advantages of fast training and strong nonlinear mapping two datasets, as shown in Fig. 12 (a) and (b). It can be seen
ability. In the compare experiment, when the main body of from the figure that only a few samples are misclassified into
gesture is detected, we select the first moment and the second other categories.
moment in the Hu invariant moment parameter as the global In order to observe the effect of image segmentation
feature to identify the gesture, select the energy, contrast and operation on the DCNN training process and the convergence
relevance parameter of gray level co-occurrence matrix as results, we use the t-SNE [53] and largevis [54] method to
local features to identify gestures. Then we combine the five reduce the dimension of the feature vector of the last fully
dimensional features into the GRNN and SVM for training connected layer in DCNN, and show the distribution of the
and classification. In traditional feature descriptors, invariant two-dimensional feature in the plane, as shown in Fig. 13. The
moments can represent the overall shape of objects, and have left image represents the visualization result of t-SNE and the
good invariance for the translation, rotation and scale right image represents the visualization result of largevis. In
transformation of objects. Gray level co-occurrence matrix the figure, we show the feature distribution of testing samples
has better texture description ability and can describe detail in 10 categories of the NUS dataset. It can be clearly seen
information of image. However, when the change of gesture is from the graph that the features of the original image are
messy and loose due to the interference of the background.

(Advance online publication: 7 November 2018)


IAENG International Journal of Computer Science, 45:4, IJCS_45_4_09
______________________________________________________________________________________

(a)
(a) (b)
Fig. 12. Confusion matrix of classification results in two databases. (a) is
the confusion matrix of the classification results of gesture samples in
indoor environment, and (b) is the confusion matrix of the classification
results of gesture samples in outdoor environment.

stable features. Therefore, it has a comprehensive description


(b) of the images and therefore the hand gesture classification
Fig. 13. Distribution visualization of deep convolutional activation features accuracy can be improved.
of last fully connected layer. (a) represents the feature distribution of the
original gesture images, and (b) represents the feature distribution of the
segmented gesture images.
VI. CONCLUSION
In this paper, we construct a DCNN model for hand gesture
classification. We first use the mixed Gaussian skin color
model to segment the complete hand gesture area, and then
use DCNN to classify hand gestures in different scales, shapes
and background environments under two datasets. The back
propagation algorithm based on partial differential equation
enables DCNN to fully learn the discriminative representation
of images. Finally, the model achieves classification accuracy
(a) (b) of 99.0% and 97.5% on the two datasets respectively, which
shows superiority to other methods. Moreover, experimental
Fig. 14. The output feature maps of the first and second convolutional layers
in deep CNN. results illustrate that the proposed method has the following
advantages.
1. The effective segmentation of hand areas in complicated
The feature of the segmented image is more compact, which is backgrounds can guarantee the effective feature extraction,
more conducive to the establishment of a good classification reducing redundant information and noises, improving the
boundary. classification accuracy of the DCNN, and accelerating the
In order to analyze and understand the learning process of convergence speed of the training process.
DCNN, we input the pre-processed gesture images into 2. In dealing with two-dimensional images, the DCNN can
network, and then observe the feature maps of convolutional extract the spatial features of images through convolution and
layers in network, as shown in Fig.14. In Fig. 14(a), we show down-sampling. It avoids the subjectivity of artificial features,
some of the feature maps that are strongly activated in the first and has translation, scaling and rotation invariance.
convolutional layer, while Fig .14(b) shows the feature maps 3. The DCNN can extract the global and local information
that are strongly activated in the second convolutional layer. of hand gestures comprehensively, so that the description of
The feature map 3, 4, 6, and 12 in Fig. 14(b) extract structural hand gesture is more complete and accurate.
information about the joint or edge of the gesture, which can 4. By using the local receptive field and weights sharing
be understood as the local feature of the gesture. The rest of technique, the scale of network parameters is greatly reduced
the feature maps show the contour information of the gesture, and the computational efficiency is improved significantly.
which can be understood as global features. In Fig. 14(a), the In the future, we plan to build an end-to-end hand gesture
feature map 6 and the feature map 12 are complementary recognition system that will synchronize segmentation and
images, and the same hand gesture regions are represented by classification. And we will consider how to integrate the prior
different gray levels. By comparing Figs. 14(a) and (b), it can information such as skin color and shape into the training
be seen that the feature plane in the upper layer is mapped to process of the system.
different gray levels by convolution operation, and the
extracted features of DCNN are light robustness.
In summary, a variety of convolution kernels in the fully ACKNOWLEDGMENT
trained deep CNN endow it with the ability to extract different The authors appreciate the indoor hand gesture dataset
features. Since all processing of the subsequent feature maps produced by the computer vision and machine learning lab of
in the network is based on the output of the previous layer, the Shandong University. And the authors wish to gratefully
network can gradually extract and assemble more distinct and

(Advance online publication: 7 November 2018)


IAENG International Journal of Computer Science, 45:4, IJCS_45_4_09
______________________________________________________________________________________

acknowledge the helpful comments and suggestions of the [23] X. Cao, J. Zhao and M. Li, “Monocular Vision Gesture Segmentation
Based on Skin Color and Motion Detection,” Journal of Hunan
reviewers, which have improved the presentation. University, vol. 38, no. 1, pp. 78-83, 2011.
[24] N. Neverova and C. Wolf et al., “Hand Segmentation with Structured
REFERENCES Convolutional Learning,” in Asian Conference Computer Vision, pp.
687-702, 2015.
[1] K. Niyazi, V. Kumar, S. Mahe and S. Vyawahare, “Mouse Simulation
[25] X. Zhang, Z. Ye, L. Jin, S. Xu, “A New Writing Experience: Finger
Using Two Coloured Tapes,” International Journal of Information
Writing in the Air Using a Kinect Sensor,” IEEE Transactions on
Sciences and Techniques, vol. 2, no. 2, pp. 57-63, 2012.
Multimedia, vol. 20, no. 4, pp. 85-93, 2013.
[2] S. Mitra and T. Acharya, “Gesture Recognition: A Survey,” IEEE
[26] M. V. Akinin, A. I. Taganov, M. B. Nikiforov and A. V. Sokolova,
Transactions on Systems, Man, and Cybernetics, Part C (Applications
“Image Segmentation Algorithm Based on Self-organized Kohonen's
and Reviews), vol. 37, no. 3, pp. 311-324, 2007.
Neural Maps and Tree Pyramidal Segmenter,” in IEEE Mediterranean
[3] J. Davis and M. Shah, “Visual Gesture Recognition,” in IEE
Conference on Embedded Computing, pp. 168-170, 2015.
Proceedings of Vision, Image and Signal Processing, vol. 141, no. 2,
[27] J. Long,E. Shelhamer and T. Darrell, “Fully Convolutional Networks
pp. 101-106, 1994.
for Semantic Segmentation,” in IEEE Conference on Computer Vision
[4] Z. Lu et al., “A Hand Gesture Recognition Framework and Wearable
and Pattern Recognition, pp. 3431-3440, 2015.
Gesture-based Interaction Prototype for Mobile Devices,” IEEE
[28] S. Bilal, R. Akmeliawati, M. Salami, A. Shafie and E. E. Bouhabba, “A
Transactions on Human-Machine Systems, vol. 44, no. 2, pp. 293-299,
Hybrid Method Using Haar-like and Skin-color Algorithm for Hand
2017.
Posture Detection,” in IEEE International Conference on Mechatronics
[5] L. Bretzner, I. Laptev and T. Lindeberg, “Hand Gesture Recognition
and Automation, pp. 934-939, 2010.
Using Multi-scale Colour Features, Hierarchical Models and Particle
[29] Y. She and Q. Wang et al., “A Real-Time Hand Gesture Recognition
Filtering,” in IEEE 5th International Conference on Automatic Face
Approach Based on Motion Features of Feature Points,” in IEEE
and Gesture Recognition, pp. 423-428, 2002.
International Conference on Computational Science and Engineering,
[6] B. Stenger, “Template-based Hand Pose Recognition Using Multiple
pp. 1096-1102, 2014.
Cues,” in Aisian Conference on Computer Vision, pp. 551-560, 2006.
[30] B. Kaufmann, J. Louchet and E. Lutton, “Hand posture recognition
[7] M. C. Ornellas, “A Deformable Contour Based Approach to Hand
using real-time artificial evolution,” in Applications of Evolutionary
Image Segmentation,” in Proceedings of First International Conference
Computation, pp. 251-260, 2013.
on Cyber Crime Investigation, pp. 10-18, 2004.
[31] C. Weng, Y. Li, M. Zhang, K. Guo, X. Tang, et al, “Robust Hand
[8] D. Wu et al., “Deep Dynamic Neural Networks for Multimodal Gesture
Posture Recognition Integrating Multi-cue Hand Tracking,” in
Segmentation and Recognition”, IEEE Transactions on Pattern
International Conference on E-Learning and Games, Changchun,
Analysis & Machine Intelligence, vol. 38, no. 8, pp. 1583-1597, 2016.
China, pp. 497-508, 2010.
[9] A. Ogihara, H. Matsumoto and A. Shiozaki, “Hand Region Extraction
[32] M. Flasiński and S. Myśliński, “On the Use of Graph Parsing for
by Background Subtraction with Renewable Background for Hand
Recognition of Isolated Hand Postures of Polish Sign Language,”
Gesture Recognition,” in IEEE International Symposium on Intelligent
Pattern Recognition, vol. 43, no. 6, pp. 2249-2264, 2010.
Signal Processing and Communications, pp. 227-230, 2007.
[33] H. Ren and G. Xu, “Hand Gesture Recognition Based on Characteristic
[10] H. Kenn, F. V. Megen and R. Sugar, “A Glove-based Gesture Interface
Curves,” Journal of Software, vol. 12, no. 5, pp. 987-993, 2002.
for Wearable Computing Applications,” in VDE International Forum
[34] J. Y. Zhu, “Hand Gesture Recognition Based on Structure Analysis,”
on Applied Wearable Computing, pp. 1-10, 2007.
Chinese Journal of Computers, vol. 29, no. 12, pp. 27-32, 2006.
[11] J. W. Davis and A. F. Bobick, “The Representation and Recognition of
[35] B. Yang et al., “Gesture Recognition in Complex Background Based
Human Movement Using Temporal Templates,” in IEEE Conference
on Distribution Features of Hand,” Journal of Computer-Aided Design
on Computer Vision and Pattern Recognition, pp. 928-934, 1997.
and Computer Graphics, vol. 22, no. 10, pp. 1841-1848, 2010.
[12] Y. Liu, Y. Yin and S. Zhang, “Hand Gesture Recognition Based on HU
[36] Z. Li and R. Jarvis, “Real Time Hand Gesture Recognition Using A
Moments in Interaction of Virtual Reality,” in IEEE International
Range Camera,” in Australasian Conference on Robotics and
Conference on Intelligent Human-Machine Systems and Cybernetics,
Automation, pp. 21-27, 2009.
pp. 145-148, 2012.
[37] C. Vogler and D. Metaxas, “Adapting Hidden Markov Models for ASL
[13] S. Murthy and R. S. Jadon, “Hand Gesture Recognition Using Neural
Recognition by Using Three-dimensional Computer Vision Methods,”
Networks,” in IEEE Advance Computing Conference, pp. 134-138,
in IEEE International Conference on Systems, Man, and Cybernetics,
2010.
pp. 156-161, 1997.
[14] E. Stergiopoulou and N. Papamarkos, “Hand gesture recognition using
[38] A. Krizhevsky, I. Sutskever and G. E. Hinton, “Imagenet Classification
a neural network shape fitting technique,” Engineering Applications of
with Deep Convolutional Neural Networks,” in Advances in Neural
Artificial Intelligence, vol. 22, no. 8, pp. 1141-1158, 2009.
Information Processing Systems, pp. 1097-1105, 2012.
[15] N. Liu, B. C. Lovell and P. J. Kootsookos, “Evaluation of HMM
[39] K. Simonyan and A. Zisserman, “Very Deep Convolutional Networks
Training Algorithms for Letter Hand Gesture Recognition,” in IEEE
for Large-scale Image Recognition,” arXiv preprint, 2014. Arxiv: 1409.
International Symposium on Signal Processing and Information
1556.
Technology, pp. 648-651, 2004.
[40] C. Szegedy, W. Liu and Y. Jia, “Going Deeper with Convolutions,” in
[16] R. Y. Tara, P. I. Santosa and T. B. Adji, “Hand Segmentation From
IEEE Conference on Computer Vision and Pattern Recognition, pp.
Depth Image Using Anthropometric Approach in Natural Interface
1-9, 2015.
Development,” International Journal of Scientific and Engineering
[41] EY. Lam, “Combining Gray World and Retinex Theory for Automatic
Research, vol. 3, no. 5, pp. 1-4, 2012.
White Balance in Digital Photography,” in IEEE Proceedings of the
[17] U. Lee and J. Tanaka, “Hand Controller: Image Manipulation Interface
9th International Symposium on Consumer Electronics, pp. 134-139,
Using Fingertips and Palm Tracking with Kinect Depth Data,” in
2005.
Proceeding-Asia Pacific Conference Computation and Human Interact,
[42] A. Amjad, A. Griffiths, M. N. Patwary, “Multiple Face Detection
pp. 705-706, 2012.
Algorithm Using Colour Skin Modelling,” IET Image Processing, vol.
[18] M. Caputo, K. Denker, B. Dums and G. Umlauf, “3D Hand Gesture
6, no. 8, pp. 1093-1101, 2012.
Recognition Based on Sensor Fusion of Commodity Hardware,”
[43] H. Liu, Y. Zhang, “Curvature Computing of BOF Flame Boundary
Mensch and Computer, vol. 12, no. 1, pp. 293-302, 2012.
Based on Differential Chain Code,” Computer Engineering and
[19] X. Wang, “Hand Gesture Recognition Based on BP Neural Network in
Applications, vol. 49, no. 7, pp. 171-170, 2013.
Complex Background,” Computer Applications and Software, vol. 30,
[44] H. Liu and Zhang. Y, “State Recognition of BOF Based on Flame
no. 3, pp. 247-95, 2013.
Image Features and GRNN,” Computer Engineering and Applications,
[20] G. Marin, F. Dominio and P. Zanuttigh, “Hand Gesture Recognition
vol. 47, no. 6, pp. 7-10, 2011.
with Jointly Calibrated Leap Motion and Depth Sensor,” Multimedia
[45] O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh et al.,
Tools and Applications, vol. 75, no. 22, pp. 1-25, 2016.
“ImageNet Large Scale Visual Recognition Challenge,” International
[21] I. Oikonomidis, N. Kyriazis and A. Argyros, “Efficient Model-Based
Journal of Computer Vision, vol. 115, no. 3, pp. 211-252, 2015.
3D Tracking of Hand Articulations Using Kinect,” In British Machine
[46] E. Stergiopoulou and N. Papamarkos, “Hand Gesture Recognition
Vision Conference, pp. 1-11, 2011.
Using A Neural Network Shape Fitting Technique,” Engineering
[22] P. Gupta, “An Efficient Slap Fingerprint Segmentation and Hand
Applications of Artificial Intelligence, vol. 22, no. 8, pp. 1141-1158,
Classification Algorithm,” Neurocomputing, vol. 142, no. 1, pp.
2009.
464-477, 2014.

(Advance online publication: 7 November 2018)


IAENG International Journal of Computer Science, 45:4, IJCS_45_4_09
______________________________________________________________________________________

[47] J. Li and Q. Ruan, “Research of Gesture Recognition Based on Neuron


Networks,” Journal of Beijing Jiaotong University, vol. 30, no. 5, pp.
32-36, 2006.
[48] J. Cai, J. Cai, X. Liao, H. Huang and Q. Ding, “Preliminary Study on
Hand Gesture Recognition Based on Convolutional Neural Network,”
Computer system applications, vol. 24, no. 4, pp. 113-117, 2015.
[49] S. Yang, D. Ramanan, “Multi-scale Recognition with DAG-CNNs,” in
IEEE International Conference on Computer Vision, pp. 1215-1223,
2015.
[50] J. Triesch and D. Malsburg, “A System for Person-Independent Hand
Posture Recognition against Complex Backgrounds,” IEEE
Transactions on Pattern Analysis and Machine Intelligence, vol. 23, no.
12, pp. 1449-1453, 2001.
[51] J. Donahue et al., “DeCAF: A Deep Convolutional Activation Feature
for Generic Visual Recognition,” in International Conference on
International Conference on Machine Learning, pp. 647-655, 2013.
[52] PK. Pisharady, P. Vadakkepat, and PL. Ai, “Attention Based Detection
and Recognition of Hand Postures Against Complex Backgrounds,”
International Journal of Computer Vision, vol. 101, no. 3, pp. 403-419,
2013.
[53] L. Maaten and G. Hinton, “Visualizing Data Using t-SNE,” Journal of
Machine Learning Research, vol. 9, no .11, pp. 2579-2605, 2008.
[54] J. Tang et al., “Visualizing Large-scale and High-dimensional Data,”
in Procedings of International Conference on World Wide Web,
Montreal, Canada, pp. 287-297, 2016.
[55] Q. Zheng et al., “A Bilinear Multi-Scale Convolutional Neural
Network for Fine-grained Object Classification,” IAENG International
Journal of Computer Science, vol. 45, no. 2, pp. 340-352, 2018.
[56] H. Song, M. Yi, J. Huang, and Y. Pan, “Bernstein Polynomials Method
for a Class of Generalized Variable Order Fractional Differential
Equations,” IAENG International Journal of Applied Mathematics, vol.
46, no. 4, pp. 437-444, 2016.
[57] A. Hambarde, MF. Hashmi, A. Keskar, “Robust Image Authentication
Based on HMM and SVM Classifiers”, Engineering Letters, vol. 22,
no. 4, pp. 183-193, 2014.
[58] Q. Feng, “Jacobi Elliptic Function Solutions For Fractional Partial
Differential Equations,” IAENG International Journal of Applied
Mathematics, vol. 46, no. 1, pp. 121-129, 2016.
[59] Q. Zheng et al., “Improvement of Generalization Ability of Deep CNN
via Implicit Regularization in Two-Stage Training Process,” IEEE
Access, vol. 6, pp. 15844-15869, 2018.

Qinghe Zheng was born in Jining, Shandong, China in 1993. He received


his B.S. degree from Xi'an University of Posts and Telecommunications in
2014 and M.S. degree from Shandong University in 2018. Now, he is
pursuing his PhD degree in Shandong University. His research direction is
computer vision and machine learning.

(Advance online publication: 7 November 2018)

You might also like