Gray Word11
Gray Word11
______________________________________________________________________________________
the equipment is expensive and uncomfortable to use, which works in Section VI.
is not a natural way for human-computer interaction. Vision
based hand gesture recognition is a method of classifying a
sequence of images containing hand gestures. The gesture II. RELATED WORK
classification methods based on computer vision are mainly According to the different data input devices, the existing
divided into the following categories: 1. Template matching gesture recognition methods are divided into two categories:
method. 2. Classification method based on geometric features. gesture recognition based on data gloves and that based on
3. Classification method based on deep convolutional neural computer vision. The gesture recognition algorithm based on
network. 4. Classification method based on Hidden Markov computer vision is cheap and easy to operate, which is useful
Model (HMM). Template matching method first converts the for the development of natural human-computer interaction.
image sequence into a set of static shape patterns and then Therefore, gesture recognition based on computer vision is
compares them with pre-stored behavior templates during the the main emphasis of our research. Therefore, we focus our
classification process. Finally, the closest known template is research on the active perception of hand gesture recognition
selected as the recognition result of the measured behavior. algorithms based on computer vision. In the first part, we
Bobick [11] converts the motion information of the target into introduce some of the relevant research results on gesture
two two-dimensional templates, namely MEI (Motion Energy segmentation. And in the second part, we introduce the
Image) and MHI (Motion History Image), and then uses progress of gesture classification techniques that based on
Mahalanobis distance to test the similarity between samples computer vision.
and templates. The problem with this method is that it cannot
remove the influence of movement time. The classification A. Related Work of Hand Segmentation
method based on geometric features [12] uses the edge of the In computer vision, color space mainly consists of RGB,
contour and the regional structure features of hand gesture as HSV and HIS, and the most commonly used color space is
recognition features, and has good adaptability and stability in RGB color space. Researchers can segment the image by
hand gesture recognition. However, the shortcomings of this using image with deep information based on the Kinect sensor
method are that the learning ability is not strong, and the or to fuse the RGB and deep information. The former focuses
recognition rate will not be significantly improved when the on the speed of the algorithm, while the latter focuses on the
size of samples becomes larger. The statistical method based classification accuracy. In the paper [16], hand segmentation
on deep convolutional neural network [13] [14] can realize task is considered as a depth clustering problem, and the
complex nonlinear mapping, and has the characteristics of pixels are grouped at different depth levels. By analyzing the
anti-interference, so it is widely used in static hand gesture human gesture dimension, a threshold is determined, which
classification. But the learning ability of the shallow neural corresponds to the depth of hand. Lee [17] uses the K-means
network is not strong, and it is easy to fall into the bad local clustering algorithm and the predefined threshold to perform
optimal situation and lead to overfitting problem. HMM [15] hand detection, and the opponent pattern is analyzed by the
has a strong ability to describe gesture spatial-temporal signal convex hull to locate the finger. Both of these two methods
variations, but it has a large number of state probability assume that the hand is closest to the sensor, and the effect of
density values and the number of parameters needs to be algorithm is greatly affected by the accuracy of Kinect depth
estimated, which makes recognition slower. At present, the data. Manuel Caputo [18] uses Kinect generated skeletal data
existing methods need further research in gesture robust to determine hand positions, and determines the size of the
feature extraction and representation. human hand at different depths by looking up tables storing
In this paper, we propose a gesture recognition method standard hand information.
combining skin color model and deep convolution neural If there is no depth information, the difficulty of hand
network. For the hand gesture images collected in different segmentation will increase. Wang [19] uses RGB and YCbCr
backgrounds, the Gaussian skin model is firstly used to space for threshold segmentation, and then uses a variety of
segment the gesture area. Then we use the DCNN to establish morphological operations for image pre-processing. This
the hand gesture classification model, which combines the method reduces the computational complexity and improves
gesture feature extraction and classification process. It can the computation speed, realizing the representation of contour
effectively avoid the subjectivity and limitation of artificially features. Marin [20] uses the inter-frame difference method
designed features by simulating visual transmission and and the skin color model to calibrate the dynamic gesture area,
cognition. Finally, we use the backpropagation algorithm which is small in computation, fast in speed and can overcome
based on partial differential equations to train the DCNN, and the influence of shape change to a certain extent. Oikonomidis
then obtain the classification results of the test set. The [21] integrates the color information for hand detection, and
structure of this paper is organized as follows. In Section II, converts the hand detection problem to the labeling problem
we introduce the related works of our method. In Section III, of hand pixels or non-hand pixels. The skin detection operator
we introduce the image preprocessing process, that is, using of RGB image and the clustering operator of depth image are
Gaussian color model to extract the key area of hand in the two conditions for confirming the hand pixel, and the hand
image. In Section IV, we describe the construction process of region is the intersection of two pixel sets. Jiang [22] treats the
deep CNN for hand gesture classification. The experiments different features into different operators, uses the continuity
that used to validate the effectiveness and rationality of the principle to the depth information and color information of
method are described in Section V. Finally, we discuss the adjacent pixels in the hand area and non-hand area, and
conclusion and shortcomings of the algorithm and the future searches from the palm of the hand as the starting point to
ensure that all the pixels form a connected effective area. This color, movement and edge, extracts the characteristic lines
method avoids the problem that the hand must be located at that can reflect the structural features of human hand, and
the front or multiple targets in the traditional depth problem, finally extracts the model parameters of translation invariant
and effectively solves the problem of uneven color and depth plane to complete the hand gesture classification. Through the
data of the hand. Cao [23] presents a monocular vision gesture fuzzy operations of the background, movement and color of
segmentation method that combines skin color information the video stream in space and time domain, Zhu [34] segments
with motion information. By analyzing the characteristics of the hand and uses the pyramid image to extract features for
skin color and chroma separation of color in YCbCr space and gesture classification. Similarly, Yang [35] proposes a gesture
clustering characteristics of Cb-Cr plane, Gaussian mixture classification algorithm based on space distribution feature of
model is constructed and skin error decision algorithm with hand gesture, which uses the Gaussian brightness model to
minimum total error criterion is proposed. segment the skin area and uses the search window to filter the
On the other hand, deep learning has developed rapidly, skin area to realize the gesture location. Lahamy and Lichti
and a series of segmentation algorithms based on deep CNN [36] use hand color camera arrays to obtain depth information
[55] [57] [58] have been proposed, which have a high degree for hand gesture classification. Vogler and Metaxas [37] use
of guidance. Neverova [24] proposes a deep learning based three vertical color RGB cameras and a position tracker as
hand posture estimation method for gesture segmentation, gesture input devices, and recognize a gesture classification
which requires little label information. It uses unlabeled data accuracy of 89.9% on 53 isolated sign words.
and synthetic data as training samples. The key to this work is In the case of end-to-end gesture classification algorithms
to integrate structure information into model structure, which based on DCNN, a large number of models have achieved
improves the inference speed. It is found that the addition of remarkable results. Among them, the three models of Alexnet
unlabeled samples compared with the pure supervised [38], VGGNet [39] and GoogLeNet [40] have achieved good
training method results in significant improvements in results in the field of image classification. AlexNet contains 5
segmentation results. Zhang [25] builds a hybrid structure convolutional layers, 3 pooling layers and 3 fully connected
which includes three models: depth model (depth with layers, which won the championship in ILSVRC 2012 and has
morphology), skin model (skin color) and background model increased the classification accuracy from 74.3% to 83.6%.
(codebook algorithm). The inputs are the overlap rate VGGNet-16 consists of 13 convolutional layers, 5 pooling
between any two models, and the segmentation is done by layers, 3 fully connected layers and 1 softmax layer, which
using the three-layer neural network, which reflects the ranked first place in the object positioning task and ranked
consistency of the results of two models. Based on the theory second place in the image matching task in ILSVRC 2014. Its
of relaxation labeling technique, Akinin [26] proposed a outstanding contribution is to demonstrate that using very
master-slave segmentation method using Kohonen neural small convolution kernel (3×3) and increasing network depth
network. This method can effectively segment images with can effectively improve the classification effect, and it has
low SNR, and can realize the real-time processing. The good generalization ability for other datasets. GoogLeNet
experimental results show that it is better than the optimum cannot only keep the sparsity of network structure, but also
discriminant threshold segmentation method and the matrix get high computing performance of dense matrix by using
preserving threshold segmentation method. Long [27] inception structure. The model won the championship in the
proposed fully convolutional network (FCN) for image image classification task in ILSVRC 2014 because of its
segmentation, which tries to recover the category of each excellent classification ability. Furthermore, a large number
pixel from the abstract features. It has achieved very good of training techniques are used in these frameworks to avoid
results in image semantic segmentation. But the algorithm overfitting and to improve the generalization ability of DCNN,
requires a large number of samples for training and takes a lot such as regularization, dropout and data augmentation.
of time to complete the semantic segmentation task.
B. Related Work of Hand Gesture Classification
III. IMAGE PREPROESSING
In gesture classification, Bilal [28] chooses skin color
In this section, we mainly introduce the preprocessing steps
features and uses particle filter to detect palm and finger, and
of gesture images before we feed them into the DCNN. In part
then uses template matching method to perform hand gesture
A, we introduce a color balance algorithm to remove the
classification. Bjom [29] uses human skin colour and motion
interference caused by unstable light sources and other factors.
features to track human hand, and uses the nearest neighbor
Part B introduces the gesture area extraction method based on
classifier for hand gesture classification, which has a
Gaussian mixture model. Part C introduces reconstruction
classification accuracy of 82.2% for the 100 sign words in
process of the whole gesture area based on morphological
Irish language. Kaufmann [30] proposes gesture classification
operation and single connectivity method. In fact, the part B
method through using intelligent evolutionary algorithm,
and part C are designed to remove interference from complex
which achieves the classification accuracy of 85.0% for the
backgrounds for hand gesture recognition.
192 sign words in American language. Weng [31] uses bias
color model to segment human hands, and uses multi-feature A. Color Balance Algorithm
fusion to perform hand gesture classification. Flasinski and Due to the impact of light illumination, the acquired gesture
Myslinski [32] construct a Gaussian skin color model and images produce color shift when the illumination is not ideal,
propose a gesture graph analytic classification method based which leads to larger noise of gesture data. Therefore, we
on this model. Ren [33] integrates the information of skin introduce the color balance algorithm which called gray world
P ( Cb , Cr ) exp( 0 . 5 ( x m ) T C 1 ( x m )) (4)
where
x ( Cb , Cr ) T
m E ( x)
Fig. 1. Three sets of images before and after correction. C E {( x m )( x m ) T }
gray
RG B
3
i 1
i 1 (8)
where R , G and B represent the mean values of the three where p(i) is the probability value of the variable x belonging
channel colors R, G and B, respectively. R(C) , G(C) , B(C) to the target class, parameter K corresponds to the K single
and R ' (C ) , G ' (C ) , B ' (C ) represent the color values of Gaussian distributions and Z represents the single-mode mark
that corresponds to the polynomial distribution with which it
pixel C before and after correction, respectively. The effect of
belongs to, θ is the distribution parameter corresponding to
the algorithm is shown in Fig. 1. It can be seen that the GWT
each single Gaussian distribution and αi is the corresponding
algorithm achieves color correction on color biased images.
weight. Above mixture model has higher flexibility relative to
B. Hand Gesture Area Extraction the single mode distribution.
Because of the complex backgrounds of the gesture image Through the maximum likelihood estimation, we choose
and changeable skin color under different light sources, a the likelihood function of the model as the objective function
reliable skin color model is needed to detect the gesture area. of the model parameter estimation. The optimal solution of
Some results [42] show that the color difference between the parameter corresponds to the extreme position of the
different races is far less than the difference in chromaticity. likelihood function. The natural logarithm likelihood value
YCbCr color space has the advantages of luminance and l(θ) containing K single model and the n training samples and
chrominance separation and has better clustering and stability. its derivative to the parameters of the distribution family are
It is insensitive to the rotation of skin color and the difference given by
of race, and approximately presents the statistical law of
n K
l ( ) i1
log
k 1
k p( xi ; ) (9)
l ( )
n
j p ( xi ; ) log p ( xi ; )
j
K j
(10)
i 1
k 1
k p ( xi ; )
Fig. 2. The hand gesture region detected by mixed Gaussian skin color
model.
p ( x (i ) , z ( i ) ; t )
( (i )
j log
Q i( t ) z ( i )
)
Fig. 4. The hand gesture region after single connected operation.
i z (i ) (12)
0
(t )
N
k l ( z
i1
(i )
k) / N (13)
Fig. 5. Reconstructed RGB image of hand gesture region.
f ( x) N ( x | t , i)
i i (14)
tq
n
(i) 1 (i)
q (q x q1t q ) (18)
i1
Accordingly, the updated mean parameter can be obtained by two-dimensional planes at each layer, consisting of multiple
independent neurons. The network consists of convolutional
n n layers, pooling layers, fully connected layers and softmax
tq ( i 1
q x ) /(
(i ) (i )
i1
(i )
q ) (19) layer, which ensures that both feature extraction and image
classification are performed simultaneously. The network
realizes the displacement, scaling and distortion invariance of
Finally, we implement the parameter update of the Gaussian image information in three ways, that is, local receptive field,
mixture model. Then we apply the Gaussian mixture model to weights sharing and down sampling. The local receptive field
gesture image segmentation, and the preliminary results are means that neurons of each layer are connected to a local
shown in Fig. 2. region of the upper layer. Through the local receptive field,
network can extract the primary architectural features of the
C. Gesture Area Reconstruction
image, such as corner, color, direction, edge, etc. Weights
In fact, the mixed Gaussian skin color model usually cannot sharing can reduce the parameters that need to be trained and
achieve satisfactory segmentation results directly because of greatly enhance the generalization ability of the network.
the interference from the complex backgrounds and unstable There is usually down sampling layers behind convolutional
illumination. So we need to combine some other methods to layers, which can reduce the resolution of the feature map and
reconstruct the gesture area. increase the displacement, scaling and distortion invariance of
1. Morphological method. In the two binary image, the the network. It can make the training of the weight is more
edges of the gesture area have different sizes of holes, burrs or conducive to image classification.
incomplete contours. The morphological expansion algorithm In the convolutional layer, the feature map of the previous
can extend the bright color region in the binary image, while layer is convoluted with the kernels, and then the results are
the corrosion algorithm can extend the dark region in the output through an activation function to get the feature map of
binary image. The morphological opening is operated to this layer. Each output of the feature graph can be correlated
binary image by expansion and corrosion with the 3×3 all-one with the convolution results of multiple input feature graphs
matrix, which can remove the isolated noises and bulges of through shared weights. The neurons for each feature map in
edge in the binary image. The output is shown in Fig. 3. the convolution layer can be calculated by
2. Single connected method. There are still isolated blocks
X
[43] in the gesture image after morphological processing, l 1
X lj f ( l
i kij B lj ) (21)
which are generated by the segmentation of skin color model.
iM j
The existence of these areas have an adverse effect on the
classification results, so we use marker connectivity method
to eliminate the small areas. In the Fig. 3, each small block where l represents the position of the convolution layer in the
area of binary image is marked as {ai | i =1, 2, ⋯}. Then we structure. kij represent the convolution kernels from jth layer to
calculate the area value Si of each marked region, and get the ith layer. f and B represent activation functions and bias
maximum region according to (5). respectively. Xj is the set of input images.
The pooling layer is usually behind the convolutional layer,
S max max{Si | i N } (20) which performs down-sampling operation on the input feature
map. The number of input images is the same as the number of
output images after down-sampling, and the dimension of the
where Smax represent the maximum region area, i.e., the area
output image is scaled down. The values of neurons for each
of gesture region. N is the number of all regions. Then we
feature map in the sampling layer can be calculated by
traverse the entire binary image to determine whether the
mark of each pixel belongs to the largest region. The gray
value of the pixel belonging to the maximum region is 1, and X lj f ( βlj p( xlj1) Blj ) (22)
others are 0. The resulting gesture area is shown in Fig. 4.
3. Reconstruction of gesture area. After a single connected where p and β represent the down-sampling function and
processing of the gesture binary image, the gesture area is weights, respectively.
complete and the background is flat. We then use the stroke The last layer of DCNN is the softmax layer, that is, the
method [44] to extract the edge of gesture and rebuild the network outputs the classification results through the softmax
gesture color image. Finally, we get the gesture segmentation function, which was defined as
results, as shown in Fig. 5.
exp(θiT x )
p(i) K (23)
IV. CONSTRUCTION OF DCNN
In this section, we introduce the deep convolutional neural
k 1
θkT x
J (26)
bijl bijl
bijl
where
N
J 1
ijl
N a
j 1
L 1 L
j (ai yi )
N
J 1
bijl
N a j 1
j yj
V. EXPERIMENTS
The experiments consist of three parts. In the first part, we
introduce the two hand gesture datasets used in experiments.
In part B, we demonstrate the effectiveness of segmentation
(b) algorithm based on mixed Gaussian skin color model. In the
part C, we show the classification results and testing speed of
Fig. 9. Different kinds of hand gesture image samples in two databases. (a)
shows the samples of the indoor database, and (b) shows the samples of DCNN in two datasets to show the effectiveness and analyze
NUS database under the outdoor condition. the learning process of DCNN through visualized convolution
kernels and feature maps.
sparse and the overfitting problem of deep CNN is greatly A. Two Datasets
improved. The output of neural network is a floating-point The datasets in the experiments consist of two parts. The
vector whose component value of corresponding correct class first dataset is the set of hand gesture images in the indoor
is 1 and the rest of the values are 0. In the experimental section, background, which is used to validate the effectiveness of
we verify and compare the classification results of DCNN DCNN model. In this dataset, we collect ten classes of hand
under various activation functions to show the effectiveness gestures through an ordinary camera, as shown in Fig. 9(a).
of the Swish function. The ten categories represent numbers 0 to 9, respectively. The
In order to get the optimal number of convolutional layers dataset consists of 3000 images with different rotation angles
and feature maps, various DCNNs with different structures under the single background. The background of images
are tested in experiments. The results show that when the comes from a light stabilized indoor laboratory.
network is consisting of nine convolutional layers, it has the The second dataset used in the experiment is the standard
highest classification accuracy. So we finally use the deep hand gesture dataset of National University of Singapore
CNN structure as shown in Fig. 7. It can be seen that in the (NUS), which has a total number of 3000 images in 10
first three convolutional layers of the structure, each layer categories, as shown in Fig. 9(b). All images come from
consists of 128 convolution kernels with size 5×5. In the various outdoor background, and the background interference
middle three convolutional layers, each layer consists of 256 (e.g., similar edges, textures and colors) are deliberately
convolution kernels of size 3×3. The last three convolutional intensified.
layers contain 512 kernels of size 3×3 per layer. There is a
(a)
(b)
Fig. 10. Segmentation results of gesture samples in two databases. (a) shows the segmentation results of the hand gesture samples under indoor conditions. (b)
shows the segmentation results of the hand gesture samples under outdoor conditions.
TABLE I
THE CLASSIFICATION RESULTS OF SAMPLES BEFORE AND AFTER SEGMENTATION IN TWO DATABASES
Class 1 Class 2 Class 3 Class 4 Class 5 Class 6 Class 7 Class 8 Class 9 Class 10 mAP
Indoor database
(%) (%) (%) (%) (%) (%) (%) (%) (%) (%) (%)
Original image 92.1 91.3 93.3 88.0 88.2 87.5 90.0 87.8 94.4 94.0 90.7
Segmented image 100.0 98.7 98.0 98.4 100.0 97.4 99.3 98.3 100.0 99.9 99.0
Class 1 Class 2 Class 3 Class 4 Class 5 Class 6 Class 7 Class 8 Class 9 Class 10 mAP
NUS database
(%) (%) (%) (%) (%) (%) (%) (%) (%) (%) (%)
Original image 89.2 88.7 86.3 90.1 88.0 87.6 88.1 89.0 86.9 87.1 88.1
Segmented image 98.1 96.9 96.5 98.4 98.4 97.7 97.6 98.2 96.4 96.3 97.5
TABLE II
THE CLASSIFICATION RESULTS OF DCNN WITH DIFFERENT ACTIVATION FUNCTIONS IN TWO DATABASES
Indoor Class 1 Class 2 Class 3 Class 4 Class 5 Class 6 Class 7 Class 8 Class 9 Class 10 mAP
database (%) (%) (%) (%) (%) (%) (%) (%) (%) (%) (%)
Sigmoid 89.1 86.4 86.0 86.6 90.1 85.9 87.4 86.8 87.9 88.2 87.4
Tanh 90.9 88.7 87.2 89.1 90.2 87.9 90.0 89.9 90.7 91.1 89.6
Softplus 98.2 97.6 96.0 97.1 98.0 96.9 97.4 97.2 98.0 97.8 97.4
ReLU 98.8 97.4 97.1 97.1 98.3 96.9 97.5 97.5 98.5 98.1 97.7
PReLU 98.7 98.6 97.8 98.0 99.2 96.6 98.2 99.0 99.0 99.2 98.4
Swish 100.0 98.5 98.0 98.5 100.0 97.5 99.0 98.5 100.0 100.0 99.0
NUS Class 1 Class 2 Class 3 Class 4 Class 5 Class 6 Class 7 Class 8 Class 9 Class 10 mAP
database (%) (%) (%) (%) (%) (%) (%) (%) (%) (%) (%)
Sigmoid 88.2 87.3 87.3 88.1 89.0 88.3 87.8 89.2 86.7 85.6 87.8
Tanh 89.1 88.3 87.7 89.3 89.0 88.4 88.7 89.1 87.5 87.0 88.4
Softplus 96.8 95.9 95.7 95.8 96.3 95.6 96.0 97.2 96.3 96.8 96.2
ReLU 95.8 95.6 94.4 95.8 97.0 95.6 96.4 96.6 96.7 96.1 96.0
PReLU 96.3 95.7 96.7 96.7 98.5 96.4 97.0 97.5 95.2 94.4 96.4
Swish 98.1 96.9 96.5 98.4 98.4 97.7 97.6 98.2 96.4 96.3 97.5
TABLE III
THE CLASSIFICATION RESULTS OF DIFFERENT NEURAL NETWORK MODELS IN TWO DATABASES
Indoor Class 1 Class 2 Class 3 Class 4 Class 5 Class 6 Class 7 Class 8 Class 9 Class 10 mAP
database (%) (%) (%) (%) (%) (%) (%) (%) (%) (%) (%)
Alexnet 96.7 97.5 96.0 95.4 96.6 95.9 95.3 96.1 97.0 97.5 96.4
VGG-16 99.0 98.1 96.5 97.2 98.8 97.9 98.6 98.6 98.8 98.1 98.2
VGG-19 98.7 97.4 97.0 97.5 99.0 96.2 96.2 97.8 98.0 98.8 97.7
GoogLeNet 98.8 97.1 98.9 97.4 98.5 96.5 96.8 97.5 98.1 98.4 97.8
Our model 100.0 98.5 98.0 98.5 100.0 97.5 99.0 98.5 100.0 100.0 99.0
Class 1 Class 2 Class 3 Class 4 Class 5 Class 6 Class 7 Class 8 Class 9 Class 10 mAP
NUS database
(%) (%) (%) (%) (%) (%) (%) (%) (%) (%) (%)
Alexnet 95.2 93.3 92.9 95.1 95.0 94.8 94.1 94.5 91.9 92.2 93.9
VGG-16 94.7 94.4 96.6 95.4 95.3 95.0 95.3 98.4 94.1 94.0 95.3
VGG-19 96.7 94.2 95.9 93.0 93.1 95.2 95.7 96.0 93.7 96.6 95.1
GoogLeNet 98.5 96.0 94.7 96.9 93.8 97.8 93.0 94.2 93.9 95.4 95.1
Our model 98.1 96.9 96.4 98.4 98.4 97.8 97.6 98.2 96.4 96.3 97.5
NUS dataset, as shown in Fig. 11. The red curve represents with Switsh function has the highest classification accuracy
the training iteration of original images, while the blue curve on the hand gestures in the two datasets. The network with
represents the training iteration of segmented gesture images. PReLU function ranked second in the classification accuracy
It can be seen from the figure that the training process of due to the stable learning property. The problem of deep
segmented gesture image converges faster than that of the training of deep neural networks caused by neuronal necrosis
original images and leads to about 50 epochs advance. The in ReLU makes it less accurate than the PReLU. Because of
hand gesture image after segmentation greatly reduces the the lack of effective training of the deep neural network, the
useless information and makes the network more globally classification accuracy of network with ReLU function is
convergent at the right location. Moreover, the iteration curve lower than that of network with PReLU function. The network
of original images has a large range of oscillations at the end with Tanh function and Sigmoid function cannot effectively
of the training, which is not what we expect to see. train the deep layer neural network because of the gradient
On the basis of segmented hand gesture images, we verify vanishing problem, so the classification accuracy is the
the effect of various activation functions in our DCNN lowest.
structure, as shown in Table II. In the table, we compared the After determining the activation function, we compare the
respective classification accuracy of 10 classes and average multiple structure of DCNN and determine the optimal
classification accuracies under two datasets. The highest structure for hand gesture classification. In Table III, we
classification accuracy of each column is bold to highlight. compare the classification results of four structures of Alexnet,
It can be seen from the classification results that network VGG-16, VGG-19, and GoogLeNet. Based on the Alexnet,
TABLE IV
THE COMPARISON OF CLASSIFICATION RESULTS OF DIFFERENT ALGORITHMS IN TWO DATABASES
Class 1 Class 2 Class 3 Class 4 Class 5 Class 6 Class 7 Class 8 Class 9 Class 10 mAP
Indoor database
(%) (%) (%) (%) (%) (%) (%) (%) (%) (%) (%)
Stergiopoulou [46] 96.6 94.0 94.3 94.3 97.2 94.7 95.0 94.4 97.1 96.9 95.5
Jiang [47] 94.5 92.3 93.0 93.5 95.4 92.1 94.3 93.2 93.6 95.2 93.7
Cai [48] 97.0 95.8 96.9 94.6 96.2 94.8 95.5 95.9 96.1 96.7 95.9
Caffe-DAG [49] 98.2 96.5 95.9 95.8 98.0 95.4 96.6 96.2 97.8 98.8 96.9
Triesch [50] 90.4 91.2 89.8 87.9 91.0 90.1 90.8 90.2 91.0 91.6 90.4
DeCAF [51] 98.0 95.5 94.9 98.8 96.8 96.7 99.2 95.6 97.1 97.4 97.0
Pisharady [52] 97.0 95.7 93.7 98.8 95.8 97.7 99.1 96.6 96.2 95.4 96.6
PHOG+SVM 92.2 89.4 91.4 92.1 91.1 91.0 89.5 89.9 90.0 91.2 90.8
SIFT+SVM 92.1 92.7 91.9 91.6 94.0 90.8 92.4 92.2 93.2 92.7 92.4
Hu+GLCM+SVM 95.1 96.4 96.1 95.7 97.8 96.9 96.6 95.9 97.0 97.5 96.5
Hu+GLCM+GRNN 97.2 98.6 95.5 95.9 97.4 97.7 96.5 96.1 98.0 98.3 97.1
Our model 100.0 98.5 98.0 98.5 100.0 97.5 99.0 98.5 100.0 100.0 99.0
Class 1 Class 2 Class 3 Class 4 Class 5 Class 6 Class 7 Class 8 Class 9 Class 10 mAP
NUS database
(%) (%) (%) (%) (%) (%) (%) (%) (%) (%) (%)
Stergiopoulou [46] 95.7 93.8 93.0 95.5 95.1 94.4 94.4 95.6 92.5 92.2 94.2
Jiang [47] 93.1 90.2 90.7 94.1 94.5 92.8 93.0 94.1 91.2 91.7 92.5
Cai [48] 94.6 93.2 93.4 96.0 95.1 94.6 94.3 96.2 92.9 92.5 94.3
Caffe-DAG [49] 94.6 94.9 97.2 98.5 95.5 96.2 95.0 96.1 95.7 93.5 95.7
Triesch [50] 88.0 87.4 87.2 88.5 89.1 88.4 88.2 89.7 88.0 87.6 88.2
DeCAF [51] 95.2 92.3 92.0 93.7 94.4 93.6 92.2 93.0 91.2 92.3 93.0
Pisharady [52] 93.3 93.9 94.1 96.0 95.4 94.9 94.4 96.5 93.7 91.8 94.4
PHOG+SVM 91.7 91.9 92.2 93.3 94.0 92.6 92.6 94.1 91.7 92.1 92.6
SIFT+SVM 90.1 89.7 89.6 91.2 93.0 90.9 90.1 91.1 88.9 89.6 90.4
Hu+GLCM+SVM 96.0 95.6 94.9 95.5 96.2 95.9 95.7 96.7 96.6 94.1 95.7
Hu+GLCM+GRNN 97.7 96.5 94.4 96.7 96.1 96.6 94.5 96.0 94.7 96.4 96.0
Our model 98.1 96.9 96.5 98.4 98.4 97.7 97.6 98.2 96.4 96.3 97.5
VGGNet increased the depth of the network, making the classification is very close to classification accuracy of the
network has a more outstanding ability to express features. training set. The training process takes full advantage of the
And GoogLeNet increased the width of the network by adding self-organizing ability of the gradient descent learning
the Inception structure to the network, and the convolution method based on partial differential equation, which can force
kernel of different sizes at the same convolutional layer the network to extract the most effective features of images in
enhances the feature extraction ability of the network at different feature spaces. When the traditional single hidden
different scales. Based on the advantages of both of them, we layer neural network is used in gesture classification, the
build the final DCNN framework through a large number of network does not have the feature abstraction capability for
experiments, as shown in Fig. 7. the input data. It is necessary to manually select the feature of
The classification results compared with some other the target area as the input of the network, and the artificial
algorithms. Then we compare the classification accuracy of selection of features is subjective. So it is difficult to establish
various algorithms, as shown in Table IV. The recognition a complete description for the hand gesture. Stergiopoulou
time of each algorithm for a single image is shown in Table V. [46] constructs a CNN model to recognize gestures. The
In terms of classification performance, it can be seen that inputs of the network are binary gesture images, which can
our algorithm has the highest average classification accuracy. improve the classification efficiency of the network and have
The accuracy of two hand gesture datasets is improved from better real-time performance. However, the binary images
97.1% to 99.0% and from 96.0% to 97.5% respectively. The lack the local information of gesture, so that the network does
reason is that the weights of multilayer CNN have weight not have the ability of texture description and can only
sharing ability, and the network structure has limited weights. describe features of hand shape. Therefore, its classification
Therefore, the network has good generalization ability, and accuracy is lower than the method proposed in this paper.
the specific performance is that the accuracy of the test set
TABLE V large, the single shape feature cannot be better described, and
COMPARISON OF TESTING TIME FOR SINGLE IMAGE OF VARIOUS METHODS the texture feature based on the pixel gray value is sensitive to
Method Testing time/frame (s) the deformation of the object and the change of the view angle.
Stergiopoulou [46] 0.005 Jiang [47] used Hu invariant moments to extract the seven
Jiang [47] 0.710 parameters of gestures as features and classify them through
Cai [48] 0.120
Caffe-DAG [49] 0.057
BP neural network. Cai [48] uses the four custom geometric
Triesch [50] 0.026 invariants as a feature to classify gestures. However, these
DeCAF [51] 0.044 features are not as good as DCNN for the gestures, and the
Pisharady [52] 0.026 accuracy of these algorithms under the ten categories of
PHOG+SVM 0.116 gestures is both lower than the accuracy of our DCNN-based
SIFT+SVM 0.492
algorithms.
Hu+GLCM+SVM 0.015
Hu+GLCM+GRNN 0.027 In the aspect of recognition efficiency, although the DCNN
Our model 0.034 based approach does not achieve the fastest recognition speed,
it meets the real-time requirements in practical application
TABLE VI and can process nearly 30 images per second. In the training
CLASSIFICATION ACCURACIES OF VARIOUS ACTIVATION FUNCTION phase, the property of sparse expression of the activation
Activation Classification accuracy Classification accuracy function makes the scale of the network smaller. In the testing
fuctions of indoor dataset of outdoor dataset phase, the network automatically extracts the features of
sigmoid 80.4% 74.9%
tanh 90.2% 84.2%
gesture image through forward propagation, and the
softplus 97.1% 95.6% recognition process takes shorter time. The method of
ReLU 98.4% 96.1% Hu+GLCM+SVM method is the fastest, because the feature
Leaky-ReLU 98.3% 96.2% extraction process of Hu invariant moment is very simple, but
P-ReLU 98.6% 96.6% the classification accuracy is significantly lower than the
maxout 98.8% 97.2%
DCNN based method.
Swish 99.0% 97.5%
The comparison with various activation functions. We
compared different activation functions on DCNN under two
datasets to observe the of Swish function. The classification
results are shown in Table VI. Comparing the experimental
results of activation functions on two datasets, we can clearly
see that Swish achieved the highest classification results. It
increased the classification results of the maxout by 0.2% and
0.3%, respectively.
Visualization analysis and understanding of the learning
process of DCNN. Confusion matrix is a kind of visualization
method of classification results commonly used in supervised
learning, which can intuitively express the precision rate and
recall rate of classification model. Element Mi,j in confusion
matrix represents the number of samples in class i that are
assigned to class j. The value of the matrix can be normalized
between 0 and 1. Based on this ratio, we give each element of
Fig. 11. Convergence curves in the training process of DCNN. the matrix a hue from blue to red. The elements on the main
diagonal represent the proportion of samples that are correctly
Generalized regression neural network (GRNN) and classified. The closer the color of the main diagonal of
support vector machine (SVM) are two commonly used confusion matrix is to red, the higher the classification rate.
methods in image classification, and these two methods have We draw confusion matrices for the classification results of
the advantages of fast training and strong nonlinear mapping two datasets, as shown in Fig. 12 (a) and (b). It can be seen
ability. In the compare experiment, when the main body of from the figure that only a few samples are misclassified into
gesture is detected, we select the first moment and the second other categories.
moment in the Hu invariant moment parameter as the global In order to observe the effect of image segmentation
feature to identify the gesture, select the energy, contrast and operation on the DCNN training process and the convergence
relevance parameter of gray level co-occurrence matrix as results, we use the t-SNE [53] and largevis [54] method to
local features to identify gestures. Then we combine the five reduce the dimension of the feature vector of the last fully
dimensional features into the GRNN and SVM for training connected layer in DCNN, and show the distribution of the
and classification. In traditional feature descriptors, invariant two-dimensional feature in the plane, as shown in Fig. 13. The
moments can represent the overall shape of objects, and have left image represents the visualization result of t-SNE and the
good invariance for the translation, rotation and scale right image represents the visualization result of largevis. In
transformation of objects. Gray level co-occurrence matrix the figure, we show the feature distribution of testing samples
has better texture description ability and can describe detail in 10 categories of the NUS dataset. It can be clearly seen
information of image. However, when the change of gesture is from the graph that the features of the original image are
messy and loose due to the interference of the background.
(a)
(a) (b)
Fig. 12. Confusion matrix of classification results in two databases. (a) is
the confusion matrix of the classification results of gesture samples in
indoor environment, and (b) is the confusion matrix of the classification
results of gesture samples in outdoor environment.
acknowledge the helpful comments and suggestions of the [23] X. Cao, J. Zhao and M. Li, “Monocular Vision Gesture Segmentation
Based on Skin Color and Motion Detection,” Journal of Hunan
reviewers, which have improved the presentation. University, vol. 38, no. 1, pp. 78-83, 2011.
[24] N. Neverova and C. Wolf et al., “Hand Segmentation with Structured
REFERENCES Convolutional Learning,” in Asian Conference Computer Vision, pp.
687-702, 2015.
[1] K. Niyazi, V. Kumar, S. Mahe and S. Vyawahare, “Mouse Simulation
[25] X. Zhang, Z. Ye, L. Jin, S. Xu, “A New Writing Experience: Finger
Using Two Coloured Tapes,” International Journal of Information
Writing in the Air Using a Kinect Sensor,” IEEE Transactions on
Sciences and Techniques, vol. 2, no. 2, pp. 57-63, 2012.
Multimedia, vol. 20, no. 4, pp. 85-93, 2013.
[2] S. Mitra and T. Acharya, “Gesture Recognition: A Survey,” IEEE
[26] M. V. Akinin, A. I. Taganov, M. B. Nikiforov and A. V. Sokolova,
Transactions on Systems, Man, and Cybernetics, Part C (Applications
“Image Segmentation Algorithm Based on Self-organized Kohonen's
and Reviews), vol. 37, no. 3, pp. 311-324, 2007.
Neural Maps and Tree Pyramidal Segmenter,” in IEEE Mediterranean
[3] J. Davis and M. Shah, “Visual Gesture Recognition,” in IEE
Conference on Embedded Computing, pp. 168-170, 2015.
Proceedings of Vision, Image and Signal Processing, vol. 141, no. 2,
[27] J. Long,E. Shelhamer and T. Darrell, “Fully Convolutional Networks
pp. 101-106, 1994.
for Semantic Segmentation,” in IEEE Conference on Computer Vision
[4] Z. Lu et al., “A Hand Gesture Recognition Framework and Wearable
and Pattern Recognition, pp. 3431-3440, 2015.
Gesture-based Interaction Prototype for Mobile Devices,” IEEE
[28] S. Bilal, R. Akmeliawati, M. Salami, A. Shafie and E. E. Bouhabba, “A
Transactions on Human-Machine Systems, vol. 44, no. 2, pp. 293-299,
Hybrid Method Using Haar-like and Skin-color Algorithm for Hand
2017.
Posture Detection,” in IEEE International Conference on Mechatronics
[5] L. Bretzner, I. Laptev and T. Lindeberg, “Hand Gesture Recognition
and Automation, pp. 934-939, 2010.
Using Multi-scale Colour Features, Hierarchical Models and Particle
[29] Y. She and Q. Wang et al., “A Real-Time Hand Gesture Recognition
Filtering,” in IEEE 5th International Conference on Automatic Face
Approach Based on Motion Features of Feature Points,” in IEEE
and Gesture Recognition, pp. 423-428, 2002.
International Conference on Computational Science and Engineering,
[6] B. Stenger, “Template-based Hand Pose Recognition Using Multiple
pp. 1096-1102, 2014.
Cues,” in Aisian Conference on Computer Vision, pp. 551-560, 2006.
[30] B. Kaufmann, J. Louchet and E. Lutton, “Hand posture recognition
[7] M. C. Ornellas, “A Deformable Contour Based Approach to Hand
using real-time artificial evolution,” in Applications of Evolutionary
Image Segmentation,” in Proceedings of First International Conference
Computation, pp. 251-260, 2013.
on Cyber Crime Investigation, pp. 10-18, 2004.
[31] C. Weng, Y. Li, M. Zhang, K. Guo, X. Tang, et al, “Robust Hand
[8] D. Wu et al., “Deep Dynamic Neural Networks for Multimodal Gesture
Posture Recognition Integrating Multi-cue Hand Tracking,” in
Segmentation and Recognition”, IEEE Transactions on Pattern
International Conference on E-Learning and Games, Changchun,
Analysis & Machine Intelligence, vol. 38, no. 8, pp. 1583-1597, 2016.
China, pp. 497-508, 2010.
[9] A. Ogihara, H. Matsumoto and A. Shiozaki, “Hand Region Extraction
[32] M. Flasiński and S. Myśliński, “On the Use of Graph Parsing for
by Background Subtraction with Renewable Background for Hand
Recognition of Isolated Hand Postures of Polish Sign Language,”
Gesture Recognition,” in IEEE International Symposium on Intelligent
Pattern Recognition, vol. 43, no. 6, pp. 2249-2264, 2010.
Signal Processing and Communications, pp. 227-230, 2007.
[33] H. Ren and G. Xu, “Hand Gesture Recognition Based on Characteristic
[10] H. Kenn, F. V. Megen and R. Sugar, “A Glove-based Gesture Interface
Curves,” Journal of Software, vol. 12, no. 5, pp. 987-993, 2002.
for Wearable Computing Applications,” in VDE International Forum
[34] J. Y. Zhu, “Hand Gesture Recognition Based on Structure Analysis,”
on Applied Wearable Computing, pp. 1-10, 2007.
Chinese Journal of Computers, vol. 29, no. 12, pp. 27-32, 2006.
[11] J. W. Davis and A. F. Bobick, “The Representation and Recognition of
[35] B. Yang et al., “Gesture Recognition in Complex Background Based
Human Movement Using Temporal Templates,” in IEEE Conference
on Distribution Features of Hand,” Journal of Computer-Aided Design
on Computer Vision and Pattern Recognition, pp. 928-934, 1997.
and Computer Graphics, vol. 22, no. 10, pp. 1841-1848, 2010.
[12] Y. Liu, Y. Yin and S. Zhang, “Hand Gesture Recognition Based on HU
[36] Z. Li and R. Jarvis, “Real Time Hand Gesture Recognition Using A
Moments in Interaction of Virtual Reality,” in IEEE International
Range Camera,” in Australasian Conference on Robotics and
Conference on Intelligent Human-Machine Systems and Cybernetics,
Automation, pp. 21-27, 2009.
pp. 145-148, 2012.
[37] C. Vogler and D. Metaxas, “Adapting Hidden Markov Models for ASL
[13] S. Murthy and R. S. Jadon, “Hand Gesture Recognition Using Neural
Recognition by Using Three-dimensional Computer Vision Methods,”
Networks,” in IEEE Advance Computing Conference, pp. 134-138,
in IEEE International Conference on Systems, Man, and Cybernetics,
2010.
pp. 156-161, 1997.
[14] E. Stergiopoulou and N. Papamarkos, “Hand gesture recognition using
[38] A. Krizhevsky, I. Sutskever and G. E. Hinton, “Imagenet Classification
a neural network shape fitting technique,” Engineering Applications of
with Deep Convolutional Neural Networks,” in Advances in Neural
Artificial Intelligence, vol. 22, no. 8, pp. 1141-1158, 2009.
Information Processing Systems, pp. 1097-1105, 2012.
[15] N. Liu, B. C. Lovell and P. J. Kootsookos, “Evaluation of HMM
[39] K. Simonyan and A. Zisserman, “Very Deep Convolutional Networks
Training Algorithms for Letter Hand Gesture Recognition,” in IEEE
for Large-scale Image Recognition,” arXiv preprint, 2014. Arxiv: 1409.
International Symposium on Signal Processing and Information
1556.
Technology, pp. 648-651, 2004.
[40] C. Szegedy, W. Liu and Y. Jia, “Going Deeper with Convolutions,” in
[16] R. Y. Tara, P. I. Santosa and T. B. Adji, “Hand Segmentation From
IEEE Conference on Computer Vision and Pattern Recognition, pp.
Depth Image Using Anthropometric Approach in Natural Interface
1-9, 2015.
Development,” International Journal of Scientific and Engineering
[41] EY. Lam, “Combining Gray World and Retinex Theory for Automatic
Research, vol. 3, no. 5, pp. 1-4, 2012.
White Balance in Digital Photography,” in IEEE Proceedings of the
[17] U. Lee and J. Tanaka, “Hand Controller: Image Manipulation Interface
9th International Symposium on Consumer Electronics, pp. 134-139,
Using Fingertips and Palm Tracking with Kinect Depth Data,” in
2005.
Proceeding-Asia Pacific Conference Computation and Human Interact,
[42] A. Amjad, A. Griffiths, M. N. Patwary, “Multiple Face Detection
pp. 705-706, 2012.
Algorithm Using Colour Skin Modelling,” IET Image Processing, vol.
[18] M. Caputo, K. Denker, B. Dums and G. Umlauf, “3D Hand Gesture
6, no. 8, pp. 1093-1101, 2012.
Recognition Based on Sensor Fusion of Commodity Hardware,”
[43] H. Liu, Y. Zhang, “Curvature Computing of BOF Flame Boundary
Mensch and Computer, vol. 12, no. 1, pp. 293-302, 2012.
Based on Differential Chain Code,” Computer Engineering and
[19] X. Wang, “Hand Gesture Recognition Based on BP Neural Network in
Applications, vol. 49, no. 7, pp. 171-170, 2013.
Complex Background,” Computer Applications and Software, vol. 30,
[44] H. Liu and Zhang. Y, “State Recognition of BOF Based on Flame
no. 3, pp. 247-95, 2013.
Image Features and GRNN,” Computer Engineering and Applications,
[20] G. Marin, F. Dominio and P. Zanuttigh, “Hand Gesture Recognition
vol. 47, no. 6, pp. 7-10, 2011.
with Jointly Calibrated Leap Motion and Depth Sensor,” Multimedia
[45] O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh et al.,
Tools and Applications, vol. 75, no. 22, pp. 1-25, 2016.
“ImageNet Large Scale Visual Recognition Challenge,” International
[21] I. Oikonomidis, N. Kyriazis and A. Argyros, “Efficient Model-Based
Journal of Computer Vision, vol. 115, no. 3, pp. 211-252, 2015.
3D Tracking of Hand Articulations Using Kinect,” In British Machine
[46] E. Stergiopoulou and N. Papamarkos, “Hand Gesture Recognition
Vision Conference, pp. 1-11, 2011.
Using A Neural Network Shape Fitting Technique,” Engineering
[22] P. Gupta, “An Efficient Slap Fingerprint Segmentation and Hand
Applications of Artificial Intelligence, vol. 22, no. 8, pp. 1141-1158,
Classification Algorithm,” Neurocomputing, vol. 142, no. 1, pp.
2009.
464-477, 2014.