Hand Gesture Recognition Using Diff Models
Hand Gesture Recognition Using Diff Models
RESEARCH ARTICLE
1 School of Medical Information and Engineering, Xuzhou Medical University, Xuzhou, China, 2 Engineering
Research Center of Medical and Health Sensing Technology, Xuzhou Medical University, Xuzhou, China
number No. JBGS202204. The funders had no role operational spaces. This severely restricts the range of human expressive actions and dimin-
in study design, data collection and analysis, ishes work efficiency. Consequently, to enhance the naturalness of HMI, the next generation
decision to publish, or preparation of the
of HMI technology needs to be human-centric, diversified, and intelligent [3]. In real-life situ-
manuscript.
ations, besides verbal communication, gestures serve as one of the most significant means for
Competing interests: The authors have declared humans to convey information, enabling direct and effective expression of user needs.
that no competing interests exist.
Research conducted by Liu et al. pointed out that hand gestures constitute a significant part of
human communication, with advantages including high flexibility and rich meaning, making
them an important modality in HMI [4]. Consequently, Hand Gesture Recognition (HGR)
has emerged as a new type of HMI technology and has become a research hotspot with enor-
mous potential in various domains. For instance, in the healthcare domain, capturing and ana-
lyzing physiological characteristics related to finger movements can significantly assist in
studying and developing appropriate rehabilitation postures [5]. In the field of mechanical
automation, interaction between fingers and machines can be achieved by detecting finger
motion trajectories [6]. In the field of virtual reality, defining different gesture commands
allows users to control the movements of virtual characters from a first-person perspective [7].
Research on HGR can be classified into two categories based on the methods of acquiring
gesture data: vision-based HGR and wearable device-based HGR. Vision-based HGR relies on
cameras as the primary tools for capturing gesture data. They offer advantages such as low cost
and no direct contact with the human hands. However, despite the success of high-quality
cameras, vision-based systems still have some inherent limitations, including a restricted field
of view and high computational costs [8, 9]. In certain scenarios, robust results may require
the combined data acquisition from multiple cameras due to issues like depth and occlusion
[10, 11]. Consequently, the presence of these aforementioned challenges often hinders vision-
based HGR methods from achieving optimal performance. In recent years, wearable device-
based HGR has witnessed rapid development due to advancements in sensing technology and
widespread sensor applications. Compared to vision-based approaches, wearable device-based
HGR eliminates the need to consider camera distribution and is less susceptible to external
environmental factors such as lighting, occlusion, and background interference. Data gloves
represent a typical example of wearable devices used in HGR. These gloves are equipped with
position tracking sensors that enable real-time capture of spatial motion trajectory information
of users’ hand postures. Based on predefined algorithms, gesture actions can be recognized,
mapped to corresponding response modules, and thus complete the HMI process. HGR sys-
tems based on data gloves have become a research hotspot in the relevant field. These systems
offer several advantages, including stable acquisition of gesture data, reduced interference
from complex environments and satisfactory modeling and recognition results, especially
when dealing with large-scale gesture data [12].
In the field of HGR, researchers primarily focus on two types of gestures: static gestures and
dynamic gestures. Static HGR systems analyze hand posture data at a specific moment to
determine its corresponding meaning. However, static gesture data only provide spatial infor-
mation of hand postures at each moment, while temporal information of hand movements is
disregarded. As a result, the actual semantic information conveyed is limited, making it chal-
lenging to extend to complex real-world applications. Dynamic HGR systems, on the other
hand, deal with information regarding the changes in hand movement postures over a period
of time. These systems require a comprehensive consideration of both spatial and temporal
aspects of hand postures. Clearly, compared to static gestures, dynamic gestures can convey
richer semantic information and better align with people’s actual needs in real-life scenarios.
Although numerous research efforts have been dedicated to dynamic HGR algorithms, most
are based on vision systems, and the challenge of dynamic HGR using data gloves remains.
The dynamic gesture investigated in this study is the seven-step handwashing, which is a
crucial step in the healthcare field. Proper handwashing procedures can effectively reduce the
probability of disease transmission. Our work applies the seven-step handwashing to medical
simulation training, where users wear data gloves to perform the handwashing process. Addi-
tionally, we design an automated dynamic gesture recognition algorithm to assess whether
users correctly execute the specified hand gesture steps. Specifically, we developed a data
glove-based dynamic HGR algorithm in this paper by incorporating deep learning techniques.
This algorithm considers both spatial and temporal information of gesture data. Firstly, the
Convolutional Neural Network (CNN) is utilized to extract local features of gesture data at
each moment. Subsequently, these features are incorporated into the Bidirectional Long
Short-Term Memory (BiLSTM) structure to model the temporal relationships. Finally, an
attention mechanism is employed to enhance the gesture features and output the recognition
results of dynamic gestures. In summary, this paper makes three main contributions:
1. Within the context of medical simulation, a data glove-based seven-step handwashing
dynamic hand gesture data collection process was defined, and dynamic hand gesture data
from 32 subjects were collected following this procedure.
2. A novel data glove-based dynamic HGR algorithm, called Attention-based CNN-BiLSTM
Network (A-CBLN), was designed by combining deep learning techniques with the charac-
teristics of dynamic gesture data. A-CBLN integrates the advantages of CNN and BiLSTM,
effectively capturing the spatiotemporal features of gesture data, and further enhancing the
features using an attention mechanism, resulting in precise recognition of dynamic
gestures.
3. Extensive experiments were conducted to verify the effectiveness of the A-CBLN algorithm
for dynamic gesture recognition, and key parameter settings within A-CBLN were thor-
oughly discussed. The results obtained from the test dataset demonstrated that our pro-
posed method outperformed other comparative algorithms in terms of accuracy, precision,
recall and F1-score.
The remaining sections of this paper are organized as follows. In Section 2, we review recent
works related to HGR, with a particular focus on data glove-based HGR methods. Section 3
provides a detailed description of the proposed algorithm for dynamic gesture recognition.
Section 4 encompasses the data collection methodology for gestures and provides implementa-
tion details of the conducted experiments. The relevant experimental results and analysis are
presented in Section 5, followed by a concise summary of this paper in Section 6.
2. Related works
In recent years, research in the HGR field has focused on two main aspects: the type of gesture
data (static or dynamic) and the sensors used for data collection (visual systems or wearable
devices). This section provides an overview of relevant studies in HGR, emphasizing research
involving wearable devices like data gloves.
Static hand gesture recognition research primarily focuses on analyzing the spatial features
of gesture data without considering its temporal variations. This type of research is primarily
applied in sign language recognition scenarios. A static hand gesture recognition system based
on wavelet transform and neural networks was proposed by Karami et al. [13]. The system
operated by taking hand gesture images acquired by a camera as input and extracting image
features using Discrete Wavelet Transform (DWT). These features were fed into a neural net-
work for classification. In the experimental section, 32 Persian sign language (PSL) letter
symbols were selected for investigation. The training was conducted on 416 images, while test-
ing was performed on 224 images, resulting in a test accuracy of 83.03%. Thalange et al. [14]
introduced two novel feature extraction techniques, Combined Orientation Histogram and
Statistical (COHST) Features and Wavelet Features, to address the recognition of static sym-
bols representing numbers 0 to 9 in American Sign Language. Hand gesture data was collected
using a 5-megapixel network camera and processed with different feature extraction methods
before input into a neural network for training. The proposed approach achieved an outstand-
ing average recognition rate of 98.17%. Moreover, a novel data glove with 14 sensor units was
proposed by Wu et al. [15], who explored its performance in static hand gesture recognition.
They defined 10 static hand gestures representing digits 0–9 and collected data from 10 sub-
jects, with 50% of the data used for training and the remaining 50% for testing. By employing a
neural network for classification experiments, they achieved an impressive overall recognition
accuracy of 98.8%. Lee et al. [16] introduced a knitted glove capable of pattern recognition for
hand poses and designed a novel CNN model for hand gesture classification experiments. The
experimental results demonstrated that the proposed CNN structure effectively recognized 10
static hand gestures, with classification accuracies ranging from 79% to 97% for different ges-
tures and an average accuracy of 89.5%. However, they only recruited 10 subjects for the exper-
iments. Antillon et al. [17] developed an intelligent diving glove capable of recognizing 13
static hand gestures for underwater communication. They employed five classical machine
learning classification algorithms and conducted training on hand gesture data from 24 sub-
jects, with testing performed on an independent group of 10 subjects. The experimental results
indicated that all classification algorithms achieved satisfactory hand gesture recognition per-
formance in dry environments, with accuracies ranging from 95% to 98%. The performance
slightly declined in underwater experimental conditions, with accuracies ranging from 81% to
94%. Yuan et al. [18] developed a wearable gesture recognition system that can simultaneously
recognize ten types of numeric gestures and nine types of complex gestures. They utilized the
Multilayer Perceptron (MLP) algorithm to recognize 19 static gestures with 100% accuracy,
showcasing the strong capabilities of deep learning technology in the field of HGR. However,
it is worth noting that the sample data in their experimental section was derived solely from
four male volunteers. Moreover, a data glove based on flexible sensors was utilized by Ge et al.
[19] to accurately predict the final hand gesture before the completion of the user’s hand
movement in real time. They constructed a gesture dataset called Flex-Gesture, which con-
sisted of 16 common gestures, each comprising 3000 six-dimensional flexion data points.
Additionally, they proposed a multimodal data feature fusion approach and employed a com-
bination of neural networks and support vector machines (SVM) as classifiers. The system
achieved a remarkable prediction accuracy of 98.29% with a prediction time of only 0.2329 ms.
However, it should be noted that the data glove-based system had certain limitations as it did
not consider temporal information in the hand gestures. It is worth mentioning that the
authors believe that incorporating deep learning algorithms with temporal features analysis
could potentially yield more effective results.
Unlike static gesture recognition, dynamic gesture recognition requires considering the
spatial information of hand movements and their temporal variations. With the rapid
advancement of deep learning techniques, researchers have extensively investigated structures
such as Convolutional Neural Network (CNN) and Recurrent Neural Network (RNN) and
applied them to real-time dynamic gesture recognition problems. Nguyen et al. [20] presented
a novel approach for continuous dynamic gesture recognition using RGB video input. Their
method comprises two main components: a gesture localization module and a gesture classifi-
cation module. The former aims to separate gestures using a BiLSTM network to segment con-
tinuous gesture sequences. The latter aims to classify gestures and efficiently combine data
from multiple channels, including RGB, optical flow, and 3D key pose positions, using two 3D
CNNs and a Long Short-Term Memory (LSTM). The method was evaluated on three publicly
available datasets, achieving an average Jaccard index of 0.5535. Furthermore, Paweł et al. [21]
developed a system capable of rapidly and effectively recognizing hand gestures in hand-body
language using a dedicated glove with ten sensors. Their experiments defined 22 hand-body
language gestures and recorded 2200 gesture data samples (10 participants, each gesture action
repeated 10 times). Three machine learning classifiers were employed for training and testing,
resulting in a high sensitivity rate of 98.32%. The pioneering work of Emmanuel et al. [22]
introduced the use of CNN for grasp classification using piezoelectric data gloves. Experimen-
tal data were collected from five participants, each performing 30 object grasps following
Schlesinger’s classification method. The results demonstrated that the CNN architecture
achieved the highest classification accuracy (88.27%). It is worth mentioning that the authors
plan to leverage the strengths of both CNN and RNN in future work to improve gesture pre-
diction accuracy. Lee et al. [23] developed a real-time dynamic gesture recognition data glove.
They employed neural network structures such as LSTM, fully connected layers, and novel ges-
ture localization and recognition algorithms. This allowed the successful classification of 11
dynamic finger gestures with a gesture recognition time of less than 12 ms. Yuan et al. [24]
designed a data glove equipped with 3D flexible sensors and two wristbands and proposed a
novel deep feature fusion network to capture fine-grained gesture information. They first
fused multi-sensor data using a CNN structure with residual connections and then modeled
long-range dependencies of complex gestures using LSTM. Experimental results demonstrated
the effectiveness of this approach in classifying complex hand movements, achieving a maxi-
mum precision of 99.3% on the American Sign Language dataset. Wang et al. [25] combined
attention mechanism with BiLSTM and designed a deep learning algorithm capable of effec-
tively recognizing 10 types of dynamic gestures. Their proposed method achieved an accuracy
of 98.3% on the test dataset, showing a 14.5% improvement compared to a standalone LSTM
model. This indicates that incorporating attention mechanism can effectively enhance the
model’s understanding of gesture semantics. Dong et al. [12] introduced a novel dynamic ges-
ture recognition algorithm called DGDL-GR. Built upon deep learning, this algorithm com-
bined CNN and temporal convolutional networks (TCN) to simultaneously extract temporal
and spatial features of hand movements. They defined 10 gestures according to relevant stan-
dards and recruited 20 participants for testing. The experimental results demonstrated that
DGDL-GR achieved the highest recognition accuracy (0.9869), surpassing state-of-the-art
algorithms such as CNN and LSTM. Hu et al. [26] explored deep learning-based gesture recog-
nition using surface electromyography (sEMG) signals and proposed a hybrid CNN and RNN
structure with attention mechanism. In this framework, CNN was employed for feature extrac-
tion from sEMG signals, while RNN was utilized for modeling the temporal sequence of the
signals. Experimental results on multiple publicly available datasets revealed that the perfor-
mance of the hybrid CNN-RNN structure was superior to individual CNN and RNN modules.
Despite the existence of a large body of research on HGR, research on dynamic gesture rec-
ognition using data gloves is still limited, especially in exploring the feasibility of applying deep
learning in this field. Therefore, this study focused on the intelligent recognition of handwash-
ing steps in the context of medical simulation. We utilized data gloves as the medium for
dynamic gesture data collection and selected the seven-step handwashing series of dynamic
gestures as the research target. Specifically, we considered the characteristics of dynamic ges-
tures, including local feature variations in spatial positions and temporal changes in sequences.
We systematically combined structures such as CNN, BiLSTM, and attention mechanism and
designed a deep learning algorithm for dynamic gesture recognition based on data gloves. The
next section will provide a detailed introduction to the proposed algorithm framework.
3. Methodology
3.1. Convolutional neural network (CNN)
A classic CNN architecture was designed by LeCun et al. in 1998 [27], which achieved
remarkable performance in handwritten digit recognition tasks. Compared to traditional
neural network structures, CNN exhibits characteristics of local connectivity and weight
sharing [28]. Consequently, CNN can improve the learning efficiency of neural networks
and effectively avoid overfitting issues caused by excessive parameters. The classic CNN
architecture consists of three components: the convolutional layer, the pooling layer, and
the fully connected layer.
The convolutional layer’s core component is the convolutional kernel (or weight matrix).
Each convolutional kernel multiplies and sums the corresponding receptive field elements in
the input data. This operation is repeated by sliding the kernel with a certain stride on the
input data until the entire data has been processed for feature extraction. Finally, these feature
maps are typically generated as the output of the convolutional layer through a non-linear acti-
vation function. It is worth mentioning that multiple convolutional kernels are usually chosen
to extract more diverse features since each kernel extracts different feature information. ReLU
[29] is the most popular activation function in CNN, it has the capability to retain the segments
of input features that are greater than 0 and rectify the remaining segments to 0.
The pooling layer, also known as the down-sampling layer, extracts the minor features of
the input data using pooling kernels. Similar to the convolutional kernels, each pooling kernel
slides over the input data with a certain stride, preserving either the maximum value or the
average value of the elements within the corresponding receptive field. This process continues
until the feature extraction of the entire data is completed. The pooling layer is typically placed
after the convolutional layers to reduce the dimensionality of the feature maps, thereby reduc-
ing the computational complexity of the entire network.
In classification tasks, the input data undergoes feature extraction by passing through mul-
tiple convolutional and pooling layers, and the resulting feature maps are flattened and fed
into the fully connected layer. The fully connected layer usually consists of a few hidden layers
and a softmax classifier, which further extracts features from the data and outputs the probabil-
ity distribution of each class.
information from the current cell state ct that should be transmitted to the current hidden
state ht. Fig 1 illustrates the internal structure of a LSTM unit.
The LSTM unit has three inputs at time t: the current input xt, the previous hidden state ht
−1 and the previous cell state ct−1. After being regulated by the gate functions, two outputs are
,
obtained: the current hidden state ht and the current cell state ct. Specifically, the output of ft is
obtained by linearly transforming the current input xt and the previous hidden state ht−1, fol-
lowed by the application of the sigmoid activation function. This process can be expressed by
Formula 1.
ft ¼ sðwf ∗½ht 1 ; xt � þ bf Þ ð1Þ
Here, the weight matrix and bias vector of ft are represented by wf and bf, respectively. The
sigmoid activation function, denoted by σ, is applied. The value of ft ranges from 0 to 1, where
a value closer to 0 indicates that information will be discarded, and a value closer to 1 implies
more information will be preserved. The computation process of the input gate it is similar to
that of ft, and the specific formula is as follows.
it ¼ sðwi ∗½ht 1 ; xt � þ bi Þ ð2Þ
Here, the weight matrix and bias vector of it are represented by wi and bi, respectively. The
sigmoid activation function, denoted by σ, is applied. Subsequently, c^t is introduced to repre-
sent the current input state. c^t is obtained by linearly transforming and applying the tanh acti-
vation function to the concatenation of the current input xt and the previous hidden state ht−1.
In other words, the state information contained in xt and ht−1 is integrated to form a new state
representation. c^t and the previous cell state ct−1 are used to calculate and update the current
cell state ct. The specific formulas for the aforementioned process are as follows.
c^t ¼ tðwc ∗½ht 1 ; xt � þ bc Þ ð3Þ
Where the weight matrix and bias vector of c^t are denoted by wc and bc, respectively. The sym-
bol τ represents the tanh activation function. The calculation process of ot is similar to that of ft
and it. Additionally, the current hidden state ht is determined by ct and ot, and their computa-
tion formulas are as follows.
ht ¼ ot ∗tðct Þ ð6Þ
Where the weight matrix and bias vector of ot are denoted by wo and bo, respectively. The sym-
bol σ represents the sigmoid activation function, and τ represents the tanh activation function.
LSTM addresses the issue of vanishing gradients during training by incorporating a series
of gate mechanisms. However, as LSTM only propagates information in one direction, it can
only learn forward features and not capture backward features. To overcome this limitation,
Graves et al. introduced BiLSTM based on LSTM [33]. BiLSTM effectively combines a pair of
forward and backward LSTM sequences, inheriting the advantages of LSTM while addressing
the unidirectional learning problem. This integration allows BiLSTM to effectively capture
contextual information in sequential data. From a temporal perspective, BiLSTM analyzes
both the "past-to-future" and "future-to-past" directions of data flow, enabling better explora-
tion of temporal features in the data and improving the utilization efficiency of the data and
the predictive accuracy of the model.
Fig 2 depicts the unfolding of the BiLSTM network along the time axis, including t−1, t,
and t+1. In the diagram, x represents the input to the model, h represents the hidden layer
states, and o represents the output. BiLSTM can handle both forward and backward temporal
information, thus having a hidden layer for each direction without mutual influence. The
sequential data are separately fed into the forward LSTM layer and the backward LSTM layer,
resulting in the computation of the hidden state ht in the forward LSTM layer and the hidden
state h0t in the backward LSTM layer. It is important to note that the forward LSTM layer per-
forms forward computation (from time 1 to time t), while the backward LSTM layer performs
!
backward computation (from time t to time 1). The forward hidden states ht and backward
hidden states ht for each time step are computed and stored accordingly. These two sets of
hidden state information are further processed to generate the output of the BiLSTM. The pro-
cess described above can be expressed using the following formulas.
! �!
ht ¼ LSTMðxt ; ht 1 Þ ð7Þ
�
ht ¼ LSTMðxt ; htþ1 Þ ð8Þ
!
ot ¼ W!∗ht þ W ∗ht ð9Þ
h h
!
Here, ot represents the output of the BiLSTM. LSTM denotes the LSTM unit, ht represents
the output of the forward LSTM layer, and ht represents the output of the backward LSTM
!
layer. W! and W are the weight matrix for ht and ht , respectively.
h h
Where st represents the attention score, ht represents the input vector. wh and bh represent the
weight matrix and bias matrix of the attention mechanism, respectively, and τ represents the
tanh activation function. Next, st is normalized and the attention weights at can be obtained
using the softmax function, which can be expressed as follows:
expðst Þ
at ¼ P ð11Þ
t expðst Þ
Finally, the weighted sum of at and ht is computed to obtain the final output enhanced by
the attention mechanism.
X
s¼ at ht ð12Þ
t
recognition. Therefore, in this paper, we proposed to combine CNN, BiLSTM, and the atten-
tion mechanism, presenting a novel framework for dynamic gesture recognition called Atten-
tion-based CNN-BiLSTM Network (A-CBLN). A-CBLN effectively integrates the advantages
of different types of neural networks, thereby improving the predictive accuracy of dynamic
gesture recognition. Fig 3 illustrates the pipeline of dynamic gesture recognition based on
A-CBLN.
Specifically, as shown in Fig 3, the A-CBLN consists of five main components. The input
layer transforms the data collected by the data glove into the model’s input format: T×L×1,
where T represents the number of samples of gesture data collected within a specified time
range, L represents the feature dimension of the gesture data returned by the data glove, and 1
represents the number of channels. The CNN layer performs feature extraction and
dimensionality reduction using two convolutional operations and one max-pooling operation.
It is worth noting that we did not directly employ 1D or 2D convolutional methods for feature
extraction, but instead utilized a 2D convolutional method with a kernel size of 1×3, enabling
the extraction of spatial features from gesture data without being influenced by the temporal
dimension. The BiLSTM layer provides additional modeling of the long-term dependencies of
gesture features. Both the CNN layer and the BiLSTM layer use the ReLU activation function.
The AM layer helps the network better understand the specific meaning of gesture features.
The FC layer utilizes fully connected layers to flatten the features and further reduce the
dimensionality. Finally, it outputs the probability prediction of the current dynamic gesture
through the softmax function. Table 1 presents the specific parameter settings for each net-
work layer in A-CBLN.
We summarized the training and validation process of the A-CBLN algorithm in Algorithm
1. The two inputs of the algorithm include gesture data X and gesture label y, and the output is
the trained model weight w*. The predefined main parameters include epoch, batch size, and
best validation accuracy. In line 1, The gesture dataset is divided into a training dataset and a
validation dataset according to a certain proportion. In line 2, A-CBLN model randomly ini-
tializes parameters according to the default method. In line 3, the model begins the training
and validation process. For each epoch, we first feed the training dataset into the model in
sequence according to a certain batch (training data is Xtrain, training label is ytrain). Next, we
follow the feature extraction process of A-CBLN to extract features from Xtrain and finally gen-
erate the corresponding prediction probability y^train (lines 6 to 9). In line 10, the cross-entropy
loss function is used to calculate the loss value between ytrain and y^train and the Adam optimizer
is used to update the parameters. After the training process is completed, we start the valida-
tion process, calculate the accuracy of the current model on the validation dataset and record
it as Vacc (line 11). Then we determine the value of the current Vacc and the best validation
accuracy. If Vacc is larger, then save the currently trained model parameters w* and update the
value of the best validation accuracy to Vacc (Lines 12–14). After completing all training
epochs, the entire training and validation process of A-CBLN ends (line 15).
4. Experiments
4.1. Data glove
The wearable sensor gesture data extraction device used in this study is provided by the
VRTRIXTM Data Glove(http://www.vrtrix.com.cn/). The core component of this glove is a
9-axis MEMS (Micro Electro Mechanical System) inertial sensor, which can capture real-time
motion data related to finger joints and enable the reproduction of the hand postures assumed
by the operator during motion execution. The transmission of data from the glove employs
wireless transmission technology, where the data captured by the sensors on both hands can
be wirelessly transmitted to a personal computer (PC) through the wireless transmission mod-
ule on the back of the hand for real-time rendering. In addition, the VRTRIXTM Data Glove
provides a low-level Python API interface, allowing users to access joint pose data of the data
glove, facilitating secondary development. It has been widely used in fields such as industrial
simulation, mechanical control, and scientific research data acquisition.
Once the data glove is properly worn, the left hand has a total of 11 inertial sensors for cap-
turing finger gestures. Specifically, each finger is assigned 2 sensors, while 1 sensor is allocated
to the back of the hand. The number and distribution of sensors on the right hand are identical
to those on the left hand. Table 2 presents the key parameters information of the data glove
used in this study.
f fmin
fnorm ¼ ð13Þ
fmax fmin
Here, f represents the input data, fnorm refers to the normalized data, fmin and fmax represent
the minimum and maximum values of the input data, respectively.
To evaluate the performance of the proposed gesture recognition model, we divided the
data into training, validation, and test dataset in a ratio of 8:1:1. Therefore, the data from 26
subjects were used for training, while the remaining 6 were evenly split between the validation
and test dataset.
TP
Pr ¼ ð15Þ
TP þ FP
TP
Re ¼ ð16Þ
TP þ FN
2 � Pr � Re
F1 ¼ ð17Þ
Pr þ Re
Here, TP represents the number of true positive samples, which are the samples that are
correctly predicted as positive. TN represents the number of true negative samples, which are
the samples that are correctly predicted as negative. FP represents the number of false positive
samples, which are the samples that are actually negative but predicted as positive. FN repre-
sents the number of false negative samples, which are the samples that are actually positive but
predicted as negative.
highest validation accuracy of 88.95% at 50 epochs. This may be due to the fact that the pure
LSTM structure fails to focus on the local features within the dynamic handwashing steps. For
instance, actions like rubbing or rotating are of utmost importance in understanding the
semantic meaning conveyed by the gestures. In contrast, the best validation accuracy of the
Attention-BiLSTM structure has been improved and peaked at 45 epochs (92.77%). Neverthe-
less, the entire training progress displays instability. This limitation is also attributed to the
structure’s limited ability in capturing local features. By combining CNN and LSTM, the
model can not only perceive the local features of dynamic gestures in spatial changes but also
capture their temporal variations. As a result, the recognition ability has been significantly
improved, the model achieves an accuracy of 93.71% at 48 epochs. Finally, our proposed
A-CBLN combines attention mechanisms, further enhancing the model’s understanding of
different gesture semantics. Consequently, it exhibits the most powerful learning capability
during the training process. Its validation accuracy stabilizes and consistently outperforms
other models after 18 epochs, peaked at 32 epochs (93.62%).
The model with the best performance on the validation dataset was preserved for further
analysis of their performance on the test dataset. As shown in Table 4, all models perform well
on the test dataset, with prediction accuracy exceeding 87%. Further observation reveals that
the pure LSTM and Attention-BiLSTM models have relatively lower prediction accuracy
(87.43% and 91.43% respectively), while the hybrid CNN-LSTM structure significantly
improves the prediction accuracy to 93.38%. This is consistent with our previous analysis, indi-
cating that the hybrid CNN-LSTM structure possesses stronger feature extraction capability
for dynamic gesture data. Finally, our proposed A-CBLN model demonstrates the best predic-
tive performance for dynamic gestures, achieving optimal values in all evaluation metrics, with
an accuracy of 95.05%, precision of 95.43%, recall of 95.25%, and F1-score of 95.22%. Com-
pared to the pure LSTM structure, it improves by 7.62%, 5.84%, 7.32%, and 7.78% in accuracy,
precision, recall, and F1-score, respectively.
actions such as "finger bending" and "rotational friction," leading to misjudgment by the
model. Overall, A-CBLN demonstrates good overall recognition performance for the seven
dynamic gestures, with an average accuracy exceeding 95%.
6. Conclusion
This paper aims to investigate the problem of dynamic gesture recognition based on data
gloves. Based on deep learning techniques, we proposed a dynamic gesture recognition algo-
rithm called A-CBLN, which combines structures such as CNN, BiLSTM, and attention mech-
anism to capture the spatiotemporal features of dynamic gestures to the maximum extent. We
selected the commonly used seven-step handwashing method in the medical simulation
domain as the research subject and validated the performance of the proposed model in recog-
nizing the seven dynamic gestures. The experimental results demonstrated that our proposed
approach effectively addresses the task of dynamic gesture recognition and achieved superior
prediction results compared to similar models, with the accuracy of 95.05%, precision of
95.43%, recall of 95.25%, and F1-score of 95.22% on the test dataset. In the future, we plan to
further improve our approach in the following aspects: (1) design more efficient feature extrac-
tion modules to enhance the discriminability of gestures with similar action sequences; (2)
recruit more subjects to increase the dataset size and improve the model’s generalization abil-
ity; (3) explore the fusion of multimodal data captured by infrared cameras to enhance the rec-
ognition performance of the model.
Author Contributions
Conceptualization: Jiawei Wu, Peng Ren, Xiao Zhang.
Data curation: Jiawei Wu, Chen Zhao.
Formal analysis: Boming Song, Ran Zhang.
Methodology: Jiawei Wu, Peng Ren, Boming Song, Ran Zhang.
Resources: Chen Zhao.
Writing – original draft: Jiawei Wu.
Writing – review & editing: Peng Ren, Xiao Zhang.
References
1. Shah A, Ali B, Habib M, Frnda J, Ullah I, Shahid Anwar M. An ensemble face recognition mechanism
based on three-way decisions. Journal of King Saud University—Computer and Information Sciences.
2023 Apr; 35(4):196–208. https://doi.org/10.1016/j.jksuci.2023.03.016
2. Guo L, Lu Z, Yao L. Human-Machine Interaction Sensing Technology Based on Hand Gesture Recogni-
tion: A Review. IEEE Trans Human-Mach Syst. 2021 Aug; 51(4):300–9. https://doi.org/10.1109/THMS.
2021.3086003
3. LI Y, HUANG J, TIAN F, WANG H, DAI G. Gesture interaction in virtual reality. Virtual Reality & Intelli-
gent Hardware. 2019 Feb; 1(1):84–112. https://doi.org/10.3724/SP.J.2096-5796.2018.0006
4. Liu H, Wang L. Gesture recognition for human-robot collaboration: A review. International Journal of
Industrial Ergonomics. 2018 Nov; 68:355–67. https://doi.org/10.1016/j.ergon.2017.02.004
5. Peng Y, Wang J, Pang K, Liu W, Meng J, Li B. A Physiology-Based Flexible Strap Sensor for Gesture
Recognition by Sensing Tendon Deformation. IEEE Sensors J. 2021 Apr 1; 21(7):9449–56. https://doi.
org/10.1109/JSEN.2021.3054562
6. Song X, Peng Y, Hu B, Liu W. Characterization of the fine hand movement in badminton by a smart
glove. Instrumentation Science & Technology. 2020 Jul 3; 48(4):443–58. https://doi.org/10.1080/
10739149.2020.1737814
28. Li Y, Huang C, Ding L, Li Z, Pan Y, Gao X. Deep learning in bioinformatics: Introduction, application,
and perspective in the big data era. Methods. 2019 Aug; 166:4–21. https://doi.org/10.1016/j.ymeth.
2019.04.008 PMID: 31022451
29. Nair V, Hinton G E. Rectified linear units improve restricted boltzmann machines. Proceedings of the
27th international conference on machine learning (ICML-10). 2010: 807–814.
30. Connor J, Martin R, Atlas L. Recurrent neural networks and robust time series prediction. IEEE Trans
Neural Netw. 1994 Mar; 5(2):240–54. https://doi.org/10.1109/72.279188 PMID: 18267794
31. Shi H, Miao K, Ren X. Short-term load forecasting based on CNN-BiLSTM with Bayesian optimization
and attention mechanism. Concurrency and Computation. 2023 Aug; 35(17). https://doi.org/10.1002/
cpe.6676
32. Greff K, Srivastava RK, Koutnik J, Steunebrink BR, Schmidhuber J. LSTM: A Search Space Odyssey.
IEEE Trans Neural Netw Learning Syst. 2017 Oct; 28(10):2222–32. https://doi.org/10.1109/TNNLS.
2016.2582924 PMID: 27411231
33. Graves A, Fernández S, Schmidhuber J. Bidirectional LSTM networks for improved phoneme classifi-
cation and recognition. International conference on artificial neural networks. Berlin, Heidelberg:
Springer Berlin Heidelberg, 2005: 799–804. https://doi.org/10.1007/11550907_126
34. Luong M T, Pham H, Manning C D. Effective Approaches to Attention-based Neural Machine Transla-
tion. Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing.
2015: 1412–1421. https://doi.org/10.18653/v1/D15-1166
35. Liu J, Wei B, Cai M, Xu Y. Dynamic gesture recognition based on CNN-LSTM-Attention. 2021 IEEE
International Conference on Signal Processing, Communications and Computing (ICSPCC). IEEE,
2021: 1–6. https://doi.org/10.1109/ICSPCC52875.2021.9565034
36. Pittet D, Allegranzi B, Boyce J. The World Health Organization Guidelines on Hand Hygiene in Health
Care and Their Consensus Recommendations. Infect Control Hosp Epidemiol. 2009 Jul; 30(7):611–22.
https://doi.org/10.1086/600379 PMID: 19508124
37. Toro-Ossaba A, Jaramillo-Tigreros J, Tejada JC, Peña A, López-González A, Castanho RA. LSTM
Recurrent Neural Network for Hand Gesture Recognition Using EMG Signals. Applied Sciences. 2022
Sep 27; 12(19):9700. https://doi.org/10.3390/app12199700
38. Skaria S, Huang D, Al-Hourani A, Evans R J, Lech M. Deep-learning for hand-gesture recognition with
simultaneous thermal and radar sensors. 2020 IEEE SENSORS. IEEE, 2020: 1–4. https://doi.org/10.
1109/SENSORS47125.2020.9278683