Available online at www.sciencedirect.
com
ScienceDirect
Procedia Computer Science 256 (2025) 198–205
CENTERIS – International Conference on ENTERprise Information Systems / ProjMAN –
International Conference on Project MANagement / HCist – International Conference on
Health and Social Care Information Systems and Technologies 2024
Hand Gesture Recognition using Machine Learning
Caminate Na Ranga, Paulo Jerónimoa, Carlos Morab, Sandra Jardimb, *
a
Polytechnic Institute of Tomar, Quinta do Contador, 2300-313 Tomar, Portugal
b
Smart Cities Research Center, Polytechnic Institute of Tomar, Quinta do Contador, 2300-313 Tomar, Portugal
Abstract
Sign language recognition is a growing area of research, with applications ranging from gestural communication to controlling
devices using gestures. One of the challenges inherent to sign language recognition is the ability to translate gestures into
meaningful information, such as letters, words or even sentences. Machine Learning, which has emerged as a powerful tool for
solving a wide variety of complex problems, namely in the field of computer vision, plays a key role, enabling computers to
understand and interpret complex gestures. In this paper, we present a Machine Learning model focused on classifying hand
gestures that represent the letters of the Latin alphabet. The objective of this work is to create a solution capable of accurately
identifying which letter of the Latin alphabet is being represented by a hand gesture in an image. To classify manual gestures was
used the Random Forest Machine Learning classification model, which is fed with the vector of features extracted from the
region of interest in the image. To implement the proposed approach, a database of RGB images of hand gestures was created. To
extract the characteristics of the gestures, was used the MediaPipe open source framework. The solution presents hand gesture
classification precisions by class ranging between 98.8% and 74.4%, with an accuracy of 92.3%, that represents an improvement
over previous approaches.
© 2025 The Authors. Published by Elsevier B.V.
This is an open access article under the CC BY-NC-ND license (https://creativecommons.org/licenses/by-nc-nd/4.0)
Peer-review under responsibility of the scientific committee of the CENTERIS - International Conference on ENTERprise Information
Systems / ProjMAN - International Conference on Project MANagement / HCist - International Conference on Health and Social Care
Information Systems and Technologies
Keywords: Hand gesture recognition; image classification; machine learning; radom forest
* Corresponding author. Tel.: +351 249 328 100; fax: +351 249 328 186.
E-mail address: [email protected]
1877-0509 © 2025 The Authors. Published by Elsevier B.V.
This is an open access article under the CC BY-NC-ND license (https://creativecommons.org/licenses/by-nc-nd/4.0)
Peer-review under responsibility of the scientific committee of the CENTERIS - International Conference on ENTERprise
Information Systems / ProjMAN - International Conference on Project MANagement / HCist - International Conference
on Health and Social Care Information Systems and Technologies
10.1016/j.procs.2025.02.112
Caminate Na Rang et al. / Procedia Computer Science 256 (2025) 198–205 199
1. Introduction
In the domain of sign language, a sign encompasses three primary components: manual features, which involve
gestures performed with the hands to convey meaning through hand shape and motion; non-manual features,
including facial expressions or body posture, which can either be integrated into a sign or alter its meaning; and
finger spelling, a method of gesturally spelling out words in the spoken language of the local community [1]. Sign
language is as intricate as any spoken language, with several practical applications, from helping disabled people to
communicate to enabling humans to interact with devices in an intuitive and natural way. Each sign language
comprises numerous signs, each distinguished from the next by subtle variations in hand shape, movement,
positioning, non-manual features, or contextual usage. Frequently, these elements involve swift and intricate
movements, posing considerable challenges in the task of recognizing sign language.
As fundamental components of sign language, hand gestures encompass a set of factors such as hand movement,
shape, orientation, alignment, and positioning of the fingers in relation to the hands and body [2]. This critical
component of sign language interpretation presents its own set of challenges, as it is marked by notable similarities
between gestures across different categories, considerable variations within the same category, and frequent
obstructions in hand shapes [1]. The challenge is to translate complex and variable gestures into comprehensive
legible information, as alphabetic letters, words, or even complete sentences. Using Machine Learning (ML)
techniques, it is possible to develop computer models capable of perceiving these differences and accurately
identifying and classifying different hand gestures [3].
Machine Learning, a branch of Artificial Intelligence (AI) and computer science focusing on using data and
algorithms to enable computational applications/systems to imitate the way that humans learn, gradually improving
its accuracy, has emerged as a revolutionary discipline, transforming the way we approach complex problems in
several areas. In this context, sign language recognition is gaining prominence as a fascinating and increasingly
relevant application [4]. Driven by access to large volumes of data and technological evolution, particularly
regarding computational processing capacity, the ML area has evolved at an impressive pace.
Being a subfield of ML, Deep Learning (DL) distinguishes itself in the approaches it implements by the type of
data it uses, as well as the methods it implements for the learning process. ML algorithms rely on structured and
labeled data to generate predictions, meaning that specific attributes are identified in the input data and organized
into structured formats. In the case of unstructured data, it is normally pre-processed with a view to transforming it
into a structured layout. On the other hand, DL streamlines the preprocessing steps inherent in traditional machine
learning approaches. These sophisticated algorithms can manipulate and analyze unstructured data, such as texts and
images, excelling in automatic feature extraction, thus reducing dependence on human intervention.
DL algorithms exhibit remarkable complexity, with several types of artificial neural networks tailored to tackle
specific challenges or datasets. Artificial neural networks strive to emulate the functioning of the human brain by
integrating data inputs, weights, and biases. Through this interconnected framework, they aim to effectively identify,
classify, and describe patterns present within the dataset. Among the different types of neural networks,
Convolutional Neural Networks (CNN) have proven to be particularly effective in computer vision tasks, such as
image classification [5].
Due to its ability to recognize objects in images, several approaches have been proposed for the automatic
recognition of hand gestures using ML or DL models. For the most part, existing methods are limited in terms of the
number of hand gestures they are capable of recognizing, presenting accuracies and processing times that are still
incompatible with applicability in a real context.
This paper presents a proposed approach for the recognition of static hand gestures, where the classification of
gestures is achieved using the Random Forest ML model. The complete pipeline of a solution is presented, ranging
from capturing images of hand gestures, to recognizing the letter of the Latin alphabet they represent, through
extracting features from the collected images, and creating an annotated dataset, with which the aforementioned
model is trained and tested.
This paper is structured as follows: Section 2 presents a literature review of related work to highlight research
developed on sign language recognition and. Section 3 illustrates the methodology behind our model, while Section
4 covers the results and evaluation of the proposed approach. Finally, in Section 5 are presented the conclusions of
the work developed, also indicating possible research directions to improve the results achieved.
200 Caminate Na Rang et al. / Procedia Computer Science 256 (2025) 198–205
2. Related work
Over the past few years, progress in AI-powered methodologies has brought substantial changes to the domain of
sign language hand gesture recognition. Researchers have used a blend of specialized hardware alongside ML and
DL frameworks to forge ahead with more sophisticated models. For example, Rautaray et al. [6] design a system for
gestural interaction between the user and a computer in a dynamic environment. The gesture recognition system
employs image processing techniques for detection, segmentation, tracking, and recognition of hand gestures to
convert them into meaningful commands. The proposed interface can be effectively applied to various applications
such as image browsers, games, etc.
Some years later, Haria et al. [6 - 7] proposed a markerless hand gesture recognition system for tracking both
static and dynamic hand gestures. In the authors' proposed system, while detected static gestures are translated into
actions, dynamic gestures are utilized for interactions. The results presented by the authors indicate that human-
computer interaction can be achieved with minimal hardware requirements.
Ceolini et al. [3] present a fully neuromorphic sensor fusion approach for hand gesture recognition, consisting of
an event-based vision sensor and three different neuromorphic processors. They utilize the event-based camera,
called DVS, along with two neuromorphic platforms, Loihi and ODIN + MorphIC. EMG signals are recorded using
traditional electrodes and then converted into spikes to be fed into chips. They collected a dataset of five sign
language gestures where visual and electromyographic signals are synchronized.
More recently, Chang et al. [5] developed a hand gesture recognition system aimed at enhancing authentic,
efficient, and effortless human-computer interactions without additional devices, particularly for the speech-impaired
community, which relies solely on hand gestures for communication. The algorithm of this system consists of two
phases. The first phase involves Region of Interest Segmentation based on the color space segmentation technique,
with a predefined color range that removes pixels (hand) from the background region of interest (pixels not in the
desired interest area). The second phase of the system involves inputting the segmented images into a Convolutional
Neural Network (CNN) model for image categorization. For image training, the Python Keras package was used.
The system addressed the need for image segmentation in hand gesture recognition.
Nogales et al. [8] proposed the evaluation of a model with both manual feature extraction and automatic feature
extraction. Manual feature extraction was conducted using statistical functions of central tendency, while automatic
extraction was performed through CNN and BiLSTM. These features were also assessed with classifiers such as
Softmax, ANN, and SVM.
Cruz et al. [9] introduced a Reinforcement Learning (RL) approach to classify EMG-IMU signals obtained using
a Myo Armband sensor. For this purpose, they developed an agent based on the Deep Q-learning (DQN) algorithm
to learn a policy from online experiences for classifying EMG-IMU signals. They then tested the HGR system to
control two different robotic platforms. The first is a three-degree-of-freedom (3DoF) tandem helicopter test bench,
and the second is a six-degree-of-freedom (6DoF) UR5 virtual robot. They employ a designed system for hand
gesture recognition (HGR) and the inertial measurement unit (IMU) integrated into the Myo sensor to command and
control the movement of both platforms. The motion of the helicopter test bench and the UR5 robot is controlled by
a PID controller scheme. Experimental results demonstrate the effectiveness of using the proposed HGR system
based on DQN to control both platforms with rapid and precise responses.
Harini et al. [10] proposed a methodology for hand gesture recognition using Self-Organizing Map (SOM) with
Deep Convolutional Neural Network (DCNN). The experiments were conducted on a dataset consisting of 30 static
gestures and 6 dynamic gestures, and evaluated on an IIITA-ROBITA ISL gesture database to demonstrate
effectiveness. The proposed algorithm was then implemented to control household appliances.
John et al. [11] proposed a DenseNet-based architecture called Multidilated Convolution DenseNet (MDCDN),
which combines multidilated convolution and DenseNet to automatically extract features. The benefits of high-level
deep learning techniques are leveraged for hand gesture recognition. Python is used for architecture evaluation. The
proposed outcome is estimated in terms of accuracy, recall, F-measure, precision, etc., using real datasets ASL, ISL,
Massey, and HSR. Each dataset contains a large number of gesture classes, and their images have an equal amount
of uniform and complex backgrounds.
Caminate Na Rang et al. / Procedia Computer Science 256 (2025) 198–205 201
3. Materials and Methods
In this section are outlined in detail the methodology followed to develop a hand gesture recognition model. This
section is divided into three subsections, detailing the data collection process, image feature extraction, and the hand
gesture recognition model.
3.1. Data Collection
The data collection was carried out by capturing images with the webcam of a Dell Mobile Precision Workstation
7560, an HD RGB camera with a resolution of 0,92 megapixel and a diagonal viewing angle of 74.9 degrees. The
image capture process took place under controlled lighting and surrounding conditions to ensure data consistency
and quality. For each of the 26 letters of the Latin alphabet, were captured 2,000 images of the corresponding hand
gestures, totaling 52,000 images. In this way, the database created is made up of a total of 54,000 RGB images,
where 52,000 represent hand gestures corresponding to the 26 letters of the Latin alphabet and 2,000 corresponding
to hand gestures not representing any of these letters. The images have a resolution of 224 x 224 pixels. To ensure
data comprehensiveness, the hand gestures were performed by 5 individuals.
3.2. Dataset
For the dataset creation was used the MediaPipe open-source framework developed by Google [12], which
allows the processing, analysis, and extraction of information from different types of media, such as images, videos,
and audios. This code structure provides a wide range of functionalities for object detection and tracking, facial
recognition, pose estimation, hand detection, among others. For this work, the Hands Landmarks functionality [13]
was used, with which it is possible to detect the keypoint localization of 21 hand-knuckle coordinates within the
detected hand regions (Fig. 1). This functionality is based on two models: a palm detection model, which locates the
hand in the image; and a hand landmark detection model, which identifies specific hand landmarks in the cropped
hand image defined by the palm detection model.
Fig. 1. The 21 hand reference points detectable with MediaPipe framework.
To build the dataset, the collected images were processed with the Hands Landmarks model from the MediaPipe
framework, and the coordinates of the landmarks of the hand of each image were extracted and structured into an
annotated dataset. Each image was classified according to the alphabet letter represented by the hand. Fig. 2
illustrates the hand gesture images and the corresponding reference points for the letters B, D, and I.
a b c
Fig. 2. Keypoints detection for hand gesture corresponding to (a) the letter B; (b) the letter D; (c) the letter I.
202 Caminate Na Rang et al. / Procedia Computer Science 256 (2025) 198–205
3.3. Hand Gesture Recognition Model
To classify hand gestures, was used the Random Forest ML model [14], an algorithm widely used to solve
classification and regression problems. Random Forest belongs to the category of ensemble methods, where
different models are combined to obtain a single result. This characteristic makes the algorithms more robust and
complex, which leads to a higher computational cost, usually accompanied by better results. More specifically,
Random Forest is a combination of many decision trees based on the Bagging (Bootstrap Aggregating) method,
where are generated a set of base learners trained on bootstrap samples, which are combined to predict a result.
The main idea behind the Random Forest model is combining many decision trees into a big forest via the
Bagging method, making predictions based on the outputs of those trees.
The key points of Random Forest Model are the following:
• Bootstrap Sampling: During training, multiple training samples are created using bootstrap sampling technique.
This involves randomly selecting instances from the training set with replacement. These samples are used to
train individual trees.
• Decision Tree Construction: For each training sample, is constructed a decision tree. Often, the created decision
trees are limited in depth and have randomness in node splitting decisions.
• Majority Voting: When making a prediction, each tree in the ensemble provides a prediction, and the most
frequent class or value is chosen as the final prediction (for classification problems), or the average is calculated
(for regression problems).
• Variance Reduction: Random Forest helps reduce variance compared to a single decision tree, making it less
prone to overfitting. This is achieved by combining multiple independent trees.
In the context of this study, the Random Forest model is fed with the vector of features extracted from hand
gesture images along with the corresponding class, where each class corresponds to a letter of the Latin alphabet.
The additional class "nothing" was also considered, which classifies images of hand gestures for which the
corresponding letter cannot be identified.
To train the classification model, were used 67% of the collected images, corresponding to approximately 35,000
images. To ensure the minimization of bias and the maximization of the representativeness of the training dataset, a
random selection of images was carried out, balanced across the 27 existing classes.
The model was tested with the remaining images, which represent 33% of the total, corresponding to
approximately 17,000 images.
Fig. 3. Model architecture (a) training phase; (b) testing phase.
Caminate Na Rang et al. / Procedia Computer Science 256 (2025) 198–205 203
To evaluate the performance of the model in real and dynamic situations, was developed an application in Python
that uses the integrated camera of a computer, with which hand gestures are captured in real time, through the
integration of the MediaPipe framework. The application makes use of the pre-trained Random Forest model,
classifying the image corresponding to the user's hand gesture, the result of which can be viewed in real time. This
practical approach, without reserving a set of reserved images, demonstrates the capacity of the classification model
in situations very close to real ones, providing a robust assessment of its effectiveness.
4. Results and Discussion
The performance of the developed model was evaluated using not only test images, but also in real-time use
situations. Fig. 4 illustrates the visualization of the hand gesture classification results provided by the developed
application.
A
Fig. 4. Hand gesture classification results.
To evaluate the performance of the developed classification system, the metrics precision, recall and F1 Score
were calculated for each class. Additionally, to provide a more concise summary of the classification system
performance, macro- and micro-averaging metrics were also computed.
Considering, for each class, TP the number of true positives (hand gestures correctly classified), FN the number
of false negatives (hand gestures considered not to belong to the class in which they should be classified), and FP
the number of false positives (hand gestures classified in a wrong class), the metrics precision, recall and F1 Score
are defined by equations 1, 2 and 3, respectively.
𝑇𝑇𝑇𝑇
𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃 =
𝑇𝑇𝑇𝑇+𝐹𝐹𝐹𝐹
(1)
𝑇𝑇𝑇𝑇
𝑅𝑅𝑅𝑅𝑅𝑅𝑅𝑅𝑅𝑅𝑅𝑅 =
𝑇𝑇𝑇𝑇+𝐹𝐹𝐹𝐹
(2)
𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃∗𝑅𝑅𝑅𝑅𝑅𝑅𝑅𝑅𝑅𝑅𝑅𝑅
𝐹𝐹1 𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆 = 2 ∗
𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃+𝑅𝑅𝑅𝑅𝑅𝑅𝑅𝑅𝑅𝑅𝑅𝑅
(3)
Table 1 presents the classification performance evaluation metrics for each class.
Table 1. Classification evaluation metrics per class. All values are presented as a percentage (%)
F1 F1
Class Precision Recall Class Precision Recall Class Precision Recall F1 Score
Score Score
A 92.8 97.7 95.2 J 96.4 93.6 94.9 S 79.7 75.0 77.3
B 96.7 98.0 97.3 K 97.6 93.7 95.6 T 74.4 73.6 74.0
C 96.8 96.5 96.6 L 98.8 97.1 97.9 U 88.6 94.9 91.6
D 92.3 98.6 95.3 M 82.8 84.0 83.4 V 96.1 95.9 96.0
E 96.2 97.7 97.0 N 83.7 82.6 83.1 W 98.3 96.9 97.6
F 98.0 96.6 97.3 O 96.5 96.8 96.6 X 98.5 98.5 98.5
G 97.1 76.6 85.6 P 90.5 92.0 91.3 Y 98.3 90.2 94.1
H 80.7 97.7 88.4 Q 91.9 90.4 91.1 Z 98.6 95.0 96.7
I 93.8 96.5 95.1 R 97.6 91.7 94.6 “nothing” 81.7 93.8 87.3
204 Caminate Na Rang et al. / Procedia Computer Science 256 (2025) 198–205
Regarding the overall performance of the classification system, and considering N the total number of classes,
were computed the macro- and micro-averaged precision and recall, according to equations 4, 5, 6, and 7,
respectively.
𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶 𝐴𝐴 + 𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶 𝐵𝐵 + ⋯ + 𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶 𝑛𝑛𝑛𝑛𝑛𝑛ℎ𝑖𝑖𝑖𝑖𝑖𝑖
𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑀𝑀𝑀𝑀𝑀𝑀𝑀𝑀𝑀𝑀−𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎 = (4)
𝑁𝑁
𝑇𝑇𝑇𝑇𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶 𝐴𝐴 + ⋯ + 𝑇𝑇𝑇𝑇𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶 𝑛𝑛𝑛𝑛𝑛𝑛ℎ𝑖𝑖𝑖𝑖𝑖𝑖
𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑀𝑀𝑀𝑀𝑀𝑀𝑀𝑀𝑀𝑀−𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎 = (5)
(𝑇𝑇𝑇𝑇𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶 𝐴𝐴 + 𝐹𝐹𝐹𝐹𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶 𝐴𝐴 ) + ⋯ + (𝑇𝑇𝑇𝑇𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶 𝑛𝑛𝑛𝑛𝑛𝑛ℎ𝑖𝑖𝑖𝑖𝑖𝑖 + 𝐹𝐹𝐹𝐹𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶 𝑛𝑛𝑛𝑛𝑛𝑛ℎ𝑖𝑖𝑖𝑖𝑖𝑖 )
𝑅𝑅𝑅𝑅𝑅𝑅𝑅𝑅𝑅𝑅𝑅𝑅𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶 𝐴𝐴 + 𝑅𝑅𝑅𝑅𝑅𝑅𝑅𝑅𝑅𝑅𝑅𝑅𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶 𝐵𝐵 + ⋯ + 𝑅𝑅𝑅𝑅𝑅𝑅𝑅𝑅𝑅𝑅𝑅𝑅𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶 𝑛𝑛𝑛𝑛𝑛𝑛ℎ𝑖𝑖𝑖𝑖𝑖𝑖
𝑅𝑅𝑅𝑅𝑅𝑅𝑅𝑅𝑅𝑅𝑅𝑅𝑀𝑀𝑀𝑀𝑀𝑀𝑀𝑀𝑀𝑀−𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎 = (6)
𝑁𝑁
𝑇𝑇𝑇𝑇𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶 𝐴𝐴 + ⋯ + 𝑇𝑇𝑇𝑇𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶 𝑛𝑛𝑛𝑛𝑛𝑛ℎ𝑖𝑖𝑖𝑖𝑖𝑖
𝑅𝑅𝑅𝑅𝑅𝑅𝑅𝑅𝑅𝑅𝑅𝑅𝑀𝑀𝑀𝑀𝑀𝑀𝑀𝑀𝑀𝑀−𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎 = (7)
(𝑇𝑇𝑇𝑇𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶 𝐴𝐴 + 𝐹𝐹𝐹𝐹𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶 𝐴𝐴 ) + ⋯ + (𝑇𝑇𝑇𝑇𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶 𝑛𝑛𝑛𝑛𝑛𝑛ℎ𝑖𝑖𝑖𝑖𝑖𝑖 + 𝐹𝐹𝐹𝐹𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶 𝑛𝑛𝑛𝑛𝑛𝑛ℎ𝑖𝑖𝑖𝑖𝑖𝑖 )
Table 2 presents the macro- and micro-averaged evaluation metrics of the developed classification system.
Table 2. System performance evaluation.
Evaluation Metric
Macro-average precision 92.4%
Micro-average precision 92.3%
Macro-average recall 92.3%
Micro-average recall 92.3%
By analyzing the results in table 2, we found that, as expected, the values of micro-average precision and micro-
average recall are the same. This derives from the fact that a false negative for a given class corresponds to a false
positive for another class.
The results obtained demonstrate a remarkable performance of the Random Forest model in recognizing most of
the hand gestures representing the letters of the Latin alphabet. The accuracy of 92.3%, which equals micro-average
precision and recall, is a positive indicator of the success of the Random Forest model in classifying hand gestures,
which, as mentioned, is one of the components of sign language. However, detailed analysis by class reveals
variations in performance according to different hand gestures.
When examining performance by class, we observe that classes S and T present precisions below 80%, while
class G has a very similar recall value. The same can be observed for classes M and N, where precision values are
just above 80%. Such results are justifiable given the similarity of hand gestures corresponding to different letters,
such as the letters M and N, as well as the letters S and T, where the model presents some difficulty in distinguishing.
Another challenge to consider is the position of the fingers which, for some letters, may vary slightly from person to
person.
Comparing the results obtained with previous studies on hand gesture recognition, we can state that the approach
proposed in this paper achieved competitive performance. However, it is important to highlight that direct
comparison with previous studies should not be seen in absolute terms, depending on several factors such as the
databases (images), experimental configurations and/or algorithms used.
The practical implications of these results are significant, as the precise recognition of hand gestures has
applications in several areas, such as the communication of individuals unable to communicate verbally, human-
machine interaction, virtual reality, among others. The performance obtained by the proposed model reveals its
potential to be incorporated into application systems and devices in a real context, in more or less complex
situations.
Caminate Na Rang et al. / Procedia Computer Science 256 (2025) 198–205 205
5. Conclusions and Future Work
Being the oldest method of human communication, sign language is a form of non-verbal communication that
uses various parts of the body, where interpretation focuses on hand gestures, facial emotions, and body posture. In
addition to their importance in interpreting sign language, static hand gestures have application in several areas, such
as human-machine interaction, augmented reality, the aviation industry, among others. This applicability is one of
the factors that has driven the interest of the scientific community in the study and development of efficient
approaches for automatic hand gesture recognition. The performance of these approaches must be analyzed not only
from the point of view of the accuracy achieved in gesture recognition, but also regarding their processing time,
which must be compatible with applications in a real context.
In this paper we present a solution for hand gesture recognition, using the Random Forest ML classification
model. To implement the proposed approach, a database of 54.000 RGB images of hand gestures was created, from
which characteristics were extracted using the MediaPipe framework. The solution presents hand gesture
classification precisions by class that vary between 98.8% and 74.4%, with an accuracy of 92.3%, representing an
improvement compared to previous approaches. Despite these results, there is still room to improve the recognition
accuracy of some hand gestures, namely those corresponding to the letters of the Latin alphabet G, M, N, S and T.
Possible directions for future research include increasing the size of the dataset, investigating different image
preprocessing techniques, as well as exploring different machine/deep learning algorithms.
Acknowledgements
This work has been funded by (Portuguese) Foundation for Science and Technology (FCT), under the Project
UIDB/05567/2020.
References
[1] Alaghband, Marie, Hamid Reza Maghroor and Ivan Garibay. (2023) “A survey on sign language literature.” Machine Learning with
Applications 14: 100504. https://doi.org/10.1016/j.mlwa.2023.100504.
[2] Oudah, Munir, Ali Al-Naji and Javaan Chahl. (2020) “Hand Gesture Recognition Based on Computer Vision: A Review of Techniques.”
Journal of Imaging 6 (8): 73. https://doi.org/10.3390/jimaging6080073.
[3] Ceolini, Enea, Charlotte Frenkel, Sumit Bam Shrestha, Gemma Taverni, Lyes Khacef, Melika Payvand and Elisa Donati. (2020) “Hand-
Gesture Recognition Based on EMG and Event-Based Camera Sensor Fusion: A Benchmark in Neuromorphic Computing.” Frontiers in
Neuroscience 14:637. https://doi.org/10.3389/fnins.2020.00637.
[4] Eid, Ahmed and Friedhelm Schwenker. (2023) “Visual Static Hand Gesture Recognition Using Convolutional Neural Network.” Algorithms
16 (8): 361. https://doi.org/10.3390/a16080361.
[5] Chang Victor, Rahman Olamide Eniola, Lewis Golightly and Qianwen Ariel Xu. (2023) “An Exploration into Human–Computer Interaction:
Hand Gesture Recognition Management in a Challenging Environment.” SN Computer Science 4:441. https://doi.org/10.1007/s42979-023-
01751-y.
[6] Rautaray, Siddharth S. and Anupam Agrawal. (2012) “Real Time Gesture Recognition System for Interaction in Dynamic Environment.”
Procedia Technology 4:595–599. https://doi.org/https://doi.org/10.1016/j.protcy.2012.05.095.
[7] Haria, Aashni, Archanasri Subramanian, Nivedhitha Asokkumar, Shristi Poddar, and Jyothi S Nayak. (2017) “Hand Gesture Recognition for
Human Computer Interaction.” Procedia Computer Science 115:367–374. https://doi.org/10.1016/j.procs.2017.09.092.
[8] Nogales, Rubén E. and Marco E. Benalcázar. (2023) “Hand Gesture Recognition Using Automatic Feature Extraction and Deep Learning
Algorithms with Memory.” Big Data and Cognitive Computing 7:102. https://doi.org/10.3390/bdcc7020102.
[9] Cruz, Patricio J., Juan Pablo Vásconez, Ricardo Romero, Alex Chico, Marco E. Benalcázar, Robin Álvarez, Lorena Isabel Barona López and
Ángel Leonardo Valdivieso Caraguay. (2023) “A Deep Q-Network based hand gesture recognition system for control of robotic platforms.”
Scientific Reports 13:2045–2322. https://doi.org/10.1038/s41598-023-34540-x.
[10] Harini, K. and S. Uma Maheswari. “A novel static and dynamic hand gesture recognition using self organizing map with deep convolutional
neural network.” Automatika 64:1128–1140. https://doi.org/10.1080/00051144.2023.2251229.
[11] John, Jogi and DeshpandeShrinivas. (2023) “Static hand gesture recognition using multi-dilated DenseNet-based deep learning architecture.”
The Imaging Science Journal 71:221–243. https://doi.org/10.1080/13682199.2023.2179965.
[12] Google Developers. MediaPipe Framework, available on https://developers.google.com/mediapipe.
[13] Google Developers. MediaPipe Framework: Hand landmarks detection guide, available on
https://developers.google.com/mediapipe/solutions/vision/hand_landmarker.
[14] Breiman, Leo. (2001). Random forests. Machine learning 45 (1), 5–32.