In My Perspective, in My Hands: Accurate Egocentric 2D Hand Pose and Action Recognition
In My Perspective, in My Hands: Accurate Egocentric 2D Hand Pose and Action Recognition
Fig. 1. Overview of our method. From the sequence of input frames f1 , f2 , f3 ..fn representing action, 2D hands pose P hL R
2D ,P h2D and bounding box
of manipulated objects P o2D with its labels P ol are extracted. Under the study, four distinct state-of-the-art hand pose methods are implemented and
tested. Object information is retrieved using YOLOv7 [35]. Pose information is embedded into a vector describing each frame. The sequence of vectors is
processed via the transformer-based deep learning neural network to predict the final action class.
Abstract— Action recognition is essential for egocentric video the video [28]. Research in egocentric action recognition is
understanding, allowing automatic and continuous monitoring crucial due to broad potential application fields, including
of Activities of Daily Living (ADLs) without user effort. Existing augmented and virtual reality, nutritional behaviour analysis,
literature focuses on 3D hand pose input, which requires
computationally intensive depth estimation networks or wearing and Active Assisted Living (AAL) technologies for lifestyle
an uncomfortable depth sensor. In contrast, there has been analysis [28] or assistance [43]. ADLs targeted by AAL
insufficient research in understanding 2D hand pose for egocen- technologies (e.g., drinking, eating and food preparation)
tric action recognition, despite the availability of user-friendly are all based on manual operations and manipulations of
smart glasses in the market capable of capturing a single RGB objects, which motivates research focused on hand-based
image. Our study aims to fill this research gap by exploring
the field of 2D hand pose estimation for egocentric action action recognition.
recognition, making two contributions. Firstly, we introduce Current works on egocentric action recognition focus on
two novel approaches for 2D hand pose estimation, namely 3D hand pose [33], [10], [19], despite the absence of wear-
EffHandNet for single-hand estimation and EffHandEgoNet,
tailored for an egocentric perspective, capturing interactions able depth sensors in the market. Consequently, these studies
between hands and objects. Both methods outperform state- resort to estimating depth from RGB frames, introducing
of-the-art models on H2O and FPHA public benchmarks. complexities and yielding pose prediction errors around 40
Secondly, we present a robust action recognition architecture mm [33], [19] (equivalent to a 20.5% error, considering an
from 2D hand and object poses. This method incorporates average human hand size of 18 cm). While depth maps
EffHandEgoNet, and a transformer-based action recognition
method. Evaluated on H2O and FPHA datasets, our archi- could be directly acquired through sensors, this necessitates
tecture has a faster inference time and achieves an accuracy inconvenient custom setups, as illustrated in Fig. 2. As
of 91.32% and 94.43%, respectively, surpassing state of the an alternative to 3D hand pose estimation, 2D estimations
art, including 3D-based methods. Our work demonstrates that are reported to be more accurate at the time of this study
using 2D skeletal data is a robust approach for egocentric action considering percental error (13.4% for 2D [44] against 20.5%
understanding. Extensive evaluation and ablation studies show
the impact of the hand pose estimation approach, and how each for 3D [33], [10], [19]). Additionally, the existing literature
input affects the overall performance. The code is available at provides examples of 2D-based action recognition achieving
https://github.com/wiktormucha/effhandegonet. higher accuracy in non-egocentric setups compared to 3D-
based methods [13]. These factors underscore the need for
I. I NTRODUCTION further exploration of the potential advantages of 2D pose
The growing interest in egocentric vision research is estimation for egocentric action recognition.
evident from the release of large dedicated datasets like Our study, based on 2D keypoints, investigates this gap
EPIC-KITCHENS [9], Ego4D [16], and H2O [19]. One of the with the goal of bridging the distance between research and
challenges in this domain is the task of action recognition, practical applications, enabling the utilization of off-the-shelf
focused on determining the action performed by the user in RGB egocentric cameras. An overview of our approach is
prehensive evaluation on the H2O Dataset showcases
superior accuracy, reaching 91.32%, outperforming
other state-of-the-art methods. In single-handed experi-
ments on the FPHA Dataset, our method demonstrates
robustness, achieving a state-of-the-art performance of
94.43% when only one hand manipulates an object.
Fig. 2: Currently, comfortable wearable RGB-D cameras are not readily 4) We present extensive experiments and ablations per-
available in the market. Left: A self-made RGB-D setup [19]. Right: User
wearing RayBan glasses with an integrated RGB camera1 . formed on H2O Dataset, showing the influence of
the hand pose method by comparing four different
pose estimation techniques. Further analysis shows the
shown in Fig. 1, which displays the 2D hand and object importance of each input (left and right hand, object
poses obtained by our methods. It includes an example of a position) for the results of action recognition.
sequence representing an action captured from an egocentric
The paper is organised as follows: Section II presents re-
perspective of the H2O Dataset [19]. The proposed method
lated work in egocentric hand keypoints estimation and hand-
is applicable to recordings from recently released wearable
based action recognition highlighting areas for improvement.
RGB smart-glasses like RayBan Stories1 and Snapchat Spec-
Section III describes our methods and implementation. Eval-
tacles2 . These devices allow the use of only a single RGB im-
uation and experiments are presented in Section IV. Section
age, despite incorporating dual cameras, and offer improved
V concludes the study summarising its main findings.
comfort, quality and lightweight design for egocentric vision
compared to head-mounted cameras or self-made RGB-D II. R ELATED W ORK
sensors (see Fig. 2). This is particularly noteworthy given the Action recognition has been extensively studied in the lit-
lack of commercially available wearable RGB-D devices on erature, with approaches employing diverse modalities such
the market at the time of the study. The release of these user- as RGB images, depth information, and skeletal data [31]. In
friendly glasses is expected to increase the availability of our study, we focus on egocentric hand-based action recogni-
single-image RGB datasets, thus fostering egocentric vision tion using 2D keypoints estimated from a single RGB image.
research. Lastly, works on 3D lifting [12], [23], which are Therefore, we present related work in egocentric vision for
the current state of the art for 3D human pose estimation, hand pose estimation and hand-based action recognition.
highlight the importance of accurate 2D pose predictions as
input for the generation of 3D poses using lifting algorithms. A. Egocentric Hand Keypoint Description
Our 2D hand pose estimation method is proven to predict Hand-pose estimation in egocentric vision is challenged
accurate 2D poses, which means that our method is not by self-occlusion during movements, limited field of view,
limited to direct action recognition from 2D keypoints, but and diverse perspectives, making effective generalisation
holds potential as a versatile first step for 3D pose estimation difficult. Some works address these challenges by employing
through lifting algorithms. RGB-D sensors [26], [40], [15]. In addition, the depth
Our contributions are the following: modality could enhance user privacy [24], [25], but the
1) A state-of-the-art architecture for single 2D hand pose adoption of depth-sensing devices in the market is limited,
prediction, EffHandNet, surpassing other methods in requiring users to wear uncomfortable devices (see Fig. 2).
every metric on the single-hand FreiHAND [45]. To leverage the benefits of 3D keypoints, some researchers
2) A novel architecture for modelling 2D hand pose employ neural networks for depth estimation and subsequent
from the egocentric perspective, EffHandEgoNet that conversion from 2D to 3D space using intrinsic camera
outperforms state-of-the-art methods on H2O [19] and parameters [33], [19]. For instance, Tekin et al. [33] use
FPHA [15] datasets. This includes models for both 2D a single RGB image to directly calculate the 3D pose of a
and 3D pose estimation, with the latter projected into hand through a Convolutional Neural Network (CNN) which
2D space for a fair comparison. generates a 3D grid as the output of CNN, where each cell
3) A novel method for egocentric action recognition using contains the probability of target pose value. Kwon et al. [19]
2D data on the subject’s hands and object poses from follow this approach but estimates pose for both hands. This
YOLOv7 [35]. Our method distinguishes itself from work, however, reports a mean End-Point Error (EPE) of
other studies by employing a reduced set of input 37mm for hand pose estimation in the H2O Dataset, which,
information, leading to faster inference per action. By considering the average human hand size of 18 cm, results
leveraging YOLOv7, which has demonstrated excellent in a 20% error, leaving room for improvement. Cho et al. [7]
performance at recognizing diverse labels in large uses CNN with the transformer-based network for 3D pose-
datasets, we enhance the potential for easy generaliza- per-frame reconstruction. Wen et al. [38] propose to use se-
tion to various tasks and datasets in the future. A com- quence information to reconstruct depth omitting the issue of
occlusions. However, these works focus on 3D pose estima-
1 RayBan Stories - https://www.ray-ban.com/usa/ tion from the egocentric perspective. Despite the practicality
ray-ban-stories (accessed 03 July 2023)
2 Snapchat Spectacles - https://www.spectacles.com (accessed of RGB camera glasses for real-world applications, 2D hand-
03 July 2023) based approaches are not widely adopted in the egocentric
domain. One of a few works in egocentric 2D hand pose is model, allowing for learning interactions between these com-
an early study by Liang et al. [20]. A single RGB camera ponents. Wen et al. [38] use a transformer-based model with
is employed to estimate hand pose and its distance from the estimated 3D hand pose and object label input. Cho et al. [7]
sensor, employing a Conditional Regression Forest; however, enrich the transformer inputs with object pose and hand-
this study lacks quantitative results. Another approach by object contact information. However, these studies do not
Wang et al. [37] introduces a cascaded CNN for 2D hand employ sensor-acquired depth data. Instead, they estimate
pose prediction, involving two stages: hand mask creation points in 3D space using neural networks and intrinsic
and hand pose prediction. camera parameters [33], [19], [38], [7].
Beyond the egocentric domain, a review of available In contrast, our study aims to fill the gap of egocentric
applications reveals that improvements in network architec- action recognition based on 2D hand pose. We introduce a
tures for regular pose estimation and keypoint prediction novel architecture based on 2D keypoints, estimated with
are also applicable to the egocentric hand pose prediction. a lower percentage error in pose prediction compared to
This is demonstrated in the study by Bauligu et al. [2], who 3D-based methods that outperform existing solutions. It
successfully adapt OpenPose [4] to the egocentric perspec- effectively leverages information from both hands and relies
tive. However, despite continuous improvements in single- solely on RGB images. Moreover, by bypassing the archi-
hand pose performance, its application in the egocentric tecture responsible for lifting points to the third dimension,
environment remains relatively unexplored, prompting the we reduce network complexity and achieve a faster inference
focus of this study. Recent advancements in 2D hand pose time per action, which is a critical factor in certain applica-
estimation include PoseResNet50 [8], which is based on tions.
residual connection architecture and Santavas et al. [30], who
reinforce a CNN by a self-attention module. Zhang et al. [44] III. E GOCENTRIC ACTION R ECOGNITION BASED ON 2D
proposes MediaPipe, which employs single shot detection to H AND P OSE
identify hand regions passed further through the network to
estimate the final pose for each hand. The proposed method performs action recognition from
Our contribution differs from available works, focusing image sequences by processing pose information describing
on 2D prediction based on a single RGB image. We imple- hands and interacting objects in a 2D space. We consider
ment top-down and bottom-up methods and compare them fine-grained actions of a user manipulating different objects
with state-of-the-art in the egocentric environment presenting with his hands, e.g. opening a bottle or pouring milk. The
quantitative and qualitative results [supplement]. We assess pipeline overview is presented in Fig. 1 where fn corresponds
their transition from a standard to an egocentric perspective to processed frames. It is constructed of three separate
by employing them to build a 2D hand-based action recog- blocks object detection, hand pose estimation and finally,
nition system. action recognition using transformer encoder block and fully
connected layers.
B. Hand-Based Egocentric Action Recognition
A common strategy for action recognition involves pro- A. Object Detection and Pose Estimation
cessing jointly hand and object information. Cartas et al. [6]
The first step in the pipeline is object detection, which
propose CNN-based object detectors to estimate the po-
is carried out employing the pre-trained YOLOv7 network
sitions of primary regions (hands) and secondary regions
[35]. Following that, transfer learning is used on the H2O
(objects). Temporal information from these regions is then
Dataset, using the Ground Truth (GT) object pose. This
processed by a Long Short-Term Memory (LSTM) network.
process involves transforming from a 6D GT representation
Nguyen et al. [27] transition from bounding box information
of objects to a 2D bounding box. In each frame, denoted
to 2D skeletons of a single hand estimated by CNN from
as fn , the interacting object is represented by Po 2D (x, y) ∈
RGB and depth images. The joints of these skeletons are
R4×2 , where each point corresponds to the corners of its
aggregated using spatial and temporal Gaussian aggregation,
bounding box. Additionally, Po l ∈ R1 represents object’s
and action recognition is performed using a learnable Sym-
label.
metric Positive Definite (SPD) matrix.
With the rise of 3D-based hand pose estimation algo-
B. Hand Pose Estimation
rithms, the scientific community has increasingly focused on
egocentric action understanding using 3D information [33], Each frame fn contains alongside the object pose, the
[10], [19]. Tekin et al. [33] estimate 3D hand and object hands pose of a subject conducting action. This hand pose
poses from a single RGB frame using CNN, embedding is estimated with an independent network part dedicated to
temporal information for predicting action classes using inferring the position of 21 keypoints j in each hand. Each
LSTM. Other techniques employ graph networks, such as point j expresses the position of wrist and fingers’ joints
Das et al. [10], who present an architecture with a spatio- following the standard approach of hand pose description
temporal graph CNN, describing finger movement using [1]. Each hand pose is represented by Ph L,R
2D (x, y), ∈ R
j×2
separate sub-graphs. Kwon et al. [19] construct sub-graphs where L and R describes the left or right hand and x, y
for each hand and object, which are merged into a multigraph represents the point j.
Fig. 3: Our EffHandEgoNet architecture to resolve keypoint prediction problem. Input images are resized to 512 pixels and passed through the network to
produce 21 heatmaps for each of the hand’s keypoints by both upsamplers. The size of the layers is illustrative.
1) EffHandNet - Top-Down Single Hand Model: Hand actions longer than N frames are sub-sampled. The input
instances appearing in the scene are detected and extracted vector Vseq is a concatenation of frames fn ∈ R93 .
from the image as RL , RR ∈ R3×w×h , w, h = 128 using
fn = [P hL R
2D , P h2D , P o2D , P ol ] (1)
finetuned YOLOv7 [35]. Further, these regions RL , RR are
passed to the model that predicts hand keypoints in a single- Vseq = [f1 , f2 ..fn ], n ∈ [1..20] (2)
hand image. Image features FM ∈ R1280×4×4 are extracted
with EfficientNetV2-S [32]. It provides state-of-the-art accu- The final stage of the pipeline involves processing the
racy with a medium amount of trainable parameters (21.45M sequence vector Vseq to embed temporal information and
parameters). Further, the feature matrix FM is passed via an perform action classification. For this objective, we employ
upsampler which generates heatmap H ∈ RJ×w×h where a model inspired by the Visual Transformer [11]. It is con-
each cell represents the probability of joint J occurrence. structed using a standard Transformer Encoder block [34].
We create EffHandNet model with the upsampler containing The input vector Vseq representing an action is linearised
a stack of five transposed convolutions and pointwise convo- using a fully connected layer to xlin . The resulting xlin
lution for channel reduction and probability layer inspired by is combined with a classification token, and a positional
Xiao et al. [39]. Finally, H is transformed into P2D choosing embedding following [11]. The number of encoder layers
highest probability cell for each J. is set to 2, and encoder dropout and attention dropout are
applied with a probability of 0.2. The number of network
2) EffHandEgoNet - Bottom-Up Egocentric Model: The
parameters is equal to only 31K and the inference time on
challenge to overcome in the egocentric view concerning
the RTX 3090 GPU equals 6.2 ms. The detailed pipeline is
hand pose description is the accurate modelling of interac-
depicted in Fig. 4.
tions between two hands and objects, where the top-down
approach tends to fail. To close this performance gap, we IV. E XPERIMENTS
introduce EffHandEgoNet, a modified EffHandNet to execute The proposed action recognition approach is evaluated on
bottom-up keypoints estimation in the egocentric perspective. two different datasets. H2O Dataset [19] contains actions
It simultaneously allows the processing of two interacting performed with both hands and FPHA Dataset [15] contains
hands in the scene, improving the pose prediction robustness. actions performed with only one hand. The choice of these
The network consists of EfficientNetV2-S backbone extract- datasets is motivated by following other studies in the
ing features from the image I ∈ R3×w×h , w, h = 512. field of hand-based egocentric action recognition [19], [38]
Extracted features FM ∈ R1280×16×16 are handed to two to allow comparison with existing work. Besides this, the
independent upsamplers for each of the hands, following the provided GT hand pose and object information allows us
accurate EffHandNet approach and to the handness modules to perform extensive experiments and ablations regarding
responsible for predicting each hand’s presence hL , hR ∈ R2 different inputs which are not possible in other egocentric
built from linear layers. The upsamplers consist of three datasets, i.e., [9], [16]. In addition, single-hand estimation
transposed convolutions with batch normalisation and ReLU task is evaluated on the non-egocentric FreiHAND [45].
activation except the last layer followed by a pointwise
convolution. Heatmaps H L and H R are transformed into A. Datasets
L R
P2D , P2D . The full architecture is illustrated in Fig. 3. 1) FreiHAND: It is a dataset for hand pose and shape
estimation from a single RGB image. In this study, it is used
C. Action Recognition for the training and evaluation of a network responsible for
The representation of each action sequence consists of detecting single-hand keypoints. It contains 130K images of
frames [f1 , f2 , f3 ..fn ], where n ∈ [1..N ] and N = 20 is hands annotated with 3D coordinates that can be transformed
chosen heuristically. These frames embed flattened poses of to 2D using the given camera intrinsics. There are 32K
hands P hL R
2D , P h2D and object P o2D , P ol . If fewer than N images of hands captured on a green screen, and the rest are
frames represent an action, zero padding is applied, while created by background augmentation. In addition, there are
Fig. 4: Our procedure for action recognition. From the sequence of frames f1 , f2 , f3 ..fn hand pose P hL R
2D ,P h2D is estimated using EffHandEgoNet
model and object pose P o2D , P ol is extracted with YOLOv7 [35]. Each sequence frame fn is linearised from shape R93 to R42 . With added positional
embedding and classification token, this information creates an input for the transformer encoder implemented following [11] repeated ×2 times, which
embeds the temporal information. Finally, the multi-layer perceptron predicts one of the 36 action labels.
V. C ONCLUSION