Thanks to visit codestin.com
Credit goes to www.scribd.com

0% found this document useful (0 votes)
8 views10 pages

In My Perspective, in My Hands: Accurate Egocentric 2D Hand Pose and Action Recognition

Uploaded by

maxkingpublish
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
8 views10 pages

In My Perspective, in My Hands: Accurate Egocentric 2D Hand Pose and Action Recognition

Uploaded by

maxkingpublish
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 10

In My Perspective, In My Hands:

Accurate Egocentric 2D Hand Pose and Action Recognition


Wiktor Mucha and Martin Kampel
Computer Vision Lab, TU Wien, Favoritenstr. 9/193-1, 1040 Vienna, Austria
arXiv:2404.09308v1 [cs.CV] 14 Apr 2024

Fig. 1. Overview of our method. From the sequence of input frames f1 , f2 , f3 ..fn representing action, 2D hands pose P hL R
2D ,P h2D and bounding box
of manipulated objects P o2D with its labels P ol are extracted. Under the study, four distinct state-of-the-art hand pose methods are implemented and
tested. Object information is retrieved using YOLOv7 [35]. Pose information is embedded into a vector describing each frame. The sequence of vectors is
processed via the transformer-based deep learning neural network to predict the final action class.

Abstract— Action recognition is essential for egocentric video the video [28]. Research in egocentric action recognition is
understanding, allowing automatic and continuous monitoring crucial due to broad potential application fields, including
of Activities of Daily Living (ADLs) without user effort. Existing augmented and virtual reality, nutritional behaviour analysis,
literature focuses on 3D hand pose input, which requires
computationally intensive depth estimation networks or wearing and Active Assisted Living (AAL) technologies for lifestyle
an uncomfortable depth sensor. In contrast, there has been analysis [28] or assistance [43]. ADLs targeted by AAL
insufficient research in understanding 2D hand pose for egocen- technologies (e.g., drinking, eating and food preparation)
tric action recognition, despite the availability of user-friendly are all based on manual operations and manipulations of
smart glasses in the market capable of capturing a single RGB objects, which motivates research focused on hand-based
image. Our study aims to fill this research gap by exploring
the field of 2D hand pose estimation for egocentric action action recognition.
recognition, making two contributions. Firstly, we introduce Current works on egocentric action recognition focus on
two novel approaches for 2D hand pose estimation, namely 3D hand pose [33], [10], [19], despite the absence of wear-
EffHandNet for single-hand estimation and EffHandEgoNet,
tailored for an egocentric perspective, capturing interactions able depth sensors in the market. Consequently, these studies
between hands and objects. Both methods outperform state- resort to estimating depth from RGB frames, introducing
of-the-art models on H2O and FPHA public benchmarks. complexities and yielding pose prediction errors around 40
Secondly, we present a robust action recognition architecture mm [33], [19] (equivalent to a 20.5% error, considering an
from 2D hand and object poses. This method incorporates average human hand size of 18 cm). While depth maps
EffHandEgoNet, and a transformer-based action recognition
method. Evaluated on H2O and FPHA datasets, our archi- could be directly acquired through sensors, this necessitates
tecture has a faster inference time and achieves an accuracy inconvenient custom setups, as illustrated in Fig. 2. As
of 91.32% and 94.43%, respectively, surpassing state of the an alternative to 3D hand pose estimation, 2D estimations
art, including 3D-based methods. Our work demonstrates that are reported to be more accurate at the time of this study
using 2D skeletal data is a robust approach for egocentric action considering percental error (13.4% for 2D [44] against 20.5%
understanding. Extensive evaluation and ablation studies show
the impact of the hand pose estimation approach, and how each for 3D [33], [10], [19]). Additionally, the existing literature
input affects the overall performance. The code is available at provides examples of 2D-based action recognition achieving
https://github.com/wiktormucha/effhandegonet. higher accuracy in non-egocentric setups compared to 3D-
based methods [13]. These factors underscore the need for
I. I NTRODUCTION further exploration of the potential advantages of 2D pose
The growing interest in egocentric vision research is estimation for egocentric action recognition.
evident from the release of large dedicated datasets like Our study, based on 2D keypoints, investigates this gap
EPIC-KITCHENS [9], Ego4D [16], and H2O [19]. One of the with the goal of bridging the distance between research and
challenges in this domain is the task of action recognition, practical applications, enabling the utilization of off-the-shelf
focused on determining the action performed by the user in RGB egocentric cameras. An overview of our approach is
prehensive evaluation on the H2O Dataset showcases
superior accuracy, reaching 91.32%, outperforming
other state-of-the-art methods. In single-handed experi-
ments on the FPHA Dataset, our method demonstrates
robustness, achieving a state-of-the-art performance of
94.43% when only one hand manipulates an object.
Fig. 2: Currently, comfortable wearable RGB-D cameras are not readily 4) We present extensive experiments and ablations per-
available in the market. Left: A self-made RGB-D setup [19]. Right: User
wearing RayBan glasses with an integrated RGB camera1 . formed on H2O Dataset, showing the influence of
the hand pose method by comparing four different
pose estimation techniques. Further analysis shows the
shown in Fig. 1, which displays the 2D hand and object importance of each input (left and right hand, object
poses obtained by our methods. It includes an example of a position) for the results of action recognition.
sequence representing an action captured from an egocentric
The paper is organised as follows: Section II presents re-
perspective of the H2O Dataset [19]. The proposed method
lated work in egocentric hand keypoints estimation and hand-
is applicable to recordings from recently released wearable
based action recognition highlighting areas for improvement.
RGB smart-glasses like RayBan Stories1 and Snapchat Spec-
Section III describes our methods and implementation. Eval-
tacles2 . These devices allow the use of only a single RGB im-
uation and experiments are presented in Section IV. Section
age, despite incorporating dual cameras, and offer improved
V concludes the study summarising its main findings.
comfort, quality and lightweight design for egocentric vision
compared to head-mounted cameras or self-made RGB-D II. R ELATED W ORK
sensors (see Fig. 2). This is particularly noteworthy given the Action recognition has been extensively studied in the lit-
lack of commercially available wearable RGB-D devices on erature, with approaches employing diverse modalities such
the market at the time of the study. The release of these user- as RGB images, depth information, and skeletal data [31]. In
friendly glasses is expected to increase the availability of our study, we focus on egocentric hand-based action recogni-
single-image RGB datasets, thus fostering egocentric vision tion using 2D keypoints estimated from a single RGB image.
research. Lastly, works on 3D lifting [12], [23], which are Therefore, we present related work in egocentric vision for
the current state of the art for 3D human pose estimation, hand pose estimation and hand-based action recognition.
highlight the importance of accurate 2D pose predictions as
input for the generation of 3D poses using lifting algorithms. A. Egocentric Hand Keypoint Description
Our 2D hand pose estimation method is proven to predict Hand-pose estimation in egocentric vision is challenged
accurate 2D poses, which means that our method is not by self-occlusion during movements, limited field of view,
limited to direct action recognition from 2D keypoints, but and diverse perspectives, making effective generalisation
holds potential as a versatile first step for 3D pose estimation difficult. Some works address these challenges by employing
through lifting algorithms. RGB-D sensors [26], [40], [15]. In addition, the depth
Our contributions are the following: modality could enhance user privacy [24], [25], but the
1) A state-of-the-art architecture for single 2D hand pose adoption of depth-sensing devices in the market is limited,
prediction, EffHandNet, surpassing other methods in requiring users to wear uncomfortable devices (see Fig. 2).
every metric on the single-hand FreiHAND [45]. To leverage the benefits of 3D keypoints, some researchers
2) A novel architecture for modelling 2D hand pose employ neural networks for depth estimation and subsequent
from the egocentric perspective, EffHandEgoNet that conversion from 2D to 3D space using intrinsic camera
outperforms state-of-the-art methods on H2O [19] and parameters [33], [19]. For instance, Tekin et al. [33] use
FPHA [15] datasets. This includes models for both 2D a single RGB image to directly calculate the 3D pose of a
and 3D pose estimation, with the latter projected into hand through a Convolutional Neural Network (CNN) which
2D space for a fair comparison. generates a 3D grid as the output of CNN, where each cell
3) A novel method for egocentric action recognition using contains the probability of target pose value. Kwon et al. [19]
2D data on the subject’s hands and object poses from follow this approach but estimates pose for both hands. This
YOLOv7 [35]. Our method distinguishes itself from work, however, reports a mean End-Point Error (EPE) of
other studies by employing a reduced set of input 37mm for hand pose estimation in the H2O Dataset, which,
information, leading to faster inference per action. By considering the average human hand size of 18 cm, results
leveraging YOLOv7, which has demonstrated excellent in a 20% error, leaving room for improvement. Cho et al. [7]
performance at recognizing diverse labels in large uses CNN with the transformer-based network for 3D pose-
datasets, we enhance the potential for easy generaliza- per-frame reconstruction. Wen et al. [38] propose to use se-
tion to various tasks and datasets in the future. A com- quence information to reconstruct depth omitting the issue of
occlusions. However, these works focus on 3D pose estima-
1 RayBan Stories - https://www.ray-ban.com/usa/ tion from the egocentric perspective. Despite the practicality
ray-ban-stories (accessed 03 July 2023)
2 Snapchat Spectacles - https://www.spectacles.com (accessed of RGB camera glasses for real-world applications, 2D hand-
03 July 2023) based approaches are not widely adopted in the egocentric
domain. One of a few works in egocentric 2D hand pose is model, allowing for learning interactions between these com-
an early study by Liang et al. [20]. A single RGB camera ponents. Wen et al. [38] use a transformer-based model with
is employed to estimate hand pose and its distance from the estimated 3D hand pose and object label input. Cho et al. [7]
sensor, employing a Conditional Regression Forest; however, enrich the transformer inputs with object pose and hand-
this study lacks quantitative results. Another approach by object contact information. However, these studies do not
Wang et al. [37] introduces a cascaded CNN for 2D hand employ sensor-acquired depth data. Instead, they estimate
pose prediction, involving two stages: hand mask creation points in 3D space using neural networks and intrinsic
and hand pose prediction. camera parameters [33], [19], [38], [7].
Beyond the egocentric domain, a review of available In contrast, our study aims to fill the gap of egocentric
applications reveals that improvements in network architec- action recognition based on 2D hand pose. We introduce a
tures for regular pose estimation and keypoint prediction novel architecture based on 2D keypoints, estimated with
are also applicable to the egocentric hand pose prediction. a lower percentage error in pose prediction compared to
This is demonstrated in the study by Bauligu et al. [2], who 3D-based methods that outperform existing solutions. It
successfully adapt OpenPose [4] to the egocentric perspec- effectively leverages information from both hands and relies
tive. However, despite continuous improvements in single- solely on RGB images. Moreover, by bypassing the archi-
hand pose performance, its application in the egocentric tecture responsible for lifting points to the third dimension,
environment remains relatively unexplored, prompting the we reduce network complexity and achieve a faster inference
focus of this study. Recent advancements in 2D hand pose time per action, which is a critical factor in certain applica-
estimation include PoseResNet50 [8], which is based on tions.
residual connection architecture and Santavas et al. [30], who
reinforce a CNN by a self-attention module. Zhang et al. [44] III. E GOCENTRIC ACTION R ECOGNITION BASED ON 2D
proposes MediaPipe, which employs single shot detection to H AND P OSE
identify hand regions passed further through the network to
estimate the final pose for each hand. The proposed method performs action recognition from
Our contribution differs from available works, focusing image sequences by processing pose information describing
on 2D prediction based on a single RGB image. We imple- hands and interacting objects in a 2D space. We consider
ment top-down and bottom-up methods and compare them fine-grained actions of a user manipulating different objects
with state-of-the-art in the egocentric environment presenting with his hands, e.g. opening a bottle or pouring milk. The
quantitative and qualitative results [supplement]. We assess pipeline overview is presented in Fig. 1 where fn corresponds
their transition from a standard to an egocentric perspective to processed frames. It is constructed of three separate
by employing them to build a 2D hand-based action recog- blocks object detection, hand pose estimation and finally,
nition system. action recognition using transformer encoder block and fully
connected layers.
B. Hand-Based Egocentric Action Recognition
A common strategy for action recognition involves pro- A. Object Detection and Pose Estimation
cessing jointly hand and object information. Cartas et al. [6]
The first step in the pipeline is object detection, which
propose CNN-based object detectors to estimate the po-
is carried out employing the pre-trained YOLOv7 network
sitions of primary regions (hands) and secondary regions
[35]. Following that, transfer learning is used on the H2O
(objects). Temporal information from these regions is then
Dataset, using the Ground Truth (GT) object pose. This
processed by a Long Short-Term Memory (LSTM) network.
process involves transforming from a 6D GT representation
Nguyen et al. [27] transition from bounding box information
of objects to a 2D bounding box. In each frame, denoted
to 2D skeletons of a single hand estimated by CNN from
as fn , the interacting object is represented by Po 2D (x, y) ∈
RGB and depth images. The joints of these skeletons are
R4×2 , where each point corresponds to the corners of its
aggregated using spatial and temporal Gaussian aggregation,
bounding box. Additionally, Po l ∈ R1 represents object’s
and action recognition is performed using a learnable Sym-
label.
metric Positive Definite (SPD) matrix.
With the rise of 3D-based hand pose estimation algo-
B. Hand Pose Estimation
rithms, the scientific community has increasingly focused on
egocentric action understanding using 3D information [33], Each frame fn contains alongside the object pose, the
[10], [19]. Tekin et al. [33] estimate 3D hand and object hands pose of a subject conducting action. This hand pose
poses from a single RGB frame using CNN, embedding is estimated with an independent network part dedicated to
temporal information for predicting action classes using inferring the position of 21 keypoints j in each hand. Each
LSTM. Other techniques employ graph networks, such as point j expresses the position of wrist and fingers’ joints
Das et al. [10], who present an architecture with a spatio- following the standard approach of hand pose description
temporal graph CNN, describing finger movement using [1]. Each hand pose is represented by Ph L,R
2D (x, y), ∈ R
j×2

separate sub-graphs. Kwon et al. [19] construct sub-graphs where L and R describes the left or right hand and x, y
for each hand and object, which are merged into a multigraph represents the point j.
Fig. 3: Our EffHandEgoNet architecture to resolve keypoint prediction problem. Input images are resized to 512 pixels and passed through the network to
produce 21 heatmaps for each of the hand’s keypoints by both upsamplers. The size of the layers is illustrative.

1) EffHandNet - Top-Down Single Hand Model: Hand actions longer than N frames are sub-sampled. The input
instances appearing in the scene are detected and extracted vector Vseq is a concatenation of frames fn ∈ R93 .
from the image as RL , RR ∈ R3×w×h , w, h = 128 using
fn = [P hL R
2D , P h2D , P o2D , P ol ] (1)
finetuned YOLOv7 [35]. Further, these regions RL , RR are
passed to the model that predicts hand keypoints in a single- Vseq = [f1 , f2 ..fn ], n ∈ [1..20] (2)
hand image. Image features FM ∈ R1280×4×4 are extracted
with EfficientNetV2-S [32]. It provides state-of-the-art accu- The final stage of the pipeline involves processing the
racy with a medium amount of trainable parameters (21.45M sequence vector Vseq to embed temporal information and
parameters). Further, the feature matrix FM is passed via an perform action classification. For this objective, we employ
upsampler which generates heatmap H ∈ RJ×w×h where a model inspired by the Visual Transformer [11]. It is con-
each cell represents the probability of joint J occurrence. structed using a standard Transformer Encoder block [34].
We create EffHandNet model with the upsampler containing The input vector Vseq representing an action is linearised
a stack of five transposed convolutions and pointwise convo- using a fully connected layer to xlin . The resulting xlin
lution for channel reduction and probability layer inspired by is combined with a classification token, and a positional
Xiao et al. [39]. Finally, H is transformed into P2D choosing embedding following [11]. The number of encoder layers
highest probability cell for each J. is set to 2, and encoder dropout and attention dropout are
applied with a probability of 0.2. The number of network
2) EffHandEgoNet - Bottom-Up Egocentric Model: The
parameters is equal to only 31K and the inference time on
challenge to overcome in the egocentric view concerning
the RTX 3090 GPU equals 6.2 ms. The detailed pipeline is
hand pose description is the accurate modelling of interac-
depicted in Fig. 4.
tions between two hands and objects, where the top-down
approach tends to fail. To close this performance gap, we IV. E XPERIMENTS
introduce EffHandEgoNet, a modified EffHandNet to execute The proposed action recognition approach is evaluated on
bottom-up keypoints estimation in the egocentric perspective. two different datasets. H2O Dataset [19] contains actions
It simultaneously allows the processing of two interacting performed with both hands and FPHA Dataset [15] contains
hands in the scene, improving the pose prediction robustness. actions performed with only one hand. The choice of these
The network consists of EfficientNetV2-S backbone extract- datasets is motivated by following other studies in the
ing features from the image I ∈ R3×w×h , w, h = 512. field of hand-based egocentric action recognition [19], [38]
Extracted features FM ∈ R1280×16×16 are handed to two to allow comparison with existing work. Besides this, the
independent upsamplers for each of the hands, following the provided GT hand pose and object information allows us
accurate EffHandNet approach and to the handness modules to perform extensive experiments and ablations regarding
responsible for predicting each hand’s presence hL , hR ∈ R2 different inputs which are not possible in other egocentric
built from linear layers. The upsamplers consist of three datasets, i.e., [9], [16]. In addition, single-hand estimation
transposed convolutions with batch normalisation and ReLU task is evaluated on the non-egocentric FreiHAND [45].
activation except the last layer followed by a pointwise
convolution. Heatmaps H L and H R are transformed into A. Datasets
L R
P2D , P2D . The full architecture is illustrated in Fig. 3. 1) FreiHAND: It is a dataset for hand pose and shape
estimation from a single RGB image. In this study, it is used
C. Action Recognition for the training and evaluation of a network responsible for
The representation of each action sequence consists of detecting single-hand keypoints. It contains 130K images of
frames [f1 , f2 , f3 ..fn ], where n ∈ [1..N ] and N = 20 is hands annotated with 3D coordinates that can be transformed
chosen heuristically. These frames embed flattened poses of to 2D using the given camera intrinsics. There are 32K
hands P hL R
2D , P h2D and object P o2D , P ol . If fewer than N images of hands captured on a green screen, and the rest are
frames represent an action, zero padding is applied, while created by background augmentation. In addition, there are
Fig. 4: Our procedure for action recognition. From the sequence of frames f1 , f2 , f3 ..fn hand pose P hL R
2D ,P h2D is estimated using EffHandEgoNet
model and object pose P o2D , P ol is extracted with YOLOv7 [35]. Each sequence frame fn is linearised from shape R93 to R42 . With added positional
embedding and classification token, this information creates an input for the transformer encoder implemented following [11] repeated ×2 times, which
embeds the temporal information. Finally, the multi-layer perceptron predicts one of the 36 action labels.

TABLE I: Results of 2D single-hand models on FreiHAND dataset. Refer-


4K images for testing, including real backgrounds referenced enced results are reported by the authors of the methods, while unreferenced
in this study as final test. results are computed by us using open-source implementations.
2) H2O Dataset: It provides GT for studying hand- Method Year PCK0.2↑ EPE↓ AUC↑
based actions and object interactions with two hands. It test subset from random data split 80/10/10
includes multi-view RGB-D images with action labels for 36 PoseResNet50 [8] 2020 99.20% 3.27 86.8
MediaPipe 2020 71.77% 7.45 79.7
classes constructed from verb and object labels, 3D poses Santavas et al. [30] 2020 - 4.00 87.0
for both hands giving j = 2 × 21, 6D poses and meshes EffHandNet 2024 98.70% 2.24 92.1
for manipulated objects, GT camera poses, and scene point EffHandNet+P 2024 99.32% 1.59 93.5
clouds. The actions were performed by four people. The final test subset
MediPipe 2020 81.73% 5.29 83.9
training, validation and test subsets for the action recognition PoseResNet50 2020 87.48% 4.32 86.0
and hand pose estimation tasks are provided. In the action EffHandNet 2024 88.76% 4.19 86.5
recognition subset there are 569 clips for training, 122 for EffHandNet+P 2024 91.08% 3.67 87.9
validation and 242 for testing.
3) FPHA: This dataset contains egocentric RGB-D
recordings of actions performed with one hand by a user Gradient Descent (SGD) and Intersection over Union (IoU)
wearing a camera. The recordings have been captured by loss function with a weight decay of 10-5 and a momentum
six different people. Besides RGB-D, they include 3D hand of 0.9. After 800 epochs, the momentum is reduced to 0
keypoint annotations of a single hand with 21 keypoints and the learning rate is reduced by two. EffHandEgoNet is
giving j = 21, 26 object labels and 45 action classes trained and evaluated on H2O and FPHA datasets, using
containing a verb and an object. In the study, we follow the a similar approach, the momentum and learning rate are
data split provided by the dataset authors for 600 training reduced after 60 epochs and the loss combines weighted IoU
actions and 575 actions for evaluation. (pose) and Cross-Entropy (handness) losses of each hand. For
both processes, the data is augmented with random cropping,
B. Metrics horizontal flipping, vertical flipping, resizing, rotating and
To evaluate 2D hand pose estimation task we follow blurring following [3]. Model weights are saved for the
standard metrics used in other studies about 2D hand pose smallest EPE in the validation subset.
[44], [8], [30]. These metrics in contrast to 3D pose metrics 2) Results of EffHandNet in Single-Hand Task: Table
provide information in image space. Mean End-Point Error I presents the results of our method compared to existing
(EPE) describes the Euclidean distance in pixels between approaches for single-hand pose estimation. We follow a
the predicted and the GT hand keypoint, the Percentage random dataset split in proportions of 80/10/10 following
of Correct Keypoints (PCK) answers how many predicted other authors [8], [30]. We provide mean results of 10
points have an EPE under a given threshold, normalised by runs to reduce the randomness factor. EffHandNet performs
the width of the hand bounding box, and the Area Under the comparable to other works on FreiHAND dataset in the
Curve (AUC) gives PCK values over thresholds from 0 to 2D domain with an accuracy of 98.79% in PCK0.2, EPE
1. The results of action recognition are reported in terms of equal 1.97 and AUC 92.7%. Other studies [8], [30] do not
classification accuracy following studies [7], [19], [38]. For declare results on 2D hand keypoint prediction in a final
each video representing an action in a test set, a predicted test subset of FreiHAND. Our method in final test results in
action label is compared to the GT label. better performance in all three metrics. The final test subset
with real-world background images shows that some of the
C. Hand Pose Evaluation predictions are outside the hand region. The reason for this is
1) Experiment Setup: EffHandNet is trained on Frei- the absence of real-world background in the training samples.
HAND dataset with 80/10/10 proportions for training, vali- This is confirmed by improved results with central cropping
dation and evaluation. Optimisation is done using Stochastic of the images with fixed value referenced as EffHandNet+P.
TABLE II: Results for 2D hand pose estimation in egocentric H2O Dataset. TABLE III: Results in accuracy of action recognition methods on H2O and
The table includes hand detection accuracy, hand pose estimation PCK0.2, FPHA datasets. Inputs of methods are: Img stands for semantic features
EPE and AUC metrics in pixels for an image size of 1280x720. Results are extracted from an image using CNN network, Hand P. and Obj P. stand for
calculated using open-source implementations and authors’ model weights. pose information type for hands and objects, and Obj L. stands for object
label. Results are from referenced papers.
Method: Year Acc.↑ PCK0.2↑ EPE↓ AUC↑ H2O Dataset
H2O Dataset Method: Year Img H. P. Obj P. Obj L. Acc.↑
PoseResNet50 [8] 2020 99.47 74.42% 26.69 81.4 C2D [36] 2018 ✓ ✗ ✗ ✗ 70.66
MediaPipe [44] 2020 96.93 86.22% 21.22 85.1 I3D [5] 2017 ✓ ✗ ✗ ✗ 75.21
HTT [38] 2023 - 84.75 19.94 84.8 SlowFast [14] 2019 ✓ ✗ ✗ ✗ 77.69
H2OTR [7] 2023 - 95.55 12.46 89.4 H+O [33] 2019 ✗ 3D 6D ✓ 68.88
EffHandNet 2024 99.47 76.27% 22.52 82.0 ST-GCN [41] 2018 ✗ 3D 6D ✓ 73.86
EffHandEgoNet 2024 99.91 97.38% 9.80 90.7 TA-GCN [19] 2021 ✗ 3D 6D ✓ 79.25
FPHA Dataset HTT [38] 2023 ✓ 3D ✗ ✓ 86.36
H2OTR [7] 2023 - 94.67 17.50 89.3 H2OTR [7] 2023 ✗ 3D 6D ✓ 90.90
HTT [38] 2023 - 92.07 18.07 88.7 Ours 2024 ✗ 2D 2D ✓ 91.32
Ours 2024 - 96.37 15.20 88.5 FPHA Dataset
Method: Year Img H. P. Obj P. Obj L. Acc.↑
FPHA [15] 2018 ✗ 3D - ✓ 78.73
3) Results of 2D Egocentric Hand Pose Models: The H+O [33] 2019 ✗ 3D - ✓ 82.43
evaluation of the hand pose models in the egocentric perspec- Coll. [42] 2020 ✓ 3D - ✓ 85.22
HTT [38] 2023 ✓ 3D - ✓ 94.09
tive is performed in the test subsets of the H2O and FPHA VPA [29] 2021 ✗ 3D - ✓ 95.93
datasets. The results are presented in Table II, including Ours 2024 ✗ 2D - ✓ 94.43
hand detection accuracy, PCK.02, EPE and AUC. On H2O,
the best results are obtained by our EffHandEgoNet, which H2O Dataset, where in FPHA Dataset it is done after 100
models both hands with a PCK02 of 97.38%, an EPE of 9.80 and 1000 epochs. Hyperparameters and augmentations are
and an AUC of 90.7% compared to other approaches. The selected based on the best-performing set in the validation
performance of non-egocentric methods such as MediaPipe, subset. Every run is repeated five times to reduce the effect
PoseResNet50 and EffHandNet drops significantly in this of random initialisation of the network and mean results with
scenario due to the challenging self-occluded hand frames standard deviations are reported.
where pose estimation is more complex. This is also visible 2) Results: We test our approach with the best performing
at the hand detection stage, where MediaPipe performs the hand pose estimation method EffHandEgoNet and object
worst, mistaking the left hand for the right [supplement]. detection implementation in H2O Dataset. Fine-tuning of
In addition, EffHandEgoNet is tested on FPHA Dataset with the YOLOv7 model results in 0.995 [email protected], making the
state-of-the-art methods HTT [38] and H2OTR [7], which are object data valid for higher-level inference. The proposed
also outperformed. Hand detection is omitted in this part, as method results in 89.17%±1.56, with the best model reach-
no labels are provided. In both datasets, H2OTR [7] and HTT ing an accuracy of 91.32%, outperforming all state-of-the-art
[38] are transformed to image space as the original papers methods.
only provide 3D metrics. Nevertheless, H2O Dataset only consider scenarios of
users performing actions with both hands. To prove the
D. Action Recognition Evaluation applicability of the proposed method in single-hand tasks, we
1) Experiment Setup: To mitigate overfitting between the assess our model in FPHA Dataset, which contains actions
validation and training subsets, our training strategy incorpo- with a single hand. Since there is no provided GT object
rates various augmentations applied to the sequence vectors pose for all actions, we add to the EffHandEgoNet hand
of keypoints, denoted as Vseq . In the process of training our pose estimation model an additional network that predicts
method in H2O Dataset, beneficial augmentations include a class of manipulated objects. Our approach results in a
horizontal and vertical flipping, random rotations, and ran- mean accuracy of 93.66% ± 0.49 with the best result of
dom cropping, following Buslaev et al. [3]. Moreover, we 94.43%, which positions our method alongside state-of-the-
implement an effective augmentation strategy by randomly art methods, confirming a robust performance also in single-
masking either the hands or object positions. This involves hand actions.
randomly setting the corresponding values of the hand or Table III presents a comparison of state-of-the-art action
object in frame fn to zero. The models are trained with recognition methods and their results in the H2O Dataset
keypoints predicted from the previous stage. For FPHA and FPHA Dataset reported in original studies. Following
Dataset the input vector is reduced by the object bounding other works we report our best results. To facilitate a
box and second hand to shape fn ∈ R43 due to this dataset fair comparison, the table provides information about the
constraints. For both datasets, input sequence frames are input types of action recognition modules, where Img stands
randomly sub-sampled during training and uniformly sub- for semantic features extracted from an image using CNN
sampled for validation and testing. The models are trained network, Hand P. and Obj P. stand for pose information
with a batch size bs = 64, AdamW optimiser, cross-entropy type for hands and objects, and Obj L. stands for object
loss function, and a learning rate lr = 0.001 reduced label. In the case of FPHA Dataset, the comparison is done
by a factor 0.5 after 900 epochs every 200 epochs for on the whole test subset where object information is not
TABLE IV: Results for 2D hand pose estimation in egocentric H2O dataset.

Model: Param. Inf.[ms] PCK0.2↑ EPE↓ AUC↑


SwinV2T [21] 31.4M 16.02 85.76% 18.87 0.851
MobileNetV3 [18] 7.3M 7.59 92.95% 13.21 0.884
ResNet50 [17] 30.3M 6.96 95.46% 11.66 0.895
ConvNext [22] 31.7M 7.17 95.67% 11.48 0.896
Eff-NetV2S [32] 25.2M 21.47 97.38% 9.80 0.907

Fig. 5: Inference time and accuracy per single action of state-of-the-art


methods on H2O Dataset. Our method predicts the fastest with the highest
accuracy.

available. Due to this, some studies implemented in H2O


Dataset are omitted as they report results only on a subset
with object information, making a comparison unfair. Our
approach uses the least amount of information compared
to other methods and still achieves the best result in H2O
Fig. 6: EPE results for different methods in edge scenarios for overlapping
Dataset and competitive accuracy in FPHA Dataset. and fully separated hands in H2O Dataset.
E. Inference Time per Action
We measure the inference time of our method by calculat-
architecture type. The results are shown in the Table IV. The
ing the prediction time of a single action in the H2O Dataset.
best performance is observed with EfficientNetV2S [32], but
The study is conducted on the H2O Dataset specifically, as
it is associated with the highest inference time. The faster
it presents a more complex scenario involving interactions
alternative are ConvNext Tiny [22] followed by ResNet50
with objects by both hands. This differs from the FPHA
[17] or MobileNetV3 Large [18]. The worst performance is
dataset, where the reported results in Table III do not incor-
observed for SwinV2T Tiny. [21], which overfits the test data.
porate object pose information. The evaluation is performed
All inference times are measured on the RTX 3090 GPU.
by averaging inference times over 1000 trials on NVIDIA
GeForce RTX 3090 GPU, ensuring robustness and reliability. 2) Hand pose in frames with overlapping hands: To un-
Our method is compared with HTT [38] and H2OTR [7] derstand how different models perform with self-occlusion,
methods based on 3D hand pose input, as they are the only we divide the test data in egocentric H2O Datasets into two
open-source implementations that allow such a comparison edge scenarios. The first subset consists of frames in which
on the H2O Dataset at the time of this study. the hands of individuals performing actions are separated
The results demonstrate the improvement of our method by a distance of 30% of the image width, and the second
over HTT [38] and H2OTR [7] in terms of inference speed includes hands that overlap by at least 10% of the image
and accuracy. We achieve an accuracy of 91.32% with an width. The results presented in Fig. 6 of the EPE for each
inference time of 95.994 ± 0.959 ms. In contrast, HTT [38] method in these scenarios show the superior performance of
needs 98.916 ± 6.461 to reach 86.36% and H2OTR [7] with our EffHandEgoNet. Not only its performance is the best, but
90.90% is about 14 times slower and performs the inference the gap between scenarios is marginal, proving the robustness
in 1355.396 ± 22 ms. The results are shown in Fig. 5, where of our method to self-occlusions.
each method is visualised as a circle whose size depicts the 3) Action Recognition and Hand Pose Models: The
number of its parameters. challenge of estimating hand poses from an egocentric per-
spective, where the movement involves the user performing
F. Ablation Study actions, is notable due to occlusions from both the hands and
Various ablations regarding input data types are performed the manipulated objects. This complexity leads to a decrease
to expand the evaluation of our approaches. We present the in performance when moving from a non-egocentric to an
study of the action recognition performance with distinct egocentric viewpoint, as evidenced by the different results
methods for hand pose estimation and the impact of each presented in Table I and Table II. We investigate the impact
input, i.e., left or right hand, on the final accuracy. All of hand pose estimation methods on action recognition
experiments are performed in H2O Dataset with fixed seeds accuracy in two ways.
to ensure reproducibility. In our first experiment, the action recognition model is
1) EffHandEgoNet Feature Extractor: To find the best- trained using GT poses following the same strategy outlined
performing future extractor, we experiment with different in Section IV-D.1, which results in a validation accuracy of
backbones. The selection is based on network size and 96.72% and 92.97% for the test. During the testing phase,
TABLE V: Results of our ablation study depending on inputs: left hand
pose, right hand pose and object pose.
Method: Left Hand Right Hand Obj Pose Acc. [%] ↑
GT ✓ ✓ ✓ 92.97
GTPose III ✓ ✓ ✗ 76.03
GTPose IV ✓ ✗ ✓ 73.14
GTPose V ✗ ✓ ✓ 79.33

The disparity in accuracy for hand type underlines that one of


the hands plays a more influential function in human actions,
but information regarding both poses is critical for successful
performance.

V. C ONCLUSION

Under this study, two novel 2D hand pose estimation mod-


els were developed to target the challenges of the egocentric
perspective. The top-down approach, EffHandNet, resulted
in the improvement of all three used metrics in the single-
hand FreiHAND dataset and state-of-the-art performance in
the egocentric H2O Dataset. The bottom-up EffHandEgoNet
model improved state-of-the-art performance in every metric
Fig. 7: Accuracy of action recognition depending on hand pose methods in on egocentric H2O and FPHA datasets, accurately estimating
H2O Dataset. On the top model trained with GT pose and tested with the the most challenging scenarios, including overlapping hands
estimated pose, on the bottom trained and tested with the estimated pose. and occlusions. Further, estimated hand poses alongside
object detection were used in a novel egocentric action
recognition model, where each frame is described by a vector
different hand pose estimation methods are used to evalu- containing the hand pose and the object bounding box. The
ate how their differing levels of pose estimation precision sequence of vectors is processed by a transformer-based
influence the final results. The results for action recognition neural network. Evaluation of our model on FPHA Dataset
show a correlation with the accuracy of hand pose estimation, showed competitive performance with 94.43% accuracy, con-
ranging from a low of 79.33% for the MediaPipe network firming accurate performance for actions performed with a
to a high of 90.49% for the EffHandEgoNet network. The single hand. Results obtained on H2O Dataset, where two
results of each method are shown in the upper panel of Fig.7. hands are involved in the action, resulted in 91.32% accuracy,
In the second one, the action recognition network is trained outperforming state-of-the-art and achieving a faster infer-
from scratch for each hand pose estimation method using the ence time.
predicted poses. While close performance is observed in the Additional experiments and ablations demonstrated the
validation subsets, the test set shows a similar correlation be- impact of hand pose estimation methods. Approaches origi-
tween hand pose methods as observed in our first experiment nally designed for a non-egocentric perspective, despite state-
(lower panel in Fig.7). of-the-art performance in single-hand datasets, showed a
Performance in action recognition is shown to be directly decrease in performance on egocentric pose estimation and
related to hand pose estimation precision, which highlights action recognition tasks compared to our egocentric approach
the importance of accurate hand pose input, particularly EffHandEgoNet. We showed that accurate pose description
in egocentric scenarios where actions involve interactions is essential for correct action understanding and that to
between both hands manipulating objects. achieve the best performance, it is necessary to use a method
4) Different Network Inputs for Action Recognition: capable of correctly modelling interactions between hands
Experiments with each network input highlight their respec- and manipulated objects like EffHandEgoNet. Our study
tive importance for action recognition accuracy, as shown demonstrates that in certain scenarios, i.e. reducing model
in Table V alongside a comparison to a complete network. complexity to reduce inference speed, 2D pose information
To streamline our analysis and avoid the influence of esti- is a promising alternative to estimated 3D pose for egocentric
mation algorithms for each network part, we perform this action recognition, as it competes strongly with non-pose and
experiment using GT information, involving the ablation of 3D hand pose-based methods.
various network inputs. GTPose III does not incorporate an
object bounding box, reducing the performance to 76.03%, VI. ACKNOWLEDGEMENTS
which emphasises the importance of including object pose
information. GTPose IV contains only left hand pose and This work was supported by VisuAAL ITN H2020 (grant
object information and results in 73.14%. GTPose V uses agreement No. 861091) and by Vienna Science and Technol-
the right hand with object information and results in 79.33%. ogy Fund (grant agreement No. ICT20-055).
R EFERENCES [21] Z. Liu, H. Hu, Y. Lin, Z. Yao, Z. Xie, Y. Wei, J. Ning, Y. Cao,
Z. Zhang, L. Dong, et al. Swin Transformer V2: Scaling Up Capacity
[1] A. Bandini and J. Zariffa. Analysis of the Hands in Egocentric and Resolution. In Proceedings of the IEEE/CVF Conference on
Vision: a Survey. IEEE Transactions on Pattern Analysis and Machine Computer Vision and Pattern Recognition, pages 12009–12019, 2022.
Intelligence, 2020. [22] Z. Liu, H. Mao, C.-Y. Wu, C. Feichtenhofer, T. Darrell, and S. Xie. A
[2] G. Baulig, T. Gulde, and C. Curio. Adapting Egocentric Visual ConvNet for the 2020s. In Proceedings of the IEEE/CVF Conference
Hand Pose Estimation Towards a Robot-Controlled Exoskeleton. In on Computer Vision and Pattern Recognition, pages 11976–11986,
Proceedings of the European Conference on Computer Vision (ECCV) 2022.
Workshops, pages 0–0, 2018. [23] S. Mehraban, V. Adeli, and B. Taati. MotionAGFormer: Enhancing 3D
[3] A. Buslaev, V. I. Iglovikov, E. Khvedchenya, A. Parinov, M. Druzhinin, Human Pose Estimation with a Transformer-GCNFormer Network. In
and A. A. Kalinin. Albumentations: Fast and Flexible Image Augmen- Proceedings of the IEEE/CVF Winter Conference on Applications of
tations. Information, 11(2):125, 2020. Computer Vision, pages 6920–6930, 2024.
[4] Z. Cao, T. Simon, S.-E. Wei, and Y. Sheikh. Realtime Multi-Person [24] W. Mucha and M. Kampel. Addressing Privacy Concerns in Depth
2D Pose Estimation Using Part Affinity Fields. In Proceedings of the Sensors. In Computers Helping People with Special Needs: 18th
IEEE Conference on Computer Vision and Pattern Recognition, pages International Conference, ICCHP-AAATE 2022, Lecco, Italy, July 11–
7291–7299, 2017. 15, 2022, Proceedings, Part II, pages 526–533. Springer, 2022.
[5] J. Carreira and A. Zisserman. Quo Vadis, Action Recognition? a [25] W. Mucha and M. Kampel. Beyond privacy of depth sensors in active
New Model and the Kinetics Dataset. In Proceedings of the IEEE and assisted living devices. In Proceedings of the 15th International
Conference on Computer Vision and Pattern Recognition, pages 6299– Conference on PErvasive Technologies Related to Assistive Environ-
6308, 2017. ments, pages 425–429, 2022.
[6] A. Cartas, P. Radeva, and M. Dimiccoli. Contextually Driven First- [26] F. Mueller, D. Mehta, O. Sotnychenko, S. Sridhar, D. Casas, and
Person Action Recognition from Videos. In Presentation at EPIC@ C. Theobalt. Real-time Hand Tracking Under Occlusion From an
ICCV2017 Workshop, page 8, 2017. Egocentric RGB-D Sensor. In Proceedings of the IEEE International
[7] H. Cho, C. Kim, J. Kim, S. Lee, E. Ismayilzada, and S. Baek. Conference on Computer Vision, pages 1154–1163, 2017.
Transformer-Based Unified Recognition of Two Hands Manipulating [27] X. S. Nguyen, L. Brun, O. Lézoray, and S. Bougleux. A Neural
Objects. In Proceedings of the IEEE/CVF Conference on Computer Network Based on SPD Manifold Learning for Skeleton-based Hand
Vision and Pattern Recognition, pages 4769–4778, 2023. Gesture Recognition. In Proceedings of the IEEE/CVF Conference on
[8] M. Contributors. OpenMMLab Pose Estimation Toolbox and Bench- Computer Vision and Pattern Recognition, pages 12036–12045, 2019.
mark. https://github.com/open-mmlab/mmpose, 2020. [28] A. Núñez-Marcos, G. Azkune, and I. Arganda-Carreras. Egocen-
[9] D. Damen, H. Doughty, G. M. Farinella, S. Fidler, A. Furnari, tric Vision-based Action Recognition: a Survey. Neurocomputing,
E. Kazakos, D. Moltisanti, J. Munro, T. Perrett, W. Price, and 472:175–197, 2022.
M. Wray. Scaling Egocentric Vision: The EPIC-KITCHENS Dataset. [29] A. Sabater, I. Alonso, L. Montesano, and A. C. Murillo. Domain and
In European Conference on Computer Vision (ECCV), 2018. View-point Agnostic Hand Action Recognition. IEEE Robotics and
[10] P. Das and A. Ortega. Symmetric Sub-graph Spatio-temporal Graph Automation Letters, 6(4):7823–7830, 2021.
Convolution and its Application in Complex Activity Recognition. [30] N. Santavas, I. Kansizoglou, L. Bampis, E. Karakasis, and A. Gaster-
In ICASSP 2021-2021 IEEE International Conference on Acoustics, atos. Attention! a Lightweight 2D Hand Pose Estimation Approach.
Speech and Signal Processing (ICASSP), pages 3215–3219. IEEE, IEEE Sensors Journal, 21(10):11488–11496, 2020.
2021. [31] Z. Sun, Q. Ke, H. Rahmani, M. Bennamoun, G. Wang, and J. Liu.
[11] A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai, Human Action Recognition from Various Data Modalities: a Review.
T. Unterthiner, M. Dehghani, M. Minderer, G. Heigold, S. Gelly, IEEE, 2022.
J. Uszkoreit, and N. Houlsby. An Image is Worth 16x16 Words: Trans- [32] M. Tan and Q. Le. Efficientnetv2: Smaller Models and Faster Training.
formers for Image Recognition at Scale. In International Conference In International Conference on Machine Learning, pages 10096–
on Learning Representations, 2021. 10106. PMLR, 2021.
[12] D. Drover, R. MV, C.-H. Chen, A. Agrawal, A. Tyagi, and [33] B. Tekin, F. Bogo, and M. Pollefeys. H+ O: Unified Egocentric Recog-
C. Phuoc Huynh. Can 3D Pose be Learned from 2D Projections nition of 3D Hand-object Poses and Interactions. In Proceedings of the
Alone? In Proceedings of the European Conference on Computer IEEE/CVF Conference on Computer Vision and Pattern Recognition,
Vision (ECCV) Workshops, pages 0–0, 2018. pages 4511–4520, 2019.
[13] H. Duan, Y. Zhao, K. Chen, D. Lin, and B. Dai. Revisiting Skeleton- [34] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N.
based Action Recognition. In Proceedings of the IEEE/CVF Confer- Gomez, Ł. Kaiser, and I. Polosukhin. Attention is All You Need.
ence on Computer Vision and Pattern Recognition, pages 2969–2978, Advances in Neural Information Processing Systems, 30, 2017.
2022. [35] C.-Y. Wang, A. Bochkovskiy, and H.-Y. M. Liao. YOLOv7: Train-
[14] C. Feichtenhofer, H. Fan, J. Malik, and K. He. Slowfast Networks for able Bag-of-Freebies Sets New State-of-the-Art for Real-Time Object
Video Recognition. In Proceedings of the IEEE/CVF International Detectors. In Proceedings of the IEEE/CVF Conference on Computer
Conference on Computer Vision, pages 6202–6211, 2019. Vision and Pattern Recognition, pages 7464–7475, 2023.
[15] G. Garcia-Hernando, S. Yuan, S. Baek, and T.-K. Kim. First-Person [36] X. Wang, R. Girshick, A. Gupta, and K. He. Non-local Neural
Hand Action Benchmark With RGB-D Videos and 3D Hand Pose Networks. In Proceedings of the IEEE Conference on Computer Vision
Annotations. In Proceedings of the IEEE Conference on Computer and Pattern Recognition, pages 7794–7803, 2018.
Vision and Pattern Recognition, pages 409–419, 2018. [37] Y. Wang, C. Peng, and Y. Liu. Mask-Pose Cascaded CNN for 2D Hand
[16] K. Grauman, A. Westbury, E. Byrne, Z. Chavis, A. Furnari, R. Girdhar, Pose Estimation From Single Color Image. IEEE Transactions on
J. Hamburger, H. Jiang, M. Liu, X. Liu, et al. Ego4D: Around the Circuits and Systems for Video Technology, 29(11):3258–3268, 2018.
World in 3,000 Hours of Egocentric Video. In Proceedings of the [38] Y. Wen, H. Pan, L. Yang, J. Pan, T. Komura, and W. Wang. Hierarchi-
IEEE/CVF Conference on Computer Vision and Pattern Recognition, cal Temporal Transformer for 3D Hand Pose Estimation and Action
pages 18995–19012, 2022. Recognition from Egocentric RGB Videos. In Proceedings of the
[17] K. He, X. Zhang, S. Ren, and J. Sun. Deep Residual Learning IEEE/CVF Conference on Computer Vision and Pattern Recognition,
for Image Recognition. In Proceedings of the IEEE Conference on pages 21243–21253, 2023.
Computer Vision and Pattern Recognition, pages 770–778, 2016. [39] B. Xiao, H. Wu, and Y. Wei. Simple Baselines for Human Pose
[18] A. Howard, M. Sandler, G. Chu, L.-C. Chen, B. Chen, M. Tan, Estimation and Tracking. In Proceedings of the European Conference
W. Wang, Y. Zhu, R. Pang, V. Vasudevan, et al. Searching for Mo- on Computer Vision (ECCV), pages 466–481, 2018.
bileNetV3. In Proceedings of the IEEE/CVF International Conference [40] W. Yamazaki, M. Ding, J. Takamatsu, and T. Ogasawara. Hand Pose
on Computer Vision, pages 1314–1324, 2019. Estimation and Motion Recognition Using Egocentric RGB-D Video.
[19] T. Kwon, B. Tekin, J. Stühmer, F. Bogo, and M. Pollefeys. H2O: In 2017 IEEE International Conference on Robotics and Biomimetics
Two Hands Manipulating Objects for First Person Interaction Recog- (ROBIO), pages 147–152. IEEE, 2017.
nition. In Proceedings of the IEEE/CVF International Conference on [41] S. Yan, Y. Xiong, and D. Lin. Spatial Temporal Graph Convolutional
Computer Vision (ICCV), pages 10138–10148, October 2021. Networks for Skeleton-based Action Recognition. In Proceedings of
[20] H. Liang, J. Yuan, and D. Thalman. Egocentric Hand Pose Estimation the AAAI Conference on Artificial Intelligence, volume 32, 2018.
and Distance Recovery in a Single RGB Image. In 2015 IEEE [42] S. Yang, J. Liu, S. Lu, M. H. Er, and A. C. Kot. Collaborative Learning
International Conference on Multimedia and Expo (ICME), pages 1–6. of Gesture Recognition and 3D Hand Pose Estimation with Multi-
IEEE, 2015. Order Feature Analysis. In Computer Vision–ECCV 2020: 16th Eu-
ropean Conference, Glasgow, UK, August 23–28, 2020, Proceedings,
Part III 16, pages 769–786. Springer, 2020.
[43] K. Zhan, S. Faux, and F. Ramos. Multi-scale Conditional Random
Fields for First-person Activity Recognition. In 2014 IEEE Inter-
national Conference on Pervasive Computing and Communications
(PerCom), pages 51–59. IEEE, 2014.
[44] F. Zhang, V. Bazarevsky, A. Vakunov, A. Tkachenka, G. Sung, C.-L.
Chang, and M. Grundmann. Mediapipe Hands: On-device Real-time
Hand Tracking. arXiv preprint arXiv:2006.10214, 2020.
[45] C. Zimmermann, D. Ceylan, J. Yang, B. Russell, M. Argus, and
T. Brox. FreiHAND: a Dataset For Markerless Capture of Hand Pose
and Shape from Single RGB Images. In Proceedings of the IEEE/CVF
International Conference on Computer Vision, pages 813–822, 2019.

You might also like