Technical report of HCB team for Multiview Egocentric Hand Tracking
Challenge on HANDS 2024 Challenge
Haohong Kuang1 , Yang Xiao1†, Changlong Jiang1 , Jinghong Zheng1 , Hang Xu1 ,
Ran Wang2 , Zhiguo Cao1 , Min Du3 , Zhiwen Fang4,5 , Joey Tianyi Zhou6,7
1
Key Laboratory of Image Processing and Intelligent Control, Ministry of Education, School of Artificial
Intelligence and Automation, Huazhong University of Science and Technology, Wuhan 430074, China
2
School of Journalism and Information Communication, Huazhong University of Science and Technology, Wuhan 430074, China
3
PICO, ByteDance, Beijing, 100089, China
4
School of Biomedical Engineering, Southern Medical University, Guangzhou 510515, China
5
Department of Rehabilitation Medicine, Zhujiang Hospital, Southern Medical University, Guangzhou 510280, China
6
Centre for Frontier AI Research, Agency for Science, Technology and Research (A*STAR), Singapore
7
Institute of High Performance Computing, Agency for Science, Technology and Research (A*STAR), Singapore
haohong kuang, Yang Xiao, changlongj, deepzheng, hang xu, rex wang, [email protected],
[email protected], [email protected], [email protected]
View 1
Abstract Hand side
Swin MANO Decoder
Transformer
��
MANO pose
In this report, we introduce the method proposed for shared Cross-view
weights feature fusion �
3D hand pose tracking in HANDS@ECCV2024 challenge
View 1 2D pose
based on UmeTrack: multiview egocentric hand tracking Swin
�� Pose Decoder
Transformer
challenge, which aims to track hand poses in calibrated View 2 2D pose
View 2
stereo videos, utilizing a pre-calibrated hand shapes. We
provide a novel method for estimating accurate 3D hand Figure 1. The features from the dual-view images are fused
poses from stereo images. Given that stereo images can pro- through cross-view feature fusion. The MANO Decoder is used
vide 3D spatial information, the main idea of our method is to regress the hand pose parameters and hand side, while the Pose
to leverage this stereo information to guide the estimation Decoder is used to predict the 2D joint coordinates for both views.
of MANO pose and transformation. Specifically, an effec-
tive cross-view feature fusion mechanism is utilized to accu-
understanding of human interaction with the surrounding
rately estimate the relative 2D poses, which are then lifted
environment.
to 3D space and used to calculate the MANO transforma-
tion. Besides, an MANO pose optimization method is pro- Previous methods[8, 11] primarily focus on hand pose
posed to alleviate the performance gap between MANO po- estimation from single image, treating multi-view images
sitions and joint coordinates. Finally, our method achieves as individual inputs. For 3D hand pose recovery, these ap-
a FINGERTIP PCK AUC of 70.81% on the UmeTrack proaches often neglect the inter-view relationships. In con-
dataset, securing the first place in the challenge. trast, Remelli et al.[9] propose a method that uses camera
parameters to transform image features into a unified latent
representation space. Our proposed cross-view feature fu-
sion, however, allows the model to learn the inter-view rela-
1. Introduction
tionships on its own. The final results demonstrate the effec-
With the development of the field of computer vision and tiveness of our approach. Some methods[2, 5, 12] estimate
the rise of XR/VR domains, 3D hand pose and shape es- the MANO[10] parameters and the parameters of a weak-
timation has become increasingly important in aiding the perspective camera, but this approach loses depth informa-
† Yang
tion and cannot accurately localize the hand in the world
Xiao is corresponding author (Yang
[email protected]).
This technical report has been patented and cannot be used for com-
coordinate system.
mercial purposes without our permission. In this work, we propose a novel method for 3D hand
pose tracking from stereo videos. First, based on the A2J- corresponding to point pi :
Transformer[4] network, an effective cross-view feature fu- i
p11 pi12 pi13 pi14
sion mechanism is utilized to accurately provide the MANO
i
pose parameters and 2D hand poses from the dual views. P = pi21 pi22
pi23 pi24 (3)
Then, by lifting 2D poses to 3D space, the transformation pi31 pi32 pi33 pi34
matrix between MANO coordinates and world coordinates
can be calculated. Finally, due to the ground-truth perfor- Because of the noise of prediction, we use SVD to solve
mance gap between MANO positions and joint coordinates Eq. 1 for the 3D point C with the least error.
in UmeTrack dataset, we introduce a MANO pose optimiza- After obtaining the 3D joints through triangulation, we
tion method to improve performance. Besides, a tempo- use the Kabsch algorithm to solve for the wrist transforma-
ral smoothing process is used to provide the temporal in- tion based on the triangulation 3D joints and the MANO
formation. Ultimately, our approach prove to be effective, joints, where pose parameters are estimated by the model
with the first place in pose tracking challenge on UmeTrack and wrist transformation is set to zero.
dataset. Decoder. The overall design of the Decoder follows that
in A2J-Transformer. For the MANO Decoder, since no an-
chor prediction offsets are needed, we modify it to predict
2. Method the MANO pose parameter and hand type. For the Pose De-
coder, since 3D anchors are not needed, we set the anchor
A2J-Transformer is a powerful 3D hand pose estimation
points on a 2D plane.
method that takes a single RGB images as input and out-
puts a root-related 3D pose. Based on A2J-Transformer, Augmentation. In the training phase, we use image flip-
we explore the potential of extending it to handle dual-view ping as a simple data augmentation method in which the
images. Our approach takes synchronized dual-view im- handedness of the hand is also changed. At the time of test-
ages as input and performs inter-view feature fusion. The ing, we use TTA, which considers the inputs of the original
enhanced model then outputs the MANO pose parameters image and the flipped image thereby obtaining results with
and the 2D joint positions for both views. Through trian- higher accuracy.
gulation, we can reconstruct the 3D hand pose in the world Postprocess. Since our network does not leverage tem-
coordinate system and subsequently derive the wrist trans- poral information from videos when predicting 2D joint co-
formation. This process will be explained in detail below. ordinates, the triangulated 3D joint points exhibit tempo-
ral instability. To address this, we apply a simple yet ef-
Cross-view feature fusion. We use the Swin
fective temporal smoothing method. Specifically, a Gaus-
Transformer[6, 7] as our visual feature extractor to capture
sian smoothing window is used to perform weighted averag-
multi-scale image feature, which shares weights between
ing of 3D keypoint coordinates across neighboring frames
the two viewpoints. Then, we employ the Deformable
within the window, producing the final smoothed result.
Transformer[13] to fuse the multi-scale image features from
Then, the smoothed 3D joint coordinates serve as the opti-
two different viewpoints to improve the representation of
mization target. The initial mano theta parameters and wrist
features, which enables information exchange between two
transformation output by the network are then used as ini-
perspectives, effectively improving the accuracy of local-
tial values, and we optimize them to obtain the final mano
ization hand joints between them.
theta parameters and wrist transformation.
Back to world system. Taking 2D point c1 = [u1 , v1 , 1]
in one view as an example, its corresponding point in the 3. Experiment
other view is c2 = [u2 , v2 , 1],We can obtain the 3D point
C = [X, Y, Z, 1] by solving the following over-determined For model training, we use only the UmeTrack[3] and
system of equations: HOT3D[1] datasets, with supervision provided by the
MANO ground truth. The training is conducted on an
NVIDIA RTX 3090 GPU, with each minibatch containing
v1 p131 − p121 v1 p132 − p122 v1 p133 − p123 v1 p134 − p124
16 image pairs. We employ the Adam optimizer, starting
u1 p131 − p111 u1 p132 − p112 u1 p133 − p113 u1 p134 − p114
A= 2
v2 p31 − p221
with an initial learning rate of 1e-4, which decays by a fac-
v2 p232 − p222 v2 p233 − p223 v2 p234 − p224
u2 p231 − p211 u2 p232 − p212 u2 p233 − p213 u2 p234 − p214
tor of 10 every 10 epochs. The model is trained for a total
(1) of 20 epochs.
Since the model itself lacks the ability to handle image
sequences, we apply a temporal smoothing filter to ensure
AC = 0 (2) the temporal consistency of our results. Additionally, we
observe that the 3D joint points reconstructed via triangula-
where P i represents the projection matrix of the camera tion exhibit a lower Mean Per-Joint Position Error (MPJPE)
on the validation set compared to those derived from the [2] Adnane Boukhayma, Rodrigo de Bem, and Philip HS Torr.
model’s predicted MANO parameters. This suggests that 3d hand shape and pose from images in the wild. In Pro-
the triangulated 3D joint points can be used to further refine ceedings of the IEEE/CVF conference on computer vision
the model’s predicted MANO parameters, which are used and pattern recognition, pages 10843–10852, 2019. 1
as the initial values in the optimization process. [3] Shangchen Han, Po-chen Wu, Yubo Zhang, Beibei Liu, Lin-
guang Zhang, Zheng Wang, Weiguang Si, Peizhao Zhang,
In Table 1, we present our results on the test set using
Yujun Cai, Tomas Hodan, et al. Umetrack: Unified multi-
TTA as well as temporal smoothing and post-optimization
view end-to-end hand tracking for vr. In SIGGRAPH Asia
(S&O). The table shows that our model has a solid base- 2022 conference papers, pages 1–9, 2022. 2
line performance, with an improvement of 0.26 when TTA [4] Changlong Jiang, Yang Xiao, Cunlin Wu, Mingyang Zhang,
is applied. Additionally, incorporating temporal smoothing Jinghong Zheng, Zhiguo Cao, and Joey Tianyi Zhou. A2j-
and post-optimization further enhances the performance by transformer: Anchor-to-joint transformer network for 3d in-
another 0.5. teracting hand pose estimation from a single rgb image. In
Proceedings of the IEEE/CVF Conference on Computer Vi-
ID Base TTA S&O FINGERTIP PCK AUC(↑) sion and Pattern Recognition, pages 8846–8855, 2023. 2
√ [5] Dominik Kulon, Riza Alp Guler, Iasonas Kokkinos,
1 70.05
√ √ Michael M Bronstein, and Stefanos Zafeiriou. Weakly-
2 70.31
√ √ √ supervised mesh-convolutional hand reconstruction in the
3 70.81
wild. In Proceedings of the IEEE/CVF conference on
computer vision and pattern recognition, pages 4990–5000,
Table 1. Quantitative results on Umetrack test set. 2020. 1
[6] Ze Liu, Yutong Lin, Yue Cao, Han Hu, Yixuan Wei, Zheng
We conduct our experiments on HANDS@ECCV2024 Zhang, Stephen Lin, and Baining Guo. Swin transformer:
challenge task4: multiview egocentric hand tracking chal- Hierarchical vision transformer using shifted windows. In
Proceedings of the IEEE/CVF international conference on
lenge. The result is shown in Table 2. It can be observed
computer vision, pages 10012–10022, 2021. 2
that our method achieves first place in the final three met-
[7] Ze Liu, Han Hu, Yutong Lin, Zhuliang Yao, Zhenda Xie,
rics. Yixuan Wei, Jia Ning, Yue Cao, Zheng Zhang, Li Dong, et al.
Swin transformer v2: Scaling up capacity and resolution. In
User MPJPE(↓) PCK AUC(↑) FINGERTIP PCK AUC(↑) Proceedings of the IEEE/CVF conference on computer vi-
ppjj 18.70 65.54 56.97 sion and pattern recognition, pages 12009–12019, 2022. 2
JVHANDS 14.21 72.23 67.63 [8] Alejandro Newell, Kaiyu Yang, and Jia Deng. Stacked hour-
HCB(ours) 12.87 75.66 70.81 glass networks for human pose estimation, 2016. 1
[9] Edoardo Remelli, Shangchen Han, Sina Honari, Pascal Fua,
Table 2. Performance comparison on HANDS@ECCV2024 chal- and Robert Wang. Lightweight multi-view 3d pose estima-
lenge task4 Umetrack dataset. tion through camera-disentangled representation. In Pro-
ceedings of the IEEE/CVF conference on computer vision
and pattern recognition, pages 6040–6049, 2020. 1
[10] Javier Romero, Dimitrios Tzionas, and Michael J Black. Em-
4. Conclusion bodied hands: Modeling and capturing hands and bodies to-
gether. arXiv preprint arXiv:2201.02610, 2022. 1
In this report, we introduce a highly accurate hand pose es- [11] Ke Sun, Bin Xiao, Dong Liu, and Jingdong Wang. Deep
timation method designed for dual-view inputs. Our net- high-resolution representation learning for human pose es-
work model can effectively predict dual-view 2D keypoints timation. In Proceedings of the IEEE/CVF conference on
and MANO pose parameters with high consistency without computer vision and pattern recognition, pages 5693–5703,
utilizing camera parameters and video sequence informa- 2019. 1
tion, which lays a solid foundation for our subsequent tri-
[12] John Yang, Hyung Jin Chang, Seungeui Lee, and Nojun
angulation reconstruction process. We test our method on
Kwak. Seqhand: Rgb-sequence-based 3d hand pose and
the UmeTrack dataset, where it demonstrates promising re-
shape estimation. In Computer Vision–ECCV 2020: 16th
sults.
European Conference, Glasgow, UK, August 23–28, 2020,
Proceedings, Part XII 16, pages 122–139. Springer, 2020. 1
References [13] Xizhou Zhu, Weijie Su, Lewei Lu, Bin Li, Xiaogang
[1] Prithviraj Banerjee, Sindi Shkodrani, Pierre Moulon, Wang, and Jifeng Dai. Deformable detr: Deformable trans-
Shreyas Hampali, Fan Zhang, Jade Fountain, Edward Miller, formers for end-to-end object detection. arXiv preprint
Selen Basol, Richard Newcombe, Robert Wang, et al. Intro- arXiv:2010.04159, 2020. 2
ducing hot3d: An egocentric dataset for 3d hand and object
tracking. arXiv preprint arXiv:2406.09598, 2024. 2