Case Study3
Case Study3
1, JANUARY 2021
Abstract—Realtime multi-person 2D pose estimation is a key component in enabling machines to have an understanding of people in
images and videos. In this work, we present a realtime approach to detect the 2D pose of multiple people in an image. The proposed
method uses a nonparametric representation, which we refer to as Part Affinity Fields (PAFs), to learn to associate body parts with
individuals in the image. This bottom-up system achieves high accuracy and realtime performance, regardless of the number of people in
the image. In previous work, PAFs and body part location estimation were refined simultaneously across training stages. We
demonstrate that a PAF-only refinement rather than both PAF and body part location refinement results in a substantial increase in both
runtime performance and accuracy. We also present the first combined body and foot keypoint detector, based on an internal annotated
foot dataset that we have publicly released. We show that the combined detector not only reduces the inference time compared to
running them sequentially, but also maintains the accuracy of each component individually. This work has culminated in the release of
OpenPose, the first open-source realtime system for multi-person 2D pose detection, including body, foot, hand, and facial keypoints.
Index Terms—2D human pose estimation, 2D foot keypoint estimation, real-time, multiple person, part affinity fields
1 INTRODUCTION
0162-8828 ß 2019 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.
See ht_tps://www.ieee.org/publications/rights/index.html for more information.
Authorized licensed use limited to: Hong Kong Polytechnic University. Downloaded on February 02,2023 at 02:54:45 UTC from IEEE Xplore. Restrictions apply.
CAO ET AL.: OPENPOSE: REALTIME MULTI-PERSON 2D POSE ESTIMATION USING PART AFFINITY FIELDS 173
Authorized licensed use limited to: Hong Kong Polytechnic University. Downloaded on February 02,2023 at 02:54:45 UTC from IEEE Xplore. Restrictions apply.
174 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 43, NO. 1, JANUARY 2021
Fig. 2. Overall pipeline. (a) Our method takes the entire image as the input for a CNN to jointly predict (b) confidence maps for body part detection and
(c) PAFs for part association. (d) The parsing step performs a set of bipartite matchings to associate body part candidates. (e) We finally assemble
them into full body poses for all people in the image.
In this work, we make several extensions to our earlier Additionally, the output of each one of the 3 convolutional
work [3]. We prove that PAF refinement is critical and suffi- kernels is concatenated, following an approach similar to
cient for high accuracy, removing the body part confidence DenseNet [52]. The number of non-linearity layers is tripled,
map refinement while increasing the network depth. This and the network can keep both lower level and higher level
leads to a faster and more accurate model. We also present features. Sections 5.2 and 5.3 analyze the accuracy and run-
the first combined body and foot keypoint detector, created time speed improvements, respectively.
from an annotated foot dataset that will be publicly rele-
ased. We prove that combining both detection approaches 3.2 Simultaneous Detection and Association
not only reduces the inference time compared to running The image is analyzed by a CNN (initialized by the first 10
them independently, but also maintains their individual layers of VGG-19 [53] and fine-tuned), generating a set of
accuracy. Finally, we present OpenPose, the first open- feature maps F that is input to the first stage. At this stage,
source library for real time body, foot, hand, and facial key- the network produces a set of part affinity fields (PAFs)
point detection. L1 ¼ f1 ðFÞ, where f1 refers to the CNNs for inference at
Stage 1. In each subsequent stage, the predictions from the
3 METHOD previous stage and the original image features F are
concatenated and used to produce refined predictions,
Fig. 2 illustrates the overall pipeline of our method. The sys-
tem takes, as input, a color image of size w h (Fig. 2a) and Lt ¼ ft ðF; Lt1 Þ; 82 t TP ; (1)
produces the 2D locations of anatomical keypoints for each
person in the image (Fig. 2e). First, a feedforward network where ft refers to the CNNs for inference at Stage t, and TP
predicts a set of 2D confidence maps S of body part loca- to the number of total PAF stages. After TP iterations, the
tions (Fig. 2b) and a set of 2D vector fields L of part affinity process is repeated for the confidence maps detection, start-
fields (PAFs), which encode the degree of association ing in the most updated PAF prediction,
between parts (Fig. 2c). The set S ¼ ðS1 ; S2 ; . . . ; SJ Þ has J
confidence maps, one per part, where Sj 2 Rwh , STP ¼ rt ðF; LTP Þ; 8t ¼ TP ; (2)
j 2 f1 . . . Jg. The set L ¼ ðL1 ; L2 ; . . . ; LC Þ has C vector fields,
one per limb, where Lc 2 Rwh2 , c 2 f1 . . . Cg. We refer to St ¼ rt ðF; LTP ; St1 Þ; 8TP < t TP þ TC ; (3)
part pairs as limbs for clarity, but some pairs are not human
limbs (e.g., the face). Each image location in Lc encodes a 2D where rt refers to the CNNs for inference at Stage t, and TC
vector (Fig. 1). Finally, the confidence maps and the PAFs to the number of total confidence map stages.
are parsed by greedy inference (Fig. 2d) to output the 2D This approach differs from [3], where both the PAF and
keypoints for all people in the image. confidence map branches were refined at each stage. Hence,
the amount of computation per stage is reduced by half. We
3.1 Network Architecture
Our architecture, shown in Fig. 3, iteratively predicts affin-
ity fields that encode part-to-part association, shown in
blue, and detection confidence maps, shown in beige. The
iterative prediction architecture, following [20], refines the
predictions over successive stages, t 2 f1; . . . ; T g, with inter-
mediate supervision at each stage.
The network depth is increased with respect to [3]. In the
original approach, the network architecture included sev-
eral 7x7 convolutional layers. In our current model, the Fig. 3. Architecture of the multi-stage CNN. The first set of stages pre-
dicts PAFs Lt , while the last set predicts confidence maps St . The pre-
receptive field is preserved while the computation is dictions of each stage and their corresponding image features are
reduced, by replacing each 7x7 convolutional kernel by 3 concatenated for each subsequent stage. Convolutions of kernel size 7
consecutive 3x3 kernels. While the number of operations for from the original approach [3] are replaced with 3 layers of convolutions
of kernel 3 which are concatenated at their end.
the former is 2 72 1 ¼ 97, it is only 51 for the latter.
Authorized licensed use limited to: Hong Kong Polytechnic University. Downloaded on February 02,2023 at 02:54:45 UTC from IEEE Xplore. Restrictions apply.
CAO ET AL.: OPENPOSE: REALTIME MULTI-PERSON 2D POSE ESTIMATION USING PART AFFINITY FIELDS 175
Authorized licensed use limited to: Hong Kong Polytechnic University. Downloaded on February 02,2023 at 02:54:45 UTC from IEEE Xplore. Restrictions apply.
176 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 43, NO. 1, JANUARY 2021
where the limb width s l is a distance in pixels, the limb tion candidate of body part j. These part detection candidates
length is lc;k ¼ jjxj2 ;k xj1 ;k jj2 , and v? is a vector perpendicu- still need to be associated with other parts from the same per-
lar to v. son—in other words, we need to find the pairs of part detec-
The groundtruth part affinity field averages the affinity
tions that are in fact connected limbs. We define a variable
fields of all people in the image,
j1 j2 2 f0; 1g to indicate whether two detection candidates dj1
m
zmn
1 X and dnj2 are connected, and the goal is to find the optimal assi-
Lc ðpÞ ¼ L ðpÞ; (10) gnment for the set of all possible connections, Z ¼ fzmn
nc ðpÞ k c;k j1 j2 :
for j1 ; j2 2 f1 . . . Jg; m 2 f1 . . . Nj1 g; n 2 f1 . . . Nj2 gg.
where nc ðpÞ is the number of non-zero vectors at point p If we consider a single pair of parts j1 and j2 (e.g., neck
across all k people. and right hip) for the cth limb, finding the optimal associa-
During testing, we measure association between can- tion reduces to a maximum weight bipartite graph match-
didate part detections by computing the line integral over ing problem [54]. This case is shown in Fig. 5b. In this graph
the corresponding PAF along the line segment connecting matching problem, nodes of the graph are the body part
the candidate part locations. In other words, we measure the detection candidates Dj1 and Dj2 , and the edges are all pos-
alignment of the predicted PAF with the candidate limb that sible connections between pairs of detection candidates.
would be formed by connecting the detected body parts. Additionally, each edge is weighted by Eq. (11)—the part
Specifically, for two candidate part locations dj1 and dj2 , we affinity aggregate. A matching in a bipartite graph is a sub-
sample the predicted part affinity field, Lc along the line seg- set of the edges chosen in such a way that no two edges
ment to measure the confidence in their association: share a node. Our goal is to find a matching with maximum
Z weight for the chosen edges,
u¼1
dj2 dj1
E¼ Lc ðpðuÞÞ du; (11) X X
u¼0 jjdj2 dj1 jj2 max Ec ¼ max Emn zmn
j1 j2 ; (13)
Zc Zc
m2Dj1 n2Dj2
where pðuÞ interpolates the position of the two body parts
dj1 and dj2 , X
s:t: 8m 2 Dj1 ; j1 j2 1;
zmn (14)
pðuÞ ¼ ð1 uÞdj1 þ udj2 : (12)
n2Dj2
Authorized licensed use limited to: Hong Kong Polytechnic University. Downloaded on February 02,2023 at 02:54:45 UTC from IEEE Xplore. Restrictions apply.
CAO ET AL.: OPENPOSE: REALTIME MULTI-PERSON 2D POSE ESTIMATION USING PART AFFINITY FIELDS 177
Authorized licensed use limited to: Hong Kong Polytechnic University. Downloaded on February 02,2023 at 02:54:45 UTC from IEEE Xplore. Restrictions apply.
178 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 43, NO. 1, JANUARY 2021
Fig. 9. Foot keypoint analysis. (a) Foot keypoint annotations, consisting Fig. 10. Keypoint annotation configuration for the 3 datasets.
of big toes, small toes, and heels. (b) Body-only model example at which
right ankle is not properly estimated. (c) Analogous body+foot model
example, the foot information helps predict the right ankle location.
to predict both the body and foot locations. Fig. 10 shows
the keypoint distribution for the three datasets (COCO,
keypoint detector. The library also includes 3D keypoint MPII, and COCO+foot). The body+foot model also incorpo-
pose detection, by performing 3D triangulation with non- rates an interpolated point between the hips to allow the
linear Levenberg-Marquardt refinement [64] over the connection of both legs even when the upper torso is
results of multiple synchronized camera views. occluded or out of the image. We find evidence that foot
The inference time of OpenPose outperforms all state-of- keypoint detection implicitly helps the network to more
the-art methods, while preserving high-quality results. It is accurately predict some body keypoints, in particular leg
able to run at about 22 FPS in a machine with a Nvidia keypoints, such as ankle locations. Fig. 9b shows an exam-
GTX 1080 Ti while preserving high accuracy (Section 5.3). ple where the body-only network was not able to predict
OpenPose has already been used by the research commu- ankle location. By including foot keypoints during training,
nity for many vision and robotics topics, such as person re- while maintaining the same body annotations, the algo-
identification [56], GAN-based video retargeting of human rithm can properly predict the ankle location in Fig. 9c. We
faces [57] and bodies [58], Human-Computer Interaction [59], quantitatively analyze the accuracy difference in Section 5.5.
3D pose estimation [60], and 3D human mesh model genera-
tion [61]. In addition, the OpenCV library [65] has included 5 DATASETS AND EVALUATIONS
OpenPose and our PAF-based network architecture within
its Deep Neural Network (DNN) module. We evaluate our method on three benchmarks for multi-per-
son pose estimation: (1) MPII human multi-person data-
4.2 Extended Foot Keypoint Detection set [66], which consists of 3844 training and 1758 testing
Existing human pose datasets ([66], [67]) contain limited groups of multiple interacting individuals in highly articu-
body part types. The MPII dataset [66] annotates ankles, lated poses with 14 body parts; (2) COCO keypoint challenge
knees, hips, shoulders, elbows, wrists, necks, torsos, and dataset [67], which requires simultaneously detecting people
head tops, while COCO [67] also includes some facial key- and localizing 17 keypoints (body parts) in each person
points. For both of these datasets, foot annotations are lim- (including 12 human body parts and 5 facial keypoints); (3)
ited to ankle position only. However, graphics applications our foot dataset, which is a subset of 15K annotations out of
such as avatar retargeting or 3D human shape reconstruc- the COCO keypoint dataset. These datasets collect images in
tion ([61], [68]) require foot keypoints such as big toe and diverse scenarios that contain many real-world challenges
heel. Without foot information, these approaches suffer such as crowding, scale variation, occlusion, and contact.
from problems such as the candy wrapper effect, floor pene- Our approach placed first at the inaugural COCO 2016 key-
tration, and foot skate. To address these issues, a small sub- points challenge [70], and significantly exceeded the pre-
set of foot instances out of the COCO dataset is labeled vious state-of-the-art results on the MPII multi-person
using the Clickworker platform [69]. It is split up with 14K benchmark. We also provide runtime analysis comparison
annotations from the COCO training set and 545 from the against Mask R-CNN and Alpha-Pose to quantify the effi-
validation set. A total of 6 foot keypoints are labeled (see ciency of the system and analyze the main failure cases.
Fig. 9a). We consider the 3D coordinate of the foot keypoints Fig. 17 shows some qualitative results from our algorithm.
rather than the surface position. For instance, for the exact
toe positions, we label the area between the connection of 5.1 Results on the MPII Multi-Person Dataset
the nail and skin, and also take depth into consideration by For comparison on the MPII dataset, we use the toolkit [1] to
labeling the center of the toe rather than the surface. measure mean Average Precision (mAP) of all body parts
Using our dataset, we train a foot keypoint detection following the “PCKh” metric from [66]. Table 1 compares
algorithm. A n€aive foot keypoint detector could have been mAP performance between our method and other appro-
built by using a body keypoint detector to generate foot aches on the official MPII testing sets. We also compare the
bounding box proposals, and then training a foot detector average inference/optimization time per image in seconds.
on top of it. However, this method suffers from the top- For the 288 images subset, our method outperforms previous
down problems stated in Section 1. Instead, the same archi- state-of-the-art bottom-up methods [2] by 8.5 percent mAP.
tecture previously described for body estimation is trained Remarkably, our inference time is 6 orders of magnitude
Authorized licensed use limited to: Hong Kong Polytechnic University. Downloaded on February 02,2023 at 02:54:45 UTC from IEEE Xplore. Restrictions apply.
CAO ET AL.: OPENPOSE: REALTIME MULTI-PERSON 2D POSE ESTIMATION USING PART AFFINITY FIELDS 179
TABLE 1
Results on the MPII Dataset
Method Hea Sho Elb Wri Hip Kne Ank mAP s/image
Subset of 288 images as in [1]
Deepcut [1] 73.4 71.8 57.9 39.9 56.7 44.0 32.0 54.1 57995
Iqbal et al. [41] 70.0 65.2 56.4 46.1 52.7 47.9 44.5 54.7 10
DeeperCut [2] 87.9 84.0 71.9 63.9 68.8 63.8 58.1 71.2 230
Newell et al. [48] 91.5 87.2 75.9 65.4 72.2 67.0 62.1 74.5 -
ArtTrack [47] 92.2 91.3 80.8 71.4 79.1 72.6 67.8 79.3 0.005
Fang et al. [6] 89.3 88.1 80.7 75.5 73.7 76.7 70.0 79.1 -
Ours 92.9 91.3 82.3 72.6 76.0 70.9 66.8 79.0 0.005
Full testing set Fig. 11. mAP curves over different PCKh thresholds on MPII validation
set. (a) mAP curves of self-comparison experiments. (b) mAP curves of
DeeperCut [2] 78.4 72.5 60.2 51.0 57.2 52.0 45.4 59.5 485 PAFs across stages.
Iqbal et al. [41] 58.4 53.9 44.5 35.0 42.2 36.7 31.1 43.1 10
Levinko et al. [71] 89.8 85.2 71.8 59.6 71.1 63.0 53.5 70.6 -
ArtTrack [47] 88.8 87.0 75.9 64.9 74.2 68.8 60.5 74.3 0.005 a tree versus 91 edges of a graph) needed facilitates the
Fang et al. [6] 88.4 86.5 78.6 70.4 74.4 73.0 65.8 76.7 -
Newell et al. [48] 92.1 89.3 78.9 69.8 76.2 71.6 64.7 77.5 -
training convergence.
Fieraru et al. [72] 91.8 89.5 80.4 69.6 77.3 71.7 65.5 78.0 - Fig. 11a shows an ablation analysis on our validation
Ours (one scale) 89.0 84.9 74.9 64.2 71.0 65.6 58.1 72.5 0.005 set. For the threshold of PCKh-0.5 [66], the accuracy of our
Ours 91.2 87.6 77.7 66.8 75.4 68.9 61.7 75.6 0.005
PAF method is 2.9 percent higher than one-midpoint and
Top: Comparison results on the testing subset defined in [1]. Middle: Compari- 2.3 percent higher than two intermediate points, generally
son results on the whole testing set. Testing without scale search is denoted outperforming the method of midpoint representation. The
as “(one scale)”. PAFs, which encode both position and orientation informa-
tion of human limbs, are better able to distinguish the com-
less. We report a more detailed runtime analysis in Section mon cross-over cases, e.g., overlapping arms. Training with
5.3. For the entire MPII testing set, our method without scale masks of unlabeled persons further improves the perfor-
search already outperforms previous state-of-the-art meth- mance by 2.3 percent because it avoids penalizing the true
ods by a large margin, i.e., 13 percent absolute increase on positive prediction in the loss during training. If we use the
mAP. Using a 3 scale search (0:7, 1 and 1:3) further ground-truth keypoint location with our parsing algorithm,
increases the performance to 75.6 percent mAP. The mAP we can obtain a mAP of 88.3 percent. In Fig. 11a, the mAP
comparison with previous bottom-up approaches indicate obtained using our parsing with GT detection is constant
the effectiveness of our novel feature representation, PAFs, across different PCKh thresholds due to no localization error.
to associate body parts. Based on the tree structure, our Using GT connection with our keypoint detection achieves a
greedy parsing method achieves better accuracy than a mAP of 81.6 percent. It is notable that our parsing algorithm
graphcut optimization formula based on a fully connected based on PAFs achieves a similar mAP as when based on GT
graph structure [1], [2]. connections (79.4 versus 81.6 percent). This indicates parsing
In Table 2, we show comparison results for the different based on PAFs is quite robust in associating correct part
skeleton structures shown in Fig. 6. We created a custom detections. Fig. 11b shows a comparison of performance
validation set consisting of 343 images from the original across stages. The mAP increases monotonically with the
MPII training set. We train our model based on a fully con- iterative refinement framework. Fig. 4 shows the qualitative
nected graph, and compare results by selecting all edges improvement of the predictions over stages.
(Fig. 6b, approximately solved by Integer Linear Program-
ming), and minimal tree edges (Fig. 6c, approximately
5.2 Results on the COCO Keypoints Challenge
solved by Integer Linear Programming, and Fig. 6d, solved
by the greedy algorithm presented in this paper). Both The COCO training set consists of over 100K person inst-
methods yield similar results, demonstrating that it is suffi- ances labeled with over 1 million keypoints. The testing
cient to use minimal edges. We trained our final model to set contains “test-challenge” and “test-dev”subsets, which
only learn the minimal edges to fully utilize the network have roughly 20K images each. The COCO evaluation def-
capacity, denoted as Fig. 6d (sep). This approach outper- ines the object keypoint similarity (OKS) and uses the mean
forms Fig. 6c and even Fig. 6b, while maintaining efficiency. average precision (AP) over 10 OKS thresholds as the main
The fewer number of part association channels (13 edges of competition metric [70]. The OKS plays the same role as the
IoU in object detection. It is calculated from the scale of
the person and the distance between predicted and GT
points. Table 3 shows results from top teams in the challenge.
TABLE 2 It is noteworthy that our method has a higher drop in accu-
Comparison of Different Structures on Our Custom racy when considering only people of higher scales (APL ).
Validation Set
In Table 4, we report self-comparisons on the COCO vali-
Method Hea Sho Elb Wri Hip Kne Ank mAP s/image dation set. If we use the GT bounding box and a single person
CPM [20], we can achieve an upper-bound for the top-down
Fig. 6b 91.8 90.8 80.6 69.5 78.9 71.4 63.8 78.3 362
Fig. 6c 92.2 90.8 80.2 69.2 78.5 70.7 62.6 77.6 43 approach using CPM, which is 62.7 percent AP. If we use the
Fig. 6d 92.0 90.7 80.0 69.4 78.4 70.1 62.3 77.4 0.005 state-of-the-art object detector, Single Shot MultiBox Detector
Fig. 6d (sep) 92.4 90.4 80.9 70.8 79.5 73.1 66.5 79.1 0.005 (SSD)[74], the performance drops 10 percent. This compari-
son indicates the performance of top-down approaches rely
Authorized licensed use limited to: Hong Kong Polytechnic University. Downloaded on February 02,2023 at 02:54:45 UTC from IEEE Xplore. Restrictions apply.
180 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 43, NO. 1, JANUARY 2021
TABLE 3 TABLE 5
COCO Test-Dev Leaderboard [73], “*” Indicates That No Self-Comparison Experiments on the COCO Validation Set
Citation Was Provided
Method AP AP50 AP75 APM APL Stages
50 75 M L
Team AP AP AP AP AP 5 PAF - 1 CM 65.3 85.2 71.3 62.2 70.7 6
Top-Down Approaches 4 PAF - 2 CM 65.2 85.3 71.4 62.3 70.1 6
Megvii [43] 78.1 94.1 85.9 74.5 83.3 3 PAF - 3 CM 65.0 85.1 71.2 62.4 69.4 6
MRSA [44] 76.5 92.4 84.0 73.0 82.7 4 PAF - 1 CM 64.8 85.3 70.9 61.9 69.6 5
The Sea Monsters* 75.9 92.1 83.0 71.7 82.1 3 PAF - 1 CM 64.6 84.8 70.6 61.8 69.5 4
Alpha-Pose [6] 71.0 87.9 77.7 69.0 75.2 3 CM - 3 PAF 61.0 83.9 65.7 58.5 65.3 6
Mask R-CNN [5] 69.2 90.4 76.0 64.9 76.3 CM refers to confidence map, while the numbers express the number of estima-
Bottom-Up Approaches tion stages for PAF and CM. Stages refers to the number of PAF and CM
stages. Reducing the number of stages increases the runtime performance.
METU [50] 70.5 87.7 77.2 66.1 77.3
TFMAN* 70.2 89.2 77.0 65.6 76.3
PersonLab [49] 68.7 89.0 75.4 64.1 75.5 over ReLU layers and Adam optimization instead of SGD
Associative Emb. [48] 65.5 86.8 72.3 60.6 72.6
Ours 64.2 86.2 70.1 61.0 68.8 with momentum. Differently to [3], we do not refine the cur-
Ours [3] 61.8 84.9 67.5 57.1 68.2 rent approach with CPM [20] to avoid harming the speed.
Top: some of the highest top-down results. Bottom: highest bottom-up results.
5.3 Inference Runtime Analysis
We compare 3 state-of-the-art, well-maintained, and widely-
heavily on the person detector. In contrast, our original bot-
used multi-person pose estimation libraries, OpenPose [4],
tom-up method achieves 58.4 percent AP. If we refine the
based on this work, Mask R-CNN [5], and Alpha-Pose [6].
results by applying a single person CPM on each rescaled
We analyze the inference runtime performance of the 3 meth-
region of the estimated persons parsed by our method, we
ods in Fig. 12. Megvii (Face++) [43] and MSRA [44] GitHub
gain a 2.6 percent overall AP increase. We only update esti-
repositories do not include the person detector they use and
mations on predictions in which both methods roughly agree,
only provide pose estimation results given a cropped person.
resulting in improved precision and recall. The new architec-
Thus, we cannot know their exact runtime performance and
ture without CPM refinement is approximately 7 percent
have been excluded from this analysis. Mask R-CNN is only
more accurate than the original approach, while increasing
compatible with Nvidia graphics cards, so we perform the
the speed 2.
analysis on a system with a NVIDIA 1080 Ti. As top-down
We analyze the effect of PAF refinement over confidence
approaches, the inference times of Mask R-CNN, Alpha-
map estimation in Table 5. We fix the computation to a max-
Pose, Megvii, and MSRA are roughly proportional to the
imum of 6 stages, distributed differently across the PAF and
number of people in the image. To be more precise, they are
confidence map branches. We can extract 3 conclusions
proportional to the number of proposals that their person
from this experiment. First, PAF requires a higher number
detectors extract. In contrast, the inference time of our bot-
of stages to converge and benefits more from refinement
tom-up approach is invariant to the number of people in the
stages. Second, increasing the number of PAF channels
image. The runtime of OpenPose consists of two major parts:
mainly improves the number of true positives, even though
(1) CNN processing time whose complexity is Oð1Þ, constant
they might not be too accurate (higher AP 50 ). However,
with varying number of people; (2) multi-person parsing
increasing the number of confidence map channels further
improves the localization accuracy (higher AP 75 ). Third, we
prove that the accuracy of the part confidence maps highly
increases when using PAF as a prior, while the opposite
results in a 4 percent absolute accuracy decrease. Even the
model with only 4 stages (3 PAF - 1 CM) is more accurate
than the computationally more expensive 6-stage model
that first predicts confidence maps (3 CM - 3 PAF). Some
other additions that further increased the accuracy of the
new models with respect to the original work are PReLU
TABLE 4
Self-Comparison Experiments on the COCO Validation Set
Authorized licensed use limited to: Hong Kong Polytechnic University. Downloaded on February 02,2023 at 02:54:45 UTC from IEEE Xplore. Restrictions apply.
CAO ET AL.: OPENPOSE: REALTIME MULTI-PERSON 2D POSE ESTIMATION USING PART AFFINITY FIELDS 181
TABLE 6
Runtime Difference between the 3 Models Released
in OpenPose with CUDA and CPU-Only Versions,
Running in a NVIDIA GeForce GTX-1080 Ti GPU
and a i7-6850K CPU
Authorized licensed use limited to: Hong Kong Polytechnic University. Downloaded on February 02,2023 at 02:54:45 UTC from IEEE Xplore. Restrictions apply.
182 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 43, NO. 1, JANUARY 2021
TABLE 7 TABLE 9
Foot Keypoint Analysis on the Foot Validation Set Vehicle Keypoint Validation Set
Fig. 14. Vehicle keypoint detection examples from the validation set. The keypoint locations are successfully estimated under challenging scenarios,
including overlapping between cars, cropped vehicles, and different scales.
Fig. 15. Common failure cases: (a) Rare pose or appearance, (b) missing or false parts detection, (c) overlapping parts, i.e., part detections shared
by two persons, (d) wrong connection associating parts from two persons, (e-f): False positives on statues or animals.
Fig. 16. Common foot failure cases: (a) Foot or leg occluded by the body, (b) foot or leg occluded by another object, (c) foot visible but leg occluded,
(d) shoe and foot not aligned, (e): False negatives when foot visible but rest of the body occluded, (f): Soles of their feet are usually not detected
(rare in training), (g): Swap between right and left body parts.
Authorized licensed use limited to: Hong Kong Polytechnic University. Downloaded on February 02,2023 at 02:54:45 UTC from IEEE Xplore. Restrictions apply.
CAO ET AL.: OPENPOSE: REALTIME MULTI-PERSON 2D POSE ESTIMATION USING PART AFFINITY FIELDS 183
Fig. 17. Results containing viewpoint and appearance variation, occlusion, crowding, contact, and other common imaging artifacts.
Authorized licensed use limited to: Hong Kong Polytechnic University. Downloaded on February 02,2023 at 02:54:45 UTC from IEEE Xplore. Restrictions apply.
184 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 43, NO. 1, JANUARY 2021
task. To demonstrate this, we have run the same network and Human-Computer Interaction. In addition, OpenPose
architecture for the task of vehicle keypoint detection [79]. has been included in the OpenCV library [65].
Once again, we use mean average precision over 10 OKS for
the evaluation. The results are shown in Table 9. Both the
average precision and recall are higher than in the body
ACKNOWLEDGMENTS
keypoint task, mainly because we are using a smaller and We acknowledge the effort from the authors of the MPII and
simpler dataset. This initial dataset consists of image anno- COCO human pose datasets. These datasets make 2D
tations from 19 different cameras. We have used the first 18 human pose estimation in the wild possible. This research
camera frames as a training set, and the camera frames from was supported by the Intelligence Advanced Research Proj-
the last camera as a validation set. No variations in the ects Activity (IARPA) via Department of Interior / Interior
model architecture or training parameters have been made. Business Center (DOI/IBC) contract number D17PC00340.
We show qualitative results in Fig. 14.
REFERENCES
5.7 Failure Case Analysis [1] L. Pishchulin, E. Insafutdinov, S. Tang, B. Andres, M. Andriluka,
We have analyzed the main cases where the current P. Gehler, and B. Schiele, “Deepcut: Joint subset partition and
labeling for multi person pose estimation,” in Proc. IEEE Conf.
approach fails in the MPII, COCO, and COCO+foot valida- Comput. Vis. Pattern Recognit., 2016, pp. 4929–4937.
tion sets. Fig. 15 shows an overview of the main body failure [2] E. Insafutdinov, L. Pishchulin, B. Andres, M. Andriluka, and
cases, while Fig. 16 shows the main foot failure cases Fig. 15a B. Schiele, “Deepercut: A deeper, stronger, and faster multi-
person pose estimation model,” in Proc. Eur. Conf. Comput. Vis.,
refers to non typical poses and upside-down examples,
2016, pp. 34–50.
where the predictions usually fail. Increasing the rotation [3] Z. Cao, T. Simon, S.-E. Wei, and Y. Sheikh, “Realtime multi-person
augmentation visually seems to partially solve these issues, 2D pose estimation using part affinity fields,” in Proc. IEEE Conf.
but the global accuracy on the COCO validation set is Comput. Vis. Pattern Recognit., 2017, pp. 1302–1310.
[4] G. Hidalgo, Z. Cao, T. Simon, S.-E. Wei, H. Joo, and Y. Sheikh,
reduced by about 5 percent. A different alternative is to run “OpenPose library,” [Online]. Available: https://github.com/
the network using different rotations and keep the poses CMU-Perceptual-Computing-Lab/openpose
with the higher confidence. Body occlusion can also lead to [5] K. He, G. Gkioxari, P. Dollar, and R. Girshick, “Mask R-CNN,” in
false negatives and high localization error. This problem is Proc. IEEE Int. Conf. Comput. Vis., 2017, pp. 2980–2988.
[6] H.-S. Fang, S. Xie, Y.-W. Tai, and C. Lu, “RMPE: Regional multi-
inherited from the dataset annotations, in which occluded person pose estimation,” in Proc. IEEE Int. Conf. Comput. Vis.,
keypoints are not included. In highly crowded images where 2017, pp. 2353–2362.
people are overlapping, the approach tends to merge annota- [7] P. F. Felzenszwalb and D. P. Huttenlocher, “Pictorial structures
for object recognition,” Int. J. Comput. Vis., vol. 61, pp. 55–79, 2005.
tions from different people, while missing others, due to the
[8] D. Ramanan, D. A. Forsyth, and A. Zisserman, “Strike a Pose:
overlapping PAFs that make the greedy multi-person pars- Tracking people by finding stylized poses,” in Proc. IEEE Conf.
ing fail. Animals and statues also frequently lead to false pos- Comput. Vis. Pattern Recognit., 2005, pp. 271–278.
itive errors. This issue could be mitigated by adding more [9] M. Andriluka, S. Roth, and B. Schiele, “Monocular 3D pose esti-
mation and tracking by detection,” in Proc. IEEE Conf. Comput.
negative examples during training to help the network dis- Vis. Pattern Recognit., 2010, pp. 623–630.
tinguish between humans and other humanoid figures. [10] M. Andriluka, S. Roth, and B. Schiele, “Pictorial structures revis-
ited: People detection and articulated pose estimation,” in Proc.
IEEE Conf. Comput. Vis. Pattern Recognit., 2009, pp. 1014–1021.
[11] L. Pishchulin, M. Andriluka, P. Gehler, and B. Schiele, “Poselet
6 CONCLUSION conditioned pictorial structures,” in Proc. IEEE Conf. Comput. Vis.
Realtime multi-person 2D pose estimation is a critical compo- Pattern Recognit., 2013, pp. 588–595.
[12] Y. Yang and D. Ramanan, “Articulated human detection with
nent in enabling machines to visually understand and inter- flexible mixtures of parts,” IEEE Trans. Pattern Anal. Mach. Intell.,
pret humans and their interactions. In this paper, we present vol. 35, no. 12, pp. 2878–2890, Dec. 2013.
an explicit nonparametric representation of the keypoint asso- [13] S. Johnson and M. Everingham, “Clustered pose and nonlinear
ciation that encodes both position and orientation of human appearance models for human pose estimation,” in Proc. British
Mach. Vis. Conf., 2010, pp. 5–15.
limbs. Second, we design an architecture that jointly learns [14] Y. Wang and G. Mori, “Multiple tree models for occlusion and
part detection and association. Third, we demonstrate that a spatial constraints in human pose estimation,” in Proc. Eur. Conf.
greedy parsing algorithm is sufficient to produce high-quality Comput. Vis., 2008, pp. 710–724.
[15] L. Sigal and M. J. Black, “Measure locally, reason globally: Occlu-
parses of body poses, and preserves efficiency regardless of
sion-sensitive articulated pose estimation,” in Proc. IEEE Conf.
the number of people. Fourth, we prove that PAF refinement Comput. Vis. Pattern Recognit., 2006, pp. 2041–2048.
is far more important than combined PAF and body part loca- [16] X. Lan and D. P. Huttenlocher, “Beyond trees: Common-factor
tion refinement, leading to a substantial increase in both models for 2D human pose recovery,” in Proc. IEEE Int. Conf. Com-
put. Vis., 2005, pp. 470–477.
runtime performance and accuracy. Fifth, we show that com- [17] L. Karlinsky and S. Ullman, “Using linking features in learning
bining body and foot estimation into a single model boosts non-parametric part models,” in Proc. Eur. Conf. Comput. Vis.,
the accuracy of each component individually and reduces the 2012, pp. 326–339.
inference time of running them sequentially. We have created [18] M. Dantone, J. Gall, C. Leistner, and L. Van Gool, “Human pose
estimation using body parts dependent joint regressors,” in Proc.
a foot keypoint dataset consisting of 15K foot keypoint instan- IEEE Conf. Comput. Vis. Pattern Recognit., 2013, pp. 3041–3048.
ces, which we will publicly release. Finally, we have open- [19] A. Newell, K. Yang, and J. Deng, “Stacked hourglass networks
sourced this work as OpenPose [4], the first realtime system for human pose estimation,” in Proc. Eur. Conf. Comput. Vis., 2016,
pp. 2597–2602.
for body, foot, hand, and facial keypoint detection. The library [20] S.-E. Wei, V. Ramakrishna, T. Kanade, and Y. Sheikh, “Convo-
is being widely used today for many research topics involving lutional pose machines,” in Proc. IEEE Conf. Comput. Vis. Pattern
human analysis, such as human re-identification, retargeting, Recognit., 2016, pp. 4724–4732.
Authorized licensed use limited to: Hong Kong Polytechnic University. Downloaded on February 02,2023 at 02:54:45 UTC from IEEE Xplore. Restrictions apply.
CAO ET AL.: OPENPOSE: REALTIME MULTI-PERSON 2D POSE ESTIMATION USING PART AFFINITY FIELDS 185
[21] W. Ouyang, X. Chu, and X. Wang, “Multi-source deep learning [45] M. Eichner and V. Ferrari, “We are family: Joint pose estimation
for human pose estimation,” in Proc. IEEE Conf. Comput. Vis. Pat- of multiple persons,” in Proc. Eur. Conf. Comput. Vis., 2010,
tern Recognit., 2014, pp. 2337–2344. pp. 228–242.
[22] J. Tompson, R. Goroshin, A. Jain, Y. LeCun, and C. Bregler, [46] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for
“Efficient object localization using convolutional networks,” in image recognition,” in Proc. IEEE Conf. Comput. Vis. Pattern Recog-
Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2015, pp. 648–656. nit., 2016, pp. 770–778.
[23] J. J. Tompson, A. Jain, Y. LeCun, and C. Bregler, “Joint training of [47] E. Insafutdinov, M. Andriluka, L. Pishchulin, S. Tang, E. Levinkov,
a convolutional network and a graphical model for human pose B. Andres, and B. Schiele, “Arttrack: Articulated multi-person
estimation,” in Proc. Advances Neural Inf. Process. Syst., 2014, tracking in the wild,” in Proc. IEEE Conf. Comput. Vis. Pattern Recog-
pp. 1799–1807. nit., 2017, pp. 1293–1301.
[24] X. Chen and A. Yuille, “Articulated pose estimation by a graphical [48] A. Newell, Z. Huang, and J. Deng, “Associative embedding: End-
model with image dependent pairwise relations,” in Proc. Int. to-end learning for joint detection and grouping,” in Proc. 31st Int.
Conf. Neural Inf. Process. Syst., 2014, pp. 1736–1744. Conf. Neural Inf. Process. Syst., 2017, pp. 2274–2284.
[25] A. Toshev and C. Szegedy, “Deeppose: Human pose estimation [49] G. Papandreou, T. Zhu, L.-C. Chen, S. Gidaris, J. Tompson, and
via deep neural networks,” in Proc. IEEE Conf. Comput. Vis. Pattern K. Murphy, “Personlab: Person pose estimation and instance seg-
Recognit., 2014, pp. 1653–1660. mentation with a bottom-up, part-based, geometric embedding
[26] V. Belagiannis and A. Zisserman, “Recurrent human pose estim- model,” in Proc. Eur. Conf. Comput. Vis., 2018, pp. 282–299.
ation,” in Proc. 12th IEEE Int. Conf. Automatic Face Gesture Recog- [50] M. Kocabas, S. Karagoz, and E. Akbas, “MultiPoseNet: Fast multi-
nit., 2017, pp. 468–475. person pose estimation using pose residual network,” in Proc.
[27] A. Bulat and G. Tzimiropoulos, “Human pose estimation via con- Eur. Conf. Comput. Vis., 2018, pp. 437–453.
volutional part heatmap regression,” in Proc. Eur. Conf. Comput. [51] X. Nie, J. Feng, J. Xing, and S. Yan, “Pose partition networks for
Vis., 2016, pp. 717–732. multi-person pose estimation,” in Proc. Eur. Conf. Comput. Vis.,
[28] X. Chu, W. Yang, W. Ouyang, C. Ma, A. L. Yuille, and X. Wang, 2018, pp. 705–720.
“Multi-context attention for human pose estimation,” in Proc. [52] G. Huang, Z. Liu, K. Q. Weinberger, and L. van der Maaten,
IEEE Conf. Comput. Vis. Pattern Recognit., 2017, pp. 5669–5678. “Densely connected convolutional networks,” in Proc. IEEE Conf.
[29] W. Yang, S. Li, W. Ouyang, H. Li, and X. Wang, “Learning feature Comput. Vis. Pattern Recognit., 2017, pp. 2261–2269.
pyramids for human pose estimation,” in Proc. IEEE Int. Conf. [53] K. Simonyan and A. Zisserman, “Very deep convolutional net-
Comput. Vis., 2017, pp. 1290–1299. works for large-scale image recognition,” in Proc. Int. Conf. Learn.
[30] Y. Chen, C. Shen, X.-S. Wei, L. Liu, and J. Yang, “Adversarial Representations, 2015, pp. 521–534.
posenet: A structure-aware convolutional network for human [54] D. B. West, et al., Introduction to Graph Theory, vol. 2, Upper Sad-
pose estimation,” in Proc. IEEE Int. Conf. Comput. Vis., 2017, dle River, NJ, USA: Prentice Hall, 2001.
pp. 1221–1230. [55] H. W. Kuhn, “The Hungarian method for the assignment prob-
[31] W. Tang, P. Yu, and Y. Wu, “Deeply learned compositional models lem,” in Naval Research Logistics Quarterly. Hoboken, NJ, USA:
for human pose estimation,” in Proc. Eur. Conf. Comput. Vis., 2018, Wiley Online Library, 1955.
pp. 190–206. [56] X. Qian, Y. Fu, T. Xiang, W. Wang, J. Qiu, Y. Wu, Y.-G. Jiang, and
[32] L. Ke, M.-C. Chang, H. Qi, and S. Lyu, “Multi-scale structure- X. Xue, “Pose-normalized image generation for person
aware network for human pose estimation,” in Proc. Eur. Conf. re-identification,” in Proc. Eur. Conf. Comput. Vis., 2018, pp. 661–678.
Comput. Vis., 2018, pp. 731–746. [57] A. Bansal, S. Ma, D. Ramanan, and Y. Sheikh, “Recycle-gan: Unsu-
[33] T. Pfister, J. Charles, and A. Zisserman, “Flowing convnets for pervised video retargeting,” in Proc. Eur. Conf. Comput. Vis., 2018,
human pose estimation in videos,” in Proc. IEEE Int. Conf. Comput. pp. 119–135.
Vis., 2015, pp. 1913–1921. [58] C. Chan, S. Ginosar, T. Zhou, and A. A. Efros, “Everybody
[34] V. Ramakrishna, D. Munoz, M. Hebert, J. A. Bagnell, and dance now,” in Proc. Eur. Conf. Comput. Vis. Workshop, 2018,
Y. Sheikh, “Pose machines: Articulated pose estimation via infer- pp. 625–633.
ence machines,” in Proc. Eur. Conf. Comput. Vis., 2014, pp. 33–47. [59] L. Gui, K. Zhang, Y.-X. Wang, X. Liang, J. M. Moura, and
[35] S. Hochreiter, Y. Bengio, and P. Frasconi, “Gradient flow in recur- M. M. Veloso, “Teaching robots to predict human motion,” in
rent nets: the difficulty of learning long-term dependencies,” Proc. Int. Conf. Intell. Robots Syst., 2018, pp. 562–567.
in Field Guide to Dynamical Recurrent Networks, J. Kolen and [60] P. Panteleris, I. Oikonomidis, and A. Argyros, “Using a single
S. Kremer, Eds. Piscataway, NJ, USA: IEEE Press, 2001. RGB frame for real time 3D hand pose estimation in the wild,” in
[36] X. Glorot and Y. Bengio, “Understanding the difficulty of training Proc. IEEE Winter Conf. Appl. Comput. Vis., 2018, pp. 436–445.
deep feedforward neural networks,” in Proc. 13th Int. Conf. Artif. [61] H. Joo, T. Simon, and Y. Sheikh, “Total capture: A 3D deformation
Intell. Statistics, 2010, pp. 249–256. model for tracking faces, hands, and bodies,” in Proc. IEEE Conf.
[37] Y. Bengio, P. Simard, and P. Frasconi, “Learning long-term depen- Comput. Vis. Pattern Recognit., 2018, pp. 8320–8329.
dencies with gradient descent is difficult,” IEEE Trans. Neural [62] D. Mehta, S. Sridhar, O. Sotnychenko, H. Rhodin, M. Shafiei,
Netw., vol. 5, no. 2, pp. 157–166, Mar. 1994. H.-P. Seidel, W. Xu, D. Casas, and C. Theobalt, “VNect: Real-
[38] L. Pishchulin, A. Jain, M. Andriluka, T. Thorm€ahlen, and B. Schiele, time 3D human pose estimation with a single RGB camera,”
“Articulated people detection and pose estimation: Reshaping ACM Trans. Graph., vol. 36, 2017, Art. no. 44.
the future,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2012, [63] T. Simon, H. Joo, I. Matthews, and Y. Sheikh, “Hand keypoint
pp. 3178–3185. detection in single images using multiview bootstrapping,” in
[39] G. Gkioxari, B. Hariharan, R. Girshick, and J. Malik, “Using k-poselets Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2017, pp. 4645–4653.
for detecting people and localizing their keypoints,” in Proc. IEEE [64] D. W. Marquardt, “An algorithm for least-squares estimation of
Conf. Comput. Vis. Pattern Recognit., 2014, pp. 3582–3589. nonlinear parameters,” J. Society Ind. Appl. Math., vol. 11, no. 2,
[40] M. Sun and S. Savarese, “Articulated part-based model for joint pp. 431–441, 1963.
object detection and pose estimation,” in Proc. IEEE Int. Conf. Com- [65] G. Bradski, “The OpenCV library,” Dobb’s J. Softw. Tools, 2000.
put. Vis., 2011, pp. 723–730. [66] M. Andriluka, L. Pishchulin, P. Gehler, and B. Schiele, “2D
[41] U. Iqbal and J. Gall, “Multi-person pose estimation with local human pose estimation: New benchmark and state of the art
joint-to-person associations,” in Proc. Eur. Conf. Comput. Vis. Work- analysis,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2014,
shop, 2016, pp. 627–642. pp. 3686–3693.
[42] G. Papandreou, T. Zhu, N. Kanazawa, A. Toshev, J. Tompson, [67] T.-Y. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan,
C. Bregler, and K. Murphy, “Towards accurate multi-person pose P. Dollar, and C. L. Zitnick, “Microsoft COCO: Common objects
estimation in the wild,” in Proc. IEEE Conf. Comput. Vis. Pattern in context,” in Proc. Eur. Conf. Comput. Vis., 2014, pp. 740–755.
Recognit., 2017, pp. 3711–3719. [68] M. Loper, N. Mahmood, J. Romero, G. Pons-Moll , and M. J. Black,
[43] Y. Chen, Z. Wang, Y. Peng, Z. Zhang, G. Yu, and J. Sun, “Cascaded “SMPL: A skinned multi-person linear model,” ACM Trans.
pyramid network for multi-person pose estimation,” in Proc. IEEE Graph., vol. 34, 2015, Art. no. 248.
Conf. Comput. Vis. Pattern Recognit., 2018, pp. 7103–7112. [69] “ClickWorker,” 2010. [Online]. Available: https://www.clickworker.
[44] B. Xiao, H. Wu, and Y. Wei, “Simple baselines for human pose com
estimation and tracking,” in Proc. Eur. Conf. Comput. Vis., 2018, [70] “MSCOCO keypoint evaluation metric,” 2016. [Online]. Available:
pp. 472–487. http://mscoco.org/dataset/#keypoints-eval
Authorized licensed use limited to: Hong Kong Polytechnic University. Downloaded on February 02,2023 at 02:54:45 UTC from IEEE Xplore. Restrictions apply.
186 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 43, NO. 1, JANUARY 2021
[71] E. Levinkov, J. Uhrig, S. Tang, M. Omran, E. Insafutdinov, A. Kirillov, Tomas Simon received the BS degree in tele-
C. Rother, T. Brox, B. Schiele, and B. Andres, “Joint graph decomposi- communications from Universidad Politecnica de
tion & node labeling: Problem, algorithms, applications,” in Proc. Valencia, the MS degree in robotics from Carnegie
IEEE Conf. Comput. Vis. Pattern Recognit., 2017, pp. 1904–1912. Mellon University, which he obtained while working
[72] M. Fieraru, A. Khoreva, L. Pishchulin, and B. Schiele, “Learning to at the Human Sensing Lab advised by Fernando
refine human pose estimation,” in Proc. Comput. Vis. Pattern Recog- De la Torre, and the PhD degree from Carnegie
nit. Workshop, 2018, pp. 205–214. Mellon University, advised by Yaser Sheikh and
[73] “MSCOCO keypoint leaderboard,” 2016. [Online]. Available: Iain Matthews, in 2017. He is a research scientist
http://mscoco.org/dataset/#keypoints-leaderboard at Facebook Reality Labs. His research interests
[74] W. Liu, D. Anguelov, D. Erhan, C. Szegedy, and S. Reed, “SSD: include mainly in using computer vision and
Single shot multibox detector,” in Proc. Eur. Conf. Comput. Vis., machine learning to model faces and bodies.
2016, pp. 21–37.
[75] J. Huang, V. Rathod, C. Sun, M. Zhu, A. Korattikara, A. Fathi,
I. Fischer, Z. Wojna, Y. Song, S. Guadarrama, et al., “Speed/accu- Shih-En Wei received the BS degree in electrical
racy trade-offs for modern convolutional object detectors,” in Proc. engineering and the MS degree in communication
IEEE Conf. Comput. Vis. Pattern Recognit., 2017, pp. 3296–3297. engineering from National Taiwan University, Tai-
[76] S. Ren, K. He, R. Girshick, and J. Sun, “Faster R-CNN: Towards
pei, Taiwan, and the MS degree in robotics from
real-time object detection with region proposal networks,” in
Carnegie Mellon University, in 2016, advised by
Proc. 28th Int. Conf. Neural Inf. Process. Syst., 2015, pp. 91–99. Dr. Yaser Sheikh. He is a research engineer at
[77] J. Redmon and A. Farhadi, “YOLO9000: Better, faster, stronger,” in Facebook Reality Labs. His research interests
Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2017, pp. 6517–6525. include computer vision and machine learning.
[78] G. Moon, J. Chang, and K. M. Lee, “Posefix: Model-agnostic gen-
eral human pose refinement network,” CVPR, 2019.
[79] N. Dinesh Reddy, M. Vo, and S. G. Narasimhan, “Carfusion: Com-
bining point tracking and part detection for dynamic 3D recon-
struction of vehicles,” in Proc. IEEE Conf. Comput. Vis. Pattern
Recognit., 2018, pp. 1906–1915. Yaser Sheikh is the director of the Facebook
Reality Lab in Pittsburgh and is an associate pro-
fessor at the Robotics Institute at Carnegie Mellon
University. His research is broadly focused on
Zhe Cao received the BS degree in computer sci- machine perception of social behavior, spanning
ence from Wuhan University, China, in 2015, and computer vision, computer graphics, and machine
the MS degree in robotics from Carnegie Mellon
learning. With colleagues, he has won Popular
University, in 2017, advised by Dr. Yaser Sheikh.
Science’s “Best of What’s New” Award, the Honda
He is working toward the PhD degree in computer Initiation Award (2010), best student paper award
vision at the University of California, Berkeley, at CVPR (2018), best paper awards at WACV
advised by Dr. Jitendra Malik. His research inter- (2012), SAP (2012), SCA (2010), ICCV THEMIS
ests include mainly in computer vision and deep (2009), best demo award at ECCV (2016), and placed first in the
learning. He is a student member of the IEEE. MSCOCO Keypoint Challenge (2016); he has also received the Hillman
Fellowship for Excellence in Computer Science Research (2004). Yaser
has served as a senior committee member at leading conferences in com-
puter vision, computer graphics, and robotics including SIGGRAPH
Gines Hidalgo received the BS degree in tele- (2013, 2014), CVPR (2014, 2015, 2018), ICRA (2014, 2016), ICCP
communications from Universidad Politecnica de (2011), and served as an Associate Editor of CVIU. His research is spon-
Cartagena, Spain, in 2014, and the MS degree in sored by various government research offices, including NSF and
robotics from Carnegie Mellon University, in 2019, DARPA, and several industrial partners including the Intel Corporation,
which he obtained while working as a research the Walt Disney Company, Nissan, Honda, Toyota, and the Samsung
associate with the Robotics Institute, advised by Group. His research has been featured by various media outlets including
Dr. Yaser Sheikh. He is a research engineer at The New York Times, The Verge, Popular Science, BBC, MSNBC, New
Epic Games. His research interests include Scientist, slashdot, and WIRED.
human pose estimation, computationally efficient
deep learning, and computer vision. He is a stu-
dent member of the IEEE. " For more information on this or any other computing topic,
please visit our Digital Library at www.computer.org/csdl.
Authorized licensed use limited to: Hong Kong Polytechnic University. Downloaded on February 02,2023 at 02:54:45 UTC from IEEE Xplore. Restrictions apply.