Thanks to visit codestin.com
Credit goes to www.scribd.com

0% found this document useful (0 votes)
7 views15 pages

Case Study3

The document presents OpenPose, a real-time multi-person 2D pose estimation system that utilizes Part Affinity Fields (PAFs) to accurately detect and associate body parts in images. This method improves runtime performance and accuracy by refining PAFs independently, and introduces a combined body and foot keypoint detector. OpenPose is the first open-source library for multi-person pose detection, capable of identifying body, foot, hand, and facial keypoints efficiently.

Uploaded by

gilbertlo680
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
7 views15 pages

Case Study3

The document presents OpenPose, a real-time multi-person 2D pose estimation system that utilizes Part Affinity Fields (PAFs) to accurately detect and associate body parts in images. This method improves runtime performance and accuracy by refining PAFs independently, and introduces a combined body and foot keypoint detector. OpenPose is the first open-source library for multi-person pose detection, capable of identifying body, foot, hand, and facial keypoints efficiently.

Uploaded by

gilbertlo680
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 15

172 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 43, NO.

1, JANUARY 2021

OpenPose: Realtime Multi-Person 2D Pose


Estimation Using Part Affinity Fields
Zhe Cao , Student Member, IEEE, Gines Hidalgo , Student Member, IEEE,
Tomas Simon , Shih-En Wei, and Yaser Sheikh

Abstract—Realtime multi-person 2D pose estimation is a key component in enabling machines to have an understanding of people in
images and videos. In this work, we present a realtime approach to detect the 2D pose of multiple people in an image. The proposed
method uses a nonparametric representation, which we refer to as Part Affinity Fields (PAFs), to learn to associate body parts with
individuals in the image. This bottom-up system achieves high accuracy and realtime performance, regardless of the number of people in
the image. In previous work, PAFs and body part location estimation were refined simultaneously across training stages. We
demonstrate that a PAF-only refinement rather than both PAF and body part location refinement results in a substantial increase in both
runtime performance and accuracy. We also present the first combined body and foot keypoint detector, based on an internal annotated
foot dataset that we have publicly released. We show that the combined detector not only reduces the inference time compared to
running them sequentially, but also maintains the accuracy of each component individually. This work has culminated in the release of
OpenPose, the first open-source realtime system for multi-person 2D pose detection, including body, foot, hand, and facial keypoints.

Index Terms—2D human pose estimation, 2D foot keypoint estimation, real-time, multiple person, part affinity fields

1 INTRODUCTION

I Nthis paper, we consider a core component in obtaining


a detailed understanding of people in images and vid-
eos: human 2D pose estimation—or the problem of local-
for each person detection, a single-person pose estimator
is run. In contrast, bottom-up approaches are attractive as
they offer robustness to early commitment and have the
izing anatomical keypoints or “parts”. Human estimation potential to decouple runtime complexity from the num-
has largely focused on finding body parts of individuals. ber of people in the image. Yet, bottom-up approaches do
Inferring the pose of multiple people in images presents a not directly use global contextual cues from other body
unique set of challenges. First, each image may contain an parts and other people. Initial bottom-up methods ([1],
unknown number of people that can appear at any posi- [2]) did not retain the gains in efficiency as the final parse
tion or scale. Second, interactions between people induce required costly global inference, taking several minutes
complex spatial interference, due to contact, occlusion, or per image.
limb articulations, making association of parts difficult. In this paper, we present an efficient method for multi-
Third, runtime complexity tends to grow with the number person pose estimation with competitive performance on
of people in the image, making realtime performance a multiple public benchmarks. We present the first bottom-up
challenge. representation of association scores via Part Affinity Fields
A common approach is to employ a person detector (PAFs), a set of 2D vector fields that encode the location and
and perform single-person pose estimation for each detec- orientation of limbs over the image domain. We demon-
tion. These top-down approaches directly leverage exist- strate that simultaneously inferring these bottom-up repre-
ing techniques for single-person pose estimation, but sentations of detection and association encodes sufficient
suffer from early commitment: if the person detector fails– global context for a greedy parse to achieve high-quality
as it is prone to do when people are in close proximity– results, at a fraction of the computational cost.
there is no recourse to recovery. Furthermore, their run- An earlier version of this manuscript appeared in[3]. This
time is proportional to the number of people in the image, version makes several new contributions. First, we prove
that PAF refinement is crucial for maximizing accuracy,
while body part prediction refinement is not that important.
 Z. Cao is with the Berkeley Artificial Intelligence Research Lab (BAIR), Uni- We increase the network depth but remove the body part
versity of California, Berkeley, CA 94709 USA. E-mail: [email protected]. refinement stages (Sections 3.1 and 3.2). This refined network
 G. Hidalgo and Y. Sheikh are with the Robotics Institute, Carnegie
Mellon University, Pittsburgh, PA 15213 USA. E-mail: gines@alumni. increases both speed and accuracy by approximately 200 and
cmu.edu, [email protected]. 7 percent, respectively (Sections 5.2 and 5.3). Second, we
 T. Simon and S.-E. Wei are with the Facebook Reality Labs, Pittsburgh, PA present an annotated foot dataset1 with 15K human foot
15213 USA. E-mail: {tomas.simon, shih-en.wei}@oculus.com.
instances that has been publicly released (Section 4.2), and
Manuscript received 13 Dec. 2018; revised 24 May 2019; accepted 4 July 2019.
Date of publication 17 July 2019; date of current version 3 Dec. 2020.
(Corresponding author: Zhe Cao.)
Recommended for acceptance by V. Lepetit. 1. Dataset webpage: https://cmu-perceptual-computing-lab.github.
Digital Object Identifier no. 10.1109/TPAMI.2019.2929257 io/foot_keypoint_dataset/

0162-8828 ß 2019 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.
See ht_tps://www.ieee.org/publications/rights/index.html for more information.

Authorized licensed use limited to: Hong Kong Polytechnic University. Downloaded on February 02,2023 at 02:54:45 UTC from IEEE Xplore. Restrictions apply.
CAO ET AL.: OPENPOSE: REALTIME MULTI-PERSON 2D POSE ESTIMATION USING PART AFFINITY FIELDS 173

at the end of each stage to address the problem of vanishing


gradients [35], [36], [37] during training. Newell et al. [19]
also showed intermediate supervisions are beneficial in a
stacked hourglass architecture. However, all of these meth-
ods assume a single person, where the location and scale of
the person of interest is given.
Multi-Person Pose Estimation. For multi-person pose esti-
mation, most approaches [5], [6], [38], [39], [40], [41], [42],
[43], [44] have used a top-down strategy that first detects
people and then have estimated the pose of each person
independently on each detected region. Although this str-
ategy makes the techniques developed for the single per-
son case directly applicable, it not only suffers from early
commitment on person detection, but also fails to capture
the spatial dependencies across different people that req-
Fig. 1. Top: Multi-person pose estimation. Body parts belonging to the uire global inference. Some approaches have started to
same person are linked, including foot keypoints (big toes, small toes, consider inter-person dependencies. Eichner et al. [45]
and heels). Bottom left: Part Affinity Fields (PAFs) corresponding to the
extended pictorial structures to take a set of interacting peo-
limb connecting right elbow and wrist. The color encodes orientation.
Bottom right: A 2D vector in each pixel of every PAF encodes the posi- ple and depth ordering into account, but still required a per-
tion and orientation of the limbs. son detector to initialize detection hypotheses. Pishchulin
et al. [1] proposed a bottom-up approach that jointly labels
we show that a combined model with body and foot key- part detection candidates and associated them to individual
points can be trained preserving the speed of the body-only people, with pairwise scores regressed from spatial offsets
model while maintaining its accuracy (Section 5.5). Third, of detected parts. This approach does not rely on person
we demonstrate the generality of our method by applying it detections, however, solving the proposed integer linear
to the task of vehicle keypoint estimation (Section 5.6). programming over the fully connected graph is an NP-hard
Finally, this work documents the release of OpenPose [4]. problem and thus the average processing time for a single
This open-source library is the first available realtime system image is on the order of hours. Insafutdinov et al. [2] built
for multi-person 2D pose detection, including body, foot, on [1] with a stronger part detectors based on ResNet [46]
hand, and facial keypoints (Section 4). We also include a run- and image-dependent pairwise scores, and vastly improved
time comparison to Mask R-CNN [5] and Alpha-Pose [6], the runtime with an incremental optimization approach,
showing the computational advantage of our bottom-up but the method still takes several minutes per image, with a
approach (Section 5.3). limit of at most 150 part proposals. The pairwise representa-
tions used in [2], which are offset vectors between every
pair of body parts, are difficult to regress precisely and thus
2 RELATED WORK a separate logistic regression is required to convert the pair-
Single Person Pose Estimation. The traditional approach to wise features into a probability score.
articulated human pose estimation is to perform inference In earlier work [3], we present part affinity fields (PAFs), a
over a combination of local observations on body parts and representation consisting of a set of flow fields that encodes
the spatial dependencies between them. The spatial model unstructured pairwise relationships between body parts of
for articulated pose is either based on tree-structured graph- a variable number of people. In contrast to [1] and [2], we
ical models [7], [8], [9], [10], [11], [12], [13], which parametri- can efficiently obtain pairwise scores from PAFs without an
cally encode the spatial relationship between adjacent parts additional training step. These scores are sufficient for a
following a kinematic chain, or non-tree models [14], [15], greedy parse to obtain high-quality results with realtime
[16], [17], [18] that augment the tree structure with addi- performance for multi-person estimation. Concurrent to
tional edges to capture occlusion, symmetry, and long- this work, Insafutdinov et al. [47] further simplified their
range relationships. To obtain reliable local observations of body-part relationship graph for faster inference in single-
body parts, Convolutional Neural Networks (CNNs) have frame model and formulated articulated human tracking as
been widely used, and have significantly boosted the accu- spatio-temporal grouping of part proposals. Recenetly,
racy on body pose estimation [19], [20], [21], [22], [23], [24], Newell et al. [48] proposed associative embeddings which
[25], [26], [27], [28], [29], [30], [31], [32]. Tompson et al. [23] can be thought as tags representing each keypoint’s group.
used a deep architecture with a graphical model whose They group keypoints with similar tags into individual peo-
parameters are learned jointly with the network. Pfister ple. Papandreou et al. [49] proposed to detect individual
et al. [33] further used CNNs to implicitly capture global keypoints and predict their relative displacements, allowing
spatial dependencies by designing networks with large a greedy decoding process to group keypoints into person
receptive fields. The convolutional pose machines architec- instances. Kocabas et al. [50] proposed a Pose Residual Net-
ture proposed by Wei et al. [20] used a multi-stage architec- work which receives keypoint and person detections, and
ture based on a sequential prediction framework [34]; then assigns keypoints to detected person bounding boxes.
iteratively incorporating global context to refine part confi- Nie et al. [51] proposed to partition all keypoint detections
dence maps and preserving multimodal uncertainty from using dense regressions from keypoint candidates to cent-
previous iterations. Intermediate supervisions are enforced roids of persons in the image.

Authorized licensed use limited to: Hong Kong Polytechnic University. Downloaded on February 02,2023 at 02:54:45 UTC from IEEE Xplore. Restrictions apply.
174 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 43, NO. 1, JANUARY 2021

Fig. 2. Overall pipeline. (a) Our method takes the entire image as the input for a CNN to jointly predict (b) confidence maps for body part detection and
(c) PAFs for part association. (d) The parsing step performs a set of bipartite matchings to associate body part candidates. (e) We finally assemble
them into full body poses for all people in the image.

In this work, we make several extensions to our earlier Additionally, the output of each one of the 3 convolutional
work [3]. We prove that PAF refinement is critical and suffi- kernels is concatenated, following an approach similar to
cient for high accuracy, removing the body part confidence DenseNet [52]. The number of non-linearity layers is tripled,
map refinement while increasing the network depth. This and the network can keep both lower level and higher level
leads to a faster and more accurate model. We also present features. Sections 5.2 and 5.3 analyze the accuracy and run-
the first combined body and foot keypoint detector, created time speed improvements, respectively.
from an annotated foot dataset that will be publicly rele-
ased. We prove that combining both detection approaches 3.2 Simultaneous Detection and Association
not only reduces the inference time compared to running The image is analyzed by a CNN (initialized by the first 10
them independently, but also maintains their individual layers of VGG-19 [53] and fine-tuned), generating a set of
accuracy. Finally, we present OpenPose, the first open- feature maps F that is input to the first stage. At this stage,
source library for real time body, foot, hand, and facial key- the network produces a set of part affinity fields (PAFs)
point detection. L1 ¼ f1 ðFÞ, where f1 refers to the CNNs for inference at
Stage 1. In each subsequent stage, the predictions from the
3 METHOD previous stage and the original image features F are
concatenated and used to produce refined predictions,
Fig. 2 illustrates the overall pipeline of our method. The sys-
tem takes, as input, a color image of size w  h (Fig. 2a) and Lt ¼ ft ðF; Lt1 Þ; 82  t  TP ; (1)
produces the 2D locations of anatomical keypoints for each
person in the image (Fig. 2e). First, a feedforward network where ft refers to the CNNs for inference at Stage t, and TP
predicts a set of 2D confidence maps S of body part loca- to the number of total PAF stages. After TP iterations, the
tions (Fig. 2b) and a set of 2D vector fields L of part affinity process is repeated for the confidence maps detection, start-
fields (PAFs), which encode the degree of association ing in the most updated PAF prediction,
between parts (Fig. 2c). The set S ¼ ðS1 ; S2 ; . . . ; SJ Þ has J
confidence maps, one per part, where Sj 2 Rwh , STP ¼ rt ðF; LTP Þ; 8t ¼ TP ; (2)
j 2 f1 . . . Jg. The set L ¼ ðL1 ; L2 ; . . . ; LC Þ has C vector fields,
one per limb, where Lc 2 Rwh2 , c 2 f1 . . . Cg. We refer to St ¼ rt ðF; LTP ; St1 Þ; 8TP < t  TP þ TC ; (3)
part pairs as limbs for clarity, but some pairs are not human
limbs (e.g., the face). Each image location in Lc encodes a 2D where rt refers to the CNNs for inference at Stage t, and TC
vector (Fig. 1). Finally, the confidence maps and the PAFs to the number of total confidence map stages.
are parsed by greedy inference (Fig. 2d) to output the 2D This approach differs from [3], where both the PAF and
keypoints for all people in the image. confidence map branches were refined at each stage. Hence,
the amount of computation per stage is reduced by half. We
3.1 Network Architecture
Our architecture, shown in Fig. 3, iteratively predicts affin-
ity fields that encode part-to-part association, shown in
blue, and detection confidence maps, shown in beige. The
iterative prediction architecture, following [20], refines the
predictions over successive stages, t 2 f1; . . . ; T g, with inter-
mediate supervision at each stage.
The network depth is increased with respect to [3]. In the
original approach, the network architecture included sev-
eral 7x7 convolutional layers. In our current model, the Fig. 3. Architecture of the multi-stage CNN. The first set of stages pre-
dicts PAFs Lt , while the last set predicts confidence maps St . The pre-
receptive field is preserved while the computation is dictions of each stage and their corresponding image features are
reduced, by replacing each 7x7 convolutional kernel by 3 concatenated for each subsequent stage. Convolutions of kernel size 7
consecutive 3x3 kernels. While the number of operations for from the original approach [3] are replaced with 3 layers of convolutions
of kernel 3 which are concatenated at their end.
the former is 2  72  1 ¼ 97, it is only 51 for the latter.

Authorized licensed use limited to: Hong Kong Polytechnic University. Downloaded on February 02,2023 at 02:54:45 UTC from IEEE Xplore. Restrictions apply.
CAO ET AL.: OPENPOSE: REALTIME MULTI-PERSON 2D POSE ESTIMATION USING PART AFFINITY FIELDS 175

Fig. 4. PAFs of right forearm across stages. Although there is confusion


between left and right body parts and limbs in early stages, the estimates
are increasingly refined through global inference in later stages.
Fig. 5. Part association strategies. (a) The body part detection candi-
dates (red and blue dots) for two body part types and all connection can-
empirically observe in Section 5.2 that refined affinity field didates (grey lines). (b) The connection results using the midpoint
predictions improve the confidence map results, while the (yellow dots) representation: correct connections (black lines) and incor-
opposite does not hold. Intuitively, if we look at the PAF rect connections (green lines) that also satisfy the incidence constraint.
(c) The results using PAFs (yellow arrows). By encoding position and ori-
channel output, the body part locations can be guessed. entation over the support of the limb, PAFs eliminate false associations.
However, if we see a bunch of body parts with no other
information, we cannot parse them into different people.
Fig. 4 shows the refinement of the affinity fields across body part j for person k in the image. The value at location
stages. The confidence map results are predicted on top of p 2 R2 in Sj;k is defined as,
the latest and most refined PAF predictions, resulting in a !
barely noticeable difference across confidence map stages. jjp  x jj 2
Sj;k ðpÞ ¼ exp 
j;k 2
To guide the network to iteratively predict PAFs of body ; (7)
s2
parts in the first branch and confidence maps in the second
branch, we apply a loss function at the end of each stage.
We use an L2 loss between the estimated predictions and where s controls the spread of the peak. The groundtruth
the groundtruth maps and fields. Here, we weight the loss confidence map predicted by the network is an aggregation
functions spatially to address a practical issue that some of the individual confidence maps via a max operator,
datasets do not completely label all people. Specifically, the
loss function of the PAF branch at stage ti and loss function Sj ðpÞ ¼ max Sj;k ðpÞ: (8)
k
of the confidence map branch at stage tk are:
C X
X
WðpÞ  kLtci ðpÞ  Lc ðpÞk22 ;
t We take the maximum of the
fLi ¼ (4)
c¼1 p confidence maps instead of the
average so that the precision of
J X
X nearby peaks remains distinct,
WðpÞ  kSjk ðpÞ  Sj ðpÞk22 ;
t t
fSk ¼ (5)
as illustrated in the right figure.
j¼1 p
At test time, we predict confi-
where Lc is the groundtruth PAF, Sj is the groundtruth part dence maps, and obtain body part candidates by perform-
confidence map, and W is a binary mask with WðpÞ ¼ 0 ing non-maximum suppression.
when the annotation is missing at the pixel p. The mask
is used to avoid penalizing the true positive predictions 3.4 Part Affinity Fields for Part Association
during training. The intermediate supervision at each stage Given a set of detected body parts (shown as the red and
addresses the vanishing gradient problem by replenishing blue points in Fig. 5a), how do we assemble them to form
the gradient periodically [20]. The overall objective is the full-body poses of an unknown number of people? We
need a confidence measure of the association for each pair
X
TP þTC
TPX
of body part detections, i.e., that they belong to the same
f¼ fLt þ fSt : (6) person. One possible way to measure the association is to
t¼1 t¼TP þ1
detect an additional midpoint between each pair of parts on
a limb and check for its incidence between candidate part
3.3 Confidence Maps for Part Detection detections, as shown in Fig. 5b. However, when people
To evaluate fS in Eq. (6) during training, we generate the crowd together—as they are prone to do—these midpoints
groundtruth confidence maps S from the annotated 2D are likely to support false associations (shown as green lines
keypoints. Each confidence map is a 2D representation of in Fig. 5b). Such false associations arise due to two limita-
the belief that a particular body part can be located in any tions in the representation: (1) it encodes only the position,
given pixel. Ideally, if a single person appears in the image, and not the orientation, of each limb; (2) it reduces the
a single peak should exist in each confidence map if the cor- region of support of a limb to a single point.
responding part is visible; if multiple people are in the Part Affinity Fields (PAFs) address these limitations.
image, there should be a peak corresponding to each visible They preserve both location and orientation information
part j for each person k. across the region of support of the limb (as shown in
We first generate individual confidence maps Sj;k for Fig. 5c). Each PAF is a 2D vector field for each limb, also
each person k. Let xj;k 2 R2 be the groundtruth position of shown in Fig. 1d. For each pixel in the area belonging to a

Authorized licensed use limited to: Hong Kong Polytechnic University. Downloaded on February 02,2023 at 02:54:45 UTC from IEEE Xplore. Restrictions apply.
176 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 43, NO. 1, JANUARY 2021

particular limb, a 2D vector encodes the direction that


points from one part of the limb to the other. Each type of
limb has a corresponding PAF joining its two associated
body parts.
Consider a single limb shown in the figure below. Let
xj1 ;k and xj2 ;k be the groundtruth positions of body parts j1
and j2 from the limb c for per-
son k in the image. If a point p
Fig. 6. Graph matching. (a) Original image with part detections. (b)
lies on the limb, the value at K-partite graph. (c) Tree structure. (d) A set of bipartite graphs.
Lc;k ðpÞ is a unit vector that
points from j1 to j2 ; for all other
points, the vector is zero-valued. limbs. We score each candidate limb using the line integral
To evaluate fL in Eq. (6) during training, we define the computation on the PAF, defined in Eq. (11). The problem of
groundtruth PAF, Lc;k , at an image point p as finding the optimal parse corresponds to a K-dimensional
 matching problem that is known to be NP-Hard [54]
v if p on limb c; k (Fig. 6c). In this paper, we present a greedy relaxation that
Lc;k ðpÞ ¼ (9)
0 otherwise. consistently produces high-quality matches. We speculate
the reason is that the pair-wise association scores implicitly
Here, v ¼ ðxj2 ;k  xj1 ;k Þ=jjxj2 ;k  xj1 ;k jj2 is the unit vector in encode global context, due to the large receptive field of the
the direction of the limb. The set of points on the limb is PAF network.
defined as those within a distance threshold of the line seg- Formally, we first obtain a set of body part detection can-
ment, i.e., those points p for which didates DJ for multiple people, where DJ ¼ fdm j : for j 2
f1 . . . Jg; m 2 f1 . . . Nj gg, where Nj is the number of can-
0  v  ðp  xj1 ;k Þ  lc;k and jv?  ðp  xj1 ;k Þj  s l ;
didates of part j, and dm j 2 R is the location of the mth detec-
2

where the limb width s l is a distance in pixels, the limb tion candidate of body part j. These part detection candidates
length is lc;k ¼ jjxj2 ;k  xj1 ;k jj2 , and v? is a vector perpendicu- still need to be associated with other parts from the same per-
lar to v. son—in other words, we need to find the pairs of part detec-
The groundtruth part affinity field averages the affinity
tions that are in fact connected limbs. We define a variable
fields of all people in the image,
j1 j2 2 f0; 1g to indicate whether two detection candidates dj1
m
zmn
1 X  and dnj2 are connected, and the goal is to find the optimal assi-
Lc ðpÞ ¼ L ðpÞ; (10) gnment for the set of all possible connections, Z ¼ fzmn
nc ðpÞ k c;k j1 j2 :
for j1 ; j2 2 f1 . . . Jg; m 2 f1 . . . Nj1 g; n 2 f1 . . . Nj2 gg.
where nc ðpÞ is the number of non-zero vectors at point p If we consider a single pair of parts j1 and j2 (e.g., neck
across all k people. and right hip) for the cth limb, finding the optimal associa-
During testing, we measure association between can- tion reduces to a maximum weight bipartite graph match-
didate part detections by computing the line integral over ing problem [54]. This case is shown in Fig. 5b. In this graph
the corresponding PAF along the line segment connecting matching problem, nodes of the graph are the body part
the candidate part locations. In other words, we measure the detection candidates Dj1 and Dj2 , and the edges are all pos-
alignment of the predicted PAF with the candidate limb that sible connections between pairs of detection candidates.
would be formed by connecting the detected body parts. Additionally, each edge is weighted by Eq. (11)—the part
Specifically, for two candidate part locations dj1 and dj2 , we affinity aggregate. A matching in a bipartite graph is a sub-
sample the predicted part affinity field, Lc along the line seg- set of the edges chosen in such a way that no two edges
ment to measure the confidence in their association: share a node. Our goal is to find a matching with maximum
Z weight for the chosen edges,
u¼1
dj2  dj1
E¼ Lc ðpðuÞÞ  du; (11) X X
u¼0 jjdj2  dj1 jj2 max Ec ¼ max Emn  zmn
j1 j2 ; (13)
Zc Zc
m2Dj1 n2Dj2
where pðuÞ interpolates the position of the two body parts
dj1 and dj2 , X
s:t: 8m 2 Dj1 ; j1 j2  1;
zmn (14)
pðuÞ ¼ ð1  uÞdj1 þ udj2 : (12)
n2Dj2

In practice, we approximate the integral by sampling and


X
summing uniformly-spaced values of u. 8n 2 Dj2 ; j1 j2  1;
zmn (15)
m2Dj1
3.5 Multi-Person Parsing Using PAFs
We perform non-maximum suppression on the detection where Ec is the overall weight of the matching from limb
confidence maps to obtain a discrete set of part candidate type c, Z c is the subset of Z for limb type c, and Emn is the
locations. For each part, we may have several candidates, part affinity between parts dm n
j1 and dj2 defined in Eq. (11).
due to multiple people in the image or false positives Eqs. (14) and (15) enforce that no two edges share a node,
(Fig. 6b). These part candidates define a large set of possible i.e., no two limbs of the same type (e.g., left forearm) share a

Authorized licensed use limited to: Hong Kong Polytechnic University. Downloaded on February 02,2023 at 02:54:45 UTC from IEEE Xplore. Restrictions apply.
CAO ET AL.: OPENPOSE: REALTIME MULTI-PERSON 2D POSE ESTIMATION USING PART AFFINITY FIELDS 177

Fig. 7. Importance of redundant PAF connections. (a) Two different people


are wrongly merged due to a wrong neck-nose connection. (b) The higher
confidence of the right ear-shoulder connection avoids the wrong nose-
neck link.
Fig. 8. Output of OpenPose, detecting body, foot, hand, and facial
keypoints in real-time. OpenPose is robust against occlusions including
during human-object interaction.
part. We can use the Hungarian algorithm [55] to obtain the
optimal matching.
When it comes to finding the full body pose of multiple 4 OPENPOSE
people, determining Z is a K-dimensional matching prob-
A growing number of computer vision and machine learn-
lem. This problem is NP-Hard [54] and many relaxations
ing applications require 2D human pose estimation as an
exist. In this work, we add two relaxations to the optimiza-
input for their systems [56], [57], [58], [59], [60], [61], [62]. To
tion, specialized to our domain. First, we choose a minimal
help the research community boost their work, we have
number of edges to obtain a spanning tree skeleton of
publicly released OpenPose [4], the first real-time multi-
human pose rather than using the complete graph, as
person system to jointly detect human body, foot, hand, and
shown in Fig. 6c. Second, we further decompose the match-
facial keypoints (in total 135 keypoints) on single images.
ing problem into a set of bipartite matching subproblems
See Fig. 8 for an example of the whole system.
and determine the matching in adjacent tree nodes indepen-
dently, as shown in Fig. 6d. We show detailed comparison
results in Section 5.1, which demonstrate that minimal 4.1 System
greedy inference well-approximates the global solution at a Available 2D body pose estimation libraries, such as Mask
fraction of the computational cost. The reason is that the R-CNN [5] or Alpha-Pose [6], require their users to imple-
relationship between adjacent tree nodes is modeled explic- ment most of the pipeline, their own frame reader (e.g.,
itly by PAFs, but internally, the relationship between non- video, images, or camera streaming), a display to visualize
adjacent tree nodes is implicitly modeled by the CNN. This the results, output file generation with the results (e.g., JSON
property emerges because the CNN is trained with a large or XML files), etc. In addition, existing facial and body key-
receptive field, and PAFs from non-adjacent tree nodes also point detectors are not combined, requiring a different
influence the predicted PAF. library for each purpose. OpenPose overcome all of these
With these two relaxations, the optimization is decom- problems. It can run on different platforms, including
posed simply as: Ubuntu, Windows, Mac OSX, and embedded systems (e.g.,
XC Nvidia Tegra TX2). It also provides support for different
max E ¼ max Ec : (16) hardware, such as CUDA GPUs, OpenCL GPUs, and CPU-
Z Zc
c¼1
only devices. The user can select an input between images,
We therefore obtain the limb connection candidates for each video, webcam, and IP camera streaming. He can also select
limb type independently using Eqs. (13), (14), and (15). whether to display the results or save them on disk, enable
With all limb connection candidates, we can assemble the or disable each detector (body, foot, face, and hand), enable
connections that share the same part detection candidates pixel coordinate normalization, control how many GPUs to
into full-body poses of multiple people. Our optimization use, skip frames for a faster processing, etc.
scheme over the tree structure is orders of magnitude faster OpenPose consists of three different blocks: (a) body
than the optimization over the fully connected graph [1], [2]. +foot detection, (b) hand detection [63], and (c) face detec-
Our current model also incorporates redundant PAF con- tion. The core block is the combined body+foot keypoint
nections (e.g., between ears and shoulders, wrists and detector (Section 4.2). It can alternatively use the original
shoulders, etc.). This redundancy particularly improves the body-only models [3] trained on COCO and MPII datasets.
accuracy in crowded images, as shown in Fig. 7. To handle Based on the output of the body detector, facial bounding
these redundant connections, we slightly modify the multi- box proposals can roughly be estimated from some body
person parsing algorithm. While the original approach part locations, in particular ears, eyes, nose, and neck. Anal-
started from a root component, our algorithm sorts all pair- ogously, the hand bounding box proposals are generated
wise possible connections by their PAF score. If a connec- with the arm keypoints. This methodology inherits the
tion tries to connect 2 body parts which have already been problems of top-down approaches discussed in Section 1.
assigned to different people, the algorithm recognizes that The hand keypoint detector algorithm is explained in fur-
this would contradict a PAF connection with a higher confi- ther detail in [63], while the facial keypoint detector has
dence, and the current connection is subsequently ignored. been trained in the same fashion as that of the hand

Authorized licensed use limited to: Hong Kong Polytechnic University. Downloaded on February 02,2023 at 02:54:45 UTC from IEEE Xplore. Restrictions apply.
178 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 43, NO. 1, JANUARY 2021

Fig. 9. Foot keypoint analysis. (a) Foot keypoint annotations, consisting Fig. 10. Keypoint annotation configuration for the 3 datasets.
of big toes, small toes, and heels. (b) Body-only model example at which
right ankle is not properly estimated. (c) Analogous body+foot model
example, the foot information helps predict the right ankle location.
to predict both the body and foot locations. Fig. 10 shows
the keypoint distribution for the three datasets (COCO,
keypoint detector. The library also includes 3D keypoint MPII, and COCO+foot). The body+foot model also incorpo-
pose detection, by performing 3D triangulation with non- rates an interpolated point between the hips to allow the
linear Levenberg-Marquardt refinement [64] over the connection of both legs even when the upper torso is
results of multiple synchronized camera views. occluded or out of the image. We find evidence that foot
The inference time of OpenPose outperforms all state-of- keypoint detection implicitly helps the network to more
the-art methods, while preserving high-quality results. It is accurately predict some body keypoints, in particular leg
able to run at about 22 FPS in a machine with a Nvidia keypoints, such as ankle locations. Fig. 9b shows an exam-
GTX 1080 Ti while preserving high accuracy (Section 5.3). ple where the body-only network was not able to predict
OpenPose has already been used by the research commu- ankle location. By including foot keypoints during training,
nity for many vision and robotics topics, such as person re- while maintaining the same body annotations, the algo-
identification [56], GAN-based video retargeting of human rithm can properly predict the ankle location in Fig. 9c. We
faces [57] and bodies [58], Human-Computer Interaction [59], quantitatively analyze the accuracy difference in Section 5.5.
3D pose estimation [60], and 3D human mesh model genera-
tion [61]. In addition, the OpenCV library [65] has included 5 DATASETS AND EVALUATIONS
OpenPose and our PAF-based network architecture within
its Deep Neural Network (DNN) module. We evaluate our method on three benchmarks for multi-per-
son pose estimation: (1) MPII human multi-person data-
4.2 Extended Foot Keypoint Detection set [66], which consists of 3844 training and 1758 testing
Existing human pose datasets ([66], [67]) contain limited groups of multiple interacting individuals in highly articu-
body part types. The MPII dataset [66] annotates ankles, lated poses with 14 body parts; (2) COCO keypoint challenge
knees, hips, shoulders, elbows, wrists, necks, torsos, and dataset [67], which requires simultaneously detecting people
head tops, while COCO [67] also includes some facial key- and localizing 17 keypoints (body parts) in each person
points. For both of these datasets, foot annotations are lim- (including 12 human body parts and 5 facial keypoints); (3)
ited to ankle position only. However, graphics applications our foot dataset, which is a subset of 15K annotations out of
such as avatar retargeting or 3D human shape reconstruc- the COCO keypoint dataset. These datasets collect images in
tion ([61], [68]) require foot keypoints such as big toe and diverse scenarios that contain many real-world challenges
heel. Without foot information, these approaches suffer such as crowding, scale variation, occlusion, and contact.
from problems such as the candy wrapper effect, floor pene- Our approach placed first at the inaugural COCO 2016 key-
tration, and foot skate. To address these issues, a small sub- points challenge [70], and significantly exceeded the pre-
set of foot instances out of the COCO dataset is labeled vious state-of-the-art results on the MPII multi-person
using the Clickworker platform [69]. It is split up with 14K benchmark. We also provide runtime analysis comparison
annotations from the COCO training set and 545 from the against Mask R-CNN and Alpha-Pose to quantify the effi-
validation set. A total of 6 foot keypoints are labeled (see ciency of the system and analyze the main failure cases.
Fig. 9a). We consider the 3D coordinate of the foot keypoints Fig. 17 shows some qualitative results from our algorithm.
rather than the surface position. For instance, for the exact
toe positions, we label the area between the connection of 5.1 Results on the MPII Multi-Person Dataset
the nail and skin, and also take depth into consideration by For comparison on the MPII dataset, we use the toolkit [1] to
labeling the center of the toe rather than the surface. measure mean Average Precision (mAP) of all body parts
Using our dataset, we train a foot keypoint detection following the “PCKh” metric from [66]. Table 1 compares
algorithm. A n€aive foot keypoint detector could have been mAP performance between our method and other appro-
built by using a body keypoint detector to generate foot aches on the official MPII testing sets. We also compare the
bounding box proposals, and then training a foot detector average inference/optimization time per image in seconds.
on top of it. However, this method suffers from the top- For the 288 images subset, our method outperforms previous
down problems stated in Section 1. Instead, the same archi- state-of-the-art bottom-up methods [2] by 8.5 percent mAP.
tecture previously described for body estimation is trained Remarkably, our inference time is 6 orders of magnitude

Authorized licensed use limited to: Hong Kong Polytechnic University. Downloaded on February 02,2023 at 02:54:45 UTC from IEEE Xplore. Restrictions apply.
CAO ET AL.: OPENPOSE: REALTIME MULTI-PERSON 2D POSE ESTIMATION USING PART AFFINITY FIELDS 179

TABLE 1
Results on the MPII Dataset

Method Hea Sho Elb Wri Hip Kne Ank mAP s/image
Subset of 288 images as in [1]
Deepcut [1] 73.4 71.8 57.9 39.9 56.7 44.0 32.0 54.1 57995
Iqbal et al. [41] 70.0 65.2 56.4 46.1 52.7 47.9 44.5 54.7 10
DeeperCut [2] 87.9 84.0 71.9 63.9 68.8 63.8 58.1 71.2 230
Newell et al. [48] 91.5 87.2 75.9 65.4 72.2 67.0 62.1 74.5 -
ArtTrack [47] 92.2 91.3 80.8 71.4 79.1 72.6 67.8 79.3 0.005
Fang et al. [6] 89.3 88.1 80.7 75.5 73.7 76.7 70.0 79.1 -
Ours 92.9 91.3 82.3 72.6 76.0 70.9 66.8 79.0 0.005
Full testing set Fig. 11. mAP curves over different PCKh thresholds on MPII validation
set. (a) mAP curves of self-comparison experiments. (b) mAP curves of
DeeperCut [2] 78.4 72.5 60.2 51.0 57.2 52.0 45.4 59.5 485 PAFs across stages.
Iqbal et al. [41] 58.4 53.9 44.5 35.0 42.2 36.7 31.1 43.1 10
Levinko et al. [71] 89.8 85.2 71.8 59.6 71.1 63.0 53.5 70.6 -
ArtTrack [47] 88.8 87.0 75.9 64.9 74.2 68.8 60.5 74.3 0.005 a tree versus 91 edges of a graph) needed facilitates the
Fang et al. [6] 88.4 86.5 78.6 70.4 74.4 73.0 65.8 76.7 -
Newell et al. [48] 92.1 89.3 78.9 69.8 76.2 71.6 64.7 77.5 -
training convergence.
Fieraru et al. [72] 91.8 89.5 80.4 69.6 77.3 71.7 65.5 78.0 - Fig. 11a shows an ablation analysis on our validation
Ours (one scale) 89.0 84.9 74.9 64.2 71.0 65.6 58.1 72.5 0.005 set. For the threshold of PCKh-0.5 [66], the accuracy of our
Ours 91.2 87.6 77.7 66.8 75.4 68.9 61.7 75.6 0.005
PAF method is 2.9 percent higher than one-midpoint and
Top: Comparison results on the testing subset defined in [1]. Middle: Compari- 2.3 percent higher than two intermediate points, generally
son results on the whole testing set. Testing without scale search is denoted outperforming the method of midpoint representation. The
as “(one scale)”. PAFs, which encode both position and orientation informa-
tion of human limbs, are better able to distinguish the com-
less. We report a more detailed runtime analysis in Section mon cross-over cases, e.g., overlapping arms. Training with
5.3. For the entire MPII testing set, our method without scale masks of unlabeled persons further improves the perfor-
search already outperforms previous state-of-the-art meth- mance by 2.3 percent because it avoids penalizing the true
ods by a large margin, i.e., 13 percent absolute increase on positive prediction in the loss during training. If we use the
mAP. Using a 3 scale search (0:7, 1 and 1:3) further ground-truth keypoint location with our parsing algorithm,
increases the performance to 75.6 percent mAP. The mAP we can obtain a mAP of 88.3 percent. In Fig. 11a, the mAP
comparison with previous bottom-up approaches indicate obtained using our parsing with GT detection is constant
the effectiveness of our novel feature representation, PAFs, across different PCKh thresholds due to no localization error.
to associate body parts. Based on the tree structure, our Using GT connection with our keypoint detection achieves a
greedy parsing method achieves better accuracy than a mAP of 81.6 percent. It is notable that our parsing algorithm
graphcut optimization formula based on a fully connected based on PAFs achieves a similar mAP as when based on GT
graph structure [1], [2]. connections (79.4 versus 81.6 percent). This indicates parsing
In Table 2, we show comparison results for the different based on PAFs is quite robust in associating correct part
skeleton structures shown in Fig. 6. We created a custom detections. Fig. 11b shows a comparison of performance
validation set consisting of 343 images from the original across stages. The mAP increases monotonically with the
MPII training set. We train our model based on a fully con- iterative refinement framework. Fig. 4 shows the qualitative
nected graph, and compare results by selecting all edges improvement of the predictions over stages.
(Fig. 6b, approximately solved by Integer Linear Program-
ming), and minimal tree edges (Fig. 6c, approximately
5.2 Results on the COCO Keypoints Challenge
solved by Integer Linear Programming, and Fig. 6d, solved
by the greedy algorithm presented in this paper). Both The COCO training set consists of over 100K person inst-
methods yield similar results, demonstrating that it is suffi- ances labeled with over 1 million keypoints. The testing
cient to use minimal edges. We trained our final model to set contains “test-challenge” and “test-dev”subsets, which
only learn the minimal edges to fully utilize the network have roughly 20K images each. The COCO evaluation def-
capacity, denoted as Fig. 6d (sep). This approach outper- ines the object keypoint similarity (OKS) and uses the mean
forms Fig. 6c and even Fig. 6b, while maintaining efficiency. average precision (AP) over 10 OKS thresholds as the main
The fewer number of part association channels (13 edges of competition metric [70]. The OKS plays the same role as the
IoU in object detection. It is calculated from the scale of
the person and the distance between predicted and GT
points. Table 3 shows results from top teams in the challenge.
TABLE 2 It is noteworthy that our method has a higher drop in accu-
Comparison of Different Structures on Our Custom racy when considering only people of higher scales (APL ).
Validation Set
In Table 4, we report self-comparisons on the COCO vali-
Method Hea Sho Elb Wri Hip Kne Ank mAP s/image dation set. If we use the GT bounding box and a single person
CPM [20], we can achieve an upper-bound for the top-down
Fig. 6b 91.8 90.8 80.6 69.5 78.9 71.4 63.8 78.3 362
Fig. 6c 92.2 90.8 80.2 69.2 78.5 70.7 62.6 77.6 43 approach using CPM, which is 62.7 percent AP. If we use the
Fig. 6d 92.0 90.7 80.0 69.4 78.4 70.1 62.3 77.4 0.005 state-of-the-art object detector, Single Shot MultiBox Detector
Fig. 6d (sep) 92.4 90.4 80.9 70.8 79.5 73.1 66.5 79.1 0.005 (SSD)[74], the performance drops 10 percent. This compari-
son indicates the performance of top-down approaches rely

Authorized licensed use limited to: Hong Kong Polytechnic University. Downloaded on February 02,2023 at 02:54:45 UTC from IEEE Xplore. Restrictions apply.
180 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 43, NO. 1, JANUARY 2021

TABLE 3 TABLE 5
COCO Test-Dev Leaderboard [73], “*” Indicates That No Self-Comparison Experiments on the COCO Validation Set
Citation Was Provided
Method AP AP50 AP75 APM APL Stages
50 75 M L
Team AP AP AP AP AP 5 PAF - 1 CM 65.3 85.2 71.3 62.2 70.7 6
Top-Down Approaches 4 PAF - 2 CM 65.2 85.3 71.4 62.3 70.1 6
Megvii [43] 78.1 94.1 85.9 74.5 83.3 3 PAF - 3 CM 65.0 85.1 71.2 62.4 69.4 6
MRSA [44] 76.5 92.4 84.0 73.0 82.7 4 PAF - 1 CM 64.8 85.3 70.9 61.9 69.6 5
The Sea Monsters* 75.9 92.1 83.0 71.7 82.1 3 PAF - 1 CM 64.6 84.8 70.6 61.8 69.5 4
Alpha-Pose [6] 71.0 87.9 77.7 69.0 75.2 3 CM - 3 PAF 61.0 83.9 65.7 58.5 65.3 6
Mask R-CNN [5] 69.2 90.4 76.0 64.9 76.3 CM refers to confidence map, while the numbers express the number of estima-
Bottom-Up Approaches tion stages for PAF and CM. Stages refers to the number of PAF and CM
stages. Reducing the number of stages increases the runtime performance.
METU [50] 70.5 87.7 77.2 66.1 77.3
TFMAN* 70.2 89.2 77.0 65.6 76.3
PersonLab [49] 68.7 89.0 75.4 64.1 75.5 over ReLU layers and Adam optimization instead of SGD
Associative Emb. [48] 65.5 86.8 72.3 60.6 72.6
Ours 64.2 86.2 70.1 61.0 68.8 with momentum. Differently to [3], we do not refine the cur-
Ours [3] 61.8 84.9 67.5 57.1 68.2 rent approach with CPM [20] to avoid harming the speed.

Top: some of the highest top-down results. Bottom: highest bottom-up results.
5.3 Inference Runtime Analysis
We compare 3 state-of-the-art, well-maintained, and widely-
heavily on the person detector. In contrast, our original bot-
used multi-person pose estimation libraries, OpenPose [4],
tom-up method achieves 58.4 percent AP. If we refine the
based on this work, Mask R-CNN [5], and Alpha-Pose [6].
results by applying a single person CPM on each rescaled
We analyze the inference runtime performance of the 3 meth-
region of the estimated persons parsed by our method, we
ods in Fig. 12. Megvii (Face++) [43] and MSRA [44] GitHub
gain a 2.6 percent overall AP increase. We only update esti-
repositories do not include the person detector they use and
mations on predictions in which both methods roughly agree,
only provide pose estimation results given a cropped person.
resulting in improved precision and recall. The new architec-
Thus, we cannot know their exact runtime performance and
ture without CPM refinement is approximately 7 percent
have been excluded from this analysis. Mask R-CNN is only
more accurate than the original approach, while increasing
compatible with Nvidia graphics cards, so we perform the
the speed 2.
analysis on a system with a NVIDIA 1080 Ti. As top-down
We analyze the effect of PAF refinement over confidence
approaches, the inference times of Mask R-CNN, Alpha-
map estimation in Table 5. We fix the computation to a max-
Pose, Megvii, and MSRA are roughly proportional to the
imum of 6 stages, distributed differently across the PAF and
number of people in the image. To be more precise, they are
confidence map branches. We can extract 3 conclusions
proportional to the number of proposals that their person
from this experiment. First, PAF requires a higher number
detectors extract. In contrast, the inference time of our bot-
of stages to converge and benefits more from refinement
tom-up approach is invariant to the number of people in the
stages. Second, increasing the number of PAF channels
image. The runtime of OpenPose consists of two major parts:
mainly improves the number of true positives, even though
(1) CNN processing time whose complexity is Oð1Þ, constant
they might not be too accurate (higher AP 50 ). However,
with varying number of people; (2) multi-person parsing
increasing the number of confidence map channels further
improves the localization accuracy (higher AP 75 ). Third, we
prove that the accuracy of the part confidence maps highly
increases when using PAF as a prior, while the opposite
results in a 4 percent absolute accuracy decrease. Even the
model with only 4 stages (3 PAF - 1 CM) is more accurate
than the computationally more expensive 6-stage model
that first predicts confidence maps (3 CM - 3 PAF). Some
other additions that further increased the accuracy of the
new models with respect to the original work are PReLU

TABLE 4
Self-Comparison Experiments on the COCO Validation Set

Method AP AP50 AP75 APM APL


Fig. 12. Inference time comparison between OpenPose, Mask R-CNN,
GT Bbox + CPM [20] 62.7 86.0 69.3 58.5 70.6 and Alpha-Pose (fast Pytorch version). While OpenPose inference time
SSD [74] + CPM [20] 52.7 71.1 57.2 47.0 64.2 is invariant, Mask R-CNN and Alpha-Pose runtimes grow linearly with
Ours [3] 58.4 81.5 62.6 54.4 65.1 the number of people. Testing with and without scale search is denoted
+ CPM refinement 61.0 84.9 67.5 56.3 69.3 as “max accuracy” and “1 scale”, respectively. This analysis was per-
Ours 65.3 85.2 71.3 62.2 70.7 formed using the same images for each algorithm and a batch size of 1.
Each analysis was repeated 1000 times and then averaged. This was all
Our new body+foot model outperforms the original work in [3] by 6.9 percent. performed on a system with a Nvidia 1080 Ti and CUDA 8.

Authorized licensed use limited to: Hong Kong Polytechnic University. Downloaded on February 02,2023 at 02:54:45 UTC from IEEE Xplore. Restrictions apply.
CAO ET AL.: OPENPOSE: REALTIME MULTI-PERSON 2D POSE ESTIMATION USING PART AFFINITY FIELDS 181

TABLE 6
Runtime Difference between the 3 Models Released
in OpenPose with CUDA and CPU-Only Versions,
Running in a NVIDIA GeForce GTX-1080 Ti GPU
and a i7-6850K CPU

Method CUDA CPU-only


Original MPII model 73 ms 2309 ms
Original COCO model 74 ms 2407 ms
Body+foot model 36 ms 10396 ms
Fig. 13. Trade-off between speed and accuracy for the main entries of
MPII and COCO models refer to our work in [3]. the COCO Challenge. We only consider those approaches that either
release their runtime measurements (methods with an asterisk) or their
code (rest). Algorithms with several values represent different resolution
time, whose complexity is Oðn2 Þ, where n represents the configurations. AlphaPose, METU, and single-scale OpenPose provide
number of people. However, the parsing time is two orders the best results considering the trade-off between speed and accuracy.
of magnitude less than the CNN processing time. For The remaining methods are both slower and less accurate than at least
one of these 3 approaches.
instance, the parsing takes 0.58 ms for 9 people while the
CNN takes 36 ms.
In Table 6, we analyze the difference in inference time both slower and less accurate than at least one of these 3
between the models released in OpenPose, i.e., the MPII methods. Overall, top-down approaches (e.g., AlphaPose)
and COCO models from [3] and the new body+foot model. provide better results for images with few people, but their
Our new combined model is not only more accurate, but is speed considerably drops for images with many people. We
also 2 faster than the original model when using the GPU also observe that the accuracy metrics might be misleading.
version. Interestingly, the runtime for the CPU version is We see in Section 5.2 that PersonLab [49] achieves higher
5x slower compared to that of the original model. The new accuracy than our method. However, our multi-scale
architecture consists of many more layers, which requires a approach simultaneously provides both higher speed and
higher amount of memory, while the number of operations accuracy than the versions for which they report runtime
is significantly fewer. Graphic cards seem to benefit more results. Note that no runtime results are provided in [49] for
from the reduction in number of operations, while the CPU their most accurate (but slower) models.
version seems to be significantly slower due to the higher
memory requirements. OpenCL and CUDA performance 5.5 Results on the Foot Keypoint Dataset
cannot be directly compared to each other, as they require To evaluate the foot keypoint detection results obtained
different hardware, in particular, different GPU brands. using our foot keypoint dataset, we calculate the mean aver-
age precision and recall over 10 OKS, as done in the COCO
5.4 Trade-off between Speed and Accuracy evaluation metric. There are only minor differences between
In the area of object detection, Huang et al. [75] show that the combined and body-only approaches. In the combined
region-proposed methods (e.g., Faster-rcnn [76]) achieve training scheme, there exist two separate and completely
higher accuracy, while single-shot methods (e.g., YOLO [77], independent datasets. The larger of the two datasets con-
SSD [74]) present higher runtime performance. Analogously sists of the body annotations while the smaller set contains
in human pose estimation, we observe that top-down both body and foot annotations. The same batch size used
approaches also present higher accuracy but lower speed for the body-only training is used for the combined training.
compared to bottom-up methods, especially for images with Nevertheless, it contains only annotations from one dataset
multiple people. The main reason for the lower accuracy of at a time. A probability ratio is defined to select the dataset
bottom-up approaches is their limited resolution. While top- from which to pick each batch. A higher probability is
down methods individually crop and feed each detected per- assigned to select a batch from the larger dataset, as the
son into their networks, bottom-up methods have to feed the number of annotations and diversity is much higher. Foot
whole image at once, resulting in smaller resolution per per- keypoints are masked out during the back-propagation
son. For instance, Moon et al. [78] show that refinement over pass of the body-only dataset to avoid harming the net with
our original work in [3] (by applying a larger cropped image non-labeled data. In addition, body annotations are also
patch) results in a higher accuracy boost than refinement masked out from the foot dataset. Keeping these annota-
over other top-down approaches. As hardware gets faster tions yields a small drop in accuracy, probably due to over-
and increases its memory, bottom-up methods with higher fitting, as those samples are repeated in both datasets.
resolution might be able to reduce the accuracy gap with Table 7 shows the foot keypoint accuracy for our valida-
respect to top-down approaches. tion set. This set is created from a subset of the COCO vali-
Additionally, current human pose performance metrics dation set, in particular from the images in which the ankles
are purely based on keypoint accuracy, while speed is of all people are visible and annotated. This results in a sim-
ignored. In order to provide a more complete comparison, pler validation set compared to that of COCO, leading to
we display both speed and accuracy for the top entries of higher precision and recall numbers compared to those of
the COCO Challenge in Fig. 13. Given those results, single- body detection (Table 4). Qualitatively, we find a higher
scale OpenPose should be chosen for maximum speed, amount of jitter and number of detection errors compared
AlphaPose for maximum accuracy, and METU for a trade- to body keypoint prediction. We believe 14K training anno-
off between both of them. The remaining approaches are tations are not a sufficient number to train a robust foot

Authorized licensed use limited to: Hong Kong Polytechnic University. Downloaded on February 02,2023 at 02:54:45 UTC from IEEE Xplore. Restrictions apply.
182 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 43, NO. 1, JANUARY 2021

TABLE 7 TABLE 9
Foot Keypoint Analysis on the Foot Validation Set Vehicle Keypoint Validation Set

Method AP AR AP75 AR75 Method AP AR AP75 AR75


Body+foot model (5 PAF - 1 CM) 77.9 82.5 82.1 85.6 Vehicle keypoint detector 70.1 77.4 73.0 79.7

TABLE 8 datasets with an unbalanced ratio, we effectively assign a


Self-Comparison Experiments for Body on the COCO very small batch size for foot, hindering foot convergence.
Validation Set
In Table 8, we show that there is almost no accuracy dif-
Method AP AP50 AP75 APM APL ference in the COCO test-dev set with respect to the same
network architecture trained only with body annotations.
Body-only (5 PAF - 1 CM) 65.2 85.0 70.9 62.1 70.5 We compared the model consisting of 5 PAF and 1 confi-
Body+foot (5 PAF - 1 CM) 65.3 85.2 71.3 62.2 70.7
dence map stages, with a 95 percent probability of picking a
Foot keypoints are predicted but ignored for the evaluation. batch from the COCO body-only dataset, and 5 percent of
choosing from the body+foot dataset. There is no architec-
detector, considering that over 100K instances are used for ture difference compared to the body-only model other
the body keypoint dataset. Rather than using the whole than the increase in the number of outputs to include the
batch with either only foot or body annotations, we also foot CM and PAFs.
tried using a mixed batch where samples from both datasets
(either COCO or COCO+foot) could be fed to the same 5.6 Vehicle Pose Estimation
batch, maintaining the same probability ratio. However, the Our approach is not limited to human body or foot key-
network accuracy was slightly reduced. By mixing the points, but can be generalized to any keypoint annotation

Fig. 14. Vehicle keypoint detection examples from the validation set. The keypoint locations are successfully estimated under challenging scenarios,
including overlapping between cars, cropped vehicles, and different scales.

Fig. 15. Common failure cases: (a) Rare pose or appearance, (b) missing or false parts detection, (c) overlapping parts, i.e., part detections shared
by two persons, (d) wrong connection associating parts from two persons, (e-f): False positives on statues or animals.

Fig. 16. Common foot failure cases: (a) Foot or leg occluded by the body, (b) foot or leg occluded by another object, (c) foot visible but leg occluded,
(d) shoe and foot not aligned, (e): False negatives when foot visible but rest of the body occluded, (f): Soles of their feet are usually not detected
(rare in training), (g): Swap between right and left body parts.

Authorized licensed use limited to: Hong Kong Polytechnic University. Downloaded on February 02,2023 at 02:54:45 UTC from IEEE Xplore. Restrictions apply.
CAO ET AL.: OPENPOSE: REALTIME MULTI-PERSON 2D POSE ESTIMATION USING PART AFFINITY FIELDS 183

Fig. 17. Results containing viewpoint and appearance variation, occlusion, crowding, contact, and other common imaging artifacts.

Authorized licensed use limited to: Hong Kong Polytechnic University. Downloaded on February 02,2023 at 02:54:45 UTC from IEEE Xplore. Restrictions apply.
184 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 43, NO. 1, JANUARY 2021

task. To demonstrate this, we have run the same network and Human-Computer Interaction. In addition, OpenPose
architecture for the task of vehicle keypoint detection [79]. has been included in the OpenCV library [65].
Once again, we use mean average precision over 10 OKS for
the evaluation. The results are shown in Table 9. Both the
average precision and recall are higher than in the body
ACKNOWLEDGMENTS
keypoint task, mainly because we are using a smaller and We acknowledge the effort from the authors of the MPII and
simpler dataset. This initial dataset consists of image anno- COCO human pose datasets. These datasets make 2D
tations from 19 different cameras. We have used the first 18 human pose estimation in the wild possible. This research
camera frames as a training set, and the camera frames from was supported by the Intelligence Advanced Research Proj-
the last camera as a validation set. No variations in the ects Activity (IARPA) via Department of Interior / Interior
model architecture or training parameters have been made. Business Center (DOI/IBC) contract number D17PC00340.
We show qualitative results in Fig. 14.
REFERENCES
5.7 Failure Case Analysis [1] L. Pishchulin, E. Insafutdinov, S. Tang, B. Andres, M. Andriluka,
We have analyzed the main cases where the current P. Gehler, and B. Schiele, “Deepcut: Joint subset partition and
labeling for multi person pose estimation,” in Proc. IEEE Conf.
approach fails in the MPII, COCO, and COCO+foot valida- Comput. Vis. Pattern Recognit., 2016, pp. 4929–4937.
tion sets. Fig. 15 shows an overview of the main body failure [2] E. Insafutdinov, L. Pishchulin, B. Andres, M. Andriluka, and
cases, while Fig. 16 shows the main foot failure cases Fig. 15a B. Schiele, “Deepercut: A deeper, stronger, and faster multi-
person pose estimation model,” in Proc. Eur. Conf. Comput. Vis.,
refers to non typical poses and upside-down examples,
2016, pp. 34–50.
where the predictions usually fail. Increasing the rotation [3] Z. Cao, T. Simon, S.-E. Wei, and Y. Sheikh, “Realtime multi-person
augmentation visually seems to partially solve these issues, 2D pose estimation using part affinity fields,” in Proc. IEEE Conf.
but the global accuracy on the COCO validation set is Comput. Vis. Pattern Recognit., 2017, pp. 1302–1310.
[4] G. Hidalgo, Z. Cao, T. Simon, S.-E. Wei, H. Joo, and Y. Sheikh,
reduced by about 5 percent. A different alternative is to run “OpenPose library,” [Online]. Available: https://github.com/
the network using different rotations and keep the poses CMU-Perceptual-Computing-Lab/openpose
with the higher confidence. Body occlusion can also lead to [5] K. He, G. Gkioxari, P. Dollar, and R. Girshick, “Mask R-CNN,” in
false negatives and high localization error. This problem is Proc. IEEE Int. Conf. Comput. Vis., 2017, pp. 2980–2988.
[6] H.-S. Fang, S. Xie, Y.-W. Tai, and C. Lu, “RMPE: Regional multi-
inherited from the dataset annotations, in which occluded person pose estimation,” in Proc. IEEE Int. Conf. Comput. Vis.,
keypoints are not included. In highly crowded images where 2017, pp. 2353–2362.
people are overlapping, the approach tends to merge annota- [7] P. F. Felzenszwalb and D. P. Huttenlocher, “Pictorial structures
for object recognition,” Int. J. Comput. Vis., vol. 61, pp. 55–79, 2005.
tions from different people, while missing others, due to the
[8] D. Ramanan, D. A. Forsyth, and A. Zisserman, “Strike a Pose:
overlapping PAFs that make the greedy multi-person pars- Tracking people by finding stylized poses,” in Proc. IEEE Conf.
ing fail. Animals and statues also frequently lead to false pos- Comput. Vis. Pattern Recognit., 2005, pp. 271–278.
itive errors. This issue could be mitigated by adding more [9] M. Andriluka, S. Roth, and B. Schiele, “Monocular 3D pose esti-
mation and tracking by detection,” in Proc. IEEE Conf. Comput.
negative examples during training to help the network dis- Vis. Pattern Recognit., 2010, pp. 623–630.
tinguish between humans and other humanoid figures. [10] M. Andriluka, S. Roth, and B. Schiele, “Pictorial structures revis-
ited: People detection and articulated pose estimation,” in Proc.
IEEE Conf. Comput. Vis. Pattern Recognit., 2009, pp. 1014–1021.
[11] L. Pishchulin, M. Andriluka, P. Gehler, and B. Schiele, “Poselet
6 CONCLUSION conditioned pictorial structures,” in Proc. IEEE Conf. Comput. Vis.
Realtime multi-person 2D pose estimation is a critical compo- Pattern Recognit., 2013, pp. 588–595.
[12] Y. Yang and D. Ramanan, “Articulated human detection with
nent in enabling machines to visually understand and inter- flexible mixtures of parts,” IEEE Trans. Pattern Anal. Mach. Intell.,
pret humans and their interactions. In this paper, we present vol. 35, no. 12, pp. 2878–2890, Dec. 2013.
an explicit nonparametric representation of the keypoint asso- [13] S. Johnson and M. Everingham, “Clustered pose and nonlinear
ciation that encodes both position and orientation of human appearance models for human pose estimation,” in Proc. British
Mach. Vis. Conf., 2010, pp. 5–15.
limbs. Second, we design an architecture that jointly learns [14] Y. Wang and G. Mori, “Multiple tree models for occlusion and
part detection and association. Third, we demonstrate that a spatial constraints in human pose estimation,” in Proc. Eur. Conf.
greedy parsing algorithm is sufficient to produce high-quality Comput. Vis., 2008, pp. 710–724.
[15] L. Sigal and M. J. Black, “Measure locally, reason globally: Occlu-
parses of body poses, and preserves efficiency regardless of
sion-sensitive articulated pose estimation,” in Proc. IEEE Conf.
the number of people. Fourth, we prove that PAF refinement Comput. Vis. Pattern Recognit., 2006, pp. 2041–2048.
is far more important than combined PAF and body part loca- [16] X. Lan and D. P. Huttenlocher, “Beyond trees: Common-factor
tion refinement, leading to a substantial increase in both models for 2D human pose recovery,” in Proc. IEEE Int. Conf. Com-
put. Vis., 2005, pp. 470–477.
runtime performance and accuracy. Fifth, we show that com- [17] L. Karlinsky and S. Ullman, “Using linking features in learning
bining body and foot estimation into a single model boosts non-parametric part models,” in Proc. Eur. Conf. Comput. Vis.,
the accuracy of each component individually and reduces the 2012, pp. 326–339.
inference time of running them sequentially. We have created [18] M. Dantone, J. Gall, C. Leistner, and L. Van Gool, “Human pose
estimation using body parts dependent joint regressors,” in Proc.
a foot keypoint dataset consisting of 15K foot keypoint instan- IEEE Conf. Comput. Vis. Pattern Recognit., 2013, pp. 3041–3048.
ces, which we will publicly release. Finally, we have open- [19] A. Newell, K. Yang, and J. Deng, “Stacked hourglass networks
sourced this work as OpenPose [4], the first realtime system for human pose estimation,” in Proc. Eur. Conf. Comput. Vis., 2016,
pp. 2597–2602.
for body, foot, hand, and facial keypoint detection. The library [20] S.-E. Wei, V. Ramakrishna, T. Kanade, and Y. Sheikh, “Convo-
is being widely used today for many research topics involving lutional pose machines,” in Proc. IEEE Conf. Comput. Vis. Pattern
human analysis, such as human re-identification, retargeting, Recognit., 2016, pp. 4724–4732.

Authorized licensed use limited to: Hong Kong Polytechnic University. Downloaded on February 02,2023 at 02:54:45 UTC from IEEE Xplore. Restrictions apply.
CAO ET AL.: OPENPOSE: REALTIME MULTI-PERSON 2D POSE ESTIMATION USING PART AFFINITY FIELDS 185

[21] W. Ouyang, X. Chu, and X. Wang, “Multi-source deep learning [45] M. Eichner and V. Ferrari, “We are family: Joint pose estimation
for human pose estimation,” in Proc. IEEE Conf. Comput. Vis. Pat- of multiple persons,” in Proc. Eur. Conf. Comput. Vis., 2010,
tern Recognit., 2014, pp. 2337–2344. pp. 228–242.
[22] J. Tompson, R. Goroshin, A. Jain, Y. LeCun, and C. Bregler, [46] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for
“Efficient object localization using convolutional networks,” in image recognition,” in Proc. IEEE Conf. Comput. Vis. Pattern Recog-
Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2015, pp. 648–656. nit., 2016, pp. 770–778.
[23] J. J. Tompson, A. Jain, Y. LeCun, and C. Bregler, “Joint training of [47] E. Insafutdinov, M. Andriluka, L. Pishchulin, S. Tang, E. Levinkov,
a convolutional network and a graphical model for human pose B. Andres, and B. Schiele, “Arttrack: Articulated multi-person
estimation,” in Proc. Advances Neural Inf. Process. Syst., 2014, tracking in the wild,” in Proc. IEEE Conf. Comput. Vis. Pattern Recog-
pp. 1799–1807. nit., 2017, pp. 1293–1301.
[24] X. Chen and A. Yuille, “Articulated pose estimation by a graphical [48] A. Newell, Z. Huang, and J. Deng, “Associative embedding: End-
model with image dependent pairwise relations,” in Proc. Int. to-end learning for joint detection and grouping,” in Proc. 31st Int.
Conf. Neural Inf. Process. Syst., 2014, pp. 1736–1744. Conf. Neural Inf. Process. Syst., 2017, pp. 2274–2284.
[25] A. Toshev and C. Szegedy, “Deeppose: Human pose estimation [49] G. Papandreou, T. Zhu, L.-C. Chen, S. Gidaris, J. Tompson, and
via deep neural networks,” in Proc. IEEE Conf. Comput. Vis. Pattern K. Murphy, “Personlab: Person pose estimation and instance seg-
Recognit., 2014, pp. 1653–1660. mentation with a bottom-up, part-based, geometric embedding
[26] V. Belagiannis and A. Zisserman, “Recurrent human pose estim- model,” in Proc. Eur. Conf. Comput. Vis., 2018, pp. 282–299.
ation,” in Proc. 12th IEEE Int. Conf. Automatic Face Gesture Recog- [50] M. Kocabas, S. Karagoz, and E. Akbas, “MultiPoseNet: Fast multi-
nit., 2017, pp. 468–475. person pose estimation using pose residual network,” in Proc.
[27] A. Bulat and G. Tzimiropoulos, “Human pose estimation via con- Eur. Conf. Comput. Vis., 2018, pp. 437–453.
volutional part heatmap regression,” in Proc. Eur. Conf. Comput. [51] X. Nie, J. Feng, J. Xing, and S. Yan, “Pose partition networks for
Vis., 2016, pp. 717–732. multi-person pose estimation,” in Proc. Eur. Conf. Comput. Vis.,
[28] X. Chu, W. Yang, W. Ouyang, C. Ma, A. L. Yuille, and X. Wang, 2018, pp. 705–720.
“Multi-context attention for human pose estimation,” in Proc. [52] G. Huang, Z. Liu, K. Q. Weinberger, and L. van der Maaten,
IEEE Conf. Comput. Vis. Pattern Recognit., 2017, pp. 5669–5678. “Densely connected convolutional networks,” in Proc. IEEE Conf.
[29] W. Yang, S. Li, W. Ouyang, H. Li, and X. Wang, “Learning feature Comput. Vis. Pattern Recognit., 2017, pp. 2261–2269.
pyramids for human pose estimation,” in Proc. IEEE Int. Conf. [53] K. Simonyan and A. Zisserman, “Very deep convolutional net-
Comput. Vis., 2017, pp. 1290–1299. works for large-scale image recognition,” in Proc. Int. Conf. Learn.
[30] Y. Chen, C. Shen, X.-S. Wei, L. Liu, and J. Yang, “Adversarial Representations, 2015, pp. 521–534.
posenet: A structure-aware convolutional network for human [54] D. B. West, et al., Introduction to Graph Theory, vol. 2, Upper Sad-
pose estimation,” in Proc. IEEE Int. Conf. Comput. Vis., 2017, dle River, NJ, USA: Prentice Hall, 2001.
pp. 1221–1230. [55] H. W. Kuhn, “The Hungarian method for the assignment prob-
[31] W. Tang, P. Yu, and Y. Wu, “Deeply learned compositional models lem,” in Naval Research Logistics Quarterly. Hoboken, NJ, USA:
for human pose estimation,” in Proc. Eur. Conf. Comput. Vis., 2018, Wiley Online Library, 1955.
pp. 190–206. [56] X. Qian, Y. Fu, T. Xiang, W. Wang, J. Qiu, Y. Wu, Y.-G. Jiang, and
[32] L. Ke, M.-C. Chang, H. Qi, and S. Lyu, “Multi-scale structure- X. Xue, “Pose-normalized image generation for person
aware network for human pose estimation,” in Proc. Eur. Conf. re-identification,” in Proc. Eur. Conf. Comput. Vis., 2018, pp. 661–678.
Comput. Vis., 2018, pp. 731–746. [57] A. Bansal, S. Ma, D. Ramanan, and Y. Sheikh, “Recycle-gan: Unsu-
[33] T. Pfister, J. Charles, and A. Zisserman, “Flowing convnets for pervised video retargeting,” in Proc. Eur. Conf. Comput. Vis., 2018,
human pose estimation in videos,” in Proc. IEEE Int. Conf. Comput. pp. 119–135.
Vis., 2015, pp. 1913–1921. [58] C. Chan, S. Ginosar, T. Zhou, and A. A. Efros, “Everybody
[34] V. Ramakrishna, D. Munoz, M. Hebert, J. A. Bagnell, and dance now,” in Proc. Eur. Conf. Comput. Vis. Workshop, 2018,
Y. Sheikh, “Pose machines: Articulated pose estimation via infer- pp. 625–633.
ence machines,” in Proc. Eur. Conf. Comput. Vis., 2014, pp. 33–47. [59] L. Gui, K. Zhang, Y.-X. Wang, X. Liang, J. M. Moura, and
[35] S. Hochreiter, Y. Bengio, and P. Frasconi, “Gradient flow in recur- M. M. Veloso, “Teaching robots to predict human motion,” in
rent nets: the difficulty of learning long-term dependencies,” Proc. Int. Conf. Intell. Robots Syst., 2018, pp. 562–567.
in Field Guide to Dynamical Recurrent Networks, J. Kolen and [60] P. Panteleris, I. Oikonomidis, and A. Argyros, “Using a single
S. Kremer, Eds. Piscataway, NJ, USA: IEEE Press, 2001. RGB frame for real time 3D hand pose estimation in the wild,” in
[36] X. Glorot and Y. Bengio, “Understanding the difficulty of training Proc. IEEE Winter Conf. Appl. Comput. Vis., 2018, pp. 436–445.
deep feedforward neural networks,” in Proc. 13th Int. Conf. Artif. [61] H. Joo, T. Simon, and Y. Sheikh, “Total capture: A 3D deformation
Intell. Statistics, 2010, pp. 249–256. model for tracking faces, hands, and bodies,” in Proc. IEEE Conf.
[37] Y. Bengio, P. Simard, and P. Frasconi, “Learning long-term depen- Comput. Vis. Pattern Recognit., 2018, pp. 8320–8329.
dencies with gradient descent is difficult,” IEEE Trans. Neural [62] D. Mehta, S. Sridhar, O. Sotnychenko, H. Rhodin, M. Shafiei,
Netw., vol. 5, no. 2, pp. 157–166, Mar. 1994. H.-P. Seidel, W. Xu, D. Casas, and C. Theobalt, “VNect: Real-
[38] L. Pishchulin, A. Jain, M. Andriluka, T. Thorm€ahlen, and B. Schiele, time 3D human pose estimation with a single RGB camera,”
“Articulated people detection and pose estimation: Reshaping ACM Trans. Graph., vol. 36, 2017, Art. no. 44.
the future,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2012, [63] T. Simon, H. Joo, I. Matthews, and Y. Sheikh, “Hand keypoint
pp. 3178–3185. detection in single images using multiview bootstrapping,” in
[39] G. Gkioxari, B. Hariharan, R. Girshick, and J. Malik, “Using k-poselets Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2017, pp. 4645–4653.
for detecting people and localizing their keypoints,” in Proc. IEEE [64] D. W. Marquardt, “An algorithm for least-squares estimation of
Conf. Comput. Vis. Pattern Recognit., 2014, pp. 3582–3589. nonlinear parameters,” J. Society Ind. Appl. Math., vol. 11, no. 2,
[40] M. Sun and S. Savarese, “Articulated part-based model for joint pp. 431–441, 1963.
object detection and pose estimation,” in Proc. IEEE Int. Conf. Com- [65] G. Bradski, “The OpenCV library,” Dobb’s J. Softw. Tools, 2000.
put. Vis., 2011, pp. 723–730. [66] M. Andriluka, L. Pishchulin, P. Gehler, and B. Schiele, “2D
[41] U. Iqbal and J. Gall, “Multi-person pose estimation with local human pose estimation: New benchmark and state of the art
joint-to-person associations,” in Proc. Eur. Conf. Comput. Vis. Work- analysis,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2014,
shop, 2016, pp. 627–642. pp. 3686–3693.
[42] G. Papandreou, T. Zhu, N. Kanazawa, A. Toshev, J. Tompson, [67] T.-Y. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan,
C. Bregler, and K. Murphy, “Towards accurate multi-person pose P. Dollar, and C. L. Zitnick, “Microsoft COCO: Common objects
estimation in the wild,” in Proc. IEEE Conf. Comput. Vis. Pattern in context,” in Proc. Eur. Conf. Comput. Vis., 2014, pp. 740–755.
Recognit., 2017, pp. 3711–3719. [68] M. Loper, N. Mahmood, J. Romero, G. Pons-Moll , and M. J. Black,
[43] Y. Chen, Z. Wang, Y. Peng, Z. Zhang, G. Yu, and J. Sun, “Cascaded “SMPL: A skinned multi-person linear model,” ACM Trans.
pyramid network for multi-person pose estimation,” in Proc. IEEE Graph., vol. 34, 2015, Art. no. 248.
Conf. Comput. Vis. Pattern Recognit., 2018, pp. 7103–7112. [69] “ClickWorker,” 2010. [Online]. Available: https://www.clickworker.
[44] B. Xiao, H. Wu, and Y. Wei, “Simple baselines for human pose com
estimation and tracking,” in Proc. Eur. Conf. Comput. Vis., 2018, [70] “MSCOCO keypoint evaluation metric,” 2016. [Online]. Available:
pp. 472–487. http://mscoco.org/dataset/#keypoints-eval

Authorized licensed use limited to: Hong Kong Polytechnic University. Downloaded on February 02,2023 at 02:54:45 UTC from IEEE Xplore. Restrictions apply.
186 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 43, NO. 1, JANUARY 2021

[71] E. Levinkov, J. Uhrig, S. Tang, M. Omran, E. Insafutdinov, A. Kirillov, Tomas Simon received the BS degree in tele-
C. Rother, T. Brox, B. Schiele, and B. Andres, “Joint graph decomposi- communications from Universidad Politecnica de
tion & node labeling: Problem, algorithms, applications,” in Proc. Valencia, the MS degree in robotics from Carnegie
IEEE Conf. Comput. Vis. Pattern Recognit., 2017, pp. 1904–1912. Mellon University, which he obtained while working
[72] M. Fieraru, A. Khoreva, L. Pishchulin, and B. Schiele, “Learning to at the Human Sensing Lab advised by Fernando
refine human pose estimation,” in Proc. Comput. Vis. Pattern Recog- De la Torre, and the PhD degree from Carnegie
nit. Workshop, 2018, pp. 205–214. Mellon University, advised by Yaser Sheikh and
[73] “MSCOCO keypoint leaderboard,” 2016. [Online]. Available: Iain Matthews, in 2017. He is a research scientist
http://mscoco.org/dataset/#keypoints-leaderboard at Facebook Reality Labs. His research interests
[74] W. Liu, D. Anguelov, D. Erhan, C. Szegedy, and S. Reed, “SSD: include mainly in using computer vision and
Single shot multibox detector,” in Proc. Eur. Conf. Comput. Vis., machine learning to model faces and bodies.
2016, pp. 21–37.
[75] J. Huang, V. Rathod, C. Sun, M. Zhu, A. Korattikara, A. Fathi,
I. Fischer, Z. Wojna, Y. Song, S. Guadarrama, et al., “Speed/accu- Shih-En Wei received the BS degree in electrical
racy trade-offs for modern convolutional object detectors,” in Proc. engineering and the MS degree in communication
IEEE Conf. Comput. Vis. Pattern Recognit., 2017, pp. 3296–3297. engineering from National Taiwan University, Tai-
[76] S. Ren, K. He, R. Girshick, and J. Sun, “Faster R-CNN: Towards
pei, Taiwan, and the MS degree in robotics from
real-time object detection with region proposal networks,” in
Carnegie Mellon University, in 2016, advised by
Proc. 28th Int. Conf. Neural Inf. Process. Syst., 2015, pp. 91–99. Dr. Yaser Sheikh. He is a research engineer at
[77] J. Redmon and A. Farhadi, “YOLO9000: Better, faster, stronger,” in Facebook Reality Labs. His research interests
Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2017, pp. 6517–6525. include computer vision and machine learning.
[78] G. Moon, J. Chang, and K. M. Lee, “Posefix: Model-agnostic gen-
eral human pose refinement network,” CVPR, 2019.
[79] N. Dinesh Reddy, M. Vo, and S. G. Narasimhan, “Carfusion: Com-
bining point tracking and part detection for dynamic 3D recon-
struction of vehicles,” in Proc. IEEE Conf. Comput. Vis. Pattern
Recognit., 2018, pp. 1906–1915. Yaser Sheikh is the director of the Facebook
Reality Lab in Pittsburgh and is an associate pro-
fessor at the Robotics Institute at Carnegie Mellon
University. His research is broadly focused on
Zhe Cao received the BS degree in computer sci- machine perception of social behavior, spanning
ence from Wuhan University, China, in 2015, and computer vision, computer graphics, and machine
the MS degree in robotics from Carnegie Mellon
learning. With colleagues, he has won Popular
University, in 2017, advised by Dr. Yaser Sheikh.
Science’s “Best of What’s New” Award, the Honda
He is working toward the PhD degree in computer Initiation Award (2010), best student paper award
vision at the University of California, Berkeley, at CVPR (2018), best paper awards at WACV
advised by Dr. Jitendra Malik. His research inter- (2012), SAP (2012), SCA (2010), ICCV THEMIS
ests include mainly in computer vision and deep (2009), best demo award at ECCV (2016), and placed first in the
learning. He is a student member of the IEEE. MSCOCO Keypoint Challenge (2016); he has also received the Hillman
Fellowship for Excellence in Computer Science Research (2004). Yaser
has served as a senior committee member at leading conferences in com-
puter vision, computer graphics, and robotics including SIGGRAPH
Gines Hidalgo received the BS degree in tele- (2013, 2014), CVPR (2014, 2015, 2018), ICRA (2014, 2016), ICCP
communications from Universidad Politecnica de (2011), and served as an Associate Editor of CVIU. His research is spon-
Cartagena, Spain, in 2014, and the MS degree in sored by various government research offices, including NSF and
robotics from Carnegie Mellon University, in 2019, DARPA, and several industrial partners including the Intel Corporation,
which he obtained while working as a research the Walt Disney Company, Nissan, Honda, Toyota, and the Samsung
associate with the Robotics Institute, advised by Group. His research has been featured by various media outlets including
Dr. Yaser Sheikh. He is a research engineer at The New York Times, The Verge, Popular Science, BBC, MSNBC, New
Epic Games. His research interests include Scientist, slashdot, and WIRED.
human pose estimation, computationally efficient
deep learning, and computer vision. He is a stu-
dent member of the IEEE. " For more information on this or any other computing topic,
please visit our Digital Library at www.computer.org/csdl.

Authorized licensed use limited to: Hong Kong Polytechnic University. Downloaded on February 02,2023 at 02:54:45 UTC from IEEE Xplore. Restrictions apply.

You might also like