Thanks to visit codestin.com
Credit goes to www.scribd.com

0% found this document useful (0 votes)
31 views15 pages

Diff Pose

Uploaded by

muneebke
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
31 views15 pages

Diff Pose

Uploaded by

muneebke
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 15

DiffPose: Toward More Reliable 3D Pose Estimation

Jia Gong1† Lin Geng Foo1† Zhipeng Fan2§ Qiuhong Ke3 Hossein Rahmani4 Jun Liu1‡
1
Singapore University of Technology and Design
2
New York University 3 Monash University 4 Lancaster University
{jia gong,lingeng foo}@mymail.sutd.edu.sg, [email protected], [email protected],
[email protected], jun [email protected]
arXiv:2211.16940v3 [cs.CV] 9 Apr 2023

Abstract Forward diffusion

Reverse diffusion
2D pose sequence
Monocular 3D human pose estimation is quite challeng- Context condition

ing due to the inherent ambiguity and occlusion, which often


𝝐 𝝐 𝝐 𝝐
lead to high uncertainty and indeterminacy. On the other 𝒈 𝒈 𝒈 𝒈
hand, diffusion models have recently emerged as an effec-
Initial Pose 𝑯𝑲 𝑯𝒌 𝑯𝟎 Desired
tive tool for generating high-quality images from noise. In- Distribution 3D Pose

spired by their capability, we explore a novel pose estima- Figure 1. Overview of our DiffPose framework. In the forward
tion framework (DiffPose) that formulates 3D pose estima- process (denoted with blue dotted arrows), we gradually diffuse
tion as a reverse diffusion process. We incorporate novel de- a “ground truth” 3D pose distribution H0 with low indetermi-
signs into our DiffPose to facilitate the diffusion process for nacy towards a 3D pose distribution with high uncertainty HK
by adding noise  at every step, which generates intermediate dis-
3D pose estimation: a pose-specific initialization of pose
tributions to guide model training. Before the reverse process, we
uncertainty distributions, a Gaussian Mixture Model-based first initialize the indeterminate 3D pose distribution HK from the
forward diffusion process, and a context-conditioned re- input. Then, during the reverse process (denoted with red solid
verse diffusion process. Our proposed DiffPose significantly arrows), we use the diffusion model g, conditioned on the context
outperforms existing methods on the widely used pose information from 2D pose sequence, to progressively transform
estimation benchmarks Human3.6M and MPI-INF-3DHP. HK into a 3D pose distribution H0 with low indeterminacy.
Project page: https://gongjia0208.github.io/Diffpose/.
On the other hand, diffusion models [12, 38] have re-
cently become popular as an effective way to generate high-
1. Introduction quality images [33]. Generally, diffusion models are capa-
ble of generating samples that match a specified data distri-
3D human pose estimation, which aims to predict the 3D
bution (e.g., natural images) from random (indeterminate)
coordinates of human joints from images or videos, is an
noise through multiple steps where the noise is progres-
important task with a wide range of applications, including
sively removed [12, 38]. Intuitively, such a paradigm of
augmented reality [5], sign language translation [21] and
progressive denoising helps to break down the large gap be-
human-robot interaction [40], attracting a lot of attention
tween distributions (from a highly uncertain one to a deter-
in recent years [23, 46, 50, 52]. Generally, the mainstream
minate one) into smaller intermediate steps [39] and thus
approach is to conduct 3D pose estimation in two stages: the
successfully helps the model to converge towards smoothly
2D pose is first obtained with a 2D pose detector, and then
generating samples from the target data distribution.
2D-to-3D lifting is performed (where the lifting process is
the primary aspect that most recent works [2, 10, 16, 17, 19, Inspired by the strong capability of diffusion models to
32, 54] focus on). Yet, despite the considerable progress, generate realistic samples even from a starting point with
monocular 3D pose estimation still remains challenging. In high uncertainty (e.g., random noise), here we aim to tackle
particular, it can be difficult to accurately predict 3D pose 3D pose estimation, which also involves handling uncer-
from monocular data due to many challenges, including the tainty and indeterminacy (of 3D poses), with diffusion mod-
inherent depth ambiguity and the potential occlusion, which els. In this paper, we propose DiffPose, a novel framework
often lead to high indeterminacy and uncertainty. that represents a new brand of diffusion-based 3D pose es-
timation approach, which also follows the mainstream two-
† Equal contribution; § Currently at Meta; ‡ Corresponding author stage pipeline. In short, DiffPose models the 3D pose esti-
mation procedure as a reverse diffusion process, where we with just HK as input. This is because our aim is not just to
progressively transform a 3D pose distribution with high generate any realistic 3D pose, but rather to predict accurate
uncertainty and indeterminacy towards a 3D pose with low 3D poses corresponding to our estimated 2D poses, which
uncertainty. often requires more context information to achieve.
Intuitively, we can consider the determinate ground truth To address these challenges, we introduce several novel
3D pose as particles in the context of thermodynamics, designs in our DiffPose. Firstly, we initialize the indetermi-
where particles can be neatly gathered and form a clear nate 3D pose distribution HK based on extracted heatmaps,
pose with low indeterminacy at the start; then eventually which captures the underlying uncertainty of the desired 3D
these particles stochastically spread over the space, leading pose. Secondly, during forward diffusion, to generate the
to high indeterminacy. This process of particles evolving indeterminate 3D pose distributions that eventually (after
from low indeterminacy to high indeterminacy is the for- K steps) resemble HK , we add noise to the ground truth
ward diffusion process. The pose estimation task aims to 3D pose distribution H0 , where the noise is modeled by
perform precisely the opposite of this process, i.e., the re- a Gaussian Mixture Model (GMM) that characterizes the
verse diffusion process. We receive an initial 2D pose that uncertainty distribution HK . Thirdly, the reverse diffusion
is indeterminate and uncertain in 3D space, and we want process is conditioned on context information from the in-
to shed the indeterminacy to obtain a determinate 3D pose put video or frame in order to better leverage the spatial-
distribution containing high-quality solutions. temporal relationship between frames and joints. Then, to
Overall, our DiffPose framework consists of two oppo- effectively use the context information and perform the pro-
site processes: the forward process and the reverse process, gressive denoising to obtain accurate 3D poses, we design a
as shown in Fig. 1. In short, the forward process generates GCN-based diffusion model g.
supervisory signals of intermediate distributions for training The contributions of this paper are threefold: (i) We pro-
purposes, while the reverse process is a key part of our 3D pose DiffPose, a novel framework which represents a new
pose estimation pipeline that is used for both training and brand of method with the diffusion architecture for 3D pose
testing. Specifically, in the forward process, we gradually estimation, which can naturally handle the indeterminacy
diffuse a “ground truth” 3D pose distribution H0 with low and uncertainty of 3D poses. (ii) We propose various de-
indeterminacy towards a 3D pose distribution with high in- signs to facilitate 3D pose estimation, including the initial-
determinacy that resembles the 3D pose’s underlying uncer- ization of 3D pose distribution, a GMM-based forward dif-
tainty distribution HK . We obtain samples from the inter- fusion process and a conditional reverse diffusion process.
mediate distributions along the way, which are used during (iii) DiffPose achieves state-of-the-art performance on two
training as step-by-step supervisory signals for our diffu- widely used human pose estimation benchmarks.
sion model g. To start the reverse process, we first initialize
the indeterminate 3D pose distribution (HK ) according to 2. Related Work
the underlying uncertainty of the 3D pose. Then, our diffu- 3D Human Pose Estimation. Existing monocular 3D
sion model g is used in the reverse process to progressively pose estimation methods can roughly be categorized into
transform HK into a 3D pose distribution with low indeter- two groups: frame-based methods and video-based ones.
minacy (H0 ). The diffusion model g is optimized using the Frame-based methods predict the 3D pose from a single
samples from intermediate distributions (generated in the RGB image. Some works [7–9,30,31,42] use Convolutional
forward process), which guide it to smoothly transform the Neural Networks (CNNs) to output a human pose from the
indeterminate distribution HK into accurate predictions. RGB image, while many works [26, 46, 51, 52] first detect
However, there are several challenges in the above for- the 2D pose and then use it to regress the 3D pose. On the
ward and reverse process. Firstly, in 3D pose estimation, other hand, video-based methods tend to exploit temporal
we start the reverse diffusion process from an estimated 2D dependencies between frames in the video clip. Most video-
pose which has high uncertainty in 3D space, instead of based methods [2, 3, 6, 10, 14, 32, 34, 35, 44, 45, 54] extract
starting from random noise like in existing image genera- 2D pose sequences from the input video clip via a 2D pose
tion diffusion models [12, 38]. This is a significant differ- detector, and focus on distilling the crucial spatial-temporal
ence, as it means that the underlying uncertainty distribution information from these 2D pose sequences for 3D pose es-
of each 3D pose can differ. Thus, we cannot design the out- timation. To encode spatial-temporal information, existing
put of the forward diffusion steps to converge to the same works explore CNN-based frameworks with temporal con-
Gaussian noise like in previous image generation diffusion volutions [3, 32], GCNs [2, 6], or Transformers [34, 54].
works [12, 38]. Moreover, the uncertainty distribution of Notably, several works [17, 19, 36] aim to alleviate the un-
3D poses can be irregular and complicated, making it hard certainty and indeterminacy in 3D pose estimation by de-
to characterize via a single Gaussian distribution. Lastly, signing models that can generate multiple hypothesis solu-
it can be difficult to perform accurate 3D pose estimation tions from a single input. Different from all the aforemen-
Z
tioned works, DiffPose is formulated as a distribution-to- q(hk |h0 ) := q(h1:k |h0 )dh1:k−1
distribution transformation process, where we train a dif- √
fusion model to smoothly denoise from the indeterminate =Npdf (hk | αk h0 , (1 − αk )I). (3)
pose distribution to a pose distribution with low indetermi-
nacy. By framing the 3D pose estimation procedure as a Thus, hk can be expressed as a linear combination of the
reverse diffusion process, DiffPose can naturally handle the source sample h0 and a noise variable , where each element
indeterminacy and uncertainty for 3D pose estimation. of  is sampled from N (0, 1), as follows:
Denoising Diffusion Probabilistic Models (DDPMs). √ p
hk = αk h0 + (1 − αk ). (4)
DDPMs (called diffusion models for short) have emerged
as an effective approach to learn a data distribution that
is straightforward to sample from. Introduced by Sohl- Hence, when a long decreasing sequence α1:K is set such
Dickstein et al. [37] for image generation, DDPMs have that αK ≈ 0, the distribution of hK will converge to a stan-
been further simplified and accelerated [12, 38], and en- dard Gaussian, i.e., hK ∼ N (0, I). This indicates that the
hanced [1, 28, 29, 53] in recent years. Previous works have source signal h0 will eventually be corrupted into Gaussian
explored applying diffusion models to various generation noise, which conforms to the non-equilibrium thermody-
tasks, including image inpainting [25] and text generation namics phenomenon of the diffusion process [37].
[20]. Here, we explore using diffusion models to tackle 3D Using the sample h0 and noisy samples {hk }K k=1 gener-
pose estimation with our DiffPose framework. Unlike these ated by forward diffusion, the diffusion model g (which is
generation tasks [20, 25] that often start the generation pro- often a deep network parameterized by θ) is optimized to
cess from random noise, our pose estimation process starts approximate the reverse diffusion process. Specifically, al-
from an estimated 2D pose with uncertainty and indeter- though the exact formulations may differ [12, 37, 38], each
minacy in 3D space, where the uncertainty distribution dif- reverse diffusion step can be expressed as a function f that
fers for each pose and can also be irregular and difficult to takes in hk and diffusion model g as input to generate an
characterize. We also design a GCN-based architecture as output hk−1 as follows:
our diffusion model g, and condition it on spatial-temporal
hk−1 = f (hk , g). (5)
context information to aid the reverse diffusion process and
obtain accurate 3D poses.
Finally, during testing, a Gaussian noise hK can be eas-
3. Background on Diffusion Models ily sampled, and the reverse diffusion step introduced in
Eq. 5 can be recurrently performed to generate a high-
Diffusion models [12,38] are a class of probabilistic gen- quality sample h0 using the trained diffusion model g.
erative models that learn to transform noise hK ∼ N (0, I)
to a sample h0 by recurrently denoising hK , i.e., (hK → 4. Proposed Method: DiffPose
hK−1 → ... → h0 ). This denoising process is called
reverse diffusion. Conversely, the process (h0 → h1 → Given an RGB image frame It or a video clip Vt =
(t+T )
... → hK ) is called forward diffusion. {Iτ }τ =(t−T ) , the goal of 3D human pose estimation is to
To allow the diffusion model to learn the reverse diffu- predict the 3D coordinates of all the J keypoints of the hu-
K−1
sion process, a set of intermediate noisy samples {hk }k=1 man body in It . In this paper, inspired by diffusion-based
are needed to bridge the source sample h0 and the Gaus- generative models that can recurrently shed the indetermi-
sian noise hK . Specifically, forward diffusion is conducted nacy in an initial distribution (e.g., Gaussian distribution) to
to generate these samples, where the posterior distribution reconstruct a high-quality determinate sample, we frame the
q(h1:K |h0 ) from h0 to hK is formulated as: 3D pose estimation task as constructing a determinate 3D
pose distribution (H0 ) from the highly indeterminate pose
K
Y distribution (HK ) via diffusion models, which can handle
q(h1:K |h0 ) := q(hk |hk−1 ) (1) the uncertainty and indeterminacy of 3D poses.
k=1
r As shown in Fig. 2, we conduct pose estimation in two
αk αk
stages: (i) Initializing the indeterminate 3D pose distribu-

q(hk |hk−1 ) := Npdf hk hk−1 , (1 − )I , (2)
αk−1 αk−1
tion HK based on extracted heatmaps, which capture the
where Npdf (hk |·) refers to the likelihood of sampling hk underlying uncertainty of the input 2D pose in 3D space;
conditioned on the given parameters, and α1:K ∈ (0, 1]K is (ii) Performing the reverse diffusion process, where we use
a fixed decreasing sequence that controls the noise scaling a diffusion model g to progressively denoise the initial dis-
at each diffusion step. Using the known statistical results tribution HK to a desired high-quality determinate distribu-
for the combination of Gaussian distributions, the posterior tion H0 , and then we can sample h0 ∈ R3×J from the pose
for the diffusion process to step k can be formulated as: distribution H0 to synthesize the final 3D pose hs .
Encoder 𝝓𝑺𝑻
Diffusion model 𝒈

Context
𝒇𝑺𝑻 : Spatial-temporal context feature

𝒇𝑺𝑻 𝒇𝒌𝑫 𝒇𝒌𝑫 : The 𝒌𝒕𝒉 diffusion step embedding

the 𝒌𝒕𝒉 conditional reverse diffusion step


𝒉𝟏𝒌 𝒉𝟏𝒌(𝟏

Attention Layer
𝒉𝟐𝒌 𝒉𝟐𝒌(𝟏

Attention Layer

GCN Layer
GCN Layer

GCN Layer
GCN Layer

GCN Layer
GCN Layer
Depth
distribution Y
𝒌=𝑲
… 𝒌=𝟎 Y Y



Z Z
Z 𝒉𝑵 𝒉𝑵 X X
X 𝒌 𝒌(𝟏
Determinate XYZ Final XYZ
Initial XYZ Pose
Pose distribution 𝑯𝟎 pose 𝒉𝒔
distribution 𝑯𝑲
Heatmaps 𝒌=𝒌−𝟏

Figure 2. Illustration of our DiffPose framework during inference. First, we use the Context Encoder φST to extract the spatial-temporal
k
context feature fST from the given 2D pose sequence. We also generate diffusion step embedding fD for each kth diffusion step. Then,
we initialize the indeterminate pose distribution HK using heatmaps derived from an off-the-shelf 2D pose detector and depth distributions
that can either be computed from the training set or predicted by the Context Encoder φST . Next, we sample N noisy poses {hiK }N i=1 from
HK , which are required for performing distribution-to-distribution mapping. We feed these N poses into the diffusion model K times,
k
where diffusion model g is also conditioned on fST and fD at each step, to obtain {hi0 }N
i=1 which represents the high-quality determinate
i N
distribution H0 . Lastly, we use the mean of {h0 }i=1 as our final 3D pose hs .
In Sec. 4.1, we describe how to initialize the 3D distribu- which first extracts heatmaps depicting the likely area on the
tion HK from an input 2D pose that effectively captures the image where each joint is located, before making predic-
uncertainty in the 3D space. Then, we explain our forward tions of 2D joint locations based on the extracted heatmaps.
diffusion process in Sec. 4.2 and the reverse diffusion pro- Therefore, these heatmaps naturally reveal the uncertainty
cess in Sec. 4.3. After that, we present the detailed training of the 2D pose predictions. Hence, for the input 2D pose,
and testing process in Sec. 4.4. Finally, the architecture of we use the corresponding heatmaps from the off-the-shelf
our diffusion network is detailed in Sec. 4.5. 2D pose detector as the x and y distribution.
However, we are unable to obtain the z distribution in
4.1. Initializing 3D Pose Distribution HK the same way, as it is not known by the 2D pose detec-
tor. Instead, one way we can compute the z distribution is
In previous diffusion models [11, 12, 38], the reverse dif-
by calculating the occurrence frequencies of z values in the
fusion process often starts from random noise, which is pro-
training data, where we obtain a histogram for every joint.
gressively denoised to generate a high-quality output. How-
We also explore another approach, where the uncertain z
ever, in 3D pose estimation, our input here is instead an
distribution is initialized using the Context Encoder (which
estimated 2D pose that has its own uncertainty character-
is introduced in Sec. 4.3), which we empirically observe to
istics in 3D space. To aid our diffusion model in handling
lead to faster convergence.
the uncertainty and indeterminacy of each input 2D pose
in 3D space, we would like to initialize a corresponding 3D
4.2. Forward Pose Diffusion
pose distribution HK that captures the uncertainty of the 3D
pose. Thus, the reverse diffusion process can start from the After initializing the indeterminate distribution HK , the
distribution HK with sample-specific knowledge (in con- next step in our 3D pose estimation pipeline is to progres-
trast to Gaussian noise with no prior information), which sively reduce the uncertainty (HK → HK−1 → ... → H0 )
leads to better performance. Below, we describe how we using the diffusion model g through the reverse diffusion
construct the x, y, and z uncertainty distribution for each process. However, to attain the progressive denoising capa-
joint of an input pose. bility of the diffusion model g, we require “ground truth”
Initializing (x, y, z) distribution. Intuitively, the x and intermediate distributions as supervisory signals to train g.
y uncertainty distribution contains information regarding Here, we obtain samples from these intermediate distribu-
the likely regions in the image where the joints are located, tions via the forward diffusion process, where we take a
and can roughly be seen as the outcome of “outwards” dif- ground truth 3D pose distribution H0 and gradually add
fusion from the ground-truth positions. It can be difficult to noise to it, as shown in Fig. 1. Specifically, given a desired
capture such 2D pose uncertainty distributions, which are determinate pose distribution H0 , we define the forward dif-
often complicated and also vary for different joints of the fusion process as (H0 → H1 → ... → HK ), where K is
given pose. To address this, we take advantage of the avail- the maximum number of diffusion steps. In this process,
able prior information to model the uncertainty of the 2D we aim to progressively increase the indeterminacy of H0
pose. Notably, the 2D pose is often estimated from the im- towards the underlying pose uncertainty distribution HK as
age with an off-the-shelf 2D pose detector (e.g., CPN [4]), obtained in Sec. 4.1, such that we can obtain samples from
intermediate distributions that correspond to H1 , ..., HK ,
which will allow us to optimize the diffusion model g to √ p
ĥk = µG + αk (h0 − µG ) + (1 − αk ) · G . (7)
smoothly perform the step-by-step denoising.
where ĥk is a generated sample from the generated distribu-
DiffPose Forward Diffusion. For DiffPose, we do not tion Ĥk (which does not have a superscript since
PMit describes
want to diffuse our 3D pose towards a standard Gaussian how to generate a single sample), µG = m=1 1m µm ,
noise. This is because our indeterminate distribution HK is G
PM
 ∼ N (0, m=1 (1m Σm )), and 1m ∈ {0, 1} is a binary
not random noise, but is instead a (x, y, z) distribution ac- PM
indicator for the mth component such that m=1 1m = 1
cording to the 3D pose uncertainty, and has more complex
and P rob(1m = 1) = πm . In other words, we first
characteristics. This has several implications. For example,
select a component m̂ via sampling according to the re-
the region of uncertainty for each joint and each coordinate
spective probabilities πm , and set only 1m̂ to 1. Then,
of the initial pose distribution HK can be different. Sec-
we sample the Gaussian noise from that component m̂ us-
ondly, the mean locations of all joints should not be treated
ing µm̂ and Σm̂ . Notably, as αK ≈ 0, ĥK is drawn
as equal to the origin (i.e., 0 along all dimensions), due to G G
the constraints of the body structure. Due to these reasons,
fromPM the fitted GMMPMmodel, i.e., ĥK = µ +  ∼
N ( m=1 (1m µm ), m=1 (1m Σm )). Thus, this allows us
the basic generative diffusion process (in Sec. 3) cannot ap-
to generate samples from {Ĥ1 , ..., ĤK } as supervisory sig-
propriately model the uncertainty of the initialized pose dis-
nals. More details can be found in Supplementary.
tribution HK (as described in Sec. 4.1) for our 3D pose es-
timation task, which motivates us to design a new forward 4.3. Reverse Diffusion for 3D Pose Estimation
diffusion process.
As shown in Fig. 1, the reverse diffusion process aims to
Designing such a forward diffusion process can be chal- recover a determinate 3D pose distribution H0 from the in-
lenging, because the uncertainty distribution HK , which is determinate pose distribution HK , where HK has been dis-
based on heatmaps, often has irregular and complex shapes, cussed in Sec. 4.1. In the previous subsection, we represent
and it is not straightforward to express HK mathematically. HK via a GMM model to generate intermediate distribu-
To overcome this, we propose to use a Gaussian Mixture tions {Ĥ1 , ..., ĤK }. Here, we use these distributions to op-
Model (GMM) to model the uncertainty distribution HK timize our diffusion model g (parameterized by θ) to learn
for 3D pose estimation, as it can characterize intractable the reverse diffusion process (ĤK → ... → Ĥ1 → H0 ),
and complex distributions [18, 28], and is very effective to and progressively shed the indeterminacy from ĤK to re-
represent heatmap-based distributions [43]. Then, based on construct the determinate source distribution H0 . The ar-
the fitted GMM model, we perform a corresponding GMM- chitecture of the diffusion model g is described in Sec. 4.5.
based forward diffusion process. Specifically, we set the Context Encoder φST . However, it is difficult to di-
number of Gaussian components in the GMM at M , and use rectly perform the reverse diffusion process using only ĤK
the Expectation-Maximization (EM) algorithm to optimize as the input of the diffusion model g. This is because g
the GMM parameters φGM M to fit the target distribution will not observe much context information from the in-
HK as follows: put videos/images, leading to difficulties for g to gener-
ate accurate poses from the indeterminate distribution HK .
YNGM M XM Therefore, we propose to utilize the available context in-
max πm Npdf (hiK |µm , Σm ), (6)
φGM M i=1 m=1 formation from the input to guide g to achieve more accu-
where h1K , ..., hK NGM M
are NGM M poses sampled rate predictions. The context information can constrain the
from the pose distribution HK , and φGM M = model’s denoising based on the observed inputs, and guide
{µ1 , Σ1 , π1 , ..., µM , ΣM , πM } refers to the GMM pa- the model to produce more accurate predictions.
rameters. Here, µm ∈ R3J and Σm ∈ R3J×3J are the Specifically, to guide the diffusion model g, we leverage
mean values and covariance matrix of the mth Gaussian the spatial-temporal context. The context information can
component. πm ∈ [0, 1] is the probability that any be extracted from the 2D pose sequence derived from Vt (or
sample hiK is drawn from the mth mixture component just a single 2D pose derived from It if Vt is not available).
PM
( m=1 πm = 1). This context information aids the reverse diffusion process,
providing additional information to the diffusion model g
Next, we want to run the forward diffusion process on the that helps to reduce uncertainty and generate more accu-
ground truth pose distribution H0 such that after K steps, rate 3D poses. To achieve that, we introduce the Context
the generated noisy distribution becomes equivalent to the Encoder φST to extract spatial-temporal information fST
fitted GMM distribution φGM M , which we henceforth de- from the 2D pose sequence, and condition the reverse diffu-
note as ĤK because it is a GMM-based representation of sion process on fST (as shown in Fig. 2).
HK . To achieve this, we can modify Eq. 4 as follows: Reverse Diffusion Process. Overall, our reverse diffu-
sion process aims to recover a determinate pose distribution poses here, because we are mapping from a distribution to
H0 from the indeterminate pose distribution ĤK (during another distribution. Then, to obtain the final high-quality
training) or HK (during testing). Here, we describe the re- and reliable pose hs , we calculate the mean of the N de-
verse diffusion process during training and use ĤK nota- noised samples {h10 , . . . , hN
0 }.
tion. We first use Context Encoder φST to extract fST from
the 2D pose sequence. Moreover, to allow the diffusion 4.5. DiffPose Architecture
model to learn to denoise samples appropriately at each dif-
fusion step, we also generate the unique step embedding fD k Our framework consists of two sub-networks: a diffu-
th
to represent the k diffusion step via the sinusoidal func- sion network (g) that performs the steps in the reverse pro-
tion. Then, for a noisy pose ĥk sampled from Ĥk , we use cess and a Context Encoder (φST ) that extracts the context
diffusion model g, conditioned on the diffusion step k and feature from the 2D pose sequence (or frame).
the spatial-temporal context feature fST , to progressively Main Diffusion Model g. We adopt a lightweight GCN-
reconstruct ĥk−1 from ĥk as follows: based architecture for g to perform 3D pose estimation via
diffusion, which is modified from [52]. The graph con-
k
ĥk−1 =gθ (ĥk , fST , fD ), k ∈ {1, ..., K}. (8) volution layer treats the human skeleton as a graph (with
joints as the nodes), and effectively encodes topological in-
formation between joints for 3D human pose estimation.
4.4. Overall Training and Testing Process Moreover, we interlace GCN layers with Self-Attention lay-
ers, which can encode global relationships between non-
Overall, for each sample during training, we (i) initialize adjacent joints and allow for better structural understand-
HK ; (ii) use H0 and HK to generate supervisory signals ing of the 3D human pose as a whole. As shown in Fig. 2,
{Ĥ1 , ..., ĤK } via the forward process; (iii) run K steps of our diffusion model g mainly consists of 3 stacked GCN-
the reverse process starting from ĤK and optimize with our Attention Blocks with residual connections, where each
generated signals. During testing, we (i) initialize HK ; (ii) GCN-Attention Block comprises of two standard GCN lay-
run K steps of the reverse process starting from HK to ob- ers and a Self-Attention layer. A GCN layer is added at the
tain final prediction hs . More details are described below. front and back of these stacked GCN-Attention Blocks to
Training. First, from the input sequence Vt (or frame control the embedding size of GCN-Attention Blocks.
It ), we extract the 2D heatmaps together with the esti- Specifically, the starting GCN layer maps the input
mated 2D pose via an off-the-shelf 2D pose detector [4]. hk ∈ RJ×3 to a latent embedding E ∈ RJ×128 . On the
Then, we compute the z distribution, either from the train- other hand, we extract spatial-temporal context information
ing set or predicted by the Context Encoder φST . After fST ∈ RJ×128 . In order to provide information to the
that, we initialize HK based on the 3D distribution for each model regarding the current step number k, we also gen-
joint and use the EM algorithm to get the best-fit GMM erate a diffusion step embedding fD k
∈ RJ×256 using the
parameters φGM M = {µ1 , Σ1 , π1 , ..., µM , ΣM , πM } for sinusoidal function. Then, we combine these embeddings to
HK . Based on φGM M , we use the ground truth 3D pose form features v1 ∈ RJ×256 , where E and fST are first con-
h0 to directly generate N sets of ĥ1 , ..., ĥK via Eq. 7, catenated along the second dimension, before adding fD k
to
 N
i.e., {ĥi1 , ..., ĥiK } i=1 . Specifically, we first sample a the result. Features v1 are then fed into the stack of 3 GCN-
component m̂i for each ith set according to probabilities Attention Blocks, which all have the exact same structure.
{πm }M i
m=1 , and use the m̂ -th Gaussian component to di- The output features from the last GCN-Attention Block are
rectly add noise for the i set {ĥi1 , ..., ĥiK }. Next, we ex-
th fed into the final GCN layer to be mapped into an output
tract the spatial-temporal context fST using the Context En- pose hk−1 ∈ RJ×3 . Then, we feed hk−1 back to g as in-
coder φST . Then, we want to optimize the model parame- put again to perform another reverse step. At the final K-th
ters θ to reconstruct ĥik−1 from ĥik in a step-wise manner. step, we obtain an output pose h0 ∈ RJ×3 .
Following previous works on diffusion models [12, 38], we Context Encoder φST . In this paper, we leverage
formulate our loss L as follows (where ĥi0 = h0 for all i): a transformer-based network [49] to capture the spatial-
N X
X K
2 temporal context information in the 2D pose sequence Vt .
L= gθ (ĥik , fST , fD
k
) − ĥik−1 . (9) Note that, if we do not have the video, we only input a sin-
2
i=1 k=1
gle frame It , and utilize [52] instead.
Testing. Similar to the start of the training procedure,
during testing we first initialize HK and also extract fST . 5. Experiments
Then, we perform the reverse diffusion process, where we
sample N poses from HK (h1K , h2K , ..., hN K ) and recurrently We evaluate our method on two widely used datasets for
feed them into diffusion model g for K times, to obtain N 3D human pose estimation: Human3.6M [15] and MPI-
high-quality 3D poses (h10 , h20 , ..., hN
0 ). We need N noisy INF-3DHP [27]. Specifically, we conduct experiments to
evaluate the performance of our method in two scenarios: Table 1. Video-based results on Human3.6M in millimeters under
video-based and frame-based 3D pose estimation. MPJPE. Top table shows the results on detected 2D poses. Bottom
table shows the results on ground truth 2D poses.
Human3.6M [15] is the largest benchmark for 3D hu-
MPJPE(CPN) Dir Disc Eat Greet Phone Photo Pose Pur Sit SitD Smoke Wait WalkD Walk WalkT Avg
man pose estimation, consisting of 3.6 million images cap- Pavllo [32] 45.2 46.7 43.3 45.6 48.1 55.1 44.6 44.3 57.3 65.8 47.1 44.0 49.0 32.8 33.9 46.8
tured from four cameras, where 15 daily activities are per- Liu [24] 41.8 44.8 41.1 44.9 47.4 54.1 43.4 42.2 56.2 63.6 45.3 43.5 45.3 31.3 32.2 45.1
Zeng [48] 46.6 47.1 43.9 41.6 45.8 49.6 46.5 40.0 53.4 61.1 46.1 42.6 43.1 31.5 32.6 44.8
formed by 11 subjects. For video-based 3D pose estima- Zheng [54] 41.5 44.8 39.8 42.5 46.5 51.6 42.1 42.0 53.3 60.7 45.5 43.3 46.1 31.8 32.2 44.3
Li [19] 39.2 43.1 40.1 40.9 44.9 51.2 40.6 41.3 53.5 60.3 43.7 41.1 43.8 29.8 30.6 43.0
tion, we follow previous works [3, 24, 32] to train on five Shan [34] 38.4 42.1 39.8 40.2 45.2 48.9 40.4 38.3 53.8 57.3 43.9 41.6 42.2 29.3 29.3 42.1
subjects (S1, S5, S6, S7, S8) and test on two subjects (S9 Zhang [49] 37.6 40.9 37.3 39.7 42.3 49.9 40.1 39.8 51.7 55.0 42.1 39.8 41.0 27.9 27.9 40.9
Ours 33.2 36.6 33.0 35.6 37.6 45.1 35.7 35.5 46.4 49.9 37.3 35.6 36.5 24.4 24.1 36.9
and S11). For frame-based 3D pose estimation, we follow
MPJPE(GT) Dir Disc Eat Greet Phone Photo Pose Pur Sit SitD Smoke Wait WalkD Walk WalkT Avg
[46, 51, 52] to train on (S1, S5, S6, S7, S8) subjects and test
Pavllo [32] 35.2 40.2 32.7 35.7 38.2 45.5 40.6 36.1 48.8 47.3 37.8 39.7 38.7 27.8 29.5 37.8
on (S9, S11) subjects. We report the mean per joint posi- Liu [24] 34.5 37.1 33.6 34.2 32.9 37.1 39.6 35.8 40.7 41.4 33.0 33.8 33.0 26.6 26.9 34.7
Zeng [48] 34.8 32.1 28.5 30.7 31.4 36.9 35.6 30.5 38.9 40.5 32.5 31.0 29.9 22.5 24.5 32.0
tion error (MPJPE) and Procrustes MPJPE (P-MPJPE). The Zheng [54] 30.0 33.6 29.9 31.0 30.2 33.3 34.8 31.4 37.8 38.6 31.7 31.5 29.0 23.3 23.1 31.3
Li [19] 27.7 32.1 29.1 28.9 30.0 33.9 33.0 31.2 37.0 39.3 30.0 31.0 29.4 22.2 23.0 30.5
former computes the Euclidean distance between the pre- Shan [34] 28.5 30.1 28.6 27.9 29.8 33.2 31.3 27.8 36.0 37.4 29.7 29.5 28.1 21.0 21.0 29.3
dicted joint positions and the ground truth positions. The Zhang [49] 21.6 22.0 20.4 21.0 20.8 24.3 24.7 21.9 26.9 24.9 21.2 21.5 20.8 14.7 15.7 21.6
Ours 18.6 19.3 18.0 18.4 18.3 21.5 21.5 19.1 23.6 22.3 18.6 18.8 18.3 12.8 13.9 18.9
latter is the MPJPE after the predicted results are aligned
to the ground truth via a rigid transformation. Due to page Table 3. Frame-based results on Human3.6M in millimeters under
limitations, we move P-MPJPE results to Supplementary. MPJPE. Top table shows the results on detected 2D poses. Bottom
MPI-INF-3DHP [27] is a large 3D pose dataset captured table shows the results on ground truth 2D poses.
in both indoor and outdoor environments, with 1.3 million MPJPE(CPN) Dir Disc Eat Greet Phone Photo Pose Pur Sit SitD Smoke Wait WalkD Walk WalkT Avg
Pavlakos [31] 67.4 71.9 66.7 69.1 72.0 77.0 65.0 68.3 83.7 96.5 71.7 65.8 74.9 59.1 63.2 71.9
frames. Following [3,22,27,54], we train DiffPose using all Martinez [26] 51.8 56.2 58.1 59.0 69.5 78.4 55.2 58.1 74.0 94.6 62.3 59.1 65.1 49.5 52.4 62.9
activities from 8 camera views in the training set and eval- Sun [41]
Yang [47]
52.8 54.8 54.2
51.5 58.9 50.4
54.3
57.0
61.8
62.1
53.1
65.4
53.6 71.7 86.7 61.5
49.8 52.7 69.2 85.2
67.2
57.4
53.4
58.4
47.1
43.6
61.6
60.1
53.4
47.7
59.1
58.6
uate on valid frames in the test set. Here, we report metrics Hossain [13] 48.4 50.7 57.2 55.2 63.1 72.6 53.0 51.7 66.1 80.9 59.0 57.3 62.4 46.6 49.6 58.3
Zhao [51] 48.2 60.8 51.8 64.0 64.6 53.6 51.1 67.4 88.7 57.7 73.2 65.6 48.9 64.8 51.9 60.8
of MPJPE, Percentage of Correct Keypoints (PCK) with the Liu [23] 46.3 52.2 47.3 50.7 55.5 67.1 49.2 46.0 60.4 71.1 51.5 50.1 54.5 40.3 43.7 52.4
Xu [46] 45.2 49.9 47.5 50.9 54.9 66.1 48.5 46.3 59.7 71.5 51.4 48.6 53.9 39.9 44.1 51.9
threshold of 150 mm, and Area Under Curve (AUC) for a Zhao [52] 45.2 50.8 48.0 50.0 54.9 65.0 48.2 47.1 60.2 70.0 51.6 48.7 54.1 39.7 43.1 51.8
range of PCK thresholds to compare our performance with Ours 42.8 49.1 45.2 48.7 52.1 63.5 46.3 45.2 58.6 66.3 50.4 47.6 52.0 37.6 40.2 49.7
other methods on the video-based setting. MPJPE(GT) Dir Disc Eat Greet Phone Photo Pose Pur Sit SitD Smoke Wait WalkD Walk WalkT Avg

Implementation Details. We set the number of pose Martinez [26] 37.7 44.4 40.3 42.1 48.2 54.9 44.4 42.1 54.6 58.0 45.1 46.4 47.6 36.4 40.4 45.5
Hossain [13] 35.2 40.8 37.2 37.4 43.2 44.0 38.9 35.6 42.3 44.6 39.7 39.7 40.2 32.8 35.5 39.2
samples N to 5 and number of reverse diffusion steps K to Zhao [51] 37.8 49.4 37.6 40.9 45.1 41.4 40.1 48.3 50.1 42.2 53.5 44.3 40.5 47.3 39.0 43.8
Liu [23] 36.8 40.3 33.0 36.3 37.5 45.0 39.7 34.9 40.3 47.7 37.4 38.5 38.6 29.6 32.0 37.8
50. We fit ĤK via a GMM model with 5 kernels (M = 5) Xu [46] 35.8 38.1 31.0 35.3 35.8 43.2 37.3 31.7 38.4 45.5 35.4 36.7 36.8 27.9 30.7 35.8
Zhao [52] 32.0 38.0 30.4 34.4 34.7 43.3 35.2 31.4 38.0 46.2 34.2 35.7 36.1 27.4 30.6 35.2
for forward diffusion, and accelerate our diffusion inference
Ours 28.8 32.7 27.8 30.9 32.8 38.9 32.2 28.3 33.3 41.0 31.0 32.1 31.5 25.9 27.5 31.6
procedure for all experiments via an acceleration technique
DDIM [38], where only five steps are required to complete
the reverse diffusion process. For video pose estimation, we 4 mm. This shows that DiffPose can effectively improve
set the Context Encoder φST to follow [49], and for frame- monocular 3D pose estimation. Moreover, we also conduct
based pose estimation, we set φST to follow [52]. The experiments using the ground truth 2D pose as input, and re-
Context Encoder φST is pre-trained on the training set to port our results at the bottom of Tab. 1. Our DiffPose again
predict (x, y, z), then frozen during diffusion model train- outperforms all previous methods by a large margin.
ing; we use it to produce features fST and also to initialize Video-based Results on Table 2. Video-based results
the z distribution. For video-based pose estimation, we fol- MPI-INF-3DHP. We also on MPI-INF-3DHP.
low [2, 32] to use detected 2D pose (using CPN [4]) and evaluate our method on MPI- Method PCK ↑ AUC ↑ MPJPE ↓
ground truth 2D pose on Human3.6M, and use ground truth INF-3DHP. Here, we use 81 Pavllo [32] 86.0 51.9 84.0
2D pose on MPI-INF-3DHP. For frame-based pose estima- frames as our input due to the Wang [44] 86.9 62.1 68.1
tion, we follow [51, 52] to use the 2D pose detected by [4] Zheng [54] 88.6 56.4
shorter video length of this Li [24]
77.1
93.8 63.3 58.0
and ground truth 2D pose to conduct experiments on Hu- dataset. The results in Tab. 2 Zhang [49] 94.4 66.5 54.9
man3.6M. More details are in Supplementary. demonstrate that our method Ours 98.0 75.9 29.1
achieves the best performance, showing the efficacy of our
5.1. Comparison with State-of-the-art Methods
DiffPose in improving performance in outdoor scenes.
Video-based Results on Human3.6M. We follow [32, Frame-based Results on Human3.6M. To further in-
48, 49] to use 243 frames for 3D pose estimation and com- vestigate the efficacy of DiffPose, we evaluate it in a more
pare our method against existing works on Human3.6M in challenging setting: frame-based 3D pose estimation. Here,
Tab. 1. As shown in the top of Tab. 1, our method achieves we only extract context information from the single input
the best MPJPE results using the detected 2D pose, and sig- frame via our Context Encoder φST . Our results on Hu-
nificantly outperforms the SOTA method [49] by around man3.6M are reported in Tab. 3. As shown at the top of
different M in Tab. 5. Experiments show that our GMM-
based design significantly outperforms the baseline Stand-
Diff, which shows the effectiveness of using a GMM to
approximate HK . Moreover, we can observe that using 5
kernels (M = 5) is sufficient to effectively capture the un-
Figure 3. Qualitative results. Red colored 3D pose corresponds to certainty distribution.
the ground truth. Under occlusion, our DiffPose predicts a pose Impact of context fST . Table 6. Ablation study for fST .
that is more accurate than previous methods (circled in orange).
Another crucial component Method MPJPE P-MPJPE
to explore is the role of [34] 42.1 34.4
Tab. 3, our DiffPose surpasses all existing methods in aver-
spatial-temporal context Ours + [34] 39.3 31.8
age MPJPE using detected 2D poses. At the bottom of Tab. [49] 40.9 32.6
fST in our method. Here,
3, we observe that DiffPose also outperforms all methods Ours + [49] 36.9 28.7
we evaluate the perfor-
with a large margin when ground truth 2D poses are used.
mance when using various context encoders [34, 49] to
Qualitative results. In the first four columns of Fig. 3,
obtain fST . As shown in Tab. 6, our DiffPose achieves
we provide visualizations of the reverse diffusion process,
good performance using both models. We also find that
where the step k decreases from 15 to 0. We observe that
DiffPose significantly outperforms both context encoders,
DiffPose can progressively narrow down the gap between
which verifies the efficacy of our approach.
the sampled poses and the ground-truth pose. Moreover, we
Impact of reverse diffusion
compare our method with the current SOTA method [49],
steps K and sample number N . To
which shows that our method can generate more reliable
further investigate the characteristics
3D pose solutions, especially for ambiguous body parts.
of our pose diffusion process, we
5.2. Ablation Study conduct several experiments with
different diffusion step numbers (K) Figure 4. Evaluation of
To verify the impact of each proposed design, we con- and sample numbers (N ) and plot parameters K and N .
duct extensive ablation experiments on Human3.6M dataset the results in Fig. 4. We observe that MPJPE first drops
using the detected 2D poses in the video-based setting. significantly till K = 50, and shows minor improvements
Impact of Diffusion Process. Table 4. Ablation study for when K > 50. Thus, we use 50 diffusion steps (K = 50)
We first evaluate the diffusion diffusion pipeline. in our method, which can effectively and efficiently shed
process’s effectiveness. Here Method MPJPE P-MPJPE
indeterminacy. On the other hand, we find that model
we build two baseline models: Baseline B 44.3
Baseline A
41.1
33.7
32.8 performance improves with the number of samples N until
(1) Baseline A: It has the same DiffPose 36.9 28.6 N = 5, where our performance stays roughly consistent.
structure as our diffusion model but the 3D pose estima- Inference Speed. In Tab. 7, Table 7. Analysis of speed.
tion is conducted in a single step. (2) Baseline B: It has we compare the speed of Diff- Our method can run ef-
the nearly same architecture as our diffusion model but the Pose with existing methods ficiently, yet outperforms
diffusion model is stacked multiple times to approximate in terms of Frames Per Sec- SOTA significantly.
the computational complexity of DiffPose. Note that both ond (FPS). Our DiffPose with Method MPJPE FPS
baselines are optimized to predict 3D human pose instead of Li [19]
DDIM acceleration can achieve Zhang [49]
43.0 328
40.9 974
learning the reverse diffusion process. We report the results a competitive speed compared DiffPose w/o DDIM 36.7 173
of the baselines and DiffPose in Tab. 4. The performance with the current SOTA [49]
DiffPose w/ DDIM 36.9 671
of both baselines are much worse than our DiffPose, which while obtaining better performance. Moreover, even with-
indicates that the performance improvement of our method out DDIM acceleration, the FPS of our model is still higher
comes from the designed diffusion pipeline. than 170 FPS, which satisfies most real-time requirements.
Impact of GMM. To vali- Table 5. Ablation study for
date the effect of the GMM de- GMM design 6. Conclusion
sign, we consider two alterna- Method MPJPE P-MPJPE
Stand-Diff 40.1 31.1
tive ways to train our diffusion GMM-Diff(M=1) 38.0 29.7 This paper presents DiffPose, a novel diffusion-based
model: (1) Stand-Diff: we di- GMM-Diff(M=5) 36.9 28.6 framework that handles the uncertainty and indeterminacy
GMM-Diff(M=9) 36.5 28.5
rectly adopt the basic forward in monocular 3D pose estimation. DiffPose first initial-
diffusion process introduced in Eq. 4 for model training. izes the indeterminate 3D pose distribution and then recur-
(2) GMM-Diff: we utilize GMM to fit the initial 3D pose rently sheds the indeterminacy in this distribution to obtain
distribution HK to generate intermediate distributions for the final high-quality 3D human pose distribution for reli-
model training. Moreover, we test the number of kernels able pose estimation. Extensive experiments show that the
in GMM M (from 1 to 9) to investigate the characteris- proposed DiffPose achieves state-of-the-art performance on
tics of GMM in pose diffusion. We report the results with two widely used benchmark datasets.
Acknowledgments. This work is supported by MOE AcRF Tier 2 (Pro- of the IEEE/CVF Conference on Computer Vision and Pattern
posal ID: T2EP20222-0035), National Research Foundation Singapore un- Recognition, pages 17113–17122, 2022. 4
der its AI Singapore Programme (AISG-100E-2020-065), and SUTD SKI
[12] Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising dif-
Project (SKI 2021 02 06). This work is also supported by TAILOR, a
project funded by EU Horizon 2020 research and innovation programme fusion probabilistic models. Advances in Neural Information
under GA No 952215. Processing Systems, 33:6840–6851, 2020. 1, 2, 3, 4, 6
[13] Mir Rayat Imtiaz Hossain and James J Little. Exploiting
References temporal information for 3d human pose estimation. In ECCV,
pages 68–84, 2018. 7
[1] Jacob Austin, Daniel D Johnson, Jonathan Ho, Daniel Tar- [14] Wenbo Hu, Changgong Zhang, Fangneng Zhan, Lei Zhang,
low, and Rianne van den Berg. Structured denoising diffusion and Tien-Tsin Wong. Conditional directed graph convolution
models in discrete state-spaces. Advances in Neural Informa- for 3d human pose estimation. In Proceedings of the 29th
tion Processing Systems, 34:17981–17993, 2021. 3 ACM International Conference on Multimedia, pages 602–
[2] Yujun Cai, Liuhao Ge, Jun Liu, Jianfei Cai, Tat-Jen Cham, 611, 2021. 2
Junsong Yuan, and Nadia Magnenat Thalmann. Exploiting [15] Catalin Ionescu, Dragos Papava, Vlad Olaru, and Cristian
spatial-temporal relationships for 3d pose estimation via graph Sminchisescu. Human3.6m: Large scale datasets and predic-
convolutional networks. In Proceedings of the IEEE/CVF in- tive methods for 3d human sensing in natural environments.
ternational conference on computer vision, pages 2272–2281, IEEE transactions on pattern analysis and machine intelli-
2019. 1, 2, 7 gence, 36(7):1325–1339, 2013. 6, 7
[3] Tianlang Chen, Chen Fang, Xiaohui Shen, Yiheng Zhu, Zhili [16] Rawal Khirodkar, Visesh Chari, Amit Agrawal, and Ambr-
Chen, and Jiebo Luo. Anatomy-aware 3d human pose estima- ish Tyagi. Multi-instance pose networks: Rethinking top-
tion with bone-based pose decomposition. IEEE Transactions down pose estimation. In Proceedings of the IEEE/CVF Inter-
on Circuits and Systems for Video Technology, 32(1):198–209, national Conference on Computer Vision, pages 3122–3131,
2021. 2, 7 2021. 1
[4] Yilun Chen, Zhicheng Wang, Yuxiang Peng, Zhiqiang Zhang, [17] Chen Li and Gim Hee Lee. Generating multiple hypotheses
Gang Yu, and Jian Sun. Cascaded pyramid network for multi- for 3d human pose estimation with mixture density network.
person pose estimation. In Proceedings of the IEEE confer- In Proceedings of the IEEE/CVF conference on computer vi-
ence on computer vision and pattern recognition, pages 7103– sion and pattern recognition, pages 9887–9895, 2019. 1, 2
7112, 2018. 4, 6, 7 [18] Jonathan Li and Andrew Barron. Mixture density estimation.
[5] Manuela Chessa, Guido Maiello, Lina K Klein, Vivian C Advances in neural information processing systems, 12, 1999.
Paulun, and Fabio Solari. Grasping objects in immersive vir- 5
tual reality. In 2019 IEEE Conference on Virtual Reality and [19] Wenhao Li, Hong Liu, Hao Tang, Pichao Wang, and Luc
3D User Interfaces (VR), pages 1749–1754. IEEE, 2019. 1 Van Gool. Mhformer: Multi-hypothesis transformer for 3d
[6] Hai Ci, Chunyu Wang, Xiaoxuan Ma, and Yizhou Wang. Op- human pose estimation. In Proceedings of the IEEE/CVF
timizing network structure for 3d human pose estimation. In Conference on Computer Vision and Pattern Recognition,
Proceedings of the IEEE/CVF international conference on pages 13147–13156, 2022. 1, 2, 7, 8
computer vision, pages 2262–2271, 2019. 2 [20] Xiang Lisa Li, John Thickstun, Ishaan Gulrajani, Percy
[7] Zhipeng Fan, Jun Liu, and Yao Wang. Adaptive computation- Liang, and Tatsunori B Hashimoto. Diffusion-lm im-
ally efficient network for monocular 3d hand pose estimation. proves controllable text generation. arXiv preprint
In Computer Vision–ECCV 2020: 16th European Conference, arXiv:2205.14217, 2022. 3
Glasgow, UK, August 23–28, 2020, Proceedings, Part IV 16, [21] Xing Liang, Anastassia Angelopoulou, Epaminondas
pages 127–144. Springer, 2020. 2 Kapetanios, Bencie Woll, Reda Al Batat, and Tyron Woolfe.
[8] Zhipeng Fan, Jun Liu, and Yao Wang. Motion adaptive A multi-modal machine learning approach and toolkit to au-
pose estimation from compressed videos. In Proceedings of tomate recognition of early stages of dementia among british
the IEEE/CVF International Conference on Computer Vision, sign language users. In European Conference on Computer
pages 11719–11728, 2021. 2 Vision, pages 278–293. Springer, 2020. 1
[9] Lin Geng Foo, Jia Gong, Zhipeng Fan, and Jun Liu. System- [22] Jiahao Lin and Gim Hee Lee. Trajectory space factorization
status-aware adaptive network for online streaming video un- for deep video-based 3d human pose estimation. In BMVC,
derstanding. In Proceedings of the IEEE/CVF Conference on 2019. 7
Computer Vision and Pattern Recognition (CVPR), June 2023. [23] Kenkun Liu, Rongqi Ding, Zhiming Zou, Le Wang, and Wei
2 Tang. A comprehensive study of weight sharing in graph net-
[10] Lin Geng Foo, Tianjiao Li, Hossein Rahmani, Qiuhong Ke, works for 3d human pose estimation. In European Conference
and Jun Liu. Unified pose sequence modeling. In Proceedings on Computer Vision, pages 318–334. Springer, 2020. 1, 7
of the IEEE/CVF Conference on Computer Vision and Pattern [24] Ruixu Liu, Ju Shen, He Wang, Chen Chen, Sen-ching Che-
Recognition (CVPR), June 2023. 1, 2 ung, and Vijayan Asari. Attention mechanism exploits tem-
[11] Tianpei Gu, Guangyi Chen, Junlong Li, Chunze Lin, Yong- poral contexts: Real-time 3d human pose reconstruction. In
ming Rao, Jie Zhou, and Jiwen Lu. Stochastic trajectory pre- Proceedings of the IEEE/CVF Conference on Computer Vi-
diction via motion indeterminacy diffusion. In Proceedings sion and Pattern Recognition, pages 5064–5073, 2020. 7
[25] Andreas Lugmayr, Martin Danelljan, Andres Romero, Fisher [38] Jiaming Song, Chenlin Meng, and Stefano Ermon. Denois-
Yu, Radu Timofte, and Luc Van Gool. Repaint: Inpainting us- ing diffusion implicit models. In International Conference on
ing denoising diffusion probabilistic models. In Proceedings Learning Representations, 2021. 1, 2, 3, 4, 6, 7
of the IEEE/CVF Conference on Computer Vision and Pattern [39] Yang Song and Stefano Ermon. Generative modeling by esti-
Recognition, pages 11461–11471, 2022. 3 mating gradients of the data distribution. Advances in Neural
[26] Julieta Martinez, Rayat Hossain, Javier Romero, and James J Information Processing Systems, 32, 2019. 1
Little. A simple yet effective baseline for 3d human pose esti- [40] Srinath Sridhar, Anna Maria Feit, Christian Theobalt, and
mation. In IEEE ICCV, pages 2640–2649, 2017. 2, 7 Antti Oulasvirta. Investigating the dexterity of multi-finger
[27] Dushyant Mehta, Helge Rhodin, Dan Casas, Pascal input for mid-air text entry. In Proceedings of the 33rd Annual
Fua, Oleksandr Sotnychenko, Weipeng Xu, and Christian ACM Conference on Human Factors in Computing Systems,
Theobalt. Monocular 3d human pose estimation in the wild pages 3643–3652, 2015. 1
using improved cnn supervision. In 2017 international con- [41] Xiao Sun, Jiaxiang Shang, Shuang Liang, and Yichen Wei.
ference on 3D vision (3DV), pages 506–516. IEEE, 2017. 6, Compositional human pose regression. In IEEE ICCV, pages
7 2602–2611, 2017. 7
[28] Eliya Nachmani, Robin San Roman, and Lior Wolf. [42] Xiao Sun, Bin Xiao, Fangyin Wei, Shuang Liang, and Yichen
Non gaussian denoising diffusion models. arXiv preprint Wei. Integral human pose regression. In Proceedings of the
arXiv:2106.07582, 2021. 3, 5 European conference on computer vision (ECCV), pages 529–
[29] Alexander Quinn Nichol and Prafulla Dhariwal. Improved 545, 2018. 2
denoising diffusion probabilistic models. In International [43] Chen Wang, Feng Zhang, Xiatian Zhu, and Shuzhi Sam Ge.
Conference on Machine Learning, pages 8162–8171. PMLR, Low-resolution human pose estimation. Pattern Recognition,
2021. 3 126:108579, 2022. 5
[30] Sungheon Park, Jihye Hwang, and Nojun Kwak. 3d human [44] Jingbo Wang, Sijie Yan, Yuanjun Xiong, and Dahua Lin.
pose estimation using convolutional neural networks with 2d Motion guided 3d pose estimation from videos. In European
pose information. In European Conference on Computer Vi- Conference on Computer Vision, pages 764–780. Springer,
sion, pages 156–169. Springer, 2016. 2 2020. 2, 7
[31] Georgios Pavlakos, Xiaowei Zhou, Konstantinos G Derpa- [45] Jingwei Xu, Zhenbo Yu, Bingbing Ni, Jiancheng Yang, Xi-
nis, and Kostas Daniilidis. Coarse-to-fine volumetric predic- aokang Yang, and Wenjun Zhang. Deep kinematics analysis
tion for single-image 3d human pose. In IEEE CVPR, pages for monocular 3d human pose estimation. In Proceedings of
7025–7034, 2017. 2, 7 the IEEE/CVF Conference on Computer Vision and Pattern
[32] Dario Pavllo, Christoph Feichtenhofer, David Grangier, and Recognition, pages 899–908, 2020. 2
Michael Auli. 3d human pose estimation in video with tempo- [46] Tianhan Xu and Wataru Takano. Graph stacked hourglass
ral convolutions and semi-supervised training. In Proceedings networks for 3d human pose estimation. In Proceedings of the
of the IEEE/CVF Conference on Computer Vision and Pattern IEEE/CVF conference on computer vision and pattern recog-
Recognition, pages 7753–7762, 2019. 1, 2, 7 nition, pages 16105–16114, 2021. 1, 2, 7
[33] Robin Rombach, Andreas Blattmann, Dominik Lorenz, [47] Wei Yang, Wanli Ouyang, Xiaolong Wang, Jimmy Ren,
Patrick Esser, and Björn Ommer. High-resolution image Hongsheng Li, and Xiaogang Wang. 3d human pose estima-
synthesis with latent diffusion models. In Proceedings of tion in the wild by adversarial learning. In IEEE CVPR, pages
the IEEE/CVF Conference on Computer Vision and Pattern 5255–5264, 2018. 7
Recognition, pages 10684–10695, 2022. 1 [48] Ailing Zeng, Xiao Sun, Fuyang Huang, Minhao Liu, Qiang
[34] Wenkang Shan, Zhenhua Liu, Xinfeng Zhang, Shanshe Xu, and Stephen Lin. Srnet: Improving generalization in 3d
Wang, Siwei Ma, and Wen Gao. P-stmo: Pre-trained spatial human pose estimation with a split-and-recombine approach.
temporal many-to-one model for 3d human pose estimation. In European Conference on Computer Vision, pages 507–523.
In ECCV, page 461–478, 2022. 2, 7, 8 Springer, 2020. 7
[35] Wenkang Shan, Haopeng Lu, Shanshe Wang, Xinfeng [49] Jinlu Zhang, Zhigang Tu, Jianyu Yang, Yujin Chen, and Jun-
Zhang, and Wen Gao. Improving robustness and accuracy via song Yuan. Mixste: Seq2seq mixed spatio-temporal encoder
relative information encoding in 3d human pose estimation. for 3d human pose estimation in video. In Proceedings of
In Proceedings of the 29th ACM International Conference on the IEEE/CVF Conference on Computer Vision and Pattern
Multimedia, pages 3446–3454, 2021. 2 Recognition, pages 13232–13242, 2022. 6, 7, 8
[36] Saurabh Sharma, Pavan Teja Varigonda, Prashast Bindal, [50] Long Zhao, Xi Peng, Yu Tian, Mubbasir Kapadia, and Dim-
Abhishek Sharma, and Arjun Jain. Monocular 3d human pose itris N. Metaxas. Semantic graph convolutional networks for
estimation by generation and ordinal ranking. In Proceed- 3d human pose regression. In IEEE Conference on Computer
ings of the IEEE/CVF international conference on computer Vision and Pattern Recognition (CVPR), pages 3425–3435,
vision, pages 2325–2334, 2019. 2 2019. 1
[37] Jascha Sohl-Dickstein, Eric Weiss, Niru Maheswaranathan, [51] Long Zhao, Xi Peng, Yu Tian, Mubbasir Kapadia, and Dim-
and Surya Ganguli. Deep unsupervised learning using itris N Metaxas. Semantic graph convolutional networks for
nonequilibrium thermodynamics. In International Conference 3d human pose regression. In IEEE CVPR, pages 3425–3435,
on Machine Learning, pages 2256–2265. PMLR, 2015. 3 2019. 2, 7
[52] Weixi Zhao, Weiqiang Wang, and Yunjie Tian. Graformer:
Graph-oriented transformer for 3d pose estimation. In Pro-
ceedings of the IEEE/CVF Conference on Computer Vision
and Pattern Recognition, pages 20438–20447, 2022. 1, 2, 6,
7
[53] Yunqing Zhao, Tianyu Pang, Chao Du, Xiao Yang, Ngai-
Man Cheung, and Min Lin. A recipe for watermarking diffu-
sion models. arXiv preprint arXiv: 2303.10137, 2023. 3
[54] Ce Zheng, Sijie Zhu, Matias Mendieta, Taojiannan Yang,
Chen Chen, and Zhengming Ding. 3d human pose estima-
tion with spatial and temporal transformers. In Proceedings
of the IEEE/CVF International Conference on Computer Vi-
sion, pages 11656–11665, 2021. 1, 2, 7
DiffPose: Toward More Reliable 3D Pose Estimation (Supplementary)

Jia Gong1† Lin Geng Foo1† Zhipeng Fan2§ Qiuhong Ke3 Hossein Rahmani4 Jun Liu1‡
1
Singapore University of Technology and Design
2
New York University 3 Monash University 4 Lancaster University
{jia gong,lingeng foo}@mymail.sutd.edu.sg, [email protected], [email protected],
[email protected], jun [email protected]

1. Additional Details of GMM Forward Diffu- 2. Additional Details of Diffusion Network g


sion
In order to provide information to the model regard-
In Section 4.2 of the main paper, we describe the GMM- ing the current step number k, we generate a diffusion
k
based forward diffusion process. Here, we explain it in step embedding fD ∈ RJ×256 using the sinusoidal func-
more detail, particularly about how it can be framed in a k
tion. Specifically, at each even (2j) index of fD , we set
step-wise formulation. We first re-state Eq. 7 in the main k 2j/256
the element fD [2j] to sin(k/10000 ), while at each
paper as follows: k
odd (2j + 1) index, we set the element fD [2j + 1] to
2j/256
√ p cos(k/10000 ).
ĥk = µG + αk (h0 − µG ) + (1 − αk ) · G . (1)
PM PM
where µG = m=1 1m µm , G ∼ N (0, m=1 (1m Σm )), 3. More Implementation Details
and 1m ∈ {0, 1} is a binary indicator for the mth compo-
PM In the forward diffusion process, we generate the de-
nent such that m=1 1m = 1, and P rob(1m = 1) = πm . Qk
creasing sequence α1:K via the formula: αk = i=1 (1 −
We remark that Eq. 1 directly formulates ĥk as a function βi ), where β1:K is a sequence from 1e − 4 to 2e − 3, which
of h0 instead of ĥk−1 , because this clearly expresses the aim is interpolated by the linear function. To optimize the GMM
of our GMM-based forward diffusion design, i.e., such that parameters φGM M , we sample 1000 poses from HK (i.e.,
the generated ĥ1 , ..., ĥK can converge to the fitted GMM NGM M = 1000) and then model HK via a GMM model.
model φGM M . Yet, we note that the step-wise formulation
of ĥk in terms of ĥk−1 can still be defined, if necessary. During model pre-training, the Context Encoder φST is
First, we sample according to probabilities {πm }M first pre-trained on the training set to predict 3D poses from
m=1 , and
select a Gaussian component m̂, i.e., 1m̂ = 1. Next, we 2D poses. Then we adopt the Adam optimizer [7] to train
first calculate h̃0 , a “centered” our diffusion model g, where the initial learning rate is set
PM version of h0 , using h̃0 = to 1e−4 with a decay rate of 0.9 after ten epochs, and the
h0 − µG , where µG = m=1 (1m µm ) = µm̂ . Then, we
follow the step-wise formulation as follows: batch size is set to 4096. Our DiffPose is implemented using
PyTorch, and can be trained on a single GeForce RTX 3090
GPU within 96 hours.
r r
αk αk G
h̃k = h̃k−1 + (1 − ) , (2)
αk−1 αk−1
PM 4. Experiment Results on Human3.6M under
where G ∼ N (0, m=1 (1m Σm )), which is equivalent to
G ∼ N (0, Σm̂ ). After taking k steps of Eq. 2 starting from P-MPJPE (Protocol 2)
h̃0 , we can get:
Tab. 1 and Tab. 2 present the video-based and frame-
√ based results of our DiffPose on Human3.6M under P-
h̃k = αk (h˜0 ) + (1 − αk ) · G .
p
(3) MPJPE, where the input 2D poses are detected by CPN [1].
We observe that the result of the stepwise formulation is As shown in Tab. 1, our DiffPose can significantly outper-
thus equivalent to Eq. 1, as we can simply “de-center” our form the state-of-the-art methods [8, 21] on all actions with
h̃0 and h̃k by substituting h̃0 = h0 −µG and h̃k = ĥk −µG . a large margin. Moreover, from Tab. 2, we observe that
our method can achieve promising performance on the chal-
† Equal contribution; § Currently at Meta; ‡ Corresponding author lenging frame-based setting.

12
Table 1. Video-based results on Human3.6M with detected 2D poses in millimeters under P-MPJPE.
P-MPJPE Dir. Disc. Eat Greet Phone Photo Pose Pur. Sit SitD. Smoke Wait WalkD. Walk WalkT. Avg
Lin [9] 32.5 35.3 34.3 36.2 37.8 43.0 33.0 32.2 45.7 51.8 38.4 32.8 37.5 25.8 28.9 36.8
Pavllo [15] 34.1 36.1 34.4 37.2 36.4 42.2 34.4 33.6 45.0 52.5 37.4 33.8 37.8 25.6 27.3 36.5
Liu [12] 32.3 35.2 33.3 35.8 35.9 41.5 33.2 32.7 44.6 50.9 37.0 32.4 37.0 25.2 27.2 35.6
Zheng et al. [24] 32.5 34.8 32.6 34.6 35.3 39.5 32.1 32.0 42.8 48.5 34.8 32.4 35.3 24.5 26.0 34.6
Li [8] 31.5 34.9 32.8 33.6 35.3 39.6 32.0 32.2 43.5 48.7 36.4 32.6 34.3 23.9 25.1 34.4
Zhang [21] 28.0 30.9 28.6 30.7 30.4 34.6 28.6 28.1 37.1 47.3 30.5 29.7 30.5 21.6 20.0 30.6
ours 26.3 29.0 26.1 27.8 28.4 34.6 26.9 26.5 36.8 39.2 29.4 26.8 28.4 18.6 19.2 28.7

Table 2. Frame-based results on Human3.6M with detected 2D poses in millimeters under P-MPJPE.
P-MPJPE Dir. Disc. Eat Greet Phone Photo Pose Pur. Sit SitD. Smoke Wait WalkD. Walk WalkT. Avg
Sun [17] 42.1 44.3 45.0 45.4 51.5 53.0 43.2 41.3 59.3 73.3 51.0 44.0 48.0 38.3 44.8 48.3
Martinez [13] 39.5 43.2 46.4 47.0 51.0 56.0 41.4 40.6 56.5 69.4 49.2 45.0 49.5 38.0 43.1 47.7
Pavlakos [14] 34.7 39.8 41.8 38.6 42.5 47.5 38.0 36.6 50.7 56.8 42.6 39.6 43.9 32.1 36.5 41.8
Liu [11] 35.9 40.0 38.0 41.5 42.5 51.4 37.8 36.0 48.6 56.6 41.8 38.3 42.7 31.7 36.2 41.2
ours 33.9 38.2 36.0 39.2 40.2 46.5 35.8 34.8 48.0 52.5 41.2 36.5 40.9 30.3 33.8 39.2

our method with state-of-the-art method [22] in this setting,


and present results in Fig. 1. We observe that our method
can predict more reliable and accurate poses, especially for
novel human gestures (e.g., the first and second rows in
Fig. 1) and occluded body parts (e.g., the third and fourth
rows in Fig. 1).
Forward diffusion process visualization. Extending
from our results in Tab. 5 of the main paper, here we qual-
itatively compare our GMM-based forward diffusion pro-
cess with the standard diffusion process (as described in
Sec. 3 of our main paper). As shown in Fig. 2, the stan-
dard diffusion process recurrently adds noise to the source
sample and tends to spread the joints’ positions to the whole
space. However, our GMM-based diffusion process can add
noise according to pose-specific information (obtained from
heatmaps) and the data distribution, which generates noise
in a more constrained manner. Thus, during training, the
GMM-based diffusion process allows us to initialize a ĤK
that captures the uncertainty of the 3D pose, which boosts
the performance of DiffPose.
Reverse diffusion process visualization. We visualize
the poses reconstructed by our diffusion model with/without
the context information fST . Note that the model without
fST means that no context decoder is used. From the last
column of Fig. 3, we observe that both methods can recon-
struct realistic human poses while the model with fST can
predict more accurate poses. Moreover, compared to the un-
conditioned reverse diffusion process (i.e., the model with-
out fST ), the model conditioned by fST can converge to the
desired pose faster.
Figure 1. Qualitative comparison between Graformer [22] and our
method. Red colored 3D pose corresponds to the ground truth.
6. Future Work
5. Additional Results In this work, we explore a novel diffusion-based frame-
work to tackle monocular 3D pose estimation. Future work
In this section, we further investigate the performance includes more investigations into the architecture of the dif-
of our method on the frame-based scenario, by conducting fusion network, as well as extending to the online setting
experiments on Human3.6M [6]. [2, 5, 19], the few-shot setting [18, 23] and other pose-based
3D Pose visualization. First, we qualitatively compare tasks [3, 4, 10, 16, 20].
Standard forward diffusion and Jun Liu. Unified pose sequence modeling. In Proceed-
ings of the IEEE/CVF Conference on Computer Vision and
Pattern Recognition (CVPR), June 2023. 13
Source sample
[5] Amirhossein Habibian, Davide Abati, Taco S Cohen, and
Babak Ehteshami Bejnordi. Skip-convolutions for efficient
video processing. In Proceedings of the IEEE/CVF Con-
k=5 k=10 k=15 ference on Computer Vision and Pattern Recognition, pages
GMM-based forward diffusion 2695–2704, 2021. 13
[6] Catalin Ionescu, Dragos Papava, Vlad Olaru, and Cristian
Sminchisescu. Human3.6m: Large scale datasets and predic-
tive methods for 3d human sensing in natural environments.
IEEE transactions on pattern analysis and machine intelli-
gence, 36(7):1325–1339, 2013. 13
k=5 k=10 k=15 [7] Diederik P Kingma and Jimmy Ba. Adam: A method for
Figure 2. Qualitative comparison between standard diffusion for- stochastic optimization. arXiv preprint arXiv:1412.6980,
ward process and our GMM-based forward diffusion process. 2014. 12
[8] Wenhao Li, Hong Liu, Hao Tang, Pichao Wang, and Luc
Van Gool. Mhformer: Multi-hypothesis transformer for 3d
Our DiffPose without 𝒇𝑺𝑻 human pose estimation. In Proceedings of the IEEE/CVF
Conference on Computer Vision and Pattern Recognition,
pages 13147–13156, 2022. 12, 13
[9] Jiahao Lin and Gim Hee Lee. Trajectory space factorization
for deep video-based 3d human pose estimation. In BMVC,
2019. 13
k=15 k=10 k=5 k=0 [10] Jun Liu, Amir Shahroudy, Dong Xu, and Gang Wang.
Spatio-temporal lstm with trust gates for 3d human action
Our DiffPose with 𝒇𝑺𝑻 recognition. In European conference on computer vision,
pages 816–833. Springer, 2016. 13
[11] Kenkun Liu, Rongqi Ding, Zhiming Zou, Le Wang, and Wei
Tang. A comprehensive study of weight sharing in graph
networks for 3d human pose estimation. In European Con-
ference on Computer Vision, pages 318–334. Springer, 2020.
k=15 k=10 k=5 k=0 13
[12] Ruixu Liu, Ju Shen, He Wang, Chen Chen, Sen-ching Che-
Figure 3. Qualitative comparison between our reverse diffusion ung, and Vijayan Asari. Attention mechanism exploits tem-
process conditioned on context information fST (bottom), against poral contexts: Real-time 3d human pose reconstruction. In
a reverse diffusion process without using fST (top). Proceedings of the IEEE/CVF Conference on Computer Vi-
sion and Pattern Recognition, pages 5064–5073, 2020. 13
[13] Julieta Martinez, Rayat Hossain, Javier Romero, and James J
References Little. A simple yet effective baseline for 3d human pose
estimation. In IEEE ICCV, pages 2640–2649, 2017. 13
[1] Yilun Chen, Zhicheng Wang, Yuxiang Peng, Zhiqiang [14] Georgios Pavlakos, Xiaowei Zhou, Konstantinos G Derpa-
Zhang, Gang Yu, and Jian Sun. Cascaded pyramid net- nis, and Kostas Daniilidis. Coarse-to-fine volumetric predic-
work for multi-person pose estimation. In Proceedings of tion for single-image 3d human pose. In IEEE CVPR, pages
the IEEE conference on computer vision and pattern recog- 7025–7034, 2017. 13
nition, pages 7103–7112, 2018. 12 [15] Dario Pavllo, Christoph Feichtenhofer, David Grangier, and
[2] Lin Geng Foo, Jia Gong, Zhipeng Fan, and Jun Liu. System- Michael Auli. 3d human pose estimation in video with tem-
status-aware adaptive network for online streaming video un- poral convolutions and semi-supervised training. In Proceed-
derstanding. In Proceedings of the IEEE/CVF Conference ings of the IEEE/CVF Conference on Computer Vision and
on Computer Vision and Pattern Recognition (CVPR), June Pattern Recognition, pages 7753–7762, 2019. 13
2023. 13 [16] Lei Shi, Yifan Zhang, Jian Cheng, and Hanqing Lu. Two-
[3] Lin Geng Foo, Tianjiao Li, Hossein Rahmani, Qiuhong Ke, stream adaptive graph convolutional networks for skeleton-
and Jun Liu. Era: Expert retrieval and assembly for early based action recognition. In Proceedings of the IEEE/CVF
action prediction. In Computer Vision–ECCV 2022: 17th conference on computer vision and pattern recognition,
European Conference, Tel Aviv, Israel, October 23–27, 2022, pages 12026–12035, 2019. 13
Proceedings, Part XXXIV, pages 670–688. Springer, 2022. [17] Xiao Sun, Jiaxiang Shang, Shuang Liang, and Yichen Wei.
13 Compositional human pose regression. In IEEE ICCV, pages
[4] Lin Geng Foo, Tianjiao Li, Hossein Rahmani, Qiuhong Ke, 2602–2611, 2017. 13
[18] Jonathan Taylor, Jamie Shotton, Toby Sharp, and Andrew
Fitzgibbon. The vitruvian manifold: Inferring dense corre-
spondences for one-shot human pose estimation. In 2012
IEEE Conference on Computer Vision and Pattern Recogni-
tion, pages 103–110. IEEE, 2012. 13
[19] Zuxuan Wu, Caiming Xiong, Yu-Gang Jiang, and Larry S
Davis. Liteeval: A coarse-to-fine framework for resource
efficient video recognition. In H. Wallach, H. Larochelle,
A. Beygelzimer, F. d'Alché-Buc, E. Fox, and R. Garnett, ed-
itors, Advances in Neural Information Processing Systems,
volume 32. Curran Associates, Inc., 2019. 13
[20] Sijie Yan, Yuanjun Xiong, and Dahua Lin. Spatial tempo-
ral graph convolutional networks for skeleton-based action
recognition. In Thirty-second AAAI conference on artificial
intelligence, 2018. 13
[21] Jinlu Zhang, Zhigang Tu, Jianyu Yang, Yujin Chen, and Jun-
song Yuan. Mixste: Seq2seq mixed spatio-temporal encoder
for 3d human pose estimation in video. In Proceedings of
the IEEE/CVF Conference on Computer Vision and Pattern
Recognition, pages 13232–13242, 2022. 12, 13
[22] Weixi Zhao, Weiqiang Wang, and Yunjie Tian. Graformer:
Graph-oriented transformer for 3d pose estimation. In Pro-
ceedings of the IEEE/CVF Conference on Computer Vision
and Pattern Recognition, pages 20438–20447, 2022. 13
[23] Yunqing Zhao and Ngai-Man Cheung. Fs-ban: Born-again
networks for domain generalization few-shot classification.
IEEE Transactions on Image Processing, 2023. 13
[24] Ce Zheng, Sijie Zhu, Matias Mendieta, Taojiannan Yang,
Chen Chen, and Zhengming Ding. 3d human pose estima-
tion with spatial and temporal transformers. In Proceedings
of the IEEE/CVF International Conference on Computer Vi-
sion, pages 11656–11665, 2021. 13

You might also like