Thanks to visit codestin.com
Credit goes to www.scribd.com

0% found this document useful (0 votes)
42 views10 pages

Category-Level Articulated Object Pose Estimation

Category-Level_Articulated_Object_Pose_Estimation
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
42 views10 pages

Category-Level Articulated Object Pose Estimation

Category-Level_Articulated_Object_Pose_Estimation
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 10

2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)

Category-Level Articulated Object Pose Estimation

Xiaolong Li1∗ He Wang2∗ Li Yi3 Leonidas Guibas 2 A. Lynn Abbott 1 Shuran Song4
1
Virginia Tech 2 Stanford University 3 Google Research 4 Columbia University
articulated-pose.github.io

Abstract ,QSXW 2XWSXW


'3RLQW&ORXGIURP 3DUW3RVH
This paper addresses the task of category-level pose D6LQJOH'HSWK,PDJH 6HJPHQWDWLRQ
estimation for articulated objects from a single depth image. 3DUW$PRGDO
%RXQGLQJ%R[HV
We present a novel category-level approach that correctly
accommodates object instances previously unseen during
training. We introduce Articulation-aware Normalized -RLQW3DUDPHWHUV
Coordinate Space Hierarchy (ANCSH) – a canonical C
-RLQW6WDWH
representation for different articulated objects in a given
ș
category. As the key to achieve intra-category general-
ization, the representation constructs a canonical object
*HQHUDOL]DWLRQWR'LIIHUHQW2EMHFW,QVWDQFHV
M
space as well as a set of canonical part spaces. The
canonical object space normalizes the object orientation,
scales and articulations (e.g. joint parameters and states)
while each canonical part space further normalizes its part
pose and scale. We develop a deep network based on Figure 1. Category-level articulated object pose estimation.
PointNet++ that predicts ANCSH from a single depth point Given a depth point cloud of a novel articulated object from a
known category, our algorithm estimates: part attributes, including
cloud, including part segmentation, normalized coordi-
part segmentation, poses, scales and amodal bounding boxes; joint
nates, and joint parameters in the canonical object space. attributes, including joint parameters and joint states.
By leveraging the canonicalized joints, we demonstrate: 1)
improved performance in part pose and scale estimations of a novel articulated object instance in a known category
using the induced kinematic constraints from joints; 2) high from a single depth image. Here object instances from one
accuracy for joint parameter estimation in camera space. category will share a known kinematic chain composing of
a fixed number of rigid parts connected by certain types
1. Introduction
of joints. We are particularly interested in the two most
Our environment is populated with articulated objects, common joint types, revolute joints that cause 1D rotational
ranging from furniture such as cabinets and ovens to small motion (e.g., door hinges), and prismatic joints that allow
tabletop objects such as laptops and eyeglasses. Effectively 1D translational movement (e.g., drawers in a cabinet). An
interacting with these objects requires a detailed under- overview is shown in Figure 1. To achieve this goal, several
standing of their articulation states and part-level poses. major challenges need to be addressed:
Such understanding is beyond the scope of typical 6D pose First, to handle novel articulated objects without
estimation algorithms, which have been designed for rigid knowing exact 3D CAD models, we need to find a
objects [31, 25, 24, 28]. Algorithms that do consider object shared representation for different instances within a given
articulations [13, 14, 12, 16] often require the exact object category. The representation needs to accommodate the
CAD model and the associated joint parameters at test time, large variations in part geometry, joint parameters, joint
preventing them from generalizing to new object instances. states, and self-occlusion patterns. More importantly, for
In this paper, we adopt a learning-based approach learning on such diverse data, the representation needs to
to perform category-level pose estimation for articulated facilitate intra-category generalization.
objects. Specifically, we consider the task of estimating per- Second, in contrast to a rigid object, an articulated object
part 6D poses and 3D scales, joint parameters (i.e. type, is composed of multiple rigid parts leading to a higher
position, axis orientation), and joint states (i.e. joint angle) degree of freedom in its pose. Moreover, the parts are
* indicates equal contributions. connected and constrained by certain joints and hence their

2575-7075/20/$31.00 ©2020 IEEE 3703


DOI 10.1109/CVPR42600.2020.00376
Authorized licensed use limited to: HEFEI UNIVERSITY OF TECHNOLOGY. Downloaded on January 25,2025 at 13:52:17 UTC from IEEE Xplore. Restrictions apply.
poses are not independent. It is challenging to accurately joint parameter predictions. In summary, the primary
estimate poses in such a high-dimensional space while contribution of our paper is a unified framework for
complying with physical constraints. category-level articulated pose estimation. In support of this
Third, various types of joints provide different physical framework, we design:
constraints and priors for part articulations. Designing a · A novel category-level representation for articulated
framework that can accurately predict the parameters and objects – Articulation-aware Normalized Coordinate
effectively leverage the constraints for both revolute and Space Hierarchy (ANCSH).
prismatic joints is yet an open research problem. · A PointNet++ based neural network that is capable of
To address the representation challenge, we propose a predicting ANCSH for previously unseen articulated
shared category-level representation for different articulated object instances from a single depth input.
object instances, namely, Articulation-aware Normalized
Coordinate Space Hierarchy (ANCSH). Concretely,
· A combined optimization scheme that leverages
ANCSH predictions along with induced joint
ANCSH is a two-level hierarchy of canonical space, constraints for part pose and scale estimation.
composed of Normalized Articulated Object Coordinate
Space (NAOCS) at the root level and a set of Normalized · A two-step approach for high-accuracy joint parameter
estimation that first predicts joint parameters in the
Part Coordiante Spaces (NPCSs) at the leaf level. In the
NAOCS and then transforms them into camera space
NAOCS, object scales, orientations, and joint states are
using part poses.
normalized. In the NPCS of each rigid part, the part pose
and scale are further normalized . We note that NAOCS and
NPCS are complimentary to each other: NAOCS provides 2. Related Work
a canonical reference on the object level while NPCSs This section summarizes related work on pose estimation
provide canonical part references. The two-level reference for rigid and articulated objects.
frames from ANCSH allow us to define per-part pose as
well as joint attributes for previously unseen articulated Rigid object pose estimation. Classically, the goal of
object instances on the category-level. pose estimation is to infer an object’s 6D pose (3D rotation
To address the pose estimation challenge, we segment and 3D location) relative to a given reference frame. Most
objects into multiple rigid parts and predict the normalized previous work has focused on estimating instance-level
coordinates in ANCSH. However, separate per-part pose pose by assuming that exact 3D CAD models are available.
estimation could easily lead to physically impossible For example, traditional algorithms such as iterative closest
solutions since joint constraints are not considered. To point (ICP) [4] perform template matching by aligning the
conform with the kinematic constraints introduced by CAD model with an observed 3D point cloud. Another
joints, we estimate joint parameters in the NAOCS from the family of approaches aim to regress the object coordinates
observation, mathematically model the constraints based onto its CAD model for each observed object pixel, and then
upon the joint type, and then leverage the kinematic priors use voting to solve for object pose [6, 7]. These approaches
to regularize the part poses. We formulate articulated are limited by the need to have exact CAD models for
pose fitting from the ANCSH to the depth observation as particular object instances.
a combined optimization problem, taking both part pose Category-level pose estimation aims to infer an object’s
fitting and joint constraints into consideration. In this work pose and scale relative to a category-specific canonical
we mainly focus on 1D revolute joints and 1D prismatic representation. Recently, Wang et al. [28] extended the
joints, while the above formulation can be extended to object coordinate based approach to perform category-
model and support other types of joints. level pose estimation. The key idea behind the intra-
category generalization is to regress the coordinates within
Our experiments demonstrate that leveraging the joint
a Normalized Object Coordinate Space (NOCS), where the
constraints in the combined optimization leads to improved
sizes are normalized and the orientations are aligned for
performance in part pose and scale prediction. Noting
objects in a given category. Whereas the work by [28]
that leveraging joint constraints for regularizing part poses
focuses on pose and size estimation for rigid objects, the
requires high-accuracy joint parameter predictions, which
work presented here extends the NOCS concept to accom-
itself is very challenging. Instead of directly predicting
modate articulated objects at both part and object level. In
joint parameters in the camera space, we consider and
addition to pose, our work also infers joint information and
leverage predictions in NAOCS, where joints are posed
addresses particular problems related to occlusion.
in a canonical orientation, e.g. the revolute joints always
point upward for eyeglasses. By transforming joint Articulated object pose estimation. Most algorithms
parameter predictions from NAOCS back to camera space, that attempt pose estimation for articulated objects assume
we further demonstrate supreme accuracy on camera-space that instance-level information is available. The approaches

3704

Authorized licensed use limited to: HEFEI UNIVERSITY OF TECHNOLOGY. Downloaded on January 25,2025 at 13:52:17 UTC from IEEE Xplore. Restrictions apply.
1RUPDOL]HG
$UWLFXODWHG2EMHFW
FW
&RRUGLQDWH6SDFHH
1$2&6 $UWLFXODWLRQ$ZDUH
1RUPDOL]HG
'HSWK2EVHUYDWLRQ 'HSWK2EVHUYDWLRQ &RRUGLQDWH
6SDFH+LHUDUFK\
$1&6+
1RUPDOL]HG3DUW
DUW
&RRUGLQDWH
6SDFH 13&6

-RLQW -RLQW -RLQW -RLQW -RLQW

Figure 2. Articulation-aware Normalized Coordinate Space Hierarchy (ANCSH) is a category-level object representation composed
of a Normalized Articulated Object Coordinate Space (NAOCS) on top of a set of Normalized Part Coordinate Spaces (NPCSs) per
part. Here we show two examples of ANCSH representation (points are colored according to its corresponding coordinates in the
NAOCS/NPCS). Note that NAOCS sets the object articulations to pre-defined states, all the joints in the NAOCS are hence canonicalized,
e.g. the axes of the revolute joints in the eyeglasses example all point upwards and the joint angles are right angles. For each individual
part, NPCS maintains the part orientation as in the NAOCS but zero-centers its position and normalizes its scales.

often use CAD models for particular instances along with estimating full 3D shape through 2D supervision [15,
known kinematic parameters to constrain the search space 20]. Similarly, techniques for hand pose estimation (e.g.,
and to recover the pose separately for different parts [18, [27, 11]) leverages dense coordinate regression, which
9]. Michel et al. [18] used a random forest to vote for is then used for voting 3D joint locations. Approaches
pose parameters on canonical body parts for each point for both body and hand pose estimation are often specif-
in a depth image, followed by a variant of the Kabsch ically customized for those object types, relying on a
algorithm to estimate joint parameters using RANSAC- fixed skeletal model with class-dependent variability (e.g.,
based energy minimization. Desingh et al. [9] adopted a expected joint lengths) and strong shape priors (e.g., using
generative approach using a Markov Random Field formu- parametric body shape model for low-dimensional parame-
lation, factoring the state as individual parts constrained by terization). Also, such hand/body approaches accommodate
their articulation parameters. However, these approaches only revolute joints. In contrast, our algorithm is designed
only consider known object instances and cannot handle to handle generic articulated objects with varying kinematic
different part and kinematic variations. A recent work [1] chain, allowing both revolute joints and prismatic joints.
also tries to handle novel objects within the same category
by training a mixed density model [5] on depth images, 3. Problem Statement
their method could infer kinematic model using probabil- The input to the system is a 3D point cloud P =
ities predictions of a mixtures of Gaussians. However they {pi ∈ R3 | i = 1, ..., N } backprojected from a single depth
don’t explicitly estimate pose on part-level, the simplified image representing an unknown object instance from a
geometry predictions like length, width are for the whole known category, where N denotes the number of points.
object with scale variation only. We know that all objects from this category share the same
Another line of work relies on active manipulation of an kinematic chain composed of M rigid parts {S (j) | j =
object to infer its articulation pattern [13, 14, 12, 16, 32]. 1, ..., M } and K joints with known types {Jk | k =
For example, Katz et al. [14], uses a robot manipulator 1, ..., K}. The goal is to segment the point cloud into rigid
to interact with articulated objects as RGB-D videos are parts {S (j) }, recover the 3D rotations {R(j) }, 3D transla-
recorded. Then the 3D points are clustered into rigid parts tions {t(j) }, and sizes {s(j) } for the parts in {S (j) }, and
according to their motion. Although these approaches could predict the joint parameter {φk } and state {θk } for the
perform pose estimation for unknown objects, they require joints in {Jk }. In this work, we consider 1D revolute joints
the input to be a sequence of images that observe an object’s and 1D prismatic joints. We parameterize the two types of
different articulation states, whereas our approach is able to joints as following. For a revolute joint, its joint param-
(r)
perform the task using a single depth observation. eters include the direction of the rotation axis uk as well
as a pivot point qk on the rotation axis; its joint state is
Human body and hand pose estimation. Two specific (r)
articulated classes have gained considerable attention defined as the relative rotation angle along uk between the
recently: the human body and the human hand. For two connected parts compared with a pre-defined rest state.
human pose estimation, approaches have been developed For a prismatic joint, its joint parameter is the direction of
(t)
using end-to-end networks to predict 3D joint locations the translation axis uk , and its joint state is defined as
(t)
directly [17, 23, 19], using dense correspondence maps the relative translation distance along uk between the two
between 2D images and 3D surface models [3], or connected parts compared with a pre-defined rest state.

3705

Authorized licensed use limited to: HEFEI UNIVERSITY OF TECHNOLOGY. Downloaded on January 25,2025 at 13:52:17 UTC from IEEE Xplore. Restrictions apply.
4. Method build simple mathematical models to describe the kinematic
constraints regarding each individual joint in NAOCS.
ANCSH provides a category-specific reference frame
defining per-part poses as well as joint attributes for previ- NPCS. For each part, NPCS further zero-centers its
ously unseen articulated object instances. In Sec. 4.1, we position and uniformly scales it as is done in [28], while
first explain ANCSH in detail. In Sec. 4.2, we then present at the same time keeps its orientation unchanged as in
a deep neural network capable of predicting the ANCSH NAOCS. In this respect, NPCS is defined similarly to
representation. Sec. 4.3 describes how the ANCSH repre- NOCS [28] but for individual parts instead of whole objects.
sentation is used to jointly optimize part poses with explicit NPCS provides a part reference frame and we can define
joint constraints. Last, we describe how we compute joint the part pose and scale as the transformation from NPCS
states and deduce camera-space joint parameters in Sec. 4.4. to the camera space. Note that corresponding parts of
different object instances are aligned in NPCS, which facil-
4.1. ANCSH Representation itates intra-category generalization and enables predictions
Our ANCSH representation is inspired by and closely for unseen instances.
related to Normalized Object Coordinate Space (NOCS) Relationship between NPCS, NAOCS and NOCS.
[28], which we briefly review here. NOCS is defined as Both NPCS and NAOCS are inspired by the NOCS repre-
a 3D space contained within a unit cube and was intro- sentation and designed for handling a collection of artic-
duced in [28] to estimate the category-level 6D pose and ulated objects from a given category. Therefore, similar
size of rigid objects. For a given category, the objects to NOCS, both representations encode canonical infor-
are consistently aligned by their orientations in the NOCS. mation and enable generalization to new object instances.
Furthermore, these objects are zero-centered and uniformly However, each of the two representations has its own advan-
scaled so that their tight bounding boxes are all centered tages in modeling articulated objects and hence provides
at the origin of the NOCS with a diagonal length of 1. complementary information. Thus, our ANCSH leverages
NOCS provides a reference frame for rigid objects in a both NPCS and NAOCS to form a comprehensive represen-
given category so that the object pose and size can then be tation of both parts and articulations.
defined using the similarity transformation from the NOCS On the one hand, NPCSs normalize the position, orien-
to the camera space. However, NOCS is limited for repre- tation, and size for each part. Therefore, transformation
senting articulated objects. Instead of the object pose and between NPCSs and camera space can naturally be used
size, we care more about the poses and the states for each to compute per-part 3D amodal bounding boxes, which
individual parts and joints, which isn’t addressed in NOCS. is not well-presented in NAOCS representation. On the
To define category-level per-part poses and joint other hand, NAOCS looks at the parts from a holistic
attributes, we present ANCSH, a two-level hierarchy of view, encoding the canonical relationship of different parts
normalized coordinate spaces, as shown in Figure 2. At the in the object space. NAOCS provides a parent reference
root level, NAOCS provides an object-level reference frame frame to those in NPCSs and allows a consistent definition
with normalized pose, scale, and articulation; at the leaf of the joint parameters across different parts. We hence
level, NPCS provides a reference frame for each individual model joints and predict joint parameters in the NAOCS
part. We explain both NPCS and NAOCS in detail below. instead of NPCSs. The joint parameters can be used to
NAOCS. To construct a category-level object reference deduce joint constraints, which can regularize the poses
frame for the collection of objects, we first bring all the between connected parts. Note that the information defined
object articulations into a set of pre-defined rest states. in NPCS and NAOCS is not mutually exclusive – each
Basically, for each joint Jk , we manually define its rest NPCS can transform into its counterpart in NAOCS by
state θk0 and then set the joint into this state. For example, a uniform scaling and translation. Therefore, instead of
we define the rest states of the two revolute joints in the independently predicting the full NAOCS representation,
eyeglasses category to be in right angles; we define the rest our network predicts the scaling and translation parameters
states of all drawers to be closed. In addition to normal- for each object part and directly applies it on the corre-
izing the articulations, NAOCS applies the same normal- sponding NPCS to obtain the NAOCS estimation.
ization used in [28] to the objects, including zero-centering,
4.2. ANCSH Network
aligning orientations, and uniformly scaling.
As a canonical object representation, NAOCS has the We devise a deep neural network capable of predicting
following advantages: 1) the joints are set to predefined the ANCSH representation for unseen articulated object
states so that accurately estimating joint parameters in instances. As shown in Figure 3, the network takes a depth
NAOCS, e.g. the direction of rotation/translation axis, point cloud P as input and its four heads output rigid part
becomes an easy task; 2) with the canonical joints, we can segmentation, dense coordinate predictions in each NPCS,

3706

Authorized licensed use limited to: HEFEI UNIVERSITY OF TECHNOLOGY. Downloaded on January 25,2025 at 13:52:17 UTC from IEEE Xplore. Restrictions apply.
Jk in the NAOCS space (we use “  ” here to distinguish the
,QSXW 3DUW
'HSWK3RLQW&ORXG NAOCS parameters from camera-space parameters.) We
6HJPHQWDWLRQ

3RLQW1HW
consider the following two types of joints: 1D revolute joint
whose parameters include the rotation axis direction and the
(r)
1RUPDOL]HG pivot point position, namely φk = (uk , qk ); 1D prismatic
3DUW&RRUGLQDWH6SDFH joint whose parameters is the translation axis direction
13&6  (t)
φk = (uk ). We adopt a voting scheme to accurately
predict joint parameters, in which we first associate points
6 1RUPDOL]HG$UWLFXODWHG to each joint via a labeling scheme and then let the points
2EMHFW&RRUGLQDWH6SDFH vote for the parameters of its associated joint.
3RLQW1HW

1$2&6 
7 We define a per-point joint association {ai | ai ∈
-RLQW3DUDPHWHUVLQWKH1$2&6 {0, 1, ..., K}}, where label k means the point pi is
associated to the joint Jk and label 0 means no association
to any joint. We use the following heuristics to provide the
6SHUSDUWVFDOHV
7SHUSDUWWUDQVODWLRQV
ground truth joint association: for a revolute joint Jk , if a
HOHPHQWZLVHPXOWLSOLFDWLRQ

HOHPHQWZLVHVXP
$VVRFLDWLRQ $[LV'LUHFWLRQ
$VV Q 3LYRW/RFDWLRQ point pi belongs to its two connecting parts and is within
a distance σ from its rotation axis, then we set ai = k; for
Figure 3. ANCSH network leverages two PointNet++ [21]
a prismatic joint, we associate it with all the points on its
modules to predict the ANCSH representation, including part
corresponding moving part. We empirically find σ = 0.2
segmentation, NPCS coordinates, transformations (1D scaling and
3D translation) from each NPCS to the NAOCS, and joint param- leads to a non-overlapping joint association on our data.
eters in the NAOCS. This figure illustrates the eyeglasses case In addition to predicting joint association, the joint
with only revolute joints, but the network structure also applies parameter head performs dense regression on the associated
to objects with revolute and prismatic joints. joint parameters. To be more specific, for each point pi , the
head regresses a 7D vector vi ∈ R7 . The first three dimen-
transformations from each NPCS to NAOCS, and joint sions of vi is a unit vector, which either represents u(r) for
parameters in NAOCS, correspondingly. The network is a revolute joint or u(t) for a prismatic joint. The rest four
based on two modules adapted from the PointNet++ [21] dimensions are dedicated to the pivot point q in case the
segmentation architecture. point is associated to a revolute joint. Since the pivot point
The part segmentation head predicts a per-point proba- of a 1D revolute joint is not uniquely defined (it can move
bility distribution among the M rigid parts. The NPCS head arbitrarily along the rotation axis), we instead predict the
(j) projection of pi to the rotation axis of its associated revolute
predicts M coordinates {ci ∈ R3 | j = 1, ..., M } for each
point pi . We use the predicted part label to select the corre- joint by regressing a 3D unit vector for the projection
sponding NPCS. This design helps to inject the geometry direction and a scalar for the projection distance. For
prior of each part into the network and hence specializes training, we only supervise the matched dimensions of vi
the networks on part-specific predictions. We design the for points pi with ai = 0. We use the ground truth joint
segmentation network and the NPCS regression network to parameters φai associated with joint Jai as the supervision.
share the same PointNet++ backbone and only branch at the During inference, we use the predicted joint association
last fully-connected layers. to interpret vi . We perform a voting step to get the final
The NAOCS head predicts the transformations {G(j) } joint parameter prediction φk , where we simply average the
from each NPCS to the NAOCS, and computes the coordi- predictions from points associated with each joint Jk . Note
nates in NAOCS using the predicted transformations. Since that the NAOCS head and the joint parameter head share the
part orientations are the same between NPCS and NAOCS, second PointNet++ as their backbone since they all predict
(j) attributes in the NAOCS.
the network only needs to estimate a 3D translation Gt
(j)
and a 1D scaling Gs for the NPCS of the part S (j) . Loss functions: We use relaxed IoU loss [32] Lseg for
Similar to NPCS head, the head here predicts for each part segmentation as well as for joint association Lassociation .
(j) (j)
point pi dense transformations with Gt,i and Gs,i for each We use mean-square loss LNPCS for NPCS coordinate
NPCS of the parts S (j) . We use the predicted segmen- regression. We use mean-square loss LNAOCS for NAOCS
(j)
tation label to select per-point translation Gt,i and scaling to supervise per-point translation {Gt,i }i,j and scaling
Gs,i . Then the NAOCS coordinates can be represented as (j)
{Gs,i }i,j . We again use mean-square loss Ljoint for joint
(j)
{gi | gi = Gs,i ci + Gt,i }. Finally, we compute Gs and parameter predictions. Our total loss is given by L =
(j)
Gt by averaging over points {pi ∈ S (j) }. λ1 Lseg + λ2 LNPCS + λ3 LNAOCS + λ4 Lassociation + λ5 Ljoint ,
The last head infers joint parameters {φk } for each joint where the loss weights are set to [1, 10, 1, 1, 1].

3707

Authorized licensed use limited to: HEFEI UNIVERSITY OF TECHNOLOGY. Downloaded on January 25,2025 at 13:52:17 UTC from IEEE Xplore. Restrictions apply.
4.3. Pose Optimization with Kinematic Constraints 4.4. Camera-Space Joint Parameters and Joint
States Estimation
Given the output of our ANCSH network, including part
(j) (j)
(j) (j)
segmentation, {ci } for each point pi , {Gt , Gs } for each Knowing {R(j) , t(j) , s(j) , Gt , Gs } of each part, we
part S (j) , and {φk } for each joint Jk , we now estimate the can compute the joint states {θk } and deduce joint param-
6D poses and sizes {R(j) , t(j) , s(j) } for each part S (j) . eters {φk } in the camera space from NAOCS joint param-
Considering a part S (j) , for the points {pi ∈ S (j) }, we eters {φk }. For a revolute joint Jk connecting parts S (j1 )
(r)
have their corresponding NPCS predictions {ci |pi ∈ S (j) }. and S (j2 ) , we compute its parameters φk = (uk , qk ) in
We could follow [28] to perform pose fitting, where the the camera space as:
Umeyama algorithm [26] is adopted within a RANSAC
(r)
[10] framework to robustly estimate the 6D pose and size (r) (R(j1 ) + R(j2 ) )uk
uk = (r)
of a single rigid object. However, without leveraging ||(R(j1 ) + R(j2 ) )uk ||
joint constraints, naively applying this approach to each
1  R(j) s(j)   (j)

individual part in our setting would easily lead to physically qk = q k − G t + t(j)
impossible part poses. To cope with this issue, we propose 2 j=j ,j G(j)s
1 2
the following optimization scheme leveraging kinematic
constraints for estimating the part poses. Without the The joint state θk can be computed as:
kinematic constraints, the energy function Evanilla regarding
all part poses can be written as Evanilla = j ej , where θk = arccos((trace(R(j2 ) (R(j1 ) )T ) − 1)/2)

1  For a prismatic joint Jk connecting parts S (k1) and S (k2) ,


ej = ||pi − (s(j) R(j) ci + t(j) )||2 (t)
|S (j) | we compute its parameters φk = (uk ) in the camera space
(j)
pi ∈S (r)
similar to computing uk for revolute joints and and its state
We then introduce the kinematic constraints by adding θk is simply ||δk1,k2 ||.
an energy term ek for each joint to the energy function.
In concrete  terms, our modified energy function is 5. Evaluation
Econstrained = j ej + λ k ek , where ek is defined differ-
ently for each type of joint. For a revolute joint Jk with 5.1. Experimental Setup
(r)
parameters φk = (uk , qk ) in the NAOCS, assuming it Evaluation Metrics. We use the following metrics to
connects part S (j1 ) and part S (j2 ) , we define ek as: evaluate our method.

ek = ||R(j1 ) uk
(r) (r)
− R(j2 ) uk ||2
·Part-based metrics. For each part, we evaluate
rotation error measured in degrees, translation error,
(t) and 3D intersection over union (IoU) [22] of the
For a prismatic joint Jk with parameters φk = (uk ) in
the NAOCS, again assuming it connects part S (j1 ) and part predicted amodal bounding box.
S (j2 ) , we define ek as: ·Joint states. For each revolute joint, we evaluate joint
 (t)
angle error in degrees. For each prismatic joint, we
ek = μ||R(j1 ) R(j2 ) T − I||2 + ||[R(j) uk ]× δj1 ,j2 ||2 evaluate the error of relative translation amounts.
j=j1 ,j2
·Joint parameters. For each revolute joint, we evaluate
where [·]× converts a vector into a matrix for conducting the orientation error of the rotation axis in degrees,
cross product with other vectors, and δj1 ,j2 is defined as: and the position error using the minimum line-to-line
distance. For each prismatic joint, we compute the
(j1 ) (j2 )
δj1 ,j2 = t(j2 ) − t(j1 ) + s(j1 ) R(j1 ) Gt − s(j2 ) R(j2 ) Gt orientation error of the translation axis.
To minimize our energy function Econstrained , we can
Datasets. We have evaluated our algorithm using both
no longer separately solve different part poses using the
synthetic and real-word datasets. To generate the synthetic
Umeyama algorithm. Instead, we first minimize Evanilla
data, we mainly use object CAD models from [29] along
using the Umeyama algorithm to initialize our estimation
with drawer models from [30]. Following the same
of the part poses. Then we fix {s(j) } and adopt a non-linear
rendering pipeline with random camera viewpoints, we use
least-squares solver to further optimize {R(j) , t(j) }, as is
PyBullet[8] to generate on average 3000 testing images of
commonly done for bundle adjustment [2]. Similar to [28],
unseen object instances for each object category that do
we also use RANSAC for outlier removal.
not overlap with our training data. For the real data, we
Finally, for each part S (j) , we use the fitted
(j) (j) (j) evaluated our algorithm on the dataset provided by Michel
R , t , s and the NPCS {ci |pi ∈ S (j) } to compute
et al. [18], which contains depth images for 4 different
an amodal bounding box, the same as in [28].
objects captured using the Kinect.

3708

Authorized licensed use limited to: HEFEI UNIVERSITY OF TECHNOLOGY. Downloaded on January 25,2025 at 13:52:17 UTC from IEEE Xplore. Restrictions apply.
Part-based Metrics Joint States Joint Parameters
Category Method
Rotation Error ↓ Translation Error ↓ 3D IoU % ↑ Error ↓ Angle error ↓ Distance error ↓
NPCS 4.0◦ , 7.7◦ , 7.2◦ 0.044, 0.080, 0.071 86.9, 40.5, 41.4 8.8◦ ,8.4◦ - -
Eye-
NAOCS 4.2◦ , 12.1◦ , 13.5◦ 0.157, 0.252, 0.168 - 13.7◦ , 15.1◦ - -
glasses ANCSH 3.7◦ , 5.1◦ , 3.7◦ 0.035, 0.051, 0.057 87.4, 43.6, 44.5 4.3 , 4.5◦
◦ 2.2◦ , 2.3◦ 0.019 , 0.014
NPCS 1.3◦ , 3.5◦ 0.032, 0.049 75.8 , 88.5 4.0◦ - -
Oven NAOCS 1.7◦ , 4.7◦ 0.036 , 0.090 - 5.1◦ - -
ANCSH 1.1◦ , 2.2◦ 0.030, 0.046 75.9 , 89.0 2.1◦ 0.8◦ 0.024
NPCS 1.1◦ , 2.0◦ 0.043 , 0.056 86.9 , 88.0 2.3 ◦ - -
Washing
NAOCS 1.1◦ , 3.3◦ 0.072 , 0.119 - 3.1 ◦ - -
Machine ANCSH 1.0◦ , 1.4◦ 0.042, 0.053 87.0 , 88.3 1.00 ◦ 0.7◦ 0.008
NPCS 11.6◦ , 4.4◦ 0.098, 0.044 35.7, 93.6 14.4 ◦ - -
Laptop NAOCS 12.4◦ , 4.9◦ 0.110, 0.049 - 15.2 ◦ - -
ANCSH 6.7◦ , 4.3◦ 0.062, 0.044 41.1, 93.0 9.7 ◦ 0.5◦ 0.017
NPCS 1.9◦ , 3.5◦ , 2.4◦ , 1.8◦ 0.032, 0.038, 0.024, 0.025 82.8, 71.2, 71.5, 79.3 0.026, 0.031, 0.046 - -
Drawer NAOCS 1.5◦ , 2.5◦ , 2.5◦ , 2.0◦ 0.044, 0.045, 0.073, 0.054 - 0.043, 0.066, 0.048 - -
ANCSH 1.0◦ , 1.1◦ , 1.2◦ , 1.5◦ 0.024, 0.021, 0.021, 0.033 84.0,72.1, 71.7, 78.6 0.011, 0.020, 0.030 0.8◦ , 0.8◦ , 0.8◦ -
Table 1. Performance comparison on unseen object instances. The categories eyeglasses, oven, washing machine, and laptop contain
only revolute joints and the drawer category contains three prismatic joints.

Baselines. There are no existing methods for category- Effect of combined optimization. First, we want to
level articulated object pose estimation. We therefore use examine how combined optimization would influence the
ablated versions of our system for baseline comparison. accuracy of articulated object pose estimation, using both
· NPCS. This algorithm predicts part segmentation and predicted joint parameters and predicted part poses. To
see this, we compare the algorithm performance between
NPCS for each part (without the joint parameters).
The prediction allows the algorithm to infer part pose, NPCS and ANCSH, where NPCS performs a per-part pose
amodal bounding box for each part, and joint state for estimation and ANCSH performs a combined optimization
revolute joint by treating each part as an independent using the full kinematic chain to constrain the result. The
rigid body. However, it is not able to perform a results in Table 1 show that the combined optimization of
combined optimization with the kinematic constraints. joint parameters and part pose consistently improves the

· NAOCS. This algorithm predicts part segmentation predict results for almost all object categories and on almost
all evaluation metrics. The improvement is particularly
and NAOCS representation for the whole object salient for thin object parts such as the two temples of
instance. The prediction allows the algorithm to infer eyeglasses (the parts that extend over the ears), where the
part pose and joint state, but not the amodal bounding per-part based method produces large pose errors due to
boxes for each part since the amodal bounding boxes limited number of visible points and shape ambiguity. This
are not defined in the NAOCS alone. Note the part pose result demonstrates that the joint parameters predicted in the
here is defined from the NAOCS to the camera space, NAOCS can regularize the part poses based on kinematic
different from the one we defined based upon NPCS. chain constraints during the combined pose optimization
We measure the error in the observed object scale so step and improve the pose estimation accuracy.
that it is comparable with our method.
· Direct joint voting. This algorithm directly votes for Joint parameters estimation. Predicting the location and
joint-associated parameters in camera space, including the orientation of joints in camera space directly with all
offset vectors and orientation for each joint from the degrees of freedom is challenging. Our approach predicts
point cloud using PointNet++ segmentation network. the joint parameters in NAOCS since it provides a canonical
representation where the joint axes usually have a strong
Our final algorithm predicts the full ANCSH representation orientation prior. We further use a voting-based scheme
that includes NPCS, joint parameters, and per-point global to reduce the prediction noise. Given joint axis predic-
scaling and translation value that can be used together with tions in NAOCS, we leverage the transformation between
the NPCS prediction for computing NAOCS. NAOCS and NPCS to compute corresponding joint param-
eters in NPCS. Based on the high-quality prediction of
5.2. Experimental Results
part poses, we will transform the joint parameters into the
Figure 4 presents some qualitative results. Tables 1 camera coordinate. Comparing to a direct voting baseline
summarizes the quantitative results. Following paragraphs using PointNet++, our approach significantly improves the
provide our analysis and discussion of the results. joint axis prediction for unseen instances (Table 2).

3709

Authorized licensed use limited to: HEFEI UNIVERSITY OF TECHNOLOGY. Downloaded on January 25,2025 at 13:52:17 UTC from IEEE Xplore. Restrictions apply.
6KDSH0RWLRQ'DWDVHW
8QVHHQ,QVWDQFHRQ
5HDO:RUOG'HSWK,PDJHV
6HHQ,QVWDQFHRQ

Figure 4. Qualitative Results. Top tow rows show test results on unseen object instances from the Shape2Motion dataset [29] and SAPIEN
dataset[30] (for only drawer category). Bottom two rows show test result on seen instances in the real-world dataset [18]. Here we visualize
the predicted amodal bounding box for each parts. Color images are for visualization only.

Category Methods Angle error Distance error than state-of-the-art. On average our algorithm achieves
96.25%, 92.3%, 96.9%, 79.8% AD accuracy on the whole
Eye- PointNet++ 2.9◦ , 15.7◦ 0.140, 0.197 kinematic chain of object instance laptop, cabinet, cupboard
glass ANCSH 2.2◦ , 2.3◦ 0.019, 0.014 and toy train. For detailed results on each part in all the test
PointNet++ 27.0◦ 0.024 sequences, as well as more visualizations, please refer to the
Oven
ANCSH 0.8◦ 0.024 supplementary material.
Washing PointNet++ 8.7◦ 0.010
Machine ANCSH 0.7◦ 0.008 6. Conclusion
PointNet++ 29.5◦ 0.007 This paper has presented an approach for category-
Laptop level pose estimation of articulated objects from a single
ANCSH 0.5◦ 0.017
depth image. To accommodate unseen object instances
PointNet++ 4.9◦ ,5.0◦ ,5.1◦ - with large intra-category variations, we introduce a
Drawer
ANCSH 0.8◦ ,0.8◦ ,0.8◦ - novel object representation, namely Articulation-aware
Table 2. A comparison of joint parameters estimation. Here Normalized Coordinate Space Hierarchy (ANCSH). We
PointNet++ denotes the direct joint voting baseline. further devise a deep neural network capable of predicting
ANCSH from a single depth point cloud. We then formulate
Generalization to real depth images. We have also articulated pose fitting from the ANCSH predictions as
tested our algorithm’s ability to generalize to real-world a combined optimization problem, taking both part pose
depth images on the dataset provided in [18]. The dataset errors and joint constraints into consideration. Our exper-
contains video sequences captured with Kinect for four iments demonstrate that the ANCSH representation and the
different object instances. Following the same training combined optimization scheme significantly improve the
protocol, we train the algorithm with synthetically rendered accuracy for both part pose prediction and joint parameters
depth images of the provided object instances. Then we estimation.
test the pose estimation accuracy on the real world depth
images. We adopt the same evaluation metric in [18], Acknowledgement: This research is supported by a
which uses 10% of the object part diameter as the threshold grant from Toyota-Stanford Center for AI Research,
to compute Averaged Distance (AD) accuracy, and test resources provided by Advanced Research Computing
the performance on each sequence separately. Although in the Division of Information Technology at Virginia
our algorithm is not specifically designed for instance- Tech. We thank Vision and Learning Lab at Virginia
level pose estimation and the network has never been Tech for help on visualization tool. We are also
trained using any real-world depth images, our algorithm grateful for the financial and hardware support from
achieves strong performance on par with or even better Google.

3710

Authorized licensed use limited to: HEFEI UNIVERSITY OF TECHNOLOGY. Downloaded on January 25,2025 at 13:52:17 UTC from IEEE Xplore. Restrictions apply.
References IEEE International Conference on Robotics and
Automation (ICRA), pages 3305–3312. IEEE, 2015. 1,
[1] Ben Abbatematteo, Stefanie Tellex, and George 3
Konidaris. Learning to generalize kinematic models to
[13] Dov Katz and Oliver Brock. Manipulating articu-
novel objects. In Proceedings of the Third Conference
lated objects with interactive perception. In 2008
on Robot Learning, 2019. 3
IEEE International Conference on Robotics and
[2] Sameer Agarwal, Noah Snavely, Steven M Seitz, and Automation, pages 272–277. IEEE, 2008. 1, 3
Richard Szeliski. Bundle adjustment in the large. In
European conference on computer vision, pages 29– [14] Dov Katz, Moslem Kazemi, J Andrew Bagnell, and
42. Springer, 2010. 6 Anthony Stentz. Interactive segmentation, tracking,
and kinematic modeling of unknown 3d articulated
[3] Rıza Alp Güler, Natalia Neverova, and Iasonas objects. In 2013 IEEE International Conference on
Kokkinos. Densepose: Dense human pose estimation Robotics and Automation, pages 5003–5010. IEEE,
in the wild. In Proceedings of the IEEE Conference 2013. 1, 3
on Computer Vision and Pattern Recognition, pages
7297–7306, 2018. 3 [15] Christoph Lassner, Javier Romero, Martin Kiefel,
Federica Bogo, Michael J Black, and Peter V Gehler.
[4] Paul J Besl and Neil D McKay. A method for regis- Unite the people: Closing the loop between 3d and 2d
tration of 3-d shapes. In PAMI, 1992. 2 human representations. In Proceedings of the IEEE
[5] Christopher M Bishop. Mixture density networks. Conference on Computer Vision and Pattern Recog-
1994. 3 nition, pages 6050–6059, 2017. 3
[6] Eric Brachmann, Alexander Krull, Frank Michel, [16] Roberto Martı́n-Martı́n, Sebastian Höfer, and Oliver
Stefan Gumhold, Jamie Shotton, and Carsten Rother. Brock. An integrated approach to visual perception
Learning 6d object pose estimation using 3d object of articulated objects. In 2016 IEEE International
coordinates. In European conference on computer Conference on Robotics and Automation (ICRA),
vision, pages 536–551. Springer, 2014. 2 pages 5091–5097. IEEE, 2016. 1, 3
[7] Eric Brachmann, Frank Michel, Alexander Krull, [17] Dushyant Mehta, Srinath Sridhar, Oleksandr Sotny-
Michael Ying Yang, Stefan Gumhold, et al. chenko, Helge Rhodin, Mohammad Shafiei, Hans-
Uncertainty-driven 6d pose estimation of objects Peter Seidel, Weipeng Xu, Dan Casas, and Christian
and scenes from a single rgb image. In Proceedings of Theobalt. Vnect: Real-time 3d human pose estimation
the IEEE Conference on Computer Vision and Pattern with a single rgb camera. ACM Transactions on
Recognition, pages 3364–3372, 2016. 2 Graphics (TOG), 36(4):44, 2017. 3
[8] Erwin Coumans and Yunfei Bai. Pybullet, a python [18] Frank Michel, Alexander Krull, Eric Brachmann,
module for physics simulation for games, robotics Michael Ying Yang, Stefan Gumhold, and Carsten
and machine learning. http://pybullet.org, Rother. Pose estimation of kinematic chain instances
2016–2018. 6 via object coordinate regression. In BMVC, pages
[9] Karthik Desingh, Shiyang Lu, Anthony Opipari, and 181–1, 2015. 3, 6, 8
Odest Chadwicke Jenkins. Factored pose estimation [19] Georgios Pavlakos, Xiaowei Zhou, Konstantinos G
of articulated objects using efficient nonparametric Derpanis, and Kostas Daniilidis. Coarse-to-fine
belief propagation. arXiv preprint arXiv:1812.03647, volumetric prediction for single-image 3d human
2018. 3 pose. In Proceedings of the IEEE Conference on
[10] Martin A Fischler and Robert C Bolles. Random Computer Vision and Pattern Recognition, pages
sample consensus: a paradigm for model fitting with 7025–7034, 2017. 3
applications to image analysis and automated cartog- [20] Georgios Pavlakos, Luyang Zhu, Xiaowei Zhou, and
raphy. Communications of the ACM, 24(6):381–395, Kostas Daniilidis. Learning to estimate 3d human pose
1981. 6 and shape from a single color image. In Proceedings
[11] Liuhao Ge, Zhou Ren, and Junsong Yuan. Point-to- of the IEEE Conference on Computer Vision and
point regression pointnet for 3d hand pose estimation. Pattern Recognition, pages 459–468, 2018. 3
In Proceedings of the European Conference on [21] Charles Ruizhongtai Qi, Li Yi, Hao Su, and Leonidas J
Computer Vision (ECCV), pages 475–491, 2018. 3 Guibas. Pointnet++: Deep hierarchical feature
[12] Karol Hausman, Scott Niekum, Sarah Osentoski, learning on point sets in a metric space. In Advances in
and Gaurav S Sukhatme. Active articulation model neural information processing systems, pages 5099–
estimation through interactive perception. In 2015 5108, 2017. 5

3711

Authorized licensed use limited to: HEFEI UNIVERSITY OF TECHNOLOGY. Downloaded on January 25,2025 at 13:52:17 UTC from IEEE Xplore. Restrictions apply.
[22] Shuran Song and Jianxiong Xiao. Deep sliding shapes [32] Li Yi, Haibin Huang, Difan Liu, Evangelos
for amodal 3d object detection in rgb-d images. In Kalogerakis, Hao Su, and Leonidas Guibas. Deep part
Proceedings of the IEEE Conference on Computer induction from articulated object pairs. In SIGGRAPH
Vision and Pattern Recognition, pages 808–816, 2016. Asia 2018 Technical Papers, page 209. ACM, 2018. 3,
6 5
[23] Xiao Sun, Jiaxiang Shang, Shuang Liang, and Yichen
Wei. Compositional human pose regression. In
Proceedings of the IEEE International Conference on
Computer Vision, pages 2602–2611, 2017. 3
[24] Martin Sundermeyer, Zoltan-Csaba Marton, Maxim-
ilian Durner, Manuel Brucker, and Rudolph Triebel.
Implicit 3d orientation learning for 6d object detection
from rgb images. In Proceedings of the European
Conference on Computer Vision (ECCV), pages 699–
715, 2018. 1
[25] Jonathan Tremblay, Thang To, Balakumar Sundar-
alingam, Yu Xiang, Dieter Fox, and Stan Birch-
field. Deep object pose estimation for semantic
robotic grasping of household objects. arXiv preprint
arXiv:1809.10790, 2018. 1
[26] Shinji Umeyama. Least-squares estimation of trans-
formation parameters between two point patterns.
IEEE Transactions on Pattern Analysis & Machine
Intelligence, (4):376–380, 1991. 6
[27] Chengde Wan, Thomas Probst, Luc Van Gool, and
Angela Yao. Dense 3d regression for hand pose
estimation. In Proceedings of the IEEE Conference
on Computer Vision and Pattern Recognition, pages
5147–5156, 2018. 3
[28] He Wang, Srinath Sridhar, Jingwei Huang, Julien
Valentin, Shuran Song, and Leonidas J Guibas.
Normalized object coordinate space for category-level
6d object pose and size estimation. In Proceedings of
the IEEE Conference on Computer Vision and Pattern
Recognition, pages 2642–2651, 2019. 1, 2, 4, 6
[29] Xiaogang Wang, Bin Zhou, Yahao Shi, Xiaowu Chen,
Qinping Zhao, and Kai Xu. Shape2motion: Joint
analysis of motion parts and attributes from 3d shapes.
In Proceedings of the IEEE Conference on Computer
Vision and Pattern Recognition, pages 8876–8884,
2019. 6, 8
[30] Fanbo Xiang, Yuzhe Qin, Kaichun Mo, Yikuan Xia,
Hao Zhu, Fangchen Liu, Minghua Liu, Hanxiao
Jiang, Yifu Yuan, He Wang, Li Yi, Angel X. Chang,
Leonidas J. Guibas, and Hao Su. Sapien: A simulated
part-based interactive environment, 2020. 6, 8
[31] Yu Xiang, Tanner Schmidt, Venkatraman Narayanan,
and Dieter Fox. Posecnn: A convolutional neural
network for 6d object pose estimation in cluttered
scenes. arXiv preprint arXiv:1711.00199, 2017. 1

3712

Authorized licensed use limited to: HEFEI UNIVERSITY OF TECHNOLOGY. Downloaded on January 25,2025 at 13:52:17 UTC from IEEE Xplore. Restrictions apply.

You might also like