6D Pose Estimation For Textureless Objects On RGB Frames Using Multi-View Optimization

The document proposes a framework to estimate the 6D pose of textureless objects using only RGB images from multiple viewpoints. It decouples 6D pose estimation into estimating 3D translation first using scale information, then 3D rotation. It develops an optimization scheme to handle multi-modal distributions in rotation space and object symmetries. The approach achieves better performance than the state-of-the-art on a challenging dataset.

Uploaded by

zezhang0611

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

92 views8 pages

6D Pose Estimation For Textureless Objects On RGB Frames Using Multi-View Optimization

Uploaded by

zezhang0611

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 8

6D Pose Estimation for Textureless Objects on RGB Frames using

Multi-View Optimization

Jun Yang, Wenjie Xue†, Sahar Ghavidel†, and Steven L. Waslander

Abstract— 6D pose estimation of textureless objects is a boost the object pose estimation performance using only
valuable but challenging task for many robotic applications. In RGB images [23], [24], [25], [26], [27]. However, due to
this work, we propose a framework to address this challenge the scale, depth, and perspective ambiguities inherent to
using only RGB images acquired from multiple viewpoints. The
core idea of our approach is to decouple 6D pose estimation a single viewpoint, RGB-based solutions usually have low
accuracy for the final estimated 6D poses.
arXiv:2210.11554v2 [cs.RO] 21 Feb 2023

into a sequential two-step process, first estimating the 3D

translation and then the 3D rotation of each object. This To this end, recent works utilize multiple RGB frames
decoupled formulation first resolves the scale and depth acquired from different viewpoints to enhance their pose
ambiguities in single RGB images, and uses these estimates
estimation results [28], [29], [30], [6], [31], [7]. In particular,
to accurately identify the object orientation in the second
stage, which is greatly simplified with an accurate scale esti- these approaches can be further categorized into offline
mate. Moreover, to accommodate the multi-modal distribution batch-based solutions [28], [30], where all the frames are
present in rotation space, we develop an optimization scheme provided at once, and incremental solutions [29], [6], [31],
that explicitly handles object symmetries and counteracts [7], where frames are provided sequentially. While fusing
measurement uncertainties. In comparison to the state-of-the-
pose estimates from different viewpoints can improve
art multi-view approach, we demonstrate that the proposed
approach achieves substantial improvements on a challenging the overall performance, handling extreme inconsistency,
6D pose estimation dataset for textureless objects. such as appearance ambiguities, rotational symmetries,
and possible occlusions, is still challenging. To address
I. INTRODUCTION these challenges, in this work, we propose a decoupled
formulation to factorize the 6D pose estimation problem
Texture-less rigid objects occur frequently in industrial
into a sequential two-step optimization process. Figure 1
environments and are of significant interest in many
shows an overview of the framework. Based on the per-
robotic applications. The task of 6D pose estimation aims
frame predictions of the object’s segmentation mask and
to detect these objects of known geometry and estimate
2D center from neural networks, we first optimize the 3D
their 6DoF (Degree of Freedom) poses, i.e., 3D translations
translation and obtain the object’s scale in the image. The
and 3D rotations, with respect to a global coordinate
acquired scale greatly simplifies the object rotation esti-
frame. In robotic manipulation tasks, accurate object
mation problem with a template-matching method [21]. A
poses are required for path planning and grasp execu-
max-mixture formulation [32] is finally adopted to accom-
tions [1], [2], [3]. For robotic navigation, 6D object poses
modate the multi-modal output distribution present in
provide useful information to the robot for localization
rotation space. We conduct extensive experiments on the
and obstacle avoidance [4], [5], [6], [7].
challenging ROBI dataset [33]. In comparison to the state-
Due to the lack of appearance features, historically, the
of-the-art method CosyPose [28], we achieve a substantial
problem of 6D pose estimation for textureless objects is
improvement with our method (28.5% and 3.4% over its
mainly addressed with depth data [8], [9], [10], [11], [12] or
RGB and RGBD version, respectively).
RGB-D images [13], [2], [14], [15], [16]. These approaches
In summary, our key contributions are:
can achieve strong pose estimation performance when
given high-quality depth data. Despite recent advances • We propose a novel 6D object pose estimation ap-
in depth acquisition technology, commodity-level depth proach that decouples the problem into a sequential
cameras produce depth maps with low accuracy and two-step process. This process resolves the depth
missing data when surfaces are too glossy or dark [17], ambiguities from individual RGB frames and greatly
[18], or the object is transparent [19], [20]. Hence, in the improves the estimate of rotation parameters.
past decade, RGB-based solutions have received a lot of • To deal with the multi-modal uncertainties of object
attention as an alternative approach [21], [22]. Due to rotation, we develop a rotation optimization scheme
the advancements in deep learning, some learning-based that explicitly handles the object symmetries and
approaches have been recently shown to significantly counteracts measurement ambiguities.

This work was supported by Epson Canada Ltd. II. RELATED WORK
*Jun Yang and Steven L. Waslander are with University of Toronto
Institute for Aerospace Studies and Robotics Institute. {jun.yang, A. Object Pose Estimation from a Single RGB Image
steven.waslander}@robotics.utias.utoronto.ca
†Wenjie Xue and Sahar Ghavidel are with Epson Canada Many approaches have been presented in recent years
{mark.xue, sahar.ghavidel}@ea.epson.com to address the pose estimation problem for texture-less
Unit Vectors

Object Center Localization

3D Translation
Optimization

Cropped Image 2D Center + Uncertainty

Input Image + Bounding Boxes Object Mask

3D Rotation Optimization
TM-based Rotation
Estimation

Re-Cropped Object
3D Rotation Segmentation Square ROI using known
Measurement Center and Scale

Fig. 1: An overview of the proposed multi-view object pose estimation pipeline with a two-step optimization formulation.

objects with only RGB images. Due to the lack of appear- biguities by a decoupled formulation. It also explicitly
ance features, traditional methods usually tackle the prob- handles rotational symmetries and measurement uncer-
lem via holistic template matching techniques [21], [34], tainties within an incremental online framework.
[35], but are susceptible to scale change and cluttered en-
vironments. More recently, deep learning techniques, such III. A PPROACH O VERVIEW AND P ROBLEM F ORMULATION
as convolutional neural networks (CNNs), have been em- Given the 3D object model and multi-view images, the
ployed to overcome these challenges. As two pioneering goal of 6D object pose estimation is to estimate the rigid
methods, SSD-6D [23] and PoseCNN [24] developed the transformation T wo ∈ SE (3) from the object model frame
CNN architectures to estimate the 6D object poses from O to a global (world) frame W . We assume that we know
a single RGB image. In comparison, some recent works the camera poses T wc ∈ SE (3) with respect to the world
leverage CNNs to first predict 2D object keypoints [36], frame. This can be done by robot forward kinematics and
[37], [26] or dense 2D-3D correspondences [38], [39], [27], eye-in-hand calibration when the camera is mounted on
[40], and then compute the pose through 2D-3D corre- the end-effector of a robotic arm [46], or off-the-shelf
spondences with a PnP algorithm [41]. Although these SLAM methods for a hand-held camera [47], [48].
methods show good 2D detection results, the accuracy Given measurements Z 1:k up to viewpoint k, we aim to
in the final 6D poses is generally low. estimate the posterior distribution of the 6D object pose
P (R wo , t wo |Z 1:k ). The direct computation of this distribu-
B. Object Pose Estimation from Multiple Viewpoints
tion is generally not feasible since object translation t w,o
Multi-view approaches aim to resolve the scale and and rotation R wo have distinct distributions. Specifically,
depth ambiguities that commonly occur in the single the translation distribution P (t wo ) is straightforward and
viewpoint setting and improve the accuracy of the esti- expected to be unimodal. In contrast, the distribution for
mated poses. Traditional works utilize local features [42], object rotation P (R wo ) is less obvious due to complex
[43] and cannot handle textureless objects. Recently, the uncertainties arising from shape symmetries, appearance
multi-view object pose estimation problem has been re- ambiguities, and possible occlusions. Inspired by [29], we
visited with neural networks. These approaches used an decouple the pose posterior P (R wo , t wo |Z 1:k ) into:
offline, batch-based optimization formulation, where all
the frames are given at once, to obtain a single consistent P (R wo , t wo |Z 1:k ) = P (R wo |Z 1:k , t wo ) P (t wo |Z 1:k ) (1)
scene interpretation [44], [28], [20], [30]. Compared to
batch-based methods, other works solve the multi-view where P (t wo |Z 1:k ) can be ¡ formulated as a unimodal
Gaussian distribution N t wo |µ, Σ . P (R wo |Z 1:k , t wo ) is
¢
pose estimation problem in an online manner. These
works estimate camera poses and object poses simulta- the rotation distribution conditioned on the input images
neously, known as object-level SLAM [5], [6], [45], [7], Z 1:k and the 3D translation t w,o . To represent the com-
or estimate object poses with known camera poses [1], plex rotation uncertainties, similar to [42], we formulate
[29], [31]. Although these methods show performance im- P (R wo |Z 1:k , t wo ) as the mixture of Gaussian distribution:
provements with only RGB images, they still face difficulty N
w i N R wo |µi , Σi
X ¡ ¢
in dealing with object scales, rotational symmetries, and P (R wo |Z 1:k , t wo ) = (2)
i =1
measurement uncertainties.
With the per-frame neural network predictions as mea- which consists of N Gaussian components. The coefficient
surements, our work resolves the depth and scale am- w i denotes the weight of the mixture component. µi
t co can be recovered by the following back-projection
assuming a pinhole camera model,
   u x −c x t 
tx fx z
u y −c y 
t y  = 
 f y tz  (3)
tz tz
where f x and f y denote the camera focal lengths, and
£ ¤T £ ¤T
cx , c y is the principal point. We define u = u x , u y as
the projection of the object model origin O and call it the
2D center of the object in the rest of the paper. We can
see that if we can localize object center u in the image
and estimate the depth t z , then t co (or t wo ) is solved. In
Fig. 2: Illustration of the object, world, and camera co- our framework, we predict per-frame 2D object center u
ordinates. The 3D translation t wo is the coordinate of using the power of the neural network and estimate the
the object model origin in the world coordinate. We can depth t z with a multi-view optimization formulation.
estimate the translation by localizing the per-frame 2D Our per-frame 2D object center localization network
center of the object, u k , and minimizing the re-projection is shown in the upper part of Figure 1. Our network
errors with known camera poses T wc,k . architecture is based on PVNet [26]. To deal with mul-
tiple instances in the scene, we first use off-the-shelf
YOLOv5 [49] to detect 2D bounding boxes of the objects.
and Σi are the mean and covariance of i t h component, The detections are then cropped and resized to 128x128
respectively. before being fed into the network. The network predicts
Our proposed decoupling formulation implies a useful pixel-wise binary labels and a 2D vector field towards the
correlation between translation and rotation in the image object center. A RANSAC-based voting scheme is finally
domain. The 3D translation estimation t wo is independent utilized to estimate the mean u k and covariance Σk of the
of the object’s rotation and encodes the center and scale object center at frame k. For more details of the object
information of the object. By applying the camera pose center localization prediction, we refer the reader to [26].
T wc,k at frame k, the estimated 3D translation t co,k under Given a sequence of measurements, we can estimate
the camera coordinate provides the scale and 2D center the object 3D translation t wo based on the maximum
of the object in the image. Based on it, the per-frame likelihood estimation (MLE) formulation. By assuming
object rotation measurement R co,k can be estimated from the uni-modal Gaussian error model, we solve it with
its visual appearance in the image. With this formulation, a nonlinear least squares (NLLS) optimization approach.
our multi-view framework comprises two main steps, The optimization is formulated by creating measurement
summarized in Figure 1. In the first step (Section IV), residuals that constrain the object translation t wo with
we estimate the 3D translation t wo by integrating the the object center u k , Σk and known camera pose T wc,k at
per-frame neural network outputs into an optimization viewpoint k,
formulation. The network outputs the segmentation mask ³ ´
r k (t wo ) = π T −1
wc,k t wo − u k (4)
and the 2D projection u k of the object’s 3D center, and the
object’s 3D translation t wo is estimated by minimizing the where π is the perspective projection function. The full
2D re-projection error across views. Given the estimated problem becomes the minimization of the cost L across
3D translation t wo and segmentation mask, in the second all the viewpoints,
step (Section V), we re-crop a rotation-independent Re-
L = ρ H r Tk Σ−1
X ¡ ¢
gion of Interest (RoI) for each object with the estimated k rk (5)
k
scale. We then feed the RoI into a rotation estimator to get
where Σk is the covariance matrix estimated by the local-
per-frame 3D rotation measurement R co,k . The final ob-
ization network for the object center u k , and ρ H is the
ject rotation R wo is obtained by an optimization approach
Huber norm to reduce the impact of outliers for the opti-
with the explicitly handling of shape symmetries, and a
mization. We initialize each object’s translation t wo using
max-mixture formulation [32] to counteract measurement
the diagonal length of its 2D bounding box, similar to [23],
uncertainties.
from the first frame. With the known camera pose T wc,k ,
IV. 3D T RANSLATION E STIMATION we perform object association based on epipolar geometry
constraints and the estimated translation t wo,1:k−1 up to
As illustrated in Figure 2, the 3D translation t wo is the
viewpoint k − 1. Detections that are not associated with
coordinate of the object model origin in the world frame.
any existing objects are initialized as new objects.
Since the camera pose T wc is known, it is equivalent to
We solve the NLLS problem (Equation 4 and 5) in an
solve the translation from the £object model origin to the
¤T iterative Gauss-Newton procedure:
camera optical center, t co = t x , t y , t z . Given an RGB ¡ T T
J t wo Σ−1 −1
z J t wo δt wo = J t wo Σz r (t wo )
¢
image from an arbitrary camera viewpoint, the translation (6)
Image Plane distance at training time, respectively. This process is
𝒛𝒓 illustrated in Figure 3a. Note that the RoI is a square
𝒍𝒔 = 𝒍𝒓
𝒛𝒔 region here and is independent of the object’s rotation.
𝒍𝒓 𝒍𝒔 To further reduce the gap between rendered templates
𝒛𝒓 and RoI images, we take the segmentation mask from the
𝒛𝒔
object center localization network (upper part of Figure 1)
(a) and then feed the re-cropped object RoI into the LINE-2D
estimator to get a per-frame rotation measurement R co,k ,
as shown in Figure 3b.

B. Optimization formulation
Generally, estimating object 3D rotation from a se-
(b)
quence of measurements can also be formulated as an
Fig. 3: (a). The inference of object size l s from their
MLE problem:
projective ratio. (b) Left: the rendered object template at Y
a canonical distance. Middle: incorrect rotation estimates X̂ = argmax p(z k |X ) (7)
due to the scale change. Right: re-cropped object RoI using X k
the translation estimate, leading to the correct result. where X denotes the object 3D rotation R wo to be esti-
mated. The measurement z k here is the object’s rotation
where Σz is the stacked measurement covariance matrix with respect to the camera coordinate R co,k , obtained
up to the current frame and obtained from the object cen- from Section V-A. The measurement model is a function
ter localization network (upper part of Figure 1). The over of camera pose (rotation part) R wc,k and object rotation
Jacobian, J t wo , is stacked up by the per-frame Jacobian R wo in world frame:
matrix J t wo ,k .
h R wo , R wc,k = R −1
¡ ¢
wc,k R wo (8)
V. 3D R OTATION E STIMATION
We formulate the optimization problem by creating the
The procedure of estimating the object rotation R wo residual between R wo and per-frame measurement R co,k :
is shown in the lower part of Figure 1. We first adopt a ³ ¡ ¢−1 ´∨
template-matching (TM)-based approach, LINE-2D [21], r k (R wo ) = log R co,k h R wo , R wc,k (9)
for obtaining the per-frame rotation measurement R co,k .
The acquired measurements from multiple viewpoints are where r k (R wo ) is expressed by Lie algebra so(3). To
then integrated into an optimization scheme. We handle handle rotational symmetries, we consider them explic-
the rotational symmetries explicitly given the object CAD itly together with the measurement R co,k in Equation 9.
model. To counteract the measurement uncertainties (e.g., Generally, when an object has symmetry, there exist a set
from appearance ambiguities), a max-mixture formula- of rotations that leave the object’s appearance unchanged:
n ¢o
tion [32] is used to recover a globally consistent set of S (R co ) = R 0co ∈ SO(3) s.t ∀ G R co = G R 0co
¡ ¢ ¡
(10)
object pose estimates. Note that the acquisition of the ¡ ¢
rotation measurement R co,k is not limited to the LINE- where G R co is the rendered image of object under
2D [21] or TM-based approaches and can be superseded rotation R co (assuming the same object translation). We
by other holistic-based methods [50], [34], [23], [25]. can update the measurement R co,k in Equation 9 to R̄ co,k :
A. Per-Frame Rotation Measurement
° ³³ ´ ¡ ¢−1 ´∨ °
R̄ co,k = argmin °log R 0co,k h R wo , R wc,k ° (11)
° °
0
R co,k ∈S (R co,k )
Given the object 3D model, LINE-2D renders object
templates from a view sphere in the offline training where k·k denotes the absolute angle for a 3D rotation
stage (bottom middle in Figure 1). At run-time, it utilizes vector φ, and R̄ co,k is the updated rotation measurement
the gradient response on input RGB or grayscale images that has the minimal loss relative to R wo .
for template matching. A confidence score is provided
based on the matching quality. In general, the TM-based C. Measurement ambiguities
approach suffers from scale change issues, and the object Due to complex uncertainties, such uni-modal esti-
templates need to be generated at multiple distances and mates are still not sufficient to adequately represent the
scales. In our work, instead of training the multi-scale rotation uncertainties. To this end, we now consider the
templates, which increases the run-time complexity, we sum-mixture of Gaussians as the likelihood function:
fix the 3D translation to a canonical centroid distance N
w i N µi , Σi
X ¡ ¢
t r = [0, 0, z £r ]. At run-time,
¤ given the 3D translation hypoth- p(z̄ k |X ) = (12)
esis t co = x s , y s , z s from object origin to camera center i =1
(obtained from t wo and camera pose T wc ), we can re-crop where z̄ k is the updated ¢measurement (using Equa-
tion 11), and each N µi , Σi represents a distinct Gaus-
¡
the RoI from the image. The RoI size l s is determined by
l s = zzrs l r , where l r and z r are the RoI size and canonical sian distribution, and w i is the weight for component i .
(a)

(a)

(b) (c)
Fig. 4: Max-mixtures for processing the rotation mea- (b)
surements. Note that we show the distribution only on
Fig. 5: Examples of (a) the generated synthetic data and
one axis for demonstration purposes. (a) Acquired rota-
(b) real images from the RealSense Camera.
tion measurements from different viewpoints. (b) Mixture
distribution after two viewpoints. (c) Mixture distribution component across the viewpoints. The weight can be
after five viewpoints. c
approximated as: w i = P ic . This processing is illustrated
i i
in Figure 4. Given the measurements from two viewpoints,
The problem with a sum-mixture is that the MLE solution the object rotation distribution P (R wo ) is represented with
is no longer simple and falls outside the common NLLS two Gaussian components (green and red) with similar
optimization approaches. Instead, we consider the max- weights. The third component (yellow) is added when
marginal and solve the problem with the following max- observing more viewpoints. As a result of receiving more
mixture formulation [32]: rotation measurements (after five viewpoints), the weight
p(z̄ k |X ) = max w i N µi , Σi
¡ ¢
(13) of the correct component (green) becomes higher than
i =1:N the false hypotheses.
The max operator acts as a selector, keeping the prob-
VI. EXPERIMENTS
lem as a common NLLS optimization. Note that the
Max-mixture does not make a permanent choice. In an A. Datasets, Baselines and Evaluation Metrics
iteration of optimization, only one of the Gaussian com- We evaluate our framework on the recently released
ponents is selected and optimized. In particular, given a ROBI dataset [33]. It provides multiple camera viewpoints
new rotation measurement R̄ co,k at frame k, we actually and ground truth 6D poses for textureless reflective in-
evaluate each Gaussian component in Equation 13 by dustrial parts. The objects were placed in challenging bin
computing the¡ absolute rotation
¢ angle error θ k,i between scenarios and captured using two sensors: a high-cost
R̄ co,k and h R wo,i , R wc,k , Ensenso camera and a low-cost RealSense camera. For
° ³ ¢−1 ´∨ ° network training purposes, we generate 80,000 synthetic
θ k,i = °log R̄ co,k h R wo , R wc,k
¡
(14) images using Blender software [51] with Bullet physics en-
° °
°
gine [52] and train our object center localization network
and select the one with the minimal angle error. To reduce
with only synthetic data. Figure 5 presents some examples
the impact of outliers, the selected Gaussian component
of our generated synthetic images. We picked five objects
will accept a rotation measurement only if the rotation
that are textureless and evaluated them on both Ensenso
angle error θ k,i is less than a pre-defined threshold (30◦
and RealSense test sets.
in our implementation). If the measurement R̄ co,k is not
Quantitatively, we compare our approach with Cosy-
accepted by any Gaussian component, it will be con-
Pose [28], a state-of-the-art multi-view pose fusion solu-
sidered as a new component and added to the current
tion which takes the object pose estimates from individual
Gaussian-mixture model. We optimize the object rotation
viewpoints as the input and optimizes the overall scene
R wo within each component by operating on the tangent
consistency. Note that, CosyPose is an offline batch-based
space so(3):
solution that is agnostic to any particular pose estimator.
³ ´
J Tφ Λz J φwo δφwo = J Tφ Λz r (R wo ) (15) For a fair comparison, we use the same pose estimator
wo wo
(LINE-2D template matching with the same bounding
where r (R wo ) and J φwo are the stacked rotation residual boxes, object center, and segmentation mask from the
vector and Jacobian matrix, respectively, across multiple object center localization network). Additionally, we pro-
viewpoints. We approximate the weight matrix Λz by vide the CosyPose with known camera poses. To feed the
placing the LINE-2D confidence score on its diagonal reliable single-view estimates to CosyPose, we use two
elements. strategies to obtain the scale information for the LINE-
To compute the weight, w i , for each Gaussian compo- 2D pose estimator. For the first strategy, we generate the
nent, we accumulate the LINE-2D confidence score, c i , templates at multiple distances in the training time (9
from the rotation measurements within each individual in our experiments) and perform the standard template
Ensenso RealSense
4 Views 8 Views 4 Views 8 Views
Objects CosyPose Ours CosyPose Ours CosyPose Ours CosyPose Ours
Input Modality RGB RGBD RGB RGB RGBD RGB RGB RGBD RGB RGB RGBD RGB
Tube Fitting 51.6 80.8 86.1 66.2 86.7 88.7 42.6 64.7 77.9 72.1 76.5 85.2
Chrome Screw 39.7 66.1 64.9 58.6 75.3 67.8 60.0 62.9 67.1 72.9 84.3 81.4
Eye Bolt 36.5 77.0 78.4 62.1 87.8 83.8 26.5 58.8 79.4 55.9 91.1 91.1
Gear 33.3 83.9 81.5 45.6 85.2 86.4 41.7 77.8 75.0 61.1 83.3 86.1
Zigzag 51.7 75.8 82.7 60.3 89.7 94.8 53.6 71.4 78.6 64.3 78.6 89.3
ALL 42.6 76.7 78.7 58.6 84.9 84.3 44.9 67.1 75.6 65.3 82.8 86.6

TABLE I: 6D object pose estimation results on Ensenso test set from ROBI dataset, evaluated with the metrics of correct
detection rate. There are a total of nine scenes for the Ensenso test set and four scenes for the RealSense test set.

matching in run-time. This strategy can significantly im-

Evaluation Detection Rate (%) Run-time1
Method 4 Views 8 Views (ms)
prove the single view pose estimation performance by Simultaneous Process 74.4 81.5 104.5
sacrificing the run-time speed and is treated as the RGB Two-Step Process 77.2 85.5 25.6
version of CosyPose in our experiments. For the second
strategy, we directly use the depth images at run-time to TABLE II: Ablation studies on ROBI dataset with different
acquire the object scale and refer to it as the RGBD version configurations for object pose estimation.
for CosyPose. Note that, for our approach, we only use
RGB images without any depth data. its effectiveness, we consider an alternative version of
We adopt the average distance (ADD) metric [53] for our approach, one which simultaneously estimates the 3D
evaluation. We transform the object model points by the translation and rotation. This version uses the same strat-
ground truth and the estimated 6D poses, respectively, egy to estimate the object translation. However, instead of
and compute the mean of the pairwise distances between using the provided scale from the translation estimates, it
the two transformed point sets. A pose is claimed as uses the multi-scale trained templates (similar to the RGB
correct if its ADD is smaller than 10% of the object version of CosyPose) to acquire rotation measurements.
diameter. A ground truth pose will be considered only if Table II presents the result of our ablation study. Due
its visibility score is larger than 75%. to the large volume of the templates, the run-time for
rotation estimation is generally slow for the simultaneous
B. Results
process version. In comparison, our two-step process not
We conduct the experiments on the ROBI dataset with a only operates with a much faster run-time speed but also
variable number of viewpoints (4 and 8). The object pose has better overall performance.
estimation results are presented in Table I of five highly
reflective textureless objects. The results show our method
VII. CONCLUSION
outperforms the baseline CosyPose by a wide margin on
RGB data, and is competitive with the RGBD approach. In this work, we have implemented a multi-view pose
On the Ensenso test set, RGBD version CosyPose achieves estimation framework for textureless objects using only
an overall 76.7% detection with four views and 84.9% with RGB images. Our core idea of our method is to decouple
eight views. It can be considered as an upper bound of the posterior distribution into the 3D translation and
CosyPose as it uses depth data at test time. In comparison, the 3D rotation of an object and integrate the per-frame
our approach outperforms the RGB version CosyPose by measurements with a two-step multi-view optimization
a large margin of 36.1% and 25.7% on the 4-view and 8- formulation. This process first resolves the scale and depth
view test set, respectively, and achieves the upper bound ambiguities in the RGB images and greatly simplifies
of CosyPose. On the RealSense test set, RGBD version the per-frame rotation estimation problem. Moreover, our
CosyPose performs slightly worse than Ensenso results, rotation optimization module explicitly handles the object
mainly due to the poorer depth data quality from the symmetries and counteracts the measurement uncertain-
RealSense sensor. In comparison, our approach only relies ties with a max-mixture-based formulation. Experiments
on RGB images and outperforms both RGB and RGBD on the real ROBI dataset demonstrate the effectiveness
versions CosyPose by 26.0% and 6.1%, respectively. and accuracy compared to the state-of-the-art. Future
C. Ablation Study on Decoupled Formulation work includes joint camera pose estimation and 6D object
pose estimation, and the active perception to strategically
As discussed in Section III and V-A, the core idea of select camera viewpoints for estimating the object poses.
our method is the decoupling of 6D pose estimation into a
sequential two-step process. This process first resolves the 1 We conduct the run-time analysis for template matching only and
scale and depth ambiguities in the RGB images and greatly report with milliseconds per object. The analysis is conducted on a
improves the rotation estimation performance. To justify desktop with an Intel 3.40GHz CPU and an Nvidia RTX 2080 Ti GPU.
R EFERENCES [20] X. Liu, R. Jonschkowski, A. Angelova, and K. Konolige, “Keypose:
Multi-view 3d labeling and keypoint estimation for transparent
[1] X. Deng, Y. Xiang, A. Mousavian, C. Eppner, T. Bretl, and D. Fox, objects,” in Proceedings of the IEEE/CVF conference on computer
“Self-supervised 6d object pose estimation for robot manipulation,” vision and pattern recognition, pp. 11602–11610, 2020.
in 2020 IEEE International Conference on Robotics and Automation
[21] S. Hinterstoisser, C. Cagniart, S. Ilic, P. Sturm, N. Navab, P. Fua,
(ICRA), pp. 3665–3671, IEEE, 2020.
and V. Lepetit, “Gradient response maps for real-time detection
[2] C. Wang, D. Xu, Y. Zhu, R. Martín-Martín, C. Lu, L. Fei-Fei, and
of textureless objects,” IEEE transactions on pattern analysis and
S. Savarese, “Densefusion: 6d object pose estimation by iterative
machine intelligence, vol. 34, no. 5, pp. 876–888, 2011.
dense fusion,” in Proceedings of the IEEE/CVF conference on com-
[22] E. Brachmann, F. Michel, A. Krull, M. Y. Yang, S. Gumhold, et al.,
puter vision and pattern recognition, pp. 3343–3352, 2019.
“Uncertainty-driven 6d pose estimation of objects and scenes from
[3] K. Wada, E. Sucar, S. James, D. Lenton, and A. J. Davison, “Morefu-
a single rgb image,” in Proceedings of the IEEE conference on
sion: Multi-object reasoning for 6d pose estimation from volumetric
computer vision and pattern recognition, pp. 3364–3372, 2016.
fusion,” in Proceedings of the IEEE/CVF conference on computer
vision and pattern recognition, pp. 14540–14549, 2020. [23] W. Kehl, F. Manhardt, F. Tombari, S. Ilic, and N. Navab, “Ssd-6d:
[4] R. F. Salas-Moreno, R. A. Newcombe, H. Strasdat, P. H. Kelly, and A. J. Making rgb-based 3d detection and 6d pose estimation great again,”
Davison, “Slam++: Simultaneous localisation and mapping at the in Proceedings of the IEEE international conference on computer
level of objects,” in Proceedings of the IEEE conference on computer vision, pp. 1521–1529, 2017.
vision and pattern recognition, pp. 1352–1359, 2013. [24] Y. Xiang, T. Schmidt, V. Narayanan, and D. Fox, “Posecnn: A
[5] S. Yang and S. Scherer, “Cubeslam: Monocular 3-d object slam,” convolutional neural network for 6d object pose estimation in
IEEE Transactions on Robotics, vol. 35, no. 4, pp. 925–938, 2019. cluttered scenes,” in Robotics: Science and Systems (RSS), 2018.
[6] J. Fu, Q. Huang, K. Doherty, Y. Wang, and J. J. Leonard, “A multi- [25] M. Sundermeyer, Z.-C. Marton, M. Durner, M. Brucker, and
hypothesis approach to pose ambiguity in object-based slam,” in R. Triebel, “Implicit 3d orientation learning for 6d object detection
2021 IEEE/RSJ International Conference on Intelligent Robots and from rgb images,” in Proceedings of the european conference on
Systems (IROS), pp. 7639–7646, IEEE, 2021. computer vision (ECCV), pp. 699–715, 2018.
[7] N. Merrill, Y. Guo, X. Zuo, X. Huang, S. Leutenegger, X. Peng, L. Ren, [26] S. Peng, Y. Liu, Q. Huang, X. Zhou, and H. Bao, “Pvnet: Pixel-wise
and G. Huang, “Symmetry and uncertainty-aware object slam voting network for 6dof pose estimation,” in Proceedings of the
for 6dof object pose estimation,” in Proceedings of the IEEE/CVF IEEE/CVF Conference on Computer Vision and Pattern Recognition,
Conference on Computer Vision and Pattern Recognition, pp. 14901– pp. 4561–4570, 2019.
14910, 2022. [27] T. Hodan, D. Barath, and J. Matas, “Epos: Estimating 6d pose of
[8] B. Drost, M. Ulrich, N. Navab, and S. Ilic, “Model globally, match objects with symmetries,” in Proceedings of the IEEE/CVF conference
locally: Efficient and robust 3d object recognition,” in 2010 IEEE on computer vision and pattern recognition, pp. 11703–11712, 2020.
computer society conference on computer vision and pattern recog- [28] Y. Labbé, J. Carpentier, M. Aubry, and J. Sivic, “Cosypose: Consistent
nition, pp. 998–1005, Ieee, 2010. multi-view multi-object 6d pose estimation,” in European Confer-
[9] M. Bui, S. Zakharov, S. Albarqouni, S. Ilic, and N. Navab, “When ence on Computer Vision, pp. 574–591, Springer, 2020.
regression meets manifold learning for object recognition and pose [29] X. Deng, A. Mousavian, Y. Xiang, F. Xia, T. Bretl, and D. Fox,
estimation,” in 2018 IEEE International Conference on Robotics and “Poserbpf: A rao–blackwellized particle filter for 6-d object pose
Automation (ICRA), pp. 6140–6146, IEEE, 2018. tracking,” IEEE Transactions on Robotics, vol. 37, no. 5, pp. 1328–
[10] G. Gao, M. Lauri, Y. Wang, X. Hu, J. Zhang, and S. Frintrop, “6d 1342, 2021.
object pose regression via supervised learning on point clouds,” [30] I. Shugurov, I. Pavlov, S. Zakharov, and S. Ilic, “Multi-view object
in 2020 IEEE International Conference on Robotics and Automation pose refinement with differentiable renderer,” IEEE Robotics and
(ICRA), pp. 3643–3649, IEEE, 2020. Automation Letters, vol. 6, no. 2, pp. 2579–2586, 2021.
[11] G. Gao, M. Lauri, X. Hu, J. Zhang, and S. Frintrop, “Cloudaae: [31] K.-K. Maninis, S. Popov, M. Niesser, and V. Ferrari, “Vid2cad: Cad
Learning 6d object pose regression with on-line data synthesis on model alignment using multi-view constraints from videos,” IEEE
point clouds,” in 2021 IEEE International Conference on Robotics Transactions on Pattern Analysis and Machine Intelligence, 2022.
and Automation (ICRA), pp. 11081–11087, IEEE, 2021. [32] E. Olson and P. Agarwal, “Inference on networks of mixtures
[12] D. Cai, J. Heikkilä, and E. Rahtu, “Ove6d: Object viewpoint encoding for robust robot mapping,” The International Journal of Robotics
for depth-based 6d object pose estimation,” in Proceedings of the Research, vol. 32, no. 7, pp. 826–840, 2013.
IEEE/CVF Conference on Computer Vision and Pattern Recognition,
[33] J. Yang, Y. Gao, D. Li, and S. L. Waslander, “Robi: A multi-
pp. 6803–6813, 2022.
view dataset for reflective objects in robotic bin-picking,” in 2021
[13] A. Doumanoglou, R. Kouskouridas, S. Malassiotis, and T.-K. Kim,
IEEE/RSJ International Conference on Intelligent Robots and Systems
“Recovering 6d object pose and predicting next-best-view in the
(IROS), pp. 9788–9795, IEEE, 2021.
crowd,” in Proceedings of the IEEE conference on computer vision
[34] M. Imperoli and A. Pretto, “D co: Fast and robust registration of
and pattern recognition, pp. 3583–3592, 2016.
3d textureless objects using the directional chamfer distance,” in
[14] Y. He, W. Sun, H. Huang, J. Liu, H. Fan, and J. Sun, “Pvn3d: A deep
Computer Vision Systems: 10th International Conference, ICVS 2015,
point-wise 3d keypoints voting network for 6dof pose estimation,”
Copenhagen, Denmark, July 6-9, 2015, Proceedings, pp. 316–328,
in Proceedings of the IEEE/CVF conference on computer vision and
Springer, 2015.
pattern recognition, pp. 11632–11641, 2020.
[15] M. Tian, L. Pan, M. H. Ang, and G. H. Lee, “Robust 6d object pose [35] T. Hodaň, X. Zabulis, M. Lourakis, Š. Obdržálek, and J. Matas,
estimation by learning rgb-d features,” in 2020 IEEE International “Detection and fine 3d pose estimation of texture-less objects
Conference on Robotics and Automation (ICRA), pp. 6218–6224, in rgb-d images,” in 2015 IEEE/RSJ International Conference on
IEEE, 2020. Intelligent Robots and Systems (IROS), pp. 4421–4428, IEEE, 2015.
[16] L. Saadi, B. Besbes, S. Kramm, and A. Bensrhair, “Optimizing rgb- [36] M. Rad and V. Lepetit, “Bb8: A scalable, accurate, robust to partial
d fusion for accurate 6dof pose estimation,” IEEE Robotics and occlusion method for predicting the 3d poses of challenging objects
Automation Letters, vol. 6, no. 2, pp. 2413–2420, 2021. without using depth,” in Proceedings of the IEEE international
[17] C.-Y. Chai, Y.-P. Wu, and S.-L. Tsao, “Deep depth fusion for black, conference on computer vision, pp. 3828–3836, 2017.
transparent, reflective and texture-less objects,” in 2020 IEEE Inter- [37] G. Pavlakos, X. Zhou, A. Chan, K. G. Derpanis, and K. Daniilidis,
national Conference on Robotics and Automation (ICRA), pp. 6766– “6-dof object pose from semantic keypoints,” in 2017 IEEE interna-
6772, IEEE, 2020. tional conference on robotics and automation (ICRA), pp. 2011–2018,
[18] J. Yang and S. L. Waslander, “Next-best-view prediction for active IEEE, 2017.
stereo cameras and highly reflective objects,” in 2022 International [38] S. Zakharov, I. Shugurov, and S. Ilic, “Dpod: 6d pose object detector
Conference on Robotics and Automation (ICRA), pp. 3684–3690, and refiner,” in Proceedings of the IEEE/CVF international conference
IEEE, 2022. on computer vision, pp. 1941–1950, 2019.
[19] S. Sajjan, M. Moore, M. Pan, G. Nagaraja, J. Lee, A. Zeng, and [39] K. Park, T. Patten, and M. Vincze, “Pix2pose: Pixel-wise coordinate
S. Song, “Clear grasp: 3d shape estimation of transparent objects for regression of objects for 6d pose estimation,” in Proceedings of the
manipulation,” in 2020 IEEE International Conference on Robotics IEEE/CVF International Conference on Computer Vision, pp. 7668–
and Automation (ICRA), pp. 3634–3642, IEEE, 2020. 7677, 2019.
[40] R. L. Haugaard and A. G. Buch, “Surfemb: Dense and continuous on robotics and automation, vol. 5, no. 3, pp. 345–358, 1989.
correspondence distributions for object pose estimation with learnt [47] G. Klein and D. Murray, “Parallel tracking and mapping for small ar
surface embeddings,” in Proceedings of the IEEE/CVF Conference on workspaces,” in 2007 6th IEEE and ACM international symposium
Computer Vision and Pattern Recognition, pp. 6749–6758, 2022. on mixed and augmented reality, pp. 225–234, IEEE, 2007.
[41] V. Lepetit, F. Moreno-Noguer, and P. Fua, “Epnp: An accurate o (n) [48] R. Mur-Artal, J. M. M. Montiel, and J. D. Tardos, “Orb-slam: a
solution to the pnp problem,” International journal of computer versatile and accurate monocular slam system,” IEEE transactions
vision, vol. 81, no. 2, pp. 155–166, 2009. on robotics, vol. 31, no. 5, pp. 1147–1163, 2015.
[42] R. Eidenberger and J. Scharinger, “Active perception and scene
[49] G. Jocher, “ultralytics/yolov5: v3.1 - Bug Fixes and Performance Im-
modeling by planning with probabilistic 6d object poses,” in 2010
provements.” https://github.com/ultralytics/yolov5,
IEEE/RSJ International Conference on Intelligent Robots and Systems,
Oct. 2020.
pp. 1036–1043, IEEE, 2010.
[50] M.-Y. Liu, O. Tuzel, A. Veeraraghavan, Y. Taguchi, T. K. Marks, and
[43] A. Collet and S. S. Srinivasa, “Efficient multi-view object recognition
R. Chellappa, “Fast object localization and pose estimation in heavy
and full pose estimation,” in 2010 IEEE International Conference on
clutter for robotic bin picking,” The International Journal of Robotics
Robotics and Automation, pp. 2050–2055, IEEE, 2010.
Research, vol. 31, no. 8, pp. 951–973, 2012.
[44] J. N. Kundu, M. Rahul, A. Ganeshan, and R. V. Babu, “Object
pose estimation from monocular image using multi-view keypoint [51] B. O. Community, Blender - a 3D modelling and rendering package.
correspondence,” in European Conference on Computer Vision, Blender Foundation, Stichting Blender Foundation, Amsterdam,
pp. 298–313, Springer, 2018. 2018.
[45] Y. Wu, Y. Zhang, D. Zhu, Y. Feng, S. Coleman, and D. Kerr, [52] E. Coumans and Y. Bai, “Pybullet, a python module for physics
“Eao-slam: Monocular semi-dense object slam based on ensemble simulation for games, robotics and machine learning,” 2016.
data association,” in 2020 IEEE/RSJ International Conference on [53] S. Hinterstoisser, V. Lepetit, S. Ilic, S. Holzer, G. Bradski, K. Konolige,
Intelligent Robots and Systems (IROS), pp. 4966–4973, IEEE, 2020. and N. Navab, “Model based training, detection and pose estima-
[46] R. Y. Tsai, R. K. Lenz, et al., “A new technique for fully autonomous tion of texture-less 3d objects in heavily cluttered scenes,” in Asian
and efficient 3 d robotics hand/eye calibration,” IEEE Transactions conference on computer vision, pp. 548–562, Springer, 2012.