6D Pose Estimation For Textureless Objects On RGB Frames Using Multi-View Optimization
6D Pose Estimation For Textureless Objects On RGB Frames Using Multi-View Optimization
Multi-View Optimization
Abstract— 6D pose estimation of textureless objects is a boost the object pose estimation performance using only
valuable but challenging task for many robotic applications. In RGB images [23], [24], [25], [26], [27]. However, due to
this work, we propose a framework to address this challenge the scale, depth, and perspective ambiguities inherent to
using only RGB images acquired from multiple viewpoints. The
core idea of our approach is to decouple 6D pose estimation a single viewpoint, RGB-based solutions usually have low
accuracy for the final estimated 6D poses.
arXiv:2210.11554v2 [cs.RO] 21 Feb 2023
This work was supported by Epson Canada Ltd. II. RELATED WORK
*Jun Yang and Steven L. Waslander are with University of Toronto
Institute for Aerospace Studies and Robotics Institute. {jun.yang, A. Object Pose Estimation from a Single RGB Image
steven.waslander}@robotics.utias.utoronto.ca
†Wenjie Xue and Sahar Ghavidel are with Epson Canada Many approaches have been presented in recent years
{mark.xue, sahar.ghavidel}@ea.epson.com to address the pose estimation problem for texture-less
Unit Vectors
3D Rotation Optimization
TM-based Rotation
Estimation
Re-Cropped Object
3D Rotation Segmentation Square ROI using known
Measurement Center and Scale
Fig. 1: An overview of the proposed multi-view object pose estimation pipeline with a two-step optimization formulation.
objects with only RGB images. Due to the lack of appear- biguities by a decoupled formulation. It also explicitly
ance features, traditional methods usually tackle the prob- handles rotational symmetries and measurement uncer-
lem via holistic template matching techniques [21], [34], tainties within an incremental online framework.
[35], but are susceptible to scale change and cluttered en-
vironments. More recently, deep learning techniques, such III. A PPROACH O VERVIEW AND P ROBLEM F ORMULATION
as convolutional neural networks (CNNs), have been em- Given the 3D object model and multi-view images, the
ployed to overcome these challenges. As two pioneering goal of 6D object pose estimation is to estimate the rigid
methods, SSD-6D [23] and PoseCNN [24] developed the transformation T wo ∈ SE (3) from the object model frame
CNN architectures to estimate the 6D object poses from O to a global (world) frame W . We assume that we know
a single RGB image. In comparison, some recent works the camera poses T wc ∈ SE (3) with respect to the world
leverage CNNs to first predict 2D object keypoints [36], frame. This can be done by robot forward kinematics and
[37], [26] or dense 2D-3D correspondences [38], [39], [27], eye-in-hand calibration when the camera is mounted on
[40], and then compute the pose through 2D-3D corre- the end-effector of a robotic arm [46], or off-the-shelf
spondences with a PnP algorithm [41]. Although these SLAM methods for a hand-held camera [47], [48].
methods show good 2D detection results, the accuracy Given measurements Z 1:k up to viewpoint k, we aim to
in the final 6D poses is generally low. estimate the posterior distribution of the 6D object pose
P (R wo , t wo |Z 1:k ). The direct computation of this distribu-
B. Object Pose Estimation from Multiple Viewpoints
tion is generally not feasible since object translation t w,o
Multi-view approaches aim to resolve the scale and and rotation R wo have distinct distributions. Specifically,
depth ambiguities that commonly occur in the single the translation distribution P (t wo ) is straightforward and
viewpoint setting and improve the accuracy of the esti- expected to be unimodal. In contrast, the distribution for
mated poses. Traditional works utilize local features [42], object rotation P (R wo ) is less obvious due to complex
[43] and cannot handle textureless objects. Recently, the uncertainties arising from shape symmetries, appearance
multi-view object pose estimation problem has been re- ambiguities, and possible occlusions. Inspired by [29], we
visited with neural networks. These approaches used an decouple the pose posterior P (R wo , t wo |Z 1:k ) into:
offline, batch-based optimization formulation, where all
the frames are given at once, to obtain a single consistent P (R wo , t wo |Z 1:k ) = P (R wo |Z 1:k , t wo ) P (t wo |Z 1:k ) (1)
scene interpretation [44], [28], [20], [30]. Compared to
batch-based methods, other works solve the multi-view where P (t wo |Z 1:k ) can be ¡ formulated as a unimodal
Gaussian distribution N t wo |µ, Σ . P (R wo |Z 1:k , t wo ) is
¢
pose estimation problem in an online manner. These
works estimate camera poses and object poses simulta- the rotation distribution conditioned on the input images
neously, known as object-level SLAM [5], [6], [45], [7], Z 1:k and the 3D translation t w,o . To represent the com-
or estimate object poses with known camera poses [1], plex rotation uncertainties, similar to [42], we formulate
[29], [31]. Although these methods show performance im- P (R wo |Z 1:k , t wo ) as the mixture of Gaussian distribution:
provements with only RGB images, they still face difficulty N
w i N R wo |µi , Σi
X ¡ ¢
in dealing with object scales, rotational symmetries, and P (R wo |Z 1:k , t wo ) = (2)
i =1
measurement uncertainties.
With the per-frame neural network predictions as mea- which consists of N Gaussian components. The coefficient
surements, our work resolves the depth and scale am- w i denotes the weight of the mixture component. µi
t co can be recovered by the following back-projection
assuming a pinhole camera model,
u x −c x t
tx fx z
u y −c y
t y =
f y tz (3)
tz tz
where f x and f y denote the camera focal lengths, and
£ ¤T £ ¤T
cx , c y is the principal point. We define u = u x , u y as
the projection of the object model origin O and call it the
2D center of the object in the rest of the paper. We can
see that if we can localize object center u in the image
and estimate the depth t z , then t co (or t wo ) is solved. In
Fig. 2: Illustration of the object, world, and camera co- our framework, we predict per-frame 2D object center u
ordinates. The 3D translation t wo is the coordinate of using the power of the neural network and estimate the
the object model origin in the world coordinate. We can depth t z with a multi-view optimization formulation.
estimate the translation by localizing the per-frame 2D Our per-frame 2D object center localization network
center of the object, u k , and minimizing the re-projection is shown in the upper part of Figure 1. Our network
errors with known camera poses T wc,k . architecture is based on PVNet [26]. To deal with mul-
tiple instances in the scene, we first use off-the-shelf
YOLOv5 [49] to detect 2D bounding boxes of the objects.
and Σi are the mean and covariance of i t h component, The detections are then cropped and resized to 128x128
respectively. before being fed into the network. The network predicts
Our proposed decoupling formulation implies a useful pixel-wise binary labels and a 2D vector field towards the
correlation between translation and rotation in the image object center. A RANSAC-based voting scheme is finally
domain. The 3D translation estimation t wo is independent utilized to estimate the mean u k and covariance Σk of the
of the object’s rotation and encodes the center and scale object center at frame k. For more details of the object
information of the object. By applying the camera pose center localization prediction, we refer the reader to [26].
T wc,k at frame k, the estimated 3D translation t co,k under Given a sequence of measurements, we can estimate
the camera coordinate provides the scale and 2D center the object 3D translation t wo based on the maximum
of the object in the image. Based on it, the per-frame likelihood estimation (MLE) formulation. By assuming
object rotation measurement R co,k can be estimated from the uni-modal Gaussian error model, we solve it with
its visual appearance in the image. With this formulation, a nonlinear least squares (NLLS) optimization approach.
our multi-view framework comprises two main steps, The optimization is formulated by creating measurement
summarized in Figure 1. In the first step (Section IV), residuals that constrain the object translation t wo with
we estimate the 3D translation t wo by integrating the the object center u k , Σk and known camera pose T wc,k at
per-frame neural network outputs into an optimization viewpoint k,
formulation. The network outputs the segmentation mask ³ ´
r k (t wo ) = π T −1
wc,k t wo − u k (4)
and the 2D projection u k of the object’s 3D center, and the
object’s 3D translation t wo is estimated by minimizing the where π is the perspective projection function. The full
2D re-projection error across views. Given the estimated problem becomes the minimization of the cost L across
3D translation t wo and segmentation mask, in the second all the viewpoints,
step (Section V), we re-crop a rotation-independent Re-
L = ρ H r Tk Σ−1
X ¡ ¢
gion of Interest (RoI) for each object with the estimated k rk (5)
k
scale. We then feed the RoI into a rotation estimator to get
where Σk is the covariance matrix estimated by the local-
per-frame 3D rotation measurement R co,k . The final ob-
ization network for the object center u k , and ρ H is the
ject rotation R wo is obtained by an optimization approach
Huber norm to reduce the impact of outliers for the opti-
with the explicitly handling of shape symmetries, and a
mization. We initialize each object’s translation t wo using
max-mixture formulation [32] to counteract measurement
the diagonal length of its 2D bounding box, similar to [23],
uncertainties.
from the first frame. With the known camera pose T wc,k ,
IV. 3D T RANSLATION E STIMATION we perform object association based on epipolar geometry
constraints and the estimated translation t wo,1:k−1 up to
As illustrated in Figure 2, the 3D translation t wo is the
viewpoint k − 1. Detections that are not associated with
coordinate of the object model origin in the world frame.
any existing objects are initialized as new objects.
Since the camera pose T wc is known, it is equivalent to
We solve the NLLS problem (Equation 4 and 5) in an
solve the translation from the £object model origin to the
¤T iterative Gauss-Newton procedure:
camera optical center, t co = t x , t y , t z . Given an RGB ¡ T T
J t wo Σ−1 −1
z J t wo δt wo = J t wo Σz r (t wo )
¢
image from an arbitrary camera viewpoint, the translation (6)
Image Plane distance at training time, respectively. This process is
𝒛𝒓 illustrated in Figure 3a. Note that the RoI is a square
𝒍𝒔 = 𝒍𝒓
𝒛𝒔 region here and is independent of the object’s rotation.
𝒍𝒓 𝒍𝒔 To further reduce the gap between rendered templates
𝒛𝒓 and RoI images, we take the segmentation mask from the
𝒛𝒔
object center localization network (upper part of Figure 1)
(a) and then feed the re-cropped object RoI into the LINE-2D
estimator to get a per-frame rotation measurement R co,k ,
as shown in Figure 3b.
B. Optimization formulation
Generally, estimating object 3D rotation from a se-
(b)
quence of measurements can also be formulated as an
Fig. 3: (a). The inference of object size l s from their
MLE problem:
projective ratio. (b) Left: the rendered object template at Y
a canonical distance. Middle: incorrect rotation estimates X̂ = argmax p(z k |X ) (7)
due to the scale change. Right: re-cropped object RoI using X k
the translation estimate, leading to the correct result. where X denotes the object 3D rotation R wo to be esti-
mated. The measurement z k here is the object’s rotation
where Σz is the stacked measurement covariance matrix with respect to the camera coordinate R co,k , obtained
up to the current frame and obtained from the object cen- from Section V-A. The measurement model is a function
ter localization network (upper part of Figure 1). The over of camera pose (rotation part) R wc,k and object rotation
Jacobian, J t wo , is stacked up by the per-frame Jacobian R wo in world frame:
matrix J t wo ,k .
h R wo , R wc,k = R −1
¡ ¢
wc,k R wo (8)
V. 3D R OTATION E STIMATION
We formulate the optimization problem by creating the
The procedure of estimating the object rotation R wo residual between R wo and per-frame measurement R co,k :
is shown in the lower part of Figure 1. We first adopt a ³ ¡ ¢−1 ´∨
template-matching (TM)-based approach, LINE-2D [21], r k (R wo ) = log R co,k h R wo , R wc,k (9)
for obtaining the per-frame rotation measurement R co,k .
The acquired measurements from multiple viewpoints are where r k (R wo ) is expressed by Lie algebra so(3). To
then integrated into an optimization scheme. We handle handle rotational symmetries, we consider them explic-
the rotational symmetries explicitly given the object CAD itly together with the measurement R co,k in Equation 9.
model. To counteract the measurement uncertainties (e.g., Generally, when an object has symmetry, there exist a set
from appearance ambiguities), a max-mixture formula- of rotations that leave the object’s appearance unchanged:
n ¢o
tion [32] is used to recover a globally consistent set of S (R co ) = R 0co ∈ SO(3) s.t ∀ G R co = G R 0co
¡ ¢ ¡
(10)
object pose estimates. Note that the acquisition of the ¡ ¢
rotation measurement R co,k is not limited to the LINE- where G R co is the rendered image of object under
2D [21] or TM-based approaches and can be superseded rotation R co (assuming the same object translation). We
by other holistic-based methods [50], [34], [23], [25]. can update the measurement R co,k in Equation 9 to R̄ co,k :
A. Per-Frame Rotation Measurement
° ³³ ´ ¡ ¢−1 ´∨ °
R̄ co,k = argmin °log R 0co,k h R wo , R wc,k ° (11)
° °
0
R co,k ∈S (R co,k )
Given the object 3D model, LINE-2D renders object
templates from a view sphere in the offline training where k·k denotes the absolute angle for a 3D rotation
stage (bottom middle in Figure 1). At run-time, it utilizes vector φ, and R̄ co,k is the updated rotation measurement
the gradient response on input RGB or grayscale images that has the minimal loss relative to R wo .
for template matching. A confidence score is provided
based on the matching quality. In general, the TM-based C. Measurement ambiguities
approach suffers from scale change issues, and the object Due to complex uncertainties, such uni-modal esti-
templates need to be generated at multiple distances and mates are still not sufficient to adequately represent the
scales. In our work, instead of training the multi-scale rotation uncertainties. To this end, we now consider the
templates, which increases the run-time complexity, we sum-mixture of Gaussians as the likelihood function:
fix the 3D translation to a canonical centroid distance N
w i N µi , Σi
X ¡ ¢
t r = [0, 0, z £r ]. At run-time,
¤ given the 3D translation hypoth- p(z̄ k |X ) = (12)
esis t co = x s , y s , z s from object origin to camera center i =1
(obtained from t wo and camera pose T wc ), we can re-crop where z̄ k is the updated ¢measurement (using Equa-
tion 11), and each N µi , Σi represents a distinct Gaus-
¡
the RoI from the image. The RoI size l s is determined by
l s = zzrs l r , where l r and z r are the RoI size and canonical sian distribution, and w i is the weight for component i .
(a)
(a)
(b) (c)
Fig. 4: Max-mixtures for processing the rotation mea- (b)
surements. Note that we show the distribution only on
Fig. 5: Examples of (a) the generated synthetic data and
one axis for demonstration purposes. (a) Acquired rota-
(b) real images from the RealSense Camera.
tion measurements from different viewpoints. (b) Mixture
distribution after two viewpoints. (c) Mixture distribution component across the viewpoints. The weight can be
after five viewpoints. c
approximated as: w i = P ic . This processing is illustrated
i i
in Figure 4. Given the measurements from two viewpoints,
The problem with a sum-mixture is that the MLE solution the object rotation distribution P (R wo ) is represented with
is no longer simple and falls outside the common NLLS two Gaussian components (green and red) with similar
optimization approaches. Instead, we consider the max- weights. The third component (yellow) is added when
marginal and solve the problem with the following max- observing more viewpoints. As a result of receiving more
mixture formulation [32]: rotation measurements (after five viewpoints), the weight
p(z̄ k |X ) = max w i N µi , Σi
¡ ¢
(13) of the correct component (green) becomes higher than
i =1:N the false hypotheses.
The max operator acts as a selector, keeping the prob-
VI. EXPERIMENTS
lem as a common NLLS optimization. Note that the
Max-mixture does not make a permanent choice. In an A. Datasets, Baselines and Evaluation Metrics
iteration of optimization, only one of the Gaussian com- We evaluate our framework on the recently released
ponents is selected and optimized. In particular, given a ROBI dataset [33]. It provides multiple camera viewpoints
new rotation measurement R̄ co,k at frame k, we actually and ground truth 6D poses for textureless reflective in-
evaluate each Gaussian component in Equation 13 by dustrial parts. The objects were placed in challenging bin
computing the¡ absolute rotation
¢ angle error θ k,i between scenarios and captured using two sensors: a high-cost
R̄ co,k and h R wo,i , R wc,k , Ensenso camera and a low-cost RealSense camera. For
° ³ ¢−1 ´∨ ° network training purposes, we generate 80,000 synthetic
θ k,i = °log R̄ co,k h R wo , R wc,k
¡
(14) images using Blender software [51] with Bullet physics en-
° °
°
gine [52] and train our object center localization network
and select the one with the minimal angle error. To reduce
with only synthetic data. Figure 5 presents some examples
the impact of outliers, the selected Gaussian component
of our generated synthetic images. We picked five objects
will accept a rotation measurement only if the rotation
that are textureless and evaluated them on both Ensenso
angle error θ k,i is less than a pre-defined threshold (30◦
and RealSense test sets.
in our implementation). If the measurement R̄ co,k is not
Quantitatively, we compare our approach with Cosy-
accepted by any Gaussian component, it will be con-
Pose [28], a state-of-the-art multi-view pose fusion solu-
sidered as a new component and added to the current
tion which takes the object pose estimates from individual
Gaussian-mixture model. We optimize the object rotation
viewpoints as the input and optimizes the overall scene
R wo within each component by operating on the tangent
consistency. Note that, CosyPose is an offline batch-based
space so(3):
solution that is agnostic to any particular pose estimator.
³ ´
J Tφ Λz J φwo δφwo = J Tφ Λz r (R wo ) (15) For a fair comparison, we use the same pose estimator
wo wo
(LINE-2D template matching with the same bounding
where r (R wo ) and J φwo are the stacked rotation residual boxes, object center, and segmentation mask from the
vector and Jacobian matrix, respectively, across multiple object center localization network). Additionally, we pro-
viewpoints. We approximate the weight matrix Λz by vide the CosyPose with known camera poses. To feed the
placing the LINE-2D confidence score on its diagonal reliable single-view estimates to CosyPose, we use two
elements. strategies to obtain the scale information for the LINE-
To compute the weight, w i , for each Gaussian compo- 2D pose estimator. For the first strategy, we generate the
nent, we accumulate the LINE-2D confidence score, c i , templates at multiple distances in the training time (9
from the rotation measurements within each individual in our experiments) and perform the standard template
Ensenso RealSense
4 Views 8 Views 4 Views 8 Views
Objects CosyPose Ours CosyPose Ours CosyPose Ours CosyPose Ours
Input Modality RGB RGBD RGB RGB RGBD RGB RGB RGBD RGB RGB RGBD RGB
Tube Fitting 51.6 80.8 86.1 66.2 86.7 88.7 42.6 64.7 77.9 72.1 76.5 85.2
Chrome Screw 39.7 66.1 64.9 58.6 75.3 67.8 60.0 62.9 67.1 72.9 84.3 81.4
Eye Bolt 36.5 77.0 78.4 62.1 87.8 83.8 26.5 58.8 79.4 55.9 91.1 91.1
Gear 33.3 83.9 81.5 45.6 85.2 86.4 41.7 77.8 75.0 61.1 83.3 86.1
Zigzag 51.7 75.8 82.7 60.3 89.7 94.8 53.6 71.4 78.6 64.3 78.6 89.3
ALL 42.6 76.7 78.7 58.6 84.9 84.3 44.9 67.1 75.6 65.3 82.8 86.6
TABLE I: 6D object pose estimation results on Ensenso test set from ROBI dataset, evaluated with the metrics of correct
detection rate. There are a total of nine scenes for the Ensenso test set and four scenes for the RealSense test set.