AirLab Summer School Series:
Visual Odometry Tutorial
Yafei Hu
July 07, 2020
Part 1
Fundamentals of Computer Vision
1. Pinhole camera projection model
Camera projection is a transformation (mapping) between 3D world and 2D image
This mapping is described as:
x: 2D Image point, P: Projection matrix, X: 3D world point
The projection consists of two parts:
world coordinate camera coordinate image coordinate
system (3D) system (3D) system (2D)
1. Pinhole camera projection model
[R, t] transform a 3D point in world coordinate system, denoted as XW
to camera coordinate system, denoted as XC
in homogeneous coordinates:
1. Pinhole camera projection model
Does this [R,t] means the rotation and translation of the camera?
in homogeneous coordinates:
1. Pinhole camera projection model
e.g. No rotation, just pure translation:
3D point
Xc
XC = X W - C
C is also the camera translation in relative to the world coordinate system
1. Pinhole camera projection model
The real rotation and translation of camera is represented by the inverse
transformation matrix of T
is rotation, is the translation of the camera,
relative to the world coordinate system
1. Pinhole camera projection model
An object is projected to the image plane through pinhole camera:
image plane virtual image plane
1. Pinhole camera projection model
Then let’s see how this 3D point in camera coordinate system, denoted as X_C, is
projected to image coordinate system (image plane), denoted as x
_C
x_im
y
C is the camera center (or optical center) and p(c_x, c_y) the principal point
1. Pinhole camera projection model
Take one plane of the camera coordinate system as example:
X_C = [X, Y, Z]
x_im
_y
1. Pinhole camera projection model
What is the coordinate x, in image coordinate system?
(Hint: use properties of Similar Triangles)
Which is a mapping from 3D Euclidean space to 2D Euclidean space
Is it linear?
What does this remind you of?
1. Pinhole camera projection model
What is the coordinate x, in image coordinate system?
(Hint: use properties of Similar Triangles)
Which is a mapping from 3D Euclidean space to 2D Euclidean space
Is it linear? NO!
What does this remind you of? Homogeneous coordinate!
1. Pinhole camera projection model
Considering the principal point p,
whose coordinate in 2D image coordinate system is [c_x, c_y],
This gives us the coordinate of a projected 3D point, in 2D image coordinate system
How to write this mapping in matrix form?
1. Pinhole camera projection model
K is called camera intrinsic matrix, camera intrinsic parameters, or calibration matrix
1. Pinhole camera projection model
Combine the 3D transformation together:
Sometimes we write in this form,
[R|t] is called camera extrinsic parameters
2. Epipolar constraints
2.1 Epipolar line
● The 3D point X projected to x must lie in the ray that passes optical center 1 and x
● The projection of this ray on second camera forms a line, called epipolar line
● x and x’ forms correspondence
X
Image Image
plane 1 x plane 2
x’
2. Epipolar constraints
2.2 Epipoles
Now consider a number of points in the first image:
Each point x_1, x_2 and x_3 is associated with a ray
Image Image
plane 1 plane 2
2. Epipolar constraints
Special cases of epipoles:
(1) Camera movement is a pure translation perpendicular to the optical axis (parallel to the
image plane)
The epipolar lines are parallel and the epipole is at infinity
2. Epipolar constraints
Special cases of epipoles:
(2) Camera movement is a pure translation along the optical axis
The epiholes have same coordinates in both images. Epipolar lines form a radial pattern.
2. Epipolar constraint
2.3 Epipolar plane
The optical centers C and C’, 3D point X, and its projection on images x and x’ lie in a
common plane π, called epipolar plane.
2. Epipolar constraint
2.4 Baseline
The camera baseline is the line formed by optical center C, C’ and epipoles e, e’
● Baseline intersects each image plane at the epipoles e and e’.
● Any plane π containing the baseline is an epipolar plane, and intersects the image planes in
corresponding epipolar lines l and l’.
● As the position of the 3D point X varies, the epipolar planes “rotate” about the baseline. This
family of planes is known as an epipolar pencil. All epipolar lines intersect at the epipole.
3. The Essential matrix
The essential matrix, denoted as E, is a 3 x 3 matrix that encodes epipolar constraint.
l l'
e e'
3. The Essential matrix
We define essential matrix E as
Which also gives the epipolar constraint,
4. The Fundamental matrix
Then we give the definition of fundamental matrix, denoted as F
fundamental matrix also contains the epipolar constraints
Relationship between fundamental matrix F and essential matrix E
E is a special case of F
4. The Fundamental matrix
More about fundamental matrix
Epipolar line:
Given image point x in one view, we can get the epipolar line l’ in another view,
According to epipolar constraint,
Substitute epipolar line we have,
What does it mean?
4. The Fundamental matrix
Function of a line in 3D space:
Sometimes represented as,
If a point x’= [x, y, 1] is on a line l’, then we have,
This shows epipolar constraint in other perspective
4. The Fundamental matrix
What about epipole?
Epipole also lies in epipolar line,
With the definition of epipolar line,
So we have,
Take the transpose of both side,
4. The Fundamental matrix
Given a set of matched points,
the relationship between F and x:
The fundamental Matrix can be computed by 8-point algorithm (8 correspondence)
More recently, 5 correspondence [1]
[1] David Niste´r, An Efficient Solution to the Five-Point Relative Pose Problem, in PAMI 2004
5. RANSAC Algorithm
Recall feature meaturing:
The incorrectly matched feature points are called outliers
We want less outliers!
5. RANSAC Algorithm
RANdom SAmple Consensus (RANSAC) algorithm in general:
Try to fit a line with some data (contains outliers)
1. Sample (randomly) the number of points required to fit the model
2. Solve for model parameters using samples
3. Score by the fraction of inliers within a preset threshold of the model
Repeat 1-3 until the best model is found with high confidence
5. RANSAC Algorithm
RANdom SAmple Consensus (RANSAC) algorithm in general:
Try to fit a line with some data (contains outliers)
1. Sample (randomly) the number of points required to fit the model
2. Solve for model parameters using samples
3. Score by the fraction of inliers within a preset threshold of the model
Repeat 1-3 until the best model is found with high confidence
5. RANSAC Algorithm
RANdom SAmple Consensus (RANSAC) algorithm in general:
Try to fit a line with some data (contains outliers)
1. Sample (randomly) the number of points required to fit the model
2. Solve for model parameters using samples
3. Score by the fraction of inliers within a preset threshold of the model
Repeat 1-3 until the best model is found with high confidence
5. RANSAC Algorithm
RANdom SAmple Consensus (RANSAC) algorithm in general:
Try to fit a line with some data (contains outliers)
Number of inliers
1. Sample (randomly) the number of points required to fit the model
2. Solve for model parameters using samples
3. Score by the fraction of inliers within a preset threshold of the model
Repeat 1-3 until the best model is found with high confidence
5. RANSAC Algorithm
RANdom SAmple Consensus (RANSAC) algorithm in general:
Try to fit a line with some data (contains outliers)
1. Sample (randomly) the number of points required to fit the model
2. Solve for model parameters using samples
3. Score by the fraction of inliers within a preset threshold of the model
Repeat 1-3 until the best model is found with high confidence
5. RANSAC Algorithm
RANdom SAmple Consensus (RANSAC) algorithm for computing fundamental matrix F:
1. In Each iteration:
randomly choose 8 points correspondences from all points correspondences
2. Compute fundamental matrix F from this 8 points
3. Compute the number of inliers:
search all N points correspondence, keep those point correspondence which satisfy
≤ Threshold
4. Choose the F with the max number of inliers
6. Solve camera pose from Essential matrix
Take SVD for essential matrix E
The essential matrix must have two singular values which are equal and another which is
zero
Define matrix
Translation t and rotation R can be computed as,
7. Feature Detector and Descriptor
Introduction
● Mainly about point features
● Point features can be used to find a sparse set of corresponding in different images
What kinds of feature points might one
use to establish a set of correspondences
between these images?
Two pairs of images to be matched.
7. Feature Detector and Descriptor
Introduction
● Mainly about point features
● Point features can be used to find a sparse set of corresponding in different images
What kinds of feature points might one
use to establish a set of correspondences
between these images?
Corners and edges!
These points are often called interest
points
Two pairs of images to be matched.
7. Feature Detector and Descriptor
Types of detectors and descriptors
Interested point detector: Harris corner detector, Good Features to Track ( Shi-Tomasi
Corner Detector)
Gradient based detector and descriptor: SIFT, SURF
Binary detector and/or descriptor: FAST (detector), BRIEF (descritor),
ORB (Oriented FAST and Rotated BRIEF)
8. Optical Flow and Lucas-Kanade Algorithm
8.1 Optical flow
Estimate the motion of each pixel between two consecutive image frames, based on two
assumptions: 1. brightness consistency 2.small motion
What we want to estimate is the 2D motion vector from time t to time t’
8. Optical Flow and Lucas-Kanade Algorithm
8.1 Optical flow
brightness constancy assumption:
The 2D projection of the same 3D point ‘moves’ across different frames
brightness consistency assumption tells us that the pixel intensity of these 2D projection
points does not change
Where C is a constant
8. Optical Flow and Lucas-Kanade Algorithm
8.2 Lucas-Kanade algorithm
Solve motion vectors:
According to brightness consistency assumption
where dx, dy and dt denotes the changes in x, y directions and in time t
According to small motion assumption, we can do 1st-order Taylor expansion:
So we have,
8. Optical Flow and Lucas-Kanade Algorithm
8.2 Lucas-Kanade algorithm
Then we have
Let’s explain what these variables mean:
I are the gradients of pixel intensity in x and y directions, respectively
are velocities of the motion of the pixel in x and y directions, respectively
is the gradient of the pixel intensity over time
8. Optical Flow and Lucas-Kanade Algorithm
8.2 Lucas-Kanade algorithm
Then we do a different version of the representations for these variables:
Where Ix, Iy are gradients of pixel intensity, u and v velocities in x and y directions, It the
gradient of the pixel intensity over time.
In vector form we have,
8. Optical Flow and Lucas-Kanade Algorithm
8.2 Lucas-Kanade algorithm
We don’t use just one pixel, usually we use a patch and assume all the pixels in this patch have
the same motion.
e.g. a w × w = n patch, for some pixel k in this patch we have
So for n pixels we have,
8. Optical Flow and Lucas-Kanade Algorithm
8.2 Lucas-Kanade algorithm
We define
So we have
The solution for u, v vector?
8. Optical Flow and Lucas-Kanade Algorithm
8.2 Lucas-Kanade algorithm
Two unknowns need to equations.
But what if we have more than two equations?
This is the least square solution for this overdetermined linear equation
and is called Pseudoinverse or Moore-Penrose inverse of matrix A
This algorithm is called Lucas-Kanade algorithm [1]
Lucas-Kanade algorithm can be used to track key-points in visual SLAM.
[1] Simon Baker and Iain Matthews, Lucas-Kanade 20 Years On: A Unifying Framework, in International Journal of Computer Vision 56(3),
221–255, 2004
9. Direct Methods for Camera Pose and Depth Estimation
Drawbacks of feature-based SLAM:
● Computing feature descriptors are time-consuming, especially gradient based features,
i.e. SIFT
● Discard many useful information in the images
● Impossible for dense reconstruction
Possible solutions:
● Use raw image pixel for matching after computing feature detectors (semi-direct method)
● Direct matching raw pixels without even feature detectors (direct method)
9. Direct Methods for Camera Pose and Depth Estimation
Back to the two-view geometry figure:
reference frame target frame
9. Direct Methods for Camera Pose and Depth Estimation
For a point in 3D world coordinate system
Its projection on reference image frame is,
The 3D point can also be expressed as
Then project the 3D point to another frame with estimated R, t and Z
This will form a image if we have multiple 3D points and its projection. This image is formed by
‘moving’ the original pixel in reference image to the new 2D position in image coordinate.
This process is called warping.
9. Direct Methods for Camera Pose and Depth Estimation
According to brightness constancy assumption, if we have perfect estimation of R, t and Z, the
warped image and target image should be the same.
However, it’s not possible, so we have the following loss called photometric loss:
Where Iref(i) and Iwarp(i) are the pixel intensities of the two images.
By minimizing the photometric loss , we can optimize the estimated R, t and Z
How to minimize this loss?
This optimization problem is a typical non-linear least optimization problem
Solutions are Gauss-Newton and Levenberg–Marquardt algorithms
Part 2
Traditional Feature based and Direct VO/VSLAM
1. Feature based: ORB-SLAM
Overview of ORB-SLAM
Bundle Adjustment
Raul Mur-Artal, J. M. M. Montiel ´and Juan D. Tardos, ORB-SLAM: A Versatile and Accurate Monocular SLAM System, TRO 2015
1. Feature based: ORB-SLAM
Results
Feature matching for different scale and for dynamic objects Results on KITTI odometry dataset
Raul Mur-Artal, J. M. M. Montiel ´and Juan D. Tardos, ORB-SLAM: A Versatile and Accurate Monocular SLAM System, TRO 2015
2. Dense Direct Method: DTAM
Depth estimation:
projective photometric cost volume:
Pixel-level photometric error:
Richard A. Newcombe, Steven J. Lovegrove and Andrew J. Davison, DTAM: Dense Tracking and Mapping in Real-Time
2. Dense Direct Method: DTAM
Plots for the single pixel photometric functions ρ(u) and the resulting total data cost C(u)
So even direct method need texture!
Richard A. Newcombe, Steven J. Lovegrove and Andrew J. Davison, DTAM: Dense Tracking and Mapping in Real-Time
2. Dense Direct Method: DTAM
Incremental cost volume construction:
The current inverse depth map extracted as the current minimum cost for each pixel row as 2, 10 and 30
overlapping images are used in the data term (from left)
Richard A. Newcombe, Steven J. Lovegrove and Andrew J. Davison, DTAM: Dense Tracking and Mapping in Real-Time
2. Direct Method: DTAM
Camera pose tracking by minimizing the following term:
The whole DTAM system estimate depth and pose in an interleaved manner
Richard A. Newcombe, Steven J. Lovegrove and Andrew J. Davison, DTAM: Dense Tracking and Mapping in Real-Time
3. Semi-Dense Direct Method: LSD-SLAM
Framework
Jakob Engel and Thomas Schops and Daniel Cremers, LSD-SLAM: Large-Scale Direct Monocular SLAM, ECCV 2016
3. Semi-Dense Direct Method: LSD-SLAM
Weighted Photometric loss:
Jakob Engel and Thomas Schops and Daniel Cremers, LSD-SLAM: Large-Scale Direct Monocular SLAM, ECCV 2016
1. Introduction to direct SLAM/VO
1.2 Semi-Dense Direct Method: LSD-SLAM
Qualitative results: mapping and semi-dense map estimation
Semi-dense means only using pixels with high gradients (edges and smooth intensity variations regions)
Jakob Engel and Thomas Schops and Daniel Cremers, LSD-SLAM: Large-Scale Direct Monocular SLAM, ECCV 2016
Part 3
Learning based VO
1. Supervised method
1.1 PoseNet
Approach: Camera pose regression with CNN
Camera pose represented as translation x and rotation q in quaternion form
With ground truth supervision, we get the training loss:
The network architecture is modified from GoogLeNet
Alex Kendall, Matthew Grimes, Roberto Cipolla, PoseNet: A Convolutional Network for Real-Time 6-DOF Camera Relocalization, in ICCV 2015
1. Supervised method
1.1 PoseNet
Experimental results:
Alex Kendall, Matthew Grimes, Roberto Cipolla, PoseNet: A Convolutional Network for Real-Time 6-DOF Camera Relocalization, in ICCV 2015
1. Supervised method
1.2 DeepVO
Network architecture
Sen Wang, Ronald Clark, Hongkai Wen, Niki Trigoni, DeepVO: Towards End-to-End Visual Odometry with Deep Recurrent Convolutional
Neural Networks, in ICRA 2017
1. Supervised method
1.2. DeepVO
Experimental Results
Camera pose estimation results in KITTI VO datasets
Sen Wang, Ronald Clark, Hongkai Wen, Niki Trigoni, DeepVO: Towards End-to-End Visual Odometry with Deep Recurrent Convolutional
Neural Networks, in ICRA 2017
2. Self-supervised method
SfM-Learner
Tinghui Zhou, Matthew Brown, Noah Snavely, David G. Lowe, Unsupervised Learning of Depth and Ego-Motion from Video, CVPR 2017
2. Self-supervised method
SfM-Learner
Warping and photometric loss:
Tinghui Zhou, Matthew Brown, Noah Snavely, David G. Lowe, Unsupervised Learning of Depth and Ego-Motion from Video, CVPR 2017
2. Self-supervised method
SfM-Learner
Network structures:
Tinghui Zhou, Matthew Brown, Noah Snavely, David G. Lowe, Unsupervised Learning of Depth and Ego-Motion from Video, CVPR 2017
3. Hybrid method
DVSO
Overview
Nan Yang, Rui Wang, Jorg Stuckler, and Daniel Cremers, Deep Virtual Stereo Odometry: Leveraging Deep Depth Prediction for Monocular Direct Sparse Odometry, ECCV
2018
3. Hybrid method
DVSO
Depth estimation
Nan Yang, Rui Wang, Jorg Stuckler, and Daniel Cremers, Deep Virtual Stereo Odometry: Leveraging Deep Depth Prediction for Monocular Direct Sparse Odometry, ECCV
2018
3. Hybrid method
DVSO
Pose estimation on KITTI:
Nan Yang, Rui Wang, Jorg Stuckler, and Daniel Cremers, Deep Virtual Stereo Odometry: Leveraging Deep Depth Prediction for Monocular Direct Sparse Odometry, ECCV
2018
Recommended Open Source
Visual Odometry/SLAM Implementations
Tutorial Purpose: PySLAM: https://github.com/luigifreda/pyslam
Feature based: ORB-SLAM 2: https://github.com/raulmur/ORB_SLAM2
Direct Method: DSO: https://github.com/JakobEngel/dso
Learning based: SfM-Learner: https://github.com/ClementPinard/SfmLearner-Pytorch