Thanks to visit codestin.com
Credit goes to www.scribd.com

0% found this document useful (0 votes)
70 views74 pages

VO Tutorial

The document summarizes fundamentals of computer vision including: 1) Pinhole camera projection model which maps 3D points to 2D images through intrinsic and extrinsic camera parameters. 2) Epipolar geometry constraints including epipolar lines, epipoles, essential and fundamental matrices which encode the geometric relationship between two views. 3) RANSAC algorithm which fits a model to data containing outliers by iteratively sampling model parameters and scoring based on inliers.

Uploaded by

Hein Htet
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
70 views74 pages

VO Tutorial

The document summarizes fundamentals of computer vision including: 1) Pinhole camera projection model which maps 3D points to 2D images through intrinsic and extrinsic camera parameters. 2) Epipolar geometry constraints including epipolar lines, epipoles, essential and fundamental matrices which encode the geometric relationship between two views. 3) RANSAC algorithm which fits a model to data containing outliers by iteratively sampling model parameters and scoring based on inliers.

Uploaded by

Hein Htet
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 74

AirLab Summer School Series:

Visual Odometry Tutorial

Yafei Hu
July 07, 2020
Part 1

Fundamentals of Computer Vision


1. Pinhole camera projection model
Camera projection is a transformation (mapping) between 3D world and 2D image

This mapping is described as:

x: 2D Image point, P: Projection matrix, X: 3D world point

The projection consists of two parts:

world coordinate camera coordinate image coordinate


system (3D) system (3D) system (2D)
1. Pinhole camera projection model
[R, t] transform a 3D point in world coordinate system, denoted as XW
to camera coordinate system, denoted as XC

in homogeneous coordinates:
1. Pinhole camera projection model
Does this [R,t] means the rotation and translation of the camera?

in homogeneous coordinates:
1. Pinhole camera projection model
e.g. No rotation, just pure translation:
3D point
Xc

XC = X W - C
C is also the camera translation in relative to the world coordinate system
1. Pinhole camera projection model
The real rotation and translation of camera is represented by the inverse
transformation matrix of T

is rotation, is the translation of the camera,


relative to the world coordinate system
1. Pinhole camera projection model
An object is projected to the image plane through pinhole camera:

image plane virtual image plane


1. Pinhole camera projection model
Then let’s see how this 3D point in camera coordinate system, denoted as X_C, is
projected to image coordinate system (image plane), denoted as x

_C
x_im
y

C is the camera center (or optical center) and p(c_x, c_y) the principal point
1. Pinhole camera projection model
Take one plane of the camera coordinate system as example:

X_C = [X, Y, Z]

x_im

_y
1. Pinhole camera projection model
What is the coordinate x, in image coordinate system?
(Hint: use properties of Similar Triangles)

Which is a mapping from 3D Euclidean space to 2D Euclidean space

Is it linear?
What does this remind you of?
1. Pinhole camera projection model
What is the coordinate x, in image coordinate system?
(Hint: use properties of Similar Triangles)

Which is a mapping from 3D Euclidean space to 2D Euclidean space

Is it linear? NO!
What does this remind you of? Homogeneous coordinate!
1. Pinhole camera projection model
Considering the principal point p,
whose coordinate in 2D image coordinate system is [c_x, c_y],

This gives us the coordinate of a projected 3D point, in 2D image coordinate system


How to write this mapping in matrix form?
1. Pinhole camera projection model

K is called camera intrinsic matrix, camera intrinsic parameters, or calibration matrix


1. Pinhole camera projection model
Combine the 3D transformation together:

Sometimes we write in this form,

[R|t] is called camera extrinsic parameters


2. Epipolar constraints
2.1 Epipolar line
● The 3D point X projected to x must lie in the ray that passes optical center 1 and x
● The projection of this ray on second camera forms a line, called epipolar line
● x and x’ forms correspondence
X

Image Image
plane 1 x plane 2
x’
2. Epipolar constraints
2.2 Epipoles
Now consider a number of points in the first image:
Each point x_1, x_2 and x_3 is associated with a ray

Image Image
plane 1 plane 2
2. Epipolar constraints
Special cases of epipoles:
(1) Camera movement is a pure translation perpendicular to the optical axis (parallel to the
image plane)
The epipolar lines are parallel and the epipole is at infinity
2. Epipolar constraints
Special cases of epipoles:
(2) Camera movement is a pure translation along the optical axis
The epiholes have same coordinates in both images. Epipolar lines form a radial pattern.
2. Epipolar constraint
2.3 Epipolar plane
The optical centers C and C’, 3D point X, and its projection on images x and x’ lie in a
common plane π, called epipolar plane.
2. Epipolar constraint
2.4 Baseline
The camera baseline is the line formed by optical center C, C’ and epipoles e, e’

● Baseline intersects each image plane at the epipoles e and e’.


● Any plane π containing the baseline is an epipolar plane, and intersects the image planes in
corresponding epipolar lines l and l’.
● As the position of the 3D point X varies, the epipolar planes “rotate” about the baseline. This
family of planes is known as an epipolar pencil. All epipolar lines intersect at the epipole.
3. The Essential matrix
The essential matrix, denoted as E, is a 3 x 3 matrix that encodes epipolar constraint.

l l'

e e'
3. The Essential matrix

We define essential matrix E as

Which also gives the epipolar constraint,


4. The Fundamental matrix
Then we give the definition of fundamental matrix, denoted as F

fundamental matrix also contains the epipolar constraints

Relationship between fundamental matrix F and essential matrix E

E is a special case of F
4. The Fundamental matrix
More about fundamental matrix
Epipolar line:
Given image point x in one view, we can get the epipolar line l’ in another view,

According to epipolar constraint,

Substitute epipolar line we have,

What does it mean?


4. The Fundamental matrix
Function of a line in 3D space:

Sometimes represented as,

If a point x’= [x, y, 1] is on a line l’, then we have,

This shows epipolar constraint in other perspective


4. The Fundamental matrix
What about epipole?
Epipole also lies in epipolar line,

With the definition of epipolar line,

So we have,

Take the transpose of both side,


4. The Fundamental matrix
Given a set of matched points,

the relationship between F and x:

The fundamental Matrix can be computed by 8-point algorithm (8 correspondence)

More recently, 5 correspondence [1]

[1] David Niste´r, An Efficient Solution to the Five-Point Relative Pose Problem, in PAMI 2004
5. RANSAC Algorithm
Recall feature meaturing:

The incorrectly matched feature points are called outliers


We want less outliers!
5. RANSAC Algorithm
RANdom SAmple Consensus (RANSAC) algorithm in general:
Try to fit a line with some data (contains outliers)

1. Sample (randomly) the number of points required to fit the model


2. Solve for model parameters using samples
3. Score by the fraction of inliers within a preset threshold of the model
Repeat 1-3 until the best model is found with high confidence
5. RANSAC Algorithm
RANdom SAmple Consensus (RANSAC) algorithm in general:
Try to fit a line with some data (contains outliers)

1. Sample (randomly) the number of points required to fit the model


2. Solve for model parameters using samples
3. Score by the fraction of inliers within a preset threshold of the model
Repeat 1-3 until the best model is found with high confidence
5. RANSAC Algorithm
RANdom SAmple Consensus (RANSAC) algorithm in general:
Try to fit a line with some data (contains outliers)

1. Sample (randomly) the number of points required to fit the model


2. Solve for model parameters using samples
3. Score by the fraction of inliers within a preset threshold of the model
Repeat 1-3 until the best model is found with high confidence
5. RANSAC Algorithm
RANdom SAmple Consensus (RANSAC) algorithm in general:
Try to fit a line with some data (contains outliers)

Number of inliers

1. Sample (randomly) the number of points required to fit the model


2. Solve for model parameters using samples
3. Score by the fraction of inliers within a preset threshold of the model
Repeat 1-3 until the best model is found with high confidence
5. RANSAC Algorithm
RANdom SAmple Consensus (RANSAC) algorithm in general:
Try to fit a line with some data (contains outliers)

1. Sample (randomly) the number of points required to fit the model


2. Solve for model parameters using samples
3. Score by the fraction of inliers within a preset threshold of the model
Repeat 1-3 until the best model is found with high confidence
5. RANSAC Algorithm
RANdom SAmple Consensus (RANSAC) algorithm for computing fundamental matrix F:
1. In Each iteration:
randomly choose 8 points correspondences from all points correspondences
2. Compute fundamental matrix F from this 8 points
3. Compute the number of inliers:
search all N points correspondence, keep those point correspondence which satisfy

≤ Threshold
4. Choose the F with the max number of inliers
6. Solve camera pose from Essential matrix
Take SVD for essential matrix E

The essential matrix must have two singular values which are equal and another which is
zero

Define matrix

Translation t and rotation R can be computed as,


7. Feature Detector and Descriptor
Introduction
● Mainly about point features
● Point features can be used to find a sparse set of corresponding in different images
What kinds of feature points might one
use to establish a set of correspondences
between these images?

Two pairs of images to be matched.


7. Feature Detector and Descriptor
Introduction
● Mainly about point features
● Point features can be used to find a sparse set of corresponding in different images
What kinds of feature points might one
use to establish a set of correspondences
between these images?

Corners and edges!

These points are often called interest


points

Two pairs of images to be matched.


7. Feature Detector and Descriptor
Types of detectors and descriptors
Interested point detector: Harris corner detector, Good Features to Track ( Shi-Tomasi
Corner Detector)

Gradient based detector and descriptor: SIFT, SURF

Binary detector and/or descriptor: FAST (detector), BRIEF (descritor),


ORB (Oriented FAST and Rotated BRIEF)
8. Optical Flow and Lucas-Kanade Algorithm
8.1 Optical flow
Estimate the motion of each pixel between two consecutive image frames, based on two
assumptions: 1. brightness consistency 2.small motion

What we want to estimate is the 2D motion vector from time t to time t’


8. Optical Flow and Lucas-Kanade Algorithm
8.1 Optical flow
brightness constancy assumption:
The 2D projection of the same 3D point ‘moves’ across different frames

brightness consistency assumption tells us that the pixel intensity of these 2D projection
points does not change

Where C is a constant
8. Optical Flow and Lucas-Kanade Algorithm
8.2 Lucas-Kanade algorithm
Solve motion vectors:
According to brightness consistency assumption

where dx, dy and dt denotes the changes in x, y directions and in time t

According to small motion assumption, we can do 1st-order Taylor expansion:

So we have,
8. Optical Flow and Lucas-Kanade Algorithm
8.2 Lucas-Kanade algorithm
Then we have

Let’s explain what these variables mean:

I are the gradients of pixel intensity in x and y directions, respectively

are velocities of the motion of the pixel in x and y directions, respectively

is the gradient of the pixel intensity over time


8. Optical Flow and Lucas-Kanade Algorithm
8.2 Lucas-Kanade algorithm
Then we do a different version of the representations for these variables:

Where Ix, Iy are gradients of pixel intensity, u and v velocities in x and y directions, It the
gradient of the pixel intensity over time.

In vector form we have,


8. Optical Flow and Lucas-Kanade Algorithm
8.2 Lucas-Kanade algorithm
We don’t use just one pixel, usually we use a patch and assume all the pixels in this patch have
the same motion.

e.g. a w × w = n patch, for some pixel k in this patch we have

So for n pixels we have,


8. Optical Flow and Lucas-Kanade Algorithm
8.2 Lucas-Kanade algorithm
We define

So we have

The solution for u, v vector?


8. Optical Flow and Lucas-Kanade Algorithm
8.2 Lucas-Kanade algorithm
Two unknowns need to equations.
But what if we have more than two equations?

This is the least square solution for this overdetermined linear equation

and is called Pseudoinverse or Moore-Penrose inverse of matrix A

This algorithm is called Lucas-Kanade algorithm [1]


Lucas-Kanade algorithm can be used to track key-points in visual SLAM.

[1] Simon Baker and Iain Matthews, Lucas-Kanade 20 Years On: A Unifying Framework, in International Journal of Computer Vision 56(3),
221–255, 2004
9. Direct Methods for Camera Pose and Depth Estimation
Drawbacks of feature-based SLAM:

● Computing feature descriptors are time-consuming, especially gradient based features,


i.e. SIFT

● Discard many useful information in the images

● Impossible for dense reconstruction

Possible solutions:
● Use raw image pixel for matching after computing feature detectors (semi-direct method)

● Direct matching raw pixels without even feature detectors (direct method)
9. Direct Methods for Camera Pose and Depth Estimation
Back to the two-view geometry figure:

reference frame target frame


9. Direct Methods for Camera Pose and Depth Estimation
For a point in 3D world coordinate system

Its projection on reference image frame is,

The 3D point can also be expressed as

Then project the 3D point to another frame with estimated R, t and Z

This will form a image if we have multiple 3D points and its projection. This image is formed by
‘moving’ the original pixel in reference image to the new 2D position in image coordinate.
This process is called warping.
9. Direct Methods for Camera Pose and Depth Estimation
According to brightness constancy assumption, if we have perfect estimation of R, t and Z, the
warped image and target image should be the same.

However, it’s not possible, so we have the following loss called photometric loss:

Where Iref(i) and Iwarp(i) are the pixel intensities of the two images.

By minimizing the photometric loss , we can optimize the estimated R, t and Z

How to minimize this loss?


This optimization problem is a typical non-linear least optimization problem
Solutions are Gauss-Newton and Levenberg–Marquardt algorithms
Part 2

Traditional Feature based and Direct VO/VSLAM


1. Feature based: ORB-SLAM
Overview of ORB-SLAM

Bundle Adjustment

Raul Mur-Artal, J. M. M. Montiel ´and Juan D. Tardos, ORB-SLAM: A Versatile and Accurate Monocular SLAM System, TRO 2015
1. Feature based: ORB-SLAM
Results

Feature matching for different scale and for dynamic objects Results on KITTI odometry dataset

Raul Mur-Artal, J. M. M. Montiel ´and Juan D. Tardos, ORB-SLAM: A Versatile and Accurate Monocular SLAM System, TRO 2015
2. Dense Direct Method: DTAM

Depth estimation:
projective photometric cost volume:

Pixel-level photometric error:

Richard A. Newcombe, Steven J. Lovegrove and Andrew J. Davison, DTAM: Dense Tracking and Mapping in Real-Time
2. Dense Direct Method: DTAM

Plots for the single pixel photometric functions ρ(u) and the resulting total data cost C(u)

So even direct method need texture!

Richard A. Newcombe, Steven J. Lovegrove and Andrew J. Davison, DTAM: Dense Tracking and Mapping in Real-Time
2. Dense Direct Method: DTAM
Incremental cost volume construction:

The current inverse depth map extracted as the current minimum cost for each pixel row as 2, 10 and 30
overlapping images are used in the data term (from left)

Richard A. Newcombe, Steven J. Lovegrove and Andrew J. Davison, DTAM: Dense Tracking and Mapping in Real-Time
2. Direct Method: DTAM

Camera pose tracking by minimizing the following term:

The whole DTAM system estimate depth and pose in an interleaved manner

Richard A. Newcombe, Steven J. Lovegrove and Andrew J. Davison, DTAM: Dense Tracking and Mapping in Real-Time
3. Semi-Dense Direct Method: LSD-SLAM

Framework

Jakob Engel and Thomas Schops and Daniel Cremers, LSD-SLAM: Large-Scale Direct Monocular SLAM, ECCV 2016
3. Semi-Dense Direct Method: LSD-SLAM

Weighted Photometric loss:

Jakob Engel and Thomas Schops and Daniel Cremers, LSD-SLAM: Large-Scale Direct Monocular SLAM, ECCV 2016
1. Introduction to direct SLAM/VO
1.2 Semi-Dense Direct Method: LSD-SLAM
Qualitative results: mapping and semi-dense map estimation

Semi-dense means only using pixels with high gradients (edges and smooth intensity variations regions)

Jakob Engel and Thomas Schops and Daniel Cremers, LSD-SLAM: Large-Scale Direct Monocular SLAM, ECCV 2016
Part 3

Learning based VO
1. Supervised method
1.1 PoseNet
Approach: Camera pose regression with CNN
Camera pose represented as translation x and rotation q in quaternion form

With ground truth supervision, we get the training loss:

The network architecture is modified from GoogLeNet

Alex Kendall, Matthew Grimes, Roberto Cipolla, PoseNet: A Convolutional Network for Real-Time 6-DOF Camera Relocalization, in ICCV 2015
1. Supervised method
1.1 PoseNet
Experimental results:

Alex Kendall, Matthew Grimes, Roberto Cipolla, PoseNet: A Convolutional Network for Real-Time 6-DOF Camera Relocalization, in ICCV 2015
1. Supervised method
1.2 DeepVO

Network architecture

Sen Wang, Ronald Clark, Hongkai Wen, Niki Trigoni, DeepVO: Towards End-to-End Visual Odometry with Deep Recurrent Convolutional
Neural Networks, in ICRA 2017
1. Supervised method
1.2. DeepVO
Experimental Results

Camera pose estimation results in KITTI VO datasets


Sen Wang, Ronald Clark, Hongkai Wen, Niki Trigoni, DeepVO: Towards End-to-End Visual Odometry with Deep Recurrent Convolutional
Neural Networks, in ICRA 2017
2. Self-supervised method
SfM-Learner

Tinghui Zhou, Matthew Brown, Noah Snavely, David G. Lowe, Unsupervised Learning of Depth and Ego-Motion from Video, CVPR 2017
2. Self-supervised method
SfM-Learner

Warping and photometric loss:

Tinghui Zhou, Matthew Brown, Noah Snavely, David G. Lowe, Unsupervised Learning of Depth and Ego-Motion from Video, CVPR 2017
2. Self-supervised method
SfM-Learner

Network structures:

Tinghui Zhou, Matthew Brown, Noah Snavely, David G. Lowe, Unsupervised Learning of Depth and Ego-Motion from Video, CVPR 2017
3. Hybrid method

DVSO
Overview

Nan Yang, Rui Wang, Jorg Stuckler, and Daniel Cremers, Deep Virtual Stereo Odometry: Leveraging Deep Depth Prediction for Monocular Direct Sparse Odometry, ECCV
2018
3. Hybrid method

DVSO

Depth estimation

Nan Yang, Rui Wang, Jorg Stuckler, and Daniel Cremers, Deep Virtual Stereo Odometry: Leveraging Deep Depth Prediction for Monocular Direct Sparse Odometry, ECCV
2018
3. Hybrid method

DVSO

Pose estimation on KITTI:

Nan Yang, Rui Wang, Jorg Stuckler, and Daniel Cremers, Deep Virtual Stereo Odometry: Leveraging Deep Depth Prediction for Monocular Direct Sparse Odometry, ECCV
2018
Recommended Open Source
Visual Odometry/SLAM Implementations

Tutorial Purpose: PySLAM: https://github.com/luigifreda/pyslam

Feature based: ORB-SLAM 2: https://github.com/raulmur/ORB_SLAM2

Direct Method: DSO: https://github.com/JakobEngel/dso

Learning based: SfM-Learner: https://github.com/ClementPinard/SfmLearner-Pytorch

You might also like