Computer Vison 5
Computer Vison 5
Feature-based alignment is the problem of estimating the motion between two or more sets
of matched 2D or 3D points.
2D alignment using least squares Given a set of matched feature points {(xi , x 0 i )} and a
planar parametric transformation1 of the form
how can we produce the best estimate of the motion parameters p? The usual way to do
this is to use least squares, i.e., to minimize the sum of squared residuals
is the residual between the measured location xˆ 0 i and its corresponding current predicted
location x˜ 0 i = f(xi ; p).
1. Accurate Alignment: The primary aim is to accurately align two images by finding the
transformation parameters that best match corresponding features between them.
2. Feature Detection: Identify distinctive key points or features in each image that are robust to
changes in scale, rotation, and illumination.
3. Feature Description: Describe the detected key points in a way that allows for reliable
matching across images despite variations in viewpoint or lighting conditions.
4. Feature Matching: Match corresponding features between the images to establish
correspondences, which are used to estimate the transformation that aligns the images.
5. Transformation Estimation: Estimate the transformation parameters (such as translation,
rotation, scale, etc.) that align one image with another based on the matched feature
correspondences.
6. Robustness: Ensure that the alignment process is robust to noise, occlusions, and other
artifacts present in the images.
7. Efficiency: Perform the alignment process efficiently to allow for real-time or near- real-time
performance in applications such as augmented reality or video processing.
8. Applicability: Enable alignment for various computer vision tasks such as image stitching,
Where is called the Hessian and b For the case of pure translation, the resulting
equations have a particularly simple form, i.e., the translation is the average translation
between corresponding points or, equivalently, the translation of the point centroids.
Uncertainty weighting. The above least squares formulation assumes that all feature points
are matched with the same accuracy. This is often not the case, since certain points may fall
into more textured regions than others. If we associate a scalar variance estimate with each
correspondence, we can minimize the weighted least squares problem instead
Application: Panography
One of the simplest (and most fun) applications of image alignment is a special form of
image stitching called panography. In a panograph, images are translated and optionally
rotated and scaled before being blended with simple averaging.
In most of the examples seen on the web, the images are aligned by hand for best artistic
effect.
Consider a simple translational model. We want all the corresponding features in different
images to line up as best as possible. Let tj be the location of the jth image coordinate frame
in the global composite frame and xij be the location of the ith matched feature in the jth
image. In order to align the images, we wish to minimize the least squares error
To minimize the non-linear least squares problem, we iteratively find an update _p to the
current parameter estimate p by minimizing
For the other 2D motion models, the derivatives in Table 8.1 are all fairly straightforward, except
3D alignment
Instead of aligning 2D sets of image features, many computer vision applications require the
alignment of 3D points. In the case where the 3D transformations are linear in the motion
parameters, e.g., for translation, similarity, and affine, regular least squares can be used.
which arises more frequently and is often called the absolute orientation problem, requires
slightly different techniques. If only scalar weightings are being used (as opposed to full 3D
per-point anisotropic covariance estimates), the weighted centroids of the two point clouds
One commonly used technique is called the orthogonal Procrustes algorithm and involves
computing the singular value decomposition (SVD) of the 3 * 3 correlation matrix.
The rotation matrix is then obtained as (Verify this for yourself when ^x0 = R^x.)
Another technique is the absolute orientation algorithm for estimating the unit quaternion
corresponding to the rotation matrix R, which involves forming a 4 _ 4 matrix from the
entries in C and then finding the eigenvector associated with its largest positive eigenvalue
Pose estimation
This pose estimation problem is also known as extrinsic calibration, as opposed to the intrinsic
calibration of internal camera parameters such as focal length. The problem of recovering
pose from three correspondences, which is the minimal amount of information necessary, is
known as the perspective-3-point-problem (P3P), 2 with extensions to larger numbers of
points collectively known as PnP. In this section, we look at some of the techniques that
have been developed to solve such problems, starting with the direct linear transform (DLT),
which recovers a 3 _ 4 camera matrix, followed by other “linear” algorithms, and then
looking at statistically optimal iterative algorithms.
Linear algorithms
The simplest way to recover the pose of the camera is to form a set of rational linear
equations analogous to those used for 2D motion estimation from the camera matrix form of
perspective projection.
where (xi; yi) are the measured 2D feature locations and (Xi; Yi;Zi) are the known 3D feature
locations. As with, this system of equations can be solved in a linear fashion for the
unknowns in the camera matrix P by multiplying the denominator on both sides of the
equation. Because P is unknown up to a scale, we can either fix one of the entries, e.g., p23
= 1, or find the smallest singular vector of the set of linear equations. The resulting algorithm
is called the direct linear transform (DLT) and is commonly attributed
to (For a more in-depth discussion, To compute the unknowns in P, at least six
correspondences between 3D and 2D locations must be known.
Fig: Pose estimation by the direct linear transform and by measuring visual angles and
distances between pairs of points.
As with the case of estimating homographies , more accurate results for the entries in P can
be obtained by directly minimizing the set of Equations using non-linear least squares with a
small number of iterations. Note that instead of taking the ratios of the X=Z and Y=Z values
as in it is also possible to take a cross product of the 3-vector (xi; yi; 1) image measurement
and the 3-D ray (X; Y;Z) and set the three elements of this cross-product to 0. The resulting
three equations, when interpreted as a set of least squares constraints, in effect compute
the squared sine of the angle between the two rays.
CNN-based pose estimation
As with other areas on computer vision, deep neural networks have also been applied to
pose estimation. Some representative papers include for object pose estimation, and papers
such as and discussed in Section on location recognition. There is also a very active
community around estimating pose from RGB-D images, with the most recent papers
evaluated on the BOP (Benchmark for 6DOF Object Pose)
The most accurate and flexible way to estimate pose is to directly minimize the squared (or
robust) reprojection error for the 2D points as a function of the unknown pose parameters in
(R; t) and optionally K using non-linear least squares. We can write the projection equations
as
where is the current residual vector (2D error in predicted position) and the partial
derivatives are with respect to the unknown pose parameters (rotation, translation, and
optionally calibration). The robust loss function is used to reduce the influence of outlier
correspondences. The resulting projection equations can be written as
Note that in these equations, we have indexed the camera centers cj and camera rotation
quaternions qj by an index j, in case more than one pose of the calibration object is being
used.
The advantage of this chained set of transformations is that each one has a simple partial
derivative with respect both to its parameters and to its input. Thus, once the predicted
value of ~xi has been computed based on the 3D point location pi and the current values of
the pose parameters (cj ; qj ; k), we can obtain all of the required partial derivatives using
the chain rule
Where p(k) indicates one of the parameter vectors that is being optimized. (This same “trick”
is used in neural networks as part of back propagation. The one special case in this
formulation that can be considerably simplified is the computation of the rotation update.
Instead of directly computing the derivatives of the 3 _ 3 rotation matrix R(q) as a function
of the unit quaternion entries, you can prepend the incremental rotation matrix
given in to the current rotation matrix and compute the partial derivative of the transform
with respect to these parameters, which results in a simple cross product of the backward
chaining partial derivative and the outgoing 3D vector,
The inference of human pose (head, body, and limb locations and attitude) from a single
images can be viewed as yet another kind of segmentation task. We have already
discussed some pose estimation techniques on pedestrian detection section. Starting with
the seminal work 2D and 3D pose detection and estimation rapidly developed as an active
research area, with important advances and datasets
Application: Location recognition
One of the most exciting applications of pose estimation is in the area of location
recognition, which can be used both in desktop applications (“Where did I take this holiday
snap?”) and in mobile smartphone applications. The latter case includes not only finding out
your current location based on a cell-phone image, but also providing you with navigation
directions or annotating your images with useful information, such as building names and
restaurant reviews (i.e., a pocketable form of augmented reality). This problem is also often
called visual (or image-based) localization.
Some approaches to location recognition assume that the photos consist of architectural
scenes for which vanishing directions can be used to pre-rectify the images for easier
matching.
The main difficulty in location recognition is in dealing with the extremely large community
(user-generated) photo collections on websites such as Flickr.
In the latter case, the overlap between adjacent database images can be used to verify and
prune potential matches using “temporal” filtering, i.e., requiring the query image to match
nearby overlap ping database images before accepting the match. Similar ideas have been
used to improve location recognition from panoramic video sequences. Recognizing indoor
locations inside buildings and shopping malls poses its own set of challenges, including
textureless areas and repeated elements
Some of the most recent approaches to localization use deep networks to generate feature
Descriptors perform large-scale instance retrieval map images to 3D scene coordinates or
perform end-to-end scene coordinate regression, absolute pose regression (APR) or relative
pose regression (RPR). Recent evaluations of these techniques have shown that classical
approaches based on feature matching followed by geometric pose optimization
typicallyoutperform pose regression approaches in terms of accuracy and generalization.
The Long- Term Visual Localization benchmark has a leaderboard listing the best-
performing localization systems. Another variant on location recognition is the automatic
discovery of landmarks, i.e., frequently photographed objects and locations.
The concept of organizing the world’s photo collections by location has even been recently
extended to organizing all of the universe’s (astronomical) photos in an application called
astrometry.The technique used to match any two star fields is to take quadruplets of nearby
stars (a pair of stars and another pair inside their diameter) to form a 30-bit geometric hash
by encoding the relative positions of the second pair of points using the inscribed square
as the reference frame. Traditional information retrieval techniques (k-d trees built for
different parts of a sky atlas) are then used to find matching quads as potential star field
location hypotheses, which can then be verified using a similarity transform.
Triangulation
One of the simplest ways to solve this problem is to find the 3D point p that lies closest to
all of the 3D rays corresponding to the 2D matching feature locations observed by cameras
Where and is the jth camera center, these rays originate at cj in a direction
The optimal value for p, which lies closest to all of the rays, can be computed as a regular
least squares problem by summing over all the r2 j and finding the optimal value of p,
An alternative formulation, which is more statistically optimal and which can produce
significantly better estimates if some of the cameras are closer to the 3D point than others,
is to minimize the residual in the measurement equations
where (xj ; yj) are the measured 2D feature locations and are the known entries
in camera matrix Pj. As with, this set of non-linear equations can be converted into a linear
least squares problem by multiplying both sides of the denominator, again resulting in the
direct linear transform (DLT) formulation. Note that if we use homogeneous coordinates p =
(X; Y;Z;W), the resulting set of equations is homogeneous and is best solved as a singular
value decomposition (SVD) or eigenvalue problem (looking for the smallest singular vector
or eigenvector). If we set W = 1, we can use regular linear least
squares, but the resulting system may be singular or poorly conditioned, i.e., if all of the
viewing rays are parallel, as occurs for points far away from the camera.
So far in our study of 3D reconstruction, we have always assumed that either the 3D point
positions or the 3D camera poses are known in advance. In this section, we take our first
look at structure from motion, which is the simultaneous recovery of 3D structure and pose
from image correspondences. In particular, we examine techniques that operate on just two
frames with point correspondences. We divide this section into the study of classic “n- point”
algorithms, special (degenerate) cases, projective (uncalibrated) reconstruction, and self-
calibration for cameras whose intrinsic calibrations are unknown.
Consider which shows a 3D point p being viewed from two cameras whose relative position
can be encoded by a rotation R and a translation t. As we do not know anything about the
camera positions, without loss of generality, we can set the first camera at the origin c0 = 0
and at a canonical orientation
R0 = I. The 3D point p0 = d0^x0 observed in the first image at location ^x0 and at a z
distance of d0 is mapped into the second image by the transformation.
Where are the (local) ray direction vectors. Taking the cross product of the two
(Interchanged) sides with t in order to annihilate it on the right-hand side yields.
because the right-hand side is a triple product with two identical entries. (Another way to say
this is that the cross product matrix [t]_ is skew symmetric and returns 0 when pre- and
post-multiplied by the same vector.). We therefore arrive at the basic epipolar constraint,
Where
is called the essential matrix,
An alternative way to derive the epipolar constraint is to notice that, for the cameras to be
oriented so that the rays intersect in 3D at point p, the vectors connecting the two camera
centers and the rays corresponding to pixels x0 and x1, namely must be co- planar.
This requires that the triple product.
Eight-point algorithm. Given this fundamental relationship (11.30), how can we use it to
recover the camera motion encoded in the essential matrix E? If we have N corresponding
measurements {(xi0, xi1)}, we can form N homogeneous equations in the nine elements of
E = {e00 . . . e22}
Self-calibration
The results of structure from motion computation are much more useful if a metric
reconstruction is obtained, i.e., one in which parallel lines are parallel, orthogonal walls are
at right angles, and the reconstructed model is a scaled version of reality. Over the years, a
large number of self-calibration (or auto-calibration) techniques have been developed for
converting a projective reconstruction into a metric one, which is equivalent to recovering
the unknown calibration matrices Kj associated with each image.In situations where
additional information is known about the scene, different methods may be employed. For
example, if there are parallel lines in the scene, three or more vanishing points, which are
the images of points at infinity, can be used to establish the homography for the plane at
infinity, from which focal lengths and rotations can be recovered. If two or more finite
orthogonal vanishing points have been observed, the single-image calibration method
based on vanishing points can be used instead
encode the unknown focal lengths. For simplicity, let us rewrite each of the numerators and
denominators in (11.56) as
While two-frame techniques are useful for reconstructing sparse geometry from stereo
image pairs and for initializing larger-scale 3D reconstructions, most applications can benefit
from the much larger number of images that are usually available in photo collections and
videos of scenes.
In this section, we briefly review an older technique called factorization, which can provide
useful solutions for short video sequences, and then turn to the more commonly used
bundle adjustment approach, which uses non-linear least squares to obtain optimal
solutions under general camera configurations
When processing video sequences, we often get extended feature tracks from which it is
possible to recover the structure and motion using a process called factorization. Consider
the tracks generated by a rotating ping pong ball, which has been marked with dots to make
its shape and motion more discernable . We can readily see from the shape of the tracks
that the moving object must be a sphere, but how can we infer this mathematically? It turns
out that, under orthography or related models we discuss below, the shape and motion can
be recovered simultaneously using a singular value decomposition.
Once the rotation matrices and 3D point locations have been recovered, there still exists a
bas-relief ambiguity, i.e., we can never be sure if the object is rotating left to right or if its
depth reversed version is moving the other way. (This can be seen in the classic rotating
Necker Cube visual illusion.) Additional cues, such as the appearance and disappearance of
points, or perspective effects, both of which are discussed below, can be used to remove
this ambiguity.
For motion models other than pure orthography, e.g., for scaled orthography or
paraperspective, the approach above must be extended in the appropriate manner. Such
techniques are relatively straightforward to derive from first principles; more details can be
found in papers that extend the basic factorization approach to these more flexible models.
Additional extensions of the original factorization algorithm include multi-body rigid motion,
sequential updates to the factorization, the addition of lines and planes and re-scaling the
measurements to incorporate individual location uncertainties.
A disadvantage of factorization approaches is that they require a complete set of tracks,
i.e., each point must be visible in each frame, for the factorization approach to work and
deal with this problem by first applying factorization to smaller denser subsets and then
using known camera (motion) or point (structure) estimates to hallucinate additional missing
values, which allows them to incrementally incorporate more features and cameras.
Perspective and projective factorization Another disadvantage of regular factorization is that
it cannot deal with perspective cameras. One way to get around this problem is to perform
an initial affine (e.g., orthographic) reconstruction and to then correct for the perspective
effects in an iterative manner (Christy and Horaud 1996). This algorithm usually converges
in three to five iterations, with the majority of the time spent in the SVD computation.
Bundle adjustment
As we have mentioned several times before, the most accurate way to recover structure and
motion is to perform robust non-linear minimization of the measurement (re-projection)
errors, which is commonly known in the photogrammetry (and now computer vision)
communities as bundle adjustment.
The biggest difference between these formulas and full bundle adjustment is that our feature
location measurements xij now depend not only on the point (track) index i but also on the
camera pose index j,
and that the 3D point positions pi are also being simultaneously updated. In addition, it is
common to add a stage for radial distortion parameter estimation
if the cameras being used have not been pre-calibrated, as shown in Figure.
While most of the boxes (transforms) have previously been explained , the leftmost box has
not. This box performs a robust comparison of the predicted and measured 2D locations xˆij
and ˜xij after re-scaling by the measurement noise covariance Σij . In more detail, this
operation can be written as
The advantage of the chained representation introduced above is that it not only makes the
computations of the partial derivatives and Jacobians simpler but it can also be adapted to
any camera configuration. Consider for example a pair of cameras mounted on a robot that
is moving around in the world, as shown in Figure 11.15a. By replacing the rightmost two
transformations in Figure 11.14 with the transformations shown in Figure 11.15b, we can
simultaneously recover the position of the robot at each time and the calibration of each
camera with respect to the rig, in addition to the 3D structure of the world.
Exploiting sparsity
Large bundle adjustment problems, such as those involving reconstructing 3D scenes from
thousands of internet photographs can require solving non-linear least squares problems
with millions of measurements (feature matches) and tens of thousands of unknown
parameters (3D point positions and camera poses). Unless some care is taken, these kinds
of problem can become intractable, because the (direct) solution of dense least squares
problems is cubic in the number of unknowns.
Fortunately, structure from motion is a bipartite problem in structure and motion. Each
feature point xij in a given image depends on one 3D point position pi and one 3D camera
pose (Rj , cj ). This is illustrated in Figure 11.16a, where each circle (1–9) indicates a 3D
point, each square (A–D) indicates a camera, and lines (edges) indicate which points are
visible in which cameras (2D features). If the values for all the points are known or fixed, the
equations for all the cameras become independent, and vice versa.
If we order the structure variables before the motion variables in the Hessian matrix A (and
hence also the right-hand side vector b), we obtain a structure for the Hessian shown in
Figure 11.16c.19 When such a system is solved using sparse Cholesky factorization, the fill-
in occurs in the smaller motion explore the use of iterative (conjugate gradient) techniques
for the solution of bundle adjustment problems.
In more detail, the reduced motion Hessian is computed using the Schur complement
where APP is the point (structure) Hessian (the top left block of Figure 11.16c), APC is the
point- camera Hessian (the top right block), and ACC and A0 CC are the motion Hessians
before and after the point variable elimination (the bottom right block of Figure 11.16c).
Notice that A0 CC has a non-zero entry between two cameras if they see any 3D point in
common. This is indicated with dashed arcs in Figure 11.16a and light blue squares in
Figure 11.16c.
Whenever there are global parameters present in the reconstruction algorithm, such as
camera intrinsics that are common to all of the cameras, or camera rig calibration
parameters such as those shown in Figure 11.15, they should be ordered last (placed along
the right and bottom edges of A) to reduce fill-in
Application: Match move
One of the neatest applications of structure from motion is to estimate the 3D motion of a
video or film camera, along with the geometry of a 3D scene, in order to superimpose 3D
graphics or computer-generated images (CGI) on the scene. In the visual effects industry,
this is known as the match move problem (Roble 1999), as the motion of the synthetic 3D
camera used to render the graphics must be matched to that of the real-world camera. For
very small motions, or motions involving pure camera rotations, one or two tracked points
can suffice to compute the necessary visual motion.
For planar surfaces moving in 3D, four points are needed to compute the homography,
which can then be used to insert planar overlays, e.g., to replace the contents of advertising
billboards during sporting events. The general version of this problem requires the
estimation of the full 3D camera pose along with the focal length (zoom) of the lens and
potentially its radial distortion parameters
When the 3D structure of the scene is known ahead of time, pose estimation techniques
such as view correlation or through-the-lens camera control (Gleicher and Witkin 1992) can
be used, as described in For more complex scenes, it is usually preferable to recover the 3D
structure simultaneously with the camera motion using structure-from-motion techniques.
The trick with using such techniques is that to prevent any visible jitter between the synthetic
graphics and the actual scene, features must be tracked to very high accuracy and ample
feature tracks must be available in the vicinity of the insertion location
The most general algorithms for structure from motion make no prior assumptions about the
objects or scenes that they are reconstructing. In many cases, however, the scene contains
higher-level geometric primitives, such as lines and planes. These can provide information
complementary to interest points and also serve as useful building blocks for 3D modeling
and visualization. Furthermore, these primitives are often arranged in particular
relationships, i.e., many lines and planes are either parallel or orthogonal to each other.
In other situations, the camera itself may be moving in a fixed arc around some center of
rotation Specialized capture setups, such as mobile stereo camera rigs or moving vehicles
equipped with multiple fixed cameras, can also take advantage of the knowledge that
individual cameras are (mostly) fixed with respect to the capture rig, as shown in Figu
Sometimes, instead of exploiting regularity in the scene structure, it is possible to take
advantage of a constrained motion model.
For example, if the object of interest is rotating on a turntable), i.e., around a fixed but
unknown axis, specialized techniques can be used to recover this motion. In other
situations, the camera itself may be moving in a fixed arc around some center of rotation
(Shum and He 1999). Specialized capture setups, such as mobile stereo camera rigs or
moving vehicles equipped with multiple fixed cameras, can also take advantage of the
knowledge that individual cameras are (mostly) fixed with respect to the capture rig, as
shown in Figure.
Line-based techniques
It is well known that pairwise epipolar geometry cannot be recovered from line matches
alone, even if the cameras are calibrated. To see this, think of projecting the set of lines in
each image into a set of 3D planes in space. You can move the two cameras around into
any configuration you like and still obtain a valid reconstruction for 3D lines. Line-based
techniques. When lines are visible in three or more views, the trifocal tensor can be used to
transfer lines from one pair of images to another.
Author describe a widely used technique for matching 2D lines based on the average of 15
× 15 pixel correlation scores evaluated at all pixels along their common line segment
intersection.35 In their system, the epipolar geometry is assumed to be known, e.g.,
computed from point matches. For wide baselines, all possible homographies corresponding
to planes passing through the 3D line are used
to warp pixels and the maximum correlation score is used. For triplets of images, the trifocal
tensor is used to verify that the lines are in geometric correspondence before evaluating the
correlations between line segments. Figure 11.22 shows the results of using their system.
Instead of reconstructing 3D lines, use RANSAC to group lines into likely coplanar subsets.
Four lines are chosen at random to compute a homography, which is then verified for these
and other plausible line segment matches by evaluating color histogram-based correlation
scores. The 2D intersection points of lines belonging to the same plane are then used as
virtual measurements to estimate the epipolar geometry, which is more accurate than using
the homographies directly.
Plane-based techniques
In scenes that are rich in planar structures, e.g., in architecture, it is possible to directly
estimate homographies between different planes, using either feature-based or intensity-
based methods. In principle, this information can be used to simultaneously infer the camera
poses and the plane equations, i.e., to compute plane-based structure from motion.
It show how a fundamental matrix can be directly computed from two or more homographies
using algebraic manipulations and least squares. Unfortunately, this approach often
performs poorly, because the algebraic errors do not correspond to meaningful reprojection
errors.
A better approach is to hallucinate virtual point correspondences within the areas from
which each homography was computed and to feed them into a standard structure from
motion algorithm. An even better approach is to use full bundle adjustment with explicit
plane equations, as well as additional constraints to force reconstructed co-planar features
to lie exactly on their corresponding planes. (A principled way to do this is to establish a
coordinate frame for each plane, e.g., at one of the feature points, and to use 2D in-plane
parameterizations for the other points.)
Simultaneous localization and mapping (SLAM)
While the computer vision community has been studying structure from motion, i.e., the
reconstruction of sparse 3D models from multiple images and videos, since the early
1980s), the mobile robotics community has in parallel been studying the automatic
construction of 3D maps from moving robots.36 In robotics, the problem was formulated as
the simultaneous estimation of 3D robot and landmark poses (Figure 11.23), and was
known as probabilistic mapping and simultaneous localization and mapping (SLAM). In the
computer vision community, the problem was originally called visual odometry although that
term is now usually reserved for shorter-range motion estimation that does not involve
building a global map with loop closing.
Early versions of such algorithms used range-sensing techniques, such as ultrasound, laser
range finders, or stereo matching, to estimate local 3D geometry, which could then be fused
into a 3D model..
SLAM differs from bundle adjustment in two fundamental aspects. First, it allows for a
variety of sensing devices, instead of just being restricted to tracked or matched feature
points. Second, it solves the localization problem online, i.e., with no or very little lag in
providing the current sensor pose. This makes it the method of choice for both time-critical
robotics applications such as autonomous navigation.
Application: Autonomous navigation Since the early days of artificial intelligence and
robotics, computer vision has been used to enable manipulation for dextrous robots and
navigation for autonomous robots (Janai, Guney ¨ et al. 2020; Kubota 2019). Some of the
earliest vision-based navigation systems include the Stanford Cart and CMU Rover, the
Terregator and the CMU Nablab.
originally could only advance 4m every 10 sec (< 1 mph), and which was also the first
system to use a neural network for driving .The early algorithms and technologies advanced
rapidly, with the VaMoRs system of operating a 25Hz Kalman filter loop and driving with
good lane markings at 100 km/h. By the mid 2000s, when DARPA introduced their Grand
Challenge and Urban Challenge, vehicles equipped with both range-finding lidar cameras
and stereo cameras were able to traverse rough outdoor terrain and navigate city streets at
regular human driving speeds. Systems led to the formation of industrial research projects
at companies such as Google and Tesla, as well numerous startups, many of which exhibit
their vehicles at computer vision conferences
Translational alignment
The simplest way to establish an alignment between two images or image patches is to shift
one image relative to the other. Given a template image I0(x) sampled at discrete pixel
locations {xi = (xi
, yi)}, we wish to find where it is located in image I1(x). A least squares solution to this
problem is to find the minimum of the sum of squared differences (SSD) function
where u = (u, v) is the displacement and ei = I1(xi + u) − I0(xi) is called the residual error (or
the displaced frame difference in the video coding literature).1 (We ignore for the moment
the possibility that parts of I0 may lie outside the boundaries of I1 or be otherwise not
visible.) The assumption that corresponding pixel values remain the same in the two images
is often called the brightness constancy constraint.
Color images can be processed by summing differences across all three-color channels,
although it is also possible to first transform the images into a different color space or to only
use the luminance (which is often done in video encoders). Robust error metrics. We can
make the above error metric more robust to outliers by replacing the squared error terms
with a robust function ρ(ei)
The robust norm ρ(e) is a function that grows less quickly than the quadratic penalty
associated with least squares. One such function, sometimes used in motion estimation for
video coding because of its speed, is the sum of absolute differences (SAD) metric3 or L1
norm, i.e., However, because this function is not differentiable at the origin, it is not well
suited to gradient-descent approaches such as the ones presented. Instead, a smoothly
varying function that is quadratic for small values but grows more slowly away from the
origin is often used
where a is a constant that can be thought of as an outlier threshold. An appropriate value for
the threshold can itself be derived using robust statistics. e.g., by computing the median
absolute deviation, MAD = medi |ei |, and multiplying it by to obtain a robust estimate of the
standard deviation of the inlier noise process proposes a generalized robust loss function
that can model various outlier distributions and thresholds, as discussed in more detail in,
and also has a Bayesian method for estimating the loss function parameters.
Spatially varying weights. The error metrics above ignore that fact that for a given alignment,
some of the pixels being compared may lie outside the original image boundaries.
Furthermore, we may want to partially or completely downweight the contributions of certain
pixels. For example, we may want to selectively “erase” some parts of an image from
consideration when stitching a mosaic where unwanted foreground objects have been cut
out. For applications such as background stabilization, we may want to downweight the
middle part of the image, which often contains independently moving objects being tracked
by the camera
All of these tasks can be accomplished by associating a spatially varying per-pixel weight
with each of the two images being matched. The error metric then becomes the weighted
(or windowed) SSD function
Hierarchical motion estimation
The simplest solution is to do a full search over some range of shifts, using either integer or
sub-pixel steps. This is often the approach used for block matching in motion compensated
video compression, where a range of possible motions (say, ±16 pixels) is explored.
To accelerate this search process, hierarchical motion estimation is often used: an image
pyramid is constructed and a search over a smaller number of discrete pixels
(corresponding to the same range of motion) is first performed at coarser levels.
The motion estimate from one level of the pyramid is then used to initialize a smaller local
search at the next finer level. Alternatively, several seeds (good solutions) from the coarse
level can be used to initialize the fine-level search. While this is not guaranteed to produce
the same result as a full search, it usually works almost as well and is much faster.
Fourier-based alignment
Windowed correlation.
Phase correlation.
Parametric motion