Thanks to visit codestin.com
Credit goes to www.scribd.com

0% found this document useful (0 votes)
7 views44 pages

Computer Vison 5

1.OpenCV Installation and working with Python 2.Basic Image Processing - loading images, Cropping, Resizing, Thresholding, Contour analysis, Bolb detection 3.Image Annotation – Drawing lines, text circle, rectangle, ellipse on images 4.Image Enhancement - Understanding Color spaces, color space conversion, Histogram equialization, Convolution, Image smoothing, Gradients, Edge Detection 5.Image Features and Image Alignment – Image transforms – Fourier, Hough, Extract ORB Image features, Feature m

Uploaded by

Andro Jeevan
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
7 views44 pages

Computer Vison 5

1.OpenCV Installation and working with Python 2.Basic Image Processing - loading images, Cropping, Resizing, Thresholding, Contour analysis, Bolb detection 3.Image Annotation – Drawing lines, text circle, rectangle, ellipse on images 4.Image Enhancement - Understanding Color spaces, color space conversion, Histogram equialization, Convolution, Image smoothing, Gradients, Edge Detection 5.Image Features and Image Alignment – Image transforms – Fourier, Hough, Extract ORB Image features, Feature m

Uploaded by

Andro Jeevan
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 44

UNIT II

FEATURE-BASED ALIGNMENT & MOTION ESTIMATION


2D and 3D feature-based alignment – Pose estimation – Geometric intrinsic calibration –
Triangulation – Two-frame structure from motion – Factorization – Bundle adjustment –
Constrained structure and motion – Translational alignment – Parametric motion – Spline-
based motion – Optical flow – Layered motion.

2D and 3D feature-based alignment

Feature-based alignment is the problem of estimating the motion between two or more sets
of matched 2D or 3D points.

2D alignment using least squares Given a set of matched feature points {(xi , x 0 i )} and a
planar parametric transformation1 of the form

how can we produce the best estimate of the motion parameters p? The usual way to do
this is to use least squares, i.e., to minimize the sum of squared residuals

is the residual between the measured location xˆ 0 i and its corresponding current predicted
location x˜ 0 i = f(xi ; p).

The primary goals of 2D feature-based alignment in computer vision presented in points:

1. Accurate Alignment: The primary aim is to accurately align two images by finding the
transformation parameters that best match corresponding features between them.
2. Feature Detection: Identify distinctive key points or features in each image that are robust to
changes in scale, rotation, and illumination.
3. Feature Description: Describe the detected key points in a way that allows for reliable
matching across images despite variations in viewpoint or lighting conditions.
4. Feature Matching: Match corresponding features between the images to establish
correspondences, which are used to estimate the transformation that aligns the images.
5. Transformation Estimation: Estimate the transformation parameters (such as translation,
rotation, scale, etc.) that align one image with another based on the matched feature
correspondences.
6. Robustness: Ensure that the alignment process is robust to noise, occlusions, and other
artifacts present in the images.
7. Efficiency: Perform the alignment process efficiently to allow for real-time or near- real-time
performance in applications such as augmented reality or video processing.
8. Applicability: Enable alignment for various computer vision tasks such as image stitching,

object recognition, image registration, and augmented reality.


The minimum can be found by solving the symmetric positive definite (SPD) system of
normal Equations

Where is called the Hessian and b For the case of pure translation, the resulting

equations have a particularly simple form, i.e., the translation is the average translation
between corresponding points or, equivalently, the translation of the point centroids.
Uncertainty weighting. The above least squares formulation assumes that all feature points
are matched with the same accuracy. This is often not the case, since certain points may fall
into more textured regions than others. If we associate a scalar variance estimate with each
correspondence, we can minimize the weighted least squares problem instead
Application: Panography

One of the simplest (and most fun) applications of image alignment is a special form of
image stitching called panography. In a panograph, images are translated and optionally
rotated and scaled before being blended with simple averaging.
In most of the examples seen on the web, the images are aligned by hand for best artistic
effect.
Consider a simple translational model. We want all the corresponding features in different

images to line up as best as possible. Let tj be the location of the jth image coordinate frame
in the global composite frame and xij be the location of the ith matched feature in the jth
image. In order to align the images, we wish to minimize the least squares error

To minimize the non-linear least squares problem, we iteratively find an update _p to the
current parameter estimate p by minimizing

For the other 2D motion models, the derivatives in Table 8.1 are all fairly straightforward, except

for the projective 2D motion (homography), which arises in image-stitching applications

3D alignment
Instead of aligning 2D sets of image features, many computer vision applications require the
alignment of 3D points. In the case where the 3D transformations are linear in the motion
parameters, e.g., for translation, similarity, and affine, regular least squares can be used.

The case of rigid (Euclidean) motion,

which arises more frequently and is often called the absolute orientation problem, requires
slightly different techniques. If only scalar weightings are being used (as opposed to full 3D
per-point anisotropic covariance estimates), the weighted centroids of the two point clouds

One commonly used technique is called the orthogonal Procrustes algorithm and involves
computing the singular value decomposition (SVD) of the 3 * 3 correlation matrix.

The rotation matrix is then obtained as (Verify this for yourself when ^x0 = R^x.)
Another technique is the absolute orientation algorithm for estimating the unit quaternion
corresponding to the rotation matrix R, which involves forming a 4 _ 4 matrix from the
entries in C and then finding the eigenvector associated with its largest positive eigenvalue

Pose estimation

This pose estimation problem is also known as extrinsic calibration, as opposed to the intrinsic
calibration of internal camera parameters such as focal length. The problem of recovering
pose from three correspondences, which is the minimal amount of information necessary, is
known as the perspective-3-point-problem (P3P), 2 with extensions to larger numbers of
points collectively known as PnP. In this section, we look at some of the techniques that
have been developed to solve such problems, starting with the direct linear transform (DLT),
which recovers a 3 _ 4 camera matrix, followed by other “linear” algorithms, and then
looking at statistically optimal iterative algorithms.

Linear algorithms
The simplest way to recover the pose of the camera is to form a set of rational linear
equations analogous to those used for 2D motion estimation from the camera matrix form of
perspective projection.
where (xi; yi) are the measured 2D feature locations and (Xi; Yi;Zi) are the known 3D feature
locations. As with, this system of equations can be solved in a linear fashion for the
unknowns in the camera matrix P by multiplying the denominator on both sides of the
equation. Because P is unknown up to a scale, we can either fix one of the entries, e.g., p23
= 1, or find the smallest singular vector of the set of linear equations. The resulting algorithm
is called the direct linear transform (DLT) and is commonly attributed
to (For a more in-depth discussion, To compute the unknowns in P, at least six
correspondences between 3D and 2D locations must be known.

Fig: Pose estimation by the direct linear transform and by measuring visual angles and
distances between pairs of points.

As with the case of estimating homographies , more accurate results for the entries in P can
be obtained by directly minimizing the set of Equations using non-linear least squares with a
small number of iterations. Note that instead of taking the ratios of the X=Z and Y=Z values
as in it is also possible to take a cross product of the 3-vector (xi; yi; 1) image measurement
and the 3-D ray (X; Y;Z) and set the three elements of this cross-product to 0. The resulting
three equations, when interpreted as a set of least squares constraints, in effect compute
the squared sine of the angle between the two rays.
CNN-based pose estimation

As with other areas on computer vision, deep neural networks have also been applied to
pose estimation. Some representative papers include for object pose estimation, and papers
such as and discussed in Section on location recognition. There is also a very active

community around estimating pose from RGB-D images, with the most recent papers
evaluated on the BOP (Benchmark for 6DOF Object Pose)

Iterative non-linear algorithms

The most accurate and flexible way to estimate pose is to directly minimize the squared (or
robust) reprojection error for the 2D points as a function of the unknown pose parameters in
(R; t) and optionally K using non-linear least squares. We can write the projection equations
as

and iteratively minimize the robustified linearized reprojection errors

where is the current residual vector (2D error in predicted position) and the partial
derivatives are with respect to the unknown pose parameters (rotation, translation, and
optionally calibration). The robust loss function is used to reduce the influence of outlier
correspondences. The resulting projection equations can be written as
Note that in these equations, we have indexed the camera centers cj and camera rotation
quaternions qj by an index j, in case more than one pose of the calibration object is being
used.
The advantage of this chained set of transformations is that each one has a simple partial
derivative with respect both to its parameters and to its input. Thus, once the predicted
value of ~xi has been computed based on the 3D point location pi and the current values of
the pose parameters (cj ; qj ; k), we can obtain all of the required partial derivatives using
the chain rule

Where p(k) indicates one of the parameter vectors that is being optimized. (This same “trick”
is used in neural networks as part of back propagation. The one special case in this
formulation that can be considerably simplified is the computation of the rotation update.
Instead of directly computing the derivatives of the 3 _ 3 rotation matrix R(q) as a function

of the unit quaternion entries, you can prepend the incremental rotation matrix
given in to the current rotation matrix and compute the partial derivative of the transform
with respect to these parameters, which results in a simple cross product of the backward
chaining partial derivative and the outgoing 3D vector,
The inference of human pose (head, body, and limb locations and attitude) from a single

images can be viewed as yet another kind of segmentation task. We have already
discussed some pose estimation techniques on pedestrian detection section. Starting with
the seminal work 2D and 3D pose detection and estimation rapidly developed as an active
research area, with important advances and datasets
Application: Location recognition

One of the most exciting applications of pose estimation is in the area of location
recognition, which can be used both in desktop applications (“Where did I take this holiday
snap?”) and in mobile smartphone applications. The latter case includes not only finding out
your current location based on a cell-phone image, but also providing you with navigation
directions or annotating your images with useful information, such as building names and
restaurant reviews (i.e., a pocketable form of augmented reality). This problem is also often
called visual (or image-based) localization.

Some approaches to location recognition assume that the photos consist of architectural
scenes for which vanishing directions can be used to pre-rectify the images for easier
matching.

The main difficulty in location recognition is in dealing with the extremely large community
(user-generated) photo collections on websites such as Flickr.

In the latter case, the overlap between adjacent database images can be used to verify and
prune potential matches using “temporal” filtering, i.e., requiring the query image to match
nearby overlap ping database images before accepting the match. Similar ideas have been
used to improve location recognition from panoramic video sequences. Recognizing indoor
locations inside buildings and shopping malls poses its own set of challenges, including
textureless areas and repeated elements
Some of the most recent approaches to localization use deep networks to generate feature
Descriptors perform large-scale instance retrieval map images to 3D scene coordinates or
perform end-to-end scene coordinate regression, absolute pose regression (APR) or relative
pose regression (RPR). Recent evaluations of these techniques have shown that classical
approaches based on feature matching followed by geometric pose optimization
typicallyoutperform pose regression approaches in terms of accuracy and generalization.
The Long- Term Visual Localization benchmark has a leaderboard listing the best-
performing localization systems. Another variant on location recognition is the automatic
discovery of landmarks, i.e., frequently photographed objects and locations.

The concept of organizing the world’s photo collections by location has even been recently
extended to organizing all of the universe’s (astronomical) photos in an application called
astrometry.The technique used to match any two star fields is to take quadruplets of nearby
stars (a pair of stars and another pair inside their diameter) to form a 30-bit geometric hash
by encoding the relative positions of the second pair of points using the inscribed square
as the reference frame. Traditional information retrieval techniques (k-d trees built for
different parts of a sky atlas) are then used to find matching quads as potential star field
location hypotheses, which can then be verified using a similarity transform.
Triangulation

The problem of determining a point’s 3D position from a set of corresponding image


locations
and known camera positions is known as triangulation. This problem is the converse of the
pose estimation problem

One of the simplest ways to solve this problem is to find the 3D point p that lies closest to

all of the 3D rays corresponding to the 2D matching feature locations observed by cameras

Where and is the jth camera center, these rays originate at cj in a direction

where normalizes a vector v to unit length.


The nearest point to p on this ray, which we denote as minimizes the
distance.
-

which has a minimum at Hence,


in the notation of Equation, and the squared distance between p and qj is

The optimal value for p, which lies closest to all of the rays, can be computed as a regular
least squares problem by summing over all the r2 j and finding the optimal value of p,

An alternative formulation, which is more statistically optimal and which can produce
significantly better estimates if some of the cameras are closer to the 3D point than others,
is to minimize the residual in the measurement equations
where (xj ; yj) are the measured 2D feature locations and are the known entries
in camera matrix Pj. As with, this set of non-linear equations can be converted into a linear
least squares problem by multiplying both sides of the denominator, again resulting in the
direct linear transform (DLT) formulation. Note that if we use homogeneous coordinates p =
(X; Y;Z;W), the resulting set of equations is homogeneous and is best solved as a singular
value decomposition (SVD) or eigenvalue problem (looking for the smallest singular vector
or eigenvector). If we set W = 1, we can use regular linear least
squares, but the resulting system may be singular or poorly conditioned, i.e., if all of the
viewing rays are parallel, as occurs for points far away from the camera.

For this reason, it is generally preferable to parameterize 3D points using homogeneous


coordinates, especially if we know that there are likely to be points at greatly varying
distances from the cameras. Of course, minimizing the set of observations using non-linear
least squares is preferable to using linear least squares, regardless of the representation
chosen.

Two-frame structure from motion

So far in our study of 3D reconstruction, we have always assumed that either the 3D point
positions or the 3D camera poses are known in advance. In this section, we take our first
look at structure from motion, which is the simultaneous recovery of 3D structure and pose
from image correspondences. In particular, we examine techniques that operate on just two
frames with point correspondences. We divide this section into the study of classic “n- point”
algorithms, special (degenerate) cases, projective (uncalibrated) reconstruction, and self-
calibration for cameras whose intrinsic calibrations are unknown.

Eight, seven, and five-point algorithms

Consider which shows a 3D point p being viewed from two cameras whose relative position
can be encoded by a rotation R and a translation t. As we do not know anything about the
camera positions, without loss of generality, we can set the first camera at the origin c0 = 0
and at a canonical orientation
R0 = I. The 3D point p0 = d0^x0 observed in the first image at location ^x0 and at a z
distance of d0 is mapped into the second image by the transformation.
Where are the (local) ray direction vectors. Taking the cross product of the two
(Interchanged) sides with t in order to annihilate it on the right-hand side yields.

Taking the dot product of both sides with ^x1 yields,

because the right-hand side is a triple product with two identical entries. (Another way to say
this is that the cross product matrix [t]_ is skew symmetric and returns 0 when pre- and
post-multiplied by the same vector.). We therefore arrive at the basic epipolar constraint,

Where
is called the essential matrix,

An alternative way to derive the epipolar constraint is to notice that, for the cameras to be
oriented so that the rays intersect in 3D at point p, the vectors connecting the two camera
centers and the rays corresponding to pixels x0 and x1, namely must be co- planar.
This requires that the triple product.

Eight-point algorithm. Given this fundamental relationship (11.30), how can we use it to
recover the camera motion encoded in the essential matrix E? If we have N corresponding
measurements {(xi0, xi1)}, we can form N homogeneous equations in the nine elements of
E = {e00 . . . e22}

where ⊗ indicates an element-wise multiplication and summation of matrix elements, and zi


and f are the vectorized forms of the Zi = xˆi1xˆ T i0 and E matrices.Given N ≥ 8 such
equations, we can compute an estimate (up to scale) for the entries in E using an SVD.

scale operations in , the original essential matrix E can be recovered as

Self-calibration
The results of structure from motion computation are much more useful if a metric
reconstruction is obtained, i.e., one in which parallel lines are parallel, orthogonal walls are
at right angles, and the reconstructed model is a scaled version of reality. Over the years, a
large number of self-calibration (or auto-calibration) techniques have been developed for
converting a projective reconstruction into a metric one, which is equivalent to recovering
the unknown calibration matrices Kj associated with each image.In situations where
additional information is known about the scene, different methods may be employed. For
example, if there are parallel lines in the scene, three or more vanishing points, which are
the images of points at infinity, can be used to establish the homography for the plane at
infinity, from which focal lengths and rotations can be recovered. If two or more finite
orthogonal vanishing points have been observed, the single-image calibration method
based on vanishing points can be used instead

encode the unknown focal lengths. For simplicity, let us rewrite each of the numerators and
denominators in (11.56) as

Application: View morphing


An interesting application of basic two-frame structure from motion is view morphing (also
known as view interpolation,), which can be used to generate a smooth 3D animation from
one view of a 3D scene to another. To create such a transition, you must first smoothly
interpolate the camera matrices, i.e., the camera positions, orientations, and focal lengths.
While simple linear interpolation can be used (representing rotations as quaternions.a more
pleasing effect is obtained by easing in and easing out the camera parameters, e.g., using a
raised cosine, as well as moving the camera along a more circular trajectory. To generate
in-between frames, either a full set of 3D correspondences needs to be established or 3D
models (proxies) must be created for each reference view. describes several widely used
approaches to this problem. One of the simplest is to just triangulate the set of matched
feature points in each image, e.g., using Delaunay triangulation. As the 3D points are re-
projected into their intermediate views, pixels can be mapped from their original source
images to their new views using affine or projective mapping. The final image is then
composited using a linear blend the two reference images, as with usual morphing.
Factorization

While two-frame techniques are useful for reconstructing sparse geometry from stereo
image pairs and for initializing larger-scale 3D reconstructions, most applications can benefit
from the much larger number of images that are usually available in photo collections and
videos of scenes.
In this section, we briefly review an older technique called factorization, which can provide
useful solutions for short video sequences, and then turn to the more commonly used
bundle adjustment approach, which uses non-linear least squares to obtain optimal
solutions under general camera configurations

When processing video sequences, we often get extended feature tracks from which it is
possible to recover the structure and motion using a process called factorization. Consider
the tracks generated by a rotating ping pong ball, which has been marked with dots to make
its shape and motion more discernable . We can readily see from the shape of the tracks
that the moving object must be a sphere, but how can we infer this mathematically? It turns
out that, under orthography or related models we discuss below, the shape and motion can
be recovered simultaneously using a singular value decomposition.

Once the rotation matrices and 3D point locations have been recovered, there still exists a
bas-relief ambiguity, i.e., we can never be sure if the object is rotating left to right or if its
depth reversed version is moving the other way. (This can be seen in the classic rotating
Necker Cube visual illusion.) Additional cues, such as the appearance and disappearance of
points, or perspective effects, both of which are discussed below, can be used to remove
this ambiguity.

For motion models other than pure orthography, e.g., for scaled orthography or
paraperspective, the approach above must be extended in the appropriate manner. Such
techniques are relatively straightforward to derive from first principles; more details can be
found in papers that extend the basic factorization approach to these more flexible models.
Additional extensions of the original factorization algorithm include multi-body rigid motion,
sequential updates to the factorization, the addition of lines and planes and re-scaling the
measurements to incorporate individual location uncertainties.
A disadvantage of factorization approaches is that they require a complete set of tracks,
i.e., each point must be visible in each frame, for the factorization approach to work and
deal with this problem by first applying factorization to smaller denser subsets and then
using known camera (motion) or point (structure) estimates to hallucinate additional missing
values, which allows them to incrementally incorporate more features and cameras.
Perspective and projective factorization Another disadvantage of regular factorization is that
it cannot deal with perspective cameras. One way to get around this problem is to perform
an initial affine (e.g., orthographic) reconstruction and to then correct for the perspective
effects in an iterative manner (Christy and Horaud 1996). This algorithm usually converges
in three to five iterations, with the majority of the time spent in the SVD computation.

Bundle adjustment

As we have mentioned several times before, the most accurate way to recover structure and
motion is to perform robust non-linear minimization of the measurement (re-projection)
errors, which is commonly known in the photogrammetry (and now computer vision)
communities as bundle adjustment.
The biggest difference between these formulas and full bundle adjustment is that our feature
location measurements xij now depend not only on the point (track) index i but also on the
camera pose index j,

and that the 3D point positions pi are also being simultaneously updated. In addition, it is
common to add a stage for radial distortion parameter estimation

if the cameras being used have not been pre-calibrated, as shown in Figure.

While most of the boxes (transforms) have previously been explained , the leftmost box has
not. This box performs a robust comparison of the predicted and measured 2D locations xˆij
and ˜xij after re-scaling by the measurement noise covariance Σij . In more detail, this
operation can be written as
The advantage of the chained representation introduced above is that it not only makes the
computations of the partial derivatives and Jacobians simpler but it can also be adapted to
any camera configuration. Consider for example a pair of cameras mounted on a robot that
is moving around in the world, as shown in Figure 11.15a. By replacing the rightmost two
transformations in Figure 11.14 with the transformations shown in Figure 11.15b, we can
simultaneously recover the position of the robot at each time and the calibration of each
camera with respect to the rig, in addition to the 3D structure of the world.

Exploiting sparsity

Large bundle adjustment problems, such as those involving reconstructing 3D scenes from
thousands of internet photographs can require solving non-linear least squares problems
with millions of measurements (feature matches) and tens of thousands of unknown
parameters (3D point positions and camera poses). Unless some care is taken, these kinds
of problem can become intractable, because the (direct) solution of dense least squares
problems is cubic in the number of unknowns.

Fortunately, structure from motion is a bipartite problem in structure and motion. Each
feature point xij in a given image depends on one 3D point position pi and one 3D camera
pose (Rj , cj ). This is illustrated in Figure 11.16a, where each circle (1–9) indicates a 3D
point, each square (A–D) indicates a camera, and lines (edges) indicate which points are
visible in which cameras (2D features). If the values for all the points are known or fixed, the
equations for all the cameras become independent, and vice versa.
If we order the structure variables before the motion variables in the Hessian matrix A (and
hence also the right-hand side vector b), we obtain a structure for the Hessian shown in
Figure 11.16c.19 When such a system is solved using sparse Cholesky factorization, the fill-
in occurs in the smaller motion explore the use of iterative (conjugate gradient) techniques
for the solution of bundle adjustment problems.

In more detail, the reduced motion Hessian is computed using the Schur complement

where APP is the point (structure) Hessian (the top left block of Figure 11.16c), APC is the
point- camera Hessian (the top right block), and ACC and A0 CC are the motion Hessians
before and after the point variable elimination (the bottom right block of Figure 11.16c).

Notice that A0 CC has a non-zero entry between two cameras if they see any 3D point in
common. This is indicated with dashed arcs in Figure 11.16a and light blue squares in
Figure 11.16c.

Whenever there are global parameters present in the reconstruction algorithm, such as
camera intrinsics that are common to all of the cameras, or camera rig calibration
parameters such as those shown in Figure 11.15, they should be ordered last (placed along
the right and bottom edges of A) to reduce fill-in
Application: Match move

One of the neatest applications of structure from motion is to estimate the 3D motion of a
video or film camera, along with the geometry of a 3D scene, in order to superimpose 3D
graphics or computer-generated images (CGI) on the scene. In the visual effects industry,
this is known as the match move problem (Roble 1999), as the motion of the synthetic 3D
camera used to render the graphics must be matched to that of the real-world camera. For
very small motions, or motions involving pure camera rotations, one or two tracked points
can suffice to compute the necessary visual motion.
For planar surfaces moving in 3D, four points are needed to compute the homography,
which can then be used to insert planar overlays, e.g., to replace the contents of advertising
billboards during sporting events. The general version of this problem requires the
estimation of the full 3D camera pose along with the focal length (zoom) of the lens and
potentially its radial distortion parameters

When the 3D structure of the scene is known ahead of time, pose estimation techniques
such as view correlation or through-the-lens camera control (Gleicher and Witkin 1992) can
be used, as described in For more complex scenes, it is usually preferable to recover the 3D
structure simultaneously with the camera motion using structure-from-motion techniques.
The trick with using such techniques is that to prevent any visible jitter between the synthetic
graphics and the actual scene, features must be tracked to very high accuracy and ample
feature tracks must be available in the vicinity of the insertion location

Constrained structure and motion

The most general algorithms for structure from motion make no prior assumptions about the
objects or scenes that they are reconstructing. In many cases, however, the scene contains
higher-level geometric primitives, such as lines and planes. These can provide information
complementary to interest points and also serve as useful building blocks for 3D modeling
and visualization. Furthermore, these primitives are often arranged in particular
relationships, i.e., many lines and planes are either parallel or orthogonal to each other.

Sometimes, instead of exploiting regularity in the scene structure, it is possible to take


advantage of a constrained motion model. For example, if the object of interest is rotating on
a turntable, i.e., around a fixed but unknown axis, specialized techniques can be used to
recover this motion.

In other situations, the camera itself may be moving in a fixed arc around some center of
rotation Specialized capture setups, such as mobile stereo camera rigs or moving vehicles
equipped with multiple fixed cameras, can also take advantage of the knowledge that
individual cameras are (mostly) fixed with respect to the capture rig, as shown in Figu
Sometimes, instead of exploiting regularity in the scene structure, it is possible to take
advantage of a constrained motion model.
For example, if the object of interest is rotating on a turntable), i.e., around a fixed but
unknown axis, specialized techniques can be used to recover this motion. In other
situations, the camera itself may be moving in a fixed arc around some center of rotation
(Shum and He 1999). Specialized capture setups, such as mobile stereo camera rigs or
moving vehicles equipped with multiple fixed cameras, can also take advantage of the
knowledge that individual cameras are (mostly) fixed with respect to the capture rig, as
shown in Figure.

Line-based techniques

It is well known that pairwise epipolar geometry cannot be recovered from line matches
alone, even if the cameras are calibrated. To see this, think of projecting the set of lines in
each image into a set of 3D planes in space. You can move the two cameras around into
any configuration you like and still obtain a valid reconstruction for 3D lines. Line-based
techniques. When lines are visible in three or more views, the trifocal tensor can be used to
transfer lines from one pair of images to another.

Author describe a widely used technique for matching 2D lines based on the average of 15
× 15 pixel correlation scores evaluated at all pixels along their common line segment
intersection.35 In their system, the epipolar geometry is assumed to be known, e.g.,
computed from point matches. For wide baselines, all possible homographies corresponding
to planes passing through the 3D line are used
to warp pixels and the maximum correlation score is used. For triplets of images, the trifocal
tensor is used to verify that the lines are in geometric correspondence before evaluating the
correlations between line segments. Figure 11.22 shows the results of using their system.

Instead of reconstructing 3D lines, use RANSAC to group lines into likely coplanar subsets.
Four lines are chosen at random to compute a homography, which is then verified for these
and other plausible line segment matches by evaluating color histogram-based correlation
scores. The 2D intersection points of lines belonging to the same plane are then used as
virtual measurements to estimate the epipolar geometry, which is more accurate than using
the homographies directly.

It describe a 3D modeling system that constructs calibrated panoramas from multiple


images and then has the user draw vertical and horizontal lines in the image to demarcate
the boundaries of planar regions. The lines are used to establish an absolute rotation for
each panorama and are then used (along with the inferred vertices and planes) to build a
3D structure, which can be recovered up to scale from one or more images

Plane-based techniques

In scenes that are rich in planar structures, e.g., in architecture, it is possible to directly
estimate homographies between different planes, using either feature-based or intensity-
based methods. In principle, this information can be used to simultaneously infer the camera
poses and the plane equations, i.e., to compute plane-based structure from motion.

It show how a fundamental matrix can be directly computed from two or more homographies
using algebraic manipulations and least squares. Unfortunately, this approach often
performs poorly, because the algebraic errors do not correspond to meaningful reprojection
errors.

A better approach is to hallucinate virtual point correspondences within the areas from
which each homography was computed and to feed them into a standard structure from
motion algorithm. An even better approach is to use full bundle adjustment with explicit
plane equations, as well as additional constraints to force reconstructed co-planar features
to lie exactly on their corresponding planes. (A principled way to do this is to establish a
coordinate frame for each plane, e.g., at one of the feature points, and to use 2D in-plane
parameterizations for the other points.)
Simultaneous localization and mapping (SLAM)

While the computer vision community has been studying structure from motion, i.e., the
reconstruction of sparse 3D models from multiple images and videos, since the early
1980s), the mobile robotics community has in parallel been studying the automatic
construction of 3D maps from moving robots.36 In robotics, the problem was formulated as
the simultaneous estimation of 3D robot and landmark poses (Figure 11.23), and was
known as probabilistic mapping and simultaneous localization and mapping (SLAM). In the
computer vision community, the problem was originally called visual odometry although that
term is now usually reserved for shorter-range motion estimation that does not involve
building a global map with loop closing.

Early versions of such algorithms used range-sensing techniques, such as ultrasound, laser
range finders, or stereo matching, to estimate local 3D geometry, which could then be fused
into a 3D model..

SLAM differs from bundle adjustment in two fundamental aspects. First, it allows for a
variety of sensing devices, instead of just being restricted to tracked or matched feature
points. Second, it solves the localization problem online, i.e., with no or very little lag in
providing the current sensor pose. This makes it the method of choice for both time-critical
robotics applications such as autonomous navigation.

Some of the important milestones in SLAM include:


• the application of SLAM to monocular cameras;
• parallel tracking and mapping (PTAM) , which split the front end (tracking) and back
end (mapping) processes (Figure 11.24) onto two separate threads running at different rates
(Figure 11.27) and then implemented the whole process on a camera phone
As you can tell from this very brief overview, SLAM is an incredibly rich and rapidly evolving
field of research, full of challenging robust optimization and real-time performance problems.
A good source
for finding a list of the most recent papers and algorithms is the KITTI Visual Odometry/SLAM
Evaluation.

Application: Autonomous navigation Since the early days of artificial intelligence and
robotics, computer vision has been used to enable manipulation for dextrous robots and
navigation for autonomous robots (Janai, Guney ¨ et al. 2020; Kubota 2019). Some of the
earliest vision-based navigation systems include the Stanford Cart and CMU Rover, the
Terregator and the CMU Nablab.

originally could only advance 4m every 10 sec (< 1 mph), and which was also the first
system to use a neural network for driving .The early algorithms and technologies advanced
rapidly, with the VaMoRs system of operating a 25Hz Kalman filter loop and driving with
good lane markings at 100 km/h. By the mid 2000s, when DARPA introduced their Grand
Challenge and Urban Challenge, vehicles equipped with both range-finding lidar cameras
and stereo cameras were able to traverse rough outdoor terrain and navigate city streets at
regular human driving speeds. Systems led to the formation of industrial research projects
at companies such as Google and Tesla, as well numerous startups, many of which exhibit
their vehicles at computer vision conferences

Translational alignment

The simplest way to establish an alignment between two images or image patches is to shift
one image relative to the other. Given a template image I0(x) sampled at discrete pixel
locations {xi = (xi
, yi)}, we wish to find where it is located in image I1(x). A least squares solution to this
problem is to find the minimum of the sum of squared differences (SSD) function
where u = (u, v) is the displacement and ei = I1(xi + u) − I0(xi) is called the residual error (or
the displaced frame difference in the video coding literature).1 (We ignore for the moment
the possibility that parts of I0 may lie outside the boundaries of I1 or be otherwise not
visible.) The assumption that corresponding pixel values remain the same in the two images
is often called the brightness constancy constraint.

In general, the displacement u can be fractional, so a suitable interpolation function must be


applied to image I1(x). In practice, a bilinear interpolant is often used, but bicubic
interpolation can yield slightly better results.

Color images can be processed by summing differences across all three-color channels,
although it is also possible to first transform the images into a different color space or to only
use the luminance (which is often done in video encoders). Robust error metrics. We can
make the above error metric more robust to outliers by replacing the squared error terms
with a robust function ρ(ei)
The robust norm ρ(e) is a function that grows less quickly than the quadratic penalty
associated with least squares. One such function, sometimes used in motion estimation for
video coding because of its speed, is the sum of absolute differences (SAD) metric3 or L1
norm, i.e., However, because this function is not differentiable at the origin, it is not well
suited to gradient-descent approaches such as the ones presented. Instead, a smoothly
varying function that is quadratic for small values but grows more slowly away from the
origin is often used

where a is a constant that can be thought of as an outlier threshold. An appropriate value for
the threshold can itself be derived using robust statistics. e.g., by computing the median
absolute deviation, MAD = medi |ei |, and multiplying it by to obtain a robust estimate of the
standard deviation of the inlier noise process proposes a generalized robust loss function
that can model various outlier distributions and thresholds, as discussed in more detail in,
and also has a Bayesian method for estimating the loss function parameters.

Spatially varying weights. The error metrics above ignore that fact that for a given alignment,
some of the pixels being compared may lie outside the original image boundaries.
Furthermore, we may want to partially or completely downweight the contributions of certain
pixels. For example, we may want to selectively “erase” some parts of an image from
consideration when stitching a mosaic where unwanted foreground objects have been cut
out. For applications such as background stabilization, we may want to downweight the
middle part of the image, which often contains independently moving objects being tracked
by the camera

All of these tasks can be accomplished by associating a spatially varying per-pixel weight
with each of the two images being matched. The error metric then becomes the weighted
(or windowed) SSD function
Hierarchical motion estimation

The simplest solution is to do a full search over some range of shifts, using either integer or
sub-pixel steps. This is often the approach used for block matching in motion compensated
video compression, where a range of possible motions (say, ±16 pixels) is explored.

To accelerate this search process, hierarchical motion estimation is often used: an image
pyramid is constructed and a search over a smaller number of discrete pixels
(corresponding to the same range of motion) is first performed at coarser levels.

The motion estimate from one level of the pyramid is then used to initialize a smaller local
search at the next finer level. Alternatively, several seeds (good solutions) from the coarse
level can be used to initialize the fine-level search. While this is not guaranteed to produce
the same result as a full search, it usually works almost as well and is much faster.
Fourier-based alignment

When the search range corresponds to a significant fraction of the


larger image (as is the case in image stitching, the hierarchical
approach may not work that well, as it is often not possible to
coarsen the representation too much before significant features are
blurred away. In this case, a Fourier-based approach may be
preferable.

Windowed correlation.

Unfortunately, the Fourier convolution theorem only applies when the


summation over xi is performed over all the pixels in both images,
using a circular shift of the image when accessing pixels outside the
original boundaries. While this is acceptable for small shifts and
comparably sized images, it makes no sense when the images
overlap by a small amount or one image is a small subset of the
other.

Phase correlation.

A variant of regular correlation that is sometimes used for motion


estimation is phase correlation.Here, the spectrum of the two signals
being matched is whitened by dividing each per- frequency product
in the magnitudes of the Fourier transforms.

Parametric motion

Parametric motion estimation In this section, we consider motion


estimation approaches which estimate the parameters of the motion
models. As previously discussed, these models can be applied to a
coherently moving region of support. An important special case is
when a single region of support corresponding to the whole image is
selected.
In this case, referred to as global motion estimation, the dominant
motion is estimated. This dominant motion is resulting from camera
motion, such as dolly, track, boom, pan, tilt and roll, which is a widely
used cinematic technique in filmmaking and video production.
Hereafter, we describe two classes of techniques for parametric
motion estimation. We also discuss difficulties arising due to outliers,
and related robust estimators.
Indirect parametric motion estimation A first class of approaches
indirectly computes the motion parameters from a dense motion field
rather than from the image pixels. More specifically, a dense motion
field is first estimated, and then the parametric motion model is fitted
on the obtained motion vectors. A Least Mean Square (LMS)
technique is commonly used for this model fitting. More specifically,
the motion parameters are derived from the expressions
PART A
1. What is bundle adjustment?

2. Define reprojection error.

3. What is geometric intrinsic calibration?

4. Briefly explain triangulation in structure from motion.

5. What does the essential matrix encode?

6. What is factorization in SfM?

7. Define optical flow.

8. What is layered motion?

9. Describe translational alignment.

10. What is pose estimation?


PART B

1. Compare and contrast the factorization method (e.g., Tomasi–


Kanade) and bundle adjustment for SfM
2. Describe the triangulation process using a linear SVD-based
method:
3. Explain the formulation and optimization in bundle adjustment:
4. Explain how geometric constraints (e.g., parallelism, coplanarity)
can be integrated into SfM:
5. Explain translational alignment methods for image registration or
motion estimation:
6. Compare parametric (e.g., affine, homography) motion models
with spline-based motion:
7. Explain the optical flow formulation and two classical computation
methods:
8. Describe the concept of layered motion models in image
sequences:
9. Explain the interplay between intrinsic calibration and pose
estimation:
10. Explain the full pipeline of two-frame SfM
PART C
1. What is the goal of feature-based alignment?
2. Define the task and its applications
3. Explain the pipeline: feature detection → description → matching
→ geometric model estimation
4. How does RANSAC improve robustness in feature alignment?
5. What are intrinsic camera parameters and how are they
estimated?
6. Explain triangulation: solving for 3D point from multiple views via
linear least squares or SVD, then refine with bundle adjustment
7. Describe how a checkerboard pattern is used: detect corners, use
DLT, then refine via non-linear minimization
8. What are intrinsic camera parameters and how are they
estimated?
9. Compare structure-from-motion vs. optical flow: what each uses
and reconstructs.
10. What challenges arise in pose estimation and motion analysis
(e.g. noise, occlusions)?
11. Why use spline models for camera trajectories/video sequences?
Discuss temporal smoothness.
12. Parametric motion estimation: how to estimate motion parameters
(translation, rotation) via reprojection error minimization.
·

You might also like