Thanks to visit codestin.com
Credit goes to www.scribd.com

0% found this document useful (0 votes)
36 views8 pages

CVML Mulakat Notlari

Uploaded by

dessas061
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
36 views8 pages

CVML Mulakat Notlari

Uploaded by

dessas061
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 8

SIFT / SURF vs HOG:

1
https://medium.com/@deepanshut041/introduction-to-sift-scale-invariant-feature-transform-6
5d7f3a72d40
The scale space of an image is a function L(x,y,σ) that is produced from the convolution of a
Gaussian kernel(Blurring) at different scales with the input image. Scale-space is separated
into octaves and the number of octaves and scale depends on the size of the original image.
So we generate several octaves of the original image. Each octave’s image size is half the
previous one.

Within an octave, images are progressively blurred using the Gaussian Blur operator.
Mathematically, “blurring” is referred to as the convolution of the Gaussian operator and the
image. Gaussian blur has a particular expression or “operator” that is applied to each pixel.
What results is the blurred image.

Now we use those blurred images to generate another set of images, the Difference of
Gaussians (DoG). These DoG images are great for finding out interesting keypoints in the
image. The difference of Gaussian is obtained as the difference of Gaussian blurring of an
image with two different σ, let it be σ and kσ. This process is done for different octaves of the
image in the Gaussian Pyramid.

Up till now, we have generated a scale space and used the scale space to calculate the
Difference of Gaussians. Those are then used to calculate Laplacian of Gaussian
approximations that are scale invariant. One pixel in an image is compared with its 8
neighbors as well as 9 pixels in the next scale and 9 pixels in previous scales. This way, a
total of 26 checks are made. If it is a local extrema, it is a potential keypoint. It basically
means that keypoint is best represented in that scale.

SIFT descriptor chooses a 16x16 and then divides it into 4x4 windows. Over each of these 4
windows it computes a Histogram of Oriented gradients. While computing this histogram, it
also performs an interpolation between neighboring angles. Once you have all the 4x4
windows, it uses a gaussian of half the window size, centered at the center of the 16x16
block to weight the values in the whole 16x16 descriptor.

HoG on the other hand only computes a simple histogram of oriented gradients as the name
says.

1) In SIFT gaussian smoothing is applied in order to compute the DOG (difference of


gaussian). Then performing Scale Extrema Detection you will detect the feature points. Once
you have this feature points you will need to compute the HOG for each feature. Since it
takes a 16x16 neighbourhood the result will be a 128 length descriptor. Whereas HOG
compute edge gradient of a whole image and find orientation of each pixel so it can generate
a histogram.
2) HOG is used to extract global feature whereas SIFT is used for extracting local features.
3) SIFT is also scale and rotation invariant whereas HOG is not scale and rotation invariant.

2
LBP:
The LBP feature vector, in its simplest form, is created in the following manner:

● Divide the examined window into cells (e.g. 16x16 pixels for each cell).
● For each pixel in a cell, compare the pixel to each of its 8 neighbors (on its left-top,
left-middle, left-bottom, right-top, etc.). Follow the pixels along a circle, i.e. clockwise
or counter-clockwise.
● Where the center pixel's value is greater than the neighbor's value, write "0".
Otherwise, write "1". This gives an 8-digit binary number (which is usually converted
to decimal for convenience).
● Compute the histogram, over the cell, of the frequency of each "number" occurring
(i.e., each combination of which pixels are smaller and which are greater than the
center). This histogram can be seen as a 256-dimensional feature vector.
● Optionally normalize the histogram.
● Concatenate (normalized) histograms of all cells. This gives a feature vector for the
entire window.

ORB (Oriented FAST and Rotated BRIEF):


ORB is basically a fusion of FAST keypoint detector and BRIEF descriptor with many
modifications to enhance the performance.
First it use FAST to find keypoints, then apply Harris corner measure to find top N points
among them. It also use pyramid to produce multiscale-features.
Unlike BRIEF, ORB is comparatively scale and rotation invariant while still employing the
very efficient Hamming distance metric for matching. As such, it is preferred for real-time
applications.

FAST:
1. Select a pixel p in the image which is to be identified as an interest point or not. Let
its intensity be I_p.
2. Select appropriate threshold value t.
3. Consider a circle of 16 pixels around the pixel under test. (See the image below)
4. Now the pixel p is a corner if there exists a set of n contiguous pixels in the circle (of
16 pixels) which are all brighter than I_p + t, or all darker than I_p − t. (Shown as
white dash lines in the above image). n was chosen to be 12.

3
Motion Estimation:
Optical flow
– Recover image motion at each pixel from spatio-temporal image brightness variations
(optical flow)
• Feature-tracking
– Extract visual features (corners, textured areas) and “track” them over multiple frames

Bag of Words:

We detect features, extract descriptors from each image in the dataset, and build a visual
dictionary. Detecting features and extracting descriptors in an image can be done by using
feature extractor algorithms (for example, SIFT, KAZE, etc). Next, we make clusters from the
descriptors (we can use K-Means, DBSCAN or another clustering algorithm). The center of
each cluster will be used as the visual dictionary’s vocabularies. Finally, for each image, we
make frequency histogram from the vocabularies and the frequency of the vocabularies in
the image. Those histograms are our bag of visual words (BOVW).

Use Nearest neighbour or SVM for classification.

Histogram comparison: Earth Movers Distance, correlation, chi-square, ​Bhattacharyya

Harris Corner and Shi-Tomasi Corner:

Harris Corner Detector basically finds the difference in intensity for a displacement of (u,v) in
all directions. This is expressed as below:

We've looking for windows that produce a large E value. To do that, we need to high values
of the terms inside the square brackets.

Taylor series expansion + first order derivatives + etc. For small shifts [u,v] we have a
bilinear approximation:

It was figured out that eigenvalues of the matrix can help determine the suitability of a
window. A score, R, is calculated for each window:

4
The Shi-Tomasi corner detector is based entirely on the Harris corner detector. However,
one slight variation in a "selection criteria" made this detector much better than the original. It
works quite well where even the Harris corner detector fails. So here's the minor change that
Shi and Tomasi did to the original Harris corner detector:

The Harris corner detector has a corner selection criteria. A score is calculated for each
pixel, and if the score is above a certain value, the pixel is marked as a corner. The score is
calculated using two eigenvalues. That is, you gave the two eigenvalues to a function. The
function manipulates them, and gave back a score. Shi and Tomasi suggested that the
function should be done away with. Only the eigenvalues should be used to check if the pixel
was a corner or not.
The scoring function in Harris Corner Detector was given by:

Instead of this, Shi-Tomasi proposed:

KLT (Kanade–Lucas–Tomasi)Tracker:
Find a good point to track (harris corner)
• Use intensity second moment matrix and difference across frames to find displacement
• Iterate and use coarse-to-fine search to deal with larger movements
• When creating long tracks, check appearance of registered patch against appearance of
initial patch to find points that have drifted
---
1. Detect Harris corners in the first frame of the video.
2. For each detected Harris corner, compute the motion between consecutive frames
using the optical flow (translator) and local affine transformation (affine).
3. Now link these motion vectors from frame-to-frame to track the corners.
4. Generate new Harris corners after a specific number of frames (say, 10 to 20) to
compensate for new points entering the scene or to discard the ones going out of the
scene.
5. Track the new and old Harris points.
• cost function: sum of squared intensity differences between template and window
• optimization technique: gradient descent
• model learning: no update / last frame / convex combination
• attractive properties:
–fast

5
–easily extended to image-to-image transformations with
multiple parameters

Correlation filter based tracking and KCF:


The basic idea of the correlation filter tracking is estimating an optimal image filter such that
the filtration with the input image produces a desired response. The desired response is
typically of a Gaussian shape centered at the target location, so the score decreases with
the distance.

The filter is trained from translated (shifted) instances of the target patch. When testing, the
response of the filter is evaluated and the maximum gives the new position of the target. The
filter is trained on-line and updated successively with every frame in order the tracker adapts
to moderate target changes.

Major advantage of the correlation filter tracker is the computation efficiency. The reason is
that the computation can be performed efficiently in the Fourier domain. Thus the tracker
runs super-realtime, several hundreds FPS.

----

Filter based trackers model the appearance of objects using filters trained on example
images. The target is initially selected based on a small tracking window centered on the
object in the first frame. From this point on, tracking and filter training work together. The
target is tracked by correlating the filter over a search window in next frame; the location
corresponding to the maximum value in the correlation output indicates the new position of
the target. An online update is then performed based on that new location.

To create a fast tracker, correlation is computed in the Fourier domain Fast Fourier
Transform (FFT) [15]. First, the 2D Fourier transform of the input image: F = F(f), and of the
filter: H = F(h) are computed. The Convolution Theorem states that correlation becomes an
element wise multiplication in the Fourier domain. Using the ⊙ symbol to explicitly denote
element-wise multiplication and ∗ to indicate the complex conjugate, correlation takes the
form:
G = F ⊙ H∗ (1)
The correlation output is transformed back into the spatial domain using the inverse FFT.
The bottleneck in this process is computing the forward and inverse FFTs so that the entire
process has an upper bound time of O(P log P) where P is the number of pixels in the
tracking window.

BACF (background aware correlation filter tracking):


Learning CF trackers in the frequency domain, however, comes at the high cost of learning
from circular shifted examples of the foreground target. These shifted patches are implicitly
generated through the circulant property of correlation in the frequency domain and are used

6
as negative examples for training the filter [20]. All shifted patches are plagued by circular
boundary effects and are not truly representative of negative patches in real-world scenes.
These boundary effects have been shown to have a drastic impact on tracking performance,
due to a number of factors. First, learning from limited shifted patches may lead to training
an over-fitted filter which is not well generalized to rapid visual deformation e.g. caused by
fast motion [10]. Second, the lack of real negative training examples can drastically degrade
the robustness of such trackers against cluttered background, and as a result, increase the
risk of tracking drift specifically when the target and background display similar visual cues.
Third, discarding background information from the learning process may reduce the tracker’s
ability to distinguish the target from occlusion patches. This limits the potential of such
trackers to re-detect after an occlusion or out-of-plane movement .

BACF is capable of learning/updating filters from real negative examples densely extracted
from the background. We demonstrate that learning trackers from negative background
patches, instead of shifted foreground patches, achieves superior accuracy with real-time
performance. This paper offers the following contributions:
• We propose a new correlation filter for real-time visual tracking. Unlike prior CF-based
trackers in which negative examples are limited to circular shifted patches, our tracker is
trained from real negative training

CRF:
Conditional random fields (CRFs) are a probabilistic framework for labeling and segmenting
structured data, such as sequences, trees and lattices. The underlying idea is that of
defining a conditional probability distribution over label sequences given a particular
observation sequence, rather than a joint distribution over both label and observation
sequences. The primary advantage of CRFs over hidden Markov models is their conditional
nature, resulting in the relaxation of the independence assumptions required by HMMs in
order to ensure tractable inference. Additionally, CRFs avoid the label bias problem, a
weakness exhibited by maximum entropy Markov models (MEMMs) and other conditional
Markov models based on directed graphical models. CRFs outperform both MEMMs and
HMMs on a number of real-world tasks in many fields, including bioinformatics,
computational linguistics and speech recognition.

---
Hidden Markov Models are generative, and give output by modeling the joint probability
distribution. On the other hand, Conditional Random Fields are discriminative, and model the
conditional probability distribution. CRFs don’t rely on the independence assumption (that
the labels are independent of each other), and avoid label bias. One way to look at it is that
Hidden Markov Models are a very specific case of Conditional Random Fields, with constant
transition probabilities used instead. HMMs are based on Naive Bayes, which we say can be
derived from Logistic Regression, from which CRFs are derived.

7
Hough Transform

The Hough transform is a technique which can be used to isolate features of a particular
shape within an image. Because it requires that the desired features be specified in some
parametric form, the classical Hough transform is most commonly used for the detection of
regular curves such as lines, circles, ellipses, etc. A generalized Hough transform can be
employed in applications where a simple analytic description of a feature(s) is not possible.

The Hough transform is a feature extraction technique used in image analysis, computer
vision, and digital image processing. The purpose of the technique is to find imperfect
instances of objects within a certain class of shapes by a voting procedure. This voting
procedure is carried out in a parameter space, from which object candidates are obtained as
local maxima in a so-called accumulator space that is explicitly constructed by the algorithm
for computing the Hough transform.

You might also like