SJ CV
SJ CV
Computer Vision is a multidisciplinary field of artificial intelligence (AI) and computer science that
enables machines to interpret, understand, and make decisions based on visual data, such as images
and videos. The goal of computer vision is to automate tasks that the human visual system can
perform, such as recognizing objects, understanding scenes, and detecting patterns.
Computer vision systems often use a combination of image processing, machine learning, and deep
learning techniques to analyze and interpret visual information.
Computer vision has a wide range of applications across multiple domains, including:
2. Facial Recognition
3. Medical Imaging
4. Autonomous Vehicles
5. Image Segmentation
6. Gesture Recognition
8. Video Surveillance
Object detection refers to the ability of a computer vision system to identify and locate objects
within an image or video. The system not only detects the presence of objects but also marks their
locations with bounding boxes, making it easier to track and recognize them in real-time.
How it Works:
• Computer vision models are trained using large datasets containing labeled images, where
each object of interest is annotated with its category and location.
• Deep learning techniques, particularly Convolutional Neural Networks (CNNs), are often
used for object detection. CNNs can learn complex features of images such as edges,
textures, and shapes, which helps the system to identify objects.
• Popular object detection models include YOLO (You Only Look Once), Faster R-CNN, and SSD
(Single Shot Multibox Detector).
Applications:
• Retail and E-commerce: Object recognition can help in product tagging, inventory
management, and recommendation systems.
• Robotics: Autonomous robots use object detection to navigate and interact with their
environment.
Example:
In an autonomous car, object detection is used to identify pedestrians, vehicles, traffic signs, and
obstacles. This allows the vehicle to navigate safely by recognizing and avoiding potential hazards.
2. Facial Recognition
Facial recognition is a computer vision technology that identifies or verifies a person based on their
facial features. It is widely used in security, authentication, and surveillance applications.
How it Works:
• Facial recognition systems work by analyzing the unique features of the face such as the
distance between the eyes, nose shape, jawline, and other facial landmarks.
• The system typically involves face detection (locating the face in an image) followed by
feature extraction (identifying distinguishing facial characteristics). These features are then
compared to a database of known faces.
• Techniques like deep learning, particularly using CNNs, are frequently applied to achieve
accurate facial recognition, even under varying lighting conditions, poses, and expressions.
Applications:
• Security: Facial recognition is used for access control in secure facilities, airports, or personal
devices (e.g., unlocking phones).
• Retail and Marketing: Stores use facial recognition to analyze customer demographics and
provide personalized shopping experiences.
Example:
In smartphones, facial recognition allows users to unlock their device by simply looking at it. The
system compares the live image to a stored facial template and grants access if there’s a match.
Radial distortion is one of the key factors to account for in camera calibration, as it directly impacts
the accuracy of geometric transformations, 3D reconstruction, and other computer vision tasks. It
occurs due to imperfections in the camera lens, which causes the light rays to diverge or converge in
a nonlinear manner. Understanding and correcting radial distortion is crucial for ensuring that the
camera model produces accurate, realistic images.
Camera calibration is the process of determining the intrinsic and extrinsic parameters of a camera to
correct for distortions and establish an accurate geometric model of the scene
Incorrect Mapping of Points: Without correction for radial distortion, points in an image won't map
to their true positions in the 3D world. Straight lines (like those in calibration patterns) may appear
curved, leading to incorrect measurement of the camera’s internal parameters (focal length, principal
point, and lens distortion). This results in inaccurate modeling of the scene, affecting tasks like object
tracking, 3D reconstruction, and camera pose estimation.
Reduced Accuracy of Depth Estimation: Calibration is crucial for determining how far objects are
from the camera (depth). Radial distortion can skew the depth perception of objects, particularly at
the edges of the image.
Algorithms such as epipolar geometry and homography that depend on accurate camera calibration
can suffer due to radial distortion.
The camera matrix contains intrinsic parameters (like focal length, principal point, and distortion
coefficients) that help map 3D world coordinates to 2D image coordinates. Radial distortion can
distort the camera's apparent focal length, making it difficult to compute accurate mappings.
Barrel Distortion:
• This type of distortion happens when the image magnification decreases with distance from
the optical center (the center of the lens).
• It typically occurs with wide-angle lenses, causing straight lines to curve outward (towards
the image boundary).
Pincushion Distortion:
• The opposite of barrel distortion, pincushion distortion occurs when the image magnification
increases with distance from the optical center.
• Straight lines near the edge of the image bend inward, making the image look like a
pincushion.
Camera calibration is the process of determining the intrinsic and extrinsic parameters of a camera
to correct distortions and establish an accurate mapping between the 3D world and the 2D image.
The goal of calibration is to enable accurate measurements from images, such as reconstructing 3D
scenes, measuring distances, or performing visual tasks like object tracking and motion analysis.
In a nutshell, camera calibration helps to: Determine the camera's internal properties (intrinsic
parameters) such as the focal length, principal point, and lens distortions.
1. Pinhole (Optical Center): The small hole through which light enters the camera. This is the
point where all the light rays from a scene converge.
2. Image Plane: The 2D surface (sensor or film) where the image is formed.
3. Focal Length (f): The distance from the pinhole to the image plane. This determines how
"zoomed in" or "zoomed out" the image will be. A longer focal length results in a narrower
field of view, and a shorter focal length provides a wider field of view.
4. Principal Point (c_x, c_y): The point on the image plane where the optical axis intersects. In
most practical cameras, this is close to the center of the image.
5. Optical Axis: The line that passes through the center of the pinhole and is perpendicular to
the image plane. The optical axis is assumed to be aligned with the camera’s coordinate
system.
6. Field of View: The extent of the observable world seen through the camera at any given
moment. This depends on the focal length and the size of the image sensor.
Despite its simplicity, the pinhole camera model is foundational in numerous computer vision and
photogrammetry applications:
Camera calibration is the process of determining the intrinsic and extrinsic parameters of a camera
to correct distortions and establish an accurate mapping between the 3D world and the 2D image.
The goal of calibration is to enable accurate measurements from images, such as reconstructing 3D
scenes, measuring distances, or performing visual tasks like object tracking and motion analysis.
In a nutshell, camera calibration helps to: Determine the camera's internal properties (intrinsic
parameters) such as the focal length, principal point, and lens distortions.
The intrinsic parameters of a camera define the internal characteristics that affect how the 3D world
is projected onto the 2D image plane. These parameters are crucial for understanding the geometry
of the camera and correcting distortions in the captured image.
Focal Length (f): The focal length is the distance between the camera’s lens and the image plane
(sensor). It determines the magnification or zoom level of the camera.
In the camera matrix, it is typically represented in terms of the horizontal and vertical focal lengths,
Principal Point (c_x, c_y):The principal point is the point on the image plane where the optical axis
(the line passing through the center of the lens) intersects the image plane.This is typically close to
the center of the image, but in some cases, it may not be exactly at the center.
Skew (s):The skew parameter describes the angle between the x and y axes of the image plane. It
accounts for non-perpendicular axes in the image.In most modern cameras, the skew is close to 0,
meaning that the x and y axes are perpendicular.However, in some specialized cameras (e.g., non-
rectangular sensors), the skew can be non-zero.
Aspect Ratio (p_x, p_y):The aspect ratio defines the scaling of the image in the x and y directions.
This is particularly relevant when the pixel sizes in the x and y directions are not equal.The values
pxp_xpx and pyp_ypy correspond to the pixel size in the horizontal and vertical directions.
Camera MatrixThe intrinsic parameters are represented together in the camera matrix KKK, which is
a 3x3 matrix that transforms 3D world coordinates into 2D image coordinates.
Importance of Intrinsic Parameters Image Rectification: Intrinsic parameters help correct for lens
distortions and misalignments in the image plane.3D Reconstruction: By knowing the intrinsic
parameters, you can better map 2D image points to their corresponding 3D points in the
world.Camera Calibration: Intrinsic parameters are vital for calibrating a camera, allowing for the
correction of geometric distortions and enabling accurate measurement from images.
The process of image formation describes how a camera captures a 3D scene and projects it onto a
2D image plane (sensor or film). This process is governed by the principles of optics and geometry,
where light from the scene passes through the camera lens and forms an image on the image plane.
Below is a step-by-step breakdown of the image formation process:
• Every point in the 3D world emits or reflects light that travels in various directions. Some of
this light reaches the camera lens.
• For simplicity, we assume that the light is made up of rays that travel in straight lines.
• The camera has a lens that focuses light onto the image plane. The lens is typically a convex
lens, which bends light rays toward the optical center.
• The aperture controls how much light passes through the lens. It is a small opening that
allows light to enter and reach the image plane. The size of the aperture affects the exposure
and depth of field.
• The lens focuses light rays from different points in the scene onto corresponding points on
the image plane.
• In the idealized pinhole camera model, light rays pass through a small hole (the pinhole) and
project an inverted image of the scene onto the image plane.
• The image is formed by the perspective projection, where a 3D point in the world is mapped
to a 2D point on the image plane.
• The image is inverted because rays from the top of an object in the scene are projected to
the bottom of the image plane, and vice versa.
• The focal length of the lens plays a key role in determining the scale and field of view of the
image. It is the distance between the lens and the image plane, where the lens focuses light.
• A short focal length results in a wide field of view, whereas a long focal length provides a
zoomed-in, narrower field of view.
• The distance between the camera and the object also affects the size of the image formed on
the image plane. For objects farther away, the image becomes smaller, while closer objects
form larger images.
5. Image Plane
• The image plane is where the light rays converge and form the image. This is typically a flat
surface or sensor (in digital cameras).
• The image formed on the image plane is a 2D projection of the 3D scene, and its quality is
influenced by factors like focal length, aperture size, sensor resolution, and lens quality.
• As mentioned earlier, the image formed on the image plane is inverted due to the nature of
projection through the lens (pinhole model).
• Real-world cameras use additional lenses or software to correct this inversion and produce
an upright image.
In computer vision, the process of image formation can be mathematically described using a
projection matrix, which maps 3D world coordinates to 2D image coordinates. This process involves
perspective projection and is governed by the following equations.
Let (X,Y,Z)(X, Y, Z)(X,Y,Z) represent a 3D point in the world, and (u,v)(u, v)(u,v) represent the
corresponding 2D point on the image plane. The relationship is given by:
\
Perspective Projective Geometry
Perspective Projective Geometry refers to the mathematical principles that govern how a 3D object
is projected onto a 2D plane, mimicking how we perceive depth and space. In this system, objects
appear smaller the farther away they are, and parallel lines converge at vanishing points. The pinhole
camera model is often used to represent perspective projection, which relates 3D points in the world
to 2D points in an image.
o Objects are represented with true size variation based on their distance.
o Example: A car near the camera appears larger than a car far in the background.
o Crucial in graphics and video games for creating immersive environments that mimic
real-world perception.
o Involves intricate calculations for 3D-to-2D mapping, camera calibration, and depth
estimation.
2. Non-linear Distortion:
o Distortion increases near the edges of the image, where objects may appear
stretched.
o Example: Wide-angle lenses cause barrel distortion, curving straight lines at the
edges.
o Objects too close to the camera may appear disproportionately large, affecting
realism.
o Example: A small object near the camera can look unnaturally large.
4. Loss of Parallelism:
o Example: Railroad tracks appear to meet at a point on the horizon, distorting real-
world parallelism.
Segmentation
Segmentation in image processing refers to the process of dividing an image into distinct regions or
segments, typically based on characteristics like color, intensity, texture, or other visual attributes.
The goal is to simplify the representation of an image or make it more meaningful for further
analysis, such as object recognition, scene interpretation, or image compression.
Segmentation divides the image into regions that are homogeneous according to some predefined
criteria. This process is crucial for applications in computer vision, such as object detection, tracking,
and medical imaging.
Graph-based Segmentation
Graph-based segmentation is a technique that uses graph theory to partition an image into regions
with similar attributes, such as color, intensity, or texture. The image is represented as a graph where
each pixel is treated as a node, and the edges between nodes represent the similarity between
pixels.
o Edges: An edge is created between two neighboring pixels (nodes), and the weight of
the edge represents the similarity or dissimilarity between those two pixels.
o Weighting Function: The weight can be computed using a variety of metrics such as
the difference in pixel intensity, color, or texture.
2. Graph Cut:
o The goal is to partition the graph into several disjoint subgraphs (segments) such that
the edges within each segment are much stronger (similar) than the edges between
different segments.
o Minimum Cut: A graph cut technique is used to partition the graph. It finds a way to
cut the graph into two disjoint sets by minimizing the sum of the weights of the
edges between the sets, thus ensuring that pixels in the same segment are similar to
each other while maximizing the dissimilarity between segments.
o Normalized Cut: In practice, instead of using just the minimum cut, normalized cut
(Ncut) is often used to ensure that the segmentation does not overly favor larger
regions or segments.
▪ It normalizes the cut cost by taking into account both the total weight of
edges within each segment and the total weight of edges between
segments.
3. Region Merging:
o After the initial segmentation, neighboring regions with similar characteristics can be
merged together to form more meaningful segments.
o This is particularly useful when the initial segmentation produces too many small
segments or when fine details need to be aggregated into larger, more coherent
regions.
4. Segmentation Output:
o Graph-based methods are highly accurate, particularly when handling images with
complex textures or boundaries. By modeling the image as a graph, this technique
can capture global relationships between pixels, leading to precise boundaries for
segmentation.
2. Flexibility:
o Graph-based methods can be adapted for a wide range of similarity measures,
making them versatile in handling various types of images and applications (e.g.,
color-based, texture-based, or edge-based segmentation).
o Unlike traditional segmentation methods that may struggle with irregularly shaped
objects, graph-based segmentation can handle objects with complex shapes or non-
linear boundaries.
1. Computational Complexity:
o The complexity of graph cut algorithms grows quadratically with the number of
pixels, making them inefficient for real-time or large-scale applications.
2. Over-segmentation:
o Depending on the threshold used in the graph cut, the algorithm may over-segment
the image, resulting in too many small regions that need further processing or
merging.
3. Sensitivity to Parameters:
o While graph-based methods are effective for detecting edges and boundaries, they
might struggle to segment homogeneous regions (e.g., large, uniformly colored
areas), as there is little distinction between pixels in such regions.
Region Splitting and Region Merging is a technique used in image segmentation that works by
partitioning an image into regions based on some homogeneity criteria (e.g., color, intensity,
texture). The idea is to divide the image into smaller parts (splitting) and then combine them based
on similarity (merging), ultimately yielding segments that are more meaningful for further analysis or
processing.
This method is particularly popular for binary segmentation tasks where the goal is to partition an
image into regions that share certain properties.
Region Splitting
Region Splitting is the first step in this technique, where the image is divided into smaller regions
based on a predefined criterion such as intensity or color similarity.
1. Initial Region:
2. Split Criteria:
o The region is checked against a homogeneity criterion. If the region does not meet
the criterion (i.e., it’s not uniform enough in terms of color, intensity, or texture), the
region is divided (split) into smaller subregions.
3. Recursive Process:
o This splitting process continues recursively, dividing the image into smaller and
smaller regions until each region meets the homogeneity condition. The
homogeneity condition is typically defined by a threshold, where regions must have
pixel values (color or intensity) that are similar within each region.
4. Termination:
o The splitting stops when all regions meet the homogeneity criterion, and no further
division is necessary.
Region Merging
After splitting the image into smaller regions, Region Merging is applied to combine adjacent regions
that are homogeneous. The purpose is to reduce over-segmentation and merge smaller regions that
are similar in characteristics.
o After the splitting step, the image consists of many small regions, which may not yet
represent meaningful structures or objects.
2. Merging Criteria:
o Regions that are adjacent to each other are checked for similarity using the same
homogeneity criterion. If two adjacent regions meet the similarity condition, they
are merged into a larger region.
3. Recursive Merging:
o This merging process is recursive, where regions are repeatedly merged as long as
they are similar. The merging continues until no more regions can be merged
because all adjacent regions are sufficiently distinct.
4. Termination:
o The process terminates when no more regions can be merged without violating the
homogeneity condition.
Semantic Segmentation
Semantic Segmentation is a type of image segmentation where each pixel in an image is classified
into a predefined category or class. Unlike traditional image segmentation techniques, which
typically focus on dividing an image into regions based on homogeneous attributes, semantic
segmentation aims to assign each pixel to a specific class, such as "car," "tree," or "road," based on
the object or feature it represents.
In other words, semantic segmentation involves pixel-wise classification of an image, where every
pixel is labeled with a category. For example, in a street scene, all the pixels corresponding to the
road are labeled as "road," all pixels corresponding to cars are labeled as "car," and so on.
1. Input Image:
o The process starts with an input image that typically contains multiple objects,
regions, and features.
2. Pixel-Level Labeling:
o Every pixel in the image is classified into one of the predefined categories. This can
be done using various machine learning or deep learning algorithms, such as
Convolutional Neural Networks (CNNs), which are effective at extracting spatial
features from images.
3. Output:
o The result is an image where each pixel is assigned a label corresponding to its
category. The output image is typically a label map or segmentation mask, where
each pixel is assigned a class, and different classes are represented by different
colors.
4. Applications:
o Medical Imaging: Used for identifying specific regions in medical scans, such as
tumor detection or organ segmentation.
o Agriculture: Semantic segmentation is used for classifying crops, weeds, and other
elements in satellite imagery.
Focus on Class Labels vs. Regions:
• Semantic segmentation focuses on labeling each pixel with a semantic class, which means it
recognizes and classifies every object or feature in the image, regardless of whether it’s
physically separated by boundaries.
• Classical image segmentation (e.g., region-based segmentation) groups pixels into regions
based on visual similarity or color/texture, without necessarily attributing a specific class to
each region. The primary goal is to divide the image into meaningful segments, not
necessarily to assign a semantic class to every pixel.
• Semantic segmentation works at a pixel level, ensuring that every pixel is classified
individually, providing fine-grained information about the image content.
• Classical image segmentation often works on a region level, dividing the image into larger,
continuous regions based on features like intensity or texture, without explicitly recognizing
object categories.
Computational Complexity:
• Semantic segmentation typically requires more advanced models, especially deep learning
techniques like Fully Convolutional Networks (FCNs) or U-Net architectures, which require
large labeled datasets for training. This makes the process computationally expensive.
• Classical segmentation can be done using simpler algorithms such as thresholding, k-means
clustering, or region growing, which are generally less computationally intensive but may
not achieve the same level of precision as semantic segmentation.
Interpretation of Results:
• In semantic segmentation, the results are more interpretable because the segmentation
corresponds directly to real-world objects or classes (e.g., roads, cars, buildings).
Motion-Based Segmentation
Optical Flow refers to the pattern of apparent motion of objects or pixels between two consecutive
frames in a video. It is caused by the relative motion between the camera and the scene. The goal of
optical flow-based segmentation is to estimate the motion vector for each pixel and use this
information to segment objects that are moving.
How It Works:
• Optical flow estimation is typically done using algorithms like the Horn-Schunck or Lucas-
Kanade methods, which estimate the velocity (or flow) of each pixel in an image by analyzing
the changes in pixel intensities between successive frames.
• The optical flow vectors represent the displacement of pixels from one frame to the next.
• Motion segmentation is achieved by grouping pixels with similar motion patterns into
clusters, where each cluster corresponds to an object or region that moves in a consistent
direction.
1. Calculate Optical Flow: The optical flow vectors are computed for each pixel using a flow
estimation algorithm.
2. Group Pixels by Motion: Pixels with similar motion vectors (e.g., similar velocity, direction)
are grouped together. These groups correspond to moving objects or background regions.
3. Segment Moving Objects: The moving objects can then be extracted by thresholding or
clustering the motion vectors, distinguishing the objects of interest from the stationary
background.
Example:
In a video of a person walking, the optical flow algorithm calculates the motion vectors for each pixel
between consecutive frames. The moving pixels corresponding to the person will form a distinct
motion pattern, and these pixels can be grouped to segment the person from the background.
Advantages:
• Works well for small motions where the movement is relatively smooth and consistent
across pixels.
Disadvantages:
• May fail for large motions or fast-moving objects, as the assumption of small, gradual pixel
changes may not hold.
• Can be computationally expensive, especially for high-resolution video and complex motion
patterns.
Background subtraction is a technique used to detect moving objects in a sequence of video frames
by comparing the current frame to a background model. The idea is to identify the pixels that differ
significantly from the background and classify them as foreground (i.e., moving objects).
How It Works:
• Background model: The background model is a representation of the static scene, which is
updated over time. This model may be a simple static image or a more complex dynamic
model, depending on the algorithm.
• Foreground detection: To detect motion, each pixel in the current frame is compared to the
corresponding pixel in the background model. If the pixel value differs significantly from the
background, it is classified as foreground (i.e., part of a moving object).
• Updating the background: Over time, the background model is updated to accommodate
gradual changes in the scene (e.g., lighting conditions, shadows, etc.).
2. Compare Current Frame with Background: For each pixel in the current frame, compare it
with the background model. If the pixel value significantly deviates, it is classified as
foreground.
3. Update Background Model: After processing each frame, the background model is updated
to account for new static elements in the scene, like changes in lighting or new objects
becoming part of the background.
Example:
In a surveillance video, a camera might be monitoring an entrance. Background subtraction will allow
the system to detect moving people entering or exiting by comparing the current frame to the
background (which could be an empty hallway or door frame).
Advantages:
• Works well in stationary scenes: Particularly effective when the background is relatively
static, and only foreground objects are moving.
Disadvantages:
• Challenges with dynamic backgrounds: The method struggles when there are significant
changes in the background, such as lighting changes, shadows, or moving trees.
• Sensitivity to noise: Small noise or changes in the environment can be falsely detected as
foreground.
Clustering
Clustering is an unsupervised machine learning technique used to group similar data points into
clusters or groups. The goal of clustering is to organize data into such groups that data points within
each group are more similar to each other than to those in other groups. This method is useful for
data exploration, pattern recognition, anomaly detection, and data summarization.
• Image segmentation
• Document classification
• Market research
• Pattern recognition
The K-Means algorithm is one of the most widely used clustering algorithms. It aims to partition the
data into KKK distinct, non-overlapping clusters, where each data point belongs to the cluster whose
center (or centroid) is nearest.
1. Initialization:
o Randomly select KKK data points from the dataset to serve as initial centroids (the
center of each cluster).
2. Assignment Step:
o Assign each data point in the dataset to the nearest centroid based on a chosen
distance metric (usually Euclidean distance).
o This step forms KKK clusters, where each data point is assigned to one of the KKK
centroids.
3. Update Step:
o After assigning all data points to the nearest centroid, recompute the centroids of
each cluster. The new centroid is the mean of all the data points assigned to the
cluster.
4. Repeat:
o Repeat the Assignment Step and Update Step until the centroids no longer change
significantly (i.e., convergence is reached) or the maximum number of iterations is
met.
5. Termination:
o The algorithm terminates when the centroids stabilize, and further iterations do not
result in significant changes in the clusters.
Consider a 2D dataset of points with their coordinates, and we want to cluster them into 2 clusters
using the K-Means algorithm.
Step-by-Step Example:
Let's take a simple example where we have the following data points:
(1,2),(2,3),(3,3),(6,6),(8,8),(9,9)(1, 2), (2, 3), (3, 3), (6, 6), (8, 8), (9, 9)(1,2),(2,3),(3,3),(6,6),(8,8),(9,9)
Step 1: Initialization
• Choose K=2K = 2K=2, and initialize two centroids randomly. Let's say the initial centroids are:
• Now, assign each point to the nearest centroid. To do this, calculate the Euclidean distance
between each point and the centroids:
Step 4: Repeat
The centroids no longer change, indicating that the algorithm has converged.
1. Efficiency:
2. Scalability:
3. Simplicity:
4. Flexibility:
o K-Means can be used in many different domains and is adaptable to many clustering
problems.
1. Choice of KKK:
o The number of clusters KKK must be pre-specified, and choosing the right value can
be challenging. Incorrect KKK can lead to poor clustering results.
o K-Means assumes that clusters are spherical in shape (i.e., circular or ellipsoidal),
which may not always be the case in real-world data.
4. Outliers:
o K-Means can be sensitive to outliers because they can skew the centroids' positions.