0% found this document useful (0 votes)

21 views73 pages

8-Image Detection and Segmentation

Uploaded by

biware8359

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

21 views73 pages

8-Image Detection and Segmentation

Uploaded by

biware8359

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 73

Image Detection and

Segmentation
Other Computer Vision Tasks
Semantic Classification Object Instance
Segmentation + Localization Detection Segmentation

GRASS, CAT, CAT DOG, DOG, CAT DOG, DOG, CAT

TREE, SKY
No objects, just pixels Single Object Multiple Object This image is CC0 public domain
Semantic Segmentation

GRASS, CAT, CAT DOG, DOG, CAT DOG, DOG, CAT

TREE, SKY
No objects, just pixels Single Object Multiple Object This image is CC0 public domain
This image is CC0 public domain

Semantic Segmentation

Label each pixel in the

image with a category
label

s
Sky Sky

ee
Tr

Tr
e
es
Don’t differentiate
Cat Cow
instances, only care about
pixels
Grass Grass
Semantic Segmentation Idea: Sliding
Window
Classify center
Extract patch
pixel with CNN
Full image
Cow

Cow

Grass

Farabet et al, “Learning Hierarchical Features for Scene Labeling,” TPAMI 2013
Pinheiro and Collobert, “Recurrent Convolutional Neural Networks for Scene Labeling”, ICML 2014
Semantic Segmentation Idea: Sliding
Window
Classify center
Extract patch
pixel with CNN
Full image
Cow

Cow

Grass
Problem: Very inefficient! Not
reusing shared features between
Farabet et al, “Learning Hierarchical Features for Scene Labeling,” TPAMI 2013
overlapping patches Pinheiro and Collobert, “Recurrent Convolutional Neural Networks for Scene Labeling”, ICML 2014
Semantic Segmentation Idea: Fully
Convolutional
Design a network as a bunch of convolutional layers
to make predictions for pixels all at once!

Conv Conv Conv Conv argmax

Input:
Scores: Predictions:
3xHxW
CxHxW HxW
Convolutions:
DxHxW
Semantic Segmentation Idea: Fully
Convolutional
Design a network as a bunch of convolutional layers
to make predictions for pixels all at once!

Conv Conv Conv Conv argmax

Input:
Scores: Predictions:
3xHxW
CxHxW HxW
Convolutions:
Problem: convolutions at
DxHxW
original image resolution will
be very expensive ...
Semantic Segmentation Idea: Fully
Convolutional
Design network as a bunch of convolutional layers, with
downsampling and upsampling inside the network!
Med-res: Med-res:
D2 x H/4 x W/4 D2 x H/4 x W/4

Low-res:
D3 x H/4 x W/4
Input: High-res: High-res: Predictions:
3xHxW D1 x H/2 x W/2 D1 x H/2 x W/2 HxW

Long, Shelhamer, and Darrell, “Fully Convolutional Networks for Semantic Segmentation”, CVPR 2015
Noh et al, “Learning Deconvolution Network for Semantic Segmentation”, ICCV 2015
Semantic Segmentation Idea: Fully
Convolutional
Downsampling: Design network as a bunch of convolutional layers, with Upsampling:
Pooling, strided downsampling and upsampling inside the network! ???
convolution Med-res: Med-res:
D2 x H/4 x W/4 D2 x H/4 x W/4

Low-res:
D3 x H/4 x W/4
Input: High-res: High-res: Predictions:
3xHxW D1 x H/2 x W/2 D1 x H/2 x W/2 HxW

Nearest Neighbor “Bed of Nails”

1 1 2 2 1 0 2 0

1 2 1 1 2 2 1 2 0 0 0 0

3 4 3 3 4 4 3 4 3 0 4 0

3 3 4 4 0 0 0 0

Input: 2 x 2 Output: 4 x 4 Input: 2 x 2 Output: 4 x 4

In-Network
Max Pooling
upsampling: “Max Unpooling”
Max Unpooling
Remember which element was max!
Use positions from
1 2 6 3 pooling layer 0 0 2 0

1 2 0 1 0 0
3 5 2 1 5 6
… 3 4
1 2 2 1 7 8 0 0 0 0
Rest of the network
7 3 4 8 3 0 0 4

Input: 4 x 4 Output: 2 x 2 Input: 2 x 2 Output: 4 x 4

Corresponding pairs of
downsampling and
upsampling layers
Learnable Upsampling: Transpose
Convolution
Recall:Typical 3 x 3 convolution, stride 1 pad 1

Input: 4 x 4 Output: 4 x 4
Learnable Upsampling: Transpose
Convolution
Recall: Normal 3 x 3 convolution, stride 1 pad 1

Dot product
between filter
and input

Input: 4 x 4 Output: 4 x 4
Learnable Upsampling: Transpose
Convolution
Recall: Normal 3 x 3 convolution, stride 1 pad 1

Dot product
between filter
and input

Input: 4 x 4 Output: 4 x 4
Learnable Upsampling: Transpose
Convolution
Recall: Normal 3 x 3 convolution, stride 2 pad 1

Input: 4 x 4 Output: 2 x 2
Learnable Upsampling: Transpose
Convolution
Recall: Normal 3 x 3 convolution, stride 2 pad 1

Dot product
between filter
and input

Input: 4 x 4 Output: 2 x 2
Learnable Upsampling: Transpose
Convolution
Recall: Normal 3 x 3 convolution, stride 2 pad 1

Filter moves 2 pixels in

Dot product the input for every one
between filter pixel in the output
and input
Stride gives ratio between
movement in input and
output
Input: 4 x 4 Output: 2 x 2
Learnable Upsampling: Transpose
Convolution
3 x 3 transpose convolution, stride 2 pad 1

Input: 2 x 2 Output: 4 x 4
Learnable Upsampling: Transpose
Convolution
3 x 3 transpose convolution, stride 2 pad 1

Input gives
weight for
filter

Input: 2 x 2 Output: 4 x 4
Learnable Upsampling: Transpose
Convolution
Sum where
3 x 3 transpose convolution, stride 2 pad 1
output overlaps

Filter moves 2 pixels in

Input gives the output for every one
weight for pixel in the input
filter
Stride gives ratio between
movement in output and
input
Input: 2 x 2 Output: 4 x 4
Learnable Upsampling: Transpose
Convolution
Sum where
Other names: 3 x 3 transpose convolution, stride 2 pad 1
output overlaps
-Deconvolution (bad)
-Upconvolution
-Fractionally strided
convolution Filter moves 2 pixels in
-Backward strided Input gives the output for every one
convolution weight for pixel in the input
filter
Stride gives ratio between
movement in output and
input
Input: 2 x 2 Output: 4 x 4
Transpose Convolution: 1D Example
Output
Input Filter Output contains
ax copies of the filter
weighted by the
x ay input, summing at
a where at overlaps in
y az + bx the output
b Need to crop one
z by pixel from output to
make output exactly
2x input
bz
Semantic Segmentation Idea: Fully
Convolutional Upsampling:
Downsampling: Design network as a bunch of convolutional layers, with
Unpooling or strided
Pooling, strided downsampling and upsampling inside the network!
transpose convolution
convolution Med-res: Med-res:
D2 x H/4 x W/4 D2 x H/4 x W/4

Low-res:
D3 x H/4 x W/4
Input: High-res: High-res: Predictions:
3xHxW D1 x H/2 x W/2 D1 x H/2 x W/2 HxW

GRASS, CAT, CAT DOG, DOG, CAT DOG, DOG, CAT

TREE, SKY

No objects, just pixels Single Object Multiple Object This image is CC0 public domain
Classification + Localization
Class Scores
Fully Cat: 0.9
Connected:
Dog: 0.05
4096 to 1000
Car: 0.01
...

This image is CC0 public domain Vector: Fully

Connected:
4096 Box
4096 to 4
Coordinates
(x, y, w, h)
Treat localization as a
regression problem!
Correct label:
Classification + Localization Cat

Class Scores
Fully Cat: 0.9 Softmax
Connected: Dog: 0.05 Loss
4096 to 1000 Car: 0.01
...

This image is CC0 public domain Vector: Fully

Connected:
4096 Box
4096 to 4
Coordinates L2 Loss
(x, y, w, h)
Treat localization as a
Correct box:
regression problem!
(x’, y’, w’, h’)
Correct label:
Classification + Localization Cat

Class Scores
Fully Cat: 0.9 Softmax
Connected: Dog: 0.05 Loss
4096 to 1000 Car: 0.01
...

Multitask Loss + Loss

This image is CC0 public domain Vector: Fully

Connected:
4096 Box
4096 to 4
Coordinates L2 Loss
(x, y, w, h)
Treat localization as a
Correct box:
regression problem!
(x’, y’, w’, h’)
Correct label:
Classification + Localization Cat

Class Scores
Fully Cat: 0.9 Softmax
Connected: Dog: 0.05 Loss
4096 to 1000 Car: 0.01
...

+ Loss

This image is CC0 public domain Vector: Fully

Often pretrained on ImageNet Connected:
4096 Box
(Transfer learning) 4096 to 4
Coordinates L2 Loss
(x, y, w, h)
Treat localization as a
Correct box:
regression problem!
(x’, y’, w’, h’)
Aside: Human Pose Estimation
Represent pose as a
set of 14 joint
positions:

Left / right foot

Left / right knee
Left / right hip
Left / right shoulder
Left / right elbow
Left / right hand
Neck
This image is licensed under CC-BY 2.0.
Head top
Johnson and Everingham, "Clustered Pose and Nonlinear Appearance Models
for Human Pose Estimation", BMVC 2010
Aside: Human Pose Estimation

Left foot: (x, y)

Right foot: (x, y)

…
Vector:
4096 Head top: (x, y)

Toshev and Szegedy, “DeepPose: Human Pose

Estimation via Deep Neural Networks”, CVPR 2014
Aside: Human Pose Estimation
Correct left
foot: (x’, y’)

Left foot: (x, y) L2 loss

Right foot: (x, y) L2 loss

… ...
+ Loss
Vector:
4096 Head top: (x, y) L2 loss

Correct head
Toshev and Szegedy, “DeepPose: Human Pose top: (x’, y’)
Estimation via Deep Neural Networks”, CVPR 2014

52
Object Detection

GRASS, CAT, CAT DOG, DOG, CAT DOG, DOG, CAT

TREE, SKY

No objects, just pixels Single Object Multiple Object This image is CC0 public domain
Object Detection as Regression?
CAT: (x, y, w, h)

DOG: (x, y, w, h)
DOG: (x, y, w, h)
CAT: (x, y, w, h)

DUCK: (x, y, w, h)
DUCK: (x, y, w, h)
….
Object Detection as Regression?
Each image needs a different number of outputs!
CAT: (x, y, w, h) 4 numbers

DOG: (x, y, w, h)
DOG: (x, y, w, h) 16 numbers
CAT: (x, y, w, h)

DUCK: (x, y, w, h)
Many
DUCK: (x, y, w, h)

numbers!
….
Object Detection as Classification:
Sliding Window
Apply a CNN to many different crops of the
image, CNN classifies each crop as object
or background

Dog? NO
Cat? NO
Background? YES
Object Detection as Classification:
Sliding Window
Apply a CNN to many different crops of the
image, CNN classifies each crop as object
or background

Dog? YES
Cat? NO
Background? NO
Object Detection as Classification:
Sliding Window
Apply a CNN to many different crops of the
image, CNN classifies each crop as object
or background

Dog? NO
Cat? YES
Background? NO
Object Detection as Classification:
Sliding Window
Apply a CNN to many different crops of the
image, CNN classifies each crop as object
or background

Dog? NO Cat? YES

Background? NO

Problem: Need to apply CNN to huge number of locations and scales, very
computationally expensive!
Region Proposals
● Find “blobby” image regions that are likely to contain objects
● Relatively fast to run; e.g. Selective Search gives 1000 region
proposals in a few seconds on CPU

Alexe et al, “Measuring the objectness of image windows”, TPAMI 2012

Uijlings et al, “Selective Search for Object Recognition”, IJCV 2013
Cheng et al, “BING: Binarized normed gradients for objectness estimation at 300fps”, CVPR 2014
Zitnick and Dollar, “Edge boxes: Locating object proposals from edges”, ECCV 2014
R-CNN

Girshick et al, “Rich feature hierarchies for accurate object detection and
semantic segmentation”, CVPR 2014.
Figure copyright Ross Girshick, 2015; source. Reproduced with permission.
R-CNN

Girshick et al, “Rich feature hierarchies for accurate object detection and
semantic segmentation”, CVPR 2014.
Figure copyright Ross Girshick, 2015; source. Reproduced with permission.
R-CNN: Problems
• Ad hoc training objectives
• Fine-tune network with softmax classifier (log loss)
• Train post-hoc linear SVMs (hinge loss)
• Train post-hoc bounding-box regressions (least squares)
• Training is slow (84h), takes a lot of disk space
• Inference (detection) is slow
• 47s / image with VGG16 [Simonyan & Zisserman. ICLR15]
• Fixed by SPP-net [He et al. ECCV14]

Girshick et al, “Rich feature hierarchies for accurate object detection and
semantic segmentation”, CVPR 2014.
Slide copyright Ross Girshick, 2015; source. Reproduced with permission.

69
Fast R-CNN