8-Image Detection and Segmentation
8-Image Detection and Segmentation
Segmentation
Other Computer Vision Tasks
Semantic Classification Object Instance
Segmentation + Localization Detection Segmentation
Semantic Segmentation
s
Sky Sky
ee
Tr
Tr
e
es
Don’t differentiate
Cat Cow
instances, only care about
pixels
Grass Grass
Semantic Segmentation Idea: Sliding
Window
Classify center
Extract patch
pixel with CNN
Full image
Cow
Cow
Grass
Farabet et al, “Learning Hierarchical Features for Scene Labeling,” TPAMI 2013
Pinheiro and Collobert, “Recurrent Convolutional Neural Networks for Scene Labeling”, ICML 2014
Semantic Segmentation Idea: Sliding
Window
Classify center
Extract patch
pixel with CNN
Full image
Cow
Cow
Grass
Problem: Very inefficient! Not
reusing shared features between
Farabet et al, “Learning Hierarchical Features for Scene Labeling,” TPAMI 2013
overlapping patches Pinheiro and Collobert, “Recurrent Convolutional Neural Networks for Scene Labeling”, ICML 2014
Semantic Segmentation Idea: Fully
Convolutional
Design a network as a bunch of convolutional layers
to make predictions for pixels all at once!
Input:
Scores: Predictions:
3xHxW
CxHxW HxW
Convolutions:
DxHxW
Semantic Segmentation Idea: Fully
Convolutional
Design a network as a bunch of convolutional layers
to make predictions for pixels all at once!
Input:
Scores: Predictions:
3xHxW
CxHxW HxW
Convolutions:
Problem: convolutions at
DxHxW
original image resolution will
be very expensive ...
Semantic Segmentation Idea: Fully
Convolutional
Design network as a bunch of convolutional layers, with
downsampling and upsampling inside the network!
Med-res: Med-res:
D2 x H/4 x W/4 D2 x H/4 x W/4
Low-res:
D3 x H/4 x W/4
Input: High-res: High-res: Predictions:
3xHxW D1 x H/2 x W/2 D1 x H/2 x W/2 HxW
Long, Shelhamer, and Darrell, “Fully Convolutional Networks for Semantic Segmentation”, CVPR 2015
Noh et al, “Learning Deconvolution Network for Semantic Segmentation”, ICCV 2015
Semantic Segmentation Idea: Fully
Convolutional
Downsampling: Design network as a bunch of convolutional layers, with Upsampling:
Pooling, strided downsampling and upsampling inside the network! ???
convolution Med-res: Med-res:
D2 x H/4 x W/4 D2 x H/4 x W/4
Low-res:
D3 x H/4 x W/4
Input: High-res: High-res: Predictions:
3xHxW D1 x H/2 x W/2 D1 x H/2 x W/2 HxW
Long, Shelhamer, and Darrell, “Fully Convolutional Networks for Semantic Segmentation”, CVPR 2015
Noh et al, “Learning Deconvolution Network for Semantic Segmentation”, ICCV 2015
In-Network upsampling: “Unpooling”
1 2 1 1 2 2 1 2 0 0 0 0
3 4 3 3 4 4 3 4 3 0 4 0
3 3 4 4 0 0 0 0
1 2 0 1 0 0
3 5 2 1 5 6
… 3 4
1 2 2 1 7 8 0 0 0 0
Rest of the network
7 3 4 8 3 0 0 4
Corresponding pairs of
downsampling and
upsampling layers
Learnable Upsampling: Transpose
Convolution
Recall:Typical 3 x 3 convolution, stride 1 pad 1
Input: 4 x 4 Output: 4 x 4
Learnable Upsampling: Transpose
Convolution
Recall: Normal 3 x 3 convolution, stride 1 pad 1
Dot product
between filter
and input
Input: 4 x 4 Output: 4 x 4
Learnable Upsampling: Transpose
Convolution
Recall: Normal 3 x 3 convolution, stride 1 pad 1
Dot product
between filter
and input
Input: 4 x 4 Output: 4 x 4
Learnable Upsampling: Transpose
Convolution
Recall: Normal 3 x 3 convolution, stride 2 pad 1
Input: 4 x 4 Output: 2 x 2
Learnable Upsampling: Transpose
Convolution
Recall: Normal 3 x 3 convolution, stride 2 pad 1
Dot product
between filter
and input
Input: 4 x 4 Output: 2 x 2
Learnable Upsampling: Transpose
Convolution
Recall: Normal 3 x 3 convolution, stride 2 pad 1
Input: 2 x 2 Output: 4 x 4
Learnable Upsampling: Transpose
Convolution
3 x 3 transpose convolution, stride 2 pad 1
Input gives
weight for
filter
Input: 2 x 2 Output: 4 x 4
Learnable Upsampling: Transpose
Convolution
Sum where
3 x 3 transpose convolution, stride 2 pad 1
output overlaps
Low-res:
D3 x H/4 x W/4
Input: High-res: High-res: Predictions:
3xHxW D1 x H/2 x W/2 D1 x H/2 x W/2 HxW
Long, Shelhamer, and Darrell, “Fully Convolutional Networks for Semantic Segmentation”, CVPR 2015
Noh et al, “Learning Deconvolution Network for Semantic Segmentation”, ICCV 2015
Classification + Localization
No objects, just pixels Single Object Multiple Object This image is CC0 public domain
Classification + Localization
Class Scores
Fully Cat: 0.9
Connected:
Dog: 0.05
4096 to 1000
Car: 0.01
...
Class Scores
Fully Cat: 0.9 Softmax
Connected: Dog: 0.05 Loss
4096 to 1000 Car: 0.01
...
Class Scores
Fully Cat: 0.9 Softmax
Connected: Dog: 0.05 Loss
4096 to 1000 Car: 0.01
...
Class Scores
Fully Cat: 0.9 Softmax
Connected: Dog: 0.05 Loss
4096 to 1000 Car: 0.01
...
+ Loss
…
Vector:
4096 Head top: (x, y)
… ...
+ Loss
Vector:
4096 Head top: (x, y) L2 loss
Correct head
Toshev and Szegedy, “DeepPose: Human Pose top: (x’, y’)
Estimation via Deep Neural Networks”, CVPR 2014
52
Object Detection
No objects, just pixels Single Object Multiple Object This image is CC0 public domain
Object Detection as Regression?
CAT: (x, y, w, h)
DOG: (x, y, w, h)
DOG: (x, y, w, h)
CAT: (x, y, w, h)
DUCK: (x, y, w, h)
DUCK: (x, y, w, h)
….
Object Detection as Regression?
Each image needs a different number of outputs!
CAT: (x, y, w, h) 4 numbers
DOG: (x, y, w, h)
DOG: (x, y, w, h) 16 numbers
CAT: (x, y, w, h)
DUCK: (x, y, w, h)
Many
DUCK: (x, y, w, h)
numbers!
….
Object Detection as Classification:
Sliding Window
Apply a CNN to many different crops of the
image, CNN classifies each crop as object
or background
Dog? NO
Cat? NO
Background? YES
Object Detection as Classification:
Sliding Window
Apply a CNN to many different crops of the
image, CNN classifies each crop as object
or background
Dog? YES
Cat? NO
Background? NO
Object Detection as Classification:
Sliding Window
Apply a CNN to many different crops of the
image, CNN classifies each crop as object
or background
Dog? YES
Cat? NO
Background? NO
Object Detection as Classification:
Sliding Window
Apply a CNN to many different crops of the
image, CNN classifies each crop as object
or background
Dog? NO
Cat? YES
Background? NO
Object Detection as Classification:
Sliding Window
Apply a CNN to many different crops of the
image, CNN classifies each crop as object
or background
Problem: Need to apply CNN to huge number of locations and scales, very
computationally expensive!
Region Proposals
● Find “blobby” image regions that are likely to contain objects
● Relatively fast to run; e.g. Selective Search gives 1000 region
proposals in a few seconds on CPU
Girshick et al, “Rich feature hierarchies for accurate object detection and
semantic segmentation”, CVPR 2014.
Figure copyright Ross Girshick, 2015; source. Reproduced with permission.
R-CNN
Girshick et al, “Rich feature hierarchies for accurate object detection and
semantic segmentation”, CVPR 2014.
Figure copyright Ross Girshick, 2015; source. Reproduced with permission.
R-CNN
Girshick et al, “Rich feature hierarchies for accurate object detection and
semantic segmentation”, CVPR 2014.
Figure copyright Ross Girshick, 2015; source. Reproduced with permission.
R-CNN
Girshick et al, “Rich feature hierarchies for accurate object detection and
semantic segmentation”, CVPR 2014.
Figure copyright Ross Girshick, 2015; source. Reproduced with permission.
R-CNN
Girshick et al, “Rich feature hierarchies for accurate object detection and
semantic segmentation”, CVPR 2014.
Figure copyright Ross Girshick, 2015; source. Reproduced with permission.
R-CNN
Girshick et al, “Rich feature hierarchies for accurate object detection and
semantic segmentation”, CVPR 2014.
Figure copyright Ross Girshick, 2015; source. Reproduced with permission.
R-CNN: Problems
• Ad hoc training objectives
• Fine-tune network with softmax classifier (log loss)
• Train post-hoc linear SVMs (hinge loss)
• Train post-hoc bounding-box regressions (least squares)
• Training is slow (84h), takes a lot of disk space
• Inference (detection) is slow
• 47s / image with VGG16 [Simonyan & Zisserman. ICLR15]
• Fixed by SPP-net [He et al. ECCV14]
Girshick et al, “Rich feature hierarchies for accurate object detection and
semantic segmentation”, CVPR 2014.
Slide copyright Ross Girshick, 2015; source. Reproduced with permission.
69
Fast R-CNN
Girshick et al, “Rich feature hierarchies for accurate object detection and semantic segmentation”, CVPR 2014.
He et al, “Spatial pyramid pooling in deep convolutional networks for visual recognition”, ECCV 2014
Girshick, “Fast R-CNN”, ICCV 2015
R-CNN vs SPP vs Fast R-CNN
Problem:
Runtime dominated
by region proposals!
Girshick et al, “Rich feature hierarchies for accurate object detection and semantic segmentation”, CVPR 2014.
He et al, “Spatial pyramid pooling in deep convolutional networks for visual recognition”, ECCV 2014
Girshick, “Fast R-CNN”, ICCV 2015
Faster R-CNN:
Make CNN do proposals!
Insert Region Proposal
Network (RPN) to predict
proposals from features
Johnson, Karpathy, and Fei-Fei, “DenseCap: Fully Convolutional Localization Networks for Dense Captioning”, CVPR 2016
Figure copyright IEEE, 2016. Reproduced for educational purposes.
Aside: Object Detection + Captioning
= Dense Captioning
Johnson, Karpathy, and Fei-Fei, “DenseCap: Fully Convolutional Localization Networks for Dense Captioning”, CVPR 2016
Figure copyright IEEE, 2016. Reproduced for educational purposes.
Instance Segmentation
No objects, just pixels Single Object Multiple Object This image is CC0 public domain
Mask R-CNN
Classification Scores: C
Box coordinates (per class): 4 * C
C x 14 x 14
He et al, “Mask R-CNN”, arXiv 2017
Mask R-CNN: Very Good Results!
C x 14 x 14
He et al, “Mask R-CNN”, arXiv 2017
Mask R-CNN Also does pose
93
Recap:
Semantic Classification Object Instance
Segmentation + Localization Detection Segmentation
No objects, just pixels Single Object Multiple Object This image is CC0 public domain
Thankyou
Any Questions?