Lecture 5 Segmentation
Lecture 5 Segmentation
Image
Segmentation
Ziwei Liu
刘子纬
https://liuziwei7.github.io/
Slide Credits
Instance Segmentation:
• Object Detection
• Mask R-CNN
• Joint Mask+X Prediction
Open Problems:
Many Structured Prediction Tasks
Semantic Object Instance
Classification
Segmentation Detection Segmentation
CAT GRASS, CAT, TREE, DOG, DOG, CAT DOG, DOG, CAT
SKY
Grass Grass
Fully Convolutional Network
Semantic Segmentation Idea: Sliding Window
Extract Classify center
patch pixel with CNN
Full image
Cow
Cow
Grass
Cow
Grass
Problem: Very inefficient! Not
reusing shared features
between overlapping patches Farabet et al, “Learning Hierarchical Features for Scene Labeling,” TPAMI 2013
Pinheiro and Collobert, “Recurrent Convolutional Neural Networks for Scene Labeling”, ICML 2014
“Fully” Convolution
Fully Convolutional Network
Design a network as a bunch of convolutional
layers to make predictions for pixels all at once!
Input:
Scores: Predictions:
3 x H xW
Convolutions: C x H xW HxW
DxH xW
Loss function: Per-Pixel cross-entropy
Long et al, “Fully convolutional networks for semantic segmentation”, CVPR 2015
Fully Convolutional Network
Design a network as a bunch of convolutional
layers to make predictions for pixels all at once!
3 W C
W
H … H
3 W C
W
H … H
Uh oh!*
*100 layers, batch size of 20 = 238GB of memory!
Downsampling and Upsampling
Fully Convolutional Network
Design network as a bunch of convolutional layers, with
downsampling and upsampling inside the network!
Med-res: Med-res:
D2 x H/4 x W/4 D2 x H/4 x W/4
Low-res:
Input: High-res: D3 x H/4 x W/4 High-res: Predictions:
3 x H xW D1 x H/2 x W/2 D1 x H/2 x W/2 HxW
Long, Shelhamer, and Darrell, “Fully Convolutional Networks for Semantic Segmentation”, CVPR 2015
Noh et al, “Learning Deconvolution Network for Semantic Segmentation”, ICCV 2015
Fully Convolutional Network
Downsampling: Design network as a bunch of convolutional layers, with
Upsampling:
downsampling and upsampling inside the network!
Pooling, strided ???
convolution Med-res: Med-res:
D2 x H/4 x W/4 D2 x H/4 x W/4
Low-res:
Input: High-res: D3 x H/4 x W/4 High-res: Predictions:
3 x H xW D1 x H/2 x W/2 D1 x H/2 x W/2 HxW
Long, Shelhamer, and Darrell, “Fully Convolutional Networks for Semantic Segmentation”, CVPR 2015
Noh et al, “Learning Deconvolution Network for Semantic Segmentation”, ICCV 2015
In-Network Upsampling: “Unpooling”
Bed of Nails
1 0 2 0
1 2 0 0 0 0
3 4 3 0 4 0
0 0 0 0
Input Output
Cx2x2 Cx4x4
In-Network Upsampling: “Unpooling”
Input: C x 2 x 2 Output: C x 4 x 4
Input: C x 2 x 2 Output: C x 4 x 4
Use three closest neighbors in x and y to
construct cubic approximations
(This is how we normally resize images!)
In-Network Upsampling
In-Network Upsampling: “Max Unpooling”
Max Pooling: Remember Max Unpooling: Place into
which position had the max remembered positions
1 2 6 3 0 0 2 0
3 5 2 1 5 6 Rest 1 2 0 1 0 0
of
1 2 2 1 7 8 net 3 4 0 0 0 0
7 3 4 8 3 0 0 4
Noh et al, “Learning Deconvolution Network for Semantic Segmentation”, ICCV 2015
Learnable Upsampling: Transposed Convolution
Recall: Normal 3 x 3 convolution, stride 1, pad 1
Input: 4 x 4 Output: 4 x 4
Learnable Upsampling: Transposed Convolution
Recall: Normal 3 x 3 convolution, stride 1, pad 1
Dot product
between input
and filter
Input: 4 x 4 Output: 4 x 4
Learnable Upsampling: Transposed Convolution
Recall: Normal 3 x 3 convolution, stride 1, pad 1
Dot product
between input
and filter
Input: 4 x 4 Output: 4 x 4
Learnable Upsampling: Transposed Convolution
Recall: Normal 3 x 3 convolution, stride 2, pad 1
Input: 4 x 4 Output: 2 x 2
Learnable Upsampling: Transposed Convolution
Recall: Normal 3 x 3 convolution, stride 2, pad 1
Dot product
between input
and filter
Input: 4 x 4 Output: 2 x 2
Learnable Upsampling: Transposed Convolution
Recall: Normal 3 x 3 convolution, stride 2, pad 1
Dot product
between input
and filter
Input: 4 x 4 Output: 2 x 2
Learnable Upsampling: Transposed Convolution
Recall: Normal 3 x 3 convolution, stride 2, pad 1
Convolution with stride > 1 is “Learnable Downsampling”
Can we use stride < 1 for “Learnable Upsampling”?
Dot product
between input
and filter
Input: 4 x 4 Output: 2 x 2
Learnable Upsampling: Transposed Convolution
3 x 3 convolution transpose, stride 2
Input: 2 x 2 Output: 4 x 4
Learnable Upsampling: Transposed Convolution
3 x 3 convolution transpose, stride 2
Weight filter by
input value and
copy to output
Input: 2 x 2 Output: 4 x 4
Learnable Upsampling: Transposed Convolution
3 x 3 convolution transpose, stride 2
Filter moves 2 pixels in output
for every 1 pixel in input
Weight filter by
input value and
copy to output
Input: 2 x 2 Output: 4 x 4
Learnable Upsampling: Transposed Convolution
Sum where
3 x 3 convolution transpose, stride 2 output overlaps
Weight filter by
input value and
copy to output
Input: 2 x 2 Output: 4 x 4
Learnable Upsampling: Transposed Convolution
Sum where
3 x 3 convolution transpose, stride 2 output overlaps
Weight filter by
input value and
copy to output
Input: 2 x 2 Output: 4 x 4
Transposed Convolution: 1D example
Input Filter Output
• Output has copies
ax of filter weighted
by input
x ay
a
y az+bx • Stride 2: Move 2
b pixels output for
z by each pixel in input
bz
• Sum at overlaps
Transposed Convolution: 1D example
Input Filter Output This has many names:
ax
- Deconvolution (bad)!
x ay - Upconvolution
a - Fractionally strided
y az+bx convolution
b
z by - Backward strided
convolution
bz - Transposed Convolution
(best name)
Transposed Convolution
Low-res:
Input: High-res: D3 x H/4 x W/4 High-res: Predictions:
3 x H xW D1 x H/2 x W/2 D1 x H/2 x W/2 HxW
Long, Shelhamer, and Darrell, “Fully Convolutional Networks for Semantic Segmentation”, CVPR 2015 Loss function: Per-Pixel cross-entropy
Noh et al, “Learning Deconvolution Network for Semantic Segmentation”, ICCV 2015
Skip Connections
Retaining Spatial Details
Result from Long et al. Fully Convolutional Networks For Semantic Segmentation. CVPR 2014
Retaining Spatial Details
A B C D E
Result from Long et al. Fully Convolutional Networks For Semantic Segmentation. CVPR 2014
Retaining Spatial Details
How do you send details forward in the network?
You copy the activations forward.
Subsequent layers at the same resolution figure out
how to fuse things.
Copy
Result from Long et al. Fully Convolutional Networks For Semantic Segmentation. CVPR 2014
U-Net
Extremely popular
architecture, was originally
used for biomedical image
segmentation.
Ronneberger et al, “U-Net: Convolutional Networks for Biomedical Image Segmentation”, MICCAI 2015
Spatial Contexts
The Importance of Spatial Contexts
𝑈𝑛𝑎𝑟𝑦 = − ln 𝑝𝑖 (𝑙𝑎𝑏𝑒𝑙)
𝑖
Liu et al, “Semantic Image Segmentation via Deep Parsing Network”, ICCV 2015
Spatial Contexts: Markov Random Field
𝑖 𝑗
𝑑𝑖𝑠𝑠 𝑖, 𝑗 = , = 0.8 Energy Function
min 𝐸 = 𝑈𝑛𝑎𝑟𝑦 + 𝑃𝑎𝑖𝑟
𝑖 Unary Term
𝑗
𝑈𝑛𝑎𝑟𝑦 = − ln 𝑝𝑖 (𝑙𝑎𝑏𝑒𝑙)
𝑖
Pairwise Term
𝑖 Unary Term
𝑗
𝑈𝑛𝑎𝑟𝑦 = − ln 𝑝𝑖 (𝑙𝑎𝑏𝑒𝑙)
𝑖
Pairwise Term
mbike
person
mbike
bottle
sheep
horse
plant
chair
table
train
boat
areo
bike
bird
sofa
cow
bkg
dog
bus
car
cat
tv
bkg
areo penalty
bike
bike
bird
boat
bottle
bus
car
cat
chair
cow
table
dog
horse
mbike
person
person
plant
sheep
sofa
train favor
tv
Spatial Contexts: Markov Random Field
3 C C
W W W
H CNN H H
CNN
H H
𝒚 = 𝑦)
Accuracy: mean(ෝ Prediction Ground-Truth
𝒚)
(ෝ (𝒚)
Intersection over union,
averaged over classes
/
Intersection over Union (IoU)
• Cow
• Grass
Things and Stuff
Things and Stuff
This image is CC0 public domain
• Stuff
Structured Prediction Tasks
Object Detection: Detects individual Semantic Segmentation: Gives
object instances, but only gives box per-pixel labels, but merges
(Only things!) instances (Both things and stuff)
• Sky
• Cow
• Grass
Structured Prediction Tasks: Instance Segmentation
Semantic Object Instance
Classification
Segmentation Detection Segmentation
Liu et al, “Multi-Objective Convolutional Learning for Face Labeling”, CVPR 2015
Face Parsing
Semantic Segmentation
Instance Segmentation
Instance Segmentation
Instance Segmentation:
Detect all objects in the Cow
image, and identify the
pixels that belong to each
object (Only things!)
Cow
Girshick et al, “Rich feature hierarchies for accurate object detection and semantic segmentation”, CVPR 2014. Figure copyright Ross Girshick, 2015
Region-based CNN
Girshick et al, “Rich feature hierarchies for accurate object detection and semantic segmentation”, CVPR 2014. Figure copyright Ross Girshick, 2015
Region-based CNN
Girshick et al, “Rich feature hierarchies for accurate object detection and semantic segmentation”, CVPR 2014. Figure copyright Ross Girshick, 2015
Region-based CNN
Girshick et al, “Rich feature hierarchies for accurate object detection and semantic segmentation”, CVPR 2014. Figure copyright Ross Girshick, 2015
Region-based CNN
Classify each region
Girshick et al, “Rich feature hierarchies for accurate object detection and semantic segmentation”, CVPR 2014. Figure copyright Ross Girshick, 2015
Region-based CNN
Classify each region
Girshick et al, “Rich feature hierarchies for accurate object detection and semantic segmentation”, CVPR 2014. Figure copyright Ross Girshick, 2015
Region-based CNN (R-CNN)
Girshick et al, “Rich feature hierarchies for accurate object detection and semantic segmentation”, CVPR 2014. Figure copyright Ross Girshick, 2015
Fast R-CNN vs “Slow” R-CNN
Fast R-CNN: Apply differentiable cropping to “Slow” R-CNN: Run CNN independently for
shared image features each region
Fast R-CNN vs “Slow” R-CNN
Girshick et al, “Rich feature hierarchies for accurate object detection and semantic segmentation”, CVPR 2014. He et al, “Spatial pyramid
pooling in deep convolutional networks for visual recognition”, ECCV 2014
Girshick, “Fast R-CNN”, ICCV 2015
Fast R-CNN vs “Slow” R-CNN
Girshick et al, “Rich feature hierarchies for accurate object detection and semantic segmentation”, CVPR 2014. He et al, “Spatial pyramid
pooling in deep convolutional networks for visual recognition”, ECCV 2014
Girshick, “Fast R-CNN”, ICCV 2015
Fast R-CNN vs “Slow” R-CNN
Recall: Region proposals computed by heuristic ”Selective Search” Problem: Runtime dominated by region
algorithm on CPU -- let’s learn them with a CNN instead! proposals!
Girshick et al, “Rich feature hierarchies for accurate object detection and semantic segmentation”, CVPR 2014. He et al, “Spatial pyramid
pooling in deep convolutional networks for visual recognition”, ECCV 2014
Girshick, “Fast R-CNN”, ICCV 2015
Faster R-CNN: Learnable Region Proposals
Ren et al, “Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks”, NIPS 2015 Figure copyright 2015, Ross Girshick
Object Detection:
Faster R-CNN
Ren et al, “Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks”, NeurIPS 2015
Mask R-CNN
Instance Segmentation: Mask
Prediction
Mask R-CNN
• Feature upsampling
using adaptive kernels
that incorporate contexts
𝑯
Extract corresponding
reassembly kernel
𝑾
𝑘𝑢𝑝 𝑘𝑢𝑝
𝑘𝑢𝑝
⨂ 𝑘𝑢𝑝 =
Extract 𝒌𝒖𝒑 x 𝒌𝒖𝒑 nearby 𝑵(𝝌 , 𝒌 ) Feature reassembly
𝒍 𝒖𝒑 𝓦𝒍′
region of a location
Beyond Instance Segmentation
Instance Segmentation: Separate Semantic Segmentation: Identify both things
object instances, but only things and stuff, but doesn’t separate instances
Cow Sky
Cow Cow
Grass
Panoptic Segmentation
Panoptic Segmentation
Label all pixels in Sky
the image (both Trees
things and stuff)
Cow #1
Cow #1
For “thing” Cow #2
categories also
separate into Grass
instances
Kirillov et al, “Panoptic Segmentation”, CVPR 2019
Kirillov et al, “Panoptic Feature Pyramid Networks”, CVPR 2019
Panoptic Segmentation
e.g. 17 keypoints:
- Nose
- Left / Right eye
- Left / Right ear
- Left / Right shoulder
- Left / Right elbow
- Left / Right wrist
- Left / Right hip
- Left / Right knee
- Left / Right ankle
Person image is CC0 public domain
Mask R-CNN:
Mask
Instance Segmentation Prediction
Keypoint
estimation
Keypoint
prediction
Keypoint
estimation
Keypoint
prediction
Keypoint
estimation
…
CNN Conv…
RoI Align
+RPN Keypoint masks:
256 x 14 x 14 K x 56 x 56
Ground-truth has one “pixel” turned on
He et al, “Mask R-CNN”, ICCV 2017
per keypoint. Train with softmax loss
Joint Instance Segmentation and Pose Estimation
Guler et al, “DensePose: Dense Human Pose Estimation in the Wild”, CVPR 2018
General Idea: Add Per- Mask
Region “Heads” to Prediction
Faster / Mask R-CNN! Keypoint
prediction
Per-Region Heads:
Each receives the features after
RoI Pool / RoI Align, makes
some prediction per-region
Keypoint
estimation
Open World
CAT GRASS, CAT, TREE, DOG, DOG, CAT DOG, DOG, CAT
SKY
MMDetection:
https://github.com/open
-mmlab/mmdetection
Next Time:
Transformers