Thanks to visit codestin.com
Credit goes to www.scribd.com

0% found this document useful (0 votes)
12 views140 pages

Lecture 5 Segmentation

Uploaded by

Sazid Azad
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
12 views140 pages

Lecture 5 Segmentation

Uploaded by

Sazid Azad
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 140

AI6126 Advanced Computer Vision

Last update: 9 February 2022

Image
Segmentation
Ziwei Liu
刘子纬

https://liuziwei7.github.io/
Slide Credits

• Justin Johnson, EECS 498/598


• David Fouhey, EECS 442
• Paper Authors
Outline
Semantic Segmentation:
• Fully Convolutional Network
• Skip Connections
• Spatial Contexts

Instance Segmentation:
• Object Detection
• Mask R-CNN
• Joint Mask+X Prediction

Open Problems:
Many Structured Prediction Tasks
Semantic Object Instance
Classification
Segmentation Detection Segmentation

CAT GRASS, CAT, TREE, DOG, DOG, CAT DOG, DOG, CAT
SKY

No spatial extent No objects, just pixels Multiple Objects


This image is CC0 public domain
Part I:
Semantic Segmentation
Structured Prediction Tasks: Semantic Segmentation
Semantic Object Instance
Classification
Segmentation Detection Segmentation

GRASS, CAT, TREE,


CAT SKY DOG, DOG, CAT DOG, DOG, CAT

No spatial extent No objects, just pixels Multiple Objects


Semantic Segmentation
This image is CC0 public domain

Label each pixel in the image


with a category label

Don’t differentiate instances, Sky Sky


only care about pixels
Cat Cow

Grass Grass
Fully Convolutional Network
Semantic Segmentation Idea: Sliding Window
Extract Classify center
patch pixel with CNN
Full image
Cow

Cow

Grass

Farabet et al, “Learning Hierarchical Features for Scene Labeling,” TPAMI2013


Pinheiro and Collobert, “Recurrent Convolutional Neural Networks for Scene Labeling”, ICML 2014
Semantic Segmentation Idea: Sliding Window
Extract Classify center
patch pixel with CNN
Full image
Cow

Cow

Grass
Problem: Very inefficient! Not
reusing shared features
between overlapping patches Farabet et al, “Learning Hierarchical Features for Scene Labeling,” TPAMI 2013
Pinheiro and Collobert, “Recurrent Convolutional Neural Networks for Scene Labeling”, ICML 2014
“Fully” Convolution
Fully Convolutional Network
Design a network as a bunch of convolutional
layers to make predictions for pixels all at once!

Conv Conv Conv Conv argmax

Input:
Scores: Predictions:
3 x H xW
Convolutions: C x H xW HxW
DxH xW
Loss function: Per-Pixel cross-entropy
Long et al, “Fully convolutional networks for semantic segmentation”, CVPR 2015
Fully Convolutional Network
Design a network as a bunch of convolutional
layers to make predictions for pixels all at once!

Conv Conv Conv Conv argmax

Input: Problem #1: Effective receptive


3 x H x W field size is linear in number of
conv layers: With L 3x3 conv
layers, receptive field is 1+2L
Long et al, “Fully convolutional networks for semantic segmentation”, CVPR 2015
Fully Convolutional Network
Design a network as a bunch of convolutional
layers to make predictions for pixels all at once!

Conv Conv Conv Conv argmax

Input: Problem #1: Effective receptive Problem #2: Convolution on


3 x H x W field size is linear in number of high res images is expensive!
conv layers: With L 3x3 conv Recall ResNet stem aggressively
layers, receptive field is 1+2L downsamples
Long et al, “Fully convolutional networks for semantic segmentation”, CVPR 2015
Why Not Stack Convolutions?

3 W C
W

H … H

n 3x3 convs have a receptive field of 2n+1 pixels


How many convolutions until >=200 pixels?
100
Why Not Stack Convolutions?

3 W C
W

H … H

Suppose 200 3x3 filters/layer, H=W=400


Storage/layer/image: 200 * 400 * 400 * 4 bytes = 122MB

Uh oh!*
*100 layers, batch size of 20 = 238GB of memory!
Downsampling and Upsampling
Fully Convolutional Network
Design network as a bunch of convolutional layers, with
downsampling and upsampling inside the network!
Med-res: Med-res:
D2 x H/4 x W/4 D2 x H/4 x W/4

Low-res:
Input: High-res: D3 x H/4 x W/4 High-res: Predictions:
3 x H xW D1 x H/2 x W/2 D1 x H/2 x W/2 HxW

Long, Shelhamer, and Darrell, “Fully Convolutional Networks for Semantic Segmentation”, CVPR 2015
Noh et al, “Learning Deconvolution Network for Semantic Segmentation”, ICCV 2015
Fully Convolutional Network
Downsampling: Design network as a bunch of convolutional layers, with
Upsampling:
downsampling and upsampling inside the network!
Pooling, strided ???
convolution Med-res: Med-res:
D2 x H/4 x W/4 D2 x H/4 x W/4

Low-res:
Input: High-res: D3 x H/4 x W/4 High-res: Predictions:
3 x H xW D1 x H/2 x W/2 D1 x H/2 x W/2 HxW

Long, Shelhamer, and Darrell, “Fully Convolutional Networks for Semantic Segmentation”, CVPR 2015
Noh et al, “Learning Deconvolution Network for Semantic Segmentation”, ICCV 2015
In-Network Upsampling: “Unpooling”

Bed of Nails
1 0 2 0
1 2 0 0 0 0
3 4 3 0 4 0
0 0 0 0

Input Output
Cx2x2 Cx4x4
In-Network Upsampling: “Unpooling”

Bed of Nails Nearest Neighbor


1 0 2 0 1 1 2 2
1 2 0 0 0 0 1 2 1 1 2 2
3 4 3 0 4 0 3 4 3 3 4 4
0 0 0 0 3 3 4 4

Input Output Input Output


Cx2x2 Cx4x4 Cx2x2 Cx4x4
In-Network Upsampling: Bilinear Interpolation
1.00 1.25 1.75 2.00

1 2 1.50 1.75 2.25 2.50

2.50 2.75 3.25 3.50


3 4
3.00 3.25 3.75 4.00

Input: C x 2 x 2 Output: C x 4 x 4

Use two closest neighbors in x and y


to construct linear approximations
In-Network Upsampling: Bicubic Interpolation
0.68 1.02 1.56 1.89

1 2 1.35 1.68 2.23 2.56

2.44 2.77 3.32 3.65


3 4
3.11 3.44 3.98 4.32

Input: C x 2 x 2 Output: C x 4 x 4
Use three closest neighbors in x and y to
construct cubic approximations
(This is how we normally resize images!)
In-Network Upsampling
In-Network Upsampling: “Max Unpooling”
Max Pooling: Remember Max Unpooling: Place into
which position had the max remembered positions
1 2 6 3 0 0 2 0
3 5 2 1 5 6 Rest 1 2 0 1 0 0
of
1 2 2 1 7 8 net 3 4 0 0 0 0
7 3 4 8 3 0 0 4

Pair each downsampling layer


with an upsampling layer

Noh et al, “Learning Deconvolution Network for Semantic Segmentation”, ICCV 2015
Learnable Upsampling: Transposed Convolution
Recall: Normal 3 x 3 convolution, stride 1, pad 1

Input: 4 x 4 Output: 4 x 4
Learnable Upsampling: Transposed Convolution
Recall: Normal 3 x 3 convolution, stride 1, pad 1

Dot product
between input
and filter

Input: 4 x 4 Output: 4 x 4
Learnable Upsampling: Transposed Convolution
Recall: Normal 3 x 3 convolution, stride 1, pad 1

Dot product
between input
and filter

Input: 4 x 4 Output: 4 x 4
Learnable Upsampling: Transposed Convolution
Recall: Normal 3 x 3 convolution, stride 2, pad 1

Input: 4 x 4 Output: 2 x 2
Learnable Upsampling: Transposed Convolution
Recall: Normal 3 x 3 convolution, stride 2, pad 1

Dot product
between input
and filter

Input: 4 x 4 Output: 2 x 2
Learnable Upsampling: Transposed Convolution
Recall: Normal 3 x 3 convolution, stride 2, pad 1

Dot product
between input
and filter

Input: 4 x 4 Output: 2 x 2
Learnable Upsampling: Transposed Convolution
Recall: Normal 3 x 3 convolution, stride 2, pad 1
Convolution with stride > 1 is “Learnable Downsampling”
Can we use stride < 1 for “Learnable Upsampling”?

Dot product
between input
and filter

Input: 4 x 4 Output: 2 x 2
Learnable Upsampling: Transposed Convolution
3 x 3 convolution transpose, stride 2

Input: 2 x 2 Output: 4 x 4
Learnable Upsampling: Transposed Convolution
3 x 3 convolution transpose, stride 2

Weight filter by
input value and
copy to output
Input: 2 x 2 Output: 4 x 4
Learnable Upsampling: Transposed Convolution
3 x 3 convolution transpose, stride 2
Filter moves 2 pixels in output
for every 1 pixel in input

Weight filter by
input value and
copy to output
Input: 2 x 2 Output: 4 x 4
Learnable Upsampling: Transposed Convolution
Sum where
3 x 3 convolution transpose, stride 2 output overlaps

Filter moves 2 pixels in output


for every 1 pixel in input

Weight filter by
input value and
copy to output
Input: 2 x 2 Output: 4 x 4
Learnable Upsampling: Transposed Convolution
Sum where
3 x 3 convolution transpose, stride 2 output overlaps

This gives 5x5 output – need to trim one


pixel from top and left to give 4x4 output

Weight filter by
input value and
copy to output
Input: 2 x 2 Output: 4 x 4
Transposed Convolution: 1D example
Input Filter Output
• Output has copies
ax of filter weighted
by input
x ay
a
y az+bx • Stride 2: Move 2
b pixels output for
z by each pixel in input
bz
• Sum at overlaps
Transposed Convolution: 1D example
Input Filter Output This has many names:
ax
- Deconvolution (bad)!
x ay - Upconvolution
a - Fractionally strided
y az+bx convolution
b
z by - Backward strided
convolution
bz - Transposed Convolution
(best name)
Transposed Convolution

Convolution Transposed Conv.


Filter: little lens that Filter: tiles used to make
looks at a pixel. image

Image credit: ifixit.com, thespruce.com


Fully Convolutional Network
Downsampling: Design network as a bunch of convolutional layers, with Upsampling:
Pooling, strided
downsampling and upsampling inside the network! interpolation,
convolution transposed conv
Med-res: Med-res:
D2 x H/4 x W/4 D2 x H/4 x W/4

Low-res:
Input: High-res: D3 x H/4 x W/4 High-res: Predictions:
3 x H xW D1 x H/2 x W/2 D1 x H/2 x W/2 HxW

Long, Shelhamer, and Darrell, “Fully Convolutional Networks for Semantic Segmentation”, CVPR 2015 Loss function: Per-Pixel cross-entropy
Noh et al, “Learning Deconvolution Network for Semantic Segmentation”, ICCV 2015
Skip Connections
Retaining Spatial Details

While the output is HxW, just upsampling often produces


results without details/not aligned with the image.
Why?

Information about details


lost when downsampling!

Result from Long et al. Fully Convolutional Networks For Semantic Segmentation. CVPR 2014
Retaining Spatial Details

Where is the useful information about the high-


frequency details of the image?

A B C D E

Result from Long et al. Fully Convolutional Networks For Semantic Segmentation. CVPR 2014
Retaining Spatial Details
How do you send details forward in the network?
You copy the activations forward.
Subsequent layers at the same resolution figure out
how to fuse things.

Copy
Result from Long et al. Fully Convolutional Networks For Semantic Segmentation. CVPR 2014
U-Net

Extremely popular
architecture, was originally
used for biomedical image
segmentation.

Ronneberger et al, “U-Net: Convolutional Networks for Biomedical Image Segmentation”, MICCAI 2015
Spatial Contexts
The Importance of Spatial Contexts

Image credit: A. Torralba


The Importance of Spatial Contexts

What’s this? (No Cheating!)

(a) Keyboard? (c) Old cell phone?


(b) Hammer? (d) Xbox controller?
Image credit: COCO dataset
The Importance of Spatial Contexts

Image credit: COCO dataset


Spatial Contexts: Dilated Convolution

• Receptive field of an element x in layer k + 1 is the set of elements in layer k


that influence it
• Resulting receptive field of 2i -Dilated feature map is size (2i+2 − 1)2
• Receptive field grows exponentially while number of parameters is constant
Yu et al, “Multi-Scale Context Aggregation By Dilated Convolutions”, ICLR 2016
Spatial Contexts: Dilated Convolution

standard convolution dilated convolution


Spatial Contexts: Dilated Convolution

standard convolution dilated convolution


Spatial Contexts: Pyramid Scene Parsing Net

• Hierarchical global prior, containing information with different scales and


varying among different sub-regions
• Pyramid pooling module for global scene prior on the final feature map
• Use 1x1 convolution to reduce the number of channels
Zhao et al, “Pyramid Scene Parsing Network”, CVPR 2017
Spatial Contexts: Markov Random Field
Energy Function
min 𝐸 = 𝑈𝑛𝑎𝑟𝑦 + 𝑃𝑎𝑖𝑟
𝑝𝑖 𝑙𝑎𝑏𝑒𝑙 =′ 𝑡𝑎𝑏𝑙𝑒′ = 0.8
𝑖 Unary Term

𝑈𝑛𝑎𝑟𝑦 = − ෍ ln 𝑝𝑖 (𝑙𝑎𝑏𝑒𝑙)
𝑖

Liu et al, “Semantic Image Segmentation via Deep Parsing Network”, ICCV 2015
Spatial Contexts: Markov Random Field
𝑖 𝑗
𝑑𝑖𝑠𝑠 𝑖, 𝑗 = , = 0.8 Energy Function
min 𝐸 = 𝑈𝑛𝑎𝑟𝑦 + 𝑃𝑎𝑖𝑟

𝑖 Unary Term
𝑗
𝑈𝑛𝑎𝑟𝑦 = − ෍ ln 𝑝𝑖 (𝑙𝑎𝑏𝑒𝑙)
𝑖
Pairwise Term

𝑃𝑎𝑖𝑟 = ෍ 𝑐𝑜𝑠𝑡(𝑖) ∗ 𝑑𝑖𝑠𝑠(𝑖, 𝑗)


Appearance Consistency
𝑖,𝑗
Spatial Contexts: Markov Random Field
Energy Function
𝑐𝑜𝑠𝑡 𝑖; 𝑙𝑎𝑏𝑒𝑙 =′ 𝑡𝑎𝑏𝑙𝑒′ = 0.1
min 𝐸 = 𝑈𝑛𝑎𝑟𝑦 + 𝑃𝑎𝑖𝑟

𝑖 Unary Term
𝑗
𝑈𝑛𝑎𝑟𝑦 = − ෍ ln 𝑝𝑖 (𝑙𝑎𝑏𝑒𝑙)
𝑖
Pairwise Term

𝑃𝑎𝑖𝑟 = ෍ 𝑐𝑜𝑠𝑡(𝑖) ∗ 𝑑𝑖𝑠𝑠(𝑖, 𝑗)


Label Consistency
𝑖,𝑗
Spatial Contexts: Markov Random Field

mbike
person
mbike
bottle

sheep
horse

plant
chair
table

train
boat
areo
bike
bird

sofa
cow
bkg

dog
bus
car
cat

tv
bkg
areo penalty
bike
bike
bird
boat
bottle
bus
car
cat
chair
cow
table
dog
horse
mbike
person
person
plant
sheep
sofa
train favor
tv
Spatial Contexts: Markov Random Field

Original Image Ground Truth Unary Term

Triple Penalty Label Contexts Joint Tuning


Train and Evaluate FCN
Train Fully Convolutional Network

3 C C
W W W

H CNN H H

C-way classification exp( 𝑊𝑥 𝑦𝑖


loss function − log
σ𝑘 exp( 𝑊𝑥 𝑘 ))
at every pixel:
Image Credit: Everingham et al. Pascal VOC 2012.
Evaluate Fully Convolutional Network
Input Predicted
Image Classes
3 C
W W

CNN
H H

How do we convert final HxWxC into labels?


argmax over labels
Evaluate Fully Convolutional Network
Input Prediction (𝑦)
ො Ground-Truth (𝑦)
Evaluate Fully Convolutional Network
Prediction and ground-truth are images where each
pixel is one of C classes.

𝒚 = 𝑦)
Accuracy: mean(ෝ Prediction Ground-Truth
𝒚)
(ෝ (𝒚)
Intersection over union,
averaged over classes

/
Intersection over Union (IoU)

How can we compare our


prediction to the ground-truth
bounding box or mask?
Intersection over Union (IoU)

How can we compare our


prediction to the ground-truth
bounding box or mask?

Intersection over Union


(IoU) (Also called “Jaccard
similarity” or “Jaccard index”):
Intersection over Union (IoU)

How can we compare our


prediction to the ground-truth
bounding box or mask?

Intersection over Union


(IoU) (Also called “Jaccard
similarity” or “Jaccard index”):

IoU > 0.5 is “decent”


Intersection over Union (IoU)

How can we compare our


prediction to the ground-truth
bounding box or mask?

Intersection over Union


(IoU) (Also called “Jaccard
similarity” or “Jaccard index”):

IoU > 0.5 is “decent”,


IoU > 0.7 is “pretty good”
Intersection over Union (IoU)

How can we compare our


prediction to the ground-truth
bounding box or mask?

Intersection over Union


(IoU) (Also called “Jaccard
similarity” or “Jaccard index”):

IoU > 0.5 is “decent”,


IoU > 0.7 is “pretty good”,
IoU > 0.9 is “almost perfect”
Part II:
Instance Segmentation
Structured Prediction Tasks
Object Detection: Detects individual Semantic Segmentation: Gives
object instances, but only gives box per-pixel labels, but merges
instances
• Sky

• Cow

• Grass
Things and Stuff
Things and Stuff
This image is CC0 public domain

Things: Object categories


that can be separated into
object instances
(e.g. cats, cars, person)

Stuff: Object categories Sky Sky


that cannot be separated
into instances Cat Cow
(e.g. sky, grass, water, trees)
Grass Grass
Things and Stuff
• Things

• Stuff
Structured Prediction Tasks
Object Detection: Detects individual Semantic Segmentation: Gives
object instances, but only gives box per-pixel labels, but merges
(Only things!) instances (Both things and stuff)
• Sky

• Cow

• Grass
Structured Prediction Tasks: Instance Segmentation
Semantic Object Instance
Classification
Segmentation Detection Segmentation

GRASS, CAT, TREE, DOG, DOG, CAT DOG, DOG, CAT


CAT SKY

No spatial extent No objects, just pixels Multiple Objects


Semantic or Instance Segmentation?

Liu et al, “Multi-Objective Convolutional Learning for Face Labeling”, CVPR 2015
Face Parsing
Semantic Segmentation
Instance Segmentation
Instance Segmentation
Instance Segmentation:
Detect all objects in the Cow
image, and identify the
pixels that belong to each
object (Only things!)
Cow

This image is CC0 public domain


Instance Segmentation
Instance Segmentation:
Detect all objects in the Cow
image, and identify the
pixels that belong to each
object (Only things!)
Cow
Approach: Perform
object detection, then
predict a segmentation
mask for each object!

This image is CC0 public domain


Object Detection
Region proposals
• Find a small set of boxes that are likely to cover all objects
• Often based on heuristics: e.g. look for “blob-like” image regions
• Relatively fast to run; e.g. Selective Search gives 2000 region proposals in
a few seconds on CPU

Alexe et al, “Measuring the objectness of image windows”, TPAMI 2012


Uijlings et al, “Selective Search for Object Recognition”, IJCV 2013
Cheng et al, “BING: Binarized normed gradients for objectness estimation at 300fps”, CVPR 2014 Zitnick and Dollar, “Edge
boxes: Locating object proposals from edges”, ECCV 2014
Region-based CNN

Girshick et al, “Rich feature hierarchies for accurate object detection and semantic segmentation”, CVPR 2014. Figure copyright Ross Girshick, 2015
Region-based CNN

Girshick et al, “Rich feature hierarchies for accurate object detection and semantic segmentation”, CVPR 2014. Figure copyright Ross Girshick, 2015
Region-based CNN

Girshick et al, “Rich feature hierarchies for accurate object detection and semantic segmentation”, CVPR 2014. Figure copyright Ross Girshick, 2015
Region-based CNN

Girshick et al, “Rich feature hierarchies for accurate object detection and semantic segmentation”, CVPR 2014. Figure copyright Ross Girshick, 2015
Region-based CNN
Classify each region

Girshick et al, “Rich feature hierarchies for accurate object detection and semantic segmentation”, CVPR 2014. Figure copyright Ross Girshick, 2015
Region-based CNN
Classify each region

Bounding box regression:


Predict “transform” to correct the
RoI: 4 numbers (𝑡𝑥 , 𝑡𝑦 , 𝑡ℎ , 𝑡𝑤 )

Girshick et al, “Rich feature hierarchies for accurate object detection and semantic segmentation”, CVPR 2014. Figure copyright Ross Girshick, 2015
Region-based CNN (R-CNN)

Problem: Very slow! Need to do ~2k


forward passes for each image!

Solution: Run CNN *before* warping!

Girshick et al, “Rich feature hierarchies for accurate object detection and semantic segmentation”, CVPR 2014. Figure copyright Ross Girshick, 2015
Fast R-CNN vs “Slow” R-CNN

Fast R-CNN: Apply differentiable cropping to “Slow” R-CNN: Run CNN independently for
shared image features each region
Fast R-CNN vs “Slow” R-CNN

Girshick et al, “Rich feature hierarchies for accurate object detection and semantic segmentation”, CVPR 2014. He et al, “Spatial pyramid
pooling in deep convolutional networks for visual recognition”, ECCV 2014
Girshick, “Fast R-CNN”, ICCV 2015
Fast R-CNN vs “Slow” R-CNN

Problem: Runtime dominated by region


proposals!

Girshick et al, “Rich feature hierarchies for accurate object detection and semantic segmentation”, CVPR 2014. He et al, “Spatial pyramid
pooling in deep convolutional networks for visual recognition”, ECCV 2014
Girshick, “Fast R-CNN”, ICCV 2015
Fast R-CNN vs “Slow” R-CNN

Recall: Region proposals computed by heuristic ”Selective Search” Problem: Runtime dominated by region
algorithm on CPU -- let’s learn them with a CNN instead! proposals!

Girshick et al, “Rich feature hierarchies for accurate object detection and semantic segmentation”, CVPR 2014. He et al, “Spatial pyramid
pooling in deep convolutional networks for visual recognition”, ECCV 2014
Girshick, “Fast R-CNN”, ICCV 2015
Faster R-CNN: Learnable Region Proposals

Ren et al, “Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks”, NIPS 2015 Figure copyright 2015, Ross Girshick
Object Detection:
Faster R-CNN

Ren et al, “Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks”, NeurIPS 2015
Mask R-CNN
Instance Segmentation: Mask
Prediction
Mask R-CNN

He et al, “Mask R-CNN”, ICCV 2017


Mask R-CNN
Classification Scores: C
Box coordinates (per class):
4*C

CNN Conv Conv


RoI Align
+RPN
256 x 14 x 14 256 x 14 x 14
Predict a mask for
each of C classes:
He et al, “Mask R-CNN”, ICCV 2017
C x 28 x 28
Mask R-CNN: Example Training Targets
Mask R-CNN: Example Training Targets
Mask R-CNN: Example Training Targets
Mask R-CNN: Example Training Targets
Mask R-CNN: Very Good Results!
Accurate Classification and Localization
Hybrid Task Cascade

• Information fusion between bbox and mask prediction


Chen et al, “Hybrid Task Cascade for Instance Segmentation”, CVPR 2019
Content-Aware Upsampling

• Feature upsampling
using adaptive kernels
that incorporate contexts

Wang et al, “CARAFE: Content-Aware ReAssembly of FEatures”, ICCV 2019


Content-Aware Upsampling
Example Location

⨂ Reassemble Operation Large field of view


comes from
large 𝒌𝒖𝒑 x 𝒌𝒖𝒑 reassembly kernels.

𝑯
Extract corresponding
reassembly kernel
𝑾
𝑘𝑢𝑝 𝑘𝑢𝑝

𝑘𝑢𝑝
⨂ 𝑘𝑢𝑝 =
Extract 𝒌𝒖𝒑 x 𝒌𝒖𝒑 nearby 𝑵(𝝌 , 𝒌 ) Feature reassembly
𝒍 𝒖𝒑 𝓦𝒍′
region of a location
Beyond Instance Segmentation
Instance Segmentation: Separate Semantic Segmentation: Identify both things
object instances, but only things and stuff, but doesn’t separate instances

Cow Sky

Cow Cow

Grass
Panoptic Segmentation
Panoptic Segmentation
Label all pixels in Sky
the image (both Trees
things and stuff)
Cow #1
Cow #1
For “thing” Cow #2

categories also
separate into Grass
instances
Kirillov et al, “Panoptic Segmentation”, CVPR 2019
Kirillov et al, “Panoptic Feature Pyramid Networks”, CVPR 2019
Panoptic Segmentation

Kirillov et al, “Panoptic Feature Pyramid Networks”, CVPR 2019


Panoptic Segmentation
Joint Mask + X Prediction
Beyond Instance Segmentation: Human Keypoints
Represent the pose of a human
by locating a set of keypoints

e.g. 17 keypoints:
- Nose
- Left / Right eye
- Left / Right ear
- Left / Right shoulder
- Left / Right elbow
- Left / Right wrist
- Left / Right hip
- Left / Right knee
- Left / Right ankle
Person image is CC0 public domain
Mask R-CNN:
Mask
Instance Segmentation Prediction

Keypoint
estimation

He et al, “Mask R-CNN”, ICCV 2017


Mask R-CNN:
Mask
Keypoint Estimation Prediction

Keypoint
prediction

Keypoint
estimation

He et al, “Mask R-CNN”, ICCV 2017


Mask R-CNN:
Mask
Keypoint Estimation Prediction

Keypoint
prediction

Keypoint
estimation

He et al, “Mask R-CNN”, ICCV 2017


Mask R-CNN: Classification Scores: C
Box coordinates (per class): 4 * C
Keypoint Estimation Segmentation mask: C x 28 x 28

One mask for each of


the K different keypoints
Left ankle Right ankle


CNN Conv…
RoI Align
+RPN Keypoint masks:
256 x 14 x 14 K x 56 x 56
Ground-truth has one “pixel” turned on
He et al, “Mask R-CNN”, ICCV 2017
per keypoint. Train with softmax loss
Joint Instance Segmentation and Pose Estimation

He et al, “Mask R-CNN”, ICCV 2017


Joint Instance Segmentation and Pose Estimation

Guler et al, “DensePose: Dense Human Pose Estimation in the Wild”, CVPR 2018
General Idea: Add Per- Mask
Region “Heads” to Prediction
Faster / Mask R-CNN! Keypoint
prediction
Per-Region Heads:
Each receives the features after
RoI Pool / RoI Align, makes
some prediction per-region

Keypoint
estimation

He et al, “Mask R-CNN”, ICCV 2017


3D Shape Prediction:
Mesh
Predict a 3D mesh predictor
per region!
Per-Region Heads:
Each receives the features after
RoI Pool / RoI Align, makes
some prediction per-region

Gkioxari, Malik, and Johnson, “Mesh R-CNN”, ICCV 2019


3D Shape Prediction: Mask R-CNN + Mesh Head
Mask R-CNN: Mesh R-CNN:
2D Image -> 2D shapes 2D Image -> 3D shapes

He, Gkioxari, Dollár, and


Girshick, “Mask R-CNN”, Gkioxari, Malik, and Johnson,
ICCV 2017 “Mesh R-CNN”, ICCV 2019
Open Problems
Problem I:
The Devils are in the Tail
Open Long-Tailed Recognition

Open World

Head Classes Tail Classes Open Classes


Liu et al, “Large-Scale Long-Tailed Recognition in an Open World”, CVPR 2019
Gupta et al, “LVIS: A Dataset for Large Vocabulary Instance Segmentation”, CVPR 2019
Wang et al, “Seesaw Loss for Long-Tailed Instance Segmentation”, CVPR 2021
Problem II:
The Blessing of Dimensionality
Hong et al, “LiDAR-based Panoptic Segmentation via Dynamic Shifting Network”, CVPR 2021
Summary: Many Structured Prediction Tasks!
Semantic Object Instance
Classification
Segmentation Detection Segmentation

CAT GRASS, CAT, TREE, DOG, DOG, CAT DOG, DOG, CAT
SKY

No spatial extent No objects, just pixels Multiple Objects


This image is CC0 public domain
Deep Structured Prediction: Resources
ICCV Tutorial on Instance-Level Visual Recognition:
https://instancetutorial.github.io/

MMDetection:
https://github.com/open
-mmlab/mmdetection
Next Time:
Transformers

You might also like