0% found this document useful (0 votes)

12 views140 pages

Lecture 5 Segmentation

Uploaded by

Sazid Azad

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

12 views140 pages

Lecture 5 Segmentation

Uploaded by

Sazid Azad

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 140

AI6126 Advanced Computer Vision

Last update: 9 February 2022

Image
Segmentation
Ziwei Liu
刘子纬

https://liuziwei7.github.io/
Slide Credits

• Justin Johnson, EECS 498/598

• David Fouhey, EECS 442
• Paper Authors
Outline
Semantic Segmentation:
• Fully Convolutional Network
• Skip Connections
• Spatial Contexts

Instance Segmentation:
• Object Detection
• Mask R-CNN
• Joint Mask+X Prediction

Open Problems:
Many Structured Prediction Tasks
Semantic Object Instance
Classification
Segmentation Detection Segmentation

CAT GRASS, CAT, TREE, DOG, DOG, CAT DOG, DOG, CAT
SKY

No spatial extent No objects, just pixels Multiple Objects

This image is CC0 public domain
Part I:
Semantic Segmentation
Structured Prediction Tasks: Semantic Segmentation
Semantic Object Instance
Classification
Segmentation Detection Segmentation

GRASS, CAT, TREE,

CAT SKY DOG, DOG, CAT DOG, DOG, CAT

No spatial extent No objects, just pixels Multiple Objects

Semantic Segmentation
This image is CC0 public domain

Label each pixel in the image

with a category label

Don’t differentiate instances, Sky Sky

only care about pixels
Cat Cow

Grass Grass
Fully Convolutional Network
Semantic Segmentation Idea: Sliding Window
Extract Classify center
patch pixel with CNN
Full image
Cow

Cow

Grass

Farabet et al, “Learning Hierarchical Features for Scene Labeling,” TPAMI2013

Pinheiro and Collobert, “Recurrent Convolutional Neural Networks for Scene Labeling”, ICML 2014
Semantic Segmentation Idea: Sliding Window
Extract Classify center
patch pixel with CNN
Full image
Cow

Cow

Grass
Problem: Very inefficient! Not
reusing shared features
between overlapping patches Farabet et al, “Learning Hierarchical Features for Scene Labeling,” TPAMI 2013
Pinheiro and Collobert, “Recurrent Convolutional Neural Networks for Scene Labeling”, ICML 2014
“Fully” Convolution
Fully Convolutional Network
Design a network as a bunch of convolutional
layers to make predictions for pixels all at once!

Conv Conv Conv Conv argmax

Input:
Scores: Predictions:
3 x H xW
Convolutions: C x H xW HxW
DxH xW
Loss function: Per-Pixel cross-entropy
Long et al, “Fully convolutional networks for semantic segmentation”, CVPR 2015
Fully Convolutional Network
Design a network as a bunch of convolutional
layers to make predictions for pixels all at once!

Conv Conv Conv Conv argmax

Input: Problem #1: Effective receptive

3 x H x W field size is linear in number of
conv layers: With L 3x3 conv
layers, receptive field is 1+2L
Long et al, “Fully convolutional networks for semantic segmentation”, CVPR 2015
Fully Convolutional Network
Design a network as a bunch of convolutional
layers to make predictions for pixels all at once!

Conv Conv Conv Conv argmax

Input: Problem #1: Effective receptive Problem #2: Convolution on

3 x H x W field size is linear in number of high res images is expensive!
conv layers: With L 3x3 conv Recall ResNet stem aggressively
layers, receptive field is 1+2L downsamples
Long et al, “Fully convolutional networks for semantic segmentation”, CVPR 2015
Why Not Stack Convolutions?

3 W C
W

H … H

n 3x3 convs have a receptive field of 2n+1 pixels

How many convolutions until >=200 pixels?
100
Why Not Stack Convolutions?

3 W C
W

H … H

Suppose 200 3x3 filters/layer, H=W=400

Storage/layer/image: 200 * 400 * 400 * 4 bytes = 122MB

Uh oh!*
*100 layers, batch size of 20 = 238GB of memory!
Downsampling and Upsampling
Fully Convolutional Network
Design network as a bunch of convolutional layers, with
downsampling and upsampling inside the network!
Med-res: Med-res:
D2 x H/4 x W/4 D2 x H/4 x W/4

Low-res:
Input: High-res: D3 x H/4 x W/4 High-res: Predictions:
3 x H xW D1 x H/2 x W/2 D1 x H/2 x W/2 HxW

Long, Shelhamer, and Darrell, “Fully Convolutional Networks for Semantic Segmentation”, CVPR 2015
Noh et al, “Learning Deconvolution Network for Semantic Segmentation”, ICCV 2015
Fully Convolutional Network
Downsampling: Design network as a bunch of convolutional layers, with
Upsampling:
downsampling and upsampling inside the network!
Pooling, strided ???
convolution Med-res: Med-res:
D2 x H/4 x W/4 D2 x H/4 x W/4

Low-res:
Input: High-res: D3 x H/4 x W/4 High-res: Predictions:
3 x H xW D1 x H/2 x W/2 D1 x H/2 x W/2 HxW

Bed of Nails
1 0 2 0
1 2 0 0 0 0
3 4 3 0 4 0
0 0 0 0

Input Output
Cx2x2 Cx4x4
In-Network Upsampling: “Unpooling”

Bed of Nails Nearest Neighbor

1 0 2 0 1 1 2 2
1 2 0 0 0 0 1 2 1 1 2 2
3 4 3 0 4 0 3 4 3 3 4 4
0 0 0 0 3 3 4 4

Input Output Input Output

Cx2x2 Cx4x4 Cx2x2 Cx4x4
In-Network Upsampling: Bilinear Interpolation
1.00 1.25 1.75 2.00

1 2 1.50 1.75 2.25 2.50

2.50 2.75 3.25 3.50

3 4
3.00 3.25 3.75 4.00

Input: C x 2 x 2 Output: C x 4 x 4

Use two closest neighbors in x and y

to construct linear approximations
In-Network Upsampling: Bicubic Interpolation
0.68 1.02 1.56 1.89

1 2 1.35 1.68 2.23 2.56

2.44 2.77 3.32 3.65

3 4
3.11 3.44 3.98 4.32

Input: C x 2 x 2 Output: C x 4 x 4
Use three closest neighbors in x and y to
construct cubic approximations
(This is how we normally resize images!)
In-Network Upsampling
In-Network Upsampling: “Max Unpooling”
Max Pooling: Remember Max Unpooling: Place into
which position had the max remembered positions
1 2 6 3 0 0 2 0
3 5 2 1 5 6 Rest 1 2 0 1 0 0
of
1 2 2 1 7 8 net 3 4 0 0 0 0
7 3 4 8 3 0 0 4

Pair each downsampling layer

with an upsampling layer

Noh et al, “Learning Deconvolution Network for Semantic Segmentation”, ICCV 2015
Learnable Upsampling: Transposed Convolution
Recall: Normal 3 x 3 convolution, stride 1, pad 1

Input: 4 x 4 Output: 4 x 4
Learnable Upsampling: Transposed Convolution
Recall: Normal 3 x 3 convolution, stride 1, pad 1