Object Detection : Only
Convolution Based Models
Copyright 2019 RESTRICTED CIRCULATION
Object Localisation & Detection ( single object)
Source:https://towardsdatascience.com/evolution-of-object-detection-and-localization-algorithms-e241021d8bad
Copyright 2019 RESTRICTED CIRCULATION ‹#›
Multiple Objects with Sliding Window
• Sliding window using simple CNN for object detection that we built earlier
• Strides can vary
• Window size can vary
• Computation cost is huge ( slow models )
Copyright 2019 RESTRICTED CIRCULATION ‹#›
Issues
• Multiple aspect ratios
• Multiple bounding boxes for same object
• Object overlapping is not handled properly
• Overlapping bounding boxes go through repeated
convolutions instead of sharing features
Copyright 2019 RESTRICTED CIRCULATION ‹#›
Localisation and detection as single convolution
• Usual CNN layers
• Image is divided into a grid • Output is 3X3X8 tensor
Copyright 2019 RESTRICTED CIRCULATION ‹#›
Evaluate=>IOU: intersection Over Union
Copyright 2019 RESTRICTED CIRCULATION ‹#›
Non Max Suppression
• Multiple instances of same object must be brought down to
one
• Discard all bounding boxes with Pc < 0.6
• Pick the one with highest Pc, discard all boxes which have
IOU > 0.5 with that box
• Do this until you have either all high Pc box or discarded
them
• For multiple classes , NMS needs to be done separately for
each class
Copyright 2019 RESTRICTED CIRCULATION ‹#›
Anchor Boxes
• One grid area might need to output multiple bounding
boxes for multiple classes
• We can simply output multiple instances for each grid
• Number of such outputs are called number of anchor
boxes
Copyright 2019 RESTRICTED CIRCULATION ‹#›
YOLO
• CNN with tensor output is used to build the model ( input
needs to be prepared according to the grid size )
• Output is : nXnXAX(1+4+C)
• n= grid size , A = number of anchor boxes , 1 = probability
for background vs object , 4 = for bounding box coordinates,
C = number of classes being considered
• Use NMS for better bounding boxes while predictions
( separately for each class )
Copyright 2019 RESTRICTED CIRCULATION ‹#›
SSD: Single Shot Detection
• Issue with YOLO: can not detect at different scales very well
• SSD has convolutions of multiple scales on top features
created by VGG16
• Prediction is facilitated at different convolution output.
• Early layers output help predict objects at finer scale due to
their receptive field being limited to smaller areas in the
image
• As we move forward , layers receptive fields grow larger and
they favour predicting larger objects
• Unlike YOLO, SSD does not split the image into grids of
arbitrary size but predicts offset of predefined anchor boxes
(this is called “default boxes” in the paper) for every location
of the feature map.
Copyright 2019 RESTRICTED CIRCULATION ‹#›
SSD Architechture
Copyright 2019 RESTRICTED CIRCULATION ‹#›
Object Detection : Region
Proposal Based Models
Copyright 2019 RESTRICTED CIRCULATION
What is Region Proposal
• Region Proposal is a process of identifying parts of images
[ rectangles ] which have high chances of having an object
instead of background
• Selective Search is a common approach for coming up with
region proposals
• Its pretty fast with high recall [ many of the region proposals
might not have any object, but all the objects will be
contained in proposed regions ]
• Its not part of the network being built
• For deep dive in selective search :
• https://www.learnopencv.com/selective-search-for-object-detection-cpp-python/
Copyright 2019 RESTRICTED CIRCULATION ‹#›
R-CNN
Copyright 2019 RESTRICTED CIRCULATION ‹#›
Issues with R-CNN
• Very slow training due to large number of convent usage
across region proposal
• Prediction is also very slow , 47 seconds/image
Copyright 2019 RESTRICTED CIRCULATION ‹#›
Fast R-CNN
• Instead of using multiple instances of
convnet feature extraction , Region
proposals are projected on the convnet
feature map
• Linear part from fully connected layer is
used for bounding box regression
• Actually loss is a composite one ,
containing both classification and
regression losses . We can use weighted
some [ wt as a hyper parameter ] instead
of simple sum .
• Removal of multiple application of
convnet gives huge reduction in
prediction time as well as training time
Copyright 2019 RESTRICTED CIRCULATION ‹#›
R-CNN Vs Fast R-CNN
Notice that during prediction, most of the time in Fast R-CNN is being
taken by external Region Proposal Process. Faster R-CNN, makes the
Region Proposals also part of Network and further speed up things
Copyright 2019 RESTRICTED CIRCULATION ‹#›
Faster R-CNN
Copyright 2019 RESTRICTED CIRCULATION ‹#›
Pixel Level Masks : Mask R-CNN
Copyright 2019 RESTRICTED CIRCULATION
Semantic Segmentation
• Pixel Level classification
• Doesn’t Differentiate between two objects of same
class if they are adjacent [ no mask boundaries ]
Copyright 2019 RESTRICTED CIRCULATION ‹#›
Mask R-CNN
• Upper Branch is essentially
doing what Faster R-CNN does
• Lower branch is for semantic
segmentation for each bounding
box for each class . This
combination eventually gives
instance segmentation
Copyright 2019 RESTRICTED CIRCULATION ‹#›