Thanks to visit codestin.com
Credit goes to www.scribd.com

0% found this document useful (0 votes)
7 views52 pages

Car Detection Motion

Uploaded by

Roberto Gabriel
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
7 views52 pages

Car Detection Motion

Uploaded by

Roberto Gabriel
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 52

Car Detection with and

without Motion Features

Kajsa Sundbeck
Master’s thesis
2016:E52

CENTRUM SCIENTIARUM MATHEMATICARUM

Centre for Mathematical Sciences


Mathematics
Abstract

Object detection is a big research area and has been investigated for many differ-
ent purposes and objects, such as faces, pedestrians and vehicles. Depending on
the application there are different limitations to adjust to, but also possibilities
to take advantage of.
The purpose of this thesis is to investigate the improvement of an existing
car detector when the detections are performed in video sequences. The original
detector uses only individual frames and the new one utilizes the video format
by adding motion features in the detection.
Motion features are features which give information about motion in the im-
age. In this work the motion features used come from background images which
are calculated from previous frames in the video. The input to the detector is
both the image and the background image and there is movement in the image
where the images differ.
When evaluation was performed on the training data the detectors performed
better with than without motion features. That is an indication that there might
be additional information in the motion features.
On the validation data it was observed that when using motion features only
moving objects were detected but although the evaluation was only performed
in an area of the image where the vehicles were always moving there was no
significant improvement to be found from the use of motion features.
2
Acknowledgements

I would like to thank my supervisors Håkan Ardö and Mikael Nilsson for sharing
their knowledge and passion for computer vision and image analysis. I often did
not know that I needed help but when I got it my work always took giant leaps
ahead. I am also very grateful for them making me finish now, because I could
have continued another few months if no one would have stopped me.
Thank you to my examiner Kalle Åström for being very supportive.
I would also like to thank Roman Juránek who was in the team that wrote the
detector I continued to develop and always took time to answer my questions.

3
4
Contents

1 Introduction 9
1.1 Machine learning . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
1.2 InDeV . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
1.3 Related work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10

2 Features 11
2.1 Image channels . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
2.2 Aggregate Channel Features . . . . . . . . . . . . . . . . . . . . . 12
2.3 Motion features . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13

3 Machine learning 14
3.1 Decision trees . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
3.1.1 Random forest . . . . . . . . . . . . . . . . . . . . . . . . 14
3.1.2 Training decision trees . . . . . . . . . . . . . . . . . . . . 16
3.2 Cascade of classifiers . . . . . . . . . . . . . . . . . . . . . . . . . 16
3.3 The AdaBoost algorithm . . . . . . . . . . . . . . . . . . . . . . . 16
3.4 Sub-sampling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
3.4.1 Weighted sampling . . . . . . . . . . . . . . . . . . . . . . 18

4 Object detection 20
4.1 Sliding window detector . . . . . . . . . . . . . . . . . . . . . . . 20
4.2 Feature pyramids . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
4.3 Non-maximal suppression . . . . . . . . . . . . . . . . . . . . . . 21
4.4 Evaluation of a detector . . . . . . . . . . . . . . . . . . . . . . . 22

5 The data 23
5.1 Ground truth . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
5.2 Negative training data . . . . . . . . . . . . . . . . . . . . . . . . 25

6 Detector implementation 27
6.1 Settings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
6.2 Performing detections . . . . . . . . . . . . . . . . . . . . . . . . 27
6.3 Training the detector . . . . . . . . . . . . . . . . . . . . . . . . . 28
6.3.1 Constructing one decision tree . . . . . . . . . . . . . . . 28
6.3.2 Training the cascade of classifiers . . . . . . . . . . . . . . 30
6.4 Adding motion features to the detector . . . . . . . . . . . . . . . 31

5
7 Results 32
7.1 Verification of implementation . . . . . . . . . . . . . . . . . . . . 32
7.2 Validation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
7.2.1 Detectors B and BMF . . . . . . . . . . . . . . . . . . . . 34
7.2.2 Detector CMF . . . . . . . . . . . . . . . . . . . . . . . . . 35
7.2.3 Detector DMF . . . . . . . . . . . . . . . . . . . . . . . . . 37
7.2.4 Detector EMF . . . . . . . . . . . . . . . . . . . . . . . . . 37
7.2.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
7.2.6 Time consumption . . . . . . . . . . . . . . . . . . . . . . 40
7.3 Evaluation on the manually recorded video . . . . . . . . . . . . 41
7.4 Detection examples . . . . . . . . . . . . . . . . . . . . . . . . . . 42

8 Discussion 44
8.1 Overfitting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
8.2 Time consumption . . . . . . . . . . . . . . . . . . . . . . . . . . 45
8.3 The data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
8.4 Evaluation on the manually recorded video . . . . . . . . . . . . 46

9 Conclusions 47

Bibliography 48

6
List of abbreviations

ACF Aggregate Channel Features


InDeV In-depth Understanding of Accident Causation for Vulnerable Road
Users

MF Motion Features
NMS Non-Maximal Suppression
PRC Precision Recall Curve

7
8
Chapter 1

Introduction

The road traffic safety in different places is today evaluated manually which is
time consuming and inexact. To automatically be able to track the trajectories
of road users would facilitate the work to discover and assess dangerous situa-
tions in for example intersections. This knowledge could be an important part
in developing improvements which could increase safety.
The first task in tracking is object detection. The objects of interest are
detected in each frame of a video and then the trajectory can be extracted by
identifying an object in consecutive frames and translating the image coordi-
nates to real world coordinates. This thesis will focus on the object detection
problem.
By taking advantage of the fact that the detections are made in video se-
quences and not only individual frames something called motion features can
be used. To perform the detections both the image and a background image,
calculated from previous frames in the video, are used and the detector can take
advantage of the fact that where the image and the background differ there is
(probably) motion. The aim of this thesis is to find out if motion features can
improve the detection of cars.

1.1 Machine learning


Object detection is a very complex problem. Although the appearance of cars
or other objects can vary greatly in size, colour and from different perspectives
etcetera they should all be identified.
One solution would be to create a model of what the object can look like
but that would be very complicated due to the big variations in unconstrained
settings. Another way is to provide a framework for the detections and let
the computer train on many image examples together with the desired output.
If the data amount is sufficiently big and captures the variation well and the
algorithm can extract the pattern then a general model can be obtained.
This automated building of the model from examples is called machine learn-
ing.

9
1.2 InDeV
This thesis is a part of the project InDeV, “In-depth Understanding of Acci-
dent Causation for Vulnerable Road Users” [1] , which is an European research
project about road safety that is working for better understanding of accident
causations. In order to fully understand the dangers in traffic it might not be
sufficient to look at statistics of accidents since there is not enough data to
evaluate. One part of InDeV is therefore to analyse incidents which do not lead
to accidents but are still similar situations. These incidents are more frequent
and could therefore yield sufficient data and even make it possible to take the
appropriate measures before any real accidents have happened.
If road users can be tracked in a critical traffic situation, such as an intersec-
tion, the trajectories can be collected and analysed to find dangerous situations
automatically.

1.3 Related work


Object detection and Machine learning have have been investigated thoroughly
and this work build on many important previous findings. Some of the most
significant works for this thesis are briefly presented below.
In Machine learning very complex problems are solved. But using too com-
plex learning methods have a big risk of resulting in overfitting, that the model
adapts very well to the training data but is not general enough to describe previ-
ously unseen data. A solution to this is boosting, which is to use weak-classifiers,
which each are only slightly better than random guessing, but combining many
of them results in one strong. AdaBoost is one of the most important boosting
algorithms and was first presented by Freund and Schapire [10].
A very influential work in the object recognition field was made by Viola and
Jones in 2004 [16]. They created a framework for face detection but it works for
detecting many other types of objects as well. The main contributions that are
relevant to this project are how to combine many weak classifiers in a cascade
structure and the way each weak classifier is built, depending on previous stages
in the cascade, using AdaBoost.
Several recent works are based on the principles of AdaBoost and the work of
Viola and Jones, for example the ACF detector developed by Dollár et al. who
contributed with the use of Aggregate Channel Features (ACF). Before other
kinds of features had been used such as Haar-like features which are differences
of sums of pixels in different regions of the image [16]. Dollár et al. found
that ACF, which are simply the pixel values in various channels of the image,
considerably increased the detection accuracy compared to previous methods
[6]. In object detection the task is to find objects in images regardless of scale.
Dollár et al. also contributed with the knowledge that most features only need to
be computed on a coarse set of scales and can be estimated using extrapolation
on the dense set of scales in between. This was very important since it speeds up
the process considerably and makes it possible to use many different channels.
Juránek et al. [12] continued to work on the ACF detector and also added
pose estimation to their detector. The Juránek et al. framework is in this work
a baseline that is explored and further developed.

10
Chapter 2

Features

A feature is a characteristic of something that can be described with a number,


either a real number or a limited set of numbers where each number has a
predefined meaning. A fixed set of features collected in a vector can be used to
identify or evaluate different objects or images. A feature vector for a road user
could be:  
Number of wheels
x= Maximal speed (km/h) , (2.1)
Does it have a motor? (yes = 1, no = 0)
in that case a car could have the feature vector
 
4
xcar = 180 .
1
Feature vectors describing images often consist of several thousand features,
for example Haar-like features [14], which are differences of sums of pixel val-
ues in many different rectangles in the image, or Aggregate Channel Features
(ACF), which are all pixel values in multiple channels of the image. ACF will
be explained in the following sections.

2.1 Image channels


A normal RBG image has three colour channels, the red, blue and green. Each of
them alone is a grey scale image depicting the same motive but showing different
properties. But images can have many other channels as well, produced using
linear or non-linear transformations of the image, for example the grey scale
image and gradient histogram images in various directions. Figure 2.1 shows
the grey scale channel, the gradient magnitude channel and gradient histogram
channels in six directions of an image depicting an intersection.
The gradient magnitude channel [6] is calculated from a discrete m × n
signal I(x, y), for example the grey scale image. The discrete first derivatives,
∂I/∂x and ∂I/∂y, are calculated (usually by taking the 1D differences between
adjacent pixels) and then the gradient magnitude image, M , is given by
∂I ∂I
M (i, j)2 = (i, j)2 + (i, j)2 . (2.2)
∂x ∂y

11
Figure 2.1: The top left image shows the grey scale channel of an image, the
top middle image is the gradient magnitude channel and the other images are
the gradient histogram channels in six directions.

An example of a gradient magnitude channel is the second image showed in


Figure 2.1.
Each of the gradient histogram channel images represents a direction and
they describe where in the image the gradient has that direction and how big
the gradient is in that point [6, 7]. To compute the gradient histogram channels
the orientations of the gradients, O, is needed
  
∂I ∂I
O(i, j) = arctan (i, j) (i, j) . (2.3)
∂x ∂y
The gradient orientations are quantized in Q bins, where Q is the desired number
of directions of the gradient histograms, resulting in O(i, j) ∈ {1, . . . , Q}. The
gradient histogram channels Hq for q = 1, . . . , Q become
Hq (i, j) = M (i, j) · 1((O(i, j) = q), (2.4)
where 1 is the indicator function. An example of gradient histogram channel
images in six orientations are the six last images displayed in Figure 2.1.

2.2 Aggregate Channel Features


Aggregate Channel Features (ACF) consist simply of single pixel value look ups
in various image channels. The image channels are downscaled by taking sums
of blocks of pixels and smoothed, in that way each pixel look up depend on
various pixel values in the original image channel which makes it more robust
[6].

12
If the detector analyses windows of size 20 × 20 and uses the three RGB
colour channels, a gradient magnitude channel and 6 gradient histogram then
that results in 10 channels of 20 × 20 pixels. The total number of features is
then 10 · 20 · 20 = 4000.

2.3 Motion features


Motion Features (MF) are additional features to give information about motion.
This can be done for example by analysing an image together with the previous
frame or with a background image which ideally does not contain any of the
things that are moving. An example of an image and its background image are
showed in Figure 2.2. From the background image new channels are computed,
these can be for example colour channels, the grey scale channel, gradient mag-
nitude channel or gradient histogram channels, just as for the original image.
The purpose of motion features is that the detector should take advantage of
that the image and the background differ where there is motion.
A detector which uses the same channels for the normal and the motion
features channels will have the double number of features compared to a detector
with the same channels but no motion features.

(a) Original image. (b) Background image.

Figure 2.2: Original image and background image. When the detector receive
both images together as input it is called that it is using motion features.

13
Chapter 3

Machine learning

Machine learning is used to solve problems where the amount of data is too big
or the problem is too complex for humans to program directly. It can be for
example face detection or product recommendations. When Machine learning
is used for learning object detection, instead of having a person make up which
characteristics the computer should use to recognize a pattern, the computer is
given many examples of what it should find and also many examples of what is
not the object and let it find the pattern on its own. This is called supervised
learning since the computer get the examples together with the correct answer
or label (“object” or “not the object”). There is also unsupervised learning
where the computer only gets examples and should divide it into groups by
finding the differences and similarities of the samples. Unsupervised learning
is used for finding structures or groups in data, for example if it is possible to
make groups of music tastes from data of what different people listen to. In this
thesis only supervised learning will be used and from here on machine learning
will imply supervised machine learning.

3.1 Decision trees


There are many model approaches for machine learning such as neural networks
or support vector machines. The model used in this project is decision trees.
Decision trees can be represented as a graph with a tree structure. A sample
passes through the tree from the top node, confusingly called the root, and
depending on a binary split function the sample is passed on to the left or
right branch where a new node is reached and another function determines the
next direction of the sample. After some number of layers the sample reaches
one of the leaves, which are the finishing nodes. Depending on which leaf the
sample ended up in it will get a class label or a target value, depending on
the application. Figure 3.1 shows an example of a decision tree which classifies
different kinds of vehicles using the feature vector example presented in (2.1).

3.1.1 Random forest


To evaluate many features a decision tree either has to grow larger or a number
of small decision trees can be combined into a sequence. The problem with

14
true false
x1 > 3

x2 > 50 x3 = 1

Car Tractor Motorcycle Bicycle

Figure 3.1: An example of a decision tree which classifies different kinds of


vehicles using the feature vector x =[Number of wheels, maximal speed (km/h),
Has motor? (yes = 1, no = 0)]. One starts from the top node and moves
downwards in the tree, to the left if the statement by the node is true and to
the right if it is false. The class of the vehicle is determined by the bottom
node where it ends up. Of course this tree cannot classify all vehicles correctly.
Problems will occur for cars which run too slow, three wheeled cars, electrical
bicycles etcetera and vehicles which do not belong to any of the categories. To
handle that a more complex classifier is necessary.

15
x ···

h13 + h21 + h32 + + hn−1 + hn4 = H


··· 1

Figure 3.2: A forest of decision trees. In this example a sample x is tested in


all trees in a sequence and the classification response H consists of the sum of
the results hml where m is the tree number and l is the leaf number.

making one big tree is that it gets very sensitive to overfitting. A sequence of
small trees on the other hand, where the splits in the nodes are created with
some randomness, can grow very large without suffering from overfitting [4].
The result from testing a sample in all the trees is then added up or averaged in
the end. An illustration of a forest of decision tree, where the result from each
tree is summed to give the classification response, is showed in Figure 3.2.

3.1.2 Training decision trees


One way of training the trees is according to the Random Forest framework [4].
A tree is built recursively from the root to the leaves. Each node is trained by
creating n random splits on the set X which reached the node, dividing it into
many different subsets X L and X R . The split which gets the lowest value for
some error function I(X) is chosen for the node. The same procedure is applied
to X L and X R until the maximal number of layers is reached or the amount of
training samples which reaches a node are too few.

3.2 Cascade of classifiers


A sequence of weak classifiers, for example a Random forest, can be structured
as a cascade of classifiers by dividing it into layers with filters between which
discard windows that do not seem to contain the object of interest [16]. Only
samples which make it through the entire cascade are classified as correct de-
tections, see illustration in Figure 3.3. The advantage of the cascade structure
is that samples which are easy to identify as non-objects will be filtered out
early in the cascade and will therefore not require as much time and computer
power and more focus is put on the more difficult cases. Each layer is trained
according to the AdaBoost algorithm, which will be explained in the next sec-
tion. It has been proven that a classifier consisting of a cascade of decision trees
constructed using AdaBoost algorithm produce significantly lower error rates
than single decision trees do [11].

3.3 The AdaBoost algorithm


The name AdaBoost is a short for Adaptive Boosting and boosting is a way of
combining many weak classifiers into one strong, such as many short decision
trees in a Random forest. The idea of AdaBoost is to give weights to the

16
F F F
L L ··· L 40.2

Figure 3.3: The weak classifiers in a cascade are grouped in layers. When a sub-
window is tested it can get discarded by any of the filters between the layers
and will not be processed further and is considered a “non-object”. A window
which passes through the entire cascade gets a detection score and is processed
further together with the other sub-windows of the image which also made it
through the cascade.

training samples such that previously misclassified samples have more impact
on the future training.
Below and in the table Algorithm 1 is one version called Real AdaBoost
presented but there exists many others [11].
The training set consists of N samples x belonging to either of two classes,
y = 1 or y = −1. Initially all samples xi get equal weights wi = N1 . For
each weak classifier the following procedure is repeated: Classifier number m is
trained using the weights, for example following the Random forest framework
described in Section 3.1.2, and letting the error function I(X) depend on the
weights. A class probability estimate pm (x), the probability of x being of class
y = 1 based on the new weak classifier, is obtained. The probability estimate is
determined by the weak classifier’s output of the samples and the weights they
have. A classification response fm (x) is calculated from the probabilities. The
classification response and the old weights are used to compute new weights
such that the less probable a sample is estimated of having its actual class, the
more its weight is increased. The new weights are normalized and the next
classifier is trained. The final output of the classifier is the sign of the sum of
all classification responses.

17
Algorithm 1 Real AdaBoost

1. Start with weights wi = 1/N , i = 1, 2, . . . , N .


2. Repeat for m = 1, 2, . . . , M :
(a) Fit the classifier to obtain a class probability estimate pm (x) =
P̂w (y = 1|x) ∈ [0, 1], using weights wi of the training data.
 
pm (x)
(b) Set fm (x) ← 12 log 1−p m (x)
∈ R.
Set wi ← wi exp[−yi fm (xi )], i = 1, 2, . . . , N , and renormalize so that
(c) P
i wi = 1.
P 
M
3. Output the sign of the classifier: sign m=1 fm (x) .

3.4 Sub-sampling
A way to increase stability and also to speed up the training process is to only
use smaller active sets, instead of the entire training set, at some parts of the
training.
One way of training a sequence of classifiers using smaller active sets is
Bagging [3]. Each weak classifier is trained using a subset of the training sam-
ples. After training, the weak classifier is evaluated using the entire training set
to decide the output score. Bagging stands for Bootstrap aggregating, because
Bootstrapping is used for the re-sampling and then the final result is aggregated,
for example by voting or averaging, over all weak the classifiers. Bootstrapping
means in this case that the sampling is performed uniformly with replacement,
in other words, all samples have the same probability to get selected and the
subset may contain repetitions.

3.4.1 Weighted sampling


With Bagging the sub-sampling is performed uniformly but various methods of
sub-sampling using the AdaBoost weights have been investigated.
When the samples for the active set is chosen simply by taking the ones
with the highest weights it is called Trimming [11]. When Trimming is used all
samples in the set are unique.
The samples can also be taken randomly using the weights as probabilities,
that is called weighted sampling [9]. An implementation of this method is to
let the samples have different intervals on a unit length segment. The length of
the intervals are the weights of the samples. Random numbers between 0 and 1
are generated and the samples corresponding to the intervals where the random
numbers are located are chosen. Since multiple numbers can end up within the
same interval there is a risk of repetitions of the samples in the subset.
To decrease the risk of choosing the same samples multiple times quasi-
random sampling was proposed by Kalal et al. [13]. If N samples should be
chosen then the line segment is divided into N equally long parts and the random
numbers are limited to belonging to one part each.

18
The sub-sampling method used in this project is a mix between Trimming
and Quasi-random sampling. Some samples are chosen using Trimming and
Quasi-random sampling is performed among the rest of the samples [13].

19
Chapter 4

Object detection

This thesis is about object detection, which should not be confused with object
recognition. In recognition we have an object of a certain class and want to
identify to which subclass the object belongs, for example identifying who a
person in a photo is. An object detection system should on the other hand be
able to distinguish objects of a certain class from “everything else”, meaning
finding out if and where there is for example a person in a photo. The framework
of detecting objects in an image and how the result can be evaluated is described
in the following sections of this chapter.

4.1 Sliding window detector


A common framework for detectors is the sliding or scanning window paradigm
[14]. The training of the object detector is performed using samples of some
fixed size m × n. When using the detector to find objects in an image, the whole
image is scanned, feeding every possible window of size m × n to the detector
which determines if it contains the object and how good the detection is. The
result is a number of boxes in the image which contain the object.

4.2 Feature pyramids


The detector can only handle windows of size m×n, as described in the previous
section. In order to be able to find objects which are bigger and smaller, instead
of scaling the detector, the image channels are computed at various scales creat-
ing pyramids of images of different sizes where the detection is performed, but
always on windows of size m × n. Since channel images are the same as features
when we are using ACF the image pyramids are called feature pyramids.
An image which is downscaled to half of the original is called an octave.
A common setting is to use eight scales per octave and the number of octaves
depend on the variation of sizes of the objects in the images. Computing the
channels of the image at all scales requires a lot of computational power. It
has been proven that, for the channels described and used in this thesis, it is
sufficient to compute the channels from the image at each octave and then scale
the channels to intermediate scales [6] which considerably speeds up the process.

20
4.3 Non-maximal suppression
At the location of an object of interest many overlapping windows and at many
scales will be analysed and probably yield positive detections. The edges of
the window which contains the positive detection is called the bounding box of
the detection. Figure 4.1 shows an image and 100 randomly chosen detections
of the over 1000 produced. The next step for the detector is to use a Non-
Maximal Suppression algorithm (NMS) in order to get only one bounding box
per object. The resulting detections are those with the highest score such that
no bounding boxes are overlapping by more than some amount a. The area
of overlap between two bounding boxes B1 and B2 is measured by taking the
intersection of the pixels within the boxes divided by the union of the pixels of
the boxes:
area(B1 ∩ B2 )
Aoverlap = . (4.1)
area(B1 ∪ B2 )
The NMS algorithm matches all detections, if two detections are overlapping
then the one with the lowest score is suppressed.
The non-maximal suppression algorithm used in this project has time com-
plexity O(n2 ) [5].
Finally the output of the detector are the detections which are not suppressed
by NMS and have a score above some pre-set threshold θfinal .

Figure 4.1: Of all detections found in this image before NMS, 100 were chosen
randomly. The detections are all centred around the cars but are of different
sizes or at slightly different locations. Note that the parked cars are not detected,
that is since the detector is using motion features and only detects moving cars.

21
4.4 Evaluation of a detector
Evaluation of a detector is performed by detecting objects in images where all
objects or all objects within a certain area have correct bounding boxes around
them, that data is called the ground truth. The bounding boxes produced by
the detector are compared to the ground truth. A detected bounding box is
considered correct if the area of overlap (4.1) with a ground truth bounding
box is greater than 50 %, according to the Pascal Visual Objects Challenge [8].
Correct detections are called true positives, tp, incorrect detections are false
positives, f p, and a ground truth bounding boxes that do not have matching
detections are called a false negatives, f n.
The detections in the output depend on the threshold θfinal mentioned in the
previous section. With a low θfinal many detections will be let through. This
will hopefully yield many true positives and few false negatives but there is a
risk that the output will also contain many false positive detections. When the
threshold is high on the other hand, only the best detections are let through to
the output and the number of false positives will decrease, but the drawback is
that also the true positives will decrease and the false negatives increase.
The evaluation of the detector does not depend on how the threshold θfinal
is chosen but rather all possible thresholds are tested. For every threshold θ the
number of tpθ , f pθ and f nθ are counted and the precision, a measure of how
many of the detections in the output are relevant, and recall, amount of the
present objects which are detected, are calculated:
tpθ
precisionθ = , (4.2)
tpθ + f pθ
tpθ
recallθ = . (4.3)
tpθ + f nθ
One way of presenting the result of detections is to plot precision against recall
in a Precision Recall Curve (PRC) for the different thresholds. Another option
is the F1 score which is a total grade of the performance of the detector. The
F1 score is a number which lies between 0 and 1 and combines the precision and
recall:
precisionθ · recallθ
 
F1 = 2 · max , (4.4)
θ precisionθ + recallθ
where F1 = 1 means that there exists a threshold θ which perfectly separates
the correct an incorrect detections and that the correct detections consists of all
the objects in the image and F1 = 0 means that there are no correct detections
at all.

22
Chapter 5

The data

The data used for the training and most of the testing comes from the PDTV
dataset [2] consisting of images from an intersection in Minsk taken by four
different cameras during the summer of 2010. The views of the cameras are
shown in Figure 5.1. The images used for training were filmed on eight different
occasions, from all cameras that results in 32 video sequences. The test images
consist of one longer sequence filmed by all cameras.

(a) Camera 1 (b) Camera 2

(c) Camera 3 (d) Camera 4

Figure 5.1: Images from the four different cameras that were used in this work.
The yellow lines are the zones where the evaluation is performed.

The detector was also tested on additional video material manually recorded.
These images did not have ground truth so evaluation was only done manually

23
by counting how many cars were detected. This is not part of the evaluation
of the detectors and was only made to see how the detector worked on images
from a different place than where training images came from. The additional
video have a much higher resolution and in these videos it is possible to identify
persons and cars, so the result cannot be published due to ethical aspects.

5.1 Ground truth


The ground truth of the PDTV dataset consists of the size of the road user, the
cameras’ calibration matrices and, for every time point the road user is visible
in any of the cameras, location and front direction of the road users in 3D world
coordinates. But for training and testing 2D bounding boxes of road users in
the image plane and optionally object viewpoint vectors in camera coordinates
are needed. The ground truth of the images is calculated through the following
steps:
• For each time point that a road user has ground truth the 3D bounding
box in world coordinates is calculated from the position, orientation and
size of the vehicle.

• The corners of the bounding box is projected into each of the cameras.
• For every camera where the road user is projected within the image, which
means that the road user is visible in the camera, a 2D bounding box is
calculated from the projection. This is illustrated if Figure 5.2 which
shows the projected 3D box around the road user in an image and the
resulting 2D bounding box.
• This bounding box is added to the list of annotated vehicles of that image.
Each of the eight video sequences that were used for training had ground
truth annotations of two vehicles. From all four cameras and using the all
mirrored images as well this results in 4134 annotations of vehicles in 3112
images. The PDTV dataset includes one video sequence where all road users
in the inner of the intersection have ground truth data, this sequence was used
for testing. An area where all cars in the entire sequence are annotated was
extracted and only detections within that area were evaluated. To be able
to evaluate the motion features better the area was limited to the parts of the
intersection where the vehicles are always moving. In Figure 5.1 are the outlines
of the test zones marked. In total there were 3004 annotations of vehicles to be
found within the test zone.
An overview of the number of annotations in the different sets are showed
in Table 5.1.

Set Training Testing


Camera number 1 2 3 4 1 2 3 4
# annotations 1414 1480 898 342 1252 974 538 240
Total 4134 3004

Table 5.1: Number of annotated cars used for training and testing for each
camera.

24
The input samples to the detector training have the form X = {x, v, y},
where x ∈ RM ×N ×C is an M × N window with C channels, v ∈ R3 is an object
viewpoint vector and y ∈ {−1, 1} is the class label where y = 1 means that the
window contains a car and y = −1 means that it does not contain a car. The
set of windows with y = 1 is the positive data and the windows without cars is
the negative data. An object viewpoint vector is the forward direction of the
vehicle in relation to the camera.

Figure 5.2: To get the 2D bounding box of a road user in an image the 3D
box around it is projected into the camera. The smallest enclosing box is the
resulting bounding box.

5.2 Negative training data


To create the negative training data two images from each of the 32 scenes of
the training set have been used. All vehicles in those images have been covered
with black boxes to get images consisting of only non-cars. The same mask was
added to the background of the images because the negative images must also
have motion features for the motion features detectors. An example of an image
from the negative training data is displayed in Figure 5.3. The images were also
mirrored resulting in a total of 128 images used as negative training data. The
detector uses cut out windows from the negative images to train on. For the
first layer in the cascade these windows are taken randomly at different scales
and locations. For later layers, detections in the negative images, meaning false
positive detections, made by the incomplete detector are used instead.

25
(a) Negative image (b) Negative background

Figure 5.3: The left figure shows an example of an image used as negative
training data and its background, which is needed if the detector has motion
features, is showed to the right. The negative image and its background image
are very similar but notice that there are people in 5.3a but not in 5.3b.

26
Chapter 6

Detector implementation

The detector used and further developed in this project is written by Juránek et
al. [12] and is a boosted cascade detector with a random forest of decision trees
using aggregate channel features. Juránek et al. used the same eight channels
as in [6], that is the image in grey scale, the gradient magnitude and gradient
histogram in six directions. The detector also performs pose estimation but that
is not used in this thesis neither for training nor testing and therefore it will not
be described in this report.

6.1 Settings
The detector uses windows of size 20 × 20 and performs the detection on image
sub-windows from size 40 × 40 up to 123 × 123 on eight scales per every octave.
The channels used are the same as in [12], calculated from the image but
also from the background image when motion features are used. The channels
and their numbering are presented in Table 6.1.

Channel # Channel name


1 The image in grey scale
2 Gradient magnitude of the image
3-8 Gradient histogram 1-6 of the image
9 The background in grey scale
10 Gradient magnitude of the background
11-16 Gradient histogram 1-6 of the background

Table 6.1: The channels used and their numbering.

6.2 Performing detections


The detector consists of a sequence of TD = 1024 trees t which all have depth
2 except the final TR = 128 trees which have depth 4. Each node has a binary
split function, see definition in the next section, which directs samples to the
left or right node. The classification response h is a score where negative values
indicates that the sample does not depict a car and positive that it does, the

27
bigger the absolute values is the more certain is the result. Depending on
which leaf l in a tree tm a sample x ends up in it gets a classification response
hm (x) = hml . The total response H
m
after tree number m is the accumulation
of the partial responses from previous trees the uninteresting static objects
m
X
H m (x) = hk (x). (6.1)
k=1

Each stage m in the cascade has an assigned threshold θm which terminates the
detection if H m (x) < θm . The values of the thresholds are
(
m −1, for m ∈ {4, 8, 16, 64, 256},
θ = (6.2)
−∞, otherwise.

After most stages all samples get to pass (θ = −∞) but after some of the
stages, more frequently in the beginning, the threshold is θ = −1, resulting
in that samples which get big negative responses are rejected. The thresholds
with θ = −1 are the filters described in section 3.2 and the set of consecutive
trees between the filters are the layers. An overview of how classification of one
sub-window is performed in the cascade was showed Figure 3.3.
The windows which pass through the entire cascade are recognized as objects
of interest and H TD is the detection score. Among overlapping detections all de-
tections except the one with the highest score are suppressed by a Non-Maximal
Suppression algorithm. The time consumption of NMS increases rapidly when
the output of the cascade becomes larger because the time complexity is O(n2 ).
For practical reasons if the output after the cascade is bigger than 3000 bound-
ing boxes then the set is simply reduced to 3000 by drawing samples at random
before performing NMS.

6.3 Training the detector


The implementation of the detector training is described in this section. First
it is explained how the training of each individual tree is executed and then the
training of the entire forest. It will be showed that the cascade structure of the
classifier does not only affect how detections are performed but also the set-up
of the training.

6.3.1 Constructing one decision tree


The training data at each stage in the cascade consists of training samples
xi which are 20 × 20 × C image patches, where C is the number of channels,
together with a label yi ∈ {−1, 1}. The patches with label y = −1 are non-
object samples cut out from the negative training images. The positive samples
have label y = 1 and are cut out from the positive images at the location and
scale indicated by the ground truth.
The detector is trained according to the Real AdaBoost algorithm described
in Algorithm 1. The implementation of the algorithm in this application is
described below.

28
All the samples have weights wi which are either set to N1 , where N is the
number of samples, if it is the first tree, step (1) in the Algorithm, or have values
given in the previous stage.
The fitting of the classifier in step (2a) is made according to the Random
Forest framework, described in Chapter 3.1.2. A subset of 1000 positive and
1000 negative samples are used as the active set, see Chapter 3.4, to fit the
classifier. The subset is obtained using quasi-random sampling and repeated
samples may occur. Each node in the tree is trained recursively from the set X
of samples which reached the node. A number n of random splits are generated
from binary splitting functions. The splitting function is created from the values
at two pixel locations in any of the channels, a, b ∈ Z3 , and a threshold δ. The
pixel locations a and b are chosen randomly and the difference δ between the
pixel values in those locations is calculated for all samples in the set X:

δi = xi (a) − xi (b). (6.3)

The threshold δ for splitting the sets into two subsets, X L and X R , is obtained
by taking a random value between the lowest and highest value of all the dif-
ferences δi . Which subset a sample will belong to depends on the result of the
inequality
xi (a) − xi (b) < δ. (6.4)
If either of the new subsets are too small the split is discarded. New splits
are generated until n approved splits are obtained or enough attempts have been
preformed. The split with the lowest error I(X) is used in the node. For this
project the error function used is
X |X s |
I(X) = E(X s ), (6.5)
|X|
s∈{L,R}

where E(X) is the classification error which will be defined shortly. The error
function I(X) should, as mentioned in Section 3.3, depend on the weights from
AdaBoost. Those come in since the error E(X) depends on W + and W − which
are the sums of the weights of the samples from each class in the node:

E(X) = 2 W + W − . (6.6)

Ideally the positive and negative samples end up in separate branches, in that
way W + = 0 in one and W − = 0 in the other and the product in (6.6) is zero
in both branches and therefore the error function (6.5) is also zero. 
It is possible to limit the space of the positions a = a1 a2 a3 and
b = b1 b2 b3 by a matrix

Q = {qi,j } ∈ MC×C ([0, 1]) , (6.7)

where qi,j = 0 means that no pair a, b can be such that a3 = i and b3 = j. This
results in that the pixel value differences will only be taken between or within
certain channels.
For example if a detector uses features from three channels and
 
1 0 0
Q = 0 1 1 (6.8)
0 1 0

29
then the binary split function can only be created from differences of features
within layer one, within layer two or from features of layer two and three.
In step (2a) in Algorithm 1 the entire training set is used to calculate the class
probability estimate pm (x). That is the estimated probability that a sample x
has label y = 1 (car) given the leaf the sample ended up in, using the weights
of the training samples. In this application the probability is given by

W+
pm (x) = P̂w (y = 1|x) = . (6.9)
W+ + W−
The next step of the AdaBoost algorithm, step (2b), is dedicated to calcu-
lating the classification responses fm (x) but here the notation hm (x) will be
used instead. In a decision tree the number of possible response values are the
number of leaves in the tree. The classification response hm (x) is given by the
leaf l where x ends up. If Wl+ and Wl− are the sums of the weights of the
training samples of each class which were directed to the leaf l in tree number
m then the classification response becomes, according to Algorithm 1 and (6.9):
 
  Wl+  +
m 1 pm (x) 1  Wl+ +Wl−  1 Wl
h (x) ← log = log   = log .
Wl−
+
2 1 − pm (x) 2 W
1 − W + +W
l 2

l l
(6.10)
The classification responses of the leaves, which are used when detections are
performed, coincides with the classification responses of the training samples.
The classification response h of leaf l in tree m is:
 +
m 1 Wl
hl = log . (6.11)
2 Wl−

Lastly the weights wi of the training samples xi are updated, rewriting the
step (2c) in AdaBoost with the notation used above the weights become

wi ← wi exp[−yi hm (xi )] = exp[−yi H m (xi )], i = 1, 2, . . . , N, (6.12)

where N is the number of training samples. The weights are then normalized
before the training of the next tree starts.

6.3.2 Training the cascade of classifiers


The detector’s trees are trained iteratively and during the training each layer
of the cascade is one bootstrap1 iteration [15]. Negative samples which get low
scores, that is samples which are easy to identify as negative, are filtered out. To
replace them the incomplete detector is used to make detections in the negative
images. Since those do not contain any cars the obtained images will be images
of “non objects” which are difficult for the detector to identify as such. Figure
6.1 shows a schematic diagram of the bootstrapping.
1 Here Bootstrapping is not in the statistical sense of using random sampling with replace-

ment but a way of re-sampling from detections.

30
F F F F
L L ··· L L

Figure 6.1: During training the negative dataset is updated to get successively
more difficult images for the detector to work with. The negative samples which
are easy for the detector to identify as “non-objects” are filtered out between
each layer and replaced by the same number of new samples which come from
detections made in images from the negative training data.

6.4 Adding motion features to the detector


The background image used for the motion features can be calculated in many
ways. It could be the median image over all the tested images in the same scene
and the same background would be used for all of them, but that would not
work in a real time setting since then we do not have all images on before hand.
Another problem with a completely static background image is if the scene
changes, such as the change in sunlight during the day or cars which stop for
parking or leave their parking place, then we want the background to be able to
change. In this project the background images were calculated recursively. At
a time t the background image Mt , is calculated from the previous background
image Mt−1 and the current image It


Mt = It ,
 if t = 0,
Mt = Mt−1 + 100 100
t (Mt−1 < It ) − t (Mt−1 > It ), if 0 < t ≤ 10000,

Mt = Bt−1 + 0.01(Mt−1 < It ) − 0.01(Mt−1 > It ), if t > 10000.

(6.13)
The inequalities and arithmetics between the images are computed pixel wise.
In the PDTV dataset the video sequences were too short and the background
image did not have time to tune in. To solve this problem an initial background
image B was calculated as the median image between images in the entire
sequence. Then it was assumed that the time was t > 10000 and the following
images were calculated using the formula (6.13).
The background channels used were either the same as for the original im-
ages, that resulted in a total of 16 channels or only the grey scale image and
the gradient magnitude, in that case the detector used in total 10 channels.

31
Chapter 7

Results

To evaluate if motion features can improve detection in video sequences four


different motion features settings were tried out and compared to a detector
without motion features. The training was also evaluated by training and testing
on the same data.

7.1 Verification of implementation


Two verification datasets were set up by taking the positive samples in the
ground truth and the negative windows from the last bootstrap iteration of
the training of two detectors, one with and the other without motion features.
Detectors consisting of only eight decision trees were trained and tested on the
verification datasets. Figure 7.1 shows the PRC curves and it is clear that the
motion features detector gets better result.

0.8
precision

0.6

0.4

0 0.2 0.4 0.6 0.8 1


recall
with MF (mean(F1 ) = 0.7615) without MF (mean(F1 ) = 0.6482)

Figure 7.1: The PRC for detectors consisting of 8 trees each which were trained
and tested on the same data. To see the variation between different runs three
detectors with motion features and three without were trained and tested.

32
7.2 Validation
The evaluation was performed by comparing precision recall curves and F1 scores
for detectors with and without motion features.
The first detectors to be evaluated are called A and AMF . Both were trained
on all images in the training set and tested on all images in the test set, see
Table 5.1. Detector A uses channels 1-8 and AMF uses additionally the channels
9-16 which come from the background image. The channel numbers are given
by Table 6.1. Figure 7.2 shows the precision recall curves for the four cameras
and overall. The F1 scores can be found in Table 7.1.

Detector Camera 1 Camera 2 Camera 3 Camera 4 Overall


A 0.776 0.701 0.612 0.625 0.683
AMF 0.809 0.647 0.481 0.752 0.640

Table 7.1: The F1 scores of the detectors A and AMF when performing detections
in images from different cameras and the overall score.

The detector with motion features performs better when the detection is
performed in images from Camera 1 and Camera 4, the ones which are from
above, both according to the F1 score and the PRC. But the result on the images
from the other cameras and the overall result is better for the detector which
does not have motion features.
To simplify and delimit the problem, detectors only handling images filmed
from above were tested further. In total four different settings for motion fea-
tures detectors were tested by training and evaluating on images from Camera
1 and 4. The result was compared to one detector without motion features. All
settings of the detectors are summarized in Table 7.2 and will also be explained
in more detail in their respective section.

Settings
Detector # train # test views Q-matrix MF channels
A 4134 3004 all 1 no 1-8
AMF 4134 3004 all 1 yes 1-16
B 1756 1492 above 1 no 1-8
BMF 1756 1492 above  1  yes 1-16
1 I
CMF 1756 1492 above yes 1-16
I 0
DMF 1756 1492 above  1  yes 1-10
1 U
EMF 1756 1492 above yes 1-10
UT V

Table 7.2: The settings of the detectors which are presented in the results. The
first columns presents the number of samples in the training and testing sets,
the next from which perspective the cameras are filming. The Q matrix is the
one defined by Equation (6.7). The channels used can be all or some of the
16 channels presented in Table 6.1. The channels 9-16 are produced from the
background image, if some or all of those are used then it is a motion features
detector. The matrices U and V are sparse matrices which are presented in
Section 7.2.4.

33
Camera 1 Camera 2 Camera 3 Camera 4
1

0.8
precision

0.6

0.4

0.2
0 0.4 0.8 0 0.4 0.8 0 0.4 0.8 0 0.4 0.8
recall
Overall
1

0.8
precision

0.6 with MF
without MF
0.4

0.2
0 0.4 0.8
recall

Figure 7.2: The precision recall curves of the four cameras and the overall result
of the detectors A and AMF . The result is clearly better for the detector with
motion features for Camera 1 and 4, the ones from above, but the detector
without MF work better for the cameras from the side and also has a better
overall result.

7.2.1 Detectors B and BMF


The detectors B and BMF have the same settings as A and AMF except that
only images from Camera 1 and 4 are used. Due to the randomized elements of
the tree construction different realizations will yield different result and for that
reason ten trainings of B and BMF were carried out. The ones of each which got
the highest F1 scores are plotted in Figure 7.3. The detector B got a little bit
better F1 score and precision recall curve than BMF but BMF reaches a slightly
higher recall.
To find out if and how much the motion features are employed in BMF the
channels of the features used in the tree splits are counted. A matrix P is
created by going through all the splits (6.3) and counting the number of times
each channel pair is used. The  channels of the pairs  are given by the third
coordinates in a = a1 a2 a3 and b = b1 b2 b3 but since (a, b) uses the
same channels as (b, a) those will be counted as the same feature pair and the

34
1

precision 0.95

0.9

with MF (F1 = 0.8858)


without MF (F1 = 0.8913)
0.85
0 0.2 0.4 0.6 0.8
recall

Figure 7.3: The PRC of the detector BMF (with MF) is plotted against the
PRC of the detector B (without MF).

matrix P become upper triangular. The matrix is calculated using the formula

P (i, j) = k 1 (ak3 = i ∧ bk3 = j) ∨ (ak3 = j ∧ bk3 = i) , if i ≤ j,


( P 
(7.1)
P (i, j) = 0, if i > j,
where the sum is over all splits in the cascade and 1 is the indicator function.
The P matrices of B and BMF are displayed in Figure 7.4 together with a
histogram over the number of times channels from the image, the background
or a combination of the two are used in the tree splits of BMF . The channels of
the background are used a lot and the most common feature pair is background
image channel together with background gradient channel.

7.2.2 Detector CMF


The detector CMF has a Q-matrix (6.7) which is constructed based on an idea
that motion features detectors should make use of the difference between the
image and its background to find movement in the image, and not apply differ-
ences within or between background channels. The use of pixels in the image
channels is unlimited but the background channel features can only be used
together with pixels of the same channel type but in the image. The Q matrix
therefore become:  
1 I
QC = , (7.2)
I 0
where the block matrices have the size 8 × 8 and 1 is a matrix consisting of ones,
0 is a matrix consisting of zeros and I is the identity matrix.
Out of 10 detectors trained with this QC matrix the precision recall curve of
the one which got the highest F1 score is showed in Figure 7.5. For comparison
the PRC of the detector B is plotted again in this figure. The detector B
performs better than CMF both regarding F1 score and precision recall curve,
although the difference is not very big.

35
background
gradient

gradient
image

hist 1
hist 2
hist 3
hist 4
hist 5
hist 6

hist 1
hist 2
hist 3
hist 4
hist 5
hist 6
gradient
image

image
hist 1
hist 2
hist 3
hist 4
hist 5
hist 6
gradient
hist 1
image hist 2
gradient hist 3
hist 1 hist 4
hist 2 hist 5
hist 3 hist 6
hist 4 background
gradient
hist 5
hist 1
hist 6 hist 2
hist 3
hist 4
50 100 150 hist 5
hist 6
# splits
(a) Detector B
0 20 40 60 80 100
# splits
(b) Detector BMF

(I,I) 1,126
(I,BG) 1,533
(BG,BG) 536

0 400 800 1,200 1,600


# splits
(c) Number of feature pairs in tree splits of detector BMF
consisting of image (I) or background (BG) channels.

Figure 7.4: An overview of the channels which are used by the detectors B and
BMF . Each pixel in (a) and (b) represents how many times a channel pair was
used in the binary split functions in the trees. The histogram (c) shows the
number of times the channels of the image and the background were used in
different combinations in the tree splits of BMF .

36
1

precision 0.95

0.9

with MF (F1 = 0.8866)


without MF (F1 = 0.8913)
0.85
0 0.2 0.4 0.6 0.8
recall

Figure 7.5: The PRC of the detector CMF (with MF) is plotted against the PRC
of the detector B (without MF). The splits in the trees of the motion features
detector are limited by a matrix QC .

7.2.3 Detector DMF


Having too many features can lead to problems with overfitting. To restrict the
number of features but still having motion features a detector DMF , with the
original eight channels from the image but fewer channels from the background,
is created. The background channels which are utilized the most by the detector
BMF are the grey scale channel and the gradient histogram channel, see Figure
7.4, and therefore these are the channels chosen for the background features of
DMF . Out of 10 realisations of DMF detectors the PRC of the one with the best
F1 score is plotted together with the PRC of B in Figure 7.6. Both PRC and
F1 score are a little bit better for B than for DMF .

7.2.4 Detector EMF


The detector EMF is also inspired by the features used by BMF . The most
frequently used feature channel pairs which include background channels are
grey scale image with grey scale background and grey scale background with
background gradient, see Figure 7.4. For that reason the EMF detector gets a
Q matrix which allows all combinations of features from the image channels but
only using the background channels in the pairs just mentioned. The histogram

37
1

precision 0.95

0.9

with MF (F1 = 0.8808)


without MF (F1 = 0.8913)
0.85
0 0.2 0.4 0.6 0.8
recall

Figure 7.6: The PRC of the detector DMF (with MF) is plotted against the
PRC of the detector B (without MF). The motion features detector uses only
the first 2 channels from the background.

channels of the background are therefore not used. The matrix become:
 
1 1 1 1 1 1 1 1 1 0
 1 1 1 1 1 1 1 1 0 0 
 
 1 1 1 1 1 1 1 1 0 0 
 
 1 1 1 1 1 1 1 1 0 0 
 
 1 1 1 1 1 1 1 1 0 0 
QE =  1
. (7.3)
 1 1 1 1 1 1 1 0 0  
 1 1 1 1 1 1 1 1 0 0 
 
 1 1 1 1 1 1 1 1 0 0 
 
 1 0 0 0 0 0 0 0 0 1 
0 0 0 0 0 0 0 0 1 0

The PRC of the one EMF detector out of 10 which got the highest F1 score is
plotted together with the PRC of the non-motion features detector B in Figure
7.7. The result of EMF is marginally better than B.
Figure 7.8 displays the number of times different channel combinations are
used in the tree splits of EMF . The two pairs which contain background channels
were the most frequently used.

7.2.5 Summary
The detector settings B, BMF , CMF , DMF and EMF were used to train 10
detectors each. The precision recall curves in the previous sections only show the
detector which got the highest F1 score but the variation between the realisations
was big. In Figure 7.9 the F1 scores for all trained detectors are displayed as
box plots. The detectors B and EMF got the highest top scores and the detector
CMF got the highest median and also the highest lower whisker but the variation

38
1

0.95
precision

0.9

with MF (F1 = 0.8935)


without MF (F1 = 0.8913)
0.85
0 0.2 0.4 0.6 0.8
recall

Figure 7.7: The PRC of the detector EMF (with MF) is plotted against the
PRC of the detector B (without MF). The EMF detector uses only two channels
from the background image and the tree splits are limited by a Q-matrix. The
detectors are performing almost equally and the F1 score of EMF is slightly
higher.
background
gradient

gradient
image

hist 1
hist 2
hist 3
hist 4
hist 5
hist 6

image
gradient
hist 1
hist 2
hist 3
hist 4
hist 5
hist 6
background
gradient

0 100 200 300


# splits

Figure 7.8: The number of times different feature channel pairs occur in the
tree splits of the detector EMF .

39
0.89

0.88
F1 score

0.87

0.86

B BM F CM F DM F EM F

Figure 7.9: Ten detectors were trained with B, BMF , CMF , DMF and EMF
settings and their F1 scores are visualized as box plots.

is big, the ranges of the box plots are all overlapping and no detector stands
out as significantly better.

7.2.6 Time consumption


The detector B performs detections at approximately two frames per second
and the motion features detectors at one frame per second.
An important contribution to the time consumption of cascades of classifiers
is how early in the cascade non-objects can be sorted out. The equally per-
forming detectors B and EMF are compared in this aspect, see the left graph
in Figure 7.10. The graph shows the mean number of windows processed by B
and EMF at every stage. At the end of every layer sub-windows are discarded,
creating the staircase shape of the graph. The motion features detector is better
at filtering out windows at early layers in the cascade, for example at the third
layer less than 8000 windows are processed by EMF compared to over 16000
windows for B. At the end of the cascade it is the other way around and the
detector with motion features have more windows to process than the detector
without. The behaviour of being fast at sorting out windows early in the cas-
cade but slower in the end was in observed among all detectors BMF , CMF and
DMF , EMF when compared to the detector B.
The graph to the right in Figure 7.10 shows the number of detections that
are output after the cascade and passed on to the non-maximal suppression. In
average 315 detections reached the non-maximal suppression algorithm for B
and for EMF the number was 585. What is more important in this case, since
non-maximal suppression is O(n2 ), is that the number of detections processed
by non-maximal suppression reaches much higher values for EMF than for B,
2152 compared to 1119. The time consumption of non-maximal suppression is
measured by testing sending different numbers of bounding boxes to the func-
tion. The result is displayed in Figure 7.11. Fitting the measurements to a
quadratic function using ordinary least squares estimation gives a function of

40
NMS’s time consumption:
y = 6.05 · 10−7 x2 , (7.4)
which is also plotted in the same figure.

without MF
105 with MF 105
Number of windows

104 104

103 103

0 4 8 16 64 256 1024 NMS


stage

Figure 7.10: The left graph shows the mean number of windows processed during
the stages of the cascade when the equally performing detectors B and EMF
are making detections in the images in the test set. In total 257500 windows
inside an image are analysed but after every layer more and more windows are
filtered out. The right graph shows the number of detections that are output
by the cascade and passed on to the Non-maximal suppression algorithm. Note
the non-linear scale on the x-axis and that the last layer is three times longer
than the first five layers together. It is also important to note that y-axis is
logarithmic and for example at layer three and four the number of windows is
approximately the double for the detector without motion features compared to
the detector with motion features.

A way to cut time consumption is to take advantage of the fact that the
background image changes very slowly and the computed image pyramids could
be reused for some number of consecutive frames. This increased efficiency did
speed up the detection with approximately 0.02 seconds/image but since the
time consumption is dominated by the non-maximal suppression it was not a
significant improvement of the total time.

7.3 Evaluation on the manually recorded video


Usually one does not have the possibility of training a detector with images
from the location where it is going to be used. Therefore the detectors A and
AMF were also tested on a video sequence filmed from a similar perspective as
Camera 2 and 3 but at a different intersection. Some settings of the image
pyramids had to be adjusted since the resolution and size of the images and the
relative size of the objects was different. There was no ground truth to this video
sequence so no real evaluation could be performed but some things were noted.
The detectors worked and could detect cars in another setting. The non-motion

41
measured time
5 y = 6.05 · 10−7 x2

4
time [s]

0
0 500 1,000 1,500 2,000 2,500 3,000
# bounding boxes

Figure 7.11: The blue line with circles shows the experimentally found time
acquired by non-maximal suppression to process different numbers of bounding
boxes. The red line is the quadratic function which is fitted to the measurements
using ordinary least squares estimation.

features detector A made many detections in the background of static objects


which were not cars. The motion features detectors did not make as many false
positive detections since it almost only detected moving objects.

7.4 Detection examples


Examples of detections performed by the detector B and EMF are showed in
Figure 7.12. The detections are plotted together with the ground truth anno-
tations and are marked with colours representing if the detections are correct,
incorrect or ignored.

42
(a) Detector B

(b) Detector EMF

Figure 7.12: Detections performed by B and EMF . The solid lines are the ground
truth annotations and the dashed lines are the detections. Green markings
means correct detection or detected ground truth, yellow detections and ground
truth are ignored because they are outside the zone and red markings mean
incorrect detections or not detected ground truth. The white numbers are the
detection scores. Note that parked cars are detected by B but not usually by
EMF . One parked car is detected by EMF but it got a low score.

43
Chapter 8

Discussion

In the implementations of motion features detectors in this work all image chan-
nels were kept and experiments were made by trying different settings in the
application of the background channels. This means that no information was
removed, only new information was added. If the machine learning algorithm
works properly then when training and testing on the same data and more in-
formation is added the result should only stay the same or become improved.
The result of the verification test, see Figure 7.1, confirmed this and therefore it
was assumed that the implementation of the motion features was correct. Since
the verification result was significantly better for the motion features detector
and the background channel features were frequently used, there was additional
information in the motion features in the training set. But the question is if
that information is general and can be applied to the test set.
There was a slight improvement in some aspects for a few motion features
detectors but no significant difference could be found when testing on the vali-
dation data. For some detectors the use of motion features even seemed to make
the detection worse. Some possible reasons for this and ideas of further work is
discussed in this chapter.

8.1 Overfitting
The background channels were frequently used by the motion features detectors,
see Figure 7.4 and 7.8. It seems as though the detectors find a lot of information
in the background channels during training, since they are used a lot, but that
the extracted pattern is not general enough. One problem when using a bigger
number of features is that the risk of overfitting increases. The solution to that
is increasing the training set. Lack of data is a very common problem when
working with machine learning and in that aspect this project was no exception.
The training set from all cameras consisted of 3112 images, this can be compared
to Juránek et al. who trained their detector with 20000 images [12]. A bigger
amount of data would probably reduce overfitting and also produce a result with
less variation between different realisations which would make investigation and
evaluation easier.
Throughout this work all the image channels (the channels number 1-8) were
used for all settings. Further experiments could be made varying this parameter.

44
Using fewer of the image channels could be a solution to overfitting since that
would decrease the number of features.

8.2 Time consumption


For real time applications, such as running a surveillance system for evaluating
the traffic safety continuously, the time consumption is very important.
All motion features detectors were better at filtering out many windows as
non-cars early in the cascade. The reason for this might be that they mainly
find the moving cars, while detectors without motion features also detect parked
cars and there might be a lot of them. At the very last stages the number of
detections of B decreased rapidly and the non-motion features detector got
fewer windows to process than all the motion features detectors. The last stage
is three times longer than the other stages together and the last 128 trees have
double depth so as it turned out the number of operations in the cascade evened
out. An idea could be to use motion features only in the beginning of the cascade
to quickly filter out detections.
The greatest impact on the time consumption had the number of detections
that had to be processed by non-maximal suppression. The reason that the
detectors which used motion features were slower was mainly due to that the
number of windows output after the cascade was much larger. The most efficient
way to make detection faster would be to make the non-maximal suppression
algorithm faster or make the number of windows it has to process fewer.
A consequence of this was that the number of detections sometimes needed
to be reduced randomly before non maximal suppression, for practical reasons.
The motion features detector AMF which used images of all views got an output
after the cascade of sometimes over 10000 bounding boxes. That would take
approximately a minute to process by non-maximal suppression (7.4). Reducing
the number of windows randomly of course might worsen the detection perfor-
mance since the best detections could get discarded. Increasing the threshold
until a reasonable number of detections were discarded was not an option since
some cars are more difficult to detect than others and that would only throw
away all difficult and therefore low scoring detections and keep too many detec-
tions of the easier cases.

8.3 The data


The negative images were created by covering the cars with black boxes but it
would have been better if the negative images would have annotations of the
location of the cars instead and only let the detector sample from sub windows
without cars. In the beginning of the training there will probably be black
boxes in the negative training samples since they are sampled randomly and
the detector will learn that those are part of the background. In the following
bootstrap iterations fewer and fewer windows will include black areas and that
is good but still early stages of the detector will redundantly focus on discarding
windows with uniform black areas.
Considering the low amount of data the result was already quite good with-
out the motion features and there might not be much room for improvement.

45
Maybe motion features could be more of an asset in the detection in a more
difficult setting.

8.4 Evaluation on the manually recorded video


It was not possible to draw any conclusion of which detector performed better
on the manually recorded data, except that the motion features detectors had
fewer false positive detections in the surrounding static objects.

46
Chapter 9

Conclusions

The motion features detectors were implemented by in addition to channel fea-


tures processed from an image using channels from a background image which
did not contain any moving objects.
Due to lack of data the problem was simplified by limiting the original
dataset to images taken by the cameras viewing the scene from above.
Four detectors with different settings for the use of the background chan-
nels were analysed and compared to a detector without motion features. The
motion features detectors yielded better result than the non-motion features
detector when evaluating on the training data. When evaluation was performed
on a test set the motion features detectors got sometimes equal or worse result
compared to the detector without motion features. One reason for this could
be that a larger number of features increases the risk of overfitting and require
a larger training set to give a proper result but that the dataset available was
limited. The result of a few of the motion features detectors was slightly better
in some aspects but the variation was big and no significant improvement could
be showed.
Motion features detectors can only be used when all objects of interest are
moving. If that is the case then an advantage is that for example parked cars,
which have been standing in the same spot for a long time, will become a part
of the background and will not be detected. Most static objects which a normal
detector can mistake for a car will not be found by the a detector which is using
motion features. It could be investigated further if there would be a benefit from
using motion features only in the beginning of a cascade to sort out uninteresting
static objects rapidly.

47
Bibliography

[1] In-depth understanding of Accident Causation for Vulnerable Road Users


(InDeV). http://www.indev-project.eu/InDeV/EN/Home/home_node.
html.
[2] PDTV dataset. ftp://barbapappa.tft.lth.se/.
[3] Leo Breiman. Bagging predictors. Machine learning, 24(2):123–140, 1996.
[4] Leo Breiman. Random forests. Machine learning, 45(1):5–32, 2001.
[5] Piotr Dollár. Piotr’s Computer Vision Matlab Toolbox (PMT). http:
//vision.ucsd.edu/~pdollar/toolbox/doc/index.html.
[6] Piotr Dollár, Ron Appel, Serge Belongie, and Pietro Perona. Fast feature
pyramids for object detection. IEEE Transactions on Pattern Analysis and
Machine Intelligence, 36(8):1532–1545, 2014.
[7] Piotr Dollár, Zhuowen Tu, Pietro Perona, and Serge Belongie. Integral
channel features. 2009.
[8] Mark Everingham, SM Ali Eslami, Luc Van Gool, Christopher KI Williams,
John Winn, and Andrew Zisserman. The pascal visual object classes
challenge: A retrospective. International Journal of Computer Vision,
111(1):98–136, 2015.
[9] François Fleuret and Donald Geman. Stationary features and cat detection.
Journal of Machine Learning Research, 9(Nov):2549–2578, 2008.
[10] Yoav Freund and Robert E Schapire. A desicion-theoretic generalization
of on-line learning and an application to boosting. In European conference
on computational learning theory, pages 23–37. Springer, 1995.
[11] Jerome Friedman, Trevor Hastie, Robert Tibshirani, et al. Additive logistic
regression: a statistical view of boosting (with discussion and a rejoinder
by the authors). The annals of statistics, 28(2):337–407, 2000.
[12] Roman Juranek, Adam Herout, Marketa Dubska, and Pavel Zemcik. Real-
time pose estimation piggybacked on object detection. In Proceedings of
the IEEE International Conference on Computer Vision, pages 2381–2389,
2015.
[13] Zdenek Kalal, Jiri Matas, and Krystian Mikolajczyk. Weighted sampling
for large-scale boosting. In BMVC, pages 1–10, 2008.

48
[14] Constantine Papageorgiou and Tomaso Poggio. A trainable system for
object detection. International Journal of Computer Vision, 38(1):15–33,
2000.

[15] K-K Sung and Tomaso Poggio. Example-based learning for view-based
human face detection. IEEE Transactions on pattern analysis and machine
intelligence, 20(1):39–51, 1998.
[16] Paul Viola and Michael J Jones. Robust real-time face detection. Interna-
tional journal of computer vision, 57(2):137–154, 2004.

49
Master’s Theses in Mathematical Sciences 2016:E52
ISSN 1404-6342
LUTFMA-3308-2016
Mathematics
Centre for Mathematical Sciences
Lund University
Box 118, SE-221 00 Lund, Sweden
http://www.maths.lth.se/

You might also like