Car Detection Motion
Car Detection Motion
Kajsa Sundbeck
Master’s thesis
2016:E52
Object detection is a big research area and has been investigated for many differ-
ent purposes and objects, such as faces, pedestrians and vehicles. Depending on
the application there are different limitations to adjust to, but also possibilities
to take advantage of.
The purpose of this thesis is to investigate the improvement of an existing
car detector when the detections are performed in video sequences. The original
detector uses only individual frames and the new one utilizes the video format
by adding motion features in the detection.
Motion features are features which give information about motion in the im-
age. In this work the motion features used come from background images which
are calculated from previous frames in the video. The input to the detector is
both the image and the background image and there is movement in the image
where the images differ.
When evaluation was performed on the training data the detectors performed
better with than without motion features. That is an indication that there might
be additional information in the motion features.
On the validation data it was observed that when using motion features only
moving objects were detected but although the evaluation was only performed
in an area of the image where the vehicles were always moving there was no
significant improvement to be found from the use of motion features.
2
Acknowledgements
I would like to thank my supervisors Håkan Ardö and Mikael Nilsson for sharing
their knowledge and passion for computer vision and image analysis. I often did
not know that I needed help but when I got it my work always took giant leaps
ahead. I am also very grateful for them making me finish now, because I could
have continued another few months if no one would have stopped me.
Thank you to my examiner Kalle Åström for being very supportive.
I would also like to thank Roman Juránek who was in the team that wrote the
detector I continued to develop and always took time to answer my questions.
3
4
Contents
1 Introduction 9
1.1 Machine learning . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
1.2 InDeV . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
1.3 Related work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
2 Features 11
2.1 Image channels . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
2.2 Aggregate Channel Features . . . . . . . . . . . . . . . . . . . . . 12
2.3 Motion features . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
3 Machine learning 14
3.1 Decision trees . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
3.1.1 Random forest . . . . . . . . . . . . . . . . . . . . . . . . 14
3.1.2 Training decision trees . . . . . . . . . . . . . . . . . . . . 16
3.2 Cascade of classifiers . . . . . . . . . . . . . . . . . . . . . . . . . 16
3.3 The AdaBoost algorithm . . . . . . . . . . . . . . . . . . . . . . . 16
3.4 Sub-sampling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
3.4.1 Weighted sampling . . . . . . . . . . . . . . . . . . . . . . 18
4 Object detection 20
4.1 Sliding window detector . . . . . . . . . . . . . . . . . . . . . . . 20
4.2 Feature pyramids . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
4.3 Non-maximal suppression . . . . . . . . . . . . . . . . . . . . . . 21
4.4 Evaluation of a detector . . . . . . . . . . . . . . . . . . . . . . . 22
5 The data 23
5.1 Ground truth . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
5.2 Negative training data . . . . . . . . . . . . . . . . . . . . . . . . 25
6 Detector implementation 27
6.1 Settings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
6.2 Performing detections . . . . . . . . . . . . . . . . . . . . . . . . 27
6.3 Training the detector . . . . . . . . . . . . . . . . . . . . . . . . . 28
6.3.1 Constructing one decision tree . . . . . . . . . . . . . . . 28
6.3.2 Training the cascade of classifiers . . . . . . . . . . . . . . 30
6.4 Adding motion features to the detector . . . . . . . . . . . . . . . 31
5
7 Results 32
7.1 Verification of implementation . . . . . . . . . . . . . . . . . . . . 32
7.2 Validation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
7.2.1 Detectors B and BMF . . . . . . . . . . . . . . . . . . . . 34
7.2.2 Detector CMF . . . . . . . . . . . . . . . . . . . . . . . . . 35
7.2.3 Detector DMF . . . . . . . . . . . . . . . . . . . . . . . . . 37
7.2.4 Detector EMF . . . . . . . . . . . . . . . . . . . . . . . . . 37
7.2.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
7.2.6 Time consumption . . . . . . . . . . . . . . . . . . . . . . 40
7.3 Evaluation on the manually recorded video . . . . . . . . . . . . 41
7.4 Detection examples . . . . . . . . . . . . . . . . . . . . . . . . . . 42
8 Discussion 44
8.1 Overfitting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
8.2 Time consumption . . . . . . . . . . . . . . . . . . . . . . . . . . 45
8.3 The data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
8.4 Evaluation on the manually recorded video . . . . . . . . . . . . 46
9 Conclusions 47
Bibliography 48
6
List of abbreviations
MF Motion Features
NMS Non-Maximal Suppression
PRC Precision Recall Curve
7
8
Chapter 1
Introduction
The road traffic safety in different places is today evaluated manually which is
time consuming and inexact. To automatically be able to track the trajectories
of road users would facilitate the work to discover and assess dangerous situa-
tions in for example intersections. This knowledge could be an important part
in developing improvements which could increase safety.
The first task in tracking is object detection. The objects of interest are
detected in each frame of a video and then the trajectory can be extracted by
identifying an object in consecutive frames and translating the image coordi-
nates to real world coordinates. This thesis will focus on the object detection
problem.
By taking advantage of the fact that the detections are made in video se-
quences and not only individual frames something called motion features can
be used. To perform the detections both the image and a background image,
calculated from previous frames in the video, are used and the detector can take
advantage of the fact that where the image and the background differ there is
(probably) motion. The aim of this thesis is to find out if motion features can
improve the detection of cars.
9
1.2 InDeV
This thesis is a part of the project InDeV, “In-depth Understanding of Acci-
dent Causation for Vulnerable Road Users” [1] , which is an European research
project about road safety that is working for better understanding of accident
causations. In order to fully understand the dangers in traffic it might not be
sufficient to look at statistics of accidents since there is not enough data to
evaluate. One part of InDeV is therefore to analyse incidents which do not lead
to accidents but are still similar situations. These incidents are more frequent
and could therefore yield sufficient data and even make it possible to take the
appropriate measures before any real accidents have happened.
If road users can be tracked in a critical traffic situation, such as an intersec-
tion, the trajectories can be collected and analysed to find dangerous situations
automatically.
10
Chapter 2
Features
11
Figure 2.1: The top left image shows the grey scale channel of an image, the
top middle image is the gradient magnitude channel and the other images are
the gradient histogram channels in six directions.
12
If the detector analyses windows of size 20 × 20 and uses the three RGB
colour channels, a gradient magnitude channel and 6 gradient histogram then
that results in 10 channels of 20 × 20 pixels. The total number of features is
then 10 · 20 · 20 = 4000.
Figure 2.2: Original image and background image. When the detector receive
both images together as input it is called that it is using motion features.
13
Chapter 3
Machine learning
Machine learning is used to solve problems where the amount of data is too big
or the problem is too complex for humans to program directly. It can be for
example face detection or product recommendations. When Machine learning
is used for learning object detection, instead of having a person make up which
characteristics the computer should use to recognize a pattern, the computer is
given many examples of what it should find and also many examples of what is
not the object and let it find the pattern on its own. This is called supervised
learning since the computer get the examples together with the correct answer
or label (“object” or “not the object”). There is also unsupervised learning
where the computer only gets examples and should divide it into groups by
finding the differences and similarities of the samples. Unsupervised learning
is used for finding structures or groups in data, for example if it is possible to
make groups of music tastes from data of what different people listen to. In this
thesis only supervised learning will be used and from here on machine learning
will imply supervised machine learning.
14
true false
x1 > 3
x2 > 50 x3 = 1
15
x ···
making one big tree is that it gets very sensitive to overfitting. A sequence of
small trees on the other hand, where the splits in the nodes are created with
some randomness, can grow very large without suffering from overfitting [4].
The result from testing a sample in all the trees is then added up or averaged in
the end. An illustration of a forest of decision tree, where the result from each
tree is summed to give the classification response, is showed in Figure 3.2.
16
F F F
L L ··· L 40.2
Figure 3.3: The weak classifiers in a cascade are grouped in layers. When a sub-
window is tested it can get discarded by any of the filters between the layers
and will not be processed further and is considered a “non-object”. A window
which passes through the entire cascade gets a detection score and is processed
further together with the other sub-windows of the image which also made it
through the cascade.
training samples such that previously misclassified samples have more impact
on the future training.
Below and in the table Algorithm 1 is one version called Real AdaBoost
presented but there exists many others [11].
The training set consists of N samples x belonging to either of two classes,
y = 1 or y = −1. Initially all samples xi get equal weights wi = N1 . For
each weak classifier the following procedure is repeated: Classifier number m is
trained using the weights, for example following the Random forest framework
described in Section 3.1.2, and letting the error function I(X) depend on the
weights. A class probability estimate pm (x), the probability of x being of class
y = 1 based on the new weak classifier, is obtained. The probability estimate is
determined by the weak classifier’s output of the samples and the weights they
have. A classification response fm (x) is calculated from the probabilities. The
classification response and the old weights are used to compute new weights
such that the less probable a sample is estimated of having its actual class, the
more its weight is increased. The new weights are normalized and the next
classifier is trained. The final output of the classifier is the sign of the sum of
all classification responses.
17
Algorithm 1 Real AdaBoost
3.4 Sub-sampling
A way to increase stability and also to speed up the training process is to only
use smaller active sets, instead of the entire training set, at some parts of the
training.
One way of training a sequence of classifiers using smaller active sets is
Bagging [3]. Each weak classifier is trained using a subset of the training sam-
ples. After training, the weak classifier is evaluated using the entire training set
to decide the output score. Bagging stands for Bootstrap aggregating, because
Bootstrapping is used for the re-sampling and then the final result is aggregated,
for example by voting or averaging, over all weak the classifiers. Bootstrapping
means in this case that the sampling is performed uniformly with replacement,
in other words, all samples have the same probability to get selected and the
subset may contain repetitions.
18
The sub-sampling method used in this project is a mix between Trimming
and Quasi-random sampling. Some samples are chosen using Trimming and
Quasi-random sampling is performed among the rest of the samples [13].
19
Chapter 4
Object detection
This thesis is about object detection, which should not be confused with object
recognition. In recognition we have an object of a certain class and want to
identify to which subclass the object belongs, for example identifying who a
person in a photo is. An object detection system should on the other hand be
able to distinguish objects of a certain class from “everything else”, meaning
finding out if and where there is for example a person in a photo. The framework
of detecting objects in an image and how the result can be evaluated is described
in the following sections of this chapter.
20
4.3 Non-maximal suppression
At the location of an object of interest many overlapping windows and at many
scales will be analysed and probably yield positive detections. The edges of
the window which contains the positive detection is called the bounding box of
the detection. Figure 4.1 shows an image and 100 randomly chosen detections
of the over 1000 produced. The next step for the detector is to use a Non-
Maximal Suppression algorithm (NMS) in order to get only one bounding box
per object. The resulting detections are those with the highest score such that
no bounding boxes are overlapping by more than some amount a. The area
of overlap between two bounding boxes B1 and B2 is measured by taking the
intersection of the pixels within the boxes divided by the union of the pixels of
the boxes:
area(B1 ∩ B2 )
Aoverlap = . (4.1)
area(B1 ∪ B2 )
The NMS algorithm matches all detections, if two detections are overlapping
then the one with the lowest score is suppressed.
The non-maximal suppression algorithm used in this project has time com-
plexity O(n2 ) [5].
Finally the output of the detector are the detections which are not suppressed
by NMS and have a score above some pre-set threshold θfinal .
Figure 4.1: Of all detections found in this image before NMS, 100 were chosen
randomly. The detections are all centred around the cars but are of different
sizes or at slightly different locations. Note that the parked cars are not detected,
that is since the detector is using motion features and only detects moving cars.
21
4.4 Evaluation of a detector
Evaluation of a detector is performed by detecting objects in images where all
objects or all objects within a certain area have correct bounding boxes around
them, that data is called the ground truth. The bounding boxes produced by
the detector are compared to the ground truth. A detected bounding box is
considered correct if the area of overlap (4.1) with a ground truth bounding
box is greater than 50 %, according to the Pascal Visual Objects Challenge [8].
Correct detections are called true positives, tp, incorrect detections are false
positives, f p, and a ground truth bounding boxes that do not have matching
detections are called a false negatives, f n.
The detections in the output depend on the threshold θfinal mentioned in the
previous section. With a low θfinal many detections will be let through. This
will hopefully yield many true positives and few false negatives but there is a
risk that the output will also contain many false positive detections. When the
threshold is high on the other hand, only the best detections are let through to
the output and the number of false positives will decrease, but the drawback is
that also the true positives will decrease and the false negatives increase.
The evaluation of the detector does not depend on how the threshold θfinal
is chosen but rather all possible thresholds are tested. For every threshold θ the
number of tpθ , f pθ and f nθ are counted and the precision, a measure of how
many of the detections in the output are relevant, and recall, amount of the
present objects which are detected, are calculated:
tpθ
precisionθ = , (4.2)
tpθ + f pθ
tpθ
recallθ = . (4.3)
tpθ + f nθ
One way of presenting the result of detections is to plot precision against recall
in a Precision Recall Curve (PRC) for the different thresholds. Another option
is the F1 score which is a total grade of the performance of the detector. The
F1 score is a number which lies between 0 and 1 and combines the precision and
recall:
precisionθ · recallθ
F1 = 2 · max , (4.4)
θ precisionθ + recallθ
where F1 = 1 means that there exists a threshold θ which perfectly separates
the correct an incorrect detections and that the correct detections consists of all
the objects in the image and F1 = 0 means that there are no correct detections
at all.
22
Chapter 5
The data
The data used for the training and most of the testing comes from the PDTV
dataset [2] consisting of images from an intersection in Minsk taken by four
different cameras during the summer of 2010. The views of the cameras are
shown in Figure 5.1. The images used for training were filmed on eight different
occasions, from all cameras that results in 32 video sequences. The test images
consist of one longer sequence filmed by all cameras.
Figure 5.1: Images from the four different cameras that were used in this work.
The yellow lines are the zones where the evaluation is performed.
The detector was also tested on additional video material manually recorded.
These images did not have ground truth so evaluation was only done manually
23
by counting how many cars were detected. This is not part of the evaluation
of the detectors and was only made to see how the detector worked on images
from a different place than where training images came from. The additional
video have a much higher resolution and in these videos it is possible to identify
persons and cars, so the result cannot be published due to ethical aspects.
• The corners of the bounding box is projected into each of the cameras.
• For every camera where the road user is projected within the image, which
means that the road user is visible in the camera, a 2D bounding box is
calculated from the projection. This is illustrated if Figure 5.2 which
shows the projected 3D box around the road user in an image and the
resulting 2D bounding box.
• This bounding box is added to the list of annotated vehicles of that image.
Each of the eight video sequences that were used for training had ground
truth annotations of two vehicles. From all four cameras and using the all
mirrored images as well this results in 4134 annotations of vehicles in 3112
images. The PDTV dataset includes one video sequence where all road users
in the inner of the intersection have ground truth data, this sequence was used
for testing. An area where all cars in the entire sequence are annotated was
extracted and only detections within that area were evaluated. To be able
to evaluate the motion features better the area was limited to the parts of the
intersection where the vehicles are always moving. In Figure 5.1 are the outlines
of the test zones marked. In total there were 3004 annotations of vehicles to be
found within the test zone.
An overview of the number of annotations in the different sets are showed
in Table 5.1.
Table 5.1: Number of annotated cars used for training and testing for each
camera.
24
The input samples to the detector training have the form X = {x, v, y},
where x ∈ RM ×N ×C is an M × N window with C channels, v ∈ R3 is an object
viewpoint vector and y ∈ {−1, 1} is the class label where y = 1 means that the
window contains a car and y = −1 means that it does not contain a car. The
set of windows with y = 1 is the positive data and the windows without cars is
the negative data. An object viewpoint vector is the forward direction of the
vehicle in relation to the camera.
Figure 5.2: To get the 2D bounding box of a road user in an image the 3D
box around it is projected into the camera. The smallest enclosing box is the
resulting bounding box.
25
(a) Negative image (b) Negative background
Figure 5.3: The left figure shows an example of an image used as negative
training data and its background, which is needed if the detector has motion
features, is showed to the right. The negative image and its background image
are very similar but notice that there are people in 5.3a but not in 5.3b.
26
Chapter 6
Detector implementation
The detector used and further developed in this project is written by Juránek et
al. [12] and is a boosted cascade detector with a random forest of decision trees
using aggregate channel features. Juránek et al. used the same eight channels
as in [6], that is the image in grey scale, the gradient magnitude and gradient
histogram in six directions. The detector also performs pose estimation but that
is not used in this thesis neither for training nor testing and therefore it will not
be described in this report.
6.1 Settings
The detector uses windows of size 20 × 20 and performs the detection on image
sub-windows from size 40 × 40 up to 123 × 123 on eight scales per every octave.
The channels used are the same as in [12], calculated from the image but
also from the background image when motion features are used. The channels
and their numbering are presented in Table 6.1.
27
bigger the absolute values is the more certain is the result. Depending on
which leaf l in a tree tm a sample x ends up in it gets a classification response
hm (x) = hml . The total response H
m
after tree number m is the accumulation
of the partial responses from previous trees the uninteresting static objects
m
X
H m (x) = hk (x). (6.1)
k=1
Each stage m in the cascade has an assigned threshold θm which terminates the
detection if H m (x) < θm . The values of the thresholds are
(
m −1, for m ∈ {4, 8, 16, 64, 256},
θ = (6.2)
−∞, otherwise.
After most stages all samples get to pass (θ = −∞) but after some of the
stages, more frequently in the beginning, the threshold is θ = −1, resulting
in that samples which get big negative responses are rejected. The thresholds
with θ = −1 are the filters described in section 3.2 and the set of consecutive
trees between the filters are the layers. An overview of how classification of one
sub-window is performed in the cascade was showed Figure 3.3.
The windows which pass through the entire cascade are recognized as objects
of interest and H TD is the detection score. Among overlapping detections all de-
tections except the one with the highest score are suppressed by a Non-Maximal
Suppression algorithm. The time consumption of NMS increases rapidly when
the output of the cascade becomes larger because the time complexity is O(n2 ).
For practical reasons if the output after the cascade is bigger than 3000 bound-
ing boxes then the set is simply reduced to 3000 by drawing samples at random
before performing NMS.
28
All the samples have weights wi which are either set to N1 , where N is the
number of samples, if it is the first tree, step (1) in the Algorithm, or have values
given in the previous stage.
The fitting of the classifier in step (2a) is made according to the Random
Forest framework, described in Chapter 3.1.2. A subset of 1000 positive and
1000 negative samples are used as the active set, see Chapter 3.4, to fit the
classifier. The subset is obtained using quasi-random sampling and repeated
samples may occur. Each node in the tree is trained recursively from the set X
of samples which reached the node. A number n of random splits are generated
from binary splitting functions. The splitting function is created from the values
at two pixel locations in any of the channels, a, b ∈ Z3 , and a threshold δ. The
pixel locations a and b are chosen randomly and the difference δ between the
pixel values in those locations is calculated for all samples in the set X:
The threshold δ for splitting the sets into two subsets, X L and X R , is obtained
by taking a random value between the lowest and highest value of all the dif-
ferences δi . Which subset a sample will belong to depends on the result of the
inequality
xi (a) − xi (b) < δ. (6.4)
If either of the new subsets are too small the split is discarded. New splits
are generated until n approved splits are obtained or enough attempts have been
preformed. The split with the lowest error I(X) is used in the node. For this
project the error function used is
X |X s |
I(X) = E(X s ), (6.5)
|X|
s∈{L,R}
where E(X) is the classification error which will be defined shortly. The error
function I(X) should, as mentioned in Section 3.3, depend on the weights from
AdaBoost. Those come in since the error E(X) depends on W + and W − which
are the sums of the weights of the samples from each class in the node:
√
E(X) = 2 W + W − . (6.6)
Ideally the positive and negative samples end up in separate branches, in that
way W + = 0 in one and W − = 0 in the other and the product in (6.6) is zero
in both branches and therefore the error function (6.5) is also zero.
It is possible to limit the space of the positions a = a1 a2 a3 and
b = b1 b2 b3 by a matrix
where qi,j = 0 means that no pair a, b can be such that a3 = i and b3 = j. This
results in that the pixel value differences will only be taken between or within
certain channels.
For example if a detector uses features from three channels and
1 0 0
Q = 0 1 1 (6.8)
0 1 0
29
then the binary split function can only be created from differences of features
within layer one, within layer two or from features of layer two and three.
In step (2a) in Algorithm 1 the entire training set is used to calculate the class
probability estimate pm (x). That is the estimated probability that a sample x
has label y = 1 (car) given the leaf the sample ended up in, using the weights
of the training samples. In this application the probability is given by
W+
pm (x) = P̂w (y = 1|x) = . (6.9)
W+ + W−
The next step of the AdaBoost algorithm, step (2b), is dedicated to calcu-
lating the classification responses fm (x) but here the notation hm (x) will be
used instead. In a decision tree the number of possible response values are the
number of leaves in the tree. The classification response hm (x) is given by the
leaf l where x ends up. If Wl+ and Wl− are the sums of the weights of the
training samples of each class which were directed to the leaf l in tree number
m then the classification response becomes, according to Algorithm 1 and (6.9):
Wl+ +
m 1 pm (x) 1 Wl+ +Wl− 1 Wl
h (x) ← log = log = log .
Wl−
+
2 1 − pm (x) 2 W
1 − W + +W
l 2
−
l l
(6.10)
The classification responses of the leaves, which are used when detections are
performed, coincides with the classification responses of the training samples.
The classification response h of leaf l in tree m is:
+
m 1 Wl
hl = log . (6.11)
2 Wl−
Lastly the weights wi of the training samples xi are updated, rewriting the
step (2c) in AdaBoost with the notation used above the weights become
where N is the number of training samples. The weights are then normalized
before the training of the next tree starts.
30
F F F F
L L ··· L L
Figure 6.1: During training the negative dataset is updated to get successively
more difficult images for the detector to work with. The negative samples which
are easy for the detector to identify as “non-objects” are filtered out between
each layer and replaced by the same number of new samples which come from
detections made in images from the negative training data.
Mt = It ,
if t = 0,
Mt = Mt−1 + 100 100
t (Mt−1 < It ) − t (Mt−1 > It ), if 0 < t ≤ 10000,
Mt = Bt−1 + 0.01(Mt−1 < It ) − 0.01(Mt−1 > It ), if t > 10000.
(6.13)
The inequalities and arithmetics between the images are computed pixel wise.
In the PDTV dataset the video sequences were too short and the background
image did not have time to tune in. To solve this problem an initial background
image B was calculated as the median image between images in the entire
sequence. Then it was assumed that the time was t > 10000 and the following
images were calculated using the formula (6.13).
The background channels used were either the same as for the original im-
ages, that resulted in a total of 16 channels or only the grey scale image and
the gradient magnitude, in that case the detector used in total 10 channels.
31
Chapter 7
Results
0.8
precision
0.6
0.4
Figure 7.1: The PRC for detectors consisting of 8 trees each which were trained
and tested on the same data. To see the variation between different runs three
detectors with motion features and three without were trained and tested.
32
7.2 Validation
The evaluation was performed by comparing precision recall curves and F1 scores
for detectors with and without motion features.
The first detectors to be evaluated are called A and AMF . Both were trained
on all images in the training set and tested on all images in the test set, see
Table 5.1. Detector A uses channels 1-8 and AMF uses additionally the channels
9-16 which come from the background image. The channel numbers are given
by Table 6.1. Figure 7.2 shows the precision recall curves for the four cameras
and overall. The F1 scores can be found in Table 7.1.
Table 7.1: The F1 scores of the detectors A and AMF when performing detections
in images from different cameras and the overall score.
The detector with motion features performs better when the detection is
performed in images from Camera 1 and Camera 4, the ones which are from
above, both according to the F1 score and the PRC. But the result on the images
from the other cameras and the overall result is better for the detector which
does not have motion features.
To simplify and delimit the problem, detectors only handling images filmed
from above were tested further. In total four different settings for motion fea-
tures detectors were tested by training and evaluating on images from Camera
1 and 4. The result was compared to one detector without motion features. All
settings of the detectors are summarized in Table 7.2 and will also be explained
in more detail in their respective section.
Settings
Detector # train # test views Q-matrix MF channels
A 4134 3004 all 1 no 1-8
AMF 4134 3004 all 1 yes 1-16
B 1756 1492 above 1 no 1-8
BMF 1756 1492 above 1 yes 1-16
1 I
CMF 1756 1492 above yes 1-16
I 0
DMF 1756 1492 above 1 yes 1-10
1 U
EMF 1756 1492 above yes 1-10
UT V
Table 7.2: The settings of the detectors which are presented in the results. The
first columns presents the number of samples in the training and testing sets,
the next from which perspective the cameras are filming. The Q matrix is the
one defined by Equation (6.7). The channels used can be all or some of the
16 channels presented in Table 6.1. The channels 9-16 are produced from the
background image, if some or all of those are used then it is a motion features
detector. The matrices U and V are sparse matrices which are presented in
Section 7.2.4.
33
Camera 1 Camera 2 Camera 3 Camera 4
1
0.8
precision
0.6
0.4
0.2
0 0.4 0.8 0 0.4 0.8 0 0.4 0.8 0 0.4 0.8
recall
Overall
1
0.8
precision
0.6 with MF
without MF
0.4
0.2
0 0.4 0.8
recall
Figure 7.2: The precision recall curves of the four cameras and the overall result
of the detectors A and AMF . The result is clearly better for the detector with
motion features for Camera 1 and 4, the ones from above, but the detector
without MF work better for the cameras from the side and also has a better
overall result.
34
1
precision 0.95
0.9
Figure 7.3: The PRC of the detector BMF (with MF) is plotted against the
PRC of the detector B (without MF).
matrix P become upper triangular. The matrix is calculated using the formula
35
background
gradient
gradient
image
hist 1
hist 2
hist 3
hist 4
hist 5
hist 6
hist 1
hist 2
hist 3
hist 4
hist 5
hist 6
gradient
image
image
hist 1
hist 2
hist 3
hist 4
hist 5
hist 6
gradient
hist 1
image hist 2
gradient hist 3
hist 1 hist 4
hist 2 hist 5
hist 3 hist 6
hist 4 background
gradient
hist 5
hist 1
hist 6 hist 2
hist 3
hist 4
50 100 150 hist 5
hist 6
# splits
(a) Detector B
0 20 40 60 80 100
# splits
(b) Detector BMF
(I,I) 1,126
(I,BG) 1,533
(BG,BG) 536
Figure 7.4: An overview of the channels which are used by the detectors B and
BMF . Each pixel in (a) and (b) represents how many times a channel pair was
used in the binary split functions in the trees. The histogram (c) shows the
number of times the channels of the image and the background were used in
different combinations in the tree splits of BMF .
36
1
precision 0.95
0.9
Figure 7.5: The PRC of the detector CMF (with MF) is plotted against the PRC
of the detector B (without MF). The splits in the trees of the motion features
detector are limited by a matrix QC .
37
1
precision 0.95
0.9
Figure 7.6: The PRC of the detector DMF (with MF) is plotted against the
PRC of the detector B (without MF). The motion features detector uses only
the first 2 channels from the background.
channels of the background are therefore not used. The matrix become:
1 1 1 1 1 1 1 1 1 0
1 1 1 1 1 1 1 1 0 0
1 1 1 1 1 1 1 1 0 0
1 1 1 1 1 1 1 1 0 0
1 1 1 1 1 1 1 1 0 0
QE = 1
. (7.3)
1 1 1 1 1 1 1 0 0
1 1 1 1 1 1 1 1 0 0
1 1 1 1 1 1 1 1 0 0
1 0 0 0 0 0 0 0 0 1
0 0 0 0 0 0 0 0 1 0
The PRC of the one EMF detector out of 10 which got the highest F1 score is
plotted together with the PRC of the non-motion features detector B in Figure
7.7. The result of EMF is marginally better than B.
Figure 7.8 displays the number of times different channel combinations are
used in the tree splits of EMF . The two pairs which contain background channels
were the most frequently used.
7.2.5 Summary
The detector settings B, BMF , CMF , DMF and EMF were used to train 10
detectors each. The precision recall curves in the previous sections only show the
detector which got the highest F1 score but the variation between the realisations
was big. In Figure 7.9 the F1 scores for all trained detectors are displayed as
box plots. The detectors B and EMF got the highest top scores and the detector
CMF got the highest median and also the highest lower whisker but the variation
38
1
0.95
precision
0.9
Figure 7.7: The PRC of the detector EMF (with MF) is plotted against the
PRC of the detector B (without MF). The EMF detector uses only two channels
from the background image and the tree splits are limited by a Q-matrix. The
detectors are performing almost equally and the F1 score of EMF is slightly
higher.
background
gradient
gradient
image
hist 1
hist 2
hist 3
hist 4
hist 5
hist 6
image
gradient
hist 1
hist 2
hist 3
hist 4
hist 5
hist 6
background
gradient
Figure 7.8: The number of times different feature channel pairs occur in the
tree splits of the detector EMF .
39
0.89
0.88
F1 score
0.87
0.86
B BM F CM F DM F EM F
Figure 7.9: Ten detectors were trained with B, BMF , CMF , DMF and EMF
settings and their F1 scores are visualized as box plots.
is big, the ranges of the box plots are all overlapping and no detector stands
out as significantly better.
40
NMS’s time consumption:
y = 6.05 · 10−7 x2 , (7.4)
which is also plotted in the same figure.
without MF
105 with MF 105
Number of windows
104 104
103 103
Figure 7.10: The left graph shows the mean number of windows processed during
the stages of the cascade when the equally performing detectors B and EMF
are making detections in the images in the test set. In total 257500 windows
inside an image are analysed but after every layer more and more windows are
filtered out. The right graph shows the number of detections that are output
by the cascade and passed on to the Non-maximal suppression algorithm. Note
the non-linear scale on the x-axis and that the last layer is three times longer
than the first five layers together. It is also important to note that y-axis is
logarithmic and for example at layer three and four the number of windows is
approximately the double for the detector without motion features compared to
the detector with motion features.
A way to cut time consumption is to take advantage of the fact that the
background image changes very slowly and the computed image pyramids could
be reused for some number of consecutive frames. This increased efficiency did
speed up the detection with approximately 0.02 seconds/image but since the
time consumption is dominated by the non-maximal suppression it was not a
significant improvement of the total time.
41
measured time
5 y = 6.05 · 10−7 x2
4
time [s]
0
0 500 1,000 1,500 2,000 2,500 3,000
# bounding boxes
Figure 7.11: The blue line with circles shows the experimentally found time
acquired by non-maximal suppression to process different numbers of bounding
boxes. The red line is the quadratic function which is fitted to the measurements
using ordinary least squares estimation.
42
(a) Detector B
Figure 7.12: Detections performed by B and EMF . The solid lines are the ground
truth annotations and the dashed lines are the detections. Green markings
means correct detection or detected ground truth, yellow detections and ground
truth are ignored because they are outside the zone and red markings mean
incorrect detections or not detected ground truth. The white numbers are the
detection scores. Note that parked cars are detected by B but not usually by
EMF . One parked car is detected by EMF but it got a low score.
43
Chapter 8
Discussion
In the implementations of motion features detectors in this work all image chan-
nels were kept and experiments were made by trying different settings in the
application of the background channels. This means that no information was
removed, only new information was added. If the machine learning algorithm
works properly then when training and testing on the same data and more in-
formation is added the result should only stay the same or become improved.
The result of the verification test, see Figure 7.1, confirmed this and therefore it
was assumed that the implementation of the motion features was correct. Since
the verification result was significantly better for the motion features detector
and the background channel features were frequently used, there was additional
information in the motion features in the training set. But the question is if
that information is general and can be applied to the test set.
There was a slight improvement in some aspects for a few motion features
detectors but no significant difference could be found when testing on the vali-
dation data. For some detectors the use of motion features even seemed to make
the detection worse. Some possible reasons for this and ideas of further work is
discussed in this chapter.
8.1 Overfitting
The background channels were frequently used by the motion features detectors,
see Figure 7.4 and 7.8. It seems as though the detectors find a lot of information
in the background channels during training, since they are used a lot, but that
the extracted pattern is not general enough. One problem when using a bigger
number of features is that the risk of overfitting increases. The solution to that
is increasing the training set. Lack of data is a very common problem when
working with machine learning and in that aspect this project was no exception.
The training set from all cameras consisted of 3112 images, this can be compared
to Juránek et al. who trained their detector with 20000 images [12]. A bigger
amount of data would probably reduce overfitting and also produce a result with
less variation between different realisations which would make investigation and
evaluation easier.
Throughout this work all the image channels (the channels number 1-8) were
used for all settings. Further experiments could be made varying this parameter.
44
Using fewer of the image channels could be a solution to overfitting since that
would decrease the number of features.
45
Maybe motion features could be more of an asset in the detection in a more
difficult setting.
46
Chapter 9
Conclusions
47
Bibliography
48
[14] Constantine Papageorgiou and Tomaso Poggio. A trainable system for
object detection. International Journal of Computer Vision, 38(1):15–33,
2000.
[15] K-K Sung and Tomaso Poggio. Example-based learning for view-based
human face detection. IEEE Transactions on pattern analysis and machine
intelligence, 20(1):39–51, 1998.
[16] Paul Viola and Michael J Jones. Robust real-time face detection. Interna-
tional journal of computer vision, 57(2):137–154, 2004.
49
Master’s Theses in Mathematical Sciences 2016:E52
ISSN 1404-6342
LUTFMA-3308-2016
Mathematics
Centre for Mathematical Sciences
Lund University
Box 118, SE-221 00 Lund, Sweden
http://www.maths.lth.se/