29 - DeepDishMulti-Object Tracking With
29 - DeepDishMulti-Object Tracking With
an Off-the-Shelf Raspberry Pi
Matthew Danish Justas Brazauskas Rob Bricheno
Cambridge University Cambridge University Cambridge University
[email protected] [email protected] [email protected]
37
EdgeSys ’20, April 27, 2020, Heraklion, Greece Danish and Brazauskas, et al.
38
DeepDish: Multi-Object Tracking with an Off-the-Shelf Raspberry Pi EdgeSys ’20, April 27, 2020, Heraklion, Greece
150
140
130
120
milliseconds
110
100
90
80
70
1500MHz 1750MHz 1900MHz 2000MHz
39
EdgeSys ’20, April 27, 2020, Heraklion, Greece Danish and Brazauskas, et al.
4.1 Power usage Figure 6: Screenshot from the ‘Plaza’ test video.
Table 1 shows the power draw and CPU temperature of the Rasp-
berry Pi under different loads. Run mode is the normal operation,
seeking maximum performance. In Sleep mode the program does
not invoke inference but instead periodically checks if anything the MOT17 challenge [10], taken from a fixed camera overlooking a
has changed in the input, waking back up if necessary. Both are pedestrian plaza with a fairly intense flow of people moving about,
compared against a baseline of an idle Raspberry Pi. There are sig- as seen in Figure 6.
nificant opportunities for power-saving during quiet periods when The code was instrumented so that it wrote the current value
there is no motion in front of the camera and it is looking at a fixed of several counter variables into a file every ten seconds: persons
scene. crossing the counting line in each direction (as determined by the
sign of the cross-product), overall number of crossings, and number
5 PRIVACY of tracking identities that have expired.
Putting up cameras raises privacy concerns, even in public spaces. Ground truth for test videos was established using the following
One assurance we can offer is that since our data transmission will method: in each ten-second interval we counted the number of
ultimately be carried by LoRaWAN, there is simply not enough people crossing the counting line in each direction.
bandwidth for it to be practical to transmit images at all. The only Therefore, for each ten second interval of test video, we know
data transmitted are counting events and the current state of the the number of people walking to the ‘negative’ side of the line
counters. An attacker could learn the number of people within (negcount) and the number of people walking to the ‘positive’ side of
the public part of the building, but this is not considered sensitive the line (poscount). These values were subtracted and then compiled
information. For debugging purposes, the WiFi device in the Rasp- into a vector covering the whole video. This allowed us to see how
berry Pi has been configured with a private network protected by the algorithm varied from the ground truth over time, and penalised
a pre-shared key. It is not connected to the Internet and can only offsetting mistakes to some extent.
be accessed by a person with the password and in close proximity Vectors drawn from the ground truth and from test runs of the
to the device. algorithm are compared using cosine-distance, subtracted from 1.
As the algorithm reads and processes each camera frame, it finds The ideal score is 1, the worst score is 0. We ran the tests with a
the coordinates of boxes around each object it detects, then distils number of configurations of the test-bed, tweaking parameters to
the contents of each box down to a short vector of 128 numbers, and DeepSORT, and trying much slower (but more accurate) YOLO-
finally discards the frame. The vectors may stay in memory as long based object detectors in addition to MobileNet. The test-bed also
as the object is within sight but they are a very sparse encoding was able to simulate different frame-rates, for example, by dropping
of pixel colours and basic shapes, with no personally identifying five frames out of every six to simulate a 200ms inference latency.
information associated with them. Parameters adjustable for testing include the resolution of the
MARS-trained feature encoder (64×32, 128×64 or 256×128), the
6 ACCURACY simulated FPS value (from 5–30), the maximum cosine distance
We set up a test-bed to accept pre-recorded video in place of the live (max-cos-distance) threshold for two feature vectors to be consid-
camera feed into the same algorithm used by the live tracker. The ered part of the same ‘track’, and the ‘non-maximum suppression’
test video named ‘Office’ comprises two minutes of filming using the (nms-max-overlap) threshold that eliminates spurious overlapping
Raspberry Pi camera in an office environment similar in character to object detection boxes (at 1.0 the boxes must overlap completely for
the pilot project location, with seven volunteers instructed to enter one to be pruned, and at 0 it would eliminate even non-overlapping
and exit the office repeatedly and without any particular patterns. boxes). Over 450 configurations were tested. A selection of scores
The test video named ‘Plaza’ comprises 50 seconds of video from is shown in Table 2.
40
DeepDish: Multi-Object Tracking with an Off-the-Shelf Raspberry Pi EdgeSys ’20, April 27, 2020, Heraklion, Greece
7 DISCUSSION 1
0.9
The most surprising finding from our experiments is that increased
frame-rate was not helpful and in fact could make things worse. As 0.8
live video counter, we achieved a score of 0.948 on the ‘Office’ test 0.6
score
with MobileNet v1 running at 5 FPS. This rivalled the top test score 0.5
of 0.968 that was obtained using a powerful GPU running YOLO 0.4
v3 (tests at both 5 and 30 FPS achieved this mark). Figure 7 shows 0.3
the overall average scores at each frame-rate: the worst cases are 0.2
considerably worse with 15 and 30 FPS compared to 5 FPS. 0.1
The cosine distances were often not that large between vectors 0
encoded from the views of different people, especially if they were 5 15 30
wearing similar colour clothing. This could result in unwanted frames per second
identity swaps, but that would only affect the score in a small way
if the overall count was correct. Generally, setting max-cos-distance Figure 7: The effect of frame-rate on score.
to very low values tended to reduce the performance of the tracker,
effectively removing any assistance from the feature encoder. This
nms-max-overlap < 1.0. Both MobileNet versions score about the
effect can be seen in Figure 8.
same (just a slight advantage for v2) but in our judgement that is
Lower ‘non-maximum suppression’ was slightly more important
outweighed by the worse running time.
with MobileNet than YOLO because the latter tends to generate
higher-quality object-detection boxes and get less confused when 8 CONCLUSION
multiple people are standing together in a group. With MobileNet,
it helped to suppress some spurious boxes that could be generated We show that practical performance can be achieved on a Rasp-
by clusters of people, and lower values of nms-max-overlap led to berry Pi 4B in this application without requiring special hardware
slightly improved performance, as can be seen in Figure 9. acceleration. Although our pilot project was meant to take place
Another unexpected finding is that lowering the resolution of inside a building, we anticipate future deployments in places where
the input to the feature encoder in order to gain performance did network access is limited, space is at a premium and power supply
not affect scores overall. In Figure 10, it shows that the 64×32 fea- may be more circumscribed. We are continuing to refine the mod-
ture encoder gave approximately the same scores as the slower els and parameters so this system may be used in more extensive
default resolution of 128×64. We also tried experiments with an experiments for our larger Digital Twin project.
even higher resolution feature encoder, at 256×128, and that back-
fired, producing worse results. ACKNOWLEDGMENTS
Finally we compare models in Figure 11. This chart focuses on This research forms part of Centre for Digital Built Britain’s work
the generally good configurations: with feature encoder resolu- within the Construction Innovation Hub. The funding was provided
tion 64×32, processing 5 FPS, having max-cos-distance ≥ 0.3, and through the Government’s modern industrial strategy by Innovate
UK, part of UK Research and Innovation.
41
EdgeSys ’20, April 27, 2020, Heraklion, Greece Danish and Brazauskas, et al.
1 1
0.9
0.8 0.95
0.7
0.6 0.9
score
score
0.5
0.4 0.85
0.3
0.2 0.8
0.1
0 0.75
0.01 0.3 0.6 0.9 MobileNet v1 MobileNet v2 YOLO v3
maximum cosine distance object detection model
Figure 8: The effect of maximum cosine distance on score. Figure 11: The effect of model type on score.
1
Image Processing (ICIP). IEEE, 3464–3468.
0.9 [3] Alejandro Cartas, Martin Kocour, Aravindh Raman, Ilias Leontiadis, Jordi Luque,
Nishanth Sastry, Jose Nuñez-Martinez, Diego Perino, and Carlos Segura. 2019.
0.8
A reality check on inference at mobile networks edge. In Proceedings of the 2nd
0.7 International Workshop on Edge Systems, Analytics and Networking. 54–59.
[4] Gioele Ciaparrone, Francisco Luque Sánchez, Siham Tabik, Luigi Troiano, Roberto
0.6
Tagliaferri, and Francisco Herrera. 2019. Deep learning in video multi-object
score
0.5 framework for real-time intelligent video analytics. In Proceedings of the 1st
International Workshop on Edge Systems, Analytics and Networking. 1–6.
0.4
[13] Joseph Redmon and Ali Farhadi. 2016. YOLO9000: Better, Faster, Stronger. CoRR
0.3 abs/1612.08242 (2016). arXiv:1612.08242 http://arxiv.org/abs/1612.08242
[14] Ju Ren, Yundi Guo, Deyu Zhang, Qingqing Liu, and Yaoxue Zhang. 2018. Dis-
0.2
tributed and efficient object detection in edge computing: challenges and solu-
0.1 tions. IEEE Network 32, 6 (2018), 137–143.
0
[15] Springer 2016. MARS: a video benchmark for large-scale person re-identification.
64x32 128x64 256x128 Springer.
feature encoder resolution
[16] Nicolai Wojke and Alex Bewley. 2018. Deep cosine metric learning for person
re-identification. In 2018 IEEE Winter Conference on Applications of Computer
Vision (WACV). IEEE, 748–756. https://doi.org/10.1109/WACV.2018.00087
Figure 10: The effect of feature encoder resolution on score. [17] Nicolai Wojke, Alex Bewley, and Dietrich Paulus. 2017. Simple online and realtime
tracking with a deep association metric. In 2017 IEEE international conference on
image processing (ICIP). IEEE, 3645–3649.
[18] Xu Zhang, Xiangyang Hao, Songlin Liu, Junqiang Wang, Jiwei Xu, and Jun
REFERENCES Hu. 2019. Multi-target tracking of surveillance video with differential YOLO
[1] Ragaad Al-Tarawneh, Christina Strong, Luis Remis, Pablo Munoz, Addicam and DeepSORT. In Eleventh International Conference on Digital Image Processing
Sanjay, and Srikanth Kambhatla. 2019. Navigating the visual fog: analyzing (ICDIP 2019), Jenq-Neng Hwang and Xudong Jiang (Eds.), Vol. 11179. International
and managing visual data from edge to cloud. In 2nd USENIX Workshop on Society for Optics and Photonics, SPIE, 701 – 710. https://doi.org/10.1117/12.
Hot Topics in Edge Computing (HotEdge 19). USENIX Association, Renton, WA. 2540269
https://www.usenix.org/conference/hotedge19/presentation/altarawneh
[2] Alex Bewley, Zongyuan Ge, Lionel Ott, Fabio Ramos, and Ben Upcroft. 2016.
Simple online and realtime tracking. In 2016 IEEE International Conference on
42