Thanks to visit codestin.com
Credit goes to www.scribd.com

0% found this document useful (0 votes)
8 views6 pages

29 - DeepDishMulti-Object Tracking With

The document presents DeepDish, a multi-object tracking system utilizing Raspberry Pi and TensorFlow to analyze movement in urban settings. It employs a tracking-by-detection approach using MobileNet and DeepSORT algorithms for real-time data processing and transmission via LoRaWAN. The system aims to enhance safety and planning by providing accurate data on people's movement while addressing privacy and performance concerns.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
8 views6 pages

29 - DeepDishMulti-Object Tracking With

The document presents DeepDish, a multi-object tracking system utilizing Raspberry Pi and TensorFlow to analyze movement in urban settings. It employs a tracking-by-detection approach using MobileNet and DeepSORT algorithms for real-time data processing and transmission via LoRaWAN. The system aims to enhance safety and planning by providing accurate data on people's movement while addressing privacy and performance concerns.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 6

DeepDish: Multi-Object Tracking with

an Off-the-Shelf Raspberry Pi
Matthew Danish Justas Brazauskas Rob Bricheno
Cambridge University Cambridge University Cambridge University
[email protected] [email protected] [email protected]

Ian Lewis Richard Mortier


Cambridge University Cambridge University
[email protected] [email protected]
ABSTRACT distils the information down to a few numbers that can be trans-
When looking at in-building or urban settings, information about mitted over a LoRaWAN network from sensors in the field.
the number of people present and the way they move through the The MOT problem takes as input an image sequence and a set of
space is useful for helping designers to understand what they have objects of interest. Solutions must trace their movement throughout
created, fire marshals to identify potential safety hazards, planners the image sequence while maintaining the distinct identity of each
to speculate about what is needed in the future, and the public to object. We identify objects of interest by category (e.g., ‘person’)
have real data on which to base opinions about communal choices. and then apply category-specific object-detection methods to auto-
We propose a network of edge devices based on Raspberry Pi and matically draw bounding boxes around all objects of interest within
TensorFlow, which will ultimately push data via LoRaWAN to a each frame. Solving the MOT problem then requires finding the
real-time data server. This network is being integrated into a Digital corresponding bounding boxes in each successive frame, as well as
Twin of a local site which includes several dozen buildings spread determining if a bounding box is being drawn around a newly in-
over approximately 500,000 square metres. We share and discuss troduced object, and if a previously-known object has disappeared.
issues regarding privacy, accuracy and performance. This is known as the tracking-by-detection approach [4].
The object-detection problem has been studied intensively and
CCS CONCEPTS solutions have advanced rapidly in recent years thanks to the suc-
cess of supervised deep learning methods [5]. We focus in particular
• Computer systems organization → Sensor networks; • Hard-
on the MobileNet model architecture [7] because it is optimised for
ware → Sensor applications and deployments; Sensor devices
use in the mobile processor context, and we use a model trained on
and platforms.
the COCO data-set [11] that offers 91 possible object categories.
KEYWORDS For tracking-by-detection, we employ DeepSORT [17], which
extends SORT [2] with a ‘feature encoder’ that extracts a vector
object detection, object tracking, edge computing using a pre-trained convolutional neural network (CNN) on the
ACM Reference Format: image data within each bounding box. SORT is designed to be fast
Matthew Danish, Justas Brazauskas, Rob Bricheno, Ian Lewis, and Richard because it relies only on simple techniques of association based
Mortier. 2020. DeepDish: Multi-Object Tracking with an Off-the-Shelf on scores computed using Kalman filtering and the Mahalanobis
Raspberry Pi . In 3rd International Workshop on Edge Systems, Analytics and
distance metric and then solved using the Hungarian method, but
Networking (EdgeSys ’20), April 27, 2020, Heraklion, Greece. ACM, New York,
SORT can easily be fooled by object occlusion. By adding the fea-
NY, USA, 6 pages. https://doi.org/10.1145/3378679.3394535
ture encoder vector to the mix, DeepSORT helps avoid accidental
1 INTRODUCTION identity switches between overlapping objects, while maintaining
much of the performance required for online tracking of real-time
We propose a system of edge devices based on a low-power comput-
video.
ing board, the Raspberry Pi, that analyses the movement of people
We show that this can be achieved with a low cost, stock Rasp-
using a standard camera peripheral and can publish real-time events
berry Pi 4B thanks to some unexpected findings we discuss in §7.
and anonymous statistics to a low-bandwidth and secure network.
The system runs several machine-learning-based algorithms to per-
form multi-object tracking (MOT) on sequential image data and
2 RELATED WORK
Past work by Ren et al. [14] relied on a network of edge-based
Permission to make digital or hard copies of all or part of this work for personal or
classroom use is granted without fee provided that copies are not made or distributed servers with high-performance GPUs that could be placed close
for profit or commercial advantage and that copies bear this notice and the full citation enough to gather image data from local network of cameras. Car-
on the first page. Copyrights for components of this work owned by others than the tas et al. [3] performed object-detection by sending video frames
author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or
republish, to post on servers or to redistribute to lists, requires prior specific permission from mobile devices to nearby small servers backed up by high-
and/or a fee. Request permissions from [email protected]. performance but more distant servers; they were only able to
EdgeSys ’20, April 27, 2020, Heraklion, Greece achieve 150ms inference time by equipping the small servers with
© 2020 Copyright held by the owner/author(s). Publication rights licensed to ACM.
ACM ISBN 978-1-4503-7132-2/20/04. . . $15.00 GPUs. EdgeEye [12] similarly depends on having a GPU. Hochstetler
https://doi.org/10.1145/3378679.3394535 et al. [6] bench-marked a Raspberry Pi 3B processor both with and

37
EdgeSys ’20, April 27, 2020, Heraklion, Greece Danish and Brazauskas, et al.

without an Intel Movidius™ Neural Compute Stick on data from a


visual recognition challenge. When setting the model input image
size to 224x224, they found that the CPU alone performed at about
2.1 frames-per-second (FPS) on the object detection task, while
adding the Neural Compute Stick boosted that to 17.2 FPS. The
DeepX project [9] looked at ways of distributing inference tasks
across heterogeneous resources such as GPUs and lower-power
processing units present on certain mobile platforms, with which
they were able to achieve sub-500ms object-detection inference
times at significant energy savings.
DeepSORT is a popular tracking algorithm; one recent and no-
table work using it is by Zhang et al. [18] who considered the case
of fixed-view cameras: they computed a differential filter to isolate
only the portions of the view that were changing, then used the
YOLOv2 [13] object-detection system with DeepSORT to perform
online MOT on a high-performance server and GPU. Al-Tarawneh Figure 1: The counting device.
et al. [1] used a different style of feature vectors computed on high-
performance servers to re-identify customers over the course of a
day as they browsed a shop, in order to produce retail analytics of 3.3 Algorithm
their behaviour.
Following the tracking-by-detection concept used by DeepSORT,
3 EXPERIMENTAL SETUP we break down the algorithm into three steps shown in Figure 2:
object-detection, feature-extraction and tracking/association.
Our pilot project intended1 to count the number of people entering
The object-detection model is a pre-trained quantised version of
and exiting a particular building for fire safety purposes. While
SSD MobileNet v1 [7] compiled for TensorFlow Lite, made avail-
person-counting is a venerable field, this is only intended as a test
able by Google. We feed it images of size 300×300, and from the
case for a much larger experiment that will study the movement
output use only the detections that are labelled with the category
of people in public space as part of a project to create a Digital
‘person’. The feature-extraction step uses the DeepFLOW CNN
Twin [8] of a local site, which includes several dozen buildings and
model trained on the Motion Analysis and Re-identification Set
covers approximately 500,000 square metres of land.
(MARS) data-set [15] using the cosine metric learning technique [16].
With a large enough field-of-view, we hypothesise that the count-
Feature-extraction must be run separately on each person detected,
ing task can be performed even if the software can only process
therefore the algorithm scales linearly by the number of people that
a low number of FPS, because people would enter the view for at
need tracking. Association of tracks with known history of objects
least several seconds. Careful placement of the camera is critical
is performed by the DeepSORT method of combining Mahalanobis
to ensure that people are within view for sufficient time on either
distance computed on Kalman Filter distributions and cosine met-
side of the pre-determined ‘counting line’, while not being too far
ric distance computed on extracted feature vectors. People who
away to compromise the object detector.
are new to the tracking history (based on a feature threshold) are
assigned a fresh identification number, and people that fail to be
3.1 Hardware
found for over 30 frames are considered to have left the scene.
Our edge node is a Raspberry Pi 4B, which is a Broadcom BCM2711 At each step the most recent track vector of each known person
system-on-chip with a quad-core Cortex-A72 (ARM v8) 64-bit pro- is compared against the pre-determined counting line by solving
cessor running at 1.5GHz, with 4GB of RAM. We used a night-vision for line intersection and cross-product (to determine direction of
camera module recording video at 640×480 in full colour, however movement). When an intersection is found, a ‘count’ event is gener-
we also tested with the standard Pi camera, which provides a similar ated, and the intersecting segment is highlighted in red, as shown
quality of video albeit without the infrared lighting support. We in Figure 3. Internally, we maintain a running total of counts go-
added the Fan SHIM for temperature control, and a PiJuice battery ing in each direction, as well as the number of tracks that have
‘HAT’ mounted on the GPIO interface to provide a power supply been deleted after the person was not identified for 30 consecutive
backup. The device is packaged in a 3D-printed case that allows the frames. When an HDMI monitor is hooked up to the device we
camera to be rotated into the desired position, as shown in Figure 1. display this information as an overlay on the current camera view,
Future work will include a LoRaWAN HAT as well. for debugging purposes, as well as drawing the count line, rectan-
gles around detected people and their tracked vectors. You can also
3.2 Operating software enable a web interface that shows the same debugging information.
We use TensorFlow 1.15.0, including the TensorFlow Lite engine,
and the controlling software is written in Python 3.7 running under 4 PERFORMANCE
Hypriot, a Debian-based operating system customised for Rasp-
We used SSD MobileNet v1 because in our testing, we found it to
berry Pi.
give the best response times by far, as can be seen in Figure 4. The
1 Late note: the pilot project has been cancelled due to the global COVID-19 pandemic. accuracy scores are also quite good, under the circumstances, as

38
DeepDish: Multi-Object Tracking with an Off-the-Shelf Raspberry Pi EdgeSys ’20, April 27, 2020, Heraklion, Greece

Figure 2: Tracking-by-detection pipeline.

150

140

130

120

milliseconds
110

100

90

80

70
1500MHz 1750MHz 1900MHz 2000MHz

Figure 3: Screenshot from the ‘Office’ test video, showing


two intersection events. The numbers at the bottom corners Figure 5: Object detection inference times for SSD MobileNet
of the screen record the counts of people who have crossed v1 with overclocking.
the red counting line in either direction. The central num-
ber is the difference. The boxes around people are results
from object-detection, and the numbers in the corners of the
16.5%, with core temperatures fluctuating somewhat under 60°C
boxes are tracking identities. The purple lines that follow
with active cooling from the Fan SHIM. However, we learned that
people are the tracking vectors; the intersecting segments
the PiJuice did not work reliably when the Pi was over-volted, and
are coloured red momentarily when an intersection event is
the HAT had to be removed before conducting these tests.
detected.
The DeepSORT feature-encoder CNN model comes trained on
MARS data with an image input size of 128×64. However, we found
220 that running it took 95ms per person per frame: too slow for real-
time processing of scenes with multiple people. Therefore, we re-
200 sized the image input to 64×32 and retrained the CNN model. Af-
ter doing that we achieved a 35ms running time per person per
180
frame, 63% faster than before. With overclocking the typical time
milliseconds

for feature-encoding was reduced further to 32ms per person per


160
frame. This is considerably more practical in our expected use case.
When overclocking to 1,900MHz, the running time can be esti-
140
mated using the following numbers:
120 • Fixed costs per frame: approximately 130ms, composed of,
– Object-detection: about 96ms
100 – Processing (e.g. input, resizing and output): about 34ms
v1 v2
• Cost per person tracked: approximately 36ms, composed of,
– Feature-encoding: about 32ms
Figure 4: Inference time for object detection using different – Association: about 4ms
versions of SSD MobileNet.
Therefore, the overall inference time per frame when tracking n
people at a time averages approximately T (n) = 130 + 36n millisec-
discussed in §6. A further speed-up can be obtained by overclocking onds. We can track up to ten people at a time while still maintaining
the Raspberry Pi, as seen in Figure 5. We found that the platform two FPS, five people at a time at three FPS and two people at about
remained stable up to 1,900MHz, speeding up inference by about five FPS.

39
EdgeSys ’20, April 27, 2020, Heraklion, Greece Danish and Brazauskas, et al.

Table 1: Power draw and CPU temperature in different CPU


frequencies and modes.

CPU Clock Mode Power CPU Temp.


(MHz) (W) (°C)
1,500 Run 6.0 49
1,500 Sleep 3.6 35
1,500 Idle Pi 3.4 33
1,750 Run 7.0 51
1,750 Sleep 3.9 35
1,750 Idle Pi 3.6 34
1,900 Run 8.5 57
1,900 Sleep 4.2 36
1,900 Idle Pi 3.7 35

4.1 Power usage Figure 6: Screenshot from the ‘Plaza’ test video.
Table 1 shows the power draw and CPU temperature of the Rasp-
berry Pi under different loads. Run mode is the normal operation,
seeking maximum performance. In Sleep mode the program does
not invoke inference but instead periodically checks if anything the MOT17 challenge [10], taken from a fixed camera overlooking a
has changed in the input, waking back up if necessary. Both are pedestrian plaza with a fairly intense flow of people moving about,
compared against a baseline of an idle Raspberry Pi. There are sig- as seen in Figure 6.
nificant opportunities for power-saving during quiet periods when The code was instrumented so that it wrote the current value
there is no motion in front of the camera and it is looking at a fixed of several counter variables into a file every ten seconds: persons
scene. crossing the counting line in each direction (as determined by the
sign of the cross-product), overall number of crossings, and number
5 PRIVACY of tracking identities that have expired.
Putting up cameras raises privacy concerns, even in public spaces. Ground truth for test videos was established using the following
One assurance we can offer is that since our data transmission will method: in each ten-second interval we counted the number of
ultimately be carried by LoRaWAN, there is simply not enough people crossing the counting line in each direction.
bandwidth for it to be practical to transmit images at all. The only Therefore, for each ten second interval of test video, we know
data transmitted are counting events and the current state of the the number of people walking to the ‘negative’ side of the line
counters. An attacker could learn the number of people within (negcount) and the number of people walking to the ‘positive’ side of
the public part of the building, but this is not considered sensitive the line (poscount). These values were subtracted and then compiled
information. For debugging purposes, the WiFi device in the Rasp- into a vector covering the whole video. This allowed us to see how
berry Pi has been configured with a private network protected by the algorithm varied from the ground truth over time, and penalised
a pre-shared key. It is not connected to the Internet and can only offsetting mistakes to some extent.
be accessed by a person with the password and in close proximity Vectors drawn from the ground truth and from test runs of the
to the device. algorithm are compared using cosine-distance, subtracted from 1.
As the algorithm reads and processes each camera frame, it finds The ideal score is 1, the worst score is 0. We ran the tests with a
the coordinates of boxes around each object it detects, then distils number of configurations of the test-bed, tweaking parameters to
the contents of each box down to a short vector of 128 numbers, and DeepSORT, and trying much slower (but more accurate) YOLO-
finally discards the frame. The vectors may stay in memory as long based object detectors in addition to MobileNet. The test-bed also
as the object is within sight but they are a very sparse encoding was able to simulate different frame-rates, for example, by dropping
of pixel colours and basic shapes, with no personally identifying five frames out of every six to simulate a 200ms inference latency.
information associated with them. Parameters adjustable for testing include the resolution of the
MARS-trained feature encoder (64×32, 128×64 or 256×128), the
6 ACCURACY simulated FPS value (from 5–30), the maximum cosine distance
We set up a test-bed to accept pre-recorded video in place of the live (max-cos-distance) threshold for two feature vectors to be consid-
camera feed into the same algorithm used by the live tracker. The ered part of the same ‘track’, and the ‘non-maximum suppression’
test video named ‘Office’ comprises two minutes of filming using the (nms-max-overlap) threshold that eliminates spurious overlapping
Raspberry Pi camera in an office environment similar in character to object detection boxes (at 1.0 the boxes must overlap completely for
the pilot project location, with seven volunteers instructed to enter one to be pruned, and at 0 it would eliminate even non-overlapping
and exit the office repeatedly and without any particular patterns. boxes). Over 450 configurations were tested. A selection of scores
The test video named ‘Plaza’ comprises 50 seconds of video from is shown in Table 2.

40
DeepDish: Multi-Object Tracking with an Off-the-Shelf Raspberry Pi EdgeSys ’20, April 27, 2020, Heraklion, Greece

Table 2: Accuracy of counting: a selection of test results.

Score Test name Object-detector Feature encoder FPS max-cos-distance nms-max-overlap


0.968 Office YOLO v3 128x64 5 0.3 0.6
0.948 Office MobileNet v1 64x32 5 0.6 0.6
0.938 Office MobileNet v2 64x32 30 0.9 0.6
0.906 Office MobileNet v1 128x64 5 0.3 0.3
0.878 Office MobileNet v2 64x32 5 0.9 1.0
0.794 Office MobileNet v1 128x64 30 0.9 0.6
0.656 Office MobileNet v2 256x128 30 0.6 0.6
0.424 Office MobileNet v2 256x128 15 0.01 1.0
0.986 Plaza YOLO v3 64x32 5 0.6 0.6
0.903 Plaza MobileNet v1 128x64 15 0.9 0.6
0.880 Plaza MobileNet v2 128x64 5 0.3 0.6
0.843 Plaza MobileNet v1 64x32 5 0.6 0.8
0.839 Plaza MobileNet v2 256x128 5 0.9 0.6
0.815 Plaza MobileNet v2 64x32 5 0.6 0.3
0.713 Plaza MobileNet v1 64x32 15 0.01 1.0
0.596 Plaza MobileNet v2 128x64 30 0.9 1.0

7 DISCUSSION 1

0.9
The most surprising finding from our experiments is that increased
frame-rate was not helpful and in fact could make things worse. As 0.8

seen in Table 2, using a configuration with conditions similar to our 0.7

live video counter, we achieved a score of 0.948 on the ‘Office’ test 0.6
score

with MobileNet v1 running at 5 FPS. This rivalled the top test score 0.5

of 0.968 that was obtained using a powerful GPU running YOLO 0.4
v3 (tests at both 5 and 30 FPS achieved this mark). Figure 7 shows 0.3
the overall average scores at each frame-rate: the worst cases are 0.2
considerably worse with 15 and 30 FPS compared to 5 FPS. 0.1
The cosine distances were often not that large between vectors 0
encoded from the views of different people, especially if they were 5 15 30

wearing similar colour clothing. This could result in unwanted frames per second

identity swaps, but that would only affect the score in a small way
if the overall count was correct. Generally, setting max-cos-distance Figure 7: The effect of frame-rate on score.
to very low values tended to reduce the performance of the tracker,
effectively removing any assistance from the feature encoder. This
nms-max-overlap < 1.0. Both MobileNet versions score about the
effect can be seen in Figure 8.
same (just a slight advantage for v2) but in our judgement that is
Lower ‘non-maximum suppression’ was slightly more important
outweighed by the worse running time.
with MobileNet than YOLO because the latter tends to generate
higher-quality object-detection boxes and get less confused when 8 CONCLUSION
multiple people are standing together in a group. With MobileNet,
it helped to suppress some spurious boxes that could be generated We show that practical performance can be achieved on a Rasp-
by clusters of people, and lower values of nms-max-overlap led to berry Pi 4B in this application without requiring special hardware
slightly improved performance, as can be seen in Figure 9. acceleration. Although our pilot project was meant to take place
Another unexpected finding is that lowering the resolution of inside a building, we anticipate future deployments in places where
the input to the feature encoder in order to gain performance did network access is limited, space is at a premium and power supply
not affect scores overall. In Figure 10, it shows that the 64×32 fea- may be more circumscribed. We are continuing to refine the mod-
ture encoder gave approximately the same scores as the slower els and parameters so this system may be used in more extensive
default resolution of 128×64. We also tried experiments with an experiments for our larger Digital Twin project.
even higher resolution feature encoder, at 256×128, and that back-
fired, producing worse results. ACKNOWLEDGMENTS
Finally we compare models in Figure 11. This chart focuses on This research forms part of Centre for Digital Built Britain’s work
the generally good configurations: with feature encoder resolu- within the Construction Innovation Hub. The funding was provided
tion 64×32, processing 5 FPS, having max-cos-distance ≥ 0.3, and through the Government’s modern industrial strategy by Innovate
UK, part of UK Research and Innovation.

41
EdgeSys ’20, April 27, 2020, Heraklion, Greece Danish and Brazauskas, et al.

1 1

0.9

0.8 0.95

0.7

0.6 0.9
score

score
0.5

0.4 0.85

0.3

0.2 0.8

0.1

0 0.75
0.01 0.3 0.6 0.9 MobileNet v1 MobileNet v2 YOLO v3
maximum cosine distance object detection model

Figure 8: The effect of maximum cosine distance on score. Figure 11: The effect of model type on score.

1
Image Processing (ICIP). IEEE, 3464–3468.
0.9 [3] Alejandro Cartas, Martin Kocour, Aravindh Raman, Ilias Leontiadis, Jordi Luque,
Nishanth Sastry, Jose Nuñez-Martinez, Diego Perino, and Carlos Segura. 2019.
0.8
A reality check on inference at mobile networks edge. In Proceedings of the 2nd
0.7 International Workshop on Edge Systems, Analytics and Networking. 54–59.
[4] Gioele Ciaparrone, Francisco Luque Sánchez, Siham Tabik, Luigi Troiano, Roberto
0.6
Tagliaferri, and Francisco Herrera. 2019. Deep learning in video multi-object
score

0.5 tracking: a survey. Neurocomputing (2019).


[5] Junwei Han, Dingwen Zhang, Gong Cheng, Nian Liu, and Dong Xu. 2018. Ad-
0.4
vanced deep-learning techniques for salient and category-specific object detec-
0.3 tion: a survey. IEEE Signal Processing Magazine 35, 1 (2018), 84–100.
[6] Jacob Hochstetler, Rahul Padidela, Qi Chen, Qing Yang, and Song Fu. 2018. Embed-
0.2
ded deep learning for vehicular edge computing. In 2018 IEEE/ACM Symposium
0.1 on Edge Computing (SEC). IEEE, 341–343.
0
[7] Andrew G. Howard, Menglong Zhu, Bo Chen, Dmitry Kalenichenko, Weijun
0.3 0.6 1 Wang, Tobias Weyand, Marco Andreetto, and Hartwig Adam. 2017. MobileNets:
non-maximum suppression (max overlap)
efficient convolutional neural networks for mobile vision applications. CoRR
abs/1704.04861 (2017). arXiv:1704.04861 http://arxiv.org/abs/1704.04861
[8] Kirsten Lamb. 2019. Principle-based digital twins: a scoping review. (2019).
Figure 9: The effect of non-maximum suppression on score [9] Nicholas D Lane, Sourav Bhattacharya, Petko Georgiev, Claudio Forlivesi, Lei
Jiao, Lorena Qendro, and Fahim Kawsar. 2016. DeepX: A software accelerator for
for MobileNet. low-power deep learning inference on mobile devices. In 2016 15th ACM/IEEE
International Conference on Information Processing in Sensor Networks (IPSN). IEEE,
1–12.
1
[10] L. Leal-Taixé, A. Milan, I. Reid, S. Roth, and K. Schindler. 2015. MOTChallenge
0.9 2015: towards a benchmark for multi-target tracking. arXiv:1504.01942 [cs] (April
2015). http://arxiv.org/abs/1504.01942 arXiv: 1504.01942.
0.8
[11] Tsung-Yi Lin, Michael Maire, Serge Belongie, Lubomir Bourdev, Ross Girshick,
0.7 James Hays, Pietro Perona, Deva Ramanan, C. Lawrence Zitnick, and Piotr Dollár.
2014. Microsoft COCO: common objects in context. arXiv:1405.0312 [cs.CV]
0.6
[12] Peng Liu, Bozhao Qi, and Suman Banerjee. 2018. EdgeEye: an edge service
score

0.5 framework for real-time intelligent video analytics. In Proceedings of the 1st
International Workshop on Edge Systems, Analytics and Networking. 1–6.
0.4
[13] Joseph Redmon and Ali Farhadi. 2016. YOLO9000: Better, Faster, Stronger. CoRR
0.3 abs/1612.08242 (2016). arXiv:1612.08242 http://arxiv.org/abs/1612.08242
[14] Ju Ren, Yundi Guo, Deyu Zhang, Qingqing Liu, and Yaoxue Zhang. 2018. Dis-
0.2
tributed and efficient object detection in edge computing: challenges and solu-
0.1 tions. IEEE Network 32, 6 (2018), 137–143.
0
[15] Springer 2016. MARS: a video benchmark for large-scale person re-identification.
64x32 128x64 256x128 Springer.
feature encoder resolution
[16] Nicolai Wojke and Alex Bewley. 2018. Deep cosine metric learning for person
re-identification. In 2018 IEEE Winter Conference on Applications of Computer
Vision (WACV). IEEE, 748–756. https://doi.org/10.1109/WACV.2018.00087
Figure 10: The effect of feature encoder resolution on score. [17] Nicolai Wojke, Alex Bewley, and Dietrich Paulus. 2017. Simple online and realtime
tracking with a deep association metric. In 2017 IEEE international conference on
image processing (ICIP). IEEE, 3645–3649.
[18] Xu Zhang, Xiangyang Hao, Songlin Liu, Junqiang Wang, Jiwei Xu, and Jun
REFERENCES Hu. 2019. Multi-target tracking of surveillance video with differential YOLO
[1] Ragaad Al-Tarawneh, Christina Strong, Luis Remis, Pablo Munoz, Addicam and DeepSORT. In Eleventh International Conference on Digital Image Processing
Sanjay, and Srikanth Kambhatla. 2019. Navigating the visual fog: analyzing (ICDIP 2019), Jenq-Neng Hwang and Xudong Jiang (Eds.), Vol. 11179. International
and managing visual data from edge to cloud. In 2nd USENIX Workshop on Society for Optics and Photonics, SPIE, 701 – 710. https://doi.org/10.1117/12.
Hot Topics in Edge Computing (HotEdge 19). USENIX Association, Renton, WA. 2540269
https://www.usenix.org/conference/hotedge19/presentation/altarawneh
[2] Alex Bewley, Zongyuan Ge, Lionel Ott, Fabio Ramos, and Ben Upcroft. 2016.
Simple online and realtime tracking. In 2016 IEEE International Conference on

42

You might also like