Thanks to visit codestin.com
Credit goes to www.scribd.com

0% found this document useful (0 votes)
8 views62 pages

Detection of Safety Equipment in The Manufacturing Industry Using Image Recognition

Hand gestures recognition

Uploaded by

akshayyasekar20
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
8 views62 pages

Detection of Safety Equipment in The Manufacturing Industry Using Image Recognition

Hand gestures recognition

Uploaded by

akshayyasekar20
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 62

Linköping University | Department of Computer and Information Science

Master’s thesis, 30 ECTS | Information Technology

2021 | LIU-IDA/LITH-EX-A--2021/027--SE

Detection of safety equipment in


the manufacturing industry using
image recognition
Detektering av säkerhetsutrustning i tillverkningsindustrin med
hjälp av bildigenkänning

Mimmi Cromsjö
Linn Hallonqvist

Supervisor : Johan Blomkvist


Examiner : Eva Blomqvist

Linköpings universitet
SE–581 83 Linköping
+46 13 28 10 00 , www.liu.se
Upphovsrätt
Detta dokument hålls tillgängligt på Internet - eller dess framtida ersättare - under 25 år från publicer-
ingsdatum under förutsättning att inga extraordinära omständigheter uppstår.

Tillgång till dokumentet innebär tillstånd för var och en att läsa, ladda ner, skriva ut enstaka kopior för
enskilt bruk och att använda det oförändrat för ickekommersiell forskning och för undervisning. Över-
föring av upphovsrätten vid en senare tidpunkt kan inte upphäva detta tillstånd. All annan användning
av dokumentet kräver upphovsmannens medgivande. För att garantera äktheten, säkerheten och till-
gängligheten finns lösningar av teknisk och administrativ art.

Upphovsmannens ideella rätt innefattar rätt att bli nämnd som upphovsman i den omfattning som god
sed kräver vid användning av dokumentet på ovan beskrivna sätt samt skydd mot att dokumentet än-
dras eller presenteras i sådan form eller i sådant sammanhang som är kränkande för upphovsmannens
litterära eller konstnärliga anseende eller egenart.

För ytterligare information om Linköping University Electronic Press se förlagets hemsida


http://www.ep.liu.se/.

Copyright
The publishers will keep this document online on the Internet - or its possible replacement - for a
period of 25 years starting from the date of publication barring exceptional circumstances.

The online availability of the document implies permanent permission for anyone to read, to down-
load, or to print out single copies for his/hers own use and to use it unchanged for non-commercial
research and educational purpose. Subsequent transfers of copyright cannot revoke this permission.
All other uses of the document are conditional upon the consent of the copyright owner. The publisher
has taken technical and administrative measures to assure authenticity, security and accessibility.

According to intellectual property law the author has the right to be mentioned when his/her work is
accessed as described above and to be protected against infringement.

For additional information about the Linköping University Electronic Press and its procedures
for publication and for assurance of document integrity, please refer to its www home page:
http://www.ep.liu.se/.

Mimmi Cromsjö
©
Linn Hallonqvist
Abstract

Safety is an essential part of the paper industry, as the industry can be very hazardous
and accidents can lead to serious injuries for the people involved. In order to mitigate and
prevent accidents, it has been shown that proactive measures are of great value. One type
of proactive measure is the use of Personal Protective Equipment (PPE), such as gloves,
hard hats, safety glasses and reflective vests. Despite that it is often required to wear PPE
in a work place, it is not always guaranteed and non-usage can affect the safety of workers.
To detect unsafe conditions, such as non-usage of PPE, automated video monitoring with
image recognition can be useful. The intention of this work is to investigate whether an
image recognition model can be created using the cloud service Azure and used in a system
that can detect PPE, which in this work is limited to reflective vests. The work results
in an artifact using an image recognition model. Additionally, this work examines how
the training data can affect the model’s performance. It is found that the model can be
improved by training the model on images with varying backgrounds, angles, distances,
and occlusions.

While there are many advantages with automated monitoring, the use of it can raise ques-
tions regarding the privacy of the people being monitored and how it can be perceived in a
workplace. Therefore, this thesis examines the privacy concerns and attitudes regarding an
image recognition system for monitoring. This is accomplished by performing a literature
study and interviews with employees at a paper mill. The results reveals challenges with
systems for automated monitoring as well as factors that can affect how employees feel
about them.
Acknowledgments

We would like to thank the people at Sogeti for the help and encouragement during the
project. Especially Jonas Jaani, Jörgen Nybrolin and Anders Persson who have been support-
ing us through the process. We would also like to thank Richard Haag and Andreas Olofsson
at Holmen AB for the collaboration and the employees at Holmen participating in the in-
terviews. At Linköping University we want to thank our supervisor Johan Blomkvist and
examiner Eva Blomqvist for their guidance. Also, we want to thank Nicklas Silversved and
Madeleine Bäckström for the opposition and feedback on the thesis.

iv
Contents

Abstract iii

Acknowledgments iv

Contents v

List of Figures vi

List of Tables vii

1 Introduction 1
1.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.2 Aim . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.3 Research questions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.4 Delimitations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

2 Theory 4
2.1 Image recognition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
2.2 Introducing an image recognition system in a workplace . . . . . . . . . . . . . 14
2.3 Interview techniques . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18

3 Method 19
3.1 Implementation of PPE detection system . . . . . . . . . . . . . . . . . . . . . . 19
3.2 Improving the object detection model . . . . . . . . . . . . . . . . . . . . . . . . 22
3.3 Examining the attitudes and effects of image recognition . . . . . . . . . . . . . 26

4 Results 29
4.1 Performance of different features in the training sets . . . . . . . . . . . . . . . . 29
4.2 Outcome of the interviews . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33

5 Discussion 39
5.1 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
5.2 Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
5.3 The work in a wider context . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45

6 Conclusion 46

Bibliography 48

A Interview questions 52

v
List of Figures

2.1 Objects, classified as cars respectively pedestrians, detected on a street in New York. 5
2.2 An example of object instance recognition. Big Ben with a surrounding bounding
box. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
2.3 An example of object class detection. A cat with a surrounding bounding box. . . 6
2.4 Confusion matrix of actual and predicted values. . . . . . . . . . . . . . . . . . . . . 8
2.5 An example of a ground truth and a predicted bounding box overlapping. . . . . . 9
2.6 Graphical representation of Intersection over Union (IoU). . . . . . . . . . . . . . . 10
2.7 Definition of TP, FP and FN. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
2.8 Precision-recall curve. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11

3.1 Architecture of the system. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21


3.2 Labeling an object in Custom Vision. . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
3.3 Example images from the data sets. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
3.4 Iteration loop for training and evaluating the model for features. . . . . . . . . . . . 25
3.5 Text file with ground truths. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
3.6 Text file with predictions. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
3.7 Likert scale. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27

4.1 Precision-Recall curves, baseline. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30


4.2 Precision-Recall curves, lightning conditions. . . . . . . . . . . . . . . . . . . . . . . 30
4.3 Precision-Recall curves, background. . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
4.4 Precision-Recall curves, angle. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
4.5 Precision-Recall curves, distance. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
4.6 Precision-Recall curves, occlusion. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
4.7 Precision-Recall curves, all features combined. . . . . . . . . . . . . . . . . . . . . . 33

vi
List of Tables

3.1 Training sets. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23


3.2 The test set. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
3.3 Participants of the interviews. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28

4.1 Metrics for each model with different features included in the training set. . . . . . 29
4.2 Participants A-E’s answers to the quantitative survey. . . . . . . . . . . . . . . . . . 34

vii
1 Introduction

The paper industry as a workplace can be very hazardous and has a high number of acci-
dents. The high accident rate is linked to the nature of the industry with massive weights
and dangerous machines. The safety issues are taken seriously in the paper industry and
have been of high priority throughout time. But as accidents keep happening, there is always
a need of improvement.

As technology evolves, new methods are used to increase the safety in manufacturing indus-
tries. Industry 4.0 is a current trend in the manufacturing industry. The task to automate and
simplify repetitive processes by using Artificial Intelligence and Machine-to-Machine com-
munication to reduce costs in the industry are key parts of Industry 4.0. Furthermore, finding
ways to increase safety with the use of modern technology is also an area of research. The
use of Virtual Reality (VR) for training and education is one example of how technology can
be used to safely train workers and reduce risks.

In order to mitigate and prevent accidents, it has been shown that proactive measures are of
great value. Proactive measures can be for example safety training for workers to learn and
repeat the safety principles and standard emergency procedure of the workplace. Another
type of proactive measure is the use of Personal Protective Equipment (PPE), such as gloves,
hard hats, safety glasses and vests. Researchers have explored automated ways of monitoring
and detecting unsafe conditions, such as non-usage of PPE in manufacturing and on construc-
tion sites [21, 10]. Because of major advances in the area of machine learning and computer
vision, this is a common approach for such automated monitoring. Cloud services such as
Microsoft Azure, Amazon Web Service, and Google Cloud Platform offer a wide range of cog-
nitive services, one of them is computer vision. These services provides advanced algorithms
behind an intuitive interface, making it possible to utilize image recognition without having
a long experience within machine learning. Although this could decrease the time and effort
for building computer vision models, further research is needed to explore the performance
and eventual limitations of such models.

While there are many advantages with automated monitoring, the use of it can raise questions
regarding the privacy of the people being monitored. It is therefore important to consider

1
1.1. Motivation

ethical aspects when implementing such technology in a workplace. Previous research has
been done studying the acceptance and attitudes of workers towards electronic monitoring
[40, 43, 14]. The findings in the works show how awareness of such technology can affect
the way workers feel about it. There has also been suggestions on different approaches to
increase the level of privacy in monitoring, such as blurring regions like faces that can identify
a person [4].

This work is made in collaboration with the Swedish forest company Holmen. Holmen is
one of Sweden’s largest forest owners, and manufactures wood products, paperboard and
paper. Holmen also produces renewable energy. The collaboration was mainly with the IT
department at Holmen (Holmen IT) and one of their paper mills. Holmen has set the goal
of being the best paper mill in the world in year 2025. To achieve this, one focus area is that
Holmen wants their employees to be free from work related injuries and feeling safe going
to work every day. Therefore, Holmen is interested in gaining knowledge of how accessible
technology can contribute to the occupational safety at their paper mills.

1.1 Motivation
In manufacturing industry environments, such as paper mills, there are often vehicles used
for transportation of goods and humans moving close to each other. A collision between a
human and a vehicle can lead to serious injuries and can even be fatal. To avoid collisions
there are safety instructions that tells that all people inside the area must wear PPE for protec-
tion and/or to be more visible. However, it occurs that these rules are missed, overlooked or
forgotten, which can immediately expose the person and others to a risk. An image recogni-
tion system could help decreasing these risks by identifying people in the scene not following
the safety rules.

When implementing an image recognition system for monitoring in a workplace, it is impor-


tant that it is accepted and well received by the employees. People can have different attitudes
towards monitoring systems, and as the system processes images, it can raise questions re-
garding the privacy of the employees. It is therefore necessary to investigate employees’
feelings and what they believe is important regarding the implementation such a system.

1.2 Aim
The aim of this thesis is to explore the possibilities to increase the usage of safety equipment
by using image recognition in order to increase personal safety in a paper mill. This is done
by providing a starting point for Holmen in the implementation of a system where image
recognition is used to detect safety equipment. Furthermore, the aim is to examine how such
a system can be introduced without intruding on the privacy of the employees.

1.3 Research questions


The main question is: How can an image recognition system be implemented that makes more em-
ployees wear safety equipment? This is a broad and complex question where many components
needs to be considered. In this thesis we have identified two research questions that can help
answer the main question.

1. How and to what extent can machine learning models be trained to recognize safety
equipment?

2
1.4. Delimitations

2. What are the privacy concerns and attitudes regarding the use of image recognition for
detecting safety equipment?

1.4 Delimitations
Today, the main digital services that Holmen is dependent on are Microsoft services, and it is
also within the range of these services that Holmen wishes to investigate the possibilities of
developing a system for improving their work environment. Furthermore, Holmen wishes
to investigate if a solution can be implemented using cloud services. Therefore, this thesis fo-
cuses on exploring how to detect safety equipment using cloud services within the Microsoft
domain.

3
2 Theory

In this chapter theory and background information is presented for related concepts to this
work.

2.1 Image recognition


Image recognition is a branch of computer vision, which is a broad research area including
techniques such as 3D modelling and object tracking. The history of computer vision origi-
nates in the early ’70s and unlike previous work in the area of image processing the goal was
to enable full scene understanding through retrieving the 3D construction of the world from
images [39, p.12]. Today, computer vision is used in different real-world scenarios such as
Optical Character Recognition which means to distinguish text in images and convert it into
a text file. It is also used in medical imaging for registering pre-operative and intra-operative
imagery, and in fingerprint recognition for authentication [39, p.5].

Image recognition focuses on interpreting images and identifying different variables, such as
locations, people, and animals. For humans, the task of interpreting what we see in an image
is easy because we know that objects can appear in different shapes, angles and lightnings.
The human eye has no trouble distinguishing between for example a cat and a dog, and
understanding that different breeds of dogs still are dogs.

This thesis involves Image Classification and mainly Object Detection. These techniques are
related to image recognition and are described in more detail in the following sections.

2.1.1 Image classification


Image classification is the task of a computer to comprehend the image as a whole and deter-
mine its class from a predetermined set of classes. Typically, when using image classification
the images include only one object. For example an image of a cat is of class "cat". Image
classification has become easier with deep learning algorithms. Deep learning is a type of
machine learning and a subset of Artificial Intelligence that makes it possible for machines
to learn from observed data. Neural networks are made of layers of nodes and deep neural

4
2.1. Image recognition

networks have multiple layers. Each layer of nodes trains on a set of features, the further you
advance into the network the more complex features can be recognized be the nodes.

A Convolutional Neural Network (CNN) is a type of neural network that is common when
analyzing images, as CNNs primarily focus on that the inputs are images [30]. Unlike the
regular neural networks the nodes in CNN are arranged in three dimensions; height, width
and depth.

2.1.2 Object detection


Object detection is a technique of detecting already known objects in an image. It is based on
similar techniques as image classification but object detection includes the task of localizing
the objects. The goal of object detection is to scan an image and determine what objects that
occur in the image chosen from a predefined list of categories of interest, for example human,
bicycle, house, and dog. If an object is present in the image, the location and dimensions of
the object are given by a bounding box tightly bounding the object [23]. The bounding box is
defined by a point, a width and a height. In Figure 2.1, eight objects have been detected in a
street view and each detection bounding box has been drawn onto the image.

Figure 2.1: Objects, classified as cars respectively pedestrians, detected on a street in New
York.

The area of object detection can be divided into two subgroups: object instance recognition
and object class detection [47]. The first group called object instance recognition aims to
identify previously seen object instances such as Big Ben, seen in Figure 2.2. Object instance
recognition can be seen as a matching problem where the object to be re-identified in a new
image is the same as the previously seen object, except from some changed surrounding con-
ditions such as location, angles, and lightning. The second group called object class detection
on the other hand focuses on detecting instances of predefined categories such as humans,
cars, bicycles, and cats, seen in Figure 2.3. The latter subgroup of object detection is a more
challenging task because of the variation that can occur within a category [23]. A car can have

5
2.1. Image recognition

different size, colour, shape etc. which means that the look of a car can vary a lot and still be-
long to the same category. When research and work began in the field of object detection, the
detection of a few categories such as faces were at center, whereas now the field of categories
is broader to fulfill more general detection systems [23].

Figure 2.2: An example of object instance recognition. Big Ben with a surrounding bounding
box.

Figure 2.3: An example of object class detection. A cat with a surrounding bounding box.

A lot of research has been done in the field of detecting faces [31], a technique that for instance
is built into most digital cameras. Another focus area of object detection is the detection of

6
2.1. Image recognition

pedestrians that can be used in autonomous cars. Histogram of Oriented Gradients (HOG) is
a well known algorithm used for detecting pedestrians [27, 7].

When creating a custom made object detection model an algorithm or neural network is
needed as well as a labeled training data set. The performance of the model is highly depen-
dent on the model’s robustness, making it possible for the model to predict correctly despite
shifting conditions of light, pose or the characteristics the camera [16].

2.1.3 Training an object detection model


Training an object detection model is accomplished by providing the model with images
where the objects of interest are annotated. Each object is tagged with its correct class and
bounding box, this is called the "ground truth". The training data should have a good vari-
ation of images in order to be robust. A robust model is able to detect objects in images
with difficult conditions such as varying weather conditions, angles or lightning conditions.
Object detection is a difficult task and there are robustness challenges with for example occlu-
sion, scale variation and rotations of objects. Having a rich and varying data set can prevent
these problems.

Occlusion is when an object is partly hidden by another object in the image. This is a problem
within object detection, and in [11] it was shown that in street scenes, about 70% pedestrians
are occluded in at least one frame of the video. The task of detecting objects despite occlusion
can be of great importance for computer vision applications. Similar to the street scenes, ac-
tions in the environment of a paper mill often involves numerous workers that occur in pairs
or groups. Additionally, workers in a paper mill can be amongst vehicles and equipment etc.,
which can generate a level of occlusion. A possible solution for solving occlusion problems
is to include more data where occlusion appears [41]. In that way, the model can learn the
invariance that can occur due to occlusion. However, occlusion can appear in a way that
occurs more often but also rare cases that almost never happen. This means that it is almost
impossible to provide images that covers all cases of occlusion.

Scale variation refers to the challenge of detecting objects in various sizes, in particular very
small objects. According to Li et al. [20], small-size objects are difficult to detect for two
reasons; small instances often occur with "blurred boundaries and obscure appearance" [20,
p. 985], making it difficult to distinguish from the background. Secondly, large-size objects
usually have big visual differences from small-size instances. Different solutions to this prob-
lem have been proposed, such as using feature pyramids [22, 23] or using different sub-
networks that are trained on small and big instances respectively [20]. One way of improving
the performance is to build the training sets with sufficient variations [9].

Another difficulty is to detect objects with varying shapes due to the angle or viewpoint from
where the object is seen. An object can take different expressions when seen from the same
height as the object compared to when it is seen from up above. To address the problem,
adding images from different viewpoints to the training set has been shown generating a
more robust model [36].

2.1.4 Evaluating an object detection model


When evaluating an object detection model, the performance of the model is measured to
examine if the algorithm performs as desired. Evaluating an object detection model is a more
complicated task than evaluating a classifier. Not only need the model detect that there is an
object in an image and correctly classify it, but it also needs to localize the object by predicting
the bounding box coordinates of the object. Therefore, when evaluating an object detection

7
2.1. Image recognition

model, it is necessary to evaluate the performance of classifying and localizing the object. This
makes it a two part problem [45]. An image classification model can only classify an image
correctly or incorrectly while an object detection model can have detected an object at the
right location, but classified it incorrectly and vice versa.

When working with machine learning algorithms, cross validation is a commonly used tech-
nique in which the model is trained using a subset of the data set and then evaluated using
the complementary subset of the data set. Traditionally, these algorithms are evaluated us-
ing measures such as accuracy, recall, and precision. These metrics are based on the number
of correct and incorrect predictions that a model does on a test set. For each prediction, the
predicted value and the actual value can be represented in a confusion matrix, seen in Figure
2.4. If an algorithm correctly predicts a positive class, it is a True Positive (TP). If an algorithm
correctly predicts a negative class, it is a True Negative (TN). If an algorithm incorrectly pre-
dicts a positive class, it is a False Positive (FP). If an algorithm incorrectly predicts a negative
class, it is a False Negative (FN). In object detection, true negatives means every part of the
image where there is no object. This is not useful and is therefore ignored.

Figure 2.4: Confusion matrix of actual and predicted values.

Accuracy describes how often an algorithm classifies a data point correctly. Accuracy is
defined as the number of true positives and true negatives divided by the number of true
positives, true negatives, false positives, and false negatives. A model that achieves 100 %
accuracy classifies every sample correctly. Accuracy is calculated as follows:

TP + TN
A=
TP + TN + FP + FN

Since accuracy is calculated using true negatives, and true negatives have no useful meaning
in object detection, accuracy is not used to evaluate an object detection model. Instead, the
measures recall and precision are used.

Recall is a measure of how many actual positives that are correctly predicted as positive. The
importance of the recall measure varies depending on the use case. In the context of medical

8
2.1. Image recognition

application recall can be seen as the primary measure because the aim is to find all actual
positive [33]. Recall is calculated as follows:

TP
R=
TP + FN

Precision is similar to accuracy but focuses only on the predictions of the positive class. Pre-
cision denotes the proportion of predicted positives that are correct predictions. Precision is
calculated as follows:

TP
P=
TP + FP

When evaluating an object detection algorithm’s performance, these measures mentioned


above are not applicable directly since the decision whether an object has been detected or
not is not a binary one [45]. If an object is detected, a bounding box is drawn describing the
position of the object. For the detected object to be counted as TP, the bounding box needs
to be overlapping to a certain extent with the ground truth bounding box. In Figure 2.5, the
ground truth box is drawn in blue and the predicted box in red. For the dog to be counted as
detected by the model, the overlap of the two boxes needs to be considered. To calculate this,
the metric Intersection over Union (IoU) is often used [23].

Figure 2.5: An example of a ground truth and a predicted bounding box overlapping.

IoU is a measure of the extent in which the detected bounding boxes and its ground truth
box overlaps. An IoU threshold is often predefined and is used to determine if an object is
classified as true positive. The IoU is calculated as follows:

Area of intersection
IoU =
Area of union

This formula can be visualised as in Figure 2.6.

9
2.1. Image recognition

Figure 2.6: Graphical representation of Intersection over Union (IoU).

If the IoU threshold is set to 0.5, and the IoU of a prediction is ě 0.5, it is counted as a true
positive. A true positive means that an object has been found and that it has been correctly
classified, see Figure 2.7 a. If the IoU is < 0.5, it is a false positive, which means that an object
has been incorrectly detected. Another type of false positive is when an object is detected but
labeled incorrectly, see Figure 2.7 b. If the model does not detect an object when it should,
meaning that it does not draw a bounding box around the object, it is a false negative, like
the dog in Figure 2.7 c.

(a) Both objects are detected and classified cor- (b) The dog is predicted as a cat and is there-
rectly, therefore this image has two true posi- fore a false positive. The cat is correctly pre-
tives. dicted and is therefore a true positive.

(c) The cat is detected as a true positive, the


dog is not detected and is therefore a false neg-
ative.

Figure 2.7: Definition of TP, FP and FN. Source: Adapted from [34].

10
2.1. Image recognition

With the TP, FN and FP, the recall and precision of an object detection model can be calculated
as above. From the calculated recall and precision, a precision-recall curve can be computed,
see Figure 2.8. This curve describes the trade-off between precision and recall. If the recall is
high and the precision is low, it means that the model has done many detections, but most
of the detections are incorrectly labeled. If the model has high precision and low recall, not
many detections are done but most of the detections are correctly labeled. The precision-recall
curve should ideally have high precision as recall increases for the model to perform well [1].

Figure 2.8: Precision-recall curve.

For evaluating object detection models, it is common to use the metric Average Precision
(AP) or mean Average Precision (mAP). The AP can be calculated from the area under the
precision-recall curve. mAP is average of the AP calculated from all classes and/or IoU
thresholds. AP and mAP are often used interchangeably [1].

2.1.5 Object detection for Personal Protective Equipment


The task to detect safety equipment such has helmets and reflective vests has been studied
before. In a paper by Waranusast et al. [42], the authors detects motorcycle safety helmets
that are worn by people riding motorcycles. The authors’ methodology is to first distinguish
a motorcycle from other objects in motion. Since the helmet usually is in the top part of the
object, the authors defines this as a Region of Interest (ROI) to look at. The ROI is then ex-
tracted and classified to be either; "wearing a helmet", "not wearing a helmet" or "undefined".
The authors study the accuracy in different input conditions with respect to in which lanes
the motorcycle appears:

• Condition 1. Input with moving objects in the near lane


• Condition 2. Input with moving object in the far lane
• Condition 3. Input from moving objects in both near and far lane

The accuracy for these different conditions were 84 %, 68 % and 74 % respectively. It was
found that most errors occured for Condition 2, when the objects were further away. This

11
2.1. Image recognition

indicates, according to the authors, that the low resolution of the input images play an im-
portant role for the accuracy of the system. Another error found in the study was that when
objects overlapped with each other the system treated them as one single object. These find-
ings can be of interest when evaluate the prototype in this study.

In the article written by Li et al. [21], the authors propose a method for detecting safety hel-
mets on humans. The background to the work is to ensure the wearing of a helmet at power
substations as the number of accidents has increased. In the power substation, a camera is
installed at a fixed location, which means that the background never changes. The method
for detecting if people wear helmets is divided into three phases. At first, objects in motion
are recognized by ViBe background modelling algorithm. Then a HOG feature extraction
and SVM (Support Vector Machine) classifier is used to get the human location. Finally, by
focusing on the top region of the human detected, a colour feature is used to recognize safety
helmets. The authors evaluate the system by using real-time surveillance videos from the
location. In the analysis they can see that the system manages to detect not only one person
but also several in the same image. Furthermore, they collected ten video streams and es-
timated the accuracy by counting the detection results of people with and without helmet.
The accuracy was calculated to be 80.7%. The authors conclude that some improvement is
needed when having varying light conditions in the scene. When developing our system we
will keep in mind that varying light conditions must be taken into account when training the
algorithm.

In [10] by Delhi et al., a similar use case as the one presented in this thesis is studied. The
authors present a framework which predicts if a person is wearing a hard hat and safety
jacket. By implementing a model using a CNN called YOLOv3 the framework outputs 4
classes: NOT SAFE, SAFE, NoHardHat and NoJacket. An important part of the development
of the model is the data collection in order to be able to train and evaluate the model. In total,
a data set of 2500 images was used. This was done by both manual collecting and image
scraping online using web crawlers. The model was evaluated on a test data set and achieved
an average precision of 97 %. Additionally, the system was developed for generating alarm
and reports if and when a safety risk is detected.

In a study by Nath et al. [28], the authors present three deep learning models to verify PPE
compliance of workers in real-time from a video stream. The models verifies if the workers
on a construction site are wearing hard hats, vests or both. The study compares three differ-
ent approaches which make up the models. The first approach is to detect workers, vests and
hard hats and then a machine learning model verifies if each worker is wearing the equip-
ment properly. The second approach is to detect a worker and simultaneously verifies PPE
compliance. The third approach is to first only detect the worker and then crop the image in
the area of interest, after which the cropped image is classified by CNN-based classifier. This
classifier verifies if the worker is wearing a vest, a hard hat or both. All models are trained
on the same data set containing approximately 1500 annotated images with 4700 occurrences
of workers. The data set is produced using crowd-sourcing and web-mining images. The
crowd-sourcing images are collected from three different construction sites while the web-
mined images are gathered from the web using search engines like Google. The authors
recommends the second approach as it performed best and got the most accurate detection
of PPE with a mAP of 72.3 %. This thesis investigates a similar problem, and the results by
Nath et al. can be of value when selecting an approach for detection of safety equipment.

2.1.6 Platforms for image recognition


Today you can find several cloud applications on the market that provide image recogni-
tion. Often, the image recognition is an out-of-the-box service, which means that the service

12
2.1. Image recognition

is accessible immediately without any specific installation or configuration. Amazon Web


Service, Google Cloud Platform and Microsoft Azure are examples of cloud providers of im-
age recognition. Da Costa and Pisa [8] studied Amazon Web Service and made a case study
with both Amazon’s out-of-the-box service and the framework service. While the out-of-the-
box service is a very simple service where no knowledge in machine learning is needed, the
framework service is more of a service for Data Scientists to train and test a model. The case
study showed that the out-of-the-box service had problems identifying multiple objects in
the image. When instead using the framework service, where a custom model was made,
the performance improved. The out-of-the-box service can be a good starting point but when
facing more specific problems, the framework service is more appropriate.

In this study, the Computer Vision platform provided by Microsoft Azure will be used to train
a model for object detection. The architecture and the underlying algorithms are hidden, and
it is therefore necessary to treat our model as a black box. The algorithms’ structure can not
be changed or affected and you can only influence the model by training it with your own
data. This means that the control is limited to the inputs and outputs of the black box.

Azure Computer Vision


Azure provides a variation of services related to computer vision [25]. The Azure Computer
Vision API provides advanced algorithms for processing images. The user uploads an image
or gives the image URL and a pretrained model analyzes the image and provides a response
quickly. The service can for example determine whether an image contains any objects, find
specific brands, or find human faces. Although this service is a good choice for predicting
images without the need of training the underlying model, it is not enough for this project.

Azure Custom Vision [26] is a platform that, in contrast to the Computer Vision API, lets you
train your own image detection model. The service can be managed from the Azure portal
on the web. Azure Custom Vision has two features; image classification and object detection.
Image classification gives one or more labels of what is in the image while object detection
in addition provides the coordinates of where the object can be found in the image. When
creating an image classifier, the user uploads two groups of images, one with the features of
interest and the other without. The user then labels every image. When creating an object
detection model only images with the object(s) of interest are needed. The Custom Vision uses
machine learning algorithms to analyze the labeled images and trains a model. The model’s
performance is validated in the portal by evaluating the model’s accuracy on the same images
that is used for training the model. Evaluating the model on the training data set is not
according to cross validation described in Section 2.1.4 and will therefore not be used in this
thesis. When the model is trained the user can test, re-train and, when satisfied, use the model
to classify new images. When creating a model in Azure Custom Vision, it is recommended to
have a data set containing at least 50 annotations per label. As an alternative to deploying the
model online at Azure, the model can be exported from Azure Custom Vision in the formats;
ONNX, CoreML, TensorFlow, Docker image and Vision AI Development Kit.

Previous research has used Azure Custom Vision for classification problems with good re-
sults. Batugal et al. [5] investigate if Azure Custom Vision can be used for detecting rice
spoilage before consumption. In the study two data sets were gathered, spoiled rice and not
spoiled rice. For training the model 500 images with each label were used. The performance
of the model was measured to 85 % accuracy. Pejčinović [32] gives an overview of Custom
Vision and research the features and possibilities in the field of traffic signs classification. In
the research a model in Custom Vision was trained in three iterations with the training data
set improved in between. The model’s performance was evaluated with the metrics accuracy,
recall and F1-score. The first iteration of the model was trained on a not well balanced train-

13
2.2. Introducing an image recognition system in a workplace

ing set. One class had 113 images while it only contained 19 images of another class. This led
to low accuracy for the classes with low number of images in the training set. To improve the
performance more iterations were done. In the third iteration of training images with differ-
ent backgrounds, lightning and camera angels were used. This iteration was more balanced
and included 100 images for every class. The result was a precision of 81.6 %, recall of 84.7 %
and F1-score of 83.1 %.

2.2 Introducing an image recognition system in a workplace


In a workplace, new technology can lead to improved safety, mobility and efficiency. While
technology can provide many positive outcomes, it needs to be examined from a legal and
ethical point of view before being introduced. In this section we therefore present how the
technology may conflict with legislation. This is followed by researching how the system in
this thesis can be used ethically by discussing aspects that affect peoples attitudes and how
privacy can be protected.

2.2.1 Laws and regulations


The regulations relevant when performing camera surveillance in Sweden are found in both
the Camera Surveillance Act (Kamerabevakningslagen) [17] and the General Data Protection
Regulation (GDPR) [19], as surveillance can be considered sensitive from an integrity per-
spective. An assessment of a company’s own interest in surveillance must be considered in
relation to individuals’ interest in privacy. The Camera Surveillance Act states that it is not
required to have a permit for camera surveillance of a place to which the public does not
have access. The same law also states that information about camera surveillance must be
provided through clear signs or in some other effective way.

Camera surveillance where people can be identified is considered as processing of personal


data. Anyone who wants to perform camera surveillance must therefore ensure that it is
done in compliance with GDPR. The purposes of GDPR is to protect the fundamental rights of
individuals, in particular their right to protection of personal data. GDPR applies in almost all
types of activities and regardless of who performs the processing of personal data. It applies
to companies, organizations and authorities. Any data that can be linked to an individual or
can identify an individual is referred to as personal data. Personal data can be photos of an
individual regardless if name or other identification is attached. Data is considered personal
even if the person is not directly identifiable but can be identified indirectly through analysis
or in conjunction with other similar data.

In GDPR it is stated that processing of personal data is not legal without a legal justification.
Legal justification can be consent from the subject that has agreed of data processing and
it can also be legitimate interest. The data processing can be done without consent if the
interest of which data is processed outweigh the subject and if the processing is necessary for
the given purpose. The purposes sets the boundaries for what is allowed to do, such as what
information can be processed and how long it can be saved. The purpose of the inspection
may be monitoring for safety or technical reasons.

GDPR states that the subject of data processing must always be informed about the legal jus-
tification on which the personal data is collected. Personal data is only allowed to be collected
for the specifically stated and justified purpose. Therefor the purpose of the data processing
has to be clarified and informed before the collection takes place. It is not allowed to process
more personal data than is necessary to fulfill the purpose. In the case of surveillance, this
could be interpreted such that the cameras should cover as small area as necessary. GDPR
also states that personal data should be deleted when the data is no longer needed.

14
2.2. Introducing an image recognition system in a workplace

According to GDPR, the collector of personal data must ensure that the data is protected and
that unauthorized persons cannot access it. Therefor, technical and organizational security
measures should be used such as firewalls, encryption and anti-virus protection.

In GDPR, anonymization is defined as data that is anonymous in such a way that the sub-
ject cannot be identified. If the data is anonymous, it is no longer in the scope of GDPR. To
anonymize data in an application for facial recognition and identification and person detec-
tion Burileanu et al. [4] present a solution where a blur filter is applied to faces. This makes
the people in the images unidentifiable.

When performing surveillance at a workplace in Sweden, the above mentioned laws have
to be followed because people can be identified in the images. If faces cannot be seen in the
image, some argue that GDPR does not apply. Although, in a workplace where a limited
number of people appears which is the employees, anonymization through blurring faces
can be seen as not being enough. If people can be identified based on their clothes or the
occasion etc., larger interventions are needed such as blurring the whole person. In addition,
according to the Co-determination act [18] (Förhandlingsskyldighet), there is an obligation to
negotiate when performing surveillance at a workplace. This means that the employer must
negotiate with the affected workers organization regarding the decision to introduce camera
surveillance at the workplace.

2.2.2 Aspects that affect attitudes and protect privacy


Because the system in question is of the nature of monitoring employees to see if right equip-
ment is carried, the system can be seen as a video surveillance system. In general, surveillance
in workplaces can be done using different courses of action such as electronic surveillance, in-
cluding email monitoring and web surfing activity, or video surveillance which is becoming
more common as camera equipment has become more affordable and the technology easier
to use. The motivation for monitoring employees in a workplace can be to increase efficiency,
measure productivity and decrease risks, where the latter is the case when controlling safety
equipment. Monitoring for measuring efficiency amongst worker is often called performance
surveillance. While surveillance can help improve a business, the implementation of surveil-
lance systems can raise ethical questions regarding the privacy of employees. There are many
different opinions and aspects to consider when discussing ethics in surveillance where Alder
argues that electronic monitoring is not "inherently ethical or unethical" [2, p. 741] and that
the focus should be on how monitoring can be used ethically. Marx argues that "the ethics of
a surveillance system needs to be judged according to the means, the context and conditions
of data collection, and the uses/goals" [24, p. 171].

The concept of privacy can be defined as the ability to control what information others can
access about you [15]. People may not want others to get information that is considered
personal or sensitive. Privacy is considered a fundamental human right in the majority of
countries in the world. It is obvious that privacy is highly valued and invasion of privacy can
therefor be considered unethical. As previously discussed, there are laws like GDPR that exist
in order to protect people’s private information. But something that should be kept in mind
is that what is considered an invasion of privacy can vary a lot between different people. In
the context of video surveillance, recording and saving images of employees where people
can be identified can be considered an invasion of privacy. The location of a camera and
what area that is covered by the camera view can also vary, and people may feel differently
regarding being monitoring depending on the location of it. It might for instance be different
to be filmed in a passage than in the lunch area at a workplace. A large part of these issues
are covered by GDPR, but even if a surveillance system is approved by law, people may still
feel that their privacy is intruded on.

15
2.2. Introducing an image recognition system in a workplace

In order to understand how monitoring and surveillance systems affect employees, studies
have been carried out where employees have been interviewed or participated in surveys.
From these studies, there have been several important findings that point to important as-
pects when implementing a surveillance system.

Previous research has studied whether the perceived goal of a system, how it is going to
be used and what purpose it fulfills influence how employees feel about it. In an article by
Stanton and Weiss [37], the authors explore experiences and ideas regarding electronic mon-
itoring such as email and website monitoring. The study has a qualitative, exploratory and
generative approach, with the goal to elicit thoughts and new perspectives on the impacts
of electronic monitoring. To examine these aspects, a survey is conducted where the results
suggest that some are not disturbed by monitoring systems, and some even appreciated the
systems. Participants that were not bothered by the monitoring believed that nothing was
being done with the information while the ones that were bothered expressed that they were
unsure about how the information was used or how important the management considered
it to be. Participants that were monitored mainly for security reasons, and where the mon-
itoring did not monitor their work performance, 8 out of 9 reported not being disturbed by
the monitoring. This suggests that the attitudes about monitoring may depend on how the
monitoring information is being used. The article is from 2000, and even though electronic
monitoring has changed in 20 years, we argue that the findings are relevant to consider as
the monitoring techniques examined are still relevant and used today. The same motivation
applies for an article by Wells et al. [44] from 2007 which shows similar results. In this work,
relationships between the perceived purpose of Electronic Performance Monitoring (EPM)
and job attitudes are studied. From studied literature, the authors state hypotheses regarding
how different perceptions of the purpose relates to different job attitudes. These job attitudes
are: the perceived system fairness, job satisfaction, organizational commitment and felt obli-
gation to reciprocate. The authors test two different perceptions of purposes: 1. to develop
appropriate behaviour and 2. to deterrent negative behaviour. The hypotheses are that the
first perception of the purpose will be positively and significantly related to the job attitudes.
Similarly the second perception of the purpose will be negatively and significantly related to
the job attitudes. An example of two hypotheses related to the same job attitude is:

HYPOTHESIS 1. Perceptions that the purpose of the EPM system is to develop


appropriate behavior will be positively and significantly related to the perceived
fairness of the system.
HYPOTHESIS 2. Perceptions that the purpose of the EPM system is to deter in-
appropriate behavior will be negatively and significantly related to the perceived
fairness of the system [44, p. 126].

The authors conducted a survey which confirmed all hypotheses and indicated that the per-
ceived purpose has impact on the different job attitudes. The study concluded that the em-
ployees were more positive towards an electronic performance monitoring system if the per-
ceived purpose was for their own development rather than deterrent to negative behaviours.
For the system in this thesis, the presented results indicate that it is important to communicate
what the information is used for and that the purpose of the system is to increase safety and
not work performance. Furthermore, they suggest that the purpose should be communicated
as being development of appropriate behaviour, to encourage and help reminding employ-
ees to use correct safety equipment, rather than deterrent of missing to wear the correct safety
equipment.

While video surveillance can be beneficial for organizations it can lead to uncertainty among
the employees which affect the attitude regarding the system. The uncertainty can emerge if

16
2.2. Introducing an image recognition system in a workplace

the employees do not trust their employers use of personal information or if they are uncer-
tain about the cause of the monitoring. Botan [6] claims that one panoptic affect of surveil-
lance is an increased uncertainty amongst the employees when they do not understand why
they are being monitored and they do not see the logic behind it. In addition, Botan say
that uncertainty increases when the actor is unsure when they are monitored and how the
gathered information is used. Tabak and Smith [40] state that as many as a fifth of all major
U.S. companies carry out electronic monitoring in secret of their employees. Tabak and Smith
propose that secret monitoring will lead to uncertainty amongst the employees thinking that
the monitoring is not carried out the right way and they may show dissatisfaction with the
management.

As stated earlier, the perceived purpose can affect the attitudes of employees which will be
more negative if the employees lack knowledge or trust regarding the purpose of a system.
Therefore, to obtain positive attitudes transparency and communication can be of great im-
portance. Schnackenberg and Tomlinson define transparency as "the perceived quality of
intentionally shared information from a sender" [35, p. 5]. In the context of video surveil-
lance in a workplace, the definition can be interpreted as the perceived amount and quality
of information that is communicated to employees regarding the monitoring practices. Ac-
cording to Hagen et al. [14], the transparency when performing surveillance occurs in four
stages. First, when deploying a camera the visibility of the camera can be done in different
ways. The camera can be placed visible or not, and it can be accompanied by signs making
it clear that cameras are present. Second, the extent in which the cameras will be recording
can be communicated or visible. The cameras can be continuously recording or only under
special circumstances. Third, transparency and openness about the use, storage and process-
ing of the footage. Information regarding if the footage are going to be kept internally in
the organization or shared to others. Fourth, transparency about under which circumstances
information from the footage can be used. If the organization is being transparent in all four
stages it is more likely that employees will understand the purpose and it can reduce the risk
of the employees feeling uncertainty.

Researchers have explored how transparency affect attitudes, where Alder [2] argues that
communication in both system design and implementation is the key to an ethical electronic
performance surveillance system. To involve the people that is going to be monitored in the
system design and to inform employees about monitoring practices will affect the attitude.
While Alder is reasoning about performance surveillance, and the article from is from 1998,
we argue that this can also be applied to monitoring systems today for safety reasons like the
system in this thesis. Similarly, in a paper by Workman [46] it was found that transparency
has large impact on the attitudes when performing surveillance. In the article the author
summarizes the results of a previous field study of the influences from justice perceptions
as mitigation. The study was performed on a multinational company based in the USA and
included 387 participants. The author states that even if surveillance can help improve the
security it may come with some risks of employees acting in protest. Workman found that
employees seemed to be more accepting when they felt that there was transparency in how
the monitoring was performed. They seemed more accepting when the purpose of the mon-
itoring was communicated and if the personal information handled were protected.

Additionally, Alge [3] states that employees seems to be more likely to accept being mon-
itored in the workplace when they are also given the opportunity to participate and give
input into the design and implementation of a monitoring system. Participation enhance
the feeling of greater control of how personal data is used and therefor reduces invasion of
privacy.

17
2.3. Interview techniques

2.3 Interview techniques


In this thesis, employees’ feelings regarding an image recognition system for monitoring
safety equipment is going to be examined. To elicit information, opinions, values, etc. from
people, interviews can be a suitable approach. In this part we present information about
different interview techniques that could be used.

When arranging interviews the form and style is determined by its purpose. The main at-
tribute when forming an interview is how structured it is. The most structured forms of
interviews can be used when the subject is known. The questions in a structured interview
are straightforward and specific because the interviewer knows what answers is needed to
fulfill the purpose of the interview. This interview form is common in market research where
the interviewer wants to find out about buying habits. A structured interview requires a lot
of preparation by the interviewer in order to elicite the needed information. When perform-
ing a structured interview no unexpected discoveries are made and a risk with the structured
interview approach is to miss important information. [12]

A semi structured interview has the important attribute of being flexible. It is partly struc-
tured and needs a high level of preparation. A semi structured interview means that all in-
terviewees get the same initial questions. The questions can then spontaneously be followed
by supplementary questions to ensure coverage if some area were not fully processed. The
questions are open and invite the interviewee to think freely. [13]

An unstructured interview leaves the responsibility of leading the way to the person being
interviewed. Unstructured interviews can be used in an initial phase when the researcher
wants to find out what things that needs to be investigated. When using unstructured in-
terviews it is common that the researcher does not know what to find and therefor does not
know what to ask. [13]

18
3 Method

This chapter provides information regarding how the work in this thesis was performed.
This includes a sample system which implements image recognition for detecting protective
equipment, investigating the impacts of different training sets to improve the object detection
model, and interviews with employees at Holmen to explore feelings and attitudes about
image recognition systems for monitoring.

3.1 Implementation of PPE detection system


In this part, the technical aspects of how a system for detection of PPE with computer vision
can be implemented to work in practice is explored. A sample system is developed with the
aim of being able to be used in a real environment. The system should have the ability to
detect people with and without the correct safety equipment. As the training data is impor-
tant for the robustness of a model, as discussed earlier in Section 2.1.3, the system should also
be able to add training images to the training set continuously for it to work well in a real
environment. In the context of this thesis, the implementation is also used for demonstrating
the system in the interviews.

To wear the correct safety equipment can mean to wear a helmet, reflective vest and safety
glasses. In other cases it can mean to only wear a helmet. Since this creates many combi-
nations and variations that the system needs to handle, we have limited the correct safety
equipment to mean wearing a reflective vest. It is therefore only required for the system to
be able to detect a person with and without reflective vest.

3.1.1 Prerequisites
The hardware that is utilized to run the system including the real-time processing of the
model consists of an Intel Core i7-7600U CPU @2.80GHz with a RAM of 8GB. The model
is created in Microsoft Azure’s service Custom Vision, where an account is required. The
basic environment consists of .NET Core, C# and OpenCV on a Windows 10 Operating Sys-
tem (OS). Visual Studio 2019 is used for development. For video capturing, a web camera

19
3.1. Implementation of PPE detection system

with USB connection is used. The mentioned setup is based on documentation available for
utilizing Azure Custom Vision.

The prerequisites are as follow:

• An account in Microsoft Azure


• Custom Vision training and prediction resource in Azure
• .Net Core
• Windows 10
• Visual Studio 2015 or later
• Web camera with USB connection
• Event Hub resource in Azure

3.1.2 Description of artifact


The architecture of the system is described in Figure 3.1. The center part of the system in-
cludes two applications, the training and scoring application, built in .Net Core and located
at the same device running OS Windows 10. A camera is connected to the device running
the applications and provides an input video stream to both applications. The training appli-
cation is connected to a Custom Vision resource for training a model in Azure. The trained
model can be exported and used by the Scoring application for predictions. Based on the
predictions, the application outputs messages including label and location information of de-
tected objects. If the application is running on a device with a screen, the predicted image
is displayed with the detected objects surrounded by bounding boxes. The system can also
output information of detected objects to the Event Hub and be written to a log file.

In the following parts, the sub parts of the system are described more in detail.

Training application
The training application takes a video stream from the connected camera as an input. Images
are then retrieved by extracting frames out of the video stream. The app captures images
from the stream in an adjustable rate. The images get uploaded to the target Custom Vision
resource in Azure. It repeats this procedure until it reaches the target number which by
default is set to 30 images. While the app is in the state of capturing and uploading images
you can expose the camera to the types of objects that you desire to train your model on.
When the app has finished uploading images to Custom Vision, the images can be found in
the Custom Vision resource.

An alternative to using the training application is to upload images to Custom Vision directly
from a local folder through the Custom Vision web portal.

Training the model in Custom Vision


In Custom Vision, a model is built based on training images. When creating a model, you
determine what objects in the images you wish the model to detect. This is done through
defining class labels. When building a model to detect a person with and without reflective
vest, the class labels defined are "PersonWithVest" and "PersonWithoutVest". The approach
of detecting a person and the vest simultaneously is shown to result in the best performance
[28].

20
3.1. Implementation of PPE detection system

Figure 3.1: Architecture of the system.

The images that the model is going to be trained on are uploaded to Azure Custom Vision
through the training application. As mentioned, it is not necessary to use the training ap-
plication to upload images to Custom Vision, it can also be uploaded through accessing the
local file system. For each image, objects are manually tagged by drawing a box around the
object, the ground truth, as in Figure 3.2. Only objects of interest are tagged and labeled with
the class it belongs to. When drawing the boxes, we are meticulous to draw the box closely to
the object to avoid background noise which can affect the model negatively. Images that do
not contain any of the object of interest are not used to train the model. When the labeling is
done, the model is trained in custom vision after which it is ready to be used for predicting.
The model can be exported as an ONNX file to be used locally.

Figure 3.2: Labeling an object in Custom Vision.

21
3.2. Improving the object detection model

Scoring Application
When the ONNX model has been downloaded, it can be used for predicting. In this state, the
application captures images from the video stream in the same manner as described for the
training application. The model evaluates each image and outputs the prediction including
all found objects in the image and the location of each objects bounding box. For predicting,
a probability threshold can be set. This defines the lowest probability that a found object can
have to be counted as detected. The result of the prediction of each image can be shown on a
display by drawing the bounding boxes onto the images with the name of the label and the
probability attached. Because of the speed of the system the result shown on the screen is
close to a real-time video with detection of objects. The application also sends the result as a
message to the Event hub in Azure. If the image contains any objects the class label will be
included in the message.

Azure Event Hub


Azure Event Hub is a service for big data streaming. It is a platform that can receive millions
of events per second from a variety of sources. The Event Hub can be used for processing
events sent from separate applications and systems. These events can be messages with in-
formation about an application, and can be stored, processed and analyzed. In this thesis we
can make use of this service when the system discovers that there is a person that does not
comply with the required safety regulations (not wearing the right equipment). The system
can send a message to Event Hub where it can be used to, for instance, send a notification to
a mobile phone or to be stored for statistical use.

3.2 Improving the object detection model


As mentioned in Section 2.1.6, the algorithms of a model built in Custom Vision cannot be
accessed and the control of the performance is limited to the training data of the model.
Therefore, this part focuses on investigating if and to what extent using different training sets
impacts the performance of the model. The chosen features represented in the training sets
are based on previous research and are all conditions that often can occur at a paper mill.
Also, we add on features that we believe can improve the model based on the conditions that
could occur at Holmen, such as background.

3.2.1 Data Set Preparation


Collecting data and preparing data sets for training machine learning algorithms is a time
consuming part of the work. As mentioned, adding images for training can be done with
the training application of the artifact. In this case however, we wanted to be able to edit
and control the data sets as much as possible before uploading to Custom Vision, therefore
the training application was not chosen. Instead, data was collected manually by recording
videos using an iPhone. The videos were recorded in a outdoor environment. Four people
walked in front of the camera and were captured in the video. The data collection of images
with the different features were performed in a controlled environment that should represent
and be similar to the real environment. Working in a controlled environment facilitates the
ability to control and adjust the presence of the features. From the videos, images were ex-
tracted and used to create data sets. In total, six data sets were prepared and later split into
training sets, see Table 3.1, and a test set, see Table 3.2. In the tables the number of occur-
rences of each class label included in the data sets is presented. There is one data set that
provides a baseline which functions as a fixed point of reference and is used for comparison.
The remaining data sets includes images that take one feature into account in isolation. The
data sets are described in more detail later in this chapter.

22
3.2. Improving the object detection model

Training sets
Features #PersonWithVest #PersonWithoutVest
Baseline 50 50
Lightning conditions 25 25
Background 25 25
Angle 25 25
Distance 25 25
Occlusion 25 25

Table 3.1: Training sets.

A consideration made when collecting the data is to variate the appearance of the people in
the images. This is done by changing clothes, accessories etc. This is to avoid that the model
associates for example a specific color of pants with wearing safety equipment. It is important
that the model identifies the vest specifically and not other features.

When creating data sets for training a model it is important that the occurrences of each class
label in the training set are balanced. In this project, that means to have a similar amount of
people with and without reflective vests included in the training sets.

Baseline
The images collected for the baseline were simple with the intention of being easy for the
model to predict. People in the scene were between 4-6 meters from the camera, which means
that the people in the image are always in full body perspective and well visible. The camera
was placed at eye level with the people. The background was static and with few objects
disturbing. The images were taken in daylight. The people in the pictures did not overlap
each other causing occlusion. An example image from the baseline data set can be seen in
Figure 3.3a.

Lightning conditions
When creating the data set for different lightning conditions images were collected in the
same way as described for the baseline. The images were post processed where the light were
adjusted using an photo editor program. The light in the images were lowered to resemble a
dark environment, which can be seen in Figure 3.3b.

Background
The images collected with varying background were taken in three different locations. The
locations distinguished from each other and were more cluttered than the background in the
baseline. An example image from one the three locations can be seen in Figure 3.3c. When
varying the background the placement of the camera related to the people in the scene were
the same as in the baseline.

Angle
To capture images of people from a different angle the images were taken from a height of 4
meters. In Figure 3.3d an image from the data set can be seen. The location was otherwise
the same as the baseline.

23
3.2. Improving the object detection model

Distance
Images were collected of people placed 60 to 80 meters from the camera. In the images the
people covered approximately 5 % of the image height. In Figure 3.3e an image from the data
set can be seen where a person with vest is located 80 meters from the camera.

Occlusion
For the data set with occlusion, the setup was the same as the baseline. When extracting the
images from the video, only images where occlusion occurred were chosen for the data set.
Only occlusion where a person covered another person was used. Occlusion appeared in
different shapes, one can be seen in Figure 3.3f, and occurrences where more than 50 % of the
person was covered were dismissed.

(a) Baseline (b) Light

(c) Background (d) Angle

(e) Distance (f) Occlusion

Figure 3.3: Example images from the data sets.

Test set
From the collected images, some were picked out to a test set. The test set contained images
with varying lightning conditions, backgrounds, angles, distance and occlusions as well as
images with characteristics like the baseline data set. The images in the test set were not
included in any of the training sets. The composition of the test set can be seen in Table 3.2.

24
3.2. Improving the object detection model

Test set
Features #PersonWithVest #PersonWithoutVest
Baseline 16 14
Lightning conditions 10 10
Background 10 10
Angle 10 10
Distance 10 10
Occlusion 13 7
Total 69 61

Table 3.2: The test set.

In total there were 69 tags for the PersonWithVest class, 61 tags for the PersonWithoutVest
class and 81 images in the test set.

3.2.2 Training
The training of the model was done in iterations, visualized in Figure 3.4. In the first iteration,
the training set was the baseline and for each following iteration, the baseline was combined
with one of the training sets with different features taken into account, as described in Table
3.1. In each iteration, the model was evaluated. By adding the feature training sets one by one
and evaluating the performance of the model against the baseline, the affects of the different
feature could be measured separately.

Figure 3.4: Iteration loop for training and evaluating the model for features.

3.2.3 Evaluation
To evaluate if the model improves, the performance needs to be measured. In object detection,
this requires having one or several test sets along with evaluation metrics. It is common to
calculate the mAP of a model to measure its performance. In this part, details of how the
mAP was computed is described.

As mentioned, the model was evaluated each time a new set of training images was combined
with the baseline. When evaluating the models, the same test set was used for all iterations.
The evaluation was done by using an open source project which performs evaluation metrics
of the object detection problem. To evaluate the model, the human interpretations (ground

25
3.3. Examining the attitudes and effects of image recognition

truths) were compared with the model predictions of the test set. The ground truths for each
image in the test set were annotated in a text file in the format <class> <left> <top> <width>
<height>, see Figure 3.5. The predictions were also annotated in a text file in the format
<class> <confidence> <left> <top> <width> <height>, see Figure 3.6. The ground truth and
the prediction text files were taken as input of the program. The output of the program is a
precision-recall curve and AP for each class and finally the mAP. The IoU threshold was set
to 50%. An IoU threshhold of 50 % is often used [23].

The evaluation metrics used in the open source project for each evaluation were precision
and recall. From this, a precision-recall curve was computed and displayed. Finally, the AP
for the two classes and the mAP was calculated from the precision-recall curves.

Figure 3.5: Text file with ground truths.

Figure 3.6: Text file with predictions.

3.2.4 Combined training set


As a final step we wanted to test the performance of a model where all the presented features
were represented in the training data. Hence, a training set including all the data in the
presented training sets were constructed and used to train a model. The model trained on the
combined data was evaluated according to 3.2.3.

3.3 Examining the attitudes and effects of image recognition


In this study we aim to investigate employees’ attitudes and feelings regarding monitoring
and the use of image recognition. Furthermore we are interested in getting insights of what
is important to take into consideration when using an object detection system at a workplace.
In order to do this, interviews were held with employees at Holmen.

3.3.1 Interviews
Interviews are a common technique when explicitly reporting peoples’ perspectives, as de-
scribed in Section 2.3. As part of exploring the attitudes and privacy concerns we have there-
for chosen to perform interviews. The interviews were designed as semi-structured which
means that questions are prepared but leave room for supplementary questions and discus-
sion.

26
3.3. Examining the attitudes and effects of image recognition

Procedure
Each interview was held with one participant at a time over a video call with one interviewer
and one taking notes. The interviews were recorded if the participant agreed to it. Before each
interview, the participants gave an oral consent to participate in the interview. In this stage,
general information about the interview was presented to the participant. The information
included a presentation of the interviewers, the intention of the interview and a clarification
that the project was only for research. The participant was also informed about the data that
was going to be collected and how the data gathered from the interview would be used and
stored. Lastly the participants were informed that they could terminate the interview at any
time and that the consent was always revocable. The interviews were held in Swedish and
approximated to last for 45 minutes each.

Outline and content


The interview consisted of two parts. The first part had a quantitative approach were the
intention was to get participants general thoughts about privacy and monitoring. This part
included questions that were answered on a five point Likert scale ranging from 1 (strongly
disagree) to 5 (strongly agree), see Figure 3.7. The use of quantifiable questions makes it easy
and fast for the participant to answer many questions while the answers are easy to analyze
when the interview is over. The quantitative part consisted of 13 items inspired by Stone in
[38]. Two examples of the items used are "Organizations should be allowed to collect infor-
mation about people without their permission" and "I feel somewhat uncomfortable about
some of the ways the organization is collecting personal information about me". Before the
participants answered the questions they were informed about what we meant by a number
of terms in the questions, also inspired by Stone. Personal information we defined as any infor-
mation that can be linked to an individual. This involves information regarding such as social
security number, bank details or social activities. With Organization it was explained that it
did not refer to Holmen, and it could mean organizations of all types, such as companies,
associations or authorities.

Figure 3.7: Likert scale.

Before the second part of the interview, we wanted to give the participant an idea of the image
recognition system. Therefore the participant was shown a short video of the artifact running
the scoring application.

The second part of the interview consisted of questions that opened up for more discussion,
and the participants were encouraged to elaborate on their answers. The intention was to
gather thoughts, feelings and ideas of the image recognition system. Examples of questions
were "14. How do you feel about an image recognition system like this?" and "18. Do you
believe that the image recognition system can increase the safety at the paper mill?". We

27
3.3. Examining the attitudes and effects of image recognition

Participant Age Gender Years at Hol- Area of work


men
A 57 Male 32 Factory worker
B 51 Male 33 Engineer
C 53 Male 28 Team leader
D 39 Male 18 Human resources
E 49 Male 31 Laboratory technician

Table 3.3: Participants of the interviews.

were also interested in the employees’ feelings about the existing surveillance cameras at the
paper mill and how they thought the image recognition system would differ. Therefore we
asked questions like "19. How do you feel about the existing surveillance cameras at the
workplace?" and "21. Do you think there would be a difference between how you feel about
the existing surveillance cameras and this image recognition system?". The intention was also
to explore employees’ opinions regarding collection, storage and usage of data. An example
of a question was "22. What type of data do you think is okay for the organization to collect?".

For some of the questions in the second part, we gave examples. For example the question
"26. How and who should be notified about someone not wearing the correct safety equip-
ment?". Here we gave the examples "Direct alarm to the person concerns", "Direct alarm to
the manager", "Direct alarm that is visible for all", and "Data collection for statistics". This was
to facilitate for the participant and to give some ideas, but the participant was not limited to
the examples.

Participants
A total of five people were recruited to participate in the study. When recruiting the partic-
ipants it was important that all were employees at Holmen and working at the same paper
mill. The participants would belong to different work areas at the mill, and all should on a
daily basis move in areas where there are surveillance cameras and where protective equip-
ment needs to be worn. Among employees who meet these requirements, participants were
selected at random. The chosen employees were contacted through email and participation
was voluntary. After each interview demographic data were collected about the participants
age, gender, number of years they have worked at Holmen, and area of work. The data of the
five participants can be seen in Table 3.3. All participants were men and the average age was
49.8 years where the youngest was 39 and oldest 57. The participants had worked at Holmen
between 18 and 33 years and for an average of 28.4 years.

Data analysis
From each recorded interview, careful notes were taken without writing down every single
word said during the interview. Notes were taken of each interview separately by both the
person holding the interview and the assistant separately. By doing that, there is a bigger
chance of capturing the "essence" since we could have interpreted things differently. After
that, the answers from all the participants were summarized in a document, question by
question, to be able to compare the answers to find similarities and differences between the
answers.

28
4 Results

In this part, the results from the models containing different features and interviews are pre-
sented.

4.1 Performance of different features in the training sets


In this part, the results from the evaluation of the object detection models with respect to
different features are presented. First, a summary of the performance of the different mod-
els are presented, see Table 4.1. For every model including a feature, the average precision
(AP) along with the precision-recall curve for the two classes, PersonWithVest and Person-
WithoutVest, are then presented respectively. Furthermore, the mAP for every feature is pre-
sented.

Feature AP PersonWith- AP PersonWith- mAP


Vest outVest
Baseline 83.25% 71.08% 77.17%
Lightning 83.50% 70.83% 77.16%
Background 85.74% 74.11% 79.92%
Angle 84.33% 76.07% 80.20%
Distance 94.03% 82.28% 88.15%
Occlusion 85.26% 70.85% 78.05%
All features com- 94.34% 90.32% 92.33%
bined

Table 4.1: Metrics for each model with different features included in the training set.

29
4.1. Performance of different features in the training sets

Baseline

(a) PersonWithVest (b) PersonWithoutVest

Figure 4.1: Precision-Recall curves, baseline.

The precision-recall curves for the baseline are shown in Figure 4.1. The AP of the class
PersonWithVest was 83.25 % and the AP of PersonWithoutVest was 71.08 %. The mAP of the
baseline was 77.17 %. In 4.1 (a), we see that the precision is high for a recall under 0.8, and
that the precision decreases for recalls above 0.8. In 4.1 (b), there is a similar pattern, although
the precision decreases somewhat around recall of 0.5.

Lightning conditions

(a) PersonWithVest (b) PersonWithoutVest

Figure 4.2: Precision-Recall curves, lightning conditions.

In Figure 4.2, the results from adding images with darker lightning conditions to the baseline
are shown. The AP for PersonWithVest was 83.50 % and the AP for PersonWithoutVest was
70.83 %. This results in a mAP of 77.16 %, which is a small decrease comparing to the baseline.
Since there was an increase for the AP of the class PersonWithVest, the decrease of the mAP is
due to the decrease of the AP for PersonWithoutVest. The precision-recall curves are similar
to the baseline.

30
4.1. Performance of different features in the training sets

Background

(a) PersonWithVest (b) PersonWithoutVest

Figure 4.3: Precision-Recall curves, background.

The result of the model including training images with varying backgrounds are shown in
Figure 4.3. The AP was 85.74 % for PersonWithVest and 74.11 % for PersonWithoutVest. The
precision is high for a recall under 0.8 and decreases drastically for a recall higher than 0.8 for
PersonWithVest. The mAP was 79.92 %, which is an improvement of 2.75 percentage points
compared to the baseline.

Angle

(a) PersonWithVest (b) PersonWithoutVest

Figure 4.4: Precision-Recall curves, angle.

The results of the model including training images with varying angles are shown in Figure
4.4. The AP for the two classes was 84.33 % for PersonWithVest and 76.07 % for PersonWith-
outVest. The mAP was 80.20 %.

31
4.1. Performance of different features in the training sets

Distance

(a) PersonWithVest (b) PersonWithoutVest

Figure 4.5: Precision-Recall curves, distance.

In Figure 4.5, the precision-recall curves for the model including different distance are shown.
Here we can see a significant improvement for both classes. The AP for PersonWithVest was
94.03 % and the AP for PersonWithoutVest was 82.28 %. This is an increase with 10.78 % and
11.2 percentage points from the baseline respectively. The mAP was 88.15 %, 10.98 percentage
points better than the baseline. In the precision-recall curves it is shown that the precision
stays high until around 0.9 for PersonWithVest and around 0.8 for PersonWithoutVest. We
can see some decrease of precision in (b) at recall around 0.6.

Occlusion

(a) PersonWithVest (b) PersonWithoutVest

Figure 4.6: Precision-Recall curves, occlusion.

The results of including images with occurrences of occlusions are shown in Figure 4.6. The
AP for the classes were 85.26 % for PersonWithVest and 70.85 % for PersonWithoutVest. The
AP for PersonWithoutVest has decreased compared to the baseline. The mAP was 78.05 %.

32
4.2. Outcome of the interviews

All features combined

(a) PersonWithVest (b) PersonWithoutVest

Figure 4.7: Precision-Recall curves, all features combined.

The results when combining the previous training sets, including all features, are shown in
Figure 4.7. The AP for PersonWithVest was 94.34 % and for PersonWithoutVest it was 90.32
%. The mAP was 92.33 %, which is an improvement of 15.16 percentage points compared to
the baseline.

4.2 Outcome of the interviews


In this part the results from the interviews are presented. The results from the initial part of
the interviews, which are of a quantitative design, are presented in Table 4.1. The participants
are anonymous and therefore named A-E according to Table 3.3. Each question in the table is
answered by the participant with a number ranging from 1 (strongly disagree) to 5 (strongly
agree) according to the Likert scale. Then follows the outcome from the interview questions
that are of qualitative nature. The participants’ answers have been gathered and grouped by
themes.

33
4.2. Outcome of the interviews

Question A B C D E
1. Organizations should be allowed to 1 3 3 2 2
collect information about people with-
out their permission.
2. Organizations should be allowed to 1 1 1 1 2
collect information about people with-
out their awareness.
3. Employees should be involved in 5 5 5 4 5
the implementation process of moni-
toring systems.
4. The amounts and types of per- 5 5 4 5 5
sonal information stored by organiza-
tions should be strictly limited.
5. The uses organizations make of 1 1 1 1 1
personal information should have few
controls.
6. Organizations that collect and store 3 5 3 3 5
personal information should not have
the right to release this information to
other organizations.
7. I’m not at all happy about the 2 5 4 1 3
fact that many organizations are stor-
ing personal information digitally.
8. I feel quite upset by the uses that or- 4 3 3 2 3
ganizations make of personal informa-
tion about me.
9. I feel somewhat uncomfortable 3 3 5 3 3
about some of the ways the organiza-
tion is collecting personal information
about me.
10. I would feel somewhat uncomfort- 3 5 4 2 4
able if the organization would evalu-
ate my actions by collecting personal
information about me.
11. I feel quite upset by the fact that 5 2 2 3 2
organizations save images of me.

12. I feel that electronic monitoring at 3 3 3 5 4


the workplace is fair.

13. The purpose of a monitoring sys- 5 5 3 5 5


tem has impact of how I feel about it.

Table 4.2: Participants A-E’s answers to the quantitative survey.

Initial thoughts of the system


Participant A felt at first glance that the system could be quite good. What participant A
feels depends a lot on the purpose for which the system is implemented. If the purpose of
the system is to help and improve safety, then it could be positive. Participant D felt that the
system can be good in the purpose of seeing if the incentives introduced at the workplace
work or not. It could illustrate if people follow the rules or not which makes it possible
to work with the problems. Participant E also thought that the system could be positive
depending on the purpose, and that a system like this should not be forbidden but not taken

34
4.2. Outcome of the interviews

for granted. Both participant B and D felt positive about the system because it makes it
possible to check whether people wear safety equipment or not. Participant B added that it
becomes more problematic when people can be identified in the images. Participant A stated
that cameras can be privacy intrusive. Participant C thought that it is not easy to say to what
extent it feels okay or not that the system is used and gave some examples:

For example in London with crime and such things, its really good with image
recognition. Like if the police use it in subways to look for criminals, it’s really
good. But whether they belong in a workplace, I don’t know.

The participants were all unanimous of what they thought the intention of using the system
is. They stated that the system is meant to increase the usage of personal protective equip-
ment. Both participant A and C mentioned that it is easy to become skeptical. Participant
A said that if you are conspiratorial, suspicions about the purpose of the system can arise,
which the participant believes may be the case for several employees.

Consequences of the system


Participant D stated that the security at the paper mill can be improved by using the image
recognition system. When people are aware that someone is watching them they feel pres-
sured to follow the rules and therefore be more careful. Participant D said that the wish had
been that the system had contributed to more people wearing safety equipment even when
the system is not running. Also that the system would contribute to a better security think-
ing amongst people in general. Participant C also thought that the safety would improve
and said that for him as a supervisor the system could make it easier to demonstrate that a
problem exists. Participant B said that many already have reflexes on their work clothes and
that all people at the paper mill already wear safety equipment. Participant A reasoned the
same as participant B but thinks that the system could be used as an complement and be used
to discuss the safety. Participant E also thought the usage of safety equipment was already
high at the paper mill, but that the system could be more comprehensive if there would be a
remark if someone lacked the right equipment.

Participant A, C and D said that the safety at the paper mill could increase when using the
system. Participant E said it could increase slightly. Participant B said that it would not make
a big difference.

Participant A, B, C and D all thought that the system potentially could affect the behaviour
of people who are exposed for the system. They stated that some people may feel uncom-
fortable and stiff and that some people may take another way to avoid the system if possible.
Participant A stated that the system could contribute to a greater amount of stress. Partici-
pant D thought that people may feel that the company has a reduced trust in its employees
and reasoned:

I can imagine that people would think that "okay, now they’re monitoring me,
don’t they believe I would have worn safety equipment anyway, without this
camera?"
... I think that the recognition factor is something people can get annoyed at.

Also, the system could raise a fear of being caught for not doing the right thing. Participant
C added that even if the system can make people uncomfortable, it is something you get
used to. Participant E was doubtful that the system would have any behavioural effects. This

35
4.2. Outcome of the interviews

participant reasoned that there were already many cameras at the workplace today and that
people stop thinking about the cameras after a while.

Existing surveillance cameras


Regarding the surveillance cameras that exist at the paper mill today, the participants were in
general positive. Participant A expressed that the fact that the purpose was known and that
he knew whether the cameras were recording or not was important for how he felt about it.
Participant B reasoned similarly and thought that the benefit outweighed the inconvenience.
Participant C said that he did not care that they were there, that he did not think anyone
thought about them and that it was for a good cause. Participant D said that he did not
have any specific feelings about the cameras, was positive about the ability to view a location
without having to be there and thought cameras was something you get used to. Participant
E similarly did not think about the existing cameras a lot and said it was helpful for the
operators.

There were differences in how the participants described how they were informed about the
existing surveillance cameras at the paper mill. Participant A and C were informed through
negotiations or at union meetings. Participant B and E said they did not get any specific in-
formation but instead discovered the cameras themselves by noticing the cameras or screens
displaying camera views. Participant E additionally said that he later had received informa-
tion through the union. Participant C said that the information also could be found on the
company’s intranet. Participant D described that he received this type of information on the
intranet, at workplace meetings or through a manager.

Differences between image recognition and surveillance


When the participants were asked whether they thought that there would be a difference be-
tween how they would feel about the image recognition system and the existing surveillance
cameras, their answers were varying. Participants A said he could not say yes or no. Par-
ticipant B and D thought there would be a difference. Participant B reasoned that this type
of system would likely be present in a passage where people would pass and get an OK be-
fore entering an area, rather than monitoring a larger area where people worked throughout
the day. This would mean that people would relax after passing the camera with the image
recognition system. Participant D thought there would be a difference in how he would feel
about such a system since he would be aware of that the camera would gather information
in a different way and that the information would be used different compared to tradition
surveillance. The participant expressed:

I would be unsure of where the information ends up and would pay more atten-
tion when passing the camera.

Participant C did not believe there would be a difference between the two systems.

Data gathering, storage, usage and sharing


The participants in general did not feel any different about which data is okay for the work-
place to gather, store, use or share. Participants A, B and C all felt that monitoring phone
calls and email of employees did not feel okay. Participant A said it was to intrude on the
personal integrity. Participant B reasoned that email surveillance possibly could be entitled
since it is on behalf of the company. Participant C claimed that as long as he managed his job,
the company did not have anything to do with what he did on his spare time. However, if
there is reason to believe that an employee has done something illegal, the email should be
allowed to go through.

36
4.2. Outcome of the interviews

Participant E gave an example and said it would not be okay to record employees during
an entire work day from their laptop web camera. He thought there was a difference if the
laptop would record sound, which would feel worse and be to cross the line. If the purpose is
not to increase safety, he said he did not understand the reason to record. Participant E said it
was a difference of collecting and storing data and believed that an organization should not
store data for a long time. The participant thought data could be okay to save to be able to
show examples of good and bad behaviour, situations or similar.

How detections from the system should be used


Regarding how the system should use the information and who should be informed about
the absence of protective equipment on workers, the participants all agreed that gathering
statistics would be a good approach. Participant A said that gathering information could be
used to see general behaviours at the workplace. This could be used for group discussion
to identify how large the problem is. The participant claimed that this type of information
would be best suited to give to the site manager, as that is who is responsible for the working
environment. Participant B and D also expressed that statistics about how many wears the
right equipment would be good, Participant B thought this information could be given to a
group leader. Participant C also said that gathering statistics should be the starting point for
this type of system, to examine how large the problem is. According to participant C, this
task could be assigned to the safety manager.

Another possibility that the participants expressed was to alarm directly to the person miss-
ing required safety equipment. Participant C could see difficulties with having to identify the
person, and would prefer to remain unidentifiable in the images. This participant added that
he thought this was something that could change and would feel more acceptable in a few
years.

Participant E thought that if someone was missing required equipment and entering an area
with dangerous machines, it would be of most use to send an alarm to the people running
the machines. This way, the machines could be turned off and the person missing equipment
could be contacted and reminded to put on equipment. The participant did not see a reason
for the alarm to go to a guard. The participants further reasoned that the alarm could be
direct and visible for all, which would make everyone pay attention.

Participant A said that the alarm could be sent directly to the closest manager. This way, the
manager could talk to the person of concern and discuss that he/she was missing protective
equipment. Participant B, D and E did not think a direct alarm to the closest manager would
be a good idea. Participant B said:

If I would have walked through an area where I could barely see there was a
camera, and then my manager would get a message about me not wearing safety
glasses for example, and I wouldn’t even know it had happened, it would have
felt like an infraction.

Participant B would prefer that the system would give an alert that is visible for all when a
person for example goes through a passage. Participant D said that it would be better to create
a system that contributed to employees voluntarily wore equipment more often rather than
getting a notice from their manager about it. Participant E thought an alarm to a manager
seemed unnecessary.

37
4.2. Outcome of the interviews

How data from the system should be managed


Participant A, B and D all stated that the data gathered by the system should be managed
by the IT department at Holmen. Participant A claimed that if the data was to be handled
externally, outside of Holmen, it would not feel good. The participant thought it felt like
a greater risk and that the data could be misused and sold to make money. Participant C
and D said in similar way that the data should be kept internally. Participant E thought the
information should primarily be handled by the guards and whether it would be okay to
handle information externally would depend on the cause.

How to introduce the system


When introducing the system at the workplace, Participant B said that it is important to "get
all on board". The people who are affected must be given the chance to come to terms with
the idea of being monitored. It is important that the purpose of the system is presented and
clear to those concerned. Participant D said similarly that the purpose and aim of the system
needs to be anchored within the organization. Those concerned must understand on what
grounds the system is introduced. Participant C stated that information and communication
are important and above all, the purpose of the system must be explained. Participant E also
considered information and communicating the purpose to be important, and to regularly
report what the information is used for and how.

Important findings
Below is a list of important findings of the interviews.

• The purpose of the system is important to how the participants would feel about it.

• The system can be helpful and lead to an increase of wearing safety equipment amongst
the employees.

• Consequences of the system could be an increase of stress, lead to employees worrying


about getting caught for not following the rules, a feeling that the company had low
trust that employees can manage their own safety.

• The participants feel comfortable with the existing cameras and were more unsure of
how the data gathered would be used with the image recognition system.

• The participants do not want the images to be linked to individuals.

• The participants were positive about using the system to gather statistics, mainly to be
able to see how big the problem is.

• The participants would rather have a general alarm that would be visible for all than to
be pointed out in person.

• When introducing a system like this, it is important to give information to employees


and to communicate the purpose of the system.

38
5 Discussion

In this section the results will be discussed followed with a discussion of the chosen method-
ology for this thesis and the work in a wider context.

5.1 Results
In this section the results will be discussed, both the performance of the models trained with
different features and the interview answers.

5.1.1 Object detection model performance


The precision-recall curves for the models have similar shapes as the ideal appearance de-
scribed in Section 2.1.4. From all precision-recall curves in the result, it shows that precision
is high at low recall. This result indicates that there were a number of objects in the test data
that the models easily detected as the right class and also the right location in the image. In all
curves the precision decreases when recall increases which can be explained as that the model
makes many predictions but fails to label the class correctly. Due to the fact that no threshold
of the confidence is set when calculating the performance, all predictions are included (even
the ones with very low confidence) which contributes to almost all objects being found, even
the ones that the model finds hard to classify, but for some predictions the class label is often
wrong. In the precision-recall curves it is shown that if the confidence threshold is increased
it will decrease recall and increase precision. Depending on the area of use, the precision
respectively recall can be prioritized differently. In the case of detecting absence of safety
equipment one can argue that high recall is more important than high precision because it is
better to find all occurrences of a person and predict some wrong. The opposite, prioritizing
precision, means finding fewer occurrences of a person but predict most of them correctly.

The performance of the object detection model differed between the two classes PersonWith-
Vest and PersonWithoutVest. The AP for PersonWithVest was higher than the AP for Person-
WithoutVest throughout the experiment, with all training sets. The AP between the classes
differed at least 8.26 percentage points and at most 14.41 percentage points. In the precision-
recall curves for the two classes we can see a clear pattern that the recall is higher for Per-

39
5.1. Results

sonWithVest than PersonWithoutVest regardless of the training set the model was trained on.
This indicates that the model is not as good at finding all occurrences of a person without vest
as for a person with vest. This result can be due to different causes. The people without vest
in the training set can have varying characteristics such as varying clothing colours. Since the
vest is very similar (identical), it can be easy for the model to classify it and associate it with
the class PersonWithVest. Since the amount of occurrences of the different classes were bal-
anced in both the training and test sets, the result can be of value when creating the training
set. It indicates that the model could need a larger amount of occurrences of persons without
vests in the training set.

When evaluating the impact of including different features in the training images for a model
the results show that in general the tested features contributes to a better performance and
more robust model. The only exception where the model did not perform better than the
baseline was the model with different lightning conditions. The result showed a higher AP for
PersonWithVest but a lower AP for PersonWithoutVest and the mAP turns out to be slightly
lower compared to the baseline. It means that the model performs worse when adding dark
images to the training set and it seems like the different lightning conditions confuses the
model in terms of how a person without vest looks like.

The model trained on the background data set shows good improvements for all metrics. The
mAP increased by 2.75 percentage points and shows that including images in the training
data from the same location as used when testing improves the models performance. The
result demonstrates that the background in where the objects occurs when training the model
affects the model’s ability to detect correctly. The same goes for the model with different
angles which received similar results as background.

The greatest difference compared to the baseline was given by the model trained on the data
set including varying distance where the mAP increased by 10.98 percentage points. It was
not surprising since the scale difference between the distance images and the baseline was
quite drastic. We assumed that the model would have much easier to detect the small scaled
persons in the test set if it had been trained on images of persons in the same scale. In the
precision-recall curves we see that the precision is high for a recall up to around 0.9 for Per-
sonWithVest, which means the model manages to detect almost all objects and classify them
correctly. For PersonWithoutVest we can see a small decrease of precision at recall after 0.5.
It is difficult to say what caused this, we think it can be that the distance training images con-
fuses the model of what a person without vest looks like, like when adding images of varying
lightning conditions. Although the small decrease of precision, the recall reaches close to 1
for PersonWithoutVest, which is better than for all the other models for this class. This results
in a higher AP for this class.

When comparing the model trained on the data set including occlusions to the baseline a
small increase in performance can be seen. The result was somewhat expected because the
ways in which the objects cover each other differ very much from time to time. Because the
occlusion varies much means that it will take many occurrences of occlusion in the training
set before the model has been trained on such cases that then occur when using the model.

Overall, in this study the best performance was achieved for the model trained on the com-
bined data set including all features where a mAP of 92.33 % was achieved. This result can
be compared with the result from a similar study [32] where a model was built using Azure
Custom Vision. In the study a model trained on about 100 images per class achieved a mAP
of 81.6 %. This shows that our model performs better, however it can be misleading compar-
ing the results because of the differences in shaping the test set. In another study [10] where
a YOLOv3 model was used, the achieved AP was 97 % which was measured the same as

40
5.1. Results

mAP in this study. The result when using the YOLOv3 model is better which may depend on
the algorithm or the training data which was significantly larger than the training set used in
this study. Although this study did not focus on creating the best performing model, these
comparisons can give an indication that a model for detecting safety equipment created in
Azure Custom Vision seems to perform well with limited resources invested, both in terms
of time and data.

5.1.2 Interviews
From both the quantitative and qualitative parts of the interview, it was clear that the partic-
ipants thought that the purpose of the system was important for how they would feel about
it. In question 13, it showed that four out of five participants thought this was important, and
one was neutral. This is not surprising and is consistent with previous research [37, 44]. If
the purpose is to improve safety, they would feel more positive about it. From the interviews
it was clear that the participants were positive that the system could help increase the usage
of safety equipment at their workplace. Although some claimed the usage was already high,
and that the increase would be small. In a more general perspective, we think it is positive
that the participants saw a potential increase. When people feel that the system has a helping
function, it can contribute to more people having a positive attitude towards the use and see
it as a benefit.

The employees feel comfortable with the existing cameras and more unsure of how the data
would be used. As previous research has indicated that not knowing what the monitoring
information is used for is related to how bothered people feel about it [37], we believe this
is important to consider. It is also in line with what the participants believe are important
when introducing a system like the one in this thesis; communicate the purpose and pro-
vide information to the employees. In question 2, the participants answers show that they do
not think organizations should be allowed to gather information without their knowledge.
The answers to question 3 show that they believe that employees should be involved in the
introduction of monitoring systems. These results indicate that being transparent about mon-
itoring systems and the introduction of them is of great importance. Nor is this a surprising
finding and previous work have come to similar conclusions [14, 46].

Several participants mentioned consequences that may come with using the system. They
mentioned the system could increase the amount of stress for the employees and it could
lead to employees worrying about getting caught for not doing the right thing. These conse-
quences amongst other not discovered in this study are important to take into consideration
as an employer, and if the consequences are severe, you need to think about whether it is
worth using the system. Furthermore, one participant thought that people would get a feel-
ing that the company had low trust in employees managing their own safety. We found
this interesting, since it is not anything we have found in the literature. Even though it is
difficult to draw conclusions when only one person claims it, we believe this potential conse-
quence should be taken into consideration and that it is a possible outcome of implementing
a surveillance system like the one in this thesis.

All the consequences brought up, we argue are in line with having dissatisfaction with man-
agement. Dissatisfaction with management is a potential risk when performing monitoring
in secret, as shown in [40], and is another indication that transparency is important.

When handling monitoring systems, it is important to consider what data to store and for
how long. As described in Section 2.2.1, there are laws to comply with. Apart from laws,
people have their own perceptions of where the line should be drawn regarding information
gathering. The interviews showed that the participants did not want the images to be linked

41
5.2. Method

to individuals, which we believe could mean they felt it was sensitive information. Question
4 and 5 handle how limited and controlled the gathering and usage of personal information
should be. The participants’ answers show that they think the amount and type of informa-
tion gathered should be strictly limited and that the usages of personal information should
have many controls. We can from this see that privacy is important for the participants and
that personal information should be handled carefully.

We saw somewhat of a clash of opinions since many thought direct alarm would be a good
idea, but did not want to be identifiable in the images. It is not possible to be personally
alerted if it is not known who the person in the pictures is. There is a possibility that people
are okay to be identified if it can be ensured that the information is only used to alert the
person itself without the information getting into someone else’s hands. This would require
a thorough examination of how information is gathered, stored and used.

Some of the participants would rather have the system give a general alarm that is visible
for all and where the detected person is not pointed out personally. An idea was to have
something like a light or a sound to warn people in the area where the person without safety
equipment was detected. Another possibility would be to have an alarm in the vehicles no-
tifying the drivers to pay extra attention. This is another indication that a personal alarm or
that a detection can be traced back to an individual would not be a suitable approach. The
participants were positive about using the system to gather information to use for statistics.
We believe this is the least sensitive approach to use the system, since the objectives in the
images do not need to be identified.

Based on the opinions that emerge from the employees, the system can be implemented using
different approaches and in this way, employees’ attitudes towards the use of the system can
be affected to varying degrees. To avoid storing more data than necessary, the system can be
designed so that images taken from the video stream are never stored, meaning the system
would analyze the images in real time. This could reduce worry and stress for the employees
that the system would store a lot of images of them that could possibly be used for other
purposes. An alternative would be for the system to store an image when a detection of the
wrong equipment has been detected in order to be able to check afterwards that the detection
was performed correctly. This means that more data is stored but there is evidence of what
happened if, for example, someone were to claim that the result from the system is incorrect.
If the system is implemented in such a way that recall is valued higher than precision, as
discussed earlier, it can lead to more detections being made and thus the system will detect
more cases of lack of safety equipment incorrectly. For the employees, this can mean more
alarms, which can be annoying. Another consequence if the system alarms even though no
one lacks equipment is that the employees lose confidence in the system and distrust the
system even when it alarms correctly. On the other hand, if the system would value precision
higher than recall and the system misses to detect a person without safety equipment it can
lead to dissatisfaction.

5.2 Method
In this part, the method of the study is discussed. The examining of the impact of the features
and the interviews are presented separately and we discuss with respect to replicability, relia-
bility and validity. Other limitations with the method and what would have been interesting
to do differently are also presented. Finally, source criticism and the work in a wider context
are presented.

42
5.2. Method

5.2.1 Examine the impact of the features


To ensure high replicability, the procedure of how the models were trained and evaluated is
described in detail. The reliability is high as there are good chances to get similar results when
using a publicly available service like Azure Custom Vision. The underlying algorithms do
not need to be reproduced as they are provided by Custom Vision. However, since this is a
service provided by Microsoft, the algorithms behind it can change due to updates and would
lower the reliability. To use different images also affect the performance of the model, and
can lead to different results if the experiments were to be repeated. The fact that we included
example images from the different training sets and the test set could facilitate the recreation
of similar data sets. Furthermore, the labeling of the images in Azure Custom Vision is made
manual, and can be different since not all will mark the bounding boxes identically. In the
same way as including example images from the training sets, we believe it can help that
an example image is included of how it looked like when the images are labeled. When
evaluating the performance of the model, evaluation metrics that are commonly used for
object detection models were chosen. By using measurements that are commonly used, the
certainty of the results are increased, which in turn increases the validity of the thesis.

When the data set of images taken at long distance was collected, the people were at a set
distance. The distance set was about 80 meters, which meant that people were visible in the
pictures but no details could be discerned. Since images taken from other distances were not
used to train the models nor being evaluated, we can only draw conclusions about perfor-
mance at the tested distance. It would have been interesting to test more distances to see if
there is any breaking point where the performance of the model trained on the baseline starts
to decrease. In addition, it would have been interesting to investigate whether there is a limit
to how far away people can be in order for a model to be able to detect it at all. It could
have led to information about how large an area a camera can cover that will make these
detections.

The approach taken in this study when creating the data set including other lightning condi-
tions was to remake an image taken in daylight by adding a dark filter. All images in the data
set had about the same darkness. The study did not include other lightning conditions which
means that conclusions can not be drawn about the impact of other lightning conditions like
strong sunlight and shadows. The reason for limiting to darkness was because of the diffi-
culty of controlling the weather. Additionally, the difference between the lightning condition
in the baseline and stronger sunlight assumed to be barely noticeable.

When collecting the data set with different background to compare to the baseline, images
from three separate location was included. When evaluating the model trained on those
images the test data included images from those same locations. The result indicated that
training the model on the same background that is used when scoring increased the perfor-
mance. From this finding, conclusions cannot be drawn about if training on different back-
ground contributes to a more robust model. To evaluate if the model becomes more robust
and performs better in entire new locations images with other background not included in
the training set should be in the test set.

Throughout the experiments, the data collected was originally extracted from videos. The
extracted images were split into training data and test data. Since the images were from the
same videos, the training data and the test data were very similar. As described earlier, this
means the different aspects were not varied a lot, and the training images of, for example,
distance was very similar to the images taken from a longer distance included in the test
data. This can represent a real scenario where the system used at a paper mill utilizes a
model which is trained on data collected from the same location where the model is then to

43
5.2. Method

be used. It would have been interesting to include images from an entirely new location, and
with more varying distances and view points in the test set. The people in the images were
also the same, it would have been interesting to test the model on entirely new people in the
images. This could indicate the models performance if it was to be used in another paper mill
or if they were to change the employees work clothes.

5.2.2 Interviews
In order for the replicability of the interviews to be high, the procedure has been included
in detail. The interview outline is described detailed and the interview questions is included
in the appendix. However, the fact that the interviews are partly unstructured contributes to
a lower replicability. The questions used in the interview were based on questions used in
previous research examining the same area. However, the questions used in the interviews
were adjusted from its origin which can decrease the validity.

The interviews in this study were held in Swedish and from the notes the participants an-
swers were translated into English to be used in Section 4.2. When translating there is a risk
of incorrect translation and use of incorrect words which can affect the meaning of the partic-
ipants’ answers. Therefore, the outcome of the interviews can be considered to have reduced
reliability.

The reliability of the outcomes of the interviews can also be questioned due to the small
number of participants. The survey contains too few participants and more people’s opinions
would be required, which would provide a more general image of what a group of people
think in order to increase reliability. Although, the results from the five interviews can be
used to give an indication and identify early problems that need to be addressed.

When conducting the interviews the deviation among the participants was small. Only men
aged around 40-60 who had worked at the company for at least 18 years were interviewed. It
is possible that attitudes and feelings differ between age groups and genders. Therefore, the
validity of the outcomes of the interview can be argued low because the people included in
the study are not a representation of all the employees at Holmen.

When the interviews were held and the participants answered the quantitative part, we no-
ticed that some participants perceived the statements as difficult to understand or take a
position on. We believe that this may have contributed to the participants in some cases not
responding in agreement with their true opinions and feelings. We believe it could have been
avoided through some reformulations and by giving the participant more time to respond to
the statement.

5.2.3 Source criticism


The scientific sources in this thesis are mainly collected through the search engine Google
Scholar. The selected sources are for a large part peer-reviewed, published work with many
citations. It was difficult to find related work within Azure Custom Vision, as most scientific
articles develop their own neural networks for image recognition. As we still found it rele-
vant to include studies that utilize Azure Custom Vision, we selected some studies with fewer
citations. We have a few internet sources for when we found it difficult to gather information
from scientific research. We had difficulties finding information about machine learning eval-
uation metrics in scientific articles, regarding for example precision-recall curves and how to
calculate mAP and AP. However we argue that this is common knowledge within machine
learning and therefore does not need to be justified by scientific papers. In Section 2.2.1, we

44
5.3. The work in a wider context

have based the information about GDPR and the Camera Surveillance Act from the Swedish
Government Offices’ legal databases.

Several of the presented studies in Section 2.2.2 were conducted over 10-15 years ago and a
fair question would be whether the results would be the same today. The area of surveillance
has changed a lot and it is possible that people would have different opinions today compared
to 15 years ago. From the responses of the interviews we found several similarities with
the studied literature. This, we argue, indicates that the findings on feelings and attitudes
towards surveillance are still relevant today.

5.3 The work in a wider context


In this thesis, the ethical aspects have been examined for those who are exposed to an image
recognition system. By examining how the system may affect the employees at the paper
mill, it can help Holmen to introduce and use the system without affecting those exposed in
a negative way and avoiding unwanted consequences.

Since technology has advanced very rapidly in the field of image recognition, it has happened
that it is used in a way that many consider unethical. Several studies have been conducted
where image recognition, in particular facial recognition, is used for distinguish different eth-
nicity, predict if people are criminal and personal characteristics such as gender and age [29].
As an engineer of a system like the one in thesis, one has a responsibility to consider what
the possible consequences can be and to take proactive measures to prevent gender and eth-
nic biases etc. In action this would mean, for example including a broad representation of
humans in the training set making the model equal good at detecting irrespective of appear-
ance. Similarly, people from all hierarchical layers of the organization would be represented
in the training set so that everyone is treated equally regardless of position.

In the progress of Industry 4.0, automation makes it possible to replace manual tasks per-
formed by humans with machines. Therefore some believe that automation will decrease the
number of full-time staff and the desired competencies will shift to engineers with higher
education. This is an ethical dilemma that companies need to consider. However, we are
convinced that the development of new technology will only increase since there is financial
gaining in replacing human labour with machines. Regarding the system in this thesis how-
ever, the purpose it fulfills does not entail replacing any staff. It is mainly meant as a tool and
aid for when there is a lack of safety compliance.

We believe the usage of image recognition in the industry will only increase, and that the
technique can be of great value if used right. If an image recognition system for detecting
unsafe conditions in relation to missing safety equipment can be implemented, we see possi-
bilities for other uses of the technique. A similar solution could be used for detecting a person
in a critical area such as too close to a dangerous machine. Another usage could be in jobs
where working alone is frequent, where a similar system could be used for alarming if an
accident happen and there is no-one else present.

45
6 Conclusion

The aim of this thesis was to examine how an image recognition system can be implemented
that makes employees wear safety equipment more often. The study shows that it is possible
to build a system that is able to detect a person with and without safety equipment, more
specifically reflective vests. The system that was built in C# .Net Core utilizes Azure Custom
Vision for detecting and Azure Event Hubs for notification can take a video stream as input
and output the detections.

In addition to implementing a system, the goal was to examine how and to what extent a
machine learning model be trained in order to detect safety equipment. Since the algorithms
which Azure Custom Vision are built upon are hidden, as engineers we are only responsible
for providing the training data. By training the model on different data sets, we received
interesting results showing that including varying backgrounds, angles, distances and occlu-
sions in the training set have positive impact on the model’s performance. When creating a
model for detection of safety equipment, we argue that these aspects should be considered
in the data collection phase in order for the system to perform well in different environments
and circumstances. When including images with varying lightning conditions to the training
set, we saw a slight deterioration in performance. However, we recommend further research
due to limitations in the method.

This thesis focused on examining the impact of training data on an Azure Custom Vision
model. The best performance measured gave a mAP of 92.33 %, which we believe is promis-
ing. To achieve this result, a minimal amount of images was used for training and we see
potential in improving the model further by increasing the amount of training images.

Finally, the thesis aimed to investigate the privacy concerns and employees attitudes regard-
ing the system. From literature we identified variables that have impact on peoples’ attitudes
regarding electronic monitoring. Literature showed that the perceived purpose, transparency
and communication can affect what people feel about them. It also showed that electronic
monitoring can lead to reduced trust and privacy concerns if implemented irresponsibly or
without informing the objectives of the monitoring. These findings were further strength-
ened by interviews held with employees at Holmen which showed that employees would be

46
more positive if the cause of the system was justified. It was furthermore highly important
with information and communication in the introduction of a monitoring system. Privacy
was highly valued and to not be identifiable in images would be to prefer. We argue that the
system should alarm on a general level, without identifying the concerned person, to avoid
privacy intrusion. As a starting point for Holmen we suggest utilizing the system for statis-
tics to gain an understanding of the extent of the problem, as this was an important outtake
from the interviews. To implement a monitoring system for detection of safety equipment
at Holmen, collected data and monitoring practices must be used in compliance with GDPR
and Camera Surveillance Act.

47
Bibliography

[1] Manal El Aidouni. Evaluating Object Detection Models: Guide to Performance Metrics. 2019.
URL : https : / / manalelaidouni . github . io / manalelaidouni . github .
io / Evaluating - Object - Detection - Models - Guide - to - Performance -
Metrics.html (visited on 03/22/2021).
[2] G Stoney Alder. “Ethical issues in electronic performance monitoring: A consideration
of deontological and teleological perspectives”. In: Journal of Business Ethics 17.7 (1998),
pp. 729–743.
[3] Bradley J Alge. “Effects of computer surveillance on perceptions of privacy and proce-
dural justice.” In: Journal of Applied Psychology 86.4 (2001), p. 797.
[4] Eduard Barnoviciu, Veta Ghenescu, Serban-Vasile Carata, Marian Ghenescu, Roxana
Mihaescu, and Mihai Chindea. “GDPR compliance in Video Surveillance and Video
Processing Application”. In: 2019 International Conference on Speech Technology and
Human-Computer Dialogue (SpeD). IEEE, pp. 1–6.
[5] Christian Luzter Batugal, Jewel Mark Perry Gupo, Kasandra Kimm Mendoza, Adonis S
Santos, Francis A Malabanan, Jay Nickson T Tabing, and Christopher B Escarez. “Eye-
Smell: Rice Spoilage Detection using Azure Custom Vision in Raspberry Pi 3”. In: 2020
IEEE REGION 10 CONFERENCE (TENCON), pp. 738–743.
[6] Carl Botan. “Communication work and electronic surveillance: A model for predicting
panoptic effects”. In: Communications Monographs 63.4 (1996), pp. 293–313.
[7] Marco Javier Flores Calero, Miltón Aldás, José Lázaro, Alfredo Gardel, Noemi Onofa,
and Brayan Quinga. Pedestrian detection under partial occlusion by using logic inference,
HOG and SVM. 09. 2019, pp. 1552–1559.
[8] Bernardo Botelho Antunes da Costa and Pedro Silveira Pisa. “Cloud Strategies for Im-
age Recognition”. In: 2020 4th Conference on Cloud and Internet of Things (CIoT). IEEE,
pp. 57–58.
[9] Jifeng Dai, Haozhi Qi, Yuwen Xiong, Yi Li, Guodong Zhang, Han Hu, and Yichen Wei.
“Deformable convolutional networks”. In: Proceedings of the IEEE international conference
on computer vision. 2017, pp. 764–773.

48
Bibliography

[10] Venkata Santosh Kumar Delhi, Albert Thomas, et al. “Detection of Personal Protective
Equipment (PPE) compliance on construction site using Computer Vision based Deep
Learning techniques”. In: Frontiers in Built Environment 6 (2020), p. 136.
[11] Piotr Dollar, Christian Wojek, Bernt Schiele, and Pietro Perona. “Pedestrian detection:
An evaluation of the state of the art”. In: IEEE transactions on pattern analysis and machine
intelligence 34.4 (2011), pp. 743–761.
[12] Bill Gillham. Research interview. A&C Black, 2000.
[13] Bill Gillham. Research Interviewing: The range of techniques: A practical guide. McGraw-Hill
Education (UK), 2005.
[14] Christina S Hagen, Leila Bighash, Andrea B Hollingshead, Sonia Jawaid Shaikh, and
Kristen S Alexander. “Why are you watching? Video surveillance in organizations”. In:
Corporate Communications: An International Journal (2018).
[15] Laura P Hartman. Technology and ethics: Privacy in the workplace. 1. 2001, pp. 1–27.
[16] Xiaoyue Jiang, Abdenour Hadid, Yanwei Pang, Eric Granger, and Xiaoyi Feng. Deep
Learning in object detection and recognition. Springer, 2019.
[17] Kamerabevakningslag (2018:1200). 2018. URL: http : / / rkrattsbaser . gov . se /
sfst?bet=2018:1200 (visited on 05/11/2021).
[18] Lag (1976:580) om medbestämmande i arbetslivet. 1976. URL: http://rkrattsbaser.
gov.se/sfst?bet=1976:580 (visited on 05/11/2021).
[19] Lag (2018:218) med kompletterande bestämmelser till EU:s dataskyddsförordning. 2018. URL:
http://rkrattsbaser.gov.se/sfst?bet=2018:218 (visited on 05/11/2021).
[20] Jianan Li, Xiaodan Liang, ShengMei Shen, Tingfa Xu, Jiashi Feng, and Shuicheng Yan.
“Scale-aware fast R-CNN for pedestrian detection”. In: IEEE transactions on Multimedia
20.4 (2017), pp. 985–996.
[21] Jie Li, Huanming Liu, Tianzheng Wang, Min Jiang, Shuai Wang, Kang Li, and Xi-
aoguang Zhao. “Safety helmet wearing detection based on image processing and ma-
chine learning”. In: 2017 Ninth International Conference on Advanced Computational Intel-
ligence (ICACI). IEEE, pp. 201–205.
[22] Tsung-Yi Lin, Piotr Dollár, Ross Girshick, Kaiming He, Bharath Hariharan, and Serge
Belongie. “Feature pyramid networks for object detection”. In: Proceedings of the IEEE
conference on computer vision and pattern recognition. 2017, pp. 2117–2125.
[23] Li Liu, Wanli Ouyang, Xiaogang Wang, Paul Fieguth, Jie Chen, Xinwang Liu, and Matti
Pietikäinen. “Deep learning for generic object detection: A survey”. In: International
journal of computer vision 128.2 (2020), pp. 261–318.
[24] Gary T Marx. “Ethics for the new surveillance”. In: The Information Society 14.3 (1998),
pp. 171–185.
[25] Microsoft. What is Computer Vision? 2020. URL: https://docs.microsoft.com/
en-ca/azure/cognitive-services/computer-vision/overview (visited on
03/17/2021).
[26] Microsoft. What is Custom Vision? 2020. URL: https : / / docs . microsoft . com /
sv- se/azure/cognitive- services/custom- vision- service/overview
(visited on 03/17/2021).
[27] Ma Nan, Chen Li, Hu JianCheng, Shang QiuNa, Li JiaHong, and Zhang GuoPing.
“Pedestrian Detection Based on HOG Features and SVM Realizes Vehicle-Human-
Environment Interaction”. In: 2019 15th International Conference on Computational Intelli-
gence and Security (CIS). IEEE, pp. 287–291.

49
Bibliography

[28] Nipun D Nath, Amir H Behzadan, and Stephanie G Paal. “Deep learning for site safety:
Real-time detection of personal protective equipment”. In: Automation in Construction
112 (2020), p. 103085.
[29] Richard Van Noorden. The ethical questions that haunt facial-recognition research. 2020.
URL : https : / / www . nature . com / articles / d41586 - 020 - 03187 - 3 (visited
on 05/11/2021).
[30] Keiron O’Shea and Ryan Nash. “An introduction to convolutional neural networks”.
In: arXiv preprint arXiv:1511.08458 (2015).
[31] OM Parkhi, A Vedaldi, and A Zisserman. “Deep face recognition”. In: British Machine
Vision Association, 2015, pp. 1–12.
[32] Matea Pejčinović. “A Review of Custom Vision Service for Facilitating an Image Clas-
sification”. In: Central European Conference on Information and Intelligent Systems. Faculty
of Organization and Informatics Varazdin. 2019, pp. 197–208.
[33] David MW Powers. “Evaluation: from precision, recall and F-measure to ROC, in-
formedness, markedness and correlation”. In: arXiv preprint arXiv:2010.16061 (2020).
[34] Sovit Ranjan Rath. Evaluation Metrics for Object Detection. URL: https : / /
debuggercafe.com/evaluation- metrics- for- object- detection (visited
on 05/06/2021).
[35] Andrew K Schnackenberg and Edward C Tomlinson. “Organizational transparency:
A new perspective on managing trust in organization-stakeholder relationships”. In:
Journal of Management 42.7 (2016), pp. 1784–1810.
[36] JoonOh Seo, SangUk Han, SangHyun Lee, and Hyoungkwan Kim. “Computer vision
techniques for construction safety and health monitoring”. In: Advanced Engineering
Informatics 29.2 (2015), pp. 239–251.
[37] Jeffrey M Stanton and Elizabeth M Weiss. “Electronic monitoring in their own words:
an exploratory study of employees’ experiences with new types of surveillance”. In:
Computers in Human Behavior 16.4 (2000), pp. 423–440.
[38] Eugene F Stone, Hal G Gueutal, Donald G Gardner, and Stephen McClure. “A field
experiment comparing information-privacy values, beliefs, and attitudes across several
types of organizations.” In: Journal of applied psychology 68.3 (1983), p. 459.
[39] Richard Szeliski. Computer vision: algorithms and applications. Springer Science & Busi-
ness Media, 2010.
[40] Filiz Tabak and William P Smith. “Privacy and electronic monitoring in the workplace:
A model of managerial cognition and relational trust development”. In: Employee Re-
sponsibilities and Rights Journal 17.3 (2005), pp. 173–189.
[41] Xiaolong Wang, Abhinav Shrivastava, and Abhinav Gupta. “A-fast-rcnn: Hard positive
generation via adversary for object detection”. In: Proceedings of the IEEE conference on
computer vision and pattern recognition. 2017, pp. 2606–2615.
[42] Rattapoom Waranusast, Nannaphat Bundon, Vasan Timtong, Chainarong Tangnoi, and
Pattanawadee Pattanathaburt. “Machine vision techniques for motorcycle safety hel-
met detection”. In: 2013 28th International Conference on Image and Vision Computing New
Zealand (IVCNZ 2013). IEEE, pp. 35–40.
[43] Myria Watkins Allen, Stephanie J Coopman, Joy L Hart, and Kasey L Walker. “Work-
place surveillance and managing privacy boundaries”. In: Management Communication
Quarterly 21.2 (2007), pp. 172–200.
[44] Deborah L Wells, Robert H Moorman, and Jon M Werner. “The impact of the perceived
purpose of electronic performance monitoring on an array of attitudinal variables”. In:
Human Resource Development Quarterly 18.1 (2007), pp. 121–138.

50
Bibliography

[45] Christian Wolf and Jean-Michel Jolion. “Object count/area graphs for the evaluation
of object detection and segmentation algorithms”. In: International Journal of Document
Analysis and Recognition (IJDAR) 8.4 (2006), pp. 280–296.
[46] Michael Workman. “How perceptions of justice affect security attitudes: suggestions for
practitioners and researchers”. In: Information Management & Computer Security (2009).
[47] Xin Zhang, Yee-Hong Yang, Zhiguang Han, Hui Wang, and Chao Gao. “Object class
detection: A survey”. In: ACM Computing Surveys (CSUR) 46.1 (2013), pp. 1–53.

51
A Interview questions

52
Intervjufrågor

Kvantitativa frågor:
1. Organisationer ska få samla information om människor utan deras tillstånd.

2. Organisationer ska få samla information om människor utan deras vetskap.

3. Anställda ska involveras vid införandet av övervakningssystem.

4. Mängden och typen av personlig information som lagras av organisationer ska


vara strikt begränsade.

5. Vad organisationer använder personlig information till ska ha få kontroller.

6. Organisationer som samlar in och lagrar personlig information ska inte ha rätten
att ge ut informationen till andra organisationer.

7. Jag är inte alls nöjd med att många organisationer lagrar personlig information
digitalt.

8. Jag känner mig ganska upprörd av hur organisationer använder personlig


information om mig.

9. Jag känner mig något obekväm med några av de sätt som organisationen samlar
personlig information om mig.

10. Jag skulle känna mig något obekväm om organisationen skulle utvärdera mina
handlingar genom att samla personlig information om mig.

11. Jag känner mig ganska upprörd av att organisationen sparar bilder på mig.

12. Jag känner att elektronisk övervakning på arbetsplatsen är rimlig.

13. Syftet med ett övervakningssystem påverkar hur jag känner kring det.
Kvalitativa frågor:
14. Hur känner du kring ett bildigenkännings-system som detta?

15. Vad tror du är intentionen med bildigenkännings-systemet?

16. Tror du att detta bildigenkännings-system skulle göra att du eller andra
(besökare, leverantörer osv) bär säkerhetsutrustning oftare?

17. Tror du att bildigenkännings-systemet skulle ändra ditt beteende? Andras?

18. Tror du att bildigenkännings-systemet kan öka säkerheten på bruket?

19. Hur känner du kring de existerande övervakningskamerorna på arbetsplatsen?

20. Hur fick du reda på/blev informerad om de existerande


övervakningskamerorna?

21. Tror du det skulle vara någon skillnad mellan hur du känner kring de existerande
övervakningskamerorna och detta bildigenkännings-system?

22. Vilken typ av data tycker du är okej att organisationen samlar in?

23. Vilken typ av data tycker du är okej att organisationen lagrar?

24. Vilken typ av data tycker du är okej att organisationen använder?

25. Vilken typ av data tycker du är okej att organisationen delar?

26. Hur och vem ska meddelas om att någon inte har på sig rätt säkerhetsutrustning?
Exempel:
● Direkt alarm till personen som berörs
● Direkt alarm till chefen
● Direkt alarm som är synligt för alla.
● Datainsamling för att föra statistik

27. Vem ska ha tillgång till data?


Exempel:
● vakten
● externt företag
● IT avdelning

28. Har du idéer på ytterligare möjligheter och användningsområden för IT för att
förbättra säkerheten på arbetsplatsen?
29. Vad tycker du är viktigt vid införandet av liknande system?

You might also like