Detection of Safety Equipment in The Manufacturing Industry Using Image Recognition
Detection of Safety Equipment in The Manufacturing Industry Using Image Recognition
2021 | LIU-IDA/LITH-EX-A--2021/027--SE
Mimmi Cromsjö
Linn Hallonqvist
Linköpings universitet
SE–581 83 Linköping
+46 13 28 10 00 , www.liu.se
Upphovsrätt
Detta dokument hålls tillgängligt på Internet - eller dess framtida ersättare - under 25 år från publicer-
ingsdatum under förutsättning att inga extraordinära omständigheter uppstår.
Tillgång till dokumentet innebär tillstånd för var och en att läsa, ladda ner, skriva ut enstaka kopior för
enskilt bruk och att använda det oförändrat för ickekommersiell forskning och för undervisning. Över-
föring av upphovsrätten vid en senare tidpunkt kan inte upphäva detta tillstånd. All annan användning
av dokumentet kräver upphovsmannens medgivande. För att garantera äktheten, säkerheten och till-
gängligheten finns lösningar av teknisk och administrativ art.
Upphovsmannens ideella rätt innefattar rätt att bli nämnd som upphovsman i den omfattning som god
sed kräver vid användning av dokumentet på ovan beskrivna sätt samt skydd mot att dokumentet än-
dras eller presenteras i sådan form eller i sådant sammanhang som är kränkande för upphovsmannens
litterära eller konstnärliga anseende eller egenart.
Copyright
The publishers will keep this document online on the Internet - or its possible replacement - for a
period of 25 years starting from the date of publication barring exceptional circumstances.
The online availability of the document implies permanent permission for anyone to read, to down-
load, or to print out single copies for his/hers own use and to use it unchanged for non-commercial
research and educational purpose. Subsequent transfers of copyright cannot revoke this permission.
All other uses of the document are conditional upon the consent of the copyright owner. The publisher
has taken technical and administrative measures to assure authenticity, security and accessibility.
According to intellectual property law the author has the right to be mentioned when his/her work is
accessed as described above and to be protected against infringement.
For additional information about the Linköping University Electronic Press and its procedures
for publication and for assurance of document integrity, please refer to its www home page:
http://www.ep.liu.se/.
Mimmi Cromsjö
©
Linn Hallonqvist
Abstract
Safety is an essential part of the paper industry, as the industry can be very hazardous
and accidents can lead to serious injuries for the people involved. In order to mitigate and
prevent accidents, it has been shown that proactive measures are of great value. One type
of proactive measure is the use of Personal Protective Equipment (PPE), such as gloves,
hard hats, safety glasses and reflective vests. Despite that it is often required to wear PPE
in a work place, it is not always guaranteed and non-usage can affect the safety of workers.
To detect unsafe conditions, such as non-usage of PPE, automated video monitoring with
image recognition can be useful. The intention of this work is to investigate whether an
image recognition model can be created using the cloud service Azure and used in a system
that can detect PPE, which in this work is limited to reflective vests. The work results
in an artifact using an image recognition model. Additionally, this work examines how
the training data can affect the model’s performance. It is found that the model can be
improved by training the model on images with varying backgrounds, angles, distances,
and occlusions.
While there are many advantages with automated monitoring, the use of it can raise ques-
tions regarding the privacy of the people being monitored and how it can be perceived in a
workplace. Therefore, this thesis examines the privacy concerns and attitudes regarding an
image recognition system for monitoring. This is accomplished by performing a literature
study and interviews with employees at a paper mill. The results reveals challenges with
systems for automated monitoring as well as factors that can affect how employees feel
about them.
Acknowledgments
We would like to thank the people at Sogeti for the help and encouragement during the
project. Especially Jonas Jaani, Jörgen Nybrolin and Anders Persson who have been support-
ing us through the process. We would also like to thank Richard Haag and Andreas Olofsson
at Holmen AB for the collaboration and the employees at Holmen participating in the in-
terviews. At Linköping University we want to thank our supervisor Johan Blomkvist and
examiner Eva Blomqvist for their guidance. Also, we want to thank Nicklas Silversved and
Madeleine Bäckström for the opposition and feedback on the thesis.
iv
Contents
Abstract iii
Acknowledgments iv
Contents v
List of Figures vi
1 Introduction 1
1.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.2 Aim . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.3 Research questions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.4 Delimitations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
2 Theory 4
2.1 Image recognition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
2.2 Introducing an image recognition system in a workplace . . . . . . . . . . . . . 14
2.3 Interview techniques . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
3 Method 19
3.1 Implementation of PPE detection system . . . . . . . . . . . . . . . . . . . . . . 19
3.2 Improving the object detection model . . . . . . . . . . . . . . . . . . . . . . . . 22
3.3 Examining the attitudes and effects of image recognition . . . . . . . . . . . . . 26
4 Results 29
4.1 Performance of different features in the training sets . . . . . . . . . . . . . . . . 29
4.2 Outcome of the interviews . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
5 Discussion 39
5.1 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
5.2 Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
5.3 The work in a wider context . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
6 Conclusion 46
Bibliography 48
A Interview questions 52
v
List of Figures
2.1 Objects, classified as cars respectively pedestrians, detected on a street in New York. 5
2.2 An example of object instance recognition. Big Ben with a surrounding bounding
box. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
2.3 An example of object class detection. A cat with a surrounding bounding box. . . 6
2.4 Confusion matrix of actual and predicted values. . . . . . . . . . . . . . . . . . . . . 8
2.5 An example of a ground truth and a predicted bounding box overlapping. . . . . . 9
2.6 Graphical representation of Intersection over Union (IoU). . . . . . . . . . . . . . . 10
2.7 Definition of TP, FP and FN. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
2.8 Precision-recall curve. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
vi
List of Tables
4.1 Metrics for each model with different features included in the training set. . . . . . 29
4.2 Participants A-E’s answers to the quantitative survey. . . . . . . . . . . . . . . . . . 34
vii
1 Introduction
The paper industry as a workplace can be very hazardous and has a high number of acci-
dents. The high accident rate is linked to the nature of the industry with massive weights
and dangerous machines. The safety issues are taken seriously in the paper industry and
have been of high priority throughout time. But as accidents keep happening, there is always
a need of improvement.
As technology evolves, new methods are used to increase the safety in manufacturing indus-
tries. Industry 4.0 is a current trend in the manufacturing industry. The task to automate and
simplify repetitive processes by using Artificial Intelligence and Machine-to-Machine com-
munication to reduce costs in the industry are key parts of Industry 4.0. Furthermore, finding
ways to increase safety with the use of modern technology is also an area of research. The
use of Virtual Reality (VR) for training and education is one example of how technology can
be used to safely train workers and reduce risks.
In order to mitigate and prevent accidents, it has been shown that proactive measures are of
great value. Proactive measures can be for example safety training for workers to learn and
repeat the safety principles and standard emergency procedure of the workplace. Another
type of proactive measure is the use of Personal Protective Equipment (PPE), such as gloves,
hard hats, safety glasses and vests. Researchers have explored automated ways of monitoring
and detecting unsafe conditions, such as non-usage of PPE in manufacturing and on construc-
tion sites [21, 10]. Because of major advances in the area of machine learning and computer
vision, this is a common approach for such automated monitoring. Cloud services such as
Microsoft Azure, Amazon Web Service, and Google Cloud Platform offer a wide range of cog-
nitive services, one of them is computer vision. These services provides advanced algorithms
behind an intuitive interface, making it possible to utilize image recognition without having
a long experience within machine learning. Although this could decrease the time and effort
for building computer vision models, further research is needed to explore the performance
and eventual limitations of such models.
While there are many advantages with automated monitoring, the use of it can raise questions
regarding the privacy of the people being monitored. It is therefore important to consider
1
1.1. Motivation
ethical aspects when implementing such technology in a workplace. Previous research has
been done studying the acceptance and attitudes of workers towards electronic monitoring
[40, 43, 14]. The findings in the works show how awareness of such technology can affect
the way workers feel about it. There has also been suggestions on different approaches to
increase the level of privacy in monitoring, such as blurring regions like faces that can identify
a person [4].
This work is made in collaboration with the Swedish forest company Holmen. Holmen is
one of Sweden’s largest forest owners, and manufactures wood products, paperboard and
paper. Holmen also produces renewable energy. The collaboration was mainly with the IT
department at Holmen (Holmen IT) and one of their paper mills. Holmen has set the goal
of being the best paper mill in the world in year 2025. To achieve this, one focus area is that
Holmen wants their employees to be free from work related injuries and feeling safe going
to work every day. Therefore, Holmen is interested in gaining knowledge of how accessible
technology can contribute to the occupational safety at their paper mills.
1.1 Motivation
In manufacturing industry environments, such as paper mills, there are often vehicles used
for transportation of goods and humans moving close to each other. A collision between a
human and a vehicle can lead to serious injuries and can even be fatal. To avoid collisions
there are safety instructions that tells that all people inside the area must wear PPE for protec-
tion and/or to be more visible. However, it occurs that these rules are missed, overlooked or
forgotten, which can immediately expose the person and others to a risk. An image recogni-
tion system could help decreasing these risks by identifying people in the scene not following
the safety rules.
1.2 Aim
The aim of this thesis is to explore the possibilities to increase the usage of safety equipment
by using image recognition in order to increase personal safety in a paper mill. This is done
by providing a starting point for Holmen in the implementation of a system where image
recognition is used to detect safety equipment. Furthermore, the aim is to examine how such
a system can be introduced without intruding on the privacy of the employees.
1. How and to what extent can machine learning models be trained to recognize safety
equipment?
2
1.4. Delimitations
2. What are the privacy concerns and attitudes regarding the use of image recognition for
detecting safety equipment?
1.4 Delimitations
Today, the main digital services that Holmen is dependent on are Microsoft services, and it is
also within the range of these services that Holmen wishes to investigate the possibilities of
developing a system for improving their work environment. Furthermore, Holmen wishes
to investigate if a solution can be implemented using cloud services. Therefore, this thesis fo-
cuses on exploring how to detect safety equipment using cloud services within the Microsoft
domain.
3
2 Theory
In this chapter theory and background information is presented for related concepts to this
work.
Image recognition focuses on interpreting images and identifying different variables, such as
locations, people, and animals. For humans, the task of interpreting what we see in an image
is easy because we know that objects can appear in different shapes, angles and lightnings.
The human eye has no trouble distinguishing between for example a cat and a dog, and
understanding that different breeds of dogs still are dogs.
This thesis involves Image Classification and mainly Object Detection. These techniques are
related to image recognition and are described in more detail in the following sections.
4
2.1. Image recognition
networks have multiple layers. Each layer of nodes trains on a set of features, the further you
advance into the network the more complex features can be recognized be the nodes.
A Convolutional Neural Network (CNN) is a type of neural network that is common when
analyzing images, as CNNs primarily focus on that the inputs are images [30]. Unlike the
regular neural networks the nodes in CNN are arranged in three dimensions; height, width
and depth.
Figure 2.1: Objects, classified as cars respectively pedestrians, detected on a street in New
York.
The area of object detection can be divided into two subgroups: object instance recognition
and object class detection [47]. The first group called object instance recognition aims to
identify previously seen object instances such as Big Ben, seen in Figure 2.2. Object instance
recognition can be seen as a matching problem where the object to be re-identified in a new
image is the same as the previously seen object, except from some changed surrounding con-
ditions such as location, angles, and lightning. The second group called object class detection
on the other hand focuses on detecting instances of predefined categories such as humans,
cars, bicycles, and cats, seen in Figure 2.3. The latter subgroup of object detection is a more
challenging task because of the variation that can occur within a category [23]. A car can have
5
2.1. Image recognition
different size, colour, shape etc. which means that the look of a car can vary a lot and still be-
long to the same category. When research and work began in the field of object detection, the
detection of a few categories such as faces were at center, whereas now the field of categories
is broader to fulfill more general detection systems [23].
Figure 2.2: An example of object instance recognition. Big Ben with a surrounding bounding
box.
Figure 2.3: An example of object class detection. A cat with a surrounding bounding box.
A lot of research has been done in the field of detecting faces [31], a technique that for instance
is built into most digital cameras. Another focus area of object detection is the detection of
6
2.1. Image recognition
pedestrians that can be used in autonomous cars. Histogram of Oriented Gradients (HOG) is
a well known algorithm used for detecting pedestrians [27, 7].
When creating a custom made object detection model an algorithm or neural network is
needed as well as a labeled training data set. The performance of the model is highly depen-
dent on the model’s robustness, making it possible for the model to predict correctly despite
shifting conditions of light, pose or the characteristics the camera [16].
Occlusion is when an object is partly hidden by another object in the image. This is a problem
within object detection, and in [11] it was shown that in street scenes, about 70% pedestrians
are occluded in at least one frame of the video. The task of detecting objects despite occlusion
can be of great importance for computer vision applications. Similar to the street scenes, ac-
tions in the environment of a paper mill often involves numerous workers that occur in pairs
or groups. Additionally, workers in a paper mill can be amongst vehicles and equipment etc.,
which can generate a level of occlusion. A possible solution for solving occlusion problems
is to include more data where occlusion appears [41]. In that way, the model can learn the
invariance that can occur due to occlusion. However, occlusion can appear in a way that
occurs more often but also rare cases that almost never happen. This means that it is almost
impossible to provide images that covers all cases of occlusion.
Scale variation refers to the challenge of detecting objects in various sizes, in particular very
small objects. According to Li et al. [20], small-size objects are difficult to detect for two
reasons; small instances often occur with "blurred boundaries and obscure appearance" [20,
p. 985], making it difficult to distinguish from the background. Secondly, large-size objects
usually have big visual differences from small-size instances. Different solutions to this prob-
lem have been proposed, such as using feature pyramids [22, 23] or using different sub-
networks that are trained on small and big instances respectively [20]. One way of improving
the performance is to build the training sets with sufficient variations [9].
Another difficulty is to detect objects with varying shapes due to the angle or viewpoint from
where the object is seen. An object can take different expressions when seen from the same
height as the object compared to when it is seen from up above. To address the problem,
adding images from different viewpoints to the training set has been shown generating a
more robust model [36].
7
2.1. Image recognition
model, it is necessary to evaluate the performance of classifying and localizing the object. This
makes it a two part problem [45]. An image classification model can only classify an image
correctly or incorrectly while an object detection model can have detected an object at the
right location, but classified it incorrectly and vice versa.
When working with machine learning algorithms, cross validation is a commonly used tech-
nique in which the model is trained using a subset of the data set and then evaluated using
the complementary subset of the data set. Traditionally, these algorithms are evaluated us-
ing measures such as accuracy, recall, and precision. These metrics are based on the number
of correct and incorrect predictions that a model does on a test set. For each prediction, the
predicted value and the actual value can be represented in a confusion matrix, seen in Figure
2.4. If an algorithm correctly predicts a positive class, it is a True Positive (TP). If an algorithm
correctly predicts a negative class, it is a True Negative (TN). If an algorithm incorrectly pre-
dicts a positive class, it is a False Positive (FP). If an algorithm incorrectly predicts a negative
class, it is a False Negative (FN). In object detection, true negatives means every part of the
image where there is no object. This is not useful and is therefore ignored.
Accuracy describes how often an algorithm classifies a data point correctly. Accuracy is
defined as the number of true positives and true negatives divided by the number of true
positives, true negatives, false positives, and false negatives. A model that achieves 100 %
accuracy classifies every sample correctly. Accuracy is calculated as follows:
TP + TN
A=
TP + TN + FP + FN
Since accuracy is calculated using true negatives, and true negatives have no useful meaning
in object detection, accuracy is not used to evaluate an object detection model. Instead, the
measures recall and precision are used.
Recall is a measure of how many actual positives that are correctly predicted as positive. The
importance of the recall measure varies depending on the use case. In the context of medical
8
2.1. Image recognition
application recall can be seen as the primary measure because the aim is to find all actual
positive [33]. Recall is calculated as follows:
TP
R=
TP + FN
Precision is similar to accuracy but focuses only on the predictions of the positive class. Pre-
cision denotes the proportion of predicted positives that are correct predictions. Precision is
calculated as follows:
TP
P=
TP + FP
Figure 2.5: An example of a ground truth and a predicted bounding box overlapping.
IoU is a measure of the extent in which the detected bounding boxes and its ground truth
box overlaps. An IoU threshold is often predefined and is used to determine if an object is
classified as true positive. The IoU is calculated as follows:
Area of intersection
IoU =
Area of union
9
2.1. Image recognition
If the IoU threshold is set to 0.5, and the IoU of a prediction is ě 0.5, it is counted as a true
positive. A true positive means that an object has been found and that it has been correctly
classified, see Figure 2.7 a. If the IoU is < 0.5, it is a false positive, which means that an object
has been incorrectly detected. Another type of false positive is when an object is detected but
labeled incorrectly, see Figure 2.7 b. If the model does not detect an object when it should,
meaning that it does not draw a bounding box around the object, it is a false negative, like
the dog in Figure 2.7 c.
(a) Both objects are detected and classified cor- (b) The dog is predicted as a cat and is there-
rectly, therefore this image has two true posi- fore a false positive. The cat is correctly pre-
tives. dicted and is therefore a true positive.
Figure 2.7: Definition of TP, FP and FN. Source: Adapted from [34].
10
2.1. Image recognition
With the TP, FN and FP, the recall and precision of an object detection model can be calculated
as above. From the calculated recall and precision, a precision-recall curve can be computed,
see Figure 2.8. This curve describes the trade-off between precision and recall. If the recall is
high and the precision is low, it means that the model has done many detections, but most
of the detections are incorrectly labeled. If the model has high precision and low recall, not
many detections are done but most of the detections are correctly labeled. The precision-recall
curve should ideally have high precision as recall increases for the model to perform well [1].
For evaluating object detection models, it is common to use the metric Average Precision
(AP) or mean Average Precision (mAP). The AP can be calculated from the area under the
precision-recall curve. mAP is average of the AP calculated from all classes and/or IoU
thresholds. AP and mAP are often used interchangeably [1].
The accuracy for these different conditions were 84 %, 68 % and 74 % respectively. It was
found that most errors occured for Condition 2, when the objects were further away. This
11
2.1. Image recognition
indicates, according to the authors, that the low resolution of the input images play an im-
portant role for the accuracy of the system. Another error found in the study was that when
objects overlapped with each other the system treated them as one single object. These find-
ings can be of interest when evaluate the prototype in this study.
In the article written by Li et al. [21], the authors propose a method for detecting safety hel-
mets on humans. The background to the work is to ensure the wearing of a helmet at power
substations as the number of accidents has increased. In the power substation, a camera is
installed at a fixed location, which means that the background never changes. The method
for detecting if people wear helmets is divided into three phases. At first, objects in motion
are recognized by ViBe background modelling algorithm. Then a HOG feature extraction
and SVM (Support Vector Machine) classifier is used to get the human location. Finally, by
focusing on the top region of the human detected, a colour feature is used to recognize safety
helmets. The authors evaluate the system by using real-time surveillance videos from the
location. In the analysis they can see that the system manages to detect not only one person
but also several in the same image. Furthermore, they collected ten video streams and es-
timated the accuracy by counting the detection results of people with and without helmet.
The accuracy was calculated to be 80.7%. The authors conclude that some improvement is
needed when having varying light conditions in the scene. When developing our system we
will keep in mind that varying light conditions must be taken into account when training the
algorithm.
In [10] by Delhi et al., a similar use case as the one presented in this thesis is studied. The
authors present a framework which predicts if a person is wearing a hard hat and safety
jacket. By implementing a model using a CNN called YOLOv3 the framework outputs 4
classes: NOT SAFE, SAFE, NoHardHat and NoJacket. An important part of the development
of the model is the data collection in order to be able to train and evaluate the model. In total,
a data set of 2500 images was used. This was done by both manual collecting and image
scraping online using web crawlers. The model was evaluated on a test data set and achieved
an average precision of 97 %. Additionally, the system was developed for generating alarm
and reports if and when a safety risk is detected.
In a study by Nath et al. [28], the authors present three deep learning models to verify PPE
compliance of workers in real-time from a video stream. The models verifies if the workers
on a construction site are wearing hard hats, vests or both. The study compares three differ-
ent approaches which make up the models. The first approach is to detect workers, vests and
hard hats and then a machine learning model verifies if each worker is wearing the equip-
ment properly. The second approach is to detect a worker and simultaneously verifies PPE
compliance. The third approach is to first only detect the worker and then crop the image in
the area of interest, after which the cropped image is classified by CNN-based classifier. This
classifier verifies if the worker is wearing a vest, a hard hat or both. All models are trained
on the same data set containing approximately 1500 annotated images with 4700 occurrences
of workers. The data set is produced using crowd-sourcing and web-mining images. The
crowd-sourcing images are collected from three different construction sites while the web-
mined images are gathered from the web using search engines like Google. The authors
recommends the second approach as it performed best and got the most accurate detection
of PPE with a mAP of 72.3 %. This thesis investigates a similar problem, and the results by
Nath et al. can be of value when selecting an approach for detection of safety equipment.
12
2.1. Image recognition
In this study, the Computer Vision platform provided by Microsoft Azure will be used to train
a model for object detection. The architecture and the underlying algorithms are hidden, and
it is therefore necessary to treat our model as a black box. The algorithms’ structure can not
be changed or affected and you can only influence the model by training it with your own
data. This means that the control is limited to the inputs and outputs of the black box.
Azure Custom Vision [26] is a platform that, in contrast to the Computer Vision API, lets you
train your own image detection model. The service can be managed from the Azure portal
on the web. Azure Custom Vision has two features; image classification and object detection.
Image classification gives one or more labels of what is in the image while object detection
in addition provides the coordinates of where the object can be found in the image. When
creating an image classifier, the user uploads two groups of images, one with the features of
interest and the other without. The user then labels every image. When creating an object
detection model only images with the object(s) of interest are needed. The Custom Vision uses
machine learning algorithms to analyze the labeled images and trains a model. The model’s
performance is validated in the portal by evaluating the model’s accuracy on the same images
that is used for training the model. Evaluating the model on the training data set is not
according to cross validation described in Section 2.1.4 and will therefore not be used in this
thesis. When the model is trained the user can test, re-train and, when satisfied, use the model
to classify new images. When creating a model in Azure Custom Vision, it is recommended to
have a data set containing at least 50 annotations per label. As an alternative to deploying the
model online at Azure, the model can be exported from Azure Custom Vision in the formats;
ONNX, CoreML, TensorFlow, Docker image and Vision AI Development Kit.
Previous research has used Azure Custom Vision for classification problems with good re-
sults. Batugal et al. [5] investigate if Azure Custom Vision can be used for detecting rice
spoilage before consumption. In the study two data sets were gathered, spoiled rice and not
spoiled rice. For training the model 500 images with each label were used. The performance
of the model was measured to 85 % accuracy. Pejčinović [32] gives an overview of Custom
Vision and research the features and possibilities in the field of traffic signs classification. In
the research a model in Custom Vision was trained in three iterations with the training data
set improved in between. The model’s performance was evaluated with the metrics accuracy,
recall and F1-score. The first iteration of the model was trained on a not well balanced train-
13
2.2. Introducing an image recognition system in a workplace
ing set. One class had 113 images while it only contained 19 images of another class. This led
to low accuracy for the classes with low number of images in the training set. To improve the
performance more iterations were done. In the third iteration of training images with differ-
ent backgrounds, lightning and camera angels were used. This iteration was more balanced
and included 100 images for every class. The result was a precision of 81.6 %, recall of 84.7 %
and F1-score of 83.1 %.
In GDPR it is stated that processing of personal data is not legal without a legal justification.
Legal justification can be consent from the subject that has agreed of data processing and
it can also be legitimate interest. The data processing can be done without consent if the
interest of which data is processed outweigh the subject and if the processing is necessary for
the given purpose. The purposes sets the boundaries for what is allowed to do, such as what
information can be processed and how long it can be saved. The purpose of the inspection
may be monitoring for safety or technical reasons.
GDPR states that the subject of data processing must always be informed about the legal jus-
tification on which the personal data is collected. Personal data is only allowed to be collected
for the specifically stated and justified purpose. Therefor the purpose of the data processing
has to be clarified and informed before the collection takes place. It is not allowed to process
more personal data than is necessary to fulfill the purpose. In the case of surveillance, this
could be interpreted such that the cameras should cover as small area as necessary. GDPR
also states that personal data should be deleted when the data is no longer needed.
14
2.2. Introducing an image recognition system in a workplace
According to GDPR, the collector of personal data must ensure that the data is protected and
that unauthorized persons cannot access it. Therefor, technical and organizational security
measures should be used such as firewalls, encryption and anti-virus protection.
In GDPR, anonymization is defined as data that is anonymous in such a way that the sub-
ject cannot be identified. If the data is anonymous, it is no longer in the scope of GDPR. To
anonymize data in an application for facial recognition and identification and person detec-
tion Burileanu et al. [4] present a solution where a blur filter is applied to faces. This makes
the people in the images unidentifiable.
When performing surveillance at a workplace in Sweden, the above mentioned laws have
to be followed because people can be identified in the images. If faces cannot be seen in the
image, some argue that GDPR does not apply. Although, in a workplace where a limited
number of people appears which is the employees, anonymization through blurring faces
can be seen as not being enough. If people can be identified based on their clothes or the
occasion etc., larger interventions are needed such as blurring the whole person. In addition,
according to the Co-determination act [18] (Förhandlingsskyldighet), there is an obligation to
negotiate when performing surveillance at a workplace. This means that the employer must
negotiate with the affected workers organization regarding the decision to introduce camera
surveillance at the workplace.
The concept of privacy can be defined as the ability to control what information others can
access about you [15]. People may not want others to get information that is considered
personal or sensitive. Privacy is considered a fundamental human right in the majority of
countries in the world. It is obvious that privacy is highly valued and invasion of privacy can
therefor be considered unethical. As previously discussed, there are laws like GDPR that exist
in order to protect people’s private information. But something that should be kept in mind
is that what is considered an invasion of privacy can vary a lot between different people. In
the context of video surveillance, recording and saving images of employees where people
can be identified can be considered an invasion of privacy. The location of a camera and
what area that is covered by the camera view can also vary, and people may feel differently
regarding being monitoring depending on the location of it. It might for instance be different
to be filmed in a passage than in the lunch area at a workplace. A large part of these issues
are covered by GDPR, but even if a surveillance system is approved by law, people may still
feel that their privacy is intruded on.
15
2.2. Introducing an image recognition system in a workplace
In order to understand how monitoring and surveillance systems affect employees, studies
have been carried out where employees have been interviewed or participated in surveys.
From these studies, there have been several important findings that point to important as-
pects when implementing a surveillance system.
Previous research has studied whether the perceived goal of a system, how it is going to
be used and what purpose it fulfills influence how employees feel about it. In an article by
Stanton and Weiss [37], the authors explore experiences and ideas regarding electronic mon-
itoring such as email and website monitoring. The study has a qualitative, exploratory and
generative approach, with the goal to elicit thoughts and new perspectives on the impacts
of electronic monitoring. To examine these aspects, a survey is conducted where the results
suggest that some are not disturbed by monitoring systems, and some even appreciated the
systems. Participants that were not bothered by the monitoring believed that nothing was
being done with the information while the ones that were bothered expressed that they were
unsure about how the information was used or how important the management considered
it to be. Participants that were monitored mainly for security reasons, and where the mon-
itoring did not monitor their work performance, 8 out of 9 reported not being disturbed by
the monitoring. This suggests that the attitudes about monitoring may depend on how the
monitoring information is being used. The article is from 2000, and even though electronic
monitoring has changed in 20 years, we argue that the findings are relevant to consider as
the monitoring techniques examined are still relevant and used today. The same motivation
applies for an article by Wells et al. [44] from 2007 which shows similar results. In this work,
relationships between the perceived purpose of Electronic Performance Monitoring (EPM)
and job attitudes are studied. From studied literature, the authors state hypotheses regarding
how different perceptions of the purpose relates to different job attitudes. These job attitudes
are: the perceived system fairness, job satisfaction, organizational commitment and felt obli-
gation to reciprocate. The authors test two different perceptions of purposes: 1. to develop
appropriate behaviour and 2. to deterrent negative behaviour. The hypotheses are that the
first perception of the purpose will be positively and significantly related to the job attitudes.
Similarly the second perception of the purpose will be negatively and significantly related to
the job attitudes. An example of two hypotheses related to the same job attitude is:
The authors conducted a survey which confirmed all hypotheses and indicated that the per-
ceived purpose has impact on the different job attitudes. The study concluded that the em-
ployees were more positive towards an electronic performance monitoring system if the per-
ceived purpose was for their own development rather than deterrent to negative behaviours.
For the system in this thesis, the presented results indicate that it is important to communicate
what the information is used for and that the purpose of the system is to increase safety and
not work performance. Furthermore, they suggest that the purpose should be communicated
as being development of appropriate behaviour, to encourage and help reminding employ-
ees to use correct safety equipment, rather than deterrent of missing to wear the correct safety
equipment.
While video surveillance can be beneficial for organizations it can lead to uncertainty among
the employees which affect the attitude regarding the system. The uncertainty can emerge if
16
2.2. Introducing an image recognition system in a workplace
the employees do not trust their employers use of personal information or if they are uncer-
tain about the cause of the monitoring. Botan [6] claims that one panoptic affect of surveil-
lance is an increased uncertainty amongst the employees when they do not understand why
they are being monitored and they do not see the logic behind it. In addition, Botan say
that uncertainty increases when the actor is unsure when they are monitored and how the
gathered information is used. Tabak and Smith [40] state that as many as a fifth of all major
U.S. companies carry out electronic monitoring in secret of their employees. Tabak and Smith
propose that secret monitoring will lead to uncertainty amongst the employees thinking that
the monitoring is not carried out the right way and they may show dissatisfaction with the
management.
As stated earlier, the perceived purpose can affect the attitudes of employees which will be
more negative if the employees lack knowledge or trust regarding the purpose of a system.
Therefore, to obtain positive attitudes transparency and communication can be of great im-
portance. Schnackenberg and Tomlinson define transparency as "the perceived quality of
intentionally shared information from a sender" [35, p. 5]. In the context of video surveil-
lance in a workplace, the definition can be interpreted as the perceived amount and quality
of information that is communicated to employees regarding the monitoring practices. Ac-
cording to Hagen et al. [14], the transparency when performing surveillance occurs in four
stages. First, when deploying a camera the visibility of the camera can be done in different
ways. The camera can be placed visible or not, and it can be accompanied by signs making
it clear that cameras are present. Second, the extent in which the cameras will be recording
can be communicated or visible. The cameras can be continuously recording or only under
special circumstances. Third, transparency and openness about the use, storage and process-
ing of the footage. Information regarding if the footage are going to be kept internally in
the organization or shared to others. Fourth, transparency about under which circumstances
information from the footage can be used. If the organization is being transparent in all four
stages it is more likely that employees will understand the purpose and it can reduce the risk
of the employees feeling uncertainty.
Researchers have explored how transparency affect attitudes, where Alder [2] argues that
communication in both system design and implementation is the key to an ethical electronic
performance surveillance system. To involve the people that is going to be monitored in the
system design and to inform employees about monitoring practices will affect the attitude.
While Alder is reasoning about performance surveillance, and the article from is from 1998,
we argue that this can also be applied to monitoring systems today for safety reasons like the
system in this thesis. Similarly, in a paper by Workman [46] it was found that transparency
has large impact on the attitudes when performing surveillance. In the article the author
summarizes the results of a previous field study of the influences from justice perceptions
as mitigation. The study was performed on a multinational company based in the USA and
included 387 participants. The author states that even if surveillance can help improve the
security it may come with some risks of employees acting in protest. Workman found that
employees seemed to be more accepting when they felt that there was transparency in how
the monitoring was performed. They seemed more accepting when the purpose of the mon-
itoring was communicated and if the personal information handled were protected.
Additionally, Alge [3] states that employees seems to be more likely to accept being mon-
itored in the workplace when they are also given the opportunity to participate and give
input into the design and implementation of a monitoring system. Participation enhance
the feeling of greater control of how personal data is used and therefor reduces invasion of
privacy.
17
2.3. Interview techniques
When arranging interviews the form and style is determined by its purpose. The main at-
tribute when forming an interview is how structured it is. The most structured forms of
interviews can be used when the subject is known. The questions in a structured interview
are straightforward and specific because the interviewer knows what answers is needed to
fulfill the purpose of the interview. This interview form is common in market research where
the interviewer wants to find out about buying habits. A structured interview requires a lot
of preparation by the interviewer in order to elicite the needed information. When perform-
ing a structured interview no unexpected discoveries are made and a risk with the structured
interview approach is to miss important information. [12]
A semi structured interview has the important attribute of being flexible. It is partly struc-
tured and needs a high level of preparation. A semi structured interview means that all in-
terviewees get the same initial questions. The questions can then spontaneously be followed
by supplementary questions to ensure coverage if some area were not fully processed. The
questions are open and invite the interviewee to think freely. [13]
An unstructured interview leaves the responsibility of leading the way to the person being
interviewed. Unstructured interviews can be used in an initial phase when the researcher
wants to find out what things that needs to be investigated. When using unstructured in-
terviews it is common that the researcher does not know what to find and therefor does not
know what to ask. [13]
18
3 Method
This chapter provides information regarding how the work in this thesis was performed.
This includes a sample system which implements image recognition for detecting protective
equipment, investigating the impacts of different training sets to improve the object detection
model, and interviews with employees at Holmen to explore feelings and attitudes about
image recognition systems for monitoring.
To wear the correct safety equipment can mean to wear a helmet, reflective vest and safety
glasses. In other cases it can mean to only wear a helmet. Since this creates many combi-
nations and variations that the system needs to handle, we have limited the correct safety
equipment to mean wearing a reflective vest. It is therefore only required for the system to
be able to detect a person with and without reflective vest.
3.1.1 Prerequisites
The hardware that is utilized to run the system including the real-time processing of the
model consists of an Intel Core i7-7600U CPU @2.80GHz with a RAM of 8GB. The model
is created in Microsoft Azure’s service Custom Vision, where an account is required. The
basic environment consists of .NET Core, C# and OpenCV on a Windows 10 Operating Sys-
tem (OS). Visual Studio 2019 is used for development. For video capturing, a web camera
19
3.1. Implementation of PPE detection system
with USB connection is used. The mentioned setup is based on documentation available for
utilizing Azure Custom Vision.
In the following parts, the sub parts of the system are described more in detail.
Training application
The training application takes a video stream from the connected camera as an input. Images
are then retrieved by extracting frames out of the video stream. The app captures images
from the stream in an adjustable rate. The images get uploaded to the target Custom Vision
resource in Azure. It repeats this procedure until it reaches the target number which by
default is set to 30 images. While the app is in the state of capturing and uploading images
you can expose the camera to the types of objects that you desire to train your model on.
When the app has finished uploading images to Custom Vision, the images can be found in
the Custom Vision resource.
An alternative to using the training application is to upload images to Custom Vision directly
from a local folder through the Custom Vision web portal.
20
3.1. Implementation of PPE detection system
The images that the model is going to be trained on are uploaded to Azure Custom Vision
through the training application. As mentioned, it is not necessary to use the training ap-
plication to upload images to Custom Vision, it can also be uploaded through accessing the
local file system. For each image, objects are manually tagged by drawing a box around the
object, the ground truth, as in Figure 3.2. Only objects of interest are tagged and labeled with
the class it belongs to. When drawing the boxes, we are meticulous to draw the box closely to
the object to avoid background noise which can affect the model negatively. Images that do
not contain any of the object of interest are not used to train the model. When the labeling is
done, the model is trained in custom vision after which it is ready to be used for predicting.
The model can be exported as an ONNX file to be used locally.
21
3.2. Improving the object detection model
Scoring Application
When the ONNX model has been downloaded, it can be used for predicting. In this state, the
application captures images from the video stream in the same manner as described for the
training application. The model evaluates each image and outputs the prediction including
all found objects in the image and the location of each objects bounding box. For predicting,
a probability threshold can be set. This defines the lowest probability that a found object can
have to be counted as detected. The result of the prediction of each image can be shown on a
display by drawing the bounding boxes onto the images with the name of the label and the
probability attached. Because of the speed of the system the result shown on the screen is
close to a real-time video with detection of objects. The application also sends the result as a
message to the Event hub in Azure. If the image contains any objects the class label will be
included in the message.
22
3.2. Improving the object detection model
Training sets
Features #PersonWithVest #PersonWithoutVest
Baseline 50 50
Lightning conditions 25 25
Background 25 25
Angle 25 25
Distance 25 25
Occlusion 25 25
A consideration made when collecting the data is to variate the appearance of the people in
the images. This is done by changing clothes, accessories etc. This is to avoid that the model
associates for example a specific color of pants with wearing safety equipment. It is important
that the model identifies the vest specifically and not other features.
When creating data sets for training a model it is important that the occurrences of each class
label in the training set are balanced. In this project, that means to have a similar amount of
people with and without reflective vests included in the training sets.
Baseline
The images collected for the baseline were simple with the intention of being easy for the
model to predict. People in the scene were between 4-6 meters from the camera, which means
that the people in the image are always in full body perspective and well visible. The camera
was placed at eye level with the people. The background was static and with few objects
disturbing. The images were taken in daylight. The people in the pictures did not overlap
each other causing occlusion. An example image from the baseline data set can be seen in
Figure 3.3a.
Lightning conditions
When creating the data set for different lightning conditions images were collected in the
same way as described for the baseline. The images were post processed where the light were
adjusted using an photo editor program. The light in the images were lowered to resemble a
dark environment, which can be seen in Figure 3.3b.
Background
The images collected with varying background were taken in three different locations. The
locations distinguished from each other and were more cluttered than the background in the
baseline. An example image from one the three locations can be seen in Figure 3.3c. When
varying the background the placement of the camera related to the people in the scene were
the same as in the baseline.
Angle
To capture images of people from a different angle the images were taken from a height of 4
meters. In Figure 3.3d an image from the data set can be seen. The location was otherwise
the same as the baseline.
23
3.2. Improving the object detection model
Distance
Images were collected of people placed 60 to 80 meters from the camera. In the images the
people covered approximately 5 % of the image height. In Figure 3.3e an image from the data
set can be seen where a person with vest is located 80 meters from the camera.
Occlusion
For the data set with occlusion, the setup was the same as the baseline. When extracting the
images from the video, only images where occlusion occurred were chosen for the data set.
Only occlusion where a person covered another person was used. Occlusion appeared in
different shapes, one can be seen in Figure 3.3f, and occurrences where more than 50 % of the
person was covered were dismissed.
Test set
From the collected images, some were picked out to a test set. The test set contained images
with varying lightning conditions, backgrounds, angles, distance and occlusions as well as
images with characteristics like the baseline data set. The images in the test set were not
included in any of the training sets. The composition of the test set can be seen in Table 3.2.
24
3.2. Improving the object detection model
Test set
Features #PersonWithVest #PersonWithoutVest
Baseline 16 14
Lightning conditions 10 10
Background 10 10
Angle 10 10
Distance 10 10
Occlusion 13 7
Total 69 61
In total there were 69 tags for the PersonWithVest class, 61 tags for the PersonWithoutVest
class and 81 images in the test set.
3.2.2 Training
The training of the model was done in iterations, visualized in Figure 3.4. In the first iteration,
the training set was the baseline and for each following iteration, the baseline was combined
with one of the training sets with different features taken into account, as described in Table
3.1. In each iteration, the model was evaluated. By adding the feature training sets one by one
and evaluating the performance of the model against the baseline, the affects of the different
feature could be measured separately.
Figure 3.4: Iteration loop for training and evaluating the model for features.
3.2.3 Evaluation
To evaluate if the model improves, the performance needs to be measured. In object detection,
this requires having one or several test sets along with evaluation metrics. It is common to
calculate the mAP of a model to measure its performance. In this part, details of how the
mAP was computed is described.
As mentioned, the model was evaluated each time a new set of training images was combined
with the baseline. When evaluating the models, the same test set was used for all iterations.
The evaluation was done by using an open source project which performs evaluation metrics
of the object detection problem. To evaluate the model, the human interpretations (ground
25
3.3. Examining the attitudes and effects of image recognition
truths) were compared with the model predictions of the test set. The ground truths for each
image in the test set were annotated in a text file in the format <class> <left> <top> <width>
<height>, see Figure 3.5. The predictions were also annotated in a text file in the format
<class> <confidence> <left> <top> <width> <height>, see Figure 3.6. The ground truth and
the prediction text files were taken as input of the program. The output of the program is a
precision-recall curve and AP for each class and finally the mAP. The IoU threshold was set
to 50%. An IoU threshhold of 50 % is often used [23].
The evaluation metrics used in the open source project for each evaluation were precision
and recall. From this, a precision-recall curve was computed and displayed. Finally, the AP
for the two classes and the mAP was calculated from the precision-recall curves.
3.3.1 Interviews
Interviews are a common technique when explicitly reporting peoples’ perspectives, as de-
scribed in Section 2.3. As part of exploring the attitudes and privacy concerns we have there-
for chosen to perform interviews. The interviews were designed as semi-structured which
means that questions are prepared but leave room for supplementary questions and discus-
sion.
26
3.3. Examining the attitudes and effects of image recognition
Procedure
Each interview was held with one participant at a time over a video call with one interviewer
and one taking notes. The interviews were recorded if the participant agreed to it. Before each
interview, the participants gave an oral consent to participate in the interview. In this stage,
general information about the interview was presented to the participant. The information
included a presentation of the interviewers, the intention of the interview and a clarification
that the project was only for research. The participant was also informed about the data that
was going to be collected and how the data gathered from the interview would be used and
stored. Lastly the participants were informed that they could terminate the interview at any
time and that the consent was always revocable. The interviews were held in Swedish and
approximated to last for 45 minutes each.
Before the second part of the interview, we wanted to give the participant an idea of the image
recognition system. Therefore the participant was shown a short video of the artifact running
the scoring application.
The second part of the interview consisted of questions that opened up for more discussion,
and the participants were encouraged to elaborate on their answers. The intention was to
gather thoughts, feelings and ideas of the image recognition system. Examples of questions
were "14. How do you feel about an image recognition system like this?" and "18. Do you
believe that the image recognition system can increase the safety at the paper mill?". We
27
3.3. Examining the attitudes and effects of image recognition
were also interested in the employees’ feelings about the existing surveillance cameras at the
paper mill and how they thought the image recognition system would differ. Therefore we
asked questions like "19. How do you feel about the existing surveillance cameras at the
workplace?" and "21. Do you think there would be a difference between how you feel about
the existing surveillance cameras and this image recognition system?". The intention was also
to explore employees’ opinions regarding collection, storage and usage of data. An example
of a question was "22. What type of data do you think is okay for the organization to collect?".
For some of the questions in the second part, we gave examples. For example the question
"26. How and who should be notified about someone not wearing the correct safety equip-
ment?". Here we gave the examples "Direct alarm to the person concerns", "Direct alarm to
the manager", "Direct alarm that is visible for all", and "Data collection for statistics". This was
to facilitate for the participant and to give some ideas, but the participant was not limited to
the examples.
Participants
A total of five people were recruited to participate in the study. When recruiting the partic-
ipants it was important that all were employees at Holmen and working at the same paper
mill. The participants would belong to different work areas at the mill, and all should on a
daily basis move in areas where there are surveillance cameras and where protective equip-
ment needs to be worn. Among employees who meet these requirements, participants were
selected at random. The chosen employees were contacted through email and participation
was voluntary. After each interview demographic data were collected about the participants
age, gender, number of years they have worked at Holmen, and area of work. The data of the
five participants can be seen in Table 3.3. All participants were men and the average age was
49.8 years where the youngest was 39 and oldest 57. The participants had worked at Holmen
between 18 and 33 years and for an average of 28.4 years.
Data analysis
From each recorded interview, careful notes were taken without writing down every single
word said during the interview. Notes were taken of each interview separately by both the
person holding the interview and the assistant separately. By doing that, there is a bigger
chance of capturing the "essence" since we could have interpreted things differently. After
that, the answers from all the participants were summarized in a document, question by
question, to be able to compare the answers to find similarities and differences between the
answers.
28
4 Results
In this part, the results from the models containing different features and interviews are pre-
sented.
Table 4.1: Metrics for each model with different features included in the training set.
29
4.1. Performance of different features in the training sets
Baseline
The precision-recall curves for the baseline are shown in Figure 4.1. The AP of the class
PersonWithVest was 83.25 % and the AP of PersonWithoutVest was 71.08 %. The mAP of the
baseline was 77.17 %. In 4.1 (a), we see that the precision is high for a recall under 0.8, and
that the precision decreases for recalls above 0.8. In 4.1 (b), there is a similar pattern, although
the precision decreases somewhat around recall of 0.5.
Lightning conditions
In Figure 4.2, the results from adding images with darker lightning conditions to the baseline
are shown. The AP for PersonWithVest was 83.50 % and the AP for PersonWithoutVest was
70.83 %. This results in a mAP of 77.16 %, which is a small decrease comparing to the baseline.
Since there was an increase for the AP of the class PersonWithVest, the decrease of the mAP is
due to the decrease of the AP for PersonWithoutVest. The precision-recall curves are similar
to the baseline.
30
4.1. Performance of different features in the training sets
Background
The result of the model including training images with varying backgrounds are shown in
Figure 4.3. The AP was 85.74 % for PersonWithVest and 74.11 % for PersonWithoutVest. The
precision is high for a recall under 0.8 and decreases drastically for a recall higher than 0.8 for
PersonWithVest. The mAP was 79.92 %, which is an improvement of 2.75 percentage points
compared to the baseline.
Angle
The results of the model including training images with varying angles are shown in Figure
4.4. The AP for the two classes was 84.33 % for PersonWithVest and 76.07 % for PersonWith-
outVest. The mAP was 80.20 %.
31
4.1. Performance of different features in the training sets
Distance
In Figure 4.5, the precision-recall curves for the model including different distance are shown.
Here we can see a significant improvement for both classes. The AP for PersonWithVest was
94.03 % and the AP for PersonWithoutVest was 82.28 %. This is an increase with 10.78 % and
11.2 percentage points from the baseline respectively. The mAP was 88.15 %, 10.98 percentage
points better than the baseline. In the precision-recall curves it is shown that the precision
stays high until around 0.9 for PersonWithVest and around 0.8 for PersonWithoutVest. We
can see some decrease of precision in (b) at recall around 0.6.
Occlusion
The results of including images with occurrences of occlusions are shown in Figure 4.6. The
AP for the classes were 85.26 % for PersonWithVest and 70.85 % for PersonWithoutVest. The
AP for PersonWithoutVest has decreased compared to the baseline. The mAP was 78.05 %.
32
4.2. Outcome of the interviews
The results when combining the previous training sets, including all features, are shown in
Figure 4.7. The AP for PersonWithVest was 94.34 % and for PersonWithoutVest it was 90.32
%. The mAP was 92.33 %, which is an improvement of 15.16 percentage points compared to
the baseline.
33
4.2. Outcome of the interviews
Question A B C D E
1. Organizations should be allowed to 1 3 3 2 2
collect information about people with-
out their permission.
2. Organizations should be allowed to 1 1 1 1 2
collect information about people with-
out their awareness.
3. Employees should be involved in 5 5 5 4 5
the implementation process of moni-
toring systems.
4. The amounts and types of per- 5 5 4 5 5
sonal information stored by organiza-
tions should be strictly limited.
5. The uses organizations make of 1 1 1 1 1
personal information should have few
controls.
6. Organizations that collect and store 3 5 3 3 5
personal information should not have
the right to release this information to
other organizations.
7. I’m not at all happy about the 2 5 4 1 3
fact that many organizations are stor-
ing personal information digitally.
8. I feel quite upset by the uses that or- 4 3 3 2 3
ganizations make of personal informa-
tion about me.
9. I feel somewhat uncomfortable 3 3 5 3 3
about some of the ways the organiza-
tion is collecting personal information
about me.
10. I would feel somewhat uncomfort- 3 5 4 2 4
able if the organization would evalu-
ate my actions by collecting personal
information about me.
11. I feel quite upset by the fact that 5 2 2 3 2
organizations save images of me.
34
4.2. Outcome of the interviews
for granted. Both participant B and D felt positive about the system because it makes it
possible to check whether people wear safety equipment or not. Participant B added that it
becomes more problematic when people can be identified in the images. Participant A stated
that cameras can be privacy intrusive. Participant C thought that it is not easy to say to what
extent it feels okay or not that the system is used and gave some examples:
For example in London with crime and such things, its really good with image
recognition. Like if the police use it in subways to look for criminals, it’s really
good. But whether they belong in a workplace, I don’t know.
The participants were all unanimous of what they thought the intention of using the system
is. They stated that the system is meant to increase the usage of personal protective equip-
ment. Both participant A and C mentioned that it is easy to become skeptical. Participant
A said that if you are conspiratorial, suspicions about the purpose of the system can arise,
which the participant believes may be the case for several employees.
Participant A, C and D said that the safety at the paper mill could increase when using the
system. Participant E said it could increase slightly. Participant B said that it would not make
a big difference.
Participant A, B, C and D all thought that the system potentially could affect the behaviour
of people who are exposed for the system. They stated that some people may feel uncom-
fortable and stiff and that some people may take another way to avoid the system if possible.
Participant A stated that the system could contribute to a greater amount of stress. Partici-
pant D thought that people may feel that the company has a reduced trust in its employees
and reasoned:
I can imagine that people would think that "okay, now they’re monitoring me,
don’t they believe I would have worn safety equipment anyway, without this
camera?"
... I think that the recognition factor is something people can get annoyed at.
Also, the system could raise a fear of being caught for not doing the right thing. Participant
C added that even if the system can make people uncomfortable, it is something you get
used to. Participant E was doubtful that the system would have any behavioural effects. This
35
4.2. Outcome of the interviews
participant reasoned that there were already many cameras at the workplace today and that
people stop thinking about the cameras after a while.
There were differences in how the participants described how they were informed about the
existing surveillance cameras at the paper mill. Participant A and C were informed through
negotiations or at union meetings. Participant B and E said they did not get any specific in-
formation but instead discovered the cameras themselves by noticing the cameras or screens
displaying camera views. Participant E additionally said that he later had received informa-
tion through the union. Participant C said that the information also could be found on the
company’s intranet. Participant D described that he received this type of information on the
intranet, at workplace meetings or through a manager.
I would be unsure of where the information ends up and would pay more atten-
tion when passing the camera.
Participant C did not believe there would be a difference between the two systems.
36
4.2. Outcome of the interviews
Participant E gave an example and said it would not be okay to record employees during
an entire work day from their laptop web camera. He thought there was a difference if the
laptop would record sound, which would feel worse and be to cross the line. If the purpose is
not to increase safety, he said he did not understand the reason to record. Participant E said it
was a difference of collecting and storing data and believed that an organization should not
store data for a long time. The participant thought data could be okay to save to be able to
show examples of good and bad behaviour, situations or similar.
Another possibility that the participants expressed was to alarm directly to the person miss-
ing required safety equipment. Participant C could see difficulties with having to identify the
person, and would prefer to remain unidentifiable in the images. This participant added that
he thought this was something that could change and would feel more acceptable in a few
years.
Participant E thought that if someone was missing required equipment and entering an area
with dangerous machines, it would be of most use to send an alarm to the people running
the machines. This way, the machines could be turned off and the person missing equipment
could be contacted and reminded to put on equipment. The participant did not see a reason
for the alarm to go to a guard. The participants further reasoned that the alarm could be
direct and visible for all, which would make everyone pay attention.
Participant A said that the alarm could be sent directly to the closest manager. This way, the
manager could talk to the person of concern and discuss that he/she was missing protective
equipment. Participant B, D and E did not think a direct alarm to the closest manager would
be a good idea. Participant B said:
If I would have walked through an area where I could barely see there was a
camera, and then my manager would get a message about me not wearing safety
glasses for example, and I wouldn’t even know it had happened, it would have
felt like an infraction.
Participant B would prefer that the system would give an alert that is visible for all when a
person for example goes through a passage. Participant D said that it would be better to create
a system that contributed to employees voluntarily wore equipment more often rather than
getting a notice from their manager about it. Participant E thought an alarm to a manager
seemed unnecessary.
37
4.2. Outcome of the interviews
Important findings
Below is a list of important findings of the interviews.
• The purpose of the system is important to how the participants would feel about it.
• The system can be helpful and lead to an increase of wearing safety equipment amongst
the employees.
• The participants feel comfortable with the existing cameras and were more unsure of
how the data gathered would be used with the image recognition system.
• The participants were positive about using the system to gather statistics, mainly to be
able to see how big the problem is.
• The participants would rather have a general alarm that would be visible for all than to
be pointed out in person.
38
5 Discussion
In this section the results will be discussed followed with a discussion of the chosen method-
ology for this thesis and the work in a wider context.
5.1 Results
In this section the results will be discussed, both the performance of the models trained with
different features and the interview answers.
The performance of the object detection model differed between the two classes PersonWith-
Vest and PersonWithoutVest. The AP for PersonWithVest was higher than the AP for Person-
WithoutVest throughout the experiment, with all training sets. The AP between the classes
differed at least 8.26 percentage points and at most 14.41 percentage points. In the precision-
recall curves for the two classes we can see a clear pattern that the recall is higher for Per-
39
5.1. Results
sonWithVest than PersonWithoutVest regardless of the training set the model was trained on.
This indicates that the model is not as good at finding all occurrences of a person without vest
as for a person with vest. This result can be due to different causes. The people without vest
in the training set can have varying characteristics such as varying clothing colours. Since the
vest is very similar (identical), it can be easy for the model to classify it and associate it with
the class PersonWithVest. Since the amount of occurrences of the different classes were bal-
anced in both the training and test sets, the result can be of value when creating the training
set. It indicates that the model could need a larger amount of occurrences of persons without
vests in the training set.
When evaluating the impact of including different features in the training images for a model
the results show that in general the tested features contributes to a better performance and
more robust model. The only exception where the model did not perform better than the
baseline was the model with different lightning conditions. The result showed a higher AP for
PersonWithVest but a lower AP for PersonWithoutVest and the mAP turns out to be slightly
lower compared to the baseline. It means that the model performs worse when adding dark
images to the training set and it seems like the different lightning conditions confuses the
model in terms of how a person without vest looks like.
The model trained on the background data set shows good improvements for all metrics. The
mAP increased by 2.75 percentage points and shows that including images in the training
data from the same location as used when testing improves the models performance. The
result demonstrates that the background in where the objects occurs when training the model
affects the model’s ability to detect correctly. The same goes for the model with different
angles which received similar results as background.
The greatest difference compared to the baseline was given by the model trained on the data
set including varying distance where the mAP increased by 10.98 percentage points. It was
not surprising since the scale difference between the distance images and the baseline was
quite drastic. We assumed that the model would have much easier to detect the small scaled
persons in the test set if it had been trained on images of persons in the same scale. In the
precision-recall curves we see that the precision is high for a recall up to around 0.9 for Per-
sonWithVest, which means the model manages to detect almost all objects and classify them
correctly. For PersonWithoutVest we can see a small decrease of precision at recall after 0.5.
It is difficult to say what caused this, we think it can be that the distance training images con-
fuses the model of what a person without vest looks like, like when adding images of varying
lightning conditions. Although the small decrease of precision, the recall reaches close to 1
for PersonWithoutVest, which is better than for all the other models for this class. This results
in a higher AP for this class.
When comparing the model trained on the data set including occlusions to the baseline a
small increase in performance can be seen. The result was somewhat expected because the
ways in which the objects cover each other differ very much from time to time. Because the
occlusion varies much means that it will take many occurrences of occlusion in the training
set before the model has been trained on such cases that then occur when using the model.
Overall, in this study the best performance was achieved for the model trained on the com-
bined data set including all features where a mAP of 92.33 % was achieved. This result can
be compared with the result from a similar study [32] where a model was built using Azure
Custom Vision. In the study a model trained on about 100 images per class achieved a mAP
of 81.6 %. This shows that our model performs better, however it can be misleading compar-
ing the results because of the differences in shaping the test set. In another study [10] where
a YOLOv3 model was used, the achieved AP was 97 % which was measured the same as
40
5.1. Results
mAP in this study. The result when using the YOLOv3 model is better which may depend on
the algorithm or the training data which was significantly larger than the training set used in
this study. Although this study did not focus on creating the best performing model, these
comparisons can give an indication that a model for detecting safety equipment created in
Azure Custom Vision seems to perform well with limited resources invested, both in terms
of time and data.
5.1.2 Interviews
From both the quantitative and qualitative parts of the interview, it was clear that the partic-
ipants thought that the purpose of the system was important for how they would feel about
it. In question 13, it showed that four out of five participants thought this was important, and
one was neutral. This is not surprising and is consistent with previous research [37, 44]. If
the purpose is to improve safety, they would feel more positive about it. From the interviews
it was clear that the participants were positive that the system could help increase the usage
of safety equipment at their workplace. Although some claimed the usage was already high,
and that the increase would be small. In a more general perspective, we think it is positive
that the participants saw a potential increase. When people feel that the system has a helping
function, it can contribute to more people having a positive attitude towards the use and see
it as a benefit.
The employees feel comfortable with the existing cameras and more unsure of how the data
would be used. As previous research has indicated that not knowing what the monitoring
information is used for is related to how bothered people feel about it [37], we believe this
is important to consider. It is also in line with what the participants believe are important
when introducing a system like the one in this thesis; communicate the purpose and pro-
vide information to the employees. In question 2, the participants answers show that they do
not think organizations should be allowed to gather information without their knowledge.
The answers to question 3 show that they believe that employees should be involved in the
introduction of monitoring systems. These results indicate that being transparent about mon-
itoring systems and the introduction of them is of great importance. Nor is this a surprising
finding and previous work have come to similar conclusions [14, 46].
Several participants mentioned consequences that may come with using the system. They
mentioned the system could increase the amount of stress for the employees and it could
lead to employees worrying about getting caught for not doing the right thing. These conse-
quences amongst other not discovered in this study are important to take into consideration
as an employer, and if the consequences are severe, you need to think about whether it is
worth using the system. Furthermore, one participant thought that people would get a feel-
ing that the company had low trust in employees managing their own safety. We found
this interesting, since it is not anything we have found in the literature. Even though it is
difficult to draw conclusions when only one person claims it, we believe this potential conse-
quence should be taken into consideration and that it is a possible outcome of implementing
a surveillance system like the one in this thesis.
All the consequences brought up, we argue are in line with having dissatisfaction with man-
agement. Dissatisfaction with management is a potential risk when performing monitoring
in secret, as shown in [40], and is another indication that transparency is important.
When handling monitoring systems, it is important to consider what data to store and for
how long. As described in Section 2.2.1, there are laws to comply with. Apart from laws,
people have their own perceptions of where the line should be drawn regarding information
gathering. The interviews showed that the participants did not want the images to be linked
41
5.2. Method
to individuals, which we believe could mean they felt it was sensitive information. Question
4 and 5 handle how limited and controlled the gathering and usage of personal information
should be. The participants’ answers show that they think the amount and type of informa-
tion gathered should be strictly limited and that the usages of personal information should
have many controls. We can from this see that privacy is important for the participants and
that personal information should be handled carefully.
We saw somewhat of a clash of opinions since many thought direct alarm would be a good
idea, but did not want to be identifiable in the images. It is not possible to be personally
alerted if it is not known who the person in the pictures is. There is a possibility that people
are okay to be identified if it can be ensured that the information is only used to alert the
person itself without the information getting into someone else’s hands. This would require
a thorough examination of how information is gathered, stored and used.
Some of the participants would rather have the system give a general alarm that is visible
for all and where the detected person is not pointed out personally. An idea was to have
something like a light or a sound to warn people in the area where the person without safety
equipment was detected. Another possibility would be to have an alarm in the vehicles no-
tifying the drivers to pay extra attention. This is another indication that a personal alarm or
that a detection can be traced back to an individual would not be a suitable approach. The
participants were positive about using the system to gather information to use for statistics.
We believe this is the least sensitive approach to use the system, since the objectives in the
images do not need to be identified.
Based on the opinions that emerge from the employees, the system can be implemented using
different approaches and in this way, employees’ attitudes towards the use of the system can
be affected to varying degrees. To avoid storing more data than necessary, the system can be
designed so that images taken from the video stream are never stored, meaning the system
would analyze the images in real time. This could reduce worry and stress for the employees
that the system would store a lot of images of them that could possibly be used for other
purposes. An alternative would be for the system to store an image when a detection of the
wrong equipment has been detected in order to be able to check afterwards that the detection
was performed correctly. This means that more data is stored but there is evidence of what
happened if, for example, someone were to claim that the result from the system is incorrect.
If the system is implemented in such a way that recall is valued higher than precision, as
discussed earlier, it can lead to more detections being made and thus the system will detect
more cases of lack of safety equipment incorrectly. For the employees, this can mean more
alarms, which can be annoying. Another consequence if the system alarms even though no
one lacks equipment is that the employees lose confidence in the system and distrust the
system even when it alarms correctly. On the other hand, if the system would value precision
higher than recall and the system misses to detect a person without safety equipment it can
lead to dissatisfaction.
5.2 Method
In this part, the method of the study is discussed. The examining of the impact of the features
and the interviews are presented separately and we discuss with respect to replicability, relia-
bility and validity. Other limitations with the method and what would have been interesting
to do differently are also presented. Finally, source criticism and the work in a wider context
are presented.
42
5.2. Method
When the data set of images taken at long distance was collected, the people were at a set
distance. The distance set was about 80 meters, which meant that people were visible in the
pictures but no details could be discerned. Since images taken from other distances were not
used to train the models nor being evaluated, we can only draw conclusions about perfor-
mance at the tested distance. It would have been interesting to test more distances to see if
there is any breaking point where the performance of the model trained on the baseline starts
to decrease. In addition, it would have been interesting to investigate whether there is a limit
to how far away people can be in order for a model to be able to detect it at all. It could
have led to information about how large an area a camera can cover that will make these
detections.
The approach taken in this study when creating the data set including other lightning condi-
tions was to remake an image taken in daylight by adding a dark filter. All images in the data
set had about the same darkness. The study did not include other lightning conditions which
means that conclusions can not be drawn about the impact of other lightning conditions like
strong sunlight and shadows. The reason for limiting to darkness was because of the diffi-
culty of controlling the weather. Additionally, the difference between the lightning condition
in the baseline and stronger sunlight assumed to be barely noticeable.
When collecting the data set with different background to compare to the baseline, images
from three separate location was included. When evaluating the model trained on those
images the test data included images from those same locations. The result indicated that
training the model on the same background that is used when scoring increased the perfor-
mance. From this finding, conclusions cannot be drawn about if training on different back-
ground contributes to a more robust model. To evaluate if the model becomes more robust
and performs better in entire new locations images with other background not included in
the training set should be in the test set.
Throughout the experiments, the data collected was originally extracted from videos. The
extracted images were split into training data and test data. Since the images were from the
same videos, the training data and the test data were very similar. As described earlier, this
means the different aspects were not varied a lot, and the training images of, for example,
distance was very similar to the images taken from a longer distance included in the test
data. This can represent a real scenario where the system used at a paper mill utilizes a
model which is trained on data collected from the same location where the model is then to
43
5.2. Method
be used. It would have been interesting to include images from an entirely new location, and
with more varying distances and view points in the test set. The people in the images were
also the same, it would have been interesting to test the model on entirely new people in the
images. This could indicate the models performance if it was to be used in another paper mill
or if they were to change the employees work clothes.
5.2.2 Interviews
In order for the replicability of the interviews to be high, the procedure has been included
in detail. The interview outline is described detailed and the interview questions is included
in the appendix. However, the fact that the interviews are partly unstructured contributes to
a lower replicability. The questions used in the interview were based on questions used in
previous research examining the same area. However, the questions used in the interviews
were adjusted from its origin which can decrease the validity.
The interviews in this study were held in Swedish and from the notes the participants an-
swers were translated into English to be used in Section 4.2. When translating there is a risk
of incorrect translation and use of incorrect words which can affect the meaning of the partic-
ipants’ answers. Therefore, the outcome of the interviews can be considered to have reduced
reliability.
The reliability of the outcomes of the interviews can also be questioned due to the small
number of participants. The survey contains too few participants and more people’s opinions
would be required, which would provide a more general image of what a group of people
think in order to increase reliability. Although, the results from the five interviews can be
used to give an indication and identify early problems that need to be addressed.
When conducting the interviews the deviation among the participants was small. Only men
aged around 40-60 who had worked at the company for at least 18 years were interviewed. It
is possible that attitudes and feelings differ between age groups and genders. Therefore, the
validity of the outcomes of the interview can be argued low because the people included in
the study are not a representation of all the employees at Holmen.
When the interviews were held and the participants answered the quantitative part, we no-
ticed that some participants perceived the statements as difficult to understand or take a
position on. We believe that this may have contributed to the participants in some cases not
responding in agreement with their true opinions and feelings. We believe it could have been
avoided through some reformulations and by giving the participant more time to respond to
the statement.
44
5.3. The work in a wider context
have based the information about GDPR and the Camera Surveillance Act from the Swedish
Government Offices’ legal databases.
Several of the presented studies in Section 2.2.2 were conducted over 10-15 years ago and a
fair question would be whether the results would be the same today. The area of surveillance
has changed a lot and it is possible that people would have different opinions today compared
to 15 years ago. From the responses of the interviews we found several similarities with
the studied literature. This, we argue, indicates that the findings on feelings and attitudes
towards surveillance are still relevant today.
Since technology has advanced very rapidly in the field of image recognition, it has happened
that it is used in a way that many consider unethical. Several studies have been conducted
where image recognition, in particular facial recognition, is used for distinguish different eth-
nicity, predict if people are criminal and personal characteristics such as gender and age [29].
As an engineer of a system like the one in thesis, one has a responsibility to consider what
the possible consequences can be and to take proactive measures to prevent gender and eth-
nic biases etc. In action this would mean, for example including a broad representation of
humans in the training set making the model equal good at detecting irrespective of appear-
ance. Similarly, people from all hierarchical layers of the organization would be represented
in the training set so that everyone is treated equally regardless of position.
In the progress of Industry 4.0, automation makes it possible to replace manual tasks per-
formed by humans with machines. Therefore some believe that automation will decrease the
number of full-time staff and the desired competencies will shift to engineers with higher
education. This is an ethical dilemma that companies need to consider. However, we are
convinced that the development of new technology will only increase since there is financial
gaining in replacing human labour with machines. Regarding the system in this thesis how-
ever, the purpose it fulfills does not entail replacing any staff. It is mainly meant as a tool and
aid for when there is a lack of safety compliance.
We believe the usage of image recognition in the industry will only increase, and that the
technique can be of great value if used right. If an image recognition system for detecting
unsafe conditions in relation to missing safety equipment can be implemented, we see possi-
bilities for other uses of the technique. A similar solution could be used for detecting a person
in a critical area such as too close to a dangerous machine. Another usage could be in jobs
where working alone is frequent, where a similar system could be used for alarming if an
accident happen and there is no-one else present.
45
6 Conclusion
The aim of this thesis was to examine how an image recognition system can be implemented
that makes employees wear safety equipment more often. The study shows that it is possible
to build a system that is able to detect a person with and without safety equipment, more
specifically reflective vests. The system that was built in C# .Net Core utilizes Azure Custom
Vision for detecting and Azure Event Hubs for notification can take a video stream as input
and output the detections.
In addition to implementing a system, the goal was to examine how and to what extent a
machine learning model be trained in order to detect safety equipment. Since the algorithms
which Azure Custom Vision are built upon are hidden, as engineers we are only responsible
for providing the training data. By training the model on different data sets, we received
interesting results showing that including varying backgrounds, angles, distances and occlu-
sions in the training set have positive impact on the model’s performance. When creating a
model for detection of safety equipment, we argue that these aspects should be considered
in the data collection phase in order for the system to perform well in different environments
and circumstances. When including images with varying lightning conditions to the training
set, we saw a slight deterioration in performance. However, we recommend further research
due to limitations in the method.
This thesis focused on examining the impact of training data on an Azure Custom Vision
model. The best performance measured gave a mAP of 92.33 %, which we believe is promis-
ing. To achieve this result, a minimal amount of images was used for training and we see
potential in improving the model further by increasing the amount of training images.
Finally, the thesis aimed to investigate the privacy concerns and employees attitudes regard-
ing the system. From literature we identified variables that have impact on peoples’ attitudes
regarding electronic monitoring. Literature showed that the perceived purpose, transparency
and communication can affect what people feel about them. It also showed that electronic
monitoring can lead to reduced trust and privacy concerns if implemented irresponsibly or
without informing the objectives of the monitoring. These findings were further strength-
ened by interviews held with employees at Holmen which showed that employees would be
46
more positive if the cause of the system was justified. It was furthermore highly important
with information and communication in the introduction of a monitoring system. Privacy
was highly valued and to not be identifiable in images would be to prefer. We argue that the
system should alarm on a general level, without identifying the concerned person, to avoid
privacy intrusion. As a starting point for Holmen we suggest utilizing the system for statis-
tics to gain an understanding of the extent of the problem, as this was an important outtake
from the interviews. To implement a monitoring system for detection of safety equipment
at Holmen, collected data and monitoring practices must be used in compliance with GDPR
and Camera Surveillance Act.
47
Bibliography
[1] Manal El Aidouni. Evaluating Object Detection Models: Guide to Performance Metrics. 2019.
URL : https : / / manalelaidouni . github . io / manalelaidouni . github .
io / Evaluating - Object - Detection - Models - Guide - to - Performance -
Metrics.html (visited on 03/22/2021).
[2] G Stoney Alder. “Ethical issues in electronic performance monitoring: A consideration
of deontological and teleological perspectives”. In: Journal of Business Ethics 17.7 (1998),
pp. 729–743.
[3] Bradley J Alge. “Effects of computer surveillance on perceptions of privacy and proce-
dural justice.” In: Journal of Applied Psychology 86.4 (2001), p. 797.
[4] Eduard Barnoviciu, Veta Ghenescu, Serban-Vasile Carata, Marian Ghenescu, Roxana
Mihaescu, and Mihai Chindea. “GDPR compliance in Video Surveillance and Video
Processing Application”. In: 2019 International Conference on Speech Technology and
Human-Computer Dialogue (SpeD). IEEE, pp. 1–6.
[5] Christian Luzter Batugal, Jewel Mark Perry Gupo, Kasandra Kimm Mendoza, Adonis S
Santos, Francis A Malabanan, Jay Nickson T Tabing, and Christopher B Escarez. “Eye-
Smell: Rice Spoilage Detection using Azure Custom Vision in Raspberry Pi 3”. In: 2020
IEEE REGION 10 CONFERENCE (TENCON), pp. 738–743.
[6] Carl Botan. “Communication work and electronic surveillance: A model for predicting
panoptic effects”. In: Communications Monographs 63.4 (1996), pp. 293–313.
[7] Marco Javier Flores Calero, Miltón Aldás, José Lázaro, Alfredo Gardel, Noemi Onofa,
and Brayan Quinga. Pedestrian detection under partial occlusion by using logic inference,
HOG and SVM. 09. 2019, pp. 1552–1559.
[8] Bernardo Botelho Antunes da Costa and Pedro Silveira Pisa. “Cloud Strategies for Im-
age Recognition”. In: 2020 4th Conference on Cloud and Internet of Things (CIoT). IEEE,
pp. 57–58.
[9] Jifeng Dai, Haozhi Qi, Yuwen Xiong, Yi Li, Guodong Zhang, Han Hu, and Yichen Wei.
“Deformable convolutional networks”. In: Proceedings of the IEEE international conference
on computer vision. 2017, pp. 764–773.
48
Bibliography
[10] Venkata Santosh Kumar Delhi, Albert Thomas, et al. “Detection of Personal Protective
Equipment (PPE) compliance on construction site using Computer Vision based Deep
Learning techniques”. In: Frontiers in Built Environment 6 (2020), p. 136.
[11] Piotr Dollar, Christian Wojek, Bernt Schiele, and Pietro Perona. “Pedestrian detection:
An evaluation of the state of the art”. In: IEEE transactions on pattern analysis and machine
intelligence 34.4 (2011), pp. 743–761.
[12] Bill Gillham. Research interview. A&C Black, 2000.
[13] Bill Gillham. Research Interviewing: The range of techniques: A practical guide. McGraw-Hill
Education (UK), 2005.
[14] Christina S Hagen, Leila Bighash, Andrea B Hollingshead, Sonia Jawaid Shaikh, and
Kristen S Alexander. “Why are you watching? Video surveillance in organizations”. In:
Corporate Communications: An International Journal (2018).
[15] Laura P Hartman. Technology and ethics: Privacy in the workplace. 1. 2001, pp. 1–27.
[16] Xiaoyue Jiang, Abdenour Hadid, Yanwei Pang, Eric Granger, and Xiaoyi Feng. Deep
Learning in object detection and recognition. Springer, 2019.
[17] Kamerabevakningslag (2018:1200). 2018. URL: http : / / rkrattsbaser . gov . se /
sfst?bet=2018:1200 (visited on 05/11/2021).
[18] Lag (1976:580) om medbestämmande i arbetslivet. 1976. URL: http://rkrattsbaser.
gov.se/sfst?bet=1976:580 (visited on 05/11/2021).
[19] Lag (2018:218) med kompletterande bestämmelser till EU:s dataskyddsförordning. 2018. URL:
http://rkrattsbaser.gov.se/sfst?bet=2018:218 (visited on 05/11/2021).
[20] Jianan Li, Xiaodan Liang, ShengMei Shen, Tingfa Xu, Jiashi Feng, and Shuicheng Yan.
“Scale-aware fast R-CNN for pedestrian detection”. In: IEEE transactions on Multimedia
20.4 (2017), pp. 985–996.
[21] Jie Li, Huanming Liu, Tianzheng Wang, Min Jiang, Shuai Wang, Kang Li, and Xi-
aoguang Zhao. “Safety helmet wearing detection based on image processing and ma-
chine learning”. In: 2017 Ninth International Conference on Advanced Computational Intel-
ligence (ICACI). IEEE, pp. 201–205.
[22] Tsung-Yi Lin, Piotr Dollár, Ross Girshick, Kaiming He, Bharath Hariharan, and Serge
Belongie. “Feature pyramid networks for object detection”. In: Proceedings of the IEEE
conference on computer vision and pattern recognition. 2017, pp. 2117–2125.
[23] Li Liu, Wanli Ouyang, Xiaogang Wang, Paul Fieguth, Jie Chen, Xinwang Liu, and Matti
Pietikäinen. “Deep learning for generic object detection: A survey”. In: International
journal of computer vision 128.2 (2020), pp. 261–318.
[24] Gary T Marx. “Ethics for the new surveillance”. In: The Information Society 14.3 (1998),
pp. 171–185.
[25] Microsoft. What is Computer Vision? 2020. URL: https://docs.microsoft.com/
en-ca/azure/cognitive-services/computer-vision/overview (visited on
03/17/2021).
[26] Microsoft. What is Custom Vision? 2020. URL: https : / / docs . microsoft . com /
sv- se/azure/cognitive- services/custom- vision- service/overview
(visited on 03/17/2021).
[27] Ma Nan, Chen Li, Hu JianCheng, Shang QiuNa, Li JiaHong, and Zhang GuoPing.
“Pedestrian Detection Based on HOG Features and SVM Realizes Vehicle-Human-
Environment Interaction”. In: 2019 15th International Conference on Computational Intelli-
gence and Security (CIS). IEEE, pp. 287–291.
49
Bibliography
[28] Nipun D Nath, Amir H Behzadan, and Stephanie G Paal. “Deep learning for site safety:
Real-time detection of personal protective equipment”. In: Automation in Construction
112 (2020), p. 103085.
[29] Richard Van Noorden. The ethical questions that haunt facial-recognition research. 2020.
URL : https : / / www . nature . com / articles / d41586 - 020 - 03187 - 3 (visited
on 05/11/2021).
[30] Keiron O’Shea and Ryan Nash. “An introduction to convolutional neural networks”.
In: arXiv preprint arXiv:1511.08458 (2015).
[31] OM Parkhi, A Vedaldi, and A Zisserman. “Deep face recognition”. In: British Machine
Vision Association, 2015, pp. 1–12.
[32] Matea Pejčinović. “A Review of Custom Vision Service for Facilitating an Image Clas-
sification”. In: Central European Conference on Information and Intelligent Systems. Faculty
of Organization and Informatics Varazdin. 2019, pp. 197–208.
[33] David MW Powers. “Evaluation: from precision, recall and F-measure to ROC, in-
formedness, markedness and correlation”. In: arXiv preprint arXiv:2010.16061 (2020).
[34] Sovit Ranjan Rath. Evaluation Metrics for Object Detection. URL: https : / /
debuggercafe.com/evaluation- metrics- for- object- detection (visited
on 05/06/2021).
[35] Andrew K Schnackenberg and Edward C Tomlinson. “Organizational transparency:
A new perspective on managing trust in organization-stakeholder relationships”. In:
Journal of Management 42.7 (2016), pp. 1784–1810.
[36] JoonOh Seo, SangUk Han, SangHyun Lee, and Hyoungkwan Kim. “Computer vision
techniques for construction safety and health monitoring”. In: Advanced Engineering
Informatics 29.2 (2015), pp. 239–251.
[37] Jeffrey M Stanton and Elizabeth M Weiss. “Electronic monitoring in their own words:
an exploratory study of employees’ experiences with new types of surveillance”. In:
Computers in Human Behavior 16.4 (2000), pp. 423–440.
[38] Eugene F Stone, Hal G Gueutal, Donald G Gardner, and Stephen McClure. “A field
experiment comparing information-privacy values, beliefs, and attitudes across several
types of organizations.” In: Journal of applied psychology 68.3 (1983), p. 459.
[39] Richard Szeliski. Computer vision: algorithms and applications. Springer Science & Busi-
ness Media, 2010.
[40] Filiz Tabak and William P Smith. “Privacy and electronic monitoring in the workplace:
A model of managerial cognition and relational trust development”. In: Employee Re-
sponsibilities and Rights Journal 17.3 (2005), pp. 173–189.
[41] Xiaolong Wang, Abhinav Shrivastava, and Abhinav Gupta. “A-fast-rcnn: Hard positive
generation via adversary for object detection”. In: Proceedings of the IEEE conference on
computer vision and pattern recognition. 2017, pp. 2606–2615.
[42] Rattapoom Waranusast, Nannaphat Bundon, Vasan Timtong, Chainarong Tangnoi, and
Pattanawadee Pattanathaburt. “Machine vision techniques for motorcycle safety hel-
met detection”. In: 2013 28th International Conference on Image and Vision Computing New
Zealand (IVCNZ 2013). IEEE, pp. 35–40.
[43] Myria Watkins Allen, Stephanie J Coopman, Joy L Hart, and Kasey L Walker. “Work-
place surveillance and managing privacy boundaries”. In: Management Communication
Quarterly 21.2 (2007), pp. 172–200.
[44] Deborah L Wells, Robert H Moorman, and Jon M Werner. “The impact of the perceived
purpose of electronic performance monitoring on an array of attitudinal variables”. In:
Human Resource Development Quarterly 18.1 (2007), pp. 121–138.
50
Bibliography
[45] Christian Wolf and Jean-Michel Jolion. “Object count/area graphs for the evaluation
of object detection and segmentation algorithms”. In: International Journal of Document
Analysis and Recognition (IJDAR) 8.4 (2006), pp. 280–296.
[46] Michael Workman. “How perceptions of justice affect security attitudes: suggestions for
practitioners and researchers”. In: Information Management & Computer Security (2009).
[47] Xin Zhang, Yee-Hong Yang, Zhiguang Han, Hui Wang, and Chao Gao. “Object class
detection: A survey”. In: ACM Computing Surveys (CSUR) 46.1 (2013), pp. 1–53.
51
A Interview questions
52
Intervjufrågor
Kvantitativa frågor:
1. Organisationer ska få samla information om människor utan deras tillstånd.
6. Organisationer som samlar in och lagrar personlig information ska inte ha rätten
att ge ut informationen till andra organisationer.
7. Jag är inte alls nöjd med att många organisationer lagrar personlig information
digitalt.
9. Jag känner mig något obekväm med några av de sätt som organisationen samlar
personlig information om mig.
10. Jag skulle känna mig något obekväm om organisationen skulle utvärdera mina
handlingar genom att samla personlig information om mig.
11. Jag känner mig ganska upprörd av att organisationen sparar bilder på mig.
13. Syftet med ett övervakningssystem påverkar hur jag känner kring det.
Kvalitativa frågor:
14. Hur känner du kring ett bildigenkännings-system som detta?
16. Tror du att detta bildigenkännings-system skulle göra att du eller andra
(besökare, leverantörer osv) bär säkerhetsutrustning oftare?
21. Tror du det skulle vara någon skillnad mellan hur du känner kring de existerande
övervakningskamerorna och detta bildigenkännings-system?
22. Vilken typ av data tycker du är okej att organisationen samlar in?
26. Hur och vem ska meddelas om att någon inte har på sig rätt säkerhetsutrustning?
Exempel:
● Direkt alarm till personen som berörs
● Direkt alarm till chefen
● Direkt alarm som är synligt för alla.
● Datainsamling för att föra statistik
28. Har du idéer på ytterligare möjligheter och användningsområden för IT för att
förbättra säkerheten på arbetsplatsen?
29. Vad tycker du är viktigt vid införandet av liknande system?