Degree Programme in Computer Science and Engineering
Second cycle, 30 credits
Computer Vision for Document
Image Analysis and Text Extraction
OMAR BENCHEKROUN
Stockholm, Sweden 2022
Computer Vision for
Document Image Analysis
and Text Extraction
OMAR BENCHEKROUN
Degree Programme in Computer Science and Engineering
Date: June 9, 2022
Supervisor: Mats Nordahl
Examiner: Aristides Gionis
School of Electrical Engineering and Computer Science
Host company: La Javaness
Swedish title: Datorseende för analys av dokumentbilder och
textutvinning
© 2022 Omar Benchekroun
Abstract | 3
Abstract
Automatic document processing has been a subject of interest in the industry
for the past few years, especially with the recent technological advances
in Machine Learning and Computer Vision. This project investigates in-
depth a major component used in Document Image Processing known as
Optical Character Recognition (OCR). First, an improvement upon existing
shallow CNN+LSTM is proposed, using domain-specific data synthesis. We
demonstrate that this model can achieve an accuracy of up to 97% on
non-handwritten text, with an accuracy improvement of 24% when using
synthetic data. Furthermore, we deal with handwritten text that presents more
challenges including the variance of writing style, slanting, and character
ambiguity. A CNN+Transformer architecture is validated to recognize
handwriting extracted from real-world insurance statements data. This model
achieves a maximal accuracy of 92% on real-world data. Moreover, we
demonstrate how a data pipeline relying on synthetic data can be a scalable
and affordable solution for modern OCR needs.
Keywords
Optical Character Recognition, Document Analysis, Text Extraction,
Transformers, Convolutional Neural Networks
4 | Abstract
Sammanfattning | 5
Sammanfattning
Automatisk dokumenthantering har varit ett ämne av intresse i branschen
under de senaste åren, särskilt med de senaste tekniska framstegen inom
maskininlärning och datorseende. I detta projekt kommer man att på
djupet undersöka en viktig komponent som används vid bildbehandling av
dokument och som kallas optisk teckenigenkänning (OCR). Först kommer
en förbättring av befintlig ytlig CNN+LSTM att föreslås, med hjälp av
domänspecifik datasyntes. Vi kommer att visa att denna modell kan uppnå
en noggrannhet på upp till 97% på icke handskriven text, med en förbättring
av noggrannheten på 24% när syntetiska data används. Dessutom kommer vi
att behandla handskriven text som innebär fler utmaningar, t.ex. variationer
i skrivstilen, snedställningar och tvetydiga tecken. En CNN+Transformer-
arkitektur kommer att valideras för att känna igen handskrift från verkliga data
om försäkringsbesked. Denna modell uppnår en maximal noggrannhet på 92%
på verkliga data. Dessutom kommer vi att visa hur en datapipeline som bygger
på syntetiska data är en skalbar och prisvärd lösning för moderna OCR-behov.
Nyckelord
Optisk teckenigenkänning, dokumentanalys, textutvinning, transformatorer,
konvolutionella neurala nätverk
6 | Sammanfattning
Sammanfattning | 7
Acknowledgments
Foremost, I would like to express my gratitude to my technical supervisor
and manager Phi Hung LE for his continuous support during this project and
interesting insights that helped me make the right research directions. I would
also like to thank Dr. Van Tuan DANG for his additional supervision and help
with the technical aspects of the project, as well as for moral support in difficult
times.
My sincere thanks also goes to Yevhenii SIELSKYI and Nicolas HENRY
of the Optical Character Recognition dream team as well as the rest of La
Javaness’s Data team, for being welcoming, supportive, and available for both
questions and directions.
I would also like to thank my KTH supervisor Mats Nordahl, for his
continuous insights on how to improve my methodology and writing, as well
as my examiner Aristides Gionis for his academic guidance.
Last but not least, I would like to thank my family, especially my parents, for
being supportive during my entire education. I wouldn’t have made it through
the hurdles I’ve encountered without their love and help.
8 | Sammanfattning
Contents
1 Introduction 13
1.1 Research Question . . . . . . . . . . . . . . . . . . . 14
1.2 Purpose . . . . . . . . . . . . . . . . . . . . . . . . . 14
1.3 Delimitations . . . . . . . . . . . . . . . . . . . . . . 15
1.4 Outline . . . . . . . . . . . . . . . . . . . . . . . . . 15
2 Background 17
2.1 Machine Learning and Computer Vision . . . . . . . . 18
2.2 Domain Adaptation and Transfer Learning . . . . . . 20
2.3 Neural Networks: The building blocks and novel
architectures . . . . . . . . . . . . . . . . . . . . . . 21
2.4 Sequence Modeling . . . . . . . . . . . . . . . . . . . 25
2.5 Optical Character Recognition . . . . . . . . . . . . . 29
2.6 Related Works . . . . . . . . . . . . . . . . . . . . . 31
3 Method 35
3.1 Data . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
3.2 Preprocessing for Document Analysis . . . . . . . . . 41
3.3 Architectures . . . . . . . . . . . . . . . . . . . . . . 42
3.4 Morphological Operations for Domain Adaptation . . 44
3.5 Evaluation Metrics for Optical Character Recognition . 44
4 Results 47
4.1 Predictive Performance . . . . . . . . . . . . . . . . . 48
4.2 Method-related metrics . . . . . . . . . . . . . . . . . 60
5 Environmental and societal impact 61
5.1 Personal Environmental Impact . . . . . . . . . . . . 62
5.2 Global Environmental Impact . . . . . . . . . . . . . 63
9
10 | CONTENTS
5.3 Global Societal Impact . . . . . . . . . . . . . . . . . 67
6 Conclusion 71
CONTENTS | 11
List of acronyms and abbreviations
ANN Artificial Neural Network
CNN Convolutional Neural Network
OCR Optical Character Recognition
ReLU Rectified Linear Unit
LSTM Long Short Term Memory
GAN Generative Adversarial Networks
12 | CONTENTS
Chapter 1
Introduction
Document Analysis and subsequent information extraction is a subject of great
industrial interest. In the process of digitizing documents, clients such as
insurance companies want to:
1. Convert physical documents into numerical ones for storage and
processing purposes.
2. Input relevant information extracted from these documents in an internal
search engine.
3. Ensure quality control, customer service, fraud detection, and any
other future applications through a semi-automated or fully automated
document analysis process.
To that end, this digitization process is most usually done manually by
Data Entry employees. However, this process can be tedious and is non-
scalable, even for large organizations. Ideally, we would like for the process to
be fully automated using Machine Learning-based algorithms, with manual
entry required only for a handful of outliers. More realistically and if the
Machine Learning-based solution is not sufficiently accurate, we can develop
a semi-automatic process with some of the easier fields in the documents
automatically filled. Optical Character Recognition systems are therefore used
to infer the documents’ contents from the scanned input.
13
14 | CHAPTER 1. INTRODUCTION
1.1 Research Question
This study aims to examine the effect of domain adaptation using synthetic
data and morphological operations when applied to the document analysis
and text extraction problem. We compare various deep learning architectures
to answer the following: How does the use of synthetic data and subsequent
domain adaptation affect the performance of a neural network-based Optical
Character Recognition system?
Furthermore, we answer specifically these three subquestions:
• How well does the use of synthetic data improve the accuracy of Optical
Character Recognition? Having a highly accurate text extraction system
is the main business motivation behind OCR-based systems. Achieving
near-human accuracy has a wide range of applications, and although
stability, as well as generalization ability are important in this system,
overall accuracy should not suffer significantly as a result of their
improvement.
• How stable are the methods used in terms of generalization power?
generalization is a common issue in Machine Learning, and ideally, an
OCR engine should work just as well across different types of texts. To
that end, studying domain adaptation is paramount to avoid overzealous
results that do not scale properly.
• How do the methods compare in terms of algorithmic complexity and
execution time? Deployment in a real-world system is often subject
to multiple constraints: hardware, time sensitivity, and scalabilty in
regards to adding classes mid-deployment for example.
1.2 Purpose
The purpose of developing high performing Optical Character Recognition
systems is mainly to scale document processing that most commonly relies on
manual labor. Given the example of accident statement documents, insurance
companies have data entry workers process the physical documents into the
company’s internal database. This process could be fully automated given a
highly accurate and reliable OCR solution. Alternatively, a semi-automated
pipeline can be proposed to accelerate data entry.
CHAPTER 1. INTRODUCTION | 15
1.3 Delimitations
Within the scope of this project, we note the following limitations:
• Real data used in this project is almost exclusively from accident
statements documents provided by insurance companies. We try to
remedy this bias with the use of synthetic data and different blinding
approaches in neural network training.
• Additional noise originating from improper document formatting is not
discussed in this project. We make the assumption that the processed
documents are properly classified and tagged beforehand.
1.4 Outline
First, we begin by introducing the technical background relevant to our
study. The background introduces modern Computer Vision, Artificial Neural
Networks, and models relevant to our study including Long Short Term
Memory (LSTM) and Transformers. We also discuss the use of synthetic data,
which is central to the study, in common machine learning problems as well as
Optical Character Recognition. The chapters ends with a brief state-of-the art
that lists important contributions that made the results of our study possible.
Chapter 3 dives into the methods and their respective evaluation metrics.
We discuss the choices regarding data synthesis, model design, and training.
Chapter 4 presents the results using the relevant metrics such as Character
Error Rate, Word Error Rate, and Student T-tests. Finally, Chapters 5 and
6 discuss the results of the study, as well as the societal and environmental
impacts such as energy consumption and effects on the global job market.
16 | CHAPTER 1. INTRODUCTION
Chapter 2
Background
In this chapter, we present the background study necessary to understand the
contents of this thesis. We begin by introducing modern problems related
to Machine Learning and Computer Vision, including Object Detection and
Domain Adaptation. We will also define the Artificial Neural Network
architectures used in this study: Convolutional Neural Networks, Long
Short Term Memory (LSTM) Networks and Transformers. Furthermore, we
precisely define the problematics related to Optical Character Recognition and
Domain Adaptation. Finally, we present a literature study that presents some
of the state-of-the-art OCR solutions.
17
18 | CHAPTER 2. BACKGROUND
2.1 Machine Learning and Computer Vision
This project falls under the Machine Learning (ML) and Computer Vision
(CV) research areas.
Machine Learning is the study of data-related algorithms that automate the
learning of a given objective without the need to give explicit instructions.
The learning process can be supervised or unsupervised. Machine Learning
algorithms are often based on statistical analysis and inference to detect
patterns in past data. Machine learning algorithms usually perform better
than static programming in tasks where learning the relation between input
and output from examples is easier than coding the correlation itself.
Several objectives are usually present in a Machine Learning project:
improving the accuracy of the ML model, improving the inference speed of
the model and insuring its stability for real-world deployment, and finally
minimizing the computational resources needed to train the ML model for
both cost lowering and more importantly environmental purposes.
Current research developments in Machine Learning include the use of
Transformers. A Transformer is a deep learning neural network model that
uses the mechanism of attention [1] to process sequential data. Transformers
are the latest improvement upon Recurrent Neural Networks (RNNs) and
LSTM networks that were used to process sequential data. Transformers have
been particularly effective in the field of Natural Language Processing (NLP)
with the use of BERT [2]. However, the applications of Transformers in
Computer Vision are also very promising. The paper published by Alexey
Dosovitskiy et al. [3] shows the use of Vision Transformers that achieve state-
of-the-art performance with fewer computational resources.
Computer Vision is the study of algorithms that extract information from
image and video data, thus giving the computer a high-level understanding
of visual content in a way similar to human vision. As a matter of fact,
multiple computer vision algorithms were inspired by the way the human
neurological system works. Common computer vision tasks include Image
Classification, Image Segmentation, and Object Detection and are studied
in different contexts, the most famous of whom is the area of Autonomous
Vehicles or Self-driving cars.
We are mostly interested in two tasks in particular:
CHAPTER 2. BACKGROUND | 19
• Object detection is the task of location one or several objects in an image.
The location of the recognized object is expressed with the coordinates
of the pixels in which the object is estimated to be contained, referred
to as the bounding box. In object detection, it is usually necessary
to provide the degree of confidence that the object detector has in the
prediction. In our case study, object detection is for example used to
locate lines of text in a given document. If the aforementioned object
detection is successful, we get as output the coordinates of bounding
boxes containing lines of text, and we can later process the cropped
images in a different step of our analysis pipeline.
• Image Classification is the task of predicting the class of an image, given
a dataset of images and their respective classes (labels). For instance,
a widely used Image Classification dataset is the MNIST Dataset [4]
which consists of handwritten digits and their transcriptions. In our case
study, traditional Optical Character Recognition consisted of individual
character detection (Object Detection) and Image Classification with
approximately 40 classes (26 letters, 10 digits, etc . . . ).
20 | CHAPTER 2. BACKGROUND
2.2 Domain Adaptation and Transfer Learning
Domain Adaptation is a rising challenge in Machine Learning, especially
when we deploy Machine Learning Models on real-world data. Domain
Adaptation refers to the problem of applying a model trained on one (or
multiple) source domain(s) on inputs from a target domain. The target domain
must be different from the source domain(s) for the Domain Adaptation to
make sense, but not too different for the inference to be feasible without entire
retraining [5].
An example of Domain Adaptation in Autonomous Driving [6] would be
to adapt a model used in an autonomous vehicle in New York, to work in
the city of Paris, in which case the source and target domains would be the
images and inputs fed to the model from New York and Paris respectively.
The main idea behind Domain Adaptation or Transfer Learning, is to make
use as much as possible of the model trained on the source domain(s) given
its similarities with the target domain. Domain Adaptation has numerous
benefits, including the use of fewer resources for training (see section 4.1
for more details), a smaller overall carbon footprint (see section 5 for more
details), and applications in model compression.
Figure 2.1 – Illustration of a domain shift causing misclassification issues. Domain
Adaptation can be seen as a compromise between the source and target domains to
achieve high accuracy in both.
CHAPTER 2. BACKGROUND | 21
2.3 Neural Networks: The building blocks and novel
architectures
2.3.1 Artificial Neural Networks
Artificial neural networks (ANNs) are computing systems that are inspired
by the biological neural networks in animal brains. They take inputs
and produce outputs. The use of Artificial Neural Networks is mainly to
approximate complex non-linear functions using an automatic process known
as Backpropagation, and they have had recent success with novel architectures
in Computer Vision, Natural Language Processing, and other sub-fields in
Artificial Intelligence.
Mathematically, a single neuron j is a function f : RN −→ R, defined as :
N
X
f (x, wj ) := σ w0j + wij xi
i=1
with σ : R −→ R is called the activation function, and wj ∈ RN +1 is a vector
of trainable weights including the bias term w0j . We finally have, for a single
neuron :
f (x, wj ) := σ(wjT x)
An Artificial Neural Network (ANN) is a network of the previously defined
neurons, with the inputs of some neurons being the outputs of others. The
traditional architecture of ANN is a Feed-Forward Network (no cycles in the
connections between the nodes), and is composed of: The input layer where
x ∈ RN , the hidden layer(s), and the output layer where y ∈ RN . A hidden
layer l is defined as a function F (l) : RNl −→ RMl , with :
(l)
f (x; w1 )
f (l) (x; w2 )
(l)
F (x; w1 , · · ·, wM ) = ..
.
f (l) (x; wM )
22 | CHAPTER 2. BACKGROUND
Simply put, for each layer we have:
w1>
w2>
F (l) (x; W) = σ (Wx) , W = ..
.
>
wM
Finally, for an ANN with L layers and given an input x:
y = σ (L) W(L) σ (L−1) . . . · σ (1) W(1) x · ·· (2.1)
.
2.3.2 Backpropagation
In a nutshell, the learning procedure in ANNs is equivalent to finding weights
that minimize a given loss function ξ. The learning process is iterative and
involves the following steps:
1. Weight Initialization: is either done randomly if the training is from
scratch, otherwise, with given initial values.
2. Forward Pass: Using inputs from the training set, we compute the
Forward Pass to get the output of the network, these operations are
simply the ones described in Eq.2.1. These computations can be
accelerated using GPUs for a more scalable system. The output is then
compared with the ground truth using a loss function ξ such as Binary
Crossentropy for classification or Mean Square Error for Regression.
3. Backward Pass: using the backpropagation algorithm and the loss value,
weights are corrected beginning from the latter layers to the first ones in
a backward fashion.
We perform the forward pass and then the backward pass iteratively until
the loss seems to converge to a certain value.
Note about the activation function:
For networks that have more than one hidden layer, the activation function
has to be non-linear, otherwise, a network of multiple layers would be
equivalent to a network with only one, and will simply represent a linear
function. For deep neural networks (networks with several hidden layers),
the choice of the activation function is even more sensitive, and the use of the
sigmoid function can sometimes lead to a phenomenon known as the vanishing
gradient problem.
CHAPTER 2. BACKGROUND | 23
2.3.3 Convolutional Neural Networks
Although Feed-Forward Networks have had success with inputs with low
dimensionality in applications such as energy consumption prediction, they
were not very appropriate for handling images in a scalable way, and that is
because of two main reasons:
• Images make for high dimensional inputs. An image as small as 64x64
with 3 color channels is a 12,288-dimensional input. So even a shallow
neural network will have a very high amount of weights to train.
• Processing an image as a sequence of pixels takes into account a lot of
noisy input and makes the model prone to overfitting severely. As an
example, the exact value of the pixels that make an image containing
the digit ’2’ is not as important as the overall pattern of the digit. We
might want to extract the most important features of the image before
classification or whatever output we want from our neural network.
Convolutional Layers [7], presented in 1999 by Yann LeCun et al. show a
lot of potential for image-related tasks. Instead of neurons receiving the pixels
in a single sequence, convolutional layers take into account the 2D nature of
the input. Each neuron in a convolutional layer receives a small patch of the
input, and then applies the convolution operation (Fig. 2.2a) of the inputs
with a weight matrix known as the kernel and ultimately produces the output
through an activation function.
Max-pooling is the second building block in Convolutional Neural
Networks. The idea of pooling is to reduce the dimensionality from one layer
to another. It divides the input into a set of equally non-overlapping regions,
and for each region, it takes the maximum value of the region and drops the
other values. Similarly to the convolution operation, it moves along the input
with strides in both directions.
24 | CHAPTER 2. BACKGROUND
(a) Illustration of the convolution operation using an input I and a kernel K. Using the
kernel that ’moves’ in both directions, the convoluted values are computed using the
neighbors of the pixel. The kernel is trainable to extract features at a given level in the CNN
and is coupled with a Max-Pooling layer.
(b) Example of a CNN Architecture for Handwritten Digits Classification using the MNIST
Dataset [4]. The CNN is composed of multiple Convolutional and Max-pooling Layers. At
the end of the network, a Fully Connected Neural Network is used to do the classification
(10 classes) using the flattened feature maps.
Figure 2.2 – Illustration of the Convolutional Neural Networks, a more suited
architecture for Computer Vision related tasks than traditional Feed-Forward
Networks. A CNN can be a deep network without having an unreasonable amount
of weights to train.
CHAPTER 2. BACKGROUND | 25
Commonly used CNN Architectures:
Although the idea of putting convolutional layers and max-pooling layers
remains the same in almost every CNN, the amount of the kernels, their width,
the depth of the network, and other parameters remain unique per a given CNN
architecture. Some commonly used CNN architectures are:
• LeNet-5: is the first CNN architecture, created by LeCun et al. [7]. It
supports gray-scale images exclusively and has only two convolutional
layers.
• VGG-16 [8]: is a deep CNN with 13 convolutional layers and three fully-
connected layers, with over 138M parameters in total. It uses the ReLU
activation function and has small kernel sizes (2x2 and 3x3). It has a
deeper variant called VGG-19.
• ResNet-50 [9]: is also a deep CNN that had had recent success in the
ILSVC 2015 challenge. It is made of 50 layers and takes 224x224 RGB
images as inputs. It has deeper variants with 101 and 150 layers.
The full architectures of VGG-16 and ResNet-50 are available in Appendix C.
2.4 Sequence Modeling
We take a look at another type of Artificial Neural Networks specifically
designed for sequence modeling. Given a sequence of inputs, in which
the order matters greatly, we would want an appropriate architecture that
would perform tasks such as sequence classification or sequence-to-sequence
translation.
Feed-Forward Networks are not very appropriate for handling sequences,
since they do not capture the connection between the elements of the
sequence. Recurrent Neural Networks (RNNs) are distinguished by having
a "memory" and can process sequences of variable length. RNNs use the
previously defined backpropagation learning rule through the time variable
t. However, even RNNs have trouble processing long sequences because
of the vanishing/exploding gradient problem. Three novel architectures for
sequential data are used in this project:
26 | CHAPTER 2. BACKGROUND
2.4.1 Long Short Term Memory (LSTM)
LSTM is a variant of Recurrent Neural Networks that introduces a novel forget
gate as well as an input gate and output gate. Inspired by the way humans tend
to ignore certain elements of a sentence to understand it, the forget gate allows
for an LSTM-based network to deal with very long sequences and capture the
relationship between elements that are far away from each other. The LSTM
architecture had a lot of success in machine translation tasks [10].
The LSTM cell contains the following components:
• Forget Gate “f” (neural network with sigmoid)
• Candidate layer “C‘"(neural network with Tanh)
• Input Gate “I” (neural network with sigmoid)
• Output Gate “O”(neural network with sigmoid)
• Hidden state “H” (a vector)
• Memory state “C” (a vector)
Figure 2.3 – Illustration of an LSTM cell. The output of this cell is redirected as an
input with an increased step time, which makes it a recurrent neural network.
The following equations describe the way an LSTM cell processes input
data sequentially. At a given type step t, Xt is the data input, Ht−1 and Ct−1
are the values of the hidden and memory states at t-1, ft , C̄t It and Ot are
CHAPTER 2. BACKGROUND | 27
intermediate values to compute Ct and Ht at the current time step t.
ft = σ(Xt ∗ Uf + Ht−1 ∗ Wf )
C̄t = tanh(Xt ∗ Uc + Ht−1 ∗ Wc )
It = σ(Xt ∗ Uu + Ht−1 ∗ Wi )
Ot = σ(Xt ∗ Uo + Ht−1 ∗ Wo )
Ct = ft ∗ Ct−1 + It ∗ C̄t
Ht = Ot ∗ tanh(Ct )
Note that although the computations involved at each step are fairly simple,
there is no way to parallelize computation when training an LSTM network,
which can cause a bottleneck when dealing with massive amounts of data.
2.4.2 Transformers
Transformers are a novel architecture used in sequence analysis, which was
originally presented by Vaswani et al. [1] who introduced the concept of
attention mechanisms. One novelty of this architecture is that it takes the
elements of the sequence all at once, thus allowing for parallelization (contrary
to LSTM networks). It is able to process a sequence all at once using Positional
Encoding, where the position of each element is encoded as part of the input
vector itself.
A Transformer block, illustrated in Fig. 2.4 is made of an Input Embedding
block that converts the input (usually a sentence) to a vector for neural network
processing, followed by Positional Encoding that encodes the positions of the
items in the sequence. To compute the Positional Encoding, the authors used
the cosine and sin functions:
P Epos,2i = sin(pos/100002i/dmodel )
P Epos,2i+1 = cos(pos/100002i/dmodel )
where i is the dimension, pos is the position in the sequence and dmodel is the
embedding output dimension, meaning that each dimension of the positional
encoding corresponds to a sinusoid.
28 | CHAPTER 2. BACKGROUND
(a) The Transformer - model architecture. Positional Encoding is used to encode the position
information, which is crucial to process the input as a sequence. The Transformer is made of
an Encoder and Decoder. Multi-Head Attention module is used to compute self attention
scores in both of them.
(b) Architecture of seq2seq model used for Neural Machine Translation. The variable-length
sentence in French is processed through the encoder and decoder to output a variable-length
sentence in German.
Figure 2.4 – The Transformer and seq2seq models are widely used as Sequence
Modeling architecturess. They take as input and output a sequence of variable length.
CHAPTER 2. BACKGROUND | 29
Following that, we have an Encoder and Decoder block. The encoder
is made of six identical layers, each layer has: a) a multi-head attention
mechanism, b) a fully connected feed-forward network. The Decoder block
is similar to the Encoder block except that it has an additional multi-head
attention sub-layer in each of the six layers. Finally, Positional Encoding and
Output Embedding is applied to get the output.
2.5 Optical Character Recognition
Optical Character Recognition (OCR) is the process of extracting searchable
and analyzable text from images containing text (cf. Fig.2.5).
The OCR pipeline involves two main steps:
1. Object Detection phase: locating bounding boxes in the image that
contain the text (characters, words, lines) we want to extract. This step
is application-dependent:
• In the case of an unstructured document, we might want to extract
as much text as possible in an unordered fashion and therefore
simply detect all the words or lines in the input image.
• In the case of a structured document with key-value pairs, the
object detection process involves extracting segments for both the
keys and values and establishing a mapping between the keys and
values.
2. OCR phase: The bounding boxes from the object detection phase are
fed to the OCR engine for text recognition. The OCR Engine takes as an
input an image containing nothing but text (characters, words, or lines)
and extracts that text.
N.B. Often in literature, the second step and the entire pipeline are both
referred to as OCR. The distinction, however, is very important in this project.
30 | CHAPTER 2. BACKGROUND
Figure 2.5 – Simple illustration of OCR Pipeline. Object Detection is performed on
the initial document to obtain crops containing nothing but text (characters, words,
lines). These crops are later fed to the actual OCR engine for Text Recognition.
CHAPTER 2. BACKGROUND | 31
2.6 Related Works
In this section, we present relevant works that were foundational to this
study, either by introducing novel neural network architectures specifically
designed for Optical Character Recognition and Sequence Modeling, or Data
Synthesis techniques and related Domain adaptation issues.
2.6.1 Domain Adaptation
Varol et al. [11] attempted to tackle data synthesis to estimate human
pose, shape, and motion, and introduced the Synthetic humans foR REAL
tasks (SURREAL) datasets. The authors were able to generate more than six
million frames along with ground truth poses, depth maps, and segmentation
masks. When trained using Convolutional Neural Networks, the models were
shown to demonstrate high accuracy in human depth estimation and human
part segmentation in real RGB images.
Tremblay, Prakash et al. [12] introduce the concept of domain
randomization to tackle domain adaptation issues. Domain randomization
is a technique in which the parameters of the data synthesizer (or simulator)
such as lighting, pose, and object textures are randomized to force the neural
network to learn the features of the object of interest, and avoid overfitting
and poorly generalizable results. Moreover, the study showed that the use
of realistic synthetic data, and subsequent fine-tuning on real data, yielded
more accurate results than using real data alone. The results of this study
were evaluated on bounding box detection of cars on the KITTI dataset [13].
This study inspired the data synthesis methodology used to generate realistic
license plate images for neural network training.
Hinterstoisser, Lepetit, Wohlhart and Konolige [14] from Google,
attempted to train object detectors using synthetic images using Transfer
Learning. The idea is, given a Convolutional Neural Net, to freeze the
early layers responsible for feature extraction to generic layers pre-trained
on real images, but train the latter layers with synthetic data generated
with OpenGL. The study yielded excellent results for object recognition
models Faster-RCNN, R-FCN, Mask-RCNN and image feature extractors
InceptionResnet and Resnet. The idea of freezing certain layers of a deep
Convolutional Neural Network was mainly inspired by this paper.
32 | CHAPTER 2. BACKGROUND
Rajpura et al [15] showed concrete results when using synthetic data
to detect packaged food products clustered in refrigerator scenes. The
authors used 3D rendering to produce the synthetic images, and obtained
a mean average precision (mAP) of 24 on a 4000 synthetic images dataset,
this metric was furthermore improved when using both synthetic and real data.
In this study, Generative Adversarial Networks were used to generate
handwriting, this was inspired by the paper by Peng and Saenko [16],
who used a more holistic approach to tackle domain adaptation issues
when moving from the synthetic data domain to the real data domain
for object recognition. The authors introduced the Deep Generative
Correlation Alignment Network (DGCAN), that generates synthetic images
while preserving loss and maintaining similarity to real data. The model
was evaluated on PASCAL VOC 2007 [17] benchmark and Office dataset [18].
Gopalan, Li et al. [19] introduce one of the first unsupervised approaches
to solve domain adaptation. Inspired by incremental learning, the authors
make intermediate representations of data between the two domains by
viewing the generative subspaces made from these domains as points on the
Grassmann manifold, and testing points along the geodesic between them
to acquire subspaces that give a significant depiction of the hidden domain
shift. The authors then acquire the projections of marked source domain
information onto these subspaces, from which a discriminative classifier is
figured out how to group extended information from the target domain.
Xu and Vazquez [20] focus on Domain Adaptation for deformable
part-based model (DPM) for pedestrian detection by introducing the adaptive
structural SVM (A-SSVM) and the content-aware A-SSVM architectures.
The novelty of these models is that neither models need to revisit source-
domain training data to perform the adaptation. The study showed a 15
points improvement in accuracy when using the previously described domain
adaptation.
Sun, Feng and Saenko [21] propose another unsupervised method of
domain adaptation called CORrelation ALignment (CORAL). It works by
aligning the second-order statistics of source and target distributions, without
the need to have target labels. This method has the advantage of being
relatively simple to implement and performs surprisingly well on standard
CHAPTER 2. BACKGROUND | 33
datasets such as [18].
Chen, Li, Sakaridis et al. studied domain adaptation for object detection
in autonomous driving data. The authors strategically divide the domain
adaptation problem into two distinct categories: the image-level shift such as
illumination, and the instance-level shift such as the object’s size. For both
cases, the adaptation relies on H-divergence theory, as the domain classifier
learns similarly to a Generative Adversarial Network. The results were
evaluated on multiple datasets including KITTI [13].
Finally, Yosinki, Clune et al. [22] dived into more details concerning
domain adaptation using deep neural nets. The main takeaways from this
study are: Transfer Learning was affected by the specialization of higher layer
neurons to the source domain at the expense of the target domain, and the
optimization difficulties when splitting networks between co-adapted neurons.
These two issues were demonstrated on ImageNet. Another surprising result
was that transferring features from distant tasks can be better than using
random features. Additionally, network weight initialization with transferred
features can boost the generalization power of the model significantly.
2.6.2 Optical Character Recognition (OCR)
Concerning Optical Character Recognition, the literature shows a wide
variety of solutions in both Object Detection and Image Classification to
tackle these issues.
Hassan et al. [23] introduced a handwriting OCR system (bangla
numerals) using Local Binary Pattern. The author first uses Gaussian filtering
for preprocessing, as well as the KSC algorithm for slant correction (especially
problematic in handwriting). To recognize the characters, basic LBP, uniform
LBP and simplified LBP were benchmarked on the CMATERdb databases,
containing over 6000 bangla characters. The basic LBP, uniform LBP and
simplified LBP achieve respectively a maximal accuracy of 96.7%, 96.6%
and 96.5%.
Bedruz, Sybingco et al. [24] introduced a novel method for optical
character recognition for license plates, using fuzzy logic and scale-invariant
feature transform. The Scale-Invariant Feature Transform (SIFT) is an
algorithm that detects and extracts local image features, these features are
34 | CHAPTER 2. BACKGROUND
invariant to scaling and rotation. SIFT is used to compute features, that are
used in the fuzzy logic algorithm. One fuzzy logic system is created by
character, which makes this system quite heavy with a large character set. W.
Khan et al. [25] used point feature matching to recognize urdu characters. The
algorithm simply pre-processes the input (noise reduction and normalization)
and compares it with a set of ground truth template characters.
Similar to our study, Alfi, Barrault and Schwenk [25] proposed a novel
statistical machine translation system to correct the output of an Optical
Character Recognition system. The authors work on both the word and
character levels to support a translation from OCR system output to correct
french text. The results were evaluated on data provided by the National
Library of France, and shows a reduction of 60% and 54% respectively on the
Character Error Rate and the Word Error Rate.
Azawi, Al and Breuel et al. [26] introduced a novel technique for post
OCR correction as well. This method works by using means of weighted finite-
state transducers (WFST) with confusion rules. The model translates the OCR
confusions which appear in the recognition outputs using the Levenshtein
distance. The edit operations are extracted in a form of rules with respect to the
context of the incorrect string to build an error model using weighted finite-
state transducers. The novel error model makes the language model eligible to
correct words by using context-dependent confusion rules. This approach has
the advantage of being language independent. The results were evaluated on
the UWIII dataset [27] where the new model outperforms. This paper shows
the enormous importance of post OCR processing, which can also make use
of Machine Learning algorithms.
Chapter 3
Method
In this chapter, we present the methods and algorithms used in this project. We
start by presenting the two types of data: printed data and handwritten data,
and we make use of both synthetic and real-world data in both categories.
Next, we present a novel method for document preprocessing that corrects
the orientation of a document using Canny’s Edge Detection. Furthermore,
we present the neural network architectures used as well as the experimental
design. Finally, we discuss the evaluation metrics specific to the Optical
Character Recognition context.
35
36 | CHAPTER 3. METHOD
3.1 Data
In order to perform neural network training, we need training data, i.e. images
containing text with the corresponding transcription, which will henceforth be
referred to as the ground truth. The source, nature and variability of data are
important to achieve concrete results, e.g. previous experiments in OCR for
handwritten digits show that a slight domain change from the MNIST dataset
to SVHN dataset decreases the accuracy by up to 20%. We use both real-world
and synthetic data for training and rely solely on real data for model validation.
3.1.1 Real-world data
We will be working with two types of real-world documents:
1. Technical documents provided by the French company Cartier: these
documents contain printed text exclusively. The document quality is
usually quite unsatisfactory and we would like to improve upon existing
solutions that use object detection and image classification. At the time
of development of this project, there is no segmentation strategy for
these documents and we would simply like to extract as much text as
possible.
2. Accident statements provided by French insurance companies Thélem
and Verlingue: these documents contain both printed and handwritten
text as well as specific patterns to be extracted. We would also like the
data to be extracted in a structured manner, preferably in the form of a
dictionary mapping each key (e.g. "NOM") to a handwritten value (e.g.
"Allianz").
3.1.2 Synthetic data
Synthetic data is used in training to enrich real-world data, introduce
realistic variance and eliminate unwanted bias. Generating synthetic data is
application-dependent, as there is no generic method to it, and it consists of
modeling the statistical variance of real data and using the corresponding
generative algorithm.
To generate the synthetic data used in this project, the following steps were
followed for each image:
CHAPTER 3. METHOD | 37
1. Select a word or line, i.e. the ground truth, randomly from an
input dictionary. For example, this dictionary can contain the most
common words in the English language for a general-purpose dataset, or
artificially generated license plate numbers for a more specific purpose.
2. Generate an image from the ground truth. For printed text, this consists
of using a random but commonly used font. For generated text, it is
slightly trickier to imitate handwriting: we make use of Generative
Adversarial Networks (GANs) to generate a handwriting baseline.
3. Apply geometric transformations on both a word level and a character
level. These transformations include skewing, rotation, and translation.
4. Apply application-specific transformations such as ink smudges,
bounding box edges, and dotted lines for added realism.
38 | CHAPTER 3. METHOD
(a) Sample from technical documents provided by Cartier.
(b) Sample from accident statement provided by Thélém.
Figure 3.1 – Samples from the real-world data (full document examples are available
in Appendix A and B). The technical documents contain printed text exclusively, that
suffers usually from bad quality, pixelization and character ambiguity. The accident
statement documents contain both printed and handwritten text, and suffers from bad
quality, slanting and ambiguity.
CHAPTER 3. METHOD | 39
3.1.3 Datasets
In this subsection, the results of the shallow CNN+LSTM architecture are
presented. We use four different datasets:
• datasetC: contains 1043 annotated crops extracted from Cartier’s
technical documents dataset (see section 3.1 for more details). These
crops contain either a word or a line, and have a corresponding transcript
that was annotated manually.
• datasetE: contains 10.000 computer-generated printed words using a
synthetic data generator for text recognition as previously described.
The images generated had random fonts, blurring and background noise
and were made quite challenging to test the limits of our model.
• datasetF: contains 3000 computer-generated handwritten words using a
GAN model.
• datasetLPR: contains 528 real-world license plate images, obtained
through object detection on Thelem’s accident statements documents.
These images were annotated manually.
• datasetSynth and datasetLPSynth: provided by the authors of [8]. It
is a dataset of two million generated images of handwriting from 10
types of fields in an accident statement, it includes more than 150.000
images in the license plates category. Special care was given to make
these images to make them as realistic as possible: different handwriting
styles, different types of noises added depending on the field type, as
well as additional noise in the corners due to the imperfection of the
object detection done before the OCR step.
40 | CHAPTER 3. METHOD
(a) Samples from datasetC.
Samples show usual noise in printed text: ink smudging, uneven blurriness and
font-dependent character ambiguity.
(b) Samples from datasetD.
Samples were generated using random fonts, bakckground and gaussian blur.
(c) Sample from datasetE.
Samples were generated using a GAN network that imitates human handwriting and features
slanting and cursive writing.
(d) Samples from datasetLPR.
These real samples suffer from pixelization, noisy background and occasionally unreadable
characters.
(e) Samples from datasetSynth and datasetLPSynth.
Samples were generated to imitate as much as possible datasetLPR, with domain-specific
noise such as dotted lines and bounding boxes.
Figure 3.2 – Samples from the six datasets used in this study.
CHAPTER 3. METHOD | 41
3.2 Preprocessing for Document Analysis
In any given Data Science pipeline, data preparation is the first step. Given
the provided data, the scanned accident statements documents are not always
normalized. One of the prerequisites of having good results from using CNN-
based models is having a correct document orientation, and a slight skewing
can decrease the performance drastically. We would like to perform, as part
of the pre-processing step, an orientation correction of the documents.
The algorithm we use is a part of the classical Computer Vision family of
algorithms. Since we are dealing with a document, the idea is to detect lines
in the document to deduce the amount of orientation correction we need to
apply. Given a slightly skewed document, we:
1. Perform Canny Edge Detection to detect edges on the image. The Canny
Edge Detector works by applying a gaussian filter to remove the noise,
find the intensity gradients in the image, apply thresholding based on
the gradient magnitude, apply double thresholding to determine edges,
and finally remove ’weak’ edges. It is important to note that at the end
of this process, we have a binarized image that has a lot of emphasis on
the potential edges of the image.
2. Line Detection is done using the Hough Transform [28]. Roughly
speaking, the Hough Transform can detect lines or circles or shapes
with a very low number of parameters, by creating an accumulator space
(two-dimensional in the case of line detection) and then computing local
maxima in this space, which refer to the parameters of the detected lines.
Once the line detection is successful, finding the amount of orientation to
correct can simply be obtained by averaging the angle that each line has with
the x-axis. An illustration of the orientation correction process is available in
the Results section.
42 | CHAPTER 3. METHOD
3.3 Architectures
3.3.1 Shallow CNN+LSTM Network
As a baseline, we use the following shallow neural net, based on the
design of Tesseract OCR v4 [29]. This architecture makes use of only one
Convolutional Layer for feature extraction, followed by a Max Pooling layer to
reduce the output, and finally a series of LSTM layers to produce a sequence
of characters from the input image (also considered as a sequence). We note,
however, that even though this architecture is shallow, it takes the image as a
whole sequence, instead of breaking it up into characters as previously done
in traditional OCR systems. This presents a major improvement as the model
can detect slanted text, and achieves more accuracy by relying on nearby
characters.
Layer Characteristics
Input 36 x L
Conv2D 3 x 3 window, 16 outputs
Max Pooling 3x3
LSTM Dimension summarizing LSTM,
summarizing the y-dimension with 48 outputs
LSTM forward only in x with 96 outputs
LSTM reverse only in x with 96 outputs
LSTM forward only lstm with 192 outputs
Output layer produces 1-d seq, trained with ctc
3.3.2 Deep CNN + Transformer
In order to improve the previous architecture, we make two major changes:
1. Replace the shallow CNN with a deeper one. Deeper models were
shown to extract features on different scales of the image. We use two
deep architectures that have been known to succeed in computer vision
tasks: ResNet50 and VGG19. The following is a description of the
architecture of a VGG-19 network:
CHAPTER 3. METHOD | 43
Layer Characteristics
Conv2D 3x3, 64
Conv2D 3x3, 64
MaxPool
Conv2D 3x3, 128
Conv2D 3x3, 128
MaxPool
Conv2D 3x3, 256
Conv2D 3x3, 256
Conv2D 3x3, 256
Conv 3x3, 256
MaxPool
Conv2D 3x3, 512
Conv2D 3x3, 512
Conv2D 3x3, 512
Conv2D 3x3, 512
MaxPool
Conv2D 3x3, 512
Conv2D 3x3, 512
Conv2D 3x3, 512
Conv2D 3x3, 512
MaxPool
Fully Connected 4096
Fully Connected 4096
Fully Connected 4096
SoftMax
2. Use the Tranformer architecture instead of LSTM. The use of
Transformers was shown to yield more accurate results, and more
importantly, result in computationally efficient models that support GPU
parallelization. This helps solve the training bottleneck when dealing
with deep networks and large datasets.
3. Use seq2seq architecture as a second alternative to LSTM, with
supposedly an even lower training time.
44 | CHAPTER 3. METHOD
3.4 Morphological Operations for Domain Adaptation
Given the lack of real-world data in this project, synthetic data is used in
training, and the model is purely evaluated on real-world data for an accurate
assessment. And although the synthetic data is fairly realistic (see section 3.1
for more details), the domain change between the synthetic and real data can
explain the difference in Character Error Rate (CER) from the training set to
the test set.
To make real-world data looks more like the synthetic data, morphological
operations on training data images are used:
• Dilation: is an operation used to add pixels in an image, to fill holes.
• Erosion: is the opposite operation of dilation, used to increase the size
of holes.
• Opening: is the dilation of the erosion of an image. Opening smooths
the contours of objects, breaks narrow joints and eliminates thin
protrusions [30].
• Closing: is the erosion of a dilation of an image. Closing produces the
smoothing of sections of contours but it fuses narrow breaks and fills
gaps in the contour.
We proceed to apply the previously described morphological operations, as
well as commonly used data augmentation techniques such as random rotation,
scaling and noise, to images before training. This should stop the model from
overfitting on training data, and achieving unreliable results.
3.5 Evaluation Metrics for Optical Character
Recognition
To evaluate the output quality of the OCR model, we want to compare the
OCR output text with the ground truth text. Instead of using the accuracy,
which would be equal to 1 if the two texts match exactly and 0 if not, we
introduce two error rates that provide a better assessment of the OCR model.
We would like to measure the number of misspelled, added or missing
characters in the OCR output text, for that purpose, we define:
• Insertion error: added character in the OCR output text.
CHAPTER 3. METHOD | 45
• Deletion error: missing character in the OCR output text.
• Substitution error: misspelled character in the OCR output text.
We might notice that this doesn’t directly solve the problem. For example,
when comparing "STEAM" and "STEAL", we can substitute one letter (1
substitution error) or delete a letter and add another (1 deletion error and 1
insertion error).
To that end, we define the Levenshtein distance between two strings as the
minimum number of edits (insertion, deletion, or substitution) to go from one
string to another. All the edits are considered equally important in this context.
However, in different contexts such as DNA sequence analysis, the substitution
error is usually more important as even more complex models are used to
compare two sequences.
Finally, we define the two metrics that are mostly used in this project:
1. Character Error Rate (CER): is simply the Levenshtein distance divided
by the number of characters in the ground truth text. Obviously, we want
a CER as low as possible (0 being the score for a perfect match) and is
supposed to represent a percentage, i.e. a CER=20% roughly means we
incorrectly predict 1 character out of 5.
2. Word Error Rate (WER): defined very similarly to the aforementioned
CER, except that it operates on the word level instead of the character
level.
Example: Ground Truth = "Machine Learning is fun", OCR Output =
"Mackine Leanning is fun", we have CER = 8.69% and WER = 50%.
Conclusion: Both CER and WER represent the same error for a different
granularity level, and both should be reported for a given OCR model. If
we are trying to extract text containing a license plate, WER is perhaps more
relevant than CER. However, if we are extracting free text from a paragraph,
CER might be more relevant since we can use post-OCR spell-checking.
46 | CHAPTER 3. METHOD
Chapter 4
Results
The following chapter presents the experimental design, results, and
discussion. It attempts to answer the research questions defined in Chapter
1 by discussing preprocessing, OCR accuracy, the use of synthetic data,
the ability of an OCR model to generalize well (section 4.1), and finally
method-related metrics like training and inference time (section 4.2).
The experiments were done using PyTorch under Python, in a Debian 10
environment, and a system equipped with a Tesla K80 GPU for neural net
training acceleration.
47
48 | CHAPTER 4. RESULTS
4.1 Predictive Performance
4.1.1 Document Preprocessing
Introduction:
Before we can use raw real-world images in the neural network training,
they must be preprocessed to achieve maximal performance. Preprocessing
includes: manually checking for invalid/very low quality scans, unifying
the image format, and most importantly orientation correction. Orientation
correction will be done using image processing algorithms, since it is present
in most images we will be using as an input. The goal is to simply correct the
slight orientation error (around 5°) that comes from scanning documents.
Results:
We perform orientation correction using Canny’s Edge Detector and then
Line Detection using the Hough Transform (see Fig. 4.1). This process
has one important parameter, which is the maximum number of lines to
detect in the Hough Accumulator space. In our case with accident statement
documents, it seems that detecting the best 10 lines in the document is good
enough. In almost all cases, a few horizontal and barely skewed lines are
detected, which makes the correction possible.
Once horizontal lines are detected, all that remains is to compute the
orientation error angle, which is done by averaging the angles that the lines
make with the x-axis.
CHAPTER 4. RESULTS | 49
Figure 4.1 – Orientation correction in accident statements documents. The input
image is passed through Canny’s Edge Detector to highlight the edges of the image.
The edges used in the Hough Transform to get the parameters of the most likely to be
lines in the document. These lines are filtered to obtain the horizontal ones, and then
the correction amount is computed as the average of the angle that each horizontal
line makes with the x-axis. The final image was corrected by 2°.
Discussion:
Given the straightforward nature of the used algorithms, and the necessity
to manually label data which can be labor-intensive, the results were evaluated
qualitatively on real-world images. As shown in section 4.1, and given
that these documents necessarily contain horizontal and vertical lines, line
detection allows us to accurately correct the orientation.
50 | CHAPTER 4. RESULTS
4.1.2 Model no1: Shallow CNN+LSTM architecture
Experimental design:
We start our first series of experiments by using the previously described
shallow CNN+LSTM architecture. The datasets used are : datasetC, datasetE,
datasetF, and datasetLPR. A 70-30 random split is performed on each of them
to have more reliable results.
For a thorough examination of the model, we perform a cross-examination
of the model trained on each of the four datasets, and evaluate it on a separate
test set of each of these datasets. As a baseline, we consider the model pre-
trained on individual characters of the alphabet. For each case, we perform
the same experiment 10 times, to get a sense of the accuracy distribution by
computing confidence intervals.
Results:
We report both the Character Error Rate and Word Error Rate, with the
corresponding confidence intervals, in the tables below. For example, the
cell in the third line and third column in Table 4.1 contains the CER (and
corresponding confidence interval) of our model trained on the training set of
datasetE, and tested on the test set of datasetF.
Training (down)
/ Test (right) datasetC datasetE datasetF datasetLPR
baseline 27.14 ± 2.21 10.74 ± 2.04 61.54 ± 3.61 76.30 ± 4.13
datasetC 2.62 ± 0.14 8.51 ± 0.74 68.74 ± 5.16 92.51 ± 1.15
datasetE 9.40 ± 2.64 3.20 ± 0.26 67.13 ± 4.59 93.17 ± 1.08
datasetF 13.36 ± 2.21 12.22 ± 4.21 22.63 ± 3.51 62.31 ± 2.18
datasetLPR 11.60 ± 2.81 13.19 ± 2.23 36.44 ± 7.02 33.60 ± 4.21
Table 4.1 – Character Error Rate (CER) for the shallow CNN+LSTM model, reported
with a confidence interval over 10 experiments.
CHAPTER 4. RESULTS | 51
Training (down)
/ Test (right) datasetC datasetE datasetF datasetLPR
baseline 76.00 ± 3.00 21.4 ± 1.92 97.44 ±1.34 100.00 ± 0.00
datasetC 18.31 ± 2.36 11.31 ± 2.15 99.03 ± 0.53 100.00 ± 0.00
datasetE 21.6 ± 3.22 8.25 ± 0.85 97.02 ± 6.42 100.00 ± 0.00
datasetF 23.61 ± 5.89 25.17 ± 5.36 53.74 ± 4.61 95.12 ± 3.31
datasetLPR 24.11 ± 3.18 24.88 ± 4.30 69.47 ± 6.97 63.46 ± 3.19
Table 4.2 – Word Error Rate (WER) for the shallow CNN+LSTM model, reported
with a confidence interval over 10 experiments.
Furthermore, given that each experiment (i.e. cell in Table 4.1 and Table
4.2) consists of 10 measures, we can compare our models using Analysis of
Variance (ANOVA) and Student T-tests. Both tests use the mean and variance
from the n = 10 experiments performed to assess more authoritatively if
one is better than another or not. We make the assumption that the error is
normally distributed.
ANOVA test defines the null hypothesis as "All group means are equal"
and the alternative hypothesis as "Not all group means are equal".
evaluated on p-value for CER p-value for WER
datasetC <0.001 <0.001
datasetE <0.001 <0.001
datasetF <0.001 <0.001
datasetLPR <0.001 <0.001
Table 4.3 – Results of ANOVA test indicating that the group means are indeed not
equal.
A Student T-test defines the null hypothesis as "The means of the pair are
equal" and the alternative hypothesis as "The means of the pair are not equal".
We perform the Student T-test for interesting pairs, evaluated on all datasets.
52 | CHAPTER 4. RESULTS
Pair evaluated on p-value for CER p-value for WER
baseline/datasetC datasetC <0.001 <0.001
baseline/datasetE datasetE <0.001 <0.001
baseline/datasetF datasetF <0.001 0.027
baseline/datasetLPR datasetLPR <0.001 0.052
datasetC/datasetE datasetC 0.003 0.004
datasetC/datasetE datasetE <0.001 0.002
datasetF/datasetLPR datasetF 0.062 0.098
datasetF/datasetLPR datasetLPR 0.064 0.087
Table 4.4 – Results of pairwise Student t-tests.
p-value interpretation: For example, the interpretation of the first two
p-values of the above table would be: The probability that the model trained
on datasetC is "accidentally" better (with regards to CER and WER) than the
baseline, evaluated on datasetC, is smaller than 0.001.
Discussion:
Model no1 achieves the lowest error, in both CER and WER, when trained
and evaluated on the same dataset, albeit the train/test split (see Tables 4.1
and 4.2). However, training a model on a given dataset, and evaluating it on
another does not necessarily yield a result better than the baseline, e.g. the
model trained on datasetC and tested on datasetE. This is what interests us
the most in terms of generalization ability, because in this particular example,
even though the model was trained purely on datasetC, seems to have learned
information, probably specific to the French language, and was able to
generalize well on datasetE, thererfore performing better than the baseline.
Moreover, it seems that in general, Model no1 performs reasonably well
on datasetC and datasetE, containing printed text exclusively. A minimal
CER of 2.62% and 3.2% was achieved in these two datasets, which makes
the model accurate. However, performance is not so good on handwritten
datasets: datasetF and datasetLPR, since a CER higher than 20% is almost
always unusable in a real-world application.
ANOVA and Student t-tests were done to measure the statistical
significance of our experiments. The p-value corresponding to ANOVA
indicates that there is sufficient statistical evidence to say that the evaluated
CHAPTER 4. RESULTS | 53
errors (CER and WER) are different from one group to another. Pairwise t-
tests were performed on pairs of datasets, and yielded for all cases a p-value
smaller than 0.05, which is quite satisfactory.
54 | CHAPTER 4. RESULTS
4.1.3 Model no2: Deep CNN+Transformer/seq2seq
Introduction:
Given the unsatisfactory performance of the previous model on handwritten
data, we experiment with the deep CNN+Transformer architecture. This
model uses VGG-19 backbone coupled with a Transformer as described in
Chapter 3.
Hyperparameter optimization:
Since this model is quite large, GPU parallelization introduces an additional
parameter on top of the learning rate, which is the batch size. Having a
higher batch size means faster training (more parallelization) during one
epoch, but we do not know exactly how it affects the accuracy. To determine
the best learning rate and batch size, we do a grid search as demonstrated
in the heatmap below, and conclude empirically that using lr=0.0005 and
batchsize=32 yields the best results.
Figure 4.2 – Heatmap illustrating the grid search to find the optimal hyperparameters:
learning rate and batch size. The model was trained on datasetLPSynth and evaluated
on the high_quality dataset. Results were obtained by repeating each experiment 10
times and retaining the mean.
Experimental design:
We use this model on handwritten data only, since our previous experiment
proved that model no1 was sufficiently accurate for printed text. For each
dataset, a 70-30 random split is performed, and each experiment is repeated
10 times to get a sense of the error distribution.
Moreover, we have noticed that our real-world license plate dataset varies
CHAPTER 4. RESULTS | 55
greatly in quality, and subsequently, the accuracy of any given model would
be hard to track when evaluated on it. To remedy that, we split the dataset into
three categories: high quality, medium quality, and low-quality. We evaluate
all models on real-world data exclusively. Given that our real data is limited
in size, we divide these experiments into two parts:
1. Training on synthetic data only, so as to have the whole real-world
dataset as a test set.
2. Training on both synthetic data and the training set of the real-world
data, and evaluating solely on the test set of the real-world data
Results:
1. Training on synthetic data only: Below are the reported Character Error
Rate and Word Error Rate, with respective confidence intervals over 10
runs.
Training (down)
high_quality (186) medium_quality (224) low_quality (118)
/ Test (right)
datasetLPSynth 16.91 ± 2.41 32.61 ± 3.74 37.39 ± 5.11
datasetSynth 12.05 ± 2.31 22.49 ± 4.14 33.23 + 5.54
datasetSynth
11.03 ± 2.45 20.2 + 3.11 30.29 ± 4.41
+ datasetLPSynth
Table 4.5 – Character Error Rate (with confidence interval) of three models trained
on synthetic data exclusively: datasetLPSynth which consists of synthetic license
plates, datasetSynth which consists of general accident statements handwritten text
(including license plates), and a model trained on datasetSynth and finetuned on
datasetLPSynth.
Training (down)
high_quality (186) medium_quality (224) low_quality (118)
/ Test (right)
datasetLPSynth 62.9 ± 3.21 76.33 ± 4.55 90.67 ± 1.51
datasetSynth 56.45 ± 4.21 75.44 ± 3.74 81.35 + 1.21
datasetSynth
51.07 ± 3.69 70.98 ± 2.21 83.89 ± 1.12
+ datasetLPSynth
Table 4.6 – Word Error Rate (with confidence interval) of three models trained on
synthetic data only, similarly to Table 4.5
56 | CHAPTER 4. RESULTS
Using a series of t-tests (assuming the error is normally distributed), we
compare a pair of models and report the p-values:
Pair evaluated on p-value for CER p-value for WER
datasetLPSynth / datasetSynth high_quality <0.05 <0.05
medium_quality <0.05 <0.05
low_quality 0.062 0.07
datasetLPSynth
high_quality <0.05 <0.05
/ datasetSynth + datasetLPSynth
medium_quality <0.05 <0.05
low_quality <0.05 <0.05
datasetSynth
high_quality 0.067 0.062
/ datasetSynth + datasetLPSynth
medium_quality 0.081 0.084
low_quality 0.122 0.168
Table 4.7 – Student t-tests for a pair of models, evaluated on a given dataset, with
respective p-value for CER and WER comparisons. The T-tests were done after
repeating each experiment 10 times with a different random split.
2. Training on synthetic and real data: Similarly, we report the CER and
WER for a model trained on a combination of synthetic and real data.
The goal is to investigate the domain change from synthetic to real data,
as well as the quality of synthetic data and its ability to imitate real data.
Training (down)
high_quality (55) medium_quality (66)
/ Test (right)
datasetLPSynth 15.64 32.92
datasetSynth 10.55 20.32
datasetSynth
8.99 17.49
+ datasetLPSynth
datasetSynth
8.21 13.3
+ datasetLPR
Table 4.8 – Character Error Rate (with confidence interval) of four models trained on
a combination of synthetic and real data, and evaluated on a separate test set, which
represents a subset of the real data.
CHAPTER 4. RESULTS | 57
Training (down)
high_quality (55) medium_quality (66)
/ Test (right)
datasetLPSynth 61.81 69.69
datasetSynth 56.36 71.21
datasetSynth
47.27 66.67
+ datasetLPSynth
datasetSynth
32.72 50
+ datasetLPR
Table 4.9 – Word Error Rate (with confidence interval) of four models trained on a
combination of synthetic and real data, and evaluated on a separate test set, which
represents a subset of the real data.
Discussion:
In this round of experiments, we used a deep CNN+Transformer
architecture on handwritten data. The hypothesis is that handwritten text is
quite complex, given the ambiguity, slanting, and variance in handwriting
style in general, and therefore requires a deeper model than model no1 to
achieve reasonable accuracy.
From Tables 4.5 and 4.6, we notice:
• Accuracy is highly correlated with quality, which validates the
importance of separating the set by quality. We also notice that the
second model performs surprisingly better than the first one, probably
because the model was trained on text of varying size and formatting,
and thus generalizes better on unseen real data. Additionally, we can
conclude that it is almost hopeless to extract text accurately from the
low-quality category no matter the model used.
• The accuracy of the model reaches a satisfactory error on high and
medium quality subsets of the real-world data, even when the model is
solely trained on synthetic data, as is the case in Table 4.5. This means
that not only model no2 was not only able to achieve good accuracy, as
opposed to model no1, but that it the domain shift from synthetic data to
real data doesn’t come at a great cost, accuracy-wise. This is due to both
the efforts in training such as data augmentation to avoid overfitting, as
well as realism achieved in synthetic data.
• Student t-tests were once again realized to assess the statistical
significance of our experiments. We notice that the p-value is
58 | CHAPTER 4. RESULTS
unacceptable in some cases when testing on low quality data, but is
generally satisfactory and doesn’t raise any red flags in particular.
4.1.4 Morphological Operations for Domain Adaptation
In order to test the effect of the morphological operation known as opening
on our domain adaptation problem, we apply the opening operation 10 times
for each sample of the real-world dataset. An example of such transformation
can be found in Fig. 4.3a, and the quantitative results can be found in Table
4.3b. The model is technically the same in both cases, the sole difference
is in post-processing the real-world data on which the model is evaluated.
The hypothesis to be tested is that this operation will make the target domain
closer to the source domain, thus increasing the accuracy and demonstrating
the usefulness of synthetic data as a scalable way to train large OCR models.
CHAPTER 4. RESULTS | 59
(a) Sample of real-world data, before and after applying the opening operation 10 times. We
can notice that there is slightly less noise after the operation, as well as added smoothness in
the contours of the characters.
Model (down)
high_quality (186) medium_quality (224) low_quality (118)
/ Dataset (right)
baseline 16.91 ± 1.41 32.61 ± 2.74 37.39 ± 3.11
baseline + opening 15.66 ± 1.33 29.45 ± 2.76 33.45 ± 4.57
(b) Comparison between evaluating the same model on real-world data, and real-world data
with the opening operation.
Figure 4.3 – The effect of opening (morphological operation) for Domain Adaptation,
to make up for the differences between real-world and synthetic data such as noise and
character smoothness.
60 | CHAPTER 4. RESULTS
4.2 Method-related metrics
During this study, we mainly used two architectures: a shallow
CNN+LSTM architecture, and a deep CNN+Transformer architecture. The
first one turned out to be sufficient for printed text, and the second one
necessary for handwritten text. We see why using the first architecture, which
happens to be lighter, might present an advantage in terms of training, and
more importantly inference time.
Model Baseline Deep CNN+Transformer
Size 4 MB 209 MB
Training Time in min ∼11.94 ∼85.23
Inference Time in s ∼0.87 ∼1.96
Table 4.10 – Size, training time and inference time of the two architectures used in
this study (shallow CNN+LSTM and deep CNN+Transformer).
As expected, the lighter model performs faster in both training and
inference. Its lighter size can also be an advantage in terms of portability,
for an implementation in a mobile application for example. However, it
performs poorly in complicated cases, which is mostly the case in real-world
handwriting, and should be substituted by the second model.
Chapter 5
Environmental and societal
impact
In this chapter, we present the environmental and societal impact of this
project. Artificial Intelligence and Machine Learning are currently evolving
at an important speed, and taking the impact of these solutions is paramount.
We demonstrate how the use of GPUs for neural network acceleration can
constitute an environmental danger in the future, and see how Domain
Adaptation for example can help diminish that impact. We also discuss the
societal impact of this study, mainly related to how automation affects the job
market.
61
62 | CHAPTER 5. ENVIRONMENTAL AND SOCIETAL IMPACT
5.1 Personal Environmental Impact
This project was carried out at the host organization La Javaness, for
the duration of five months. Given the COVID-19 pandemic situation, the
organization’s policy regarding remote work is to allow for the staff to freely
choose the number of days per week to work remotely. Below are different
elements of the direct environmental impact of my project:
• Transport related impact: On average, I went to the host organization’s
HQ twice a week. The project lasted 25 weeks in total, which means 50
return journeys. A single return journey takes approximately one hour
and half. For these movements, I used public transportation exclusively.
According to ENGIE’s carbon footprint estimations, the use of a subway
(métro) in France produces 3.26g of CO2 per kilometer per passenger.
This estimation is quite low compared to the most used transportation
method in France: the use of a personal car, which amounts from 100g
to 150g of CO2 emissions per kilometer per passenger. Finally, for the
entire duration of the project, the transport related carbon footprint is
50 × 1.5 × 3.26 = 244.5g.
• Equipment related impact: This impact is slightly more difficult to
estimate correctly. I had two major elements during this project:
1. The use of the company’s laptop. On average, the laptop was
turned on for 11 hours a day, for every work day, which means
5 × 25 × 11 = 1375 hours of use. The laptop is a Macbook
Pro 2019 and has an estimated wattage of 96W , which amounts to
1375 × 96 = 132kW h. According to Ademe, the carbon footprint
of a single kWh in France amounts to 59,9 g of CO2 emissions.
Finally, the use of the company laptop during this project has a
carbon footprint of 59.9 × 132 = 7906.8g.
2. The use of GPU powered servers provided by the company for deep
learning experiment. Using the CodeCarbon Python Module, the
estimated carbon footprint for a 52 minutes experiment using GPU
amounts to 0.0852Kg of CO2 emissions. During the project, an
estimation of 25 experiments were done using GPU, with similar
paramaters. Finally, the carbon footprint of the GPU usage is
approximately 0.0852 × 25 = 2130g
Finally, an estimate of the overall carbon footprint as a direct impact of this
project is 244.5 + 7906.8 + 2130 = 10.2813Kg. We can see that the use of
CHAPTER 5. ENVIRONMENTAL AND SOCIETAL IMPACT | 63
a power consuming laptop is surprisingly the most polluting resource in this
project, orders of magnitude over transport related pollution.
5.2 Global Environmental Impact
Introduction:
Recent advancements in Deep Learning are accompanied with suitable
hardware and methodology that can perform the abundant amount of
computation required for training. Recent neural networks models include
the Generative Pre-trained Transformer three (GPT-3), which requires the use
and training of over 175 billion parameters. And although models like GPT-3
are very powerful in applications in Natural Language Processing, the cost of
training such a large network and the resulting carbon footprint to fuel tensor
processing hardware cannot be ignored.
The need for GPU resources in Deep Learning:
As previously described in section 2.3, training a neural network is an
iterative process of a Forward Pass followed by a Backward Pass. And
although the model can become quite complex and handle large tasks, the core
operations to train a neural network remain the same. These computations,
such as matrix multiplication, can be parallelized using Graphical Processing
Units (GPUs). This enormous boost in training time is perhaps what allowed
for Deep Learning to achieve such important milestones. Deep Learning has
evolved in the last few years, from requiring a simple laptop or server to
experiment with architectures, to requiring specialized hardware with multiple
instances of GPUs or TPUs.
The environmental impact of Deep Learning and Computer Vision:
If we suppose that companies can afford the cost of using expensive
computational resources for training deep neural networks, we should also
consider the substantial environmental cost of common practices in this field.
The energy requirements for this hardware are mostly carbon dependent. It
is estimated that carbon emissions need to be reduced by half over the next
decade to avoid a major ecological disaster in the years to come [31].
According to ENGIE’s carbon footprint estimations and data from
CodeCarbon’s estimations:
• The average carbon footprint of using a personal car is situated between
64 | CHAPTER 5. ENVIRONMENTAL AND SOCIETAL IMPACT
100g and 150g per kilometer per passenger. Considering a person who
uses the car two hours a day, every work day, which amount to a carbon
emissions equivalent of 125 × 2 × 20 = 5Kg of CO2 per month per
passenger.
• The estimation of the carbon footprint of an average one-hour
experiment that involves single GPU powered training is 0.0852Kg,
which amounts to 0.0852 × 48 = 4.09Kg of CO2 emissions for
a two-day training. Recent papers in Computer Vision and Deep
Learning usually train multiple models for several days. They also use
multiple instances of GPUs or TPUs to make the duration of training
reasonable, especially for large models such as GPT-3 or BERT. Given
these estimates, a single Data Scientist can be responsible for more than
100Kg of CO2 emissions per month.
We can conclude from these estimates that the CO2 emissions resulting
from GPU usage for Deep Learning can be overwhelmingly big. For a detailed
analysis of a given project, one should take into account the amounts of
experiments done during the Proof Of Concept development stage, and the
cost of deploying models on user platforms. Which means its crucial not
only to determine the training cost, but also the inference cost that should be
multiplied by the number of users of the model.
Solutions to GPU related carbon emissions:
We suggest several approaches to take into account and reduce the
environmental impact during the development of a deep learning project:
• Accurate reporting in scientific papers: Authors should report detailed
reports of training experiments, preferably in a standardized format to
allow for meta-analysis studies in the future. These reports should
include training and inference time that could preferably be peer-
reviewed. It should also include a study of how sensitive the model is
to hyperparameters. Given a very hyperparameter sensitivity model, it
takes tens or hundreds of experiments to find the correct hyperparameter
value. Additionally, the report should include what type of GPU
resources were used.
• Transfer Learning: Prioritizing the use of Transfer Learning can be a
good practice for a Data Scientist, to save both time and resources as
well as the carbon footprint. Transfer Learning is the practice of using
CHAPTER 5. ENVIRONMENTAL AND SOCIETAL IMPACT | 65
a model that was trained originally for a different purpose (see section
2.2 for more details on Domain Adaptation and Transfer Learning). A
common practice in Data Science is training a model from scratch to use
on a very specific objective, which can also create scalability problems.
To that end, more research on Transfer Learning would make Deep
Learning projects both scalable and environmentally responsible.
• Efficient algorithms and methodology: Data Scientists should work on
code efficiency, even in the early development stages of a Data Science
project. Given the increasing complexity of ML models, there is often a
lot of room for improvement in every step of training. One example
is the use of alternatives to brute force hyperparameter search such
as Bayesian hyperparameter search techniques. And as stated earlier,
including these practices on the data pipeline can save both time and
resources. Concerning the methodology, Data Scientist should try to
do as little GPU powered experiments as possible. Unfortunately, a
common practice in Data Science is intensive hyperparameter search
to increase the accuracy as much as possible, with no regards to the
scalability of such model that can render all these experiments useless.
The global environmental impact of this project:
This project is a Proof of Concept regarding Optical Character Recognition
(OCR) in documents with difficult writing. OCR is used by companies such
as Thélém in an effort to entirely digitize their accident statements database.
As a matter of fact, new accident statements are directly given as an input
in a web platform, to avoid the need for data entry or automated document
analysis. Digitizing databases is advantageous for companies, as it allows for
scalability when searching for information, as well as dematerialization and
easier and more secured information access.
Incidentally, the use of technologies such as OCR encourages
dematerializing documents, which is advantageous environmentally.
Global paper use has increased by 400% in the last 40 years. The average
European consumes up to 125 Kg of paper, second to the American with
an average of 215 Kg. And although paper recycling is a very common
practice, with over 30% of papermaking materials originating from recycled
paper, global paper consumption has a substantial environmental impact. It
is estimated that in the US, paper use results in 3000 tons of waste daily,
66 | CHAPTER 5. ENVIRONMENTAL AND SOCIETAL IMPACT
equivalent to 51.000 trees cut. The paper industry is also responsible for
approximately 10% of the release of fine particles that are linked to serious
respiratory health issues.
Concerning accident statements, the French the National Interministerial
Road Safety Observatory (ONISR) stated that more than 50.000 road
accidents are reported yearly. This translates to a substantial amount of
accident statements that can be dematerialized using the solutions proposed
in this project.
Moreover, and although this project was evaluated on specific data for
business purposes, it is easily generalizable across other fields of applications.
In a nutshell, perfecting Optical Character Recognition and Document
Analysis will allow for a large scale document dematerialization on a global
scale, which might be a huge win for the environment. It is however important
to note that the technology used in this project is still in the development phase,
and future studies focusing on stability, scalability and reliability need to be
done before even considering a large scale operation.
CHAPTER 5. ENVIRONMENTAL AND SOCIETAL IMPACT | 67
5.3 Global Societal Impact
Introduction:
Modern Artificial Intelligence is shaping up to be the most important
technological advancement of the century. With recent CNN-based models
matching or surpassing human ability in specific tasks, Artificial Intelligence
can have a direct impact on the job market and the global economy as well as
the individuals. In this section, we examine the different societal aspects of
this project and its future applications.
Data Privacy concerns:
As of 2019 and according to the French "Tech Marketers Club" (CMIT),
the digital market including IT, digital and telecommunications manufacturers
and service companies was valuated at around 150 billion euros, ahead of
the aerospace industry for example, valuated at around 65 billion euros. This
growing tendency is accompanied by advancements in Data Science, Robotic
Process Automation and Artificial Intelligence as well as Cybersecurity.
As more and more users store sensitive information using web services,
data privacy becomes paramount. Data privacy is the ability of a person to
determine when, how and to what extent their personal data should be shared
or used. In France as well as most countries, data privacy is a fundamental
human right. However, recent transgressions in social media platforms such
as Facebook, notably the Facebook–Cambridge Analytica data scandal, call
into question the efficiency of privacy protection laws as well as the use of
certain technologies.
This project is centered around automating document analysis and text
extraction with deep learning solutions. Accident statements documents can
contain information about: patient name, date of birth, email and/or phone
number, license plate and license number. In this sense, the process of
inputting these information in a central database can be a subject to security
breaches. These breaches can occur:
1. In the Proof of Concept (POC) stage, where Data Scientists such as
myself handle real-world data to experiment with models and increase
accuracy. Handling sensitive data should be subject to multiple security
protocols, from encrypting the laptop’s hard drive, to using secured
protocols in storing, labeling and using data such as the use of Amazon
Web Services (AWS ) S3 buckets. In this project, the full name of
68 | CHAPTER 5. ENVIRONMENTAL AND SOCIETAL IMPACT
the client is blurred in every documents, to protect the identity of the
client in the case of a data breach. This data blinding should preferably
be discussed between the company providing the data science solution,
and the company providing the data (the client), before the start of the
project, to ensure maximum transparency.
2. In the deployment of a production ready solution. In our study case,
this Proof of Concept could develop in the future to a production ready
solution, ready to be used by thousands or millions of users. Deploying
an Optical Character Recognition for text extraction on millions of
scanned documents to be stored in a central database, should be done
with security as a concern.
Job market related concerns:
We discuss in this section, the impact that Optical Character Recognition
and Automated Document Analysis, and automation fueled by Artificial
Intelligence in general, on the job market and therefore the individuals in our
society.
In a recent report by the McKinsey Global Institute, the authors estimate
the changes in the job market by 2030 due to automation. They found that
in about 60% of occupations, at least one third of the activity could be
automated. Very few activities however, could be fully automated. Given
that the technical aspect of automation probably takes years of research on
development, this change will happen gradually over the years. We should
also take into account that automation is simply going to be too expensive for
some businesses as they will keep their business model labor-oriented.
Activities that are most susceptible to be affected by automation are
the ones that include a predictable activity. In the case of this project, and
upon the complete success of our Optical Character Recognition solution
on the production level, this solution will either assist Data Entry workers
to make them more efficient, or eliminate the need for them altogether for
the insurance company. In general, activities in accounting, paralegal work
and office transactions might be displaced by Artificial Intelligence powered
automation. On the other hand, activities that are largely physical are going
to be less affected, seeing as research in Robotics might be advancing slower
than software that uses Artificial Intelligence for example. Jobs that are
unpredictable, or that involved leadership, are going to be even less affected
CHAPTER 5. ENVIRONMENTAL AND SOCIETAL IMPACT | 69
by automation. Finally, the authors estimate that, by 2030, 400 to 800 million
may need to switch their activities and learn new skills around the world.
Automation does not necessarily equal fewer job opportunities or lower
economic status for the affected population. Historically speaking, fear of
automation has been for the most part unfounded, as technological advances
such as the invention of the printer created more jobs than it eliminated.
Gradual insertion of automation can change the job market in a way that
is healthy to the economy. The author expects that "8 to 9 percent of 2030
labor demand will be in new types of occupations that have not existed before".
Automation is expected to be accompanied by higher wages, as jobs will
be more creatively focused. Researchers are even considering the idea of
a Universal Basic Income, to cover the wages of fully automatable jobs.
Countries must also implement changes in the laborers’ skill set to avoid rising
unemployment or low wages.
70 | CHAPTER 5. ENVIRONMENTAL AND SOCIETAL IMPACT
Chapter 6
Conclusion
This study compares different neural network architectures for an Optical
Character Recognition problem applied to both handwritten and non-
handwritten French text.
The results indicate that the use of a light shallow model (CNN+LSTM)
is sufficient for most applications that use printed text, as it can achieve an
accuracy of up to 97%, depending on the amount of noise and the quality of
the training data. However, it turns out that the complexity and ambiguity of
handwritten text require a deeper model (VGG19+Transformer for example) to
achieve acceptable results, we demonstrated that in a real-world license plate
dataset and achieved an accuracy of up to 87%. Furthermore, data synthesis
was also demonstrated to have enormous benefits to neural network training
for OCR. However, realistic data synthesis can be as challenging as the main
task of Optical Character Recognition itself, and the use of different image
processing algorithms such as filtering or even morphological operations can
lead to interesting ad-hoc results. The exclusive use of realistic synthetic data
in neural network training yielded impressive results as well, as the domain
shift between synthetic and real data affected the accuracy by no more than
5%. Additionally, the study attempted also to estimate the quality of the
aforementioned algorithms in real-world use, and metrics like training time
were shown to vary importantly from the shallow CNN+LSTM model to the
deep CNN+Transformer model.
Finally, the environmental study shows that Domain Adaptation can be
useful to reduce the overall footprint of Machine Learning algorithms, which
is predicted to rise significantly in the upcoming years.
71
72 | CHAPTER 6. CONCLUSION
Bibliography
[1] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez,
L. u. Kaiser, and I. Polosukhin, “Attention is all you need,” in Advances
in Neural Information Processing Systems (I. Guyon, U. V. Luxburg,
S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett,
eds.), vol. 30, p. 6000–6010, Curran Associates, Inc., 2017.
[2] J. Devlin, M. Chang, K. Lee, and K. Toutanova, “BERT: pre-training
of deep bidirectional transformers for language understanding,” in
Proceedings of the 2019 Conference of the North American Chapter
of the Association for Computational Linguistics: Human Language
Technologies, NAACL-HLT 2019, Minneapolis, MN, USA, June 2-
7, 2019, Volume 1 (Long and Short Papers) (J. Burstein, C. Doran,
and T. Solorio, eds.), pp. 4171–4186, Association for Computational
Linguistics, 2019.
[3] A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai,
T. Unterthiner, M. Dehghani, M. Minderer, G. Heigold, S. Gelly,
J. Uszkoreit, and N. Houlsby, “An image is worth 16x16 words:
Transformers for image recognition at scale,” in International
Conference on Learning Representations, 2021.
[4] Y. LeCun and C. Cortes, “MNIST handwritten digit database,” 2010.
[5] M. Wang and W. Deng, “Deep visual domain adaptation: A survey,”
Neurocomputing, vol. 312, pp. 135–153, 2018.
[6] F. Munir, S. Azam, and M. Jeon, “Sstn: Self-supervised domain
adaptation thermal object detection for autonomous driving,” in 2021
IEEE/RSJ International Conference on Intelligent Robots and Systems
(IROS), pp. 206–213, 2021.
73
74 | BIBLIOGRAPHY
[7] Y. LeCun, P. Haffner, L. Bottou, and Y. Bengio, “Object recognition with
gradient-based learning,” in Shape, Contour and Grouping in Computer
Vision, (Berlin, Heidelberg), p. 319, Springer-Verlag, 1999.
[8] S. Liu and W. Deng, “Very deep convolutional neural network based
image classification using small training sample size,” in 2015 3rd IAPR
Asian Conference on Pattern Recognition (ACPR), pp. 730–734, 2015.
[9] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image
recognition,” in 2016 IEEE Conference on Computer Vision and Pattern
Recognition (CVPR), pp. 770–778, 2016.
[10] S. Hochreiter and J. Schmidhuber, “Long Short-Term Memory,” Neural
Computation, vol. 9, pp. 1735–1780, 11 1997.
[11] G. Varol, J. Romero, X. Martin, N. Mahmood, M. J. Black,
I. Laptev, and C. Schmid, “Learning from synthetic humans,” in CVPR,
vol. abs/1701.01370, 2017.
[12] J. Tremblay, A. Prakash, D. Acuna, M. Brophy, V. Jampani, C. Anil,
T. To, E. Cameracci, S. Boochoon, and S. Birchfield, “Training deep
networks with synthetic data: Bridging the reality gap by domain
randomization,” CoRR, vol. abs/1804.06516, 2018.
[13] A. Geiger, P. Lenz, C. Stiller, and R. Urtasun, “Vision meets robotics:
The kitti dataset,” The International Journal of Robotics Research,
vol. 32, no. 11, pp. 1231–1237, 2013.
[14] S. Hinterstoisser, V. Lepetit, P. Wohlhart, and K. Konolige, “On pre-
trained image features and synthetic images for deep learning,” Computer
Vision ECCV 2018, pp. 682–697, 10 2017.
[15] P. S. Rajpura, R. S. Hegde, and H. Bojinov, “Object detection using deep
cnns trained on synthetic images,” CoRR, vol. abs/1706.06782, 2017.
[16] X. Peng and K. Saenko, “Synthetic to real adaptation with generative
correlation alignment networks,” in 2018 IEEE Winter Conference on
Applications of Computer Vision (WACV), pp. 1982–1991, 2018.
[17] M. Everingham, L. Van Gool, C. K. I. Williams, J. Winn,
and A. Zisserman, “The PASCAL Visual Object Classes
Challenge 2007 (VOC2007) Results.” http://www.pascal-
network.org/challenges/VOC/voc2007/workshop/index.html.
BIBLIOGRAPHY | 75
[18] K. Saenko, B. Kulis, M. Fritz, and T. Darrell, “Adapting visual
category models to new domains,” in Computer Vision – ECCV 2010
(K. Daniilidis, P. Maragos, and N. Paragios, eds.), (Berlin, Heidelberg),
pp. 213–226, Springer Berlin Heidelberg, 2010.
[19] R. Gopalan, R. Li, and R. Chellappa, “Domain adaptation for object
recognition: An unsupervised approach,” pp. 999–1006, 11 2011.
[20] J. Xu, S. Ramos, D. Vázquez, and A. M. López, “Domain adaptation of
deformable part-based models,” IEEE Transactions on Pattern Analysis
and Machine Intelligence, vol. 36, no. 12, pp. 2367–2380, 2014.
[21] B. Sun, J. Feng, and K. Saenko, “Return of frustratingly easy domain
adaptation,” in Proceedings of the Thirtieth AAAI Conference on
Artificial Intelligence, AAAI’16, p. 2058–2065, AAAI Press, 2016.
[22] J. Yosinski, J. Clune, Y. Bengio, and H. Lipson, “How transferable
are features in deep neural networks?,” in Proceedings of the 27th
International Conference on Neural Information Processing Systems -
Volume 2, NIPS’14, (Cambridge, MA, USA), p. 3320–3328, MIT Press,
2014.
[23] T. Hassan and H. A. Khan, “Handwritten bangla numeral recognition
using local binary pattern,” in 2015 International Conference on
Electrical Engineering and Information Communication Technology
(ICEEICT), pp. 1–4, 2015.
[24] W. Q. Khan and R. Q. Khan, “Urdu optical character recognition
technique using point feature matching; a generic approach,” in
2015 International Conference on Information and Communication
Technologies (ICICT), pp. 1–7, 2015.
[25] H. Afli, L. Barrault, and H. Schwenk, “Ocr error correction using
statistical machine translation,” Int. J. Comput. Linguistics Appl., vol. 7,
pp. 175–191, 2016.
[26] M. A. Azawi and T. M. Breuel, “Context-dependent confusions rules for
building error model using weighted finite state transducers for ocr post-
processing,” in 2014 11th IAPR International Workshop on Document
Analysis Systems, pp. 116–120, 2014.
76 | BIBLIOGRAPHY
[27] M. A. Azawi, Statistical Language Modeling for Historical Documents
using Weighted Finite-State Transducers and Long Short-Term Memory.
doctoralthesis, Technische Universität Kaiserslautern, 2015.
[28] A. Goldenshluger and A. Zeevi, “The Hough transform estimator,” The
Annals of Statistics, vol. 32, no. 5, pp. 1908 – 1932, 2004.
[29] R. Smith, “An overview of the tesseract ocr engine,” in Ninth
International Conference on Document Analysis and Recognition
(ICDAR 2007), vol. 2, pp. 629–633, Sep. 2007.
[30] N. Jamil, T. M. T. Sembok, and Z. A. Bakar, “Noise removal and
enhancement of binary images using morphological operations,” in 2008
International Symposium on Information Technology, vol. 4, pp. 1–6,
2008.
[31] E. Strubell, A. Ganesh, and A. Mccallum, “Energy and policy
considerations for deep learning in nlp,” pp. 3645–3650, 01 2019.
[32] C. Tomoiaga, P. Feng, M. Salzmann, and P. Jayet, “Field typing
for improved recognition on heterogeneous handwritten forms,” in
2019 International Conference on Document Analysis and Recognition
(ICDAR), (Los Alamitos, CA, USA), pp. 487–493, IEEE Computer
Society, sep 2019.
[33] J. Memon, M. Sami, R. A. Khan, and M. Uddin, “Handwritten optical
character recognition (ocr): A comprehensive systematic literature
review (slr),” IEEE Access, vol. 8, pp. 142642–142668, 2020.
[34] N. O’Mahony, S. Campbell, A. Carvalho, S. Harapanahalli, G. V.
Hernandez, L. Krpalkova, D. Riordan, and J. Walsh, “Deep learning vs.
traditional computer vision,” in Advances in Computer Vision (K. Arai
and S. Kapoor, eds.), (Cham), pp. 128–144, Springer International
Publishing, 2020.
[35] F. C. Tsai, “Geometric hashing with line features,” Pattern Recognition,
vol. 27, no. 3, pp. 377–389, 1994.
[36] I. Goodfellow, Y. Bengio, and A. Courville, Deep Learning. MIT Press,
2016. http://www.deeplearningbook.org.
BIBLIOGRAPHY | 77
[37] A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet classification
with deep convolutional neural networks,” in Advances in Neural
Information Processing Systems (F. Pereira, C. J. C. Burges, L. Bottou,
and K. Q. Weinberger, eds.), vol. 25, pp. 1097–1105, Curran Associates,
Inc., 2012.
[38] T. J. Brinker, A. Hekler, A. H. Enk, C. Berking, S. Haferkamp,
A. Hauschild, M. Weichenthal, J. Klode, D. Schadendorf, T. Holland-
Letz, C. von Kalle, S. Fröhling, B. Schilling, and J. S. Utikal, “Deep
neural networks are superior to dermatologists in melanoma image
classification,” European Journal of Cancer, vol. 119, pp. 11–17, 2019.
[39] V. Schmidt, K. Goyal, A. Joshi, B. Feld, L. Conell, N. Laskaris, D. Blank,
J. Wilson, S. Friedler, and S. Luccioni, “CodeCarbon: Estimate and
Track Carbon Emissions from Machine Learning Computing,” 2021.
[40] K. He, G. Gkioxari, P. Dollár, and R. Girshick, “Mask r-cnn,” in 2017
IEEE International Conference on Computer Vision (ICCV), pp. 2980–
2988, 2017.
[41] J. Benitez, J. Castro, and I. Requena, “Are artificial neural networks
black boxes?,” IEEE Transactions on Neural Networks, vol. 8, no. 5,
p. 1156–1164, 1997.
[42] Belval, “Text recognition data generator.” https://github.com/
Belval/TextRecognitionDataGenerator.
[43] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei, “Imagenet:
A large-scale hierarchical image database,” in 2009 IEEE Conference on
Computer Vision and Pattern Recognition, pp. 248–255, 2009.
[44] R. Girshick, J. Donahue, T. Darrell, and J. Malik, “Rich feature
hierarchies for accurate object detection and semantic segmentation,” in
2014 IEEE Conference on Computer Vision and Pattern Recognition,
pp. 580–587, 2014.
[45] M. D. Zeiler and R. Fergus, “Visualizing and understanding
convolutional networks,” in Computer Vision – ECCV 2014 (D. Fleet,
T. Pajdla, B. Schiele, and T. Tuytelaars, eds.), (Cham), pp. 818–833,
Springer International Publishing, 2014.
78 | BIBLIOGRAPHY
[46] B. Sun and K. Saenko, “From virtual to reality : Fast adaptation of virtual
object detectors to real domains,” in Proceedings of the British Machine
Vision Conference, BMVA Press, 2014.
[47] Y. Chen, W. Li, C. Sakaridis, D. Dai, and L. Van Gool, “Domain
adaptive faster r-cnn for object detection in the wild,” in 2018 IEEE/CVF
Conference on Computer Vision and Pattern Recognition, pp. 3339–
3348, 2018.
[48] S. J. Pan and Q. Yang, “A survey on transfer learning,” IEEE Transactions
on Knowledge and Data Engineering, vol. 22, no. 10, pp. 1345–1359,
2010.
[49] X. Li, W. Zhang, Q. Ding, and J.-Q. Sun, “Multi-layer domain adaptation
method for rolling bearing fault diagnosis,” Signal Processing, vol. 157,
pp. 180–197, 2019.
Appendix A
Appendix B
Appendix C
Figure 6.1 – The full architectures of VGG-19 (up) and ResNet (down).
TRITA-EECS-EX-2022:755
www.kth.se