Contents
Contents
1. Introduction
1.1. Background and Purpose of the Thesis
1.2. Overview of the International Olympiad in Artificial Intelligence
1.2.1 Format of the Olympiad
1.2.2 Scientific round
1.2.3 Practical round
1.3. Structure of the Thesis
7. Conclusion
7.1. Summary of Key Points
7.2. The Importance of AI Competitions in Advancing Knowledge
7.3. Future Directions for AI and Personal Development
8. References
9. Appendices
9.1. Additional Diagrams, Code, or Data
9.2. Olympiad-related Documents
Being a frequent participant in Science Olympiads, the author of this thesis became very
interested in this Olympiad after finding out about it. After doing some research, it was
concluded that
IOAI is a team competition, with teams containing a maximum of 4 people. Most of the
International Olympiads have been around for many years, but the IOAI was held for the first
time ever this year. The first ever IOAI was held on 9-15 august in Burgas, Bulgaria.
1.2.1 Format
The Olympiad is divided into 2 parts: scientific and practical rounds. In both rounds, the aim
of the solutions is not necessarily to reach “the correct answer” as there may not be one.
The scientific round is metric-oriented, as solutions are scored based on performance on a
predefined task-specific metric. The practical round is conclusion-oriented, as participants
have to design and perform experiments and to draw conclusions about the capabilities and
limitations of AI.
In this round, the participating teams are given problems that mimic real-world scientific
research and the process of identifying and addressing limitations in an existing approach.
Good performance in this round depends on basic coding skills, familiarity with common
deep learning Python libraries, and an understanding of the fundamentals of machine
learning.
The teams receive 3 problems based on recent AI research 6 weeks in advance of the IOAI,
and work on them on their own schedule. At the end of the allotted time, the teams submit
their solutions to all 3 problems in the form of working code and model outputs.
At the IOAI, the teams receive a set of 3 new problems to solve, which build on the 3
problems they worked on at home, i.e. the general setting remains the same in terms of AI
task, data type, and model architecture, but the teams have to solve a new challenge within
this setting.
The problems in the scientific round will be distributed as Google Colab notebooks and
solutions will be submitted as a modified version of the same notebook. Participants are
required to use Python for their solutions and to ensure that upon submission their
notebooks are fully executable within the Colab environment. Further instructions will be
given for specific problems regarding the maximum time a notebook should take to execute,
the restrictions on the use of pre-trained models, of external data, etc.
The deliverables for each problem will be clearly stated in the problem description and may
include: a score measured on a specific data split, a short written answer or methodological
report, a plot visualising some statistics or results, and others. Each problem will specify how
points are distributed between the different deliverables.
The final scores for this round are based in small part on the performance of the solutions
developed at home, and in large part on the performance of the solutions developed on site.
Exact scoring details will be provided upon distribution of the first set of problems.
This round happens entirely on site at the IOAI and is intended to acquaint students with the
workings of widely used AI software like ChatGPT, Dalle-2 and others. The problems require
teams to inspect, analyse, and explain scientific questions pertaining to the behaviour of
working AI software.
Teams are given several problems to work on in a time window of 2 to 4 hours, with access
to one computer connected to the internet per team and no other devices. They interact with
the AI software through a GUI, the way a regular user would, therefore coding is not required
in this round.
The answers submitted at the end of the allotted time are evaluated by the Jury according to
a problem-specific criteria that may be based on the metric score a team achieved, on the
number of valid solutions they found, on the ingenuity of their solutions, on the robustness of
their solution, etc. The way points are allocated will be specified in each problem's
description.
1.3 Purpose
The author of this research chose to write about this topic because it is a familiar and a very
interesting topic to the author.
2.1 Definition
Artificial intelligence (AI), in its broadest sense, is intelligence exhibited by machines,
particularly computer systems. It is a field of research in computer science that develops and
studies methods and software that enable machines to perceive their environment and use
learning and intelligence to take actions that maximise their chances of achieving defined
goals. Such machines may be called AIs.
Some high-profile applications of AI include advanced web search engines (e.g., Google
Search); recommendation systems (used by YouTube, Amazon, and Netflix); interacting via
human speech (e.g., Google Assistant, Siri, and Alexa); autonomous vehicles (e.g., Waymo);
generative and creative tools (e.g., ChatGPT, Apple Intelligence, and AI art); and
superhuman play and analysis in strategy games (e.g., chess and Go).
(Wikipedia)
2.2 History
The study of mechanical or "formal" reasoning began with philosophers and mathematicians
in antiquity. The study of logic led directly to Alan Turing's theory of computation, which
suggested that a machine, by shuffling symbols as simple as "0" and "1", could simulate any
conceivable form of mathematical reasoning. This, along with concurrent discoveries in
cybernetics, information theory and neurobiology, led researchers to consider the possibility
of building an "electronic brain". They developed several areas of research that would
become part of AI, such as McCullouch and Pitts' design for "artificial neurons" in 1943, and
Turing's influential 1950 paper 'Computing Machinery and Intelligence', which introduced the
Turing test and showed that "machine intelligence" was plausible.
The field of AI research was founded at a workshop at Dartmouth College in 1956. The
attendees became the leaders of AI research in the 1960s. They and their students
produced programs that the press described as "astonishing": computers were learning
checkers strategies, solving word problems in algebra, proving logical theorems and
speaking English. Artificial intelligence laboratories were set up at a number of British and
U.S. universities in the latter 1950s and early 1960s.
Researchers in the 1960s and the 1970s were convinced that their methods would
eventually succeed in creating a machine with general intelligence and considered this the
goal of their field. In 1965 Herbert Simon predicted, "machines will be capable, within twenty
years, of doing any work a man can do". In 1967 Marvin Minsky agreed, writing, "within a
generation ... the problem of creating 'artificial intelligence' will substantially be solved". They
had, however, underestimated the difficulty of the problem. In 1974, both the U.S. and British
governments cut off exploratory research in response to the criticism of Sir James Lighthill
and ongoing pressure from the U.S. Congress to fund more productive projects. Minsky's
and Papert's book Perceptrons was understood as proving that artificial neural networks
would never be useful for solving real-world tasks, thus discrediting the approach altogether.
The "AI winter", a period when obtaining funding for AI projects was difficult, followed.
In the early 1980s, AI research was revived by the commercial success of expert systems, a
form of AI program that simulated the knowledge and analytical skills of human experts. By
1985, the market for AI had reached over a billion dollars. At the same time, Japan's fifth
generation computer project inspired the U.S. and British governments to restore funding for
academic research. However, beginning with the collapse of the Lisp Machine market in
1987, AI once again fell into disrepute, and a second, longer-lasting winter began.
Up to this point, most of AI's funding had gone to projects that used high-level symbols to
represent mental objects like plans, goals, beliefs, and known facts. In the 1980s, some
researchers began to doubt that this approach would be able to imitate all the processes of
human cognition, especially perception, robotics, learning and pattern recognition, and
began to look into "sub-symbolic" approaches. Rodney Brooks rejected "representation" in
general and focussed directly on engineering machines that move and survive. Judea Pearl,
Lofti Zadeh and others developed methods that handled incomplete and uncertain
information by making reasonable guesses rather than precise logic. But the most important
development was the revival of "connectionism", including neural network research, by
Geoffrey Hinton and others. In 1990, Yann LeCun successfully showed that convolutional
neural networks can recognize handwritten digits, the first of many successful applications of
neural networks.
AI gradually restored its reputation in the late 1990s and early 21st century by exploiting
formal mathematical methods and by finding specific solutions to specific problems. This
"narrow" and "formal" focus allowed researchers to produce verifiable results and collaborate
with other fields (such as statistics, economics and mathematics). By 2000, solutions
developed by AI researchers were being widely used, although in the 1990s they were rarely
described as "artificial intelligence". However, several academic researchers became
concerned that AI was no longer pursuing its original goal of creating versatile, fully
intelligent machines. Beginning around 2002, they founded the subfield of artificial general
intelligence (or "AGI"), which had several well-funded institutions by the 2010s.
Deep learning began to dominate industry benchmarks in 2012 and was adopted throughout
the field. For many specific tasks, other methods were abandoned. Deep learning's success
was based on both hardware improvements (faster computers, graphics processing units,
cloud computing) and access to large amounts of data (including curated datasets, such as
ImageNet). Deep learning's success led to an enormous increase in interest and funding in
AI. The amount of machine learning research (measured by total publications) increased by
50% in the years 2015–2019.
In 2016, issues of fairness and the misuse of technology were catapulted into centre stage at
machine learning conferences, publications vastly increased, funding became available, and
many researchers re-focussed their careers on these issues. The alignment problem
became a serious field of academic study.
In the late teens and early 2020s, AGI companies began to deliver programs that created
enormous interest. In 2015, AlphaGo, developed by DeepMind, beat the world champion Go
player. The program was taught only the rules of the game and developed strategy by itself.
GPT-3 is a large language model that was released in 2020 by OpenAI and is capable of
generating high-quality human-like text. These programs, and others, inspired an aggressive
AI boom, where large companies began investing billions in AI research. According to AI
Impacts, about $50 billion annually was invested in "AI" around 2022 in the U.S. alone and
about 20% of the new U.S. Computer Science PhD graduates have specialised in "AI".
About 800,000 "AI"-related U.S. job openings existed in 2022.
2.2. Core Areas of AI: Machine Learning, Natural Language Processing, and Computer
Vision
Machine Learning
Machine learning is the study of programs that can improve their performance on a given
task automatically. It has been a part of AI from the beginning.
There are several kinds of machine learning. Unsupervised learning analyses a stream of
data and finds patterns and makes predictions without any other guidance. Supervised
learning requires a human to label the input data first, and comes in two main varieties:
classification (where the program must learn to predict what category the input belongs in)
and regression (where the program must deduce a numeric function based on numeric
input).
In reinforcement learning, the agent is rewarded for good responses and punished for bad
ones. The agent learns to choose responses that are classified as "good". Transfer learning
is when the knowledge gained from one problem is applied to a new problem. Deep learning
is a type of machine learning that runs inputs through biologically inspired artificial neural
networks for all of these types of learning.
Natural language processing (NLP) allows programs to read, write and communicate in
human languages such as English. Specific problems include speech recognition, speech
synthesis, machine translation, information extraction, information retrieval and question
answering.
Early work, based on Noam Chomsky's generative grammar and semantic networks, had
difficulty with word-sense disambiguation unless restricted to small domains called
"micro-worlds" (due to the common sense knowledge problem). Margaret Masterman
believed that it was meaning and not grammar that was the key to understanding languages,
and that thesauri and not dictionaries should be the basis of computational language
structure.
Modern deep learning techniques for NLP include word embedding (representing words,
typically as vectors encoding their meaning), transformers (a deep learning architecture
using an attention mechanism), and others. In 2019, generative pre-trained transformer (or
"GPT") language models began to generate coherent text, and by 2023, these models were
able to get human-level scores on the bar exam, SAT test, GRE test, and many other
real-world applications
Computer Vision
Computer vision is a field of artificial intelligence (AI) and computer science that focuses on
enabling machines to interpret and understand visual information from the world, similar to
how humans use their eyes and brains to make sense of their surroundings. The goal of
computer vision is to develop techniques that allow computers to recognize and process
objects, people, scenes, and activities in images or videos.
3. Foundational Concepts in AI
3.1 ML basics
3.1.1 What is ML?
Machine learning (ML) is a branch of artificial intelligence (AI) that focuses on building
systems that can learn from and make decisions based on data. Unlike traditional
programming, where a programmer explicitly writes rules and logic, machine learning
systems are trained using data to find patterns and make predictions.
The foundation of machine learning. ML models are trained on datasets, which consist of
examples or instances, each described by a set of features or attributes. The task of a model
is given an input x, to return an output y by performing some operation on x. How exactly
that is done depends on what type of ML task we are solving. ML tasks can be divided into 3
main categories: supervised learning, unsupervised learning and reinforcement learning.
In supervised learning, the model is trained on labelled data, which means that each training
example is paired with an output label. Furthermore, supervised learning is partitioned into
classification tasks and regression tasks.
In unsupervised learning, algorithms are trained on unlabeled data, where the desired output
is unknown. Unsupervised learning is often divided into clustering and association.
Supervised learning (SL) is a paradigm in machine learning where input objects (for
example, a vector of predictor variables) and a desired output value (also known as a
human-labelled supervisory signal) train a model. The training data is processed, building a
function that maps new data to expected output values. An optimal scenario will allow for the
algorithm to correctly determine output values for unseen instances. This requires the
learning algorithm to generalise from the training data to unseen situations in a "reasonable"
way (see inductive bias). This statistical quality of an algorithm is measured through the
so-called generalisation error.
In statistical modelling, regression analysis is a set of statistical processes for estimating the
relationships between a dependent variable (often called the outcome or response variable,
or a label in machine learning parlance) and one or more error-free independent variables
(often called regressors, predictors, covariates, explanatory variables or features). The most
common form of regression analysis is linear regression, in which one finds the line (or a
more complex linear combination) that most closely fits the data according to a specific
mathematical criterion.
In linear regression, the model is given an n-dimensional input x and an output y. It first
randomly chooses parameters θ1, θ2, .. θn, b and updates them during training to get a
value as close as possible to y for any given output.
It attempts to make
The proximity of the predicted output and the real output is determined by a loss function. A
common loss function is the least squared error and it tries to minimise the sum of
[(θ1x1+θ2x2+ ..+ θnxn+b)-y]^2 over all training examples.
Examples include:
In this type of task, when a model gets a new input (question mark in pic. 1), it attempts to
classify the input, i.e. determine, which class does the input belong to. In pic. 1, there are 3
classes (crosses, circles and triangles). The model assigns a probability to each class c_1,
c_2 and c_3, then it chooses the class č that has the highest probability:
Examples include:
● Spam detection
● Image recognition (identifying objects in images)
● Sentiment analysis (positive, negative, neutral)
In the example above, I demonstrated multiclass classification, when there are 3+ classes.
The model is given data, where every input comes paired with an output, and the goal of the
model is, given a new input without a label, assign the label that fits the data best. So in the
illustration above, there are 3 different clusters (noughts, crosses and triangles) and the
model is given a new input “?”, but without the label. The model assigns probabilities to each
class and chooses the class with the highest probability. In this case, it should be pretty clear
that the new input belongs to the triangle class, so the model is able to correctly predict the
class of the input.
Another type of classification is binary classification, when there are 2 classes. For instance,
the model can be trained on images with a cat and images without a cat and is then asked to
classify, whether the input image belongs to the first class (there is a cat) or the second class
(there is no cat). Since there are only two classes, it can be calculated using the formula:
d(x)={ 1 if P(y=1|x)>0,5
0 otherwise.
3.1.5 Decision trees
A common classifier is called “Decision tree”. Decision trees are a very common way to
represent and visualise possible outcomes of a decision or an action based on probabilities.
In a nutshell, a decision tree is a hierarchical representation of the outcome space where
every node represents a choice or an action, with leaves representing states of the outcome.
Decision trees are highly human-readable since they generally follow a particular structure
such as:
Now, we can visualise decision trees as a treelike structure where each leaf represents a
state or an outcome, and the internal branches represent the various paths that could lead to
that leaf.
In our example above, consider whether or not the weather conditions are OK to play a
football game. The first step would be to start at the root node labelled “Weather”,
representing current weather conditions. We then move onto a branch, and the next nodes,
depending on whether it’s sunny, overcast, or rainy. Finally, we continue down the decision
tree until we have an outcome: could play or couldn’t play.
However, this simplicity is deceptive because decision trees can be used across a broad
range of applications. For example, they are widely used in:
● Medical diagnosis: A patient’s symptoms and medical history give rise to a number of
possible diagnoses and treatments.
● Product or service selection: Different brands of cars, types of insurance products,
etc.
● Conducting market research: Evaluating different solutions for a problem you’re trying
to solve.
Decision trees are simpler than their random forest counterpart, which combines many
decision trees into one model, which may be better suited for multi-class classification and
other complex artificial intelligence tasks.
While decision trees and neural networks each represent a different way of classifying (or
grouping) data into clusters that share common characteristics (or features), there are some
key differences between them.
Essentially, decision trees work best for simple cases with few variables, while neural
networks perform better when the data has more complex relationships between features or
values (i.e., it’s “dense”).
As such, decision trees are often used as the first line classification method in simple data
science projects. However, they may not scale well when faced with large amounts of
high-dimensional data, i.e., it’s difficult to interpret meaningful results from their analysis.
https://www.evidentlyai.com/classification-metrics/explain-roc-curve
During training, the model's results are stored for evaluation. A common evaluation metric is
called the ROC AUC metric.
The ROC AUC score is the area under the ROC curve. It sums up how well a model can
produce relative scores to discriminate between positive or negative instances across all
classification thresholds. The ROC AUC score ranges from 0 to 1, where 0.5 indicates
random guessing, and 1 indicates perfect performance.
3.1.8 Clustering
In this type of task, the model is trained on unlabelled data. The number of clusters,
commonly denoted by k, is provided to the model with the goal of splitting the data into k
clusters. Informally, a cluster is a set of points, in which the distance between each of the
points to each other is small.
The goal of association is to find rules that describe large portions of the data. For instance,
it is often used for market basket analysis (identifying items frequently bought together) and
recommendation systems. In any given transaction with a variety of items, association rules
are meant to discover the rules that determine how or why certain items are connected.
Association rules are made by searching data for frequent if-then patterns and by using a
certain criterion under Support and Confidence to define what the most important
relationships are. Support is the evidence of how frequent an item appears in the data given,
as Confidence is defined by how many times the if-then statements are found true. However,
there is a third criteria that can be used, it is called Lift and it can be used to compare the
expected Confidence and the actual Confidence. Lift will show how many times the if-then
statement is expected to be found to be true.
Association rules are made to calculate from itemsets, which are created by two or more
items. If the rules were built from analysing all the possible itemsets from the data then there
would be so many rules that they wouldn’t have any meaning. That is why Association rules
are typically made from rules that are well represented by the data.
3.2 Natural Language Processing
1. Speech Recognition: The task of converting spoken language into text. This
involves capturing the nuances of human speech, such as accents, intonations, and
variations in speed. Technologies like voice assistants (e.g., Siri, Google Assistant)
are built on advanced speech recognition systems.
2. Text Classification: This involves categorising text into predefined labels. Text
classification can be applied in spam detection, sentiment analysis, or document
categorization. For instance, an email service can classify incoming emails as spam
or not spam, while social media platforms may classify posts as positive, negative, or
neutral in sentiment.
3. Natural-Language Understanding (NLU): This is one of the core goals of
NLP—enabling machines to understand and interpret human language in a
meaningful way. NLU goes beyond merely processing words; it involves grasping the
context, intent, and nuances behind them. For example, an NLU system must
comprehend the difference between "I can’t wait for the weekend!" and "I can't stand
waiting in line."
4. Natural-Language Generation (NLG): The process of generating coherent,
human-like text from structured data. This is used in applications like chatbots,
automated reporting systems, or even text summarization. The objective is to
produce language that is grammatically correct, contextually relevant, and
natural-sounding.
1. Search Engines: Search engines like Google use NLP to understand user queries
and provide relevant results. Modern search engines don’t just match keywords; they
try to grasp the intent behind a query, thanks to advanced NLP models.
2. Machine Translation: Tools like Google Translate rely on NLP to convert text from
one language to another. These systems have greatly improved with neural models,
providing more accurate and contextually relevant translations.
3. Chatbots and Virtual Assistants: NLP enables chatbots and voice-activated
assistants to interact with users in a conversational manner. They can answer
questions, provide recommendations, and even hold basic conversations, simulating
human-like dialogue.
4. Sentiment Analysis: Businesses use sentiment analysis to gauge public opinion
from social media, customer reviews, and other sources of unstructured text. By
analysing the emotional tone behind words, companies can better understand
customer satisfaction or market trends.
5. Text Summarization: Automatic summarization tools can condense long documents
or articles into concise summaries. These tools are useful in industries like
journalism, legal, and research, where large volumes of information need to be
processed quickly.
https://neptune.ai/blog/tokenization-in-nlp
The first step in any NLP project is text preprocessing. Preprocessing input text simply
means putting the data into a predictable and analyzable form. It’s a crucial step for building
an amazing NLP application.
Among these, the most important step is tokenization. It’s the process of breaking a stream
of textual data into words, terms, sentences, symbols, or some other meaningful elements
called tokens. A lot of open-source tools are available to perform the tokenization process.
Tokenization is the first step in any NLP pipeline. It has an important effect on the rest of
your pipeline. A tokenizer breaks unstructured data and natural language text into chunks of
information that can be considered as discrete elements. The token occurrences in a
document can be used directly as a vector representing that document.
This immediately turns an unstructured string (text document) into a numerical data structure
suitable for machine learning. They can also be used directly by a computer to trigger useful
actions and responses. Or they might be used in a machine learning pipeline as features
that trigger more complex decisions or behaviour.
Tokenization can separate sentences, words, characters, or subwords. When the text is split
into sentences, it’s called sentence tokenization. For words, it’s called word tokenization.
Text encoders and decoders are an essential part of bridging the gap between
human-readable text and machine-understandable representations. Text encoders process
tokens, while text decoders generate them. These mechanisms enable various NLP tasks
such as translation, summarization, and text generation.
Decoders
Decoders take the vector representation produced by the encoder and transform it back into
human-readable text. The decoder is crucial in tasks like machine translation and text
generation, where you need to generate meaningful, grammatically correct sentences from
encoded data.
Encoders
An encoder transforms input text into a numerical or vector representation. This is essential
because computers process numbers, not words. The encoder captures the semantics
(meaning) of the text in a way that allows the machine to understand its structure and
context.
1. Tokenization: The first step in encoding text is breaking it into smaller pieces
(tokens), such as words or subwords. For example, "Natural Language Processing"
might be tokenized into ["Natural", "Language", "Processing"] or into smaller units like
["Nat", "ur", "al"] depending on the tokenizer used.
2. Embedding: Each token is then converted into a high-dimensional vector
(embedding). This embedding captures semantic relationships between words. For
example, "cat" and "dog" might have vectors that are close to each other because
they are semantically similar.
Transformers
https://huggingface.co/learn/nlp-course/en/chapter1/4
https://www.youtube.com/playlist?list=PLZHQObOWTQDNU6R1_67000Dx_ZCJB-3pi
Transformers are a type of neural network that underlies the current boom in AI. They
were first introduced in a 2017 paper called “Attention Is All You Need.”
Most transformers are trained as language models. This means they have been trained on
large amounts of raw text in a self-supervised fashion. That means that humans are not
needed to label the data.
The goal of a transformer is to take in a piece of text and predict the next word.
● The input text is broken up into tokens. Each token is associated with a
high-dimensional vector called an embedding.
● Directions in the embedding space correspond to semantic meanings. For example,
one direction might correspond to gender, where adding a certain step in this space
can take you from the embedding of a masculine noun to the embedding of the
corresponding feminine noun.
● The transformer progressively adjusts the embeddings so they encode contextual
meaning rather than just the meaning of an individual word.
● The key piece in a transformer is the attention mechanism. One way to think about
how attention works is to imagine each noun asking if there are any adjectives sitting
in front of it, and the adjectives answering that they are an adjective and giving their
position. For example, in the phrase “a fluffy blue creature,” the word “creature” would
ask if there are any adjectives in front of it, and the words “fluffy” and “blue” would
respond.
● Each attention block consists of multiple attention heads that run in parallel. For
example, GPT-3 uses 96 attention heads inside each block. This allows the model to
learn many different ways that context changes meaning.
● The attention mechanism is extremely parallelizable, which means it can be run very
efficiently on GPUs. This is a major factor in the success of transformers, as deep
learning has shown that scale leads to significant improvements in model
performance.
3.2.4 MLMs
Masked Language Models (MLMs) are a type of language model used in NLP tasks. They
are primarily associated with models like BERT (Bidirectional Encoder Representations from
Transformers) and are designed to predict missing or "masked" words in a sentence based
on the context provided by the other words in the sentence.
https://arxiv.org/abs/1810.04805
In addition to the MLM, BERT also uses a next sentence prediction (NSP) task to jointly
pre-train text-pair representations. These pre-training tasks allow BERT to achieve
state-of-the-art performance on a variety of natural language processing tasks.
BERT can be fine-tuned for text classification by simply adding a classification layer on top
of the pre-trained model. The [CLS] token, a special symbol added in front of every input
example, is used as the aggregate sequence representation for classification tasks. The final
hidden state corresponding to the [CLS] token is fed into an output layer for classification.
For example, in a sentiment analysis task, the output layer would have two neurons, one for
positive sentiment and one for negative sentiment.
BERT has been shown to be very effective for text classification, achieving state-of-the-art
results on a number of benchmark datasets. Overall, BERT's novel pre-training approach,
using the MLM and NSP tasks, and its ability to leverage large model sizes contribute to its
superior performance in text classification and other NLP tasks.
In order to measure the model’s performance, various performance metrics are used. The
model is tested on unseen data in order to ensure its overall improvement in general. In
classification tasks, several metrics help measure performance, including accuracy,
precision, recall, and the F1 score.
1. Accuracy:
○ The ratio of correctly predicted observations to the total observations.
○ Formula: Accuracy=True Positives + True Negatives /Total observations
○ Limitations: Accuracy alone can be misleading, especially in imbalanced
datasets (e.g., a dataset with 95% of one class and 5% of another).
2. Precision:
○ Measures how many of the predicted positive results are truly positive.
○ Formula: Precision=True Positives/ (True Positives+False Positives)
○ High precision indicates a low false positive rate.
3. Recall (Sensitivity or True Positive Rate):
○ Measures how many of the actual positive cases were correctly identified by
the model.
○ Formula: Recall=True Positives / (True Positives + False Negatives)
○ High recall indicates a low false negative rate.
The F1 score is a metric that combines both precision and recall into a single value,
balancing them in a harmonic mean. It’s particularly useful in situations where we want to
balance the false positives and false negatives, and it’s often used when there’s class
imbalance.
F1 score = 2*precision*recall/(precision+recall)
Since the F1 score uses the harmonic mean, it gives more weight to lower values, so if
either precision or recall is low, the F1 score will be low. This makes it a better measure than
accuracy when the class distribution is imbalanced.
3.3. Computer Vision Basics
3.3.1. Introduction to Computer Vision
Computer Vision is a rapidly advancing field of artificial intelligence (AI) that enables
machines to interpret and understand the visual world in a manner similar to humans. By
leveraging techniques from machine learning, deep learning, and image processing,
computer vision seeks to automate the extraction of meaningful information from digital
images or videos. This technology is at the core of many modern applications, from facial
recognition systems and autonomous vehicles to medical imaging and augmented reality.
At its core, computer vision aims to replicate the human visual system by teaching machines
to:
1. Recognize objects: Identify and classify objects within an image (e.g., detecting a
car in a street scene).
2. Understand context: Interpret the relationships between objects and the overall
scene (e.g., determining that a person is walking on a sidewalk rather than in the
road).
3. Reconstruct environments: Create 3D models from 2D images (e.g., mapping an
indoor space using a set of photographs).
4. Extract features: Analyse the structural and compositional aspects of an image,
such as shapes, colours, textures, and patterns.
The applications of computer vision span various industries and have a significant impact on
daily life:
Computer vision is transforming the way machines interact with the world, unlocking new
possibilities for automation, safety, and efficiency. As the field continues to evolve, it is
expected to play a critical role in a wide range of future technologies.
Image generation is a subfield of artificial intelligence (AI) and computer vision focused on
creating new, realistic, or stylized images from input data, often guided by machine learning
models. The technology behind image generation has rapidly advanced, especially with the
rise of deep learning and Generative Adversarial Networks (GANs). These advances
enable machines to create everything from hyper-realistic photos to abstract artwork, with
applications ranging from entertainment and design to medical imaging and content creation.
Generative Models:
Generative models are a class of machine learning models that can create new data
instances that resemble the training data. These models can generate new images that look
similar to real ones but are entirely novel.
GANs are arguably the most well-known approach to image generation. They consist of two
neural networks:
● Generator: This network tries to create fake images from random noise that look as
realistic as possible.
● Discriminator: This network evaluates images and classifies them as either "real"
(from the training set) or "fake" (generated by the generator).
The two networks are trained in an adversarial process: the generator improves by trying to
fool the discriminator, while the discriminator improves by becoming better at detecting fake
images. Over time, the generator learns to create highly realistic images.
GANs have been used to generate images of human faces, landscapes, artworks, and more.
They are also used in super-resolution (enhancing image resolution) and style transfer
(combining the content of one image with the style of another).
Latent Space:
The latent space is a key concept in generative models like GANs and VAEs. It refers to a
high-dimensional space where each point represents a potential image. The idea is that
images are represented as compressed vectors in this space, and by navigating or
manipulating this space, we can generate new images.
For example, you can generate a new image by selecting a point in the latent space or by
interpolating between two points, creating a blend between two images (e.g., transitioning
between a picture of a cat and a dog).
In conditional image generation, the generation process is guided by additional input, such
as labels, textual descriptions, or even other images. Conditional models generate images
that match the given conditions.
Examples include:
● Text-to-Image Generation: Models like DALL·E and Stable Diffusion can generate
images from text descriptions. For example, given the prompt "a dog riding a
skateboard," the model can generate images that match that description.
● Image-to-Image Translation: This task involves converting one type of image into
another. For example, turning sketches into photorealistic images or black-and-white
photos into coloured ones (e.g., pix2pix).
Neural style transfer is a method for applying the artistic style of one image to the content
of another. For example, you could take a photograph and a famous painting and use neural
style transfer to blend the two, so the photograph looks like it was painted in the style of Van
Gogh.
This works by using a pre-trained convolutional neural network (CNN) to separate the
content and style of images and then recompose them into a new image.
Diffusion Models:
Diffusion models are a newer approach to generative modeling that creates images by
gradually adding noise to an image and then learning to reverse this noise. These models
can generate images of high fidelity and are used in models like DALL·E 2 and Stable
Diffusion.
The process involves:
● Forward process: Gradually adding noise to a clean image until it becomes pure
noise.
● Reverse process: Training a model to remove the noise step by step, eventually
producing a clear, realistic image from random noise.
Autoencoders:
An autoencoder consists of two main parts:
Variational Autoencoders:
VAEs modify this basic framework by making the latent space probabilistic rather than
deterministic. Instead of mapping input data to a single point in the latent space, VAEs map
the data to a distribution over the latent space. This probabilistic nature allows VAEs to
sample different points from this distribution and decode them into new, unseen data points,
enabling the generation of novel data.
3.3.4 U-Net
https://en.wikipedia.org/wiki/U-Net
U-Net is a convolutional neural network that was developed for image segmentation. The
network is based on a fully convolutional neural network whose architecture was modified
and extended to work with fewer training images and to yield more precise segmentation.
U-Net takes an input image and learns to label each pixel by determining which part of the
image belongs to a specific object or class. For example, it can be used to separate objects
like cars, trees, or buildings in a photograph or identify tumours in medical images.
Stable Diffusion is a deep learning, text-to-image model released in 2022 based on diffusion
techniques. It is primarily used to generate detailed images conditioned on text descriptions,
though it can also be applied to other tasks such as inpainting, outpainting, and generating
image-to-image translations guided by a text prompt. Its development involved researchers
from the CompVis Group at Ludwig Maximilian University of Munich and Runway with a
computational donation from Stability and training data from non-profit organisations.
Stable Diffusion is a latent diffusion model, a kind of deep generative artificial neural
network. Its code and model weights have been released publicly, and it can run on most
consumer hardware equipped with a modest GPU with at least 4 GB VRAM.
Stable Diffusion consists of 3 parts: the variational autoencoder (VAE), U-Net, and an
optional text encoder. The VAE encoder compresses the image from pixel space to a smaller
dimensional latent space, capturing a more fundamental semantic meaning of the image.
Gaussian noise is iteratively applied to the compressed latent representation during forward
diffusion. The U-Net block, composed of a ResNet backbone, denoises the output from
forward diffusion backwards to obtain a latent representation. Finally, the VAE decoder
generates the final image by converting the representation back into pixel space.