Thanks to visit codestin.com
Credit goes to www.scribd.com

0% found this document useful (0 votes)
7 views32 pages

Unit 5 Notes

This document discusses Transfer Learning, a concept in artificial intelligence that allows models to share knowledge across different tasks and improve outcomes. It outlines the types of Transfer Learning, including inductive, unsupervised, and transductive transfer, and highlights the challenges and methodologies involved. The document emphasizes the significance of Transfer Learning in enhancing machine learning performance, particularly in scenarios with limited data.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
7 views32 pages

Unit 5 Notes

This document discusses Transfer Learning, a concept in artificial intelligence that allows models to share knowledge across different tasks and improve outcomes. It outlines the types of Transfer Learning, including inductive, unsupervised, and transductive transfer, and highlights the challenges and methodologies involved. The document emphasizes the significance of Transfer Learning in enhancing machine learning performance, particularly in scenarios with limited data.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 32

DEEP LEARNING UNIT 5 Transfer Learning

UNIT V
Transfer Learning
1. Types
2. Methodologies
3. Diving into Transfer Learning
4. Challenges

Introductions

• Learning is a process.
• A very involved process which assists in acquiring new knowledge and skills.
• While Learning has been historically associated with sentient and evolved life forms,
since the advent of computers and subsequent research in the field of artificial
intelligence, we now associate the process of Learning with machines as well.
• Virtual assistants, AI powered game players and even autonomous vehicles are a
reality now.
• It is not surprising any more to have machines included in the list of entities that can
learn.

✓ As part of evolution, humans have refined the way we learn and adapt to different
scenarios over the years.
✓ We have acquired this innate ability to use skills and knowledge across different tasks
without always starting from zero.
✓ For instance, some of the common examples of this ability to transfer knowledge across
different tasks are “knowing how to boil water ® one can learn to make tea”, “knowing
how to ride a bicycle ® can help us to learn to ride a motorbike easily” or even “knowing
how to do subtraction ® helps us to learn to perform division”

• There are many such things in real life where we leverage what we learnt for solving
one specific problem and apply it to solve a number of other related problems.
• We do this consciously and sub-consciously all the time.
• These examples point to our ability to transfer knowledge across different yet related
tasks. This ability is proportional to how similar or dissimilar the tasks are.

➢ As Data Scientists, we leverage machine Learning and deep Learning algorithms


to model real-world processes and develop systems that learn from data.
DR. RAMBABU PEMULA, ASSOCIATE PROFESSOR, DEPARTMENT OF AI, VJIT 1
DEEP LEARNING UNIT 5 Transfer Learning

➢ For instance, in the case of a classification model to differentiate between cats versus
dogs, we train a model which takes a dataset of images and their corresponding labels.
➢ The classification model picks up patterns in the input dataset of images and tries to
map them with output labels.
➢ Thus, we end up with a system which we can use to identify whether a new image is
of a cat or a dog.
➢ This is one of the numerous ways we have developed over the years to make machines
learn.
➢ As a next step, what if we need a system which can classify whether an input image is
that of a human or a dog?
➢ The answer is often, we will train another classifier from scratch, which needs a dataset
with labelled images for humans and dogs.
➢ Though it is the usual way of solving such problems, it does feel a BiT different from
the human way of Learning.

1. What is Transfer Learning


• Artificial intelligence as a field has grown by leaps and bounds in the past decade or
so.
• We not only have immense investment, interest and hype powering this data driven
fourth industrial revolution but also the ever increasing yet affordable compute and
data volumes.
• In this section, we will formally introduce the concept of transfer Learning and
understand why Andrew Ng termed it the “next driver of commercial success”[1].

✓ It might seem from Andrew Ng’s quote that Transfer Learning is a new concept.
✓ On the contrary, it is not something that only emerged as recently as 2016.
✓ In fact, it is believed to have been first mentioned and introduced back at a workshop
titled, “Learning to Learn: Knowledge Consolidation and Transfer in Inductive
Systems” which was held at the Neural-IPS conference in 1995.

• Learning algorithms are designed to solve problems such classification, regression,


clustering and so on.

DR. RAMBABU PEMULA, ASSOCIATE PROFESSOR, DEPARTMENT OF AI, VJIT 2


DEEP LEARNING UNIT 5 Transfer Learning

• As previously mentioned, when presented with a problem to solve we leverage any of


the known algorithms/concepts to build a model specific to the given task.

✓ We train a specific model for each task in isolation, irrespective of the similarity in tasks,
datasets or target variables.
✓ This is different from how we, as humans, are able to leverage knowledge from one task
to solve a related one.
✓ Transfer Learning is a concept which enables trained models to share their knowledge
and help improve the outcomes.
✓ Transfer Learning, thus, is a framework of extending the capabilities of known systems
to understand and share knowledge among tasks, as illustrated at a high-level in figure
1.1.
✓ The figure highlights the siloed traditional approach where we train a separate model for
each task.
✓ We will get into the details shown in the figure, the transfer Learning setup (as compared
to traditional approach) leverages knowledge from source tasks to improve upon the
Learning on target task.

• Let us understand the core idea of transfer Learning with the help of a simple
example.
• Assume we have been asked to develop a classifier to identify different breeds of
dogs.

DR. RAMBABU PEMULA, ASSOCIATE PROFESSOR, DEPARTMENT OF AI, VJIT 3


DEEP LEARNING UNIT 5 Transfer Learning

• For the purpose of this example, we denote this dog breed dataset as our target
dataset.
• Now, an experienced machine Learning practitioner such as yourself would try to
tackle this classification task using computer vision tools such as convolutional
neural networks (CNN).
• However, as is the case with most real-world scenarios, there is a twist in the problem
statement. The dataset we are provided is not just small in size, say a few thousand
samples, but also has unbalanced class distribution.
• Building a CNN-based classifier would have worked perfectly if we’d had enough data
samples to accurately represent each dog breed (we know how data hungry deep
Learning models can be).
• But with only a small, unbalanced dataset, the classifier will achieve dismal
performance.

✓ Training a CNN based classifier from scratch for our dog breed dataset is depicted
in figure 1.2.
✓ It uses the typical setup consisting of training samples and a CNN model.
✓ As our dataset is small and imbalanced, the performance is depicted to be not so
good.
✓ Transfer Learning is our rescue card for precisely such scenarios (there are others,
which we will come to in subsequent sections).
✓ Instead of building a CNN from scratch, we can make use of one of the various state-
of-the-art models trained on a similar dataset, such as ImageNet.
✓ However, some obvious questions would be, why ImageNet? And why a pretrained
model?

DR. RAMBABU PEMULA, ASSOCIATE PROFESSOR, DEPARTMENT OF AI, VJIT 4


DEEP LEARNING UNIT 5 Transfer Learning

• ImageNet is a massive dataset comprising of millions of images spanning thousands


of categories/classes.
• This provides the models with enough variation to learn both generic and class-
specific features (having dogs as one of the classes in ImageNet is a nice-to-have
condition, not a necessary one).
• Reasons for selecting a pretrained model (such as VGG-16, Resnet, Inception, etc.)
are more intuitive and in line with the thought of sharing knowledge.
• These pretrained models have been optimized to perform well on ImageNet as their
source domain.
• This in turn implies that these models have filters which are well trained to identify
different aspects of an image.
• It would be easier to leverage such pretrained models to adapt to a different yet
related domain, and train a model which performs better despite having a smaller target
dataset with skewed distribution of classes.
• Thus, we leverage a pretrained model, such as VGG-16, trained on ImageNet as the
source dataset to transfer learn on the dog breed dataset as its target dataset.
• This enables us to achieve a performance boost as compared to a CNN trained from
scratch (see figure 1.2).

DR. RAMBABU PEMULA, ASSOCIATE PROFESSOR, DEPARTMENT OF AI, VJIT 5


DEEP LEARNING UNIT 5 Transfer Learning

✓ Let’s now formally introduce the concept of Transfer Learning.


✓ We will leverage the work of Pan et. al. titled “A Survey on Transfer Learning”[2] to
present the formal definition here.
✓ Let’s define a domain, D, to be a tuple {χ, P(X)} where χ represents the feature space and
P(X) is the marginal probability of the dataset. Therefore,

D = {χ, P(X)}

• Equation 1.1, where X = {x1, x2…..,xn } and X ∈ χ


• Similarly, we define a task T as a tuple {γ,θ}, where γ is the label and θ is the objective
function.
• The objective function θ can be denoted as P(Y|X) or simply the conditional probability.
Therefore,

T = {γ, θ},

⟹ T = {γ,P(Y│X)},

• Equation 1.2 where Y ∈ γ


• Using equations 1.1 and 1.2 we define transfer Learning as a process to improve the
outcome of target objective function θT for target task TT in the target domain DT using
knowledge from source task TS in source domain DS.
• For obvious reasons, either DS ≠ DT or TS ≠ TT.

✓ This framework leads to the following mismatch scenarios for transfer Learning to deal
with:
✓ When source and target domains are different (DS ≠ DT):

o χS ≠ χT

• The feature space of source and target domains are different.


• For instance, if our task is to perform spam vs non-spam classification of emails, then our
source might be emails in English while target domain might be in Spanish.
• This is referred to as feature space mismatch.

DR. RAMBABU PEMULA, ASSOCIATE PROFESSOR, DEPARTMENT OF AI, VJIT 6


DEEP LEARNING UNIT 5 Transfer Learning

o P(XS) ≠ P(XT)

✓ The marginal probabilities of source and target domains are different (while feature space
is same).
✓ Continuing with the same example of email classification, in this case even though the
emails are in the same language, the source might be an official/corporate email dataset
(which typically has shorter emails with documents, presentations as attachments) while
target may be a personal email dataset (which has a very different distribution).
✓ This is referred to as marginal probability mismatch.

• When source and target tasks are different (TS ≠ TT):

o (γS ≠ γT)

✓ The label space of source and target domains are different.


✓ Continuing with the same example of email classification, then our source label space
might be binary (spam vs non-spam) while the target might be multi-class (primary,
finance, social, promotions, etc.).
✓ This is referred to as label space mismatch.

o P(YS|XS) ≠ P(YT|XT)

• The conditional probabilities of source and target domains are different.


• Continuing with the same example of email classification, the source might be an
official/corporate email dataset (which typically has a lot less spam) while target may be
a personal email dataset (which has a very large number of spam emails, hence, a different
distribution).
• This is referred to as conditional probability mismatch.

✓ To summarize, transfer Learning is a powerful concept to leverage knowledge of source


domain to improve outcomes in target domains.
✓ This concept can widely be categorized into four different mismatch scenarios.
✓ These scenarios assist in understanding areas where transfer Learning can be applied,
along with developing of different methods and techniques to apply it successfully.

DR. RAMBABU PEMULA, ASSOCIATE PROFESSOR, DEPARTMENT OF AI, VJIT 7


DEEP LEARNING UNIT 5 Transfer Learning

2. Transfer Learning Types

• We now understand the core idea behind transfer Learning.


• We also went through a few possible scenarios where transfer Learning needs to handle
different conditions (the mismatch conditions).
• In this section, we will focus on broadly three different types of transfer Learning
scenarios.
• This categorization is based on different relationship settings between source and target
domains.
• We will also touch upon a few related topics such as zero-shot/few-shot Learning, multi-
task Learning and even domain adaptation.
• We discuss how these are different yet related topics to transfer Learning.
• Figure 1.3 depicts the overall categorization of different types of transfer Learning.

Inductive Transfer:

• in this scenario, the source and target tasks are different with domains being
similar or related.
• This scenario is characterized by the presence of labels in both source and target
domains.

DR. RAMBABU PEMULA, ASSOCIATE PROFESSOR, DEPARTMENT OF AI, VJIT 8


DEEP LEARNING UNIT 5 Transfer Learning

• As the name suggests, the target labels are required to induce the knowledge from
source domain for use in target.
• Typically, the source domain has a larger labelled dataset while the target domain
has limited labeled samples.
• Thus, inductive transfer helps in improving performance in the target domain by
leveraging/inducing objective function with knowledge from source domain.
• This is the most common form of transfer Learning we typically use in real-world
settings.
• It is closely related to aspects of multi-task Learning (we will cover this shortly).

Unsupervised Transfer:

• this scenario is actually similar to inductive transfer with only difference being absence
of labels in both source and target domains.
• The focus of this category is to handle transfer Learning in unsupervised scenarios
such as dimensionality reduction, clustering, density estimation, etc.

Transductive Transfer:

• in this scenario, the source and target tasks are same while the corresponding domains
are different.
• This scenario is characterized by absence of labelled samples in the target domain. This
scenario closely matches the feature-space and marginal-probability mismatch
scenarios explained in section 1.1.
• This scenario is also similar to related but slightly different concept of domain-
adaptation.

✓ Transfer Learning is intuitive yet powerful.


✓ Transfer Learning as a concept has been studied independently by different research
groups over different periods of time.
✓ This has lead to different definitions comprising of related yet different concepts.
✓ The definitions and categorization presented so far follow the general and widely accepted
definition of the concept (see fig 1.3).

DR. RAMBABU PEMULA, ASSOCIATE PROFESSOR, DEPARTMENT OF AI, VJIT 9


DEEP LEARNING UNIT 5 Transfer Learning

✓ We next present a few additional concepts and terms which are often, either used inter-
changeably, or as related concepts.
i. Domain Adaptation
ii. Multi-task Learning
iii. Zero-shot and One-Shot Learning
iv. Bayesian Methods and Inductive Transfer

2.1 Domain Adaptation

• Domain, as we have described in equation 1.1 is a tuple of feature space and marginal
probability, denoted as D = {χ, P(X)}.
• Domain adaptation is a specific scenario where the label space remains same, yet the
marginal probabilities between source and target domains change, i.e.

P (XS) ≠ P (XT)

• This is similar to marginal-distribution mismatch scenario we discussed in the previous


section.

✓ Let us consider the example of sentiment analysis.


✓ Here, the source domain might have a dataset of movie reviews while the target domain
consists of book reviews.
✓ We cannot directly apply the model trained on source dataset but certainly with the help
of transfer Learning methods/techniques.
✓ Domain Adaptation is closely related with transductive transfer as well.

• Works such as “An Introduction to Domain Adaptation and Transfer Learning”[3] by


Kouw et. al. and “A Primer on Domain Adaptation”[4] by Lemberger et al provide detailed
account of domain adaptation and related concepts.

Multi-task Learning

✓ The typical way of preparing models for specific tasks is to do so in silos.


✓ We train and fine-tune for a given objective to achieve best possible outcome.

DR. RAMBABU PEMULA, ASSOCIATE PROFESSOR, DEPARTMENT OF AI, VJIT 10


DEEP LEARNING UNIT 5 Transfer Learning

✓ Similar to Transfer Learning, Multi-Task Learning (MTL) is also a paradigm where we


leverage knowledge between tasks and domains.
✓ Yet, unlike Transfer Learning, MTL trains for all related tasks together instead of
training on source and then transferring to target (see figure 1.4.)

• The motivation behind multi-task Learning comes from the fact that often related
tasks carry additional information that can help us learn the main objective better.
• For example, let us assume we are given the task to prepare a sentiment classifier.
• Though we can train one model to perform exceptionally well, it helps more to train
for a task such as Name-Entity-Recognition (NER) in parallel.
• If we compare it with the human way of Learning, this intuitively also makes sense.
• Such a model would understand the language constructs and semantics along with
sentiment classification.
• A child/human who understands the language constructs might be better at identifying
the sentiment of a given statement versus a child/human who only understands
sentiment classification.
• Caruana et. al.(1998)[5] summarize this as a method for improving generalization by
leveraging domain specific information contained in training signals(features) of
related tasks.

✓ Multi-Task Learning is a lot more similar to inductive transfer scenario we discussed in


section 1.2.

DR. RAMBABU PEMULA, ASSOCIATE PROFESSOR, DEPARTMENT OF AI, VJIT 11


DEEP LEARNING UNIT 5 Transfer Learning

✓ Inductive transfer is also one of the reasons why MTL works.


✓ Caruana et. al. discuss how MTL acts as L1 and lasso regularization to assist in improved
Learning.
✓ MTL is related to transfer Learning but is a field of study in itself.
✓ The advantage of improved Learning for complex tasks by leveraging related tasks has
made MTL a popular method in a number of domains.
✓ “An Overview of Multi-Task Learning in Deep Neural networks”[6] By Ruder et. al.
explains this in good detail. Particularly in the NLP space where most recent works are
focusing on training for multiple tasks in parallel as opposed to focusing on just a single
one.
✓ For example, the transformer models such as BERT, T5, XLNET etc. all train for NER,
sentiment analysis, question-answering and a number of other language tasks in parallel.

2.2 Zero-shot and One-Shot Learning

• As we know, machine Learning algorithms are data hungry systems.


• Deep Learning architectures are even more so.
• This is counter-intuitive to the human way of Learning where we can generalize from
one or a few examples for a task.
• For example, once a child is shown what a flower looks like, he or she can easily
identify a different type of flower. Such is not the case with ML/DL algorithms.

✓ Zero-shot and One-shot Learning are thus these extreme variants of Transfer
Learning which help in Learning a task by using no training data and one or few
training samples respectively.
✓ The term one-shot Learning seems to have been coined by Fei Fei and her team in their
seminal work titled One Shot Learning of Object Categories.
✓ They presented a new and improved method for representation Learning which helped
the model transfer knowledge using only a few training examples.
✓ On the other extreme, zero-shot Learning is described by Ian Goodfellow and team in
their book on Deep Learning as a scenario where instead of just Learning the input
and output random variables the model is supposed to learn a random variable
describing the task itself.

DR. RAMBABU PEMULA, ASSOCIATE PROFESSOR, DEPARTMENT OF AI, VJIT 12


DEEP LEARNING UNIT 5 Transfer Learning

✓ Thus, the model is trained for the conditional probability P(y|x,T), where x and y are
input and target variables and T represents the task itself.

• Zero-shot and One-shot (or Few-shot) paradigms come in handy in real world scenarios
where typically there aren’t many training/labeled examples available.
• Domains such as machine translation benefit the most from these advances as this
enables us to prepare translation models for target languages with virtually no labelled
examples.
• Since their introduction, these methods have been researched, applied and improved
quite a BiT.

2.3 Bayesian Methods and Inductive Transfer

✓ Bayesian Learning methods require a special mention in this context due to the way
they are formulated.
✓ Bayesian methods, such as Naïve Bayes, model probability distributions and impose
the conditional independence assumption to simplify the models. Another important
aspect of Bayesian methods is the use of prior distributions. Prior distribution helps us
describe the domain without checking the training dataset.
✓ A strong prior can positively impact the posterior distribution and overall results.
✓ This use of prior distribution is similar to transfer Learning setup where we leverage
knowledge from source domain to improve target task.
✓ To state in simple words, we can leverage knowledge from source domain as a strong
prior for target task when using Bayesian Learning methods.
✓ This concept is discussed in detail in the works of Marx[7] et. al. and Dai[8] et. al.

3 Transfer Learning Methodologies

• Equipped with the understanding of different types of transfer Learning scenarios, we


can now proceed towards developing an understanding of how to transfer.
• As mentioned earlier, transfer Learning is not limited to deep Learning, yet the
success and complexity of deep Learning architectures make them a more favorable
topic for this discussion.
• We will discuss the three major methodologies of performing transfer Learning in the
coming sub-sections.

DR. RAMBABU PEMULA, ASSOCIATE PROFESSOR, DEPARTMENT OF AI, VJIT 13


DEEP LEARNING UNIT 5 Transfer Learning

• It is important to understand that the choice of method depends upon the source and
target domains, availability of labels, the task at hand amongst a number of other
parameters discussed in section on types of transfer Learning.

3.1 Feature Extraction

✓ As you probably know, deep Learning architectures can be considered like towers
made out of Lego-blocks.
✓ Each layer captures different features. It is often the case that such layered
architectures capture simpler features in the initial layers and complex ones in the
deeper ones.
✓ For example, a typical CNN trained to identify human faces would capture simpler
features like straight edges and diagonals in the initial layers while the deeper layers
capture shapes and textures.
✓ This non-linear Learning setup captures a hierarchical representation of objects of
importance.

• Any deep Learning architecture can thus be viewed as a layered architecture where
all layers extract certain features except the last one.
• The last layer transforms these features into the objective at hand, i.e. classification,
regression, etc.
• Thus, it is an interesting proposition to utilize deep Learning architectures (sans the
final layer) as feature extractors.
• This is a typical transfer Learning setup where we utilize deep Learning
architectures as feature extractors.
• The same is depicted is figure 1.5.

DR. RAMBABU PEMULA, ASSOCIATE PROFESSOR, DEPARTMENT OF AI, VJIT 14


DEEP LEARNING UNIT 5 Transfer Learning

✓ Let us understand this using an example where we are supposed to develop a dog breed
classifier.
✓ Assume we have a pre-trained CNN model, such as VGG-16 (built by the Visual
Geometry Group at Oxford), trained on ImageNet.
✓ Since VGG-16 has state-of-the-art performance on thousand plus image classes, it
does seem like a good starting point.
✓ The fact that we have the target task which is closely associated with source domain
helps even further.

• The first step towards using this pre-trained model for transfer Learning is to remove
the final layer for classification.
• This leads to a CNN model sans its final layer which acts as feature extractor.
• An important point to note is that we freeze the rest of the layers.
• The 4096-dimensional output vector from the penultimate layer is representative of the
feature space this model is able to understand from the input image.
• The final step is to connect this penultimate layer output to a new classification layer
which has the objective to classify dog breeds.
• The training proceeds as usual based on the back propagation of gradients.
• The training phase only updates the weights of the newly added layers while the frozen
layers stay as-is, i.e. the weights of the frozen layers are not updated.

DR. RAMBABU PEMULA, ASSOCIATE PROFESSOR, DEPARTMENT OF AI, VJIT 15


DEEP LEARNING UNIT 5 Transfer Learning

✓ This method of using deep Learning models as feature extractors has been shown to
outperform handcrafted and off-the shelf features across various tasks and domains.
✓ We will observe the effectiveness of this approach in the next few chapters where we will
do hands-on experiments.

3.2 Fine-Tuning

✓ The AI field has immensely benefitted from sharing knowledge and ideas across (quite
a quote for a book on transfer Learning).
✓ Particularly in the computer vision space (and lately in the NLP space), researchers
and teams across the world have been sharing their latest work both in the form of
research papers as well as final trained weights/checkpoints of their models.
✓ The availability of these pre-trained models is one of the Biggest drivers of the
transfer Learning paradigm.
✓ While in the previous section we discussed about using pre-trained models as feature
extractors, in this sub section we will focus on fine-tuning them.

• For instance, a pre-trained model, again taking example of VGG-16, is a container of


knowledge associated with those thousand classes from ImageNet.
• While the scope of ImageNet is quite wide, it does provide an amazing
variety/variation for the models to learn about.
• In cases where we are presented with tasks whose domain is similar or related to that
of ImageNet, fine-tuning is a popular method for transfer Learning as illustrated in
figure 1.6.

DR. RAMBABU PEMULA, ASSOCIATE PROFESSOR, DEPARTMENT OF AI, VJIT 16


DEEP LEARNING UNIT 5 Transfer Learning

✓ Fine-tuning, in simple words, refers to the method of using a pre-trained model as our
starting point. Similar to the feature-extraction method, we remove the final
classification/regression layer and add a set of new layers depending upon our target
task’s objective.
✓ Unlike the previous scenario where we froze the layers of the pre-trained model, in this
case we allow some of the re-used layers to be trained/updated along with the newly added
ones.
✓ This method is illustrated in figure 1.6 for reference.

TIP:

We often remove the fully-connected dense layers at the end of the pre-trained model to add in
our own dense layers for fine-tuning. The general methodology is to freeze the (shallow) layers
closer to the input which learn more generic features, and fine-tune the (deeper) layers closer
to the output which learn more specific features from the input data. However we can also
open up the entire network with all of its layers for fine-tuning by making sure the Learning
rate is not too high. This will ensure that we don’t end up destroying the prior learnt knowledge
(weights) by making huge gradient updates as we fine-tune the model layers on our new
dataset.

• This method is useful in scenarios where the target task has enough training/labeled
samples to train a deep, complex network with large number of trainable weights.
• Fine-tuning a pre-trained network provides significant performance improvements
over training a network from scratch.
• The main reasons for this improved performance is attributed to, firstly, a well-studied
and proven architecture and secondly, to the fact that the weights of a pre-trained
network provide a far better impetus as compared to random weights to start with.

3.3 Pre-trained Models

✓ Pretrained models present certain advantages, some of which we discussed in the


previous section.
✓ Not only do they provide a better starting point but also assist in knowledge transfer
by virtue of utilizing proven models for target domain tasks. Building upon the success
of fine-tuning based transfer Learning methods, using the whole pretrained model

DR. RAMBABU PEMULA, ASSOCIATE PROFESSOR, DEPARTMENT OF AI, VJIT 17


DEEP LEARNING UNIT 5 Transfer Learning

is the generalized form. In this case, in place of only retraining some of the layers
(while rest were fixed), we retrain the whole network for the target domain.
✓ This method is useful in scenarios where we have enough training samples in the
target domain as well as the required amount of compute to handle retraining of
complete networks.

• Most deep Learning frameworks such as TensorFlow[9], PyTorch[10], Caffe[11],


MxNet[12] and so on, provide a long list of state-of-the-art models in different domains
(image, video, audio, text) for quick usage, adaptation and fine-tuning.
• Model Zoo is another growing repository of pretrained models maintained by Berkley
Artificial intelligence Research group.
• In the upcoming chapters, we will make use of such pretrained models for different use
cases on real-world data.

4. Diving into Transfer Learning

• Understanding the essentials of transfer Learning would not be complete without


trying it out on some actual examples!
• Thanks to organizations like Google and Facebook we are able to access, download and
use the latest state-of-the-art pre-trained models with just a few lines of code.
• In this section we will discuss on the major repositories of pre-trained models and
leverage some of these models in a scenario of image classification by trying to predict
the category of a few sample images.

4.1 Accessing Pre-trained Models

✓ Considering we have various sources of data like images, audio, video and text, there exist
a diverse set of pre-trained models which have been trained on humongous amounts of
data on a wide variety of tasks like classification, representation Learning, detection and
so on.
✓ The core idea in transfer Learning as you already know by now, is to leverage a pre-
trained model and adapt it to solve your own problem at-hand, instead of training a
model from scratch.

DR. RAMBABU PEMULA, ASSOCIATE PROFESSOR, DEPARTMENT OF AI, VJIT 18


DEEP LEARNING UNIT 5 Transfer Learning

• One might ask a question at this point as to how do we actually access these pre-trained
models?
• One way of accessing these models is directly from the module APIs in the specific deep
Learning libraries you might be using.

For example:

✓ TensorFlow has a nice applications module under the hood thanks to the Keras API
which you can access from the tf.keras.applications module in your code.

• PyTorch has various packages like torchvision, torchtext and torchaudio which do offer
some of the pre-trained models specific to computer vision, text and audio data
respectively.

✓ A better way to access models is often by referring to a central model repository of


pre-trained models, often referred to as a model hub or a model zoo.
✓ The most popular model hubs are TensorFlow Hub, PyTorch Hub and HuggingFace
Transformers.

• TensorFlow Hub and PyTorch Hub are maintained by Google and Facebook respectively
and consist of a wide variety of pre-trained models which have been open-sourced for the
community.
• A lot of models are also developed and maintained by independent organizations and
individual contributors who are primarily into machine Learning and deep Learning
research.
• The other popular model hub is the transformers model hub, which focuses on the latest
state-of-the-art (SOTA) models for NLP.
• It’s maintained by HuggingFace and includes a significant number of pre-trained models
which are contributed by individual researchers as well as organizations like Google and
Facebook.

4.2 Image Classification with Transfer Learning

✓ Let’s now dive into a hands-on example showcasing transfer Learning in the context
of image classification or categorization.

DR. RAMBABU PEMULA, ASSOCIATE PROFESSOR, DEPARTMENT OF AI, VJIT 19


DEEP LEARNING UNIT 5 Transfer Learning

✓ The objective here will be to take a few sample images of animals and see how some
canned, pre-trained models fare in classifying these images.
✓ We will be picking up a couple of pre-trained state-of-the-art models based on
complexity, to compare and contrast how they interpret the true category of the input
images.

METHODOLOGY

• The key objective here is to take a pre-trained model off-the-shelf and use it directly
to predict the class of an input image.
• We focus on inference here to keep things simple without diving into how to train or
fine-tune these models.
• Building on the visual from figure 1.2, our methodology to tackle our key objective of
image classification focuses on taking an input image, loading up a pre-trained
model from TensorFlow Hub in Python and classifying the Top 5 probable classes of
the input image.
• This workflow is depicted in figure 1.8

PRE-TRAINED MODEL ARCHITECTURES

For our experiment, we will be leveraging two state-of-the-art pre-trained convolution neural
network (CNN) models, namely:

Resnet-50:

DR. RAMBABU PEMULA, ASSOCIATE PROFESSOR, DEPARTMENT OF AI, VJIT 20


DEEP LEARNING UNIT 5 Transfer Learning

✓ This is a residual deep convolutional neural network (CNN) with a total of 50


layers focusing on the standard convolution and pooling layers a typical CNN has
along with Batch Normalization layers for regularization.
✓ The novelty of these models include residual or skip connections.
✓ This model was trained on the standard ImageNet-1K dataset having a total of
1000 distinct classes.

BiT MultiClass Resnet-152 4x:

• This is Google’s latest state-of-the-art (SOTA) invention in the world of computer


vision called Big Transfer, published on May, 2020.
• Here, they have built their flagship model architecture, a pre-trained Resnet-152
model (152 layers) but four times wider than the original model.
• This model uses Group Normalization layers instead of Batch Normalization for
regularization.
• The model was trained on the ImageNet-21k dataset[14] having a total of 21843
classes.

The foundational architecture behind both models is a convolutional neural network (CNN)
which works on the principle of leveraging a multi-layered hierarchical architecture of several
convolution and pooling layers with non-linear activation functions.

CONVOLUTIONAL NEURAL NETWORKS

✓ Typically, a convolutional neural network, more popularly known as CNN model consists
of a layered architecture of several layers which include convolution, pooling and dense
layers besides the input and output layers.
✓ A typical architecture is depicted in figure 1.9.

DR. RAMBABU PEMULA, ASSOCIATE PROFESSOR, DEPARTMENT OF AI, VJIT 21


DEEP LEARNING UNIT 5 Transfer Learning

• The CNN model leverages convolution and pooling layers to automatically extract
different hierarchies of features, from very generic features like edges and corners to
very specific features like the facial structure, whiskers and ears of the tiger depicted as
an input image in figure 1.10.
• The feature maps are usually flattened using a flatten or global pooling operator to
obtain a 1-dimensional feature vector.
• This vector is then sent as an input through a few fully-connected dense layers, and the
output class is finally predicted using a softmax output layer.

✓ The key objective of this multi-stage hierarchical architecture is to learn spatial


hierarchies of patterns which are also translation invariant.
✓ This is possible through two main layers in the CNN architecture: the convolution and
pooling layers.

Convolution Layers:

• The secret sauce of CNNs are its convolution layers!


• These layers are created by convolving multiple filters or kernels with patches of the
input image, which help in extracting specific features from the input image
automatically.
• Using a layered architecture of stacked convolution layers helps in Learning spatial
features with a certain hierarchy as depicted in Figure 1.10

DR. RAMBABU PEMULA, ASSOCIATE PROFESSOR, DEPARTMENT OF AI, VJIT 22


DEEP LEARNING UNIT 5 Transfer Learning

• While figure 1.10 provides a simplistic view of a CNN, the core methodology is true in
the sense that coarse and generic features like edges and corners are extracted in initial
convolution layers (to give feature maps).
• A combination of these features maps in deeper convolutional layers helps the CNN
to learn more complex visual features like the mane, eyes, cheeks and nose.
• Finally the overall visual representation and concept of what a tiger looks like is built
using a combination of these features.

Pooling Layers:

✓ We typically downsample the feature maps from the convolutional layers in the
pooling layers using an aggregation operation like max, min or mean.
✓ Usually max-pooling is preferred, which means we take in patches of image pixels (e.g.
a 2x2 patch) and reduce it to its maximum value (giving one pixel with the max value).
✓ Max-pooling is preferred because of its lower computation time as well as its ability to
encode the enhanced aspects of the feature maps (by taking the maximal pixel values
of image patches rather than the average).
✓ Pooling also helps in reducing overfitting, decreasing computation time and enables the
CNN to learn translation-invariant features.

THE RESNET ARCHITECTURE

✓ Both of the pre-trained models we mentioned earlier are different variants of the
Resnet CNN architecture.
✓ Resnet stands for Residual Networks, which introduced a novel concept of using
residual or skip connections to build deeper neural network models without facing
problems of vanishing gradients and model generalization ability.
✓ The typical architecture of a Resnet-50 has been simplified and depicted in figure
1.11.

DR. RAMBABU PEMULA, ASSOCIATE PROFESSOR, DEPARTMENT OF AI, VJIT 23


DEEP LEARNING UNIT 5 Transfer Learning

• It is pretty clear that the Resnet-50 architecture consists of several stacked convolutional
and pooling layers followed by a final global average pooling and a fully connected layer
with 1000 units to make the final class prediction.
• This model also introduces the concept of batch-normalization layers interspersed between
layers to help with regularization. The stacked conv and identity blocks are novel concepts
introduced in the Resnet architecture which make use of residual or skip connections as
seen in the detailed block diagrams in figure 1.11.

✓ The whole idea of a skip connection (also known as residual or shortcut connections)
is to not just stack layers but also directly connect the original input to the output of a
few stacked layers as seen in figure 1.12 where the original input is added to the output
from the conv or identity block.
✓ The purpose of using skip connections is to enable the capability to build deeper
networks without facing problems like vanishing gradients and saturation of
performance by allowing alternate paths for gradients to flow through the network.
✓ We see different variants of the Resnet architecture in figure 1.12

DR. RAMBABU PEMULA, ASSOCIATE PROFESSOR, DEPARTMENT OF AI, VJIT 24


DEEP LEARNING UNIT 5 Transfer Learning

• For our first pre-trained model we will use a Resnet-50 model which has been trained
on the ImageNet-1k dataset with a multi-class classification task.
• Our second pre-trained model uses Google’s pre-trained Big Transfer Model for
multi-label classification (BiTM) which has variants based on Resnet 50, 101 and 152.
• The model we use is based on a variant of the Resnet-152 architecture which is 4
times wider.

BIG TRANSFER (BIT) PRE-TRAINED MODELS

✓ The Big Transfer Models (BiT) were trained and published by Google on May, 2020
as a part of their seminal research paper[15].
✓ These pre-trained models are built on top of the basic Resnet architecture we
discussed in the previous section with a few tricks and enhancements.

The key focus of BigTransfer models including the following:

i. Upstream Training
ii. Downstream Fine-tuning

Upstream Training:

• Here we train large model architectures (e.g. Resnet) on large datasets (e.g.
ImageNet-21k) with a long pre-training time and using concepts like Group
Normalization with Weight Standardization, instead of Batch Normalization.

DR. RAMBABU PEMULA, ASSOCIATE PROFESSOR, DEPARTMENT OF AI, VJIT 25


DEEP LEARNING UNIT 5 Transfer Learning

• The general observation has been that GroupNorm with Weight Standardization
scales well to larger batch sizes as compared to BatchNorm.

Downstream Fine-tuning:

• Once the model is pre-trained, it can be fine-tuned and ‘adapted’ to any new dataset
with relatively less number of samples.
• Google uses a hyperparameter heuristic called BiT-HyperRule where stochastic
gradient descent (SGD) is used with an initial Learning rate of 0.003 with a decay
factor of 10 at 30%, 60% and 90% of the training steps.

In our following experiments, we will be using the BiTM-R152x4 model which is a pre-
trained Big Transfer model based on Google’s flagship CNN model architecture of a
Resnet-152 which is four times wider and trained to perform multi-label classification on the
ImageNet-21k dataset.

IMPLEMENTATION

Let’s now use these pre-trained models to solve our objective of predicting the Top-5 classes
of input images.

We start by loading up the specific dependencies for image processing, modeling and
inference.

import tensorflow as tf
import tensorflow_hub as tf_hub
from PIL import Image
import matplotlib.pyplot as plt
import numpy as np
print('TF Version:', tf.__version__)
print('TF Hub Version:', tf_hub.__version__)
TF Version: 2.3.0
TF Hub Version: 0.8.0

Do note that we use TensorFlow 2.x here which is the latest version at the time of writing this
book. Since we will be directly using the pre-trained models for inference, we will need to
know the class labels of the original ImageNet-1K and the ImageNet-21k datasets for the
Resnet-50 and BiTM-R152x4 models respectively as depicted in listing 1.1

Listing 1.1 Loading ImageNet Class Labels

!wget https://storage.Googleapis.com/download.tensorflow.org/data/ImageNetLabels.txt
DR. RAMBABU PEMULA, ASSOCIATE PROFESSOR, DEPARTMENT OF AI, VJIT 26
DEEP LEARNING UNIT 5 Transfer Learning

!wget https://storage.Googleapis.com/BiT_models/ImageNet21k_wordnet_lemmas.txt

data1k = []
with open('ImageNetLabels.txt', 'r') as f:
data1k = f.readlines()
data21k = []
with open('ImageNet21k_wordnet_lemmas.txt', 'r') as f:
data21k = f.readlines()
ImageNet1k_mapping = {i: value.strip('\n')
for i, value in enumerate(data1k)}
ImageNet21k_mapping = {i: value.strip('\n')
for i, value in enumerate(data21k)}

print('ImageNet 1K (Resnet-50) Total Classes:',


len(list(ImageNet1k_mapping.items())))
print('Sample:', list(ImageNet1k_mapping.items())[:5])
print('\nImageNet 21K (BiT Resnet-152 4x)Total Classes:',
len(list(ImageNet21k_mapping.items())))
print('Sample:', list(ImageNet21k_mapping.items())[:5])
ImageNet 1K (Resnet-50) Total Classes: 1001
Sample: [(0, 'background'), (1, 'tench'), (2, 'goldfish'),
(3, 'great white shark'), (4, 'tiger shark')]
ImageNet 21K (BiT Resnet-152 4x)Total Classes: 21843
Sample: [(0, 'organism, being'), (1, 'benthos'),
(2, 'heterotroph'), (3, 'cell'),
(4, 'person, individual, someone, somebody, mortal, soul')]

The next step would be to load up the two pre-trained models we discussed earlier from
TensorFlow Hub.

Resnet_model_url = "https://tfhub.dev/tensorflow/Resnet_50/classification/1"
Resnet_50 = tf_hub.KerasLayer(Resnet_model_url)
BiT_model_url = "https://tfhub.dev/Google/BiT/m-
r152x4/ImageNet21k_classification/1"
DR. RAMBABU PEMULA, ASSOCIATE PROFESSOR, DEPARTMENT OF AI, VJIT 27
DEEP LEARNING UNIT 5 Transfer Learning

BiT_r152x4 = tf_hub.KerasLayer(BiT_model_url)

Once we have our pre-trained models ready, the next step would be to focus on building some
specific utility functions which you can access from the notebook for this Chapter in our
GitHub repository mentioned earlier. Just to get some perspective,

The preprocess_image(…) function helps us in pre-processing, shaping and scaling the input
image pixel values between the range of 0-1.

The visualize_predictions(…) function takes in the pre-trained model, the class label
mappings, the model type and the input image as inputs to visualize the top-5 predictions as
a bar chart.

The Resnet-50 model directly gives the class probabilities as inputs but the BiTM-R152x4
model gives class logits as outputs which need to be converted to class probabilities. We can
look at listing 1.2 which shows a section of the visualize_predictions(…) function which helps
us achieve this.

Listing 1.2 Getting the class probabilities from the model predictions

def visualize_predictions(model, image, ImageNet_mapping_dict,


model_type='Resnet'):
if model_type =='Resnet':
probs = model(image)
probs = tf.reshape(probs, [-1])
else:
logits = model(image)
logits = tf.reshape(logits, [-1])
probs = tf.nn.softmax(logits)
top5_ImageNet_idxs = np.argsort(probs)[:-6:-1]
top5_probs = np.sort(probs)[:-6:-1]
pred_labels = [ImageNet_mapping_dict[i]
for i in top5_ImageNet_idxs]
...
...

DR. RAMBABU PEMULA, ASSOCIATE PROFESSOR, DEPARTMENT OF AI, VJIT 28


DEEP LEARNING UNIT 5 Transfer Learning

Remember that logits are basically the log-odds or unnormalized class probabilities and hence
you need to compute the softmax of these logits to get to the normalized class probabilities
which sum up to 1 as depicted in figure 1.13 which shows a sample neural network
architecture with the logits and the class probabilities for a hypothetical 3-class classification
problem.

The softmax function basically squashes the logits using the transform depicted in figure 1.13
to give us the normalized class probabilities. Let’s now put our code to action! You can
leverage these functions on any downloaded image using the sequence of steps depicted in
listing 1.3 to visualize the Top-5 predictions of our two pre-trained models.

Listing 1.3 Visualizing Top-5 predictions of our pre-trained models on a sample image

img = Image.open('snow_leo.png').convert("RGB")
pre_img = preprocess_image(img)
plt.figure(figsize=(12, 3))
plt.subplot(1,3,1)
visualize_predictions(model=BiT_r152x4, image=pre_img,
ImageNet_mapping_dict=ImageNet21k_mapping,
model_type='BiT-multiclass')
plt.subplot(1,3,2)
Resnet_img = tf.image.resize(pre_img, (224, 224))
visualize_predictions(model=Resnet_50, image=Resnet_img,

DR. RAMBABU PEMULA, ASSOCIATE PROFESSOR, DEPARTMENT OF AI, VJIT 29


DEEP LEARNING UNIT 5 Transfer Learning

ImageNet_mapping_dict=ImageNet1k_mapping,
model_type='Resnet')
plt.subplot(1,3,3)
plt.imshow(pre_img[0])
plt.tight_layout()

Voila! We have the Top-5 predictions from our two pre-trained models depicted in a nice
visualization in figure 1.14.

It looks like both our models performed well, and as expected the BiTM model is very specific
and more accurate given it has been trained on over 21K classes with very specific animal
species and breeds.

The Resnet-50 model has more inconsistencies as compared to the BiTM model with regard
to predicting on animals of similar genus but slightly different species like tigers and lions as
depicted in figure 1.15

Another aspect to keep in mind is that these models are not exhaustive. They don’t cover each
and every entity on this planet. This would be impossible to do considering data collection for
DR. RAMBABU PEMULA, ASSOCIATE PROFESSOR, DEPARTMENT OF AI, VJIT 30
DEEP LEARNING UNIT 5 Transfer Learning

this task itself would take centuries, if not forever! An example is showcased in figure 1.16
where our models try to predict a very specific dog breed, the Afghan Hound, from a given
image.

Based on the Top-5 predictions in figure 1.16 you can see that while our BiTM model actually
get the right prediction, the prediction probability is very low indicating our model is not too
confident (given that it probably hasn’t seen too many examples of this dog breed in its
training data during the pre-training phase). This is where we can fine-tune and adapt our
models to make them more tuned to our specific datasets and output labels and outcomes.

5. Transfer Learning Challenges

• We have covered quite a BiT of ground so far in terms of understanding the what, how
and why of transfer Learning.
• We even tried our hands on using off-the-shelf utilities to understand how transfer
Learning proves itself useful and powerful.
• In this section we will touch upon a few challenges associated with transfer Learning
before we close the chapter.
• Understanding the challenges and open questions provides a complete picture and also
influences in making the right decisions.

The following are major challenges associated with transfer Learning.

i. Negative Transfer
ii. Transfer Bounds or Knowledge Gain:

Negative Transfer:

• In context of transfer Learning, the improvements in target task Learning can be


categorized into better initial performance, improved training times or enhanced final
performance.

DR. RAMBABU PEMULA, ASSOCIATE PROFESSOR, DEPARTMENT OF AI, VJIT 31


DEEP LEARNING UNIT 5 Transfer Learning

• Scenarios when some or all of these improvements are observed are termed as positive
transfer scenarios.
• Yet in real world setting, this is not always the case.
• There are cases when transfer Learning can lead to a drop in overall performance.
• This is termed as negative transfer.
• Negative transfer can occur due to a number of reasons.
• It could be due to source task not being sufficiently related to target task/domain.
• It could also be due to incorrect choice of transfer method or transfer method being
unable to leverage relationship between source and target.
• Avoiding negative transfer is important yet the abstract list of causes are difficult (if
not impossible) to narrow down.

Transfer Bounds or Knowledge Gain:

• Transfer Learning is a powerful concept which provides pretty amazing


improvements in Learning the target tasks.
• Yet it is quite difficult to quantify the amount of knowledge transferred.
• It is important to quantify the transfer in transfer Learning to understand the quality
of transfer and its viability.
• It is also important from the point of view of comparing transfers from different
sources (apart from train/test evaluation metrics) to understand
generalizability/robustness of transfer learnt models.
• Works from Glorot[16] et. al. have tried to address the same by coming up with methods
to quantify transfer Learning related gains to a certain extent.

DR. RAMBABU PEMULA, ASSOCIATE PROFESSOR, DEPARTMENT OF AI, VJIT 32

You might also like