Unit 5 Notes
Unit 5 Notes
UNIT V
Transfer Learning
1. Types
2. Methodologies
3. Diving into Transfer Learning
4. Challenges
Introductions
• Learning is a process.
• A very involved process which assists in acquiring new knowledge and skills.
• While Learning has been historically associated with sentient and evolved life forms,
since the advent of computers and subsequent research in the field of artificial
intelligence, we now associate the process of Learning with machines as well.
• Virtual assistants, AI powered game players and even autonomous vehicles are a
reality now.
• It is not surprising any more to have machines included in the list of entities that can
learn.
✓ As part of evolution, humans have refined the way we learn and adapt to different
scenarios over the years.
✓ We have acquired this innate ability to use skills and knowledge across different tasks
without always starting from zero.
✓ For instance, some of the common examples of this ability to transfer knowledge across
different tasks are “knowing how to boil water ® one can learn to make tea”, “knowing
how to ride a bicycle ® can help us to learn to ride a motorbike easily” or even “knowing
how to do subtraction ® helps us to learn to perform division”
• There are many such things in real life where we leverage what we learnt for solving
one specific problem and apply it to solve a number of other related problems.
• We do this consciously and sub-consciously all the time.
• These examples point to our ability to transfer knowledge across different yet related
tasks. This ability is proportional to how similar or dissimilar the tasks are.
➢ For instance, in the case of a classification model to differentiate between cats versus
dogs, we train a model which takes a dataset of images and their corresponding labels.
➢ The classification model picks up patterns in the input dataset of images and tries to
map them with output labels.
➢ Thus, we end up with a system which we can use to identify whether a new image is
of a cat or a dog.
➢ This is one of the numerous ways we have developed over the years to make machines
learn.
➢ As a next step, what if we need a system which can classify whether an input image is
that of a human or a dog?
➢ The answer is often, we will train another classifier from scratch, which needs a dataset
with labelled images for humans and dogs.
➢ Though it is the usual way of solving such problems, it does feel a BiT different from
the human way of Learning.
✓ It might seem from Andrew Ng’s quote that Transfer Learning is a new concept.
✓ On the contrary, it is not something that only emerged as recently as 2016.
✓ In fact, it is believed to have been first mentioned and introduced back at a workshop
titled, “Learning to Learn: Knowledge Consolidation and Transfer in Inductive
Systems” which was held at the Neural-IPS conference in 1995.
✓ We train a specific model for each task in isolation, irrespective of the similarity in tasks,
datasets or target variables.
✓ This is different from how we, as humans, are able to leverage knowledge from one task
to solve a related one.
✓ Transfer Learning is a concept which enables trained models to share their knowledge
and help improve the outcomes.
✓ Transfer Learning, thus, is a framework of extending the capabilities of known systems
to understand and share knowledge among tasks, as illustrated at a high-level in figure
1.1.
✓ The figure highlights the siloed traditional approach where we train a separate model for
each task.
✓ We will get into the details shown in the figure, the transfer Learning setup (as compared
to traditional approach) leverages knowledge from source tasks to improve upon the
Learning on target task.
• Let us understand the core idea of transfer Learning with the help of a simple
example.
• Assume we have been asked to develop a classifier to identify different breeds of
dogs.
• For the purpose of this example, we denote this dog breed dataset as our target
dataset.
• Now, an experienced machine Learning practitioner such as yourself would try to
tackle this classification task using computer vision tools such as convolutional
neural networks (CNN).
• However, as is the case with most real-world scenarios, there is a twist in the problem
statement. The dataset we are provided is not just small in size, say a few thousand
samples, but also has unbalanced class distribution.
• Building a CNN-based classifier would have worked perfectly if we’d had enough data
samples to accurately represent each dog breed (we know how data hungry deep
Learning models can be).
• But with only a small, unbalanced dataset, the classifier will achieve dismal
performance.
✓ Training a CNN based classifier from scratch for our dog breed dataset is depicted
in figure 1.2.
✓ It uses the typical setup consisting of training samples and a CNN model.
✓ As our dataset is small and imbalanced, the performance is depicted to be not so
good.
✓ Transfer Learning is our rescue card for precisely such scenarios (there are others,
which we will come to in subsequent sections).
✓ Instead of building a CNN from scratch, we can make use of one of the various state-
of-the-art models trained on a similar dataset, such as ImageNet.
✓ However, some obvious questions would be, why ImageNet? And why a pretrained
model?
D = {χ, P(X)}
T = {γ, θ},
⟹ T = {γ,P(Y│X)},
✓ This framework leads to the following mismatch scenarios for transfer Learning to deal
with:
✓ When source and target domains are different (DS ≠ DT):
o χS ≠ χT
o P(XS) ≠ P(XT)
✓ The marginal probabilities of source and target domains are different (while feature space
is same).
✓ Continuing with the same example of email classification, in this case even though the
emails are in the same language, the source might be an official/corporate email dataset
(which typically has shorter emails with documents, presentations as attachments) while
target may be a personal email dataset (which has a very different distribution).
✓ This is referred to as marginal probability mismatch.
o (γS ≠ γT)
o P(YS|XS) ≠ P(YT|XT)
Inductive Transfer:
• in this scenario, the source and target tasks are different with domains being
similar or related.
• This scenario is characterized by the presence of labels in both source and target
domains.
• As the name suggests, the target labels are required to induce the knowledge from
source domain for use in target.
• Typically, the source domain has a larger labelled dataset while the target domain
has limited labeled samples.
• Thus, inductive transfer helps in improving performance in the target domain by
leveraging/inducing objective function with knowledge from source domain.
• This is the most common form of transfer Learning we typically use in real-world
settings.
• It is closely related to aspects of multi-task Learning (we will cover this shortly).
Unsupervised Transfer:
• this scenario is actually similar to inductive transfer with only difference being absence
of labels in both source and target domains.
• The focus of this category is to handle transfer Learning in unsupervised scenarios
such as dimensionality reduction, clustering, density estimation, etc.
Transductive Transfer:
• in this scenario, the source and target tasks are same while the corresponding domains
are different.
• This scenario is characterized by absence of labelled samples in the target domain. This
scenario closely matches the feature-space and marginal-probability mismatch
scenarios explained in section 1.1.
• This scenario is also similar to related but slightly different concept of domain-
adaptation.
✓ We next present a few additional concepts and terms which are often, either used inter-
changeably, or as related concepts.
i. Domain Adaptation
ii. Multi-task Learning
iii. Zero-shot and One-Shot Learning
iv. Bayesian Methods and Inductive Transfer
• Domain, as we have described in equation 1.1 is a tuple of feature space and marginal
probability, denoted as D = {χ, P(X)}.
• Domain adaptation is a specific scenario where the label space remains same, yet the
marginal probabilities between source and target domains change, i.e.
P (XS) ≠ P (XT)
Multi-task Learning
• The motivation behind multi-task Learning comes from the fact that often related
tasks carry additional information that can help us learn the main objective better.
• For example, let us assume we are given the task to prepare a sentiment classifier.
• Though we can train one model to perform exceptionally well, it helps more to train
for a task such as Name-Entity-Recognition (NER) in parallel.
• If we compare it with the human way of Learning, this intuitively also makes sense.
• Such a model would understand the language constructs and semantics along with
sentiment classification.
• A child/human who understands the language constructs might be better at identifying
the sentiment of a given statement versus a child/human who only understands
sentiment classification.
• Caruana et. al.(1998)[5] summarize this as a method for improving generalization by
leveraging domain specific information contained in training signals(features) of
related tasks.
✓ Zero-shot and One-shot Learning are thus these extreme variants of Transfer
Learning which help in Learning a task by using no training data and one or few
training samples respectively.
✓ The term one-shot Learning seems to have been coined by Fei Fei and her team in their
seminal work titled One Shot Learning of Object Categories.
✓ They presented a new and improved method for representation Learning which helped
the model transfer knowledge using only a few training examples.
✓ On the other extreme, zero-shot Learning is described by Ian Goodfellow and team in
their book on Deep Learning as a scenario where instead of just Learning the input
and output random variables the model is supposed to learn a random variable
describing the task itself.
✓ Thus, the model is trained for the conditional probability P(y|x,T), where x and y are
input and target variables and T represents the task itself.
• Zero-shot and One-shot (or Few-shot) paradigms come in handy in real world scenarios
where typically there aren’t many training/labeled examples available.
• Domains such as machine translation benefit the most from these advances as this
enables us to prepare translation models for target languages with virtually no labelled
examples.
• Since their introduction, these methods have been researched, applied and improved
quite a BiT.
✓ Bayesian Learning methods require a special mention in this context due to the way
they are formulated.
✓ Bayesian methods, such as Naïve Bayes, model probability distributions and impose
the conditional independence assumption to simplify the models. Another important
aspect of Bayesian methods is the use of prior distributions. Prior distribution helps us
describe the domain without checking the training dataset.
✓ A strong prior can positively impact the posterior distribution and overall results.
✓ This use of prior distribution is similar to transfer Learning setup where we leverage
knowledge from source domain to improve target task.
✓ To state in simple words, we can leverage knowledge from source domain as a strong
prior for target task when using Bayesian Learning methods.
✓ This concept is discussed in detail in the works of Marx[7] et. al. and Dai[8] et. al.
• It is important to understand that the choice of method depends upon the source and
target domains, availability of labels, the task at hand amongst a number of other
parameters discussed in section on types of transfer Learning.
✓ As you probably know, deep Learning architectures can be considered like towers
made out of Lego-blocks.
✓ Each layer captures different features. It is often the case that such layered
architectures capture simpler features in the initial layers and complex ones in the
deeper ones.
✓ For example, a typical CNN trained to identify human faces would capture simpler
features like straight edges and diagonals in the initial layers while the deeper layers
capture shapes and textures.
✓ This non-linear Learning setup captures a hierarchical representation of objects of
importance.
• Any deep Learning architecture can thus be viewed as a layered architecture where
all layers extract certain features except the last one.
• The last layer transforms these features into the objective at hand, i.e. classification,
regression, etc.
• Thus, it is an interesting proposition to utilize deep Learning architectures (sans the
final layer) as feature extractors.
• This is a typical transfer Learning setup where we utilize deep Learning
architectures as feature extractors.
• The same is depicted is figure 1.5.
✓ Let us understand this using an example where we are supposed to develop a dog breed
classifier.
✓ Assume we have a pre-trained CNN model, such as VGG-16 (built by the Visual
Geometry Group at Oxford), trained on ImageNet.
✓ Since VGG-16 has state-of-the-art performance on thousand plus image classes, it
does seem like a good starting point.
✓ The fact that we have the target task which is closely associated with source domain
helps even further.
• The first step towards using this pre-trained model for transfer Learning is to remove
the final layer for classification.
• This leads to a CNN model sans its final layer which acts as feature extractor.
• An important point to note is that we freeze the rest of the layers.
• The 4096-dimensional output vector from the penultimate layer is representative of the
feature space this model is able to understand from the input image.
• The final step is to connect this penultimate layer output to a new classification layer
which has the objective to classify dog breeds.
• The training proceeds as usual based on the back propagation of gradients.
• The training phase only updates the weights of the newly added layers while the frozen
layers stay as-is, i.e. the weights of the frozen layers are not updated.
✓ This method of using deep Learning models as feature extractors has been shown to
outperform handcrafted and off-the shelf features across various tasks and domains.
✓ We will observe the effectiveness of this approach in the next few chapters where we will
do hands-on experiments.
3.2 Fine-Tuning
✓ The AI field has immensely benefitted from sharing knowledge and ideas across (quite
a quote for a book on transfer Learning).
✓ Particularly in the computer vision space (and lately in the NLP space), researchers
and teams across the world have been sharing their latest work both in the form of
research papers as well as final trained weights/checkpoints of their models.
✓ The availability of these pre-trained models is one of the Biggest drivers of the
transfer Learning paradigm.
✓ While in the previous section we discussed about using pre-trained models as feature
extractors, in this sub section we will focus on fine-tuning them.
✓ Fine-tuning, in simple words, refers to the method of using a pre-trained model as our
starting point. Similar to the feature-extraction method, we remove the final
classification/regression layer and add a set of new layers depending upon our target
task’s objective.
✓ Unlike the previous scenario where we froze the layers of the pre-trained model, in this
case we allow some of the re-used layers to be trained/updated along with the newly added
ones.
✓ This method is illustrated in figure 1.6 for reference.
TIP:
We often remove the fully-connected dense layers at the end of the pre-trained model to add in
our own dense layers for fine-tuning. The general methodology is to freeze the (shallow) layers
closer to the input which learn more generic features, and fine-tune the (deeper) layers closer
to the output which learn more specific features from the input data. However we can also
open up the entire network with all of its layers for fine-tuning by making sure the Learning
rate is not too high. This will ensure that we don’t end up destroying the prior learnt knowledge
(weights) by making huge gradient updates as we fine-tune the model layers on our new
dataset.
• This method is useful in scenarios where the target task has enough training/labeled
samples to train a deep, complex network with large number of trainable weights.
• Fine-tuning a pre-trained network provides significant performance improvements
over training a network from scratch.
• The main reasons for this improved performance is attributed to, firstly, a well-studied
and proven architecture and secondly, to the fact that the weights of a pre-trained
network provide a far better impetus as compared to random weights to start with.
is the generalized form. In this case, in place of only retraining some of the layers
(while rest were fixed), we retrain the whole network for the target domain.
✓ This method is useful in scenarios where we have enough training samples in the
target domain as well as the required amount of compute to handle retraining of
complete networks.
✓ Considering we have various sources of data like images, audio, video and text, there exist
a diverse set of pre-trained models which have been trained on humongous amounts of
data on a wide variety of tasks like classification, representation Learning, detection and
so on.
✓ The core idea in transfer Learning as you already know by now, is to leverage a pre-
trained model and adapt it to solve your own problem at-hand, instead of training a
model from scratch.
• One might ask a question at this point as to how do we actually access these pre-trained
models?
• One way of accessing these models is directly from the module APIs in the specific deep
Learning libraries you might be using.
For example:
✓ TensorFlow has a nice applications module under the hood thanks to the Keras API
which you can access from the tf.keras.applications module in your code.
• PyTorch has various packages like torchvision, torchtext and torchaudio which do offer
some of the pre-trained models specific to computer vision, text and audio data
respectively.
• TensorFlow Hub and PyTorch Hub are maintained by Google and Facebook respectively
and consist of a wide variety of pre-trained models which have been open-sourced for the
community.
• A lot of models are also developed and maintained by independent organizations and
individual contributors who are primarily into machine Learning and deep Learning
research.
• The other popular model hub is the transformers model hub, which focuses on the latest
state-of-the-art (SOTA) models for NLP.
• It’s maintained by HuggingFace and includes a significant number of pre-trained models
which are contributed by individual researchers as well as organizations like Google and
Facebook.
✓ Let’s now dive into a hands-on example showcasing transfer Learning in the context
of image classification or categorization.
✓ The objective here will be to take a few sample images of animals and see how some
canned, pre-trained models fare in classifying these images.
✓ We will be picking up a couple of pre-trained state-of-the-art models based on
complexity, to compare and contrast how they interpret the true category of the input
images.
METHODOLOGY
• The key objective here is to take a pre-trained model off-the-shelf and use it directly
to predict the class of an input image.
• We focus on inference here to keep things simple without diving into how to train or
fine-tune these models.
• Building on the visual from figure 1.2, our methodology to tackle our key objective of
image classification focuses on taking an input image, loading up a pre-trained
model from TensorFlow Hub in Python and classifying the Top 5 probable classes of
the input image.
• This workflow is depicted in figure 1.8
For our experiment, we will be leveraging two state-of-the-art pre-trained convolution neural
network (CNN) models, namely:
Resnet-50:
The foundational architecture behind both models is a convolutional neural network (CNN)
which works on the principle of leveraging a multi-layered hierarchical architecture of several
convolution and pooling layers with non-linear activation functions.
✓ Typically, a convolutional neural network, more popularly known as CNN model consists
of a layered architecture of several layers which include convolution, pooling and dense
layers besides the input and output layers.
✓ A typical architecture is depicted in figure 1.9.
• The CNN model leverages convolution and pooling layers to automatically extract
different hierarchies of features, from very generic features like edges and corners to
very specific features like the facial structure, whiskers and ears of the tiger depicted as
an input image in figure 1.10.
• The feature maps are usually flattened using a flatten or global pooling operator to
obtain a 1-dimensional feature vector.
• This vector is then sent as an input through a few fully-connected dense layers, and the
output class is finally predicted using a softmax output layer.
Convolution Layers:
• While figure 1.10 provides a simplistic view of a CNN, the core methodology is true in
the sense that coarse and generic features like edges and corners are extracted in initial
convolution layers (to give feature maps).
• A combination of these features maps in deeper convolutional layers helps the CNN
to learn more complex visual features like the mane, eyes, cheeks and nose.
• Finally the overall visual representation and concept of what a tiger looks like is built
using a combination of these features.
Pooling Layers:
✓ We typically downsample the feature maps from the convolutional layers in the
pooling layers using an aggregation operation like max, min or mean.
✓ Usually max-pooling is preferred, which means we take in patches of image pixels (e.g.
a 2x2 patch) and reduce it to its maximum value (giving one pixel with the max value).
✓ Max-pooling is preferred because of its lower computation time as well as its ability to
encode the enhanced aspects of the feature maps (by taking the maximal pixel values
of image patches rather than the average).
✓ Pooling also helps in reducing overfitting, decreasing computation time and enables the
CNN to learn translation-invariant features.
✓ Both of the pre-trained models we mentioned earlier are different variants of the
Resnet CNN architecture.
✓ Resnet stands for Residual Networks, which introduced a novel concept of using
residual or skip connections to build deeper neural network models without facing
problems of vanishing gradients and model generalization ability.
✓ The typical architecture of a Resnet-50 has been simplified and depicted in figure
1.11.
• It is pretty clear that the Resnet-50 architecture consists of several stacked convolutional
and pooling layers followed by a final global average pooling and a fully connected layer
with 1000 units to make the final class prediction.
• This model also introduces the concept of batch-normalization layers interspersed between
layers to help with regularization. The stacked conv and identity blocks are novel concepts
introduced in the Resnet architecture which make use of residual or skip connections as
seen in the detailed block diagrams in figure 1.11.
✓ The whole idea of a skip connection (also known as residual or shortcut connections)
is to not just stack layers but also directly connect the original input to the output of a
few stacked layers as seen in figure 1.12 where the original input is added to the output
from the conv or identity block.
✓ The purpose of using skip connections is to enable the capability to build deeper
networks without facing problems like vanishing gradients and saturation of
performance by allowing alternate paths for gradients to flow through the network.
✓ We see different variants of the Resnet architecture in figure 1.12
• For our first pre-trained model we will use a Resnet-50 model which has been trained
on the ImageNet-1k dataset with a multi-class classification task.
• Our second pre-trained model uses Google’s pre-trained Big Transfer Model for
multi-label classification (BiTM) which has variants based on Resnet 50, 101 and 152.
• The model we use is based on a variant of the Resnet-152 architecture which is 4
times wider.
✓ The Big Transfer Models (BiT) were trained and published by Google on May, 2020
as a part of their seminal research paper[15].
✓ These pre-trained models are built on top of the basic Resnet architecture we
discussed in the previous section with a few tricks and enhancements.
i. Upstream Training
ii. Downstream Fine-tuning
Upstream Training:
• Here we train large model architectures (e.g. Resnet) on large datasets (e.g.
ImageNet-21k) with a long pre-training time and using concepts like Group
Normalization with Weight Standardization, instead of Batch Normalization.
• The general observation has been that GroupNorm with Weight Standardization
scales well to larger batch sizes as compared to BatchNorm.
Downstream Fine-tuning:
• Once the model is pre-trained, it can be fine-tuned and ‘adapted’ to any new dataset
with relatively less number of samples.
• Google uses a hyperparameter heuristic called BiT-HyperRule where stochastic
gradient descent (SGD) is used with an initial Learning rate of 0.003 with a decay
factor of 10 at 30%, 60% and 90% of the training steps.
In our following experiments, we will be using the BiTM-R152x4 model which is a pre-
trained Big Transfer model based on Google’s flagship CNN model architecture of a
Resnet-152 which is four times wider and trained to perform multi-label classification on the
ImageNet-21k dataset.
IMPLEMENTATION
Let’s now use these pre-trained models to solve our objective of predicting the Top-5 classes
of input images.
We start by loading up the specific dependencies for image processing, modeling and
inference.
import tensorflow as tf
import tensorflow_hub as tf_hub
from PIL import Image
import matplotlib.pyplot as plt
import numpy as np
print('TF Version:', tf.__version__)
print('TF Hub Version:', tf_hub.__version__)
TF Version: 2.3.0
TF Hub Version: 0.8.0
Do note that we use TensorFlow 2.x here which is the latest version at the time of writing this
book. Since we will be directly using the pre-trained models for inference, we will need to
know the class labels of the original ImageNet-1K and the ImageNet-21k datasets for the
Resnet-50 and BiTM-R152x4 models respectively as depicted in listing 1.1
!wget https://storage.Googleapis.com/download.tensorflow.org/data/ImageNetLabels.txt
DR. RAMBABU PEMULA, ASSOCIATE PROFESSOR, DEPARTMENT OF AI, VJIT 26
DEEP LEARNING UNIT 5 Transfer Learning
!wget https://storage.Googleapis.com/BiT_models/ImageNet21k_wordnet_lemmas.txt
data1k = []
with open('ImageNetLabels.txt', 'r') as f:
data1k = f.readlines()
data21k = []
with open('ImageNet21k_wordnet_lemmas.txt', 'r') as f:
data21k = f.readlines()
ImageNet1k_mapping = {i: value.strip('\n')
for i, value in enumerate(data1k)}
ImageNet21k_mapping = {i: value.strip('\n')
for i, value in enumerate(data21k)}
The next step would be to load up the two pre-trained models we discussed earlier from
TensorFlow Hub.
Resnet_model_url = "https://tfhub.dev/tensorflow/Resnet_50/classification/1"
Resnet_50 = tf_hub.KerasLayer(Resnet_model_url)
BiT_model_url = "https://tfhub.dev/Google/BiT/m-
r152x4/ImageNet21k_classification/1"
DR. RAMBABU PEMULA, ASSOCIATE PROFESSOR, DEPARTMENT OF AI, VJIT 27
DEEP LEARNING UNIT 5 Transfer Learning
BiT_r152x4 = tf_hub.KerasLayer(BiT_model_url)
Once we have our pre-trained models ready, the next step would be to focus on building some
specific utility functions which you can access from the notebook for this Chapter in our
GitHub repository mentioned earlier. Just to get some perspective,
The preprocess_image(…) function helps us in pre-processing, shaping and scaling the input
image pixel values between the range of 0-1.
The visualize_predictions(…) function takes in the pre-trained model, the class label
mappings, the model type and the input image as inputs to visualize the top-5 predictions as
a bar chart.
The Resnet-50 model directly gives the class probabilities as inputs but the BiTM-R152x4
model gives class logits as outputs which need to be converted to class probabilities. We can
look at listing 1.2 which shows a section of the visualize_predictions(…) function which helps
us achieve this.
Listing 1.2 Getting the class probabilities from the model predictions
Remember that logits are basically the log-odds or unnormalized class probabilities and hence
you need to compute the softmax of these logits to get to the normalized class probabilities
which sum up to 1 as depicted in figure 1.13 which shows a sample neural network
architecture with the logits and the class probabilities for a hypothetical 3-class classification
problem.
The softmax function basically squashes the logits using the transform depicted in figure 1.13
to give us the normalized class probabilities. Let’s now put our code to action! You can
leverage these functions on any downloaded image using the sequence of steps depicted in
listing 1.3 to visualize the Top-5 predictions of our two pre-trained models.
Listing 1.3 Visualizing Top-5 predictions of our pre-trained models on a sample image
img = Image.open('snow_leo.png').convert("RGB")
pre_img = preprocess_image(img)
plt.figure(figsize=(12, 3))
plt.subplot(1,3,1)
visualize_predictions(model=BiT_r152x4, image=pre_img,
ImageNet_mapping_dict=ImageNet21k_mapping,
model_type='BiT-multiclass')
plt.subplot(1,3,2)
Resnet_img = tf.image.resize(pre_img, (224, 224))
visualize_predictions(model=Resnet_50, image=Resnet_img,
ImageNet_mapping_dict=ImageNet1k_mapping,
model_type='Resnet')
plt.subplot(1,3,3)
plt.imshow(pre_img[0])
plt.tight_layout()
Voila! We have the Top-5 predictions from our two pre-trained models depicted in a nice
visualization in figure 1.14.
It looks like both our models performed well, and as expected the BiTM model is very specific
and more accurate given it has been trained on over 21K classes with very specific animal
species and breeds.
The Resnet-50 model has more inconsistencies as compared to the BiTM model with regard
to predicting on animals of similar genus but slightly different species like tigers and lions as
depicted in figure 1.15
Another aspect to keep in mind is that these models are not exhaustive. They don’t cover each
and every entity on this planet. This would be impossible to do considering data collection for
DR. RAMBABU PEMULA, ASSOCIATE PROFESSOR, DEPARTMENT OF AI, VJIT 30
DEEP LEARNING UNIT 5 Transfer Learning
this task itself would take centuries, if not forever! An example is showcased in figure 1.16
where our models try to predict a very specific dog breed, the Afghan Hound, from a given
image.
Based on the Top-5 predictions in figure 1.16 you can see that while our BiTM model actually
get the right prediction, the prediction probability is very low indicating our model is not too
confident (given that it probably hasn’t seen too many examples of this dog breed in its
training data during the pre-training phase). This is where we can fine-tune and adapt our
models to make them more tuned to our specific datasets and output labels and outcomes.
• We have covered quite a BiT of ground so far in terms of understanding the what, how
and why of transfer Learning.
• We even tried our hands on using off-the-shelf utilities to understand how transfer
Learning proves itself useful and powerful.
• In this section we will touch upon a few challenges associated with transfer Learning
before we close the chapter.
• Understanding the challenges and open questions provides a complete picture and also
influences in making the right decisions.
i. Negative Transfer
ii. Transfer Bounds or Knowledge Gain:
Negative Transfer:
• Scenarios when some or all of these improvements are observed are termed as positive
transfer scenarios.
• Yet in real world setting, this is not always the case.
• There are cases when transfer Learning can lead to a drop in overall performance.
• This is termed as negative transfer.
• Negative transfer can occur due to a number of reasons.
• It could be due to source task not being sufficiently related to target task/domain.
• It could also be due to incorrect choice of transfer method or transfer method being
unable to leverage relationship between source and target.
• Avoiding negative transfer is important yet the abstract list of causes are difficult (if
not impossible) to narrow down.