Thanks to visit codestin.com
Credit goes to www.scribd.com

0% found this document useful (0 votes)
7 views27 pages

Unit 4 Short Notes

Unit 4 discusses deep feedforward networks, covering the history of deep learning from its inception in the 1940s to the deep learning revolution from 2011 to 2020. It explores probabilistic theories, gradient learning, backpropagation, and regularization techniques essential for training neural networks. Key concepts include Bayesian Neural Networks, gradient descent methods, and the importance of non-linear activation functions in learning complex patterns.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
7 views27 pages

Unit 4 Short Notes

Unit 4 discusses deep feedforward networks, covering the history of deep learning from its inception in the 1940s to the deep learning revolution from 2011 to 2020. It explores probabilistic theories, gradient learning, backpropagation, and regularization techniques essential for training neural networks. Key concepts include Bayesian Neural Networks, gradient descent methods, and the importance of non-linear activation functions in learning complex patterns.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 27

Unit – 4 DEEP FEEDFORWARD NETWORKS

History of Deep Learning- A Probabilis c Theory of Deep Learning- Gradient Learning – Chain Rule and
Backpropaga on - Regulariza on: Dataset Augmenta on – Noise Robustness -Early Stopping, Bagging
and Dropout - batch normaliza on- VC Dimension and Neural Nets.

History of Deep Learning:


1940s – The Beginning
 In 1943, Walter Pi s and Warren McCulloch built the first computer model
of a neuron using mathema cs and logic.
 They introduced “threshold logic”, an early a empt to mimic how the brain
thinks.
1960s – Backpropaga on and Early Models
 In 1960, Henry J. Kelley proposed an early form of backpropaga on (a
training method for neural networks).
 Stuart Dreyfus simplified this with the chain rule in 1962.
 In 1965, Alexey Ivakhnenko and Valen n Lapa developed a system where
data passed through mul ple layers, laying the groundwork for deep
learning.
1970s – First AI Winter and New Ideas
 Funding for AI dropped in the 70s (called the AI winter) because early
promises couldn’t be delivered.
 S ll, progress con nued. In 1979, Kunihiko Fukushima developed the
Neocognitron, an early convolu onal neural network (CNN) for pa ern
recogni on.
 Seppo Linnainmaa created code for backpropaga on in 1970, but it wasn’t
applied to neural nets un l 1985.
1980s-1990s – Second AI Winter and Key Advances
 In 1989, Yann LeCun combined CNNs and backpropaga on to read
handwri en digits—used for reading checks.
 Despite another AI winter, key work con nued:
o Support Vector Machines (SVMs) were introduced in 1995 by Cortes
and Vapnik.
o LSTM (Long Short-Term Memory) networks, used in language models,
were developed in 1997 by Hochreiter and Schmidhuber.
 By 1999, GPUs became common, making deep learning training 1000 mes
faster over 10 years.
2000–2010 – Challenges and Big Data
 Vanishing Gradient Problem: Deep layers struggled to learn because the
learning signal became too weak.
 Solu ons included:
o Layer-by-layer pre-training
o Use of LSTM
 In 2001, Big Data began gaining a en on.
 In 2009, Fei-Fei Li launched ImageNet—a massive dataset of labeled
images, which became cri cal for training vision-based deep learning
models.
2011–2020 – Deep Learning Revolu on
 GPU speed boosted progress—no need for pre-training.
 AlexNet (2012) won image recogni on contests using CNNs, ReLU
ac va on, and dropout.
 Google Brain’s “Cat Experiment” (2012):
o Trained a network on unlabeled YouTube images.
o Found a neuron that recognized cats without being told what a cat is.
 GANs (Genera ve Adversarial Networks) were invented in 2014 by Ian
Goodfellow.
o GANs involve two networks compe ng—one generates fake images,
the other tries to spot the fake.

Probabilis c Theory of Deep Learning


Probabilis c Theory of Deep Learning is an approach that helps us understand and improve deep neural
networks (DNNs) by using probability and sta s cs.

Uncertainty Important

In many real-life situa ons, data is noisy or incomplete. A good model should say “I’m not sure” when
the data is unclear. Probabilis c models do exactly that. They give not just predic ons, but also a
measure of confidence (or uncertainty) in those predic ons.

1. Bayesian Neural Networks (BNNs):

 In regular neural networks, weights are fixed numbers.

 In BNNs, weights are random variables with probability distribu ons.

 The model learns the distribu on of weights (not just one value), which allows it to make
predic ons with uncertainty es mates.

2. Varia onal Inference:

 Calcula ng exact probabili es is hard.

 Varia onal inference approximates complex probability distribu ons using simpler ones.

 It’s o en used with BNNs to es mate the distribu on of weights efficiently.

3. Dropout as Bayesian Approxima on:

 Normally, dropout is used during training to prevent overfi ng.

 But it turns out, if we keep using dropout during tes ng, it’s like doing Bayesian inference.

 This trick helps es mate uncertainty without big changes to the model.

4. Gaussian Processes (GPs):

 GPs are models that can predict a distribu on over func ons (not just values).

 They’re very good at telling you how uncertain a predic on is.


 When combined with deep learning (as Deep Gaussian Processes), you get both flexibility and
uncertainty es mates.

5. Monte Carlo Dropout:

 It extends the idea of dropout.

 At test me, you run the model mul ple mes with dropout turned on.

 This gives different results each me, and the varia on between them tells you how uncertain
the predic on is.

6. Ensemble Methods:

 You train mul ple neural networks, each a bit different.

 You average their predic ons.

 If the predic ons vary a lot between models, that means the model is less certain.

 It’s simple and o en very effec ve.

GRADIENT LEARNING
What is Gradient Learning?

Gradient learning is the process of training machine learning models (especially neural networks) by
op mizing their parameters (weights and biases). This is done using an algorithm called gradient
descent.

What is Gradient Descent?

Gradient descent is a method that helps the model improve by:

 Looking at the loss func on (which tells how wrong the model is).

 Calcula ng the gradient (the direc on and steepness of the slope of the loss).

 Taking small steps in the direc on that reduces the loss (like walking downhill to reach the
bo om).

Use of Gradient-Based Methods

 They work well for smooth and con nuous func ons (like those in neural networks).

 It’s much easier to minimize these func ons than discrete or irregular ones.

 By es ma ng how small changes in the parameters affect the loss, we can improve the model
gradually.

Types of Gradient Descent Variants


There are mul ple versions to make training faster and more stable:

1. Stochas c Gradient Descent (SGD):

o Updates the model using one data point at a me.

o Noisy but can escape local minima.

2. Mini-batch Gradient Descent:

o Updates using a small batch of data (e.g., 32 or 64 examples).

o More stable than SGD and faster than using the whole dataset.

3. Adap ve Methods (like Adam):

o Adjust the learning rate automa cally during training.

o O en converges faster and is more stable.

Convergence and Ini aliza on

 Convex op miza on: If the loss func on is convex, gradient descent is guaranteed to find the
global minimum no ma er where you start.

 Non-convex func ons (common in deep learning): No guarantee of reaching the best solu on.
Star ng values ma er.

 For feedforward neural networks:

o Weights should be ini alized to small random values.

o Biases can be zero or small posi ve numbers.

Cost Func on
Mean squared error (MSE) loss func on arises from maximum likelihood es ma on (MLE) when
assuming a Gaussian distribu on for the outputs.

1. Maximum Likelihood Es ma on
2. The Cost Func on from MLE

Pu ng this into the form of an expected value over the data distribu on pdata:

This is the mean squared error (MSE) cost func on with a constant offset. The constant doesn't affect
op miza on because it doesn't depend on θ.
3. Gradient Desirable Proper es

 The gradient of the cost func on (how fast the cost is changing) should be:

o Large enough to give clear direc on.

o Predictable, so op miza on can proceed steadily.

4. Desirable Property of Gradient

“Gradient must be large and predictable enough to serve as a good guide to the learning algorithm.”

 The gradient tells the model how to update its parameters during training.

 If the gradient is:

o Too small → learning is very slow or stops (vanishing gradient).

o Unpredictable → training becomes unstable.

 We want gradients that are:

o Informa ve (accurately point toward reducing the loss).

o Stable and large enough to drive learning effec vely.

5. Cross-Entropy and Regulariza on

"Cross-entropy cost used for MLE does not have a minimum value..."

For Discrete Outputs (e.g., classifica on):


 Models like logis c regression use cross-entropy loss.

 The model predicts probabili es (e.g., so max).

 Cross-entropy penalizes wrong predic ons heavily.

 However, perfect certainty (probability = 0 or 1) is impossible because:

o Log(0) is undefined (→ loss goes to infinity).

o Model approaches perfect certainty, but can never reach it exactly.

o So the loss doesn't have a clear minimum — it can keep improving.

For Real-Valued Outputs (e.g., regression):

 If we model output with a Gaussian distribu on, cross-entropy involves the variance.

 If the model learns a ny variance, it can assign extremely high density to the correct output.

 This causes the log-likelihood to diverge to nega ve infinity — again, no well-defined minimum.

 That’s why regulariza on is needed — to prevent the model from becoming overly confident.

6. Learning Condi onal Sta s cs

"We o en want to learn just one condi onal sta s c of y given x."

 Instead of learning the whole distribu on p(y∣x)), we may only care about:

o The mean E[y∣x]

o Or the mode, median, etc.

 This simplifies the learning problem.

 Example: In MSE regression, we’re learning the expected value of yyy given xxx.

7. Learning a Func on (Func onal View)

"Cost func on is a func on rather than a func on..."

This line is confusing due to wording, but it’s trying to say:

 In deep learning, we're not just adjus ng parameters — we're learning a func on f(x).

 The cost func on is be er seen as a func onal — a func on of a func on.

o Example: It takes the en re func on f and gives a single number (the total loss).

 So instead of thinking about minimizing cost by tuning parameters, we can think of:

o Choosing the best func on from a space of all possible func ons.
o The cost func onal is designed so that its minimum lies at the func on we want (e.g.,
the one mapping xxx to E[y∣x]

Chain Rule and Backpropaga on

1. Chain Rule and Backpropaga on: What are they?

 Chain Rule is a rule from calculus used to compute the deriva ve of a func on composed of
other func ons.

o In neural networks, we use it to compute how a change in weights affects the final
output error, even across mul ple layers.

 Backpropaga on uses the chain rule to calculate gradients of the loss (error) with respect to the
weights in the network.

o It is essen al for training deep networks using gradient descent.

2. What is Backpropaga on?

 Backpropaga on is a training algorithm for mul -layer neural networks (also called deep neural
networks).

 It's also called the generalized delta rule, an extension of simpler learning rules like the Widrow-
Hoff rule.

 It systema cally updates weights to minimize the error between predicted and actual output by
using gradient descent.
step-by-step idea)

1. Forward Pass:

o Inputs go through the network layer by layer.

o The final output is computed.

o The error (difference between predicted and actual output) is calculated.

2. Backward Pass:

o This is where the chain rule is used.

o Gradients (slopes of error with respect to weights) are calculated layer by layer, star ng
from the output layer and going backward.

o These gradients show how much each weight contributed to the error.

3. Update Weights:

o Using gradient descent, we update the weights to reduce the error.

Nonlinearity

 Each neuron applies an ac va on func on (like sigmoid, tanh, etc.).

 Non-linear ac va on func ons let the network learn complex pa erns.

 Without non-linearity, no ma er how many layers we have, the en re network acts like a single
linear func on.

Use of Hidden Layers


 Two-layer networks (input and output) can only learn simple rela onships (e.g., linearly
separable data).

 Hidden layers allow the network to learn non-linear and complex mappings.

 This enables the network to solve real-world problems like image recogni on, language
processing, etc.

Connec vity and Learning

 Neurons in one layer are only connected to the next layer.

 The output of each neuron is scaled by the weight and passed forward.

 The network learns by adjus ng these weights using backpropaga on so that the output gets
closer to the desired result.

Training Procedure:
Training a neural network means adjus ng the weights so that it can produce correct outputs for a given
set of inputs. This is done through repeated exposure to input-output pairs.

Training Algorithm Steps:

1. Ini alize Weights Randomly:

o All weights in the network are set to small random values (both posi ve and nega ve).

o This prevents neurons from becoming saturated (e.g., stuck with outputs too close to 0
or 1 in sigmoid).

2. Pick a Training Pair:

o Select one input-output pair from the dataset. This is called supervised learning
because we provide the desired output.

3. Apply the Input:

o Feed the input vector to the input layer of the network.

4. Calculate the Output (Forward Pass):

o Data flows through the network from input → hidden layer(s) → output.

o Each neuron calculates its output using a weighted sum and ac va on func on.

5. Calculate the Error:

o Compare the network’s output to the target (desired) output.

o Compute the error using a loss func on (e.g., Mean Squared Error).

6. Adjust the Weights (Backward Pass):


o Use backpropaga on and gradient descent to update weights.

o The goal is to reduce the error by shi ing the weights in the direc on that lowers the
loss.

7. Repeat for All Training Pairs:

o Go through all pairs in the dataset.

o Repeat the process for mul ple epochs (full passes through the dataset) un l the total
error is low enough.

Forward Pass vs Backward Pass:

Forward Pass:

 Data flows from input to output.

 Output is computed layer-by-layer.

 Used to evaluate the current performance of the network.

Backward Pass:

 The error is propagated backward through the network.

 Gradients are calculated for each weight using the chain rule.

 Weights are updated to reduce the output error.

Weight Adjustment Strategy:

1. Output Layer:

o Adjusted first because we know the target values.

o Use the delta rule (a part of gradient descent) to update weights.

2. Hidden Layers:

o More challenging because they don’t have direct target values.

o Instead, their errors are inferred from the layers above using the chain rule.

o These inferred errors guide how their weights should be updated.


Chain Rule?

In calculus, the chain rule is used to compute the derivative of a composite function. If you have
two functions:

y=f(g(x))

Then the derivative of y with respect to x is:

Chain Rule in Deep Learning (Backpropagation)

Neural networks are composed of layers where each layer applies a function to the previous
layer’s output:

x→z=Wx+b→a=σ(z)
During training, we want to compute the gradient of the loss function with respect to each
parameter (e.g., weights WWW) to update them using gradient descent.

Using the chain rule, we compute:

Where:

 L is the loss function,


 a is the activation,
 z=Wx+b is the pre-activation value.

This is done layer by layer from the output to the input (backward), hence the name
backpropagation.
Example: Single Neuron

Suppose a neuron computes:

z=w⋅x+b,

a=σ(z),

L=Loss(a,y)

Then

Each of these par al deriva ves is easy to compute and the chain rule lets us link them together to find
the gradient.

1. The chain rule allows us to propagate error gradients backward through the layers.
2. It enables gradient-based optimization methods like SGD, Adam, etc.
3. It’s the core idea behind backpropagation, which powers training of deep neural
networks.

Regulariza on: Dataset Augmenta on

Regulariza on techniques are essen al for preven ng overfi ng in machine learning models, including
neural networks. Dataset augmenta on is one such technique used to enhance the generaliza on ability
of models by ar ficially increasing the size and diversity of the training dataset.
Heuris c data augmenta on schemes o en rely on the composi on of a set of simple transforma on
func ons (TFs) such as rota ons and flips (see Figure). When chosen carefully, data augmenta on
schemes tuned by human experts can improve model performance. However, such heuris c strategies in
prac ce can cause large variances in end model performance and may not produce augmenta ons
needed for state-of-the-art models.

Data augmenta on can be defined as the technique used to improve the diversity of the data by slightly
modifying copies of already exis ng data or newly create synthe c data from the exis ng data. It is used
to regularize the data and it also helps to reduce overfi ng. Some of the techniques used for data
augmenta on are :

1. Rota on (Range 0-360 degrees)


2. flipping (true or false for horizontal flip and ver cal flip)
3. Shear range (image is shi ed along x-axis or y-axis)
4. Brightness or Contrast range (image is made lighter or darker)
5. Cropping (resize the image)
6. Scale (image is scaled outward or inward)
7. Satura on (depth or intensity of the image) Here's how dataset augmenta on works
within the context of regulariza on:

Dataset Augmentation?

Dataset augmentation is a technique used in machine learning (especially in computer vision)


to artificially increase the size and diversity of a training dataset. Instead of collecting more real
data, you take existing data and apply transformations to create new, slightly changed
versions of it.

These changes make the model more robust (able to generalize better) by teaching it to handle
variations it might see in real-world data — without changing the essential meaning of the data.

1. Prevents overfitting: Helps the model avoid memorizing the training data.
2. Improves generalization: Makes the model better at handling unseen data.
3. Expands small datasets: Useful when real-world data is limited or hard to collect.

Common Types of Transformations

1. Geometric Transformations
Change the position or shape of the image:
o Rotation: Turn the image slightly.
o Translation: Shift the image up/down or left/right.
o Scaling: Zoom in or out.
o Cropping: Cut out a part of the image.
o Flipping: Mirror the image horizontally or vertically.
2. Color Transformations
Change how the image looks visually:
o Brightness: Make the image lighter or darker.
o Contrast: Change the difference between dark and light areas.
o Saturation: Make colors more or less intense.
o Hue: Shift the overall color tone.
3. Noise Injection
Add small random changes (noise) to simulate imperfections:
o Helps the model learn to ignore irrelevant variations.
4. Random Cropping and Padding
o Random cropping: Take a random part of the image.
o Padding: Add extra borders with a certain color or patter

Regularization Effect?

When we say "dataset augmentation acts as a form of regularization", we mean:

It helps the model avoid overfitting by making learning harder in a good way — so the model
doesn't just memorize but actually learns patterns that work on new, unseen data.

When you apply random changes (like rotations, brightness shifts, or noise) to your training
data, you're:

 Making the data less perfect and more like the real world, where data isn’t always clean
or consistent.
 Forcing the model to adapt to this variation, instead of just memorizing specific
examples.

This process "regularizes" the model — meaning it makes the learning more stable and
general.

Without augmentation:

 A model might memorize training images — like "I know this cat because of the exact
position of its ears and background."
 This leads to overfitting, where the model performs well on training data but poorly on
new data.

With augmentation:

 The model sees many variations of the same image.


 It learns core features that matter — like "a cat usually has pointed ears and whiskers,"
no matter the angle, brightness, or background.
 This leads to better generalization.
 Dataset augmentation acts like a regularizer (just like dropout or weight decay).
 It helps the model focus on important, general features.
 It reduces overfitting and boosts performance on real, unseen data.

Example:

Let’s say you’re training a model to recognize cars.

If you:

 Rotate the car images,


 Move them slightly in the frame,
 Adjust brightness like daytime or night...

Then the model learns:

“Ah, that’s still a car, even if it’s turned, shifted, or in different lighting.”

Early Stopping, Bagging and Dropout


Early Stopping
Early Stopping is a technique used in training machine learning models (especially neural networks) to
prevent overfi ng—which is when the model learns the training data too well, including its noise or
errors, and performs poorly on new, unseen data.

Here's how it works:

1. Use a Valida on Set:


While training, you split off a small part of your data (called the valida on set) that the model
doesn't learn from, but you use it to check how well the model is doing.

2. Monitor Valida on Loss:


A er each round of training (called an epoch), you check how much error the model is making
on the valida on set. This is called the valida on loss.

3. Stop When Performance Gets Worse:


At first, as training progresses, both training loss and valida on loss usually decrease. But at
some point, the model starts to "memorize" the training data and forget how to generalize. This
shows up as the valida on loss increasing.
When the valida on loss doesn't improve for a while (e.g., 5 or 10 epochs), we stop training
early.

4. Best Model is Saved:


The model from the epoch with the lowest valida on loss is usually saved and used for
predic ons.

Use:
 It saves me by not training unnecessarily.

 It prevents overfi ng and helps the model generalize be er to new data.

Bagging
Bagging (short for Bootstrap Aggrega ng) is an ensemble learning technique in machine learning
designed to improve the accuracy and stability of models, par cularly those that are high-variance (e.g.,
decision trees).

How Bagging Works:

1. Bootstrapping (Data Sampling):

o From the original dataset, mul ple subsets are created by random sampling with
replacement.

o Each subset is the same size as the original dataset (or slightly smaller).

2. Training Mul ple Models:

o A separate model is trained on each bootstrapped subset.

o Commonly used with decision trees (e.g., in Random Forests).

3. Aggrega on:

o For classifica on tasks: uses majority vo ng across all models.

o For regression tasks: uses averaging of model outputs.

Benefits:

 Reduces variance: Helps to prevent overfi ng by averaging out fluctua ons.

 Improves accuracy: Especially effec ve for unstable learners like decision trees.

 Parallelizable: Since each model is trained independently, it’s easy to parallelize.

Example:
The most well-known applica on of bagging is the Random Forest algorithm, which builds mul ple
decision trees using bagged samples and random feature selec on.

Pseudocode
Bagging (Bootstrap Aggregating) – Pseudocode

Input:
Training data

Base learning algorithm


Number of models T

Algorithm:

For t = 1 to T:

a. Generate bootstrap sample S_t by randomly sampling m examples from D with


replacement.

b. Train base learner h_t on S_t

Output: Combined classifier

H(x) = MajorityVote(h₁(x), h₂(x), ..., h_T(x))

For regression, replace MajorityVote with Average:


Dropout:
Dropout is a regularization technique specifically designed for training neural networks to
prevent overfitting. It involves randomly "dropping out" (i.e., deactivating) a fraction of neurons
during training. The key aspects of dropout are:

1. Random Deactivation:
o In each training iteration, a fraction of neurons is set to zero with a probability p
(usually between 0.2 and 0.5).
2. Training and Inference:
o Dropout is only applied during training.
o During inference, all neurons are active.
o Outputs are scaled by the dropout probability p during inference to maintain
consistency.
3. Ensemble Effect:
o Dropout simulates training many different subnetworks.
o This ensemble behavior helps in learning more generalizable features and reduces
reliance on specific neurons.

Batch Normaliza on
 Batch Normaliza on is a technique used to make training deep neural networks faster
and more stable.
 When a neural network is training, the parameters (like weights) in each layer keep
changing. This causes the distribu on (e.g., the range and mean of values) of inputs to
the next layers to also change. This is called internal covariate shi .
 Imagine you're trying to learn something new, but the rules keep changing slightly every
me—it's harder to learn. That’s what internal covariate shi does to neural networks.

Use of Batch Normalization

Batch Normaliza on reduces this shi ing effect by:

1. Normalizing the inputs of each layer (subtrac ng the mean and dividing by the standard
devia on).
2. Then, it scales and shi s the normalized values using learned parameters (so the network s ll
has flexibility).

This helps the network:

 Train faster,
 Use higher learning rates,
 Be less sensi ve to weight ini aliza on,
 And o en generalize be er (perform well on unseen data)

Normalization

 For every mini-batch (a small subset of your dataset used during training), batch normaliza on
standardizes the inputs to a layer.
 It does this by:

Purpose: This ensures the inputs to each layer have zero mean and unit variance, which helps
the network learn more efficiently.

Scaling and Shifting

 A er normaliza on, we don’t just pass the standardized values as-is. We apply two trainable
parameters:
This allows the network to undo the normalization if needed and still learn the best
representation for the task.

Training and Inference

 During training: The mean and variance are calculated from each mini-batch.
 During inference (when making predic ons): We use running averages of the mean and
variance computed over the whole training process, not batch-wise stats.

The diagram shows:

 Forward Propagation: Data moves through input → hidden layers → output.


 Backpropagation: Errors are sent backward to update weights.
 Batch normalization corrects each layer (“Oops! I’ll correct my layer”) by normalizing
activations at each layer.
Benefits of Batch Normaliza on
 Batch normaliza on offers several benefits to the training process of deep neural networks.
 Batch normaliza on makes training faster, more stable, and more reliable, while also helping
generaliza on and suppor ng deeper architectures.

1. Improved Optimization

 Batch normaliza on allows the model to use higher learning rates safely.
 Normally, high learning rates can make training unstable, but batch norm helps keep
ac va ons in a predictable range.
 This speeds up training and reduces the need for careful manual tuning of learning rates
or other hyperparameters.

2. Regularization

 During training, batch norm uses the mini-batch sta s cs (mean and variance), which
introduces a bit of randomness into each forward pass.
 This acts like a regularizer by slightly disturbing the ac va ons each me, much like
dropout.
 As a result, it helps reduce overfi ng—the model becomes less likely to just memorize
the training data.

3. Reduced Sensitivity to Initialization

 Neural networks are o en sensi ve to their ini al weights—bad ini aliza ons can slow
down or ruin training.
 Batch normaliza on lessens this sensi vity, because it keeps ac va ons well-behaved
even if the ini al weights aren’t ideal.
 This means the network is more robust and more likely to converge to a good solu on.

4. Allows Deeper Networks

 One of the main challenges in training very deep networks is the internal covariate shi ,
where the input distribu on to layers changes constantly.
 Batch normaliza on reduces this shi , which makes it easier to train deeper
architectures.
 That’s why modern deep models like ResNet, VGG, and Transformers o en use batch
norm.
VC Dimension and Neural Nets
The VC dimension is a theoretical concept that measures the capacity or expressiveness of a
learning algorithm, specifically the size of the largest set of points that can be shattered by the
model.

 To shatter a set of points means that for every possible way of labeling those points
(e.g., as + or −), there exists some classifier in the model's hypothesis space that can
perfectly separate them.

 The green region shows that a linear classifier (like a straight line) can shatter 3 points in
2D space. That is, for any labeling of 3 points, there's always a line that separates the + and −
correctly.

 The red region shows that 4 points cannot always be shattered by a linear classifier. There's
at least one labeling of 4 points for which no single straight line can separate + and − labels
perfectly.

Relevance to Neural Networks:

 The VC dimension tells us how complex a neural network is in terms of the variety of pa erns it
can learn.

 A higher VC dimension usually means a model can fit more complex data, but it also means
higher risk of overfi ng.

 It’s a crucial concept for understanding generaliza on—whether a model just memorizes data or
truly learns pa erns.
Sha ering set of examples:

Assume a binary classifica on problem with N examples RD and consider the set of 2|N| possible
dichotomies. For instance, with N = 3 examples, set of all possible dichotomies is {(000), (001), (010),
(011), (100), (101), (110), (111)}. A class of func ons is said to sha er the dataset if, for every possible
dichotomy, there is a func on 𝑓(𝛼) that models it. Consider as an example a finite concept class C =
{c1,…,c4} applied to three instance vectors with the results :

Step-by-Step
Breakdown Using the Table:
We are working with a concept class C={c1,c2,c3,c4} and each concept func on gives output labels on
three input instances: x1,x2,x3

Each concept corresponds to one row in this table:

To sha er a set of input points means that for every way you could assign 0s and 1s (labels) to those
points, there's some concept func on in CCC that gives exactly those labels.

If you’re trying to sha er:


 1 point → there are 2 possible labelings (0 or 1)

 2 points → 4 labelings: (0,0), (0,1), (1,0), (1,1)

 3 points → 8 labelings: (000), (001), ..., (111)

We now ask: Can our concept class produce all those combina ons?

Detailed Analysis

1 Point (say x1):

Look at the outputs of each concept on x1:


 c1(x1)=1

 c2(x1)=0

 c3(x1)=1

 c4(x1)=0

So we can generate both outputs: 0 and 1 → All labelings possible → 1 point is sha ered

2 Points (say x1,x3):

Check all 4 concepts and their outputs on x1 and x3:

We have all 4 possible binary labelings:

 (0, 0), (0, 1), (1, 0), (1, 1) → All dichotomies present

→ 2 points can be sha ered

3 Points ( x1,x2,x3):

Now we need all 8 possible labelings for 3 bits, i.e.:


(0,0,0)

(0,0,1)

(0,1,0)

(0,1,1)

(1,0,0)

(1,0,1)

(1,1,0)

(1,1,1)

From the table, we only have the following outputs:

Only 4 pa erns are covered: (111), (011), (100), (000)

The remaining 4 are missing.


→ Not all possible labelings can be achieved
→ 3 points cannot be sha ered

Final Result:

 1 point → sha ered

 2 points → sha ered

 3 points → not sha ered

So, VC dimension = 2

You might also like