Thanks to visit codestin.com
Credit goes to www.scribd.com

0% found this document useful (0 votes)
6 views11 pages

DL Answers

Uploaded by

tiwariekta783
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
6 views11 pages

DL Answers

Uploaded by

tiwariekta783
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 11

Batch 1 (Marketing Ops)

Q1. What is ReLU Function? What role does it play in Deep Learning?

ReLU (Rectified Linear Unit) is an activation function defined mathematically as: f(x) = max(0, x)

This means it outputs the input directly if positive, and 0 if the input is negative.

Role in Deep Learning:

1. Alleviates the vanishing gradient problem - Unlike sigmoid or tanh functions, ReLU
doesn't saturate in the positive region, allowing for faster learning.

2. Introduces non-linearity - Without non-linear activation functions like ReLU, neural


networks would simply be linear regressors.

3. Computational efficiency - ReLU is easy to compute (just a max operation).

4. Sparsity - By outputting exact zeros for negative inputs, ReLU creates sparse
activations, which can be beneficial for model representation.

Q2. Explain why backpropagation is better than brute force?

Backpropagation is vastly superior to brute force methods for neural network training:

Brute force approach would involve:

• Testing every possible combination of weights

• Requiring exponential computational resources (O(m^n) where m is possible weight


values and n is the number of weights)

• Being completely infeasible for even small networks (modern networks have millions of
parameters)

Backpropagation advantages:

• Uses the chain rule of calculus to efficiently calculate gradients

• Computational complexity scales linearly with network size

• Efficiently reuses intermediate calculations (error signals)

• Takes advantage of the network structure to determine the impact of each weight on the
overall error

• Enables practical training of large, complex networks

Q3. Give examples of 2 activation functions & draw them.

Activation Functions: Sigmoid and Tanh


1. Sigmoid Function

• Formula: σ(x) = 1/(1 + e^(-x))

• Output range: (0, 1)

• Historically popular but less used now due to vanishing gradient problems

• Still used in output layers for binary classification problems

2. Tanh Function (Hyperbolic Tangent)

• Formula: tanh(x) = (e^x - e^(-x))/(e^x + e^(-x))

• Output range: (-1, 1)

• Zero-centered, which makes optimization easier in some cases

• Stronger gradients than sigmoid but still suffers from vanishing gradient issues

Q4. What is dropout regularization? Explain with Diagram

Dropout Regularization
Dropout Regularization is a technique to prevent overfitting in neural networks by randomly
"dropping out" (temporarily removing) neurons during training.

How it works:

1. During each training iteration, neurons are randomly deactivated with probability p
(typically 0.2-0.5)

2. Forward and backward passes occur only with the remaining neurons

3. At test time, all neurons are used but their outputs are scaled by p to compensate

Benefits:

• Prevents co-adaptation of neurons (neurons becoming too dependent on each other)

• Forces the network to learn more robust features

• Acts like an ensemble of many different network architectures

• Significantly reduces overfitting

Implementation:

• During training: output = activation(input) * mask, where mask is randomly generated 0s


and 1s

• During testing: output = activation(input) * (1-p) to scale appropriately


Q5. Draw a typical Neural Network Diagram for handwritten digit recognition

Neural Network for Handwritten Digit Recognition (MNIST)

This neural network for handwritten digit recognition (commonly using the MNIST dataset)
features:

1. Input layer: 28×28 pixels = 784 neurons (one for each pixel in the image)

2. Hidden layers:

o First hidden layer: 256 neurons with ReLU activation

o Second hidden layer: 128 neurons with ReLU activation

3. Output layer: 10 neurons (one for each digit 0-9) with softmax activation

The architecture shows how input images are processed through multiple layers to recognize
handwritten digits. Key components include:

• Image flattening at the beginning

• Dense connections between layers

• ReLU activation in hidden layers

• Softmax function in the output layer for probability distribution across the 10 classes

Q6. Explain why and how image flattening is done in Deep Learning?

Why Image Flattening is Necessary:

1. Input Format Requirement: Traditional neural network layers (fully connected/dense


layers) require 1D vector inputs.

2. Dimensional Compatibility: Convolutional layers output 3D tensors, but fully


connected layers need 1D inputs.
3. Transition Between Architectures: Serves as a bridge between convolutional and fully
connected parts of networks.

How Image Flattening Works:

1. Mathematical Operation: Conversion of multi-dimensional data (typically 2D or 3D)


into a 1D vector.

2. Implementation:

o For a 2D image of size H×W, the flattened vector will have H×W elements.

o For a 3D volume (H×W×C), the flattened vector will have H×W×C elements.

o Elements are arranged sequentially row by row (or channel by channel).

Example: A 28×28 grayscale image (like in MNIST):

• Original format: 28×28 matrix

• After flattening: 784-element vector (28×28 = 784)

Code Implementation (in Python/Keras):

python

# In a sequential model

model = Sequential([

# Input layer - image is 28x28 pixels

Input(shape=(28, 28)),

# Flattening layer - converts to a 784-element vector

Flatten(),

# Dense layer

Dense(128, activation='relu'),

# Output layer

Dense(10, activation='softmax')

])

Note: While flattening is common, it loses spatial information. For image tasks, CNNs preserve
spatial relationships by using convolutional layers before flattening.
Batch 2

Q1. Where and how Gradient descent is used in Deep learning?

Gradient descent is the fundamental optimization algorithm used in deep learning for training
neural networks.

Where it's used:

• In the training phase of nearly all deep learning models

• To minimize the loss/cost function that measures prediction error

• Across all major architectures: CNNs, RNNs, Transformers, etc.

How it works:

1. Calculate the gradient (direction of steepest increase) of the loss function with respect
to each model parameter

2. Update parameters in the opposite direction of the gradient (to decrease loss)

3. Apply learning rate to control step size

4. Repeat until convergence (minimal improvement in loss)

Types of gradient descent:

• Batch gradient descent: Uses entire dataset per update

• Stochastic gradient descent (SGD): Uses one sample per update

• Mini-batch gradient descent: Uses small batches (most common approach)

Gradient Descent Optimization

The diagram shows a loss function landscape where:


• Red dots represent parameter values during training

• Blue arrows show the direction of updates

• Green dot represents the global minimum (optimal parameters)

• Each step brings the parameters closer to the minimum value

• Parameters are updated using the formula: θ = θ - α∇J(θ)

Q2. Where is the softmax function required in Deep learning?

The softmax function is primarily used in output layers of neural networks for multi-class
classification problems.

Mathematical definition: softmax(z)ᵢ = e^zᵢ / Σ(e^zⱼ) for j=1 to K

Where z is the input vector and K is the number of classes.

Where it's used:

1. Multi-class classification output layers:

o Image recognition (classifying among multiple categories)

o Natural language processing (part-of-speech tagging, named entity recognition)

o Speech recognition (identifying phonemes or words)

2. Attention mechanisms in transformers:

o Used to compute attention weights in transformer architectures

Key properties:

• Converts raw scores (logits) into probabilities (values between 0 and 1)

• Ensures all outputs sum to 1 (proper probability distribution)

• Accentuates the largest input value while suppressing smaller values

• Differentiable, allowing for gradient-based learning

Example use case: In an image classifier with 10 classes (like MNIST digits), the last layer has 10
neurons. Softmax converts these 10 values into probabilities that sum to 1, allowing the model
to predict the most likely digit class.

Q3. What is Activation function and why is it required?

An activation function is a mathematical function applied to the output of each neuron in a


neural network.

What it does:

• Applies a non-linear transformation to the weighted sum of inputs

• Determines whether and to what extent a neuron should "fire" or activate

Why activation functions are required:


1. Introduce non-linearity:

o Without activation functions, neural networks would just be linear regression


models

o Non-linearity allows modeling of complex patterns and relationships

o Enables the network to learn more complex functions

2. Enable backpropagation:

o Most activation functions are differentiable, allowing gradient-based


optimization

o Their derivatives are used to compute gradients during training

3. Control neuron output:

o Normalize outputs to specific ranges

o Prevent numerical issues like exploding values

Common activation functions:

1. ReLU (Rectified Linear Unit): f(x) = max(0, x)

o Most popular in hidden layers

o Computationally efficient

o Helps mitigate vanishing gradient problem

2. Sigmoid: f(x) = 1/(1+e^(-x))

o Outputs between 0 and 1

o Used in binary classification output layers

3. Tanh: f(x) = (e^x - e^(-x))/(e^x + e^(-x))

o Outputs between -1 and 1

o Zero-centered

4. Softmax: Mentioned in previous question

o Used in multi-class classification output layers

5. Leaky ReLU: f(x) = max(αx, x) where α is a small constant

o Addresses "dying ReLU" problem

Without activation functions, deep neural networks could not model complex, non-linear
relationships in data.

Q4. What is F1 score and where is it used?

The F1 score is a popular evaluation metric for classification models that balances precision
and recall.
Definition: F1 = 2 * (Precision * Recall) / (Precision + Recall)

Where:

• Precision = True Positives / (True Positives + False Positives)

• Recall = True Positives / (True Positives + False Negatives)

Where it's used:

1. Imbalanced dataset evaluation:

o When classes are not equally represented

o When simple accuracy is misleading

2. Binary classification problems:

o Medical diagnosis (disease detection)

o Spam detection

o Fraud detection

o Anomaly detection

3. Multi-class classification:

o Can be calculated per class (one-vs-rest) or averaged across classes

4. Information retrieval:

o Document classification

o Search result relevance

Why F1 score is important:

• Single metric that balances false positives and false negatives

• Ranges from 0 (worst) to 1 (best)

• Particularly useful when the cost of false positives and false negatives is similar

• More informative than accuracy when classes are imbalanced

Types of F1 score for multi-class problems:

• Macro F1: Simple average of F1 scores for each class (treats all classes equally)

• Weighted F1: Weighted average based on number of samples in each class

• Micro F1: Calculated by counting global true positives, false positives, and false
negatives

F1 score is especially valuable in deep learning when working with unbalanced datasets where
positive examples are rare, such as in medical imaging or fraud detection.

Q5. CNN with block diagram


Convolutional Neural Networks (CNNs) are specialized neural networks designed primarily for
processing grid-like data, especially images.

CNN Architecture Block Diagram

Key Components of a CNN:

1. Convolutional Layers:

o Apply filters (kernels) to input data

o Extract features like edges, textures, and patterns

o Parameters: filter size, stride, padding, number of filters

o Each filter produces a feature map

o Output dimensions: (input_size - filter_size + 2*padding)/stride + 1

2. Activation Function:

o Usually ReLU after each convolutional layer

o Introduces non-linearity

3. Pooling Layers:

o Reduce spatial dimensions (downsampling)

o Common types: Max Pooling, Average Pooling

o Helps achieve spatial invariance

o No learnable parameters

4. Flatten Layer:

o Converts 3D feature maps to 1D vector

o Prepares data for fully connected layers


5. Fully Connected Layers:

o Traditional neural network layers

o Process high-level features for final classification

o Usually has dropout for regularization

6. Output Layer:

o Contains neurons equal to number of classes

o Uses softmax activation for multi-class classification

Advantages of CNNs:

1. Parameter sharing: Same filter applied across the entire image

2. Translation invariance: Can detect features regardless of their position

3. Spatial hierarchy: Captures features at multiple levels of abstraction

4. Reduced parameters: Compared to fully connected networks of similar depth

CNNs are widely used in computer vision tasks like image classification, object detection, facial
recognition, and medical image analysis. They're also applied to non-image data like audio
spectrograms, time series, and even natural language processing.

You might also like