MMC102 - Module 4 - Notes
MMC102 - Module 4 - Notes
&
DATA ANALYTICS USING PYTHON
[MMC201]
(2024-26)
Sl.
Experiments
NO
Implement and demonstrate the FIND-S algorithm for finding the most specific
1 hypothesis based on a given set of training data samples. Read the training data from
a .CSV file.
For a given set of training data examples stored in a .CSV file, implement and
2 demonstrate the Candidate- Elimination algorithm to output a description of the set
of all hypotheses consistent with the training examples.
Write a program to demonstrate the working of the decision tree based ID3
3 algorithm. Use an appropriate da set for building the decision tree and apply this
knowledge to classify a new sample.
Write a program to implement the naïve Bayesian classifier for a sample training
4 data set stored as a .CSV fil Compute the accuracy of the classifier, considering few
test data sets.
Write a program to implement k-Nearest Neighbour algorithm to classify the iris
5
data set. Print both correct a wrong prediction.
Build an Artificial Neural Network by implementing the Backpropagation algorithm
6
and test the same using appropriate data sets.
Write a program to demonstrate Regression analysis with residual plots on a given
7
data set.
Write a program to compute summary statistics such as mean, median, mode,
8
standard deviation and variance the given different types of data.
Write a program to implement k-Means clustering algorithm to cluster the set of data
9
stored in .CSV file.
MODULE 4
ADVANCED MACHINE LEARNING TECHNIQUES
INTRODUCTION
As data becomes increasingly complex and high-dimensional, traditional machine learning algorithms often
struggle with performance and accuracy. Advanced Machine Learning Techniques are designed to handle
such challenges by introducing sophisticated modeling strategies, optimization methods, and the ability to
learn from limited or unstructured data.
ENSEMBLE METHODS
Ensemble learning is a method where we use many small models instead of just one. Each of these models
may not be very strong on its own, but when we put their results together, we get a better and more accurate
answer. It's like asking a group of people for advice instead of just one person—each one might be a little
wrong, but together, they usually give a better answer.
Bagging Algorithm:
Step 1: Bootstrap Sampling - Divides the original training data into ‘N’ subsets and randomly selects a
subset with replacement
Step 2: Base Model Training - For each bootstrapped sample we train a base model independently
on that subset of data.
Step 3: Prediction Aggregation - To make a prediction on testing data combine the predictions of all base
models. For classification tasks it can include majority voting and for regression it involves
averaging the predictions.
Step 4: Out-of-Bag (OOB) Evaluation - The “out-of-bag” samples (training subset samples) can be used
to estimate the model’s performance without the need for cross-validation.
Step 5: Final Prediction - After aggregating the predictions from all the base models, Bagging produces a
final prediction for each instance.
Goal: The primary goal of bagging is to reduce variance and prevent overfitting. By averaging or voting
across multiple models trained on slightly different data, the ensemble becomes more stable and less
sensitive to noise or specific patterns in the training data. It's particularly effective with "unstable" models
like deep decision trees that are prone to high variance.
Starting with an original dataset containing multiple data points (represented by colored circles). The
original dataset is randomly sampled with replacement multiple times. This means that in each sample, a
data point can be selected more than once or not at all. These samples create multiple subsets of the original
data.
• For each of the bootstrapped subsets, a separate classifier (e.g., decision tree, logistic regression) is
trained.
• The predictions from all the individual classifiers are combined to form a final prediction. This is
often done through a majority vote (for classification) or averaging (for regression).
Key Characteristics:
• Parallel: Base models are trained independently and in parallel.
• Homogeneous: Typically uses the same type of base learner (e.g., all decision trees).
• Reduces Variance: Focuses on reducing the error component due to variance.
Examples: The most well-known bagging algorithm is Random Forest, which builds multiple decision
trees on bootstrapped samples and also introduces further randomness by considering only a random subset
of features at each split.
Advantages of Bagging:
• Reduces Overfitting: Significantly lowers the risk of overfitting by averaging predictions and
reducing variance.
• Improved Robustness: The ensemble is less sensitive to noisy data or outliers because individual
model errors tend to cancel out.
• Easy to Parallelize: Due to the independent training of base models, bagging is computationally
efficient and scalable, especially for large datasets and distributed computing environments.
• Handles High-Variance Models Well: Particularly effective for models like decision trees that
tend to have high variance.
Disadvantages of Bagging:
• Less Effective in Reducing Bias: While it excels at reducing variance, it doesn't directly address
the underlying bias of the base model. If the base models are inherently biased, the ensemble might
still retain that bias.
• Reduced Interpretability: Combining multiple models makes the overall ensemble more complex
and harder to interpret compared to a single model.
• Computational Cost: Training multiple models can be computationally expensive, although
parallelization helps mitigate this.
• Can still overfit: While it generally reduces overfitting, if the base models are too complex or the
dataset is very noisy, there's still a possibility of the ensemble overfitting.
Boosting
Concept: Boosting is a sequential ensemble method where base models are trained iteratively. Each
subsequent model focuses on correcting the errors (misclassifications for classification, residuals for
regression) made by the previous models in the sequence. It "boosts" the performance of weak learners by
combining them.
Boosting Algorithm:
Step 1: Initialize Model Weights - Begin with a single weak learner and assign equal weights to all training
examples.
Step 2: Train Weak Learner - Train weak learners on this dataset.
Step 3: Sequential Learning - Boosting works by training models sequentially where each model focuses
on correcting the errors of its predecessor.
Step 4: Weight Adjustment - Boosting assigns weights to training datapoints. Misclassified examples
receive higher weights in the next iteration so that next models pay more attention to them.
Goal: The primary goal of boosting is to reduce bias and transform weak learners into a strong learner. By
iteratively focusing on difficult cases, boosting algorithms can achieve very high accuracy.
The above image shows how Boosting works. It starts with training on the original data. After each round
more weight is given to misclassified points so the next model focuses on them. This process repeats and
in the end all models are combined to make a final, more accurate prediction.
Key Characteristics:
• Sequential: Base models are trained one after another, with dependencies between them.
• Homogeneous (typically): Often uses the same type of base learner, though they can vary.
• Reduces Bias (and often variance): Primarily targets the bias component of the error, but can also
reduce variance.
Examples:
• AdaBoost (Adaptive Boosting): One of the earliest and most influential boosting
algorithms. It adjusts instance weights after each iteration.
• Gradient Boosting Machines (GBM): Builds models to minimize the "residuals" or errors
of the previous model. It uses gradient descent to optimize the loss function.
• XGBoost, LightGBM, CatBoost: Highly optimized and widely used implementations of
gradient boosting that offer significant performance improvements and additional features.
Advantages of Boosting:
• High Accuracy: Often achieves state-of-the-art predictive performance, especially on complex
datasets.
• Reduces Bias: Excellent at reducing bias by iteratively focusing on errors.
• Handles Complex Relationships: Can model complex non-linear relationships in data.
• Built-in Feature Selection: Some boosting algorithms implicitly perform feature selection by
giving more importance to relevant features.
Disadvantages of Boosting:
• Prone to Overfitting (if not tuned): While powerful, boosting can be susceptible to overfitting if
the number of iterations is too high, the base learners are too complex, or the learning rate is not
properly controlled.
• Sensitive to Outliers: Because it focuses on misclassified instances, outliers can disproportionately
influence the training process, potentially leading to poor performance.
• Slow to Train: The sequential nature of boosting makes it less parallelizable than bagging, leading
to longer training times for large datasets.
• Complex Hyperparameter Tuning: Boosting algorithms often have many hyperparameters that
need careful tuning to achieve optimal performance and prevent overfitting.
• Reduced Interpretability: Similar to bagging, the ensemble of multiple models can be difficult to
interpret.
Computational
Highly parallelizable, often faster Sequential, generally slower
Efficiency
𝑌𝑝𝑟𝑒𝑑 = 𝑦1 + 𝜂 ∙ 𝑟1 + 𝜂 ∙ 𝑟2 + ⋯ + 𝜂 ∙ 𝑟𝑁
Where 𝑟1 , 𝑟2 , … 𝑟𝑁 are the errors predicted by each tree.
Advantages of GBMs
• High Predictive Accuracy: Often delivers state-of-the-art performance on tabular data, frequently
outperforming other algorithms.
• Handles Complex Relationships: Can capture complex non-linear relationships and interactions
between features.
• Flexibility: Can be used for both regression and classification tasks, and is compatible with various
loss functions.
• Robust to Feature Scaling: Tree-based models are generally not sensitive to the scaling of features.
• Implicit Feature Importance: Provides a measure of feature importance, indicating which features
contribute most to the model's predictions.
Disadvantages of GBMs
• Prone to Overfitting: If not carefully tuned (especially n_estimators, learning_rate, max_depth),
GBMs can easily overfit the training data.
• Computationally Intensive: Due to their sequential nature, training can be time-consuming,
especially with a large number of trees or deep trees, and is less parallelizable than bagging.
• Sensitivity to Outliers: Since each tree tries to correct the errors of previous trees, outliers can
significantly influence the model and potentially lead to poor performance if not handled.
• Less Interpretability: Like other ensemble methods, understanding the exact decision-making
process of a complex GBM can be challenging compared to a single decision tree.
• Hyperparameter Tuning: Requires careful tuning of multiple hyperparameters to achieve optimal
performance and prevent overfitting.
Applications
GBMs are widely applied in various fields due to their high performance:
• Fraud Detection: Identifying fraudulent transactions in financial data.
• Credit Scoring: Assessing creditworthiness of loan applicants.
• Customer Churn Prediction: Predicting which customers are likely to leave a service.
• Recommendation Systems: Predicting user preferences for products or content.
• Ranking (e.g., Search Engines): Ranking relevant search results.
• Disease Diagnosis and Prognosis: In healthcare, predicting disease presence or progression.
• Demand Forecasting: Predicting future sales or resource needs.
It is an optimized implementation of Gradient Boosting and is a type of ensemble learning method that
combines multiple weak models to form a stronger model.
• XGBoost uses decision trees as its base learners and combines them sequentially to improve the
model’s performance. Each new tree is trained to correct the errors made by the previous tree and
this process is called boosting.
• It has built-in parallel processing to train models on large datasets quickly. XGBoost also supports
customizations allowing users to adjust model parameters to optimize performance based on the
specific problem.
𝑦̂ = ∑ 𝑓𝑘 (𝑥𝑖 )
𝑘=1
where,
• 𝑦̂ is the final predicted value for the ith data point,
• K is the number of trees in the ensemble,
• 𝑓𝑘 (𝑥𝑖 ) represents the prediction of the Kth tree for the ith data point
Advantages of XGboost
• Scalable and efficient for large datasets with millions of records
• Supports parallel processing and GPU acceleration for faster training
• Offers customizable parameters and regularization for fine-tuning
Disadvantages of XGBoost
• XGBoost can be computationally intensive, making it less ideal for resource-constrained systems.
• It may be sensitive to noise or outliers, requiring careful data preprocessing.
• Prone to overfitting, especially on small datasets or with too many trees.
• Offers feature importance, but overall model interpretability is limited compared to simpler methods
which is an issue in fields like healthcare or finance.
Introduction
Support Vector Machines (SVM) is a powerful supervised machine learning algorithm used primarily for
classification, but also for regression tasks. It is particularly effective when there is a clear margin of
separation between classes.
The main goal of SVM is to maximize the margin between the two classes. The larger the margin the better
the model performs on new and unseen data.
• Linear SVM: Linear SVMs use a linear decision boundary to separate the data points of different
classes. When the data can be precisely linearly separated, linear SVMs are very suitable. This
means that a single straight line (in 2D) or a hyperplane (in higher dimensions) can entirely divide
the data points into their respective classes. A hyperplane that maximizes the margin between the
classes is the decision boundary.
• Non-Linear SVM: Non-Linear SVM can be used to classify data when it cannot be separated into
two classes by a straight line (in the case of 2D). By using kernel functions, nonlinear SVMs can
handle nonlinearly separable data. The original input data is transformed by these kernel functions
into a higher-dimensional feature space where the data points can be linearly separated. A linear
SVM is used to locate a nonlinear decision boundary in this modified space.
Linear SVM
The Core Idea: Finding the Optimal Separating Hyperplane
Imagine you have a dataset with two different classes of data points (e.g., red dots and blue dots). The
fundamental goal of an SVM is to find a "decision boundary" that best separates these classes. In a 2D
space, this boundary is a line. In 3D, it's a plane, and in higher dimensions, it's called a hyperplane
𝑤𝑇𝑥 + 𝑏 = 0
Where:
• w is the normal vector to the hyperplane (the direction perpendicular to it).
• b is the offset or bias term representing the distance of the hyperplane from the origin along the
normal vector w.
Nonlinear SVM
Why Non-Linear SVM is Required
In many situations data cannot be separated with a straight line. For example, one group of points might
surround another group in a circle. A simple Support Vector Machines (SVM) won't work well here because
it only draws straight lines. Non-linear SVM is needed because it can draw curved lines to separate such
data properly. This helps the model make better predictions when the data has complex shapes or patterns.
SVM uses a technique called the kernel trick.
Non-Linear SVM uses kernels to work in higher dimensions where data can become linearly separable.
Kernel trick
In Support Vector Machines (SVMs), a kernel is a crucial mathematical function that plays a pivotal role
in handling complex, non-linear data.
Here's a breakdown of what kernels are and their importance in SVMs:
The Core Problem: Non-Linear Separability
SVMs primarily aim to find an optimal hyperplane that separates different classes of data points with the
largest possible margin. While a linear SVM works well when data is linearly separable (meaning you can
draw a straight line or plane to divide the classes), real-world data is often much more complex and not
linearly separable in its original form.
Linear Kernel
• A linear kernel is the simplest form of kernel used in SVM. It is suitable when the data is linearly
separable meaning that a straight line (or hyperplane in higher dimensions) can effectively separate
the classes.
Polynomial Kernel:
• Parameter: γ (gamma) controls the influence of each training example, defining the "width" of the
kernel.
• Use Case: The most popular and versatile kernel. It can handle very complex and non-linear
relationships, especially when there's no prior knowledge about the data's distribution. It effectively
maps data into an infinite-dimensional space.
Sigmoid Kernel:
• Parameters: γ and r.
• Use Case: Inspired by neural networks, it can be used for non-linear classification and sometimes
acts as a proxy for neural networks.
Computational
Generally faster and less complex More computationally intensive
Cost
Applications
1. Image Classification: They are widely used for image recognition tasks such as handwritten digit
recognition like MNIST dataset, where the data classes are not linearly separable.
2. Bioinformatics: Used in gene expression analysis and protein classification where the relationships
between variables are complex and non-linear.
3. Natural Language Processing (NLP): Used for text classification tasks like spam filtering or
sentiment analysis where non-linear relationships exist between words and sentiments.
4. Medical Diagnosis: Effective for classifying diseases based on patient data such as tumour
classification where data have non-linear patterns.
5. Fraud Detection: They can identify fraudulent activities by detecting unusual patterns in
transactional data.
6. Voice and Speech Recognition: Useful for separating different voice signals or identifying speech
patterns where non-linear decision boundaries are needed.
4. Binary and Multiclass Support: SVM is effective for both binary classification and multiclass
classification suitable for applications in text classification.
5. Memory Efficiency: It focuses on support vectors making it memory efficient compared to other
algorithms.
Mathematical Formula
The formula to calculate MAE for a data with "n" data points is:
1
𝑀𝐴𝐸 = Σ|𝑦𝑖 − 𝑦̂𝑖 |
𝑛
Where:
The above graph shows the salary of an employee vs experience in years. We have the actual value on the
line and the predicted value is shown with X. And the absolute distance between them is a mean absolute
error.
1
𝑀𝐴𝐸 = √𝑀𝑆𝐸 = √ Σ(𝑦𝑖 − 𝑦̂)
𝑖
2
𝑛
Errors
In a test process there can be four possible situations of which two of the situations leads to the two types
of errors and the same is presented as follows.
In order to minimize both these types of errors we need to increase the sample size.
Confusion Matrix:
A table that summarizes the performance of a classification model. It shows the counts of:
• True Positives (TP): Correctly predicted positive instances.
• True Negatives (TN): Correctly predicted negative instances.
• False Positives (FP): Incorrectly predicted positive instances (Type I error).
• False Negatives (FN): Incorrectly predicted negative instances (Type II error).
Accuracy
Accuracy is the proportion of correctly classified instances (predictions) out of the total number of instances
in the dataset.
Mathematically, it is defined as:
Total Number of Predictions
Accuracy =
Number of Correct Predictions
Alternatively, using terms from a confusion matrix:
TP + TN
Accuracy =
TP + TN + FP + FN
Where:
• TP - True Positives: The model correctly predicted the positive class.
• TN - True Negatives: The model correctly predicted the negative class.
• FP - False Positives: The model incorrectly predicted the positive class (Type I error).
• FN - False Negatives: The model incorrectly predicted the negative class (Type II error).
Precision
Precision measures the proportion of true positive predictions among all instances that the model predicted
as positive. In simpler terms, it tells you "of all the times the model said 'yes,' how many of those times was
it actually correct?"
Mathematically, it is defined as:
TP
Precision =
TP + FP
Recall
Recall measures the proportion of actual positive instances that were correctly identified by the model. In
simpler terms, it tells you "of all the times it should have said 'yes,' how many times did it actually say
'yes'?"
Mathematically, it is defined as:
TP
Recall =
TP + FN
F1-Score
The F1-score is the harmonic mean of precision and recall. It's designed to give a more balanced view of
performance than either precision or recall alone.
The formula for the F1-score is:
Precision × Recall
F1 − Score = 2 ×
Precision + Recall
Where:
TP TP
Precision = Recall =
TP + FP TP + FN
The F1-score ranges from 0 to 1, where 1 represents perfect precision and recall, and 0 represents the worst
possible performance.
1. C (Regularization Parameter):
• Role: Controls the trade-off between achieving a low training error (correctly classifying training
examples) and a large margin (simplicity of the decision boundary).
• Effect:
• Small C: Leads to a wider margin but may allow more misclassifications (underfitting).
This promotes a smoother decision boundary.
• Large C: Leads to a narrower margin but tries to classify all training examples correctly
(potential overfitting). This results in a more complex decision boundary.
2. Kernel:
• Role: Specifies the type of kernel function to be used to transform the input data into a higher-
dimensional space where it might become linearly separable.
• Common Kernels:
• linear: For linearly separable data. Fastest for large datasets.
• rbf (Radial Basis Function / Gaussian): Most common choice for non-linear data. It maps
samples into a higher-dimensional space.
• poly (Polynomial): For non-linear data, specified by degree.
• sigmoid: Another option for non-linear data.
• How it works: You define a grid of hyperparameter values to explore. Grid Search
exhaustively tries every possible combination of these values.
• Pros: Guarantees finding the best combination within the defined search space.
• Cons: Computationally expensive, especially with many hyperparameters or large ranges,
as the number of combinations grows exponentially.
2. Randomized Search (RandomizedSearchCV in scikit-learn):
• How it works: You define a distribution or range for each hyperparameter, and
Randomized Search samples a fixed number of random combinations from these
distributions.
• Pros: More computationally efficient than Grid Search, especially for large search spaces,
as it can often find good combinations without testing all of them.
• Cons: Does not guarantee finding the absolute best combination.
A typical biological neuron has four parts called dendrites, soma, axon and synapse. The body of the neuron
is called as soma. Dendrites accept the input information and process it in the cell body called soma. A
single neuron is connected by axons to around 10,000 neurons and through these axons the processed
information is passed from one neuron to another neuron. A neuron gets fired if the input information
crosses a threshold value and transmits signals to another neuron through a synapse. A synapse gets fired
with an electrical impulse called spikes which are transmitted to another neuron. A single neuron can receive
synaptic inputs from one neuron or multiple neurons. These neurons form a network structure which
processes input information and gives out a response. The simple structure of a biological neuron is shown
in Figure.
The given figure illustrates the typical diagram of Biological Neural Network.
The typical Artificial Neural Network looks something like the given figure.
Dendrites from Biological Neural Network represent inputs in Artificial Neural Networks, cell nucleus
represents Nodes, synapse represents Weights, and Axon represents Output.
Relationship between Biological neural network and artificial neural network:
Dendrites Inputs
Synapse Weights
Axon Output
An Artificial Neural Network in the field of Artificial intelligence where it attempts to mimic the
network of neurons makes up a human brain so that computers will have an option to understand things and
make decisions in a human-like manner. The artificial neural network is designed by programming
computers to behave simply like interconnected brain cells.
There are around 1000 billion neurons in the human brain. Each neuron has an association point somewhere
in the range of 1,000 and 100,000. In the human brain, data is stored in such a manner as to be distributed,
and we can extract more than one piece of this data, when necessary, from our memory parallelly. We can
say that the human brain is made up of incredibly amazing parallel processors.
1. Input Layer: This is where the network receives its input data. Each input neuron in the layer
corresponds to a feature in the input data.
2. Hidden Layers: These layers perform most of the computational heavy lifting. A neural network
can have one or multiple hidden layers. Each layer consists of units (neurons) that transform the
inputs into something that the output layer can use.
3. Output Layer: The final layer produces the output of the model. The format of these outputs varies
depending on the specific task like classification, regression.
In summary:
Data enters the input layer, where it's represented by individual features. This information then flows
through one or more hidden layers, where weighted sums and activation functions transform the data,
allowing the network to learn intricate patterns. Finally, the processed information reaches the output layer,
which provides the network's prediction or classification. The learning process involves adjusting the
weights and biases throughout the network to minimize the difference between the network's output and
the desired output, typically using algorithms like backpropagation.
Backpropagation Algorithm (Simple Definition):
Backpropagation is a training algorithm for neural networks that adjusts the weights by comparing the
predicted output with the actual output and minimizing the error. It works by moving the error backward
from the output layer to the input layer.
Key Steps (Simplified):
1. Forward Pass: Input passes through the network to get the output.
2. Error Calculation: Compare predicted output with actual output.
3. Backward Pass: Propagate the error back through the network.
4. Weight Update: Adjust weights to reduce the error (using gradient descent).
2. ReLU (Rectified Linear Unit): A popular choice for hidden layers, it returns the input if positive
and zero otherwise. It helps to solve the vanishing gradient problem.
3. Tanh (Hyperbolic Tangent): Similar to sigmoid but outputs values between -1 and 1. It is used in
hidden layers when a broader range of outputs is needed.
4. Softmax: Converts raw outputs into probabilities used in the final layer of a network for multi-class
classification tasks.
5. Leaky ReLU: A variant of ReLU that allows small negative values for inputs helps in preventing
“dead neurons” during training.
These functions help the network decide whether to activate a neuron helps it to recognize patterns and
make predictions.
3. Healthcare: ANNs are used in medical imaging for detecting diseases like cancer and they assist
in diagnosing conditions with accuracy similar to doctors. Additionally, they predict health risks
and recommend personalized treatment plans.
4. Personal Assistants: Virtual assistants like Siri and Alexa use ANNs to process natural language,
understand voice commands and respond accordingly. They help manage tasks like setting
reminders helps in making calls and answering queries.
5. Customer Support: ANNs power chatbots and automated customer service systems that analyze
customer queries and provide accurate responses helps in improving efficiency in handling customer
inquiries.
6. Finance: In the financial industry, they are used for fraud detection, credit scoring and predicting
market trends by analyzing large sets of transaction data and spotting anomalies.
Install Tensorflow
Tensorflow is a library/platform created by and open-sourced by Google. It is the most used library for deep
learning applications. Now, creating a neural network might not be the primary function of the TensorFlow
library but it is used quite frequently for this purpose. So before going ahead let's install and import the
TensorFlow module.
Building and training neural networks using TensorFlow and Keras involves a structured process, typically
following these steps:
• Import Libraries and Prepare Data:
• Import tensorflow and keras along with any necessary modules for data handling and
preprocessing (e.g., numpy, sklearn).
• Load and prepare your dataset. This often involves splitting the data into training and testing
sets, and potentially normalizing or scaling feature values.
• Define the Model Architecture:
• Use Keras's Sequential API for simple, layer-by-layer models, or the Functional API for
more complex architectures with multiple inputs/outputs or shared layers.
• Add layers to your model, such as Dense (fully connected), Conv2D (convolutional for
images), LSTM (for sequences), etc.
• Specify the number of neurons, activation functions (e.g., 'relu', 'sigmoid', 'softmax'), and
input shape for the first layer.
• Compile the Model:
• Call the compile() method on your model.
• Specify an optimizer to control how the model updates its weights during training.
• Define a loss function appropriate for your task (e.g., 'sparse_categorical_crossentropy' for
multi-class classification, 'mse' for regression).
• Optionally, specify metrics to monitor during training and evaluation (e.g., 'accuracy').
• Train the Model:
• Use the fit() method to train your model on the training data.
• Provide the training features (x_train) and labels (y_train).
• Specify the epochs (number of times the model iterates over the entire dataset)
and batch_size (number of samples per gradient update).
import tensorflow as tf
from tensorflow import keras
Recurrent Neuron
2. RNN Unfolding
RNN unfolding or unrolling is the process of expanding the recurrent structure over time steps. During
unfolding each step of the sequence is represented as a separate layer in a series illustrating how information
flows across each time step.
This unrolling enables backpropagation through time (BPTT) a learning process where errors are
propagated across time steps to adjust the network’s weights enhancing the RNN’s ability to learn
dependencies within sequential data.
RNN Unfolding
3. Many-to-One RNN
The Many-to-One RNN receives a sequence of inputs and generates a single output. This type is useful
when the overall context of the input sequence is needed to make one prediction. In sentiment analysis the
model receives a sequence of words (like a sentence) and produces a single output like positive, negative
or neutral.
While RNNs excel at handling sequential data they face two main training challenges i.e vanishing gradient
and exploding gradient problem:
1. Vanishing Gradient: During backpropagation gradients diminish as they pass through each time
step leading to minimal weight updates. This limits the RNN’s ability to learn long-term
dependencies which is crucial for tasks like language translation.
2. Exploding Gradient: Sometimes gradients grow uncontrollably causing excessively large weight
updates that de-stabilize training.
These challenges can hinder the performance of standard RNNs on complex, long-sequence tasks.
Applications of Recurrent Neural Networks
RNNs are used in various applications where data is sequential or time-based:
• Time-Series Prediction: RNNs excel in forecasting tasks, such as stock market predictions and
weather forecasting.
• Natural Language Processing (NLP): RNNs are fundamental in NLP tasks like language
modeling, sentiment analysis and machine translation.
• Speech Recognition: RNNs capture temporal patterns in speech data, aiding in speech-to-text and
other audio-related applications.
• Image and Video Processing: When combined with convolutional layers, RNNs help analyze video
sequences, facial expressions and gesture recognition.