Thanks to visit codestin.com
Credit goes to www.scribd.com

0% found this document useful (0 votes)
35 views39 pages

MMC102 - Module 4 - Notes

The document outlines a course on Machine Learning and Data Analytics using Python (MMC201) at Sri Venkateshwara College of Engineering, focusing on foundational concepts, proficiency in Python, and advanced machine learning techniques such as ensemble methods, support vector machines, and neural networks. It includes practical experiments for implementing various algorithms and emphasizes the importance of advanced techniques for handling complex data. The course aims to prepare students for industry roles in data-driven decision-making and predictive modeling.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
35 views39 pages

MMC102 - Module 4 - Notes

The document outlines a course on Machine Learning and Data Analytics using Python (MMC201) at Sri Venkateshwara College of Engineering, focusing on foundational concepts, proficiency in Python, and advanced machine learning techniques such as ensemble methods, support vector machines, and neural networks. It includes practical experiments for implementing various algorithms and emphasizes the importance of advanced techniques for handling complex data. The course aims to prepare students for industry roles in data-driven decision-making and predictive modeling.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 39

MACHINE LEARNING

&
DATA ANALYTICS USING PYTHON
[MMC201]
(2024-26)

Dr. Srinivasa Rao K


Prof. & Head, Department of MCA
Sri Venkateshwara College of Engineering
Bengaluru.
Machine learning and Data analytics using Python MMC201

Machine learning and Data analytics using Python


Course Code MMC201 CIE Marks 50
Teaching Hours/Week (L:P:
2:2:0 SEE Marks 50
SDA/T/T)
Total Hours of Pedagogy 50 Total Marks 100
Credits 04 Exam Hours 03
Course Learning Objectives:
1. Understand foundational concepts in machine learning and data analytics.
2. Gain proficiency in Python for data analysis and machine learning tasks.
3. Learn and apply various machine learning algorithms and techniques.
4. Develop skills in data preprocessing, visualization, and model evaluation.
5. Prepare students for industry roles involving data-driven decision making and
predictive modeling.
Module-4 08 Hours
Advanced Machine Learning Techniques:
Ensemble Methods: Bagging and Boosting, Gradient Boosting Machines (GBM),
Extreme Gradient Boosting (XGBoost).
Support Vector Machines (SVM): Linear and non-linear SVM, Kernel trick,
Model evaluation and tuning.
Neural Networks and Deep Learning: Introduction to neural networks, Building
and training neural networks using TensorFlow and Keras, Convolutional Neural
Networks (CNN) and Recurrent Neural Networks (RNN).
Teaching Learning Process:
Practical sessions on advanced machine learning techniques, Interactive coding exercises
to implement neural networks, Group projects on applying advanced techniques to
complex data problems, Continuous assessment through quizzes and practical tests.

Sl.
Experiments
NO
Implement and demonstrate the FIND-S algorithm for finding the most specific
1 hypothesis based on a given set of training data samples. Read the training data from
a .CSV file.
For a given set of training data examples stored in a .CSV file, implement and
2 demonstrate the Candidate- Elimination algorithm to output a description of the set
of all hypotheses consistent with the training examples.
Write a program to demonstrate the working of the decision tree based ID3
3 algorithm. Use an appropriate da set for building the decision tree and apply this
knowledge to classify a new sample.
Write a program to implement the naïve Bayesian classifier for a sample training
4 data set stored as a .CSV fil Compute the accuracy of the classifier, considering few
test data sets.
Write a program to implement k-Nearest Neighbour algorithm to classify the iris
5
data set. Print both correct a wrong prediction.
Build an Artificial Neural Network by implementing the Backpropagation algorithm
6
and test the same using appropriate data sets.
Write a program to demonstrate Regression analysis with residual plots on a given
7
data set.
Write a program to compute summary statistics such as mean, median, mode,
8
standard deviation and variance the given different types of data.
Write a program to implement k-Means clustering algorithm to cluster the set of data
9
stored in .CSV file.

Dr. Srinivasa Rao K SVCE, B’lore


Machine learning and Data analytics using Python MMC201

MODULE 4
ADVANCED MACHINE LEARNING TECHNIQUES

INTRODUCTION
As data becomes increasingly complex and high-dimensional, traditional machine learning algorithms often
struggle with performance and accuracy. Advanced Machine Learning Techniques are designed to handle
such challenges by introducing sophisticated modeling strategies, optimization methods, and the ability to
learn from limited or unstructured data.

Why Advanced ML?


• To capture complex patterns in data
• To enhance prediction accuracy
• To handle high-dimensional and unstructured data
• To enable scalability and automation in real-world systems

Categories of Advanced ML Techniques


1. Ensemble Methods
Combine multiple models to improve performance.
• Bagging & Boosting: Techniques to reduce variance and bias respectively.
• Gradient Boosting Machines (GBM): Builds models sequentially to fix previous errors.
• Extreme Gradient Boosting (XG_Boost)
2. Support Vector Machines (SVM)
• Linear and non-linear SVM
• Kernel trick
3. Deep Learning & Neural networks
Neural networks with multiple layers (ideal for big data and unstructured data like images and text).
• Convolutional Neural Networks (CNNs) – for image processing
• Recurrent Neural Networks (RNNs), 4. Reinforcement Learning
4. Dimensionality Reduction
Used when datasets have too many features.
• Principal Component Analysis (PCA)
• t-SNE & UMAP (for visualization)
• Autoencoders (neural network-based)

Dr. Srinivasa Rao K SVCE, B’lore


Machine learning and Data analytics using Python MMC201

ENSEMBLE METHODS

Ensemble learning is a method where we use many small models instead of just one. Each of these models
may not be very strong on its own, but when we put their results together, we get a better and more accurate
answer. It's like asking a group of people for advice instead of just one person—each one might be a little
wrong, but together, they usually give a better answer.

How Ensemble Learning Works


The principle behind ensemble learning is to leverage the "wisdom of the crowd." By combining the
predictions of several models, the ensemble can reduce errors due to bias, variance, or noise in the data.
Each base model might have its own strengths and weaknesses, but when their predictions are aggregated,
their individual errors tend to cancel each other out, leading to a more generalized and reliable overall
prediction.

Types of Ensembles Learning in Machine Learning


There are three main types of ensemble methods:

• Bagging (Bootstrap Aggregating):


Models are trained independently on different random subsets of the training data. Their results are
then combined—usually by averaging (for regression) or voting (for classification). This helps
reduce variance and prevents overfitting.
• Boosting:
Models are trained one after another. Each new model focuses on fixing the errors made by the
previous ones. The final prediction is a weighted combination of all models, which helps reduce
bias and improve accuracy.

• Stacking (Stacked Generalization):


Multiple different models (often of different types) are trained, and their predictions are used as
inputs to a final model, called a meta-model. The meta-model learns how to best combine the
predictions of the base models, aiming for better performance than any individual model.

Dr. Srinivasa Rao K SVCE, B’lore


Machine learning and Data analytics using Python MMC201

Bagging and Boosting


Bagging and Boosting are two of the most popular and effective ensemble learning techniques in machine
learning. While both aim to improve model performance by combining multiple "weak" or "base" learners,
they achieve this in fundamentally different ways.

Bagging (Bootstrap Aggregating)


Concept: Bagging, short for Bootstrap Aggregating, works by training multiple instances of the same base
learning algorithm independently on different, randomly sampled subsets of the training data. These subsets
are created using bootstrapping, which involves sampling with replacement.

Bagging Algorithm:
Step 1: Bootstrap Sampling - Divides the original training data into ‘N’ subsets and randomly selects a
subset with replacement
Step 2: Base Model Training - For each bootstrapped sample we train a base model independently
on that subset of data.
Step 3: Prediction Aggregation - To make a prediction on testing data combine the predictions of all base
models. For classification tasks it can include majority voting and for regression it involves
averaging the predictions.
Step 4: Out-of-Bag (OOB) Evaluation - The “out-of-bag” samples (training subset samples) can be used
to estimate the model’s performance without the need for cross-validation.
Step 5: Final Prediction - After aggregating the predictions from all the base models, Bagging produces a
final prediction for each instance.

Goal: The primary goal of bagging is to reduce variance and prevent overfitting. By averaging or voting
across multiple models trained on slightly different data, the ensemble becomes more stable and less
sensitive to noise or specific patterns in the training data. It's particularly effective with "unstable" models
like deep decision trees that are prone to high variance.

Dr. Srinivasa Rao K SVCE, B’lore


Machine learning and Data analytics using Python MMC201

Starting with an original dataset containing multiple data points (represented by colored circles). The
original dataset is randomly sampled with replacement multiple times. This means that in each sample, a
data point can be selected more than once or not at all. These samples create multiple subsets of the original
data.
• For each of the bootstrapped subsets, a separate classifier (e.g., decision tree, logistic regression) is
trained.
• The predictions from all the individual classifiers are combined to form a final prediction. This is
often done through a majority vote (for classification) or averaging (for regression).

Key Characteristics:
• Parallel: Base models are trained independently and in parallel.
• Homogeneous: Typically uses the same type of base learner (e.g., all decision trees).
• Reduces Variance: Focuses on reducing the error component due to variance.
Examples: The most well-known bagging algorithm is Random Forest, which builds multiple decision
trees on bootstrapped samples and also introduces further randomness by considering only a random subset
of features at each split.

Advantages of Bagging:
• Reduces Overfitting: Significantly lowers the risk of overfitting by averaging predictions and
reducing variance.
• Improved Robustness: The ensemble is less sensitive to noisy data or outliers because individual
model errors tend to cancel out.
• Easy to Parallelize: Due to the independent training of base models, bagging is computationally
efficient and scalable, especially for large datasets and distributed computing environments.
• Handles High-Variance Models Well: Particularly effective for models like decision trees that
tend to have high variance.

Disadvantages of Bagging:
• Less Effective in Reducing Bias: While it excels at reducing variance, it doesn't directly address
the underlying bias of the base model. If the base models are inherently biased, the ensemble might
still retain that bias.
• Reduced Interpretability: Combining multiple models makes the overall ensemble more complex
and harder to interpret compared to a single model.
• Computational Cost: Training multiple models can be computationally expensive, although
parallelization helps mitigate this.
• Can still overfit: While it generally reduces overfitting, if the base models are too complex or the
dataset is very noisy, there's still a possibility of the ensemble overfitting.

Dr. Srinivasa Rao K SVCE, B’lore


Machine learning and Data analytics using Python MMC201

Boosting
Concept: Boosting is a sequential ensemble method where base models are trained iteratively. Each
subsequent model focuses on correcting the errors (misclassifications for classification, residuals for
regression) made by the previous models in the sequence. It "boosts" the performance of weak learners by
combining them.

Boosting Algorithm:
Step 1: Initialize Model Weights - Begin with a single weak learner and assign equal weights to all training
examples.
Step 2: Train Weak Learner - Train weak learners on this dataset.
Step 3: Sequential Learning - Boosting works by training models sequentially where each model focuses
on correcting the errors of its predecessor.
Step 4: Weight Adjustment - Boosting assigns weights to training datapoints. Misclassified examples
receive higher weights in the next iteration so that next models pay more attention to them.
Goal: The primary goal of boosting is to reduce bias and transform weak learners into a strong learner. By
iteratively focusing on difficult cases, boosting algorithms can achieve very high accuracy.

The above image shows how Boosting works. It starts with training on the original data. After each round
more weight is given to misclassified points so the next model focuses on them. This process repeats and
in the end all models are combined to make a final, more accurate prediction.

Key Characteristics:
• Sequential: Base models are trained one after another, with dependencies between them.
• Homogeneous (typically): Often uses the same type of base learner, though they can vary.
• Reduces Bias (and often variance): Primarily targets the bias component of the error, but can also
reduce variance.

Dr. Srinivasa Rao K SVCE, B’lore


Machine learning and Data analytics using Python MMC201

Examples:
• AdaBoost (Adaptive Boosting): One of the earliest and most influential boosting
algorithms. It adjusts instance weights after each iteration.
• Gradient Boosting Machines (GBM): Builds models to minimize the "residuals" or errors
of the previous model. It uses gradient descent to optimize the loss function.
• XGBoost, LightGBM, CatBoost: Highly optimized and widely used implementations of
gradient boosting that offer significant performance improvements and additional features.

Advantages of Boosting:
• High Accuracy: Often achieves state-of-the-art predictive performance, especially on complex
datasets.
• Reduces Bias: Excellent at reducing bias by iteratively focusing on errors.
• Handles Complex Relationships: Can model complex non-linear relationships in data.
• Built-in Feature Selection: Some boosting algorithms implicitly perform feature selection by
giving more importance to relevant features.

Disadvantages of Boosting:
• Prone to Overfitting (if not tuned): While powerful, boosting can be susceptible to overfitting if
the number of iterations is too high, the base learners are too complex, or the learning rate is not
properly controlled.
• Sensitive to Outliers: Because it focuses on misclassified instances, outliers can disproportionately
influence the training process, potentially leading to poor performance.
• Slow to Train: The sequential nature of boosting makes it less parallelizable than bagging, leading
to longer training times for large datasets.
• Complex Hyperparameter Tuning: Boosting algorithms often have many hyperparameters that
need careful tuning to achieve optimal performance and prevent overfitting.
• Reduced Interpretability: Similar to bagging, the ensemble of multiple models can be difficult to
interpret.

Bagging vs. Boosting: A Summary of Key Differences

Feature Bagging Boosting


Sequential, dependent (learns from
Training Parallel, independent
errors)

Weights instances based on


Data Sampling Bootstrap samples (with replacement)
previous errors

Reduces bias, converts weak to


Focus Reduces variance, prevents overfitting
strong learners

Typically homogeneous, "strong"


Base Learners Typically homogeneous, "weak"
(unstable)

Dr. Srinivasa Rao K SVCE, B’lore


Machine learning and Data analytics using Python MMC201

Feature Bagging Boosting


Averaging (regression), Majority Vote Weighted sum of base model
Final Prediction
(classification) predictions

AdaBoost, Gradient Boosting,


Examples Random Forest
XGBoost

Computational
Highly parallelizable, often faster Sequential, generally slower
Efficiency

Sensitivity to Outliers Less sensitive More sensitive

Gradient Boosting Machines (GBM)


Gradient Boosting is a powerful ensemble learning method for classification and regression that builds a
strong predictive model by sequentially combining multiple weak learners. Unlike traditional boosting
methods, Gradient Boosting trains each new model to minimize the loss function (e.g., mean squared error
or cross-entropy) of the previous model using gradient descent.
In each iteration, the algorithm computes the negative gradient of the loss function with respect to the
current predictions and then trains a new weak model (typically a decision tree) to fit these negative
gradients (also known as "pseudo-residuals"). The predictions of this new model are then added to the
ensemble, and the process continues until a stopping criterion is met, effectively correcting the errors made
by its predecessors

How Gradient Boosting Machines Works


The core idea behind GBMs is to iteratively combine multiple "weak learners" (typically shallow decision
trees, often called "decision stumps" if they have only one split) to create a single, strong predictive model.
What makes it "gradient" boosting is its approach to correcting errors: it uses a gradient descent
optimization strategy to minimize a predefined loss function.

1. Sequential Learning Process


The ensemble consists of multiple trees each trained to correct the errors of the previous one. In the first
iteration Tree 1 is trained on the original data x and the true labels y. It makes predictions which are used
to compute the errors.
2. Residuals Calculation
In the second iteration Tree 2 is trained using the feature matrix xx and the errors from Tree 1 as labels.
This means Tree 2 is trained to predict the errors of Tree 1. This process continues for all the trees in the
ensemble. Each subsequent tree is trained to predict the errors of the previous tree.

Dr. Srinivasa Rao K SVCE, B’lore


Machine learning and Data analytics using Python MMC201

Gradient Boosted Trees


3. Shrinkage
After each tree is trained its predictions are shrunk by multiplying them with the learning rate η (eetah)
which ranges from 0 to 1. This prevents overfitting by ensuring each tree has a smaller impact on the
final model.
Once all trees are trained predictions are made by summing the contributions of all the trees. The final
prediction is given by the formula:

𝑌𝑝𝑟𝑒𝑑 = 𝑦1 + 𝜂 ∙ 𝑟1 + 𝜂 ∙ 𝑟2 + ⋯ + 𝜂 ∙ 𝑟𝑁
Where 𝑟1 , 𝑟2 , … 𝑟𝑁 are the errors predicted by each tree.

Why "Gradient" Boosting?


The "gradient" part comes from the fact that each new weak learner is trained to predict the gradient of the
loss function. This allows GBMs to be very flexible, as they can be used with any differentiable loss
function (e.g., Mean Squared Error for regression, Log Loss for classification). By moving in the direction
of the negative gradient, the algorithm effectively "descends" the error surface to find the optimal set of
predictions.

Advantages of GBMs
• High Predictive Accuracy: Often delivers state-of-the-art performance on tabular data, frequently
outperforming other algorithms.
• Handles Complex Relationships: Can capture complex non-linear relationships and interactions
between features.
• Flexibility: Can be used for both regression and classification tasks, and is compatible with various
loss functions.
• Robust to Feature Scaling: Tree-based models are generally not sensitive to the scaling of features.

Dr. Srinivasa Rao K SVCE, B’lore


Machine learning and Data analytics using Python MMC201

• Implicit Feature Importance: Provides a measure of feature importance, indicating which features
contribute most to the model's predictions.

Disadvantages of GBMs
• Prone to Overfitting: If not carefully tuned (especially n_estimators, learning_rate, max_depth),
GBMs can easily overfit the training data.
• Computationally Intensive: Due to their sequential nature, training can be time-consuming,
especially with a large number of trees or deep trees, and is less parallelizable than bagging.
• Sensitivity to Outliers: Since each tree tries to correct the errors of previous trees, outliers can
significantly influence the model and potentially lead to poor performance if not handled.
• Less Interpretability: Like other ensemble methods, understanding the exact decision-making
process of a complex GBM can be challenging compared to a single decision tree.
• Hyperparameter Tuning: Requires careful tuning of multiple hyperparameters to achieve optimal
performance and prevent overfitting.

Applications
GBMs are widely applied in various fields due to their high performance:
• Fraud Detection: Identifying fraudulent transactions in financial data.
• Credit Scoring: Assessing creditworthiness of loan applicants.
• Customer Churn Prediction: Predicting which customers are likely to leave a service.
• Recommendation Systems: Predicting user preferences for products or content.
• Ranking (e.g., Search Engines): Ranking relevant search results.
• Disease Diagnosis and Prognosis: In healthcare, predicting disease presence or progression.
• Demand Forecasting: Predicting future sales or resource needs.

Extreme Gradient Boosting (XGBoost).


Traditional machine learning models like decision trees and random forests are easy to interpret but often
struggle with accuracy on complex datasets. XGBoost short form for eXtreme Gradient Boosting is an
advanced machine learning algorithm designed for efficiency, speed and high performance.

Dr. Srinivasa Rao K SVCE, B’lore


Machine learning and Data analytics using Python MMC201

It is an optimized implementation of Gradient Boosting and is a type of ensemble learning method that
combines multiple weak models to form a stronger model.
• XGBoost uses decision trees as its base learners and combines them sequentially to improve the
model’s performance. Each new tree is trained to correct the errors made by the previous tree and
this process is called boosting.
• It has built-in parallel processing to train models on large datasets quickly. XGBoost also supports
customizations allowing users to adjust model parameters to optimize performance based on the
specific problem.

How XGBoost Works?


It builds decision trees sequentially with each tree attempting to correct the mistakes made by the previous
one. The process can be broken down as follows:
1. Start with a base learner: The first model decision tree is trained on the data. In regression tasks
this base model simply predicts the average of the target variable.
2. Calculate the errors: After training the first tree the errors between the predicted and actual values
are calculated.
3. Train the next tree: The next tree is trained on the errors of the previous tree. This step attempts to
correct the errors made by the first tree.
4. Repeat the process: This process continues with each new tree trying to correct the errors of the
previous trees until a stopping criterion is met.
5. Combine the predictions: The final prediction is the sum of the predictions from all the trees.

Mathematics Behind XGBoost Algorithm


It can be viewed as iterative process where we start with an initial prediction often set to zero. After which
each tree is added to reduce errors. Mathematically the model can be represented as:
𝐾

𝑦̂ = ∑ 𝑓𝑘 (𝑥𝑖 )
𝑘=1

where,
• 𝑦̂ is the final predicted value for the ith data point,
• K is the number of trees in the ensemble,
• 𝑓𝑘 (𝑥𝑖 ) represents the prediction of the Kth tree for the ith data point

Advantages of XGboost
• Scalable and efficient for large datasets with millions of records
• Supports parallel processing and GPU acceleration for faster training
• Offers customizable parameters and regularization for fine-tuning

Dr. Srinivasa Rao K SVCE, B’lore


Machine learning and Data analytics using Python MMC201

• Includes feature importance analysis for better insights and selection


• Trusted by data scientists across multiple programming languages

Disadvantages of XGBoost
• XGBoost can be computationally intensive, making it less ideal for resource-constrained systems.
• It may be sensitive to noise or outliers, requiring careful data preprocessing.
• Prone to overfitting, especially on small datasets or with too many trees.
• Offers feature importance, but overall model interpretability is limited compared to simpler methods
which is an issue in fields like healthcare or finance.

SUPPORT VECTOR MACHINES (SVM)

Introduction
Support Vector Machines (SVM) is a powerful supervised machine learning algorithm used primarily for
classification, but also for regression tasks. It is particularly effective when there is a clear margin of
separation between classes.
The main goal of SVM is to maximize the margin between the two classes. The larger the margin the better
the model performs on new and unseen data.

Types of Support Vector Machine


Based on the nature of the decision boundary, Support Vector Machines (SVM) can be divided into two
main parts:

Dr. Srinivasa Rao K SVCE, B’lore


Machine learning and Data analytics using Python MMC201

• Linear SVM: Linear SVMs use a linear decision boundary to separate the data points of different
classes. When the data can be precisely linearly separated, linear SVMs are very suitable. This
means that a single straight line (in 2D) or a hyperplane (in higher dimensions) can entirely divide
the data points into their respective classes. A hyperplane that maximizes the margin between the
classes is the decision boundary.
• Non-Linear SVM: Non-Linear SVM can be used to classify data when it cannot be separated into
two classes by a straight line (in the case of 2D). By using kernel functions, nonlinear SVMs can
handle nonlinearly separable data. The original input data is transformed by these kernel functions
into a higher-dimensional feature space where the data points can be linearly separated. A linear
SVM is used to locate a nonlinear decision boundary in this modified space.

Linear SVM
The Core Idea: Finding the Optimal Separating Hyperplane
Imagine you have a dataset with two different classes of data points (e.g., red dots and blue dots). The
fundamental goal of an SVM is to find a "decision boundary" that best separates these classes. In a 2D
space, this boundary is a line. In 3D, it's a plane, and in higher dimensions, it's called a hyperplane

Dr. Srinivasa Rao K SVCE, B’lore


Machine learning and Data analytics using Python MMC201

The Margin and Support Vectors


• Margin: The margin is the maximum distance between the hyperplane and the closest data points
from each class. The idea is that a larger margin generally leads to better generalization on unseen
data, as it creates a more robust (strong) separation.
• Support Vectors: These are the data points that lie closest to the decision boundary (the
hyperplane). They are the critical elements of the training set because they directly influence the
position and orientation of the hyperplane. If you were to remove a support vector, the decision
boundary would likely change. In essence, they "support" the separating hyperplane.

Mathematical Computation of SVM


Consider a binary classification problem with two classes, labeled as +1 and -1. We have a training dataset
consisting of input feature vectors X and their corresponding class labels Y. The equation for the linear
hyperplane can be written as:

𝑤𝑇𝑥 + 𝑏 = 0
Where:
• w is the normal vector to the hyperplane (the direction perpendicular to it).
• b is the offset or bias term representing the distance of the hyperplane from the origin along the
normal vector w.

Linear SVM Classifier


Distance from a Data Point to the Hyperplane:
𝑇
𝑦̂ = {1 ∶ 𝑤 𝑇 𝑥 + 𝑏 ≥ 0
0∶𝑤 𝑥+𝑏 ≤0
Where 𝑦̂ is the predicted label of data point.

Dr. Srinivasa Rao K SVCE, B’lore


Machine learning and Data analytics using Python MMC201

Nonlinear SVM
Why Non-Linear SVM is Required
In many situations data cannot be separated with a straight line. For example, one group of points might
surround another group in a circle. A simple Support Vector Machines (SVM) won't work well here because
it only draws straight lines. Non-linear SVM is needed because it can draw curved lines to separate such
data properly. This helps the model make better predictions when the data has complex shapes or patterns.
SVM uses a technique called the kernel trick.
Non-Linear SVM uses kernels to work in higher dimensions where data can become linearly separable.

What to do if data are not linearly separable?


When data is not linearly separable i.e., it can't be divided by a straight line, SVM uses a technique
called kernels to map the data into a higher-dimensional space where it becomes separable. This
transformation helps SVM find a decision boundary even for non-linear data.

Kernel trick
In Support Vector Machines (SVMs), a kernel is a crucial mathematical function that plays a pivotal role
in handling complex, non-linear data.
Here's a breakdown of what kernels are and their importance in SVMs:
The Core Problem: Non-Linear Separability
SVMs primarily aim to find an optimal hyperplane that separates different classes of data points with the
largest possible margin. While a linear SVM works well when data is linearly separable (meaning you can
draw a straight line or plane to divide the classes), real-world data is often much more complex and not
linearly separable in its original form.

The Solution: The Kernel Trick


This is where kernels come into play. The "kernel trick" allows SVMs to deal with non-linearly separable
data without explicitly transforming the data into a higher-dimensional space. Instead, the kernel function
implicitly calculates the similarity (dot product) between data points as if they were already in that higher-
dimensional space.

Dr. Srinivasa Rao K SVCE, B’lore


Machine learning and Data analytics using Python MMC201

How it Works (The "Trick"):


Imagine you have data points that are intertwined in 2D, making it impossible to draw a straight line to
separate them. A kernel function can implicitly map these 2D points to, say, a 3D space, where they might
become linearly separable. In this higher dimension, a hyperplane can then be found to separate the classes.
The "trick" is that the kernel function avoids the computationally expensive step of actually calculating
the coordinates of the data points in this higher-dimensional space. It just calculates the dot product
(similarity) in that space.

Why are Kernels Important?


1. Handling Non-Linearity: This is their primary function. Kernels enable SVMs to learn complex,
non-linear decision boundaries, making them applicable to a wide range of real-world problems.
2. Computational Efficiency: By using the "kernel trick," SVMs can work with high (even infinite)
dimensional feature spaces without incurring the computational cost of explicitly transforming the
data.
3. Flexibility: Different types of kernel functions are suitable for different data structures and problem
types, providing great flexibility to the SVM algorithm.
4. Implicit Feature Engineering: Kernels can be seen as implicitly creating new, more complex
features from the original ones, which helps in better separation.

Common Types of Kernel Functions:

Linear Kernel
• A linear kernel is the simplest form of kernel used in SVM. It is suitable when the data is linearly
separable meaning that a straight line (or hyperplane in higher dimensions) can effectively separate
the classes.

• It is represented as: 𝐾(𝑥, 𝑦) = 𝑥 ∙ 𝑦 (Dot Product)

• It is used for text classification problems such as spam detection.

Polynomial Kernel:

• Formula: 𝐾(𝑥𝑖 , 𝑥𝑗 ) = (γ𝑥𝑖 ∙ 𝑥𝑗 + 𝑟)𝑑

• Parameters: d (degree of the polynomial), γ (scale factor), r (constant term).


• Use Case: Effective for non-linear data where a curved or polynomial decision boundary might be
appropriate (e.g., image processing).

Dr. Srinivasa Rao K SVCE, B’lore


Machine learning and Data analytics using Python MMC201

Radial Basis Function (RBF) Kernel (also known as Gaussian Kernel):


2
• Formula: 𝐾(𝑥𝑖 , 𝑥𝑗 ) = 𝑒𝑥𝑝 (−γ ‖𝑥𝑖 − 𝑥𝑗 ‖ )

• Parameter: γ (gamma) controls the influence of each training example, defining the "width" of the
kernel.
• Use Case: The most popular and versatile kernel. It can handle very complex and non-linear
relationships, especially when there's no prior knowledge about the data's distribution. It effectively
maps data into an infinite-dimensional space.

Sigmoid Kernel:

• Formula: 𝐾(𝑥𝑖 , 𝑥𝑗 ) = 𝑡𝑎𝑛ℎ(γ𝑥𝑖 ∙ 𝑥𝑗 + 𝑟)

• Parameters: γ and r.
• Use Case: Inspired by neural networks, it can be used for non-linear classification and sometimes
acts as a proxy for neural networks.

Choosing the Right Kernel:


The choice of kernel is crucial and depends on several factors:
• Nature of the Data: Is it linearly separable or highly non-linear?
• Computational Complexity: More complex kernels (like RBF or polynomial with high degree)
can be computationally more intensive.
• Model Performance: The best way is to try different kernels and use cross-validation to see which
works best for your problem.
• Domain Knowledge: Sometimes, understanding the underlying data and problem can guide the
kernel selection.

Mapping 1D data to 2D to become able to separate the two classes

Dr. Srinivasa Rao K SVCE, B’lore


Machine learning and Data analytics using Python MMC201

Linear SVM vs Non-Linear SVM

Feature Linear SVM Non-Linear SVM

Decision Boundary Straight line or hyperplane Curved or complex boundaries

Works well when data is linearly


Data Separation Suitable for non-linearly separable data
separable

Uses non-linear kernels (e.g., RBF,


Kernel Usage No kernel or uses a linear kernel
polynomial)

Computational
Generally faster and less complex More computationally intensive
Cost

Image classification or handwriting


Example Use Case Spam detection with simple features
recognition

Applications
1. Image Classification: They are widely used for image recognition tasks such as handwritten digit
recognition like MNIST dataset, where the data classes are not linearly separable.
2. Bioinformatics: Used in gene expression analysis and protein classification where the relationships
between variables are complex and non-linear.
3. Natural Language Processing (NLP): Used for text classification tasks like spam filtering or
sentiment analysis where non-linear relationships exist between words and sentiments.
4. Medical Diagnosis: Effective for classifying diseases based on patient data such as tumour
classification where data have non-linear patterns.
5. Fraud Detection: They can identify fraudulent activities by detecting unusual patterns in
transactional data.
6. Voice and Speech Recognition: Useful for separating different voice signals or identifying speech
patterns where non-linear decision boundaries are needed.

Advantages of Support Vector Machine (SVM)


1. High-Dimensional Performance: SVM excels in high-dimensional spaces, making it suitable for
image classification and gene expression analysis.
2. Nonlinear Capability: Utilizing kernel functions like RBF and polynomial SVM effectively
handles nonlinear relationships.
3. Outlier Resilience: The soft margin feature allows SVM to ignore outliers, enhancing robustness
in spam detection and anomaly detection.

Dr. Srinivasa Rao K SVCE, B’lore


Machine learning and Data analytics using Python MMC201

4. Binary and Multiclass Support: SVM is effective for both binary classification and multiclass
classification suitable for applications in text classification.
5. Memory Efficiency: It focuses on support vectors making it memory efficient compared to other
algorithms.

Disadvantages of Support Vector Machine (SVM)


1. Slow Training: SVM can be slow for large datasets, affecting performance in SVM in data mining
tasks.
2. Parameter Tuning Difficulty: Selecting the right kernel and adjusting parameters like C requires
careful tuning, impacting SVM algorithms.
3. Noise Sensitivity: SVM struggles with noisy datasets and overlapping classes, limiting
effectiveness in real-world scenarios.
4. Limited Interpretability: The complexity of the hyperplane in higher dimensions makes SVM less
interpretable than other models.
5. Feature Scaling Sensitivity: Proper feature scaling is essential, otherwise SVM models may
perform poorly.

MODEL EVALUATION AND TUNING


Evaluating an SVM model involves assessing how well it generalizes to unseen data. This is typically done
after training the model on a portion of your data and testing it on a separate, unseen test set.

Evaluation Metrics of Regression Model


Regression metrics are quantitative measures used to evaluate the nice of a regression model. Scikit-analyze
provides several metrics, each with its own strengths and boundaries, to assess how well a model suits the
statistics.
Types of Regression Metrics
Some common regression metrics in scikit-learn with examples
• Mean Absolute Error (MAE)
• Mean Squared Error (MSE)
• Root Mean Squared Error (RMSE)

Mean Absolute Error (MAE)


In the fields of statistics and machine learning, the Mean Absolute Error (MAE) is the difference between
predicted target value and actual target values. By calculating MAE we can get an idea of how wrong the
predictions were done by the model.

Dr. Srinivasa Rao K SVCE, B’lore


Machine learning and Data analytics using Python MMC201

Mathematical Formula
The formula to calculate MAE for a data with "n" data points is:
1
𝑀𝐴𝐸 = Σ|𝑦𝑖 − 𝑦̂𝑖 |
𝑛
Where:

• 𝑦𝑖 − represents the actual target.


• 𝑦̂𝑖 − represents the predicted target.

The above graph shows the salary of an employee vs experience in years. We have the actual value on the
line and the predicted value is shown with X. And the absolute distance between them is a mean absolute
error.

Mean Squared Error (MSE)


The Mean Squared Error (or MSE) is the same as the mean absolute error. Both tell the average of the
differences between predicted and actual values and the magnitude of the error. It is the sum of squares of
residuals. This value is always positive and closer to 0.
Mathematical Formula
The formula to calculate MSE for a data with "n" data points is:
1
𝑀𝐴𝐸 = Σ(𝑦𝑖 − 𝑦̂𝑖 )2
𝑛

Root Mean Squared Error (RMSE)


To obtain the RMSE, simply take the square root of the MSE
Mathematical Formula
The formula to calculate MSE for a data with "n" data points is:

1
𝑀𝐴𝐸 = √𝑀𝑆𝐸 = √ Σ(𝑦𝑖 − 𝑦̂)
𝑖
2
𝑛

Model Evaluation Metrics of Classification Models


When evaluating machine learning models, particularly classification models, following are most
straightforward and commonly used metrics.
• Accuracy
• Precision
• Recall
• F1- Score
• ROC-AUC

Dr. Srinivasa Rao K SVCE, B’lore


Machine learning and Data analytics using Python MMC201

Errors
In a test process there can be four possible situations of which two of the situations leads to the two types
of errors and the same is presented as follows.

Accepting the Hypothesis Rejecting the Hypothesis


Wrong decision
Hypothesis True Correct decision
(Type 1 Error)
Wrong decision
Hypothesis False Correct decision
(Type 2 Error)

In order to minimize both these types of errors we need to increase the sample size.

Confusion Matrix:
A table that summarizes the performance of a classification model. It shows the counts of:
• True Positives (TP): Correctly predicted positive instances.
• True Negatives (TN): Correctly predicted negative instances.
• False Positives (FP): Incorrectly predicted positive instances (Type I error).
• False Negatives (FN): Incorrectly predicted negative instances (Type II error).

Accuracy
Accuracy is the proportion of correctly classified instances (predictions) out of the total number of instances
in the dataset.
Mathematically, it is defined as:
Total Number of Predictions
Accuracy =
Number of Correct Predictions
Alternatively, using terms from a confusion matrix:
TP + TN
Accuracy =
TP + TN + FP + FN
Where:
• TP - True Positives: The model correctly predicted the positive class.
• TN - True Negatives: The model correctly predicted the negative class.
• FP - False Positives: The model incorrectly predicted the positive class (Type I error).
• FN - False Negatives: The model incorrectly predicted the negative class (Type II error).

Precision
Precision measures the proportion of true positive predictions among all instances that the model predicted
as positive. In simpler terms, it tells you "of all the times the model said 'yes,' how many of those times was
it actually correct?"
Mathematically, it is defined as:

Dr. Srinivasa Rao K SVCE, B’lore


Machine learning and Data analytics using Python MMC201

TP
Precision =
TP + FP
Recall
Recall measures the proportion of actual positive instances that were correctly identified by the model. In
simpler terms, it tells you "of all the times it should have said 'yes,' how many times did it actually say
'yes'?"
Mathematically, it is defined as:
TP
Recall =
TP + FN
F1-Score
The F1-score is the harmonic mean of precision and recall. It's designed to give a more balanced view of
performance than either precision or recall alone.
The formula for the F1-score is:
Precision × Recall
F1 − Score = 2 ×
Precision + Recall
Where:
TP TP
Precision = Recall =
TP + FP TP + FN
The F1-score ranges from 0 to 1, where 1 represents perfect precision and recall, and 0 represents the worst
possible performance.

ROC-AUC (Receiver Operating Characteristic - Area Under the Curve)


The ROC curve is a graphical plot that illustrates the diagnostic ability of a binary classifier as its
discrimination threshold is varied. It plots two parameters:
• True Positive Rate (TPR): Also known as Recall or Sensitivity, it's the proportion of actual
positive instances that are correctly identified.
TP
TPR =
TP + FN
• False Positive Rate (FPR): It's the proportion of actual negative instances that are incorrectly
identified as positive.
FP
FPR =
FP + TN

Model Tuning of SVM (Hyperparameter Tuning)


SVMs have several hyperparameters that significantly influence their performance. Tuning these
hyperparameters involves finding the optimal combination that yields the best model performance on
unseen data.

Dr. Srinivasa Rao K SVCE, B’lore


Machine learning and Data analytics using Python MMC201

1. C (Regularization Parameter):
• Role: Controls the trade-off between achieving a low training error (correctly classifying training
examples) and a large margin (simplicity of the decision boundary).
• Effect:
• Small C: Leads to a wider margin but may allow more misclassifications (underfitting).
This promotes a smoother decision boundary.
• Large C: Leads to a narrower margin but tries to classify all training examples correctly
(potential overfitting). This results in a more complex decision boundary.
2. Kernel:
• Role: Specifies the type of kernel function to be used to transform the input data into a higher-
dimensional space where it might become linearly separable.
• Common Kernels:
• linear: For linearly separable data. Fastest for large datasets.
• rbf (Radial Basis Function / Gaussian): Most common choice for non-linear data. It maps
samples into a higher-dimensional space.
• poly (Polynomial): For non-linear data, specified by degree.
• sigmoid: Another option for non-linear data.

3. gamma (Kernel Coefficient for rbf, poly, and sigmoid kernels):


• Role: Defines how far the influence of a single training example reaches. It dictates the "reach" of
the kernel function.
• Effect (for rbf kernel):
• Small gamma: A larger influence, leading to a smoother decision boundary (potentially
underfitting).
• Large gamma: A smaller influence, leading to a more complex, wiggly decision boundary
that can overfit the training data.
4. degree (for poly kernel):
• Role: The degree of the polynomial kernel function.
• Effect: Higher degrees lead to more complex decision boundaries.

General Steps for SVM Model Evaluation and Tuning:


1. Data Preprocessing: Scale your features! SVMs are sensitive to the scale of the data because they
calculate distances between data points. Standardization (mean 0, variance 1) or Min-Max scaling
(0 to 1) are common choices.
2. Split Data: Divide your dataset into training and testing sets (e.g., 70-80% for training, 20-30% for
testing). A validation set can also be created for initial tuning if not using cross-validation.

Techniques for Hyperparameter Tuning:


1. Grid Search (GridSearchCV in scikit-learn):

Dr. Srinivasa Rao K SVCE, B’lore


Machine learning and Data analytics using Python MMC201

• How it works: You define a grid of hyperparameter values to explore. Grid Search
exhaustively tries every possible combination of these values.
• Pros: Guarantees finding the best combination within the defined search space.
• Cons: Computationally expensive, especially with many hyperparameters or large ranges,
as the number of combinations grows exponentially.
2. Randomized Search (RandomizedSearchCV in scikit-learn):
• How it works: You define a distribution or range for each hyperparameter, and
Randomized Search samples a fixed number of random combinations from these
distributions.
• Pros: More computationally efficient than Grid Search, especially for large search spaces,
as it can often find good combinations without testing all of them.
• Cons: Does not guarantee finding the absolute best combination.

NEURAL NETWORKS AND DEEP LEARNING

Introduction to neural networks


An Artificial Neural Network (ANN), often simply called a neural network, is a computational model
inspired by the structure and functioning of the human brain. It's a core component of machine learning,
particularly in the field of deep learning. ANNs are designed to learn complex patterns and relationships
within data, enabling them to perform tasks like image recognition, natural language processing, financial
forecasting, and more.

A typical biological neuron has four parts called dendrites, soma, axon and synapse. The body of the neuron
is called as soma. Dendrites accept the input information and process it in the cell body called soma. A
single neuron is connected by axons to around 10,000 neurons and through these axons the processed
information is passed from one neuron to another neuron. A neuron gets fired if the input information
crosses a threshold value and transmits signals to another neuron through a synapse. A synapse gets fired
with an electrical impulse called spikes which are transmitted to another neuron. A single neuron can receive
synaptic inputs from one neuron or multiple neurons. These neurons form a network structure which
processes input information and gives out a response. The simple structure of a biological neuron is shown
in Figure.

The given figure illustrates the typical diagram of Biological Neural Network.

Dr. Srinivasa Rao K SVCE, B’lore


Machine learning and Data analytics using Python MMC201

What is Artificial Neural Network?


The term "Artificial Neural Network" is derived from Biological neural networks that develop the structure
of a human brain. Similar to the human brain that has neurons interconnected to one another, artificial neural
networks also have neurons that are interconnected to one another in various layers of the networks. These
neurons are known as nodes.

The typical Artificial Neural Network looks something like the given figure.
Dendrites from Biological Neural Network represent inputs in Artificial Neural Networks, cell nucleus
represents Nodes, synapse represents Weights, and Axon represents Output.
Relationship between Biological neural network and artificial neural network:

Biological Neural Network Artificial Neural Network

Dendrites Inputs

Cell nucleus Nodes

Synapse Weights

Axon Output

An Artificial Neural Network in the field of Artificial intelligence where it attempts to mimic the
network of neurons makes up a human brain so that computers will have an option to understand things and
make decisions in a human-like manner. The artificial neural network is designed by programming
computers to behave simply like interconnected brain cells.

There are around 1000 billion neurons in the human brain. Each neuron has an association point somewhere
in the range of 1,000 and 100,000. In the human brain, data is stored in such a manner as to be distributed,
and we can extract more than one piece of this data, when necessary, from our memory parallelly. We can
say that the human brain is made up of incredibly amazing parallel processors.

Dr. Srinivasa Rao K SVCE, B’lore


Machine learning and Data analytics using Python MMC201

Understanding Neural Networks in Deep Learning


Neural networks are capable of learning and identifying patterns directly from data without pre-defined
rules. These networks are built from several key components:
1. Neurons: The basic units that receive inputs, each neuron is governed by a threshold and an
activation function.
2. Connections: Links between neurons that carry information, regulated by weights and biases.
3. Weights and Biases: These parameters determine the strength and influence of connections.
4. Propagation Functions: Mechanisms that help process and transfer data across layers of neurons.
5. Learning Rule: The method that adjusts weights and biases over time to improve accuracy.
Learning in neural networks follows a structured, three-stage process:
1. Input Computation: Data is fed into the network.
2. Output Generation: Based on the current parameters, the network generates an output.
3. Iterative Refinement: The network refines its output by adjusting weights and biases, gradually
improving its performance on diverse tasks.
In an adaptive learning environment:
• The neural network is exposed to a simulated scenario or dataset.
• Parameters such as weights and biases are updated in response to new data or conditions.
• With each adjustment, the network’s response evolves allowing it to adapt effectively to different
tasks or environments.

Structure of a Neural Network: Input, Hidden, and Output Layers


A typical feedforward Artificial Neural Network is organized into layers of interconnected nodes.
Information flows in one direction, from the input layer through one or more hidden layers to the output
layer.

Dr. Srinivasa Rao K SVCE, B’lore


Machine learning and Data analytics using Python MMC201

1. Input Layer: This is where the network receives its input data. Each input neuron in the layer
corresponds to a feature in the input data.
2. Hidden Layers: These layers perform most of the computational heavy lifting. A neural network
can have one or multiple hidden layers. Each layer consists of units (neurons) that transform the
inputs into something that the output layer can use.
3. Output Layer: The final layer produces the output of the model. The format of these outputs varies
depending on the specific task like classification, regression.

In summary:
Data enters the input layer, where it's represented by individual features. This information then flows
through one or more hidden layers, where weighted sums and activation functions transform the data,
allowing the network to learn intricate patterns. Finally, the processed information reaches the output layer,
which provides the network's prediction or classification. The learning process involves adjusting the
weights and biases throughout the network to minimize the difference between the network's output and
the desired output, typically using algorithms like backpropagation.
Backpropagation Algorithm (Simple Definition):
Backpropagation is a training algorithm for neural networks that adjusts the weights by comparing the
predicted output with the actual output and minimizing the error. It works by moving the error backward
from the output layer to the input layer.
Key Steps (Simplified):
1. Forward Pass: Input passes through the network to get the output.
2. Error Calculation: Compare predicted output with actual output.
3. Backward Pass: Propagate the error back through the network.
4. Weight Update: Adjust weights to reduce the error (using gradient descent).

Dr. Srinivasa Rao K SVCE, B’lore


Machine learning and Data analytics using Python MMC201

Activation Functions in ANNs


Activation functions are important in neural networks because they introduce non-linearity and helps the
network to learn complex patterns. Let’s see some common activation functions used in ANNs:
1. Sigmoid Function: Outputs values between 0 and 1. It is used in binary classification tasks like
deciding if an image is a cat or not.
1
𝑓(𝑥) =
(1 + 𝑒 −𝑥 )

2. ReLU (Rectified Linear Unit): A popular choice for hidden layers, it returns the input if positive
and zero otherwise. It helps to solve the vanishing gradient problem.
3. Tanh (Hyperbolic Tangent): Similar to sigmoid but outputs values between -1 and 1. It is used in
hidden layers when a broader range of outputs is needed.
4. Softmax: Converts raw outputs into probabilities used in the final layer of a network for multi-class
classification tasks.
5. Leaky ReLU: A variant of ReLU that allows small negative values for inputs helps in preventing
“dead neurons” during training.
These functions help the network decide whether to activate a neuron helps it to recognize patterns and
make predictions.

Applications of Artificial Neural Networks


1. Social media: ANNs help social media platforms suggest friends and relevant content by analyzing
user profiles, interests and interactions. They also assist in targeted advertising which ensures users
to see ads tailored to their preferences.
2. Marketing and Sales: E-commerce sites like Amazon use ANNs to recommend products based on
browsing history. They also personalize offers, predict customer behavior and segment customers
for more effective marketing campaigns.

Dr. Srinivasa Rao K SVCE, B’lore


Machine learning and Data analytics using Python MMC201

3. Healthcare: ANNs are used in medical imaging for detecting diseases like cancer and they assist
in diagnosing conditions with accuracy similar to doctors. Additionally, they predict health risks
and recommend personalized treatment plans.
4. Personal Assistants: Virtual assistants like Siri and Alexa use ANNs to process natural language,
understand voice commands and respond accordingly. They help manage tasks like setting
reminders helps in making calls and answering queries.
5. Customer Support: ANNs power chatbots and automated customer service systems that analyze
customer queries and provide accurate responses helps in improving efficiency in handling customer
inquiries.
6. Finance: In the financial industry, they are used for fraud detection, credit scoring and predicting
market trends by analyzing large sets of transaction data and spotting anomalies.

Challenges in Artificial Neural Networks


1. Data Dependency: ANNs require large amounts of high-quality data to train effectively. Gathering
and cleaning sufficient data can be time-consuming, expensive and often impractical especially in
industries with limited access to quality data.
2. Computational Power: Training deep neural networks with many layers, demands significant
computational resources. High-performance hardware (e.g GPUs) is often required which makes it
expensive and resource-intensive.
3. Overfitting: It can easily overfit to the training data which means they perform well on the training
set but poorly on new, unseen data. This challenge arises when the model learns to memorize rather
than generalize, reducing its real-world applicability.
4. Interpretability: They are often referred to as "black boxes." It is difficult to understand how they
make decisions which is a problem in fields like healthcare and finance where explainability and
transparency are important.
5. Training Time: Training ANNs can take a long time, especially for deep learning models with
many layers and vast datasets. This lengthy training process can delay the deployment of models
and hinder their use in time-sensitive applications.

Dr. Srinivasa Rao K SVCE, B’lore


Machine learning and Data analytics using Python MMC201

Building and training neural networks using TensorFlow and Keras


A neural network architecture comprises a number of neurons or activation units as we call them, and this
circuit of units serves their function of finding underlying relationships in data. And it's mathematically
proven that neural networks can find any kind of relation/function regardless of its complexity, provided it
is deep/optimized enough, that is how much potential it has.
Now let's learn to implement a neural network using TensorFlow

Install Tensorflow
Tensorflow is a library/platform created by and open-sourced by Google. It is the most used library for deep
learning applications. Now, creating a neural network might not be the primary function of the TensorFlow
library but it is used quite frequently for this purpose. So before going ahead let's install and import the
TensorFlow module.
Building and training neural networks using TensorFlow and Keras involves a structured process, typically
following these steps:
• Import Libraries and Prepare Data:
• Import tensorflow and keras along with any necessary modules for data handling and
preprocessing (e.g., numpy, sklearn).
• Load and prepare your dataset. This often involves splitting the data into training and testing
sets, and potentially normalizing or scaling feature values.
• Define the Model Architecture:
• Use Keras's Sequential API for simple, layer-by-layer models, or the Functional API for
more complex architectures with multiple inputs/outputs or shared layers.
• Add layers to your model, such as Dense (fully connected), Conv2D (convolutional for
images), LSTM (for sequences), etc.
• Specify the number of neurons, activation functions (e.g., 'relu', 'sigmoid', 'softmax'), and
input shape for the first layer.
• Compile the Model:
• Call the compile() method on your model.
• Specify an optimizer to control how the model updates its weights during training.
• Define a loss function appropriate for your task (e.g., 'sparse_categorical_crossentropy' for
multi-class classification, 'mse' for regression).
• Optionally, specify metrics to monitor during training and evaluation (e.g., 'accuracy').
• Train the Model:
• Use the fit() method to train your model on the training data.
• Provide the training features (x_train) and labels (y_train).
• Specify the epochs (number of times the model iterates over the entire dataset)
and batch_size (number of samples per gradient update).

Dr. Srinivasa Rao K SVCE, B’lore


Machine learning and Data analytics using Python MMC201

• Optionally, include validation_data to monitor performance on a separate validation set


during training.
• Evaluate the Model:
• After training, use the evaluate () method on the test data to assess the model's performance
on unseen examples.
• This provides insights into the model's generalization ability.
• Make Predictions:
• Use the predict() method to generate predictions on new, unseen data using your trained
model.

import tensorflow as tf
from tensorflow import keras

# 1. Prepare Data (example with MNIST)


(x_train, y_train), (x_test, y_test) = keras.datasets.mnist.load_data()
x_train, x_test = x_train / 255.0, x_test / 255.0 # Normalize pixel values

# 2. Define Model Architecture


model = keras.Sequential([
keras.layers.Flatten(input_shape=(28, 28)), # Input layer
keras.layers.Dense(128, activation='relu'), # Hidden layer
keras.layers.Dropout(0.2), # Dropout for regularization
keras.layers.Dense(10, activation='softmax') # Output layer])

# 3. Compile the Model


model.compile(optimizer='adam',
loss='sparse_categorical_crossentropy',
metrics=['accuracy'])

# 4. Train the Model


model.fit(x_train, y_train, epochs=5)

# 5. Evaluate the Model


test_loss, test_acc = model.evaluate(x_test, y_test, verbose=2)
print(f'\nTest accuracy: {test_acc}')

Dr. Srinivasa Rao K SVCE, B’lore


Machine learning and Data analytics using Python MMC201

Convolutional Neural Networks


Convolutional Neural Networks (CNNs) are a specialized type of deep learning algorithm primarily
designed for processing data with a grid-like topology, such as images. They have revolutionized the field
of computer vision and are the de-facto standard for tasks like image recognition, object detection, and
image segmentation.
Key Components of a Convolutional Neural Network
1. Convolutional Layers: These layers apply convolutional operations to input images using filters
or kernels to detect features such as edges, textures and more complex patterns. Convolutional
operations help preserve the spatial relationships between pixels.
2. Pooling Layers: They down sample the spatial dimensions of the input, reducing the computational
complexity and the number of parameters in the network. Max pooling is a common pooling
operation where we select a maximum value from a group of neighbouring pixels.
3. Activation Functions: They introduce non-linearity to the model by allowing it to learn more
complex relationships in the data.
4. Fully Connected Layers: These layers are responsible for making predictions based on the high-
level features learned by the previous layers. They connect every neuron in one layer to every neuron
in the next layer.

How CNNs Work?


1. Input Image: CNN receives an input image which is pre-processed to ensure uniformity in size and
format.
2. Convolutional Layers: Filters are applied to the input image to extract features like edges, textures
and shapes.
3. Pooling Layers: The feature maps generated by the convolutional layers are down sampled to
reduce dimensionality.
4. Fully Connected Layers: The down sampled feature maps are passed through fully connected
layers to produce the final output, such as a classification label.
5. Output: The CNN outputs a prediction, such as the class of the image.

Dr. Srinivasa Rao K SVCE, B’lore


Machine learning and Data analytics using Python MMC201

How to Train a Convolutional Neural Network?


CNNs are trained using a supervised learning approach. This means that the CNN is given a set of labeled
training images. The CNN learns to map the input images to their correct labels.
The training process for a CNN involves the following steps:
1. Data Preparation: The training images are pre-processed to ensure that they are all in the same
format and size.
2. Loss Function: A loss function is used to measure how well the CNN is performing on the training
data. The loss function is typically calculated by taking the difference between the predicted labels
and the actual labels of the training images.
3. Optimizer: An optimizer is used to update the weights of the CNN in order to minimize the loss
function.
4. Backpropagation: Backpropagation is a technique used to calculate the gradients of the loss
function with respect to the weights of the CNN. The gradients are then used to update the weights
of the CNN using the optimizer.

How to Evaluate CNN Models


Efficiency of CNN can be evaluated using a variety of criteria. Among the most popular metrics are:
• Accuracy: Accuracy is the percentage of test images that the CNN correctly classifies.
• Precision: Precision is the percentage of test images that the CNN predicts as a particular class
and that are actually of that class.
• Recall: Recall is the percentage of test images that are of a particular class and that the CNN
predicts as that class.
• F1 Score: The F1 Score is a harmonic mean of precision and recall. It is a good metric for
evaluating the performance of a CNN on classes that are imbalanced.

Dr. Srinivasa Rao K SVCE, B’lore


Machine learning and Data analytics using Python MMC201

Case Study of CNN for Diabetic retinopathy


Diabetic retinopathy is a severe eye condition caused by damage to the retina's blood vessels due to
prolonged diabetes. It is a leading cause of blindness among adults aged 20 to 64. CNNs have successfully
used to detect diabetic retinopathy by analyzing retinal images. By training on labeled datasets of healthy
and affected retina images CNNs can accurately identify signs of the disease helping in early diagnosis and
treatment.
Applications of CNN
• Image classification: CNNs are the state-of-the-art models for image classification. They can be
used to classify images into different categories such as cats and dogs.
• Object detection: It can be used to detect objects in images such as people, cars and
buildings. They can also be used to localize objects in images which means that they can identify
the location of an object in an image.
• Image segmentation: It can be used to segment images which means that they can identify and
label different objects in an image. This is useful for applications such as medical imaging and
robotics.
• Video analysis: It can be used to analyze videos such as tracking objects in a video or detecting
events in a video. This is useful for applications such as video surveillance and traffic monitoring.
Advantages of CNN
• High Accuracy: They can achieve high accuracy in various image recognition tasks.
• Efficiency: They are efficient, especially when implemented on GPUs.
• Robustness: They are robust to noise and variations in input data.
• Adaptability: It can be adapted to different tasks by modifying their architecture.
Disadvantages of CNN
• Complexity: It can be complex and difficult to train, especially for large datasets.
• Resource-Intensive: It require significant computational resources for training and deployment.
• Data Requirements: They need large amounts of labeled data for training.
• Interpretability: They can be difficult to interpret making it challenging to understand their
predictions.

Dr. Srinivasa Rao K SVCE, B’lore


Machine learning and Data analytics using Python MMC201

Recurrent Neural Networks (RNN).


Recurrent Neural Networks (RNNs) are a class of artificial neural networks specifically designed to process
sequential data, where the order and context of elements are crucial. Unlike traditional feedforward neural
networks that assume independence between inputs, RNNs have "memory" – they can consider information
from previous inputs to influence the current output. This makes them exceptionally well-suited for tasks
involving sequences like text, speech, and time series data.
Let’s understand RNN with an example:
Imagine reading a sentence and you try to predict the next word; you don’t rely only on the current word
but also remember the words that came before. RNNs work similarly by “remembering” past information
and passing the output from one step as input to the next i.e. it considers all the earlier words to choose the
most likely next word. This memory of previous steps helps the network understand context and make
better predictions.

Key Components of RNNs


There are mainly two components of RNNs that we will discuss.
1. Recurrent Neurons
The fundamental processing unit in RNN is a Recurrent Unit. They hold a hidden state that maintains
information about previous inputs in a sequence. Recurrent units can "remember" information from prior
steps by feeding back their hidden state, allowing them to capture dependencies across time.

Recurrent Neuron
2. RNN Unfolding
RNN unfolding or unrolling is the process of expanding the recurrent structure over time steps. During
unfolding each step of the sequence is represented as a separate layer in a series illustrating how information
flows across each time step.
This unrolling enables backpropagation through time (BPTT) a learning process where errors are
propagated across time steps to adjust the network’s weights enhancing the RNN’s ability to learn
dependencies within sequential data.

Dr. Srinivasa Rao K SVCE, B’lore


Machine learning and Data analytics using Python MMC201

RNN Unfolding

Types of Recurrent Neural Networks


There are four types of RNNs based on the number of inputs and outputs in the network:
1. One-to-One RNN
This is the simplest type of neural network architecture where there is a single input and a single output. It
is used for straightforward classification tasks such as binary classification where no sequential data is
involved.

One to One RNN


2. One-to-Many RNN
In a One-to-Many RNN the network processes a single input to produce multiple outputs over time. This is
useful in tasks where one input triggers a sequence of predictions (outputs). For example in image
captioning a single image can be used as input to generate a sequence of words as a caption.

One to Many RNN

Dr. Srinivasa Rao K SVCE, B’lore


Machine learning and Data analytics using Python MMC201

3. Many-to-One RNN
The Many-to-One RNN receives a sequence of inputs and generates a single output. This type is useful
when the overall context of the input sequence is needed to make one prediction. In sentiment analysis the
model receives a sequence of words (like a sentence) and produces a single output like positive, negative
or neutral.

Many to One RNN


4. Many-to-Many RNN
The Many-to-Many RNN type processes a sequence of inputs and generates a sequence of outputs. In
language translation task a sequence of words in one language is given as input and a corresponding
sequence in another language is generated as output.

Many to Many RNN

Advantages of Recurrent Neural Networks


• Sequential Memory: RNNs retain information from previous inputs making them ideal for time-
series predictions where past data is crucial.
• Enhanced Pixel Neighborhoods: RNNs can be combined with convolutional layers to capture
extended pixel neighborhoods improving performance in image and video data processing.
Limitations of Recurrent Neural Networks (RNNs)

Dr. Srinivasa Rao K SVCE, B’lore


Machine learning and Data analytics using Python MMC201

While RNNs excel at handling sequential data they face two main training challenges i.e vanishing gradient
and exploding gradient problem:
1. Vanishing Gradient: During backpropagation gradients diminish as they pass through each time
step leading to minimal weight updates. This limits the RNN’s ability to learn long-term
dependencies which is crucial for tasks like language translation.
2. Exploding Gradient: Sometimes gradients grow uncontrollably causing excessively large weight
updates that de-stabilize training.
These challenges can hinder the performance of standard RNNs on complex, long-sequence tasks.
Applications of Recurrent Neural Networks
RNNs are used in various applications where data is sequential or time-based:
• Time-Series Prediction: RNNs excel in forecasting tasks, such as stock market predictions and
weather forecasting.
• Natural Language Processing (NLP): RNNs are fundamental in NLP tasks like language
modeling, sentiment analysis and machine translation.
• Speech Recognition: RNNs capture temporal patterns in speech data, aiding in speech-to-text and
other audio-related applications.
• Image and Video Processing: When combined with convolutional layers, RNNs help analyze video
sequences, facial expressions and gesture recognition.

Dr. Srinivasa Rao K SVCE, B’lore

You might also like