AIMLCZG567: AI & ML
Techniques for Cyber Security
BITS Pilani Jagdish Prasad
Pilani Campus WILP
BITS Pilani
Pilani Campus
Session : 04
Title : Basics for Machine Learning - II
Agenda
• Converting Categorical Data into Numerical Data
• Feature Encoding, Vectorization, Normalization
• Issues: Overfitting, Under fitting, Class Imbalance
• Evaluation Metrics: Precision, Recall, F1-score
• Overview of Machine learning algorithms
• Support Vector Machine (SVM)
• Bayesian Networks
• Decision Trees
• Random Forests
• Artificial Neural Networks (ANN)
• Hierarchical Algorithms
• Genetic Algorithms
• Similarity Algorithms
BITS Pilani, Pilani Campus
Categorical to Numerical
Conversion
BITS Pilani, Pilani Campus
Why Categorical to Numerical Conversion?
• Machines only understand numerical data.
• Text data ‘as-is’ can not be used as input to Machine Learning
algorithm.
• Process of converting text data into numbers is called Feature
Extraction (also called text vectorization).
• In NLP, Feature Extraction is an important step for a better
understanding of the context of what we are dealing with.
BITS Pilani, Pilani Campus
Conversion Techniques
• One Hot Encoding
• Bag of Word (BOW)
• N-grams
• Tf-Idf
• Word2Vec (Word Embedding)
BITS Pilani, Pilani Campus
One Hot Encoding
• One Hot Encoding converts the words of a document into a Vocabulary
dimension vector.
• Example:
• Documents (Single record): “We are learning Natural Language Processing”, “We are learning
Data Science”, and “Natural Language Processing comes under Data Science”.
• Corpus (Total words): We are learning Natural Language Processing, We are learning Data
Science, Natural Language Processing comes under Data Science
• Vocabulary (Unique words): We are learning Natural Language Processing Data Science
comes under
Document Corpus
Vocabulary
BITS Pilani, Pilani Campus
Example: One Hot Encoding
Advantage
• It is Intuitive
• Easy to implement
Disadvantage
• It creates Sparsity.
• Size of each document
after one hot encoding
may be different.
• Out of Vocabulary (OOV)
problem.
• No capturing of semantic
meaning.
Simplest but not very effective technique - Not much in use
BITS Pilani, Pilani Campus
Bag of Words
• Representation of text that describes the Frequency of words
within a document.
• Specially used in the Text Classification task.
• Can directly use Count Vectorizer class by Scikit-learn.
BITS Pilani, Pilani Campus
Bag of Words
• Advantage
• Simple and intuitive.
• Size of each document after conversion is same.
• Disadvantage
• Creates Sparsity (scattered and lacks denseness)
• Does not consider sentence ordering issues.
One of the most used text vectorization techniques.
BITS Pilani, Pilani Campus
TF-IDF: Term Frequency and Inverse Document
Frequency
• TF-IDF is a statistical measure that evaluates how relevant a word is to a
document in a collection of documents.
• Term Frequency (TF):
• Number of times a word appears in a document is divided by the total number
of words in that document. 0 < Tf < 1
𝑁𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑡𝑖𝑚𝑒𝑠 𝑡𝑒𝑟𝑚 𝑖 𝑎𝑝𝑝𝑒𝑎𝑟𝑠 𝑖𝑛 𝑑𝑜𝑐𝑢𝑚𝑒𝑛𝑡 𝑗
𝑇𝐹𝑖𝑗 =
𝑇𝑜𝑡𝑎𝑙 𝑛𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑡𝑒𝑟𝑚𝑠 𝑖𝑛 𝑑𝑜𝑐𝑢𝑒𝑚𝑛𝑡 𝑗
• Inverse Document Frequency (IDF):
• Logarithm of the number of documents in the corpus divided by the number of
documents where the specific term appears.
• In Scikit-learn use log(N/ni) + 1 formula.
BITS Pilani, Pilani Campus
Example: TF-IDF: Term Frequency and Inverse
Document Frequency
• Document 1: It is going to rain today
• Document 2: Today I am not going outside
• Document 3: I am going to watch a season premiere
BITS Pilani, Pilani Campus
Example: TF-IDF: Term Frequency and Inverse
Document Frequency
• Document 1: It is going to rain today
• Document 2: Today I am not going outside
• Document 3: I am going to watch a season premiere
Document going to today i am it is rain
Doc 1 0 0.07 0.07 0 0 0.17 0.17 0.17
Doc 2 0 0 0.07 0.07 0.07 0 0 0
Doc 3 0 0.05 0 0.05 0.05 0 0 0
• Evident that words like ‘it’,’is’,’rain’ are important for doc 1 but not for
doc 2 and doc 3
• Means Doc 1 and 2&3 are different w.r.t talking about rain.
• Notice that Doc 1 and 2 talk about something ‘today’, and doc 2 and 3
discuss something about the writer because of the word ‘I’.
• Table helps find similarities and non-similarities between documents,
words etc much better than BOW.
BITS Pilani, Pilani Campus
TF-IDF: Term Frequency and Inverse Document
Frequency
Advantage
• Widely used technique for Information retrieval like a search
engine.
Disadvantage
• Sparsity
• Dimensionality increases with a large dataset, slowing down
algorithm.
• Does not capture Semantic meaning.
BITS Pilani, Pilani Campus
Word Embedding
• Representation of words for text analysis:
• In the form of a real-valued vector that encodes the meaning of the word
• Words that are closer in the vector space are similar in meaning.
• Example:
• Boy : Man
• Boy : Table
• First pair has more similar words to each other
• Easier for human to understand the associations between words
in a language.
• Helps machines understand this kind of relation automatically in
language.
BITS Pilani, Pilani Campus
Word Embedding Types
• Frequency-based – Count frequency of word
• Bag of Words (BOW)
• Tf-idf
• Glove (based on Matric Factorization)
• Prediction based
• Word2Vec
BITS Pilani, Pilani Campus
Word2Vec
• A Deep learning-based word embedding technique that converts
a given word into a vector as a collection of numbers.
• Why word2vec?
• captures semantic meaning like happiness and joy have the same
meaning.
• creates low dimension vector
• creates a Dense vector (non-zeros)
• Two approaches for using Word2Vec:
• Use a pre-trained model
• Self-Trained model
BITS Pilani, Pilani Campus
Feature Encoding, Vectorization
and Normalization
BITS Pilani, Pilani Campus
What is Feature Encoding?
• Process of transformation of categorical values of the relevant
features into numerical value is called feature encoding.
• Data frame analytics automatically performs feature encoding.
• Input data is pre-processed with the following encoding
techniques:
• One-Hot encoding: Assigns vectors to each category. The vector represent
whether the corresponding feature is present (1) or not (0).
• Target-Mean encoding: Replaces categorical values with the mean value of
the target variable.
• Frequency Encoding: Takes into account how many times a given
categorical value is present in relation with a feature.
BITS Pilani, Pilani Campus
Label/Ordinal Encoder
• Label Encoder and Ordinal
Encoder encode categories into
numerical values directly.
• Label Encoder is used for
nominal categorical variables
(categories without order i.e.
red, green, blue)
• Ordinal Encoder is used for
ordinal categorical variables
(categories with order i.e. small,
medium, large).
BITS Pilani, Pilani Campus
One Hot Encoding
• In One-Hot Encoding and Dummy Encoding, the categorical column is
split into multiple columns consisting of ones and zeros.
• Addresses the drawback to Label and Ordinal Encoding where columns
are now read in as categorical columns due to encoded data being
represented as multiple Boolean columns.
User City User Rome Madrid Istambul
1 Rome 1 1 0 0
2 Madrid 2 0 1 0
1 Madrid 1 0 1 0
3 Istambul 3 0 0 1
2 Istambul 2 0 0 1
1 Istambul 1 0 0 1
1 Rome 1 1 0 0
BITS Pilani, Pilani Campus
Count/Frequency Encoding
• Count/Frequency Encoding
encodes categorical variables
to the count of occurrences
and frequency of occurrences
respectively.
• Utilizes the frequency of the
categories as labels.
• In the cases where the
frequency is related with the
target variable, it helps the
model to understand and
assign the weight in direct and
inverse proportion, depending
on the nature of the data.
BITS Pilani, Pilani Campus
Binary/BaseN Encoding
• Binary Encoding encodes
categorical variables into
integers, then converts them to
binary code.
• Output is similar to One-Hot
Encoding, but lesser columns are
created.
• Addresses the drawback to One-
Hot Encoding where a cardinality
of n does not result in n number
of columns, but log2(n) columns.
• BaseN Encoding follows the same
idea but uses other base values
instead of 2, resulting in logN(n)
columns.
BITS Pilani, Pilani Campus
Target/Mean Encoding
• Target encoding is similar to label encoding, except labels are
correlated directly with the target.
• Each category in the feature label is decided with the mean
value of the target variable on a training data.
• Advantages of the Target encoding are that it does not affect
the volume of the data and helps in faster learning.
• Target Encoding or Mean Encoding is a popular encoding
approach.
BITS Pilani, Pilani Campus
Target Encoding
BITS Pilani, Pilani Campus
Feature Vector
• Feature vector is an ordered list of numerical properties of observed data set.
• Represents input features to a machine learning model that makes a
prediction.
• Humans can analyze qualitative data to make a decision.
• Example: we see the cloudy sky, feel the damp breeze, and decide to take an
umbrella when going outside.
• Machine learning models can only deal with quantitative data.
• Must always convert features of observed phenomena into numerical values and
feed them into a machine learning model in the same order.
• Must represent features in feature vectors.
BITS Pilani, Pilani Campus
Feature Scaling
• Feature scaling is a data pre-processing technique that transforms the values
of features or variables in a dataset to a similar scale.
• Done to ensure that all features contribute equally to the model and to
prevent features with larger values from dominating the model.
• Transforms the data to a more consistent scale, making it easier to build
accurate and effective machine learning models.
• Essential when working with datasets where the features have different
ranges, units of measurement, or orders of magnitude.
• Feature scaling techniques:
• Normalization (Min-Max scaling)
• Standardization
BITS Pilani, Pilani Campus
Feature Normalization
• Normalization: Values are shifted and rescaled so that they end
up ranging between 0 and 1 (also known as Min-Max scaling).
• Normalization adjusts the values of features in a dataset to a
common scale.
• Reduces the impact of different scales on the accuracy of
machine learning models.
• Formula for normalization:
BITS Pilani, Pilani Campus
Feature Standardization
• Standardization: Values are centered around the Mean with a unit
Standard Deviation.
• Under standardization, Mean of the attribute becomes zero, and
the resultant distribution has a unit Standard Deviation.
• Formula for standardization
BITS Pilani, Pilani Campus
Under / Over fit Models
BITS Pilani, Pilani Campus
Under Fitting
• A Machine Learning algorithm is said to have underfitting when it cannot
capture the underlying trend of the data, i.e., it only performs well on training
data but performs poorly on testing data.
• Underfitting reduces the accuracy of machine-learning model.
• Usually happens when there is less data to build an accurate model or try to
build a linear model with fewer non-linear data.
• Underfitting can be avoided by using more data and reducing the features
• Reasons for Under fitting:
– High bias and low variance.
– Size of the training dataset used is not enough.
– Model is too simple.
– Training data is not cleaned and also contains noise in it.
BITS Pilani, Pilani Campus
Over Fitting
• A Machine Learning model is said to be overfitted when the model does not
make accurate predictions on testing data.
• When a model gets trained with large data, it starts learning from the noise and
inaccurate data from the data set.
• During testing the model does not categorize the data correctly, because of too
many details and noise.
• Overfitting is caused by non-parametric and non-linear methods:
• these algorithms have more freedom in building the model based on the dataset
• they can build unrealistic models.
• Using a linear algorithm for linear data or using the parameters like the maximal
depth for decision trees avoids over fitting.
• Reasons for Under fitting:
– High variance and low bias.
– Model is too complex.
– Size of the training data is large.
BITS Pilani, Pilani Campus
Feature Extraction
• Feature Extraction aims to reduce the number of features in a
dataset by:
• Creating new features from the existing ones (and then discarding the
original features)
• Discarding some features altogether
• Reduced set of features should be able to summarize most of the
information contained in the original set of features.
BITS Pilani, Pilani Campus
Imbalanced Classes
• Examples of Imbalanced classes:
• Fraudulent transactions are significantly lower than normal healthy transactions i.e.
around 1-2 % of the total number of observations.
• Cyber incidents are lower than genuine incidents.
• Electricity theft transactions are much less compared to normal transactions.
• Advanced Analytics and Machine Learning algorithms try to identify patterns in such
transactions to indicate abnormal behaviour.
• Machine Learning algorithms tend to produce unsatisfactory classifiers when
faced with imbalanced datasets.
• Challenge is to improve identification of the rare minority class as opposed to
achieving higher overall accuracy.
• For an imbalanced data set, if the event to be predicted belongs to the minority
class and the event rate is less than 5%, it is referred to as a rare event.
BITS Pilani, Pilani Campus
Evaluation Metrics
• Evaluation metrics are quantitative measures used to assess the performance
and effectiveness of a Machine Learning model.
• Metrics indicate how well a model is performing and help in comparing different
models or algorithms.
• Evaluation metrics provide objective criteria to evaluate a Machine Learning
model for its:
• Predictive ability
• Generalization capability
• Overall quality
• Choice of evaluation metrics depends on the specific problem domain, the type
of data, and the desired outcome.
BITS Pilani, Pilani Campus
Evaluation Metrics: Terms Definitions
Term Definition
True Positives Predicted Value=Yes, Real Value=Yes
True Negatives Predicted Value=NO, Real Value=No
False Positives Predicted Value=Yes, Real Value=No
False Negatives Predicted Value=No, Real Value=Yes
Accuracy Ratio of total number of correct predictions to that were
correct.
Positive Predictive Ratio of positive cases that were correctly identified.
Value or Precision
Negative Predictive Ratio of negative cases that were correctly identified.
Value
Sensitivity or Recall Ratio of actual positive cases which are correctly
identified.
Specificity Ratio of actual negative cases which are correctly
identified.
BITS Pilani, Pilani Campus
Precision & Recall
Precision
• Defined as ratio of correctly predicted Attacks to all the samples predicted as
Attacks.
• Calculated as the number of true positive predictions divided by the sum of
number of true positive and false positive predictions.
Precision = TP / (TP + FP)
Recall
• Defined as ratio of all samples correctly classified as Attacks to all the samples
that are actually Attacks. It is also called a Detection Rate.
• Calculated as the number of true positive predictions divided by sum of number
of true positive and false negative predictions:
Recall = TP / (TP + FN)
BITS Pilani, Pilani Campus
F1-Score
• Defined as the harmonic mean of the Precision and Recall. It is a statistical
technique for examining the accuracy of a system by considering both precision
and recall of the system.
• F1 score ranges between 0 to 1 and calculated as follows:
• F1-Score tells how precise (number of instances correctly classified) and robust
(does not miss any significant number of instances) the classifier is.
• Harmonic Mean punishes extreme values more.
• Example: Assume a binary classification model with the following results:
• Precision: 0, Recall: 1
• If we take the arithmetic mean, we get 0.5. It indicates that the above result comes
from a classifier that ignores the input and predicts one of the classes as output.
• If we were to take HM, we would get 0 which is accurate as this model is useless for
all purposes.
BITS Pilani, Pilani Campus
F1-Score
• F1 score gives the same importance to both Recall and Precision.
• If we want to give more weight to one of them, F1 score can be calculated by
attaching a value to either Recall or Precision depending on how many times the
value is important.
• In the equation below, β is the weightage.
BITS Pilani, Pilani Campus
Feature Pipeline
• Machine Learning process combines a series of transformers on raw data,
transforming the dataset each step of the way until it is passed to the fit
method of a final estimator.
• If we don’t vectorize our documents in the same exact manner, we will end
up with wrong or, at the very least, unintelligible results.
• Pipeline objects enable us to integrate a series of transformers that combine
normalization, vectorization, and feature analysis into a single, well-defined
mechanism.
• Pipeline objects move data from a loader into feature extraction mechanisms
to finally an estimator object that implements our predictive models.
• Pipelines are Directed Acyclic Graphs (DAGs) that can be simple linear chains
of transformers to arbitrarily complex branching and joining paths.
• Ref: https://www.oreilly.com/library/view/applied-text-analysis/9781491963036/ch04.html
BITS Pilani, Pilani Campus
Feature Pipeline
BITS Pilani, Pilani Campus
Feature Pipeline
• Feature Pipeline chains together multiple estimators representing a fixed
sequence of steps into a single unit.
• All estimators in the pipeline, except the last one, must be transformers—that
is, implement the transform method, while the last estimator can be of any
type, including predictive estimators.
• Pipelines provide convenience - fit and transform can be called for single inputs
across multiple objects at once.
• Pipelines provides a single interface for grid search of multiple estimators at
once.
• Pipelines provide operationalization of text models by coupling a vectorization
methodology with a predictive model.
• Pipelines are constructed by describing a list of (key, value) pairs where
the key is a string that names the step and the value is the estimator object.
BITS Pilani, Pilani Campus
Enriching Feature Extraction with Feature Unions
• Pipelines do not have to be simple linear sequences of steps; in fact, they can
be arbitrarily complex through the implementation of feature unions.
• Pipeline fit and transform data in sequence through each transformer.
• Feature Union object combines several transformer objects into a new, single
transformer similar to the Pipeline object.
• Feature Union evaluate independently and the results are concatenated into a
composite vector.
BITS Pilani, Pilani Campus
Enriching Feature Extraction with Feature Unions
• Consider the example of an HTML parser transformer that uses BeautifulSoup or an XML
library to parse the HTML and return the body of each document.
• We then perform a feature engineering step, where entities and keyphrases are each
extracted from the documents and the results passed into the feature union.
• Using frequency encoding on the entities is more sensible since they are relatively small,
but TF–IDF makes more sense for the keyphrases.
• Feature Union then concatenates the two resulting vectors such that our decision space
ahead of the logistic regression separates word dimensions in the title from word
dimensions in the body.
BITS Pilani, Pilani Campus
ML Algorithms
BITS Pilani, Pilani Campus
Data Nomenclature
Attributes Output
x1 x2 x3 x4 x5 x6 Y
Mean
Per Capita Human Household
GDP (Trillion GDP Development Life Poverty Index Income Dev/
Country USD) (‘1000 USD) Index Expectancy (Gini as %age) (‘1000 USD) UnderDev
Canada 1.577 39.17 0.908 80.7 32.6 67.293 D
China 5.878 7.54 0.687 73 46.9 10.22 U
India 1.632 3.41 0.547 64.7 36.8 0.735 U
Russia 1.48 19.84 0.755 65.5 39.9 0.72 U
Singapore 0.223 56.69 0.866 80 42.5 67.1 D
USA 14.527 46.86 0.91 78.3 40.8 84.3 D
… … … … … … … …
Instances
[Ref: en.wikipedia.org]
BITS Pilani, Pilani Campus
Example Problem
• Given input x compute output y y
• Example: Compute price of house
slope
from area of house
• Size of house
w0
• No of bedrooms intercept
• Construction age x
• Locality
• Segment – affordable, premium, luxury y = w0 +w1.x Red line
y = w.x Violet line
• ….
y = f(x) General
or
y = h(x) Hypothesis
BITS Pilani, Pilani Campus
Example Problem
• Generalized linear regression model is a linear combination of the
input variables and parameters:
h(x) = wo + w1x1 + w2x2 + … + wnxn
• Key components of this model are:
• parameters w0, w1,…, wn (called Weights or parameters)
• input variable x1, x2, x3, …. xn (called Attributes or Features)
• The equation can be simplified in vector form:
h(x) = wox0 + w1x1 + w2x2 + … + wnxn where x0 = 1
if W = [w0 w1 w2 … wn]
X = [x0 x1 x2 … xn]
then h(x) = WT . X
This is an Example of Linear Regression algorithm
BITS Pilani, Pilani Campus
Linear Regression: Normal Regression
t.axis([0, 2, 0, 15])
t.show()
4-2. Linear Regression model predictions
BITS Pilani, Pilani Campus
Descent does: it measures the local gradient of the error function with regard
parameter vector θ, and it goes in the direction of descending gradient. Once
dient is zero, you have reached a minimum!
Gradient Descent Concretely, you start by filling θ with random values (this is called random in
tion), and then you improve it gradually, taking one baby step at a time, e
attempting to decrease the cost function (e.g., the MSE), until the algorithm c
to a minimum (see Figure 4-3).
• A generic optimization algorithm capable of
finding optimal solutions to a problem.
• Gradient Descent tweaks parameters iteratively
in order to minimize the error or cost function
[θ = θ – 𝜂]
• Measures the local gradient of the cost
function with regards to the parameter θ and
goes in the direction of descending gradient
• A gradient value of zero or minimum refersFigure
to 4-3. Gradient Descent
solution An important parameter in Gradient Descent is the size of the steps, determ
the learning rate hyperparameter. If the learning rate is too small, then the al
Figure 4-4. Learning rate too small
• Learning Rate determines size of the step Figure 4-4).
will have to go through many iterations to converge, which will take a long t
On the other hand, if the learning rate is too high, you might jump across the valley
and end up on the other side, possibly even higher up than you were before. This
might make the algorithm diverge, with larger and larger values, failing to find a good
– If the learning rate is too small, the algorithm may take Figure 4-4. Learning rate too small
solution (see Figure 4-5).
On the other hand, if the learning rate is too high, you might jump across the valley
long to converge and end up on the other side, possibly even higher up than you were before. This
might make the algorithm diverge, with larger and larger values, failing to find a good
solution (see Figure 4-5).
– If the learning rate is too high, the algorithm may skip
minimum
Figure 4-5. Learning rate too large
BITS Pilani, Pilani Campus
Finally, not all cost functions look like nice regular bowls. There may be holes, ridges,
plateaus, and all sorts of irregular terrains, making convergence to the minimum very
Gradient Descent: Example
BITS Pilani, Pilani Campus
Naïve Bayes Algorithm
• Naïve Bayes classifier is a supervised machine learning algorithm used for
classification tasks such as text classification.
• Belongs to generative learning algorithm family - it models the distribution of
inputs for a given class or category.
• Assumes that the features of the input data are conditionally independent given
the class.
• Naive Bayes classifiers are among the simplest Bayesian network models with
high accuracy levels.
• Naïve Bayes model is easy to build and particularly useful for very large data
sets.
• Naive Bayes can outperform highly sophisticated classification methods.
BITS Pilani, Pilani Campus
Naïve Bayes Algorithm
• Example:
P(Yes|Sunny) = P(Sunny|Yes) * P(Yes) / P(Sunny)
P(Sunny|Yes) = 3/9 = 0.33
P(Sunny) = 5/14 = 0.36
P(Yes) = 9/14 = 0.64
P(Yes|Sunny) = 0.33 * 0.64 / 0.36 = 0.59
BITS Pilani, Pilani Campus
Decision Tree
• Decision Tree is a Supervised learning technique that can be used for both
classification and Regression problems, but mostly it is preferred for
classification.
• A tree-structured classifier, where
• Internal nodes represent the features of a dataset,
• Branches represent the decision rules
• Each leaf node represents the outcome.
• A Decision tree has two type of nodes: Decision Node and Leaf Node.
• Decision nodes are used to make any decision and have multiple branches
• Leaf nodes are the output of those decisions and do not contain any further
branches.
• The decisions or the test are performed on the basis of features of the given
dataset.
• A decision tree simply asks a question, and based on the answer (Yes/No), it
further split the tree into subtrees.
BITS Pilani, Pilani Campus
Decision Tree
• Decision Tree is a Supervised
learning technique that can be
used for both classification and
Regression problems, but mostly it
is preferred for solving
Classification problems.
• It is a tree-structured classifier,
where internal nodes represent
the features of a dataset, branches
represent the decision
rules and each leaf node
represents the outcome.
• In a Decision tree, there are two
nodes, which are the Decision
Node and Leaf Node.
BITS Pilani, Pilani Campus
Random Forests
• Random Forest is a popular machine learning algorithm that belongs to the
supervised learning technique. It can be used for both Classification and
Regression problems in ML.
• It is based on the concept of ensemble learning, which is a process of combining
multiple classifiers to solve a complex problem and to improve the performance
of the model.
• As the name suggests, "Random Forest is a classifier that contains a number of
decision trees on various subsets of the given dataset and takes the average to
improve the predictive accuracy of that dataset."
• Instead of relying on one decision tree, the random forest takes the prediction
from each tree and based on the majority votes of predictions, and it predicts
the final output.
• The greater number of trees in the forest leads to higher accuracy and prevents
the problem of overfitting.
BITS Pilani, Pilani Campus
Random Forests
• Random Forest is a popular
machine learning algorithm
that belongs to the supervised
learning technique. It can be
used for both Classification and
Regression problems in ML.
• It is based on the concept
of ensemble learning, which is
a process of combining multiple
classifiers to solve a complex
problem and to improve the
performance of the model.
BITS Pilani, Pilani Campus
K-Means Clustering
• K-Means Clustering is an Unsupervised Learning algorithm, which groups the
unlabelled dataset into different clusters.
• K defines the number of pre-defined clusters that need to be created in the
process.
• Allows to cluster the data into different groups to discover the categories of
groups in the unlabelled dataset on its own without the need for any training.
• A centroid-based algorithm, where each cluster is associated with a centroid.
• Main aim of this algorithm is to minimize the sum of distances between the
data point and their corresponding clusters.
• K-Means clustering algorithm mainly performs two tasks:
• Determines the best value for K center points or centroids by an iterative process.
• Assigns each data point to its closest k-center. Those data points which are near to the
particular k-center, create a cluster.
• Each cluster has datapoints with some commonalities, and it is away from other
clusters
BITS Pilani, Pilani Campus
K-Means Clustering
• K-Means Clustering is
an Unsupervised Learning
algorithm, which groups the
unlabelled dataset into
different clusters.
• K defines the number of pre-
defined clusters that need to be
created in the process.
• Allows to cluster the data into
different groups to discover the
categories of groups in the
unlabelled dataset on its own
without the need for any
training.
BITS Pilani, Pilani Campus
Support Vector Machine (SVM)
• SVM is one of the most popular Supervised Learning algorithms, which is used
for Classification (primarily) as well as Regression problems.
• SVM algorithm creates the best line or decision boundary (hyperplane) that can
segregate n-dimensional space into classes so that we can easily put the new
data point in the correct category in the future.
• SVM chooses the extreme points/vectors called support vectors, to create the
hyperplane.
• Linear SVM: Linear SVM is used for linearly separable data, which means if a
dataset can be classified into two classes by using a single straight line, then
such data is termed as linearly separable data, and classifier is used called as
Linear SVM classifier.
• Non-linear SVM: Non-Linear SVM is used for non-linearly separated data, which
means if a dataset cannot be classified by using a straight line, then such data is
termed as non-linear data and classifier used is called as Non-linear SVM
classifier.
BITS Pilani, Pilani Campus
Support Vector Machine (SVM)
• SVM is one of the most
popular Supervised Learning
algorithms, which is used for
Classification (primarily) as
well as Regression problems.
• SVM algorithm creates the
best line or decision
boundary (hyperplane) that
can segregate n-dimensional
space into classes so that we
can easily put the new data
point in the correct category
in the future.
BITS Pilani, Pilani Campus
Genetic Algorithm
• Genetic algorithm is an adaptive heuristic search algorithm inspired by
"Darwin's theory of evolution in Nature."
• Used to solve complex and long time taking optimization problems in machine
learning.
• Genetic Algorithms are used in real-world applications, for example, Designing
electronic circuits, code-breaking, image processing, and artificial creativity.
• Genetic algorithm works on the evolutionary generational cycle to generate
high-quality solutions.
• Uses different operations that either enhance or replace the population to give
an improved fit solution.
• Five phases to solve the complex optimization problems:
• Initialization
• Fitness Assignment
• Selection
• Reproduction
• Termination
BITS Pilani, Pilani Campus
Genetic Algorithm
• Genetic algorithm is an adaptive heuristic
search algorithm inspired by "Darwin's
theory of evolution in Nature."
• Used to solve complex and long time taking
optimization problems in machine learning.
Cross-over
BITS Pilani, Pilani Campus
Artificial Neural Network (ANN)
• "Artificial Neural Network" is derived from Biological neural networks that
develop the structure of a human brain.
• ANN is usually a computational network based on biological neural networks
that construct the structure of the human brain.
• Like human brain, ANN has neurons that are linked to each other in various
layers of the networks known as nodes.
• An Artificial Neural Network attempts to mimic the network of neurons makes
up a human brain so that computers will have an option to understand things
and make decisions in a human-like manner.
BITS Pilani, Pilani Campus
Artificial Neural Network
Human Brain Artificial Neural Network
Biological Neural Network Artificial Neural Network
Dendrites Inputs
Cell nucleus Nodes
Synapse Weights
Axon Output
BITS Pilani, Pilani Campus
Artificial Neural Network
• There are around 1000 billion neurons in the human brain.
• Each neuron has an association point somewhere in the range of 1,000 and
100,000.
• In the human brain, data is stored in such a manner as to be distributed, and we
can extract more than one piece of this data when necessary from our memory
parallelly.
• We can say that the human brain is made up of incredibly amazing parallel
processors.
• We can understand the artificial neural network with an example:
• consider an example of a digital logic gate that takes an input and gives an output.
• "OR" gate, which takes two inputs. If one or both the inputs are "On," then we get "On" in
output.
• If both the inputs are "Off," then we get "Off" in output.
• The outputs to inputs relationship keep changing because of the neurons in our brain, which
are "learning."
BITS Pilani, Pilani Campus
Artificial Neural Network
Input Layer:
• Accepts inputs in several different
formats provided by the
programmer.
Hidden Layer:
• Presents in-between input and
output layers.
• Performs all the calculations to find
hidden features and patterns.
Output Layer:
• The input goes through a series of
• Artificial Neural Network takes input and transformations using the hidden
computes the weighted sum of the inputs layer
and includes a bias. • Final result is conveyed by output
• This computation is represented in the form layer.
of a transfer function.
BITS Pilani, Pilani Campus
Other Algorithms
• Logarithmic Regression
• Hierarchical Networks
BITS Pilani, Pilani Campus
Thank You
BITS Pilani, Pilani Campus