0% found this document useful (0 votes)

8 views26 pages

MLT Unit 4 Notes

The document discusses probabilistic methods for learning in machine learning, focusing on Naïve Bayes algorithms, maximum likelihood estimation, and the Apriori algorithm. It explains the principles behind Naïve Bayes, including its assumptions and types, as well as its advantages and disadvantages in various applications. Additionally, it covers Bayesian belief networks and their role in representing probabilistic relationships among variables.

Uploaded by

ranandraj

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

8 views26 pages

MLT Unit 4 Notes

Uploaded by

ranandraj

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 26

PROBABILISTIC METHODS FOR LEARNING

INTRODUCTION
 Machine learning algorithms today rely heavily on probabilistic models,
which take into consideration the uncertainty inherent in real-world data.
These models make predictions based on probability distributions,
rather than absolute values, allowing for a more nuanced and accurate
understanding of complex systems.

NAÏVE BAYES ALGORITHMS

 Naïve Bayes algorithm is used for classifica on problems. It is highly used in text
classifica on. In text classifica on tasks, data contains high dimension (as each
word represent one feature in the data). It is used in spam filtering, sen ment
detec on, ra ng classifica on etc. The advantage of using naïve Bayes is its
speed. It is fast and making predic on is easy with high dimension of data.
 Why it is called Naïve bayes?
The “Naive” part of the name indicates the simplifying assump on made by the
Naïve Bayes classifier. The classifier assumes that the features used to describe
an observa on are condi onally independent, given the class label. The
“Bayes” part of the name refers to Reverend Thomas Bayes, an 18th-century
sta s cian and theologian who formulated Bayes’ theorem.

Assump on of Naive Bayes

The fundamental Naive Bayes assump on is that each feature makes an:

 Feature independence: The features of the data are condi onally independent
of each other, given the class label.
 Con nuous features are normally distributed: If a feature is con nuous, then
it is assumed to be normally distributed within each class.
 Discrete features have mul nomial distribu ons: If a feature is discrete, then
it is assumed to have a mul nomial distribu on within each class.
 Features are equally important: All features are assumed to contribute equally
to the predic on of the class label.
 No missing data: The data should not contain any missing values.
Bayes’ Theorem
Bayes’ Theorem ﬁnds the probability of an event occurring given the probability
of another event that has already occurred.
Bayes’ theorem is stated mathema cally as the following equa on:

P(A∣B) = P(B∣A)P(A)/P(B)

where A and B are events and P(B) ≠ 0

P(A) is the priori of A
P(B) is Marginal Probability, Probability of Evidence.
P(A|B) is a posteriori probability of B.
P(B|A) is Likelihood probability .

Types of Naive Bayes Model

There are three types of Naive Bayes Model:

Gaussian Naive Bayes classiﬁer

In Gaussian Naive Bayes, con nuous values associated with each feature are
assumed to be distributed according to a Gaussian distribu on. A Gaussian
distribu on is also called Normal distribu on When plo ed, it gives a bell
shaped curve which is symmetric about the mean of the feature values as
shown below:
Mul nomial Naive Bayes

Feature vectors represent the frequencies with which certain events have been
generated by a mul nomial distribu on. This is the event model typically used for
document classiﬁca on.

Bernoulli Naive Bayes

In the mul variate Bernoulli event model, features are independent booleans (binary
variables) describing inputs. Like the mul nomial model, this model is popular for
document classiﬁca on tasks, where binary term occurrence (i.e. a word occurs in a
document or not) features are used rather than term frequencies (i.e. frequency of a
word in the document).

Advantages of Naive Bayes

1. Easy to implement and computa onally eﬃcient.
2. Eﬀec ve in cases with a large number of features.
3. Performs well even with limited training data.
4. It performs well in the presence of categorical features.

Disadvantages of Naive Bayes

1. Assumes that features are independent, which may not always hold in real-
world data.
2. Can be inﬂuenced by irrelevant a ributes.
3. May assign zero probability to unseen events, leading to poor generaliza on.

Applica ons of Naive Bayes Classiﬁer

 Spam Email Filtering: Classifies emails as spam or non-spam based on features.
 Text Classifica on: Used in sen ment analysis, document categoriza on, and
topic classifica on.
 Medical Diagnosis: Helps in predic ng the likelihood of a disease based on
symptoms.
 Credit Scoring: Evaluates creditworthiness of individuals for loan approval.
 Weather Predic on: Classifies weather condi ons based on various factors
MAXMUM LIKELIHOOD
 It is the process of es ma ng the parameters of a distribu on that maximize
the likelihood of the observed data belonging to that distribu on.
 Maximum likelihood es ma on is a very prominent frequen st technique.
Many conven onal machine learning algorithms work with the principles of
MLE.
For example, the best-fit line in linear regression calculated using least squares
is iden cal to the result of MLE.

The likelihood function

Before we move forward, we need to understand the likelihood function. The likelihood
function helps us find the best parameters for our distribution. It can be defined as
shown:

where θ is the parameter to maximize, x_1, x_2, … x_n are observations

for n random variables from a distribution and f is the joint density function of our
distribution with the parameter θ.

The pipe (“ | “) is often replaced by a semi-colon since θ isn’t a random

variable, but an unknown parameter. Of course, θ could also be a set of
parameters.

For example, in the case of a normal distribution, we would have

θ = (μ, σ), with μ and σ representing the two parameters of our distribution.
Intuition

Likelihood is often interchangeably used with probability, but they are not the same.
Likelihood is not a probability density function, meaning that integrating over a
specific interval would not result in a “probability” over that interval. Rather, it talks
about how likely a distribution with certain values for its parameters fits our data.

θ_ MLE is the value that maximizes the likelihood of our data x

Looking at it this way, we can say that likelihood is how likely the distribu on ﬁts of
given data for variable values of its parameters. So, if L(θ_1|x) is greater
than L(θ_2|x), the distribu on with parameter value as θ_1 ﬁts our data be er than
the one with a parameter value of θ_2.

An Example: To understand the math behind MLE, let’s try a simple example. We’ll
derive the value of the exponential distribution’s parameter corresponding to the
maximum likelihood value.

The Exponen al Distribu on

The exponen al distribu on is a con nuous probability distribu on used to measure
inter-event me.
It has a single parameter, called λ by conven on. λ is called rate.
It's mean and variance is 1/ λ and 1 / λ², respec vely.
The probability density function for the exponen al distribu on is as shown below.

PDF plots with variable λ

There’s a single parameter λ. Let’s calculate its value, given n random points x_1 to x_
n. As discussed earlier, we know that the likelihood for a given point xi is given by the
following:

We calculate the likelihood for each of our n points.

The combined likelihood for all n points would just be the product of their individual
likelihoods since we are considering independent and iden cally distributed points.
MAXIMUM APRIORI
The Apriori algorithm uses frequent itemsets to generate associa on rules, and it is
designed to work on the databases that contain transac ons. With the help of these
associa on rule, it determines how strongly or how weakly two objects are
connected. This algorithm uses a breadth-first search and Hash Tree to calculate
the itemset associa ons eﬃciently. It is the itera ve process for ﬁnding the frequent
itemsets from the large dataset.

What is Frequent Itemset?

Frequent itemsets are those items whose support is greater than the threshold value
or user-speciﬁed minimum support. It means if A & B are the frequent itemsets
together, then individually A and B should also be the frequent itemset.

Steps for Apriori Algorithm

Below are the steps for the apriori algorithm:

Step-1: Determine the support of itemsets in the transac onal database, and select
the minimum support and confidence.
Step-2: Take all supports in the transac on with higher support value than the
minimum or selected support value.
Step-3: Find all the rules of these subsets that have higher confidence value than the
threshold or minimum confidence.
Step-4: Sort the rules as the decreasing order of li .
Apriori Algorithm Working
We will understand the apriori algorithm using an example and mathema cal
calcula on:

Example: Suppose we have the following dataset that has various transac ons, and
from this dataset, we need to ﬁnd the frequent itemsets and generate the associa on
rules using the Apriori algorithm:

Solution:
Step-1: Calcula ng C1 and L1:
In the ﬁrst step, we will create a table that contains support count (The frequency of
each itemset individually in the dataset) of each itemset in the given dataset. This
table is called the Candidate set or C1.
Now, we will take out all the itemsets that have the greater support count that the
Minimum Support (2). It will give us the table for the frequent itemset L1.
Since all the itemsets have greater or equal support count than the minimum support,
except the E, so E itemset will be removed.

Step-2: Candidate Genera on C2, and L2:

o In this step, we will generate C2 with the help of L1. In C2, we will create the
pair of the itemsets of L1 in the form of subsets.
o A er crea ng the subsets, we will again ﬁnd the support count from the main
transac on table of datasets, i.e., how many mes these pairs have occurred
together in the given dataset. So, we will get the below table for C2:

Again, we need to compare the C2 Support count with the minimum support count,
and a er comparing, the itemset with less support count will be eliminated from the
table C2. It will give us the below table for L2
Step-3: Candidate genera on C3, and L3:
 For C3, we will repeat the same two processes, but now we will form the C3
table with subsets of three itemsets together, and will calculate the support
count from the dataset. It will give the below table:

 Now we will create the L3 table. As we can see from the above C3 table, there
is only one combina on of itemset that has support count equal to the
minimum support count. So, the L3 will have only one combina on, i.e., {A, B,
C}.

Step-4: Finding the associa on rules for the subsets:

To generate the associa on rules, first, we will create a new table with the possible
rules from the occurred combina on {A, B.C}. For all the rules, we will calculate the
Confidence using formula sup( A ^B)/A. A er calcula ng the confidence value for
all rules, we will exclude the rules that have less confidence than the minimum
threshold(50%).

Consider the below table:

Rules Support Confidence

A ^B → C 2 Sup{(A ^B) ^C}/sup(A ^B)= 2/4=0.5=50%

B^C → A 2 Sup{(B^C) ^A}/sup(B ^C)= 2/4=0.5=50%

A^C → B 2 Sup{(A ^C) ^B}/sup(A ^C)= 2/4=0.5=50%

C→ A ^B 2 Sup{(C^( A ^B)}/sup(C)= 2/5=0.4=40%

A→ B^C 2 Sup{(A^( B ^C)}/sup(A)= 2/6=0.33=33.33%

B→ B^C 2 Sup{(B^( B ^C)}/sup(B)= 2/7=0.28=28%

As the given threshold or minimum conﬁdence is 50%, so the ﬁrst three rules A ^B
→ C, B^C → A, and A^C → B can be considered as the strong associa on rules for
the given problem.

Advantages of Apriori Algorithm

 This is easy to understand algorithm

 The join and prune steps of the algorithm can be easily implemented on large
datasets.

Disadvantages of Apriori Algorithm

 The apriori algorithm works slow compared to other algorithms.

 The overall performance can be reduced as it scans the database for mul ple
mes.
 The me complexity and space complexity of the apriori algorithm is O(2D),
which is very high. Here D represents the horizontal width present in the
database.

BAYESIAN BELIEF NETWORKS

Bayesian Belief Network is a graphical representa on of diﬀerent probabilis c
rela onships among random variables in a par cular set. It is a classiﬁer with no
dependency on a ributes i.e it is condi on independent.
Due to its feature of joint probability, the probability in Bayesian Belief Network is
derived,
based on a condi on — P(a ribute/parent)
i.e probability of an a ribute, true over parent a ribute.

Consider this example:

 In the above figure, we have an alarm ‘A’ – a node, say installed in a house of a
person ‘gfg’, which rings upon two probabili es i.e burglary ‘B’ and fire ‘F’,
which are – parent nodes of the alarm node. The alarm is the parent node of
two probabili es P1 calls ‘P1’ & P2 calls ‘P2’ person nodes.
 Upon the instance of burglary and fire, ‘P1’ and ‘P2’ call person ‘gfg’,
respec vely. But, there are few drawbacks in this case, as some mes ‘P1’ may
forget to call the person ‘gfg’, even a er hearing the alarm, as he has a tendency
to forget things, quick. Similarly, ‘P2’, some mes fails to call the person ‘gfg’, as
he is only able to hear the alarm, from a certain distance.

Q) Find the probability that ‘P1’ is true (P1 has called ‘gfg’), ‘P2’ is true (P2 has called
‘gfg’) when the alarm ‘A’ rang, but no burglary ‘B’ and ﬁre ‘F’ has occurred.

=> P ( P1, P2, A, ~B, ~F) [ where- P1, P2 & A are ‘true’ events and ‘~B’ & ‘~F’ are ‘false’
events]

[ Note: The values men oned below are neither calculated nor computed. They have
observed values ]

Burglary ‘B’ –

P (B=T) = 0.001 (‘B’ is true i.e burglary has occurred)

P (B=F) = 0.999 (‘B’ is false i.e burglary has not occurred)
Fire ‘F’ –

P (F=T) = 0.002 (‘F’ is true i.e ﬁre has occurred)

P (F=F) = 0.998 (‘F’ is false i.e ﬁre has not occurred)

Alarm ‘A’ –
B F P (A=T) P (A=F)

T T 0.95 0.05

T F 0.94 0.06

F T 0.29 0.71

F F 0.001 0.999
The alarm ‘A’ node can be ‘true’ or ‘false’ It has two parent nodes burglary ‘B’ and ﬁre
‘F’ which can be ‘true’ or ‘false’ depending upon diﬀerent condi ons.

Person ‘P1’ –
A P (P1=T) P (P1=F)

T 0.95 0.05

F 0.05 0.95

The person ‘P1’ node can be ‘true’ or ‘false’ ,It has a parent node, the alarm ‘A’, which
can be ‘true’ or ‘false’ .

Person ‘P2’ –
A P (P2=T) P (P2=F)

T 0.80 0.20

F 0.01 0.99

The person ‘P2’ node can be ‘true’ or false’ (i.e may have called the person ‘gfg’ or
not). It has a parent node, the alarm ‘A’, which can be ‘true’ or ‘false’ (i.e may have
rung or may not have rung, upon burglary ‘B’ or ﬁre ‘F’).

Solu on: Considering the observed probabilis c scan –

With respect to the ques on — P ( P1, P2, A, ~B, ~F) , we need to get the probability
of ‘P1’. We ﬁnd it with regard to its parent node – alarm ‘A’. To get the probability of
‘P2’, we ﬁnd it with regard to its parent node — alarm ‘A’.

We ﬁnd the probability of alarm ‘A’ node with regard to ‘~B’ & ‘~F’ since burglary ‘B’
and ﬁre ‘F’ are parent nodes of alarm ‘A’.
From the observed probabilis c scan, we can deduce –

P ( P1, P2, A, ~B, ~F) = P (P1/A) * P (P2/A) * P (A/~B~F) * P (~B) * P (~F

= 0.95 * 0.80 * 0.001 * 0.999 * 0.998

= 0.00075

PROBABILISTIC MODELING OF PROBLEMS

What are Probabilis c Models?

Probabilis c models are an essen al component of machine learning, which aims to
learn pa erns from data and make predic ons on new, unseen data. They are
sta s cal models that capture the inherent uncertainty in data and incorporate it into
their predic ons. Probabilis c models are used in various applica ons such as image
and speech recogni on, natural language processing, and recommenda on systems.
In recent years, signiﬁcant progress has been made in developing probabilis c models
that can handle large datasets eﬃciently.

Categories Of Probabilis c Models

These models can be classiﬁed into the following categories:

1. Genera ve models
2. Discrimina ve models.
3. Graphical models

Genera ve models:
Genera ve models aim to model the joint distribu on of the input and output
variables. These models generate new data based on the probability distribu on of
the original dataset. Genera ve models are powerful because they can generate new
data that resembles the training data. They can be used for tasks such as image and
speech synthesis, language transla on, and text genera on.

Discrimina ve models:
The discrimina ve model aims to model the condi onal distribu on of the output
variable given the input variable. They learn a decision boundary that separates the
diﬀerent classes of the output variable. Discrimina ve models are useful when the
focus is on making accurate predic ons rather than genera ng new data. They can be
used for tasks such as image recogni on, speech recogni on, and sen ment analysis.
Graphical models
These models use graphical representa ons to show the condi onal dependence
between variables. They are commonly used for tasks such as image recogni on,
natural language processing, and causal inference.

Naive Bayes Algorithm in Probabilis c Models

The Naive Bayes algorithm is a widely used approach in probabilis c models,

demonstra ng remarkable efficiency and effec veness in solving classifica on
problems. By leveraging the power of the Bayes theorem and making simplifying
assump ons about feature independence, the algorithm calculates the probability of
the target class given the feature set. This method has found diverse applica ons
across various industries, ranging from spam filtering to medical diagnosis. Despite its
simplicity, the Naive Bayes algorithm has proven to be highly robust, providing rapid
results in a mul tude of real-world problems.

Naive Bayes is a probabilis c algorithm that is used for classiﬁca on problems. It is

based on the Bayes theorem of probability and assumes that the features are
condi onally independent of each other given the class. The Naive Bayes Algorithm is
used to calculate the probability of a given sample belonging to a par cular class. This
is done by calcula ng the posterior probability of each class given the sample and
then selec ng the class with the highest posterior probability as the predicted class.

The algorithm works as follows:

1. Collect a labeled dataset of samples, where each sample has a set of features
and a class label.
2. For each feature in the dataset, calculate the condi onal probability of the
feature given the class.
3. This is done by coun ng the number of mes the feature occurs in samples of
the class and dividing by the total number of samples in the class.
4. Calculate the prior probability of each class by coun ng the number of samples
in each class and dividing by the total number of samples in the dataset.
5. Given a new sample with a set of features, calculate the posterior probability
of each class using the Bayes theorem and the condi onal probabili es and
prior probabili es calculated in steps 2 and 3.
6. Select the class with the highest posterior probability as the predicted class for
the new sample.
Advantages Of Probabilis c Models

 Probabilis c models are an increasingly popular method in many ﬁelds,

including ar ficial intelligence, finance, and healthcare.
 The main advantage of these models is their ability to take into account
uncertainty and variability in data. This allows for more accurate
predic ons and decision-making, par cularly in complex and
unpredictable situa ons.
 Probabilis c models can also provide insights into how different factors
influence outcomes and can help iden fy pa erns and rela onships
within data.

Disadvantages Of Probabilis c Models

There are also some disadvantages to using probabilis c models.

 One of the disadvantages is the poten al for overfi ng, where the
model is too specific to the training data and doesn’t perform well on
new data.
 Not all data fits well into a probabilis c framework, which can limit the
usefulness of these models in certain applica ons.
 Another challenge is that probabilis c models can be computa onally
intensive and require significant resources to develop and implement.

PROBABILITY DENSITY ESTIMATION

Probability Density: Assume a random variable x that has a probability distribu on
p(x). The rela onship between the outcomes of a random variable and its probability
is referred to as the probability density.

The problem is that we don’t always know the full probability distribu on for a
random variable. This is because we only use a small subset of observa ons to derive
the outcome. This problem is referred to as Probability Density Es ma on as we use
only a random sample of observa ons to ﬁnd the general density of the whole sample
space.
Probability Density Func on (PDF)
A PDF is a func on that tells the probability of the random variable from a sub-sample
space falling within a par cular range of values and not just one value. It tells the
likelihood of the range of values in the random variable sub-space being the same as
that of the whole sample.

By deﬁni on, if X is any con nuous random variable, then the func on f(x) is called a
probability density func on if:

P(𝑎 ≤ 𝑋 ≤ 𝑏)=∫ 𝑓(𝑥) 𝑑𝑥

where,
a -> lower limit
b -> upper limit
X -> con nuous random variable
f(x) -> probability density func on

Steps Involved:

Step 1 - Create a histogram for the random set of observa ons to understand the
density of the random sample.

Step 2 - Create the probability density func on and ﬁt it on the random sample.
Observe how it ﬁts the histogram plot.

Step 3 - Now iterate steps 1 and 2 in the following manner:

3.1 - Calculate the distribu on parameters.
3.2 - Calculate the PDF for the random sample distribu on.
3.3 - Observe the resul ng PDF against the data.
3.4 - Transform the data to un l it best ﬁts the distribu on.

Density Es ma on
It is the process of ﬁnding out the density of the whole popula on by examining a
random sample of data from that popula on. One of the best ways to achieve a
density es mate is by using a histogram plot.

Parametric Density Es ma on
A normal distribu on has two given parameters, mean and standard devia on. We
calculate the sample mean and standard devia on of the random sample taken from
this popula on to es mate the density of the random sample. The reason it is termed
as ‘parametric’ is due to the fact that the rela on between the observa ons and its
probability can be diﬀerent based on the values of the two parameters.
Now, it is important to understand that the mean and standard devia on of this
random sample is not going to be the same as that of the whole popula on due to its
small size. A sample plot for parametric density es ma on is shown below.

Nonparametric Density Es ma on

In some cases, the PDF may not ﬁt the random sample as it doesn’t follow a normal
distribu on (i.e instead of one peak there are mul ple peaks in the graph). Here,
instead of using distribu on parameters like mean and standard devia on, a
par cular algorithm is used to es mate the probability distribu on. Thus, it is known
as a ‘nonparametric density es ma on’.

One of the most common nonparametric approach is known as Kernel Density

Es ma on. In this, the objec ve is to calculate the unknown density (x) using the
equa on given below:

1⬚ 𝑥−𝑥
𝑓 (𝑥̇ ) = 𝐾
𝑛ℎ 𝑛
where,
K -> kernel (non-nega ve func on)
h -> bandwidth (smoothing parameter, h > 0)
Kh -> scaled kernel
(x) -> density (to calculate)
n -> no. of samples in random sample.
A sample plot for nonparametric density es ma on is given below.

PDF plot over sample histogram plot based on KDE

SEQUENCE MODELS
Sequence or seq2seq model, is a machine learning architecture designed for tasks
involving sequen al data. It takes an input sequence, processes it, and generates an
output sequence. The architecture consists of two fundamental components: an
encoder and a decoder.
The advancement in neural networks architectures led to the development of more
capable seq2seq model named transformers. “A en on is all you need! ” was a
research paper that first introduced the transformer model in the era of Deep
Learning a er which language-related models have taken a huge leap. The main idea
behind the transformers model was that of a en on layers and different encoder and
decoder stacks which were highly efficient to perform language-related tasks.

What is Encoder and Decoder in Seq2Seq model?

In the seq2seq model, the Encoder and the Decoder architecture plays a vital role in
conver ng input sequences into output sequences. Let’s explore about each block:

Encoder and Decoder Stack in seq2seq model

Encoder Block
The main purpose of the encoder block is to process the input sequence and capture
informa on in a ﬁxed-size context vector.

Architecture:
The input sequence is put into the encoder.
The encoder processes each element of the input sequence using neural networks (or
transformer architecture).
Throughout this process, the encoder keeps an internal state, and the ul mate hidden
state func ons as the context vector that encapsulates a compressed representa on
of the en re input sequence. This context vector captures the seman c meaning and
important informa on of the input sequence.
The ﬁnal hidden state of the encoder is then passed as the context vector to the
decoder.

Decoder Block
The decoder block is similar to encoder block. The decoder processes the context
vector from encoder to generate output sequence incrementally.

Architecture:
In the training phase, the decoder receives both the context vector and the desired
target output sequence (ground truth).
During inference, the decoder relies on its own previously generated outputs as inputs
for subsequent steps.
The decoder uses the context vector to comprehend the input sequence and create
the corresponding output sequence. It engages in autoregressive genera on,
producing individual elements sequen ally. At each me step, the decoder uses the
current hidden state, the context vector, and the previous output token to generate a
probability distribu on over the possible next tokens. The token with the highest
probability is then chosen as the output, and the process con nues un l the end of
the output sequence is reached.

Advantages
 Flexibility: Seq2Seq models can handle a wide range of tasks such as machine
transla on, text summariza on, and image cap oning, as well as variable-
length input and output sequences.
 Handling Sequen al Data: Seq2Seq models are well-suited for tasks that involve
sequen al data such as natural language, speech, and me series data.
 Handling Context: The encoder-decoder architecture of Seq2Seq models allows
the model to capture the context of the input sequence and use it to generate
the output sequence.
 A en on Mechanism: Using a en on mechanisms allows the model to focus
on speciﬁc parts of the input sequence when genera ng the output, which can
improve performance for long input sequences.

Disadvantages
 Computa onally Expensive: Seq2Seq models require significant computa onal
resources to train and can be difficult to op mize.
 Limited Interpretability: The internal workings of Seq2Seq models can be
difficult to interpret, which can make it challenging to understand why the
model is making certain decisions.
 Overfi ng: Seq2Seq models can overfit the training data if they are not
properly regularized, which can lead to poor performance on new data.
 Handling Rare Words: Seq2Seq models can have difficulty handling rare words
that are not present in the training data.
 Handling Long input Sequences: Seq2Seq models can have difficulty handling
input sequences that are very long, as the context vector may not be able to
capture all the informa on in the input sequence.

Applica ons
 machine transla on is the real-world applica on of seq2seq model. Let’s
explore more applica ons:
 Text Summariza on: The seq2seq model eﬀec vely understands the input text
which makes it suitable for news and document summariza on.
 Speech Recogni on: Seq2Seq model, especially those with a en on
mechanisms, excel in processing audio waveform for ASR. They are able to
capture spoken language pa erns eﬀec vely.
 Image Cap oning: The seq2seq model integrate image features from CNNs with
textual genera on capabili es for image cap oning. They are capable to
describe images in a human readable format.
MARKOV MODELS
A Markov model is a stochas c method for randomly changing systems that possess
the Markov property. This means that, at any given me, the next state is only
dependent on the current state and is independent of anything in the past

There are two types in Markov models namely,

1. Markov chains
2. Hidden Markov models

These two types of Markov model are used when the system being represented is
autonomous -- that is, when the system isn't inﬂuenced by an external agent. These
are as follows:

1. Markov chains:
These are the simplest type of Markov model and are used to represent systems
where all states are observable. Markov chains show all possible states, and
between states, they show the transi on rate, which is the probability of
moving from one state to another per unit of me. Applica ons of this type of
model include predic on of market crashes, speech recogni on and search
engine algorithms.
2. Hidden Markov models:
These are used to represent systems with some unobservable states. In
addi on to showing states and transi on rates, hidden Markov models also
represent observa ons and observa on likelihoods for each state. Hidden
Markov models are used for a range of applica ons, including thermodynamics,
ﬁnance and pa ern recogni on.

Another two commonly applied types of Markov model are used when the
system being represented is controlled -- that is, when the system is inﬂuenced
by a decision-making agent. These are as follows:

1.Markov decision processes:

These are used to model decision-making in discrete, stochas c, sequen al
environments. In these processes, an agent makes decisions based on reliable
informa on. These models are applied to problems in ar ﬁcial intelligence (AI),
economics and behavioural sciences.

2.Par ally observable Markov decision processes:

These are used in cases like Markov decision processes but with the assump on
that the agent doesn't always have reliable informa on. Applica ons of these
models include robo cs, where it isn't always possible to know the loca on.
Another applica on is machine maintenance.

How is Markov analysis applied?

Markov analysis is a probabilis c technique that uses Markov models to predict
the future behaviour of some variable based on the current state. Markov
analysis is used in many domains, including the following:

Markov chains are used for several business applica ons, including predic ng
customer brand switching for marke ng, predic ng how long people will
remain in their jobs for human resources, predic ng me to failure of a
machine in manufacturing, and forecas ng the future price of a stock in
ﬁnance.
Markov analysis is also used in natural language processing (NLP) and in
machine learning. For NLP, a Markov chain can be used to generate a sequence
of words that form a complete sentence, or a hidden Markov model can be
used for named-en ty recogni on and tagging parts of speech. For machine
learning, Markov decision processes are used to represent reward in
reinforcement learning.

How are Markov models represented?

The simplest Markov model is a Markov chain, which can be expressed in
equa ons, as a transi on matrix or as a graph. A transi on matrix is used to
indicate the probability of moving from each state to each other state.
Generally, the current states are listed in rows, and the next states are
represented as columns. Each cell then contains the probability of moving from
the current state to the next state. For any given row, all the cell values must
then add up to one.

A graph consists of circles, each of which represents a state, and direc onal
arrows to indicate possible transi ons between states. The direc onal arrows
are labeled with the transi on probability. The transi on probabili es on the
direc onal arrows coming out of any given circle must add up to one.

Other Markov models are based on the chain representa ons but with added
informa on, such as observa ons and observa on likelihoods.

Example:
The image below represents the toss of a coin. Two states are possible: heads and
tails. The transi on from heads to heads or heads to tails is equally probable (.5) and
is independent of all preceding coin tosses.

The circles represent the two possible states -- heads or tails -- and the arrows show
the possible states the system could transi on to in the next step. The number .5
represents the probability of that transi on occurring.

HIDDEN MARKOV MODELS

The hidden Markov Model (HMM) is a sta s cal model that is used to describe the
probabilis c rela onship between a sequence of observa ons and a sequence of
hidden states. It is o en used in situa ons where the underlying system or process
that generates the observa ons is unknown or hidden, hence it has the name “Hidden
Markov Model.”

It is used to predict future observa ons or classify sequences, based on the underlying
hidden process that generates the data.

An HMM consists of two types of variables:

1.hidden states and

2.observa ons.

1. Hidden states
These are the underlying variables that generate the observed data, but they are not
directly observable.
2.observa ons
These are the variables that are measured and observed.
The rela onship between the hidden states and the observa ons is modeled using a
probability distribu on. The Hidden Markov Model (HMM) is the rela onship
between the hidden states and the observa ons using two sets of probabili es: the
transi on probabili es and the emission probabili es.

The transi on probabili es describe the probability of transi oning from one hidden
state to another.
The emission probabili es describe the probability of observing an output given a
hidden state.

Hidden Markov Model Algorithm

Step 1: Deﬁne the state space and observa on space
The state space is the set of all possible hidden states, and the observa on space is
the set of all possible observa ons.

Step 2: Deﬁne the ini al state distribu on This is the probability distribu on over the
ini al state.

Step 3: Deﬁne the state transi on probabili es, These are the probabili es of
transi oning from one state to another. This forms the transi on matrix, which
describes the probability of moving from one state to another.

Step 4: Deﬁne the observa on likelihoods:

These are the probabili es of genera ng each observa on from each state. This forms
the emission matrix, which describes the probability of genera ng each observa on
from each state.

Step 5: Train the model

The parameters of the state transi on probabili es and the observa on likelihoods
are es mated using the Baum-Welch algorithm, or the forward-backward algorithm.
This is done by itera vely upda ng the parameters un l convergence.

Step 6: Decode the most likely sequence of hidden states

Given the observed data, the Viterbi algorithm is used to compute the most likely
sequence of hidden states. This can be used to predict future observa ons, classify
sequences, or detect pa erns in sequen al data.

Step 7: Evaluate the model, The performance of the HMM can be evaluated using
various metrics, such as accuracy, precision, recall, or F1 score.

Applied Statistics and Probability For Engineers, 5th Edition
75% (4)
Applied Statistics and Probability For Engineers, 5th Edition
23 pages
AIML-Unit 3 Notes-Assignment 3
No ratings yet
AIML-Unit 3 Notes-Assignment 3
37 pages
Bayes Theorem
No ratings yet
Bayes Theorem
20 pages
15.097: Probabilistic Modeling and Bayesian Analysis
No ratings yet
15.097: Probabilistic Modeling and Bayesian Analysis
42 pages
Naive Bayes Classifier and Other Topics
No ratings yet
Naive Bayes Classifier and Other Topics
52 pages
Probabilistic Models in Machine Learning: Unit - III Chapter - 1
No ratings yet
Probabilistic Models in Machine Learning: Unit - III Chapter - 1
18 pages
Machine Learning PPT Part III
No ratings yet
Machine Learning PPT Part III
26 pages
2223hk1 Slide01 ML2022-2
No ratings yet
2223hk1 Slide01 ML2022-2
23 pages
Naive Bayes
No ratings yet
Naive Bayes
29 pages
Scribe: Naive Bayes Classifier
No ratings yet
Scribe: Naive Bayes Classifier
16 pages
Machine Learning & Bayesian Methods
No ratings yet
Machine Learning & Bayesian Methods
28 pages
Discriminative Generative: R Follow A
100% (1)
Discriminative Generative: R Follow A
18 pages
Probability Theory For Machine Learning: Chris Cremer September 2015
No ratings yet
Probability Theory For Machine Learning: Chris Cremer September 2015
40 pages
Module05 - Bayesian Reasoning
No ratings yet
Module05 - Bayesian Reasoning
37 pages
ML Unit 3 Part 1
No ratings yet
ML Unit 3 Part 1
36 pages
ML Unit 3
No ratings yet
ML Unit 3
14 pages
Naive Bayes Thoerem
No ratings yet
Naive Bayes Thoerem
90 pages
@vtudeveloper - in ML Mod 4
No ratings yet
@vtudeveloper - in ML Mod 4
11 pages
BaYesian Models Machine Learning 2016
No ratings yet
BaYesian Models Machine Learning 2016
126 pages
Bayes ML Tutorial
No ratings yet
Bayes ML Tutorial
69 pages
ML 5
No ratings yet
ML 5
28 pages
CLASS 2025 Bayesian Framework
No ratings yet
CLASS 2025 Bayesian Framework
46 pages
BML Lecture Notes
No ratings yet
BML Lecture Notes
126 pages
Bayes Theorem
No ratings yet
Bayes Theorem
7 pages
CS 601 Machine Learning Unit 5
No ratings yet
CS 601 Machine Learning Unit 5
18 pages
Lecture-8 Machine Learning With Python
No ratings yet
Lecture-8 Machine Learning With Python
35 pages
2 Probability
No ratings yet
2 Probability
30 pages
Unit - 5 ML
No ratings yet
Unit - 5 ML
57 pages
CS 601 Machine Learning Unit 5
No ratings yet
CS 601 Machine Learning Unit 5
18 pages
Probability Theory - Towards Data Science
No ratings yet
Probability Theory - Towards Data Science
19 pages
2 Mle
No ratings yet
2 Mle
28 pages
Gaussian Mixture Model Guide
No ratings yet
Gaussian Mixture Model Guide
48 pages
An Introduction To Naive Bayes Algorithm For Beginners
No ratings yet
An Introduction To Naive Bayes Algorithm For Beginners
11 pages
Slide 1
No ratings yet
Slide 1
37 pages
Unit 4
No ratings yet
Unit 4
36 pages
Machine Learning and Data Mining: Prof. Alexander Ihler
No ratings yet
Machine Learning and Data Mining: Prof. Alexander Ihler
51 pages
Machine Learning and Data Mining: Prof. Alexander Ihler
No ratings yet
Machine Learning and Data Mining: Prof. Alexander Ihler
51 pages
Naive Bayes Classifiers - Parta
No ratings yet
Naive Bayes Classifiers - Parta
17 pages
WK 08
No ratings yet
WK 08
10 pages
PML Class 1 2025
No ratings yet
PML Class 1 2025
54 pages
ML Unit 3 Part 1
No ratings yet
ML Unit 3 Part 1
36 pages
Naive Bayes Classifier
No ratings yet
Naive Bayes Classifier
14 pages
ML Unit3
No ratings yet
ML Unit3
21 pages
AML Unit 2
No ratings yet
AML Unit 2
6 pages
6.1 Bayesian Learning
No ratings yet
6.1 Bayesian Learning
33 pages
Baye's Notes
No ratings yet
Baye's Notes
3 pages
Machine - Learning (Unit 3)
No ratings yet
Machine - Learning (Unit 3)
9 pages
ML - Lec 2 - Review of Probability and Statistics
No ratings yet
ML - Lec 2 - Review of Probability and Statistics
30 pages
Lecture10 - Bayesian Classifier
No ratings yet
Lecture10 - Bayesian Classifier
40 pages
Maximum Likelihood Estimation Guide
No ratings yet
Maximum Likelihood Estimation Guide
34 pages
Bayes Classifier
No ratings yet
Bayes Classifier
35 pages
Bayes Theorem in Machine Learning
No ratings yet
Bayes Theorem in Machine Learning
40 pages
Bayesian Learning: Thanks To Nir Friedman, HU
No ratings yet
Bayesian Learning: Thanks To Nir Friedman, HU
41 pages
E-Note 14654 Content Document 20231228101425AM
No ratings yet
E-Note 14654 Content Document 20231228101425AM
10 pages
Data Mining - Module 7
No ratings yet
Data Mining - Module 7
8 pages
Bayesian
No ratings yet
Bayesian
14 pages
ML - Unit4pdf
No ratings yet
ML - Unit4pdf
65 pages
DLT Unit-1
No ratings yet
DLT Unit-1
28 pages
12th Physics Important Questions English Medium
No ratings yet
12th Physics Important Questions English Medium
8 pages
Cmep Unit - 3 Notes
No ratings yet
Cmep Unit - 3 Notes
15 pages
Cmep Unit - 2 Notes
No ratings yet
Cmep Unit - 2 Notes
22 pages
Lecture 7
No ratings yet
Lecture 7
10 pages
Case-Advanced Networking For AWS Cloud
No ratings yet
Case-Advanced Networking For AWS Cloud
10 pages
MLT Unit 5 Notes
No ratings yet
MLT Unit 5 Notes
14 pages
Ch19 Thermo 2 Kotz
No ratings yet
Ch19 Thermo 2 Kotz
22 pages
Electricity PowerPoint 0
No ratings yet
Electricity PowerPoint 0
32 pages
Lecture 29
No ratings yet
Lecture 29
16 pages
VERITAS Cluster Server Commands Guide
No ratings yet
VERITAS Cluster Server Commands Guide
6 pages
Presentati w06d1
No ratings yet
Presentati w06d1
31 pages
ML Lecture16
No ratings yet
ML Lecture16
39 pages
Logit and Probit Models
50% (2)
Logit and Probit Models
11 pages
Meat & Poultry Safety Standards
No ratings yet
Meat & Poultry Safety Standards
24 pages
Gretl Guide (401 450)
No ratings yet
Gretl Guide (401 450)
50 pages
Regression Model Transformation Guide
No ratings yet
Regression Model Transformation Guide
16 pages
Intro To Vae
No ratings yet
Intro To Vae
89 pages
GTI/AGA ECDA Project - Protocol /implementation Plan Development
No ratings yet
GTI/AGA ECDA Project - Protocol /implementation Plan Development
6 pages
Contract HSE Risk Assessment Guide
No ratings yet
Contract HSE Risk Assessment Guide
5 pages
Chap 3 Answer of Questions
No ratings yet
Chap 3 Answer of Questions
11 pages
Factors Influencing Diarrhea Incidence
No ratings yet
Factors Influencing Diarrhea Incidence
16 pages
1 s2.0 S0044848616306913 Main
No ratings yet
1 s2.0 S0044848616306913 Main
9 pages
Gear Hub/Shaft Shrink-Fit Failure Analysis
No ratings yet
Gear Hub/Shaft Shrink-Fit Failure Analysis
16 pages
State-Space Modeling for Ecological Time Series
No ratings yet
State-Space Modeling for Ecological Time Series
38 pages
Technical Efficiency of Smallholder Sorghum Producers in West Hararghe Zone, Oromia Region, Ethiopia
No ratings yet
Technical Efficiency of Smallholder Sorghum Producers in West Hararghe Zone, Oromia Region, Ethiopia
8 pages
Irtplay
No ratings yet
Irtplay
78 pages
Rasch Models & R Package eRm Guide
100% (1)
Rasch Models & R Package eRm Guide
40 pages
10 1 1 53
No ratings yet
10 1 1 53
84 pages
Disponibilidad A Pagar Por El Mejoramiento en El Tratamiento de Aguas Residuales: Aplicación Del Método de Valoración Contingente en Puno, Perú
No ratings yet
Disponibilidad A Pagar Por El Mejoramiento en El Tratamiento de Aguas Residuales: Aplicación Del Método de Valoración Contingente en Puno, Perú
13 pages
Probability
No ratings yet
Probability
62 pages
Bayesian Monte Carlo: Carl Edward Rasmussen and Zoubin Ghahramani
No ratings yet
Bayesian Monte Carlo: Carl Edward Rasmussen and Zoubin Ghahramani
8 pages
Statistical Methods in Experimental Physics 2nd Ed. Edition James PDF Download
100% (7)
Statistical Methods in Experimental Physics 2nd Ed. Edition James PDF Download
84 pages
GCU Project Risk Register 2015
No ratings yet
GCU Project Risk Register 2015
3 pages
Logistic Regression in Biostatistics
No ratings yet
Logistic Regression in Biostatistics
9 pages
Stochastic Modeling and Mathematical Statistics
No ratings yet
Stochastic Modeling and Mathematical Statistics
21 pages
The Structural Topic Model and Applied Social Science
No ratings yet
The Structural Topic Model and Applied Social Science
4 pages
(Ebook) Statistical Methods in Experimental Physics by James, Frederick ISBN 9789812567956, 9789812705273, 981256795X, 9812705279 Download
100% (1)
(Ebook) Statistical Methods in Experimental Physics by James, Frederick ISBN 9789812567956, 9789812705273, 981256795X, 9812705279 Download
56 pages
Manual Stata 13
100% (1)
Manual Stata 13
371 pages
Advanced Wind Turbine Technology 1st Ed Weifei Hu Download
No ratings yet
Advanced Wind Turbine Technology 1st Ed Weifei Hu Download
83 pages
Debt and Distress - Evaluating The Psychological Cost of Credit
No ratings yet
Debt and Distress - Evaluating The Psychological Cost of Credit
22 pages

MLT Unit 4 Notes

Uploaded by

MLT Unit 4 Notes

Uploaded by

PROBABILISTIC METHODS FOR LEARNING

NAÏVE BAYES ALGORITHMS

Assump on of Naive Bayes

where A and B are events and P(B) ≠ 0

Types of Naive Bayes Model

There are three types of Naive Bayes Model:

Gaussian Naive Bayes classiﬁer

Bernoulli Naive Bayes

Advantages of Naive Bayes

Disadvantages of Naive Bayes

Applica ons of Naive Bayes Classiﬁer

The likelihood function

where θ is the parameter to maximize, x_1, x_2, … x_n are observations

The pipe (“ | “) is often replaced by a semi-colon since θ isn’t a random

For example, in the case of a normal distribution, we would have

θ_ MLE is the value that maximizes the likelihood of our data x

The Exponen al Distribu on

PDF plots with variable λ

We calculate the likelihood for each of our n points.

What is Frequent Itemset?

Steps for Apriori Algorithm

Step-2: Candidate Genera on C2, and L2:

Step-4: Finding the associa on rules for the subsets:

Consider the below table:

Rules Support Confidence

A ^B → C 2 Sup{(A ^B) ^C}/sup(A ^B)= 2/4=0.5=50%

B^C → A 2 Sup{(B^C) ^A}/sup(B ^C)= 2/4=0.5=50%

A^C → B 2 Sup{(A ^C) ^B}/sup(A ^C)= 2/4=0.5=50%

C→ A ^B 2 Sup{(C^( A ^B)}/sup(C)= 2/5=0.4=40%

A→ B^C 2 Sup{(A^( B ^C)}/sup(A)= 2/6=0.33=33.33%

B→ B^C 2 Sup{(B^( B ^C)}/sup(B)= 2/7=0.28=28%

Advantages of Apriori Algorithm

 This is easy to understand algorithm

Disadvantages of Apriori Algorithm

 The apriori algorithm works slow compared to other algorithms.

BAYESIAN BELIEF NETWORKS

Consider this example:

P (B=T) = 0.001 (‘B’ is true i.e burglary has occurred)

P (F=T) = 0.002 (‘F’ is true i.e ﬁre has occurred)

Solu on: Considering the observed probabilis c scan –

P ( P1, P2, A, ~B, ~F) = P (P1/A) * P (P2/A) * P (A/~B~F) * P (~B) * P (~F

PROBABILISTIC MODELING OF PROBLEMS

What are Probabilis c Models?

Categories Of Probabilis c Models

Naive Bayes Algorithm in Probabilis c Models

The Naive Bayes algorithm is a widely used approach in probabilis c models,

Naive Bayes is a probabilis c algorithm that is used for classiﬁca on problems. It is

The algorithm works as follows:

 Probabilis c models are an increasingly popular method in many ﬁelds,

Disadvantages Of Probabilis c Models

There are also some disadvantages to using probabilis c models.

PROBABILITY DENSITY ESTIMATION

P(𝑎 ≤ 𝑋 ≤ 𝑏)=∫ 𝑓(𝑥) 𝑑𝑥

Step 3 - Now iterate steps 1 and 2 in the following manner:

One of the most common nonparametric approach is known as Kernel Density

PDF plot over sample histogram plot based on KDE

What is Encoder and Decoder in Seq2Seq model?

Encoder and Decoder Stack in seq2seq model

There are two types in Markov models namely,

1.Markov decision processes:

2.Par ally observable Markov decision processes:

How is Markov analysis applied?

How are Markov models represented?

HIDDEN MARKOV MODELS

An HMM consists of two types of variables:

1.hidden states and

Hidden Markov Model Algorithm

Step 4: Deﬁne the observa on likelihoods:

Step 5: Train the model

Step 6: Decode the most likely sequence of hidden states

You might also like