Thanks to visit codestin.com
Credit goes to www.scribd.com

0% found this document useful (0 votes)
8 views26 pages

MLT Unit 4 Notes

The document discusses probabilistic methods for learning in machine learning, focusing on Naïve Bayes algorithms, maximum likelihood estimation, and the Apriori algorithm. It explains the principles behind Naïve Bayes, including its assumptions and types, as well as its advantages and disadvantages in various applications. Additionally, it covers Bayesian belief networks and their role in representing probabilistic relationships among variables.

Uploaded by

ranandraj
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
8 views26 pages

MLT Unit 4 Notes

The document discusses probabilistic methods for learning in machine learning, focusing on Naïve Bayes algorithms, maximum likelihood estimation, and the Apriori algorithm. It explains the principles behind Naïve Bayes, including its assumptions and types, as well as its advantages and disadvantages in various applications. Additionally, it covers Bayesian belief networks and their role in representing probabilistic relationships among variables.

Uploaded by

ranandraj
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 26

PROBABILISTIC METHODS FOR LEARNING

INTRODUCTION
 Machine learning algorithms today rely heavily on probabilistic models,
which take into consideration the uncertainty inherent in real-world data.
These models make predictions based on probability distributions,
rather than absolute values, allowing for a more nuanced and accurate
understanding of complex systems.

NAÏVE BAYES ALGORITHMS


 Naïve Bayes algorithm is used for classifica on problems. It is highly used in text
classifica on. In text classifica on tasks, data contains high dimension (as each
word represent one feature in the data). It is used in spam filtering, sen ment
detec on, ra ng classifica on etc. The advantage of using naïve Bayes is its
speed. It is fast and making predic on is easy with high dimension of data.
 Why it is called Naïve bayes?
The “Naive” part of the name indicates the simplifying assump on made by the
Naïve Bayes classifier. The classifier assumes that the features used to describe
an observa on are condi onally independent, given the class label. The
“Bayes” part of the name refers to Reverend Thomas Bayes, an 18th-century
sta s cian and theologian who formulated Bayes’ theorem.

Assump on of Naive Bayes


The fundamental Naive Bayes assump on is that each feature makes an:

 Feature independence: The features of the data are condi onally independent
of each other, given the class label.
 Con nuous features are normally distributed: If a feature is con nuous, then
it is assumed to be normally distributed within each class.
 Discrete features have mul nomial distribu ons: If a feature is discrete, then
it is assumed to have a mul nomial distribu on within each class.
 Features are equally important: All features are assumed to contribute equally
to the predic on of the class label.
 No missing data: The data should not contain any missing values.
Bayes’ Theorem
Bayes’ Theorem finds the probability of an event occurring given the probability
of another event that has already occurred.
Bayes’ theorem is stated mathema cally as the following equa on:

P(A∣B) = P(B∣A)P(A)/P(B)

where A and B are events and P(B) ≠ 0


P(A) is the priori of A
P(B) is Marginal Probability, Probability of Evidence.
P(A|B) is a posteriori probability of B.
P(B|A) is Likelihood probability .

Types of Naive Bayes Model

There are three types of Naive Bayes Model:

Gaussian Naive Bayes classifier


In Gaussian Naive Bayes, con nuous values associated with each feature are
assumed to be distributed according to a Gaussian distribu on. A Gaussian
distribu on is also called Normal distribu on When plo ed, it gives a bell
shaped curve which is symmetric about the mean of the feature values as
shown below:
Mul nomial Naive Bayes

Feature vectors represent the frequencies with which certain events have been
generated by a mul nomial distribu on. This is the event model typically used for
document classifica on.

Bernoulli Naive Bayes

In the mul variate Bernoulli event model, features are independent booleans (binary
variables) describing inputs. Like the mul nomial model, this model is popular for
document classifica on tasks, where binary term occurrence (i.e. a word occurs in a
document or not) features are used rather than term frequencies (i.e. frequency of a
word in the document).

Advantages of Naive Bayes


1. Easy to implement and computa onally efficient.
2. Effec ve in cases with a large number of features.
3. Performs well even with limited training data.
4. It performs well in the presence of categorical features.

Disadvantages of Naive Bayes


1. Assumes that features are independent, which may not always hold in real-
world data.
2. Can be influenced by irrelevant a ributes.
3. May assign zero probability to unseen events, leading to poor generaliza on.

Applica ons of Naive Bayes Classifier


 Spam Email Filtering: Classifies emails as spam or non-spam based on features.
 Text Classifica on: Used in sen ment analysis, document categoriza on, and
topic classifica on.
 Medical Diagnosis: Helps in predic ng the likelihood of a disease based on
symptoms.
 Credit Scoring: Evaluates creditworthiness of individuals for loan approval.
 Weather Predic on: Classifies weather condi ons based on various factors
MAXMUM LIKELIHOOD
 It is the process of es ma ng the parameters of a distribu on that maximize
the likelihood of the observed data belonging to that distribu on.
 Maximum likelihood es ma on is a very prominent frequen st technique.
Many conven onal machine learning algorithms work with the principles of
MLE.
For example, the best-fit line in linear regression calculated using least squares
is iden cal to the result of MLE.

The likelihood function

Before we move forward, we need to understand the likelihood function. The likelihood
function helps us find the best parameters for our distribution. It can be defined as
shown:

where θ is the parameter to maximize, x_1, x_2, … x_n are observations


for n random variables from a distribution and f is the joint density function of our
distribution with the parameter θ.

The pipe (“ | “) is often replaced by a semi-colon since θ isn’t a random


variable, but an unknown parameter. Of course, θ could also be a set of
parameters.

For example, in the case of a normal distribution, we would have


θ = (μ, σ), with μ and σ representing the two parameters of our distribution.
Intuition

Likelihood is often interchangeably used with probability, but they are not the same.
Likelihood is not a probability density function, meaning that integrating over a
specific interval would not result in a “probability” over that interval. Rather, it talks
about how likely a distribution with certain values for its parameters fits our data.

θ_ MLE is the value that maximizes the likelihood of our data x

Looking at it this way, we can say that likelihood is how likely the distribu on fits of
given data for variable values of its parameters. So, if L(θ_1|x) is greater
than L(θ_2|x), the distribu on with parameter value as θ_1 fits our data be er than
the one with a parameter value of θ_2.

An Example: To understand the math behind MLE, let’s try a simple example. We’ll
derive the value of the exponential distribution’s parameter corresponding to the
maximum likelihood value.

The Exponen al Distribu on


The exponen al distribu on is a con nuous probability distribu on used to measure
inter-event me.
It has a single parameter, called λ by conven on. λ is called rate.
It's mean and variance is 1/ λ and 1 / λ², respec vely.
The probability density function for the exponen al distribu on is as shown below.

PDF plots with variable λ

There’s a single parameter λ. Let’s calculate its value, given n random points x_1 to x_
n. As discussed earlier, we know that the likelihood for a given point xi is given by the
following:

We calculate the likelihood for each of our n points.

The combined likelihood for all n points would just be the product of their individual
likelihoods since we are considering independent and iden cally distributed points.
MAXIMUM APRIORI
The Apriori algorithm uses frequent itemsets to generate associa on rules, and it is
designed to work on the databases that contain transac ons. With the help of these
associa on rule, it determines how strongly or how weakly two objects are
connected. This algorithm uses a breadth-first search and Hash Tree to calculate
the itemset associa ons efficiently. It is the itera ve process for finding the frequent
itemsets from the large dataset.

What is Frequent Itemset?

Frequent itemsets are those items whose support is greater than the threshold value
or user-specified minimum support. It means if A & B are the frequent itemsets
together, then individually A and B should also be the frequent itemset.

Steps for Apriori Algorithm


Below are the steps for the apriori algorithm:

Step-1: Determine the support of itemsets in the transac onal database, and select
the minimum support and confidence.
Step-2: Take all supports in the transac on with higher support value than the
minimum or selected support value.
Step-3: Find all the rules of these subsets that have higher confidence value than the
threshold or minimum confidence.
Step-4: Sort the rules as the decreasing order of li .
Apriori Algorithm Working
We will understand the apriori algorithm using an example and mathema cal
calcula on:

Example: Suppose we have the following dataset that has various transac ons, and
from this dataset, we need to find the frequent itemsets and generate the associa on
rules using the Apriori algorithm:

Solution:
Step-1: Calcula ng C1 and L1:
In the first step, we will create a table that contains support count (The frequency of
each itemset individually in the dataset) of each itemset in the given dataset. This
table is called the Candidate set or C1.
Now, we will take out all the itemsets that have the greater support count that the
Minimum Support (2). It will give us the table for the frequent itemset L1.
Since all the itemsets have greater or equal support count than the minimum support,
except the E, so E itemset will be removed.

Step-2: Candidate Genera on C2, and L2:


o In this step, we will generate C2 with the help of L1. In C2, we will create the
pair of the itemsets of L1 in the form of subsets.
o A er crea ng the subsets, we will again find the support count from the main
transac on table of datasets, i.e., how many mes these pairs have occurred
together in the given dataset. So, we will get the below table for C2:

Again, we need to compare the C2 Support count with the minimum support count,
and a er comparing, the itemset with less support count will be eliminated from the
table C2. It will give us the below table for L2
Step-3: Candidate genera on C3, and L3:
 For C3, we will repeat the same two processes, but now we will form the C3
table with subsets of three itemsets together, and will calculate the support
count from the dataset. It will give the below table:

 Now we will create the L3 table. As we can see from the above C3 table, there
is only one combina on of itemset that has support count equal to the
minimum support count. So, the L3 will have only one combina on, i.e., {A, B,
C}.

Step-4: Finding the associa on rules for the subsets:

To generate the associa on rules, first, we will create a new table with the possible
rules from the occurred combina on {A, B.C}. For all the rules, we will calculate the
Confidence using formula sup( A ^B)/A. A er calcula ng the confidence value for
all rules, we will exclude the rules that have less confidence than the minimum
threshold(50%).

Consider the below table:

Rules Support Confidence

A ^B → C 2 Sup{(A ^B) ^C}/sup(A ^B)= 2/4=0.5=50%

B^C → A 2 Sup{(B^C) ^A}/sup(B ^C)= 2/4=0.5=50%

A^C → B 2 Sup{(A ^C) ^B}/sup(A ^C)= 2/4=0.5=50%

C→ A ^B 2 Sup{(C^( A ^B)}/sup(C)= 2/5=0.4=40%

A→ B^C 2 Sup{(A^( B ^C)}/sup(A)= 2/6=0.33=33.33%

B→ B^C 2 Sup{(B^( B ^C)}/sup(B)= 2/7=0.28=28%


As the given threshold or minimum confidence is 50%, so the first three rules A ^B
→ C, B^C → A, and A^C → B can be considered as the strong associa on rules for
the given problem.

Advantages of Apriori Algorithm

 This is easy to understand algorithm


 The join and prune steps of the algorithm can be easily implemented on large
datasets.

Disadvantages of Apriori Algorithm

 The apriori algorithm works slow compared to other algorithms.


 The overall performance can be reduced as it scans the database for mul ple
mes.
 The me complexity and space complexity of the apriori algorithm is O(2D),
which is very high. Here D represents the horizontal width present in the
database.

BAYESIAN BELIEF NETWORKS


Bayesian Belief Network is a graphical representa on of different probabilis c
rela onships among random variables in a par cular set. It is a classifier with no
dependency on a ributes i.e it is condi on independent.
Due to its feature of joint probability, the probability in Bayesian Belief Network is
derived,
based on a condi on — P(a ribute/parent)
i.e probability of an a ribute, true over parent a ribute.

Consider this example:


 In the above figure, we have an alarm ‘A’ – a node, say installed in a house of a
person ‘gfg’, which rings upon two probabili es i.e burglary ‘B’ and fire ‘F’,
which are – parent nodes of the alarm node. The alarm is the parent node of
two probabili es P1 calls ‘P1’ & P2 calls ‘P2’ person nodes.
 Upon the instance of burglary and fire, ‘P1’ and ‘P2’ call person ‘gfg’,
respec vely. But, there are few drawbacks in this case, as some mes ‘P1’ may
forget to call the person ‘gfg’, even a er hearing the alarm, as he has a tendency
to forget things, quick. Similarly, ‘P2’, some mes fails to call the person ‘gfg’, as
he is only able to hear the alarm, from a certain distance.

Q) Find the probability that ‘P1’ is true (P1 has called ‘gfg’), ‘P2’ is true (P2 has called
‘gfg’) when the alarm ‘A’ rang, but no burglary ‘B’ and fire ‘F’ has occurred.

=> P ( P1, P2, A, ~B, ~F) [ where- P1, P2 & A are ‘true’ events and ‘~B’ & ‘~F’ are ‘false’
events]

[ Note: The values men oned below are neither calculated nor computed. They have
observed values ]

Burglary ‘B’ –

P (B=T) = 0.001 (‘B’ is true i.e burglary has occurred)


P (B=F) = 0.999 (‘B’ is false i.e burglary has not occurred)
Fire ‘F’ –

P (F=T) = 0.002 (‘F’ is true i.e fire has occurred)


P (F=F) = 0.998 (‘F’ is false i.e fire has not occurred)

Alarm ‘A’ –
B F P (A=T) P (A=F)

T T 0.95 0.05

T F 0.94 0.06

F T 0.29 0.71

F F 0.001 0.999
The alarm ‘A’ node can be ‘true’ or ‘false’ It has two parent nodes burglary ‘B’ and fire
‘F’ which can be ‘true’ or ‘false’ depending upon different condi ons.

Person ‘P1’ –
A P (P1=T) P (P1=F)

T 0.95 0.05

F 0.05 0.95

The person ‘P1’ node can be ‘true’ or ‘false’ ,It has a parent node, the alarm ‘A’, which
can be ‘true’ or ‘false’ .

Person ‘P2’ –
A P (P2=T) P (P2=F)

T 0.80 0.20

F 0.01 0.99

The person ‘P2’ node can be ‘true’ or false’ (i.e may have called the person ‘gfg’ or
not). It has a parent node, the alarm ‘A’, which can be ‘true’ or ‘false’ (i.e may have
rung or may not have rung, upon burglary ‘B’ or fire ‘F’).

Solu on: Considering the observed probabilis c scan –

With respect to the ques on — P ( P1, P2, A, ~B, ~F) , we need to get the probability
of ‘P1’. We find it with regard to its parent node – alarm ‘A’. To get the probability of
‘P2’, we find it with regard to its parent node — alarm ‘A’.

We find the probability of alarm ‘A’ node with regard to ‘~B’ & ‘~F’ since burglary ‘B’
and fire ‘F’ are parent nodes of alarm ‘A’.
From the observed probabilis c scan, we can deduce –

P ( P1, P2, A, ~B, ~F) = P (P1/A) * P (P2/A) * P (A/~B~F) * P (~B) * P (~F


= 0.95 * 0.80 * 0.001 * 0.999 * 0.998

= 0.00075

PROBABILISTIC MODELING OF PROBLEMS

What are Probabilis c Models?


Probabilis c models are an essen al component of machine learning, which aims to
learn pa erns from data and make predic ons on new, unseen data. They are
sta s cal models that capture the inherent uncertainty in data and incorporate it into
their predic ons. Probabilis c models are used in various applica ons such as image
and speech recogni on, natural language processing, and recommenda on systems.
In recent years, significant progress has been made in developing probabilis c models
that can handle large datasets efficiently.

Categories Of Probabilis c Models


These models can be classified into the following categories:

1. Genera ve models
2. Discrimina ve models.
3. Graphical models

Genera ve models:
Genera ve models aim to model the joint distribu on of the input and output
variables. These models generate new data based on the probability distribu on of
the original dataset. Genera ve models are powerful because they can generate new
data that resembles the training data. They can be used for tasks such as image and
speech synthesis, language transla on, and text genera on.

Discrimina ve models:
The discrimina ve model aims to model the condi onal distribu on of the output
variable given the input variable. They learn a decision boundary that separates the
different classes of the output variable. Discrimina ve models are useful when the
focus is on making accurate predic ons rather than genera ng new data. They can be
used for tasks such as image recogni on, speech recogni on, and sen ment analysis.
Graphical models
These models use graphical representa ons to show the condi onal dependence
between variables. They are commonly used for tasks such as image recogni on,
natural language processing, and causal inference.

Naive Bayes Algorithm in Probabilis c Models

The Naive Bayes algorithm is a widely used approach in probabilis c models,


demonstra ng remarkable efficiency and effec veness in solving classifica on
problems. By leveraging the power of the Bayes theorem and making simplifying
assump ons about feature independence, the algorithm calculates the probability of
the target class given the feature set. This method has found diverse applica ons
across various industries, ranging from spam filtering to medical diagnosis. Despite its
simplicity, the Naive Bayes algorithm has proven to be highly robust, providing rapid
results in a mul tude of real-world problems.

Naive Bayes is a probabilis c algorithm that is used for classifica on problems. It is


based on the Bayes theorem of probability and assumes that the features are
condi onally independent of each other given the class. The Naive Bayes Algorithm is
used to calculate the probability of a given sample belonging to a par cular class. This
is done by calcula ng the posterior probability of each class given the sample and
then selec ng the class with the highest posterior probability as the predicted class.

The algorithm works as follows:


1. Collect a labeled dataset of samples, where each sample has a set of features
and a class label.
2. For each feature in the dataset, calculate the condi onal probability of the
feature given the class.
3. This is done by coun ng the number of mes the feature occurs in samples of
the class and dividing by the total number of samples in the class.
4. Calculate the prior probability of each class by coun ng the number of samples
in each class and dividing by the total number of samples in the dataset.
5. Given a new sample with a set of features, calculate the posterior probability
of each class using the Bayes theorem and the condi onal probabili es and
prior probabili es calculated in steps 2 and 3.
6. Select the class with the highest posterior probability as the predicted class for
the new sample.
Advantages Of Probabilis c Models

 Probabilis c models are an increasingly popular method in many fields,


including ar ficial intelligence, finance, and healthcare.
 The main advantage of these models is their ability to take into account
uncertainty and variability in data. This allows for more accurate
predic ons and decision-making, par cularly in complex and
unpredictable situa ons.
 Probabilis c models can also provide insights into how different factors
influence outcomes and can help iden fy pa erns and rela onships
within data.

Disadvantages Of Probabilis c Models

There are also some disadvantages to using probabilis c models.

 One of the disadvantages is the poten al for overfi ng, where the
model is too specific to the training data and doesn’t perform well on
new data.
 Not all data fits well into a probabilis c framework, which can limit the
usefulness of these models in certain applica ons.
 Another challenge is that probabilis c models can be computa onally
intensive and require significant resources to develop and implement.

PROBABILITY DENSITY ESTIMATION


Probability Density: Assume a random variable x that has a probability distribu on
p(x). The rela onship between the outcomes of a random variable and its probability
is referred to as the probability density.

The problem is that we don’t always know the full probability distribu on for a
random variable. This is because we only use a small subset of observa ons to derive
the outcome. This problem is referred to as Probability Density Es ma on as we use
only a random sample of observa ons to find the general density of the whole sample
space.
Probability Density Func on (PDF)
A PDF is a func on that tells the probability of the random variable from a sub-sample
space falling within a par cular range of values and not just one value. It tells the
likelihood of the range of values in the random variable sub-space being the same as
that of the whole sample.

By defini on, if X is any con nuous random variable, then the func on f(x) is called a
probability density func on if:

P(𝑎 ≤ 𝑋 ≤ 𝑏)=∫ 𝑓(𝑥) 𝑑𝑥

where,
a -> lower limit
b -> upper limit
X -> con nuous random variable
f(x) -> probability density func on

Steps Involved:

Step 1 - Create a histogram for the random set of observa ons to understand the
density of the random sample.

Step 2 - Create the probability density func on and fit it on the random sample.
Observe how it fits the histogram plot.

Step 3 - Now iterate steps 1 and 2 in the following manner:


3.1 - Calculate the distribu on parameters.
3.2 - Calculate the PDF for the random sample distribu on.
3.3 - Observe the resul ng PDF against the data.
3.4 - Transform the data to un l it best fits the distribu on.

Density Es ma on
It is the process of finding out the density of the whole popula on by examining a
random sample of data from that popula on. One of the best ways to achieve a
density es mate is by using a histogram plot.

Parametric Density Es ma on
A normal distribu on has two given parameters, mean and standard devia on. We
calculate the sample mean and standard devia on of the random sample taken from
this popula on to es mate the density of the random sample. The reason it is termed
as ‘parametric’ is due to the fact that the rela on between the observa ons and its
probability can be different based on the values of the two parameters.
Now, it is important to understand that the mean and standard devia on of this
random sample is not going to be the same as that of the whole popula on due to its
small size. A sample plot for parametric density es ma on is shown below.

Nonparametric Density Es ma on

In some cases, the PDF may not fit the random sample as it doesn’t follow a normal
distribu on (i.e instead of one peak there are mul ple peaks in the graph). Here,
instead of using distribu on parameters like mean and standard devia on, a
par cular algorithm is used to es mate the probability distribu on. Thus, it is known
as a ‘nonparametric density es ma on’.

One of the most common nonparametric approach is known as Kernel Density


Es ma on. In this, the objec ve is to calculate the unknown density (x) using the
equa on given below:

1⬚ 𝑥−𝑥
𝑓 (𝑥̇ ) = 𝐾
𝑛ℎ 𝑛
where,
K -> kernel (non-nega ve func on)
h -> bandwidth (smoothing parameter, h > 0)
Kh -> scaled kernel
(x) -> density (to calculate)
n -> no. of samples in random sample.
A sample plot for nonparametric density es ma on is given below.

PDF plot over sample histogram plot based on KDE

SEQUENCE MODELS
Sequence or seq2seq model, is a machine learning architecture designed for tasks
involving sequen al data. It takes an input sequence, processes it, and generates an
output sequence. The architecture consists of two fundamental components: an
encoder and a decoder.
The advancement in neural networks architectures led to the development of more
capable seq2seq model named transformers. “A en on is all you need! ” was a
research paper that first introduced the transformer model in the era of Deep
Learning a er which language-related models have taken a huge leap. The main idea
behind the transformers model was that of a en on layers and different encoder and
decoder stacks which were highly efficient to perform language-related tasks.

What is Encoder and Decoder in Seq2Seq model?


In the seq2seq model, the Encoder and the Decoder architecture plays a vital role in
conver ng input sequences into output sequences. Let’s explore about each block:

Encoder and Decoder Stack in seq2seq model


Encoder Block
The main purpose of the encoder block is to process the input sequence and capture
informa on in a fixed-size context vector.

Architecture:
The input sequence is put into the encoder.
The encoder processes each element of the input sequence using neural networks (or
transformer architecture).
Throughout this process, the encoder keeps an internal state, and the ul mate hidden
state func ons as the context vector that encapsulates a compressed representa on
of the en re input sequence. This context vector captures the seman c meaning and
important informa on of the input sequence.
The final hidden state of the encoder is then passed as the context vector to the
decoder.

Decoder Block
The decoder block is similar to encoder block. The decoder processes the context
vector from encoder to generate output sequence incrementally.

Architecture:
In the training phase, the decoder receives both the context vector and the desired
target output sequence (ground truth).
During inference, the decoder relies on its own previously generated outputs as inputs
for subsequent steps.
The decoder uses the context vector to comprehend the input sequence and create
the corresponding output sequence. It engages in autoregressive genera on,
producing individual elements sequen ally. At each me step, the decoder uses the
current hidden state, the context vector, and the previous output token to generate a
probability distribu on over the possible next tokens. The token with the highest
probability is then chosen as the output, and the process con nues un l the end of
the output sequence is reached.

Advantages
 Flexibility: Seq2Seq models can handle a wide range of tasks such as machine
transla on, text summariza on, and image cap oning, as well as variable-
length input and output sequences.
 Handling Sequen al Data: Seq2Seq models are well-suited for tasks that involve
sequen al data such as natural language, speech, and me series data.
 Handling Context: The encoder-decoder architecture of Seq2Seq models allows
the model to capture the context of the input sequence and use it to generate
the output sequence.
 A en on Mechanism: Using a en on mechanisms allows the model to focus
on specific parts of the input sequence when genera ng the output, which can
improve performance for long input sequences.

Disadvantages
 Computa onally Expensive: Seq2Seq models require significant computa onal
resources to train and can be difficult to op mize.
 Limited Interpretability: The internal workings of Seq2Seq models can be
difficult to interpret, which can make it challenging to understand why the
model is making certain decisions.
 Overfi ng: Seq2Seq models can overfit the training data if they are not
properly regularized, which can lead to poor performance on new data.
 Handling Rare Words: Seq2Seq models can have difficulty handling rare words
that are not present in the training data.
 Handling Long input Sequences: Seq2Seq models can have difficulty handling
input sequences that are very long, as the context vector may not be able to
capture all the informa on in the input sequence.

Applica ons
 machine transla on is the real-world applica on of seq2seq model. Let’s
explore more applica ons:
 Text Summariza on: The seq2seq model effec vely understands the input text
which makes it suitable for news and document summariza on.
 Speech Recogni on: Seq2Seq model, especially those with a en on
mechanisms, excel in processing audio waveform for ASR. They are able to
capture spoken language pa erns effec vely.
 Image Cap oning: The seq2seq model integrate image features from CNNs with
textual genera on capabili es for image cap oning. They are capable to
describe images in a human readable format.
MARKOV MODELS
A Markov model is a stochas c method for randomly changing systems that possess
the Markov property. This means that, at any given me, the next state is only
dependent on the current state and is independent of anything in the past

There are two types in Markov models namely,


1. Markov chains
2. Hidden Markov models

These two types of Markov model are used when the system being represented is
autonomous -- that is, when the system isn't influenced by an external agent. These
are as follows:

1. Markov chains:
These are the simplest type of Markov model and are used to represent systems
where all states are observable. Markov chains show all possible states, and
between states, they show the transi on rate, which is the probability of
moving from one state to another per unit of me. Applica ons of this type of
model include predic on of market crashes, speech recogni on and search
engine algorithms.
2. Hidden Markov models:
These are used to represent systems with some unobservable states. In
addi on to showing states and transi on rates, hidden Markov models also
represent observa ons and observa on likelihoods for each state. Hidden
Markov models are used for a range of applica ons, including thermodynamics,
finance and pa ern recogni on.

Another two commonly applied types of Markov model are used when the
system being represented is controlled -- that is, when the system is influenced
by a decision-making agent. These are as follows:

1.Markov decision processes:


These are used to model decision-making in discrete, stochas c, sequen al
environments. In these processes, an agent makes decisions based on reliable
informa on. These models are applied to problems in ar ficial intelligence (AI),
economics and behavioural sciences.

2.Par ally observable Markov decision processes:


These are used in cases like Markov decision processes but with the assump on
that the agent doesn't always have reliable informa on. Applica ons of these
models include robo cs, where it isn't always possible to know the loca on.
Another applica on is machine maintenance.

How is Markov analysis applied?


Markov analysis is a probabilis c technique that uses Markov models to predict
the future behaviour of some variable based on the current state. Markov
analysis is used in many domains, including the following:

Markov chains are used for several business applica ons, including predic ng
customer brand switching for marke ng, predic ng how long people will
remain in their jobs for human resources, predic ng me to failure of a
machine in manufacturing, and forecas ng the future price of a stock in
finance.
Markov analysis is also used in natural language processing (NLP) and in
machine learning. For NLP, a Markov chain can be used to generate a sequence
of words that form a complete sentence, or a hidden Markov model can be
used for named-en ty recogni on and tagging parts of speech. For machine
learning, Markov decision processes are used to represent reward in
reinforcement learning.

How are Markov models represented?


The simplest Markov model is a Markov chain, which can be expressed in
equa ons, as a transi on matrix or as a graph. A transi on matrix is used to
indicate the probability of moving from each state to each other state.
Generally, the current states are listed in rows, and the next states are
represented as columns. Each cell then contains the probability of moving from
the current state to the next state. For any given row, all the cell values must
then add up to one.

A graph consists of circles, each of which represents a state, and direc onal
arrows to indicate possible transi ons between states. The direc onal arrows
are labeled with the transi on probability. The transi on probabili es on the
direc onal arrows coming out of any given circle must add up to one.

Other Markov models are based on the chain representa ons but with added
informa on, such as observa ons and observa on likelihoods.

Example:
The image below represents the toss of a coin. Two states are possible: heads and
tails. The transi on from heads to heads or heads to tails is equally probable (.5) and
is independent of all preceding coin tosses.

The circles represent the two possible states -- heads or tails -- and the arrows show
the possible states the system could transi on to in the next step. The number .5
represents the probability of that transi on occurring.

HIDDEN MARKOV MODELS


The hidden Markov Model (HMM) is a sta s cal model that is used to describe the
probabilis c rela onship between a sequence of observa ons and a sequence of
hidden states. It is o en used in situa ons where the underlying system or process
that generates the observa ons is unknown or hidden, hence it has the name “Hidden
Markov Model.”

It is used to predict future observa ons or classify sequences, based on the underlying
hidden process that generates the data.

An HMM consists of two types of variables:

1.hidden states and


2.observa ons.

1. Hidden states
These are the underlying variables that generate the observed data, but they are not
directly observable.
2.observa ons
These are the variables that are measured and observed.
The rela onship between the hidden states and the observa ons is modeled using a
probability distribu on. The Hidden Markov Model (HMM) is the rela onship
between the hidden states and the observa ons using two sets of probabili es: the
transi on probabili es and the emission probabili es.

The transi on probabili es describe the probability of transi oning from one hidden
state to another.
The emission probabili es describe the probability of observing an output given a
hidden state.

Hidden Markov Model Algorithm


Step 1: Define the state space and observa on space
The state space is the set of all possible hidden states, and the observa on space is
the set of all possible observa ons.

Step 2: Define the ini al state distribu on This is the probability distribu on over the
ini al state.

Step 3: Define the state transi on probabili es, These are the probabili es of
transi oning from one state to another. This forms the transi on matrix, which
describes the probability of moving from one state to another.

Step 4: Define the observa on likelihoods:


These are the probabili es of genera ng each observa on from each state. This forms
the emission matrix, which describes the probability of genera ng each observa on
from each state.

Step 5: Train the model


The parameters of the state transi on probabili es and the observa on likelihoods
are es mated using the Baum-Welch algorithm, or the forward-backward algorithm.
This is done by itera vely upda ng the parameters un l convergence.

Step 6: Decode the most likely sequence of hidden states


Given the observed data, the Viterbi algorithm is used to compute the most likely
sequence of hidden states. This can be used to predict future observa ons, classify
sequences, or detect pa erns in sequen al data.

Step 7: Evaluate the model, The performance of the HMM can be evaluated using
various metrics, such as accuracy, precision, recall, or F1 score.

You might also like