Thanks to visit codestin.com
Credit goes to www.scribd.com

0% found this document useful (0 votes)
10 views15 pages

Mlnotes 2 Srija

Uploaded by

Palla Srija
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
10 views15 pages

Mlnotes 2 Srija

Uploaded by

Palla Srija
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 15

MODULE 4

BAYESIAN LEARNING

Bayesian reasoning provides a probabilistic approach to inference. It is based on the


assumption that the quantities of interest are governed by probability distributions and that
optimal decisions can be made by reasoning about these probabilities together with observed
data

INTRODUCTION

Bayesian learning methods are relevant to study of machine learning for two different reasons.
1. First, Bayesian learning algorithms that calculate explicit probabilities for hypotheses,
such as the naive Bayes classifier, are among the most practical approaches to certain
types of learning problems
2. The second reason is that they provide a useful perspective for understanding many
learning algorithms that do not explicitly manipulate probabilities.

Features of Bayesian Learning Methods

Each observed training example can incrementally decrease or increase the estimated
probability that a hypothesis is correct. This provides a more flexible approach to
learning than algorithms that completely eliminate a hypothesis if it is found to be
inconsistent with any single example
Prior knowledge can be combined with observed data to determine the final probability
of a hypothesis. In Bayesian learning, prior knowledge is provided by asserting (1) a
prior probability for each candidate hypothesis, and (2) a probability distribution over
observed data for each possible hypothesis.
Bayesian methods can accommodate hypotheses that make probabilistic predictions
New instances can be classified by combining the predictions of multiple hypotheses,
weighted by their probabilities.
Even in cases where Bayesian methods prove computationally intractable, they can
provide a standard of optimal decision making against which other practical methods
can be measured.
Practical difficulty in applying Bayesian methods

1. One practical difficulty in applying Bayesian methods is that they typically require
initial knowledge of many probabilities. When these probabilities are not known in
advance they are often estimated based on background knowledge, previously available
data, and assumptions about the form of the underlying distributions.
2. A second practical difficulty is the significant computational cost required to determine
the Bayes optimal hypothesis in the general case. In certain specialized situations, this
computational cost can be significantly reduced.

BAYES THEOREM

Bayes theorem provides a way to calculate the probability of a hypothesis based on its prior
probability, the probabilities of observing various data given the hypothesis, and the observed
data itself.
Notations
P(h) prior probability of h, reflects any background knowledge about the chance that h
is correct
P(D) prior probability of D, probability that D will be observed
P(D|h) probability of observing D given a world in which h holds
P(h|D) posterior probability of h, reflects confidence that h holds after D has been
observed

Bayes theorem is the cornerstone of Bayesian learning methods because it provides a way to
calculate the posterior probability P(h|D), from the prior probability P(h), together with P(D)
and P(D|h).

P(h|D) increases with P(h) and with P(D|h) according to Bayes theorem.
P(h|D) decreases as P(D) increases, because the more probable it is that D will be
observed independent of h, the less evidence D provides in support of h.
Maximum a Posteriori (MAP) Hypothesis

In many learning scenarios, the learner considers some set of candidate hypotheses H
and is interested in finding the most probable hypothesis h H given the observed data
D. Any such maximally probable hypothesis is called a maximum a posteriori (MAP)
hypothesis.
Bayes theorem to calculate the posterior probability of each candidate hypothesis is hMAP
is a MAP hypothesis provided

P(D) can be dropped, because it is a constant independent of h

Maximum Likelihood (ML) Hypothesis

In some cases, it is assumed that every hypothesis in H is equally probable a priori


(P(hi) = P(hj) for all hi and hj in H).
In this case the below equation can be simplified and need only consider the term P(D|h)
to find the most probable hypothesis.

P(D|h) is often called the likelihood of the data D given h, and any hypothesis that maximizes
P(D|h) is called a maximum likelihood (ML) hypothesis

Example
Consider a medical diagnosis problem in which there are two alternative hypotheses:
(1) that the patient has particular form of cancer, and (2) that the patient does not. The
available data is from a particular laboratory test with two possible outcomes: +
(positive) and - (negative).
We have prior knowledge that over the entire population of people only .008 have this
disease. Furthermore, the lab test is only an imperfect indicator of the disease.
The test returns a correct positive result in only 98% of the cases in which the disease is
actually present and a correct negative result in only 97% of the cases in which the
disease is not present. In other cases, the test returns the opposite result.
The above situation can be summarized by the following probabilities:

Suppose a new patient is observed for whom the lab test returns a positive (+) result.
Should we diagnose the patient as having cancer or not?

The exact posterior probabilities can also be determined by normalizing the above quantities
so that they sum to 1

Basic formulas for calculating probabilities are summarized in Table


REINFORCEMENT LEARNING

Reinforcement Learning (RL) is a subset of machine learning where an agent learns to make
decisions by interacting with an environment to maximize a reward signal. Unlike supervised
learning, where the model learns from labelled examples, RL focuses on learning from the
consequences of actions and exploring an environment to discover optimal behaviours.

Key Components of Reinforcement Learning

1. Agent: The learner or decision-maker.


2. Environment: Everything the agent interacts with.
3. State (S): A representation of the current situation the agent is in.
4. Action (A): A set of all possible moves the agent can make.
5. Reward (R): Feedback from the environment, used to evaluate the success of an
action.
6. Policy (π): A strategy that defines the actions the agent takes based on its current
state.
7. Value Function (V): Estimates the long-term rewards that can be achieved from a
given state.
8. Q-Function (Q): Evaluates the expected utility of taking a specific action in a
specific state.

How RL Works

The agent interacts with the environment in discrete time steps:

1. Observe the current state (St).


2. Take an action (At) based on the policy.
3. Receive a reward (Rt) and observe the new state (St+1).
4. Update the policy to improve decision-making.

Types of Reinforcement Learning Algorithms

1. Model-Free RL:
o Focuses on learning directly from the interaction without modeling the
environment.
o Examples:
 Q-Learning (off-policy)
 SARSA (on-policy)
2. Model-Based RL:
o Attempts to build a model of the environment for planning.
3. Policy Gradient Methods:
o Directly optimize the policy using gradient ascent.
o Examples:
 REINFORCE
 Proximal Policy Optimization (PPO)
4. Deep Reinforcement Learning:
o Combines RL with deep neural networks to handle high-dimensional state and
action spaces.
o Examples:
 Deep Q-Networks (DQN)
 Actor-Critic methods (A3C, DDPG)

Applications of RL

1. Robotics: Training robots to perform tasks like walking, grasping, and assembling.
2. Game Playing: Achieving superhuman performance in games like Go, Chess, and
StarCraft (e.g., AlphaGo, AlphaStar).
3. Autonomous Vehicles: Learning to navigate and make driving decisions.
4. Healthcare: Personalized treatment planning and drug discovery.
5. Finance: Portfolio management and algorithmic trading.

Challenges in Reinforcement Learning

1. Exploration vs. Exploitation: Balancing trying new actions (exploration) and using
known strategies (exploitation).
2. Sparse Rewards: Rewards might be infrequent, making learning difficult.
3. Computational Complexity: Requires significant computational resources,
especially for deep RL.
4. Stability: Training RL models can be unstable and sensitive to hyperparameters.
Scenario: Getting Lost in a Forest

Imagine you are hiking in a forest and realize you’ve lost your way. Here’s how the problem
maps to reinforcement learning:

1. State (S):
o Your current location in the forest, possibly represented by landmarks or a
rough map in your mind.
2. Actions (A):
o Possible moves, such as walking in different directions (north, south, east,
west) or staying put.
3. Reward (R):
o Positive reward for moving closer to your starting point or destination.
o Negative reward (or penalty) for moving farther away or encountering
obstacles (e.g., a swamp).
4. Policy (π):
o Your strategy for deciding which direction to take based on your current state.

Two Key Behaviors

1. Exploration:
o Trying new directions, even if they are uncertain, to learn more about the
environment.
o Example: You decide to walk west, even though you’re unsure where it leads,
in case it brings you closer to a landmark.
2. Exploitation:
o Using the knowledge you’ve gained to take actions that have previously
worked well.
o Example: You remember that walking east led you closer to the trail earlier, so
you choose to go east.
MARKOV CHAIN MONTE CARLO METHODS
Markov Chain Monte Carlo (MCMC) methods are a class of algorithms used for sampling from
probability distributions, particularly when direct sampling is challenging. They are widely applied in
Bayesian statistics, machine learning, and computational physics.

Key Concepts

1. Markov Chain:
o A sequence of states where the probability of transitioning to the next state
depends only on the current state (the Markov Property).
o The chain is used to explore the state space of the distribution.
2. Monte Carlo:
o Refers to the use of random sampling to perform numerical computations,
often to approximate integrals or expectations.
3. Purpose:
o To approximate a target distribution P(x), which might be computationally
expensive or intractable to evaluate directly.

How MCMC Works

MCMC generates samples by constructing a Markov Chain whose stationary distribution is


the target distribution P(x). Once the chain converges (reaches stationarity), the samples from
the chain can be treated as samples from P(x).

1. Start from an initial state.


2. Use a proposal mechanism to generate candidate states.
3. Accept or reject the candidate based on a criterion that ensures the chain converges to
the target distribution.

Common MCMC Algorithms

1. Metropolis-Hastings Algorithm:
o Proposes a new state x′ given the current state x.
o Accepts x′ with probability:
o α=min⁡(1,P(x′)⋅q(x∣x′) / P(x)⋅q(x′∣x))

 where q(x′∣x) is the proposal distribution.

2.  If x′ is rejected, the chain stays at x.


Gibbs Sampling:

 A special case of MCMC where each variable is sampled conditionally, one at a time.

 For a joint distribution P(x1,x2,…,xn) sample:

 xi∼P(xi∣x − i)
 where x−i represents all variables except x i.

 Repeat for all variables to form a complete iteration.

 Characteristics:

 Efficient for distributions where conditional probabilities are easy to compute.


 Common in Bayesian networks.

3. Hamiltonian Monte Carlo (HMC):


 Combines gradients of the log-probability with physical simulation (using Hamiltonian
dynamics) to propose new states.

 Particularly efficient for high-dimensional distributions.

4. Slice Sampling

 Introduces an auxiliary variable to define a region under the probability curve.


 Samples uniformly from this region to ensure detailed balance.

SAMPLING

MCMC Sampling (Markov Chain Monte Carlo Sampling) is a method used to draw samples from a
probability distribution, especially when the distribution is complex or high-dimensional, making
direct sampling challenging. The key idea is to create a Markov Chain that has the desired
distribution as its equilibrium distribution and then use the chain to generate dependent samples.

Why Use MCMC Sampling?

MCMC is particularly helpful in cases like:

 Bayesian inference, where the posterior distribution is difficult to compute or sample


directly.
 Complex models with high-dimensional state spaces.
 Approximating integrals of the form:

How MCMC Sampling Works

1. Start at an Initial State: Begin with a random or heuristic starting point x0.
2. Generate a Markov Chain: Use a proposal mechanism to generate a sequence of
samples x1,x2,x3…, where the next state depends only on the current state (Markov
Property).
3. Ensure Convergence: Construct the chain so that its stationary (long-term)
distribution matches the target distribution P(x).
4. Use the Samples: After a burn-in period (discarding early samples), the chain
provides samples that approximate the target distribution.

Steps in MCMC Sampling

1. Initialization: Choose an initial state x0.


2. Proposal Step: Generate a new state x′ using a proposal mechanism (e.g., random
walk).
3. Acceptance Step: Calculate the acceptance probability and decide whether to accept
or reject the proposed state.
4. Repeat: Continue the proposal and acceptance steps to create a chain of samples.

MCMC PROPOSAL DISTRIBUTION

A Markov Chain Monte Carlo (MCMC) proposal distribution is a function used within MCMC
algorithms to generate candidate samples from a probability distribution. The idea behind MCMC is
to explore the distribution of a target variable through a sequence of states (or samples) generated
by transitioning between states according to a proposal distribution.

Key Concepts:

1. Proposal Distribution:
o Definition: It is a conditional distribution that defines how the current state of
the Markov chain moves to a new state. It proposes a new sample based on the
current sample.
o Purpose: The proposal distribution is essential for guiding the exploration of
the sample space. It helps in balancing exploration and exploitation—meaning
it allows the chain to explore new regions of the sample space but also aids in
quickly converging to the target distribution.
o Examples: In a simple random walk MCMC, the proposal distribution might
be a normal distribution centered around the current state with a certain
variance. In more complex methods like the Metropolis-Hastings algorithm, it
can involve more sophisticated proposals, such as multivariate normal
distributions or mixtures of distributions.
2. Metropolis-Hastings Algorithm:
o In the context of the Metropolis-Hastings algorithm, the proposal distribution
is used to suggest new points in the state space from which the algorithm can
either accept or reject based on the acceptance probability, which depends on
the target distribution.
3. Types of Proposal Distributions:
o Symmetric Proposal: The proposal distribution is symmetric if the
probability of proposing a move from state A to state B is the same as from B
to A. This can make the MCMC more efficient.
o Asymmetric Proposal: Sometimes asymmetric proposals (e.g., based on
gradient information or more flexible forms) are used to better adapt to the
target distribution and improve mixing.
4. Tuning Proposal Distributions:
o Tuning the proposal distribution is crucial in MCMC to ensure the Markov
chain converges to the target distribution efficiently. This might involve
adjusting the width of the proposal distribution (like the standard deviation in
a normal proposal) or using more complex, adaptive strategies.

GRAPHICAL MODELS

Markov Chain Monte Carlo (MCMC) graphical models refer to the use of MCMC
methods to perform inference on graphical models, such as Bayesian networks or Markov
random fields. These models represent the probabilistic dependencies among variables
using a graph, where nodes represent variables and edges represent conditional
dependencies. MCMC techniques are powerful tools for exploring the posterior
distribution of the parameters or variables in these models.

Key Concepts:

1. Graphical Models:
o Definition: A graphical model is a way to describe the joint probability
distribution of a set of random variables. It uses a graph to represent
probabilistic relationships: nodes are variables, and edges denote direct
dependencies between them.
o Types:
 Bayesian Networks: Directed acyclic graphs where edges indicate
causal relationships. They capture dependencies among variables
through conditional probabilities.
 Markov Random Fields (MRFs): Undirected graphs capturing
dependencies without explicit directionality. Variables are
conditionally dependent if they share an edge.
 Factor Graphs: A bipartite graph that combines aspects of Bayesian
networks and MRFs.
2. MCMC in Graphical Models:
o Purpose: MCMC methods are used to sample from the posterior distribution
of parameters in graphical models. They are particularly useful when the
posterior distribution is complex, multimodal, or not available in a closed-
form.
o Algorithms:
 Metropolis-Hastings Algorithm: A popular MCMC method that can
be applied to graphical models by iteratively proposing new states and
accepting them based on a specified criterion.
 Gibbs Sampling: A special case of Metropolis-Hastings that can be
particularly effective for models where variables are conditionally
independent given the others (allowing easy sampling from their
conditional distributions).
 Hybrid Methods: Combining different MCMC strategies to handle
more complex dependencies and improve efficiency.
3. Inference with MCMC in Graphical Models:
o Posterior Inference: MCMC methods allow for posterior inference, which is
the estimation of parameter values and hidden variable states that are most
consistent with observed data.
o Marginalization: MCMC can be used to estimate marginal distributions of
variables in the model, which are often needed in Bayesian analysis.
o Sensitivity Analysis: MCMC can be employed to study the sensitivity of
inference results to different assumptions or structural changes in the model.
4. Challenges and Solutions:
o Mixing Time: Ensuring the chain adequately explores the state space without
getting stuck or converging too slowly. Adaptive MCMC techniques can
adjust proposal distributions to improve mixing.
o Burn-in: A period where the chain is not yet in equilibrium. It's crucial to run
MCMC for enough iterations to allow the chain to “burn-in” before collecting
samples.
o Multimodal Distributions: Some models have multiple peaks in their
posterior distribution. MCMC can efficiently sample from these distributions
by navigating through the peaks.

BAYESIAN NETWORKS

Bayesian Networks (also known as Bayesian Belief Networks or Bayes Nets) are a
graphical model that represents probabilistic dependencies among variables using a
directed acyclic graph. They are widely used in various fields, including artificial
intelligence, machine learning, and statistics, for tasks such as reasoning under
uncertainty, decision analysis, and predictive modeling.

Key Concepts:

1. Graphical Structure:
o Nodes: Represent random variables in the model.
o Edges: Directed arrows between nodes indicate conditional dependencies. For
instance, if there is an arrow from node A to node B, B is conditionally
dependent on A.
o Acyclic: The graph does not contain cycles, meaning there are no paths that
start and end at the same node. This ensures that dependencies are well-
defined.

Probabilistic Relationships:

 Each node has a probability distribution that captures how it depends on its parents
(the nodes connected to it). This is often represented using a conditional probability
table (CPT).
 For example, if A is a parent of B, then P(B∣A) specifies the probability of B given
the state of A.

Inference in Bayesian Networks:

 Querying: When we are interested in finding the probability of a variable given some
evidence, we can use Bayes’ Theorem. For instance, P(A∣E) asks for the probability
of A given evidence E.
 Conditional Independence: Bayesian networks exploit the property of conditional
independence. If a path exists between two variables through other nodes, they are
conditionally independent given the common neighbors.

Learning Bayesian Networks:


 Parameter Learning: Determining the conditional probability distributions based on
data. This can be done via methods like maximum likelihood estimation or Bayesian
methods.
 Structure Learning: Determining the structure of the network, i.e., which variables
are parents of which others. This can involve algorithms like Greedy Search, Hill
Climbing, or Bayesian Dirichlet Equivalent Search.

MARKOV RANDOM FIELDS

Markov Random Fields (MRFs) are a class of probabilistic graphical models that describe the
joint distribution of a set of random variables where each variable is conditionally dependent on
its neighbors. Unlike Bayesian Networks which use directed edges, MRFs use undirected edges
to represent dependencies. They are widely used in fields such as image processing, spatial
statistics, and computer vision for modeling complex interactions and spatial dependencies
among variables.

Key Concepts:

1. Graph Structure:
o Nodes: Represent random variables.
o Edges: Undirected edges indicate that the variables are conditionally
dependent on each other. There is no direction, meaning the relationship is
bidirectional.
o Clustering: Variables that share an edge are in the same cluster. They are
dependent on each other but independent of variables in other clusters given
their neighbors.

Probabilistic Relationships:

 Potential Functions: For a set of variables X connected by an edge in the graph, a


potential function f(X) assigns a non-negative value to each configuration of
variables. The probability distribution of an MRF is defined using these potential
functions:

where Z is the normalizing constant (partition function), c represents clusters (sets of connected
variables), and Xc denotes variables in the cluster.

Inference in MRFs:

 Max-Margin Markov Networks: Inference in MRFs is about computing marginal or


conditional distributions. This can be done using algorithms such as Belief
Propagation, Gibbs Sampling, or Mean Field Approximations.
 Belief Propagation: A popular method used for computing marginal distributions
efficiently by passing messages between neighboring nodes. It is effective for dense
graphs where the Markov property holds.
 Gibbs Sampling: Used to sample from the joint distribution of variables by
iteratively sampling each variable given its neighbors.

HIDDEN MARKOV FIELDS

Hidden Markov Models (HMMs) are a class of statistical models used to represent systems that
are assumed to be Markov processes with hidden states. They are widely used in fields such as
speech recognition, bioinformatics, and finance, where the goal is to infer the hidden state
sequence that generates a sequence of observable events or observations.

Key Concepts:

1. Markov Property:
o States: In an HMM, the system has a set of states that are not directly
observed. Instead, we only see the observations that are dependent on these
states.
o Transitions: There are transition probabilities between states, indicating the
probability of moving from one state to another.
o Observations: At each time step, an observable output (observation) is
generated from one of the states according to a probability distribution
specific to that state.

Model Components:

 States: A finite set of hidden states S={s1,s2,...,sN}.


 Transitions: A matrix A=[aij] where aij is the probability of transitioning from state si
to state sj.
 Observation: Each state si is associated with a probability distribution over
observations. This is often represented using a emission matrix B=[bij], where bij is
the probability of emitting observation Oj from state si.
 Initial State Distribution: A vector π=[π1,π2,...,πN] representing the probability of
starting in each state.

Inference in HMMs:

 Decoding: Finding the most likely sequence of hidden states that result in a given
sequence of observations. This is typically done using the Viterbi algorithm, which
is a dynamic programming method to find the most probable state sequence.
 Filtering: Calculating the probability distribution of the hidden states given the
observed data up to the current time step.
 Smoothing: Extending filtering to calculate the probability distribution of hidden
states at any time step in the sequence.

Training (Learning):

 Parameter Estimation: The process of determining the transition and emission


probabilities from observed data.
 Algorithms:
o Baum-Welch Algorithm: A common method used for parameter estimation
in HMMs. It iteratively adjusts the model parameters to maximize the
likelihood of the observed data given the hidden states.
o Forward-Backward Algorithm: Used to compute the probabilities required
for the Baum-Welch algorithm.
 Supervised Learning: Typically requires a set of training sequences (input-output
pairs) to learn the model parameters.

TRACKING MODELS

Markov Chain Monte Carlo (MCMC) tracking models refer to a class of algorithms used to
estimate or approximate complex distributions through the iterative simulation of Markov
chains. These models are widely used in Bayesian statistics and machine learning for tasks such
as parameter estimation, state tracking, and prediction in high-dimensional spaces.

Key Concepts:

1. Markov Chains:
o State: Each step in an MCMC is a transition from one state to another in a
Markov chain, governed by transition probabilities.
o Stationary Distribution: MCMC algorithms converge to a stationary
distribution that reflects the distribution of interest.
o Transition: At each step, the state moves according to a predefined transition
mechanism. This could be random, but it maintains certain properties that
allow the chain to explore the space effectively.

Types of MCMC Tracking Models:

 Metropolis-Hastings Algorithm: One of the most commonly used MCMC


algorithms. It allows the chain to explore the parameter space by proposing new states
and accepting or rejecting them based on a probability determined by the likelihood
and a proposal distribution.
 Gibbs Sampling: A special case of Metropolis-Hastings, where the proposal
distribution is always the conditional distribution of each variable given the others. It
is particularly useful when variables are conditionally independent given others.
 Hamiltonian Monte Carlo: Uses techniques from Hamiltonian mechanics to explore
the parameter space more efficiently, especially in high-dimensional spaces.

Steps in MCMC Tracking Models:

 Initialization: Start with an initial guess of the parameters or state.


 Proposal: Propose a new state based on the current state. The choice of the proposal
distribution can affect the convergence rate.
 Acceptance/Rejection: Decide whether to accept the proposed state based on a
Metropolis criterion or an acceptance probability. This step ensures that the chain
converges to the target distribution.
 Iteration: Repeat the process, typically thousands or millions of times, to allow the
chain to explore the state space thoroughly.

You might also like