Mlnotes 2 Srija
Mlnotes 2 Srija
BAYESIAN LEARNING
INTRODUCTION
Bayesian learning methods are relevant to study of machine learning for two different reasons.
1. First, Bayesian learning algorithms that calculate explicit probabilities for hypotheses,
such as the naive Bayes classifier, are among the most practical approaches to certain
types of learning problems
2. The second reason is that they provide a useful perspective for understanding many
learning algorithms that do not explicitly manipulate probabilities.
Each observed training example can incrementally decrease or increase the estimated
probability that a hypothesis is correct. This provides a more flexible approach to
learning than algorithms that completely eliminate a hypothesis if it is found to be
inconsistent with any single example
Prior knowledge can be combined with observed data to determine the final probability
of a hypothesis. In Bayesian learning, prior knowledge is provided by asserting (1) a
prior probability for each candidate hypothesis, and (2) a probability distribution over
observed data for each possible hypothesis.
Bayesian methods can accommodate hypotheses that make probabilistic predictions
New instances can be classified by combining the predictions of multiple hypotheses,
weighted by their probabilities.
Even in cases where Bayesian methods prove computationally intractable, they can
provide a standard of optimal decision making against which other practical methods
can be measured.
Practical difficulty in applying Bayesian methods
1. One practical difficulty in applying Bayesian methods is that they typically require
initial knowledge of many probabilities. When these probabilities are not known in
advance they are often estimated based on background knowledge, previously available
data, and assumptions about the form of the underlying distributions.
2. A second practical difficulty is the significant computational cost required to determine
the Bayes optimal hypothesis in the general case. In certain specialized situations, this
computational cost can be significantly reduced.
BAYES THEOREM
Bayes theorem provides a way to calculate the probability of a hypothesis based on its prior
probability, the probabilities of observing various data given the hypothesis, and the observed
data itself.
Notations
P(h) prior probability of h, reflects any background knowledge about the chance that h
is correct
P(D) prior probability of D, probability that D will be observed
P(D|h) probability of observing D given a world in which h holds
P(h|D) posterior probability of h, reflects confidence that h holds after D has been
observed
Bayes theorem is the cornerstone of Bayesian learning methods because it provides a way to
calculate the posterior probability P(h|D), from the prior probability P(h), together with P(D)
and P(D|h).
P(h|D) increases with P(h) and with P(D|h) according to Bayes theorem.
P(h|D) decreases as P(D) increases, because the more probable it is that D will be
observed independent of h, the less evidence D provides in support of h.
Maximum a Posteriori (MAP) Hypothesis
In many learning scenarios, the learner considers some set of candidate hypotheses H
and is interested in finding the most probable hypothesis h H given the observed data
D. Any such maximally probable hypothesis is called a maximum a posteriori (MAP)
hypothesis.
Bayes theorem to calculate the posterior probability of each candidate hypothesis is hMAP
is a MAP hypothesis provided
P(D|h) is often called the likelihood of the data D given h, and any hypothesis that maximizes
P(D|h) is called a maximum likelihood (ML) hypothesis
Example
Consider a medical diagnosis problem in which there are two alternative hypotheses:
(1) that the patient has particular form of cancer, and (2) that the patient does not. The
available data is from a particular laboratory test with two possible outcomes: +
(positive) and - (negative).
We have prior knowledge that over the entire population of people only .008 have this
disease. Furthermore, the lab test is only an imperfect indicator of the disease.
The test returns a correct positive result in only 98% of the cases in which the disease is
actually present and a correct negative result in only 97% of the cases in which the
disease is not present. In other cases, the test returns the opposite result.
The above situation can be summarized by the following probabilities:
Suppose a new patient is observed for whom the lab test returns a positive (+) result.
Should we diagnose the patient as having cancer or not?
The exact posterior probabilities can also be determined by normalizing the above quantities
so that they sum to 1
Reinforcement Learning (RL) is a subset of machine learning where an agent learns to make
decisions by interacting with an environment to maximize a reward signal. Unlike supervised
learning, where the model learns from labelled examples, RL focuses on learning from the
consequences of actions and exploring an environment to discover optimal behaviours.
How RL Works
1. Model-Free RL:
o Focuses on learning directly from the interaction without modeling the
environment.
o Examples:
Q-Learning (off-policy)
SARSA (on-policy)
2. Model-Based RL:
o Attempts to build a model of the environment for planning.
3. Policy Gradient Methods:
o Directly optimize the policy using gradient ascent.
o Examples:
REINFORCE
Proximal Policy Optimization (PPO)
4. Deep Reinforcement Learning:
o Combines RL with deep neural networks to handle high-dimensional state and
action spaces.
o Examples:
Deep Q-Networks (DQN)
Actor-Critic methods (A3C, DDPG)
Applications of RL
1. Robotics: Training robots to perform tasks like walking, grasping, and assembling.
2. Game Playing: Achieving superhuman performance in games like Go, Chess, and
StarCraft (e.g., AlphaGo, AlphaStar).
3. Autonomous Vehicles: Learning to navigate and make driving decisions.
4. Healthcare: Personalized treatment planning and drug discovery.
5. Finance: Portfolio management and algorithmic trading.
1. Exploration vs. Exploitation: Balancing trying new actions (exploration) and using
known strategies (exploitation).
2. Sparse Rewards: Rewards might be infrequent, making learning difficult.
3. Computational Complexity: Requires significant computational resources,
especially for deep RL.
4. Stability: Training RL models can be unstable and sensitive to hyperparameters.
Scenario: Getting Lost in a Forest
Imagine you are hiking in a forest and realize you’ve lost your way. Here’s how the problem
maps to reinforcement learning:
1. State (S):
o Your current location in the forest, possibly represented by landmarks or a
rough map in your mind.
2. Actions (A):
o Possible moves, such as walking in different directions (north, south, east,
west) or staying put.
3. Reward (R):
o Positive reward for moving closer to your starting point or destination.
o Negative reward (or penalty) for moving farther away or encountering
obstacles (e.g., a swamp).
4. Policy (π):
o Your strategy for deciding which direction to take based on your current state.
1. Exploration:
o Trying new directions, even if they are uncertain, to learn more about the
environment.
o Example: You decide to walk west, even though you’re unsure where it leads,
in case it brings you closer to a landmark.
2. Exploitation:
o Using the knowledge you’ve gained to take actions that have previously
worked well.
o Example: You remember that walking east led you closer to the trail earlier, so
you choose to go east.
MARKOV CHAIN MONTE CARLO METHODS
Markov Chain Monte Carlo (MCMC) methods are a class of algorithms used for sampling from
probability distributions, particularly when direct sampling is challenging. They are widely applied in
Bayesian statistics, machine learning, and computational physics.
Key Concepts
1. Markov Chain:
o A sequence of states where the probability of transitioning to the next state
depends only on the current state (the Markov Property).
o The chain is used to explore the state space of the distribution.
2. Monte Carlo:
o Refers to the use of random sampling to perform numerical computations,
often to approximate integrals or expectations.
3. Purpose:
o To approximate a target distribution P(x), which might be computationally
expensive or intractable to evaluate directly.
1. Metropolis-Hastings Algorithm:
o Proposes a new state x′ given the current state x.
o Accepts x′ with probability:
o α=min(1,P(x′)⋅q(x∣x′) / P(x)⋅q(x′∣x))
A special case of MCMC where each variable is sampled conditionally, one at a time.
For a joint distribution P(x1,x2,…,xn) sample:
xi∼P(xi∣x − i)
where x−i represents all variables except x i.
Characteristics:
4. Slice Sampling
SAMPLING
MCMC Sampling (Markov Chain Monte Carlo Sampling) is a method used to draw samples from a
probability distribution, especially when the distribution is complex or high-dimensional, making
direct sampling challenging. The key idea is to create a Markov Chain that has the desired
distribution as its equilibrium distribution and then use the chain to generate dependent samples.
1. Start at an Initial State: Begin with a random or heuristic starting point x0.
2. Generate a Markov Chain: Use a proposal mechanism to generate a sequence of
samples x1,x2,x3…, where the next state depends only on the current state (Markov
Property).
3. Ensure Convergence: Construct the chain so that its stationary (long-term)
distribution matches the target distribution P(x).
4. Use the Samples: After a burn-in period (discarding early samples), the chain
provides samples that approximate the target distribution.
A Markov Chain Monte Carlo (MCMC) proposal distribution is a function used within MCMC
algorithms to generate candidate samples from a probability distribution. The idea behind MCMC is
to explore the distribution of a target variable through a sequence of states (or samples) generated
by transitioning between states according to a proposal distribution.
Key Concepts:
1. Proposal Distribution:
o Definition: It is a conditional distribution that defines how the current state of
the Markov chain moves to a new state. It proposes a new sample based on the
current sample.
o Purpose: The proposal distribution is essential for guiding the exploration of
the sample space. It helps in balancing exploration and exploitation—meaning
it allows the chain to explore new regions of the sample space but also aids in
quickly converging to the target distribution.
o Examples: In a simple random walk MCMC, the proposal distribution might
be a normal distribution centered around the current state with a certain
variance. In more complex methods like the Metropolis-Hastings algorithm, it
can involve more sophisticated proposals, such as multivariate normal
distributions or mixtures of distributions.
2. Metropolis-Hastings Algorithm:
o In the context of the Metropolis-Hastings algorithm, the proposal distribution
is used to suggest new points in the state space from which the algorithm can
either accept or reject based on the acceptance probability, which depends on
the target distribution.
3. Types of Proposal Distributions:
o Symmetric Proposal: The proposal distribution is symmetric if the
probability of proposing a move from state A to state B is the same as from B
to A. This can make the MCMC more efficient.
o Asymmetric Proposal: Sometimes asymmetric proposals (e.g., based on
gradient information or more flexible forms) are used to better adapt to the
target distribution and improve mixing.
4. Tuning Proposal Distributions:
o Tuning the proposal distribution is crucial in MCMC to ensure the Markov
chain converges to the target distribution efficiently. This might involve
adjusting the width of the proposal distribution (like the standard deviation in
a normal proposal) or using more complex, adaptive strategies.
GRAPHICAL MODELS
Markov Chain Monte Carlo (MCMC) graphical models refer to the use of MCMC
methods to perform inference on graphical models, such as Bayesian networks or Markov
random fields. These models represent the probabilistic dependencies among variables
using a graph, where nodes represent variables and edges represent conditional
dependencies. MCMC techniques are powerful tools for exploring the posterior
distribution of the parameters or variables in these models.
Key Concepts:
1. Graphical Models:
o Definition: A graphical model is a way to describe the joint probability
distribution of a set of random variables. It uses a graph to represent
probabilistic relationships: nodes are variables, and edges denote direct
dependencies between them.
o Types:
Bayesian Networks: Directed acyclic graphs where edges indicate
causal relationships. They capture dependencies among variables
through conditional probabilities.
Markov Random Fields (MRFs): Undirected graphs capturing
dependencies without explicit directionality. Variables are
conditionally dependent if they share an edge.
Factor Graphs: A bipartite graph that combines aspects of Bayesian
networks and MRFs.
2. MCMC in Graphical Models:
o Purpose: MCMC methods are used to sample from the posterior distribution
of parameters in graphical models. They are particularly useful when the
posterior distribution is complex, multimodal, or not available in a closed-
form.
o Algorithms:
Metropolis-Hastings Algorithm: A popular MCMC method that can
be applied to graphical models by iteratively proposing new states and
accepting them based on a specified criterion.
Gibbs Sampling: A special case of Metropolis-Hastings that can be
particularly effective for models where variables are conditionally
independent given the others (allowing easy sampling from their
conditional distributions).
Hybrid Methods: Combining different MCMC strategies to handle
more complex dependencies and improve efficiency.
3. Inference with MCMC in Graphical Models:
o Posterior Inference: MCMC methods allow for posterior inference, which is
the estimation of parameter values and hidden variable states that are most
consistent with observed data.
o Marginalization: MCMC can be used to estimate marginal distributions of
variables in the model, which are often needed in Bayesian analysis.
o Sensitivity Analysis: MCMC can be employed to study the sensitivity of
inference results to different assumptions or structural changes in the model.
4. Challenges and Solutions:
o Mixing Time: Ensuring the chain adequately explores the state space without
getting stuck or converging too slowly. Adaptive MCMC techniques can
adjust proposal distributions to improve mixing.
o Burn-in: A period where the chain is not yet in equilibrium. It's crucial to run
MCMC for enough iterations to allow the chain to “burn-in” before collecting
samples.
o Multimodal Distributions: Some models have multiple peaks in their
posterior distribution. MCMC can efficiently sample from these distributions
by navigating through the peaks.
BAYESIAN NETWORKS
Bayesian Networks (also known as Bayesian Belief Networks or Bayes Nets) are a
graphical model that represents probabilistic dependencies among variables using a
directed acyclic graph. They are widely used in various fields, including artificial
intelligence, machine learning, and statistics, for tasks such as reasoning under
uncertainty, decision analysis, and predictive modeling.
Key Concepts:
1. Graphical Structure:
o Nodes: Represent random variables in the model.
o Edges: Directed arrows between nodes indicate conditional dependencies. For
instance, if there is an arrow from node A to node B, B is conditionally
dependent on A.
o Acyclic: The graph does not contain cycles, meaning there are no paths that
start and end at the same node. This ensures that dependencies are well-
defined.
Probabilistic Relationships:
Each node has a probability distribution that captures how it depends on its parents
(the nodes connected to it). This is often represented using a conditional probability
table (CPT).
For example, if A is a parent of B, then P(B∣A) specifies the probability of B given
the state of A.
Querying: When we are interested in finding the probability of a variable given some
evidence, we can use Bayes’ Theorem. For instance, P(A∣E) asks for the probability
of A given evidence E.
Conditional Independence: Bayesian networks exploit the property of conditional
independence. If a path exists between two variables through other nodes, they are
conditionally independent given the common neighbors.
Markov Random Fields (MRFs) are a class of probabilistic graphical models that describe the
joint distribution of a set of random variables where each variable is conditionally dependent on
its neighbors. Unlike Bayesian Networks which use directed edges, MRFs use undirected edges
to represent dependencies. They are widely used in fields such as image processing, spatial
statistics, and computer vision for modeling complex interactions and spatial dependencies
among variables.
Key Concepts:
1. Graph Structure:
o Nodes: Represent random variables.
o Edges: Undirected edges indicate that the variables are conditionally
dependent on each other. There is no direction, meaning the relationship is
bidirectional.
o Clustering: Variables that share an edge are in the same cluster. They are
dependent on each other but independent of variables in other clusters given
their neighbors.
Probabilistic Relationships:
where Z is the normalizing constant (partition function), c represents clusters (sets of connected
variables), and Xc denotes variables in the cluster.
Inference in MRFs:
Hidden Markov Models (HMMs) are a class of statistical models used to represent systems that
are assumed to be Markov processes with hidden states. They are widely used in fields such as
speech recognition, bioinformatics, and finance, where the goal is to infer the hidden state
sequence that generates a sequence of observable events or observations.
Key Concepts:
1. Markov Property:
o States: In an HMM, the system has a set of states that are not directly
observed. Instead, we only see the observations that are dependent on these
states.
o Transitions: There are transition probabilities between states, indicating the
probability of moving from one state to another.
o Observations: At each time step, an observable output (observation) is
generated from one of the states according to a probability distribution
specific to that state.
Model Components:
Inference in HMMs:
Decoding: Finding the most likely sequence of hidden states that result in a given
sequence of observations. This is typically done using the Viterbi algorithm, which
is a dynamic programming method to find the most probable state sequence.
Filtering: Calculating the probability distribution of the hidden states given the
observed data up to the current time step.
Smoothing: Extending filtering to calculate the probability distribution of hidden
states at any time step in the sequence.
Training (Learning):
TRACKING MODELS
Markov Chain Monte Carlo (MCMC) tracking models refer to a class of algorithms used to
estimate or approximate complex distributions through the iterative simulation of Markov
chains. These models are widely used in Bayesian statistics and machine learning for tasks such
as parameter estimation, state tracking, and prediction in high-dimensional spaces.
Key Concepts:
1. Markov Chains:
o State: Each step in an MCMC is a transition from one state to another in a
Markov chain, governed by transition probabilities.
o Stationary Distribution: MCMC algorithms converge to a stationary
distribution that reflects the distribution of interest.
o Transition: At each step, the state moves according to a predefined transition
mechanism. This could be random, but it maintains certain properties that
allow the chain to explore the space effectively.