MODULE 3
ACTING UNDER UNCERTAINTY
• Agents may need to handle uncertainty, whether due to partial observability, nondeterminism,
or a combination of the two.
• An agent may never know for certain what state it’s in or where it will end up after a sequence
of actions.
• Problem-solving and logical agents try to keep track of all possible situations (belief states) and
plan for every possible states that it might be in.
• But this approach has significant drawbacks:
• A logical agent must consider every logically possible explanation for the observations, no matter how
unlikely it is, leading to impossible large and complex belief-state representations.
• The plans can become huge because they must cover every possible outcome.
• Sometimes there is no plan that is guaranteed to achieve the goal—yet the agent must act to take a decision
Example
• Imagine a self-driving taxi needs to get a passenger to the airport. It plans to leave 90 minutes before the
flight (Plan A90).
• This sounds reasonable, but the taxi can’t be 100% sure the plan will work because many things might go
wrong—like traffic, car trouble, or unexpected delays. This is called the qualification problem, that is you can
never list or predict every possible issue.
• The qualification problem is a challenge in AI and logic where it's impossible to list all the conditions that
could prevent an action from succeeding.
• Still, plan A90 is likely the best plan based on what the agent knows. It balances:
• Getting to the airport on time
• Not arriving too early
• Avoiding speeding tickets
• Even if the plan isn’t guaranteed to work, it’s rational because it maximizes the chance of success based on
what the agent knows and values.
Summarizing uncertainty
• Let’s consider an example of uncertain reasoning: diagnosing a dental patient’s toothache.
• We might write a rule like:
Toothache ⇒ Cavity
But this is not always true—a toothache could also be caused by gum problems, abscesses, or other issues.
• So we try:
Toothache ⇒ Cavity ∨ GumProblem ∨ Abscess...
But this becomes a long and endless list.
• What if we reverse it?
Cavity ⇒ Toothache
That’s also not always true—not all cavities cause pain.
• To make these rules fully correct, we’d have to add every little detail and condition, which is too complicated and
unrealistic.
Why Logic Fails in Medical Diagnosis:
• Laziness – It's too hard to write perfect rules for every situation.
• Theoretical ignorance – Science doesn’t know everything yet.
• Practical ignorance – Even if we know all the rules, we might be
uncertain about a particular patient because not all the necessary tests
have been or can be run. We don’t always have all the info or test results.
• An agent can’t always know for sure what’s true—it can only have a degree of belief
about things.
• To handle this, we use probability theory.
• Logic says something is true, false, or unknown.
• Probability lets the agent say how likely something is—using numbers from 0 to 1:
• 0 = definitely false
• 1 = definitely true
• Values in between show uncertainty
• So, probability helps agents reason better when they aren’t 100% sure by summarizing
the uncertainty that comes from our laziness and ignorance, thereby solving the
qualification problem.
Uncertainty and rational decisions
• Consider again the A90 plan for getting to the airport. Suppose it gives us a 97% chance of catching our flight.
• Is it the best plan? Maybe not—another plan like A180 (leave earlier) might have an even higher chance.
• And what about Plan A1440 (leave 24 hours early)?
It almost guarantees being on time, but comes with a very long, unpleasant wait—so it’s not a good choice
either.
• To decide between plans, the agent needs to consider:
• Preferences (like being on time vs. waiting too long)
• Outcomes (what actually happens in each plan)
• We use utility theory to help with this. It assigns a number (utility) to each outcome based on how useful or
desirable it is.
• The agent chooses the plan that gives the highest overall utility.
• Preferences, as expressed by utilities, are combined with probabilities in the general
theory of rational decisions called decision theory
• Decision Theory = Probability + Utility
• Decision theory helps agents make smart choices by combining:
• Probability theory (how likely outcomes are)
• Utility theory (how good the outcomes are)
• An agent is rational if it chooses the action with the highest expected utility.
This is called the principle of maximum expected utility (MEU).
• Expected utility means the average usefulness of all possible outcomes, based on
how likely each one is.
• So, the best choice is the one that gives the best result on average.
Agent receives input, based on which an action is performed
Review of Basic Probability
• For our agent to represent and use probabilistic information, we need a formal language.
• The language of probability theory has traditionally been informal, written by human
mathematicians to other human mathematicians.
• This section discusses about:
• What probabilities are about
• The language of propositions in probability assertions
• Probability axioms and their reasonableness
What probabilities are about
• Like logic, probability talks about possible worlds (different outcomes).
• Logical statements say what’s definitely true or false.
• Probabilistic statements say how likely each possible world is.
Sample Space (Ω): The set of all possible outcomes is called the sample space, written as
Ω.
• Each possible world (like a dice roll result) is called ω.
• Egs: Rolling two dice has 36 possible outcomes (like (1,1), (1,2), ..., (6,6)).
Probability Model: Each possible outcome ω has a probability between 0 and 1.
• The total of all probabilities adds up to 1.
• Egs: With fair dice, each outcome has a 1/36 chance.
Events: Sets of outcomes (not just one) are called events.
• Egs: “The dice add up to 11” is an event that includes (5,6) and (6,5).
Types of Probability:
Product Rule:
• The product rule says:
• To find the probability of both a and b happening, you multiply:
• The chance of b happening, and
• The chance of a happening if b has already happened.
• Example:
The language of propositions in probability assertions
• Factored Representation (in AI): Instead of thinking about the world as one big
thing, a factored representation breaks it down into smaller pieces (variables),
each with its own value.
Example:
• Imagine you're describing the state of a taxi:
• Location = Downtown
• Passenger = Yes
• Fuel Level = Full
These variable = value pairs together represent one possible state (or world).
• Variables in probability theory are called random variables and their names begin with
an uppercase letter.
• A random variable represents something that can have different outcomes.
Example: Die1, Total, Weather, Age
• Names of random variables start with capital letters (e.g., Die1, Total, Doubles).
• Each random variable has a domain – the set of values it can take.
Die1: {1, 2, 3, 4, 5, 6}
Weather: {sunny, rain, cloudy, snow}
Boolean Variables (True/False)
• If a variable is Boolean, its domain is {true, false}.
• A probability distribution is a list of all possible values of the variable
along with their associated probabilities
• P statement defines a probability distribution for the random variable
Weather.
• A Probability Density Function (PDF) is a function that describes the likelihood of a continuous random variable taking on
a specific value within a range. It gives you the probability of the variable falling within a certain interval, but not exactly
at one point.
• A joint probability distribution function gives the probability that two or more random variables take on specific values at
the same time.
Example:
• Suppose you toss two coins.
• Let: X = result of the first coin (1 for Heads, 0 for Tails)
Y = result of the second coin (1 for Heads, 0 for Tails)
Probability axioms and their reasonableness
• The probability axioms are the basic rules that all probabilities must follow.
• Inclusion–exclusion principle: It is a rule used to find the probability (or count) of either
of two events happening.
• It states that the probability of A or B is equal to the sum of the probabilities of A and B,
minus the probability of both A and B happening together
• Formula : P(A∨B)= P(A) + P(B) − P(A∧B)
Bayes Theorem
• Bayes' theorem is also known as Bayes' rule or Bayes' law, which determines the
probability of an event with uncertain knowledge.
•Bayes' theorem was named after the British mathematician Thomas Bayes.
•It is a way to calculate the value of P(B|A) with the knowledge of P(A|B).
•Bayes' theorem allows updating the probability prediction of an event by observing
new information of the real world
• This equation is known as Bayes’ rule (also Bayes’ law or Bayes’ theorem).
Applying Bayes’ rule: The simple case
• What is the chance a patient has meningitis given they have a stiff neck?
• This is written as: P(meningitis ∣ stiff neck)
The Known Facts are:
1.P(stiff neck | meningitis) = 0.7
→ If someone has meningitis, there’s a 70% chance they will have a stiff neck. This is the likelihood
2. P(meningitis) = 1/50,000 = 0.00002
→ Very few people actually have meningitis. This is the prior probability.
3. P(stiff neck) = 0.01
→ About 1% of people in general have a stiff neck. This is the overall (marginal) probability of the
symptom.
What We Want to Find?
• We want to find the probability of meningitis given a stiff neck, P(meningitis ∣ stiff neck)
• This is the posterior probability — what we want to update using Bayes' Theorem.
• Bayes’ Theorem says:
a b
• Final answer:
• Even if a patient has a stiff neck, there’s only a 0.14% chance (or 1 in 700) that they
actually have meningitis.
Using Bayes’ rule: Combining evidence
• Bayes' Rule helps us figure out the probability of a cause, given we’ve seen some evidence.
• Example: What’s the chance a patient has a cavity if they have a toothache?“
• Bayes' Rule works well with one piece of evidence, like the stiff neck in the meningitis example.
What if we have more than one symptom or clue?
• Let’s say a dentist has two pieces of evidence:
• The patient has a toothache
• The dentist's probe catches in the tooth
• Now the dentist wants to know:
“What’s the chance the patient has a cavity, given both the toothache and the probe catching?”
This is written as: P(Cavity ∣ toothache ∧ catch)
• A joint distribution is a table that contains the probability of every possible combination of all
variables — like whether there's a cavity, toothache, probe catch, etc.
• Let’s say that from this table we find:
• P(Cavity ∣ toothache ∧ catch)=α[0.108,0.016]≈[0.871,0.129]
This means:
• There's an 87.1% chance that the person has a cavity
• And a 12.9% chance they don't, given both symptoms
Using Bayes’ Rule P(Cavity ∣ toothache ∧ catch) = α⋅P(toothache ∧ catch ∣ Cavity) ⋅ P(Cavity)
P(Cavity ∣ toothache ∧ catch): What we want to find — the probability of a cavity, given the symptoms.
P(toothache ∧ catch ∣ Cavity): The likelihood of both symptoms occurring if a cavity is actually present.
• P(Cavity): The general probability of having a cavity (before knowing any
symptoms).
• α : A normalizing constant to make sure the probabilities add up to 1.
• For this reformulation to work, we need to know the conditional probabilities that
might be feasible for just two evidence variables.
• If there are n possible evidence variables (X rays,diet, oral hygiene, etc.), then
there are possible combinations of observed values for which we would need to
know conditional probabilities.
REPRESENTING KNOWLEDGE IN AN UNCERTAIN DOMAIN
• A Bayesian Network is a type of graphical model that represents the
probabilistic relationships among a set of variables. It includes:
• Nodes – Each one represents a random variable (can be discrete or continuous).
• Arrows (directed links) – Show relationships between variables. If there’s an
arrow from X to Y, then X is a parent of Y. The graph has no loops (it’s a Directed
Acyclic Graph – DAG).
• Probability Tables – Each node has a conditional probability table that tells how
likely it is, depending on the values of its parent nodes.
This network includes four variables:
Weather
Cavity
Toothache
Catch
• The conditional independence of Toothache and Catch, given cavity, is
indicated by the absence of a link between Toothache and Catch.
• Intuitively, the network represents the fact that Cavity is a direct cause of
Toothache and Catch, whereas no direct causal relationship exists
between Toothache and Catch.
• Now consider another example, Burglar Alarm Bayesian Network.
Burglar Alarm Bayesian Network
• You have a new burglar alarm installed at home.
• It is fairly reliable at detecting burglary, but also sometimes responds to minor
earthquakes.
• You have two neighbors, John and Mary, who promised to call you at work when
they hear the alarm.
• John always calls when he hears the alarm, but sometimes confuses telephone
ringing with the alarm and calls too.
• Mary likes loud music and sometimes misses the alarm.
• Given the evidence of who has or has not called, we would like to estimate the
probability of a burglary.
1. Burglary
• This is a root node.
• Only one value is given:
• P(B = true) = 0.001 → Only a 0.1% chance of burglary.
• P(B = false) = 1 - 0.001 = 0.999
2. Earthquake
• Also a root node.
• P(E = true) = 0.002 → 0.2% chance of an earthquake.
• P(E = false) = 0.998
3. Alarm
• Alarm is influenced by Burglary (B) and Earthquake (E).
• So we need to look at all 4 possible combinations of those inputs.
• This is a Conditional Probability Table (CPT) for Alarm: If both burglary and earthquake happen → 95% chance
B E P(A = true)
t t 0.95 alarm goes off.
t f 0.94 If only burglary happens → 94% chance alarm goes off.
f t 0.29 If only earthquake happens → 29% chance alarm goes off.
f f 0.001 If neither happens → only 0.1% chance alarm goes off
(false alarm).
• If both burglary and earthquake happen → 95% chance alarm goes off.
• If only burglary happens → 94% chance alarm goes off.
• If only earthquake happens → 29% chance alarm goes off.
• If neither happens → only 0.1% chance alarm goes off (false alarm).
4. John Calls
• Depends only on whether the alarm went off.
A P(J = true)
t 0.90
f 0.05
5. Mary Calls
• Also depends only on alarm.
A P(M = true)
t 0.70
f 0.01
• The network structure shows that burglary and earthquakes directly affect the
probability of the alarm’s going off, but whether John and Mary call depends only on the
alarm.
• The network thus represents our assumptions that they do not perceive burglaries
directly, they do not notice minor earthquakes, and they do not confer before calling.
• The conditional distributions are shown as a conditional probability table, or CPT.
• Each row in a CPT contains the conditional probability of each node value for a
conditioning case.
THE SEMANTICS OF BAYESIAN NETWORKS
What is a Bayesian Network?
• A Bayesian Network is:
• A graph made of nodes (variables) and arrows (dependencies).
• It doesn’t have any loops (that’s what "acyclic" means).
• Each node has a table of numbers (called a CPT) that gives probabilities
depending on the values of its parent nodes.
• Syntax means: how the network is structured — the nodes, arrows, and numbers.
• Semantics means: what those numbers mean in terms of real-world probability.
The Joint Distribution:
• It represents a full probability distribution over all variables — i.e it tells you the probability of any complete scenario
happening.
WHEN DO Θ BECOME REAL PROBABILITIES?
• Once we define the joint probability using that multiplication rule, we
can actually derive the real conditional probabilities like:
• So,
In Bayesian networks, θ(b | a) is just another name for P(b | a). So, the θ-values are conditional
probabilities after we define the joint distribution using the product rule.
And they are connected like this (Bayesian Network):
We want to compute the joint probability of this full situation:
P(j, m, a, ¬b, ¬e)
• Use the Bayesian Network Formula
Given:
P(j | a) = 0.90
P(m | a) = 0.70
P(a | ¬b, ¬e) = 0.001
P(¬b) = 0.999
P(¬e) = 0.998
Multiply them:
0.90 × 0.70 × 0.001 × 0.999 × 0.998 = 0.000628
So, the probability of that exact scenario is 0.000628, or about 0.063%.
• The joint probability of a full scenario (all variables X1,X2,...,XnX_1, X_2, ..., X_nX1,X2
,...,Xn) can be written using the chain rule of probability.
• When you want to find the probability of a whole scenario with many events or variables,
such as X₁, X₂, ..., Xₙ, you need the joint probability. This is the probability that all the
variables happen at once.
• For example, if you're calculating the probability that it rains tomorrow and that you will
get a promotion and that you will buy a new car, you need the joint probability of all
three events happening together.
• The chain rule is a way of breaking down this joint probability into smaller, more
manageable parts. Instead of calculating the joint probability all at once, you can "chain"
conditional probabilities together.
How to Build a Bayesian Network from Scratch
Step 1: Choose Variables and Order Them
• Choose the set of variables you need to model your domain (e.g., Burglary, Earthquake, Alarm,
etc.) and Order them.
• For example: [Burglary, Earthquake, Alarm, JohnCalls, MaryCalls]
Step 2: For Each Variable, Choose Minimal Parents
• For each variable xi, look at the previous ones in the order
• Now we want to decide what should be the parents of MaryCalls.
• We might think:
• “Mary’s decision might depend on Burglary, Earthquake, Alarm, JohnCalls…”
• But with domain knowledge, we realize:
• Mary hears the alarm — that’s why she calls.
• She doesn’t directly know if there’s a burglary or earthquake.
• She doesn’t know or care if John called.
• So we say:
• Given the Alarm’s state, MaryCalls is conditionally independent of everything else.
• In notation:
So we only add Alarm → MaryCalls, and stop there.
COMPACTNESS AND NODE ORDERING
Key Takeaways About Bayesian Network Compactness and Construction
• Bayesian Networks Are Compact
• They can represent a full joint distribution using far fewer numbers if each variable only depends on a few others.
• This compactness comes from local structure — each node has a small, fixed number of parents.
• Exponential vs. Linear Growth
• Full joint distribution for n Boolean variables needs entries.
• Bayesian network with max K parents per node only needs around entries— much smaller if k≪n.
• Example:
• For n=30n = 30n=30, k=5k = 5k=5:
• Full joint: over 1 billion entries
• Bayesian network: only 960 entries
• Causal Simplicity Beats Diagnostic Complexity
• Causal order (e.g., Burglary → Alarm → MaryCalls) leads to simpler networks.
• Diagnostic order (e.g., MaryCalls → Alarm → Burglary) leads to extra, unnatural dependencies and more parameters.
• Node Ordering Matters
• A bad variable order (e.g., symptoms before causes) increases the number of required links and CPT entries.
• The wrong order might make the network as complex as the full joint distribution.
• Tenuous Links Add Complexity
• Adding weak or indirect dependencies (e.g., Earthquake → MaryCalls) may make the network unnecessarily complicated.
• Only include links if they significantly improve accuracy.
• All Orders Represent the Same Joint
• Even if a network looks more complex, it still represents the same joint probability — it’s just less efficient.
• Experts Prefer Causal Thinking
• People (including doctors) are more comfortable and accurate specifying causal probabilities than diagnostic ones.
EXACT INFERENCE IN BAYESIAN NETWORKS
• Inference by enumeration-“What’s the probability of something (X) given some known evidence
(e)?”
we use the full joint distribution (the big table of all possible
combinations of variables and their probabilities) to sum over the
possibilities and compute it.
where:
• X is the variable you're asking about (the query).
• e is the known information (the evidence).
• y represents hidden variables (things you don’t know and aren’t asking about).
• P(X, e, y) is the joint probability (the chance that X, e, and y all happen).
• α (alpha) is a normalization constant so the probabilities add up to 1.
Why is it called enumeration?
• Because you’re literally enumerating (listing and summing) all
possible cases — which becomes slow as the number of variables
grows.
Using the Bayesian Network:
• The joint probability P(X, e, y) can be written as a product of CPTs (Conditional
Probability Tables) from the network.
• Example:
• “What’s the probability that a burglary happened, given that both John and Mary called?”
• A Bayesian Network models the dependencies between events using conditional probability
tables (CPTs).
• Using CPTs, the probability becomes:
Constants like P(b) and P(e) can be moved outside the summation, reducing computations.
Normalization (α) makes sure your final answer is a valid
probability between 0 and 1.
Empty distribution
THE VARIABLE ELIMINATION ALGORITHM
• Variable elimination is used to compute the probability of a query variable given some evidence. It
works by systematically eliminating hidden variables (not in the query or evidence) by summing them
out.
Sum out A
Combine factors that mention A: f3, f4, f5.
Multiply them pointwise, then sum over A.
Result: new factor f6(B, E).
Sum out E
Combine f2(E) and f6(B, E).
Multiply pointwise, then sum over E.
Result: new factor f7(B).
Final calculation
Multiply f1(B) and f7(B).
Normalize the result to get final posterior probabilities.
REMOVE IRRELEVANT VARIABLES:
• Some variables don’t affect the query and can be removed before starting:
• If a leaf node (no children) is not part of the query or evidence, it’s irrelevant.
• Example: In P(JohnCalls | Burglary = true), MaryCalls doesn’t matter.
• Remove such nodes recursively.
Initialize an empty list of factors.
Iterate over each variable in the Bayesian network in a specific order
Construct factors based on CPT and evidence, & add with
the list of current factors
Eliminate the hidden variable
THE COMPLEXITY OF EXACT INFERENCE
1. Singly Connected Networks (Poly trees)
• There is at most one undirected path between any two nodes.
• Called polytrees.
Inference is efficient:
Time and space complexity is linear in the number of nodes (if parents per node are limited).
2. Multiply Connected Networks:
• There can be multiple undirected paths between nodes.
Inference is harder:
Time and space complexity can be exponential, even with few parents per node.
3. Inference is NP-Hard
Bayesian network inference includes propositional logic inference as a special case.
This makes it NP-hard and even #P-hard (counting satisfying assignments).
#P-hard is harder than NP-complete problems.
4. Link to CSPs (Constraint Satisfaction Problems)
The difficulty of solving a CSP depends on how tree-like (low tree width) the structure is.
Tree width applies to Bayesian networks too.
Variable elimination can be used to solve both CSPs and Bayesian networks.
CLUSTERING ALGORITHMS
Limitation of Variable Elimination
• Variable Elimination is good when you only want the probability of one
variable (e.g., "What is the chance it rained?").
• But if you want to know the probabilities of many variables, you have to
repeat the process for each one.
• In a polytree (a simple, loop-free network), each query takes about O(n)
time.
• So, for n variables, the total time becomes O(n²) — which can be slow for
large networks.
Clustering (Join Tree) Solution
• Instead of doing many separate queries, we reorganize the network.
• We combine related nodes (like Sprinkler + Rain) into "cluster nodes" (called meganodes).
• These clusters form a new simpler tree structure (a join tree, or cluster tree) with no
loops.
• In this structure, you can calculate everything in one pass — bringing the time down to
O(n).
• When we group nodes into clusters (meganodes), some clusters share variables.
• You can’t use the usual inference methods directly — because the clusters are now
interconnected in a new way.
• So we use a technique called constraint propagation that ensures agreement on shared
variables between neighboring clusters.
TIME & UNCERTAINTY
1. Static vs. Dynamic Worlds
• Static world: Each variable has a fixed value. Egs: Diagnosing a car
problem — the parts are either broken or not.
• Dynamic world: Variables change over time.
• Example: Diabetic patient — blood sugar, insulin levels, etc., vary with time.
2. Real-World Examples of Dynamic Systems
• Medical monitoring: Patient vitals change constantly.
• Robot tracking: Position updates over time.
• Economic forecasting: Trends evolve.
• Speech understanding: Word sequences change as someone talks.
Need for Temporal Modeling?
• In real life, things change over time — like health, weather, or movement.
So we need to model:
• How things change from one moment to the next.
• How actions (like giving insulin) affect future conditions.
• How past info helps us guess what’s happening now.
The Solution: Dynamic Models
• We use special models to deal with time-based changes:
• Add a time tag to variables:
Example: BloodSugarₜ (blood sugar at time t)
• Use tools like:
• Dynamic Bayesian Networks (DBNs)
• Hidden Markov Models (HMMs)
What These Models Do:
• Track state changes over time.
• Update beliefs as new data arrives.
• Predict future states based on current and past information.
STATES AND OBSERVATIONS
• In dynamic models, time is split into steps (like t = 0, 1, 2…).
At each step:
• Hidden variables (Xt) represent the true state (like if it’s raining or actual blood
sugar).
• Evidence variables (Et) are what we can observe (like someone carrying an
umbrella or a glucose reading).
• We can't see the hidden state directly, but we can guess it using evidence over
time.
This helps us track changes and make predictions as time goes on.
TRANSITION AND SENSOR MODELS
1. Modeling Time with State and Evidence variables
• Time is broken into slices (t = 1, 2, 3...).
• Each slice has:
• State variables (Xt): hidden true state (e.g., "Is it raining?")
• Evidence variables (Et): observable data (e.g., "Is someone carrying an umbrella?")
2. Transition Model:
• Shows how states change: P(Xt | Xt−1)
• Markov Assumption: Future depends only on the present, not the full past.
• Stationary Assumption: Transition rules stay the same over time.
3. Sensor Model
• Defines how observations relate to the current state:
P(Et | Xt).
• This is called the Sensor Markov Assumption.
• The state causes the evidence — e.g., rain causes umbrellas, not the other
way around.
4. Initial State
• We also need a starting point:
P(X0), the distribution over the state at time 0.
7. Example: Robot Tracking
• State: The robot’s position and velocity.
• But velocity might depend on battery level. So:
• Add Battery as another state variable.
• To track it well, add a battery sensor too.
• This makes the model smarter by including all the important factors.