Thanks to visit codestin.com
Credit goes to www.scribd.com

0% found this document useful (0 votes)
25 views37 pages

Unit 5

The document covers key concepts in artificial intelligence related to uncertainty and learning, including probabilistic reasoning, Bayes' theorem, and Bayesian networks. It discusses how agents can act under uncertainty, the use of probability in decision-making, and the representation of knowledge in uncertain domains. Additionally, it explains the structure and application of Bayesian networks for modeling and solving problems involving uncertainty.

Uploaded by

tejabhanu811
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
25 views37 pages

Unit 5

The document covers key concepts in artificial intelligence related to uncertainty and learning, including probabilistic reasoning, Bayes' theorem, and Bayesian networks. It discusses how agents can act under uncertainty, the use of probability in decision-making, and the representation of knowledge in uncertain domains. Additionally, it explains the structure and application of Bayesian networks for modeling and solving problems involving uncertainty.

Uploaded by

tejabhanu811
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 37

lOMoAR cPSD| 16826574

ARTIFICAIL INTELLIGENCE

UNIT - V
Uncertain knowledge and Learning Uncertainty: Acting under Uncertainty, Basic
Probability Notation, Inference Using Full Joint Distributions, Independence, Bayes’
Rule and Its Use,
Probabilistic Reasoning: Representing Knowledge in an Uncertain Domain, The
Semantics of Bayesian Networks, Efficient Representation of Conditional Distributions,
Approximate Inference in Bayesian Networks, Relational and First-Order Probability,
Other Approaches to Uncertain Reasoning; Dempster-Shafer theory.
Learning: Forms of Learning, Supervised Learning, Learning Decision Trees.
Knowledge in Learning: Logical Formulation of Learning, Knowledge in Learning,
Explanation-Based Learning, Learning Using Relevance Information, Inductive Logic
Programming.
ACTING UNDER UNCERTAINTY
• Agents may need to handle uncertainty
• Partial observability, which state the agent reached?
• Nondeterminism, where the agent will end after a sequence of
actions?
• Combination of the two
• They are designed to handle uncertainty
• Keeping track of a belief state
• Generating contingency plan and reach the goal.
lOMoAR cPSD| 16826574

a p e o uncertain reasonin

u es or denta dia nosis usin propositiona o ic


ee o o ica approac brea s do n
oot ac e a ity
is ru e is ron , not a patients it a toot ac e a e a ca ity , so e a e
u disease, an abscess, or ot er prob e
oot ac e a ity or u rob e or bscess or dots
eed to add a possibi ities to a e t e ru e

e can in ert t e ru e
a ity oot ac e
is ru e isn t ri t eit er, not a ca ities cause pain
oo to dea it de rees o be ie
robabi ity can be used as a ay to t e data, and a e predictions in
t e ace o uncertainty
o o ic and probabi ity t eory are t e sa e, e or d is co posed o acts
t at o d or do not o d in any particu ar case
are di erent
o ica a ent be ie es eac sentence is true or a se or as no opinion
probabi istic a ent as a de ree o be ie ro a se to true

• We might not know for sure about the problem of a particular patient
• We have a belief, we believe with 80% of chance (probability of 0.8) that the patient with
a toothache has a cavity,so velief can be derived from statistical data
• 80% of toothaches patients also had a cavity
• Probability statements are made with respect to a knowledge state, not with respect to the real
world
• The probability that a patient has a cavity, given that she has a toothache, is 0.8
• The probability that a patient has a cavity, given that she has a toothache and a history
of gum disease, is 0.4
(b) Uncertainty and Rational Decisions:
• Which of these plans to go to the airport is a rational choice?
• \(A_{90}\) (90 minutes before), \(A_{180}\) (180 minutes before), or \(A_{1440}\) (1
day before)
• It depends on our preferences, an agent must have preferences
• We want to arrive on time but not waiting too much
• Intolerable wait and food?
• We can use Utility theory, Decision theory or Rational agent
• Utility Theory:
• To reason and represent preferences
• In utility theory, every state has a degree of usefulness or utility
• An agent prefers states with high utility
• Decision theory
• Combines preferences (expressed with utilities) with probabilities in the theory of
rational decisions
• Decision theory = probability theory + utility theory
• Rational agent in decision theory
• An agent is rational iff it chooses the action that yields the highest expected utility,
averaged over all the possible outcomes of the action
• This is the maximum expected utility (MEU) principle

BASIC PROBABILITY NOTATION:


• In probability theory, the set of all possible worlds SAMPLE SPACE is called the sample space.
The possible worlds are mutually exclusive and exhaustive—two possible worlds cannot both be
the case.
• For example, if we are about to roll two (distinguishable) dice, there are 36 possible worlds to
consider: (1,1), (1,2), . . ., (6,6).
• e ree etter Ω uppercase o e a is used to re er to t e sa p e space, and ω o ercase
omega) refers to elements of the space, that is, particular possible worlds.
• A fully specified probability model associates a nu erica probabi ity ω it eac possib e
world.1 The basic axioms of probability theory say that every possible world has a probability
between 0 and 1 and that the total probability of the set of possible worlds is 1:
lOMoAR cPSD| 16826574

Random variables

• We describe the (uncertain) state of the world using random


variables
• Random variable: a variable in probability theory with a domain of
possible values it can take on
Denoted by capital letters
– R: Is it raining?
– W: W ’ w r?
– D: What is the outcome of rolling two dice?
– S: What is the speed of my car (in MPH)?
• Just like variables in CSPs, random variables take on values in
a domain
Domain values must be mutually exclusive and exhaustive
– R in {True, False}
– W in {Sunny, Cloudy, Rainy, Snow}
– D in { , , ,2 , … 6,6 }
– S in [0, 200]

Events

• Probabilistic statements are defined over events, or sets of


world states
“I r ”
“T w r r u y r wy”
“T um f w r 11”
“My r b w 30 50 m r ur”
• Events are described using propositions about random
variables:
R = True
= “ oudy” = “ no y”
D {(5,6), (6,5)}
30 S 50
• Notation: P(A) is the probability of the set of world states in
which proposition A holds
Joint Probability Distribution
• In addition to distributions on single variables, we need notation for distributions on multiple
variables. Commas are used for this.

• For example, P(Weather , Cavity) denotes the probabilities of all combinations of the values of
Weather and Cavity.

• This is a 4×2 JOINT PROBABILITY table of probabilities called the joint probability
distribution of Weather and Cavity.

We can also mix variables with and without values; P(sunny, Cavity) would be a two-element vector
giving the probabilities of a sunny day with a cavity and a sunny day with no cavity. The P notation
makes certain expressions much more concise than they might otherwise be.For example, the product
rules for all possible values of Weather and Cavity can be written as a single equation:

Inference Using Full Joint Distributions


We use the u joint distribution as t e “ no ed e base” ro ic ans ers to a questions ay be
derived.Along the way we also introduce several useful techniques for manipulating equations
involving probabilities.
• Domain with 3 variables, booleans:
• \(Toothache\)
• \(Cavity\)
• \(Catch\): ’ r b
• Full joint distribution: \(2 \times 2 \times 2\) table
lOMoAR cPSD| 16826574

For example, there are six possible worlds in which cavity or negation cavity or toothache holds:

P(cavity ∨ (negation) cavity V toothache) = 0.108 + 0.012 + 0.072 + 0.008 + 0.016 + 0.064 = 0.28

Marginal probability

• For example, there are six possible worlds in which cavity ∨ toothache holds:

P(cavity ∨ toothache) = 0.108 + 0.012 + 0.072 + 0.008 + 0.016 + 0.064 = 0.28 .

• One particularly common task is to extract the distribution over some subset of variables or a
single variable. For example, adding the entries in the first row gives the unconditional or
marginal probability 4 of cavity:

P(cavity) = 0.108 + 0.012 + 0.072 + 0.008 = 0.2 .

• This process is called marginalization, or summing out—because we sum up the probabilities


for each possible value of the other variables, thereby taking them out of the equation.

Conditional probability

• A variant of this rule involves conditional probabilities instead of joint probabilities, using the
product rule: This rule is called conditioning.

• For example, we can compute the probability of a cavity, given evidence of a toothache, as
follows:

Independence:
• Let us expand the full joint distribution in Figure 13.3 by adding a fourth variable, Weather .
• The full joint distribution then becomes P(Toothache, Catch, Cavity,Weather ), which has
2 × 2 × 2 × 4 = 32 entries.
• For example, how are P(toothache, catch, cavity, cloudy) and P(toothache, catch, cavity)
related?
• We can use the product rule:
P(toothache, catch, cavity, cloudy) = P(cloudy | toothache, catch, cavity)P(toothache, catch, cavity) .
• o , un ess one is in t e deity business, one s ou d not i a ine t at one’s denta prob e s
influence the weather. Therefore, the following assertion seems reasonable:
P(cloudy | toothache, catch, cavity) = P(cloudy) .
• From this, we can deduce
P(toothache, catch, cavity, cloudy) = P(cloudy)P(toothache, catch, cavity) .
• A similar equation exists for every entry in P(Toothache, Catch, Cavity,Weather ). In fact, we can
write the general equation
P(Toothache, Catch, Cavity,Weather) = P(Toothache, Catch, Cavity)P(Weather ) .
• Thus, the 32-element table for four variables can be constructed from one 8-element table and one
4-element table. This decomposition is illustrated schematically in Figure 13.4(a).

In particu ar, t e eat er is independent o one’s denta prob e s Independence bet een propositions a
and b can be written as

P(a | b)=P(a) or P(b | a)=P(b) or P(a ∧ b)=P(a)P(b)

BAY ’ R L AND IT

• Bayes' theorem is also known as Bayes' rule, Bayes' law, or Bayesian reasoning, which
determines the probability of an event with uncertain knowledge.

• In probability theory, it relates the conditional probability and marginal probabilities of two
random events.

• Bayes' theorem was named after the British mathematician Thomas Bayes. The Bayesian
inference is an application of Bayes' theorem, which is fundamental to Bayesian statistics.

• It is a way to calculate the value of P(B|A) with the knowledge of P(A|B).

• Bayes' theorem allows updating the probability prediction of an event by observing new
information of the real world.

• Example: If cancer corresponds to one's age then by using Bayes' theorem, we can determine the
probability of cancer more accurately with the help of age.
lOMoAR cPSD| 16826574

• P(A|B) is known as posterior, which we need to calculate, and it will be read as Probability of
hypothesis A when we have occurred an evidence B.

• P(B|A) is called the likelihood, in which we consider that hypothesis is true, then we calculate
the probability of evidence.

• P(A) is called the prior probability, probability of hypothesis before considering the evidence

• P(B) is called marginal probability, pure probability of an evidence.

• In the equation (a), in general, we can write P (B) = P(A)*P(B|Ai), hence the Bayes' rule can be
written as:

• Where A1, A2, A3,........, An is a set of mutually exclusive and exhaustive events.

A B ’ T

• It allows us to compute the single term P(b | a) in terms of three terms: P(a | b), P(b), and P(a).

• Example :

• A doctor is aware that disease meningitis causes a patient to have a stiff neck, and it occurs 70%
of the time. He is also aware of some more facts, which are given as follows:

(a)The Known probability that a patient has meningitis disease is 1/50,000.

(b) The Known probability that a patient has a stiff neck is 0.01.
• That is, we expect less than 1 in 700 patients with a stiff neck to have meningitis.

• Notice that even though a stiff neck is quite strongly indicated by meningitis (with probability
0.7), the probability of meningitis in the patient remains small. This is because the prior
probability of stiff necks is much higher than that of meningitis

Bayesian Belief Network in artificial intelligence


Bayesian belief network is key computer technology for dealing with probabilistic events
and to solve a problem which has uncertainty. We can define a Bayesian network as:

"A Bayesian network is a probabilistic graphical model which represents a set of variables
and their conditional dependencies using a directed acyclic graph."

It is also called a Bayes network, belief network, decision network, or Bayesian model.Bayesian
networks are probabilistic, because these networks are built from a probability distribution, and also
use probability theory for prediction and anomaly detection. Real world applications are probabilistic in
nature, and to represent the relationship between multiple events, we need a Bayesian network. It can
also be used in various tasks including prediction, anomaly detection, diagnostics, automated insight,
reasoning, time series prediction, and decision making under uncertainty.

Bayesian Network can be used for building models from data and experts opinions, and it
consists of two parts:
• Directed Acyclic Graph
• Table of conditional probabilities.
The generalized form of Bayesian network that represents and solve decision problems under uncertain
knowledge is known as an Influence diagram.A Bayesian network graph is made up of nodes and Arcs
(directed links), where:
lOMoAR cPSD| 16826574

o Each node corresponds to the random variables, and a variable can be continuous or discrete.
o Arc or directed arrows represent the causal relationship or conditional probabilities between random
variables. These directed links or arrows connect the pair of nodes in the graph.These links represent
that one node directly influence the other node, and if there is no directed link that means that nodes
are independent with each other
o In the above diagram, A, B, C, and D are random variables represented by the nodes of the network
graph.

o If we are considering node B, which is connected with node A by a directed arrow, then node A is
called the parent of Node B.

o Node C is independent of node A.

Joint probability distribution:


If we have variables x1, x2, x3,. ... , xn, then the probabilities of a different combination of x1,
x2, x3.. xn, are known as Joint probability distribution.

P[x1, x2, x3,. , xn], it can be written as the following way in terms of the joint probability
distribution.

= P[x1| x2, x3,....., xn]P[x2, x3,. , xn]

= P[x1| x2, x3,....., xn]P[x2|x3,....., xn]. P[xn-1|xn]P[xn].

In general for each variable Xi, we can write the equation as: P(Xi|Xi-1,. , X1) = P(Xi |Parents(Xi ))

Explanation of Bayesian network:

Let's understand the Bayesian network through an example by creating a directed acyclic graph:
Example: Harry installed a new burglar alarm at his home to detect burglary. The alarm reliably
responds at detecting a burglary but also responds for minor earthquakes. Harry has two neighbors
David and Sophia, who have taken a responsibility to inform Harry at work when they hear the
alarm. David always calls Harry when he hears the alarm, but sometimes he got confused with the
phone ringing and calls at that time too. On the other hand, Sophia likes to listen to high music, so
sometimes she misses to hear the alarm. Here we would like to compute the probability of Burglary
Alarm.
Calculate the probability that alarm has sounded, but there is neither a burglary, nor an earthquake
occurred, and David and Sophia both called the Harry.
Solution:

• The Bayesian network for the above problem is given below. The network structure is showing
that burglary and earthquake is the parent node of the alarm and directly affecting the probability
of alarm's going off, but David and Sophia's calls depend on alarm probability.

• The network is representing that our assumptions do not directly perceive the burglary and also do
not notice the minor earthquake, and they also not confer before calling.

• The conditional distributions for each node are given as conditional probabilities table or CPT.

• Each row in the CPT must be sum to 1 because all the entries in the table represent an exhaustive
set of cases for the variable.

• In CPT, a boolean variable with k boolean parents contains 2K probabilities. Hence, if there are
two parents, then CPT will contain 4 probability values
The solution for given example is as follows:
P(S, D, A, ¬B, ¬E) = P (S|A) *P (D|A)*P (A|¬B ^ ¬E) *P (¬B) *P (¬E).
= 0.75* 0.91* 0.001* 0.998*0.999
= 0.00068045.
Hence, a Bayesian network can answer any query about the domain by using Joint distribution.

THE SEMANTICS OF BAYESIAN NETWORK:


There are two ways to understand the semantics of the Bayesian network, which is given below:

1. To understand the network as the representation of the Joint probability distribution.

It is helpful to understand how to construct the network

2. To understand the network as an encoding of a collection of conditional independence statements.

It is helpful in designing inference procedure


lOMoAR cPSD| 16826574

(1)Joint probability distribution

• A generic entry in the joint distribution is the probability of a conjunction of particular


assignments to each variable, such as P (X1 = x1 ∧ ... ∧ Xn = xn). We use the notation P (x1,... , xn)
as an abbreviation for this. The value of this entry is given by the formula

• where parents(Xi) denotes the values of P arents(Xi) that appear in x1,... , xn. Thus, each entry in
the joint distribution is represented by the product of the appropriate elements of the conditional
probability tables (CPTs) in the Bayesian network.

• From this definition, it is easy to prove that the parameters θ(Xi | P arents(Xi)) are exactly the
conditional probabilities P(Xi | P arents(Xi)) implied by the joint distribution (see Exercise 14.2).
Hence, we can rewrite Equation (14.1) as

• To illustrate this, we can calculate the probability that the alarm has sounded, but neither a
burglary nor an earthquake has occurred, and both John and Mary call. We multiply entries
from the joint distribution (using single-letter names for the variables):

(Note : they have applied joint probability condition i.e., if u have seen in the above ,p(j/a) it has taken
only its parent node that is john parent node is alaram,next coming to p(a/negation b and negation e
…because a ara parent node are bur ary and earthquake)

(2) Conditional independence statements


• We have provided a “nu erica ” se antics or Bayesian networks in terms of the representation
of the full joint distribution, as in Equation (14.2).
• We can start from a “topo o ica ” se antics that specifies the conditional independence
relationships encoded by the graph structure.
• The topological semantics2 specifies that each variable is conditionally independent of its non-
descendants, given its parents. For example, in Figure 14.2, JohnCalls is independent of
Burglary, Earthquake , and MaryCalls given the value of Alarm. The definition is illustrated in
Figure 14.4(a).
• Another important independence property is implied by the topological semantics: a node is
conditionally independent of all other nodes in the network, given its parents, children, and
c i dren’s parents—that is, given its Markov blanket. For example, Burglary is independent of
JohnCalls and MaryCalls , given Alarm and Earthquake . This property is illustrated in Figure
14.4(b).

EFFICIENT REPRESENTATION OF CONDITIONAL DISTRIBUTIONS

• Even if the maximum number of parents k is smallish, filling in the CPT for a node requires up
to O(2k) numbers and perhaps a great deal of experience with all the possible conditioning cases.

• In fact, this is a worst-case scenario in which the relationship between the parents and the child
is completely arbitrary. Usually, such relationships are describable by a canonical distribution
that fits some standard pattern.

• Canonical Distributions: In many real-world scenarios, the relationship between a node and its
parents can be described by a standard or canonical distribution, such as noisy-OR, noisy-AND,
or logistic regression models. These canonical forms reduce the complexity of specifying the
CPT.

• Noisy-OR Model: This is commonly used for situations where multiple causes independently
contribute to an effect. For example, if any of several conditions can independently cause a
disease, the noisy-OR model can describe the probability of the disease given the presence or
absence of each condition with a small number of parameters.

Example : we might say that Fever is true if and only if Cold , Flu, or Malaria is true.

• The model makes two assumptions.


• First, it assumes that all the possible causes are listed.
• Second, it assumes that inhibition of each parent is independent of inhibition of any other
lOMoAR cPSD| 16826574

parents: for example, whatever inhibits Malaria from causing a fever is independent of whatever
inhibits Flu from causing a fever.
• Given these assumptions, Fever is false if and only if all its true parents are inhibited, and the
probability of this is the product of the inhibition probabilities q for each parent.
Check the following CPT tables for malaria , flu , cold i.e. it comes 23 combinations . The table
follows

Previous table shows the 8 combinations for a problem, now we apply noisy-OR assumptions ,
assumptions are following:
• Let us suppose these individual inhibition probabilities are as follows:
qcold = P (¬fever | cold , ¬flu, ¬malaria ) = 0.6 ,
qflu = P (¬fever | ¬cold , flu, ¬malaria ) = 0.2 ,
qmalaria = P (¬fever | ¬cold , ¬flu, malaria ) = 0.1 .

So as per asked problem we need only , Fever is false if and only if all its true parents are inhibited.
• In general, noisy logical relationships in which a variable depends on k parents can be described
using O(k) parameters instead of O(2k) for the full conditional probability table.
• This makes assessment and learning much easier. For example, With 448 nodes and 906 links, it
requires only 8,254 values instead of 133,931,430 for a network with full CPTs.

Exact Inference in Bayesian Networks


• The basic task for any probabilistic inference system is to compute the posterior probability
distribution for a set of query variables, given some observed event.
• We will use the notation : X denotes the query variable; E denotes the set of evidence variables
E1,... , Em, and e is a particular observed event; Y will denotes the nonevidence, nonquery
variables Y1,... , Yl (called the hidden variables). Thus, the complete set of variables is X =
{X}∪ E ∪ Y.
• Let's consider a simplified medical diagnosis scenario where we are interested in the probability
of a patient having a disease DDD (the query variable) given some observed symptoms SSS (the
evidence variables). The complete set of variables also includes some hidden variables HHH
which might influence both the disease and the symptoms.
Query Variable (X): D (Disease)
Evidence Variables (E): S(Symptoms)
Hidden Variables (Y): H (other factors such as age, genetic predisposition, etc.)
Inference by enumeration
Inference by enumeration is explained that any conditional probability can be computed by summing
terms from the full joint distribution. More specifically, a query P(X | e) can be answered using
Equation , which we repeat here for convenience:

In the above Y is a hidden variable ,where as X is an query variable, e is an event Therefore, a


query can be answered using a Bayesian network by computing sums of products of conditional
probabilities from the network.
Consider the query P(Burglary | JohnCalls = true, MaryCalls = true). The hidden variables for this
query are Earthquake and Alarm. using initial letters for the variables to shorten the expressions, we
have4

The structure of this computation is shown in Figure 14.8.

Given P (b | j, m) = α × 0.00059224. The corresponding computation for ¬b yields α × 0.0014919;


hence, That is, the chance of a burglary, given calls from both neighbors, is about 28%..(Note: if u want know
the calculation part check PPT)

APPROXIMATE INFERENCE IN BAYESIAN NETWORKS


• Given the intractability of exact inference in large, multiply connected networks, it is essential
lOMoAR cPSD| 16826574

• to consider approximate inference methods.


• This section describes randomized sampling algorithms, also called Monte Carlo algorithms,
that provide approximate answers whose accuracy depends on the number of samples generated.
• Monte Carlo algorithms, of which simulated annealing (page 126) is an example, are used in
many branches of science to estimate quantities that are difficult to calculate exactly.
We will discuss two families of algorithms:
(a) Direct sampling and
(b) Markov chain sampling.
(1) Direct sampling methods
• The primitive element in any sampling algorithm is the generation of samples from a known
probability distribution.
• For example, an unbiased coin can be thought of as a random variable , Coin with values
_heads, tails_ and a prior distribution P(Coin) = _0.5, 0.5_.
• Sampling from this distribution is exactly like flipping the coin: with probability 0.5 it will
return heads , and with probability 0.5 it will return tails .
• The simplest kind of random sampling process for Bayesian networks generates events from a
network that has no evidence associated with it. The idea is to sample each variable in turn, in
topological order. . This algorithm is shown in Figure 14.13.

We can illustrate its operation on the network in Figure 14.12(a), assuming an ordering [Cloudy,
Sprinkler , Rain, WetGrass ]:
1. Sample from P(Cloudy) = _0.5, 0.5_, value is true.
2. Sample from P(Sprinkler |Cloudy =true) = _0.1, 0.9_, value is false.
3. Sample from P(Rain |Cloudy =true) = _0.8, 0.2_, value is true.
4. Sample from P(WetGrass | Sprinkler =false, Rain =true) = _0.9, 0.1_, value is true.
In this case, PRIOR-SAMPLE returns the event [true, false, true, true].
(a) Rejection Sampling

(b) Likelihood weighting


lOMoAR cPSD| 16826574

Note : Refer CPT table for wet grass ,sprinkler then u will understand ,why we have taken only 0.9

(2) Marcov Chain Sampling


lOMoAR cPSD| 16826574

RELATIONAL AND FIRST ORDER PROBABILITY MODELS


lOMoAR cPSD| 16826574
OTHER APPROACHES TO UNCERTAIN REASONING
lOMoAR cPSD| 16826574

FORMS OF LEARNING
Any component of an agent can be improved by learning from data. The improvements, and the
techniques used to make them, depend on four major factors:
• ic component is to be improved.
• at prior knowledge the agent already has.
• at representation is used for the data and the component.
• at feedback is available to learn from.
Consider, for example, an agent training to become a taxi driver.
(1) ery ti e t e instructor s outs “Bra e!” t e a ent i t earn a condition–action rule for when
to brake.
(2) By seeing many camera images that it is told contain buses, it can learn to recognize them .
(3) By trying actions and observing the results—for example, braking hard on a wet road—it can
learn the effects of its actions..
(4) Then, when it receives no tip from passengers who have been thoroughly shaken up during the
trip, it can learn a useful component of its overall utility function.
Representation and prior knowledge
• We have seen several examples of representations for agent components like propositional
logic, first-order logical sentences, Bayesian networks using factored representation for
decision-theoretic agent and so on.
• There is another way to look at the various types of learning. Learning from general function or
rule from specific input–output pairs is called inductive Learning and another type of
analytical or deductive learning is going from a known general rule to a new rule that is
logically entailed, it will be useful for more efficient processing.
The Agents can be learn from different learnings are :
There are three main types of learning:
• In unsupervised learning the agent learns patterns in the input even though no explicit feedback
is supplied. The most common unsupervised learning task is clustering: detecting useful
clusters from input examples.
For e a p e, a ta i a ent i t radua y de e op a concept o “ ood tra ic days”
and “bad tra ic days” it out e er bein i en abe ed e a p es o eac by a teac er
• In reinforcement learning the agent learns from a series of reinforcements—rewards or
punishments.
For example, the lack of a tip at the end of the journey gives the taxi agent an indication
that it did something wrong. The two points for a win at the end of a chess game tells the
agent it did something right. It is up to the agent to decide which of the actions prior
to the reinforcement were most responsible for it.
• In supervised learning the agent observes some example input–output pairs and learns a
function that maps from input to output.
Example: In component 1 above, the inputs are percepts and the output are provided by a teacher
o says “Bra e!” or “ urn e t ”
In component 2, the inputs are camera images and the outputs again come from a teacher who says
“t at’s a bus ”
In Component 3, the theory of braking is a function from states and braking actions to stopping
distance in eet In t is case t e output a ue is a ai ab e direct y ro t e a ent’s percepts a ter
the fact); the environment is the teacher.
SUPERVISED LEARNING
• In supervised learning, the machine is trained on a set of labeled data, which means that the input
data is paired with the desired output. The machine then learns to predict the output for new
input data. Supervised learning, as the name indicates, has the presence of a supervisor as a
teacher.
• Supervised learning is when we teach or train the machine using data that is well-labelled.
Which means some data is already tagged with the correct answer.
• After that, the machine is provided with a new set of examples(data) so that the supervised
learning algorithm analyses the training data(set of training examples) and produces a correct
outcome from labeled data.
The task of supervised learning is this:
Given a training set of N example input–output pairs (x1, y1), (x2, y2), . . . (xN, yN) , where each
yj was generated by an unknown function y = f(x), discover a function h that approximates the true
function f. Here x and y can be any value; they need not be numbers,T he function h is a hypothesis.
Example:1
When the output y is one of a finite set of values (such as sunny, cloudy or rainy), the learning problem
is called classification, and is called Boolean or binary classification if there are only two values. When
y is a nu ber suc as to orro ’s te perature , t e earnin prob e is ca ed regression.
Example : 2,
A labeled dataset of images of Elephant, Camel and Cow would have each image tagged with either
“ ep ant” , “ a e ”or “ o ”
lOMoAR cPSD| 16826574

Key Points:
• Supervised learning involves training a machine from labeled data.
• Labeled data consists of examples with the correct answer or classification.
• The machine learns the relationship between inputs and outputs .
• The trained machine can then make predictions on new, unlabeled data.
Types of Supervised Learning Algorithm
• Supervised learning is typically divided into two main categories: regression and classification.
Regression:
• Regression is a supervised learning technique used to predict continuous numerical values based
on input features. It aims to establish a functional relationship between independent variables
and a dependent variable, such as predicting house prices based on features like size, bedrooms,
and location.
• The goal is to minimize the difference between predicted and actual values using algorithms like
Linear Regression, Decision Trees, or Neural Networks, ensuring the model captures underlying
patterns in the data. The following example shows
• Diagram : It is a Meteorological dataset that serves the purpose of predicting wind speed based
on different parameters.
• Input: Dew Point, Temperature, Pressure, Relative Humidity, Wind Direction
• Output: Wind Speed

Classification
Classification is a type of supervised learning that categorizes input data into predefined labels. It
involves training a model on labeled examples to learn patterns between input features and output
classes. In classification, the target variable is a categorical value.
For example, classifying emails as spam or not.
e ode ’s oa is to enera ize t is earnin to a e accurate predictions on ne , unseen data
Algorithms like Decision Trees, Support Vector Machines, and Neural Networks are commonly used for
classification tasks. The following example shows:
Diagram : It is a dataset of a shopping store that is useful in predicting whether a customer will
purchase a particular product under consideration or not based on his/ her gender, age, and salary.
Input: Gender, Age, Salary
Output: Purchased i.e. 0 or 1; 1 means yes the customer will purchase and 0 means that the customer
on’t purc ase it

Decision Trees
A decision tree is a flowchart-like structure used to make decisions or predictions .it represents a
unction t at ta es as input a ector o attribute a ues and returns a “decision”—a single output value.
The input and output values can be discrete or continuous. For now we will concentrate on problems
where the inputs have discrete values and the output has exactly two possible values; this is Boolean
classification, where each example input will be classified as true (a positive example) or false (a
negative example).
Structure of a Decision Tree
Root Node: Represents the entire dataset and the initial decision to be made.
Internal Nodes: Represent decisions or tests on attributes. Each internal node has one or more branches.
Branches: Represent the outcome of a decision or test, leading to another node.
Leaf Nodes: Represent the final decision or prediction. No further splits occur at these nodes.
Example, we will build a decision tree to decide whether to wait for a table at restaurant. The aim here
is to reach the goal predicate WillWait . First we list the attributes that we will consider as part of the
input:
1. Alternate: whether there is a suitable alternative restaurant nearby.
2. Bar : whether the restaurant has a comfortable bar area to wait in.
3. Fri/Sat: true on Fridays and Saturdays.
4. Hungry: whether we are hungry.
5. Patrons: how many people are in the restaurant (values are None, Some, and Full ).
6 rice: t e restaurant’s price ran e $, $$, $$$
7. Raining: whether it is raining outside.
lOMoAR cPSD| 16826574

8. Reservation: whether we made a reservation.


9. Type: the kind of restaurant (French, Italian, Thai, or burger).
10. WaitEstimate: the wait estimated by the host (0–10 minutes, 10–30, 30–60, or >60). Note that every
variable has a small set of possible values; the value of WaitEstimate, for example, is not an integer,
rather it is one of the four discrete values 0–10, 10–30, 30–60, or >60. The decision tree is shown in
Figure 18.2.:

Expressiveness of decision trees


A Boolean decision tree is logically equivalent to the assertion that the goal attribute is true
if and only if the input attributes satisfy one of the paths leading to a leaf with value true.
Writing this out in propositional logic, we have
Goal ⇔ (Path1 ∨ Path2 ∨ ・ ・ ・) ,
where each Path is a conjunction of attribute-value tests required to follow that path. Thus,
the whole expression is equivalent to disjunctive normal form which means
that any function in propositional logic can be expressed as a decision tree. As an example,
the rightmost path in Figure 18.2 is
Path = (Patrons =Full ∧ WaitEstimate =0–10) .

Inducing decision trees from examples


An example for a Boolean decision tree consists of an (x, y) pair, where x is a vector of values for the
input attributes, and y is a single Boolean output value. A training set of 12 examples is shown in Figure
18.3. The positive examples are the ones in which the goal WillWait is true (x1, x3, . . .); the negative
examples are the ones in which it is false (x2, x5, . . .).
• We want a tree that is consistent with the examples and is as small as possible. With some
simple heuristics, however, we can find a good approximate solution: a small consistent tree.
• The DECISION-TREE-LEARNING algorithm adopts a greedy divide-and-conquer strategy:
always test the most important attribute first, test divides the problem up into smaller
subproblems that can then be solved recursively.
• The DECISION-TREE-LEARNING algorithm is shown in Figure 18.5.
• The details of the IMPORTANCE function i.e. how to pick attributes are discussed in next topic
• The output of the learning algorithm on our sample training set is shown in Figure 18.6.
• The tree is clearly different from the original tree shown in Figure 18.2. One might conclude that
the learning algorithm is not doing a very good job of learning the correct function. This would
be the wrong conclusion to draw, however.
• The learning algorithm looks at the examples, not at the correct function, and in fact, its
hypothesis (see Figure 18.6) not only is consistent with all the examples, but is considerably
simpler than the original tree! In that case it says not to wait when Hungry is false, but I (SR)
would certainly wait. With more training examples the learning program could correct this
mistake.
lOMoAR cPSD| 16826574

Choosing attribute tests for Decision Tree Algorithm:


• The greedy search used in decision tree learning is designed to approximately minimize the
depth of the final tree. The idea is to pick the attribute that goes as far as possible toward
providing an exact classification of the examples.
• A perfect attribute divides the examples into sets, each of which are all positive or all negative
and thus will be leaves of the tree. The Patrons attribute is not perfect, but it is fairly good. All
e need, t en, is a or a easure o “ air y ood” and “rea y use ess” and e can i p e ent
the function of Figure 18.5.
• We will use the notion of information gain, which is defined in terms ENTROPY of entropy, the
fundamental quantity in information theory.
• A flip of a fair coin is equally likely to come up heads or tails, 0 or 1, and we will soon show that
t is counts as “ bit” o entropy
Calculating Entropy for restaurant training set in Figure 18.3
• It has p = n = 6, so the corresponding entropy is B(0.5) or exactly 1 bit.
• The information gain is a key concept in decision tree learning, used to decide which attribute to
split the data on at each step of building the tree. It measures the reduction in entropy (or
uncertainty) achieved by partitioning the data according to a given attribute. The formula for
information gain is given by:

where:
• B(p/(p+n)) is the entropy of the entire set before the split.
• Remainder(Ais the weighted sum of the entropies after the split on attribute A.

Let's break down the calculation for the attributes "Patrons" and "Type" as given in your example. From
these calculations, "Patrons" has an information gain of approximately 0.541 bits, while "Type" has an
information gain of 0 bits. Therefore, "Patrons" is a better attribute to split on, as it results in a greater
reduction in entropy. This aligns with the intuition that "Patrons" is a more informative attribute for the
decision tree learning algorithm.
A LOGICAL FORMULATION OF LEARNING

• Pure inductive learning as a process of finding a hypothesis that agrees with the observed
examples.
• Here, we specialize this definition to the case where the hypothesis is represented by a set of
logical sentences.
• Example descriptions and classifications will also be logical sentences, and a new example can
be classified by inferring a classification sentence from the hypothesis and the example
description.
• This approach allows for incremental construction of hypotheses, one sentence at a time.
• It also allows for prior knowledge, because sentences that are already known can assist in the
classification of new examples.
• The logical formulation of learning may seem like a lot of extra work at first, but it turns out to
clarify many of the issues in learning.

Examples and hypotheses

• Recall the restaurant learning problem: learning a rule for deciding whether to wait for a table.
• Examples were described by attributes such as Alternate, Bar , Fri/Sat ,and so on.
• In a logical setting, an example is described by a logical sentence; the attributes become unary
predicates. Let us generically call the ith example Xi. For instance, the first example from is
described by the sentences
• Alternate (X1) ∧ ¬Bar (X1) ∧ ¬Fri/Sat (X1) ∧ Hungry(X1) ∧ ...
• We will use the notation Di(Xi) to refer to the description of Xi, where Di can be any logical
expression taking a single argument. The classification of the example is given by a literal using
the goal predicate, in this case
• WillWait(X1) or ¬WillWait(X1) .
• The complete training set can thus be expressed as the conjunction of all the example
descriptions and goal literals.
• The aim of inductive learning in general is to find a hypothesis that classifies the examples well
and generalizes well to new examples. Here we are concerned with hypotheses expressed in
logic; each hypothesis hj will have the form
• ∀ x Goal (x) ⇔ Cj(x) ,
• where Cj(x) is a candidate definition—some expression involving the attribute predicates.
• For example, a decision tree can be interpreted as a logical expression of this form. Thus, the tree
expresses the following logical definition (which we will call hr for future reference):
∀ r WillWait(r) ⇔ Patrons (r, Some ) ∨ Patrons (r, Full ) ∧ Hungry(r) ∧ Type (r, French )
Patrons (r, Full ) ∧ Hungry(r) ∧ Type (r, Thai ) ∧ Fri /Sat (r) ∨ Patrons (r,
Full ) ∧ Hungry(r) ∧ Type (r, Burger ) .
Current-best-hypothesis search
• The idea behind current-best-hypothesis search is to maintain a single hypothesis, and to adjust it
as new examples arrive in order to maintain consistency.
lOMoAR cPSD| 16826574

We can now specify the CURRENT-BEST-LEARNING algorithm, shown in Figure 19.2. Notice that
each time we consider generalizing or specializing the hypothesis, we must check for consistency with
the other examples, because an arbitrary increase/decrease in the exten- sion might include/exclude
previously seen negative/positive examples.

Knowledge in learning
This simple knowledge-free picture of inductive learning persisted until the early 1980s.
The modem approach is to design agents that already know something and are trying to learn some more
this may not sound like a terrifically deep insight, but it makes quite a difference to the way we design
agents.
It might also have some relevance to our theories about how science itself works. The general idea is
shown schematically in Figure .
Logical formulation of learning
Pure inductive learning as a process of finding a hypothesis that agrees with the observed examples.
Here, we specialize this definition to the case where the hypothesis is represented by a set of logical
sentences.
Examples and hypotheses
The restaurant learning problem: learning a rule for deciding whether to wait for a table. Examples were
described by attributes such as Alternate, Bar, (Fri,Sat), and so on.
For example, a decision tree asserts that the goal predicate is true of an object if only if one of the
branches leading to true is satisfied.
Each hypothesis predicts that certain set of examples--namely, those that satisfy its candidate definition-
will be examples of the goal predicate. This set is called the extension of the predicate. Two hypotheses
with different extensions are therefore logically inconsistent with each other, because they disagree on
their predictions for at least one example.
Logically, this is exactly analogous to the resolution rule of inference. We therefore can characterize
inductive learning in a logical setting as a process of gradually eliminating hypotheses that are
inconsistent with the examples, narrowing down the possibilities.
Explanation-based learning
Explanation-based learning (EBL) is a technique used to derive general rules from specific examples.
Let's break it down using an example from algebra:
Example: Differentiation and Simplification
1. Specific Case: Differentiating X2 with respect to X results in 2X.
1. Manual Process: An experienced mathematician can do this quickly, but a novice or a
program with no experience might struggle. The standard differentiation rules can be
applied, leading to the expression 1×(2×(X 2− which simplifies to 2X. However, a
program might take many steps (e.g., 136 proof steps) to reach this result, with many
dead ends (99 steps).
2. Memoization vs. EBL:
1. Memoization: This technique stores results of computations to speed up future
calculations. It would remember the derivative of X2 with respect to X is 2X but wouldn't
help with Z 2 unless that specific calculation was stored.
2. EBL: Goes further by extracting a general rule from the example. It learns that for any
variable uuu, the derivative of U2 with respect to is 2u. This is expressed logically as:
ArithmeticUnknown(u)⇒Derivative(u 2 ,u)=2u
With this rule, any similar problem can be solved instantly without redoing the entire differentiation
process.
Generalization Process
EBL generalizes by creating rules that can apply to an entire class of problems:
lOMoAR cPSD| 16826574

1. Given an example, construct a proof that the goal predicate applies to the example using the available
background knowledge
2. In parallel, construct a generalized proof tree for the variabilized goal using the same inference steps
as in the original proof.
3. Construct a new rule whose left-hand side consists of the leaves of the proof tree and whose right-
hand side is the variabilized goal (after applying the necessary bindings from the generalized proof).
4. Drop any conditions from the left-hand side that are true regardless of the values of the variables in
the goal.
Improving efficiency
There are three factors involved in the analysis of efficiency gains from EBL:
1. Adding large numbers of rules can slow down the reasoning process, because the inference
mechanism must still check those rules even in cases where they do riot yield a solution. In other words,
it increases the branching factor in the search space.
2. To compensate for the slowdown in reasoning, the derived rules must offer significant increases in
speed for the cases that they do cover. These increases come about mainly because the derived rules
avoid dead ends that would otherwise be taken, but also because they shorten the proof itself.
3. Derived rules should be as general as possible, SCI that they apply to the largest possible set of cases.
By generalizing from past example problems, EBL makes the knowledge base more efficient for the
kind of problems that it is reasonable to expect.

LEARNING USING RELEVANT INFORMATION


• Sentences express a strict form of relevance: given nationality, language is fully determined. (Put
another way: language is a function of nationality.) These sentences are called functional
dependencies or determinations. They occur so commonly in certain kinds of applications
(e.g., defining database designs) that a special syntax is used to write them. We adopt the
notation of Davies (1985):
Nationality (x , n) + Language(x, 1 )
• Determining the hypothesis space
• Determinations specify a sufficient basis vocabulary from which to construct hypotheses
concerning the target predicate.
• This statement can be proven by showing that a given determination is logically equivalent to a
statement that the correct definitions of the target predicate is one of the set of all definitions
expressible using the predicates on the left-hand side of the determination.
INDUCTIVE LOGIC PROGRAMMING

• Inductive logic programming (ILP) combines inductive methods with the power of first-order
logic, representing hypotheses as logic programs. ILP is popular because:
• Rigorous Approach: It provides a rigorous method for knowledge-based inductive learning.
• Complete Algorithms: ILP can induce general, first-order theories from examples, making it
effective in domains where attribute-based algorithms struggle. For instance, learning how
proteins fold involves complex relationships that are well-suited to first-order logic.
• Readable Hypotheses: The hypotheses produced by ILP are relatively easy for humans to
understand, allowing for scientific scrutiny and participation in the scientific process.
• Example: Learning Family RelationshipsThe goal of ILP can be understood through the problem
of learning family relationships. Given a family tree with relations like "Mother," "Father,"
"Married," and properties like "Male" and "Female," the ILP system can induce hypotheses
about concepts like "Grandparent."
For example:
• Background Knowledge: Relationships like Father(Philip, Charles) and Mother(Mum,
Elizabeth).
• Descriptions: The extended family tree.
• Classifications: Instances of the target concept, such as Grandparent(Elizabeth, Beatrice).
• An ILP system would generate hypotheses to satisfy these entailments.
• For instance:
• Hypothesis: Grandparent(x, y) ⇔ ∃z (Parent(x, z) ∧ Parent(z, y))
• This hypothesis states that for someone to be a grandparent, there must be an intermediary
parent. ILP algorithms can create new predicates like "Parent" to simplify hypotheses, a process
known as constructive induction.
lOMoAR cPSD| 16826574

Top-down inductive learning methods


• The first approach to ILP works by starting with a very general rule and gradually specializing it
so that it fits the data.
• This is essentially what happens in decision-tree learning, where a decision tree is gradually
grown until it is consistent with the observations.
• To do ILP we use first-order literals instead of attributes, and the hypothesis is a set of clauses
instead of a decision tree.
• Three kinds of literals
• 1. Literals using predicates
• 2. Equality and inequality literals
• 3. Arithmetic comparisons

Inductive learning with inverse deduction


The second major approach to ILP involves inverting the normal deductive proof process.
Inverse resolution is based INVERSE on the observation.
Recall that an ordinary resolution step takes two clauses C1 and C2 and resolves them to produce the
resolvent C.
An inverse resolution step takes a resolvent C and produces two clauses C1 and C2, such that C is the
result of resolving C1 and C2.
Alternatively, it may take a resolvent C and clause C1 and produce a clause C2 such that C is the result
of resolving C1 and C2.
A number of approaches to taming the search implemented in ILP systems
1. Redundant choices can be eliminated
2. The proof strategy can be restricted
3. The representation language can be restricted
4. Inference can be done with model checking rather than theorem proving
5. Inference can be done with ground propositional clauses rather than in first-order logic.

You might also like