Unit 5
Unit 5
ARTIFICAIL INTELLIGENCE
UNIT - V
Uncertain knowledge and Learning Uncertainty: Acting under Uncertainty, Basic
Probability Notation, Inference Using Full Joint Distributions, Independence, Bayes’
Rule and Its Use,
Probabilistic Reasoning: Representing Knowledge in an Uncertain Domain, The
Semantics of Bayesian Networks, Efficient Representation of Conditional Distributions,
Approximate Inference in Bayesian Networks, Relational and First-Order Probability,
Other Approaches to Uncertain Reasoning; Dempster-Shafer theory.
Learning: Forms of Learning, Supervised Learning, Learning Decision Trees.
Knowledge in Learning: Logical Formulation of Learning, Knowledge in Learning,
Explanation-Based Learning, Learning Using Relevance Information, Inductive Logic
Programming.
ACTING UNDER UNCERTAINTY
• Agents may need to handle uncertainty
• Partial observability, which state the agent reached?
• Nondeterminism, where the agent will end after a sequence of
actions?
• Combination of the two
• They are designed to handle uncertainty
• Keeping track of a belief state
• Generating contingency plan and reach the goal.
lOMoAR cPSD| 16826574
a p e o uncertain reasonin
e can in ert t e ru e
a ity oot ac e
is ru e isn t ri t eit er, not a ca ities cause pain
oo to dea it de rees o be ie
robabi ity can be used as a ay to t e data, and a e predictions in
t e ace o uncertainty
o o ic and probabi ity t eory are t e sa e, e or d is co posed o acts
t at o d or do not o d in any particu ar case
are di erent
o ica a ent be ie es eac sentence is true or a se or as no opinion
probabi istic a ent as a de ree o be ie ro a se to true
• We might not know for sure about the problem of a particular patient
• We have a belief, we believe with 80% of chance (probability of 0.8) that the patient with
a toothache has a cavity,so velief can be derived from statistical data
• 80% of toothaches patients also had a cavity
• Probability statements are made with respect to a knowledge state, not with respect to the real
world
• The probability that a patient has a cavity, given that she has a toothache, is 0.8
• The probability that a patient has a cavity, given that she has a toothache and a history
of gum disease, is 0.4
(b) Uncertainty and Rational Decisions:
• Which of these plans to go to the airport is a rational choice?
• \(A_{90}\) (90 minutes before), \(A_{180}\) (180 minutes before), or \(A_{1440}\) (1
day before)
• It depends on our preferences, an agent must have preferences
• We want to arrive on time but not waiting too much
• Intolerable wait and food?
• We can use Utility theory, Decision theory or Rational agent
• Utility Theory:
• To reason and represent preferences
• In utility theory, every state has a degree of usefulness or utility
• An agent prefers states with high utility
• Decision theory
• Combines preferences (expressed with utilities) with probabilities in the theory of
rational decisions
• Decision theory = probability theory + utility theory
• Rational agent in decision theory
• An agent is rational iff it chooses the action that yields the highest expected utility,
averaged over all the possible outcomes of the action
• This is the maximum expected utility (MEU) principle
Random variables
Events
• For example, P(Weather , Cavity) denotes the probabilities of all combinations of the values of
Weather and Cavity.
• This is a 4×2 JOINT PROBABILITY table of probabilities called the joint probability
distribution of Weather and Cavity.
We can also mix variables with and without values; P(sunny, Cavity) would be a two-element vector
giving the probabilities of a sunny day with a cavity and a sunny day with no cavity. The P notation
makes certain expressions much more concise than they might otherwise be.For example, the product
rules for all possible values of Weather and Cavity can be written as a single equation:
For example, there are six possible worlds in which cavity or negation cavity or toothache holds:
P(cavity ∨ (negation) cavity V toothache) = 0.108 + 0.012 + 0.072 + 0.008 + 0.016 + 0.064 = 0.28
Marginal probability
• For example, there are six possible worlds in which cavity ∨ toothache holds:
• One particularly common task is to extract the distribution over some subset of variables or a
single variable. For example, adding the entries in the first row gives the unconditional or
marginal probability 4 of cavity:
Conditional probability
• A variant of this rule involves conditional probabilities instead of joint probabilities, using the
product rule: This rule is called conditioning.
• For example, we can compute the probability of a cavity, given evidence of a toothache, as
follows:
Independence:
• Let us expand the full joint distribution in Figure 13.3 by adding a fourth variable, Weather .
• The full joint distribution then becomes P(Toothache, Catch, Cavity,Weather ), which has
2 × 2 × 2 × 4 = 32 entries.
• For example, how are P(toothache, catch, cavity, cloudy) and P(toothache, catch, cavity)
related?
• We can use the product rule:
P(toothache, catch, cavity, cloudy) = P(cloudy | toothache, catch, cavity)P(toothache, catch, cavity) .
• o , un ess one is in t e deity business, one s ou d not i a ine t at one’s denta prob e s
influence the weather. Therefore, the following assertion seems reasonable:
P(cloudy | toothache, catch, cavity) = P(cloudy) .
• From this, we can deduce
P(toothache, catch, cavity, cloudy) = P(cloudy)P(toothache, catch, cavity) .
• A similar equation exists for every entry in P(Toothache, Catch, Cavity,Weather ). In fact, we can
write the general equation
P(Toothache, Catch, Cavity,Weather) = P(Toothache, Catch, Cavity)P(Weather ) .
• Thus, the 32-element table for four variables can be constructed from one 8-element table and one
4-element table. This decomposition is illustrated schematically in Figure 13.4(a).
In particu ar, t e eat er is independent o one’s denta prob e s Independence bet een propositions a
and b can be written as
BAY ’ R L AND IT
• Bayes' theorem is also known as Bayes' rule, Bayes' law, or Bayesian reasoning, which
determines the probability of an event with uncertain knowledge.
• In probability theory, it relates the conditional probability and marginal probabilities of two
random events.
• Bayes' theorem was named after the British mathematician Thomas Bayes. The Bayesian
inference is an application of Bayes' theorem, which is fundamental to Bayesian statistics.
• Bayes' theorem allows updating the probability prediction of an event by observing new
information of the real world.
• Example: If cancer corresponds to one's age then by using Bayes' theorem, we can determine the
probability of cancer more accurately with the help of age.
lOMoAR cPSD| 16826574
• P(A|B) is known as posterior, which we need to calculate, and it will be read as Probability of
hypothesis A when we have occurred an evidence B.
• P(B|A) is called the likelihood, in which we consider that hypothesis is true, then we calculate
the probability of evidence.
• P(A) is called the prior probability, probability of hypothesis before considering the evidence
• In the equation (a), in general, we can write P (B) = P(A)*P(B|Ai), hence the Bayes' rule can be
written as:
• Where A1, A2, A3,........, An is a set of mutually exclusive and exhaustive events.
A B ’ T
• It allows us to compute the single term P(b | a) in terms of three terms: P(a | b), P(b), and P(a).
• Example :
• A doctor is aware that disease meningitis causes a patient to have a stiff neck, and it occurs 70%
of the time. He is also aware of some more facts, which are given as follows:
(b) The Known probability that a patient has a stiff neck is 0.01.
• That is, we expect less than 1 in 700 patients with a stiff neck to have meningitis.
• Notice that even though a stiff neck is quite strongly indicated by meningitis (with probability
0.7), the probability of meningitis in the patient remains small. This is because the prior
probability of stiff necks is much higher than that of meningitis
"A Bayesian network is a probabilistic graphical model which represents a set of variables
and their conditional dependencies using a directed acyclic graph."
It is also called a Bayes network, belief network, decision network, or Bayesian model.Bayesian
networks are probabilistic, because these networks are built from a probability distribution, and also
use probability theory for prediction and anomaly detection. Real world applications are probabilistic in
nature, and to represent the relationship between multiple events, we need a Bayesian network. It can
also be used in various tasks including prediction, anomaly detection, diagnostics, automated insight,
reasoning, time series prediction, and decision making under uncertainty.
Bayesian Network can be used for building models from data and experts opinions, and it
consists of two parts:
• Directed Acyclic Graph
• Table of conditional probabilities.
The generalized form of Bayesian network that represents and solve decision problems under uncertain
knowledge is known as an Influence diagram.A Bayesian network graph is made up of nodes and Arcs
(directed links), where:
lOMoAR cPSD| 16826574
o Each node corresponds to the random variables, and a variable can be continuous or discrete.
o Arc or directed arrows represent the causal relationship or conditional probabilities between random
variables. These directed links or arrows connect the pair of nodes in the graph.These links represent
that one node directly influence the other node, and if there is no directed link that means that nodes
are independent with each other
o In the above diagram, A, B, C, and D are random variables represented by the nodes of the network
graph.
o If we are considering node B, which is connected with node A by a directed arrow, then node A is
called the parent of Node B.
P[x1, x2, x3,. , xn], it can be written as the following way in terms of the joint probability
distribution.
In general for each variable Xi, we can write the equation as: P(Xi|Xi-1,. , X1) = P(Xi |Parents(Xi ))
Let's understand the Bayesian network through an example by creating a directed acyclic graph:
Example: Harry installed a new burglar alarm at his home to detect burglary. The alarm reliably
responds at detecting a burglary but also responds for minor earthquakes. Harry has two neighbors
David and Sophia, who have taken a responsibility to inform Harry at work when they hear the
alarm. David always calls Harry when he hears the alarm, but sometimes he got confused with the
phone ringing and calls at that time too. On the other hand, Sophia likes to listen to high music, so
sometimes she misses to hear the alarm. Here we would like to compute the probability of Burglary
Alarm.
Calculate the probability that alarm has sounded, but there is neither a burglary, nor an earthquake
occurred, and David and Sophia both called the Harry.
Solution:
• The Bayesian network for the above problem is given below. The network structure is showing
that burglary and earthquake is the parent node of the alarm and directly affecting the probability
of alarm's going off, but David and Sophia's calls depend on alarm probability.
• The network is representing that our assumptions do not directly perceive the burglary and also do
not notice the minor earthquake, and they also not confer before calling.
• The conditional distributions for each node are given as conditional probabilities table or CPT.
• Each row in the CPT must be sum to 1 because all the entries in the table represent an exhaustive
set of cases for the variable.
• In CPT, a boolean variable with k boolean parents contains 2K probabilities. Hence, if there are
two parents, then CPT will contain 4 probability values
The solution for given example is as follows:
P(S, D, A, ¬B, ¬E) = P (S|A) *P (D|A)*P (A|¬B ^ ¬E) *P (¬B) *P (¬E).
= 0.75* 0.91* 0.001* 0.998*0.999
= 0.00068045.
Hence, a Bayesian network can answer any query about the domain by using Joint distribution.
• where parents(Xi) denotes the values of P arents(Xi) that appear in x1,... , xn. Thus, each entry in
the joint distribution is represented by the product of the appropriate elements of the conditional
probability tables (CPTs) in the Bayesian network.
• From this definition, it is easy to prove that the parameters θ(Xi | P arents(Xi)) are exactly the
conditional probabilities P(Xi | P arents(Xi)) implied by the joint distribution (see Exercise 14.2).
Hence, we can rewrite Equation (14.1) as
• To illustrate this, we can calculate the probability that the alarm has sounded, but neither a
burglary nor an earthquake has occurred, and both John and Mary call. We multiply entries
from the joint distribution (using single-letter names for the variables):
(Note : they have applied joint probability condition i.e., if u have seen in the above ,p(j/a) it has taken
only its parent node that is john parent node is alaram,next coming to p(a/negation b and negation e
…because a ara parent node are bur ary and earthquake)
• Even if the maximum number of parents k is smallish, filling in the CPT for a node requires up
to O(2k) numbers and perhaps a great deal of experience with all the possible conditioning cases.
• In fact, this is a worst-case scenario in which the relationship between the parents and the child
is completely arbitrary. Usually, such relationships are describable by a canonical distribution
that fits some standard pattern.
• Canonical Distributions: In many real-world scenarios, the relationship between a node and its
parents can be described by a standard or canonical distribution, such as noisy-OR, noisy-AND,
or logistic regression models. These canonical forms reduce the complexity of specifying the
CPT.
• Noisy-OR Model: This is commonly used for situations where multiple causes independently
contribute to an effect. For example, if any of several conditions can independently cause a
disease, the noisy-OR model can describe the probability of the disease given the presence or
absence of each condition with a small number of parameters.
Example : we might say that Fever is true if and only if Cold , Flu, or Malaria is true.
parents: for example, whatever inhibits Malaria from causing a fever is independent of whatever
inhibits Flu from causing a fever.
• Given these assumptions, Fever is false if and only if all its true parents are inhibited, and the
probability of this is the product of the inhibition probabilities q for each parent.
Check the following CPT tables for malaria , flu , cold i.e. it comes 23 combinations . The table
follows
Previous table shows the 8 combinations for a problem, now we apply noisy-OR assumptions ,
assumptions are following:
• Let us suppose these individual inhibition probabilities are as follows:
qcold = P (¬fever | cold , ¬flu, ¬malaria ) = 0.6 ,
qflu = P (¬fever | ¬cold , flu, ¬malaria ) = 0.2 ,
qmalaria = P (¬fever | ¬cold , ¬flu, malaria ) = 0.1 .
So as per asked problem we need only , Fever is false if and only if all its true parents are inhibited.
• In general, noisy logical relationships in which a variable depends on k parents can be described
using O(k) parameters instead of O(2k) for the full conditional probability table.
• This makes assessment and learning much easier. For example, With 448 nodes and 906 links, it
requires only 8,254 values instead of 133,931,430 for a network with full CPTs.
We can illustrate its operation on the network in Figure 14.12(a), assuming an ordering [Cloudy,
Sprinkler , Rain, WetGrass ]:
1. Sample from P(Cloudy) = _0.5, 0.5_, value is true.
2. Sample from P(Sprinkler |Cloudy =true) = _0.1, 0.9_, value is false.
3. Sample from P(Rain |Cloudy =true) = _0.8, 0.2_, value is true.
4. Sample from P(WetGrass | Sprinkler =false, Rain =true) = _0.9, 0.1_, value is true.
In this case, PRIOR-SAMPLE returns the event [true, false, true, true].
(a) Rejection Sampling
Note : Refer CPT table for wet grass ,sprinkler then u will understand ,why we have taken only 0.9
FORMS OF LEARNING
Any component of an agent can be improved by learning from data. The improvements, and the
techniques used to make them, depend on four major factors:
• ic component is to be improved.
• at prior knowledge the agent already has.
• at representation is used for the data and the component.
• at feedback is available to learn from.
Consider, for example, an agent training to become a taxi driver.
(1) ery ti e t e instructor s outs “Bra e!” t e a ent i t earn a condition–action rule for when
to brake.
(2) By seeing many camera images that it is told contain buses, it can learn to recognize them .
(3) By trying actions and observing the results—for example, braking hard on a wet road—it can
learn the effects of its actions..
(4) Then, when it receives no tip from passengers who have been thoroughly shaken up during the
trip, it can learn a useful component of its overall utility function.
Representation and prior knowledge
• We have seen several examples of representations for agent components like propositional
logic, first-order logical sentences, Bayesian networks using factored representation for
decision-theoretic agent and so on.
• There is another way to look at the various types of learning. Learning from general function or
rule from specific input–output pairs is called inductive Learning and another type of
analytical or deductive learning is going from a known general rule to a new rule that is
logically entailed, it will be useful for more efficient processing.
The Agents can be learn from different learnings are :
There are three main types of learning:
• In unsupervised learning the agent learns patterns in the input even though no explicit feedback
is supplied. The most common unsupervised learning task is clustering: detecting useful
clusters from input examples.
For e a p e, a ta i a ent i t radua y de e op a concept o “ ood tra ic days”
and “bad tra ic days” it out e er bein i en abe ed e a p es o eac by a teac er
• In reinforcement learning the agent learns from a series of reinforcements—rewards or
punishments.
For example, the lack of a tip at the end of the journey gives the taxi agent an indication
that it did something wrong. The two points for a win at the end of a chess game tells the
agent it did something right. It is up to the agent to decide which of the actions prior
to the reinforcement were most responsible for it.
• In supervised learning the agent observes some example input–output pairs and learns a
function that maps from input to output.
Example: In component 1 above, the inputs are percepts and the output are provided by a teacher
o says “Bra e!” or “ urn e t ”
In component 2, the inputs are camera images and the outputs again come from a teacher who says
“t at’s a bus ”
In Component 3, the theory of braking is a function from states and braking actions to stopping
distance in eet In t is case t e output a ue is a ai ab e direct y ro t e a ent’s percepts a ter
the fact); the environment is the teacher.
SUPERVISED LEARNING
• In supervised learning, the machine is trained on a set of labeled data, which means that the input
data is paired with the desired output. The machine then learns to predict the output for new
input data. Supervised learning, as the name indicates, has the presence of a supervisor as a
teacher.
• Supervised learning is when we teach or train the machine using data that is well-labelled.
Which means some data is already tagged with the correct answer.
• After that, the machine is provided with a new set of examples(data) so that the supervised
learning algorithm analyses the training data(set of training examples) and produces a correct
outcome from labeled data.
The task of supervised learning is this:
Given a training set of N example input–output pairs (x1, y1), (x2, y2), . . . (xN, yN) , where each
yj was generated by an unknown function y = f(x), discover a function h that approximates the true
function f. Here x and y can be any value; they need not be numbers,T he function h is a hypothesis.
Example:1
When the output y is one of a finite set of values (such as sunny, cloudy or rainy), the learning problem
is called classification, and is called Boolean or binary classification if there are only two values. When
y is a nu ber suc as to orro ’s te perature , t e earnin prob e is ca ed regression.
Example : 2,
A labeled dataset of images of Elephant, Camel and Cow would have each image tagged with either
“ ep ant” , “ a e ”or “ o ”
lOMoAR cPSD| 16826574
Key Points:
• Supervised learning involves training a machine from labeled data.
• Labeled data consists of examples with the correct answer or classification.
• The machine learns the relationship between inputs and outputs .
• The trained machine can then make predictions on new, unlabeled data.
Types of Supervised Learning Algorithm
• Supervised learning is typically divided into two main categories: regression and classification.
Regression:
• Regression is a supervised learning technique used to predict continuous numerical values based
on input features. It aims to establish a functional relationship between independent variables
and a dependent variable, such as predicting house prices based on features like size, bedrooms,
and location.
• The goal is to minimize the difference between predicted and actual values using algorithms like
Linear Regression, Decision Trees, or Neural Networks, ensuring the model captures underlying
patterns in the data. The following example shows
• Diagram : It is a Meteorological dataset that serves the purpose of predicting wind speed based
on different parameters.
• Input: Dew Point, Temperature, Pressure, Relative Humidity, Wind Direction
• Output: Wind Speed
Classification
Classification is a type of supervised learning that categorizes input data into predefined labels. It
involves training a model on labeled examples to learn patterns between input features and output
classes. In classification, the target variable is a categorical value.
For example, classifying emails as spam or not.
e ode ’s oa is to enera ize t is earnin to a e accurate predictions on ne , unseen data
Algorithms like Decision Trees, Support Vector Machines, and Neural Networks are commonly used for
classification tasks. The following example shows:
Diagram : It is a dataset of a shopping store that is useful in predicting whether a customer will
purchase a particular product under consideration or not based on his/ her gender, age, and salary.
Input: Gender, Age, Salary
Output: Purchased i.e. 0 or 1; 1 means yes the customer will purchase and 0 means that the customer
on’t purc ase it
Decision Trees
A decision tree is a flowchart-like structure used to make decisions or predictions .it represents a
unction t at ta es as input a ector o attribute a ues and returns a “decision”—a single output value.
The input and output values can be discrete or continuous. For now we will concentrate on problems
where the inputs have discrete values and the output has exactly two possible values; this is Boolean
classification, where each example input will be classified as true (a positive example) or false (a
negative example).
Structure of a Decision Tree
Root Node: Represents the entire dataset and the initial decision to be made.
Internal Nodes: Represent decisions or tests on attributes. Each internal node has one or more branches.
Branches: Represent the outcome of a decision or test, leading to another node.
Leaf Nodes: Represent the final decision or prediction. No further splits occur at these nodes.
Example, we will build a decision tree to decide whether to wait for a table at restaurant. The aim here
is to reach the goal predicate WillWait . First we list the attributes that we will consider as part of the
input:
1. Alternate: whether there is a suitable alternative restaurant nearby.
2. Bar : whether the restaurant has a comfortable bar area to wait in.
3. Fri/Sat: true on Fridays and Saturdays.
4. Hungry: whether we are hungry.
5. Patrons: how many people are in the restaurant (values are None, Some, and Full ).
6 rice: t e restaurant’s price ran e $, $$, $$$
7. Raining: whether it is raining outside.
lOMoAR cPSD| 16826574
where:
• B(p/(p+n)) is the entropy of the entire set before the split.
• Remainder(Ais the weighted sum of the entropies after the split on attribute A.
Let's break down the calculation for the attributes "Patrons" and "Type" as given in your example. From
these calculations, "Patrons" has an information gain of approximately 0.541 bits, while "Type" has an
information gain of 0 bits. Therefore, "Patrons" is a better attribute to split on, as it results in a greater
reduction in entropy. This aligns with the intuition that "Patrons" is a more informative attribute for the
decision tree learning algorithm.
A LOGICAL FORMULATION OF LEARNING
• Pure inductive learning as a process of finding a hypothesis that agrees with the observed
examples.
• Here, we specialize this definition to the case where the hypothesis is represented by a set of
logical sentences.
• Example descriptions and classifications will also be logical sentences, and a new example can
be classified by inferring a classification sentence from the hypothesis and the example
description.
• This approach allows for incremental construction of hypotheses, one sentence at a time.
• It also allows for prior knowledge, because sentences that are already known can assist in the
classification of new examples.
• The logical formulation of learning may seem like a lot of extra work at first, but it turns out to
clarify many of the issues in learning.
• Recall the restaurant learning problem: learning a rule for deciding whether to wait for a table.
• Examples were described by attributes such as Alternate, Bar , Fri/Sat ,and so on.
• In a logical setting, an example is described by a logical sentence; the attributes become unary
predicates. Let us generically call the ith example Xi. For instance, the first example from is
described by the sentences
• Alternate (X1) ∧ ¬Bar (X1) ∧ ¬Fri/Sat (X1) ∧ Hungry(X1) ∧ ...
• We will use the notation Di(Xi) to refer to the description of Xi, where Di can be any logical
expression taking a single argument. The classification of the example is given by a literal using
the goal predicate, in this case
• WillWait(X1) or ¬WillWait(X1) .
• The complete training set can thus be expressed as the conjunction of all the example
descriptions and goal literals.
• The aim of inductive learning in general is to find a hypothesis that classifies the examples well
and generalizes well to new examples. Here we are concerned with hypotheses expressed in
logic; each hypothesis hj will have the form
• ∀ x Goal (x) ⇔ Cj(x) ,
• where Cj(x) is a candidate definition—some expression involving the attribute predicates.
• For example, a decision tree can be interpreted as a logical expression of this form. Thus, the tree
expresses the following logical definition (which we will call hr for future reference):
∀ r WillWait(r) ⇔ Patrons (r, Some ) ∨ Patrons (r, Full ) ∧ Hungry(r) ∧ Type (r, French )
Patrons (r, Full ) ∧ Hungry(r) ∧ Type (r, Thai ) ∧ Fri /Sat (r) ∨ Patrons (r,
Full ) ∧ Hungry(r) ∧ Type (r, Burger ) .
Current-best-hypothesis search
• The idea behind current-best-hypothesis search is to maintain a single hypothesis, and to adjust it
as new examples arrive in order to maintain consistency.
lOMoAR cPSD| 16826574
We can now specify the CURRENT-BEST-LEARNING algorithm, shown in Figure 19.2. Notice that
each time we consider generalizing or specializing the hypothesis, we must check for consistency with
the other examples, because an arbitrary increase/decrease in the exten- sion might include/exclude
previously seen negative/positive examples.
Knowledge in learning
This simple knowledge-free picture of inductive learning persisted until the early 1980s.
The modem approach is to design agents that already know something and are trying to learn some more
this may not sound like a terrifically deep insight, but it makes quite a difference to the way we design
agents.
It might also have some relevance to our theories about how science itself works. The general idea is
shown schematically in Figure .
Logical formulation of learning
Pure inductive learning as a process of finding a hypothesis that agrees with the observed examples.
Here, we specialize this definition to the case where the hypothesis is represented by a set of logical
sentences.
Examples and hypotheses
The restaurant learning problem: learning a rule for deciding whether to wait for a table. Examples were
described by attributes such as Alternate, Bar, (Fri,Sat), and so on.
For example, a decision tree asserts that the goal predicate is true of an object if only if one of the
branches leading to true is satisfied.
Each hypothesis predicts that certain set of examples--namely, those that satisfy its candidate definition-
will be examples of the goal predicate. This set is called the extension of the predicate. Two hypotheses
with different extensions are therefore logically inconsistent with each other, because they disagree on
their predictions for at least one example.
Logically, this is exactly analogous to the resolution rule of inference. We therefore can characterize
inductive learning in a logical setting as a process of gradually eliminating hypotheses that are
inconsistent with the examples, narrowing down the possibilities.
Explanation-based learning
Explanation-based learning (EBL) is a technique used to derive general rules from specific examples.
Let's break it down using an example from algebra:
Example: Differentiation and Simplification
1. Specific Case: Differentiating X2 with respect to X results in 2X.
1. Manual Process: An experienced mathematician can do this quickly, but a novice or a
program with no experience might struggle. The standard differentiation rules can be
applied, leading to the expression 1×(2×(X 2− which simplifies to 2X. However, a
program might take many steps (e.g., 136 proof steps) to reach this result, with many
dead ends (99 steps).
2. Memoization vs. EBL:
1. Memoization: This technique stores results of computations to speed up future
calculations. It would remember the derivative of X2 with respect to X is 2X but wouldn't
help with Z 2 unless that specific calculation was stored.
2. EBL: Goes further by extracting a general rule from the example. It learns that for any
variable uuu, the derivative of U2 with respect to is 2u. This is expressed logically as:
ArithmeticUnknown(u)⇒Derivative(u 2 ,u)=2u
With this rule, any similar problem can be solved instantly without redoing the entire differentiation
process.
Generalization Process
EBL generalizes by creating rules that can apply to an entire class of problems:
lOMoAR cPSD| 16826574
1. Given an example, construct a proof that the goal predicate applies to the example using the available
background knowledge
2. In parallel, construct a generalized proof tree for the variabilized goal using the same inference steps
as in the original proof.
3. Construct a new rule whose left-hand side consists of the leaves of the proof tree and whose right-
hand side is the variabilized goal (after applying the necessary bindings from the generalized proof).
4. Drop any conditions from the left-hand side that are true regardless of the values of the variables in
the goal.
Improving efficiency
There are three factors involved in the analysis of efficiency gains from EBL:
1. Adding large numbers of rules can slow down the reasoning process, because the inference
mechanism must still check those rules even in cases where they do riot yield a solution. In other words,
it increases the branching factor in the search space.
2. To compensate for the slowdown in reasoning, the derived rules must offer significant increases in
speed for the cases that they do cover. These increases come about mainly because the derived rules
avoid dead ends that would otherwise be taken, but also because they shorten the proof itself.
3. Derived rules should be as general as possible, SCI that they apply to the largest possible set of cases.
By generalizing from past example problems, EBL makes the knowledge base more efficient for the
kind of problems that it is reasonable to expect.
• Inductive logic programming (ILP) combines inductive methods with the power of first-order
logic, representing hypotheses as logic programs. ILP is popular because:
• Rigorous Approach: It provides a rigorous method for knowledge-based inductive learning.
• Complete Algorithms: ILP can induce general, first-order theories from examples, making it
effective in domains where attribute-based algorithms struggle. For instance, learning how
proteins fold involves complex relationships that are well-suited to first-order logic.
• Readable Hypotheses: The hypotheses produced by ILP are relatively easy for humans to
understand, allowing for scientific scrutiny and participation in the scientific process.
• Example: Learning Family RelationshipsThe goal of ILP can be understood through the problem
of learning family relationships. Given a family tree with relations like "Mother," "Father,"
"Married," and properties like "Male" and "Female," the ILP system can induce hypotheses
about concepts like "Grandparent."
For example:
• Background Knowledge: Relationships like Father(Philip, Charles) and Mother(Mum,
Elizabeth).
• Descriptions: The extended family tree.
• Classifications: Instances of the target concept, such as Grandparent(Elizabeth, Beatrice).
• An ILP system would generate hypotheses to satisfy these entailments.
• For instance:
• Hypothesis: Grandparent(x, y) ⇔ ∃z (Parent(x, z) ∧ Parent(z, y))
• This hypothesis states that for someone to be a grandparent, there must be an intermediary
parent. ILP algorithms can create new predicates like "Parent" to simplify hypotheses, a process
known as constructive induction.
lOMoAR cPSD| 16826574