Discovering Models /
Theories
cs365 2015 mukerjee
Domain Theories
Agent :
given precept history p ∈ P,
select decision from set of choices a ∈ A
so as to meet a goal g (performance) –
maximize utility function U()
Requires knowledge of how actions under different precepts
affect the goal
Model or Theory
Task domains: a) 8-puzzle, [detrmnstc] b) Soccer [stochastic]
8-puzzle
• Precept = state
• Actions = move
• Goal : T/F
• Utility : num moves
8-puzzle
• State = [7,2,4,5,B,6,8,3,1]
• Actions = L,R, U,D
State + Action
new State
• Decision: based on Search
• [Informed / Uninformed]
Breadth-first search
• Expand shallowest unexpanded node
• Fringe: FIFO queue new successors go at end
O(b1+d)
CS 3243 - Blind Search 5
Properties of breadth-first search
• Complete? Yes (if b is finite)
• Time? 1+b+b2+b3+… +bd + b(bd-1) = O(bd+1)
• Space? O(bd+1) (keeps every node in memory)
• Optimal? Yes (if cost = 1 per step)
Iterative-Deepening search
14 Jan 2004 CS 3243 - Blind Search 7
Cost-based search
• edges don’t have
equal cost
• Breadth-first = first
search lower costs
from START
• Fringe: FIFO
O(b1+C/ ε)
8
Soccer
• Precept = goalie, self, ball
+ wind, opponents, θ
teammates…
• Actions = kick (angle,
speed, swing)
• Utility : goal probability
Discrete-Deterministic Spaces:
Search
Uninformed search strategies
• Uninformed search strategies use only the
information available in the problem definitio
• Breadth-first search
• Uniform-cost search
• Depth-first search
• Depth-limited search
• Iterative deepening search
Breadth-first search
• Expand shallowest unexpanded node
• Fringe: FIFO queue new successors go at end
14 Jan 2004 CS 3243 - Blind Search 12
Properties of breadth-first search
• Complete? Yes (if b is finite)
• Time? 1+b+b2+b3+… +bd + b(bd-1) = O(bd+1)
• Space? O(bd+1) (keeps every node in memory)
• Optimal? Yes (if cost = 1 per step)
Representing
the state
space
1. States:
2. Actions :
3. Goal test:
4. Cost:
8-puzzle heuristics
Admissible:
• h1 : Number of misplaced tiles
=6
goal:
• h2: Sum of Manhattan
distances of the tiles
from their goal positions
= 0+0+1+1+2+3+1+3=11
8-puzzle heuristics
Nilsson’s Sequence
Score(n) = P(n) + 3 S(n)
P(n) : Sum of Manhattan distances of each tile from
its proper position
S(n), sequence score : check around the non-central
squares:
+2 for every tile not followed by successor
0 for every other tile.
piece in center = +1
Stochastic Spaces
Soccer
θ
Soccer : Shooting at goal
[acharya mukerjee 01]
Soccer : Shoot, Pass, dribble, or … ?
Handwritten digits - MNIST
Confusion matrix
Discovering theories
Continuous Data
Discrete Attribute data
• Examples described by attribute values (Boolean, discrete, continuous)
• E.g., situations where I will/won't wait at a restaurant:
• Classification of examples is positive (T) or negative (F)
Discrete Features
• Parse the sentence: “Time flies like an arrow”
May have many parses.
How to rank the choices?
Regression
Modelling as Regression
Given a set of decisions yi based on observations xi,
- derived from unknown function y = f(x)
- with noise
Try to find a model or theory:
y = h(x) ≈ f(x)
where h() is drawn from the hypothesis space – e.g. the space of
radial basis functions, or polynomials, etc.
Polynomial Curve Fitting
[Bishop 06] ch.1
Linear Regression
y = f(x) = Σi wi . φi(x)
φi(x) : basis function
wi : weights
Linear : function is linear in the weights
Quadratic error function --> derivative is linear in w
Sum-of-Squares Error Function
0th Order Polynomial
1st Order Polynomial
3rd Order Polynomial
9th Order Polynomial
Over-fitting
Root-Mean-Square (RMS) Error:
Polynomial Coefficients
9th Order Polynomial
Data Set Size:
9th Order Polynomial
Data Set Size:
9th Order Polynomial
Regularization
Penalize large coefficient values
Regularization:
Regularization:
Regularization: vs.
Polynomial Coefficients
Probability Theory
Learning = discovering regularities
- Regularity : repeated experiments:
outcome not be fully predictable
outcome = “possible world”
set of all possible worlds = Ω
Probability Theory
Apples and Oranges
Sample Space
Sample ω = Pick two fruits,
e.g. Apple, then Orange
Sample Space Ω = {(A,A), (A,O),
(O,A),(O,O)}
= all possible worlds
Event e = set of possible worlds, e ⊆ Ω
• e.g. second one picked is an apple
Learning = discovering regularities
- Regularity : repeated experiments:
outcome not be fully predictable
- Probability p(e) : "the fraction of possible worlds in
which e is true” i.e. outcome is event e
- Frequentist view : p(e) = limit as N → ∞
- Belief view: in wager : equivalent odds
(1-p):p that outcome is in e, or vice versa
Axioms of Probability
- non-negative : p(e) ≥ 0
- unit sum p(Ω) = 1
i.e. no outcomes outside sample space
- additive : if e1, e2 are disjoint events (no common
outcome):
p(e1) + p(e2) = p(e1 ∪ e2)
ALT:
p(e1 ∨ e2) = p(e1) + p(e2) - p(e1 ∧ e2)
Why probability theory?
different methodologies attempted for uncertainty:
– Fuzzy logic
– Multi-valued logic
– Non-monotonic reasoning
But unique property of probability theory:
If you gamble using probabilities you have the best
chance in a wager. [de Finetti 1931]
=> if opponent uses some other system, he's
more likely to lose
Ramsay-diFinetti theorem (1931)
If agent X’s degrees of belief are rational, then X ’s
degrees of belief function defined by fair betting
rates is (formally) a probability function
Fair betting rates: opponent decides which side one
bets on
Proof: fair odds result in a function pr () that satisifies
the Kolmogrov axioms:
Normality : pr(S) >=0
Certainty : pr(T)=1
Additivity : pr (S1 v S2 v.. )= Σ(Si)
Joint vs. conditional probability
Marginal Probability
Joint Probability Conditional Probability
Probability Theory
Sum Rule
Product Rule
Rules of Probability
Sum Rule
Product Rule
Example
A disease d occurs in 0.05% of population. A test is
99% effective in detecting the disease, but 5% of
the cases test positive in absence of d.
10000 people are tested. How many are expected to
test positive?
p(d) = 0.0005 ; p(t/d) = 0.99 ; p(t/~d) = 0.05
p(t) = p(t,d) + p(t,~d) [Sum Rule]
= p(t/d)p(d) + p(t/~d)p(~d) [Product Rule]
= 0.99*0.0005 + 0.05 * 0.9995 = 0.0505 505 +ve
Bayes’ Theorem
posterior likelihood × prior
Bayes’ Theorem
Thomas Bayes (c.1750):
how can we infer causes from effects?
How can one learn the probability of a future event if one knew
only
how many times it had (or had not) occurred in the past?
as new evidence comes in --> prob knowledge improves.
e.g. throw a die. guess is poor (1/6)
throw die again. is it > or < than prev? Can improve guess.
throw die repeatedly. can improve prob of guess quite a lot.
Hence: initial estimate (prior belief P(h), not well formulated)
+ new evidence (support) – compute likelihood P(data|h)
improved estimate (posterior P(h|data) )
Example
A disease d occurs in 0.05% of population. A test is
99% effective in detecting the disease, but 5% of
the cases test positive in absence of d.
If you are tested +ve, what is the probability you have
the disease?
p(d/t) = p(d) . p(t/d) / p(t) ; p(t) = 0.0505
p(d/t) = 0.0005 * 0.99 / 0.0505 = 0.0098 (about 1%)
if 10K people take the test, E(d) = 5
FPs = 0.05 * 9995 = 500
TPs = 0.99 * 5 = 5. only 5/505 have d
Bayesian Inference
Testing for hypothesis H given evidence E
- Evidence : based on new observation E
- Prior : Earlier evaluation about the probability of H
- Likelihood : probability of evidence given hypothesis
P(E|H)
normalization(
Bayesian inference: (marginal lklihood)
P (H|E) = P(E|H) P(H) / P(E)
Posterior probability
Bayesian Inference
The fruit picked is an orange
(o). What is the probability
that it’s from the blue box (B)?
orange
P(B|o) =
P(o|B)p(B) / P(o)
Given: red box is picked
40% p(B) = 0.6
P(o) = (¾*.6 + 1/3*0.4) = 11/20
P(B|o) = ¾ * .6 * 20/11 = 9/11
Continuous variables:
Probability Densities
Probability Densities
cumulative
Expectations
discrete x continuous x
Frequentist approximation w unbiased sample
(both discrete / continuous)
The Gaussian Distribution
Gaussian Mean and Variance
Central Limit Theorem
Distribution of sum of N i.i.d. random variables
becomes increasingly Gaussian for larger N.
Example: N uniform [0,1] random variables.
Gaussian Parameter Estimation
Observations
assumed to be
indpendently
drawn from same
distribution (i.i.d)
Likelihood function
Maximum (Log) Likelihood
Distributions over
Multi-dimensional spaces
The Multivariate Gaussian
lines of equal
probability densities
Multivariate distribution
joint distribution P(x,y) varies considerably
though marginals P(x), P(y) are identical
estimating the joint distribution requires
much larger sample: O(nk) vs nk
Marginals and Conditionals
marginals P(x), P(y) are gaussian
conditional P(x|y) is also gaussian
Non-intuitive in high dimensions
As dimensionality
increases, bulk of
data moves away
from center
Gaussian in polar coordinates;
p(r)δr : prob. mass inside annulus δr at r.
Change of variable x=g(y)
Bernoulli Process
Successive Trials – e.g. Toss a coin three times:
HHH, HHT, HTH, HTT, THH, THT, TTH, TTT
Probability of k Heads:
k 0 1 2 3
P(k) 1/8 3/8 3/8 1/8
Probability of success: p, failure q, then
Model Selection
Model Selection
Cross-Validation
Quantized-Cell Classification
flow data
red: ‘homogenous’,
green : ‘annular’,
blue : ‘laminar’.
Curse of Dimensionality
general cubic polynomial for D dimensions : O(D3) parameters
Curse of Dimensionality
The unit hyper cube and unit sphere in high dimensions
At higher dim, vol(sphere) / vol(hypercube) 0
Curse of Dimensionality
Polynomial curve fitting, M = 3
Gaussian Densities in
higher dimensions
Regression with Polynomials
Curve Fitting Re-visited
Bayesian Inference
Testing for hypothesis H given evidence E
likelihood
Bayesian inference:
P (H|E) = P(E|H) P(H) / P(E)
prior
posterior
Maximum Likelihood
Evidence = t; Hypothesis = poly(x,w)
Maximum Likelihood
Evidence = t; Hypothesis = poly(x,w)
Determine by minimizing sum-of-squares error,
.
Predictive Distribution
MAP: A Step towards Bayes
Determine by minimizing regularized sum-of-squares error,
.
MAP = Maximum Posterior
Bayesian Curve Fitting
Bayesian Predictive Distribution
Information Theory
Twenty Questions
Knower: thinks of object (point in a probability space)
Guesser: asks knower to evaluate random variables
Stupid approach:
Guesser: Is it my left big toe?
Knower: No.
Guesser: Is it Valmiki?
Knower: No.
Guesser: Is it Aunt Lakshmi?
...
Expectations & Surprisal
Turn the key: expectation: lock will open
Exam paper showing: could be 100, could be zero.
random variable: function from set of marks
to real interval [0,1]
Interestingness ∝ unpredictability
surprisal (r.v. = x) = - log2 p(x)
= 0 when p(x) = 1
= 1 when p(x) = ½
= ∞ when p(x) = 0
Expectations in data
A: 00010001000100010001. . . 0001000100010001000100010001
B: 01110100110100100110. . . 1010111010111011000101100010
C: 00011000001010100000. . . 0010001000010000001000110000
Structure in data easy to remember
Entropy
Used in
• coding theory
• statistical physics
• machine learning
Entropy
Entropy
In how many ways can N identical objects be allocated M
bins?
Entropy maximized when
Entropy in Coding theory
x discrete with 8 possible states; how many bits to
transmit the state of x?
All states equally likely
Coding theory
Entropy in Twenty Questions
Intuitively : try to ask q whose answer is 50-50
Is the first letter between A and M?
question entropy = p(Y)logp(Y) + p(N)logP(N)
For both answers equiprobable:
entropy = - ½ * log2(½) - ½ * log2(½) = 1.0
For P(Y)=1/1028
entropy = - 1/1028 * -10 - eps = 0.01