PAC LEARNING
PAC learning framework
• The learner receives a sample S = (x1,…, xm) which are
are independently and identically distributed (i.i.d.)
according to some fixed but unknown distribution D, as
specific target concept c 𝝐 C to learn.
well as the labels (c(x1), …,c(xm)), which are based on a
hypothesis h 𝝐 H that has a small generalization error
• The task is then to use the labeled sample S to select a
with respect to the concept c.
Generalization error
• Given a hypothesis h 𝝐 H, a target concept c 𝝐 C, and an
underlying distribution D, the generalization error or risk of
h is defined by
• The generalization error of a hypothesis is not directly
accessible to the learner since both the distribution D and
the target concept c are unknown.
Empirical error
• Given a hypothesis h 𝝐 H, a target concept c 𝝐 C, and a
sample S = (x1,…, xm), the empirical error or empirical
risk of h is defined by
• Thus, the empirical error of h 𝝐 H is its average error over
the sample S, while the generalization error is its
expected error based on the distribution D.
PAC learning
• PAC stands for Probably Approximately Correct
any element x 𝝐 X is at most O(n) and denote by size(c) the maximal cost
where n is a number such that the computational cost of representing
of the computational representation of c 𝝐 C.
PAC learning
• A concept class C is thus PAC-learnable if the hypothesis
polynomial in 1/ 𝝐 and 1/𝝳 is approximately correct (error at
returned by the algorithm after observing a number of points
most 𝝐) with high probability (at least 1- 𝝳).
• 𝝐 is the upper bound on the error in accuracy, i.e.,
the hypothesis with error less than 𝝐
• Therefore, accuracy is 1- 𝝐
• 𝝳 gives the probability of failure in achieving the
accuracy 1- 𝝐. i.e., the hypothesis generated is
approximately correct at least 1- 𝝳
• Therefore, confidence is 1- 𝝳
PAC example
• Learning axis-aligned rectangle
• R represents a target axis-aligned rectangle and R’ a
hypothesis.
• Error regions
• the error regions of R are formed by the area within the rectangle R
0
but outside the rectangle R’ – false negative
• the area within R’, but outside the rectangle R – false positive
PAC example
• Learning tightest axis-aligned rectangle
• Given a labeled sample S, the algorithm consists of returning
the tightest axis-aligned rectangle R’ = RS containing the points
labeled with 1.
• RS does not produce any false positives, since its points must
be included in the target concept R. Thus, the error region of RS
is included in R - > R- RS
Error Region
• Error region = sum of four rectangular strips < 𝝐
• Each strip is at most 𝝐/4
• Probability of positive sample falling in any
one of the strip –error region= 𝝐/4
• Probability that a randomly drawn positive
sample misses a strip = 1- 𝝐/4
• P(m instances miss a strip)=(1- 𝝐/4)m
• P(m instances miss any strip)=4(1- 𝝐/4)m
• 4(1- 𝝐/4)m =4exp(-m 𝝐/4)
• Using the general inequality 1 - x <= e-x
Error Region
This yields that with probability at least 1 - 𝝳 , the error of the
algorithm is bounded as follows:
Learning bound — finite H, consistent case
proof
• Assume that the H contains some k bad hypotheses
• Hbad ={h1, h2, …, hk} with R(hS)>= 𝝐
• Let us consider a hypothesis hi
• Prob. that hi is consistent with first training example is <=1- 𝝐
• Prob. that hi is consistent with first m training example is
<=(1- 𝝐)m
• Prob. that at least one hi is consistent with first m training
example is <=k (1- 𝝐)m <=|H|(1- 𝝐)m
• Calculate the value of m so that |H|(1- 𝝐)m<= 𝝳
• Using the general inequality 1 - x <= e-x |H|e- m𝝐 <= 𝝳
• Equating the above equation to 𝝳 and solving for m, we get
Example: Conjunction of Boolean literals
• Consider learning the concept class Cn of conjunctions of
at most n Boolean literals x1, …, xn.
• A Boolean literal is either a variable xi, i 𝝐 [n], or its
negation
• For n = 4, an example is the conjunction:
• (1,0,0,1) is a positive example for this concept while
(1,0,0,0) is a negative example.
• since each literal can be included positively, with negation,
or not included, we have
Example: Conjunction of Boolean literals
• For n=6, the figure shows an example training sample
• And the consistent hypothesis
Example: Conjunction of Boolean literals
• Plugging this into the sample complexity bound for
complexity bound for any 𝝐 > 0 and 𝝳 > 0:
consistent hypotheses yields the following sample
• For 𝝳=0.02, 𝝐=0.1 and n=10, the bound becomes m
>= 149.
• Thus, for a labeled sample of at least 149 examples, the
bound guarantees 90% accuracy with a confidence of at
least 98%.
• Here, the number of training samples required is
exponential in n, which is the cost of the representation of
a point in X. Thus, PAC-learning is not guaranteed by the
theorem.
Learning bound — finite H, inconsistent case
Learning bound — finite H, inconsistent case
|H|exp(-2m𝝐2)
• So, any one hypothesis in H, that could have large error is
The bound for the no. of samples 1/2𝝐2[ln|H|
+ln(1/ 𝝳)]
Stochastic scenario
• This more general scenario is referred to as the stochastic
scenario.
• captures many real-world problems where the label of an input point is
not unique.
• For example, if we seek to predict gender based on input pairs formed
by the height and weight of a person, then the label will typically not be
unique.
Deterministic Scenario
• consider a distribution D over the input space. The
training sample is obtained by drawing x1, …, xn according
i 𝝐[m].
to D and the labels are obtained via f: yi = f(xi) for all
• When the label of a point can be uniquely determined by
some measurable function
• f : X -> Y (with probability one)
VC Dimension
• VC dimension - Vapnik-Chervonenkis dimension
• Provide a measure in the case where the hypothesis
space is infinite
• The VC dimension measures the complexity of the
hypothesis space H, not by the number of distinct
hypotheses |H|, but instead by the number of distinct
instances from X that can be completely discriminated
using H.
Shattering
• Let a subset of instances be S ⊆X and let N=2, then the
• Consider a hypothesis for the 2-class problem
possible labeling are
S ⊆X; that is, h partitions S into the two subsets
• Each hypothesis h from H imposes some dichotomy on
• {x ∊ Slh(x) = 1) and
• {x ∊ Slh(x) = 0).
Shattering
• A set of N instances can be labeled as + or – in 2N ways
• We say that H shatters S if every possible dichotomy of S
can be represented by some hypothesis from H.
• Definition: A set of instances S is shattered by hypothesis
space H if and only if for every dichotomy.of S there exists
some hypothesis in H consistent with this dichotomy.
• Consider 2 instances described using a single real valued
feature being shattered by a single interval
• But 3 instances can not be shattered by a single interval
VC dimension
• Definition: The Vapnik-Chervonenkis dimension, VC(H),
of hypothesis space H defined over instance space X is
the size of the largest finite subset of X shattered by H.
• If arbitrarily large finite sets of X can be shattered by H,
then VC(H) =∞.
• For a single interval on the real line, all set of 2 instances
can be shattered, whereas not the set of 3 instances.
Hence, VC(H) =2.
• VC dimension indicates that if we find any set of instances
of size d that can be shattered, then VC(H) >=d.
VC dimension
• An unbiased hypothesis space shatters the entire the instance
space
• What if H cannot shatter X, but can shatter some large subset
S of X? Intuitively, it seems reasonable to say that the larger
the subset of X that can be shattered, the more expressive H.
• The VC dimension of the set of oriented lines in 2-d is 3.
• Since there are 2m partitions of m instances, in order for H to
shatter instances: |H|>=2m
• VC(H) <=log(|H|)
Illustrative Example
• suppose the instance space X is the set of real numbers X
= R (e.g., describing the height of people), and H the set of
intervals on the real number line.
• In other words, H is the set of hypotheses of the form a < x
< b, where a and b may be any real constants. What is
VC(H)?
• S = {3.1,5.7}. Can S be shattered by H? Yes.
• For example, the four hypotheses (1 < x < 2), (1 < x < 4), (4 < x < 7),
and (1 < x < 7) will do.
• They represent each of the four dichotomies over S, covering neither
instance, either one of the instances, and both of the instances,
respectively.
• Since we have found a set of size two that can be shattered
by H, we know the VC dimension of H is at least two.