Instance Based Learning
• k-Nearest Neighbor
• Locally weighted regression
• Radial basis functions
• Case-based reasoning
• Lazy and eager learning
CS 5751 Machine Chapter 8 Instance Based Learning 1
Learning
Instance-Based Learning
Key idea : just store all training examples < xi ,f(xi ) >
Nearest neighbor (1 - Nearest neighbor) :
• Given query instance xq , locate nearest example xn , estimate
fˆ ( xq ) ← f ( xn )
k − Nearest neighbor :
• Given xq , take vote among its k nearest neighbors (if
discrete - valued target function)
• Take mean of f values of k nearest neighbors (if real - valued)
ˆf ( x ) ← ∑i =1
k
f ( xi )
q
k
CS 5751 Machine Chapter 8 Instance Based Learning 2
Learning
When to Consider Nearest Neighbor
• Instance map to points in Rn
• Less than 20 attributes per instance
• Lots of training data
Advantages
• Training is very fast
• Learn complex target functions
• Do not lose information
Disadvantages
• Slow at query time
• Easily fooled by irrelevant attributes
CS 5751 Machine Chapter 8 Instance Based Learning 3
Learning
k-NN Classification
5-Nearest Neighbor
xq 1-NN Decision Surface
CS 5751 Machine Chapter 8 Instance Based Learning 4
Learning
Behavior in the Limit
Define p(x) as probability that instance x will be
labeled 1 (positive) versus 0 (negative)
Nearest Neighbor
• As number of training examples approaches infinity,
approaches Gibbs Algorithm
Gibbs: with probability p(x) predict 1, else 0
k-Nearest Neighbor:
• As number of training examples approaches infinity and k
gets large, approaches Bayes optimal
Bayes optimal: if p(x) > 0.5 then predict 1, else 0
• Note Gibbs has at most twice the expected error of Bayes
optimal
CS 5751 Machine Chapter 8 Instance Based Learning 5
Learning
Distance-Weighted k-NN
Might want to weight nearer neighbors more heavily ...
ˆf ( x ) ← ∑i =1 i
k
w f ( xi )
∑i =1 wi
q k
where
1
wi ≡
d ( xq , xi ) 2
and d(xq ,xi ) is distance between xq and xi
Note, now it makes sense to use all training examples
instead of just k
→ Shepard' s method
CS 5751 Machine Chapter 8 Instance Based Learning 6
Learning
Curse of Dimensionality
Imagine instances described by 20 attributes, but
only 2 are relevant to target function
Curse of dimensionality: nearest neighbor is easily
misled when high-dimensional X
One approach:
• Stretch jth axis by weight zj, where z1,z2,…,zn chosen to
minimize prediction error
• Use cross-validation to automatically choose weights
z1,z2,…,zn
• Note setting zj to zero eliminates dimension j altogether
see (Moore and Lee, 1994)
CS 5751 Machine Chapter 8 Instance Based Learning 7
Learning
Locally Weighted Regression
k - NN forms local approximation to f for each query point xq
Why not form explicit approximation fˆ(x) for region around xq ?
• Fit linear function to k nearest neighbors
• Or fit quadratic, etc.
• Produces " piecewise approximation" to f
Several choices of error to minimize :
• Squared error over k nearest neighbors
E (x ) ≡ 1
1 q 2 ∑
( f ( x) − fˆ ( x)) 2
x∈k nearest neighbors of xq
• Distance - weighted squared error over all neighbors
q 2 ∑
E ( x ) ≡ 1 ( f ( x) − fˆ ( x)) 2 K (d ( x , x))
2
x∈D
q
CS 5751 Machine Chapter 8 Instance Based Learning 8
Learning
Radial Basis Function Networks
• Global approximation to target function, in terms
of linear combination of local approximations
• Used, for example, in image classification
• A different kind of neural network
• Closely related to distance-weighted regression,
but “eager” instead of “lazy”
CS 5751 Machine Chapter 8 Instance Based Learning 9
Learning
Radial Basis Function Networks
f(x)
w
w0 1
w2 k
w
1
where ai(x) are the attributes describing
instance x, and
a1(x) a2(x) an(x) k
f ( x) = w0 + ∑ wu K u (d ( xu , x))
u =1
One common choice for K u(d(xu ,x)) is
1
d 2 ( xu , x )
2σ u2
K u(d(xu ,x)) = e
CS 5751 Machine Chapter 8 Instance Based Learning 10
Learning
Training RBF Networks
Q1: What xu to use for kernel function Ku(d(xu,x))?
• Scatter uniformly through instance space
• Or use training instances (reflects instance distribution)
Q2: How to train weights (assume here Gaussian
Ku)?
• First choose variance (and perhaps mean) for each Ku
– e.g., use EM
• Then hold Ku fixed, and train linear output layer
– efficient methods to fit linear function
CS 5751 Machine Chapter 8 Instance Based Learning 11
Learning
Case-Based Reasoning
Can apply instance-based learning even when XV Rn
→ need different “distance” metric
Case-Based Reasoning is instance-based learning applied to
instances with symbolic logic descriptions:
((user-complaint error53-on-shutdown)
(cpu-model PowerPC)
(operating-system Windows)
(network-connection PCIA)
(memory 48meg)
(installed-applications Excel Netscape
VirusScan)
(disk 1Gig)
(likely-cause ???))
CS 5751 Machine Chapter 8 Instance Based Learning 12
Learning
Case-Based Reasoning in CADET
CADET: 75 stored examples of mechanical devices
• each training example:
<qualitative function, mechanical structure>
• new query: desired function
• target value: mechanical structure for this function
Distance metric: match qualitative function
descriptions
CS 5751 Machine Chapter 8 Instance Based Learning 13
Learning
Case-Based Reasoning in CADET
A stored case: T-junction pipe
Structure: Function:
Q1,T1 T = temperature Q1 +
Q = waterflow
Q3
Q2 +
Q3,T3
T1 +
T3
T2 +
Q2,T2
A problem specification: Water faucet
+
Structure: Function: Cc Qc +
++ + Qm
? Ch Qh
-
+
Tc
+
+ Tm
Th +
CS 5751 Machine Chapter 8 Instance Based Learning 14
Learning
Case-Based Reasoning in CADET
• Instances represented by rich structural
descriptions
• Multiple cases retrieved (and combined) to form
solution to new problem
• Tight coupling between case retrieval and problem
solving
Bottom line:
• Simple matching of cases useful for tasks such as
answering help-desk queries
• Area of ongoing research
CS 5751 Machine Chapter 8 Instance Based Learning 15
Learning
Lazy and Eager Learning
Lazy: wait for query before generalizing
• k-Nearest Neighbor, Case-Based Reasoning
Eager: generalize before seeing query
• Radial basis function networks, ID3, Backpropagation, etc.
Does it matter?
• Eager learner must create global approximation
• Lazy learner can create many local approximations
• If they use same H, lazy can represent more complex
functions (e.g., consider H=linear functions)
CS 5751 Machine Chapter 8 Instance Based Learning 16
Learning
kd-trees (Moore)
• Eager version of k-Nearest Neighbor
• Idea: decrease time to find neighbors
– train by constructing a lookup (kd) tree
– recursively subdivide space
• ignore class of points
• lots of possible mechanisms: grid, maximum variance, etc.
– when looking for nearest neighbor search tree
– nearest neighbor can be found in log(n) steps
– k nearest neighbors can be found by generalizing
process (still in log(n) steps if k is constant)
• Slower training but faster classification
CS 5751 Machine Chapter 8 Instance Based Learning 17
Learning
kd Tree
CS 5751 Machine Chapter 8 Instance Based Learning 18
Learning
Instance Based Learning Summary
• Lazy versus Eager learning
– lazy - work done at testing time
– eager -work done at training time
– instance based sometimes lazy
• k-Nearest Neighbor (k-nn) lazy
– classify based on k nearest neighbors
– key: determining neighbors
– variations:
• distance weighted combination
• locally weighted regression
– limitation: curse of dimensionality
• “stretching” dimensions
CS 5751 Machine Chapter 8 Instance Based Learning 19
Learning
Instance Based Learning Summary
• k-d trees (eager version of k-nn)
– structure built at train time to quickly find neighbors
• Radial Basis Function (RBF) networks (eager)
– units active in region (sphere) of space
– key: picking/training kernel functions
• Case-Based Reasoning (CBR) generally lazy
– nearest neighbor when no continuos features
– may have other types of features:
• structural (graphs in CADET)
CS 5751 Machine Chapter 8 Instance Based Learning 20
Learning