Plug-in Methods & under/over-fitting
Vianney Perchet
February 5th 2024
Lecture 2/12
Last Lecture Take Home Message
• “Attributes/Features” space X ⊂ Rd & ”label” space Y ⊂ R
n o
• Training data-set: Dn = (X1 , Y1 ), . . . , (Xn , Yn )
• Risk w.r.t. loss ℓ : Y × Y → R+
h i
Risk: R(f) = E(X,Y)∼P ℓ f(X), Y
• Optimal risk and Bayes predictor
f∗ = arg minf R(f) and R∗ = R(f∗ )
• Binary Classification
• 0/1-loss: ℓ(y, y′ ) = 1{y ̸= y′ }
• Bayes classifier. f∗ (x) = 1{η(x) ≥ 1
2
}
• Linear Regression
• quad-loss: ℓ(y, y′ ) = ∥y − y′ ∥2
• Bayes regressor. f∗ (x) = η(x)
2
Focus on Binary Classification
f∗ (x) = 1{η(x) ≥ 1
2}
• Simple strategy
• Estimate η(x) by η̂(x) (using Dn )
• Plug-it in the formula, i.e., f̂(x) = 1{η̂(x) ≥ 1
2
}
• Pray for the best
3
Focus on Binary Classification
f∗ (x) = 1{η(x) ≥ 1
2}
• Simple strategy
• Estimate η(x) by η̂(x) (using Dn )
• Plug-it in the formula, i.e., f̂(x) = 1{η̂(x) ≥ 1
2
}
• Pray for the best
• ✓ It works !
• Simple methods, already implemented, intuitive and interpretable
• Many variants (K-NN, regressograms, Kernels)
7 if choose the correct parameters !
3
Focus on Binary Classification
f∗ (x) = 1{η(x) ≥ 1
2}
• Simple strategy
• Estimate η(x) by η̂(x) (using Dn )
• Plug-it in the formula, i.e., f̂(x) = 1{η̂(x) ≥ 1
2
}
• Pray for the best
• ✓ It works !
• Simple methods, already implemented, intuitive and interpretable
• Many variants (K-NN, regressograms, Kernels)
7 if choose the correct parameters !
• 7 Convergence can be slow
• Even arbitrarily slow “No Free-Lunch Theorem”
• E R(f̂) − R∗ ≥ log log1log(n) which is “constant”
3
Focus on Binary Classification
f∗ (x) = 1{η(x) ≥ 1
2}
• Simple strategy
• Estimate η(x) by η̂(x) (using Dn )
• Plug-it in the formula, i.e., f̂(x) = 1{η̂(x) ≥ 1
2
}
• Pray for the best
• ✓ It works !
• Simple methods, already implemented, intuitive and interpretable
• Many variants (K-NN, regressograms, Kernels)
7 if choose the correct parameters !
• 7 Convergence can be slow
• Even arbitrarily slow “No Free-Lunch Theorem”
• E R(f̂) − R∗ ≥ log log1log(n) which is “constant”
• ✓ But these are pathological counter-examples !
• In practice, data are “regular” (Lipschitz, Holder, ....)
• Can compute explicit rates
3
Regressograms. The model
• Partition of X = Rd into “bins” (hypercubes) of size hn
• Volume of one bin: hdn
• Independently on each bin B
♯{i:Xi ∈B and Yi =1}
• η̂(x) = ♯{i:Xi ∈B}
[piece-wise constant]
Th. If hn → 0 and nhdn → ∞, “consistency” i.e., R(fn ) → R∗ in proba
• Proof ideas: h i
1. Lemma: R(f) − R∗ ≤ 2E |η̂(X) − η(X)| Dn
2. Approximation error: hn → 0
3. Estimation error: nhdn → ∞
4
Regressograms. Pros/cons
✓ Pros
• Simple, intuitive & interpretable
• Computational complexity
7 Cons
• Find the correct value of h
• Partition not data-dependent (why bins ?)
• Lots of empty bins
• Space complexity: ♯ bins huge
5
Regressograms. Pros/cons
5
K-Nearest Neighbors. The model
• Adaptive partition of X = Rd
• 1 parameter kn
• Neighborhood Nkn (x) = {kn closest Xj to x}
♯{i:Xi ∈Nkn x) and Yi =1}
• η̂(x) = ♯{i:Xi ∈Nkn (x)}
• Piece-wise (polytopial) constant
kn
Th. If n → 0 and kn → ∞, “consistency”
• Proof ideas:
1. Approximation error: knn → 0
2. Estimation error: kn → ∞
6
K-Nearest Neighbors. Pros/cons
✓ Pros
• Intuitive & (somehow) interpretable
• Data Dependent partition
• No empty bins (& no arbitrary choice)
• Space complexity
7 Cons
• Find the correct value of k
• Weirdly shaped partition
• Computational complexity: finding the partition
7
K-Nearest Neighbors. Pros/cons
7
K-Nearest Neighbors. Pros/cons
7
Kernel-Methods (Nadaraya-Watson)
• Adaptive partition of X = Rd
• 2 parameters Kernel Kn (·) : X → R+ and window h ∈ R+
P
Kn ( x−X i
)Yi
• η̂(x) = Pi h
x−Xj
j K n ( h
)
Th. If hn → 0 and nhn → ∞, “consistency”
• Proof ideas:
1. Approximation error: hn → 0
2. Estimation error: nhn → ∞
8
Typical Kernel
• Usual properties
R
• Normalized X K(u)du = 1
• Symmetry K(−u) = K(u)
R R
• Bounded variance X ∥u∥2 K(u)du < ∞& X K2 (u)du < ∞
• Typical Kernels
• uniform K(x) = 12 1{x ∈ [−1; 1]}
• triangular K(x) = (1 − |x|)1{x ∈ [−1; 1]}
• Gaussian K(x) = √1
2π
exp(− 12 x2 )
2 1
• Sigmoid K(x) = x ex +e−x
9
Kernels. Pros/cons
✓ Pros
• Intuitive & (somehow) interpretable
• Use all/many points to estimate
• Data Dependent
• No empty bins (& no arbitrary choice)
• Smooth/regular approximation
7 Cons
• Find the correct kernels K(·) and window h
10
Kernels. Pros/cons
10
Over/Under-fitting
All learning algorithms have data-fitting parameter(s)
• Choose it too small = under-fitting
Pn
7 Big empirical error on training set 1
ℓ f(Xi ), Yi )
n h
i=1
i
7 Medium (generalization) error E ℓ f(X), Y
• Choose it too big = over-fitting
Pn
✓ Small empirical error (even 0) 1
f(Xi ), Yi )
ℓ
n h
i=1
i
7 Huge (generalization) error E ℓ f(X), Y
• How to choose it ??
√
• Do not focus too much on empirical error (around 1/ n ?)
• Find several candidates & pick the smallest one (Occam’s razor)
• Cross validate (following lecture !)
11
Over/Under-fitting
11
Over/Under-fitting
11
Take home message - Local/Plug-in Methods
h i
Lemma: R(f) − R∗ ≤ 2E |η̂(X) − η(X)| Dn
• Estimate η(·) by η̂(·).
• Plug-it in the formula f∗ (x) = 1{η(x) ≥ 1
2
}
• Local Methods
P
• General form η̂(x) = ni=1 ω x, Xi ; (X1 , X2 , . . . , Xn ) Yi
convex weights for all x (in [0, 1] and sum to 1)
• Typical examples
• Regressogram
• k-Nearest neighbors
• Kernel methods
• Avoid Under/Over fitting
• Many points around x with positive weight over-fitting
• Points far from x with small/zero weight under-fitting
12