CH 5 Classification and regression
Supervised learning
Data Model Cost Function Types of supervised learning
Input Function with Dissimilarity measure between Regression: continuous y
(features) and parameters 𝑤 that observation and prediction to Classification: discrete y
outputs (label) maps input to output determine a model is good or bad)
{𝒙! , 𝑦! }"! 𝑓(𝒙; 𝒘) 𝑑(𝑦, 𝑓(𝒙; 𝒘))
Classicisation
Logistic regression
exp (𝑋𝒘) 1
𝜋! = = , 𝜋 = 𝑃(𝑦! = 1|𝒙! )
1 + exp (𝑋𝒘) 1 + exp (−𝑋𝒘) !
#
𝒘 can be found using the MLE approach. To maximize likelihood function ∏ 𝜋! ! (1 − 𝜋! )$%#!
max ∑𝑦! (ln 𝜋! ) + (1 − 𝑦! )(ln(1 − 𝜋! ))
Classification trees
If stop criterion is met: e.g. If the criterion is not met, partition the data into
Only contains 1 type of element subsets
Add a leaf note which assigns every Ask a number of question, partition the data accordingly
observation to the post prevalent class and select the question with the greatest purity gain
Purity gain
A binary splits create 3 partitions, root, left and right branches 𝑣$ , 𝑣& .
For each partition, 𝐼(𝑟), 𝐼(𝑣$ ), 𝐼(𝑣& ) is founded. Impurity measure can be one of the following
𝑬𝒏𝒕𝒓𝒐𝒑𝒚(𝒗) 𝑮𝒊𝒏𝒊(𝒗) 𝑪𝒍𝒂𝒔𝒔𝑬𝒓𝒓𝒐𝒓(𝒗)
' '
1 − max 𝑝(𝑐|𝑣)
− P 𝑝(𝑐|𝑣) log & 𝑝(𝑐|𝑣) 1 − P 𝑝(𝑐|𝑣)&
()$ ()$
𝑁𝑜. 𝑜𝑓 𝑐𝑙𝑎𝑠𝑠 𝑐 𝑖𝑛 𝑏𝑟𝑎𝑛𝑐ℎ 𝑣
𝑝(𝑐|𝑣) =
𝑁(𝑣)
Purity gain is the weighted reduction in impurity
𝑁(𝑣* )
Δ = 𝐼(𝑟) − ∑ 𝐼(𝑣* )
𝑁(𝑟)
1
CH 5 Classification and regression
Example
𝒗𝟏 𝒗𝟐 Root
𝑷(𝑴𝒂𝒎𝒎𝒂𝒍) 0.2 0.6 0.333
𝑷(𝑵𝒐𝒏 − 𝒎𝒂𝒎𝒎𝒂𝒍) 0.8 0.4 0.666
𝒗𝟏 𝒗𝟐 Root
Entropy 0.3200 0.4800 0.4444
Gini 0.7219 0.9710 0.9183
Class Error 0.2000 0.4000 0.3333
Entropy Gini Class Error
Impurity gain 0.01778 0.3035 0
Controlling tree complexity
Stop splitting when a branch contains less than a specific number of observations.
Stop splitting if a certain depth of the tree is reached.
Stop splitting if purity gain ∆ for the best split is below a certain value.
Model evolution
Confusion matrix
Labeled positive Predicted negative
Actually positive True positive False negative
Actually negative False negative True negative
-./-0 1./10
Accuracy 0 , error rate 0 = 1 − 𝐴𝑐𝑐𝑢𝑟𝑎𝑐𝑦
K-nearest neighbor
Choose the number of neighbor K and A measure of distance
When performing classification: 1) Compute distance to all other data objects à Find the K-nearest data
objects à Classify according to majority of neighbors
Nearest neighbor decision surface
2
CH 5 Classification and regression
Regression
Linear model for regression
gg⃗ + 𝜀⃗ , where 𝛽⃗ = (𝑋 - 𝑋)%$ 𝑋 - 𝑦⃗
g⃗ = 𝑋𝑤
𝑌
Regression line in 1 dimension feature space Regression plane in 2 dimension feature space
Linear model after feature transformation
The features can be transformed into different forms to provide more accurate output without affecting
the linearity
g⃗ = 𝜙(𝑥)2 𝑤
𝑌 gg⃗ + 𝜀⃗, where 𝜙(𝑥)2 is a vector of function
Consider a model 𝑦 = 𝑤3 + 𝑤$ 𝑥$ + 𝑤& 𝑥& + 𝑤4 𝑥4 . 𝑥$& , cos(𝑥& ) or ln 𝑥4 can be used instead, the model
become 𝑦 = 𝑤3 + 𝑤$ 𝑥$& + 𝑤& cos 𝑥& + 𝑤4 ln 𝑥4 or 𝑌 g⃗ = 𝜙(𝑥)2 𝑤
gg⃗ + 𝜀⃗, 𝜙(𝑥) = [𝑥 & , cos 𝑥 , ln 𝑥]
Regression: 𝒚 = 𝒘𝟎 + 𝒘𝟏 𝒙 + 𝒘𝟐 𝒙𝟐 + 𝒘𝟑 𝒙𝟑 Regression: 𝒚 = 𝒘𝟎 + 𝒘𝟏 𝒄𝒐𝒔(𝒙) + 𝒘𝟐 𝒔𝒊𝒏 (𝟐𝒙)