Linear Classification
Introduction to Classification using
Linear Classifiers
Last modified 1/1/19
1
Why Start with Linear Classifiers?
• Linear classifiers are the simplest classifiers
• Simpler than decision trees
• Textbook starts with decision trees
• We will use decision trees to introduce some of the more advanced concepts
• Learning method is linear regression
• We will use linear classifiers to introduce some concepts in
classification
• Linear classifier also provides yet one more classification algorithm
• Also helps demonstrate how different algorithms form different types of
decision boundaries
2
Classification: Definition
• Given a collection of records (training set )
• Each record contains a set of attributes and a class attribute
• Model the class attribute as a function of other attributes
• Goal: previously unseen records should be assigned a class
as accurately as possible (predictive accuracy)
• A test set is used to determine the accuracy of the model
• Usually the given labeled data is divided into training and test sets
• Training set used to build the model and test set to evaluate it
3
Classification Examples
• Predicting tumor cells as benign or malignant
• Classifying credit card transactions
as legitimate or fraudulent
• Classifying physical activities based on smartphone sensor data
• Categorizing news stories as finance,
weather, entertainment, sports, etc
4
Classification Techniques
• Decision Tree based Methods
• Memory based reasoning (Nearest Neighbor)
• Neural Networks
• Naïve Bayes
• Support Vector Machines
• Linear Regression (we start with this)
5
The Classification Problem Katydids
Given a collection of 5 instances of
Katydids and five Grasshoppers,
decide what type of insect the
unlabeled corresponds to.
Grasshoppers
Katydid or Grasshopper? 6
For any domain of interest,
we can measure features
Color {Green, Brown, Gray, Other} Has Wings?
Abdomen Thorax
Length Length Antennae
Length
Mandible
Size
Spiracle
Diameter
Leg Length
7
My_Collection
We can store features
Insect Abdomen Antennae Insect Class
in a database. ID Length Length
1 2.7 5.5 Grasshopper
2 8.0 9.1 Katydid
The classification Grasshopper
3 0.9 4.7
problem can now be 4 1.1 3.1 Grasshopper
expressed as: 5 5.4 8.5 Katydid
6 2.9 1.9 Grasshopper
• Given a training database, predict
the class label of a previously 7 6.1 6.6 Katydid
unseen instance 8 0.5 1.0 Grasshopper
9 8.3 6.6 Katydid
10 8.1 4.7 Katydids
previously unseen instance = 11 5.1 7.0 ???????
8
Grasshoppers Katydids
10
9
8
7
Antenna Length
6
5
4
3
2
1
1 2 3 4 5 6 7 8 9 10
Abdomen Length
9
Grasshoppers Katydids
10
9
8
7
Antenna Length
6
5 Each of these data
4 objects are called…
• exemplars
3
• (training) examples
2 • instances
1 • tuples
1 2 3 4 5 6 7 8 9 10
Abdomen Length
10
We will return to the previous
slide in two minutes. In the
meantime, we are going to play
a quick game.
11
Problem 1
Examples of class A Examples of class B
3 4 5 2.5
1.5 5 5 2
6 8 8 3
2.5 5 4.5 3 12
Problem 1 What class is this
object?
Examples of class A Examples of class B
3 4 5 2.5 8 1.5
What about this one,
1.5 5 5 2
A or B?
6 8 8 3
4.5 7
2.5 5 4.5 3 13
Problem 2 Oh! This ones hard!
Examples of class A Examples of class B
4 4 5 2.5 8 1.5
5 5 2 5
6 6 5 3
3 3 2.5 3 14
Problem 3
Examples of class A Examples of class B
6 6
4 4 5 6 This one is really hard!
What is this, A or B?
1 5 7 5
6 3 4 8
3 7 7 7 15
Why did we spend so much
time with this game?
Because we wanted to
show that almost all
classification problems
have a geometric
interpretation, check out
the next 3 slides… 16
10
Problem 1 9
8
Examples of class A Examples of class B 7
6
Left Bar
5
4
3
2
3 4 5 2.5 1
1 2 3 4 5 6 7 8 9 10
Right Bar
1.5 5 5 2
Here is the rule again.
If the left bar is smaller
than the right bar, it is
6 8 8 3
an A, otherwise it is a B.
2.5 5 4.5 3 17
10
Problem 2 9
8
Examples of class A Examples of class B 7
6
Left Bar
5
4
3
2
4 4 5 2.5 1
1 2 3 4 5 6 7 8 9 10
Right Bar
5 5 2 5
Let me look it up… here it is..
the rule is, if the two bars are
equal sizes, it is an A.
Otherwise it is a B.
6 6 5 3
3 3 2.5 3 18
100
Problem 3 90
80
Examples of class A Examples of class B 70
60
Left Bar
50
40
30
20
4 4 5 6 10
10 20 30 40 50 60 70 80 90 100
Right Bar
1 5 7 5
6 3 4 8
The rule again:
if the square of the sum of the
two bars is less than or equal
to 100, it is an A. Otherwise it
3 7 7 7
is a B. 19
Problem 3
• An alternative rule that works on the original training data is X + Y ≤
10 🡺 Class A; else B
• Which is better?
• Ultimately is one right and one wrong?
20
Grasshoppers Katydids
10
9
8
7
Antenna Length
6
5
4
3
2
1
1 2 3 4 5 6 7 8 9 10
Abdomen Length
21
previously unseen instance = 11 5.1 7.0 ???????
• We can “project” the previously
10 unseen instance into the same space
as the database.
9
8 • We have now abstracted away the
details of our particular problem. It
7
will be much easier to talk about
Antenna Length
6 points in space.
5
4
3
2
1
1 2 3 4 5 6 7 8 9 10
Katydids
Abdomen Length Grasshoppers 22
Simple Linear Classifier
10
9
8
7 R.A. Fisher
1890-1962
6
5 If previously unseen instance above the line
4 then
class is Katydid
3
else
2 class is Grasshopper
1
Katydids
1 2 3 4 5 6 7 8 9 10 Grasshoppers
23
Fitting a Model to Data
• One way to build a predictive model is to specify the structure of the
model with some parameters missing
• Parameter learning or parameter modeling
• Common in statistics but includes data mining methods since fields overlap
• Linear regression, logistic regression, support vector machines
24
Linear Discriminant Functions
• Equation of a line is y = mx + b
• A classification function may look like:
• Class + : if 1.0 × age – 1.5 × balance + 60 > 0
• Class - : if 1.0 × age – 1.5 × balance + 60 ≤ 0
• General form is f(x) = w0 + w1x1 + w2x2 + …
• Parameterized model where the weights for each feature are the
parameters
• The larger the magnitude of the weight the more important the feature
• The separator is a line when 2D, plane with 3D, and hyperplane with
more than 3D
25
What is the Best Separator?
Each separator has a different
10 margin, which is the distance to
9 the closest point. The orange line
has the largest margin.
8
7 For support vector machines, the
6 line/plane with the largest margin
5 is best.
4
3
2
1
1 2 3 4 5 6 7 8 9 10
26
Scoring and Ranking Instances
• Sometimes we want to know which examples are most likely to
belong to a class
• Linear discriminant functions can give us this
• Closer to separator is less confident and further away is more confident
• In fact the magnitude of f(x) give us this where larger values are more
confident/likely
27
Class Probability Estimation
• Class probability estimation is also something you often want
• Often free with methods like decision trees
• More complicated with linear discriminant functions since the
distance from the separator not a probability
• Logistic regression solves this
• We will not go into the details in this class
• Logistic regression determines class probability estimate
28
Classification Accuracy
Predicted class
Class = Katydid (1) Class = Grasshopper (0)
Class = Katydid (1) f11 f10
Actual Class
Class = Grasshopper (0) f01 f00
Confusion Matrix
Number of correct predictions f11 + f00
Accuracy = --------------------------------------------- = -----------------------
Total number of predictions f11 + f10 + f01 + f00
Number of wrong predictions f10 + f01
Error rate = --------------------------------------------- = -----------------------
Total number of predictions f11 + f10 + f01 + f00
29
Confusion Matrix
• In a binary decision problem, a classifier labels examples as either positive
or negative.
• Classifiers produce confusion/ contingency matrix, which shows four
entities: TP (true positive), TN (true negative), FP (false positive), FN (false
negative)
Confusion Matrix
Predicted Predicted
Positive Negative
(+) (-)
Actual
Positive (Y) TP FN
Actual For now responsible for
Negative (N) FP TN knowing Recall and Precision
30
The simple linear classifier is defined for higher dimensional spaces…
31
… we can visualize it as being
an n-dimensional hyperplane
32
Which of the “Problems” can be solved by the Simple
Linear Classifier? 10
9
8
7
6
5
1) Perfect 4
2) Useless 3
2
3) Pretty Good 1
1 2 3 4 5 6 7 8 9 10
100 10
90 9
Problems that can be 80 8
70 7
solved by a linear 60 6
classifier are call 50 5
linearly separable. 40 4
30 3
20 2
10 1
10 20 30 40 50 60 70 80 90 100 1 2 3 4 5 6 7 8 9 10
33
A Famous Problem
R. A. Fisher’s Iris Dataset.
Virginica
• 3 classes
• 50 of each class Versicolor
The task is to classify Iris plants
into one of 3 varieties using Petal Setos
Length and Petal Width. a Setosa Versicolor
Data:
https://archive.ics.uci.edu/ml/datasets/iris
34
Iris Setosa Iris Versicolor Iris Virginica
We can generalize to N classes by fitting N-1 lines. In this case we first learn the line
to discriminate between Setosa and Virginica/Versicolor, then we learned to
approximately discriminate between Virginica and Versicolor.
Virginica
Setosa
Versicolor
If petal width > 3.272 – (0.325 * petal length) then class = Virginica
Elseif petal width… 35
How to Compare Classification Algorithms?
• What criteria do we care about? What matters?
• Performance- predictive accuracy etc.
• Speed and Scalability
• time to construct model
• time to use/apply the model
• Expressive Power
• how flexible is the decision boundary
• Interpretability
• understanding and insight provided by the model
• ability to explain/justify the results
• Robustness
• handling noise, missing values and irrelevant features, streaming data
36