CS 476 Introduction to Machine Learning, Module 2
Module 2 – Syllabus
2.1.Probably Approximately Learning (PAC), 2.2.Noise, 2.3.Learning Multiple classes, 2.4.Model
Selection and Generalization, 2.5 Dimensionality reduction- 2.6 Subset selection, 2.7 Principle Component
Analysis
2.1 Probably Approximately Correct (PAC) Learning
➢ Explain PAC learning with example.
➢ Interpret how we can find the right number of training examples so that the
hypothesis has a low error rate.
In Probably Approximately Correct (PAC) learning, given a class, C, and examples drawn
from some unknown but fixed probability distribution, p(x), we want to find the number of
examples, N, such that with probability at least 1 − δ, the hypothesis h has error at most ε, for
arbitrary δ ≤ 1/2 and ε > 0
In short, the goal of a PAC learner is to build a hypothesis with high probability
(denoted 1- δ) that would be approximately correct (ie: error rate less than ε )
P{CΔh ≤ ε} ≥ 1 − δ
where CΔh is the region of difference between C and h. We would like to make sure that the
probability of a positive example falling in here (and causing an error) is at most ε, with a
confidence 1- δ.
Assuming S as the tightest possible rectangle, the error region between C and h is the sum of
four rectangular strips (see figure ).
For any of these strips, if we can guarantee that the
probability is upper bounded by ε/4, the error is at
most 4(ε/4) = ε.
We count the overlaps in the corners twice, and the
total actual error in this case is less than 4(ε/4).
The probability that a randomly drawn example
misses this strip is 1 − ε/4.
The probability that all N independent draws miss
the strip is (1 − ε/4)N , and the probability that all N
independent draws miss any of the four strips is at
most 4(1 − ε/4)N , which we would like to be at
most δ.
We have the inequality (1 − x) ≤ exp[−x]
So if we choose N and δ such that we have 4exp[−εN/4] ≤ δ
we can also write 4(1 − ε/4)N ≤ δ. Dividing both sides by 4, taking (natural) log and
rearranging terms, we have
N ≥ (4/ε) log(4/δ)
Therefore, provided that we take at least (4/ε)log(4/δ) independent examples from C
and use the tightest rectangle as our hypothesis h, with confidence probability at least 1 −
δ, a given point will be misclassified with error probability at most ε.
1
Prepared By Abin Philip, Asst Prof, Toc H.
Reference: Introduction to Machine Learning, II edition, Ethem Alpaydin
CS 476 Introduction to Machine Learning, Module 2
We can have arbitrary large confidence by decreasing δ and arbitrary small error by
decreasing ε. The number of examples is a slowly growing function of 1/ε and 1/δ, linear and
logarithmic, respectively.
2.2 Noise
➢ Explain Noise in data with example
➢ What are the reasons for noise in Training data
➢ Justify, Why should we prefer a simple model over a complex model
➢ Explain the concept of Occam’s razor.
➢ Explain the terms bias and variance with example.
➢ Explain the trade off between bias and variance in selecting a model.
Noise is any unwanted anomaly in the data and due to noise, the class may be more difficult
to learn and zero error may be infeasible with a simple hypothesis class
There are several reasons/interpretations of noise:
● There may be imprecision in recording the input attributes, which may shift the data
points in the input space.
● There may be errors in labeling the data points, which may relabel positive instances
as negative and vice versa. This is sometimes called teacher noise
● There may be additional attributes, which we have not taken into account, that
affect the label of an instance
When there is noise, there is not a simple boundary between the positive and negative
instances, and zero misclassification error may not be possible with a simple hypothesis. A
rectangle is a simple hypothesis with four parameters defining the corners. But to define a
more complicated shape one needs a more complex model with a much larger number of
parameters. With a complex model, one can make a perfect fit to the data and attain zero
error; see the wiggly shape in figure. Another possibility is to keep the model simple and
allow some error; see the rectangle.
Using the simple rectangle/ simple
model(unless its training error is much bigger)
makes more sense because of the following:
● It is a simple model to use. It is easy to check
whether a point is inside or outside a rectangle
and we can easily check, for a future data instance, whether it is a positive or a
negative instance.
● It is a simple model to train and has fewer parameters. It is easier to find the
corner values of a rectangle than the control points of an arbitrary shape.
● With a small training set when the training instances differ a little bit, we expect the
simpler model to change less than a complex model: A simple model is thus said to
have less variance. On the other hand, a too simple model assumes more and is more
rigid, and may fail if indeed the underlying class is not that simple: A simpler model
has more bias. Finding the optimal model corresponds to minimizing both the
bias and the variance.
● It is a simple model to explain. A rectangle simply corresponds to defining intervals
on the two attributes. By learning a simple model, we can extract information from
2
Prepared By Abin Philip, Asst Prof, Toc H.
Reference: Introduction to Machine Learning, II edition, Ethem Alpaydin
CS 476 Introduction to Machine Learning, Module 2
the raw data given in the training set.
● If indeed there is mislabeling or noise in input and the actual class is really a simple
model like the rectangle, then the simple rectangle is better because it has less
variance and is less affected by single instances, will be a better discriminator than the
wiggly shape, although the simple one may make slightly more errors on the training
set.
Given comparable empirical error, we say that a simple (but not too simple) model would
generalize better than a complex model. This principle is known as Occam’s razor,
which states that simpler explanations are more plausible and any unnecessary
complexity should be avoided.
2.3 Learning Multiple Classes
➢ Explain K class classification with example, how is it different from two class
classification.
➢ How is empirical error calculated in case of K class classification compared
to two class classification.
➢ When does a classifier reject a training instance.
In a two class learning problem (ex: Family cars) we have positive examples belonging to the
class family car and the negative examples belonging to all other cars.
In general we can have K classes denoted as Ci , i = 1, . . . , K, and an input instance belongs
to one and exactly one of them. The training set is now of the form
where r has K dimensions and
Example: here there are three classes: family
car, sports car, and luxury sedan. There are
three hypotheses induced, each one covering
the instances of one class and leaving outside
the instances of the other two classes. ‘?’ are
reject regions where no, or more than one,
class is chosen
Aim of K Class classification is to learn the
boundary separating the instances of one
class from the instances of all other classes.
We view a K-class classification problem as
K two-class problems
The training examples belonging to class Ci are the positive instances of hypothesis hi and the
examples of all other classes are the negative instances of hi.
Thus in a K-class problem, we have K hypotheses to learn such that
3
Prepared By Abin Philip, Asst Prof, Toc H.
Reference: Introduction to Machine Learning, II edition, Ethem Alpaydin
CS 476 Introduction to Machine Learning, Module 2
The total empirical error takes a sum over the predictions for all classes over all instances:
For a given x, ideally only one of hi(x),i = 1,...,K is 1 and we can choose a class. But when no,
or two or more, hi(x) is 1, we cannot choose a class, and this is the case of doubt and the
classifier rejects such cases.
Extra Note – Can we have the same hypothesis class for all classes?
If in a dataset, we expect to have all classes with similar distribution, shapes in the input
space—then the same hypothesis class can be used for all classes. For example, in a
handwritten digit recognition dataset, we would expect all digits to have similar distributions.
But in a medical diagnosis dataset, for example, where we have two classes for sick and
healthy people, we may have completely different distributions for the two classes; there may
be multiple ways for a person to be sick, reflected differently in the inputs: All healthy people
are alike; each sick person is sick in his or her own way.
2.4 Model Selection and Generalization
➢ Explain the concepts/steps behind model selection and generalization (cover all
sub topics)
➢ How many hypothesis/functions are possible for d inputs, illustrate with
example.
➢ Why is learning said to be Ill Posed / Explain Ill Posed Problem with an example
➢ Illustrate Inductive Bias in algorithms with example.
➢ What factors are to be considered when selecting a model (its ability to
generalize)
➢ Explain/Illustrate/Diffrentiate Overfitting and Underfitting with example. How
does it affect Generalization.
➢ Explain Triple Trade off. How are complexity of model, amount of training data
and Generalization inter dependent
➢ Explain concepts of Training, Validation, Cross validation and Test set
Lets consider a Boolean function with all inputs and the output as binary. There are 2d
possible ways to write d binary values and therefore, with d inputs, the training set has at
most 2d examples. As shown in table, each of these can be labelled as 0 or 1, and therefore,
4
Prepared By Abin Philip, Asst Prof, Toc H.
Reference: Introduction to Machine Learning, II edition, Ethem Alpaydin
CS 476 Introduction to Machine Learning, Module 2
there are 22d possible Boolean functions/hypothesis of d inputs.
Example if there are 2 inputs (x1 and x2), there would be 4 training examples and 16 possible
hypothesis possible for the 2 inputs as shown in table below.
one way to interpret learning is - we start with all possible hypothesis and as we see more
training examples, we remove those hypotheses that are not consistent with the training data.
Each distinct training example removes half the hypotheses, namely, those whose guesses are
wrong. For example, let us say we have x1 = 0, x2 = 1 and the output is 0; this removes
h5,h6,h7,h8,h13,h14,h15,h16
THE ILL POSED PROBLEM - In the case of a Boolean function, to end up with a single
hypothesis we need to see all 2d training examples. If the training set we are given contains
only a small subset of all possible instances, that is, if we know what the output should be for
only a small percentage of the cases—the solution is not unique. After seeing N example
cases, there remain 22d −N possible functions. This is an example of an ill-posed problem
where the data by itself is not sufficient to find a unique solution. As we see more training
examples, we know more about the underlying function, and we carve out more hypotheses
that are inconsistent from the hypothesis class.
INDUCTIVE BIAS – since learning is ill-posed, and data by itself is not sufficient to find
the solution, we should make some extra assumptions to have a unique solution with the data
we have. The set of assumptions we make to have learning possible is called the inductive
bias of the learning algorithm. One way we introduce inductive bias is when we assume a
hypothesis class H . In learning the class of family car, there are infinitely many ways of
separating the positive examples from the negative examples. Assuming the shape of a
rectangle is one inductive bias, and then the rectangle with the largest margin for example, is
another inductive bias. In linear regression, assuming a linear function is an inductive bias,
and among all lines, choosing the one that minimizes squared error is another inductive bias
Each hypothesis class has a certain capacity and can learn only certain functions. The class of
functions that can be learned can be extended by using a hypothesis class with larger
capacity, containing more complex hypotheses. For example, the hypothesis class that is a
5
Prepared By Abin Philip, Asst Prof, Toc H.
Reference: Introduction to Machine Learning, II edition, Ethem Alpaydin
CS 476 Introduction to Machine Learning, Module 2
union of two rectangles has higher capacity, but its hypotheses are more complex. Similarly
in regression, as we increase the order of the polynomial, the capacity and complexity
increase. The question now is to decide where to stop.
Model Selection and Generalization - learning is not possible without inductive bias, and
now the question is how to choose the right bias. This is called model selection, which is
choosing between possible H. The aim of machine learning is not to replicate the training
data but make right prediction for new cases. That is we would like to be able to generate the
right output for an input instance outside the training set, one for which the correct output is
not given in the training set. How well a model trained on the training set predicts the
right output for new instances is called generalization.
OVERFITTING AND UNDERFITTING - For best generalization, we should match the
complexity of the hypothesis class H with the complexity of the function underlying the data.
If Hypothesis H is less complex than the function, we have underfitting, for example,
when trying to fit a line to data sampled from a second-order polynomial(Figure 1). In such a
case, as we increase the complexity(to second order), the training error decreases(Figure 2).
But if we have H that is too complex, the data is not enough to constrain it and we may end
up with a bad hypothesis, h ∈ H.
If there is noise, an overcomplex hypothesis may learn not only the underlying function but
also the noise in the data and may make a bad fit, for example, when fitting a fourth-order
polynomial to noisy data sampled from a second-order polynomial (Figure 3). This is called
overfitting. In such a case, having more training data helps but only up to a certain point.
Given a training set and H , we can find h ∈ H that has the minimum training error but if H
is not chosen well, no matter which h ∈ H we pick, we will not have good generalization.
The problem with over fitting (using complex model) is that the model fits well for training
data, but for a new instance the prediction error will be more when compared to a simpler
model. Thus over fitting results in poor Generalization.
example of Over fitting and Under fitting (regression example below)
6
Prepared By Abin Philip, Asst Prof, Toc H.
Reference: Introduction to Machine Learning, II edition, Ethem Alpaydin
CS 476 Introduction to Machine Learning, Module 2
example of Over fitting and Under fitting (Classification example below)
TRIPLE TRADE-OFF - In all learning algorithms that are trained from example data, there is
a trade-off between three factors
1. The complexity of the hypothesis we fit to data, namely, the capacity of the
hypothesis class,
2. the amount of training data, and
3. the generalization error on new examples.
● As the amount of training data increases, the generalization error decreases.
● As the complexity of the model class H increases, the generalization error decreases
first and then starts to increase.
● The generalization error of an over complex H can be kept in check by increasing the
amount of training data but only up to a point
VALIDATION SET, CROSS VALIDATION SET and TEST SET --We can measure the
generalization ability of a hypothesis, ie;- the quality of its inductive bias, if we have access
to data outside the training set.
● We simulate this by dividing the training set we have into two parts. We use one
part for training (i.e., to fit a hypothesis), and the remaining part is called the
validation set and is used to test the generalization ability.
● That is, given a set of possible hypothesis classes Hi, for each we fit the best hi ∈ Hi
7
Prepared By Abin Philip, Asst Prof, Toc H.
Reference: Introduction to Machine Learning, II edition, Ethem Alpaydin
CS 476 Introduction to Machine Learning, Module 2
on the training set. Then, assuming large enough training and validation sets, the
hypothesis that is the most accurate on the validation set is the best one (the one that
has the best inductive bias). This process is called cross-validation.
● For example, to find the right order in polynomial regression, given a number of
candidate polynomials of different orders where polynomials of different orders
correspond to Hi, for each order, we find the coefficients on the training set, calculate
their errors on the validation set, and take the one that has the least validation error as
the best polynomial.
● In order to report the Error, we should not use the validation error. We have used the
validation set to choose the best model, and it has effectively become a part of the
training set.
● We need a third set, a test set, sometimes also called the publication set, containing
examples not used in training or validation.
● We cannot keep on using the same training/validation split either, because after
having been used once, the validation set effectively becomes part of training data.
● If we have a fixed set which we divide for training, validation, and test, we will have
different errors depending on how we do the division. These slight differences in error
will allow us to estimate how much large differences should be considered significant.
That is, in choosing between two hypothesis classes Hi and Hj , we will use them both
multiple times on a number of training and validation sets and check if the difference
between average errors of hi and hj is larger than the average difference between
multiple hi
2.5 Dimensionality Reduction
➢ Explain the reasons to opt for Dimensionality reduction of input data as a
pre-processing step.
i. In most learning algorithms, the complexity depends on the number of input
dimensions,( d), as well as on the size of the data sample (N). In order to reduce
memory and computation, we are reduce the dimensionality of the problem.
Decreasing d also decreases the complexity of the inference algorithm during
testing.
ii. When data can be represented in a few dimensions without loss of information, it can
be plotted and analyzed visually for structure and outliers
iii. When an input is decided to be unnecessary, we save the cost of extracting it.
iv. Simpler models are more robust on small datasets. Simpler models have less variance,
that is, they vary less depending on the particulars of a sample, including noise,
outliers, and so forth.
v. When data can be explained with fewer features, we get a better idea about the
8
Prepared By Abin Philip, Asst Prof, Toc H.
Reference: Introduction to Machine Learning, II edition, Ethem Alpaydin
CS 476 Introduction to Machine Learning, Module 2
process that underlies the data and this allows knowledge extraction.
➢ Explain/Compare the two methods for reducing dimensionality: (refer 2.6 &
2.7 for detailed explanation). Ie – compare Feature selection and Feature
Extraction.
i. Feature Selection – Subset selection method
ii. Feature Extraction – Principal Component Analysis (PCA) method
In feature selection, we are interested in finding k of the d dimensions that give us the
most information and we discard the other (d − k) dimensions. subset selection is a
feature selection method.
In feature extraction, we are interested in finding a new set of k dimensions that are
combinations of the original d dimensions. These methods may be supervised or
unsupervised depending on whether or not they use the output information. The best known
and most widely used feature extraction methods are Principal Components Analysis (PCA)
and Linear Discriminant Analysis (LDA), which are both linear projection methods,
unsupervised and supervised respectively.
2.6 Subset Selection
➢ Explain Subset selection, and the 2 methods for subset selection
➢ Explain/Compare forward and backwards subset selection methods
➢ What are the disadvantages of forward subset selection method and how
can we overcome it
➢ Analyse the complexity of forward/backward subset selection procedure
● In subset selection, we are interested in finding the best subset of the set of features.
The best subset contains the least number of dimensions that most contribute to
accuracy. We discard the remaining, unimportant dimensions.
● Using a suitable error function(mean square error or misclassification error), this can
be used in both regression and classification problems.
● There are 2d possible subsets of d variables, but we cannot test for all of them unless d
is small and we employ heuristics to get a reasonable (but not optimal) solution in
reasonable (polynomial) time
● There are two approaches to Subset Selection
i. Forward Selection
ii. Backward selection
Let us denote by F, a feature set of input dimensions, xi, i = 1,...,d.
E(F) denotes the error incurred on the validation sample when only the inputs in F are used.
Depending on the application, the error is either the mean square error or misclassification
error.
FORWARD SELECTION - we start with no input variables and add them one by one, at
each step adding the one that decreases the error the most, until any further addition does not
decrease the error (or decreases it only slightly)
9
Prepared By Abin Philip, Asst Prof, Toc H.
Reference: Introduction to Machine Learning, II edition, Ethem Alpaydin
CS 476 Introduction to Machine Learning, Module 2
Checking of the error is done on a validation set distinct from the training set because we
want to test the generalization accuracy. With more features, generally we have lower
training error, but not necessarily lower validation error.
In sequential forward selection, the steps are
i. We start with no features: F = ∅.
ii. At each step, for all possible xi, we train our model on the training set and calculate
E(F ∪ xi ) on the validation set.
iii. Then, we choose that input xj that causes the least error
j =argminE(F ∪xi) i
and we
add xj to F if E(F∪xj) < E(F)
iv. We stop when
● if adding any feature does not decrease E.
● We may even decide to stop earlier if the decrease in error is too small, where
there is a user-defined threshold that depends on the application constraints.
Adding a new feature introduces the cost of observing the feature, as well as making the
classifier/regressor more complex.
Complexity analysis of forward search- This process may be costly because to decrease the
dimensions from d to k, we need to train and test the system d+(d−1)+(d−2)+···+(d−k) times,
which is O(d2).
Disadvantage with Forward Selection - Forward selection is a local search procedure and
does not guarantee finding the optimal subset, namely, the minimal subset causing the
smallest error. For example, xi and xj by themselves may not be good but together may
decrease the error a lot, but because this algorithm is greedy and adds attributes one by one, it
may not be able to detect this. It is possible to generalize and add multiple features at a time,
instead of a single one, at the expense of more computation.
Solution - We can backtrack and check which previously added feature can be removed after
a current addition, thereby increasing the search space, but this increases complexity.
floating search methods - in which the number of added features and removed features can
also change at each step.
BACKWARD SELECTION - we start with all variables and remove them one by one, at
each step removing the one that decreases the error the most (or increases it only slightly),
until any further removal increases the error significantly.
In sequential backward selection, steps are
i. we start with F containing all features
ii. At each step, for all possible xi, we train our model on the training set by removing
one attribute at a time and calculate E(F - xi ) on the validation set and, and we
remove the one that causes the least error
iii. Then, we choose that input xj that causes the least error
j =argmin E(F −xi) i
and we
remove xj from F if E(F−xj) < E(F)
iv. We stop
10
Prepared By Abin Philip, Asst Prof, Toc H.
Reference: Introduction to Machine Learning, II edition, Ethem Alpaydin
CS 476 Introduction to Machine Learning, Module 2
● if removing a feature does not decrease the error. To decrease complexity, we
may decide to remove a feature if its removal causes only a slight increase in
error.
All the variants possible for forward search are also possible for backward search.
The complexity of backward search has the same order of complexity as forward search,
except that training a system with more features is more costly than training a system with
fewer features, and forward search may be preferable especially if we expect many useless
features.
Subset selection is supervised method because outputs are used by the regressor or classifier
to calculate the error
In an application like face recognition, feature selection is not a good method for
dimensionality reduction because individual pixels by themselves do not carry much
discriminative information; it is the combination of values of several pixels together that
carry information about the face identity. This is done by feature extraction method like
Principal Component Analysis.
2.7 Principal Component Analysis (PCA)
➢ Illustrate and Explain steps in PCA with example and its application
➢ How can we identify the number principal components in a dataset
➢ Explain how spectral decomposition property is utilized in PCA
➢ What do you mean by Eigen faces and Eigen digits
In contrast to Subset selection where we select a subset of features(feature selection), PCA is
an Feature Extraction method.
● PCA is a projection method – we are interested in finding a mapping from
the inputs in the original d-dimensional space to a new (k < d)-dimensional
space, with minimum loss of information
● Principal components analysis (PCA) is an unsupervised method because it
does not use the output information; the criterion to be maximized is the
variance.
● PCA can be done by Eigen value Decomposition of a data Co-Variance
matrix of the feature dataset.
● We choose the eigenvector with the largest eigenvalue for which the variance
is maximum. Therefore, the principal component is the eigenvector of the
covariance matrix, of the input sample with the largest eigenvalue
● Each succeeding principal components has highest variance under the
constraint that its orthogonal to the preceding components.
● PCA provides a mechanism to recognise the geometric similarity in data
through algebraic means
Steps in PCA
11
Prepared By Abin Philip, Asst Prof, Toc H.
Reference: Introduction to Machine Learning, II edition, Ethem Alpaydin
CS 476 Introduction to Machine Learning, Module 2
PCA will provide a mechanism to recognize this geometric similarity through algebraic
means.
The covariance matrix S is a symmetric matrix and According to Spectral
Theorem(Spectral Decomposition)
Here we call ⃗vi as Eigen Vector and λi as the corresponding Eigen Value and A as the
covariance matrix.
12
Prepared By Abin Philip, Asst Prof, Toc H.
Reference: Introduction to Machine Learning, II edition, Ethem Alpaydin
CS 476 Introduction to Machine Learning, Module 2
Step 4 : Inferring the Principal components from Eigen Values of the Co Variance
Matrix
From Spectral theorem we infer
The Most Significant Principal Component is the Eigen vector corresponding to the
largest Eigen Value.
We know,
13
Prepared By Abin Philip, Asst Prof, Toc H.
Reference: Introduction to Machine Learning, II edition, Ethem Alpaydin
CS 476 Introduction to Machine Learning, Module 2
Step 5: - Projecting the data using the Principal Components
The projection matrix is obtained by selected Eigen vectors(k<d) numbers. The original
dataset is transformed via the projection matrix to obtain a reduced k dimension subspace of
original dataset. (below k is denoted as p)
14
Prepared By Abin Philip, Asst Prof, Toc H.
Reference: Introduction to Machine Learning, II edition, Ethem Alpaydin
CS 476 Introduction to Machine Learning, Module 2
Example
Eigenfaces and Eigendigits is the name given to a set of eigenvectors when they are used in
the computer vision problem of human face recognition and handwritten digit recognition..
The eigenfaces and eigen digits themselves form a basis set of all images used to construct the
covariance matrix. This produces dimension reduction by allowing the smaller set of basis
images to represent the original training images.
15
Prepared By Abin Philip, Asst Prof, Toc H.
Reference: Introduction to Machine Learning, II edition, Ethem Alpaydin
CS 476 Introduction to Machine Learning, Module 2
Example – Additional
16
Prepared By Abin Philip, Asst Prof, Toc H.
Reference: Introduction to Machine Learning, II edition, Ethem Alpaydin
CS 476 Introduction to Machine Learning, Module 2
17
Prepared By Abin Philip, Asst Prof, Toc H.
Reference: Introduction to Machine Learning, II edition, Ethem Alpaydin
CS 476 Introduction to Machine Learning, Module 2
18
Prepared By Abin Philip, Asst Prof, Toc H.
Reference: Introduction to Machine Learning, II edition, Ethem Alpaydin
CS 476 Introduction to Machine Learning, Module 2
19
Prepared By Abin Philip, Asst Prof, Toc H.
Reference: Introduction to Machine Learning, II edition, Ethem Alpaydin
CS 476 Introduction to Machine Learning, Module 2
PCA additional References from
http://www.math.union.edu/~jaureguj/PCA.pdf
http://alexhwoods.com/eigenvalues/
http://www.inf.ed.ac.uk/teaching/courses/iaml/2011/slides/pca.pdf
20
Prepared By Abin Philip, Asst Prof, Toc H.
Reference: Introduction to Machine Learning, II edition, Ethem Alpaydin