0% found this document useful (0 votes)

82 views20 pages

Machine Learning Module 2 Overview

This module covers machine learning topics including: 1. Probably Approximately Correct (PAC) learning and how to calculate the number of training examples needed to achieve a given error rate and confidence level. 2. Noise in training data, reasons for noise, and the bias-variance tradeoff in model selection. Simple models have less variance but more bias, while complex models can overfit. 3. Multi-class classification, which treats it as multiple binary classification problems, and how empirical error is calculated for multiple classes compared to binary classification.

Uploaded by

amarthya v

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

82 views20 pages

Machine Learning Module 2 Overview

Uploaded by

amarthya v

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 20

CS 476 Introduction to Machine Learning, Module 2

Module 2 – Syllabus
2.1.Probably Approximately Learning (PAC), 2.2.Noise, 2.3.Learning Multiple classes, 2.4.Model
Selection and Generalization, 2.5 Dimensionality reduction- 2.6 Subset selection, 2.7 Principle Component
Analysis

2.1 Probably Approximately Correct (PAC) Learning

➢ Explain PAC learning with example.
➢ Interpret how we can find the right number of training examples so that the
hypothesis has a low error rate.
In Probably Approximately Correct (PAC) learning, given a class, C, and examples drawn
from some unknown but fixed probability distribution, p(x), we want to find the number of
examples, N, such that with probability at least 1 − δ, the hypothesis h has error at most ε, for
arbitrary δ ≤ 1/2 and ε > 0

In short, the goal of a PAC learner is to build a hypothesis with high probability
(denoted 1- δ) that would be approximately correct (ie: error rate less than ε )

P{CΔh ≤ ε} ≥ 1 − δ

where CΔh is the region of difference between C and h. We would like to make sure that the
probability of a positive example falling in here (and causing an error) is at most ε, with a
confidence 1- δ.

Assuming S as the tightest possible rectangle, the error region between C and h is the sum of
four rectangular strips (see figure ).
For any of these strips, if we can guarantee that the
probability is upper bounded by ε/4, the error is at
most 4(ε/4) = ε.
We count the overlaps in the corners twice, and the
total actual error in this case is less than 4(ε/4).
The probability that a randomly drawn example
misses this strip is 1 − ε/4.
The probability that all N independent draws miss
the strip is (1 − ε/4)N , and the probability that all N
independent draws miss any of the four strips is at
most 4(1 − ε/4)N , which we would like to be at
most δ.
We have the inequality (1 − x) ≤ exp[−x]
So if we choose N and δ such that we have 4exp[−εN/4] ≤ δ
we can also write 4(1 − ε/4)N ≤ δ. Dividing both sides by 4, taking (natural) log and
rearranging terms, we have
N ≥ (4/ε) log(4/δ)
Therefore, provided that we take at least (4/ε)log(4/δ) independent examples from C
and use the tightest rectangle as our hypothesis h, with confidence probability at least 1 −
δ, a given point will be misclassified with error probability at most ε.
1
Prepared By Abin Philip, Asst Prof, Toc H.
Reference: Introduction to Machine Learning, II edition, Ethem Alpaydin
CS 476 Introduction to Machine Learning, Module 2

We can have arbitrary large confidence by decreasing δ and arbitrary small error by
decreasing ε. The number of examples is a slowly growing function of 1/ε and 1/δ, linear and
logarithmic, respectively.
2.2 Noise
➢ Explain Noise in data with example
➢ What are the reasons for noise in Training data
➢ Justify, Why should we prefer a simple model over a complex model
➢ Explain the concept of Occam’s razor.
➢ Explain the terms bias and variance with example.
➢ Explain the trade off between bias and variance in selecting a model.
Noise is any unwanted anomaly in the data and due to noise, the class may be more difficult
to learn and zero error may be infeasible with a simple hypothesis class

There are several reasons/interpretations of noise:

● There may be imprecision in recording the input attributes, which may shift the data
points in the input space.  
● There may be errors in labeling the data points, which may relabel positive instances
as negative and vice versa. This is sometimes called teacher noise
● There may be additional attributes, which we have not taken into account, that
affect the label of an instance
When there is noise, there is not a simple boundary between the positive and negative
instances, and zero misclassification error may not be possible with a simple hypothesis. A
rectangle is a simple hypothesis with four parameters defining the corners. But to define a
more complicated shape one needs a more complex model with a much larger number of
parameters. With a complex model, one can make a perfect fit to the data and attain zero
error; see the wiggly shape in figure. Another possibility is to keep the model simple and
allow some error; see the rectangle.

Using the simple rectangle/ simple

model(unless its training error is much bigger)
makes more sense because of the following:

● It is a simple model to use. It is easy to check

whether a point is inside or outside a rectangle
and we can easily check, for a future data instance, whether it is a positive or a
negative instance.  
● It is a simple model to train and has fewer parameters. It is easier to find the
corner values of a rectangle than the control points of an arbitrary shape.
● With a small training set when the training instances differ a little bit, we expect the
simpler model to change less than a complex model: A simple model is thus said to
have less variance. On the other hand, a too simple model assumes more and is more
rigid, and may fail if indeed the underlying class is not that simple: A simpler model
has more bias. Finding the optimal model corresponds to minimizing both the
bias and the variance.  
● It is a simple model to explain. A rectangle simply corresponds to defining intervals
on the two attributes. By learning a simple model, we can extract information from
2
Prepared By Abin Philip, Asst Prof, Toc H.
Reference: Introduction to Machine Learning, II edition, Ethem Alpaydin
CS 476 Introduction to Machine Learning, Module 2

the raw data given in the training set.  

● If indeed there is mislabeling or noise in input and the actual class is really a simple
model like the rectangle, then the simple rectangle is better because it has less
variance and is less affected by single instances, will be a better discriminator than the
wiggly shape, although the simple one may make slightly more errors on the training
set.

Given comparable empirical error, we say that a simple (but not too simple) model would
generalize better than a complex model. This principle is known as Occam’s razor,
which states that simpler explanations are more plausible and any unnecessary
complexity should be avoided.  

2.3 Learning Multiple Classes

➢ Explain K class classification with example, how is it different from two class
classification.
➢ How is empirical error calculated in case of K class classification compared
to two class classification.
➢ When does a classifier reject a training instance.
In a two class learning problem (ex: Family cars) we have positive examples belonging to the
class family car and the negative examples belonging to all other cars.

In general we can have K classes denoted as Ci , i = 1, . . . , K, and an input instance belongs

to one and exactly one of them. The training set is now of the form

where r has K dimensions and

Example: here there are three classes: family

car, sports car, and luxury sedan. There are
three hypotheses induced, each one covering
the instances of one class and leaving outside
the instances of the other two classes. ‘?’ are
reject regions where no, or more than one,
class is chosen

Aim of K Class classification is to learn the

boundary separating the instances of one
class from the instances of all other classes.
We view a K-class classification problem as
K two-class problems

The training examples belonging to class Ci are the positive instances of hypothesis hi and the
examples of all other classes are the negative instances of hi.

Thus in a K-class problem, we have K hypotheses to learn such that

3
Prepared By Abin Philip, Asst Prof, Toc H.
Reference: Introduction to Machine Learning, II edition, Ethem Alpaydin
CS 476 Introduction to Machine Learning, Module 2

The total empirical error takes a sum over the predictions for all classes over all instances:

For a given x, ideally only one of hi(x),i = 1,...,K is 1 and we can choose a class. But when no,
or two or more, hi(x) is 1, we cannot choose a class, and this is the case of doubt and the
classifier rejects such cases.

Extra Note – Can we have the same hypothesis class for all classes?

If in a dataset, we expect to have all classes with similar distribution, shapes in the input
space—then the same hypothesis class can be used for all classes. For example, in a
handwritten digit recognition dataset, we would expect all digits to have similar distributions.
But in a medical diagnosis dataset, for example, where we have two classes for sick and
healthy people, we may have completely different distributions for the two classes; there may
be multiple ways for a person to be sick, reflected differently in the inputs: All healthy people
are alike; each sick person is sick in his or her own way.

2.4 Model Selection and Generalization

➢ Explain the concepts/steps behind model selection and generalization (cover all
sub topics)
➢ How many hypothesis/functions are possible for d inputs, illustrate with
example.
➢ Why is learning said to be Ill Posed / Explain Ill Posed Problem with an example
➢ Illustrate Inductive Bias in algorithms with example.
➢ What factors are to be considered when selecting a model (its ability to
generalize)
➢ Explain/Illustrate/Diffrentiate Overfitting and Underfitting with example. How
does it affect Generalization.
➢ Explain Triple Trade off. How are complexity of model, amount of training data
and Generalization inter dependent
➢ Explain concepts of Training, Validation, Cross validation and Test set
Lets consider a Boolean function with all inputs and the output as binary. There are 2d
possible ways to write d binary values and therefore, with d inputs, the training set has at
most 2d examples. As shown in table, each of these can be labelled as 0 or 1, and therefore,

4
Prepared By Abin Philip, Asst Prof, Toc H.
Reference: Introduction to Machine Learning, II edition, Ethem Alpaydin
CS 476 Introduction to Machine Learning, Module 2

there are 22d possible Boolean functions/hypothesis of d inputs.

Example if there are 2 inputs (x1 and x2), there would be 4 training examples and 16 possible
hypothesis possible for the 2 inputs as shown in table below.

one way to interpret learning is - we start with all possible hypothesis and as we see more
training examples, we remove those hypotheses that are not consistent with the training data.
Each distinct training example removes half the hypotheses, namely, those whose guesses are
wrong. For example, let us say we have x1 = 0, x2 = 1 and the output is 0; this removes
h5,h6,h7,h8,h13,h14,h15,h16

THE ILL POSED PROBLEM - In the case of a Boolean function, to end up with a single
hypothesis we need to see all 2d training examples. If the training set we are given contains
only a small subset of all possible instances, that is, if we know what the output should be for
only a small percentage of the cases—the solution is not unique. After seeing N example
cases, there remain 22d −N possible functions. This is an example of an ill-posed problem
where the data by itself is not sufficient to find a unique solution. As we see more training
examples, we know more about the underlying function, and we carve out more hypotheses
that are inconsistent from the hypothesis class.

INDUCTIVE BIAS – since learning is ill-posed, and data by itself is not sufficient to find
the solution, we should make some extra assumptions to have a unique solution with the data
we have. The set of assumptions we make to have learning possible is called the inductive
bias of the learning algorithm. One way we introduce inductive bias is when we assume a
hypothesis class H . In learning the class of family car, there are infinitely many ways of
separating the positive examples from the negative examples. Assuming the shape of a
rectangle is one inductive bias, and then the rectangle with the largest margin for example, is
another inductive bias. In linear regression, assuming a linear function is an inductive bias,
and among all lines, choosing the one that minimizes squared error is another inductive bias

Each hypothesis class has a certain capacity and can learn only certain functions. The class of
functions that can be learned can be extended by using a hypothesis class with larger
capacity, containing more complex hypotheses. For example, the hypothesis class that is a

5
Prepared By Abin Philip, Asst Prof, Toc H.
Reference: Introduction to Machine Learning, II edition, Ethem Alpaydin
CS 476 Introduction to Machine Learning, Module 2

union of two rectangles has higher capacity, but its hypotheses are more complex. Similarly
in regression, as we increase the order of the polynomial, the capacity and complexity
increase. The question now is to decide where to stop.

Model Selection and Generalization - learning is not possible without inductive bias, and
now the question is how to choose the right bias. This is called model selection, which is
choosing between possible H. The aim of machine learning is not to replicate the training
data but make right prediction for new cases. That is we would like to be able to generate the
right output for an input instance outside the training set, one for which the correct output is
not given in the training set. How well a model trained on the training set predicts the
right output for new instances is called generalization.

OVERFITTING AND UNDERFITTING - For best generalization, we should match the

complexity of the hypothesis class H with the complexity of the function underlying the data.
If Hypothesis H is less complex than the function, we have underfitting, for example,
when trying to fit a line to data sampled from a second-order polynomial(Figure 1). In such a
case, as we increase the complexity(to second order), the training error decreases(Figure 2).

But if we have H that is too complex, the data is not enough to constrain it and we may end
up with a bad hypothesis, h ∈ H.
If there is noise, an overcomplex hypothesis may learn not only the underlying function but
also the noise in the data and may make a bad fit, for example, when fitting a fourth-order
polynomial to noisy data sampled from a second-order polynomial (Figure 3). This is called
overfitting. In such a case, having more training data helps but only up to a certain point.
Given a training set and H , we can find h ∈ H that has the minimum training error but if H
is not chosen well, no matter which h ∈ H we pick, we will not have good generalization.
The problem with over fitting (using complex model) is that the model fits well for training
data, but for a new instance the prediction error will be more when compared to a simpler
model. Thus over fitting results in poor Generalization.

example of Over fitting and Under fitting (regression example below)

6
Prepared By Abin Philip, Asst Prof, Toc H.
Reference: Introduction to Machine Learning, II edition, Ethem Alpaydin
CS 476 Introduction to Machine Learning, Module 2

example of Over fitting and Under fitting (Classification example below)

TRIPLE TRADE-OFF - In all learning algorithms that are trained from example data, there is
a trade-off between three factors

1. The complexity of the hypothesis we fit to data, namely, the capacity of the
hypothesis class,
2. the amount of training data, and
3. the generalization error on new examples.

● As the amount of training data increases, the generalization error decreases.

● As the complexity of the model class H increases, the generalization error decreases
first and then starts to increase.
● The generalization error of an over complex H can be kept in check by increasing the
amount of training data but only up to a point
VALIDATION SET, CROSS VALIDATION SET and TEST SET --We can measure the
generalization ability of a hypothesis, ie;- the quality of its inductive bias, if we have access
to data outside the training set.

● We simulate this by dividing the training set we have into two parts. We use one
part for training (i.e., to fit a hypothesis), and the remaining part is called the
validation set and is used to test the generalization ability.
● That is, given a set of possible hypothesis classes Hi, for each we fit the best hi ∈ Hi
7
Prepared By Abin Philip, Asst Prof, Toc H.
Reference: Introduction to Machine Learning, II edition, Ethem Alpaydin
CS 476 Introduction to Machine Learning, Module 2

on the training set. Then, assuming large enough training and validation sets, the
hypothesis that is the most accurate on the validation set is the best one (the one that
has the best inductive bias). This process is called cross-validation.
● For example, to find the right order in polynomial regression, given a number of
candidate polynomials of different orders where polynomials of different orders
correspond to Hi, for each order, we find the coefficients on the training set, calculate
their errors on the validation set, and take the one that has the least validation error as
the best polynomial.
● In order to report the Error, we should not use the validation error. We have used the
validation set to choose the best model, and it has effectively become a part of the
training set.
● We need a third set, a test set, sometimes also called the publication set, containing
examples not used in training or validation.

● We cannot keep on using the same training/validation split either, because after
having been used once, the validation set effectively becomes part of training data.
● If we have a fixed set which we divide for training, validation, and test, we will have
different errors depending on how we do the division. These slight differences in error
will allow us to estimate how much large differences should be considered significant.
That is, in choosing between two hypothesis classes Hi and Hj , we will use them both
multiple times on a number of training and validation sets and check if the difference
between average errors of hi and hj is larger than the average difference between
multiple hi
2.5 Dimensionality Reduction
➢ Explain the reasons to opt for Dimensionality reduction of input data as a
pre-processing step.
i. In most learning algorithms, the complexity depends on the number of input
dimensions,( d), as well as on the size of the data sample (N). In order to reduce
memory and computation, we are reduce the dimensionality of the problem.
Decreasing d also decreases the complexity of the inference algorithm during
testing. 
ii. When data can be represented in a few dimensions without loss of information, it can
be plotted and analyzed visually for structure and outliers
iii. When an input is decided to be unnecessary, we save the cost of extracting it.  
iv. Simpler models are more robust on small datasets. Simpler models have less variance,
that is, they vary less depending on the particulars of a sample, including noise,
outliers, and so forth.
v. When data can be explained with fewer features, we get a better idea about the

8
Prepared By Abin Philip, Asst Prof, Toc H.
Reference: Introduction to Machine Learning, II edition, Ethem Alpaydin
CS 476 Introduction to Machine Learning, Module 2

process that underlies the data and this allows knowledge extraction.

➢ Explain/Compare the two methods for reducing dimensionality: (refer 2.6 &
2.7 for detailed explanation). Ie – compare Feature selection and Feature
Extraction.

i. Feature Selection – Subset selection method

ii. Feature Extraction – Principal Component Analysis (PCA) method
In feature selection, we are interested in finding k of the d dimensions that give us the
most information and we discard the other (d − k) dimensions. subset selection is a
feature selection method.

In feature extraction, we are interested in finding a new set of k dimensions that are
combinations of the original d dimensions. These methods may be supervised or
unsupervised depending on whether or not they use the output information. The best known
and most widely used feature extraction methods are Principal Components Analysis (PCA)
and Linear Discriminant Analysis (LDA), which are both linear projection methods,
unsupervised and supervised respectively.

2.6 Subset Selection

➢ Explain Subset selection, and the 2 methods for subset selection

➢ Explain/Compare forward and backwards subset selection methods
➢ What are the disadvantages of forward subset selection method and how
can we overcome it
➢ Analyse the complexity of forward/backward subset selection procedure

● In subset selection, we are interested in finding the best subset of the set of features.
The best subset contains the least number of dimensions that most contribute to
accuracy. We discard the remaining, unimportant dimensions.
● Using a suitable error function(mean square error or misclassification error), this can
be used in both regression and classification problems.
● There are 2d possible subsets of d variables, but we cannot test for all of them unless d
is small and we employ heuristics to get a reasonable (but not optimal) solution in
reasonable (polynomial) time

● There are two approaches to Subset Selection

i. Forward Selection
ii. Backward selection
Let us denote by F, a feature set of input dimensions, xi, i = 1,...,d.
E(F) denotes the error incurred on the validation sample when only the inputs in F are used.
Depending on the application, the error is either the mean square error or misclassification
error.
FORWARD SELECTION - we start with no input variables and add them one by one, at
each step adding the one that decreases the error the most, until any further addition does not
decrease the error (or decreases it only slightly)

9
Prepared By Abin Philip, Asst Prof, Toc H.
Reference: Introduction to Machine Learning, II edition, Ethem Alpaydin
CS 476 Introduction to Machine Learning, Module 2

Checking of the error is done on a validation set distinct from the training set because we
want to test the generalization accuracy. With more features, generally we have lower
training error, but not necessarily lower validation error.
In sequential forward selection, the steps are
i. We start with no features: F = ∅.
ii. At each step, for all possible xi, we train our model on the training set and calculate
E(F ∪ xi ) on the validation set.
iii. Then, we choose that input xj that causes the least error
j =argminE(F ∪xi) i
and we add xj to F if E(F∪xj) < E(F)
iv. We stop when
● if adding any feature does not decrease E.
● We may even decide to stop earlier if the decrease in error is too small, where
there is a user-defined threshold that depends on the application constraints.
Adding a new feature introduces the cost of observing the feature, as well as making the
classifier/regressor more complex.

Complexity analysis of forward search- This process may be costly because to decrease the
dimensions from d to k, we need to train and test the system d+(d−1)+(d−2)+···+(d−k) times,
which is O(d2).

Disadvantage with Forward Selection - Forward selection is a local search procedure and
does not guarantee finding the optimal subset, namely, the minimal subset causing the
smallest error. For example, xi and xj by themselves may not be good but together may
decrease the error a lot, but because this algorithm is greedy and adds attributes one by one, it
may not be able to detect this. It is possible to generalize and add multiple features at a time,
instead of a single one, at the expense of more computation.

Solution - We can backtrack and check which previously added feature can be removed after
a current addition, thereby increasing the search space, but this increases complexity.

floating search methods - in which the number of added features and removed features can
also change at each step.

BACKWARD SELECTION - we start with all variables and remove them one by one, at
each step removing the one that decreases the error the most (or increases it only slightly),
until any further removal increases the error significantly.
In sequential backward selection, steps are
i. we start with F containing all features
ii. At each step, for all possible xi, we train our model on the training set by removing
one attribute at a time and calculate E(F - xi ) on the validation set and, and we
remove the one that causes the least error
iii. Then, we choose that input xj that causes the least error
j =argmin E(F −xi) i
and we remove xj from F if E(F−xj) < E(F)
iv. We stop

10
Prepared By Abin Philip, Asst Prof, Toc H.
Reference: Introduction to Machine Learning, II edition, Ethem Alpaydin
CS 476 Introduction to Machine Learning, Module 2

● if removing a feature does not decrease the error. To decrease complexity, we

may decide to remove a feature if its removal causes only a slight increase in
error.
All the variants possible for forward search are also possible for backward search.
The complexity of backward search has the same order of complexity as forward search,
except that training a system with more features is more costly than training a system with
fewer features, and forward search may be preferable especially if we expect many useless
features.
Subset selection is supervised method because outputs are used by the regressor or classifier
to calculate the error

In an application like face recognition, feature selection is not a good method for
dimensionality reduction because individual pixels by themselves do not carry much
discriminative information; it is the combination of values of several pixels together that
carry information about the face identity. This is done by feature extraction method like
Principal Component Analysis.

2.7 Principal Component Analysis (PCA)

➢ Illustrate and Explain steps in PCA with example and its application
➢ How can we identify the number principal components in a dataset
➢ Explain how spectral decomposition property is utilized in PCA
➢ What do you mean by Eigen faces and Eigen digits
In contrast to Subset selection where we select a subset of features(feature selection), PCA is
an Feature Extraction method.

● PCA is a projection method – we are interested in finding a mapping from

the inputs in the original d-dimensional space to a new (k < d)-dimensional
space, with minimum loss of information
● Principal components analysis (PCA) is an unsupervised method because it
does not use the output information; the criterion to be maximized is the
variance.
● PCA can be done by Eigen value Decomposition of a data Co-Variance
matrix of the feature dataset.
● We choose the eigenvector with the largest eigenvalue for which the variance
is maximum. Therefore, the principal component is the eigenvector of the
covariance matrix, of the input sample with the largest eigenvalue
● Each succeeding principal components has highest variance under the
constraint that its orthogonal to the preceding components.
● PCA provides a mechanism to recognise the geometric similarity in data
through algebraic means
Steps in PCA

11
Prepared By Abin Philip, Asst Prof, Toc H.
Reference: Introduction to Machine Learning, II edition, Ethem Alpaydin
CS 476 Introduction to Machine Learning, Module 2

PCA will provide a mechanism to recognize this geometric similarity through algebraic
means.
The covariance matrix S is a symmetric matrix and According to Spectral
Theorem(Spectral Decomposition)

Here we call ⃗vi as Eigen Vector and λi as the corresponding Eigen Value and A as the
covariance matrix.

12
Prepared By Abin Philip, Asst Prof, Toc H.
Reference: Introduction to Machine Learning, II edition, Ethem Alpaydin
CS 476 Introduction to Machine Learning, Module 2

Step 4 : Inferring the Principal components from Eigen Values of the Co Variance
Matrix
From Spectral theorem we infer

The Most Significant Principal Component is the Eigen vector corresponding to the
largest Eigen Value.
We know,

13
Prepared By Abin Philip, Asst Prof, Toc H.
Reference: Introduction to Machine Learning, II edition, Ethem Alpaydin
CS 476 Introduction to Machine Learning, Module 2

Step 5: - Projecting the data using the Principal Components

The projection matrix is obtained by selected Eigen vectors(k<d) numbers. The original
dataset is transformed via the projection matrix to obtain a reduced k dimension subspace of
original dataset. (below k is denoted as p)

14
Prepared By Abin Philip, Asst Prof, Toc H.
Reference: Introduction to Machine Learning, II edition, Ethem Alpaydin
CS 476 Introduction to Machine Learning, Module 2

Example

Eigenfaces and Eigendigits is the name given to a set of eigenvectors when they are used in
the computer vision problem of human face recognition and handwritten digit recognition..
The eigenfaces and eigen digits themselves form a basis set of all images used to construct the
covariance matrix. This produces dimension reduction by allowing the smaller set of basis
images to represent the original training images.

15
Prepared By Abin Philip, Asst Prof, Toc H.
Reference: Introduction to Machine Learning, II edition, Ethem Alpaydin
CS 476 Introduction to Machine Learning, Module 2

Example – Additional

16
Prepared By Abin Philip, Asst Prof, Toc H.
Reference: Introduction to Machine Learning, II edition, Ethem Alpaydin
CS 476 Introduction to Machine Learning, Module 2

17
Prepared By Abin Philip, Asst Prof, Toc H.
Reference: Introduction to Machine Learning, II edition, Ethem Alpaydin
CS 476 Introduction to Machine Learning, Module 2

18
Prepared By Abin Philip, Asst Prof, Toc H.
Reference: Introduction to Machine Learning, II edition, Ethem Alpaydin
CS 476 Introduction to Machine Learning, Module 2

19
Prepared By Abin Philip, Asst Prof, Toc H.
Reference: Introduction to Machine Learning, II edition, Ethem Alpaydin
CS 476 Introduction to Machine Learning, Module 2

PCA additional References from

http://www.math.union.edu/~jaureguj/PCA.pdf

http://alexhwoods.com/eigenvalues/

http://www.inf.ed.ac.uk/teaching/courses/iaml/2011/slides/pca.pdf

20
Prepared By Abin Philip, Asst Prof, Toc H.
Reference: Introduction to Machine Learning, II edition, Ethem Alpaydin

Grade 8 Math Week 1-2
No ratings yet
Grade 8 Math Week 1-2
10 pages
Statistics in Nutrition and Dietetics Michael Nelson Instant Download
0% (1)
Statistics in Nutrition and Dietetics Michael Nelson Instant Download
170 pages
NC 1st Grade Math Unpackin
No ratings yet
NC 1st Grade Math Unpackin
42 pages
Foundations of Machine Learning: Module 7: Computational Learning Theory
No ratings yet
Foundations of Machine Learning: Module 7: Computational Learning Theory
64 pages
Algebra: Distributive Property Guide
No ratings yet
Algebra: Distributive Property Guide
3 pages
Factors and Multiples Worksheet
No ratings yet
Factors and Multiples Worksheet
5 pages
Certificate Mtap
No ratings yet
Certificate Mtap
21 pages
Module Review Form QR1
100% (1)
Module Review Form QR1
14 pages
Study On Different Numerical Methods For Solving Differential Equations
100% (1)
Study On Different Numerical Methods For Solving Differential Equations
133 pages
Biology at A Glance 4th Edition Judy Dodds (Author) Instant Download
No ratings yet
Biology at A Glance 4th Edition Judy Dodds (Author) Instant Download
91 pages
Problem Set 4 Conic Sections Set A
No ratings yet
Problem Set 4 Conic Sections Set A
2 pages
Unit 1
No ratings yet
Unit 1
92 pages
ML 13 Strong and Weak Learners
0% (1)
ML 13 Strong and Weak Learners
13 pages
Hypothesis Space and Inductive Bias
No ratings yet
Hypothesis Space and Inductive Bias
51 pages
ERROR and Confusion Matrix
No ratings yet
ERROR and Confusion Matrix
29 pages
Lecture 14 CLT Marked
No ratings yet
Lecture 14 CLT Marked
33 pages
Introduction To Machine Learning
No ratings yet
Introduction To Machine Learning
316 pages
Introduction To Machine Learning
No ratings yet
Introduction To Machine Learning
116 pages
Egyptian Mathematics: Babylonian Numerals
No ratings yet
Egyptian Mathematics: Babylonian Numerals
2 pages
Operations-Research - Paper-7
No ratings yet
Operations-Research - Paper-7
18 pages
Unit 4
No ratings yet
Unit 4
12 pages
PAC Learning & Machine Learning Course
No ratings yet
PAC Learning & Machine Learning Course
36 pages
Lecture 02
No ratings yet
Lecture 02
4 pages
Module 1 Part3
No ratings yet
Module 1 Part3
56 pages
Matters of Discussion
No ratings yet
Matters of Discussion
28 pages
Classification
No ratings yet
Classification
47 pages
hw1 f21112 Problems11
No ratings yet
hw1 f21112 Problems11
2 pages
TM P1A Mini Test
No ratings yet
TM P1A Mini Test
11 pages
INT354 - Unit 1
No ratings yet
INT354 - Unit 1
72 pages
AI ch6
No ratings yet
AI ch6
42 pages
Cambridge Classical Studies Netz Reviel The Transformation of Mathematics
No ratings yet
Cambridge Classical Studies Netz Reviel The Transformation of Mathematics
209 pages
Unit 1-1
No ratings yet
Unit 1-1
75 pages
American International University-Bangladesh: Declaration and Statement of Authorship
No ratings yet
American International University-Bangladesh: Declaration and Statement of Authorship
10 pages
Sequence & Progression
No ratings yet
Sequence & Progression
11 pages
UNIT I-Part 2
No ratings yet
UNIT I-Part 2
35 pages
SML Lecture2
No ratings yet
SML Lecture2
35 pages
SupervisedLearning 2 33
No ratings yet
SupervisedLearning 2 33
32 pages
Week 3
No ratings yet
Week 3
56 pages
07 Intro To ML
No ratings yet
07 Intro To ML
38 pages
Notes
No ratings yet
Notes
125 pages
Sec 1630
No ratings yet
Sec 1630
145 pages
Mathematics 8300/1H: Higher Tier Paper 1 Non-Calculator
No ratings yet
Mathematics 8300/1H: Higher Tier Paper 1 Non-Calculator
29 pages
03b Pure Mathematics 2 June 2021
No ratings yet
03b Pure Mathematics 2 June 2021
32 pages
ML 2 PPT Unit 2
No ratings yet
ML 2 PPT Unit 2
214 pages
CS3491-AI ML-Chapter 2
No ratings yet
CS3491-AI ML-Chapter 2
16 pages
SML Lecture4
No ratings yet
SML Lecture4
38 pages
Unit 3
No ratings yet
Unit 3
99 pages
הרצאה-Classifiers and Decision Trees
No ratings yet
הרצאה-Classifiers and Decision Trees
119 pages
PAC Bayesian Learning Introduction
No ratings yet
PAC Bayesian Learning Introduction
124 pages
Bartkowiak 2015
No ratings yet
Bartkowiak 2015
9 pages
MLSM Lecture2 120923
No ratings yet
MLSM Lecture2 120923
35 pages
Annual Exam Class 3
No ratings yet
Annual Exam Class 3
3 pages
ML 01
No ratings yet
ML 01
24 pages
Pattern Revision
No ratings yet
Pattern Revision
63 pages
Statistical Machine Learning-The Basic Approach and Current Research Challenges
No ratings yet
Statistical Machine Learning-The Basic Approach and Current Research Challenges
35 pages
19 ML Intro
No ratings yet
19 ML Intro
31 pages
Permutation and Combination
No ratings yet
Permutation and Combination
8 pages
Lecturenotes Cse176
No ratings yet
Lecturenotes Cse176
80 pages
Mathematical Logic for CS Students
No ratings yet
Mathematical Logic for CS Students
48 pages
ML Lecture 1 Iitg
No ratings yet
ML Lecture 1 Iitg
32 pages
Module 2
No ratings yet
Module 2
19 pages
Machine Learning HC
No ratings yet
Machine Learning HC
4 pages
Single vs Multi-Layer Perceptrons
No ratings yet
Single vs Multi-Layer Perceptrons
57 pages
Cs 171 18 IntroLearning Old
No ratings yet
Cs 171 18 IntroLearning Old
47 pages
PAC Learning Frameworks Explained
No ratings yet
PAC Learning Frameworks Explained
59 pages
DLLDLP WITH WORKSHEET-day 2
No ratings yet
DLLDLP WITH WORKSHEET-day 2
10 pages
Presentation 6
No ratings yet
Presentation 6
34 pages
Lecturenotes PDF
No ratings yet
Lecturenotes PDF
80 pages
Matrix Methods for Simultaneous Equations
No ratings yet
Matrix Methods for Simultaneous Equations
11 pages
Heteroclinic Cycles
No ratings yet
Heteroclinic Cycles
48 pages
ML Document-1 - Merged
No ratings yet
ML Document-1 - Merged
19 pages
6.036: Intro To Machine Learning: Lecturer: Professor Leslie Kaelbling Notes By: Andrew Lin Fall 2019
No ratings yet
6.036: Intro To Machine Learning: Lecturer: Professor Leslie Kaelbling Notes By: Andrew Lin Fall 2019
50 pages
Y P X y X y X: Bessel's Differential Equation
No ratings yet
Y P X y X y X: Bessel's Differential Equation
4 pages
Intro to Binary Classification
No ratings yet
Intro to Binary Classification
10 pages
Unit 2
No ratings yet
Unit 2
76 pages
05-1 Supervised Learning
No ratings yet
05-1 Supervised Learning
65 pages
Syllabus 7th Grade Math: Calvin - Doudt@msd - Edu
No ratings yet
Syllabus 7th Grade Math: Calvin - Doudt@msd - Edu
5 pages
ML Unit-2 Material Add-On
No ratings yet
ML Unit-2 Material Add-On
82 pages
ML PPT Lect - 4
No ratings yet
ML PPT Lect - 4
16 pages
Notes 1
No ratings yet
Notes 1
3 pages
Introduction To Machine Learning Lecture 3: Linear Classification Methods
No ratings yet
Introduction To Machine Learning Lecture 3: Linear Classification Methods
40 pages
Lecture 2.4
No ratings yet
Lecture 2.4
28 pages
ML 1 2 3
No ratings yet
ML 1 2 3
54 pages
Lec 3
No ratings yet
Lec 3
21 pages
Gansp Awareness Quiz PDF
No ratings yet
Gansp Awareness Quiz PDF
13 pages
06 Lectureslides LinearClassification Fixed
No ratings yet
06 Lectureslides LinearClassification Fixed
52 pages
ML Answerbank
No ratings yet
ML Answerbank
14 pages
Lecture+Notes+Model+ Selection PDF
No ratings yet
Lecture+Notes+Model+ Selection PDF
12 pages

Machine Learning Module 2 Overview

Uploaded by

Machine Learning Module 2 Overview

Uploaded by

CS 476 Introduction to Machine Learning, Module 2

2.1 Probably Approximately Correct (PAC) Learning

There are several reasons/interpretations of noise:

Using the simple rectangle/ simple

● It is a simple model to use. It is easy to check

the raw data given in the training set.

2.3 Learning Multiple Classes

In general we can have K classes denoted as Ci , i = 1, . . . , K, and an input instance belongs

where r has K dimensions and

Example: here there are three classes: family

Aim of K Class classification is to learn the

Thus in a K-class problem, we have K hypotheses to learn such that

2.4 Model Selection and Generalization

there are 22d possible Boolean functions/hypothesis of d inputs.

OVERFITTING AND UNDERFITTING - For best generalization, we should match the

example of Over fitting and Under fitting (regression example below)

example of Over fitting and Under fitting (Classification example below)

● As the amount of training data increases, the generalization error decreases.

i. Feature Selection – Subset selection method

2.6 Subset Selection

➢ Explain Subset selection, and the 2 methods for subset selection

● There are two approaches to Subset Selection

● if removing a feature does not decrease the error. To decrease complexity, we

2.7 Principal Component Analysis (PCA)

● PCA is a projection method – we are interested in finding a mapping from

Step 5: - Projecting the data using the Principal Components

PCA additional References from

You might also like

the raw data given in the training set.