COMPUTER SCIENCE AND ENGINEERING
YEAR OF
CATEGORY L T P CREDIT INTRODUCTION
CST413 MACHINE LEARNING
PEC 2 1 0 3 2019
Preamble: This course enables the learners to understand the advanced concepts and algorithms
in machine learning. The course covers the standard and most popular supervised learning
algorithms such as linear regression, logistic regression, decision trees, Bayesian learning and the
Naive Bayes algorithm, basic clustering algorithms and classifier performance measures. This
course helps the students to provide machine learning based solutions to real world problems.
Prerequisite: Basic understanding of probability theory, linear algebra and Python Programming
Course Outcomes: After the completion of the course the student will be able to
CO1 Illustrate Machine Learning concepts and basic parameter estimation methods.
(Cognitive Knowledge Level: Apply)
CO2 Demonstrate supervised learning concepts (regression, linear classification).
(Cognitive Knowledge Level: Apply)
CO3 Illustrate the concepts of Multilayer neural network and Support Vector Machine
(Cognitive Knowledge Level: Apply)
CO4 Describe unsupervised learning concepts and dimensionality reduction techniques.
(Cognitive Knowledge Level: Apply)
CO5 Solve real life problems using appropriate machine learning models and evaluate the
performance measures (Cognitive Knowledge Level: Apply)
Mapping of course outcomes with program outcomes
PO1 PO2 PO3 PO4 PO5 PO6 PO7 PO8 PO9 PO10 PO11 PO12
CO1
CO2
CO3
CO4
CO5
Downloaded from Ktunotes.in
COMPUTER SCIENCE AND ENGINEERING
Abstract POs defined by National Board of Accreditation
PO# Broad PO PO# Broad PO
PO1 Engineering Knowledge PO7 Environment and Sustainability
PO2 Problem Analysis PO8 Ethics
PO3 Design/Development of solutions PO9 Individual and team work
Conduct investigations of
PO4 complex problems PO10 Communication
PO5 Modern tool usage PO11 Project Management and Finance
PO6 The Engineer and Society PO12 Life long learning
Assessment Pattern
Continuous Assessment Tests
Bloom’s End Semester Examination
Category Marks (%)
Test 1 (%) Test 2 (%)
Remember 30 30 30
Understand 30 30 30
Apply 40 40 40
Analyze
Evaluate
Create
Mark Distribution
Total Marks CIE Marks ESE Marks ESE Duration
150 50 100 3
Downloaded from Ktunotes.in
COMPUTER SCIENCE AND ENGINEERING
Continuous Internal Evaluation Pattern:
Attendance 10 marks
Continuous Assessment Tests(Average of Internal Tests 1 & 2) 25 marks
Continuous Assessment Assignment 15 marks
Internal Examination Pattern
Each of the two internal examinations has to be conducted out of 50 marks. First series test shall
be preferably conducted after completing the first half of the syllabus and the second series test
shall be preferably conducted after completing remaining part of the syllabus. There will be two
parts: Part A and Part B. Part A contains 5 questions (preferably, 2 questions each from the
completed modules and 1 question from the partly completed module), having 3 marks for each
question adding up to 15 marks for part A. Students should answer all questions from Part A. Part
B contains 7 questions (preferably, 3 questions each from the completed modules and 1 question
from the partly completed module), each with 7 marks. Out of the 7 questions, a student should
answer any 5.
End Semester Examination Pattern:
There will be two parts; Part A and Part B. Part A contains 10 questions with 2 questions from
each module, having 3 marks for each question. Students should answer all questions. Part B
contains 2 full questions from each module of which student should answer any one. Each
question can have maximum 2 sub-divisions and carries 14 marks.
Syllabus
Module-1 (Overview of machine learning)
Machine learning paradigms-supervised, semi-supervised, unsupervised, reinforcement learning.
Basics of parameter estimation - maximum likelihood estimation(MLE) and maximum a posteriori
estimation(MAP). Introduction to Bayesian formulation.
Module-2 (Supervised Learning)
Regression - Linear regression with one variable, Linear regression with multiple variables,
solution using gradient descent algorithm and matrix method, basic idea of overfitting in
regression. Linear Methods for Classification- Logistic regression, Naive Bayes, Decision tree
algorithm ID3.
Module-3 (Neural Networks (NN) and Support Vector Machines (SVM))
Perceptron, Neural Network - Multilayer feed forward network, Activation functions (Sigmoid,
ReLU, Tanh), Backpropagation algorithm.
Downloaded from Ktunotes.in
COMPUTER SCIENCE AND ENGINEERING
SVM - Introduction, Maximum Margin Classification, Mathematics behind Maximum Margin
Classification, Maximum Margin linear separators, soft margin SVM classifier, non-linear SVM,
Kernels for learning non-linear functions, polynomial kernel, Radial Basis Function(RBF).
Module-4 (Unsupervised Learning)
Clustering - Similarity measures, Hierarchical Agglomerative Clustering, K-means partitional
clustering, Expectation maximization (EM) for soft clustering. Dimensionality reduction –
Principal Component Analysis.
Module-5 (Classification Assessment)
Classification Performance measures - Precision, Recall, Accuracy, F-Measure, Receiver
Operating Characteristic Curve(ROC), Area Under Curve(AUC. Bootstrapping, Cross Validation,
Ensemble methods, Bias-Variance decomposition. Case Study: Develop a classifier for face
detection.
Text Book
1. Ethem Alpaydin, Introduction to Machine Learning, 2nd edition, MIT Press 2010.
2. Mohammed J. Zaki and Wagner Meira, Data Mining and Analysis: Fundamental Concepts
and Algorithms, Cambridge University Press, First South Asia edition, 2016.
3. Jake VanderPlas, Python Data Science Handbook, O'Reilly Media, 2016
4. Tom Mitchell, Machine Learning, McGraw-Hill, 1997.
Reference Books
1. Christopher Bishop. Neural Networks for Pattern Recognition,Oxford University Press,
1995.
2. Kevin P. Murphy. Machine Learning: A Probabilistic Perspective, MIT Press 2012.
3. Trevor Hastie, Robert Tibshirani, Jerome Friedman, The Elements Of Statistical Learning,
Second edition Springer 2007.
4. P. Langley, Elements of Machine Learning, Morgan Kaufmann, 1995.
5. Richert and Coelho, Building Machine Learning Systems with Python.
6. Davy Cielen, Arno DB Meysman and Mohamed Ali.Introducing Data Science: Big Data,
Machine Learning, and More, Using Python Tools, Dreamtech Press 2016.
Downloaded from Ktunotes.in
COMPUTER SCIENCE AND ENGINEERING
Course Level Assessment Questions
Course Outcome1 (CO1):
1. A coin is tossed 100 times and lands heads 62 times. What is the maximum
likelihood estimate for θ, the probability of heads.
2. Suppose data x1, ..., xn are independent and identically distributed drawn from an
exponential distribution exp(λ). Find the maximum likelihood for λ.
3. Suppose x1, ..., xn are independent and identically distributed(iid) samples from a
distribution with density
Find the maximum likelihood
estimate(MLE) for θ.
4. Find the maximum likelihood
estimator (MLE) and maximum a posteriori (MAP) estimator for the mean of a univariate
normal distribution. Assume that we have N samples, x1,..., xN independently drawn from a
normal distribution with known variance σ2 and unknown mean µ and the prior
distribution for the mean is itself a normal distribution with mean ν and variance β2. What
happens to the MLE and MAP estimators as the number of samples goes to infinity.
Course Outcome 2(CO2):
1. Suppose that you are asked to perform linear regression to learn the function that
outputs y, given the D-dimensional input x. You are given N independent data
points, and that all the D attributes are linearly independent. Assuming that D is
around 100, would you prefer the closed form solution or gradient descent to
estimate the regressor?
2. Suppose you have a three class problem where class label y ∈ 0, 1, 2 and each
training example X has 3 binary attributes X1, X2, X3 ∈ 0, 1. How many parameters
(probability distribution) do you need to know to classify an example using the
Naive Bayes classifier?
Course Outcome 3(CO3):
1. What are support vectors and list any three properties of the support vector
classifier solution?
2. Why do you use kernels to model a projection from attributes into a feature space,
instead of simply projecting the dataset directly?
Downloaded from Ktunotes.in
COMPUTER SCIENCE AND ENGINEERING
3. Describe how Support Vector Machines can be extended to make use of kernels.
Illustrate with reference to the Gaussian kernel K(x, y) = e−z, where z = (x−y)2 .
4. Briefly explain one way in which using tanh instead of logistic activations makes
optimization easier.
5. ReLU activation functions are most used in neural networks instead of the tanh
activation function. Draw both activation functions and give a) an advantage of the
ReLU function compared to the tanh function. b) a disadvantage of the ReLU
function compared to the tanh function.
Course Outcome 4(CO4): .
1. Which similarity measure could be used to compare feature vectors of two images? Justify
your answer.
2. Illustrate the strength and weakness of k-means algorithm.
3. Suppose you want to cluster the eight points shown below using k-means
Assume that k = 3 and that initially the points are assigned to clusters as follows:
C1 = {x1, x2, x3}, C2 = {x4, x5, x6}, C3 = {x7, x8}. Apply the k-means algorithm until
convergence, using the Manhattan distance.
4. Cluster the following eight points representing locations into three clusters: A1(2, 10),
A2(2, 5), A3(8, 4), A4(5, 8), A5(7, 5), A6(6, 4), A7(1, 2), A8(4, 9).
Initial cluster centers are: A1(2, 10), A4(5, 8) and A7(1, 2).
The distance function between two points a = (x1, y1) and b = (x2, y2) is defined as D(a, b)
= |x2 – x1| + |y2 – y1|
Use k-Means Algorithm to find the three cluster centers after the second iteration.
Downloaded from Ktunotes.in
COMPUTER SCIENCE AND ENGINEERING
Course Outcome 5(CO5):
1. What is ensemble learning? Can ensemble learning using linear classifiers learn
classification of linearly non-separable sets?
2. Describe boosting. What is the relation between boosting and ensemble learning?
3. Classifier A attains 100% accuracy on the training set and 70% accuracy on the test set.
Classifier B attains 70% accuracy on the training set and 75% accuracy on the test set.
Which one is a better classifier. Justify your answer.
4. What are ROC space and ROC curve in machine learning? In ROC space, which points
correspond to perfect prediction, always positive prediction and always negative
prediction? Why?
5. Suppose there are three classifiers A,B and C. The (FPR, TPR) measures of the three
classifiers are as follows – A (0, 1), B (1, 1) , C (1,0.5). Which can be considered as a
perfect classifier? Justify your answer.
Model Question Paper
QP CODE:
Reg No: _______________
Name: _________________ PAGES : 4
APJ ABDUL KALAM TECHNOLOGICAL UNIVERSITY
SEVENTH SEMESTER B.TECH DEGREE EXAMINATION, MONTH & YEAR
Course Code: CST413
Course Name: Machine Learning
Max. Marks : 100 Duration: 3 Hours
PART A
Answer All Questions. Each Question Carries 3 Marks
1. Define supervised learning? Name special cases of supervised learning depending
on whether the inputs/outputs are categorical, or continuous.
Downloaded from Ktunotes.in
COMPUTER SCIENCE AND ENGINEERING
2. Differentiate between Maximum Likelihood estimation (MLE) and Maximum a
Posteriori (MAP) estimation?
3. What is overfitting and why is it a problem?
4. Specify the basic principle of gradient descent algorithm.
5. Suppose that you have a linear support vector machine(SVM) binary classifier.
Consider a point that is currently classified correctly, and is far away from the
decision boundary. If you remove the point from the training set, and re-train the
classifier, will the decision boundary change or stay the same? Justify your answer.
6. Mention the primary motivation for using the kernel trick in machine learning
algorithms?
7. Expectation maximization (EM) is designed to find a maximum likelihood setting
of the parameters of a model when some of the data is missing. Does the algorithm
converge? If so, do you obtain a locally or globally optimal set of parameters?
8. Illustrate the strength and weakness of k-means algorithm.
9. Classifier A attains 100% accuracy on the training set and 70% accuracy on the
test set. Classifier B attains 70% accuracy on the training set and 75% accuracy on
the test set. Which one is a better classifier. Justify your answer.
10. How does bias and variance trade-off affect machine learning algorithms?
(10x3=30)
Part B
(Answer any one question from each module. Each question carries 14 Marks)
11. (a) Suppose that X is a discrete random variable with the following probability (7)
mass function: where 0 ≤ θ ≤ 1 is a parameter. The following 10 independent
observations
were taken from such a distribution: (3, 0, 2, 1, 3, 2, 1, 0, 2, 1). What is the
maximum likelihood estimate of θ.
(b) Suppose you have a three class problem where class label y ∈ 0, 1, 2 (7)
and each training example X has 3 binary attributes X1, X2, X3 ∈ 0, 1.
How many parameters (probability distribution) do you need to know
Downloaded from Ktunotes.in
COMPUTER SCIENCE AND ENGINEERING
to classify an example using the Naive Bayes classifier?
OR
12. (a) Consider the geometric distribution, which has p.m.f P(X = k) = (1 −θ)k−1θ. (7)
Assume that n i.i.d data are drawn from that distribution.
i. Write an expression for the log-likelihood of the data as a function of the
parameter θ.
ii. Find the maximum likelihood estimate for θ?
ii. Let θ has a beta prior distribution. What is the posterior distribution of θ?
(b) Find the maximum a posteriori (MAP) estimator for the mean of a univariate (7)
normal distribution. Assume that we have N samples, x1,..., xN independently
drawn from a normal distribution with known variance σ2 and unknown mean
µ and the prior distribution for the mean is itself a normal distribution with
mean ν and variance β2.
13. (a) Consider the hypothesis for the linear regression hθ(x) = θ0 + θ1x, and the cost (7)
function J( θ0, θ1) = 1/2m Σ1m( hθ(x(i)) – y(i))2where m is the number of training
examples. Given the following set of training examples.
Answer the following questions :
1) Find the value of hθ (2) if θ0= 0 and θ1 = 1.5
2) Find the value of J(0,1)
3) Suppose the value of J( θ0, θ1) = 0. What can be inferred from this.
(b) Assume we have a classification problem involving 3 classes: professors, (3)
students, and staff members. There are 750 students, 150 staff members
and 100 professors. All professors have blond hair, 50 staff members have
blond hair, and 250 students have blond hair. Compute the information
gain of the test ”hair color = blond” that returns true or false.
Downloaded from Ktunotes.in
COMPUTER SCIENCE AND ENGINEERING
(c) Explain the significance of regularization. How do Ridge differs from Lasso (4)
regularization?
OR
14. (a) The following dataset can be used to train a classifier that determines whether (7)
a given person is likely to own a car or not. There are three features: education
level (primary, secondary, or university); residence (city or country); gender
(female, male).
Use ID3 Algorithm and find the best attribute at the root level of the tree
(b) Consider a linear regression problem y = w1x + w0, with a training set having (7)
m examples (x1, y1), . . .(xm, ym). Suppose that we wish to minimize the mean
5th degree error (loss function) given by 1/m Σ1m(yi −w1xi − w0)5.
1. Calculate the gradient with respect to the parameter w1.
2. Write down pseudo-code for on-line gradient descent on w1.
3. Give one reason in favor of on-line gradient descent compared to batch-
gradient descent, and one reason in favor of batch over on-line.
Downloaded from Ktunotes.in
COMPUTER SCIENCE AND ENGINEERING
15. (a) Consider a support vector machine whose input space is 2-D, and the inner (8)
products are computed by means of the kernel K(x, y) = (x.y + 1)2-1, where
x.y denotes the ordinary inner product. Show that the mapping to feature
space that is implicitly defined by this kernel is the mapping to 5-D given by
(b) Consider a neuron with four inputs, and weight of edge connecting the inputs (3)
are 1, 2, 3 and 4. Let the bias of the node is zero and inputs are 2, 3, 1, 4. If
the activation function is linear f(x)=2x, compute the output of the neuron.
(c) Compare ReLU with Sigmoid function (3)
OR
16. (a) State the mathematical formulation to express Soft Margin as a constraint (10)
optimization problem.
(b) What is the basic idea of back propagation algorithm (4)
17. (a) Suppose that we have the following data (one variable). Use single linkage (8)
Agglomerative clustering to identify the clusters.
Data: (2, 5, 9, 15, 16, 18, 25, 33, 33, 45).
(b) Given two objects represented by the tuples (22, 1, 42, 10) and (20, 0, 36, 8): (6)
(i) Compute the Euclidean distance between the two objects.
(ii) Compute the Manhattan distance between the two objects.
(iii) Compute the Minkowski distance between the two objects, using p = 3
OR
18. (a) Suppose that we have the following data: (8)
(2, 0), (1, 2), (2, 2), (3, 2), (2, 3), (3, 3), (2, 4), (3, 4), (4, 4), (3, 5)
Identify the cluster by applying the k-means algorithm, with k = 2. Try using
initial cluster centers as far apart as possible
Downloaded from Ktunotes.in
COMPUTER SCIENCE AND ENGINEERING
(b) Describe EM algorithm for Gaussian Mixtures (8)
19. (a) Suppose the dataset had 9700 cancer-free images from 10000 images from (7)
cancer patients. Find precision, recall and accuracy ? Is it a good classifier?
Justify.
Actual cancer = yes cancer = no Total
Class\Predicted
class
cancer = yes 90 210 300
cancer = no 140 9560 9700
Total 230 9770 10000
(b) What is Principal Component Analysis (PCA)? Which eigen value indicates (7)
the direction of largest variance?
OR
20. (a) Assume you have a model with a high bias and a low variance. What (6)
are the characteristics of such a model?
(b) What are ROC space and ROC curve in machine learning? In ROC space, (8 )
which points correspond to perfect prediction, always positive prediction and
always negative prediction? Why?
Downloaded from Ktunotes.in
COMPUTER SCIENCE AND ENGINEERING
Teaching Plan
No. of
Lecture
No Contents Hours
(37 hrs)
Module -1 (Overview of machine learning) (7 hours)
1.1 Supervised, semi-supervised, unsupervised learning, reinforcement learning 1 hour
(Text Book (TB) 1: Chapter 1)
1.2 Maximum likelihood estimation(MLE) (TB 1: Section 4.2) 1 hour
1.3 Maximum likelihood estimation (MLE)- example (TB 1: Section 4.2) 1 hour
1.4 Maximum a posteriori estimation(MAP) (TB 4: Section 6.2) 1 hour
1.5 Maximum a posteriori estimation(MAP)-example (TB 4: Section 6.2) 1 hour
1.6 Bayesian formulation (TB 1: Section 14.1, 14.2) 1 hour
1.7 Bayesian formulation -example (TB 1: Section 14.1, 14.2) 1 hour
Module-2 (Supervised Learning) (7 hours)
2.1 Linear regression with one variable (TB 1: Section 2.6) 1 hour
2.2 Multiple variables, Solution using gradient descent algorithm and matrix 1 hour
method (No derivation required) (TB 1: Section 5.8)
2.3 Overfitting in regression, Lasso and Ridge regularization 1 hour
2.4 Logistic regression 1 hour
2.5 Naive Bayes (TB 2: Section 18.2) 1 hour
2.6 Decision trees (TB 2: Chapter 19) 1 hour
2.7 Decision trees- ID3 algorithm (TB 2: Chapter 19) 1 hour
Module-3 (Neural Networks and Support Vector Machines) (9 hours)
3.1 Perceptron, Perceptron Learning 1 hour
3.2 Multilayer Feed forward Network, Activation Functions (Sigmoid, ReLU, 1 hour
Tanh)
3.3 Back Propagation Algorithm 1 hour
3.4 Illustrative Example for Back Propagation 1 hour
3.5 Introduction, Maximum Margin Hyperplane, 1 hour
3.6 Mathematics behind Maximum Margin Classification 1 hour
3.7 Formulation of maximum margin hyperplane and solution 1 hour
Downloaded from Ktunotes.in
COMPUTER SCIENCE AND ENGINEERING
3.8 Soft margin SVM, Solution of Soft margin SVM 1 hour
3.9 Non-linear SVM , Kernels for learning non-linear functions, Examples - 1 hour
Linear, RBF, Polynomial
Module-4 (Unsupervised Learning) (7 hours)
4.1 Similarity measures- Minkowski distance measures( Manhattan, Euclidean), 1 hour
Cosine Similarity
4.2 Clustering - Hierarchical Clustering (TB 2: Chapter 14) 1 hour
4.3 K-means partitional clustering (TB 2: Chapter 13) 1 hour
4.4 Expectation maximization (EM) for soft clustering (TB 2: Chapter 13) 1 hour
4.5 Expectation maximization (EM) for soft clustering (TB 2: Chapter 13) 1 hour
4.6 Dimensionality reduction – Principal Component Analysis (TB 1: Section 1 hour
6.3)
4.7 Dimensionality reduction – Principal Component Analysis (TB 1: Section 1 hour
6.3)
Module-5 (Classification Assessment) (7 hours)
5.1 Performance measures - Precision, Recall, Accuracy, F-Measure, ROC, 1 hour
AUC. (TB 2: Chapter 22.1)
5.2 Boot strapping, Cross validation 1 hour
5.3 Ensemble methods- bagging, boosting 1 hour
5.4 Bias-Variance decomposition (TB 2: Chapter 22.3) 1 hour
5.5 Bias-Variance decomposition (TB 2: Chapter 22.3) 1 hour
5.6 Face detection (TB 3: Chapter 5 Section Application: A Face Detection 1 hour
Pipeline)
5.7 Face detection (TB 3: Chapter 5 Section Application: A Face Detection 1 hour
Pipeline)
Downloaded from Ktunotes.in