DR.
NNCE II & III YR / II & IV SEM AIML QB
UNIT III SUPERVISED LEARNING
SYLLABUS:
Introduction to machine learning – Linear Regression Models: Least
squares, single & multiple variables, Bayesian linear regression, gradient
descent, Linear Classification Models: Discriminant function –
Probabilistic discriminative model - Logistic regression, Probabilistic
generative model – Naive Bayes, Maximum margin classifier – Support
vector machine, Decision Tree, Random forests.
2 – MARKS
1. Define Machine Learning & Mention the various classification of Machine
Learning [N/D-23] [A/M-23]
2. How to Categorize algorithm based on required Output
3. State the logic behind Gaussian processes. [N/D-23]
4. What is Least Squares Regression Line & Narrate its Equation
5. What is error or residual?
6. Define Bayesian Regression. List the Advantages and Disadvantages
7. What are the two types of Classifications problem?
8. List the different Types of ML Classification Algorithms.
9. How can overfitting be avoided? [A/M-24]
10.What is Discriminant function?
11.Define Probabilistic discriminative models.
12.Define Probabilistic Generative model
13.Define Discriminative model. & Mention the algorithms in Discriminative models.
14.Define SVM Kernel.
15.Define random forests [A/M-23]
16 – MARKS
1. Define Machine Learning. Give an introduction to Machine Learning.
2. Explain in detail about Linear Regression Models. Or Explain Linear Regression
Models: Least squares, single & multiple regression, Bayesian linear regression,
gradient descent. [A/M-24]
3. Explain in detail about Linear Classification Models – Discriminant function.
4. Explain in detail about Linear Discriminant Functions and its types. Also elaborate
about logistic regression in detail. [A/M-23]
5. Elaborate in detail about Probabilistic Generative model and Naïve Bayes.
6. Elaborate in detail about Support Vector Machine (SVM). [A/M-24]
7. Elaborate in detail about Decision Tree in Supervised Learning.
8. Elaborate in detail about Random Forest in Supervised Learning. [N/D-23] [A/M-24]
73
DR.NNCE II & III YR / II & IV SEM AIML QB
UNIT III SUPERVISED LEARNING
2 – MARKS
1. Define Machine Learning & Mention the various classification of Machine
Learning [N/D-23] [A/M-23]
o Machine learning as “the field of study that gives computers the ability to learn without
being explicitly programmed “.
o Machine learning is programming computers to optimize a performance criterion using
example data or past experience. The model may be predictive to make predictions in
the future, or descriptive to gain knowledge from data.
o Depending on the nature of the learning “signal” or “response” available to a learning
system which are as follows:
▪ Supervised learning
▪ Unsupervised learning
▪ Reinforcement learning
▪ Semi-supervised learning
2. How to Categorize algorithm based on required Output
a. Classification
b. Regression
c. Clustering
3. State the logic behind Gaussian processes. [N/D-23]
o A Gaussian Process defines a distribution over functions, meaning instead of predicting a
single output, it predicts a range of possible outputs with associated uncertainty.
o Gaussian Processes model not just a function, but a whole space of possible functions,
allowing them to make predictions with uncertainty—especially useful when data is sparse or
noisy.
4. What is Least Squares Regression Line & Narrate its Equation
o Least squares are a commonly used method in regression analysis for estimating the
unknown parameters by creating a model which will minimize the sum of squared
errors between the observed data and the predicted data.
o The equation that minimizes the total of all squared prediction errors for known Y
scores in the original correlation analysis.
where
Y´ represents the predicted value; X represents the known value;
b and a represent numbers calculated from the original correlation analysis
74
DR.NNCE II & III YR / II & IV SEM AIML QB
5. What is error or residual?
o Error is the difference between the actual value and Predicted value and the goal is
to reduce this difference.
• The vertical distance between the data point and the regression line is known as error or
residual.
o Each data point has one residual and the sum of all the differences is known as the
Sum of Residuals/Errors.
6. Define Bayesian Regression. List the Advantages and Disadvantages
o Bayesian Regression is used when the data is insufficient in the dataset or the data
is poorly distributed.
o The output of a Bayesian Regression model is obtained from a probability
distribution.
o The aim of Bayesian Linear Regression is to find the ‘posterior‘
distribution for the model parameters.
o The expression for Posterior is :
where
i. Posterior: It is the probability of an event to occur; say, H, given that another
event; say, E has already occurred. i.e., P(H | E).
ii. Prior: It is the probability of an event H has occurred prior to another event. i.e., P(H)
iii. Likelihood: It is a likelihood function in which some parameter variable is
marginalized.
Advantages of Bayesian Regression:
o Very effective when the size of the dataset is small.
o The Bayesian approach is a tried and tested approach and is very robust,
mathematically.
Disadvantages of Bayesian Regression:
o The inference of the model can be time-consuming.
o If there is a large amount of data available for our dataset, the Bayesian approach is
not worth it.
7. What are the two types of Classifications problem?
Binary representation or Binary Classifier:
o If the classification problem has only two possible outcomes, then it is called as
Binary Classifier.
o There is a single target variable t ∈ {0, 1}
1. t = 1 represents class C1
75
DR.NNCE II & III YR / II & IV SEM AIML QB
2. t = 0 represents class C2
o Examples: YES or NO, MALE or FEMALE, SPAM or NOT SPAM, CAT or DOG, etc.
Multi-class Problems:
o If a classification problem has more than two outcomes, then it is called as Multi-
class Classifier.
o Example: Classifications of types of crops, Classification of types of music.
8. List the different Types of ML Classification Algorithms.
o Logistic Regression
o K-Nearest Neighbors
o Support Vector Machines
o Kernel SVM
o Naïve Bayes
o Decision Tree Classification
o Random Forest Classification
9. How can overfitting be avoided? [A/M-24]
• Cross-Validation
• Regularization
• Simpler Models
• Pruning (for trees
• Early Stopping
• More Training Data
• Data Augmentation
• Dropout (for neural networks
10.What is Discriminant function?
o A function of a set of variables that is evaluated for samples of events or objects and
used as an aid in discriminating between or classifying them.
o A discriminant function (DF) maps independent (discriminating) variables into a
latent variable D.
o DF is usually postulated to be a linear function: D = a0 + a1 x1 + a2 x2 ... aNxN
11.Define Probabilistic discriminative models.
o Discriminative models are a class of supervised machine learning models which
make predictions by estimating conditional probability P(y|x).
o For the two-class classification problem, the posterior probability of class C1 can be
written as a logistic sigmoid acting on a linear function of x
76
DR.NNCE II & III YR / II & IV SEM AIML QB
o For the multi-class case, the posterior probability of class Ck is given by a softmax
transformation of a linear function of x
12.Define Probabilistic Generative model
o Given a model of one conditional probability, and estimated probability distributions
for the variables X and Y, denoted P(X) and P(Y), can estimate the conditional
probability using Bayes' rule:
o A generative model is a statistical model of the joint probability distribution on given
observable variable X and target variable Y.
• Given a generative model for P(X|Y), can estimate:
13.Define Discriminative model. & Mention the algorithms in Discriminative
models.
o A discriminative model is a model of the conditional probability of the target Y, given
an observation x given a discriminative model for P(Y|X), can estimate:
o Classifier based on a generative model is a generative classifier, while a classifier based
on a discriminative model is a discriminative classifier
Algorithms in Discriminative models.
a. Logistic regression
b. Support Vector Machines
c. Decision Tree Learning
d. Random Forest
14.Define SVM Kernel.
The SVM kernel is a function that takes low dimensional input space and transforms
it into higher-dimensional space, ie it converts non separable problem to separable
problem. It is mostly useful in non- linear separation problems.
15.Define random forests [A/M-23]
A Random Forest is an ensemble learning method used for classification and regression tasks. It works
by building multiple decision trees during training and combining their outputs to improve accuracy and
control overfitting.
• For classification, it takes the majority vote from all trees.
• For regression, it takes the average of all tree outputs.
77
DR.NNCE II & III YR / II & IV SEM AIML QB
16 – MARKS
1. Define Machine Learning. Give an introduction to Machine Learning.
a. Machine Learning:
i. Definition of Machine Learning:
1. Arthur Samuel, an early American leader in the field of computer gaming and artificial
intelligence, coined the term “Machine Learning ” in 1959 while at IBM.
2. He defined machine learning as “the field of study that gives computers the ability to learn
without being explicitly programmed “.
3. Machine learning is programming computers to optimize a performance criterion using
example data or past experience. The model may be predictive to make predictions in
the future, or descriptive to gain knowledge from data.
ii. Definition of learning:
A computer program is said to learn from experience E with respect to some class of
tasks T and performance measure P , if its performance at tasks T, as measured by P ,
improves with experience E.
1.1.3 Examples
1.1.3.1 Handwriting recognition learning problem
• Task T :Recognizing and classifying handwritten words within images
• Performance P : Percent of words correctly classified
• Training experience E : A dataset of handwritten words with given classifications
1.1.3.2 A robot driving learning problem
• Task T : Driving on highways using vision sensors
• Performance P : Average distance traveled before an error
• Training experience E : A sequence of images and steering commands recorded while
observing a human driver
1.2 Classification of Machine Learning
• Machine learning implementations are classified into four major categories, depending
on the nature of the learning “signal” or “response” available to a learning system
which are as follows:
1.2.1 Supervised learning:
• Supervised learning is the machine learning task of learning a function that maps an
input to an output based on example input-output pairs.
• The given data is labeled.
• Both classification and regression problems are supervised learning problems.
• For example, the inputs could be camera images, each one accompanied by an
output saying “bus” or “pedestrian,” etc.
• An output like this is calleda label.
• The agent learns a function that, when given a new image, predicts theappropriate
label.
78
DR.NNCE II & III YR / II & IV SEM AIML QB
1.2.2 Unsupervised learning:
• Unsupervised learning is a type of machine learning algorithm used to draw inferences
from datasets consisting of input data without labeled responses.
• In unsupervised learning algorithms, classification or categorization is not included in the
observations.
• In unsupervised learning the agent learns patterns in the input without any explicit
feedback.
The most common unsupervised learning task is clustering: detecting potentially
useful clusters of input examples.
• For example, when shown millions of images taken from the Internet, a computer
vision system can identify a large cluster of similar images which an English speaker
would call “cats.”
1.2.3 Reinforcement learning:
▪ In reinforcement learning the agent learns from a series of reinforcements: rewards and
punishments.
▪ Reinforcement learning is the problem of getting an agent to act in the world so as to
maximize its rewards.
▪ A learner is not told what actions to take as in most forms of machine learning but
instead must discover which actions yield the most reward by trying them.
▪ For example — Consider teaching a dog a new trick: we cannot tell him what to do,
what not to do, but we can reward/punish it if it does the right/wrong thing.
1.2.4 Semi-supervised learning:
• Semi-Supervised learning is a type of Machine Learning algorithm that represents the
intermediate ground between Supervised and Unsupervised learning algorithms.
• It uses the combination of labeled and unlabeled datasets during the training period,
where an incomplete training signal is given: a training set with some of the target
outputs missing.
1.3 Categorizing based on required Output
1.3.1 Classification:
• The Classification algorithm is a Supervised Learning technique that is used to identify
the category of new observations on the basis of training data.
• In Classification, a program learns from the given dataset or observations and then
classifies new observation into a number of classes or groups.
• Such as, Yes or No, 0 or 1, Spam or Not Spam, cat or dog, etc. Classes can be called
as targets/labels or categories.
1.3.2 Regression:
Regression is a supervised learning technique which helps in finding the correlation
between variables and enables us to predict the continuous output variable based on the
one or more predictor variables.
79
DR.NNCE II & III YR / II & IV SEM AIML QB
• It is mainly used for prediction, forecasting, time series modeling, and determining the
causal-effect relationship between variables.
1.3.2 Clustering:
• Clustering or cluster analysis is a machine learning technique, which groups the
unlabeled dataset.
• It can be defined as "A way of grouping the data points into different clusters,
consisting of similar data points.
• The objects with the possible similarities remain in a group that has less or no
similarities with another group."
2. Explain in detail about Linear Regression Models. Or Explain Linear
Regression Models: Least squares, single & multiple regression, Bayesian linear
regression, gradient descent. [A/M-24]
a. Linear Regression
i. In statistics, linear regression is a linear approach to modeling the relationship between
a dependent variable and one or more independent variables.
ii. Let X be the independent variable and Y be the dependent variable.
iii. A linear relationship between these two variables as follows:
Where, m: Slope; c: y-intercept
• Linear regression algorithm shows a linear
relationship between a dependent (y) and one or
more independent (x) variables, hence called as
linear regression.
• Linear regression finds how the value of the
dependent variable is changing according to the
value of the independent variable.
• The linear regression model provides a sloped
straight line representing the relationship
between the variables.
• Consider the below Figure 3.1, which Relationship between independent
represents the relationship between and dependent variables
independent and dependent variables
2.2 Least Squares Regression Line
• Least squares are a commonly used method in regression analysis for estimating the
unknown parameters by creating a model which will minimize the sum of squared errors
between the observed data and the predicted data.
80
DR.NNCE II & III YR / II & IV SEM AIML QB
2.2.1 Least Squares Regression Equation
▪ The equation that minimizes the total of all squared prediction errors for known Y scores in
the original correlation analysis.
where
Y´ represents the predicted value; X represents the known value;
b and a represent numbers calculated from the original correlation analysis
Least Squares Regression in Python
Scenario: A rocket motor is manufactured by combining an igniter propellant and a sustainer
propellant inside a strong metal housing. It was noticed that the shear strength of the
bond between two propellers is strongly dependent on the age of the sustainer propellant.
Problem Statement: Implement a simple linear regression algorithm using Python to build
a machine learning model that studies the relationship between the shear strength of the bond
between two propellers and the age of the sustainer propellant.
Step 1: Import the required Python libraries.
# Importing Libraries
importnumpy
asnp import
pandas aspd
importmatplotlib.pyplotasplt
Step 2: Next step is to read and load the dataset.
# Loading dataset
data = pd.read_csv('PropallantAge.csv')
data.head()
data.info()
Step 3: Create a scatter plot just to check the relationship between these two variables.
# Plotting the data
plt.scatter(data['Age of Propellant'],data['Shear Strength'])
Step 4: Next step is to assign X and Y as independent and dependent variables
respectively.
# Computing X and Y
X = data['Age of
Propellant'].values Y =
data['Shear Strength'].values
Step 5: Compute the mean of variables X and Y to determine the values of slope (m)
and y-intercept.
81
DR.NNCE II & III YR / II & IV SEM AIML QB
Also, let n be the total number of data
points. # Mean of variables X
and Y
mean_x =
np.mean(X) mean_y
= np.mean(Y)
# Total number of data values
n = len(X)
Step 6: Calculate the slope and the y-intercept using the formulas # Calculating 'm' and 'c'
num = 0
denom = 0
for i in range(n):
num += (X[i] - mean_x) * (Y[i] -
mean_y) denom += (X[i] - mean_x)
** 2
m = num / denom
c = mean_y - (m * mean_x)
# Printing coefficients
print("Coefficients")
print(m, c)
The above step has given the values of m and c. Substituting them ,
Shear Strength =2627.822359001296 + (-37.15359094490524)
* Age of Propellant
Step 7: The above equation represents the linear regression model.
Let’s plot this graphically. Refer fig 3.2
# Plotting Values and Regression Line
maxx_x = np.max(X) +
10 minn_x = np.min(X)
- 10
# line values for x and y
x = np.linspace(minn_x, maxx_x,
1000) y = c + m * x
# Plotting Regression Line
plt.plot(x, y, color='#58b970', label='Regression Line')
# Plotting Scatter Points
plt.scatter(X, Y, c='#ef5423', label='Scatter Plot')
plt.xlabel('Age of Propellant (in years)')
plt.ylabel('Shear Strength')
plt.legend() plt.show()
82
DR.NNCE II & III YR / II & IV SEM AIML QB
Output:
Example for Regression Line
Types of Linear Regression
It is of two types: Simple and Multiple.
o Simple Linear Regression is where only one independent variable is present and the model has
to find the linear relationship of it with the dependent variable
Equation of Simple Linear Regression, where bo is the intercept, b1 is coefficient or
slope, x is the independent variable and y is the dependent variable.
o In Multiple Linear Regression there are more than one independent variables for the model
to find the relationship.
Equation of Multiple Linear Regression, where bo is the intercept, b1,b2,b3,b4…,bn are
coefficients or slopes of the independent variables x1,x2,x3,x4…,xn and y is the dependent
variable.
2.4 Linear Regression Model
• A Linear Regression model’s main aim is to find the best fit linear line and the optimal values
of intercept and coefficients such that the error is minimized.
Error is the difference between the actual value and Predicted value and the goal is to reduce
this difference.
Figure 3.3 – Example for Linear Regression Model
• In the above figure 3.3,
• x is our dependent variable which is plotted on the x-axis and y is the dependent
variable which is plotted on the y-axis.
83
DR.NNCE II & III YR / II & IV SEM AIML QB
• Black dots are the data points i.e the actual values.
• bo is the intercept which is 10 and b1 is the slope of the x variable.
• The blue line is the best fit line predicted by the model i.e the predicted values lie on
the blue line.
• The vertical distance between the data point and the regression line is known as
error or residual.
• Each data point has one residual and the sum of all the differences is known as the
Sum of Residuals/Errors.
Mathematical Approach:
Residual/Error = Actual values – Predicted Values
Sum of Residuals/Errors = Sum(Actual- Predicted Values)
Square of Sum of Residuals/Errors = (Sum(Actual- Predicted Values))2
2.5 Bayesian Regression
• Bayesian Regression is used when the data is insufficient in the dataset or the data is
poorly distributed.
The output of a Bayesian Regression model is obtained from a probability distribution.
• The aim of Bayesian Linear Regression is to find the ‘posterior‘ distribution for the
model parameters.
• The expression for Posterior is :
where
o Posterior: It is the probability of an event to occur; say, H, given that another event;
say, E has already occurred. i.e., P(H | E).
o Prior: It is the probability of an event H has occurred prior to another event. i.e., P(H)
o Likelihood: It is a likelihood function in which some parameter variable is
marginalized.
• The Bayesian Ridge Regression formula is as follows:
p(y|λ) = N(w|0, λ^-1Ip)
where
▪ 'y' is the expected value,
▪ lambda is the distribution's shape parameter before the lambda parameter
▪ the vector "w" is made up of the elements w0, w1,....
2.5.1 Implementation Of Bayesian Regression Using Python
• Boston Housing dataset, which includes details on the average price of homes in
various Boston neighborhoods.
• The r2 score will be used for evaluation.
• The crucial components of a Bayesian Ridge Regression model:
84
DR.NNCE II & III YR / II & IV SEM AIML QB
Program
fromsklearn.datasets import load_boston
fromsklearn.model_selection import
train_test_split fromsklearn.metrics import
r2_score fromsklearn.linear_model import
BayesianRidge
# Loading the dataset
dataset = load_boston()
X, y = dataset.data, dataset.target
# Splitting the dataset into testing and training sets
X_train, X_test, y_train, y_test = train_test_split
(X, y, test_size = 0.15, random_state = 42)
# Creating to train the model
model = BayesianRidge()
model.fit(X_train, y_train)
# Model predicting the test data
prediction = model.predict(X_test)
# Evaluation of r2 score of the model against the test dataset
print(f"Test Set r2 score : {r2_score(y_test, prediction)}")
Output
Test Set r2 score : 0.7943355984883815
Advantages of Bayesian Regression:
• Very effective when the size of the dataset is small.
• Particularly well-suited for on-line based learning (data is received in real-time), as
compared to batch based learning, where we have the entire dataset on our hands before
we start training the model. This is because Bayesian Regression doesn’t need to store
data.
• The Bayesian approach is a tried and tested approach and is very robust,
mathematically. So, one can use this without having any extra prior knowledge about
the dataset.
Disadvantages of Bayesian Regression:
• The inference of the model can be time-consuming.
• If there is a large amount of data available for our dataset, the Bayesian approach is
not worth it.
85
DR.NNCE II & III YR / II & IV SEM AIML QB
2.6 Gradient descent
2.6.1 Cost Function
• The cost is the error in our predicted value.
It is calculated using the Mean Squared Error function as shown in figure 3.4.
Figure 3.4 – Example for Cost function
• The goal is to minimize the cost as much as possible in order to find the best fit line.
2.6.2 Gradient Descent Algorithm.
• Gradient descent is an optimization algorithm that finds the best-fit line for a given
training dataset in a smaller number of iterations.
• If m and c are plotted against MSE, it will acquire a bowl shape as shown in figure
3.4a and figure 3.4b
Figure 3.4b – Gradient Descent Shape
Figure 3.4a – Process of gradient descent algorithm
86
DR.NNCE II & III YR / II & IV SEM AIML QB
Learning Rate
• A learning rate is used for each pair of input and output values. It is a scalar factor and
coefficients are updated in direction towards minimizing error.
• The process is repeated until a minimum sum squared error is achieved or no further
improvement is possible.
Step by Step Algorithm:
1. Initially, let m = 0, c = 0
Where L = learning rate — controlling how much the value of “m” changes with each
step.
The smaller the L, greater the accuracy. L = 0.001 for a good accuracy.
2. Calculating the partial derivative of loss function “m” to get the derivative D.
• Similarly, find the partial derivative with respect to c, Dc..
• Update the current values of m and c using the following equation:
• Repeat this process until our Cost function is very small (ideally 0).
87
DR.NNCE II & III YR / II & IV SEM AIML QB
3. Explain in detail about Linear Classification Models – Discriminant function.
a. Linear Classification Models
i. The Classification algorithm is a Supervised Learning technique that is used to identify
the category of new observations on the basis of training data.
In Classification, a program learns from the given dataset or observations and then
classifies new observation into a number of classes or groups. Such as, Yes or No, 0
or 1, Spam or Not Spam, cat or dog, etc.
• Classes can be called as targets/labels or categories.
• The output variable of Classification is a category, not a value, such as "Green or
Blue", "fruit or animal", etc.
• Since the Classification algorithm is a supervised learning technique, hence it takes
labeled input data, which means it contains input with the corresponding output.
• In classification algorithm, a discrete output function(y) is mapped to input variable(x).
y=f(x), where y = categorical output
• The best example of an ML classification algorithm is Email Spam Detector.
• The goal of the classification algorithm is
o Take a D-dimensional input vector x
o Assign it to one of K discrete classes Ck , k = 1, . . . , K
• In the most common scenario, the classes are taken to be disjoint and each input is
assigned to one and only one class
• The input space is divided into decision regions
• The boundaries of the decision regions
o decision boundaries
o decision surfaces
• With linear models for classification, the decision surfaces are linear functions, Classes
that can be separated well by linear surfaces are linearly separable.
• In the figure 3.5, there are two classes, class A and Class B.
• These classes have features that are similar to each other and dissimilar to other classes.
Figure 3.5 – Example of Classification
• The algorithm which implements the classification on a dataset is known as a
classifier.
• There are two types of Classifications:
88
DR.NNCE II & III YR / II & IV SEM AIML QB
Two-class problems :
Binary representation or Binary Classifier:
o If the classification problem has only two possible outcomes, then it is called as
Binary Classifier.
o There is a single target variable t ∈ {0, 1}
o t = 1 represents class C1
o t = 0 represents class C2
o Examples: YES or NO, MALE or FEMALE, SPAM or NOT SPAM, CAT or
DOG, etc.
Multi-class Problems:
o If a classification problem has more than two outcomes, then it is called as Multi-
class Classifier.
o Example: Classifications of types of crops, Classification of types of music.
o 1-of-K coding scheme
o There is a K-long target vector t, such that
If the class is Cj, all elements tk of t are zero for k ≠ j and one for k = j tk is the
probability that the class is Ck, K = 6 and Ck = 4, then t = (0, 0, 0, 1, 0, 0)T
• The simplest approach to classification problems is through construction of a
discriminant function that directly assigns each vector x to a specific class
3.2 Types of ML Classification Algorithms:
• Logistic Regression
• K-Nearest Neighbors
• Support Vector Machines
• Kernel SVM
• Naïve Bayes
• Decision Tree Classification
• Random Forest Classification
1.4 Discriminant function
A function of a set of variables that is evaluated for samples of events or objects and used
as an aid in discriminating between or classifying them.
• A discriminant function (DF) maps independent (discriminating) variables into a latent
variable D.
• DF is usually postulated to be a linear function:
D = a0 + a1 x1 + a2 x2 ... aNxN
• The goal of discriminant analysis is to find such values of the coefficients {ai, i=0,...,N} that
the distance between the mean values of DF is maximal for the two groups.
• Whenever there is a requirement to separate two or more classes having multiple features
efficiently, the Linear Discriminant Analysis model is considered the most common
technique to solve such classification problems.
• For example, if there are classes with multiple features and need to separate them
89
DR.NNCE II & III YR / II & IV SEM AIML QB
efficiently. Classify them using a single feature, then it may show overlapping as shown in
figure 3.6.
Figure 3.6 – Example for Classification using single feature
• To overcome the overlapping issue in the classification process, must increase the number
of features regularly.
4. Explain in detail about Linear Discriminant Functions and its types. Also
elaborate about logistic regression in detail. [A/M-23]
a. Linear Discriminant Functions
A discriminant function that is a linear combination of
the components of x can be written as
(3.1)
where w is the weight vector and w0 the bias or threshold weight.
b. The Two-Category Case
i. For a discriminant function of the form of eq.3.1, a two-
category classifier implements the following decision rule:
ii. Decide w1 if g(x)>0 and w2 if g(x)<0.
iii. Thus, x is assigned to w1 if the inner product wTx exceeds
the threshold – w0 and to w2 otherwise.
iv. If g(x)=0, x can ordinarily be assigned to either class, or can be
left undefined.
v. The equation g(x)=0 defines the decision surface that
separates points assigned to w1 from points assigned to w2.
vi. When g(x) is linear, this decision surface is a hyperplane.
vii. If x1 and x2 are both on the decision surface, then
(3.2) or
(3.3)
Figure 3.7: The linear decision boundary H separates
the feature space into two half-spaces.
90
DR.NNCE II & III YR / II & IV SEM AIML QB
• In figure 3.7, the hyperplane H divides the feature space into two half-spaces:
o Decisionregion R1 for w1
o region R2 for w2.
• The discriminant function g(x) gives an algebraic measure of the distance from x to
the hyperplane.
• The way to express x as
(3.4)
where xp is the normal projection of x onto H, and r is the desired algebraic distance which is
positive if x is on the positive side and negative if x is on the negative side. Then, because
g(xp)=0,
Since then
(3.5) or
(3.6)
o The distance from the origin to H is given by .
o If w0>0, the origin is on the positive side of H, and if w0<0, it is
on the negative side.
If w0=0, then g(x) has the homogeneous form , and the hyperplane passes
through the origin
4.3 The Multi-category Case
• To devise multi category classifiers employing linear
discriminant functions reduce the problem to c two-class problems.
• Defining c linear discriminant functions
(3.7)
and assigning x to wi if for all j¹ i; in case of
ties, the classification is left undefined.
• The resulting classifier is called a linear machine.
91
DR.NNCE II & III YR / II & IV SEM AIML QB
Figure 3.8: Decision boundaries defined by linear machines
• A linear machine divides the feature space into c decision regions as shown in figure 3.8,
with gj(x) being the largest discriminant if x is in region Ri.
• If Ri and Rj are contiguous, the boundary between them is a portion of the hyperplane
Hij defined by
or
is normal to Hij and the signed distance from x to Hij is given by
4.4 Generalized Linear Discriminant Functions
• The linear discriminant function g(x) can be written as
(3.8)
where the coefficients wi are the components of the weight vector w.
Quadratic Discriminant Function
(3.9)
4.5 Probabilistic discriminative models
• Discriminative models are a class of supervised machine learning
models which make predictions by estimating conditional
probability P(y|x).
• For the two-class classification problem, the posterior probability of
classC1 can be written as a logistic sigmoid acting on a linear function
of x
• For the multi-class case, the posterior probability of class Ckis given
by a softmax transformation of a linear function of x
92
DR.NNCE II & III YR / II & IV SEM AIML QB
4.6 Logistics Regression
o Logistic regression is the Machine Learning algorithms, under the
classification algorithm of Supervised Learning technique.
o Logistic regression is used to describe data and the relationship between
one dependent variable and one or more independent variables.
o The independent variables can be nominal, ordinal, or of interval type.
o Logistic regression predicts the output of a categorical dependent variable.
o Therefore the outcome must be a categorical or discrete value.
o It can be either Yes or No, 0 or 1, true or False, etc. it gives the
probabilistic values which lie between 0 and 1.
o Linear Regression is used for solving Regression problems, whereas
Logistic regression is used for solving the classification problems.
The figure 3.9 predicts the logistic function
Figure 3.9 – Logistic Function or Sigmoid Function
4.6.1 Logistic Function (Sigmoid Function):
o The logistic function is also known as the sigmoid function.
o The sigmoid function is a mathematical function used to
map the predicted values to probabilities.
o The value of the logistic regression must be between 0 and 1, so it
forms a curve like the "S" form.
The S-form curve is called the Sigmoid function or the logistic function.
o
4.6.2 Assumptions for Logistic Regression:
The dependent variable must be categorical in nature.
o
The independent variable should not have multi-collinearity.
o
4.6.3 Logistic Regression Equation:
• The Logistic regression equation can be obtained from the
Linear Regression equation.
• The mathematical steps are given below:
• The equation of the straight line can be written as:
o In Logistic Regression y can be between 0 and 1 only, let's divide
the above equation by (1-y):
93
DR.NNCE II & III YR / II & IV SEM AIML QB
o For the range between -[infinity] to +[infinity], take logarithm
of the equation:
The above equation is the final equation for Logistic Regression
4.6.4 Type of Logistic Regression:
• Logistic Regression can be classified into three types:
o Binomial: In binomial Logistic regression, there can be only two
possible types of the dependent variables, such as 0 or 1, Pass or
Fail, etc.
o Multinomial: In multinomial Logistic regression, there can be 3
or more possible unordered types of the dependent variable, such
as "cat", "dogs", or "sheep"
o Ordinal: In ordinal Logistic regression, there can be 3 or more
possible ordered types of dependent variables, such as "low",
"Medium", or "High".
4.6.5 Steps in Logistic Regression:
• To implement the Logistic Regression using Python, the steps are
given below:
• Data Pre-processing step
• Fitting Logistic Regression to the Training set
• Predicting the test result
• Test accuracy of the result
• Visualizing the test set result.
4.6.6 Advantages of Logistic Regression Algorithm
• Logistic regression performs better when the data is linearly separable
• It does not require too many computational resources
• There is no problem scaling the input features
• It is easy to implement and train a model using logistic regression
5. Elaborate in detail about Probabilistic Generative model and Naïve Bayes.
Probabilistic Generative model
o Given a model of one conditional probability, and estimated probability
distributions for the variables X and Y, denoted P(X) and P(Y), can estimate
the conditional probability using Bayes' rule:
94
DR.NNCE II & III YR / II & IV SEM AIML QB
• A generative model isa statistical model of the joint probability distribution on given
observable variable X and target variable Y.
Given a generative model for P(X|Y), can estimate:
• A discriminative model is a model of the conditional probability of the target Y, given
an observation xgiven a discriminative model forP(Y|X), can estimate:
• Classifier based on a generative model is a generative classifier, while a classifier based
on a discriminative model is a discriminative classifier
Simple example
5.2 Generative models
Types of generative models are:
• Naive Bayes classifier or Bayesian network
• Linear discriminant analysis
5.3 Discriminative models
• Logistic regression
• Support Vector Machines
• Decision Tree Learning
• Random Forest
6. Elaborate in detail about Support Vector Machine (SVM). [A/M-24]
Support Vector Machine (SVM)
▪ Support Vector Machine(SVM) is a supervised machine learning algorithm used
for both classification and regression.
▪ The objective of SVM algorithm is to find a hyperplane in an N- dimensional
95
DR.NNCE II & III YR / II & IV SEM AIML QB
space that distinctly classifies the data points.
▪ Hyperplanes are decision boundaries that help classify the data points.
▪ The dimension of the hyperplane depends upon the number of features.
▪ If the number of input features is 2, then the hyperplane is just a line.
▪ If the number of input features is 3, then the hyperplane becomes a two-
dimensional plane.
▪ It becomes difficult to imagine when the number of features exceeds 3.
▪ The objective is to find a plane that has the maximum margin, i.e the maximum
distance between data points of both classes.
▪ Support vectors are data points that are closer to the hyperplane and influence
the position and orientation of the hyperplane.
▪ Using these support vectors, can maximize the margin of the classifier.
▪ Deleting the support vectors will change the position of the hyperplane.
o Example Refer Figure 3.10
Figure 3.10 – Example for Support Vectors
• Let’s consider two independent variables x1, x2 and one dependent variable which is
either a blue circle or a red box as shown in figure 3.11.
Figure 3.11 - Linearly Separable Data points
6.2 Cost Function and Gradient Updates
• In the SVM algorithm, to maximize the margin between the data
points and the hyperplane, the loss function helps to maximize the
margin is called hinge loss.
96
DR.NNCE II & III YR / II & IV SEM AIML QB
6.2.1 Hinge loss function
• The cost is 0 if the predicted value and the actual value are of the same sign. If they
are not, then calculate the loss value.
• The objective of the regularization parameter is to balance the margin maximization and
loss.
• After adding the regularization parameter, the cost functions looks as below.
6.3 SVM Kernel:
• The SVM kernel is a function that takes low dimensional input space and transforms
it into higher-dimensional space, ie it converts non separable problem to separable
problem. It is mostly useful in non- linear separation problems.
6.4 Types of SVMs
• There are two different types of SVMs, each used for different things:
o Simple SVM: Typically used for linear regression and classification problems.
o Kernel SVM: Has more flexibility for non-linear data .
6.5 Advantages of SVM:
• Effective on datasets with multiple features, like financial or medical data.
• Effective in cases where number of features is greater than the number of data
points.
• Its memory efficient as it uses a subset of training points in the decision function
called support vectors
• Different kernel functions can be specified for the decision functions and its possible to
specify custom kernels
6.6 Disadvantages
• If the number of features is a lot bigger than the number of data points, choosing kernel
functions and regularization term is crucial.
• SVMs don't directly provide probability estimates. Those are calculated using an
expensive five-fold cross-validation.
• Works best on small sample sets because of its high training time.
6.7 Applications
SVMs can be used to solve various real-world problems:
• SVMs are helpful in text and hypertext categorization.
• Classification of images can also be performed using SVMs.
97
DR.NNCE II & III YR / II & IV SEM AIML QB
• Classification of satellite data like SAR data using supervised SVM.
• Hand-written characters can be recognized using SVM.
• The SVM algorithm has been widely applied in the biological and other sciences. They
have been used to classify proteins with up to 90% of the compounds classified correctly.
7. Elaborate in detail about Decision Tree in Supervised Learning.
Decision Tree
• Decision Tree is a supervised learning technique that can be used for both classification
and Regression problems.
• It is a tree-structured classifier, where internal nodes represent the features of a dataset,
branches represent the decision rules and each leaf node represents the outcome.
• In a Decision tree, there are two nodes, the Decision Node and Leaf Node.
• As shown in figure 3.12, Decision nodes are used to make any decision and have
multiple branches, whereas Leaf nodes are the output of those decisions and do not
contain any further branches.
• Example for Decision Tree Refer Figure 3.13
• The goal of using a Decision Tree is to create a training model that can use to predict
the class or value of the target variable by learning simple decision rules inferred from prior
data(training data).
• In order to build a tree, use the CART algorithm, which stands for Classification
and Regression Tree algorithm.
Figure 3.12 – Decision Tree Structure
98
DR.NNCE II & III YR / II & IV SEM AIML QB
Figure 3.13 – Decision Tree Example
7.3 Reason for using Decision Trees
• Decision Trees usually mimic human thinking ability while making a decision, so it is
easy to understand.
• The logic behind the decision tree can be easily understood because it shows a tree-like
structure.
7.4 Decision Tree Terminologies
• Root Node: Root node is from where the decision tree starts. It represents the entire
dataset, which further gets divided into two or more homogeneous sets.
• Leaf Node: Leaf nodes are the final output node, and the tree cannot be segregated further
after getting a leaf node.
• Splitting: Splitting is the process of dividing the decision node/root node into sub-nodes
according to the given conditions.
Branch/Sub Tree: A tree formed by splitting the tree.
• Pruning: Pruning is the process of removing the unwanted branches from the
tree.
• Parent/Child node: The root node of the tree is called the parent node, and
other nodes are called the child nodes.
7.5 Working of Decision Tree algorithm
• In a decision tree, for predicting the class of the given dataset, the algorithm
starts from the root node of the tree.
• This algorithm compares the values of root attribute with the record (real
dataset) attribute and, based on the comparison, follows the branch and jumps
to the next node.
• For the next node, the algorithm again compares the attribute value with the
other sub-nodes and move further.
• It continues the process until it reaches the leaf node of the tree.
• The complete process can be better understood using the below algorithm:
Step-1: Begin the tree with the root node, says S, which contains the
complete dataset.
99
DR.NNCE II & III YR / II & IV SEM AIML QB
Step-2: Find the best attribute in the dataset using Attribute
Selection Measure (ASM).
Step-3: Divide the S into subsets that contains possible values for the best
attributes.
Step-4: Generate the decision tree node, which contains the best
attribute.
Step-5: Recursively make new decision trees using the subsets of the dataset
created in step -3. Continue this process until a stage
is reached where cannot further classify the nodes and called the
final node as a leaf node.
Example:
Suppose there is a candidate who has a job offer and wants to decide whether he should accept
the offer or Not. So, to solve this problem, the decision tree starts with the root node (Salary attribute
by ASM). The root node splits further into the next decision node (distance from the office) and one
leaf node based on the corresponding labels. The next decision node further gets split into one
decision node (Cab facility) and one leaf node. Finally, the decision node splits into two leaf nodes
(Accepted offers and Declined offer). Consider the below diagram: Refer fig 3.14
Decision Tree
Figure 3.14 – Decision Tree Algorithm Example
7.6 Algorithms used to construct Decision Trees:
• ID3 → (extension of D3)
• C4.5 → (successor of ID3)
• CART → (Classification And Regression Tree)
• CHAID → (Chi-square automatic interaction detection Performs multi-
level splits when computing classification trees)
• MARS → (multivariate adaptive regression splines)
100
DR.NNCE II & III YR / II & IV SEM AIML QB
7.7 Attribute Selection Measures
• While implementing a Decision tree, Attribute selection measure orASM is
used to select the best attribute for the nodes of the tree.
1. Entropy,
2. Information gain,
3. Gini index,
4. Gain Ratio,
5. Reduction in Variance
6. Chi-Square
Entropy:
• Entropy is a metric to measure the impurity in a given attribute.
• Entropy is a measure of the randomness in the information being
processed.
• The higher the entropy, the harder it is to draw any conclusions from that
information.
• Flipping a coin is an example of an action that provides information
that is random.
• Entropy can be calculated as:
Entropy(S) = -P(yes)log2 P(yes)- P(no) log2 P(no)
Where,
• S= Total number of samples
• P(yes)= probability of yes
• P(no)= probability of no
7.7.2. Information Gain:
• Information gain or IG is a statistical property that measures how well a given
attribute separates the training examples according to their target classification.
Example Refer Fig 3.15.
Figure 3.15 – Information Gain Example
• Constructing a decision tree is all about finding an attribute that returns the highest
information gain and the smallest entropy.
• Information gain is a decrease in entropy.
• It computes the difference between entropy before split and average entropy after split of the
dataset based on given attribute values.
101
DR.NNCE II & III YR / II & IV SEM AIML QB
It can be calculated using the below formula:
Information Gain= Entropy(S)- [(Weighted Avg) *Entropy(each feature)]
7.7.3. Gini Index:
• Gini index as a cost function used to evaluate splits in the dataset.
• It is calculated by subtracting the sum of the squared probabilities of each class from one.
• It favors larger partitions and easy to implement whereas information gain favors smaller
partitions with distinct values.
• Gini index can be calculated using the below formula:
7.7.4 Gain Ratio
• Information gain is biased towards choosing attributes with a large number of values
as root nodes.
• Gain ratio overcomes the problem with information gain by taking the intrinsic
information of a split into account.
7.7.5 Reduction in variance
• Reduction in variance is an algorithm that uses the standard formula of variance to
choose the best split.
• The split with lower variance is selected as the criteria to split the population:
7.7.6 Chi-Square
• The acronym CHAID stands for Chi-squared Automatic Interaction Detector.
It finds out the statistical significance between the differences between sub-nodes and
parent node.
7.7.7 Reduction in variance
• Reduction in variance is an algorithm that uses the standard formula of variance to
choose the best split.
• The split with lower variance is selected as the criteria to split the population:
102
DR.NNCE II & III YR / II & IV SEM AIML QB
7.7.8 Chi-Square
• The acronym CHAID stands for Chi-squared Automatic Interaction
Detector.
• It finds out the statistical significance between the differences between
sub-nodes and parent node.
• Higher the value of Chi-Square higher the statistical significance of
differences between sub-node and Parent node.
• It generates a tree called CHAID (Chi-square Automatic Interaction
Detector).
• Mathematically, Chi-squared is represented as:
7.8. Avoid/counter Overfitting in Decision Trees
o Two ways to remove overfitting:
o Pruning Decision Trees.
o Random Forest
7.8.1 Pruning Decision Trees
• Pruning is a process of deleting the unnecessary nodes from a tree in
order to get the optimal decision tree.
• A too-large tree increases the risk of over fitting, and a small tree may
not capture all the important features of the dataset.
• Therefore, a technique that decreases the size of the learning tree
without reducing accuracy is known as Pruning.
• There are mainly two types of tree pruning technology used:
o Cost Complexity Pruning
o Reduced Error Pruning.
7.8.2 Random Forest
▪ Random Forest is an example of ensemble learning, in which we combine
multiple machine learning algorithms to obtain better predictive
performance.
▪ The name random means
▪ A random sampling of training data set when building trees.
Random subsets of features considered when splitting nodes.
7.9 Advantages of the Decision Tree
• It is simple to understand as it follows the same process which a human follow while
103
DR.NNCE II & III YR / II & IV SEM AIML QB
making any decision in real-life.
• It can be very useful for solving decision-related problems.
• It helps to think about all the possible outcomes for a problem.
• There is less requirement of data cleaning compared to other algorithms.
7.10 Disadvantages of the Decision Tree
• The decision tree contains lots of layers, which makes it complex.
• It may have an over fitting issue, which can be resolved using the Random Forest
algorithm.
• For more class labels, the computational complexity of the decision tree may increase.
8. Elaborate in detail about Random Forest in Supervised Learning. [N/D-23]
[A/M-24]
Random Forest
▪ Random Forest is a classifier that contains a number of decision trees on
various subsets of the given dataset and takes the average to improve the
predictive accuracy of that dataset.
▪ Random Forest is a popular machine learning algorithm that belongs to the
supervised learning technique.
▪ It can be used for both Classification and Regression problems in ML.
▪ It is based on the concept of ensemble learning, which is a process of
combining multiple classifiers to solve a complex problem and to improve the
performance of the model.
▪ The greater number of trees in the forest leads to higher accuracy and prevents the
problem of over fitting.
Steps in the working process of Random Forest
▪ The Working process can be explained in the below steps and diagram:
Step 1: In Random forest n number of random records are taken from the data set
having k number of records.
Step 2: Individual decision trees are constructed for each sample.
Step 3: Each decision tree will generate an output.
Step 4: Final output is considered based on Majority Voting or Averaging for
Classification and regression respectively.
Need for Random Forest
▪ It takes less training time as compared to other algorithms.
▪ It predicts output with high accuracy, even for the large dataset it runs efficiently.
▪ It can also maintain accuracy when a large proportion of data is missing.
Example:
▪ Suppose there is a dataset that contains multiple fruit images. So, this dataset is given
to the Random forest classifier. The dataset is divided into subsets and given to each
decision tree. During the training phase, each decision tree produces a prediction result,
and when a new data point occurs, then based on the majority of results, the Random
Forest classifier predicts the final decision.
104
DR.NNCE II & III YR / II & IV SEM AIML QB
Figure 3.16 – Example for Random Forest
• In the above figure 3.16, the majority decision tree gives output as an apple when
compared to a banana, so the final output is taken as an apple.
8.5 Important Features of Random Forest
1. Diversity-
Not all attributes/variables/features are considered while making an individual tree,
each tree is different.
2. Immune to the curse of dimensionality-
Since each tree does not consider all the features, the feature space is reduced.
3. Parallelization-
Each tree is created independently out of different data and attributes. This means that we
can make full use of the CPU to build random forests.
4. Train-Test split-
In a random forest we don’t have to segregate the data for train and test as there will
always be 30% of the data which is not seen by the decision tree.
5. Stability-
Stability arises because the result is based on majority voting/ averaging.
8.6 Applications of Random Forest
1. Banking: Banking sector mostly uses this algorithm for the
identification of loan risk.
2. Medicine: With the help of this algorithm, disease trends and risks of the disease can
be identified.
3. Land Use: We can identify the areas of similar land use by this algorithm.
4. Marketing: Marketing trends can be identified using this algorithm.
8.7 Advantages of Random Forest
• Random Forest is capable of performing both Classification and Regression tasks.
• It is capable of handling large datasets with high dimensionality.
• It enhances the accuracy of the model and prevents the over fitting issue.
8.8 Disadvantages of Random Forest
Although random forest can be used for both classification and regression tasks, it is not
more suitable for Regression tasks.
105
DR.NNCE II & III YR / II & IV SEM AIML QB
106