Thanks to visit codestin.com
Credit goes to www.scribd.com

0% found this document useful (0 votes)
8 views199 pages

Unit-2 ML

The document covers various machine learning concepts, focusing on linear regression, maximum likelihood estimation, and different regression techniques such as robust linear regression, ridge regression, and Bayesian linear regression. It also discusses classification methods, including discriminant functions and support vector machines, along with their applications and assumptions. Additionally, it introduces kernel functions and their role in transforming data for better classification outcomes.

Uploaded by

adityac7724
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
8 views199 pages

Unit-2 ML

The document covers various machine learning concepts, focusing on linear regression, maximum likelihood estimation, and different regression techniques such as robust linear regression, ridge regression, and Bayesian linear regression. It also discusses classification methods, including discriminant functions and support vector machines, along with their applications and assumptions. Additionally, it introduces kernel functions and their role in transforming data for better classification outcomes.

Uploaded by

adityac7724
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 199

21CSC305P

MACHINE LEARNING
UNIT 2
Maximum likelihood estimation- Least squares, Robust linear
expression, ridge regression, Bayesian linear regression, Linear models
for classification: Discriminant function, Probabilistic generative
models, Probabilistic discriminative models, Laplacian approximation,
Bayesian logistic regression, Kernel functions, using kernels in GLM,
Kernel trick, SVMs.
Introduction to Linear
Regression
Linear Regression
In Machine Learning,
 Linear Regression is a supervised machine learning algorithm.
 It tries to find out the best linear relationship that describes the data you have.
 It assumes that there exists a linear relationship between a dependent variable and independent variable(s).
 The value of the dependent variable of a linear regression model is a continuous value i.e. real numbers.

Representing Linear Regression Model-


Linear regression model represents the linear relationship between a dependent variable and independent variable(s) via
a sloped straight line.

The sloped straight line representing the linear

relationship that fits the given data best is called as a

regression line.
Types of Linear Regression
Based on the number of independent variables, there are two types of linear regression-

1. Simple Linear Regression-


In simple linear regression, the dependent variable depends only on a single independent variable.
For simple linear regression, the form of the model is-
Y = β0 + β 1 X

Here,
 Y is a dependent variable.
 X is an independent variable.
 β0 and β1 are the regression coefficients.
 β0 is the intercept or the bias that fixes the offset to a line.
 β1 is the slope or weight that specifies the factor by which X has an impact on Y.
Types of Linear Regression

2. Multiple Linear Regression-

In multiple linear regression, the dependent variable depends on more than one independent variables.
For multiple linear regression, the form of the model is-
Y = β0 + β1X1 + β2X2 + β3X3 + …… + βnXn

Here,
 Y is a dependent variable.
 X1, X2, …., Xn are independent variables.
 β0, β1,…, βn are the regression coefficients.
 βj (1<=j<=n) is the slope or weight that specifies the factor by which X j has an impact on Y.
Assumptions of Linear Regression
Assumptions of Linear Regression
Assumptions of Linear Regression
Assumptions of Linear Regression
Assumptions of Linear Regression
Assumptions of Linear Regression
Maximum likelihood estimation- Least squares
What is Maximum Likelihood Estimation (MLE)?
Maximum Likelihood Estimation (MLE) is a method to estimate the parameters (like
weights w or θ) of a model such that the likelihood of observing the given data is
maximized.
In simple terms:
Find the parameters that make the data “most likely.”

Key idea:
We find parameters θ (e.g., weights w) such that the likelihood of data D given θ is
maximized:

For linear regression, this becomes maximizing the likelihood of observing the target
values y given the inputs X and weights w.
What is Least Squares:

It is a method to estimate parameters (often in linear regression) by minimizing the


sum of squared errors between predicted and actual values.
For linear models, Least Squares = MLE, when the errors are assumed to be normally
distributed.
Step 9 – Geometric interpretation &
convexity
geometric illustration of linear regression as an orthogonal projection
in 3D space.
Example : Perform Linear Regression using MLE on a small
dataset
Here is the visual plot showing:
•Blue dots: Actual data points from the sample.
•Green line: Linear regression line computed
using the Least Squares Method (which is
equivalent to Maximum Likelihood
Estimation under Gaussian noise).
Note: for
understanding
Robust Linear Expression
A robust linear regression is like regular linear regression, but it’s designed to handle
outliers better.

Why Robust Regression?


•Ordinary Least Squares (OLS) works well if errors are normally distributed.
•But outliers can heavily influence the regression line, pulling it away from the main data
pattern.
•Robust regression methods reduce the influence of outliers, giving a more “stable” fit.
The Problem with Ordinary Least Squares (OLS)
Motivation for Robust Regression
We want:
•A line that fits most of the data points well.
•Minimal influence from extreme points (outliers).
Goal: Reduce the penalty for large errors so they don’t dominate the solution.

Core idea about Robust Regression


Common Loss Functions in Regression
Loss functions measure how far predictions are from actual values.
In robust regression, we choose loss functions that penalize outliers less severely than squared error.
Approac
hes to
Robust
Regressi
on
Here’s the plot comparing OLS and Robust
Regression (Huber):
•Green Line (OLS): Gets pulled strongly toward
the outliers, so the fit for normal points worsens.
•Orange Line (Robust): Less sensitive to outliers,
so it stays close to the majority of the data.
This is why robust regression is preferred when you
expect extreme or erroneous points in your
dataset.
Example Robust linear
regression (Huber)
Ridge Regression
What is Ridge Regression?
Ridge regression, also known as L2 regularization, is a technique used in linear
regression to address the problem of multicollinearity among predictor variables.
Multicollinearity occurs when independent variables in a regression model are highly
correlated, which can lead to unreliable and unstable estimates of regression coefficients.

Ridge regression mitigates this issue by adding a regularization term to the ordinary least
squares (OLS) objective function, which penalizes large coefficients and thus reduces
their variance.
Why Do We Need Ridge Regression?
Why Do We Need Ridge Regression?
where is the ridge parameter and I is the identity matrix.
Ridge Regression,
Bayesian Linear Regression
Bayesian Linear Regression
Bayesian linear regression is a statistical method that combines prior knowledge with
observed data to estimate the parameters of a linear model. It uses probability
distributions to represent uncertainty in the parameters and predictions.

Bayesian Linear Regression is an extension of simple linear regression where:


•The parameters (coefficients) are treated as random variables.
•We use probability distributions to represent what we know about them before and after
seeing data.
•It produces predictions with uncertainty instead of a single “best guess.”

Bayesian Linear Regression learns a linear relationship just like normal regression but also
gives uncertainty and confidence by combining prior belief and observed data.
Why Bayesian Regression?
•In ordinary linear regression, we find one best line that fits the data.
•But in reality, we are often uncertain (because of noise, small data, or randomness).
•Bayesian regression says: instead of finding one line, let’s find a distribution of possible
lines (with probabilities).
This way, we quantify uncertainty.
Why do we need Bayesian Linear Regression?
The Bayesian Approach

or

a clear, testable assumption that connects data to parameters. everything else (likelihood,
posterior, predictions) builds on this.
This assumption leads to the formulation of a likelihood, which tells us how well parameters
explain the data.
It gives us a regularization effect and incorporates prior knowledge.
From the noise model, the likelihood can be expressed mathematically.
This serves as the expression that links parameters to the observed data.
Linear Model for Classification :
Discriminant Functions
1.What is Classification?
•Definition: Classification is the process of assigning an input data point into one of a
set of discrete categories (classes).
•Example:
• Email → {Spam, Not Spam}
• Medical test results → {Healthy, Diseased}
• Handwritten digit → {0,1,2,…,9}

The Goal:
•Given an input vector x (with D features), assign it to one of K classes.
•Classes are usually disjoint → each input belongs to only one class.

Input Space:
•The entire feature space is divided into regions → decision regions.
•Boundaries between these regions are called:
• Decision boundaries (lines/curves in 2D)
• Decision surfaces (planes/hyperplanes in higher dimensions)
Two classes
Discriminant functions
Two-Class Linear Discriminant
Function
Multi classes
Discriminant functions
Why Multi classes Discriminant
functions

Solution: Multi classes Discriminant functions

Three Approaches
 One-vs-Rest (OvR)
 One-vs-One (OvO)
 Direct Multiclass Discriminant Function
One-vs-Rest (OvR)
•Build K classifiers, one for each class vs. "all
the rest".
•Example: To classify into {C1, C2, C3}, train:
• C1 vs (C2+C3)
• C2 vs (C1+C3)
• C3 vs (C1+C2)
•Problem: Sometimes two classifiers "vote
yes", giving ambiguous green regions .

OvR: “Each class builds a wall around itself, but walls


can overlap.”
One-vs-One (OvO)
•Build classifiers for every pair of classes:
• (C1 vs C2), (C1 vs C3), (C2 vs C3), etc.
•A new point is classified by majority vote. (0,2,1
•Still can have ambiguous regions, but less than OvR. )

OvO: “Classes fight pairwise, and majority


wins.”
Example
Direct multiclass: “Everyone builds their own wall, and you pick the class with
the strongest wall.”
Summary:
• In OvR / OvO, sometimes we got ambiguous regions where two classifiers give
conflicting votes.
• In Direct Multiclass, each point is assigned only one maximum value → clear, convex
decision regions → no ambiguity.
Least Squares for Classification
Discriminant Functions
Least Squares Classification (boundary found by
regression idea)
Fisher’s Linear
Discriminant Functions
Fisher’s Linear Discriminant (best
separating line by maximizing
separation)
Perceptron
Algorithm
Perceptron Algorithm (boundary adjusted iteratively based on
errors).
Probabilistic Generative Models
Probabilistic Generative Models
Features are Continuous-Prediction is based on Gaussian
probability densities (Bayes rule).
Example: Probabilistic Generative Model (Naïve Bayes with Gaussian
Assumption)
We want to classify whether a student will pass or fail an exam based on study
hours.
Probabilistic Discriminative Models
Discriminative model
A discriminative model in machine learning is an algorithm
designed to directly learn the decision boundary between different
classes or categories within a dataset.
Probabilistic Discriminative model
Example Problem
(Binary Classification)
We want to predict whether a student passes (1) or fails (0) an exam based on
marks.
Laplace Approximation
What is Laplace Approximation?
Laplace Approximation is a method to approximate a complicated probability
distribution (usually a posterior in Bayesian inference) with a Gaussian distribution.
Bayesian Logistic Regression
The distribution of the decision boundary in Bayesian Logistic Regression is useful
because it captures uncertainty in predictions, prevents overconfidence, and
supports better risk-aware decisions.
Difference between Linear and Logistic
Linear Regression
Regression Logistic Regression

Used to solve regression problems Used to solve classification problems

The response variables are continuous in nature The response variable is categorical in nature

It helps estimate the dependent variable when there It helps to calculate the possibility of a particular
is a change in the independent variable event taking place

It is a straight line It is an S-curve (S = Sigmoid)


Kernel Functions
Kernel Functions –
Overview
1. RBF (Radial Basis Function) Kernels
2. Kernels for Comparing Documents
A Mercer kernel is a "valid similarity function" that
3. Mercer (Positive Definite) Kernels machine learning algorithms can use. It guarantees that
even if we map data into some hidden higher-
dimensional space, the math will still work properly.

The condition says: For a function


to be a Mercer kernel, the
kernel matrix it produces must
always be positive semi-
definite (PSD).
4 Linear Kernels
5. Matern Kernels
6 String Kernels
7 Pyramid Match Kernels

PMK Score
0 --- no overlap…set at different
< 10---- moderate similarity
>10 ----- High similarity
8 Kernels Derived from Probabilistic Generative Models

Simplified Formula,
= Fisher score of
Using Kernel in GLM’s Function
Example:

Normal Logistic Regression (GLM Kernel Logistic Regression


form)

So GLMs + Kernels = Non-linear decision


boundaries
This simple graph explains why we use kernels
in GLMs.
Kernel Trick
Kernel Trick
The kernel trick is the idea that:
Instead of mapping data into high-dimensional space (which is
costly), we directly compute the inner product in that space using a
kernel function.
Consider the Example
•Both methods give the same result.
•Kernel Trick avoids the hard work of computing explicit higher-dimensional vectors.
•Instead, we compute directly in the original space using a kernel function.
In simple words: Kernel Trick = Shortcut to high dimensions without ever going there!
Support Vector Machine(SVM)
What is Support Vector Machin (SVM)
A Support Vector Machine (SVM) is a machine learning algorithm used for classification and
regression. This finds the best line (or hyperplane) to separate data into groups, maximizing
the distance between the closest points (support vectors) of each group. It can handle complex
data using kernels to transform it into higher dimensions. In short, SVM helps classify data
effectively.

Types of Support Vector Machin (SVM)


Linear SVM: Linear SVM is used for linearly separable data, which means if a dataset can be
classified into two classes by using a single straight line, then such data is termed as linearly
separable data, and classifier is used called as Linear SVM classifier.
Non-linear SVM: Non-Linear SVM is used for non-linearly separable data, which means if a
dataset cannot be classified by using a straight line, then such data is termed as non-linear data
and classifier used is called as Non-linear SVM classifier. In such cases, we use advanced
techniques like kernel tricks to classify them
BASIC TERMS OF
SUPPORT VECTOR MACHINE
(SVM)
What is Hyper Plane
Hyperplanes are decision boundaries that help classify the data
points. Data points falling on either side of the hyperplane can
be attributed to different classes.
There can be multiple lines/decision boundaries to segregate the classes, but we need to
find out the best decision boundary that helps to classify the data points. This best
boundary is known as the hyperplane of SVM.

The dimension of the hyperplane depends upon the number


of features. If the number of input features is 2, then the
hyperplane is just a line.
If the number of input features is 3, then the hyperplane
becomes a two-dimensional plane.
It becomes difficult to imagine when the number of features
exceeds 3.
What is Support Vector

The data points or vectors that are


the closest to the hyperplane and
which affect the position of the
hyperplane are termed as Support
Vector. Since these vectors support
the hyperplane, hence called a
Support vector.

Support Vectors are simply the coordinates of


individual observation.
What is Margin

The width that the boundary could


be increased by before hitting a
data point.

A Support Vector Machine (SVM) performs classification


by finding the hyperplane that maximizes the margin
between the two classes.
Hard Margin & Soft Margin
When the data is linearly separable, and we don’t
want to have any misclassifications, we use SVM
with a Hard margin.

When a linear boundary is not feasible, or we


want to allow some misclassifications in the hope
of achieving better generality, we can opt for a
Soft margin for our classifier.

Sometimes, the data is linearly separable, but the margin is so small that the
model becomes prone to overfitting or being too sensitive to outliers. Also, in
this case, we can opt for a larger margin by using soft margin SVM in order
to help the model generalize better.
How does SVM Work?
At first approximation, SVM finds a separating line (or hyperplane) between
data of two classes. SVM is an algorithm that takes the data as an input and
outputs a line that separates those classes if possible.

Suppose we have a dataset as shown and we


need to classify the red rectangles from the blue
ellipses. So our task is to find an ideal line that
separates this dataset in two classes (say red
and blue).
How does SVM Work (Contd.)

Not a big task, right…?


How does SVM Work (Contd.)
Not a big task, right…?

But, as we notice there isn’t a unique line that


does the job. In fact, we have an infinite lines
that can separate these two classes. So how
does SVM find the ideal one???
Let’s take some probable candidates and
figure it out ourselves.
How does SVM Work (Contd.)
We have two candidates here, the green colored line
and the yellow colored line. Which line according to
you best separates the data?

The green line in the image above is quite close to


the red class. Though it classifies the current
datasets it is not a generalized line.

If we select the yellow line then it’s visually quite


intuitive in this case that the yellow line classifies better.
But, we need something concrete to fix our line.
How does SVM Work (Contd.)

According to the SVM algorithm we find the


points closest to the line from both the classes.
These points are called support vectors. Now,
we compute the distance between the line and
the support vectors. This distance is called the
margin. Our goal is to maximize the margin.
The hyperplane for which the margin is
maximum is the optimal hyperplane.
How does SVM Work (Contd.)

Thumb Rule to identify Right Hyperplane


• Select the hyper-plane which segregates the two classes
better.
• Maximizing the distances between nearest data point
(either class) and hyper-plane. This distance is called
Margin.
How does SVM Work (Contd.)

Identify the right hyper-plane (Scenario-1):


• Here, we have three hyper-planes (A, B, and C). Now, identify
the right hyper-plane to classify stars and circles.

• You need to remember a thumb rule to identify the right hyper-plane: “Select
the hyper-plane which segregates the two classes better”. In this scenario, hyper-
plane “B” has excellently performed this job
How does SVM Work (Contd.)

Identify the right hyper-plane



(Scenario-2)
Here, we have three hyper-planes (A, B, and C) and all are segregating
the classes well. Now, How can we identify the right hyper-plane?

• You need to remember a thumb rule to identify the right hyper-plane:


Here, maximizing the distances between nearest data point (either class)
and hyper-plane will help us to decide the right hyper-plane. This distance
is called as Margin
How does SVM Work (Contd.)

you can see that the margin for hyper-plane C is


high as compared to both A and B. Hence, we
name the right hyper-plane as C. Another
lightning reason for selecting the hyper-plane with
higher margin is robustness. If we select a hyper-
plane having low margin then there is high
chance of miss-classification.
Identify the right hyper-plane
How does SVM Work (Contd.)

(Scenario-3):
• Hint: Use the rules as discussed in previous section to identify
the right hyper-plane.

• Some of you may have selected the hyper-plane B as it has


higher margin compared to A. But, here is the catch, SVM
selects the hyper-plane which classifies the classes
accurately prior to maximizing margin. Here, hyper-plane B has a
classification error and A has classified all correctly. Therefore,
the right hyper-plane is A.
How does SVM Work (Contd.)

Can we classify two classes (Scenario-4)?

• Below, I am unable to segregate the two classes using a


straight line, as one of the stars lies in the territory of
other(circle) class as an outlier
How does SVM Work (Contd.)
Find the hyper-plane to segregate to classes (Scenario-5):

• In the scenario below, we can’t have linear hyper-plane


between the two classes, so how does SVM classify
these two classes? Till now, we have only looked at the
linear hyper-plane.
How does SVM Work (Contd.)
Let’s consider a bit complex dataset, which
is not linearly separable.

This data is clearly not linearly separable. We


cannot draw a straight line that can classify this
data. But, this data can be converted to linearly
separable data in higher dimension. Lets add one
more dimension and call it z-axis. Let the co-
ordinates on z-axis be governed by the constraint,
z = x²+y²

Basically z co-ordinate is the square of distance of the


point from origin. Let’s plot the data on z-axis.
How does SVM Work (Contd.)
Now the data is clearly linearly separable. Let the purple line separating the data in higher dimension be
z=k, where k is a constant. Since, z=x²+y² we get x² + y² = k; which is an equation of a circle. So, we can
project this linear separator in higher dimension back in original dimensions using this transformation.

Remember that…. This feature is not added manually.


This is done by Kernel trick.
Kernel Method
Kernel methods represent the techniques that are used to deal with linearly inseparable data or non-linear
data set shown figure below. The idea is to create nonlinear combinations of the original features to
project them onto a higher-dimensional space via a mapping function, where the data becomes linearly
separable. In the diagram given below, the two-dimensional dataset (X1, X2) is projected into a new three-
dimensional feature space (Z1, Z2, Z3) where the classes become separable.
Hinge Loss Function
Hinge Loss is a specific type of loss function primarily used for classification tasks, especially in Support Vector
Machines (SVMs). It measures how well a model’s predictions align with the actual labels and encourages
predictions that are not only correct but confidently separated by a margin.
Hinge loss penalizes predictions that are:
1.Incorrectly classified.
2.Correctly classified but too close to the decision boundary (within a “margin”).
It is designed to create a “margin” around the decision boundary to improve the robustness of the classifier.
Formula
The hinge loss for a single data point is given by:

y- the actual class (-1 or 1)


f(x) – the output of the classifier for the datapoint
Hinge Loss Function

Case 1 : Correct and Confident Classification and y.f(x) ≥ 1


(blue)
Case 2 : Correct and not confident Classification and y.f(x) <
1 (faded blue)
Case 3: Incorrect Classification (Red)
Advantages of SVM
• Training of the model is relatively easy
• The model scales relatively well to high dimensional data
• SVM is a useful alternative to neural networks
• Trade-off amongst classifier complexity and error can be
controlled explicitly
• It is useful for both Linearly Separable and Non-linearly
Separable data
• Assured Optimality: The solution is guaranteed to be the
global minimum due to the nature of Convex Optimization
Disadvantages of SVM
• Picking right kernel and parameters can be computationally
intensive
• In Natural Language Processing (NLP), a structured
representation of text yields better performance. However,
SVMs cannot accommodate such structures (word
embedding)
Applications of SVM
Geostatistics: It is a branch of statistics concentrating on spatial or spatiotemporal datasets. It was originally created to
predict the probability distributions of ore grading at mining operations. Now it is applied in diverse disciplines
including petroleum geology, hydrogeology, hydrology, meteorology, oceanography, geochemistry, geometallurgy,
geography, forestry, environmental control, landscape ecology, soil science, and agriculture (specifically in precision
farming).
Inverse distance weighting: Type of deterministic method for multivariate interpolation with a known scattered set of
points. The values assigned to unknown points are calculated with a weighted average of the values existing at the
known points.
3D Reconstruction: Process of capturing the shape and appearance of real objects.
Bioinformatics: An interdisciplinary field that involves molecular biology, genetics, computer science, mathematics and
statistics. Software tools and methods are developed to understand biological data better.
Chemoinformatics: Application of computational and informational techniques over the field of chemistry to solve a
wide range of problems.
Information Extraction: Acronym as IE, It is a method of automated extraction or retrieval of structured information
from an unstructured and semi-structured text documents, databases and websites.
Handwriting Recognition: Acronym as HWR, It is the ability of a computer system to receive and interpret the
handwritten input comprehensibly from different sources such as a letter paper documents, photographs, touch-
screens and other devices.

You might also like