Thanks to visit codestin.com
Credit goes to www.scribd.com

0% found this document useful (0 votes)
24 views42 pages

Assignment 2

The document discusses various aspects of machine learning, focusing on regression and classification, the workflow of supervised learning models, and different regression techniques such as simple linear regression, multiple linear regression, and polynomial regression. It outlines the differences between regression and classification, explains the step-by-step workflow for supervised learning, and details the mathematical foundations of various regression methods. Additionally, it touches on the concept of regularization to prevent overfitting in regression models.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
24 views42 pages

Assignment 2

The document discusses various aspects of machine learning, focusing on regression and classification, the workflow of supervised learning models, and different regression techniques such as simple linear regression, multiple linear regression, and polynomial regression. It outlines the differences between regression and classification, explains the step-by-step workflow for supervised learning, and details the mathematical foundations of various regression methods. Additionally, it touches on the concept of regularization to prevent overfitting in regression models.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 42

Assignment: - 2

Subject: - Machine Learning

Submitted by: Submitted to: -


Bidya Sagar Lekhi Er. Pradip Sharma
Roll: - 10
1.Differentiate between regression and classification with suitable
real-world examples.
Ans:- Regression and classification are both types of supervised
learning, but they differ in the type of output they predict.

Feature Regression Classification

Predict continuous
Objective Predict discrete class labels
numeric values

Real numbers (e.g., Categories (e.g., "Spam",


Output Type
25.3, 89.7) "Not Spam")

Type of Qualitative (categorical)


Quantitative prediction
Problem prediction

-Email spam detection


- Predicting house prices
Examples -Handwritten digit
-Estimating temperature
recognition

- Linear Regression
- Logistic Regression
Algorithms -Decision Tree
- Decision Tree Classifier
Used Regressor
- SVM
- SVR

- Mean Squared Error


-Accuracy
Evaluation (MSE)
-Precision/Recall
Metrics -R² Score
- F1 Score
- MAE

Real-World Example of Regression:


Problem: Predicting the price of a house
• Input Features: Size, number of bedrooms, location, age
• Output: Price (e.g., $250,000)
• Model: Linear Regression
Real-World Example of Classification:
• Problem: Classifying an email as spam or not spam
• Input Features: Frequency of certain keywords, sender address,
time of sending
• Output: "Spam" or "Not Spam"
• Model: Naive Bayes or Logistic Regression

2.Explain the workflow of a supervised learning model. What are


the main components?

Ans:- Workflow of a Supervised Learning Model & Its Main


Components

Supervised learning is a machine learning technique where a model is


trained on a labeled dataset — that is, each input comes with a known
output (target). The goal is for the model to learn the mapping from
inputs to outputs and generalize well to unseen data.

Step-by-Step Workflow of a Supervised Learning Model

1. Problem Definition

Clearly identify the task:

• Is it classification (e.g., spam vs. not spam)?


• Or regression (e.g., predicting house prices)?

2. Data Collection

Gather relevant labeled data from sensors, databases, APIs, etc.

• Each data point should have both input features and output label.

3. Data Preprocessing

Prepare the data for model training:

• Handle missing values, outliers


• Normalize or scale features
• Encode categorical variables (e.g., one-hot encoding)
• Split the data: typically into training, validation, and test sets

4. Feature Engineering

Extract or select the most meaningful inputs:

• Create new features (e.g., combining or transforming existing


ones)
• Reduce dimensionality if necessary (e.g., PCA)

5. Model Selection

Choose an appropriate ML algorithm:

• Classification: Logistic Regression, Decision Tree, SVM,


Random Forest
• Regression: Linear Regression, Ridge/Lasso, SVR, XGBoost

6. Model Training

Train the model on the training set:

• The model learns patterns by minimizing a loss function


• Optimization methods like Gradient Descent adjust weights to
improve performance

7. Model Evaluation

Evaluate performance using:

• A validation set (for tuning)


• A test set (for final performance check)
• Use metrics such as:
o Classification: Accuracy, Precision, Recall, F1-Score
o Regression: Mean Squared Error (MSE), R² Score

8. Hyperparameter Tuning

Fine-tune parameters like learning rate, tree depth, etc., using:

• Grid Search
• Random Search
• Cross-validation

9. Model Deployment
Deploy the model into a production system or application, e.g.:

• Web app
• Mobile device
• Embedded system (e.g., IoT device)

10. Monitoring & Maintenance

Track model performance over time:

• Detect data drift or concept drift


• Periodically retrain or update the model

Main Components of Supervised Learning

Component Description
Labeled Dataset Input features and corresponding target labels
Method used to learn the mapping (e.g., SVM,
Algorithm/Model
neural network)
Loss Function Measures how far predictions are from true labels
Updates model parameters to reduce loss (e.g.,
Optimizer
SGD, Adam)
Evaluation Assess how well the model performs (e.g.,
Metrics Accuracy, MSE)

Example: Predicting House Prices

Step Example
Input Features Area, number of rooms, location, age
Output (Label) House price (e.g., $250,000)
Model Linear Regression
Loss Function Mean Squared Error (MSE)
Evaluation Metric R² Score

3.Describe the working of simple linear regression. Derive the


formula for the regression line using least squares.
Ans:-Simple Linear Regression: Working and Derivation Using
Least Squares

What is Simple Linear Regression?

Simple Linear Regression is a statistical method used to model the linear


relationship between a dependent variable (Y) and a single independent
variable (X).

The goal is to find the best-fitting straight line (called the regression
line) that minimizes the error between predicted and actual values.

Regression Line Equation:

y=mx+cy = mx + cy=mx+c

Or, more generally:

y^=β0+β1x\hat{y} = \beta_0 + \beta_1 xy^=β0+β1x

Where:

• y^\hat{y}y^: Predicted value


• xxx: Independent variable
• β0\beta_0β0: Intercept (value of yyy when x=0x = 0x=0)
• β1\beta_1β1: Slope of the line (change in yyy for a unit change
in xxx)

Objective: Least Squares Method

We want to find values of β0\beta_0β0 and β1\beta_1β1 such that the


sum of squared errors (residuals) between the actual and predicted
values is minimized.

Loss Function (Cost Function):

SSE=∑i=1n(yi−y^i)2=∑i=1n(yi−β0−β1xi)2\text{SSE} =
\sum_{i=1}^{n} (y_i - \hat{y}_i)^2 = \sum_{i=1}^{n} (y_i - \beta_0 -
\beta_1 x_i)^2SSE=i=1∑n(yi−y^i)2=i=1∑n(yi−β0−β1xi)2

Where:

• yiy_iyi: Actual output


• y^i=β0+β1xi\hat{y}_i = \beta_0 + \beta_1 x_iy^i=β0+β1xi:
Predicted output

Derivation of the Least Squares Estimates

To minimize the SSE, we take partial derivatives with respect to


β0\beta_0β0 and β1\beta_1β1, and set them to 0.

Step 1: Compute Means

Let:

xˉ=1n∑xi,yˉ=1n∑yi\bar{x} = \frac{1}{n} \sum x_i, \quad \bar{y} =


\frac{1}{n} \sum y_ixˉ=n1∑xi,yˉ=n1∑yi

Step 2: Solve for Slope β1\beta_1β1

β1=∑(xi−xˉ)(yi−yˉ)∑(xi−xˉ)2\beta_1 = \frac{ \sum (x_i - \bar{x})(y_i


- \bar{y}) }{ \sum (x_i - \bar{x})^2 }β1=∑(xi−xˉ)2∑(xi−xˉ)(yi−yˉ)

Or equivalently:

β1=∑xiyi−nxˉyˉ∑xi2−nxˉ2\beta_1 = \frac{ \sum x_i y_i -


n\bar{x}\bar{y} }{ \sum x_i^2 - n\bar{x}^2 }β1=∑xi2−nxˉ2∑xiyi
−nxˉyˉ

Step 3: Solve for Intercept β0\beta_0β0

β0=yˉ−β1xˉ\beta_0 = \bar{y} - \beta_1 \bar{x}β0=yˉ−β1xˉ

Final Regression Line:

y^=β0+β1x\hat{y} = \beta_0 + \beta_1 xy^=β0+β1x

This line minimizes the total squared vertical distance between the
observed points and the line.

Example:

xxx yyy
1 2
2 3
xxx yyy
3 5
4 4
5 6

You can compute:

• xˉ=3\bar{x} = 3xˉ=3, yˉ=4\bar{y} = 4yˉ=4


• Then find β1\beta_1β1, and β0\beta_0β0
• Construct the best-fit line

Applications:

• Predicting house prices based on area


• Estimating sales based on advertising spend
• Forecasting temperature based on time

4.Explain multiple linear regression. How is it different from simple


linear regression?

Ans:- Multiple Linear Regression (MLR) is an extension of simple


linear regression. It models the relationship between a dependent
variable (Y) and two or more independent variables (X₁, X₂, ..., Xₙ).

Equation of MLR:

y^=β0+β1x1+β2x2+⋯+βnxn\hat{y} = \beta_0 + \beta_1 x_1 + \beta_2


x_2 + \dots + \beta_n x_ny^=β0+β1x1+β2x2+⋯+βnxn

Where:

• y^\hat{y}y^: Predicted value (dependent variable)


• x1,x2,...,xnx_1, x_2, ..., x_nx1,x2,...,xn: Independent (input)
variables
• β0\beta_0β0: Intercept
• β1,β2,...,βn\beta_1, \beta_2, ..., \beta_nβ1,β2,...,βn: Coefficients
showing effect of each feature on yyy

How It Works:
• The model learns coefficients β1\beta_1β1 to βn\beta_nβn that
best fit the data by minimizing the sum of squared errors (SSE)
between actual and predicted values.

• This is solved using matrix algebra or optimization methods like


gradient descent.

Matrix Form of MLR:

Let:

• X\mathbf{X}X: Feature matrix (including a column of 1s for


intercept)
• y\mathbf{y}y: Output vector
• β\boldsymbol{\beta}β: Coefficient vector

Then the model is:

y^=Xβ\hat{\mathbf{y}} = \mathbf{X} \boldsymbol{\beta}y^=Xβ

Using least squares, the best-fit solution is:

β=(XTX)−1XTy\boldsymbol{\beta} = (\mathbf{X}^T \mathbf{X})^{-


1} \mathbf{X}^T \mathbf{y}β=(XTX)−1XTy

Example:

Goal: Predict a house price based on:

• x1x_1x1: Area (sq. ft)


• x2x_2x2: Number of bedrooms
• x3x_3x3: Age of the house

Model:

y^=β0+β1x1+β2x2+β3x3\hat{y} = \beta_0 + \beta_1 x_1 + \beta_2 x_2


+ \beta_3 x_3y^=β0+β1x1+β2x2+β3x3

Difference Between Simple and Multiple Linear Regression


Simple Linear
Feature Multiple Linear Regression
Regression
Number of
independent 1 2 or more
variables
y^=β0+β1x1+β2x2+…\hat{y} =
y^=β0+β1x\hat{y}
Equation \beta_0 + \beta_1 x_1 + \beta_2
= \beta_0 + \beta_1
form x_2 + \dotsy^=β0+β1x1+β2x2
xy^=β0+β1x
+…
Model Simple, easy to
More complex, harder to visualize
complexity visualize
When one factor is
When multiple factors influence
Use case enough to predict
the output
output
2D plot (line on a
Visualization 3D or higher-dimensional space
plane)

5.What is polynomial regression? How does it handle non-linear


data? Give an example.

Ans:- Polynomial Regression is a type of regression that models the


non-linear relationship between the independent variable xxx and the
dependent variable yyy by introducing polynomial terms of the input.

It is still considered a linear model in terms of parameters, even though


it fits a non-linear curve to the data.

Equation of Polynomial Regression:

For a degree nnn polynomial:

y^=β0+β1x+β2x2+β3x3+⋯+βnxn\hat{y} = \beta_0 + \beta_1 x +


\beta_2 x^2 + \beta_3 x^3 + \dots + \beta_n x^ny^=β0+β1x+β2x2+β3
x3+⋯+βnxn

• x2,x3,…,xnx^2, x^3, \dots, x^nx2,x3,…,xn are non-linear


transformations of the input
• β0,β1,...,βn\beta_0, \beta_1, ..., \beta_nβ0,β1,...,βn are the model
coefficients
How It Handles Non-Linear Data:

• In simple linear regression, the model tries to fit a straight line,


which may not be adequate if data shows curvature.
• Polynomial regression introduces powers of xxx, allowing the
model to bend and curve, adapting to more complex
relationships.

Example:

Suppose you're modeling the growth of a plant over time:

Days
Height (cm) (y)
(x)
1 1.5
2 3.0
3 4.5
4 9.0
5 15.0

• A straight line will underfit this growth pattern.


• A quadratic polynomial (e.g., y=β0+β1x+β2x2y = \beta_0 +
\beta_1 x + \beta_2 x^2y=β0+β1x+β2x2) may fit well.

Visual Comparison:

Model Fit on Non-Linear Data


Linear Regression Poor fit
Polynomial Regression Good fit

Polynomial regression draws a curved line that follows the pattern of


the data.

Implementation (Conceptually):

Convert input feature xxx into a feature vector:

x→[x,x2,x3,...,xn]x \right arrow [x, x^2, x^3, ...,


x^n]x→[x,x2,x3,...,xn]
Then apply linear regression on these transformed features.

Caution: Overfitting

• Using a high-degree polynomial can lead to overfitting (model


fits noise, not the pattern).
• Choose the degree wisely (using validation or cross-validation).

6.Explain the concept of regularization in regression. Why is it


needed?

Ans:- Regularization is a technique used in regression (and other


machine learning models) to prevent overfitting by adding a penalty
term to the loss function.

It helps ensure the model generalizes well to unseen data by


discouraging overly complex models.

Why Is Regularization Needed?

• In regression models (especially multiple or polynomial


regression), the model may fit the training data too closely.
• This leads to overfitting, where the model performs well on
training data but poorly on test data.
• Regularization controls model complexity by penalizing large
coefficients.

How It Works:

The original loss function in linear regression is the Sum of Squared


Errors (SSE):

Loss=∑i=1n(yi−y^i)2\text{Loss} = \sum_{i=1}^n (y_i -


\hat{y}_i)^2Loss=i=1∑n(yi−y^i)2

Regularization adds a penalty term to this:

Regularized Loss=∑(yi−y^i)2+λ⋅Penalty\text{Regularized Loss} =


\sum (y_i - \hat{y}_i)^2 + \lambda \cdot
\text{Penalty}Regularized Loss=∑(yi−y^i)2+λ⋅Penalty

Where:
• λ\lambdaλ is the regularization parameter (controls the strength
of the penalty)
• Penalty depends on the type of regularization

Types of Regularization

1. L1 Regularization (Lasso Regression)

• Penalty:

∑∣βj∣\sum |\beta_j|∑∣βj∣

• Encourages sparse models (some coefficients become zero)


• Useful for feature selection

2. L2 Regularization (Ridge Regression)

• Penalty:

∑βj2\sum \beta_j^2∑βj2

• Encourages small coefficients, but none are exactly zero


• Useful when all features are important but should be shrunk

3. Elastic Net

• Combines L1 and L2 penalties

λ1∑∣βj∣+λ2∑βj2\lambda_1 \sum |\beta_j| + \lambda_2 \sum \beta_j^2λ1


∑∣βj∣+λ2∑βj2

• Balances between Lasso and Ridge

Effect of Regularization

Without Regularization With Regularization


Large coefficients Smaller, controlled coefficients
High risk of overfitting Better generalization
High model complexity Simpler model

7.Differentiate between Ridge and Lasso regression. When would


you prefer one over the other?
Ans:- Both Ridge and Lasso regression are regularization techniques
used to prevent overfitting in linear regression models by adding a
penalty term to the loss function.
They differ in how they penalize model coefficients.

Key Differences:

Lasso Regression
Feature Ridge Regression (L2)
(L1)
λ∑βj2\lambda \sum
Penalty Term \beta_j^2λ∑βj2 (squares of ( \lambda \sum
coefficients)
Shrinks some to zero
Effect on
Shrinks them toward zero (can eliminate
Coefficients
features)
Yes — automatic
Feature Selection No — all features kept
feature selection
When only a few
When all features are
Best Use Case features are
important, collinear
important (sparse)
Effective (distributes Selects one among
Handling
weights among correlated many correlated
Multicollinearity
features) features
May not have a
Always has a unique
Solution unique solution (if
solution
p > n)

Loss Functions:

Ridge:

Loss=∑(yi−y^i)2+λ∑βj2\text{Loss} = \sum (y_i - \hat{y}_i)^2 +


\lambda \sum \beta_j^2Loss=∑(yi−y^i)2+λ∑βj2

Lasso:

Loss=∑(yi−y^i)2+λ∑∣βj∣\text{Loss} = \sum (y_i - \hat{y}_i)^2 +


\lambda \sum |\beta_j|Loss=∑(yi−y^i)2+λ∑∣βj∣

When to Use Which?


Use Ridge When:

• You expect all the features contribute to the output.


• There’s multicollinearity (high correlation among predictors).
• Your goal is to shrink coefficients, not eliminate them.

Use Lasso When:

• You believe only a subset of features are important.


• You want a sparse model (some features excluded).
• You’re doing automatic feature selection.

Elastic Net: Hybrid Option

If you're unsure which to use:

• Elastic Net combines Lasso and Ridge:

Loss=∑(yi−y^i)2+λ1∑∣βj∣+λ2∑βj2\text{Loss} = \sum (y_i -


\hat{y}_i)^2 + \lambda_1 \sum |\beta_j| + \lambda_2 \sum
\beta_j^2Loss=∑(yi−y^i)2+λ1∑∣βj∣+λ2∑βj2

8.Explain the bias-variance tradeoff with the help of diagrams and


examples.

Ans:- The bias-variance tradeoff is a fundamental concept in machine


learning that describes the balance between underfitting and
overfitting.

Definitions:

Term Meaning
Error due to wrong assumptions in the learning algorithm
Bias
(underfitting)
Error due to too much sensitivity to training data
Variance
(overfitting)
Irreducible error due to random variations or unmeasured
Noise
variables

Tradeoff Concept:
• High bias: Model is too simple — misses patterns →
Underfitting
• High variance: Model is too complex — captures noise as
patterns → Overfitting
• The goal is to find a sweet spot with low bias and low variance

Graphical Representation:

Imagine a graph of model complexity vs. error:

Mathematica

Copyedit

|\

| \

| \ ← Total Error

| \ (Bias² + Variance + Noise)

Error | \_

| \__

| / ← Variance

| /

| /

|__/ ← Bias²

+-----------------

Model Complexity →

• At low complexity: High bias, low variance


• At high complexity: Low bias, high variance

Examples:
Example 1: Polynomial Regression

Degree of Polynomial Bias Variance Fit Type


1 (Linear) High Low Underfitting
10+ Low High Overfitting
2–4 Balanced Balanced Good Fit

Example 2: House Price Prediction

• High bias: Predicts all houses have the same price — doesn't
learn location or size impact.
• High variance: Memorizes training data — fails on new
neighborhoods.

Bias-Variance Illustrated (Target Analogy):

Imagine trying to hit the bullseye with arrows:

Case Description
High Bias Arrows far from center, but close together (wrong aim)
High Arrows scattered, some near center, some far
Variance (inconsistent aim)
Arrows clustered around the center (accurate and
Balanced
consistent)

Real Use in Model Selection:

• Linear models → Low variance, high bias


• Complex models (deep trees, neural nets) → High variance, low
bias
• Use cross-validation to detect and reduce overfitting

How to Control Bias and Variance:

Method Effect
Increase model complexity ↓ Bias, ↑ Variance
Regularization
↑ Bias, ↓ Variance
(Ridge/Lasso)
More training data ↓ Variance
Method Effect
Feature engineering ↓ Bias and/or ↓ Variance
Helps find the best bias-variance
Cross-validation
balance

9.Describe the working of Support Vector Regression (SVR). How


is it different from traditional linear regression?

Ans:- Support Vector Regression (SVR) is a regression technique based


on the principles of Support Vector Machines (SVM).
It aims to find a function that approximates data points within a certain
margin (ε), rather than minimizing just the squared error like in
traditional linear regression.

Goal of SVR:

To find a line (or curve) such that:

• Most data points lie within an ε margin from the predicted line.
• The model is as flat (simple) as possible.

Mathematical Form:

Given training data (x1,y1),(x2,y2),…,(xn,yn)(x_1, y_1), (x_2, y_2),


\dots, (x_n, y_n)(x1,y1),(x2,y2),…,(xn,yn),
SVR tries to find a function:

f(x)=wTx+bf(x) = w^T x + bf(x)=wTx+b

That minimizes:

12∥w∥2\frac{1}{2} \|w\|^221∥w∥2

Subject to:

∣yi−f(xi)∣≤ε|y_i - f(x_i)| \leq \varepsilon∣yi−f(xi)∣≤ε

For those data points outside the ε-margin, slack variables ξi,ξi∗\xi_i,
\xi_i^*ξi,ξi∗ are introduced and a penalty term is added:
Objective:min⁡12∥w∥2+C∑(ξi+ξi∗)\text{Objective:} \quad \min
\frac{1}{2} \|w\|^2 + C \sum (\xi_i + \xi_i^*)Objective:min21
∥w∥2+C∑(ξi+ξi∗)

Where:

• CCC is a regularization parameter controlling the tradeoff


between flatness and tolerance to outliers.
• ε\varepsilonε defines the width of the epsilon-insensitive tube.

Key Concepts:

• Epsilon-insensitive loss: No penalty is given for errors within the


ε margin.
• Support vectors: Only data points outside the ε tube influence the
model.
• Kernel trick: SVR can be extended to non-linear regression using
kernel functions (like RBF, polynomial).

SVR vs. Traditional Linear Regression

Linear Support Vector Regression


Feature
Regression (SVR)
Minimizes squared
Loss Function Minimizes ε-insensitive loss
error
Robustness to Sensitive to More robust due to margin
outliers outliers tolerance
Minimizing overall Fitting within a margin,
Focus
error ignoring small deviations
Flexible: can be linear or non-
Model complexity Single straight line
linear (via kernels)
All data influence Only points outside margin
Support vectors
the model affect the model
May overfit if Often better generalization
Generalization
complex with proper tuning

Visual Example:

• In linear regression, the line fits all data points by minimizing


total squared error.
• In SVR, the model allows a “tube” (ε margin) and fits a flat
function through it, ignoring errors within the margin and
penalizing only large deviations.

10.Explain logistic regression. Derive the sigmoid function and


describe its significance.

Ans:- Logistic Regression is a classification algorithm used to predict


the probability that a given input belongs to a certain class — typically
binary (0 or 1, True or False).

Unlike linear regression (which predicts continuous values), logistic


regression predicts probabilities and maps them to binary outcomes.

Use Cases:

• Email spam detection (spam vs. not spam)


• Tumor classification (malignant vs. benign)
• Customer churn prediction (churn vs. stay)

Model Equation:

Logistic regression models the probability that the output y=1y = 1y=1
given input xxx, using the sigmoid (logistic) function:

P(y=1∣x)=y^=σ(z)=11+e−zP(y = 1 \mid x) = \hat{y} = \sigma(z) =


\frac{1}{1 + e^{-z}}P(y=1∣x)=y^=σ(z)=1+e−z1

Where:

z=β0+β1x1+β2x2+⋯+βnxn=wTxz = \beta_0 + \beta_1 x_1 + \beta_2


x_2 + \dots + \beta_n x_n = \mathbf{w}^T \mathbf{x}z=β0+β1x1+β2
x2+⋯+βnxn=wTx

• w\mathbf{w}w: weights
• x\mathbf{x}x: input features
• σ(z)\sigma(z)σ(z): sigmoid function that maps any real number
to (0, 1)

Sigmoid Function: Derivation

The sigmoid function is defined as:


σ(z)=11+e−z\sigma(z) = \frac{1}{1 + e^{-z}}σ(z)=1+e−z1

This arises from modeling the log-odds (logit) of the probability:

logit(p)=log⁡(p1−p)=z\text{logit}(p) = \log \left( \frac{p}{1 - p}


\right) = zlogit(p)=log(1−pp)=z

Solving for ppp:

log⁡(p1−p)=z⇒p1−p=ez⇒p=ez1+ez=11+e−z=σ(z)\log
\left( \frac{p}{1 - p} \right) = z \Right arrow \frac{p}{1 - p} = e^z
\Right arrow p = \frac{e^z}{1 + e^z} = \frac{1}{1 + e^{-z}} =
\sigma(z)log(1−pp)=z⇒1−pp=ez⇒p=1+ezez=1+e−z1=σ(z)

So, the logistic function is the inverse of the logit.

Sigmoid Function Properties:

Feature Value
Range 0<σ(z)<10 < \sigma(z) < 10<σ(z)<1
σ(0)\sigma(0)σ(0) 0.5
z→+∞z \to +\inftyz→+∞ σ(z)→1\sigma(z) \to 1σ(z)→1
z→−∞z \to -\inftyz→−∞ σ(z)→0\sigma(z) \to 0σ(z)→0
S-shaped curve Useful for probabilistic outputs

Classification Decision Rule:

y^={1if σ(z)≥0.50if σ(z)<0.5\hat{y} = \begin{cases} 1 & \text{if }


\sigma(z) \geq 0.5 \\ 0 & \text{if } \sigma(z) < 0.5 \end{cases}y^={10
if σ(z)≥0.5if σ(z)<0.5

You can shift the threshold (e.g., to 0.7 or 0.3) based on application
needs.

Loss Function in Logistic Regression:

Logistic regression uses cross-entropy loss instead of squared error:

Loss=−[ylog⁡(y^)+(1−y)log⁡(1−y^)]\text{Loss} = -[y \log(\hat{y})


+ (1 - y) \log(1 - \hat{y})]Loss=−[ylog(y^)+(1−y)log(1−y^)]
This penalizes wrong confident predictions more heavily.

Significance of the Sigmoid Function:

• Converts linear output into probability


• Ensures the prediction lies between 0 and 1
• Smooth and differentiable → suitable for gradient descent
• Centered at 0: σ(0)=0.5\sigma(0) = 0.5σ(0)=0.5

11.Differentiate between binary and multi-class classification in


logistic regression. How is multi-class handled?

Ans:- Overview: Logistic Regression Types

Type Description
Predicts between two classes (e.g., spam vs.
Binary Classification
not spam)
Multi-Class Predicts among three or more classes (e.g., cat,
Classification dog, bird)

1. Binary Classification

• Target classes: 0 or 1
• Model:

P(y=1∣x)=11+e−z=σ(z)P(y = 1 | x) = \frac{1}{1 + e^{-z}} =


\sigma(z)P(y=1∣x)=1+e−z1=σ(z)

Where z=wTxz = \mathbf{w}^T \mathbf{x}z=wTx

• Decision:

Predict 1 if y^≥0.5, else 0\text{Predict } 1 \text{ if } \hat{y} \geq 0.5,


\text{ else } 0Predict 1 if y^≥0.5, else 0

Standard logistic regression handles binary classification directly.

2. Multi-Class Classification

• Target classes: More than 2 (e.g., 0, 1, 2, ..., k)


• Logistic regression doesn’t support this natively but is extended
using one of two strategies:
A. One-vs-Rest (OvR) / One-vs-All

• Train k separate binary classifiers, each for one class vs. the rest.
• For each class iii, the model learns:

P(y=i∣x)=11+e−ziP(y = i | x) = \frac{1}{1 + e^{-z_i}}P(y=i∣x)=1+e−zi


1

• The model selects the class with the highest confidence:

y^=arg⁡max⁡iP(y=i∣x)\hat{y} = \arg\max_i P(y = i | x)y^=argimax


P(y=i∣x)

Simple and interpretable, works well when classes are imbalanced or


separable.

B. SoftMax Regression (Multinomial Logistic Regression)

• Generalizes logistic regression by using the SoftMax function for


kkk classes:

P(y=i∣x)=ezi∑j=1kezjfor i=1,...,kP(y = i | x) =
\frac{e^{z_i}}{\sum_{j=1}^{k} e^{z_j}} \quad \text{for } i = 1, ...,
kP(y=i∣x)=∑j=1kezjezifor i=1,...,k

• Here, zi=wiTxz_i = \mathbf{w}_i^T \mathbf{x}zi=wiTx is the


score for class iii
• The cross-entropy loss is used for training.

Preferred when all classes are mutually exclusive and equally important.

Comparison: Binary vs. Multi-Class in Logistic Regression

Feature Binary Multi-Class


3 or more (e.g., 0, 1,
Classes 2 (0 or 1)
2, ...)
Single probability (0– Probabilities for each
Output
1) class
Function used Sigmoid SoftMax
Binary Cross- Categorical Cross-
Loss Function
Entropy Entropy
Feature Binary Multi-Class
Model Extension
No Yes (OvR or SoftMax)
Needed

12.Discuss the K-Nearest Neighbors algorithm. What are its


advantages and limitations?

Ans:- K-NN is a non-parametric, instance-based learning algorithm


used for classification and regression.

It makes predictions based on the K most similar instances (neighbors)


in the training data.

How K-NN Works:

1. Choose a value for K (number of neighbors).


2. Measure the distance (commonly Euclidean) between the query
point and all training points.
3. Identify the K nearest neighbors.
4. For:
o Classification: Predict the majority class among
neighbors.
o Regression: Predict the average (or weighted average) of
neighbors' values.

Distance Metrics Used:

• Euclidean Distance:

d(x,x′)=∑(xi−xi′)2d(x, x') = \sqrt{\sum (x_i - x_i')^2}d(x,x′)=∑(xi−xi′


)2

• Manhattan, Minkowski, or cosine similarity can also be used.

Example (Classification):

Suppose you have data on whether people buy a product based on age
and income. To predict if a new person will buy:

• Find K people most similar to them.


• Count how many of those did buy vs. did not buy.
• Pick the majority.
Advantages of K-NN:

Benefit Description
Simple to Easy to implement, no assumptions about data
understand distribution
Training = storing data; computation is deferred
No training phase
to prediction
Works for both classification and regression
Flexible
tasks
Naturally adapts to Local decisions make it responsive to the
data structure of data

Limitations of K-NN:

Limitation Description
Must compute distance to all training data
Slow at prediction
each time
Sensitive to irrelevant Irrelevant or redundant features can distort
features distances
Curse of Distance becomes less meaningful in high
dimensionality dimensions
Features must be normalized to avoid bias
Needs scaling
toward large-scale features
Memory intensive Stores the entire training dataset

Choosing K:

• Too small (e.g., K = 1) → High variance (overfitting)


• Too large (e.g., K = 100) → High bias (underfitting)
• Best K is found using cross-validation

K-NN Use Cases:

• Handwritten digit recognition (e.g., MNIST)


• Recommender systems
• Medical diagnosis
• Customer behavior prediction
13.How does the choice of 'k' affect the performance of the KNN
algorithm?

Ans:- What is ‘K’ in K-NN?

• K is the number of nearest neighbors considered when making a


prediction.
• It plays a critical role in determining the model's behavior,
especially regarding bias and variance.

How 'K' Affects Performance

Small K (e.g., K = 1 or 3):

• Model behavior: Very sensitive to noise in the training data.


• Leads to:
o Low bias (flexible model)
o High variance (can overfit training data)
• Risk: Model may classify based on outliers or noise.

Large K (e.g., K = 15 or 50):

• Model behavior: Smoother decision boundary, less sensitive to


individual data points.
• Leads to:
o High bias (over-simplifies patterns)
o Low variance (more stable predictions)
• Risk: Might misclassify if distant neighbors dominate majority
vote.

Graphical Understanding (Conceptual):

Decision
K Value Performance
Boundary
Very complex, Overfits (low bias, high
Small (1–3)
wavy variance)
Moderate (5–
Balanced Best generalization
10)
Underfits (high bias, low
Large (20+) Very smooth
variance)
Trade-Off Summary:

K Value Bias Variance Model Fit


Small (e.g., 1) Low High Overfit (too specific)
Large (e.g., 30) High Low Underfit (too generic)
Moderate (e.g., 5–10) Balanced Balanced Good generalization

How to Choose Optimal K:

1. Use Cross Validation (e.g., 5-fold or 10-fold)


2. Plot Accuracy vs. K and select the value with lowest validation
error
3. Prefer odd K values for binary classification (avoids tie votes)

14.What is a hyperplane in SVM? Explain the role of support


vectors in classification.

Ans:- Support Vector Machine (SVM) is a powerful supervised learning


algorithm used for binary and multi-class classification (and also
regression).
It works by finding the optimal hyperplane that separates different
classes in a high-dimensional space.

A hyperplane is a decision boundary that separates the data points of


different classes.

• In 2D space, it's a line.


• In 3D space, it's a plane.
• In n-dimensional space, it's called a hyperplane.

Mathematical Form of a Hyperplane:

wTx+b=0w^T x + b = 0wTx+b=0

Where:

• www: Weight vector (perpendicular to the hyperplane)


• xxx: Input vector
• bbb: Bias (offset from the origin)

Goal of SVM:
• Find the hyperplane that maximizes the margin (distance between
the hyperplane and the closest points from each class).
• This is called the maximum margin hyperplane.

What Are Support Vectors?

• Support Vectors are the data points closest to the hyperplane.


• These points "support" the position and orientation of the optimal
hyperplane.
• The margin is measured using these points.

If you remove all other data points except the support vectors, the
hyperplane would not change!

Illustration:

pgsql

Copyedit

Class 1: o o o o

\ <-- hyperplane

Class 2: x x x x

Support Vectors: o and x closest to the line

• The margin is the distance between the support vectors of the two
classes.
• The model tries to maximize this distance.

Why Are Support Vectors Important?

Role Explanation
Define the margin Hyperplane is built only based on support vectors
Role Explanation
Ignoring non-support points doesn't affect the
Model robustness
decision boundary
Model size depends only on the number of support
Efficiency
vectors, not all data
Generalization Maximizing margin improves performance on
power unseen data

15.Describe the use of kernel tricks in SVM. Compare linear,


polynomial, and RBF kernels.

Ans:-Kernel Trick Overview

In Support Vector Machines (SVM), the kernel trick is a powerful


mathematical technique that allows us to perform non-linear
classification efficiently without explicitly mapping input data into a
higher-dimensional feature space. The key idea is to use a kernel
function K(x,x′)K(x, x')K(x,x′) to compute the inner product between
the images of the data points in the high-dimensional feature space:

K(x,x′)=⟨ϕ(x),ϕ(x′)⟩,K(x, x') = \langle \phi(x), \phi(x')


\rangle,K(x,x′)=⟨ϕ(x),ϕ(x′)⟩,

where ϕ(x)\phi(x)ϕ(x) is the (possibly non-linear) mapping function.

• Why is it useful?
o Computational Efficiency: It avoids the explicit
transformation of input data, saving time and memory.
o Non-linear Boundaries: It enables SVM to discover
complex, non-linear decision boundaries by implicitly
considering higher-dimensional features.

Common Kernel Functions

Below are three popular kernels used in SVM along with a comparison
of their properties:

1. Linear Kernel

• Definition:
K(x,x′)=xTx′K(x, x') = x^T x'K(x,x′)=xTx′

• Interpretation:
o No transformation is performed; the data are used as-is.
• Advantages:
o Simplicity: Ideal when the dataset is linearly separable.
o Fast: Computationally efficient because it performs only
the dot product.
o Few Hyperparameters: Minimal risk of overfitting due to
less complexity.
• Limitations:
o Does not work well if the relationship between the classes
is non-linear.

2. Polynomial Kernel

• Definition:

K(x,x′)=(γ xTx′+r)dK(x, x') = ( \gamma \, x^T x' +


r )^dK(x,x′)=(γxTx′+r)d

o γ\gammaγ is a scaling factor.


o rrr is a constant that trades off the influence of higher-
order versus lower-order terms.
o ddd is the degree of the polynomial.
• Interpretation:
o Transforms the feature space into a higher-order
polynomial space.
• Advantages:
o Captures non-linear relationships by increasing the degree
ddd.
o Provides a good balance between linear and more
complex kernels when tuned appropriately.
• Limitations:
o Risk of overfitting if the degree is set too high.
o More computationally intensive than the linear kernel for
larger datasets.

3. Radial Basis Function (RBF) Kernel / Gaussian Kernel

• Definition:
K(x,x′)=exp⁡(−γ ∥x−x′∥2)K(x, x') = \exp\left(-\gamma \, \| x - x'
\|^2\right)K(x,x′)=exp(−γ∥x−x′∥2)

o γ\gammaγ controls the width of the Gaussian; a smaller


γ\gammaγ means a broader kernel.
• Interpretation:
o Maps data into an infinite-dimensional feature space.
o Measures similarity based on the distance between data
points.
• Advantages:
o Highly flexible: Excellent for handling complex, non-
linear relationships.
o Works well in situations where the decision boundary is
highly non-linear.
• Limitations:
o Requires careful tuning of γ\gammaγ; inappropriate
values can lead to overfitting or underfitting.
o Computationally more expensive than linear kernels for
very large datasets.
o Often considered a "black box" due to the implicit and
high-dimensional mapping.

Comparison Table

Polynomial RBF (Gaussian)


Aspect Linear Kernel
Kernel Kernel
Explicit Implicit mapping
None (original
Transformation polynomial to infinite-
feature space)
mapping dimensional space
Moderate; High flexibility,
Low, fast
Complexity increases with but potentially high
computation
degree computational cost
Moderately Highly non-linear,
Linearly
Suitability non-linear complex
separable data
relationships boundaries
None (or just Degree ddd,
Hyperparameters C, the penalty γ\gammaγ, and γ\gammaγ
term) rrr
Polynomial RBF (Gaussian)
Aspect Linear Kernel
Kernel Kernel
Moderate to Moderate to High
Risk of
Low High (if ddd is (sensitive to
Overfitting
large) γ\gammaγ)

16.How does SVM handle linear and non-linear classification


problems? Illustrate with examples.

Ans:- Overview of SVM Classification

Support Vector Machine (SVM) is a powerful supervised learning


algorithm that works for both linear and non-linear classification
problems by finding the optimal decision boundary (hyperplane) that
separates classes.

1. SVM for Linear Classification

Case: Linearly Separable Data

If the data points from different classes can be separated by a straight


line (2D) or hyperplane (nD), SVM finds the hyperplane that maximizes
the margin between the two classes.

Mathematical Form:

wTx+b=0w^T x + b = 0wTx+b=0

Where:

• www: weight vector (normal to hyperplane)


• bbb: bias term

Decision:

• Class 1 if wTx+b>0w^T x + b > 0wTx+b>0


• Class 2 if wTx+b<0w^T x + b < 0wTx+b<0

Example:

Suppose you have data:


• Class A: (1,2), (2,3), (3,3)
• Class B: (6,5), (7,7), (8,6)

These can be clearly separated by a straight line.

SVM Output:

• Draws a line that separates the two classes with the largest
margin
• Only a few boundary points (support vectors) define the margin

2. SVM for Non-Linear Classification

Case: Data Not Linearly Separable

When data is not linearly separable, SVM applies the kernel trick to
map data to a higher-dimensional space where it becomes linearly
separable.

How It Works:

• Input data xxx is transformed using a function ϕ(x)\phi(x)ϕ(x)


• Instead of computing ϕ(x)\phi(x)ϕ(x) directly (which is
computationally expensive), SVM uses a kernel function
K(x,x′)=⟨ϕ(x),ϕ(x′)⟩K(x, x') = \langle \phi(x), \phi(x')
\rangleK(x,x′)=⟨ϕ(x),ϕ(x′)⟩

Common Kernels:

Kernel Type Purpose


Linear For linearly separable data
Polynomial For moderately complex boundaries
RBF (Gaussian) For highly non-linear data

Example: XOR Problem

Input (x₁, x₂) Class


(0, 0) 0
(1, 1) 0
(0, 1) 1
Input (x₁, x₂) Class
(1, 0) 1

This is a classic non-linearly separable dataset.


Using an RBF kernel, SVM can correctly classify the data by
transforming it into a space where a linear boundary can be drawn.

Visual Comparison:

Linear SVM:

• Straight line (or hyperplane)


• Works only if data can be separated without curving the
boundary

Non-Linear SVM (with RBF kernel):

• Curved or complex decision boundary


• Data can be tangled, and SVM can still separate classes
correctly

17.Explain the process of constructing a decision tree. How is


information gain used?

Ans:- A Decision Tree is a supervised learning model used for


classification and regression. It works by recursively splitting the
dataset into smaller subsets based on the feature that provides the best
separation of target classes.

1. Decision Tree Construction Process

Here’s how a decision tree is built step-by-step:

Step 1: Select the Best Attribute to Split

• At the root node, evaluate all features.


• Choose the feature that best separates the data based on a
criterion (e.g., Information Gain, Gini Index).
• This feature becomes the decision node.

Step 2: Split the Dataset


• Divide the dataset into subsets according to the selected feature’s
values.
• Each branch represents a decision outcome.

Step 3: Repeat Recursively

• For each subset, repeat steps 1–2:


o Select the best feature
o Split further
• Stop when:
o All data in the node belongs to one class (pure node)
o A maximum depth is reached
o A minimum number of samples condition is met

Step 4: Create Leaf Nodes

• Leaf nodes contain the final output label (class or value).


• The path from root to leaf represents a classification rule.

2. Role of Information Gain

What is Information Gain?

Information Gain (IG) measures how much uncertainty (entropy) is


reduced by splitting the dataset on a particular feature.

Entropy Formula:

Entropy(S)=−∑i=1cpilog⁡2(pi)Entropy(S) = -\sum_{i=1}^{c} p_i


\log_2(p_i)Entropy(S)=−i=1∑cpilog2(pi)

Where:

• SSS is the dataset


• pip_ipi is the proportion of class iii in SSS
• ccc is the number of classes

Information Gain Formula:

IG(S,A)=Entropy(S)−∑v∈Values(A)∣Sv∣∣S∣⋅Entropy(Sv)IG(S, A) =
Entropy(S) - \sum_{v \in Values(A)} \frac{|S_v|}{|S|} \cdot
Entropy(S_v)IG(S,A)=Entropy(S)−v∈Values(A)∑∣S∣∣Sv∣⋅Entropy(Sv)
Where:

• AAA is an attribute
• Values(A)Values(A)Values(A) are possible values of AAA
• SvS_vSv is the subset of SSS where attribute A=vA = vA=v

The attribute with the highest Information Gain is chosen for splitting.

Example:

Suppose you're building a tree to decide whether someone will play


tennis based on weather features like "Outlook", "Humidity", etc.

If splitting on "Outlook" reduces entropy the most, then "Outlook"


becomes the first split.

Other Splitting Criteria (Alternatives to Information Gain):

Criterion Description
Gini Index Measures impurity; used in CART trees
Gain Ratio Adjusted version of Information Gain (used in C4.5)
Chi-square Statistical significance of splits

18.What is pruning in decision trees? Why is it important?

Ans:- Pruning is the process of removing parts of a decision tree that are
not critical for predicting target variables — in other words, cutting off
branches that may reflect noise or overfitting.

Why is Pruning Important?

Without pruning, decision trees tend to overfit the training data —


meaning:

• They capture noise and outliers


• They become too complex
• They perform poorly on new (test) data

Pruning improves generalization by simplifying the tree.

Types of Pruning
1. Pre-Pruning (Early Stopping):

• Stop the tree from growing once a certain condition is met:


o Maximum depth
o Minimum number of samples in a node
o Minimum Information Gain
• Advantage: Faster training, smaller tree
• Risk: Might underfit the data (if stopped too early)

2. Post-Pruning (Reduced Error Pruning):

• First, grow the full tree.


• Then prune branches from bottom up by checking whether
removing a node:
o Increases accuracy on a validation set
o Reduces complexity without hurting performance

Example:

If removing a subtree and replacing it with a leaf node doesn’t hurt


validation accuracy, we prune it.

How Does It Work (Conceptually)?

Before pruning:

yaml

Copyedit

Outlook?

├── Sunny → Humidity?

│ ├── High → No

│ └── Normal → Yes

└── ...

After pruning:

yaml
Copyedit

Outlook?

├── Sunny → No

└── ...

The model still performs well but is simpler.

Benefits of Pruning

Benefit Description
Reduces overfitting Tree doesn’t memorize noise or anomalies
Improves accuracy Especially on unseen (test) data
Simplifies model Easier to interpret and explain
Reduces time Less computation during prediction

19.Describe the ensemble method of Bagging with an example.


How does it improve model performance?

Ans:- Bagging (short for Bootstrap Aggregating) is an ensemble


learning method that combines the predictions of multiple independent
models (typically of the same type) to improve accuracy and stability.

It works best with high-variance models like decision trees (e.g.,


CART).

How Bagging Works (Step-by-Step):

1. Bootstrap Sampling:
o Create multiple random datasets by sampling with
replacement from the original training data.
o Each sample is the same size as the original but may
contain duplicate entries.
2. Model Training:
o Train a separate model on each of the bootstrapped
datasets.
3. Aggregation:
o For classification: use majority voting from all models.
o For regression: use the average of all model outputs.
Example: Bagging with Decision Trees

Suppose you have a dataset to classify whether a loan applicant is risky


or not.

Steps:

• Generate, say, 10 bootstrapped datasets from the original training


set.
• Train 10 different decision trees (each may look different due to
different data).
• To predict for a new applicant:
o All 10 trees make a prediction.
o Final prediction = class with majority vote.

This is the basis of the Random Forest algorithm (Bagging + feature


randomness).

How Bagging Improves Model Performance

Benefit Explanation
By averaging results of many models, it smooths out
Reduces
noise in the data. Helps especially with unstable
Variance
models like decision trees.
Increases Less sensitive to outliers or small fluctuations in
Robustness training data.
Improves Combining models often yields better performance
Accuracy than any single model.
Avoids Ensemble prevents individual overfitted models from
Overfitting dominating the result.

Limitations of Bagging

Limitation Explanation
If the base model is biased (e.g., underfitting),
Doesn’t reduce bias
bagging won’t fix it.
Requires more
More models = more training time and memory.
resources
20.Explain Random Forests. How do they address overfitting in
decision trees?

Ans:- A Random Forest is an ensemble learning method based on


Bagging (Bootstrap Aggregating) that builds a collection of decision
trees and aggregates their outputs to make a final prediction.

It combines many decision trees to create a more accurate and stable


model than a single tree.

How Random Forest Works (Step-by-Step)

1. Bootstrap Sampling:
o Create multiple datasets by sampling with replacement
from the original data.
2. Train Multiple Decision Trees:
o For each tree, a random subset of features is selected at
every split — this is key to introducing variation.
o Each tree is trained independently on a different bootstrap
sample.
3. Aggregate Predictions:
o Classification: Majority vote across all trees.
o Regression: Average of all tree predictions.

Why Random Forest Reduces Overfitting

Overfitting is a common problem with single decision trees, especially


deep ones that perfectly learn the training data, including noise.

Random Forest addresses overfitting through:

Technique How it Helps


Averages many trees trained on different
Bagging
samples, reducing variance.
Random feature Ensures trees are decorrelated (not all trees use
selection the same strong predictors).
Final prediction is a consensus, not a single
Voting/Averaging
tree’s potentially overfit result.
Random Forest grows trees fully, but ensemble
No pruning needed
effect controls overfitting.
Example Scenario (Classification)

Let’s say you are classifying emails as spam or not spam.

• One deep decision tree may overfit and memorize specific


patterns.
• A Random Forest with 100 trees:
o Each tree sees slightly different data and features.
o Some trees might consider "word count", others
"frequency of 'offer'".
o Final result = majority vote → much more robust to noise.

Key Advantages of Random Forest

Feature Benefit
High accuracy Better than individual decision trees
Can maintain accuracy even with
Handles missing data
gaps
Works for classification and
Versatile and flexible
regression
Ensemble approach smooths out
Resistant to overfitting
anomalies

Limitations

Limitation Explanation
Hard to explain decisions compared to a single
Less interpretable
tree
Slower than one tree Especially for real-time applications
Large memory
Due to multiple full-grown trees
usage
Thank You!

You might also like