0% found this document useful (0 votes)

15 views32 pages

MLDAP Module2

Uploaded by

Gagan Gagan

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

15 views32 pages

MLDAP Module2

Uploaded by

Gagan Gagan

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 32

MODULE #2

Linear Regression

1. What is Linear Regression?

Linear Regression is one of the simplest and most widely used supervised machine learning
algorithms for predicting a continuous target variable based on one or more predictor
variables (features). It models the relationship between the dependent variable y and one or more
independent variables X by fitting a linear equation to observed data.

2. The Linear Regression Model

The model assumes the relationship between the dependent variable y and the independent
variables x1,x2,...,xn can be described as:

y=β0+β1x1+β2x2+…+βnxn+ϵ
 β0 is the intercept (value of y when all xi are 0),
 βi are the coefficients (weights) for each feature,
 ϵ is the error term capturing noise/unexplained variance.

3. Objective
The goal of linear regression is to find the best-fitting line that minimizes the difference
between predicted values and actual values of y. The common approach is to minimize the Sum
of Squared Residuals (Ordinary Least Squares - OLS).

4. Assumptions of Linear Regression

1. Linearity: The relationship between the predictors and the target is linear.
2. Independence: Observations are independent.
3. Homoscedasticity: Constant variance of residual errors.
4. Normality of Errors: Residuals are normally distributed.
5. No multicollinearity: Predictors are not highly correlated with each other.

5. How Linear Regression Works

 Given training data, the algorithm estimates coefficients βi\beta_iβi to minimize the
squared error:

 Once fitted, the model can predict y for new X values.

6. Example with Python Code

import numpy as np
import pandas as pd
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_absolute_error, mean_squared_error

# Sample dataset: Years of Experience vs Salary

data = {
'Experience': [1, 2, 3, 4, 5, 6, 7, 8, 9, 10],
'Salary': [35000, 37000, 40000, 43000, 47000, 52000, 56000, 60000, 64000,
68000]
}
df = pd.DataFrame(data)

# Features and Target

X = df[['Experience']] # 2D array for sklearn
y = df['Salary']

# Split data into train and test sets

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2,
random_state=42)

# Create and fit the Linear Regression model

model = LinearRegression()
model.fit(X_train, y_train)

# Predict on test data

y_pred = model.predict(X_test)

# Print coefficients
print("Intercept (β0):", model.intercept_)
print("Coefficient (β1):", model.coef_[0])

# Evaluate the model

mae = mean_absolute_error(y_test, y_pred)
mse = mean_squared_error(y_test, y_pred)
rmse = np.sqrt(mse)

print(f"MAE: {mae:.2f}")
print(f"MSE: {mse:.2f}")
print(f"RMSE: {rmse:.2f}")

# Example prediction
years_exp = 6.5
predicted_salary = model.predict([[years_exp]])
print(f"Predicted salary for {years_exp} years of experience:
${predicted_salary[0]:.2f}")

7. Explanation of the Code

 Data Preparation: Simple dataset of years of experience and salary.
 Splitting Data: Train/test split for model evaluation.
 Model Training: LinearRegression() fits the data to find coefficients.
 Prediction: Predict salary for test data.
 Evaluation: Calculated MAE, MSE, and RMSE to measure error.
 New Prediction: Predict salary for 6.5 years experience.

8. When to Use Linear Regression?

 When the target variable is continuous.
 When the relationship between independent and dependent variables is approximately
linear.
 When the assumptions listed above are met or closely approximated.
Polynomial Regression

1. What is Polynomial Regression?

Polynomial Regression is an extension of Linear Regression that models the relationship
between the independent variable xxx and the dependent variable y as an nth degree polynomial.
It’s used when the relationship between variables is non-linear but can be approximated by a
polynomial curve.

2. The Polynomial Regression Model

The model equation for polynomial regression of degree d is:

y=β0+β1x+β2x2+β3x3+…+βdxd+ ϵ
 βi are the coefficients,
 ϵ is the error term,
 d is the degree of the polynomial.

3. Why Use Polynomial Regression?

 When the data shows a curved trend that a simple linear model cannot capture.
 It allows fitting a non-linear relationship using a linear model by transforming features.
 Useful in domains like physics, finance, biology where relationships are inherently
nonlinear.

4. How Polynomial Regression Works

1. Transform input features to include polynomial terms (e.g., x,x2,x3, etc.).
2. Use linear regression on this expanded feature set.
3. Estimate the coefficients βi to fit the polynomial curve.
5. Assumptions
 The relationship between the transformed features and the target is linear.
 The usual linear regression assumptions hold on the transformed data.
 Avoid very high degree polynomials to prevent overfitting.

6. Python Code Example: Polynomial Regression

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import PolynomialFeatures
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error

# Sample dataset with a nonlinear relationship

np.random.seed(0)
X = np.linspace(0, 10, 50).reshape(-1, 1)
y = 3 - 2 * X + X**2 + np.random.randn(50, 1) * 3 # Quadratic relation with
noise

# Plot original data

plt.scatter(X, y, color='blue', label='Data Points')
plt.title("Original Data")
plt.show()

# Polynomial feature transformation (degree=2)

poly = PolynomialFeatures(degree=2)
X_poly = poly.fit_transform(X) # Transforms X to [1, x, x^2]

# Split data
X_train, X_test, y_train, y_test = train_test_split(X_poly, y, test_size=0.2,
random_state=42)

# Create and train the linear regression model on polynomial features

model = LinearRegression()
model.fit(X_train, y_train)

# Predict
y_pred = model.predict(X_test)

# Evaluate the model

mse = mean_squared_error(y_test, y_pred)
print(f"Mean Squared Error: {mse:.2f}")

# Plot the polynomial regression curve

X_fit = np.linspace(0, 10, 100).reshape(-1, 1)
X_fit_poly = poly.transform(X_fit)
y_fit = model.predict(X_fit_poly)
plt.scatter(X, y, color='blue', label='Data Points')
plt.plot(X_fit, y_fit, color='red', label='Polynomial Regression Curve')
plt.title("Polynomial Regression Fit (Degree 2)")
plt.legend()
plt.show()

7. Explanation of the Code

 Created a dataset with a quadratic relationship plus noise.
 Used PolynomialFeatures to create polynomial terms up to degree 2.
 Applied Linear Regression on the transformed features.
 Evaluated model performance using Mean Squared Error.
 Visualized original data points and the fitted polynomial regression curve.

8. When to Use Polynomial Regression?

 When your data shows a non-linear relationship.
 When linear regression underfits and you need a more flexible model.
 When the degree of the polynomial is low to moderate to avoid overfitting.

9. Advantages & Disadvantages

Advantages Disadvantages

Captures non-linear relationships Can overfit if degree is too high

Still a linear model in transformed space Sensitive to outliers

Easy to implement and interpret Computational cost increases with degree

Model Evaluation Metrics for Regression

1. Overview
When predicting continuous values (regression tasks), evaluating the accuracy of predictions
requires metrics that quantify how close the predicted values are to the actual values. The most
common metrics are:

 Mean Absolute Error (MAE)

 Mean Squared Error (MSE)
 Root Mean Squared Error (RMSE)

All measure the average error between predicted values and actual values but differ in how they
penalize errors.

2. Mean Absolute Error (MAE)

 Definition: The average of the absolute differences between predicted values and actual
values.

 Interpretation: MAE gives the average magnitude of errors in predictions, without

considering their direction (positive or negative).
 Advantages:
o Easy to understand and interpret.
o Linear score — all individual differences are weighted equally.
 Disadvantages:
o Does not penalize large errors more than small errors.

3. Mean Squared Error (MSE)

 Definition: The average of the squared differences between predicted values and actual
values.
 Interpretation: MSE measures the average squared difference between actual and
predicted values. Squaring emphasizes larger errors more than smaller ones.
 Advantages:
o Penalizes larger errors more severely.
o Differentiable, useful for optimization in many algorithms.
 Disadvantages:
o Less interpretable than MAE because errors are squared (units are squared).
o Sensitive to outliers.

4. Root Mean Squared Error (RMSE)

 Definition: The square root of the mean squared error.

 Interpretation: RMSE gives error magnitude in the same units as the target variable,
making it more interpretable than MSE.
 Advantages:
o Combines benefits of MAE and MSE.
o Penalizes larger errors more (like MSE).
o Units match the predicted values, easier to interpret.
 Disadvantages:
o Also sensitive to outliers.

5. When to Use Which?

Metric Best Used When Notes
MAE You want a simple average error Less sensitive to outliers
Useful in many ML algorithms for
MSE You want to heavily penalize large errors
optimization
You want an interpretable metric with
RMSE Most common choice for regression
penalization of large errors
6. Python Code Example to Calculate MAE, MSE, RMSE
from sklearn.metrics import mean_absolute_error, mean_squared_error
import numpy as np

# Example actual and predicted values

y_true = [3.0, -0.5, 2.0, 7.0]
y_pred = [2.5, 0.0, 2.0, 8.0]

# Calculate MAE
mae = mean_absolute_error(y_true, y_pred)

# Calculate MSE
mse = mean_squared_error(y_true, y_pred)

# Calculate RMSE
rmse = np.sqrt(mse)

print(f"Mean Absolute Error (MAE): {mae:.3f}")

print(f"Mean Squared Error (MSE): {mse:.3f}")
print(f"Root Mean Squared Error (RMSE): {rmse:.3f}")

7. Interpretation of Results (Example)

Given the example:

 MAE = 0.5 means on average, predictions are off by 0.5 units.

 MSE = 0.375 squares errors so it penalizes larger errors.
 RMSE = 0.612 is in the same unit as the predicted variable, showing average error
magnitude.
Logistic Regression

1. What is Logistic Regression?

Logistic Regression is a classification algorithm used to predict the probability of a binary
outcome (0 or 1, True or False, Yes or No). Despite the name "regression", it's used for
classification tasks, not regression.

It models the log-odds of the probability of the outcome as a linear combination of the input
features.

2. Logistic Function (Sigmoid Function)

At the core of logistic regression is the sigmoid function, which maps any real number to a
value between 0 and 1:

Where:

 z= β0+β1x1+β2x2+…+βnxn
 σ(z) is the predicted probability that the output is 1.

3. Decision Boundary
 If σ(z)≥0.5: predict class 1
 If σ(z)<0.5: predict class 0

The threshold (default is 0.5) can be adjusted for different sensitivity levels.

4. Use Cases
 Email Spam Detection (Spam or Not Spam)
 Credit Scoring (Default or Not)
 Medical Diagnosis (Disease or No Disease)
 Marketing (Buy or Not Buy)

5. Assumptions
 The dependent variable is binary.
 There is a linear relationship between the logit of the outcome and the predictors.
 Observations are independent.
 Large sample size improves performance.

6. Python Code Example (Binary Classification)

import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report, confusion_matrix,
accuracy_score

# Example dataset: Predicting if someone buys a product based on age and

income
data = {
'Age': [22, 25, 47, 52, 46, 56, 55, 60],
'Income': [20000, 25000, 48000, 60000, 52000, 75000, 70000, 80000],
'Purchased': [0, 0, 1, 1, 1, 1, 1, 1]
}
df = pd.DataFrame(data)
# Features and target
X = df[['Age', 'Income']]
y = df['Purchased']
# Split data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25,
random_state=0)
# Model training
model = LogisticRegression()
model.fit(X_train, y_train)
# Predict
y_pred = model.predict(X_test)
# Evaluation
print("Confusion Matrix:\n", confusion_matrix(y_test, y_pred))
print("Classification Report:\n", classification_report(y_test, y_pred))
print("Accuracy Score:", accuracy_score(y_test, y_pred))

7. Output Explanation
 Confusion Matrix:
o True Positive (TP): Correctly predicted 1s
o True Negative (TN): Correctly predicted 0s
o False Positive (FP): Predicted 1, but actual 0
o False Negative (FN): Predicted 0, but actual 1
 Classification Report:
o Precision: TP/(TP+FP)
o Recall: TP/(TP+FN)
o F1-Score: Harmonic mean of precision and recall : 2PR/(P+R)

8. Probability Prediction
You can also get the predicted probability instead of class labels:

y_prob = model.predict_proba(X_test)
print("Predicted Probabilities:\n", y_prob)

The result is a 2-column output:

 Column 0: probability of class 0

 Column 1: probability of class 1

9. Limitations
 Only works for binary/multiclass classification, not regression.
 Assumes a linear decision boundary in feature space.
 Sensitive to outliers and irrelevant features.

10. Extensions
 Multinomial Logistic Regression for multiclass classification.
 Regularized Logistic Regression (with L1 or L2) to handle overfitting.
 Logistic Regression with Interaction Terms to model feature interactions.
K-Nearest Neighbors (KNN)

1. What is KNN?
K-Nearest Neighbors is a supervised learning algorithm used for both classification and
regression. It assumes that similar data points are located close to each other in feature space. It
classifies a data point based on how its neighbors are classified.

KNN is an instance-based learning or lazy learning algorithm:

 It does not learn a model during training.

 Instead, it memorizes the training dataset and classifies new data points using a distance
metric.

2. How KNN Works (Classification)

1. Choose the number of neighbors K.
2. Calculate the distance between the new data point and all points in the training data.
3. Select the K nearest neighbors based on distance.
4. Perform a majority vote: the most common class among neighbors is assigned to the
new point.

3. Distance Metrics
Commonly used distance functions:

 Euclidean Distance (most common):

 Manhattan Distance
 Minkowski Distance
4. Example Use Cases
 Handwritten digit recognition (e.g., MNIST dataset)
 Recommendation systems
 Image recognition
 Credit scoring

5. Python Code Example (KNN Classification)

from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report, confusion_matrix
import pandas as pd

# Sample dataset: Feature1 and Feature2 used to predict class (0 or 1)

data = {'Feature1': [1, 2, 3, 6, 7, 8],'Feature2': [1, 1, 2, 5, 6, 7],
'Label': [0, 0, 0, 1, 1, 1]}
df = pd.DataFrame(data)
# Define features and target
X = df[['Feature1', 'Feature2']]
y = df['Label']
# Split data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33,
random_state=42)
# Create the KNN model
knn = KNeighborsClassifier(n_neighbors=3)
knn.fit(X_train, y_train)
# Predict
y_pred = knn.predict(X_test)
# Evaluation
print("Confusion Matrix:\n", confusion_matrix(y_test, y_pred))
print("\nClassification Report:\n", classification_report(y_test, y_pred))

6. Choosing the Best Value of K

 Small K (e.g., 1 or 3): Model is sensitive to noise (high variance).
 Large K: Model becomes too generalized (high bias).
 Common approach: Use cross-validation to find the optimal K.

# Finding optimal K using error rate plot

import matplotlib.pyplot as plt
error_rate = []
for k in range(1, 11):
knn = KNeighborsClassifier(n_neighbors=k)
knn.fit(X_train, y_train)
pred_k = knn.predict(X_test)
error_rate.append((pred_k != y_test).mean())
plt.plot(range(1, 11), error_rate, marker='o')
plt.title("Error Rate vs. K")
plt.xlabel("K")
plt.ylabel("Error Rate")
plt.show()

7. Advantages of KNN
 Simple and easy to implement.
 No assumptions about data distribution.
 Naturally handles multi-class classification.
 Works well with small to medium-sized datasets.

8. Disadvantages of KNN
Problem Description
Because KNN stores all training data and computes distance for
Slow Prediction
every prediction.
Sensitive to Irrelevant
All features are treated equally unless scaled or weighted.
Features
Curse of Dimensionality High-dimensional data reduces accuracy.
Imbalanced Data Can bias toward the majority class.

9. Feature Scaling in KNN

KNN is distance-based, so features should be scaled:

from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)
# Proceed with train/test split and KNN model

10. KNN for Regression

KNN can also predict continuous values by averaging the target values of the nearest neighbors:

from sklearn.neighbors import KNeighborsRegressor

reg = KNeighborsRegressor(n_neighbors=3)
reg.fit(X_train, y_train)
y_pred = reg.predict(X_test)
Decision Trees

1. What is a Decision Tree?

A Decision Tree is a supervised machine learning algorithm used for both classification and
regression tasks. It models decisions and their possible consequences as a tree structure:

 Nodes represent tests on features.

 Branches represent outcomes of the test.
 Leaves represent final class labels or continuous values.

2. How Does a Decision Tree Work?

 The tree is built by splitting the dataset on features that best separate the data into classes
(for classification) or predict continuous values (for regression).
 The process starts at the root node and recursively splits until a stopping condition is met
(e.g., max depth, minimum samples).
 The goal is to create pure nodes where most samples belong to the same class or have
minimal variance.

3. Splitting Criteria
For Classification:

 Gini Impurity: Measures the probability of misclassifying a random sample.

 Entropy: Measures the disorder or uncertainty.

 The algorithm chooses splits that maximize information gain, which is the reduction in
impurity.
For Regression:

 Mean Squared Error (MSE) or Mean Absolute Error (MAE) are used to choose
splits.

4. Advantages of Decision Trees

 Easy to interpret and visualize.
 Requires little data preprocessing.
 Can handle both numerical and categorical data.
 Non-parametric, so no assumptions about data distribution.
 Can capture non-linear relationships.

5. Disadvantages
Drawback Description

Overfitting Trees can grow too complex and fit noise

Instability Small changes in data can result in very different trees

Bias towards features with more levels Features with many categories may dominate splits

6. Python Code Example: Decision Tree Classifier

import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier, plot_tree
from sklearn.metrics import classification_report, confusion_matrix
import matplotlib.pyplot as plt

# Sample data (Iris dataset for classification)

from sklearn.datasets import load_iris
iris = load_iris()
X = iris.data
y = iris.target

# Split dataset
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3,
random_state=42)
# Initialize and train Decision Tree Classifier
clf = DecisionTreeClassifier(criterion='gini', max_depth=3, random_state=42)
clf.fit(X_train, y_train)
# Predict
y_pred = clf.predict(X_test)
# Evaluate
print("Confusion Matrix:\n", confusion_matrix(y_test, y_pred))
print("\nClassification Report:\n", classification_report(y_test, y_pred))
# Visualize the tree
plt.figure(figsize=(12,8))
plot_tree(clf, feature_names=iris.feature_names,
class_names=iris.target_names, filled=True)
plt.show()

7. Explanation
 The dataset is split into training and testing sets.
 The decision tree is trained with a maximum depth of 3 to avoid overfitting.
 Predictions and evaluation metrics are printed.
 The tree structure is visualized using plot_tree.

8. Parameters to Tune
 max_depth: Maximum depth of the tree (controls overfitting).
 min_samples_split: Minimum number of samples required to split a node.
 min_samples_leaf: Minimum number of samples required at a leaf node.
 criterion: Splitting criteria, “gini” for Gini impurity, “entropy” for information gain.

9. Use Cases
 Medical Diagnosis (e.g., cancer classification)
 Customer Churn Prediction
 Fraud Detection
 Credit Scoring
Regression Trees
Regression Trees are a type of decision tree used when the target variable is continuous (rather
than categorical, as in classification problems). They work by splitting the data into smaller and
smaller subsets, forming a tree structure that predicts numeric outcomes.

Key Concepts
1. Structure of a Regression Tree

 Root Node: The entire dataset.

 Internal Nodes: Conditions based on features (e.g., X1 <= 5).
 Leaves: Final predictions, which are the mean of target values in that subset.

2. How It Works

Basic Steps:

1. Split the data at a feature and threshold that minimizes the error.
2. Repeat recursively on each child node until a stopping criterion is met (e.g., minimum
samples per leaf, maximum depth).
3. Predict the average value in each leaf.

3. Splitting Criteria

Regression trees aim to minimize mean squared error (MSE) at each split.

MSE Formula:

At each step, the algorithm chooses the feature and threshold that gives the largest reduction in
MSE.

4. Example

Let’s say we want to predict house prices based on square footage and number of bedrooms.

First Split:
If square_footage <= 1500, go left. Else, go right.
Second Split (Left branch):
If bedrooms <= 2, go left. Else, go right.

At each terminal leaf, the predicted house price is the average of all training prices in that
subset.

5. Advantages

 Easy to interpret and visualize.

 Can handle non-linear relationships.
 Requires little data preprocessing (no need to scale variables).

6. Disadvantages

 Prone to overfitting (can memorize noise in data).

 Unstable: Small changes in data can produce different trees.
 Can be biased toward features with more levels.

7. Regularization Techniques

To control overfitting:

 Max Depth: Limit depth of the tree.

 Min Samples Split: Minimum number of samples required to split a node.
 Min Samples Leaf: Minimum number of samples in a leaf node.
 Max Features: Limit number of features to consider at each split.

8. Ensembles of Regression Trees

 Random Forest: Builds multiple regression trees and averages their predictions.
 Gradient Boosting: Builds trees sequentially, each one correcting errors of the previous.

Use Cases
 Predicting house prices
 Forecasting demand
 Estimating insurance costs
 Any regression task with potentially non-linear relationships

Random Forests
1. What is a Random Forest?
Random Forest is an ensemble machine learning algorithm used for both classification and
regression tasks. It builds multiple decision trees and merges their results to improve accuracy
and control overfitting.

 It combines the idea of bagging (Bootstrap Aggregating) and random feature selection.
 Each tree is built on a random subset of data and random subset of features.
 The final prediction is made by majority voting (classification) or averaging
(regression).

2. How Does Random Forest Work?

1. From the original dataset, generate multiple bootstrap samples (random samples with
replacement).
2. For each bootstrap sample, grow a decision tree but at each split, only consider a random
subset of features (not all features).
3. Each tree is grown to the largest extent possible without pruning.
4. For classification, aggregate predictions by majority vote.
5. For regression, average the predictions from all trees.

3. Key Concepts
 Bagging: Reduces variance by training multiple models on different subsets of data.
 Random Feature Selection: Adds diversity among trees by only considering a random
subset of features when splitting.

4. Advantages of Random Forests

 Handles large datasets with higher dimensionality.
 Reduces overfitting compared to a single decision tree.
 Works well with missing values and maintains accuracy.
 Can handle both categorical and continuous variables.
 Provides feature importance scores.

5. Disadvantages
Disadvantage Description
Less interpretable Ensemble nature makes it harder to interpret than a single tree
Slower prediction More computational resources needed for many trees
Can overfit with noisy data Though less prone than single trees, overfitting is still possible

6. Python Code Example: Random Forest Classifier

import pandas as pd
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report, confusion_matrix

# Load dataset
iris = load_iris()
X = iris.data
y = iris.target

# Train-test split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3,
random_state=42)

# Initialize and train Random Forest

rf = RandomForestClassifier(n_estimators=100, random_state=42)
rf.fit(X_train, y_train)

# Predictions
y_pred = rf.predict(X_test)

# Evaluation
print("Confusion Matrix:\n", confusion_matrix(y_test, y_pred))
print("\nClassification Report:\n", classification_report(y_test, y_pred))

# Feature importance
import matplotlib.pyplot as plt
import numpy as np

feat_importances = rf.feature_importances_
indices = np.argsort(feat_importances)[::-1]
features = iris.feature_names

plt.title("Feature Importances")
plt.bar(range(len(features)), feat_importances[indices], align='center')
plt.xticks(range(len(features)), [features[i] for i in indices], rotation=45)
plt.tight_layout()
plt.show()

7. Explanation of the Code

 Load the Iris dataset and split into train and test sets.
 Train a random forest classifier with 100 trees ( n_estimators=100).
 Predict and evaluate using confusion matrix and classification report.
 Plot feature importance to understand which features contribute most to predictions.

8. Hyperparameters to Tune
Parameter Description
n_estimators Number of trees in the forest
max_depth Maximum depth of each tree
min_samples_split Minimum samples required to split a node
min_samples_leaf Minimum samples required at a leaf node
max_features Number of features to consider at each split
bootstrap Whether bootstrap samples are used (default True)

9. When to Use Random Forests?

 When you want high accuracy and robust predictions.
 When you have large datasets with many features.
 When you want to reduce the risk of overfitting from a single decision tree.
 When feature importance is needed.

Model Evaluation Metrics for Classification

1. Confusion Matrix (Foundation)

All these metrics are derived from the confusion matrix, which summarizes the performance of
a classification model:

Predicted Positive Predicted Negative

Actual Positive True Positive (TP) False Negative (FN)

Actual Negative False Positive (FP) True Negative (TN)

 True Positive (TP): Correctly predicted positive class.

 False Positive (FP): Incorrectly predicted positive class.
 True Negative (TN): Correctly predicted negative class.
 False Negative (FN): Incorrectly predicted negative class.

2. Accuracy
 Definition: Ratio of correctly predicted observations to total observations.

 Interpretation: Overall effectiveness of the model.

 Limitation: Not reliable when classes are imbalanced.

3. Precision
 Definition: Ratio of correctly predicted positive observations to total predicted positives.

 Interpretation: Of all instances predicted as positive, how many are actually positive?
 When to use: Important when the cost of false positives is high (e.g., spam detection,
fraud detection).

4. Recall (Sensitivity or True Positive Rate)

 Definition: Ratio of correctly predicted positive observations to all actual positives.

 Interpretation: Of all actual positive cases, how many did the model identify correctly?
 When to use: Important when the cost of false negatives is high (e.g., disease diagnosis).

5. F1-Score
 Definition: Harmonic mean of Precision and Recall.

 Interpretation: Balance between precision and recall. Useful when you need a balance
between false positives and false negatives.

6. ROC Curve and AUC

ROC Curve (Receiver Operating Characteristic)

 Plots True Positive Rate (Recall) against False Positive Rate (FPR) for different
classification thresholds.

 Shows trade-offs between sensitivity and specificity.

AUC (Area Under Curve)

 Scalar value between 0 and 1 summarizing the ROC curve.

 Interpretation:
o 1: Perfect classifier
o 0.5: No discrimination (random guessing)
o Closer to 1 = better model.
7. Python Code Example to Calculate Metrics
from sklearn.metrics import accuracy_score, precision_score, recall_score,
f1_score, roc_auc_score, confusion_matrix, roc_curve
import matplotlib.pyplot as plt

# Sample true and predicted labels

y_true, y_pred = [0,0,1,1,0,1,0,1,1,0],[0,1,1,1,0,0,0,1,0,0]
y_scores = [0.1, 0.4, 0.8, 0.75, 0.3, 0.2, 0.1, 0.85, 0.5, 0.05]
# predicted probabilities for class 1

# Calculate metrics
print("Accuracy:", accuracy_score(y_true, y_pred))
print("Precision:", precision_score(y_true, y_pred))
print("Recall:", recall_score(y_true, y_pred))
print("F1 Score:", f1_score(y_true, y_pred))
print("ROC-AUC:", roc_auc_score(y_true, y_scores))

# Plot ROC Curve

fpr, tpr, thresholds = roc_curve(y_true, y_scores)
plt.plot(fpr, tpr, label="ROC Curve (AUC =
{:.2f})".format(roc_auc_score(y_true, y_scores)))
plt.plot([0, 1], [0, 1], 'k--') # Random guess line
plt.xlabel("False Positive Rate")
plt.ylabel("True Positive Rate")
plt.title("Receiver Operating Characteristic (ROC) Curve")
plt.show()

8. Summary Table

Metric Formula Purpose Use Case Example

Accuracy (TP+TN)/(TP +TN + FP + FN) Overall correctness Balanced classes

Precision TP/(TP+FP) Correctness of positive predictions Spam detection (avoid false alarms)

Recall TP/(TP+FN) Ability to find all positive cases Disease diagnosis (avoid misses)

F1-Score HM of precision & recall Balance of precision & recall When classes are imbalanced

ROC-AUC Area under ROC curve Discrimination capability Model comparison, threshold tuning

Train-Test Split and Cross Validation

1. Train-Test Split
What is it?

 Dividing your dataset into two parts:

o Training set: Used to train the model.
o Testing set: Used to evaluate how well the model generalizes to new, unseen
data.

Why?

 To avoid overfitting: A model might perform very well on the training data but poorly
on new data.
 To get an unbiased estimate of model performance.

How?

 Typically, a common split is 70%-80% for training and 20%-30% for testing.
 Done randomly to ensure representative samples.

Python Example with sklearn:

from sklearn.model_selection import train_test_split

X = ... # Features
y = ... # Labels

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2,

random_state=42)

print(f"Train size: {len(X_train)}")

print(f"Test size: {len(X_test)}")

 test_size=0.2: 20% test data

 random_state=42: Ensures reproducibility

2. Cross Validation (CV)

What is it?

 A method to evaluate model performance more reliably by splitting data into multiple
training/testing sets.
 Instead of one train-test split, data is split multiple times.

Common Types:

a) K-Fold Cross Validation

 Data is split into k equal folds.

 For each iteration:
o Train on k-1 folds.
o Test on the remaining fold.
 Repeat k times, each fold used once as test.
 Performance is averaged over the k runs.

Example with 5 folds:

Fold 1: Train on folds 2,3,4,5; test on fold 1

Fold 2: Train on folds 1,3,4,5; test on fold 2
...
Fold 5: Train on folds 1,2,3,4; test on fold 5

b) Stratified K-Fold

 Like K-Fold but preserves the percentage of samples for each class.
 Useful for classification with imbalanced classes.

c) Leave-One-Out Cross Validation (LOOCV)

 Extreme case where k = number of samples.

 Train on all data except one sample, test on that single sample.
 Repeated for all samples.
 Very computationally expensive.

Why Cross Validation?

 Provides more stable and less biased estimate of model performance.

 Helps tune hyperparameters with better confidence.
 Reduces variance due to data splitting randomness.
Python Example using K-Fold CV:
from sklearn.model_selection import KFold
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score
import numpy as np

X = ... # feature matrix

y = ... # labels

kf = KFold(n_splits=5, shuffle=True, random_state=42)

model = LogisticRegression()

accuracies = []

for train_index, test_index in kf.split(X):

X_train, X_test = X[train_index], X[test_index]
y_train, y_test = y[train_index], y[test_index]

model.fit(X_train, y_train)
y_pred = model.predict(X_test)

acc = accuracy_score(y_test, y_pred)

accuracies.append(acc)

print("Cross-validated accuracies:", accuracies)

print("Mean accuracy:", np.mean(accuracies))

Using cross_val_score (simpler):

from sklearn.model_selection import cross_val_score
from sklearn.linear_model import LogisticRegression

model = LogisticRegression()
scores = cross_val_score(model, X, y, cv=5, scoring='accuracy')

print("CV accuracies:", scores)

print("Mean CV accuracy:", scores.mean())

Hyperparameter Tuning using Grid Search

1. What are Hyperparameters?
 Hyperparameters are parameters set before training a model.
 Unlike model parameters (like weights in linear regression) which are learned during
training, hyperparameters control the learning process.
 Examples:
o Number of neighbors in KNN (n_neighbors)
o Depth of a decision tree (max_depth)
o Learning rate in gradient boosting
o Regularization strength (C) in logistic regression or SVM

2. Why Tune Hyperparameters?

 Hyperparameters significantly affect model performance.
 Poor choice can lead to underfitting or overfitting.
 Tuning helps find the best combination of hyperparameters to maximize model
accuracy, precision, recall, or other metrics.

3. What is Grid Search?

 Grid Search is an exhaustive search technique to find the optimal hyperparameters.
 You specify a grid (set) of hyperparameter values.
 The algorithm trains and evaluates the model on every possible combination of these
parameters.
 The best parameter set is chosen based on cross-validation performance.

4. How Grid Search Works

1. Define model and hyperparameter grid (dictionary of hyperparameters and lists of
values).
2. For each combination in the grid:
o Train the model with that combination on training folds.
o Validate it on validation folds.
3. Calculate the average validation score.
4. Select the combination with the best average score.
5. Using Grid Search in Python with scikit-learn
Step 1: Import libraries
from sklearn.model_selection import GridSearchCV
from sklearn.ensemble import RandomForestClassifier
from sklearn.datasets import load_iris
from sklearn.metrics import accuracy_score

Step 2: Prepare data

data = load_iris()
X, y = data.data, data.target

Step 3: Define model and parameter grid

model = RandomForestClassifier(random_state=42)

param_grid = {
'n_estimators': [10, 50, 100],
'max_depth': [None, 5, 10],
'min_samples_split': [2, 5, 10]
}

Step 4: Set up GridSearchCV

grid_search = GridSearchCV(estimator=model, param_grid=param_grid,
cv=5, scoring='accuracy', n_jobs=-1)

 cv=5: 5-fold cross-validation

 scoring='accuracy': Metric to evaluate
 n_jobs=-1: Use all CPU cores to speed up

Step 5: Fit Grid Search to data

grid_search.fit(X, y)

Step 6: Get best parameters and score

print("Best hyperparameters:", grid_search.best_params_)
print("Best cross-validation accuracy:", grid_search.best_score_)

6. Interpreting Results
 grid_search.best_params_: Shows the hyperparameter combination with highest
mean CV score.
 grid_search.best_score_: The corresponding average CV accuracy.
 You can use grid_search.best_estimator_ to get the model trained with best
hyperparameters.
7. Advanced Tips
 RandomizedSearchCV: Instead of exhaustive search, samples hyperparameters
randomly (faster for large grids).
 Custom Scoring: Use metrics like precision, recall, F1-score by setting scoring
parameter.
 Refit: GridSearchCV refits the best model on the entire dataset by default.
 Pipeline Integration: Combine preprocessing and model tuning in a single pipeline with
GridSearch.

8. Example: Full Code

from sklearn.model_selection import GridSearchCV
from sklearn.ensemble import RandomForestClassifier
from sklearn.datasets import load_iris

# Load data
data = load_iris()
X, y = data.data, data.target

# Define model and grid

model = RandomForestClassifier(random_state=42)
param_grid = {
'n_estimators': [10, 50, 100],
'max_depth': [None, 5, 10],
'min_samples_split': [2, 5, 10]
}

# Grid Search with 5-fold CV

grid_search = GridSearchCV(model, param_grid, cv=5, scoring='accuracy',
n_jobs=-1)
grid_search.fit(X, y)

print("Best parameters:", grid_search.best_params_)

print("Best CV accuracy:", grid_search.best_score_)

Analytics Compendium
No ratings yet
Analytics Compendium
41 pages
Linear Regression
No ratings yet
Linear Regression
5 pages
Regression Models
No ratings yet
Regression Models
5 pages
Complete
No ratings yet
Complete
12 pages
SumitBurnwal ML
No ratings yet
SumitBurnwal ML
13 pages
2 - (9-3) Regression Classifiers
No ratings yet
2 - (9-3) Regression Classifiers
35 pages
Understanding Polynomial Regression Model
No ratings yet
Understanding Polynomial Regression Model
11 pages
Linear Regression Notes
No ratings yet
Linear Regression Notes
25 pages
U3 U4 Regression
No ratings yet
U3 U4 Regression
22 pages
Linear Regression-Part 2
No ratings yet
Linear Regression-Part 2
26 pages
Experiment 7 ML Vtu
No ratings yet
Experiment 7 ML Vtu
5 pages
2.1 Regression Analysis
No ratings yet
2.1 Regression Analysis
28 pages
Teit ML2
No ratings yet
Teit ML2
11 pages
Linear Regression Model Presentation
No ratings yet
Linear Regression Model Presentation
7 pages
Linear Regression Algorithm
No ratings yet
Linear Regression Algorithm
16 pages
SIDDHANT VIJAY 2K20 CH 65 Sem 5
No ratings yet
SIDDHANT VIJAY 2K20 CH 65 Sem 5
29 pages
Lecture3 Supervised Learning I
No ratings yet
Lecture3 Supervised Learning I
84 pages
3 Unit - Dspu
No ratings yet
3 Unit - Dspu
23 pages
ML Polynomial Regression4
No ratings yet
ML Polynomial Regression4
36 pages
Linear Regression Concepts - A4
No ratings yet
Linear Regression Concepts - A4
6 pages
A Practical Approach To Linear Regression in Machine Learning - by Ashwin Raj - Towards Data Science
No ratings yet
A Practical Approach To Linear Regression in Machine Learning - by Ashwin Raj - Towards Data Science
20 pages
Practical 5
No ratings yet
Practical 5
8 pages
? What Is Regression
No ratings yet
? What Is Regression
12 pages
Linear & Polynomial Regression Guide
No ratings yet
Linear & Polynomial Regression Guide
56 pages
(Slide) Non Linear Regression
No ratings yet
(Slide) Non Linear Regression
39 pages
Practical # 10
No ratings yet
Practical # 10
5 pages
223a1131 ML Exp 1
No ratings yet
223a1131 ML Exp 1
8 pages
Unit-2 Supervised Machine Learning
No ratings yet
Unit-2 Supervised Machine Learning
132 pages
ML 1
No ratings yet
ML 1
24 pages
UNIT II Regration
No ratings yet
UNIT II Regration
62 pages
Everything You Need To Know About Linear Regression
No ratings yet
Everything You Need To Know About Linear Regression
19 pages
ML Unit
No ratings yet
ML Unit
23 pages
Regression
No ratings yet
Regression
45 pages
S&ML Unit 5 - Q & A
No ratings yet
S&ML Unit 5 - Q & A
15 pages
Module 2
No ratings yet
Module 2
21 pages
Lecture-2 Unit 2
No ratings yet
Lecture-2 Unit 2
56 pages
Linear Regression A Foundational ML Algorithm
No ratings yet
Linear Regression A Foundational ML Algorithm
10 pages
Supervised Learning. wk3
No ratings yet
Supervised Learning. wk3
18 pages
Linear Regression
No ratings yet
Linear Regression
89 pages
Question 1 B
No ratings yet
Question 1 B
6 pages
228w1f0065 ML
No ratings yet
228w1f0065 ML
15 pages
Linear Regression Notes Extended
No ratings yet
Linear Regression Notes Extended
3 pages
Regression Questionnaire
No ratings yet
Regression Questionnaire
10 pages
Unit 2
No ratings yet
Unit 2
18 pages
2EL1730 ML Lecture02 Linear and Logistic Regression
No ratings yet
2EL1730 ML Lecture02 Linear and Logistic Regression
65 pages
6 - Classification and Regression Tasks
No ratings yet
6 - Classification and Regression Tasks
115 pages
Module 2 Modified
No ratings yet
Module 2 Modified
67 pages
ML Exp 1
No ratings yet
ML Exp 1
6 pages
OE-ML Unit - 3
No ratings yet
OE-ML Unit - 3
29 pages
Regression
No ratings yet
Regression
6 pages
Regression
No ratings yet
Regression
16 pages
Simple Linear Regression Definition: Two Variables Independent Variable Dependent Variable Equation
No ratings yet
Simple Linear Regression Definition: Two Variables Independent Variable Dependent Variable Equation
9 pages
AAI Lecture 10 SP 25
No ratings yet
AAI Lecture 10 SP 25
37 pages
Regression v33
No ratings yet
Regression v33
81 pages
Modern Pridictive Modelling (Regression)
No ratings yet
Modern Pridictive Modelling (Regression)
12 pages
Unit I
No ratings yet
Unit I
14 pages
Terro Project File
No ratings yet
Terro Project File
53 pages
Revenue Forecast for Surf Excel
No ratings yet
Revenue Forecast for Surf Excel
2 pages
The Impact of Small - Scale Irrigation On Income of Rural Farm Households: Evidence From Ahferom Woreda in Tigray, Ethiopia
No ratings yet
The Impact of Small - Scale Irrigation On Income of Rural Farm Households: Evidence From Ahferom Woreda in Tigray, Ethiopia
13 pages
Statistical and Econometric Methods For Transportation Data Analysis
No ratings yet
Statistical and Econometric Methods For Transportation Data Analysis
4 pages
Regularization: Ridge Regression and The LASSO: Statistics 305: Autumn Quarter 2006/2007
No ratings yet
Regularization: Ridge Regression and The LASSO: Statistics 305: Autumn Quarter 2006/2007
56 pages
Chapter 8 Heteroskedasticity
No ratings yet
Chapter 8 Heteroskedasticity
52 pages
Mahyudin CeTMA-Panel Data Dynamic Analysis
No ratings yet
Mahyudin CeTMA-Panel Data Dynamic Analysis
27 pages
Statistics for Data Analysts
No ratings yet
Statistics for Data Analysts
11 pages
Least Median of Squares Regression. Peter J. Rousseeuw, 1984
No ratings yet
Least Median of Squares Regression. Peter J. Rousseeuw, 1984
10 pages
NLOGIT Manual
No ratings yet
NLOGIT Manual
173 pages
BA Economics Syllabus 2017-18
No ratings yet
BA Economics Syllabus 2017-18
47 pages
Arima Cho Usd Eur
No ratings yet
Arima Cho Usd Eur
15 pages
Steps Ofvvector Estimating Error Correction Model
No ratings yet
Steps Ofvvector Estimating Error Correction Model
4 pages
Time Series Forecasting Exercises
No ratings yet
Time Series Forecasting Exercises
71 pages
Good Governance & Economic Growth Analysis
No ratings yet
Good Governance & Economic Growth Analysis
19 pages
Assignment 1
No ratings yet
Assignment 1
3 pages
Banking Metrics Pre/Post Pandemic
No ratings yet
Banking Metrics Pre/Post Pandemic
4 pages
Unit4 Multivariate Analysis
No ratings yet
Unit4 Multivariate Analysis
20 pages
Econometrics Eviews 6
No ratings yet
Econometrics Eviews 6
12 pages
Switching Models: Introductory Econometrics For Finance' © Chris Brooks 2013 1
No ratings yet
Switching Models: Introductory Econometrics For Finance' © Chris Brooks 2013 1
33 pages
Week 2: CEF & OLS: Dan Yavorsky
No ratings yet
Week 2: CEF & OLS: Dan Yavorsky
41 pages
Factors Affecting Good Governance in Pakistan
100% (1)
Factors Affecting Good Governance in Pakistan
10 pages
Mack Chain Ladder Sloma
No ratings yet
Mack Chain Ladder Sloma
23 pages
Managerial Economics For Non Major - CHAPTER 4
No ratings yet
Managerial Economics For Non Major - CHAPTER 4
9 pages
Assessing The Impact of Management Buyouts On Economic Efficiency Plant-Level Evidence From The United Kingdom - 2021
No ratings yet
Assessing The Impact of Management Buyouts On Economic Efficiency Plant-Level Evidence From The United Kingdom - 2021
6 pages
Engineering Curve Fitting Guide
No ratings yet
Engineering Curve Fitting Guide
44 pages
Private Investment in Hulbareg, Ethiopia
No ratings yet
Private Investment in Hulbareg, Ethiopia
42 pages
Theme 1. Methodological Principles of Econometrics
No ratings yet
Theme 1. Methodological Principles of Econometrics
15 pages
Cassava Farm Income Analysis
No ratings yet
Cassava Farm Income Analysis
13 pages