MODULE #2
Linear Regression
1. What is Linear Regression?
Linear Regression is one of the simplest and most widely used supervised machine learning
algorithms for predicting a continuous target variable based on one or more predictor
variables (features). It models the relationship between the dependent variable y and one or more
independent variables X by fitting a linear equation to observed data.
2. The Linear Regression Model
The model assumes the relationship between the dependent variable y and the independent
variables x1,x2,...,xn can be described as:
y=β0+β1x1+β2x2+…+βnxn+ϵ
β0 is the intercept (value of y when all xi are 0),
βi are the coefficients (weights) for each feature,
ϵ is the error term capturing noise/unexplained variance.
3. Objective
The goal of linear regression is to find the best-fitting line that minimizes the difference
between predicted values and actual values of y. The common approach is to minimize the Sum
of Squared Residuals (Ordinary Least Squares - OLS).
4. Assumptions of Linear Regression
1. Linearity: The relationship between the predictors and the target is linear.
2. Independence: Observations are independent.
3. Homoscedasticity: Constant variance of residual errors.
4. Normality of Errors: Residuals are normally distributed.
5. No multicollinearity: Predictors are not highly correlated with each other.
5. How Linear Regression Works
Given training data, the algorithm estimates coefficients βi\beta_iβi to minimize the
squared error:
Once fitted, the model can predict y for new X values.
6. Example with Python Code
import numpy as np
import pandas as pd
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_absolute_error, mean_squared_error
# Sample dataset: Years of Experience vs Salary
data = {
'Experience': [1, 2, 3, 4, 5, 6, 7, 8, 9, 10],
'Salary': [35000, 37000, 40000, 43000, 47000, 52000, 56000, 60000, 64000,
68000]
}
df = pd.DataFrame(data)
# Features and Target
X = df[['Experience']] # 2D array for sklearn
y = df['Salary']
# Split data into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2,
random_state=42)
# Create and fit the Linear Regression model
model = LinearRegression()
model.fit(X_train, y_train)
# Predict on test data
y_pred = model.predict(X_test)
# Print coefficients
print("Intercept (β0):", model.intercept_)
print("Coefficient (β1):", model.coef_[0])
# Evaluate the model
mae = mean_absolute_error(y_test, y_pred)
mse = mean_squared_error(y_test, y_pred)
rmse = np.sqrt(mse)
print(f"MAE: {mae:.2f}")
print(f"MSE: {mse:.2f}")
print(f"RMSE: {rmse:.2f}")
# Example prediction
years_exp = 6.5
predicted_salary = model.predict([[years_exp]])
print(f"Predicted salary for {years_exp} years of experience:
${predicted_salary[0]:.2f}")
7. Explanation of the Code
Data Preparation: Simple dataset of years of experience and salary.
Splitting Data: Train/test split for model evaluation.
Model Training: LinearRegression() fits the data to find coefficients.
Prediction: Predict salary for test data.
Evaluation: Calculated MAE, MSE, and RMSE to measure error.
New Prediction: Predict salary for 6.5 years experience.
8. When to Use Linear Regression?
When the target variable is continuous.
When the relationship between independent and dependent variables is approximately
linear.
When the assumptions listed above are met or closely approximated.
Polynomial Regression
1. What is Polynomial Regression?
Polynomial Regression is an extension of Linear Regression that models the relationship
between the independent variable xxx and the dependent variable y as an nth degree polynomial.
It’s used when the relationship between variables is non-linear but can be approximated by a
polynomial curve.
2. The Polynomial Regression Model
The model equation for polynomial regression of degree d is:
y=β0+β1x+β2x2+β3x3+…+βdxd+ ϵ
βi are the coefficients,
ϵ is the error term,
d is the degree of the polynomial.
3. Why Use Polynomial Regression?
When the data shows a curved trend that a simple linear model cannot capture.
It allows fitting a non-linear relationship using a linear model by transforming features.
Useful in domains like physics, finance, biology where relationships are inherently
nonlinear.
4. How Polynomial Regression Works
1. Transform input features to include polynomial terms (e.g., x,x2,x3, etc.).
2. Use linear regression on this expanded feature set.
3. Estimate the coefficients βi to fit the polynomial curve.
5. Assumptions
The relationship between the transformed features and the target is linear.
The usual linear regression assumptions hold on the transformed data.
Avoid very high degree polynomials to prevent overfitting.
6. Python Code Example: Polynomial Regression
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import PolynomialFeatures
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error
# Sample dataset with a nonlinear relationship
np.random.seed(0)
X = np.linspace(0, 10, 50).reshape(-1, 1)
y = 3 - 2 * X + X**2 + np.random.randn(50, 1) * 3 # Quadratic relation with
noise
# Plot original data
plt.scatter(X, y, color='blue', label='Data Points')
plt.title("Original Data")
plt.show()
# Polynomial feature transformation (degree=2)
poly = PolynomialFeatures(degree=2)
X_poly = poly.fit_transform(X) # Transforms X to [1, x, x^2]
# Split data
X_train, X_test, y_train, y_test = train_test_split(X_poly, y, test_size=0.2,
random_state=42)
# Create and train the linear regression model on polynomial features
model = LinearRegression()
model.fit(X_train, y_train)
# Predict
y_pred = model.predict(X_test)
# Evaluate the model
mse = mean_squared_error(y_test, y_pred)
print(f"Mean Squared Error: {mse:.2f}")
# Plot the polynomial regression curve
X_fit = np.linspace(0, 10, 100).reshape(-1, 1)
X_fit_poly = poly.transform(X_fit)
y_fit = model.predict(X_fit_poly)
plt.scatter(X, y, color='blue', label='Data Points')
plt.plot(X_fit, y_fit, color='red', label='Polynomial Regression Curve')
plt.title("Polynomial Regression Fit (Degree 2)")
plt.legend()
plt.show()
7. Explanation of the Code
Created a dataset with a quadratic relationship plus noise.
Used PolynomialFeatures to create polynomial terms up to degree 2.
Applied Linear Regression on the transformed features.
Evaluated model performance using Mean Squared Error.
Visualized original data points and the fitted polynomial regression curve.
8. When to Use Polynomial Regression?
When your data shows a non-linear relationship.
When linear regression underfits and you need a more flexible model.
When the degree of the polynomial is low to moderate to avoid overfitting.
9. Advantages & Disadvantages
Advantages Disadvantages
Captures non-linear relationships Can overfit if degree is too high
Still a linear model in transformed space Sensitive to outliers
Easy to implement and interpret Computational cost increases with degree
Model Evaluation Metrics for Regression
1. Overview
When predicting continuous values (regression tasks), evaluating the accuracy of predictions
requires metrics that quantify how close the predicted values are to the actual values. The most
common metrics are:
Mean Absolute Error (MAE)
Mean Squared Error (MSE)
Root Mean Squared Error (RMSE)
All measure the average error between predicted values and actual values but differ in how they
penalize errors.
2. Mean Absolute Error (MAE)
Definition: The average of the absolute differences between predicted values and actual
values.
Interpretation: MAE gives the average magnitude of errors in predictions, without
considering their direction (positive or negative).
Advantages:
o Easy to understand and interpret.
o Linear score — all individual differences are weighted equally.
Disadvantages:
o Does not penalize large errors more than small errors.
3. Mean Squared Error (MSE)
Definition: The average of the squared differences between predicted values and actual
values.
Interpretation: MSE measures the average squared difference between actual and
predicted values. Squaring emphasizes larger errors more than smaller ones.
Advantages:
o Penalizes larger errors more severely.
o Differentiable, useful for optimization in many algorithms.
Disadvantages:
o Less interpretable than MAE because errors are squared (units are squared).
o Sensitive to outliers.
4. Root Mean Squared Error (RMSE)
Definition: The square root of the mean squared error.
Interpretation: RMSE gives error magnitude in the same units as the target variable,
making it more interpretable than MSE.
Advantages:
o Combines benefits of MAE and MSE.
o Penalizes larger errors more (like MSE).
o Units match the predicted values, easier to interpret.
Disadvantages:
o Also sensitive to outliers.
5. When to Use Which?
Metric Best Used When Notes
MAE You want a simple average error Less sensitive to outliers
Useful in many ML algorithms for
MSE You want to heavily penalize large errors
optimization
You want an interpretable metric with
RMSE Most common choice for regression
penalization of large errors
6. Python Code Example to Calculate MAE, MSE, RMSE
from sklearn.metrics import mean_absolute_error, mean_squared_error
import numpy as np
# Example actual and predicted values
y_true = [3.0, -0.5, 2.0, 7.0]
y_pred = [2.5, 0.0, 2.0, 8.0]
# Calculate MAE
mae = mean_absolute_error(y_true, y_pred)
# Calculate MSE
mse = mean_squared_error(y_true, y_pred)
# Calculate RMSE
rmse = np.sqrt(mse)
print(f"Mean Absolute Error (MAE): {mae:.3f}")
print(f"Mean Squared Error (MSE): {mse:.3f}")
print(f"Root Mean Squared Error (RMSE): {rmse:.3f}")
7. Interpretation of Results (Example)
Given the example:
MAE = 0.5 means on average, predictions are off by 0.5 units.
MSE = 0.375 squares errors so it penalizes larger errors.
RMSE = 0.612 is in the same unit as the predicted variable, showing average error
magnitude.
Logistic Regression
1. What is Logistic Regression?
Logistic Regression is a classification algorithm used to predict the probability of a binary
outcome (0 or 1, True or False, Yes or No). Despite the name "regression", it's used for
classification tasks, not regression.
It models the log-odds of the probability of the outcome as a linear combination of the input
features.
2. Logistic Function (Sigmoid Function)
At the core of logistic regression is the sigmoid function, which maps any real number to a
value between 0 and 1:
Where:
z= β0+β1x1+β2x2+…+βnxn
σ(z) is the predicted probability that the output is 1.
3. Decision Boundary
If σ(z)≥0.5: predict class 1
If σ(z)<0.5: predict class 0
The threshold (default is 0.5) can be adjusted for different sensitivity levels.
4. Use Cases
Email Spam Detection (Spam or Not Spam)
Credit Scoring (Default or Not)
Medical Diagnosis (Disease or No Disease)
Marketing (Buy or Not Buy)
5. Assumptions
The dependent variable is binary.
There is a linear relationship between the logit of the outcome and the predictors.
Observations are independent.
Large sample size improves performance.
6. Python Code Example (Binary Classification)
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report, confusion_matrix,
accuracy_score
# Example dataset: Predicting if someone buys a product based on age and
income
data = {
'Age': [22, 25, 47, 52, 46, 56, 55, 60],
'Income': [20000, 25000, 48000, 60000, 52000, 75000, 70000, 80000],
'Purchased': [0, 0, 1, 1, 1, 1, 1, 1]
}
df = pd.DataFrame(data)
# Features and target
X = df[['Age', 'Income']]
y = df['Purchased']
# Split data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25,
random_state=0)
# Model training
model = LogisticRegression()
model.fit(X_train, y_train)
# Predict
y_pred = model.predict(X_test)
# Evaluation
print("Confusion Matrix:\n", confusion_matrix(y_test, y_pred))
print("Classification Report:\n", classification_report(y_test, y_pred))
print("Accuracy Score:", accuracy_score(y_test, y_pred))
7. Output Explanation
Confusion Matrix:
o True Positive (TP): Correctly predicted 1s
o True Negative (TN): Correctly predicted 0s
o False Positive (FP): Predicted 1, but actual 0
o False Negative (FN): Predicted 0, but actual 1
Classification Report:
o Precision: TP/(TP+FP)
o Recall: TP/(TP+FN)
o F1-Score: Harmonic mean of precision and recall : 2PR/(P+R)
8. Probability Prediction
You can also get the predicted probability instead of class labels:
y_prob = model.predict_proba(X_test)
print("Predicted Probabilities:\n", y_prob)
The result is a 2-column output:
Column 0: probability of class 0
Column 1: probability of class 1
9. Limitations
Only works for binary/multiclass classification, not regression.
Assumes a linear decision boundary in feature space.
Sensitive to outliers and irrelevant features.
10. Extensions
Multinomial Logistic Regression for multiclass classification.
Regularized Logistic Regression (with L1 or L2) to handle overfitting.
Logistic Regression with Interaction Terms to model feature interactions.
K-Nearest Neighbors (KNN)
1. What is KNN?
K-Nearest Neighbors is a supervised learning algorithm used for both classification and
regression. It assumes that similar data points are located close to each other in feature space. It
classifies a data point based on how its neighbors are classified.
KNN is an instance-based learning or lazy learning algorithm:
It does not learn a model during training.
Instead, it memorizes the training dataset and classifies new data points using a distance
metric.
2. How KNN Works (Classification)
1. Choose the number of neighbors K.
2. Calculate the distance between the new data point and all points in the training data.
3. Select the K nearest neighbors based on distance.
4. Perform a majority vote: the most common class among neighbors is assigned to the
new point.
3. Distance Metrics
Commonly used distance functions:
Euclidean Distance (most common):
Manhattan Distance
Minkowski Distance
4. Example Use Cases
Handwritten digit recognition (e.g., MNIST dataset)
Recommendation systems
Image recognition
Credit scoring
5. Python Code Example (KNN Classification)
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report, confusion_matrix
import pandas as pd
# Sample dataset: Feature1 and Feature2 used to predict class (0 or 1)
data = {'Feature1': [1, 2, 3, 6, 7, 8],'Feature2': [1, 1, 2, 5, 6, 7],
'Label': [0, 0, 0, 1, 1, 1]}
df = pd.DataFrame(data)
# Define features and target
X = df[['Feature1', 'Feature2']]
y = df['Label']
# Split data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33,
random_state=42)
# Create the KNN model
knn = KNeighborsClassifier(n_neighbors=3)
knn.fit(X_train, y_train)
# Predict
y_pred = knn.predict(X_test)
# Evaluation
print("Confusion Matrix:\n", confusion_matrix(y_test, y_pred))
print("\nClassification Report:\n", classification_report(y_test, y_pred))
6. Choosing the Best Value of K
Small K (e.g., 1 or 3): Model is sensitive to noise (high variance).
Large K: Model becomes too generalized (high bias).
Common approach: Use cross-validation to find the optimal K.
# Finding optimal K using error rate plot
import matplotlib.pyplot as plt
error_rate = []
for k in range(1, 11):
knn = KNeighborsClassifier(n_neighbors=k)
knn.fit(X_train, y_train)
pred_k = knn.predict(X_test)
error_rate.append((pred_k != y_test).mean())
plt.plot(range(1, 11), error_rate, marker='o')
plt.title("Error Rate vs. K")
plt.xlabel("K")
plt.ylabel("Error Rate")
plt.show()
7. Advantages of KNN
Simple and easy to implement.
No assumptions about data distribution.
Naturally handles multi-class classification.
Works well with small to medium-sized datasets.
8. Disadvantages of KNN
Problem Description
Because KNN stores all training data and computes distance for
Slow Prediction
every prediction.
Sensitive to Irrelevant
All features are treated equally unless scaled or weighted.
Features
Curse of Dimensionality High-dimensional data reduces accuracy.
Imbalanced Data Can bias toward the majority class.
9. Feature Scaling in KNN
KNN is distance-based, so features should be scaled:
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)
# Proceed with train/test split and KNN model
10. KNN for Regression
KNN can also predict continuous values by averaging the target values of the nearest neighbors:
from sklearn.neighbors import KNeighborsRegressor
reg = KNeighborsRegressor(n_neighbors=3)
reg.fit(X_train, y_train)
y_pred = reg.predict(X_test)
Decision Trees
1. What is a Decision Tree?
A Decision Tree is a supervised machine learning algorithm used for both classification and
regression tasks. It models decisions and their possible consequences as a tree structure:
Nodes represent tests on features.
Branches represent outcomes of the test.
Leaves represent final class labels or continuous values.
2. How Does a Decision Tree Work?
The tree is built by splitting the dataset on features that best separate the data into classes
(for classification) or predict continuous values (for regression).
The process starts at the root node and recursively splits until a stopping condition is met
(e.g., max depth, minimum samples).
The goal is to create pure nodes where most samples belong to the same class or have
minimal variance.
3. Splitting Criteria
For Classification:
Gini Impurity: Measures the probability of misclassifying a random sample.
Entropy: Measures the disorder or uncertainty.
The algorithm chooses splits that maximize information gain, which is the reduction in
impurity.
For Regression:
Mean Squared Error (MSE) or Mean Absolute Error (MAE) are used to choose
splits.
4. Advantages of Decision Trees
Easy to interpret and visualize.
Requires little data preprocessing.
Can handle both numerical and categorical data.
Non-parametric, so no assumptions about data distribution.
Can capture non-linear relationships.
5. Disadvantages
Drawback Description
Overfitting Trees can grow too complex and fit noise
Instability Small changes in data can result in very different trees
Bias towards features with more levels Features with many categories may dominate splits
6. Python Code Example: Decision Tree Classifier
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier, plot_tree
from sklearn.metrics import classification_report, confusion_matrix
import matplotlib.pyplot as plt
# Sample data (Iris dataset for classification)
from sklearn.datasets import load_iris
iris = load_iris()
X = iris.data
y = iris.target
# Split dataset
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3,
random_state=42)
# Initialize and train Decision Tree Classifier
clf = DecisionTreeClassifier(criterion='gini', max_depth=3, random_state=42)
clf.fit(X_train, y_train)
# Predict
y_pred = clf.predict(X_test)
# Evaluate
print("Confusion Matrix:\n", confusion_matrix(y_test, y_pred))
print("\nClassification Report:\n", classification_report(y_test, y_pred))
# Visualize the tree
plt.figure(figsize=(12,8))
plot_tree(clf, feature_names=iris.feature_names,
class_names=iris.target_names, filled=True)
plt.show()
7. Explanation
The dataset is split into training and testing sets.
The decision tree is trained with a maximum depth of 3 to avoid overfitting.
Predictions and evaluation metrics are printed.
The tree structure is visualized using plot_tree.
8. Parameters to Tune
max_depth: Maximum depth of the tree (controls overfitting).
min_samples_split: Minimum number of samples required to split a node.
min_samples_leaf: Minimum number of samples required at a leaf node.
criterion: Splitting criteria, “gini” for Gini impurity, “entropy” for information gain.
9. Use Cases
Medical Diagnosis (e.g., cancer classification)
Customer Churn Prediction
Fraud Detection
Credit Scoring
Regression Trees
Regression Trees are a type of decision tree used when the target variable is continuous (rather
than categorical, as in classification problems). They work by splitting the data into smaller and
smaller subsets, forming a tree structure that predicts numeric outcomes.
Key Concepts
1. Structure of a Regression Tree
Root Node: The entire dataset.
Internal Nodes: Conditions based on features (e.g., X1 <= 5).
Leaves: Final predictions, which are the mean of target values in that subset.
2. How It Works
Basic Steps:
1. Split the data at a feature and threshold that minimizes the error.
2. Repeat recursively on each child node until a stopping criterion is met (e.g., minimum
samples per leaf, maximum depth).
3. Predict the average value in each leaf.
3. Splitting Criteria
Regression trees aim to minimize mean squared error (MSE) at each split.
MSE Formula:
At each step, the algorithm chooses the feature and threshold that gives the largest reduction in
MSE.
4. Example
Let’s say we want to predict house prices based on square footage and number of bedrooms.
First Split:
If square_footage <= 1500, go left. Else, go right.
Second Split (Left branch):
If bedrooms <= 2, go left. Else, go right.
At each terminal leaf, the predicted house price is the average of all training prices in that
subset.
5. Advantages
Easy to interpret and visualize.
Can handle non-linear relationships.
Requires little data preprocessing (no need to scale variables).
6. Disadvantages
Prone to overfitting (can memorize noise in data).
Unstable: Small changes in data can produce different trees.
Can be biased toward features with more levels.
7. Regularization Techniques
To control overfitting:
Max Depth: Limit depth of the tree.
Min Samples Split: Minimum number of samples required to split a node.
Min Samples Leaf: Minimum number of samples in a leaf node.
Max Features: Limit number of features to consider at each split.
8. Ensembles of Regression Trees
Random Forest: Builds multiple regression trees and averages their predictions.
Gradient Boosting: Builds trees sequentially, each one correcting errors of the previous.
Use Cases
Predicting house prices
Forecasting demand
Estimating insurance costs
Any regression task with potentially non-linear relationships
Random Forests
1. What is a Random Forest?
Random Forest is an ensemble machine learning algorithm used for both classification and
regression tasks. It builds multiple decision trees and merges their results to improve accuracy
and control overfitting.
It combines the idea of bagging (Bootstrap Aggregating) and random feature selection.
Each tree is built on a random subset of data and random subset of features.
The final prediction is made by majority voting (classification) or averaging
(regression).
2. How Does Random Forest Work?
1. From the original dataset, generate multiple bootstrap samples (random samples with
replacement).
2. For each bootstrap sample, grow a decision tree but at each split, only consider a random
subset of features (not all features).
3. Each tree is grown to the largest extent possible without pruning.
4. For classification, aggregate predictions by majority vote.
5. For regression, average the predictions from all trees.
3. Key Concepts
Bagging: Reduces variance by training multiple models on different subsets of data.
Random Feature Selection: Adds diversity among trees by only considering a random
subset of features when splitting.
4. Advantages of Random Forests
Handles large datasets with higher dimensionality.
Reduces overfitting compared to a single decision tree.
Works well with missing values and maintains accuracy.
Can handle both categorical and continuous variables.
Provides feature importance scores.
5. Disadvantages
Disadvantage Description
Less interpretable Ensemble nature makes it harder to interpret than a single tree
Slower prediction More computational resources needed for many trees
Can overfit with noisy data Though less prone than single trees, overfitting is still possible
6. Python Code Example: Random Forest Classifier
import pandas as pd
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report, confusion_matrix
# Load dataset
iris = load_iris()
X = iris.data
y = iris.target
# Train-test split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3,
random_state=42)
# Initialize and train Random Forest
rf = RandomForestClassifier(n_estimators=100, random_state=42)
rf.fit(X_train, y_train)
# Predictions
y_pred = rf.predict(X_test)
# Evaluation
print("Confusion Matrix:\n", confusion_matrix(y_test, y_pred))
print("\nClassification Report:\n", classification_report(y_test, y_pred))
# Feature importance
import matplotlib.pyplot as plt
import numpy as np
feat_importances = rf.feature_importances_
indices = np.argsort(feat_importances)[::-1]
features = iris.feature_names
plt.title("Feature Importances")
plt.bar(range(len(features)), feat_importances[indices], align='center')
plt.xticks(range(len(features)), [features[i] for i in indices], rotation=45)
plt.tight_layout()
plt.show()
7. Explanation of the Code
Load the Iris dataset and split into train and test sets.
Train a random forest classifier with 100 trees ( n_estimators=100).
Predict and evaluate using confusion matrix and classification report.
Plot feature importance to understand which features contribute most to predictions.
8. Hyperparameters to Tune
Parameter Description
n_estimators Number of trees in the forest
max_depth Maximum depth of each tree
min_samples_split Minimum samples required to split a node
min_samples_leaf Minimum samples required at a leaf node
max_features Number of features to consider at each split
bootstrap Whether bootstrap samples are used (default True)
9. When to Use Random Forests?
When you want high accuracy and robust predictions.
When you have large datasets with many features.
When you want to reduce the risk of overfitting from a single decision tree.
When feature importance is needed.
Model Evaluation Metrics for Classification
1. Confusion Matrix (Foundation)
All these metrics are derived from the confusion matrix, which summarizes the performance of
a classification model:
Predicted Positive Predicted Negative
Actual Positive True Positive (TP) False Negative (FN)
Actual Negative False Positive (FP) True Negative (TN)
True Positive (TP): Correctly predicted positive class.
False Positive (FP): Incorrectly predicted positive class.
True Negative (TN): Correctly predicted negative class.
False Negative (FN): Incorrectly predicted negative class.
2. Accuracy
Definition: Ratio of correctly predicted observations to total observations.
Interpretation: Overall effectiveness of the model.
Limitation: Not reliable when classes are imbalanced.
3. Precision
Definition: Ratio of correctly predicted positive observations to total predicted positives.
Interpretation: Of all instances predicted as positive, how many are actually positive?
When to use: Important when the cost of false positives is high (e.g., spam detection,
fraud detection).
4. Recall (Sensitivity or True Positive Rate)
Definition: Ratio of correctly predicted positive observations to all actual positives.
Interpretation: Of all actual positive cases, how many did the model identify correctly?
When to use: Important when the cost of false negatives is high (e.g., disease diagnosis).
5. F1-Score
Definition: Harmonic mean of Precision and Recall.
Interpretation: Balance between precision and recall. Useful when you need a balance
between false positives and false negatives.
6. ROC Curve and AUC
ROC Curve (Receiver Operating Characteristic)
Plots True Positive Rate (Recall) against False Positive Rate (FPR) for different
classification thresholds.
Shows trade-offs between sensitivity and specificity.
AUC (Area Under Curve)
Scalar value between 0 and 1 summarizing the ROC curve.
Interpretation:
o 1: Perfect classifier
o 0.5: No discrimination (random guessing)
o Closer to 1 = better model.
7. Python Code Example to Calculate Metrics
from sklearn.metrics import accuracy_score, precision_score, recall_score,
f1_score, roc_auc_score, confusion_matrix, roc_curve
import matplotlib.pyplot as plt
# Sample true and predicted labels
y_true, y_pred = [0,0,1,1,0,1,0,1,1,0],[0,1,1,1,0,0,0,1,0,0]
y_scores = [0.1, 0.4, 0.8, 0.75, 0.3, 0.2, 0.1, 0.85, 0.5, 0.05]
# predicted probabilities for class 1
# Calculate metrics
print("Accuracy:", accuracy_score(y_true, y_pred))
print("Precision:", precision_score(y_true, y_pred))
print("Recall:", recall_score(y_true, y_pred))
print("F1 Score:", f1_score(y_true, y_pred))
print("ROC-AUC:", roc_auc_score(y_true, y_scores))
# Plot ROC Curve
fpr, tpr, thresholds = roc_curve(y_true, y_scores)
plt.plot(fpr, tpr, label="ROC Curve (AUC =
{:.2f})".format(roc_auc_score(y_true, y_scores)))
plt.plot([0, 1], [0, 1], 'k--') # Random guess line
plt.xlabel("False Positive Rate")
plt.ylabel("True Positive Rate")
plt.title("Receiver Operating Characteristic (ROC) Curve")
plt.show()
8. Summary Table
Metric Formula Purpose Use Case Example
Accuracy (TP+TN)/(TP +TN + FP + FN) Overall correctness Balanced classes
Precision TP/(TP+FP) Correctness of positive predictions Spam detection (avoid false alarms)
Recall TP/(TP+FN) Ability to find all positive cases Disease diagnosis (avoid misses)
F1-Score HM of precision & recall Balance of precision & recall When classes are imbalanced
ROC-AUC Area under ROC curve Discrimination capability Model comparison, threshold tuning
Train-Test Split and Cross Validation
1. Train-Test Split
What is it?
Dividing your dataset into two parts:
o Training set: Used to train the model.
o Testing set: Used to evaluate how well the model generalizes to new, unseen
data.
Why?
To avoid overfitting: A model might perform very well on the training data but poorly
on new data.
To get an unbiased estimate of model performance.
How?
Typically, a common split is 70%-80% for training and 20%-30% for testing.
Done randomly to ensure representative samples.
Python Example with sklearn:
from sklearn.model_selection import train_test_split
X = ... # Features
y = ... # Labels
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2,
random_state=42)
print(f"Train size: {len(X_train)}")
print(f"Test size: {len(X_test)}")
test_size=0.2: 20% test data
random_state=42: Ensures reproducibility
2. Cross Validation (CV)
What is it?
A method to evaluate model performance more reliably by splitting data into multiple
training/testing sets.
Instead of one train-test split, data is split multiple times.
Common Types:
a) K-Fold Cross Validation
Data is split into k equal folds.
For each iteration:
o Train on k-1 folds.
o Test on the remaining fold.
Repeat k times, each fold used once as test.
Performance is averaged over the k runs.
Example with 5 folds:
Fold 1: Train on folds 2,3,4,5; test on fold 1
Fold 2: Train on folds 1,3,4,5; test on fold 2
...
Fold 5: Train on folds 1,2,3,4; test on fold 5
b) Stratified K-Fold
Like K-Fold but preserves the percentage of samples for each class.
Useful for classification with imbalanced classes.
c) Leave-One-Out Cross Validation (LOOCV)
Extreme case where k = number of samples.
Train on all data except one sample, test on that single sample.
Repeated for all samples.
Very computationally expensive.
Why Cross Validation?
Provides more stable and less biased estimate of model performance.
Helps tune hyperparameters with better confidence.
Reduces variance due to data splitting randomness.
Python Example using K-Fold CV:
from sklearn.model_selection import KFold
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score
import numpy as np
X = ... # feature matrix
y = ... # labels
kf = KFold(n_splits=5, shuffle=True, random_state=42)
model = LogisticRegression()
accuracies = []
for train_index, test_index in kf.split(X):
X_train, X_test = X[train_index], X[test_index]
y_train, y_test = y[train_index], y[test_index]
model.fit(X_train, y_train)
y_pred = model.predict(X_test)
acc = accuracy_score(y_test, y_pred)
accuracies.append(acc)
print("Cross-validated accuracies:", accuracies)
print("Mean accuracy:", np.mean(accuracies))
Using cross_val_score (simpler):
from sklearn.model_selection import cross_val_score
from sklearn.linear_model import LogisticRegression
model = LogisticRegression()
scores = cross_val_score(model, X, y, cv=5, scoring='accuracy')
print("CV accuracies:", scores)
print("Mean CV accuracy:", scores.mean())
Hyperparameter Tuning using Grid Search
1. What are Hyperparameters?
Hyperparameters are parameters set before training a model.
Unlike model parameters (like weights in linear regression) which are learned during
training, hyperparameters control the learning process.
Examples:
o Number of neighbors in KNN (n_neighbors)
o Depth of a decision tree (max_depth)
o Learning rate in gradient boosting
o Regularization strength (C) in logistic regression or SVM
2. Why Tune Hyperparameters?
Hyperparameters significantly affect model performance.
Poor choice can lead to underfitting or overfitting.
Tuning helps find the best combination of hyperparameters to maximize model
accuracy, precision, recall, or other metrics.
3. What is Grid Search?
Grid Search is an exhaustive search technique to find the optimal hyperparameters.
You specify a grid (set) of hyperparameter values.
The algorithm trains and evaluates the model on every possible combination of these
parameters.
The best parameter set is chosen based on cross-validation performance.
4. How Grid Search Works
1. Define model and hyperparameter grid (dictionary of hyperparameters and lists of
values).
2. For each combination in the grid:
o Train the model with that combination on training folds.
o Validate it on validation folds.
3. Calculate the average validation score.
4. Select the combination with the best average score.
5. Using Grid Search in Python with scikit-learn
Step 1: Import libraries
from sklearn.model_selection import GridSearchCV
from sklearn.ensemble import RandomForestClassifier
from sklearn.datasets import load_iris
from sklearn.metrics import accuracy_score
Step 2: Prepare data
data = load_iris()
X, y = data.data, data.target
Step 3: Define model and parameter grid
model = RandomForestClassifier(random_state=42)
param_grid = {
'n_estimators': [10, 50, 100],
'max_depth': [None, 5, 10],
'min_samples_split': [2, 5, 10]
}
Step 4: Set up GridSearchCV
grid_search = GridSearchCV(estimator=model, param_grid=param_grid,
cv=5, scoring='accuracy', n_jobs=-1)
cv=5: 5-fold cross-validation
scoring='accuracy': Metric to evaluate
n_jobs=-1: Use all CPU cores to speed up
Step 5: Fit Grid Search to data
grid_search.fit(X, y)
Step 6: Get best parameters and score
print("Best hyperparameters:", grid_search.best_params_)
print("Best cross-validation accuracy:", grid_search.best_score_)
6. Interpreting Results
grid_search.best_params_: Shows the hyperparameter combination with highest
mean CV score.
grid_search.best_score_: The corresponding average CV accuracy.
You can use grid_search.best_estimator_ to get the model trained with best
hyperparameters.
7. Advanced Tips
RandomizedSearchCV: Instead of exhaustive search, samples hyperparameters
randomly (faster for large grids).
Custom Scoring: Use metrics like precision, recall, F1-score by setting scoring
parameter.
Refit: GridSearchCV refits the best model on the entire dataset by default.
Pipeline Integration: Combine preprocessing and model tuning in a single pipeline with
GridSearch.
8. Example: Full Code
from sklearn.model_selection import GridSearchCV
from sklearn.ensemble import RandomForestClassifier
from sklearn.datasets import load_iris
# Load data
data = load_iris()
X, y = data.data, data.target
# Define model and grid
model = RandomForestClassifier(random_state=42)
param_grid = {
'n_estimators': [10, 50, 100],
'max_depth': [None, 5, 10],
'min_samples_split': [2, 5, 10]
}
# Grid Search with 5-fold CV
grid_search = GridSearchCV(model, param_grid, cv=5, scoring='accuracy',
n_jobs=-1)
grid_search.fit(X, y)
print("Best parameters:", grid_search.best_params_)
print("Best CV accuracy:", grid_search.best_score_)