MODULE 3
Decision Tree
• Decision trees are a popular machine learning
algorithm used for both classification and
regression tasks. They work by:
1. Splitting data based on feature values
2. Starting from a root node and branching down to
leaf nodes
3. Making decisions at each internal node
4. Providing a final prediction or classification at leaf
nodes
A simple decision tree to decide how to spend
the evening.
ENTROPY
• The measure of information entropy, which describes the amount of impurity in a set of
features.
set of probabilities pi
• Suppose we have a dataset with two classes: positive (P) and negative (N). We can calculate
the entropy of the dataset as follows:
• All Positive Examples:
• p(P)=1
• p(N)=0
• Entropy: H(S) = - (1 log2(1) + 0 log2(0)) = 0
Information Gain
• Information Gain is defined as the entropy of the whole set minus the entropy when a
particular feature is chosen.
• Where S is the set of examples
• F is a possible feature out of the set of all possible ones
• |Sf| is a count of the number of members of S that have value f for feature F .
Example of IG
We have a dataset S={s1,s2,s3,s4} with the
following outcomes: • Calculating Entropy:
• s1 = Spam (True) 1.Determine the proportions of each class:
• s2 = Not Spam (False)
1. Number of Spam (⊕): 1 (s1)
2. Number of Not Spam (⊖): 3 (s2, s3, s4)
• s3 = Not Spam (False)
3. Total examples: 4
• s4 = Not Spam (False)
2.Calculate the proportions:
• We also have one feature F which is the "presence p(spam) = 1/4= 0.25
of a specific word" in the email, with possible
values {f1, f2, f3}. The feature values for each p(not spam) = 3 / 4= 0.75
example are: 3. Calculate the entropy of S:
• s1 = f2 H ( S ) = − ( p (spam) log 2 ( p(spam)) + p(not spam) log 2 ( p(not spam)) )
• s2 = f2 H ( S ) = − ( 0.25log 2 (0.25) + 0.75log 2 (0.75) )
H ( S ) = − ( 0.25log 2 (0.25) + 0.75log 2 (0.75) )
• s3 = f3
H ( S ) = − ( 0.25·(−2) + 0.75·(−0.415) )
• s4 = f1 H ( S ) = − ( −0.5 − 0.311)
H ( S ) = 0.811
Example of IG
• S={s1,s2,s3,s4}
• The feature F can take values {f1,f2,f3}
• s1: F=f2
• s2: F=f2
• s3: F=f3
• s4: F=f1
H (S f 1 ) =0
1 1 1 1
H ( S f 2 ) = − log 2 + log 2 = 1
2 2 2 2
H (S f 3 ) = 0
• Weighted sum of entropy
• Sf1= number of feature f1
| Sf1 | | Sf2 | | Sf3 | 1 2 1
H (S f 1 ) + H (S f 2 ) + H ( S f 3 ) = ·0 + ·1 + ·0
|S| |S| |S| 4 4 4
1
= 0+ +0
2
= 0.5
| Sf1 | | Sf 2 | | Sf3 |
Information Gain( S , F ) = H ( S ) − H (S f 1 ) + H (S f 2 ) + H (S f 3 )
|S| |S| |S|
= 0.811 − 0.5
= 0.311
Decision tree for feature F
ID3 Algorithm-
• It is a decision tree algorithm used in machine learning and data mining for creating a
decision tree from a dataset.
• The ID3 algorithm is used to generate a decision tree by employing a top-down, greedy
approach that selects the attribute with the highest information gain (or equivalently, the
lowest entropy) to split the data at each node.
• The output of the algorithm is the tree, i.e., a list of nodes, edges, and leaves.
Algorithm
•Base cases:
•If all examples have the same label, return that label.
•If no features are left, return the most common label.
•Recursive case:
•Select the best feature F that maximizes information gain.
•For each possible value of F:
•Create a new branch.
•Remove F from the feature set.
•Recursively build a subtree using the remaining features and examples.
Example
• The dataset examples: •"Hey, check this out, meet singles in your area!" - SPAM
•"Team meeting minutes" - NOT SPAM
•"Exclusive offer inside" - SPAM
•"Re: Project proposal" - NOT SPAM
•"You're a winner!" - SPAM
•"Lunch tomorrow?" - NOT SPAM
•"Urgent: Account verify" - SPAM
• Features used: •"Hot deal, act now!" - SPAM
1. Contains suspicious keywords •"Quick question" - NOT SPAM
•"Free trial ending" - SPAM
2. Sender is in contacts
3. Email length
4. Time of day sent
5. Contains attachments
• If all examples have the same
label, return that label
•If no features are left, return the
most common label.
•At each recursive step, it recalculate the
Information Gain for all remaining features.
•It always choose the feature with the
maximum IG for the next split.
The IG value changes because:
Subset of data: After the first split on
"Suspicious Keywords", we're now
working with a subset of the original data
- only those emails that don't contain
suspicious keywords. This subset may
have different characteristics than the full
dataset.
• The example
Construction of Decision Tree based on IG
Construction of Decision Tree
Decision Tree for Continuous variables or
features
• The simplest solution is to discretise the • The continuous variable x1 and x2 are
continuous variable. discretised with v and w
Final Tree
Types of Trees
• Univariate Tree
• Each node splits on a single variable
(either Age or Income).
• The splits are simple threshold
comparisons (e.g., Age > 30).
• The tree is easy to interpret.
Types of Trees
• Multivariate Tree
• The decision node uses a
combination of variables (Age and
Income).
• The split condition is more complex,
involving a linear combination of
features. This tree is more compact
but potentially harder to interpret.
CART –Classification and Regression Trees
• CART stands for Classification And
Regression Trees
• Can be used for both classification and
regression tasks
• Specifically constructs binary trees
• Binary Tree Advantage: While it might seem
limiting at first, binary trees are actually
advantageous. There are computational
efficiency reasons for preferring binary trees
• Supports the discussion of computational costs.
• Binary Tree Conversion: Multi-way decisions
can be transformed into a series of binary
decisions
• Example : A three-way decision about
assignment deadlines ('urgent', 'near', 'none')
can be converted into two binary questions:"Is
the deadline urgent?" (Yes/No)If No: "Is the
deadline near?" (Yes/No)
How does CART algorithm works?
• Step1 :Finding Best Split Points for each feature/input variable in the dataset
• •Evaluate all possible split points for numerical variables
• •Calculate the impurity measure (Gini index for classification, MSE for regression)
• •Select split point that maximizes information gain or reduces impurity
• •Step 2: After selecting the Best Feature, find best split points found for each feature
• •Choose the feature and split point combination that gives maximum improvement
• •For classification: Minimizes Gini impurity or entropy
• •For regression: Minimizes Mean Squared Error (MSE)
• •Step 3: Splitting the Node
• •Divide the data into two child nodes based on chosen split
• •Left node contains samples meeting split condition
Gini Impurity
• For example, in a binary classification
• Pure vs Impure Nodes: problem (k=2):
• Pure node: All samples belong to same • Gini = 1 - (p(class₁)² + p(class₂)²)
class (Gini = 0) • Binary Classification Examples:
• Impure node: Samples belong to Perfect split (50-50):
different classes (Gini > 0) • Gini = 1 - (0.5² + 0.5²) = 0.5
• N(i) = number of samples in class i • Pure node (100-0):
• p(i) = N(i)/total_samples (proportion) • Gini = 1 - (1² + 0²) = 0
• Gini = 1 - ∑ᵢᵏ p(classᵢ)² • 70-30 split:
• Gini = 1 - (0.7² + 0.3²) = 0.42
Regression in CART
• Main Differences from Classification: • Prediction Process: Each leaf node
Uses Sum of Squares Error (SSE) stores mean value of training samples
instead of Gini/Entropy
• Predictions are constant within each
leaf region
• ȳ is mean of segment • New samples get prediction based on
which leaf they fall into
• Leaf nodes predict mean value instead
of class
• Continuous output values instead of
discrete classes
Ensemble learning
Ensemble learning helps improve
machine learning results by combining
several models.
This approach allows the production of
better predictive performance compared
to a single model.
Basic idea is to learn a set of classifiers
(experts) and to allow them to vote.
Why do ensembles work?
• Statistical Problem - Like asking many • Ensembles work because they combine
doctors instead of just one. Even if each multiple "good but imperfect" solutions
doctor can make mistakes, their collective into one better solution, similar to how
opinion is usually more reliable. group decision-making often outperforms
individual decisions.
• Computational Problem - Like having
multiple people search for your lost keys
starting from different places. They're
more likely to find them than one person
searching alone.
• Representational Problem - Like
combining different tools (hammer,
screwdriver, wrench) when no single tool
can fix everything. Different models can
capture different aspects of the problem.
Types of Ensemble Classifier
• Bagging • Boosting
• Bagging (Bootstrap Aggregating) • Boosting Builds models sequentially
Creates multiple models on random
• Each new model focuses on errors
subsets of data
made by previous models
• Models run in parallel
• Models weighted by accuracy
• Final prediction by voting/averaging
• Examples: AdaBoost, XGBoost,
• Example: Random Forest LightGBM
• Reduces variance, good for high- • Reduces bias and variance
variance models like decision trees
Boosting
• The core idea of boosting is that we can
combine many "weak learners" (classifiers
that perform just slightly better than random
guessing) into a strong ensemble classifier.
• AdaBoost (Adaptive Boosting) works by:
Starting with equal weights for all training
samples
• Iteratively training weak learners.
• Updating sample weights after each iteration
to focus more on misclassified samples
• Combining the weak learners with weights The cross is misclassified, so their weights increase in boosting
based on their performance (shown by the datapoint getting larger), which makes the
importance of those datapoints increase, making the classifiers
pay more attention to them.
AdaBoost
• Start with equal weights for all training • The error is computed as the sum of the weights
examples (1/N each) of the misclassified points.
• For each round: Train a simple classifier • The weights for incorrect examples are updated
using current weights. by being multiplied by
• Calculate how many mistakes it makes
(weighted error)
• Give more weight to examples it got • An indicator function is defined as :
wrong
• Reduce weight of examples it got right • Return 1 if the target and output are not
equal
• Make sure all weights still add up to 1
• Stop if:
• We reach the maximum number of
rounds •
• All examples are correctly classified
Adaboost Algorithm
• Initialization: Weights are set to 1/N for N
datapoints. S is the total training example
with weight w. • Update weights using
• S = {(x₁,y₁), (x₂,y₂), ..., (xₙ,yₙ)}
• ht(xn) refers to what prediction the t-th weak Zt is a normalisation constant
classifier makes for the xn data point.
• Main Loop: • Output
Stumping
• There is a very extreme form of boosting that is applied to trees.
• Stumping consists of simply taking the root of the tree and using that as the decision maker.
• So for each classifier, use the very first question that makes up the root of the tree, and that is it.
• The overall output of stumping can be very successful.
• This is why it's called "stumping":
• Like a tree stump (just the base remains)
• One simple decision
• Combined with AdaBoost's weighting
H1(x) +1= green, -1: red
Three classifier
Combined classifier output
• h1 is classifying the x1 point as -1, h2 as +1, h3 as +1
Bagging
• The simplest method of combining classifiers is
known as bagging, which stands for bootstrap
aggregating.
•
The Bootstrap Part: You have a bag of 100 marbles.
You grab a marble, write down what it is, and put it
back. You do this 100 times (some marbles you might
pick multiple times, others not at all)You repeat this
whole process many times (50+ times) to create
multiple different samples
• The Aggregating Part: You train a separate model on
each of these samples. When you want to make a
prediction, you ask all your models. Take a vote -
whatever most models predict becomes your final
answer
• Bootstrap sampling helps reduce
model variance by:
• Creating multiple diverse training sets
from the same data
• Each model learns slightly different
patterns
• Averaging multiple models reduces
overfitting
• Final model is more robust and
generalizes better
Random Forest
• Random Forest algorithm is a powerful
tree learning technique in Machine
Learning.
• It works by creating a number of Decision
Trees during the training phase.
• Each tree is constructed using a random
subset of the data set to measure a random
subset of features in each partition.
• This randomness introduces variability
among individual trees, reducing the risk
of overfitting and improving overall
prediction performance
Algorithm for Random Forest Work
• For each of N trees:
• – create a new bootstrap sample of the
training set
• – use this bootstrap sample to train a decision
tree
• – at each node of the decision tree, randomly
select m features, and compute the
information gain (or Gini impurity) only on
that set of features, and select optimal one
• – repeat until the tree is complete
Bagging vs Boosting
DIFFERENT WAYS TO COMBINE CLASSIFIERS
• Different voting schemes work in ensemble
classifiers:
• How the ensemble methods combines 1. Simple Majority Voting:
the outputs of the different classifiers? • Takes the most common prediction across all
classifiers
• Both boosting and bagging take a vote
from amongst the classifiers, although • Requires an odd number of classifiers to avoid ties
they do it in different ways: • Outputs a prediction even if the agreement is low
• boosting takes a weighted vote.
2. Threshold-based Voting:
• bagging simple takes the majority
• Only produces output when a certain threshold of
vote. agreement is met
• Common thresholds include:
• Unanimous agreement (100%)
• Strong majority (e.g., 75%)
• Simple majority (>50%)
• Binomial Distribution Formula: • n is the total number of classifiers
For an ensemble with n classifiers, • k is the number of classifiers that need
each with accuracy p, the to be correct
probability of getting the correct
answer with majority voting • C(n,k) calculates how many different
follows this binomial distribution: ways k classifiers can be correct out of
n total classifiers
• P is the individual classifier's accuracy
rate.
• n/2+1 to n is taken because it
represents the range needed for
majority voting
Mixture of networks
• In regression problems, rather than taking the
majority vote, it is common to take the mean of the
outputs.
• To combine classifiers, there is an algorithm that
does precisely this, known as the mixture of experts.
• Process Flow: Input → { Classifiers → Individual
Assessments
• Gating Network → Weights } → Weighted
Combination → Final Output
• The system works by:
1. All experts and gates receive the same input features
2. Each expert produces its prediction
3. Gates learn to dynamically weight these predictions
based on the input The Hierarchical Mixture of Networks network, consisting of a
set of classifiers (experts) with gating systems that also use the
4. The final output is a weighted combination of expert
inputs to decide which classifiers to trust.
predictions
The Mixture of Expert Algorithm
•Input Processing:
•Receive input vector x
•Pass x to all experts and all gating networks
•Expert Predictions (Bottom Layer):
•Each expert i computes its output:
•oi = 1 / (1 + exp(-wi·x))
•This gives probability estimates for each expert
•Gating Weights (Level 1):
•Lower-level gates compute their weights:
•gi = exp(vi·x) / Σ exp(vj·x)
•Gate 1,1 computes weights for Experts 1&2
•Gate 1,2 computes weights for Experts 3&4
•Combine Expert Outputs (Level 1):
•For Gate 1,1: output = o1·g1 + o2·g2
•For Gate 1,2: output = o3·g3 + o4·g4
Top-Level Gating (Level 2):
•Gate 2 computes final weights
using same softmax formula
•Weights how much to trust each
level 1 gate
•Final Output:
•Combine level 1 outputs using top-
level gate weights
•Final = (Gate1,1_output ·
Gate2_weight1) + (Gate1,2_output ·
Gate2_weight2)
Basic Statistics
• Mean
• Median
• Mode
• Conditional probability
• Bayes theorem
Bayes Theorem
• Bayes’ Theorem is used to determine the • Posterior Probability P(A∣B)
conditional probability of an event.
• Prior Probability P(A)
• Likelihood P(B∣A)
• Marginal Likelihood P(B)
• P(A) and P(B) are the probabilities of events A
and B also P(B) is never equal to zero.
• P(A|B) is the probability of event A when event
B happens
• P(B|A) is the probability of event B when A
happens
Example
• P(Quiet) = P(Quiet|Cat)×P(Cat) +
P(Quiet|Dog)×P(Dog)
= 0.8 × 0.5 + 0.3 × 0.5
= 0.4 + 0.15 = 0.55
Bayes' Theorem: P(Cat|Quiet) =
[P(Quiet|Cat) × P(Cat)] / P(Quiet)
= (0.8 × 0.5) / 0.55
≈ 0.727
Applications of Bayes' Theorem in
machine learning.
1. Naive Bayes Classification 3. Natural Language Processing
• Used in: • Topic modeling
• Spam detection (emails)
• Sentiment analysis • Language model adaptation
• Document classification • Machine translation
• Disease diagnosis • Speech recognition
2. Computer Vision 4. Reinforcement Learning
• Object detection and classification • Bayesian exploration strategies
• Face recognition systems • Probabilistic policy updates
• Scene understanding • Uncertainty-aware decision making
• Medical image analysis
Gaussian Mixture model
• Have you ever wondered how • The probability density function (pdf)
machine learning algorithms can of a GMM of data x is given by:
effortlessly categorize complex data K
into distinct groups? p( x) = k ( x∣ k , k )
k =1
• GMMs excel in estimating density and • πk are the mixture weights
clustering data. ( x∣ k , k )
• Gaussian Mixture Model (GMM) is a • is the Gaussian density function with
probabilistic model that assumes that mean μk and covariance Σk
the data points are generated from a
mixture of several Gaussian • K is the number of Gaussian
distributions with unknown components
parameters
Expectation – Maximization algorithm
• The parameters of the GMM that • E step: Expectation step
need to be estimated are: k ( xi ∣ k , k )
ik = P( zk = 1∣ xi ) = K
• Mixture weights: πk for k=1,…,K
j =1
j ( xi ∣ j , j )
• Means: μk for k=1,…,K
• It is the posterior probability that the
• Covariances: Σk for k=1,…,K k-th component generated the data
point xi
• The EM algorithm is used to find the
maximum likelihood estimates of the • M step: Maximization step
parameters in a GMM. It consists of N
two steps: • New mean x
ik i
knew = i =1
N
i =1
ik
• New Covariances: • The log-likelihood to maximize pdf is:
N
ik ( xi − k )( xi − k )T N
K
L = log k
( xi ∣ k , k )
new
k = i =1
N i =1 k =1
i =1
ik
• Convergence:
• New weight: • The algorithm iterates until the
change in log-likelihood is less than a
N threshold:
ik
new
k = i =1
Lnew − Lold threshold
N
K- Nearest Neighbor Algorithm
• K-Nearest Neighbors (KNN) • It is a non-parametric method that
algorithm is a supervised machine makes predictions based on the
learning method employed to tackle similarity of data points in a given
classification and regression dataset.
problems.
• K-NN is less sensitive to outliers
• It belongs to the supervised learning compared to other algorithms.
domain and finds intense application
in pattern recognition, data mining, • The K-NN algorithm works by finding
and intrusion detection. the K nearest neighbors to a given
data point based on a distance
• It does not make any underlying metric, such as Euclidean distance.
assumptions about the distribution of
data • The class or value of the data point is
then determined by the majority vote
or average of the K neighbors.
n
• Distance d ( p, q ) = ( p − q )
i i
2
• In KNN, Euclidean distance is i =1
used to: • d is the Euclidean distance
1. Calculate distances from a new • p = (p₁, p₂, ..., pₙ) and q = (q₁, q₂, ...,
data point to all training points qₙ) are two points in n-dimensional
space
2. Find the K nearest neighbors
(smallest distances) • n is the number of dimensions
3. Use those neighbors' labels to
predict the new point's label
(typically by majority vote)
Example
KNN Algorithm
• Step 1: Selecting the optimal value of K • Step 4: Voting for Classification or Taking
Average for Regression
• K represents the number of nearest
neighbors that needs to be considered • In the classification problem, the class
while making prediction. labels of K-nearest neighbors are
determined by performing majority voting.
• Step 2: Calculating distance
• In the regression problem, the class label is
• To measure the similarity between target calculated by taking average of the target
and training data points, Euclidean distance values of K nearest neighbors
is used. Distance is calculated between
each of the data points in the dataset and
target point.
• Step 3: Finding Nearest Neighbors
• The k data points with the smallest
distances to the target point are the
nearest neighbors.
Unsupervised Learning
• Unlike supervised learning, there are
no labeled targets or scoring systems
• The algorithm must find patterns and
similarities in data independently
• Common use case is clustering -
grouping similar inputs together
• Uses internal error criteria :Uses
Euclidean distance between points
and centroids
• Relies on distance measures (typically
Euclidean) to determine similarity.
K-means algorithm
Where j is the mean and xi is the
• Initialisation
datapoint.
The equation gives the distance between xi
• – choose a value for k and j
• – choose k random positions in the input space • for each cluster centre:
• – assign the cluster centres µj to those positions • move the position of the centre to the mean of the
points in that cluster (Nj is the number of points in
• Learning cluster j): Nj
• 1
j =
Nj
x i
• – repeat i =1
• until the cluster centres stop moving
• ∗ for each datapoint xi:
• • Usage
• · compute the distance to each cluster centre • – for each test point:
• · assign the datapoint to the nearest cluster centre • ∗ compute the distance to each cluster centre
with distance
• ∗ assign the datapoint to the nearest cluster centre
di = min j d ( xi , j ) with distance
di = min j d ( xi , j ).
Example
On line k means algorithm
• • Usage
• Initialisation
• – for each test point:
• – choose a value for k, which corresponds to the
number of output nodes • ∗ compute the activations of all the nodes
• – initialise the weights to have small random values • ∗ pick the winner as the node with the highest
activation
• • Learning
• – normalise the data so that all the points lie on the
unit sphere
• – repeat:
• ∗ for each datapoint:
• · compute the activations of all the nodes
• · pick the winner as the node with the highest
activation
• · update the weights
• ∗ until number of iterations is above a threshold
• Network Structure: Input Layer: Raw data weight vector is the cluster center
points (x₁, x₂, x₃, etc.)
• Single Competitive Layer: Cluster centers
(C₁, C₂, C₃, etc.)
• No bias nodes
• Direct connections from inputs to
competitive layer
• Winner-Takes-All Mechanism: Each neuron
computes activation (h) based on input
• Only highest-activated neuron "wins"
• Winner represents the closest cluster center
• Only winner's weights get updated