MACHINE LEARNING
Decision Trees
We use decision trees in machine learning because they offer a simple yet powerful way to
make predictions and decisions. Here’s why they are needed:
1. Easy to understand and interpret:
They look like a flowchart — you can actually "see" how a decision is made.
2. Handle both types of data:
Decision trees can work with both categorical (like "yes" or "no") and numerical (like
prices or age) data.
3. Minimal data preparation:
They don't need data to be scaled or normalized, and missing values can often be
handled naturally.
RAJASHEKER-ATRI 1
MACHINE LEARNING
CONSTRUCTION OF TREES
The important idea is to work out how much the entropy of the whole training set would
decrease if we choose each particular feature for the next classification step. This is known
as the information gain, and it is defined as
RAJASHEKER-ATRI 2
MACHINE LEARNING
RAJASHEKER-ATRI 3
MACHINE LEARNING
CLASSIFICATION AND REGRESSION TREES (CART)
RAJASHEKER-ATRI 4
MACHINE LEARNING
Therefore, the root node will be the party feature, which has two feature values (‘yes’ and
‘no’), so it will have two branches coming out of it (see Figure 12.6). When we look at the
RAJASHEKER-ATRI 5
MACHINE LEARNING
‘yes’ branch, we see that in all five cases where there was a party we went to it, so we just
put a leaf node there, saying ‘party’. For the ‘no’ branch, out of the five cases there are three
different outcomes, so now we need to choose another feature. The five cases we are
looking at are:
RAJASHEKER-ATRI 6
MACHINE LEARNING
Ensemble Learning
Imagine you have a tough decision to make.
Instead of deciding alone, you ask a group of people (a committee) and combine their
opinions.
Ensemble learning in machine learning is based on this same idea:
Instead of using one model (learner), we use many models, and combine their outputs to
get a better, more accurate result.
RAJASHEKER-ATRI 7
MACHINE LEARNING
Why use many models?
Each model may learn something different from the data.
Some models might be good at catching one type of pattern, others at different
patterns.
When we combine them smartly, the final answer is usually better than any single
model alone.
Simple Real-Life Example:
When you visit a doctor with a complicated illness:
If one test is not enough, the doctor orders multiple tests (blood test, scan, expert
opinions).
Then, she combines the information from all tests to make a correct diagnosis.
Similarly, ensemble learning gathers opinions from multiple machine learning models to
make a better decision.
How it works (Step-by-Step):
1. Use many simple models (like decision trees, or others).
2. Train each model differently, so that they see or focus on different parts of the data.
o Example: Split the data into parts and give different parts to different models.
3. Combine their outputs.
o A simple way: Majority voting (whichever class gets the most votes wins).
RAJASHEKER-ATRI 8
MACHINE LEARNING
Important points:
If the models are diverse and strong individually, the ensemble becomes very
powerful.
Ensemble learning works well even with little data (because we reuse small pieces
of data cleverly).
Majority voting: If most models are correct, the final answer will also be correct.
Popular ensemble methods like Bagging and Boosting.
Boosting is a way of combining many weak learners (bad models that are only slightly better than
guessing) to build one strong learner (a model that makes very good predictions).
Even if each small model is bad, together they become strong and accurate!
🌟 Main Boosting Idea:
1. Train a model (say, a simple decision tree).
2. Check where it makes mistakes.
3. Give more importance (weight) to the wrong answers (hard examples).
4. Train a new model, focusing more on these difficult examples.
5. Repeat this process many times.
6. Finally, combine all the models smartly to make the final decision.
AdaBoost
AdaBoost stands for Adaptive Boosting.
It is the most popular boosting algorithm.
RAJASHEKER-ATRI 9
MACHINE LEARNING
Here's how AdaBoost works, step-by-step:
1. Start by giving equal importance to all data points.
2. Train the first model.
3. Find out which points the model got wrong.
4. Increase the weight (importance) of those wrong points.
5. Train a second model — now it pays more attention to the difficult points.
6. Keep repeating this: each time focusing more on the points that were misclassified before.
7. Finally, combine all models — the better a model did, the more power (vote) it gets in the
final decision.
RAJASHEKER-ATRI 10
MACHINE LEARNING
1. The weights are initially all set to the same value, 1/N, where N is the number of datapoints
in the training set.
2. Then, at each iteration, the error (ϵ) is computed as the sum of the weights of the
misclassified points, and the weights for incorrect examples are updated by being multiplied
by α = (1 − ϵ)/ϵ.
3. Weights for correct examples are left alone, and then the whole set is normalised so that it
sums to 1 (which is effectively a reduction in the importance of the correctly classified
datapoints).
4. Training terminates after a set number of iterations, or when either all of the datapoints are
classified correctly, or one point contains more than half of the available weight.
Example:
Imagine you're trying to teach a group of students to solve math problems.
Some students get easy questions right but fail on hard ones.
Next time, you spend more time teaching the harder questions.
Then again, you focus even more on the parts they still struggle with.
After a few rounds, the students get much better!
Similarly, AdaBoost focuses more and more on mistakes, so the overall learning keeps improving.
Stumping:
A stump is a tiny decision tree — it just asks one question and stops.
Each stump is very weak (bad alone).
But when we boost many stumps, they together become very strong!
Bagging
Bagging is a way to improve the performance and stability of machine learning models by combining
multiple models trained on different random samples of the same dataset.
The Core Idea:
You take your original dataset.
You create many random samples from it — with replacement (this is called a bootstrap
sample).
You train a separate model on each sample (often decision trees).
When you want to predict something, you combine the predictions of all these models —
usually by voting (for classification) or averaging (for regression).
RAJASHEKER-ATRI 11
MACHINE LEARNING
Why does this work?
Each model sees a slightly different version of the data.
That makes them behave differently (less correlated).
When you combine all these different models, you get a more stable and accurate
prediction.
What is a Bootstrap Sample?
It is created by randomly picking examples from the dataset — with replacement.
Some data points may appear multiple times, others not at all.
Each sample is the same size as the original dataset, just scrambled.
Example :
Imagine you are predicting what a group of people want to do (like “Party”, “Study”, “TV”, etc.).
You:
1. Make 20 random samples of this data (with replacement).
2. Train 20 small decision trees (called stumps — they only ask one question).
3. Each stump gives its own prediction.
4. You take the majority vote of all 20 trees to get your final prediction.
Despite each stump being weak on its own, together they give very accurate results!
Subagging
Subagging = Subsampling + Bagging
The idea is almost the same as bagging, but samples are smaller than the original dataset.
Instead of taking full-size samples, you take half-size (or any fraction).
Sampling is usually done without replacement (like shuffling and picking the top few).
It’s faster and can still give very good results.
RAJASHEKER-ATRI 12
MACHINE LEARNING
Different Ways to Combine Classifiers
In ensemble learning, we combine multiple classifiers to make better predictions.
But how we combine them matters a lot!
There are several strategies:
1. Bagging (Bootstrap Aggregating)
Each classifier sees different random samples of the training data.
Then, each classifier votes.
Final output: Class with majority vote wins.
Important: In bagging, each classifier has the same weight (no classifier is treated as more
important).
2. Boosting
All classifiers see the same data.
But, the importance (weight) of each data point changes.
Hard-to-classify points get higher weights for the next classifier.
RAJASHEKER-ATRI 13
MACHINE LEARNING
Final output: A weighted vote — better classifiers have more say.
3. Majority Voting Variations
Normally, majority voting = the class chosen by most classifiers.
Variations:
o Only output if more than half of classifiers agree (otherwise, no prediction — useful
to avoid wrong answers for tough cases).
o Simple majority: just pick the most common output even if it’s not >50%.
📢 Important: If each classifier has a success rate p > 0.5, and you use a large number of classifiers,
the ensemble's success probability approaches 100%!
4. Regression Tasks (Numerical output, not categories)
Instead of voting, average the outputs.
Problem: Mean (average) is sensitive to outliers.
Solution: Use Median instead of Mean.
Using Median instead of Mean leads to the Bragging Algorithm ("robust bagging").
5. Mixture of Experts
Smart, learned way to combine classifiers.
Instead of just voting or averaging, the system learns:
o Which classifier to trust for each input.
Idea: Some classifiers are better for certain types of data.
🔹 How Mixture of Experts Works:
1. Each expert (classifier) gives an output (like probability).
2. A gating network decides how much to trust each expert (gives a weight).
3. Outputs are combined using those trust weights.
4. Final prediction is made.
Training: Use EM algorithm (Expectation Maximization) or gradient descent.
6. Other Views of Mixture of Experts
Like soft decision trees: not hard splits but probabilistic splits.
Similar to Radial Basis Function (RBF) networks:
But instead of constant outputs, nodes give linear approximations.
RAJASHEKER-ATRI 14
MACHINE LEARNING
📋 In short:
Method How it Combines Classifiers Special Note
Bagging Majority vote (equal weight) Different samples of data
Boosting Weighted vote Focus on hard-to-classify data points
Simple Majority Voting Most common output May produce no output if disagreement
Averaging (Regression) Mean of outputs Use median for robustness
Mixture of Experts Learned weighted combination Uses gating network, smarter combination
🎯 Key Takeaway:
Even if each individual classifier is weak, by combining them wisely, you can get very strong and
accurate predictions!
Gaussian Mixture Models (GMM)
In machine learning, Gaussian Mixture Models (GMM) are used for modelling a set of data that is
assumed to be generated from multiple Gaussian distributions (also known as normal distributions).
These models are a powerful way of handling unsupervised learning problems, especially when the
data can be described by multiple different distributions or "modes."
RAJASHEKER-ATRI 15
MACHINE LEARNING
RAJASHEKER-ATRI 16
MACHINE LEARNING
RAJASHEKER-ATRI 17
MACHINE LEARNING
7.2 Nearest Neighbour Methods
Nearest Neighbour (NN) methods are non-parametric techniques used for classification and
regression. They are based on the idea that similar data points exist near each other in feature
space.
RAJASHEKER-ATRI 18
MACHINE LEARNING
1. Intuition (Nightclub Analogy)
Imagine you're in a nightclub, unsure how to dance. You look at nearby people:
1-NN: You copy the closest person.
k-NN: You observe k closest people and follow the majority.
This is how k-NN works: look at the closest data points and make your prediction based on them.
RAJASHEKER-ATRI 19
MACHINE LEARNING
Unsupervised Learning
No labels or correct outputs are provided.
Goal: Find patterns, similarities, or clusters within the data.
You can't perform regression directly; instead, you group similar inputs.
2. Challenge
Unlike supervised learning (which minimizes a target-based error like sum-of-squares), in
unsupervised learning, you can’t use external targets for error measurement.
Instead, the algorithm must internally define an objective, like minimizing distances between
points and their cluster centers.
3. k-Means Clustering Algorithm
RAJASHEKER-ATRI 20
MACHINE LEARNING
You pick a value for k = number of clusters you expect.
Steps:
1. Randomly place k cluster centers.
2. Assign each data point to the nearest cluster center.
3. Reposition each center to the mean of the points assigned to it.
4. Repeat until centers stop moving (convergence).
Distance measure: Typically Euclidean distance.
Problems:
o Local Minima: The result depends heavily on the initial center placement.
o Choosing k: If k is too small or too big, you underfit or overfit.
Solutions:
o Run multiple times with different initializations and pick the best solution.
o Try different values of k and evaluate carefully (but watch out for overfitting).
4. Dealing with Noise
Outliers can mislead the mean (because the mean is sensitive to extreme values).
Alternative: Use the median instead of the mean for cluster centers, but this is
computationally heavier.
5. k-Means as a Neural Network
Think of cluster centers as neurons.
Each neuron has weights corresponding to its location in space.
Activation of each neuron = similarity to input (higher is better).
Use a winner-takes-all strategy: the neuron closest to the input "fires."
Update Rule (for the winner only):
o Move the weights slightly toward the input:
Δwij=ηxj\Delta w_{ij} = \eta x_jΔwij=ηxj
(η is the learning rate.)
6. Importance of Normalization
Problem: If neurons have very different weight magnitudes, activations become
incomparable.
RAJASHEKER-ATRI 21
MACHINE LEARNING
Solution: Normalize weights so that they all have the same overall magnitude (often unit
length).
Visual Summary
k-Means clustering is like:
Randomly throwing down some flags (cluster centers) in a crowd (data points).
Each person (data point) walks to the closest flag.
Each flag moves to the average position of the people around it.
Repeat until flags stop moving.
When adapted to neural networks, neurons play the role of cluster centers, and they compete to
recognize patterns.
K-Means Algorithm: Step-by-Step
RAJASHEKER-ATRI 22
MACHINE LEARNING
RAJASHEKER-ATRI 23
MACHINE LEARNING
RAJASHEKER-ATRI 24
MACHINE LEARNING
RAJASHEKER-ATRI 25