0% found this document useful (0 votes)

3 views66 pages

Module3 Final

The document discusses decision trees, a machine learning algorithm used for classification and regression, detailing their structure, entropy, and information gain. It explains the ID3 algorithm for constructing decision trees, the CART method for both classification and regression, and introduces ensemble learning techniques like bagging and boosting to improve predictive performance. Additionally, it highlights the importance of Gini impurity and the process of creating robust models through methods like Random Forest.

Uploaded by

6cm872mffs

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

3 views66 pages

Module3 Final

Uploaded by

6cm872mffs

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 66

MODULE 3

Decision Tree
• Decision trees are a popular machine learning
algorithm used for both classification and
regression tasks. They work by:
1. Splitting data based on feature values

2. Starting from a root node and branching down to

leaf nodes
3. Making decisions at each internal node

4. Providing a final prediction or classification at leaf

nodes
A simple decision tree to decide how to spend
the evening.
ENTROPY
• The measure of information entropy, which describes the amount of impurity in a set of
features.
set of probabilities pi

• Suppose we have a dataset with two classes: positive (P) and negative (N). We can calculate
the entropy of the dataset as follows:
• All Positive Examples:
• p(P)=1
• p(N)=0
• Entropy: H(S) = - (1 log2(1) + 0 log2(0)) = 0
Information Gain
• Information Gain is defined as the entropy of the whole set minus the entropy when a
particular feature is chosen.

• Where S is the set of examples

• F is a possible feature out of the set of all possible ones

• |Sf| is a count of the number of members of S that have value f for feature F .
Example of IG
We have a dataset S={s1,s2,s3,s4} with the
following outcomes: • Calculating Entropy:
• s1 = Spam (True) 1.Determine the proportions of each class:
• s2 = Not Spam (False)
1. Number of Spam (⊕): 1 (s1)
2. Number of Not Spam (⊖): 3 (s2, s3, s4)
• s3 = Not Spam (False)
3. Total examples: 4
• s4 = Not Spam (False)
2.Calculate the proportions:
• We also have one feature F which is the "presence p(spam) = 1/4= 0.25
of a specific word" in the email, with possible
values {f1, f2, f3}. The feature values for each p(not spam) = 3 / 4= 0.75
example are: 3. Calculate the entropy of S:
• s1 = f2 H ( S ) = − ( p (spam) log 2 ( p(spam)) + p(not spam) log 2 ( p(not spam)) )
• s2 = f2 H ( S ) = − ( 0.25log 2 (0.25) + 0.75log 2 (0.75) )
H ( S ) = − ( 0.25log 2 (0.25) + 0.75log 2 (0.75) )
• s3 = f3
H ( S ) = − ( 0.25·(−2) + 0.75·(−0.415) )
• s4 = f1 H ( S ) = − ( −0.5 − 0.311)
H ( S ) = 0.811
Example of IG
• S={s1,s2,s3,s4}
• The feature F can take values {f1,f2,f3}
• s1: F=f2
• s2: F=f2
• s3: F=f3
• s4: F=f1

H (S f 1 ) =0
1 1 1  1 
H ( S f 2 ) = −  log 2   + log 2    = 1
2 2 2  2 
H (S f 3 ) = 0
• Weighted sum of entropy
• Sf1= number of feature f1
| Sf1 | | Sf2 | | Sf3 | 1 2 1
H (S f 1 ) + H (S f 2 ) + H ( S f 3 ) = ·0 + ·1 + ·0
|S| |S| |S| 4 4 4
1
= 0+ +0
2
= 0.5

| Sf1 | | Sf 2 | | Sf3 | 
Information Gain( S , F ) = H ( S ) −  H (S f 1 ) + H (S f 2 ) + H (S f 3 ) 
 |S| |S| |S| 
= 0.811 − 0.5
= 0.311
Decision tree for feature F
ID3 Algorithm-
• It is a decision tree algorithm used in machine learning and data mining for creating a
decision tree from a dataset.
• The ID3 algorithm is used to generate a decision tree by employing a top-down, greedy
approach that selects the attribute with the highest information gain (or equivalently, the
lowest entropy) to split the data at each node.
• The output of the algorithm is the tree, i.e., a list of nodes, edges, and leaves.
Algorithm
•Base cases:

•If all examples have the same label, return that label.
•If no features are left, return the most common label.

•Recursive case:

•Select the best feature F that maximizes information gain.

•For each possible value of F:

•Create a new branch.
•Remove F from the feature set.
•Recursively build a subtree using the remaining features and examples.
Example
• The dataset examples: •"Hey, check this out, meet singles in your area!" - SPAM
•"Team meeting minutes" - NOT SPAM
•"Exclusive offer inside" - SPAM
•"Re: Project proposal" - NOT SPAM
•"You're a winner!" - SPAM
•"Lunch tomorrow?" - NOT SPAM
•"Urgent: Account verify" - SPAM
• Features used: •"Hot deal, act now!" - SPAM
1. Contains suspicious keywords •"Quick question" - NOT SPAM
•"Free trial ending" - SPAM
2. Sender is in contacts
3. Email length
4. Time of day sent
5. Contains attachments
• If all examples have the same
label, return that label
•If no features are left, return the
most common label.
•At each recursive step, it recalculate the
Information Gain for all remaining features.
•It always choose the feature with the
maximum IG for the next split.

The IG value changes because:

Subset of data: After the first split on
"Suspicious Keywords", we're now
working with a subset of the original data
- only those emails that don't contain
suspicious keywords. This subset may
have different characteristics than the full
dataset.
• The example
Construction of Decision Tree based on IG
Construction of Decision Tree
Decision Tree for Continuous variables or
features
• The simplest solution is to discretise the • The continuous variable x1 and x2 are
continuous variable. discretised with v and w
Final Tree
Types of Trees
• Univariate Tree
• Each node splits on a single variable
(either Age or Income).
• The splits are simple threshold
comparisons (e.g., Age > 30).
• The tree is easy to interpret.
Types of Trees
• Multivariate Tree
• The decision node uses a
combination of variables (Age and
Income).
• The split condition is more complex,
involving a linear combination of
features. This tree is more compact
but potentially harder to interpret.
CART –Classification and Regression Trees
• CART stands for Classification And
Regression Trees
• Can be used for both classification and
regression tasks
• Specifically constructs binary trees
• Binary Tree Advantage: While it might seem
limiting at first, binary trees are actually
advantageous. There are computational
efficiency reasons for preferring binary trees
• Supports the discussion of computational costs.
• Binary Tree Conversion: Multi-way decisions
can be transformed into a series of binary
decisions
• Example : A three-way decision about
assignment deadlines ('urgent', 'near', 'none')
can be converted into two binary questions:"Is
the deadline urgent?" (Yes/No)If No: "Is the
deadline near?" (Yes/No)
How does CART algorithm works?
• Step1 :Finding Best Split Points for each feature/input variable in the dataset
• •Evaluate all possible split points for numerical variables
• •Calculate the impurity measure (Gini index for classification, MSE for regression)
• •Select split point that maximizes information gain or reduces impurity
• •Step 2: After selecting the Best Feature, find best split points found for each feature
• •Choose the feature and split point combination that gives maximum improvement
• •For classification: Minimizes Gini impurity or entropy
• •For regression: Minimizes Mean Squared Error (MSE)
• •Step 3: Splitting the Node
• •Divide the data into two child nodes based on chosen split
• •Left node contains samples meeting split condition
Gini Impurity
• For example, in a binary classification
• Pure vs Impure Nodes: problem (k=2):
• Pure node: All samples belong to same • Gini = 1 - (p(class₁)² + p(class₂)²)
class (Gini = 0) • Binary Classification Examples:
• Impure node: Samples belong to Perfect split (50-50):
different classes (Gini > 0) • Gini = 1 - (0.5² + 0.5²) = 0.5
• N(i) = number of samples in class i • Pure node (100-0):
• p(i) = N(i)/total_samples (proportion) • Gini = 1 - (1² + 0²) = 0
• Gini = 1 - ∑ᵢᵏ p(classᵢ)² • 70-30 split:
• Gini = 1 - (0.7² + 0.3²) = 0.42
Regression in CART
• Main Differences from Classification: • Prediction Process: Each leaf node
Uses Sum of Squares Error (SSE) stores mean value of training samples
instead of Gini/Entropy
• Predictions are constant within each
leaf region
• ȳ is mean of segment • New samples get prediction based on
which leaf they fall into
• Leaf nodes predict mean value instead
of class
• Continuous output values instead of
discrete classes
Ensemble learning
Ensemble learning helps improve
machine learning results by combining
several models.
This approach allows the production of
better predictive performance compared
to a single model.
Basic idea is to learn a set of classifiers
(experts) and to allow them to vote.
Why do ensembles work?
• Statistical Problem - Like asking many • Ensembles work because they combine
doctors instead of just one. Even if each multiple "good but imperfect" solutions
doctor can make mistakes, their collective into one better solution, similar to how
opinion is usually more reliable. group decision-making often outperforms
individual decisions.
• Computational Problem - Like having
multiple people search for your lost keys
starting from different places. They're
more likely to find them than one person
searching alone.
• Representational Problem - Like
combining different tools (hammer,
screwdriver, wrench) when no single tool
can fix everything. Different models can
capture different aspects of the problem.
Types of Ensemble Classifier
• Bagging • Boosting
• Bagging (Bootstrap Aggregating) • Boosting Builds models sequentially
Creates multiple models on random
• Each new model focuses on errors
subsets of data
made by previous models
• Models run in parallel
• Models weighted by accuracy
• Final prediction by voting/averaging
• Examples: AdaBoost, XGBoost,
• Example: Random Forest LightGBM
• Reduces variance, good for high- • Reduces bias and variance
variance models like decision trees
Boosting
• The core idea of boosting is that we can
combine many "weak learners" (classifiers
that perform just slightly better than random
guessing) into a strong ensemble classifier.
• AdaBoost (Adaptive Boosting) works by:
Starting with equal weights for all training
samples
• Iteratively training weak learners.
• Updating sample weights after each iteration
to focus more on misclassified samples
• Combining the weak learners with weights The cross is misclassified, so their weights increase in boosting
based on their performance (shown by the datapoint getting larger), which makes the
importance of those datapoints increase, making the classifiers
pay more attention to them.
AdaBoost
• Start with equal weights for all training • The error is computed as the sum of the weights
examples (1/N each) of the misclassified points.
• For each round: Train a simple classifier • The weights for incorrect examples are updated
using current weights. by being multiplied by

• Calculate how many mistakes it makes

(weighted error)
• Give more weight to examples it got • An indicator function is defined as :
wrong
• Reduce weight of examples it got right • Return 1 if the target and output are not
equal
• Make sure all weights still add up to 1
• Stop if:
• We reach the maximum number of
rounds •

• All examples are correctly classified

Adaboost Algorithm
• Initialization: Weights are set to 1/N for N
datapoints. S is the total training example
with weight w. • Update weights using

• S = {(x₁,y₁), (x₂,y₂), ..., (xₙ,yₙ)}

• ht(xn) refers to what prediction the t-th weak Zt is a normalisation constant
classifier makes for the xn data point.

• Main Loop: • Output

Stumping
• There is a very extreme form of boosting that is applied to trees.
• Stumping consists of simply taking the root of the tree and using that as the decision maker.
• So for each classifier, use the very first question that makes up the root of the tree, and that is it.
• The overall output of stumping can be very successful.
• This is why it's called "stumping":
• Like a tree stump (just the base remains)
• One simple decision
• Combined with AdaBoost's weighting
H1(x) +1= green, -1: red
Three classifier
Combined classifier output
• h1 is classifying the x1 point as -1, h2 as +1, h3 as +1
Bagging
• The simplest method of combining classifiers is
known as bagging, which stands for bootstrap
aggregating.
•
The Bootstrap Part: You have a bag of 100 marbles.
You grab a marble, write down what it is, and put it
back. You do this 100 times (some marbles you might
pick multiple times, others not at all)You repeat this
whole process many times (50+ times) to create
multiple different samples
• The Aggregating Part: You train a separate model on
each of these samples. When you want to make a
prediction, you ask all your models. Take a vote -
whatever most models predict becomes your final
answer
• Bootstrap sampling helps reduce
model variance by:
• Creating multiple diverse training sets
from the same data
• Each model learns slightly different
patterns
• Averaging multiple models reduces
overfitting
• Final model is more robust and
generalizes better
Random Forest
• Random Forest algorithm is a powerful
tree learning technique in Machine
Learning.
• It works by creating a number of Decision
Trees during the training phase.
• Each tree is constructed using a random
subset of the data set to measure a random
subset of features in each partition.
• This randomness introduces variability
among individual trees, reducing the risk
of overfitting and improving overall
prediction performance
Algorithm for Random Forest Work

• For each of N trees:

• – create a new bootstrap sample of the
training set
• – use this bootstrap sample to train a decision
tree
• – at each node of the decision tree, randomly
select m features, and compute the
information gain (or Gini impurity) only on
that set of features, and select optimal one
• – repeat until the tree is complete
Bagging vs Boosting
DIFFERENT WAYS TO COMBINE CLASSIFIERS
• Different voting schemes work in ensemble
classifiers:
• How the ensemble methods combines 1. Simple Majority Voting:
the outputs of the different classifiers? • Takes the most common prediction across all
classifiers
• Both boosting and bagging take a vote
from amongst the classifiers, although • Requires an odd number of classifiers to avoid ties
they do it in different ways: • Outputs a prediction even if the agreement is low
• boosting takes a weighted vote.
2. Threshold-based Voting:
• bagging simple takes the majority
• Only produces output when a certain threshold of
vote. agreement is met
• Common thresholds include:
• Unanimous agreement (100%)
• Strong majority (e.g., 75%)
• Simple majority (>50%)
• Binomial Distribution Formula: • n is the total number of classifiers
For an ensemble with n classifiers, • k is the number of classifiers that need
each with accuracy p, the to be correct
probability of getting the correct
answer with majority voting • C(n,k) calculates how many different
follows this binomial distribution: ways k classifiers can be correct out of
n total classifiers
• P is the individual classifier's accuracy
rate.
• n/2+1 to n is taken because it
represents the range needed for
majority voting
Mixture of networks
• In regression problems, rather than taking the
majority vote, it is common to take the mean of the
outputs.
• To combine classifiers, there is an algorithm that
does precisely this, known as the mixture of experts.
• Process Flow: Input → { Classifiers → Individual
Assessments
• Gating Network → Weights } → Weighted
Combination → Final Output
• The system works by:
1. All experts and gates receive the same input features
2. Each expert produces its prediction
3. Gates learn to dynamically weight these predictions
based on the input The Hierarchical Mixture of Networks network, consisting of a
set of classifiers (experts) with gating systems that also use the
4. The final output is a weighted combination of expert
inputs to decide which classifiers to trust.
predictions
The Mixture of Expert Algorithm
•Input Processing:
•Receive input vector x
•Pass x to all experts and all gating networks

•Expert Predictions (Bottom Layer):

•Each expert i computes its output:
•oi = 1 / (1 + exp(-wi·x))
•This gives probability estimates for each expert

•Gating Weights (Level 1):

•Lower-level gates compute their weights:
•gi = exp(vi·x) / Σ exp(vj·x)
•Gate 1,1 computes weights for Experts 1&2
•Gate 1,2 computes weights for Experts 3&4

•Combine Expert Outputs (Level 1):

•For Gate 1,1: output = o1·g1 + o2·g2
•For Gate 1,2: output = o3·g3 + o4·g4
Top-Level Gating (Level 2):
•Gate 2 computes final weights
using same softmax formula
•Weights how much to trust each
level 1 gate
•Final Output:
•Combine level 1 outputs using top-
level gate weights
•Final = (Gate1,1_output ·
Gate2_weight1) + (Gate1,2_output ·
Gate2_weight2)
Basic Statistics
• Mean
• Median
• Mode
• Conditional probability
• Bayes theorem
Bayes Theorem
• Bayes’ Theorem is used to determine the • Posterior Probability P(A∣B)
conditional probability of an event.
• Prior Probability P(A)
• Likelihood P(B∣A)
• Marginal Likelihood P(B)

• P(A) and P(B) are the probabilities of events A

and B also P(B) is never equal to zero.
• P(A|B) is the probability of event A when event
B happens
• P(B|A) is the probability of event B when A
happens
Example

• P(Quiet) = P(Quiet|Cat)×P(Cat) +
P(Quiet|Dog)×P(Dog)
= 0.8 × 0.5 + 0.3 × 0.5
= 0.4 + 0.15 = 0.55
Bayes' Theorem: P(Cat|Quiet) =
[P(Quiet|Cat) × P(Cat)] / P(Quiet)
= (0.8 × 0.5) / 0.55
≈ 0.727
Applications of Bayes' Theorem in
machine learning.
1. Naive Bayes Classification 3. Natural Language Processing
• Used in: • Topic modeling
• Spam detection (emails)
• Sentiment analysis • Language model adaptation
• Document classification • Machine translation
• Disease diagnosis • Speech recognition
2. Computer Vision 4. Reinforcement Learning
• Object detection and classification • Bayesian exploration strategies
• Face recognition systems • Probabilistic policy updates
• Scene understanding • Uncertainty-aware decision making
• Medical image analysis
Gaussian Mixture model
• Have you ever wondered how • The probability density function (pdf)
machine learning algorithms can of a GMM of data x is given by:
effortlessly categorize complex data K

into distinct groups? p( x) =   k ( x∣ k , k )

k =1

• GMMs excel in estimating density and • πk are the mixture weights

clustering data. ( x∣  k ,  k )

• Gaussian Mixture Model (GMM) is a • is the Gaussian density function with

probabilistic model that assumes that mean μk and covariance Σk
the data points are generated from a
mixture of several Gaussian • K is the number of Gaussian
distributions with unknown components
parameters
Expectation – Maximization algorithm
• The parameters of the GMM that • E step: Expectation step
need to be estimated are: k ( xi ∣ k ,  k )
 ik = P( zk = 1∣ xi ) = K
• Mixture weights: πk for k=1,…,K 
j =1
j ( xi ∣  j ,  j )

• Means: μk for k=1,…,K

• It is the posterior probability that the
• Covariances: Σk for k=1,…,K k-th component generated the data
point xi
• The EM algorithm is used to find the
maximum likelihood estimates of the • M step: Maximization step
parameters in a GMM. It consists of N

two steps: • New mean  x

ik i
knew = i =1
N


i =1
ik
• New Covariances: • The log-likelihood to maximize pdf is:
N

 ik ( xi − k )( xi − k )T N
 K
L =  log    k

( xi ∣ k , k ) 
 new
k = i =1
N i =1  k =1 

i =1
ik
• Convergence:
• New weight: • The algorithm iterates until the
change in log-likelihood is less than a
N threshold:
 ik
 new
k = i =1
Lnew − Lold  threshold
N
K- Nearest Neighbor Algorithm
• K-Nearest Neighbors (KNN) • It is a non-parametric method that
algorithm is a supervised machine makes predictions based on the
learning method employed to tackle similarity of data points in a given
classification and regression dataset.
problems.
• K-NN is less sensitive to outliers
• It belongs to the supervised learning compared to other algorithms.
domain and finds intense application
in pattern recognition, data mining, • The K-NN algorithm works by finding
and intrusion detection. the K nearest neighbors to a given
data point based on a distance
• It does not make any underlying metric, such as Euclidean distance.
assumptions about the distribution of
data • The class or value of the data point is
then determined by the majority vote
or average of the K neighbors.
n

• Distance d ( p, q ) = ( p − q )
i i
2

• In KNN, Euclidean distance is i =1

used to: • d is the Euclidean distance

1. Calculate distances from a new • p = (p₁, p₂, ..., pₙ) and q = (q₁, q₂, ...,
data point to all training points qₙ) are two points in n-dimensional
space
2. Find the K nearest neighbors
(smallest distances) • n is the number of dimensions
3. Use those neighbors' labels to
predict the new point's label
(typically by majority vote)
Example
KNN Algorithm
• Step 1: Selecting the optimal value of K • Step 4: Voting for Classification or Taking
Average for Regression
• K represents the number of nearest
neighbors that needs to be considered • In the classification problem, the class
while making prediction. labels of K-nearest neighbors are
determined by performing majority voting.
• Step 2: Calculating distance
• In the regression problem, the class label is
• To measure the similarity between target calculated by taking average of the target
and training data points, Euclidean distance values of K nearest neighbors
is used. Distance is calculated between
each of the data points in the dataset and
target point.
• Step 3: Finding Nearest Neighbors
• The k data points with the smallest
distances to the target point are the
nearest neighbors.
Unsupervised Learning
• Unlike supervised learning, there are
no labeled targets or scoring systems
• The algorithm must find patterns and
similarities in data independently
• Common use case is clustering -
grouping similar inputs together
• Uses internal error criteria :Uses
Euclidean distance between points
and centroids
• Relies on distance measures (typically
Euclidean) to determine similarity.
K-means algorithm
Where  j is the mean and xi is the
• Initialisation
datapoint.
The equation gives the distance between xi
• – choose a value for k and  j
• – choose k random positions in the input space • for each cluster centre:
• – assign the cluster centres µj to those positions • move the position of the centre to the mean of the
points in that cluster (Nj is the number of points in
• Learning cluster j): Nj
• 1
j =
Nj
x i
• – repeat i =1

• until the cluster centres stop moving

• ∗ for each datapoint xi:
• • Usage
• · compute the distance to each cluster centre • – for each test point:
• · assign the datapoint to the nearest cluster centre • ∗ compute the distance to each cluster centre
with distance
• ∗ assign the datapoint to the nearest cluster centre
di = min j d ( xi ,  j ) with distance
di = min j d ( xi ,  j ).
Example
On line k means algorithm
• • Usage
• Initialisation
• – for each test point:
• – choose a value for k, which corresponds to the
number of output nodes • ∗ compute the activations of all the nodes
• – initialise the weights to have small random values • ∗ pick the winner as the node with the highest
activation
• • Learning
• – normalise the data so that all the points lie on the
unit sphere
• – repeat:
• ∗ for each datapoint:
• · compute the activations of all the nodes
• · pick the winner as the node with the highest
activation
• · update the weights
• ∗ until number of iterations is above a threshold
• Network Structure: Input Layer: Raw data weight vector is the cluster center
points (x₁, x₂, x₃, etc.)
• Single Competitive Layer: Cluster centers
(C₁, C₂, C₃, etc.)
• No bias nodes
• Direct connections from inputs to
competitive layer
• Winner-Takes-All Mechanism: Each neuron
computes activation (h) based on input
• Only highest-activated neuron "wins"
• Winner represents the closest cluster center
• Only winner's weights get updated

Patrick Dattalo - Analysis of Multiple Dependent Variables
No ratings yet
Patrick Dattalo - Analysis of Multiple Dependent Variables
191 pages
Linear Regression for CPU Usage Prediction
No ratings yet
Linear Regression for CPU Usage Prediction
31 pages
Lecture 17 18
No ratings yet
Lecture 17 18
52 pages
Decision Trees: Classifier
No ratings yet
Decision Trees: Classifier
23 pages
Decision Tree
No ratings yet
Decision Tree
23 pages
Module 4 Lecture - 2
No ratings yet
Module 4 Lecture - 2
65 pages
Tree Based Algorithms in Machine Learning
No ratings yet
Tree Based Algorithms in Machine Learning
8 pages
DECISION TREES-jb
No ratings yet
DECISION TREES-jb
8 pages
CENG313 Introduction To Data Science: Lecture 12: Classification Decision Trees
No ratings yet
CENG313 Introduction To Data Science: Lecture 12: Classification Decision Trees
61 pages
AI Chapter 3 Part 2
No ratings yet
AI Chapter 3 Part 2
51 pages
Trees
No ratings yet
Trees
78 pages
Decision Trees
No ratings yet
Decision Trees
45 pages
Classification and Prediction
No ratings yet
Classification and Prediction
81 pages
Decision Tree Induction Basics
No ratings yet
Decision Tree Induction Basics
55 pages
Unit IV Decision Trees
No ratings yet
Unit IV Decision Trees
37 pages
DT-0 (3 Files Merged)
No ratings yet
DT-0 (3 Files Merged)
143 pages
Unit 1 ML (NN& ML Techniques)
No ratings yet
Unit 1 ML (NN& ML Techniques)
40 pages
Chapter 03
No ratings yet
Chapter 03
30 pages
Decision Trees in Machine Learning
No ratings yet
Decision Trees in Machine Learning
62 pages
Week 11 - Decision Tree Learning
No ratings yet
Week 11 - Decision Tree Learning
43 pages
M01 Tree-Based Methods
No ratings yet
M01 Tree-Based Methods
38 pages
Cse 445 Lecture 8 Mma
No ratings yet
Cse 445 Lecture 8 Mma
107 pages
ML-PPT Unit Iii-1
No ratings yet
ML-PPT Unit Iii-1
38 pages
Unit 3 Notes
No ratings yet
Unit 3 Notes
25 pages
AI&Ml-module 4 (Complete)
No ratings yet
AI&Ml-module 4 (Complete)
124 pages
AI&Ml-module 4 (Part 1)
No ratings yet
AI&Ml-module 4 (Part 1)
85 pages
Decision Trees: Example
No ratings yet
Decision Trees: Example
14 pages
ML Unit II
No ratings yet
ML Unit II
183 pages
Decision Tree
No ratings yet
Decision Tree
15 pages
Decision Tree Learning Guide
No ratings yet
Decision Tree Learning Guide
33 pages
07 - ML - Decision Tree
No ratings yet
07 - ML - Decision Tree
37 pages
06-Classification Part1
No ratings yet
06-Classification Part1
44 pages
Unit 1 ML (DT)
No ratings yet
Unit 1 ML (DT)
24 pages
ML Unit 3 Notes
No ratings yet
ML Unit 3 Notes
117 pages
ML Unit 3 Notes-1
No ratings yet
ML Unit 3 Notes-1
118 pages
Decision Tree
No ratings yet
Decision Tree
35 pages
Unit - 3 ML
No ratings yet
Unit - 3 ML
17 pages
Unit IV Da Online - PPTX 2 82
No ratings yet
Unit IV Da Online - PPTX 2 82
81 pages
Decision Trees
No ratings yet
Decision Trees
15 pages
Chapter 7 Supervised Learning
No ratings yet
Chapter 7 Supervised Learning
71 pages
Classification
No ratings yet
Classification
45 pages
CH2-Decision Trees and Random Forest
No ratings yet
CH2-Decision Trees and Random Forest
54 pages
Decision Tree 1113fasdlfka'
No ratings yet
Decision Tree 1113fasdlfka'
33 pages
Classification Algorithms: Inteligência Artificial E Cibersegurança (Inacs)
No ratings yet
Classification Algorithms: Inteligência Artificial E Cibersegurança (Inacs)
60 pages
Machine Learning: Professor Department of Computer Science & Engineering
No ratings yet
Machine Learning: Professor Department of Computer Science & Engineering
45 pages
Asset v1 MKAU+SEng9032+DEV 01+Type@Asset+Block@ML Chapterthree
No ratings yet
Asset v1 MKAU+SEng9032+DEV 01+Type@Asset+Block@ML Chapterthree
129 pages
L04 Decision Trees
No ratings yet
L04 Decision Trees
34 pages
Data Mining Unit 2
No ratings yet
Data Mining Unit 2
41 pages
L3 - Decision Trees
No ratings yet
L3 - Decision Trees
28 pages
Chapter 03
No ratings yet
Chapter 03
30 pages
ML4 - Decision Trees & Random Forest
No ratings yet
ML4 - Decision Trees & Random Forest
44 pages
Unit6 - 2 Classification-Decision-Trees
No ratings yet
Unit6 - 2 Classification-Decision-Trees
36 pages
S&ML Unit 6 - Q & A
No ratings yet
S&ML Unit 6 - Q & A
12 pages
Day 5 Supervised Technique-Decision Tree For Classification PDF
100% (1)
Day 5 Supervised Technique-Decision Tree For Classification PDF
58 pages
Classification With Decision Trees I: Instructor: Qiang Yang
No ratings yet
Classification With Decision Trees I: Instructor: Qiang Yang
29 pages
Classification - Decision Trees
No ratings yet
Classification - Decision Trees
43 pages
Supervised Decision TreeRandom Forest
No ratings yet
Supervised Decision TreeRandom Forest
39 pages
Unit 2
No ratings yet
Unit 2
29 pages
Module 4
No ratings yet
Module 4
48 pages
Module 2 Notes - Full
No ratings yet
Module 2 Notes - Full
54 pages
Full Stack Roadmap
No ratings yet
Full Stack Roadmap
2 pages
Full Stack Detailed Roadmap
No ratings yet
Full Stack Detailed Roadmap
7 pages
QT Formulae ONLY
No ratings yet
QT Formulae ONLY
4 pages
Examples of Path Analysis in Research
No ratings yet
Examples of Path Analysis in Research
1 page
Lecture #4
No ratings yet
Lecture #4
20 pages
No 2 (SPSS) : Variables Entered/Removed
No ratings yet
No 2 (SPSS) : Variables Entered/Removed
3 pages
Development and Validation of An Internationally Reliable Short-Form of The Positive and Negative Affect Schedule (PANAS)
No ratings yet
Development and Validation of An Internationally Reliable Short-Form of The Positive and Negative Affect Schedule (PANAS)
16 pages
Statistical Test Selection Guide
100% (1)
Statistical Test Selection Guide
3 pages
11 Multiple Linear Regression Workbook
No ratings yet
11 Multiple Linear Regression Workbook
12 pages
Supervised Learning Methods Guide
No ratings yet
Supervised Learning Methods Guide
34 pages
Infra 4 Deterioration
No ratings yet
Infra 4 Deterioration
19 pages
ML Lesson Plan (21AI63)
No ratings yet
ML Lesson Plan (21AI63)
8 pages
Lect 6
No ratings yet
Lect 6
20 pages
Paper 10
No ratings yet
Paper 10
21 pages
Data Mining with XLMiner
No ratings yet
Data Mining with XLMiner
69 pages
The Actor-Partner Interdependence Model A Model of
No ratings yet
The Actor-Partner Interdependence Model A Model of
11 pages
EViews Help - Estimating Quantile Regression in EViews
No ratings yet
EViews Help - Estimating Quantile Regression in EViews
5 pages
Advanced Statistics Project
17% (6)
Advanced Statistics Project
2 pages
Basee2 Students
No ratings yet
Basee2 Students
142 pages
Multi Regression Analysis With Qualitative Information: Dummy Variable
No ratings yet
Multi Regression Analysis With Qualitative Information: Dummy Variable
24 pages
Logistic Regression Guide for SPSS
100% (4)
Logistic Regression Guide for SPSS
65 pages
Supervised vs. Unsupervised Learning
No ratings yet
Supervised vs. Unsupervised Learning
3 pages
Geostatistics Assignment 5 1 A) Statistics and Variogram Modelling For Domain 1 North
No ratings yet
Geostatistics Assignment 5 1 A) Statistics and Variogram Modelling For Domain 1 North
12 pages
CO4CRT12 - Quantitative Techniques For Business - II (T)
No ratings yet
CO4CRT12 - Quantitative Techniques For Business - II (T)
4 pages
Advanced Regression with GLMs
No ratings yet
Advanced Regression with GLMs
13 pages
Mediation Testing with Regression
No ratings yet
Mediation Testing with Regression
3 pages
Assignment 3: Introduction To Machine Learning Prof. B. Ravindran
No ratings yet
Assignment 3: Introduction To Machine Learning Prof. B. Ravindran
4 pages
Machine Learning Algorithms
No ratings yet
Machine Learning Algorithms
9 pages
Introduction To Machine Learning and Data Mining: Arturo J. Patungan, Jr. University of Sto. Tomas Strandasia
No ratings yet
Introduction To Machine Learning and Data Mining: Arturo J. Patungan, Jr. University of Sto. Tomas Strandasia
103 pages
10 11648 J Ajtas 20150405 22
No ratings yet
10 11648 J Ajtas 20150405 22
7 pages

Module3 Final

Uploaded by

Module3 Final

Uploaded by

MODULE 3

2. Starting from a root node and branching down to

4. Providing a final prediction or classification at leaf

• Where S is the set of examples

•Select the best feature F that maximizes information gain.

•For each possible value of F:

The IG value changes because:

• Calculate how many mistakes it makes

• All examples are correctly classified

• S = {(x₁,y₁), (x₂,y₂), ..., (xₙ,yₙ)}

• Main Loop: • Output

• For each of N trees:

•Expert Predictions (Bottom Layer):

•Gating Weights (Level 1):

•Combine Expert Outputs (Level 1):

• P(A) and P(B) are the probabilities of events A

into distinct groups? p( x) =   k ( x∣ k , k )

• GMMs excel in estimating density and • πk are the mixture weights

• Gaussian Mixture Model (GMM) is a • is the Gaussian density function with

• Means: μk for k=1,…,K

two steps: • New mean  x

• In KNN, Euclidean distance is i =1

used to: • d is the Euclidean distance

• until the cluster centres stop moving

You might also like