0% found this document useful (0 votes)

22 views17 pages

CART1

Uploaded by

Manoj Kanti Debnath

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

22 views17 pages

CART1

Uploaded by

Manoj Kanti Debnath

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 17

D ECISION T REES : C LASSIFICATION AND R EGRESSION T REES (CART)

Ayush Singhal & Gaurav Shukla

National Institute of Science Education and Research (NISER)

Bhubaneswar

February 17, 2023

CART : C LASSIFICATION AND R EGRESSION T REES

1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.1 Decision Trees . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.2 CART . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.3 How CART works . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.4 CART Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
2 Gini Impurity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
2.2 Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.3 Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
2.4 Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
3 CART . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
3.1 Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
3.2 Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
3.3 Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
3.4 Code Sample . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
3.5 Benefits . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
3.6 Advantages and Disadvantages . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15

1 / 16
I NTRODUCTION
D ECISION T REES

▶ A decision tree is a tree-like model that is used for making decisions. It consists of nodes that
represent decision points, and branches that represent the outcomes of those decisions. The
decision points are based on the values of the input variables, and the outcomes are the
possible classifications or predictions.
▶ A decision tree is constructed by recursively partitioning the input data into subsets based on
the values of the input variables. Each partition corresponds to a node in the tree, and the
partitions are chosen so as to minimize the impurity of the resulting subsets.

2 / 16
I NTRODUCTION
CART

▶ The CART (Classification and Regression Trees) algorithm is a decision tree-based algorithm
that can be used for both classification and regression problems in machine learning. It works
by recursively partitioning the training data into smaller subsets using binary splits.
▶ CART is a powerful and popular algorithm due to its ability to handle both categorical and
continuous features, its interpretability, and its ability to capture non-linear relationships
between features and the target variable.
▶ The Classification and Regression Trees (CART) algorithm always creates a binary tree, which
means that each non-terminal node has two child nodes. This is in contrast with other
tree-based methods, which may allow multiple child nodes

3 / 16
I NTRODUCTION
H OW CART WORKS

▶ The algorithm works by recursively partitioning the training data into smaller subsets using
binary splits. The tree starts at the root node, which contains all the training data, and
recursively splits the data into smaller subsets until a stopping criterion is met.
▶ At each node of the tree, the algorithm selects a feature and a threshold that best separates the
training data into two groups, based on the values of that feature. This is done by choosing the
feature and threshold that maximizes the information gain or the Gini impurity, which are
measures of how well a split separates the data.
▶ The process continues recursively, with each node in the tree splitting the data into two smaller
subsets, until a stopping criterion is met. The stopping criterion could be a maximum depth for
the tree, a minimum number of instances in each leaf node, or other criteria.
▶ Once the tree is built, it can be used to make predictions by traversing the tree from the root
node to a leaf node that corresponds to the input data. For regression problems, the prediction
is the average of the target values in the leaf node. For classification problems, the prediction is
the majority class in the leaf node.

4 / 16
I NTRODUCTION
CART A LGORITHM

▶ Calculate the Gini impurity for the entire dataset. This is the impurity of the root node.
▶ For each input variable, calculate the Gini impurity for all possible split points. The split point
that results in the minimum Gini impurity is chosen.
▶ The data is split into two subsets based on the chosen split point, and a new node is created for
each subset.
▶ Steps 2 and 3 are repeated for each new node, until a stopping criterion is met. This stopping
criterion could be a maximum tree depth, a minimum number of data points in a leaf node, or a
minimum reduction in impurity.
▶ The resulting tree is the decision tree.

5 / 16
G INI I MPURITY
I NTRODUCTION

▶ Gini impurity is a measure of the quality of a split in a decision tree algorithm. It measures the
probability of misclassification of a random sample. The Gini impurity measures how often a
randomly chosen element from the set would be incorrectly labeled if it were randomly labeled
according to the distribution of labels in the subset.
▶ The Gini impurity of a node can be calculated as follows:
c
X
G=1− p2i
i=1

where c is the number of classes, and pi is the probability of a randomly chosen element in the
node being labeled as class i.

6 / 16
G INI I MPURITY
E XAMPLE

▶ To illustrate how the Gini impurity is calculated, consider the following example. Suppose we
have a binary classification problem, where we want to predict whether a person will buy a
particular product based on their age and income.

Age Income Buy Product?

30 20000 Yes
40 50000 Yes
20 30000 No
50 60000 No
60 80000 Yes

▶ We want to build a decision tree to predict whether a person will buy the product based on
their age and income. Suppose we want to split the data based on age, with a threshold of 35.
The left child node will contain the data where age is less than or equal to 35, and the right
child node will contain the data where age is greater than 35.

7 / 16
G INI I MPURITY
E XAMPLE

▶ The Gini impurity for the left child node can be calculated as follows:

1 1
Gleft = 1 − ( )2 − ( )2 = 0.5
2 2
There are two data points in the left child node, one of which buys the product, and one of
which does not.
▶ Therefore, the probability of a randomly chosen element being labeled as "Yes" is 0.5, and the
probability of it being labeled as "No" is also 0.5.
▶ The Gini impurity for the right child node can be calculated as follows:

2 1
Gright = 1 − ( )2 − ( )2 ≈ 0.444
3 3
There are three data points in the right child node, two of which buy the product, and one of
which does not.
▶ Therefore, the probability of a randomly chosen element being labeled as "Yes" is 2/3, and the
probability of it being labeled as "No" is 1/3.

8 / 16
G INI I MPURITY
E XAMPLE

▶ The overall Gini impurity for the split can be calculated as a weighted average of the Gini
impurities for the child nodes, where the weights are proportional to the number of data points
in each node.
2 3
Gsplit = Gleft + Gright ≈ 0.48
5 5
▶ The decision tree algorithm will try different thresholds and different features to find the split
that minimizes the Gini impurity.
▶ Here is an example Python code that calculates the Gini impurity for a binary classification
problem:

import numpy as np

def gini_impurity(labels):
n = len(labels)
classes = np.unique(labels)
gini = 0.0
for c in classes:
p = np.sum(labels == c)
9 / 16
CART
E XAMPLE

Let’s consider an example of using the CART algorithm for a binary classification problem.
▶ Suppose we have a dataset of patients with information about their age, gender, blood
pressure, and cholesterol level, and whether or not they have heart disease. We want to build a
decision tree to predict whether a new patient will have heart disease based on their age,
gender, blood pressure, and cholesterol level.
▶ To start, we calculate the Gini impurity of the entire dataset. Suppose there are 500 patients in
the dataset, and 200 of them have heart disease, and 300 of them do not. The Gini impurity is:

200 2 300 2
G=1−( ) −( ) = 0.48
500 500
▶ Next, we consider each input variable and each possible split point. Suppose we choose age as
the first split variable. We consider all possible split points and calculate the Gini impurity for
each split. Suppose the split point that results in the minimum Gini impurity is 50 years.

10 / 16
CART
E XAMPLE

▶ We split the data into two subsets: patients who are 50 years old or younger, and patients who
are older than 50. We create two new nodes for these subsets and calculate the Gini impurity
for each node.
▶ Suppose the first node contains 300 patients, of which 100 have heart disease and 200 do not.
The Gini impurity of this node is:

100 2 200 2
G1 = 1 − ( ) −( ) = 0.44
300 300
▶ Suppose the second node contains 200 patients, of which 100 have heart disease and 100 do not.
The Gini impurity of this node is:

100 2 100 2
G2 = 1 − ( ) −( ) = 0.5
200 200
▶ We choose the split that results in the minimum Gini impurity, which is the split on age at 50.
▶ We create two new nodes for the subsets and continue the process until we meet a stopping
criterion.

11 / 16
CART
E XAMPLE

▶ Suppose we set the stopping criterion to be a maximum tree depth of 2. The resulting decision
tree would look like this:

▶ This decision tree can be used to predict whether a new patient will have heart disease based
on their age, blood pressure, and cholesterol level.
12 / 16
CART
C ODE S AMPLE

▶ In this example, we’re using the first two features of the iris dataset (sepal length and sepal
width) for simplicity. We’re building a decision tree classifier with a maximum depth of 2 and
fitting it to the data. We’re then making predictions for new data and printing the predictions.
from sklearn.tree import DecisionTreeClassifier
from sklearn.datasets import load_iris
# Load the iris dataset
iris = load_iris()
# Use the first two features for simplicity
X = iris.data[:, :2]
y = iris.target
# Create a decision tree classifier with a maximum depth of 2
clf = DecisionTreeClassifier(max_depth=2)
# Fit the classifier to the data
clf.fit(X, y)
# Make predictions for new data
new_data = [[5.1, 3.5], [7.2, 3.6], [6.0, 2.2]]
predictions = clf.predict(new_data)
# Print the predictions
print(predictions)
13 / 16
CART
B ENEFITS

▶ One of the significant benefits of the tree-based method, especially CART, is that the
decision-making process closely resembles how humans make decisions. This makes it easy to
understand and accept the results obtained from the tree-style decision-making process. This
intuitive explanatory power is a crucial reason why the tree-based method is likely to remain
relevant.
▶ Another advantage of CART is that it allows diverse types of input data, unlike many linear
combination methods like logistic regression or support vector machines. CART can handle
mixed input data, including numerical variables like price or area and categorical variables like
house type or location. This flexibility makes CART a preferred tool of choice in a variety of
applications. CART’s ability to handle a range of input data types, coupled with its ease of
interpretability, makes it a powerful algorithm for many machine learning tasks.

14 / 16
CART
A DVANTAGES AND D ISADVANTAGES

Advantages
▶ It is a simple and intuitive algorithm that is easy to understand and interpret.
▶ It can handle both numerical and categorical data.
▶ It can handle missing values by imputing them with surrogate splits.
▶ It can handle multi-class classification problems by using an extension called the multi-class
CART.

Disadvantages
▶ It tends to overfit the data, especially if the tree is allowed to grow too deep.
▶ It is a greedy algorithm that may not find the optimal tree.
▶ It may be biased towards predictors with many categories or high cardinality.
▶ It may produce unstable results if the data is sensitive to small changes or noise.

15 / 16
R EFERENCES

▶ Classification and Regression Trees by Leo Breiman, Jerome Friedman, Charles J. Stone and
R.A. Olshen.
▶ COS 402 Lecture 2: Classification and Decision Trees , Sanjeev Arora and Elad Hazan (Fall 2016)
▶ SNU Lectures on ML Lecture 9: Classification and Regression Tree (CART) , Hyeong In Choi
(Fall 2017)

16 / 16

DPR Lakshadweep Submarine by TCIL PDF
No ratings yet
DPR Lakshadweep Submarine by TCIL PDF
133 pages
TSMP4001 - SmartPlant 3D Programming I Labs v91
100% (1)
TSMP4001 - SmartPlant 3D Programming I Labs v91
210 pages
Slides+ +CART+V3+Copy+2
No ratings yet
Slides+ +CART+V3+Copy+2
21 pages
Week 7
No ratings yet
Week 7
32 pages
Classification and Regression Trees (CART) Algorithm
No ratings yet
Classification and Regression Trees (CART) Algorithm
15 pages
Decistion Tree
No ratings yet
Decistion Tree
27 pages
Unit 2
No ratings yet
Unit 2
29 pages
223a1131 ML Exp 4
No ratings yet
223a1131 ML Exp 4
9 pages
Dadm s16 Cart
No ratings yet
Dadm s16 Cart
18 pages
CSE445 NSU Week - 4
No ratings yet
CSE445 NSU Week - 4
48 pages
Classification Algorithms: Inteligência Artificial E Cibersegurança (Inacs)
No ratings yet
Classification Algorithms: Inteligência Artificial E Cibersegurança (Inacs)
60 pages
Classification Algorithms
No ratings yet
Classification Algorithms
31 pages
ML-chap9 2024 110217
No ratings yet
ML-chap9 2024 110217
52 pages
Decision Trees: Example
No ratings yet
Decision Trees: Example
14 pages
Data Warehousing and Data Mining: Classification, Trees
No ratings yet
Data Warehousing and Data Mining: Classification, Trees
26 pages
Classification - Decision Trees
No ratings yet
Classification - Decision Trees
43 pages
DECISION TREES-jb
No ratings yet
DECISION TREES-jb
8 pages
S&ML Unit 6 - Q & A
No ratings yet
S&ML Unit 6 - Q & A
12 pages
My Decision Tree Algorithm
No ratings yet
My Decision Tree Algorithm
21 pages
6 DecisionTrees ID3 CART
No ratings yet
6 DecisionTrees ID3 CART
24 pages
Classification and Regression Trees (CART - I) : Dr. A. Ramesh
No ratings yet
Classification and Regression Trees (CART - I) : Dr. A. Ramesh
34 pages
Dec Treee
No ratings yet
Dec Treee
18 pages
Unit 3 Notes
No ratings yet
Unit 3 Notes
25 pages
Lecture 3 - Decision Trees and Random Forest
No ratings yet
Lecture 3 - Decision Trees and Random Forest
20 pages
Decision Trees: Make A Decision (Represent An Outcome
No ratings yet
Decision Trees: Make A Decision (Represent An Outcome
4 pages
5 1 Decision Trees
No ratings yet
5 1 Decision Trees
34 pages
ML Unit 2 Final - III Yr
No ratings yet
ML Unit 2 Final - III Yr
72 pages
Decision Tree
No ratings yet
Decision Tree
8 pages
Classification and Regression Trees
No ratings yet
Classification and Regression Trees
37 pages
Unit 1 ML (NN& ML Techniques)
No ratings yet
Unit 1 ML (NN& ML Techniques)
40 pages
ML Unit II
No ratings yet
ML Unit II
183 pages
Decision Tree - ML Class
No ratings yet
Decision Tree - ML Class
58 pages
Unit IV Decision Trees
No ratings yet
Unit IV Decision Trees
37 pages
Decision Tree Theory
No ratings yet
Decision Tree Theory
22 pages
ML-Lec-07-Decision Tree Overfitting
No ratings yet
ML-Lec-07-Decision Tree Overfitting
25 pages
L04 Decision Trees
No ratings yet
L04 Decision Trees
34 pages
Ch13. Decision Tree: KH Wong
No ratings yet
Ch13. Decision Tree: KH Wong
82 pages
DM Unit 4
No ratings yet
DM Unit 4
24 pages
Decision Trees
No ratings yet
Decision Trees
19 pages
Unit 1 ML (DT)
No ratings yet
Unit 1 ML (DT)
24 pages
Unit 4
No ratings yet
Unit 4
19 pages
Document 1
No ratings yet
Document 1
57 pages
Supervised Decision TreeRandom Forest
No ratings yet
Supervised Decision TreeRandom Forest
39 pages
3 Decision Trees
No ratings yet
3 Decision Trees
41 pages
Decision Tree: "For Each Node of The Tree, The Information Value Measures
No ratings yet
Decision Tree: "For Each Node of The Tree, The Information Value Measures
3 pages
Tree-Based Methods
No ratings yet
Tree-Based Methods
32 pages
MODULE 4-Dr - GM
No ratings yet
MODULE 4-Dr - GM
23 pages
Business Data Mining Week 11
No ratings yet
Business Data Mining Week 11
15 pages
Decision Tree
No ratings yet
Decision Tree
15 pages
Classification and Regression Trees
No ratings yet
Classification and Regression Trees
48 pages
Ecture Ecision REE: Sajal Halder Bsmrstu
100% (1)
Ecture Ecision REE: Sajal Halder Bsmrstu
22 pages
Classification
No ratings yet
Classification
45 pages
CSE 422 Machine Learning Tree Based Methods
No ratings yet
CSE 422 Machine Learning Tree Based Methods
35 pages
3 - Sınıflandırma 2
No ratings yet
3 - Sınıflandırma 2
62 pages
Lecture 17 18
No ratings yet
Lecture 17 18
52 pages
Impurity Measures in Decision Trees (Machine Learning) Impurity Measures
No ratings yet
Impurity Measures in Decision Trees (Machine Learning) Impurity Measures
39 pages
DMMLASSIGNMENT
No ratings yet
DMMLASSIGNMENT
36 pages
ML 7
No ratings yet
ML 7
65 pages
Introduction To Big Data and Data Mining
No ratings yet
Introduction To Big Data and Data Mining
130 pages
MCMC
No ratings yet
MCMC
81 pages
Application of Stochastic Frontier To Agriculture in Ethiopia
No ratings yet
Application of Stochastic Frontier To Agriculture in Ethiopia
20 pages
Estimating Technology Adoption and Technical Efficiency in Smallholder Maize Production A Double Bootstrap DEA Approach
No ratings yet
Estimating Technology Adoption and Technical Efficiency in Smallholder Maize Production A Double Bootstrap DEA Approach
17 pages
A Stochastic Frontier Analysis of Technical Efficiency in Smallholder Maize Production in Zimbabwe
No ratings yet
A Stochastic Frontier Analysis of Technical Efficiency in Smallholder Maize Production in Zimbabwe
15 pages
Pattern Sequence P1-P8
No ratings yet
Pattern Sequence P1-P8
4 pages
Assignment-5: 1-d Transient Heat Conduction Through Cylinder
No ratings yet
Assignment-5: 1-d Transient Heat Conduction Through Cylinder
4 pages
OGG Flume Integration
No ratings yet
OGG Flume Integration
12 pages
Python Assignments Set
No ratings yet
Python Assignments Set
10 pages
Eogui Manual
No ratings yet
Eogui Manual
17 pages
DS Assignment-1 Solution
No ratings yet
DS Assignment-1 Solution
5 pages
Assignment
No ratings yet
Assignment
3 pages
Binary Numbers
No ratings yet
Binary Numbers
13 pages
Informatica Log Files Guide
100% (1)
Informatica Log Files Guide
20 pages
Ymca Faridabad - JC Bose University Pyq of Operating System For MCA
No ratings yet
Ymca Faridabad - JC Bose University Pyq of Operating System For MCA
2 pages
Objects:: Xing, Alto and Waganr Represents Individual Objects. in This Context Each Car Object
No ratings yet
Objects:: Xing, Alto and Waganr Represents Individual Objects. in This Context Each Car Object
9 pages
User Subroutines
No ratings yet
User Subroutines
70 pages
Mumbai Data
No ratings yet
Mumbai Data
82 pages
Selenium Python Guide
No ratings yet
Selenium Python Guide
75 pages
Brijesh Dubey's MBA CV for Marketing
No ratings yet
Brijesh Dubey's MBA CV for Marketing
2 pages
Vector vs. Bitmap Image Guide
100% (1)
Vector vs. Bitmap Image Guide
5 pages
Computer Exam for 1st Grade 2016
100% (2)
Computer Exam for 1st Grade 2016
3 pages
2.16 UG BCA NEP Syllabus 3 4th Sem 2022 23 Onwards 17 11 22
No ratings yet
2.16 UG BCA NEP Syllabus 3 4th Sem 2022 23 Onwards 17 11 22
28 pages
Electronic Artwork Guidelines
No ratings yet
Electronic Artwork Guidelines
9 pages
Chapter 85: Checking The Status of The Saposcol Before Performance Checking The Status of The Saposcol Before Performance Tuning Tuning
No ratings yet
Chapter 85: Checking The Status of The Saposcol Before Performance Checking The Status of The Saposcol Before Performance Tuning Tuning
4 pages
Ankit Sarwar Resume
No ratings yet
Ankit Sarwar Resume
1 page
B.tech R22 Mid Question Bank Java 1
No ratings yet
B.tech R22 Mid Question Bank Java 1
3 pages
Mslog MGX Ii: V.6 For The
No ratings yet
Mslog MGX Ii: V.6 For The
52 pages
Case Study: Manufacturing Process Improvement: Demonstrated Experience
No ratings yet
Case Study: Manufacturing Process Improvement: Demonstrated Experience
2 pages
ModbusTCP PDF
No ratings yet
ModbusTCP PDF
2 pages
NCC Documents PDF
No ratings yet
NCC Documents PDF
2 pages
Fpga De0-Nano User Manual PDF
No ratings yet
Fpga De0-Nano User Manual PDF
155 pages
U5l11 Activity Guide - Flowcharts With While Loops
No ratings yet
U5l11 Activity Guide - Flowcharts With While Loops
3 pages

CART1

Uploaded by

CART1

Uploaded by

D ECISION T REES : C LASSIFICATION AND R EGRESSION T REES (CART)

Ayush Singhal & Gaurav Shukla

National Institute of Science Education and Research (NISER)

February 17, 2023

Age Income Buy Product?

You might also like