Thanks to visit codestin.com
Credit goes to www.scribd.com

0% found this document useful (0 votes)
17 views8 pages

DAL Assignment 4 Endsem

IITM DAL Assignment 4

Uploaded by

msrirang
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
17 views8 pages

DAL Assignment 4 Endsem

IITM DAL Assignment 4

Uploaded by

msrirang
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 8

A Decision Tree Model for Car Safety Classification

1st Melpakkam Pradeep (CH20B064)


Department of Chemical Engineering
IIT Madras
Chennai, India
[email protected]

Abstract—In this paper, we present a comprehensive analysis fundamentally oriented toward enhancing the safety landscape
of the application of decision trees to classify cars based on their for all individuals utilizing automobiles. This paper invites
safety using the Car Evaluation Database. Our study aims to you to embark on an enlightening journey as we investigate
demonstrate the effectiveness of decision trees as a predictive
model and explores the impact of various hyperparameters on the utilization of decision trees for car safety classification.
the classifier’s performance. We achieved remarkable results with Through our findings, we aim to contribute to the larger
an accuracy and F1 score of 96%, indicating the robustness of this mission of making the roads safer and more secure for all.
approach in accurately categorizing cars according to their safety
features. Through a systematic examination of hyperparameters II. DATA AND C LEANING
such as tree depth, impurity criteria, and minimum sample split, A. The Dataset
we offer valuable insights into optimizing decision tree models
for car safety classification. This research contributes to the One dataset (Car evaluation.xlsx) was provided to train the
understanding of decision tree models in the automotive safety Decision Tree model. This dataset contained around 1728
domain, providing a foundation for improving vehicle safety training samples. The target label was a four class ’Target’ with
assessment systems. Section IV has been changed.
a car’s safety level being either “Unacceptable”, “Acceptable”,
Index Terms—Decsion Trees, bootstrapping, Car Evaluation,
Classification, cross-validation “Good” or “Very Good”. The dataset contained only categori-
cal variables. The description of the features in the dataset are
summarized in Table I.
I. I NTRODUCTION
This paper delves into the realm of automotive safety TABLE I
TABLE OF THE FEATURES IN THE GIVEN DATASETS ALONG WITH THEIR
assessment, particularly the classification of cars based on DESCRIPTIONS . W E OBSERVE THAT ALL VARIABLES ARE CATEGORICAL .
their safety attributes. The primary tool employed in this
endeavor is the utilization of decision trees, which function Feature Description Type
buying Buying Price Categorical (4)
as systematic algorithms for decision-making. These decision maint Price of Maintenance Categorical (4)
trees assist in categorizing cars according to their safety, doors Number of doors Categorical (5)
utilizing the Car Evaluation Database as the source of data. persons Carrying Capacity (Persons) Categorical (3)
lug-boot Size of Luggage Boot Categorical (3)
safety Estimated Safety Categorical (3)
The significance of this research lies in the profound Target Safety Rating Categorical (4)
implications for safety within the automotive industry.
Understanding how decision trees can effectively evaluate
car safety is instrumental in aiding prospective car buyers B. Data Cleaning
to make well-informed decisions, and it also serves as a A pipeline is coded to take a dataset of the above format
guide for policymakers in establishing safety standards. This and a flag (’train’ or ’test’) and clean it. We now move onto
improved decision-making process holds the potential to imputing missing values.
enhance safety and, ultimately, save lives.
A Simple Imputer based on the most frequent value is
The noteworthy aspect of this study lies in the exceptional used on the dataset to impute missing values. This largely
accuracy achieved through the application of decision trees. preserves variable distribution. Finally the variables are
This methodology successfully identifies safe and unsafe converted to their appropriate types and the cleaned dataset
cars with an impressive accuracy rate of 96%. In addition to is returned. No confounding symbols are present in the train
this substantial achievement, we further explore the intricate or test data, we only find missing values.
domain of hyperparameters, which are the adjustable settings
governing the behavior of decision trees. This exploration There are multiple imputation techniques available. One
allows us to fine-tune the performance of decision trees in can impute missing values by 0, by the mean, median or
car safety classification. based on the k-NN of the data point or by randomly sampling
from the distribution of the variable. The Expectation
The research expands beyond numerical results and data; it is Imputers distort the distribution of the imputed data about the
expectation estimator used, when compared to the Random
Sampling Imputer (RSI) and KNN Imputer.

Unfortunately the RSI is a slow imputation technique.


Either a prior distribution must be assumed and its parameters
estimated from data, or a non-parametric method such as a
Kernel Density Estimate (KDE) can be used.

However, given that we are dealing with multiple categorical


variables, we choose to use the most frequent value for
imputation, given the KNN’s difficulty with handling
categorical variables.
Fig. 3. The count plot of the various classes of Number of Doors for various
We find that all values are present in all columns. So, target classes are shown. Unlike numerical variables, categorical variables are
not visualized well using density plots. So we prefer count plots instead
data imputation is not required. To visualize the given dataset
better, in Figs. 1-6, we present the Count Plots of some
categorical variables in the dataset, grouped by the “Target”
feature.

Fig. 4. The count plot of the various classes of Capacity (Persons) for various
target classes are shown. Unlike numerical variables, categorical variables are
not visualized well using density plots. So we prefer count plots instead
Fig. 1. The count plot of the various classes of Buying Price for various
target classes are shown. Unlike numerical variables, categorical variables are
not visualized well using density plots. So we prefer count plots instead

Fig. 5. The count plot of the various classes of Luggage Capacity (Boot)
for various target classes are shown. Unlike numerical variables, categorical
variables are not visualized well using density plots. So we prefer count plots
instead
Fig. 2. The count plot of the various classes of Maintenance Price for various
target classes are shown. Unlike numerical variables, categorical variables are
not visualized well using density plots. So we prefer count plots instead
on the values of different attributes. Each internal node of
the tree represents a test on an attribute, and each leaf node
III. M ETHODS
represents a class label or a decision. The goal of a decision
A. Decision Tree Classifier tree is to create a model that predicts the class label of an
A decision tree is a hierarchical structure used for decision- instance based on its attributes.
making by recursively splitting the dataset into subsets based
construction, and pruning, is essential for effectively applying
this method to real-world problems.

B. Classification Metrics
There are various metrics that can evaluate the goodness-
of-fit of a given classifier. Some of these metrics are presented
in this section. In classification tasks, it is essential to choose
appropriate evaluation metrics based on the problem’s context
and objectives.
1) Accuracy: Accuracy is one of the most straightforward
classification metrics and is defined as:

Fig. 6. The count plot of the various classes of Safety Rating for various Number of Correct Predictions
Accuracy = (1)
target classes are shown. Unlike numerical variables, categorical variables are Total Number of Predictions
not visualized well using density plots. So we prefer count plots instead
It measures the proportion of correct predictions made by the
model. While accuracy provides an overall sense of model
Entropy is a key concept in decision tree construction. performance, it may not be suitable for imbalanced datasets,
It measures the impurity or randomness of a dataset. The where one class dominates the other.
entropy of a dataset D with k classes is defined as:
k
X 2) Recall: Recall, also known as sensitivity or true positive
H(D) = − pi log2 (pi ) rate, quantifies a model’s ability to correctly identify positive
i=1 instances:
where pi is the proportion of instances in class i in the True Positives
dataset. High entropy indicates high impurity, while low Recall = (2)
True Positives + False Negatives
entropy means the dataset is more homogeneous.
Recall is essential when the cost of missing positive cases
Information gain quantifies the reduction in entropy achieved (false negatives) is high, such as in medical diagnoses.
by splitting the dataset on a particular attribute. It’s calculated
as: 3) Precision: Precision measures the accuracy of positive
X |Dv | predictions made by the model:
IG(D, A) = H(D) − H(Dv )
|D|
v∈values(A)
True Positives
Precision = (3)
Here, A is an attribute, values(A) represents the set of True Positives + False Positives
values of attribute A, Dv is the subset of instances for which
Precision is valuable when minimizing false positive
attribute A has value v, and |D| and |Dv | represent the sizes
predictions is critical, like in spam email detection.
of the datasets.

The process of constructing a decision tree involves 4) F1-score: The F1 score is the harmonic mean of preci-
selecting the best attribute to split on at each node based on sion and recall, providing a balance between the two:
information gain. This process continues recursively until Precision · Recall
one of the stopping criteria is met, such as a maximum tree F1 Score = 2 · (4)
Precision + Recall
depth or a minimum number of instances in a leaf node.
The result is a tree structure that can be used for classification. It is particularly useful when there is an uneven class
distribution or when both precision and recall need to be
Decision trees are susceptible to overfitting, where the considered simultaneously.
tree becomes overly complex and captures noise in the data.
Pruning is a technique used to reduce the complexity of the 5) Receiver Operator Characteristic Curve (ROC Curve):
tree by removing branches that do not significantly improve The ROC curve is a graphical representation of a model’s
predictive accuracy. Pruned trees tend to be more robust and performance across different classification thresholds. It plots
generalize better to unseen data. the true positive rate (recall) against the false positive rate (1
- specificity) at various threshold values.
Decision trees are a fundamental tool in classification tasks, The area under the ROC curve (AUC-ROC) quantifies the
providing a clear and interpretable way to make decisions model’s overall performance. A higher AUC-ROC indicates
based on data. A solid understanding of the underlying a better model at distinguishing between positive and negative
theory, including concepts like entropy, information gain, tree instances.
B. Decision Trees are highly interpretable
Although Decision Trees exhibit high variance as classifiers,
they boast a remarkable level of interpretability. The decision
boundary at each split can be discerned, aiding in compre-
hending the model’s classification criteria and emphasizing key
features.
In a way, Decision Trees mimic human decision-making,
providing clear decision paths and transparent fits. While they
may be less potent in predictive tasks, their appeal lies in their
interpretability. In Figs. 9 - 10, we visually represent Decision
Trees with depths of 2 and 3. Notably, consistent decision
boundaries are evident across all three trees.
The paramount feature is the Safety Rating, succeeded by
Fig. 7. A sample ROC curve from a classifier. Note the trade-off between Passenger Capacity and Buying Price. The prominence of
sensitivity and specificity. Based on the problem, we may optimize be required Safety Rating is self-evident. Passenger Capacity serves as a
to optimize for only one. reliable indicator, as vehicles with higher capacities generally
necessitate stricter safety standards. Finally, a higher Buying
Price implies superior material quality and elevated safety
IV. R ESULTS standards.
A. Existence of Linear Relationships among income factors
The correlation heatmap depicting numerical independent
variables is presented in Fig. 8. It is evident that there are
no significant correlations among the variables. No linear
associations between the independent variables are identified,
allowing us to move forward with our classifier.

Fig. 9. The Decision Tree of Depth 2 trained on a part of the Car Evaluation
Dataset is visualized. We find that the Safety Rating is the most important
feature, followed by the Passenger Capacity. This is expected since vehicles
with higher capacities require higher safety standards.

C. Decision Trees are fast and accurate classifiers


To train and assess our Decision Tree model, we partitioned
our train data into training and validation sets. This split was
executed using a fixed random seed for reproducibility, with
20% of the provided data allocated to the validation set.
The Decision Tree model was initially trained on the
training split. Subsequently, we performed bootstrapping on
the validation set, generating 1000 bootstrap samples. We
computed the evaluation metrics outlined in Section III-B.
The 95% confidence intervals for our evaluation metrics are
detailed in Table II. In Fig. 11, we illustrate the Confusion
Matrix for multiple classes on the Validation Set. Probability
Fig. 8. The correlation heatmap between all independent variables. This
was obtained by finding the pairwise correlation coefficient between each
distributions and empirical cumulative distribution functions
independent variable. The color gradient indicates the magnitude of the (ECDFs) of our evaluation metrics are depicted in Figs.
correlation between the variables. 12-15.
Fig. 12. The left plot contains the histogram of the accuracy obtained for
each bootstrap sample from the validation split. The right plot contains the
ECDF of the accuracy obtained for each bootstrap sample from the validation
split. We find that the metric is high and its variance is acceptable

Fig. 10. The Decision Tree of Depth 3 trained on a part of the Car Evaluation
Dataset is visualized. We find that the Safety Rating is the most important
feature, followed by the Passenger Capacity and then the Buying Price. This is
expected since vehicles with higher capacities require higher safety standards.
Also a higher Buying Price implies better quality of materials and higher
safety standards

TABLE II
E VALUATION METRICS OF THE D ECISION T REE CLASSIFIER . W E FIND
THAT ACCURACY AND P RECISION ARE REASONABLY HIGH . T HE
VARIANCE IN THESE ESTIMATES ARE ALSO ACCEPTABLE .

Metric Value 95% CI


Accuracy 0.96 (0.93, 0.99) Fig. 13. The left plot contains the histogram of the recall obtained for each
Precision 0.97 (0.95, 0.99) bootstrap sample from the validation split. The right plot contains the ECDF
Recall 0.96 (0.94, 0.98) of the recall obtained for each bootstrap sample from the validation split. We
F1 Score 0.96 (0.94, 0.98) find that the metric is high and its variance is acceptable

D. Decision Trees match the performance of other classifiers


In this section, we conduct a comparative analysis of Deci-
sion Trees with several other classification methods. Specifi-
cally, we assess the performance of Decision Trees against Lo-
gistic Regression, Support Vector Classifiers, Random Forests,
Boosted Trees, 5-Nearest Neighbours, and Naive Bayes.
We execute 100 random test-train splits and train each clas-
sification model. We evaluate the accuracy and the F1 score
of the various classifiers. The resulting metrics from boot-
strapped test sets are utilized to construct 95% confidence Fig. 14. The left plot contains the histogram of the precision obtained for
each bootstrap sample from the validation split. The right plot contains the
ECDF of the precision obtained for each bootstrap sample from the validation
split. We find that the metric is high and its variance is acceptable

intervals for the diverse classification models. Plots depicting


the performance of the various classification methods can be
found in Figs. 16 - 17. Our findings indicate that Boosted
Trees emerge as the top-performing classifier, demonstrating
superior performance. Conversely, Naive Bayes exhibits the
poorest performance, marked by high variance. Notably, De-
cision Trees showcase comparable accuracy and F1 scores,
along with variance levels similar to Boosted Trees. Further-
more, Decision Trees offer the advantage of interpretability,
Fig. 11. Confusion Matrix for our Decision Tree Classifier, evaluated on the rendering them a strong contender as the “best” classifier for
validation set. We find that only 12 out of 346 instances are misclassified. our model.
Trees can vary significantly based on the tuning of these
hyperparameters, with the tree depth being particularly crucial
as deeper trees tend to overfit.
Our classification model achieves a train accuracy of 100%
and a validation accuracy of 96%. While the risk of overfitting
seems low, we aim to explore the performance of different
decision trees by investigating how accuracy and the F1 score
fluctuate with various hyperparameters.
In Figs. 18 - 21, we depict the variation of accuracy and
the F1 score across different hyperparameters. To provide
Fig. 15. The left plot contains the histogram of the F1 score obtained for a comprehensive perspective, we conduct 100 bootstraps of
each bootstrap sample from the validation split. The right plot contains the
ECDF of the F1 score obtained for each bootstrap sample from the validation the validation set, evaluating accuracy and the F1 score.
split. We find that the metric is high and its variance is acceptable Subsequently, we plot the mean metrics, accompanied by
error bars representing 2 standard deviations, assuming a t-
distribution.

Fig. 16. Plot of the Accuracy of various classification methods on predicting


Car Safety Rating. We find that Boosted Trees are the best performing
classifier, with the Naive Bayes having the worst performance with high
variance. Decision Trees have comparable accuracy and variance with the
Boosted Trees, in addition to providing interpretability to our model, making
them the “best” classifier.

Fig. 17. Plot of the F1 Score of various classification methods on predicting


Car Safety Rating. We find that Boosted Trees are the best performing
classifier, with the Naive Bayes having the worst performance with high Fig. 18. Variation of F1 Score (Left) and Accuracy (Right) for various values
variance. Decision Trees have comparable F1 Score and variance with the of “Max. Depth” of the Decision Tree. We find that deeper trees lead to better
Boosted Trees, in addition to providing interpretability to our model, making performance on the validation set. However, it is possible that the performance
them the “best” classifier. may decrease if trees deeper than 14 are used.

E. Decision Trees are very sensitive to hyperparamters V. D ISCUSSION


Decision Trees are renowned for possessing multiple hyper- Our analysis indicates that decision trees are powerful
parameters, distinguishing them from methods like Logistic classifiers for our dataset. In addition to great performance,
Regression and Naive Bayes. The performance of Decision they also provide excellent interpretability, making them an
Fig. 19. Variation of F1 Score (Left) and Accuracy (Right) for various values Fig. 20. Variation of F1 Score (Left) and Accuracy (Right) for various values
of “Min. Samples for Split” of the Decision Tree. We find that fewer samples of “Min. Samples per Leaf” of the Decision Tree. We find that this does not
required for a split lead to better performance on the validation set. largely affect the performance on the validation set.

ideal classifier. We find that the Safety Rating is the best variance of the constructed trees may be a concern. While it
predictor of the Target class, followed by the Passenger does not seem to be a problem in this dataset (observe the
Capacity and Buying Price. The Safety Rating being the most tight confidence intervals of our metrics), it can easily be
important feature is self-explanatory. The Passenger Capacity remedied through ensemble methods.
is a good indicator, since vehicles with higher capacities
VI. C ONCLUSIONS AND F UTURE W ORK
require higher safety standards. Finally, a higher Buying Price
implies better quality of materials and higher safety standards. In conclusion, our study underscores the exceptional utility
of decision trees as classifiers for our dataset. Their consistent
The sensitivity of decision trees to hyperparameters may high performance, coupled with their inherent interpretability,
be a concern. However, our analysis indicates that only the establishes them as the optimal choice for our classification
Max. Depth of the tree and Min. Impurity Decrease for task. The decision tree not only offers accurate predictions but
splitting affects the performance the most. The variance of also provides valuable insights into the classification process,
the metrics on other hyperparameters is also low across the making it well-suited for applications where transparency and
board, showing that these two hyperparameters have the trust are crucial.
highest effect in the tree construction.
Our feature importance analysis revealed that Safety
The confusion matrix indicates that the decision tree is Rating, Passenger Capacity, and Buying Price are pivotal
an excellent classifier even in the presence of skewed predictors, shedding light on the relationships between these
class data. With minimal data preprocessing required, high factors and vehicle safety. These findings hold significance for
performance and high interpretability, the decision tree is the various stakeholders, including car manufacturers, regulators,
“best” classifier for our task in some sense. However, the high and consumers, in making informed decisions related to
[3] Bishop, C. M., & Nasrabadi, N. M. (2006). Pattern recognition and
machine learning (Vol. 4, No. 4, p. 738). New York: springer.

Fig. 21. Variation of F1 Score (Left) and Accuracy (Right) for various
values of “Min. Impurity Decrease”. We find an initial dractic decrease in
performance after which, this does not affect the decision tree’s classification
performance.

safety standards and vehicle choices.

Although decision trees exhibit sensitivity to specific


hyperparameters, our research indicates that only a limited
subset significantly influences model performance. The low
variance in metrics across other hyperparameters reinforces
the reliability of our model.

Our study lays the foundation for further exploration and


improvement. Future work could delve into the development
of ensemble methods, such as Random Forests or Gradient
Boosting, to mitigate potential variance concerns associated
with decision trees. These approaches may enhance overall
model stability.

R EFERENCES
[1] James, G., Witten, D., Hastie, T. and Tibshirani, R., 2013. An introduc-
tion to statistical learning (Vol. 112, p. 18). New York: springer.
[2] Hastie, T., Tibshirani, R., Friedman, J. H., & Friedman, J. H. (2009). The
elements of statistical learning: data mining, inference, and prediction
(Vol. 2, pp. 1-758). New York: springer.

You might also like