DAL Assignment 4 Endsem
DAL Assignment 4 Endsem
Abstract—In this paper, we present a comprehensive analysis fundamentally oriented toward enhancing the safety landscape
of the application of decision trees to classify cars based on their for all individuals utilizing automobiles. This paper invites
safety using the Car Evaluation Database. Our study aims to you to embark on an enlightening journey as we investigate
demonstrate the effectiveness of decision trees as a predictive
model and explores the impact of various hyperparameters on the utilization of decision trees for car safety classification.
the classifier’s performance. We achieved remarkable results with Through our findings, we aim to contribute to the larger
an accuracy and F1 score of 96%, indicating the robustness of this mission of making the roads safer and more secure for all.
approach in accurately categorizing cars according to their safety
features. Through a systematic examination of hyperparameters II. DATA AND C LEANING
such as tree depth, impurity criteria, and minimum sample split, A. The Dataset
we offer valuable insights into optimizing decision tree models
for car safety classification. This research contributes to the One dataset (Car evaluation.xlsx) was provided to train the
understanding of decision tree models in the automotive safety Decision Tree model. This dataset contained around 1728
domain, providing a foundation for improving vehicle safety training samples. The target label was a four class ’Target’ with
assessment systems. Section IV has been changed.
a car’s safety level being either “Unacceptable”, “Acceptable”,
Index Terms—Decsion Trees, bootstrapping, Car Evaluation,
Classification, cross-validation “Good” or “Very Good”. The dataset contained only categori-
cal variables. The description of the features in the dataset are
summarized in Table I.
I. I NTRODUCTION
This paper delves into the realm of automotive safety TABLE I
TABLE OF THE FEATURES IN THE GIVEN DATASETS ALONG WITH THEIR
assessment, particularly the classification of cars based on DESCRIPTIONS . W E OBSERVE THAT ALL VARIABLES ARE CATEGORICAL .
their safety attributes. The primary tool employed in this
endeavor is the utilization of decision trees, which function Feature Description Type
buying Buying Price Categorical (4)
as systematic algorithms for decision-making. These decision maint Price of Maintenance Categorical (4)
trees assist in categorizing cars according to their safety, doors Number of doors Categorical (5)
utilizing the Car Evaluation Database as the source of data. persons Carrying Capacity (Persons) Categorical (3)
lug-boot Size of Luggage Boot Categorical (3)
safety Estimated Safety Categorical (3)
The significance of this research lies in the profound Target Safety Rating Categorical (4)
implications for safety within the automotive industry.
Understanding how decision trees can effectively evaluate
car safety is instrumental in aiding prospective car buyers B. Data Cleaning
to make well-informed decisions, and it also serves as a A pipeline is coded to take a dataset of the above format
guide for policymakers in establishing safety standards. This and a flag (’train’ or ’test’) and clean it. We now move onto
improved decision-making process holds the potential to imputing missing values.
enhance safety and, ultimately, save lives.
A Simple Imputer based on the most frequent value is
The noteworthy aspect of this study lies in the exceptional used on the dataset to impute missing values. This largely
accuracy achieved through the application of decision trees. preserves variable distribution. Finally the variables are
This methodology successfully identifies safe and unsafe converted to their appropriate types and the cleaned dataset
cars with an impressive accuracy rate of 96%. In addition to is returned. No confounding symbols are present in the train
this substantial achievement, we further explore the intricate or test data, we only find missing values.
domain of hyperparameters, which are the adjustable settings
governing the behavior of decision trees. This exploration There are multiple imputation techniques available. One
allows us to fine-tune the performance of decision trees in can impute missing values by 0, by the mean, median or
car safety classification. based on the k-NN of the data point or by randomly sampling
from the distribution of the variable. The Expectation
The research expands beyond numerical results and data; it is Imputers distort the distribution of the imputed data about the
expectation estimator used, when compared to the Random
Sampling Imputer (RSI) and KNN Imputer.
Fig. 4. The count plot of the various classes of Capacity (Persons) for various
target classes are shown. Unlike numerical variables, categorical variables are
not visualized well using density plots. So we prefer count plots instead
Fig. 1. The count plot of the various classes of Buying Price for various
target classes are shown. Unlike numerical variables, categorical variables are
not visualized well using density plots. So we prefer count plots instead
Fig. 5. The count plot of the various classes of Luggage Capacity (Boot)
for various target classes are shown. Unlike numerical variables, categorical
variables are not visualized well using density plots. So we prefer count plots
instead
Fig. 2. The count plot of the various classes of Maintenance Price for various
target classes are shown. Unlike numerical variables, categorical variables are
not visualized well using density plots. So we prefer count plots instead
on the values of different attributes. Each internal node of
the tree represents a test on an attribute, and each leaf node
III. M ETHODS
represents a class label or a decision. The goal of a decision
A. Decision Tree Classifier tree is to create a model that predicts the class label of an
A decision tree is a hierarchical structure used for decision- instance based on its attributes.
making by recursively splitting the dataset into subsets based
construction, and pruning, is essential for effectively applying
this method to real-world problems.
B. Classification Metrics
There are various metrics that can evaluate the goodness-
of-fit of a given classifier. Some of these metrics are presented
in this section. In classification tasks, it is essential to choose
appropriate evaluation metrics based on the problem’s context
and objectives.
1) Accuracy: Accuracy is one of the most straightforward
classification metrics and is defined as:
Fig. 6. The count plot of the various classes of Safety Rating for various Number of Correct Predictions
Accuracy = (1)
target classes are shown. Unlike numerical variables, categorical variables are Total Number of Predictions
not visualized well using density plots. So we prefer count plots instead
It measures the proportion of correct predictions made by the
model. While accuracy provides an overall sense of model
Entropy is a key concept in decision tree construction. performance, it may not be suitable for imbalanced datasets,
It measures the impurity or randomness of a dataset. The where one class dominates the other.
entropy of a dataset D with k classes is defined as:
k
X 2) Recall: Recall, also known as sensitivity or true positive
H(D) = − pi log2 (pi ) rate, quantifies a model’s ability to correctly identify positive
i=1 instances:
where pi is the proportion of instances in class i in the True Positives
dataset. High entropy indicates high impurity, while low Recall = (2)
True Positives + False Negatives
entropy means the dataset is more homogeneous.
Recall is essential when the cost of missing positive cases
Information gain quantifies the reduction in entropy achieved (false negatives) is high, such as in medical diagnoses.
by splitting the dataset on a particular attribute. It’s calculated
as: 3) Precision: Precision measures the accuracy of positive
X |Dv | predictions made by the model:
IG(D, A) = H(D) − H(Dv )
|D|
v∈values(A)
True Positives
Precision = (3)
Here, A is an attribute, values(A) represents the set of True Positives + False Positives
values of attribute A, Dv is the subset of instances for which
Precision is valuable when minimizing false positive
attribute A has value v, and |D| and |Dv | represent the sizes
predictions is critical, like in spam email detection.
of the datasets.
The process of constructing a decision tree involves 4) F1-score: The F1 score is the harmonic mean of preci-
selecting the best attribute to split on at each node based on sion and recall, providing a balance between the two:
information gain. This process continues recursively until Precision · Recall
one of the stopping criteria is met, such as a maximum tree F1 Score = 2 · (4)
Precision + Recall
depth or a minimum number of instances in a leaf node.
The result is a tree structure that can be used for classification. It is particularly useful when there is an uneven class
distribution or when both precision and recall need to be
Decision trees are susceptible to overfitting, where the considered simultaneously.
tree becomes overly complex and captures noise in the data.
Pruning is a technique used to reduce the complexity of the 5) Receiver Operator Characteristic Curve (ROC Curve):
tree by removing branches that do not significantly improve The ROC curve is a graphical representation of a model’s
predictive accuracy. Pruned trees tend to be more robust and performance across different classification thresholds. It plots
generalize better to unseen data. the true positive rate (recall) against the false positive rate (1
- specificity) at various threshold values.
Decision trees are a fundamental tool in classification tasks, The area under the ROC curve (AUC-ROC) quantifies the
providing a clear and interpretable way to make decisions model’s overall performance. A higher AUC-ROC indicates
based on data. A solid understanding of the underlying a better model at distinguishing between positive and negative
theory, including concepts like entropy, information gain, tree instances.
B. Decision Trees are highly interpretable
Although Decision Trees exhibit high variance as classifiers,
they boast a remarkable level of interpretability. The decision
boundary at each split can be discerned, aiding in compre-
hending the model’s classification criteria and emphasizing key
features.
In a way, Decision Trees mimic human decision-making,
providing clear decision paths and transparent fits. While they
may be less potent in predictive tasks, their appeal lies in their
interpretability. In Figs. 9 - 10, we visually represent Decision
Trees with depths of 2 and 3. Notably, consistent decision
boundaries are evident across all three trees.
The paramount feature is the Safety Rating, succeeded by
Fig. 7. A sample ROC curve from a classifier. Note the trade-off between Passenger Capacity and Buying Price. The prominence of
sensitivity and specificity. Based on the problem, we may optimize be required Safety Rating is self-evident. Passenger Capacity serves as a
to optimize for only one. reliable indicator, as vehicles with higher capacities generally
necessitate stricter safety standards. Finally, a higher Buying
Price implies superior material quality and elevated safety
IV. R ESULTS standards.
A. Existence of Linear Relationships among income factors
The correlation heatmap depicting numerical independent
variables is presented in Fig. 8. It is evident that there are
no significant correlations among the variables. No linear
associations between the independent variables are identified,
allowing us to move forward with our classifier.
Fig. 9. The Decision Tree of Depth 2 trained on a part of the Car Evaluation
Dataset is visualized. We find that the Safety Rating is the most important
feature, followed by the Passenger Capacity. This is expected since vehicles
with higher capacities require higher safety standards.
Fig. 10. The Decision Tree of Depth 3 trained on a part of the Car Evaluation
Dataset is visualized. We find that the Safety Rating is the most important
feature, followed by the Passenger Capacity and then the Buying Price. This is
expected since vehicles with higher capacities require higher safety standards.
Also a higher Buying Price implies better quality of materials and higher
safety standards
TABLE II
E VALUATION METRICS OF THE D ECISION T REE CLASSIFIER . W E FIND
THAT ACCURACY AND P RECISION ARE REASONABLY HIGH . T HE
VARIANCE IN THESE ESTIMATES ARE ALSO ACCEPTABLE .
ideal classifier. We find that the Safety Rating is the best variance of the constructed trees may be a concern. While it
predictor of the Target class, followed by the Passenger does not seem to be a problem in this dataset (observe the
Capacity and Buying Price. The Safety Rating being the most tight confidence intervals of our metrics), it can easily be
important feature is self-explanatory. The Passenger Capacity remedied through ensemble methods.
is a good indicator, since vehicles with higher capacities
VI. C ONCLUSIONS AND F UTURE W ORK
require higher safety standards. Finally, a higher Buying Price
implies better quality of materials and higher safety standards. In conclusion, our study underscores the exceptional utility
of decision trees as classifiers for our dataset. Their consistent
The sensitivity of decision trees to hyperparameters may high performance, coupled with their inherent interpretability,
be a concern. However, our analysis indicates that only the establishes them as the optimal choice for our classification
Max. Depth of the tree and Min. Impurity Decrease for task. The decision tree not only offers accurate predictions but
splitting affects the performance the most. The variance of also provides valuable insights into the classification process,
the metrics on other hyperparameters is also low across the making it well-suited for applications where transparency and
board, showing that these two hyperparameters have the trust are crucial.
highest effect in the tree construction.
Our feature importance analysis revealed that Safety
The confusion matrix indicates that the decision tree is Rating, Passenger Capacity, and Buying Price are pivotal
an excellent classifier even in the presence of skewed predictors, shedding light on the relationships between these
class data. With minimal data preprocessing required, high factors and vehicle safety. These findings hold significance for
performance and high interpretability, the decision tree is the various stakeholders, including car manufacturers, regulators,
“best” classifier for our task in some sense. However, the high and consumers, in making informed decisions related to
[3] Bishop, C. M., & Nasrabadi, N. M. (2006). Pattern recognition and
machine learning (Vol. 4, No. 4, p. 738). New York: springer.
Fig. 21. Variation of F1 Score (Left) and Accuracy (Right) for various
values of “Min. Impurity Decrease”. We find an initial dractic decrease in
performance after which, this does not affect the decision tree’s classification
performance.
R EFERENCES
[1] James, G., Witten, D., Hastie, T. and Tibshirani, R., 2013. An introduc-
tion to statistical learning (Vol. 112, p. 18). New York: springer.
[2] Hastie, T., Tibshirani, R., Friedman, J. H., & Friedman, J. H. (2009). The
elements of statistical learning: data mining, inference, and prediction
(Vol. 2, pp. 1-758). New York: springer.