Experiment 8
Develop a program to demonstrate the working of the
decision tree algorithm. Use Breast Cancer Data set for
building the decision tree and applying this knowledge to
classify a new sample.
Introduction to Decision Trees
What is a Decision Tree?
A Decision Tree is a supervised machine learning algorithm used for classification and
regression tasks. It models decisions using a tree-like structure where:
Nodes represent decision points based on feature values.
Edges represent possible outcomes (branches).
Leaves represent the final decision or classification.
Decision trees work by recursively splitting data into subsets based on the most significant
feature, ensuring maximum information gain at each step.
Working of the Decision Tree Algorithm
1. Selecting the Best Feature for Splitting
At each step, the algorithm selects the feature that best separates the data. Common
methods for choosing the best feature include:
Gini Impurity
Gini = 1- ∑Pi2
Measures how often a randomly chosen element would be incorrectly classified.
Entropy (Information Gain)
Entropy = ∑p(X)log p(X)
Measures the uncertainty in a dataset and selects splits that maximize information gain.
Chi-Square Test
Evaluates the statistical significance of the feature split.
2. Splitting the Data
The dataset is divided into subsets based on the selected
feature. The process continues recursively until:
A stopping condition is met (e.g., pure classification, max
depth). The tree reaches a predefined depth.
3. Making Predictions
For a new sample, traverse the tree from the root to a leaf
node. The leaf node contains the predicted class label.
Advantages of Decision Trees
✔ Easy to interpret – Mimics human decision-making.
✔ Handles both numerical & categorical data.
✔ Requires little data preprocessing – No need for feature scaling.
✔ Works well with missing values.
Challenges of Decision Trees
❌ Overfitting – Deep trees may memorize noise instead of patterns.
❌ Bias towards dominant features – Features with more categories can lead to
biased splits.
❌ Instability – Small data variations can lead to different trees.
Optimizing Decision Trees
1. Pruning
Pre-Pruning: Stop the tree early using conditions (e.g., min samples per split).
Post-Pruning: Remove unnecessary branches after the tree is built.
2. Setting Tree Depth
Limiting maximum depth prevents overfitting.
3. Using Ensemble Methods
Random Forest: Combines multiple trees for better generalization.
Gradient Boosting: Sequentially improves predictions.
Applications of Decision Trees
Medical Diagnosis – Classifying diseases based on symptoms.
Fraud Detection – Identifying fraudulent transactions.
Customer Segmentation – Categorizing users based on behavior.
# Importing necessary libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier, plot_tree
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix
from sklearn.tree import export_graphviz
from IPython.display import Image
import pydotplus
import warnings
warnings.filterwarnings('ignore')
data = pd.read_csv(r')
data.head()
data.shape
data.info()
data.diagnosis.unique()
data.isnull().sum()
df = data.drop(['id'], axis=1)
df['diagnosis'] = df['diagnosis'].map({'M':1, 'B':0}) # Malignant:1, Benign:0
#Model Building
X = df.drop('diagnosis', axis=1) # Drop the 'diagnosis' column (target)
y = df['diagnosis']
# Split the dataset into training and testing sets (80% train, 20% test)
X_train, X_test, y_train, y_test = train_test_split(X,y,test_size=0.2, random_state=42)
# Fit the decision tree model
model = DecisionTreeClassifier(criterion='entropy') #criteria = gini, entropy
model.fit(X_train, y_train)
model
y_pred = model.predict(X_test)
y_pred
# Evaluate the model
accuracy = accuracy_score(y_test, y_pred) * 100
classification_rep = classification_report(y_test, y_pred)
# Print the results
print("Accuracy:", accuracy)
print("Classification Report:\n", classification_rep)
new = [[12.5, 19.2, 80.0, 500.0, 0.085, 0.1, 0.05, 0.02, 0.17, 0.06,
0.4, 1.0, 2.5, 40.0, 0.006, 0.02, 0.03, 0.01, 0.02, 0.003,
16.0, 25.0, 105.0, 900.0, 0.13, 0.25, 0.28, 0.12, 0.29, 0.08]]
y_pred = model.predict(new)
# Output the prediction (0 = Benign, 1 = Malignant)
if y_pred[0] == 0:
print("Prediction: Benign")
else:
print("Prediction: Malignant")
# Visualize the Decision Tree (optional)
plt.figure(figsize=(12, 8))
plot_tree(model, filled=True, feature_names=X.columns, class_names=['Benign', 'Mali
plt.show()
# Export the tree to DOT format
dot_data = export_graphviz(model, out_file=None,
feature_names=X_train.columns,
rounded=True, proportion=False,
precision=2, filled=True)
# Convert DOT data to a graph
graph = pydotplus.graph_from_dot_data(dot_data)
# Display the graph
Image(graph.create_png())
kkkkkkkkk