DELHI TECHNOLOGICAL UNIVERSITY
(Formerly Delhi College of Engineering)
Shahabad Daulatpur, Bawana Road, Delhi-110042
DEPARTMENT OF SOFTWARE ENGINEERING
SE-206
MACHINE LEARNING
LAB FILE
SUBMITTED BY:
SUBMITTED TO:
Arjun Yadav
Ms. Jyoti
23/SE/032
Ph.D. scholar
INDEX
S.NO EXPERIMENT DATE SIGN
1. Exploring and demonstrating Python. 10/1/25
2. Perform Data Preprocessing like outlier detection, 10/1/25
handling missing value, analyzing redundancy
and normalization on different datasets.
3. Write a program to exhibit the working of the decision 17/1/25
tree based ID3 algorithm. With the help of
appropriate data set build the decision tree and classify
a new sample.
4. Write a program to demonstrate the working of the 31/1/25
decision tree based C4.5 algorithm. With the
help of data set used in above experiment build the
decision tree and classify a new sample.
5. Write a program to demonstrate the working of 7/2/25
decision tree-based CART algorithm. Build the
decision tree and classify a new sample using suitable
dataset. Compare the performance with
that of ID, C4.5, and CART in terms of accuracy,
recall, precision and sensitivity.
6. Build an Artificial Neural Network by implementing 14/2/25
the Backpropagation algorithm and test the
same using appropriate data sets.
7. Write a program to implement the Naïve Bayesian 14/2/25
classifier for appropriate dataset and compute
the performance measures of the model
8. Write a program to implement k-Nearest Neighbour 28/3/25
algorithm to classify any dataset of your
choice. Print both correct and wrong predictions.
9. Apply k-Means clustering algorithm on suitable 4/4/25
datasets and comment on the quality of
clustering.
10. Write a program to implement Linear Regression using 11/4/25
any appropriate dataset.
EXPERIMENT – 1
AIM
To explore basic Python programming concepts and demonstrate their use with the Bank
Marketing Dataset
DATASET
Bank Marketing Dataset from the UCI Machine Learning Repository.
THEORY
Python is a versatile language widely used in data science and machine learning. In this
experiment, we demonstrate:
Array handling using NumPy
Data manipulation using pandas
Data visualization using matplotlib
Loading and inspecting the Iris dataset
CODE
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
# Load the Bank dataset using the correct separator
df = pd.read_csv("bank.csv", sep=",") # Your uploaded file uses commas, not semicolons
# Display first few rows
print("Bank Dataset (First 5 rows):")
print(df.head())
# Sample NumPy array operations
sample_balances = np.array([2343, 45, 1270])
print("\nSample Balance Array:", sample_balances)
# Plotting Age vs Balance
plt.figure(figsize=(6, 4))
plt.scatter(df['age'], df['balance'], alpha=0.5)
plt.title('Age vs Balance')
plt.xlabel('Age')
plt.ylabel('Balance')
plt.grid(True)
plt.show()
OUTPUT
EXPERIMENT – 2
AIM
To perform data preprocessing operations such as handling missing values, detecting outliers,
and applying normalization on the Bank Marketing Dataset
THEORY
Data preprocessing is a vital step in machine learning. It ensures the data is clean, consistent,
and suitable for training models. Key steps include:
Handling Missing Values: Fill in missing values using mean, median, or mode.
Outlier Detection: Identify data points significantly different from others (e.g., using
Z-score).
Normalization: Scale data between 0 and 1 or to a standard range to ensure all
features
CODE
import pandas as pd
import numpy as np
from sklearn.preprocessing import MinMaxScaler
import matplotlib.pyplot as plt
import seaborn as sns
# Load the dataset
df = pd.read_csv("bank.csv", sep=",")
# Display initial rows
print("Initial Data Overview:")
print(df.head())
# Check for missing values
print("\nMissing Values:")
print(df.isnull().sum())
# Handle missing values — in this dataset, 'unknown' is often used as a placeholder
# Replace 'unknown' with np.nan for easier handling
df.replace("unknown", np.nan, inplace=True)
# Count missing values again after replacement
print("\nMissing (NaN) values after conversion:")
print(df.isnull().sum())
# Fill NaN values in categorical columns with mode
for col in df.select_dtypes(include='object').columns:
df[col].fillna(df[col].mode()[0], inplace=True)
# Fill NaN in numeric columns with mean (if any)
for col in df.select_dtypes(include='number').columns:
df[col].fillna(df[col].mean(), inplace=True)
# Detect outliers using Z-score for numerical columns
numeric_cols = df.select_dtypes(include='number').columns
z_scores = (df[numeric_cols] - df[numeric_cols].mean()) / df[numeric_cols].std()
outliers = (np.abs(z_scores) > 3).any(axis=1)
print("\nNumber of outliers detected:", outliers.sum())
# Normalize numeric features
scaler = MinMaxScaler()
df_normalized = df.copy()
df_normalized[numeric_cols] = scaler.fit_transform(df[numeric_cols])
print("\nNormalized Data Overview:")
print(df_normalized[numeric_cols].head())
# Correlation heatmap
plt.figure(figsize=(12, 8))
corr = df[numeric_cols].corr()
sns.heatmap(corr, annot=True, cmap='coolwarm', fmt=".2f")
plt.title("Correlation Matrix of Bank Dataset")
plt.show()
OUTPUT
EXPERIMENT – 3
AIM :Write a program to exhibit the working of the decision tree based ID3 algorithm. With
the help of appropriate data set build the decision tree and classify a new sample.
THEORY
The ID3 algorithm builds a decision tree by selecting the attribute that yields the highest
information gain at each node. It uses entropy as a metric to measure the impurity of a
dataset. ID3 works best for categorical features, but it can also be applied to continuous
datasets after discretization or by using libraries that handle it internally.
Entropy: Measures the amount of disorder or impurity in the dataset.
Information Gain: The reduction in entropy from a dataset split.
CODE
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier, export_text, plot_tree
from sklearn.preprocessing import LabelEncoder
import matplotlib.pyplot as plt
# Load dataset
df = pd.read_csv("bank.csv", sep=",")
# Replace 'unknown' with mode for categorical columns
df.replace("unknown", pd.NA, inplace=True)
for col in df.select_dtypes(include='object').columns:
df[col].fillna(df[col].mode()[0], inplace=True)
# Encode categorical variables
label_encoders = {}
for col in df.select_dtypes(include='object').columns:
le = LabelEncoder()
df[col] = le.fit_transform(df[col])
label_encoders[col] = le
# Features and target
X = df.drop('deposit', axis=1)
y = df['deposit']
# Train-test split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# ID3 using entropy
id3_tree = DecisionTreeClassifier(criterion='entropy', random_state=42)
id3_tree.fit(X_train, y_train)
# Predict a new sample (random row from test set for demonstration)
sample = X_test.iloc[[0]]
prediction = id3_tree.predict(sample)
# Tree rules and performance
rules = export_text(id3_tree, feature_names=list(X.columns))
print("Decision Tree Rules:\n", rules)
print("\nPredicted class for new sample:", prediction[0])
# Visualization
plt.figure(figsize=(18, 10))
plot_tree(id3_tree, feature_names=X.columns, class_names=['No', 'Yes'], filled=True)
plt.title("ID3 Decision Tree (Bank Dataset)")
plt.show()
OUTPUT
EXPERIMENT – 4
AIM : Write a program to demonstrate the working of the decision tree based C4.5
algorithm. With the help of data set used in above experiment build the decision tree and
classify a new sample
THEORY
C4.5 is an improved version of ID3. While both use entropy and information gain, C4.5
overcomes several limitations of ID3:
Uses Gain Ratio instead of pure Information Gain to avoid bias towards features with
many levels.
Handles continuous attributes more efficiently.
Supports pruning to reduce overfitting.
CODE
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier, export_text, plot_tree
from sklearn.preprocessing import LabelEncoder
import matplotlib.pyplot as plt
# Load dataset
df = pd.read_csv("bank.csv", sep=",")
# Handle unknowns
df.replace("unknown", pd.NA, inplace=True)
for col in df.select_dtypes(include='object').columns:
df[col].fillna(df[col].mode()[0], inplace=True)
# Encode categorical variables
label_encoders = {}
for col in df.select_dtypes(include='object').columns:
le = LabelEncoder()
df[col] = le.fit_transform(df[col])
label_encoders[col] = le
# Split into features and label
X = df.drop('deposit', axis=1)
y = df['deposit']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# Train Decision Tree (C4.5 simulation using Gini)
c45_tree = DecisionTreeClassifier(criterion='gini', random_state=42)
c45_tree.fit(X_train, y_train)
# Predict a new sample
sample = X_test.iloc[[0]]
prediction = c45_tree.predict(sample)
# Export rules
print(" FULL DECISION TREE RULES (C4.5 / Gini):\n")
tree_rules = export_text(c45_tree, feature_names=list(X.columns))
print(tree_rules)
print("\n New Sample Prediction:", prediction[0])
# Plot the tree
plt.figure(figsize=(20, 10))
plot_tree(c45_tree, feature_names=X.columns, class_names=['No', 'Yes'], filled=True)
plt.title("C4.5 Decision Tree (Bank Dataset)")
plt.show()
OUTPUT
EXPERIMENT-5
AIM: To implement a Decision Tree using the CART (Classification and Regression Trees)
algorithm on the Bank Marketing Dataset, classify a new sample, and compare the
performance of CART with ID3 and C4.5 in terms of accuracy, precision, recall, and
sensitivity.
THEORY: Decision Trees are powerful supervised learning algorithms used for both
classification and regression tasks. They work by splitting the dataset into subsets based on
the value of input features. The goal is to create pure subsets where most data points belong
to a single class.
CART (Classification and Regression Trees) is a binary decision tree algorithm developed by
Breiman. It constructs decision trees by using the Gini Index as a measure of impurity to
decide the best feature and threshold to split the data. CART produces only binary trees (i.e.,
every node has 0 or 2 children).
ID3 (Iterative Dichotomiser 3), developed by Quinlan, uses Information Gain (based on
Entropy) to split nodes. It tends to overfit if not properly pruned and works only for
categorical features.
C4.5, an improvement over ID3, uses Gain Ratio (a normalized version of information gain)
to overcome the bias toward attributes with many values. It can handle both discrete and
continuous attributes and supports pruning.
Performance Metrics:
To evaluate and compare the algorithms, we use the following metrics:
Accuracy: Percentage of correctly classified instances.
Precision: Ratio of correctly predicted positive observations to total predicted
positives.
Recall (Sensitivity): Ratio of correctly predicted positives to all actual positives.
F1-score: Harmonic mean of precision and recall.
CODE:
# @title id3 algo
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import classification_report, accuracy_score
id3_model = DecisionTreeClassifier(criterion='entropy', random_state=42)
id3_model.fit(X_train, y_train)
y_pred_id3 = id3_model.predict(X_test)
print("ID3 Decision Tree Performance:")
print(classification_report(y_test, y_pred_id3))
print("Accuracy:", accuracy_score(y_test, y_pred_id3))
# @title c4.5 algo
c45_model = DecisionTreeClassifier(criterion='entropy', max_depth=5, random_state=42)
c45_model.fit(X_train, y_train)
y_pred_c45 = c45_model.predict(X_test)
print("C4.5 Decision Tree Performance:")
print(classification_report(y_test, y_pred_c45))
print("Accuracy:", accuracy_score(y_test, y_pred_c45))
# @title cart algo
cart_model = DecisionTreeClassifier(criterion='gini', random_state=42)
cart_model.fit(X_train, y_train)
y_pred_cart = cart_model.predict(X_test)
print("CART Decision Tree Performance:")
print(classification_report(y_test, y_pred_cart))
print("Accuracy:", accuracy_score(y_test, y_pred_cart))
# @title comparison between id3,c4.5,cart algorithms
from sklearn.metrics import precision_score, recall_score
def evaluate(name, y_true, y_pred):
print(f"\n{name} Metrics:")
print(f"Accuracy : {accuracy_score(y_true, y_pred):.4f}")
print(f"Precision: {precision_score(y_true, y_pred):.4f}")
print(f"Recall : {recall_score(y_true, y_pred):.4f}")
print(f"Sensitivity (Recall) = {recall_score(y_true, y_pred):.4f}")
evaluate("ID3", y_test, y_pred_id3)
evaluate("C4.5", y_test, y_pred_c45)
evaluate("CART", y_test, y_pred_cart)
OUTPUT:
EXPERIMENT -6
AIM: To build and evaluate an Artificial Neural Network (ANN) using the Backpropagation
algorithm on the Bank Marketing Dataset.
THEORY:
An Artificial Neural Network is a computational model inspired by the structure of the human
brain. It consists of interconnected nodes (neurons) organized in layers—input, hidden, and
output layers. The Backpropagation algorithm is used for training neural networks by
minimizing the error between actual and predicted output. It adjusts weights based on the
error gradient through gradient descent. ANN is capable of capturing complex, non-linear
relationships in data.
CODE:
model = Sequential()
model.add(Dense(16, activation='relu', input_shape=(X_train.shape[1],)))
model.add(Dense(8, activation='relu'))
model.add(Dense(1, activation='sigmoid'))
model.compile(optimizer='adam', loss='binary_crossentropy', metrics=['accuracy'])
history = model.fit(X_train, y_train, epochs=30, batch_size=32, validation_split=0.2,
verbose=1)
y_pred_ann_prob = model.predict(X_test)
y_pred_ann = (y_pred_ann_prob > 0.5).astype(int)
acc = accuracy_score(y_test, y_pred_ann)
print(f"accuracy of ann model: {acc * 100:.2f}%")
OUTPUT:
EXPERIMENT -7
AIM: To implement the Naïve Bayes Classifier on the Bank Marketing Dataset and evaluate
the model using performance metrics.
THEORY:
Naïve Bayes is a probabilistic classification algorithm based on Bayes’ Theorem. It assumes
independence among features and calculates the probability of each class given the input
features. Despite its simplicity, Naïve Bayes performs surprisingly well for a wide range of
classification tasks, especially when dealing with high-dimensional data.
CODE:
from sklearn.naive_bayes import GaussianNB
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix
nb_model = GaussianNB()
nb_model.fit(X_train, y_train)
y_pred_nb = nb_model.predict(X_test)
print("Naïve Bayes Classification Report:")
print(classification_report(y_test, y_pred_nb))
print("Accuracy:", accuracy_score(y_test, y_pred_nb))
OUTPUT:
EXPERIMENT -8
AIM: To implement the k-Nearest Neighbour (k-NN) algorithm to classify data from the
Bank Marketing Dataset and analyse both correct and incorrect predictions.
THEORY:
k-NN is a non-parametric, instance-based learning algorithm that classifies data based on the
majority class among its k nearest neighbours. The proximity is typically measured using
distance metrics like Euclidean distance. It is simple yet effective for many classification
problems but may be computationally intensive for large datasets
CODE:
from sklearn.neighbors import KNeighborsClassifier
k=5
knn_model = KNeighborsClassifier(n_neighbors=k)
knn_model.fit(X_train, y_train)
y_pred_knn = knn_model.predict(X_test)
print("k-NN Classification Report:")
print(classification_report(y_test, y_pred_knn))
print("Accuracy:", accuracy_score(y_test, y_pred_knn))
correct = np.where(y_pred_knn == y_test)[0]
incorrect = np.where(y_pred_knn != y_test)[0]
print(f"Correct Predictions: {len(correct)}")
print(f"Wrong Predictions: {len(incorrect)}")
print("Confusion Matrix:\n", confusion_matrix(y_test, y_pred_knn))
OUTPUT:
EXPERIMENT -9
AIM: To apply the k-Means clustering algorithm on the Bank Marketing Dataset and
evaluate the quality of the clustering using metrics like silhouette score.
THEORY:
k-Means is an unsupervised learning algorithm used to group data into k distinct clusters. It
works by initializing k centroids, assigning each data point to the nearest centroid, and then
updating centroids based on the mean of assigned points. This process continues iteratively
until convergence. The quality of clustering can be evaluated using metrics like silhouette
score or visualizations using PCA.
CODE:
from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_score
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)
kmeans = KMeans(n_clusters=2, random_state=42, n_init=10)
clusters = kmeans.fit_predict(X_scaled)
sil_score = silhouette_score(X_scaled, clusters)
print(f"Silhouette Score: {sil_score:.4f}")
df["Cluster"] = clusters
print(df.groupby("Cluster")["y"].value_counts())
from sklearn.decomposition import PCA
import matplotlib.pyplot as plt
pca = PCA(n_components=2)
X_pca = pca.fit_transform(X_scaled)
plt.scatter(X_pca[:, 0], X_pca[:, 1], c=clusters, cmap='viridis', alpha=0.5)
plt.title("k-Means Clustering (2 Clusters)")
plt.xlabel("PCA Component 1")
plt.ylabel("PCA Component 2")
plt.colorbar(label="Cluster")
plt.show()
OUTPUT:
EXPERIMENT -10
AIM: To implement Linear Regression on the Bank Marketing Dataset and evaluate the
performance using regression metrics.
THEORY:
Linear Regression is a supervised learning algorithm used to model the relationship between
a dependent variable and one or more independent variables. It fits a linear equation to
observed data by minimizing the sum of squared errors. It is widely used for predictive
analysis, though it assumes a linear relationship between variables and is sensitive to outliers
and multicollinearity.
CODE:
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score
y_reg = df['balance']
X_reg = df.drop(columns=['balance'])
X_train_reg, X_test_reg, y_train_reg, y_test_reg = train_test_split(X_reg, y_reg,
test_size=0.3, random_state=42)
lr_model = LinearRegression()
lr_model.fit(X_train_reg, y_train_reg)
y_pred_lr = lr_model.predict(X_test_reg)
print("Linear Regression Performance:")
print(f"MAE: {mean_absolute_error(y_test_reg, y_pred_lr):.2f}")
print(f"MSE: {mean_squared_error(y_test_reg, y_pred_lr):.2f}")
print(f"R² Score: {r2_score(y_test_reg, y_pred_lr):.4f}")
plt.scatter(y_test_reg, y_pred_lr, alpha=0.5)
plt.xlabel("Actual Balance")
plt.ylabel("Predicted Balance")
plt.title("Linear Regression: Actual vs Predicted")
plt.grid()
plt.show()
OUTPUT: