Thanks to visit codestin.com
Credit goes to www.scribd.com

0% found this document useful (0 votes)
4 views28 pages

Machine Learning File

Uploaded by

arjunishere02
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
4 views28 pages

Machine Learning File

Uploaded by

arjunishere02
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 28

DELHI TECHNOLOGICAL UNIVERSITY

(Formerly Delhi College of Engineering)

Shahabad Daulatpur, Bawana Road, Delhi-110042

DEPARTMENT OF SOFTWARE ENGINEERING

SE-206

MACHINE LEARNING
LAB FILE

SUBMITTED BY:
SUBMITTED TO:
Arjun Yadav
Ms. Jyoti
23/SE/032
Ph.D. scholar
INDEX
S.NO EXPERIMENT DATE SIGN

1. Exploring and demonstrating Python. 10/1/25

2. Perform Data Preprocessing like outlier detection, 10/1/25


handling missing value, analyzing redundancy
and normalization on different datasets.
3. Write a program to exhibit the working of the decision 17/1/25
tree based ID3 algorithm. With the help of
appropriate data set build the decision tree and classify
a new sample.
4. Write a program to demonstrate the working of the 31/1/25
decision tree based C4.5 algorithm. With the
help of data set used in above experiment build the
decision tree and classify a new sample.
5. Write a program to demonstrate the working of 7/2/25
decision tree-based CART algorithm. Build the
decision tree and classify a new sample using suitable
dataset. Compare the performance with
that of ID, C4.5, and CART in terms of accuracy,
recall, precision and sensitivity.
6. Build an Artificial Neural Network by implementing 14/2/25
the Backpropagation algorithm and test the
same using appropriate data sets.
7. Write a program to implement the Naïve Bayesian 14/2/25
classifier for appropriate dataset and compute
the performance measures of the model
8. Write a program to implement k-Nearest Neighbour 28/3/25
algorithm to classify any dataset of your
choice. Print both correct and wrong predictions.
9. Apply k-Means clustering algorithm on suitable 4/4/25
datasets and comment on the quality of
clustering.
10. Write a program to implement Linear Regression using 11/4/25
any appropriate dataset.
EXPERIMENT – 1
AIM

To explore basic Python programming concepts and demonstrate their use with the Bank
Marketing Dataset

DATASET

Bank Marketing Dataset from the UCI Machine Learning Repository.

THEORY

Python is a versatile language widely used in data science and machine learning. In this
experiment, we demonstrate:

 Array handling using NumPy


 Data manipulation using pandas
 Data visualization using matplotlib
 Loading and inspecting the Iris dataset

CODE

import numpy as np

import pandas as pd

import matplotlib.pyplot as plt

# Load the Bank dataset using the correct separator

df = pd.read_csv("bank.csv", sep=",") # Your uploaded file uses commas, not semicolons

# Display first few rows

print("Bank Dataset (First 5 rows):")

print(df.head())
# Sample NumPy array operations

sample_balances = np.array([2343, 45, 1270])

print("\nSample Balance Array:", sample_balances)

# Plotting Age vs Balance

plt.figure(figsize=(6, 4))

plt.scatter(df['age'], df['balance'], alpha=0.5)

plt.title('Age vs Balance')

plt.xlabel('Age')

plt.ylabel('Balance')

plt.grid(True)

plt.show()

OUTPUT
EXPERIMENT – 2
AIM

To perform data preprocessing operations such as handling missing values, detecting outliers,
and applying normalization on the Bank Marketing Dataset

THEORY

Data preprocessing is a vital step in machine learning. It ensures the data is clean, consistent,
and suitable for training models. Key steps include:

 Handling Missing Values: Fill in missing values using mean, median, or mode.
 Outlier Detection: Identify data points significantly different from others (e.g., using
Z-score).
 Normalization: Scale data between 0 and 1 or to a standard range to ensure all
features

CODE

import pandas as pd

import numpy as np

from sklearn.preprocessing import MinMaxScaler

import matplotlib.pyplot as plt

import seaborn as sns

# Load the dataset

df = pd.read_csv("bank.csv", sep=",")

# Display initial rows

print("Initial Data Overview:")

print(df.head())
# Check for missing values

print("\nMissing Values:")

print(df.isnull().sum())

# Handle missing values — in this dataset, 'unknown' is often used as a placeholder

# Replace 'unknown' with np.nan for easier handling

df.replace("unknown", np.nan, inplace=True)

# Count missing values again after replacement

print("\nMissing (NaN) values after conversion:")

print(df.isnull().sum())

# Fill NaN values in categorical columns with mode

for col in df.select_dtypes(include='object').columns:

df[col].fillna(df[col].mode()[0], inplace=True)

# Fill NaN in numeric columns with mean (if any)

for col in df.select_dtypes(include='number').columns:

df[col].fillna(df[col].mean(), inplace=True)

# Detect outliers using Z-score for numerical columns

numeric_cols = df.select_dtypes(include='number').columns

z_scores = (df[numeric_cols] - df[numeric_cols].mean()) / df[numeric_cols].std()

outliers = (np.abs(z_scores) > 3).any(axis=1)


print("\nNumber of outliers detected:", outliers.sum())

# Normalize numeric features

scaler = MinMaxScaler()

df_normalized = df.copy()

df_normalized[numeric_cols] = scaler.fit_transform(df[numeric_cols])

print("\nNormalized Data Overview:")

print(df_normalized[numeric_cols].head())

# Correlation heatmap

plt.figure(figsize=(12, 8))

corr = df[numeric_cols].corr()

sns.heatmap(corr, annot=True, cmap='coolwarm', fmt=".2f")

plt.title("Correlation Matrix of Bank Dataset")

plt.show()
OUTPUT
EXPERIMENT – 3

AIM :Write a program to exhibit the working of the decision tree based ID3 algorithm. With
the help of appropriate data set build the decision tree and classify a new sample.

THEORY

The ID3 algorithm builds a decision tree by selecting the attribute that yields the highest
information gain at each node. It uses entropy as a metric to measure the impurity of a
dataset. ID3 works best for categorical features, but it can also be applied to continuous
datasets after discretization or by using libraries that handle it internally.

 Entropy: Measures the amount of disorder or impurity in the dataset.


 Information Gain: The reduction in entropy from a dataset split.

CODE

import pandas as pd

from sklearn.model_selection import train_test_split

from sklearn.tree import DecisionTreeClassifier, export_text, plot_tree

from sklearn.preprocessing import LabelEncoder

import matplotlib.pyplot as plt

# Load dataset

df = pd.read_csv("bank.csv", sep=",")

# Replace 'unknown' with mode for categorical columns

df.replace("unknown", pd.NA, inplace=True)

for col in df.select_dtypes(include='object').columns:

df[col].fillna(df[col].mode()[0], inplace=True)
# Encode categorical variables

label_encoders = {}

for col in df.select_dtypes(include='object').columns:

le = LabelEncoder()

df[col] = le.fit_transform(df[col])

label_encoders[col] = le

# Features and target

X = df.drop('deposit', axis=1)

y = df['deposit']

# Train-test split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# ID3 using entropy

id3_tree = DecisionTreeClassifier(criterion='entropy', random_state=42)

id3_tree.fit(X_train, y_train)

# Predict a new sample (random row from test set for demonstration)

sample = X_test.iloc[[0]]

prediction = id3_tree.predict(sample)

# Tree rules and performance


rules = export_text(id3_tree, feature_names=list(X.columns))

print("Decision Tree Rules:\n", rules)

print("\nPredicted class for new sample:", prediction[0])

# Visualization

plt.figure(figsize=(18, 10))

plot_tree(id3_tree, feature_names=X.columns, class_names=['No', 'Yes'], filled=True)

plt.title("ID3 Decision Tree (Bank Dataset)")

plt.show()

OUTPUT
EXPERIMENT – 4
AIM : Write a program to demonstrate the working of the decision tree based C4.5
algorithm. With the help of data set used in above experiment build the decision tree and
classify a new sample

THEORY

C4.5 is an improved version of ID3. While both use entropy and information gain, C4.5
overcomes several limitations of ID3:

 Uses Gain Ratio instead of pure Information Gain to avoid bias towards features with
many levels.
 Handles continuous attributes more efficiently.
 Supports pruning to reduce overfitting.

CODE

import pandas as pd

from sklearn.model_selection import train_test_split

from sklearn.tree import DecisionTreeClassifier, export_text, plot_tree

from sklearn.preprocessing import LabelEncoder

import matplotlib.pyplot as plt

# Load dataset

df = pd.read_csv("bank.csv", sep=",")

# Handle unknowns

df.replace("unknown", pd.NA, inplace=True)

for col in df.select_dtypes(include='object').columns:

df[col].fillna(df[col].mode()[0], inplace=True)
# Encode categorical variables

label_encoders = {}

for col in df.select_dtypes(include='object').columns:

le = LabelEncoder()

df[col] = le.fit_transform(df[col])

label_encoders[col] = le

# Split into features and label

X = df.drop('deposit', axis=1)

y = df['deposit']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Train Decision Tree (C4.5 simulation using Gini)

c45_tree = DecisionTreeClassifier(criterion='gini', random_state=42)

c45_tree.fit(X_train, y_train)

# Predict a new sample

sample = X_test.iloc[[0]]

prediction = c45_tree.predict(sample)

# Export rules

print(" FULL DECISION TREE RULES (C4.5 / Gini):\n")


tree_rules = export_text(c45_tree, feature_names=list(X.columns))

print(tree_rules)

print("\n New Sample Prediction:", prediction[0])

# Plot the tree

plt.figure(figsize=(20, 10))

plot_tree(c45_tree, feature_names=X.columns, class_names=['No', 'Yes'], filled=True)

plt.title("C4.5 Decision Tree (Bank Dataset)")

plt.show()

OUTPUT
EXPERIMENT-5

AIM: To implement a Decision Tree using the CART (Classification and Regression Trees)
algorithm on the Bank Marketing Dataset, classify a new sample, and compare the
performance of CART with ID3 and C4.5 in terms of accuracy, precision, recall, and
sensitivity.
THEORY: Decision Trees are powerful supervised learning algorithms used for both
classification and regression tasks. They work by splitting the dataset into subsets based on
the value of input features. The goal is to create pure subsets where most data points belong
to a single class.
CART (Classification and Regression Trees) is a binary decision tree algorithm developed by
Breiman. It constructs decision trees by using the Gini Index as a measure of impurity to
decide the best feature and threshold to split the data. CART produces only binary trees (i.e.,
every node has 0 or 2 children).
ID3 (Iterative Dichotomiser 3), developed by Quinlan, uses Information Gain (based on
Entropy) to split nodes. It tends to overfit if not properly pruned and works only for
categorical features.
C4.5, an improvement over ID3, uses Gain Ratio (a normalized version of information gain)
to overcome the bias toward attributes with many values. It can handle both discrete and
continuous attributes and supports pruning.
Performance Metrics:
To evaluate and compare the algorithms, we use the following metrics:
 Accuracy: Percentage of correctly classified instances.
 Precision: Ratio of correctly predicted positive observations to total predicted
positives.
 Recall (Sensitivity): Ratio of correctly predicted positives to all actual positives.
 F1-score: Harmonic mean of precision and recall.

CODE:
# @title id3 algo
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import classification_report, accuracy_score

id3_model = DecisionTreeClassifier(criterion='entropy', random_state=42)


id3_model.fit(X_train, y_train)
y_pred_id3 = id3_model.predict(X_test)

print("ID3 Decision Tree Performance:")


print(classification_report(y_test, y_pred_id3))
print("Accuracy:", accuracy_score(y_test, y_pred_id3))

# @title c4.5 algo


c45_model = DecisionTreeClassifier(criterion='entropy', max_depth=5, random_state=42)
c45_model.fit(X_train, y_train)

y_pred_c45 = c45_model.predict(X_test)

print("C4.5 Decision Tree Performance:")


print(classification_report(y_test, y_pred_c45))
print("Accuracy:", accuracy_score(y_test, y_pred_c45))

# @title cart algo


cart_model = DecisionTreeClassifier(criterion='gini', random_state=42)
cart_model.fit(X_train, y_train)

y_pred_cart = cart_model.predict(X_test)

print("CART Decision Tree Performance:")


print(classification_report(y_test, y_pred_cart))
print("Accuracy:", accuracy_score(y_test, y_pred_cart))

# @title comparison between id3,c4.5,cart algorithms


from sklearn.metrics import precision_score, recall_score
def evaluate(name, y_true, y_pred):
print(f"\n{name} Metrics:")
print(f"Accuracy : {accuracy_score(y_true, y_pred):.4f}")
print(f"Precision: {precision_score(y_true, y_pred):.4f}")
print(f"Recall : {recall_score(y_true, y_pred):.4f}")
print(f"Sensitivity (Recall) = {recall_score(y_true, y_pred):.4f}")

evaluate("ID3", y_test, y_pred_id3)


evaluate("C4.5", y_test, y_pred_c45)
evaluate("CART", y_test, y_pred_cart)

OUTPUT:
EXPERIMENT -6
AIM: To build and evaluate an Artificial Neural Network (ANN) using the Backpropagation
algorithm on the Bank Marketing Dataset.
THEORY:
An Artificial Neural Network is a computational model inspired by the structure of the human
brain. It consists of interconnected nodes (neurons) organized in layers—input, hidden, and
output layers. The Backpropagation algorithm is used for training neural networks by
minimizing the error between actual and predicted output. It adjusts weights based on the
error gradient through gradient descent. ANN is capable of capturing complex, non-linear
relationships in data.

CODE:
model = Sequential()
model.add(Dense(16, activation='relu', input_shape=(X_train.shape[1],)))
model.add(Dense(8, activation='relu'))
model.add(Dense(1, activation='sigmoid'))
model.compile(optimizer='adam', loss='binary_crossentropy', metrics=['accuracy'])
history = model.fit(X_train, y_train, epochs=30, batch_size=32, validation_split=0.2,
verbose=1)

y_pred_ann_prob = model.predict(X_test)
y_pred_ann = (y_pred_ann_prob > 0.5).astype(int)
acc = accuracy_score(y_test, y_pred_ann)
print(f"accuracy of ann model: {acc * 100:.2f}%")

OUTPUT:

EXPERIMENT -7
AIM: To implement the Naïve Bayes Classifier on the Bank Marketing Dataset and evaluate
the model using performance metrics.
THEORY:
Naïve Bayes is a probabilistic classification algorithm based on Bayes’ Theorem. It assumes
independence among features and calculates the probability of each class given the input
features. Despite its simplicity, Naïve Bayes performs surprisingly well for a wide range of
classification tasks, especially when dealing with high-dimensional data.

CODE:
from sklearn.naive_bayes import GaussianNB
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix
nb_model = GaussianNB()
nb_model.fit(X_train, y_train)

y_pred_nb = nb_model.predict(X_test)

print("Naïve Bayes Classification Report:")


print(classification_report(y_test, y_pred_nb))
print("Accuracy:", accuracy_score(y_test, y_pred_nb))

OUTPUT:
EXPERIMENT -8
AIM: To implement the k-Nearest Neighbour (k-NN) algorithm to classify data from the
Bank Marketing Dataset and analyse both correct and incorrect predictions.
THEORY:
k-NN is a non-parametric, instance-based learning algorithm that classifies data based on the
majority class among its k nearest neighbours. The proximity is typically measured using
distance metrics like Euclidean distance. It is simple yet effective for many classification
problems but may be computationally intensive for large datasets

CODE:
from sklearn.neighbors import KNeighborsClassifier

k=5
knn_model = KNeighborsClassifier(n_neighbors=k)
knn_model.fit(X_train, y_train)

y_pred_knn = knn_model.predict(X_test)

print("k-NN Classification Report:")


print(classification_report(y_test, y_pred_knn))
print("Accuracy:", accuracy_score(y_test, y_pred_knn))

correct = np.where(y_pred_knn == y_test)[0]


incorrect = np.where(y_pred_knn != y_test)[0]

print(f"Correct Predictions: {len(correct)}")


print(f"Wrong Predictions: {len(incorrect)}")

print("Confusion Matrix:\n", confusion_matrix(y_test, y_pred_knn))


OUTPUT:
EXPERIMENT -9
AIM: To apply the k-Means clustering algorithm on the Bank Marketing Dataset and
evaluate the quality of the clustering using metrics like silhouette score.
THEORY:
k-Means is an unsupervised learning algorithm used to group data into k distinct clusters. It
works by initializing k centroids, assigning each data point to the nearest centroid, and then
updating centroids based on the mean of assigned points. This process continues iteratively
until convergence. The quality of clustering can be evaluated using metrics like silhouette
score or visualizations using PCA.

CODE:
from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_score
from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

kmeans = KMeans(n_clusters=2, random_state=42, n_init=10)


clusters = kmeans.fit_predict(X_scaled)

sil_score = silhouette_score(X_scaled, clusters)


print(f"Silhouette Score: {sil_score:.4f}")

df["Cluster"] = clusters
print(df.groupby("Cluster")["y"].value_counts())

from sklearn.decomposition import PCA


import matplotlib.pyplot as plt

pca = PCA(n_components=2)
X_pca = pca.fit_transform(X_scaled)
plt.scatter(X_pca[:, 0], X_pca[:, 1], c=clusters, cmap='viridis', alpha=0.5)
plt.title("k-Means Clustering (2 Clusters)")
plt.xlabel("PCA Component 1")
plt.ylabel("PCA Component 2")
plt.colorbar(label="Cluster")
plt.show()

OUTPUT:
EXPERIMENT -10
AIM: To implement Linear Regression on the Bank Marketing Dataset and evaluate the
performance using regression metrics.
THEORY:
Linear Regression is a supervised learning algorithm used to model the relationship between
a dependent variable and one or more independent variables. It fits a linear equation to
observed data by minimizing the sum of squared errors. It is widely used for predictive
analysis, though it assumes a linear relationship between variables and is sensitive to outliers
and multicollinearity.

CODE:
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score

y_reg = df['balance']
X_reg = df.drop(columns=['balance'])

X_train_reg, X_test_reg, y_train_reg, y_test_reg = train_test_split(X_reg, y_reg,


test_size=0.3, random_state=42)

lr_model = LinearRegression()
lr_model.fit(X_train_reg, y_train_reg)

y_pred_lr = lr_model.predict(X_test_reg)

print("Linear Regression Performance:")


print(f"MAE: {mean_absolute_error(y_test_reg, y_pred_lr):.2f}")
print(f"MSE: {mean_squared_error(y_test_reg, y_pred_lr):.2f}")
print(f"R² Score: {r2_score(y_test_reg, y_pred_lr):.4f}")

plt.scatter(y_test_reg, y_pred_lr, alpha=0.5)


plt.xlabel("Actual Balance")
plt.ylabel("Predicted Balance")
plt.title("Linear Regression: Actual vs Predicted")
plt.grid()
plt.show()

OUTPUT:

You might also like