Machine Learning
Assignment 3: Classification
Analysis
Submitted to: Sir Atif Mehmood
Group Members:
Abeeha Abid (F2021065300)
Esha Tur Razia (F2021065348)
Eman Azam (F2021065253)
2024-2025
Task 1
Support Vector Machines (SVM):
Working Principles: Classifies data by finding the hyperplane that best separates
classes in a high-dimensional space.
Assumptions: Data is linearly separable or can be made separable using kernel
functions.
Strengths: Works well with small datasets, effective for high-dimensional spaces, can
handle non-linear separability with kernels.
Weaknesses: Computationally intensive for large datasets, sensitive to noise.
Training Efficiency: Slow for large datasets.
Computational Efficiency: Moderate; increases with the number of support vectors.
1.2 Naïve Bayes
Working Principles: Based on Bayes’ theorem; assumes features are conditionally
independent given the class.
Assumptions: Feature independence, which is often unrealistic.
Strengths: Simple, fast, works well with categorical data.
Weaknesses: Poor performance if feature independence assumption is violated.
Training Efficiency: Extremely fast; suitable for large datasets.
Computational Efficiency: Very high efficiency.
1.3 k-Nearest Neighbors (KNN)
Working Principles: Predicts class by majority vote among the k-nearest neighbors
in the feature space.
Assumptions: Nearby points have similar labels; requires good distance metrics.
Strengths: Easy to implement, no training phase.
Weaknesses: Sensitive to irrelevant features and feature scaling, slow for large
datasets.
Training Efficiency: No training needed.
Computational Efficiency: Low during prediction, especially with large datasets.
1.4 Neural Networks:
Working Principles: Mimics the human brain with interconnected layers of neurons;
learns features and patterns using backpropagation.
Assumptions: Requires large amounts of labeled data, assumes enough computational
resources.
Strengths: Very powerful for complex, non-linear problems.
Weaknesses: Requires significant computational power and training time, prone to
overfitting.
Training Efficiency: Low for large networks.
Computational Efficiency: Relatively low; depends on the size of the network.
Task: 2
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_absolute_error, mean_squared_error,
r2_score
data = pd.read_csv('data.csv')
print(data.columns)
X = data[['Age', 'Education_Level', 'Occupation',
'Number_of_Dependents',
'Location', 'Work_Experience', 'Marital_Status',
'Employment_Status', 'Household_Size',
'Homeownership_Status',
'Type_of_Housing', 'Gender',
'Primary_Mode_of_Transportation']]
y = data['Income']
# Convert categorical variables into dummy/indicator variables
X = pd.get_dummies(X, drop_first=True)
X_train, X_test, y_train, y_test = train_test_split(X, y,
test_size=0.2, random_state=42)
model = LinearRegression()
model.fit(X_train, y_train)
# Make predictions on test set
y_pred = model.predict(X_test)
mae = mean_absolute_error(y_test, y_pred)
mse = mean_squared_error(y_test, y_pred)
rmse = np.sqrt(mse)
r2 = r2_score(y_test, y_pred)
print(f'Mean Absolute Error: {mae}')
print(f'Mean Squared Error: {mse}')
print(f'Root Mean Squared Error: {rmse}')
print(f'R² Score: {r2}')
plt.scatter(y_test, y_pred)
plt.xlabel('Actual Income')
plt.ylabel('Predicted Income')
plt.title('Actual vs Predicted Income')
plt.plot([min(y_test), max(y_test)], [min(y_test), max(y_test)],
color='red') # Diagonal line for reference
plt.show()
Task: 3
Analysis and Interpretation of Results
MAE, MSE, and RMSE provide insights into how well the model performs in terms of
prediction accuracy.
R² Score indicates how much variance in income can be explained by other features.
A good model will have low error metrics (MAE, MSE, RMSE) and a high R² score.
References: Sources use for this assignments are:
https://www.geeksforgeeks.org/machine-learning/
https://chatgpt.com/