Machine learning Lab (ugcs 213)
Experiment: 01
1. Introduction to Scikit-learn
Scikit-learn is a popular machine learning library in Python. It provides a wide range of tools
for building machine learning models and data preprocessing. It is built on top of NumPy,
SciPy, and matplotlib, making it efficient and easy to integrate with other Python data tools.
6Key Features of Scikit-learn
Classification – Identifying which category an object belongs to.
Example: Email spam detection.
Regression – Predicting a continuous-valued attribute.
Example: Predicting house prices.
Clustering – Grouping similar items.
Example: Customer segmentation.
Dimensionality Reduction – Reducing number of features.
Example: PCA (Principal Component Analysis).
Model Selection – Comparing, validating, and selecting models.
Preprocessing – Feature extraction and normalization.
Installation:
pip install scikit-learn
2. Classification
Classification is a supervised learning technique where the output is a label or category.
Example: K-Nearest Neighbors (KNN)
Theory:
KNN classifies a data point based on how its neighbors are classified. It stores all available
cases and classifies new ones based on a similarity measure (e.g., distance functions).
Functions Used:
KNeighborsClassifier() - Initialize the model
fit() - Train the model
predict() - Predict the classes
score() or accuracy_score() - Evaluate the model
Code:
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
Machine learning Lab (ugcs 213)
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score
# Load data
X, y = load_iris(return_X_y=True)
# Split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3)
# Model
model = KNeighborsClassifier(n_neighbors=3)
model.fit(X_train, y_train)
# Predict
y_pred = model.predict(X_test)
# Accuracy
print("Accuracy:", accuracy_score(y_test, y_pred))
3. Regression
Regression is used when the target variable is continuous.
Example: Linear Regression
Theory:
Linear regression fits a line (y = mx + b) to predict a continuous output based on input
features.
Functions Used:
LinearRegression() - Initialize the model
fit() - Train model
predict() - Predict values
mean_squared_error() - Evaluate performance
Code:
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error
import numpy as np
# Sample data
X = np.array([[1], [2], [3], [4]])
y = np.array([3, 6, 9, 12])
# Model
model = LinearRegression()
model.fit(X, y)
# Predict
Machine learning Lab (ugcs 213)
y_pred = model.predict([[5]])
print("Predicted value:", y_pred)
4. Clustering
Clustering is an unsupervised learning method that groups data based on similarity.
Example: K-Means Clustering
Theory:
K-Means groups data into k clusters by minimizing the variance within each cluster.
Functions Used:
KMeans() - Initialize with number of clusters
fit() - Compute clusters
predict() or labels_ - Access cluster assignments
cluster_centers_ - Get coordinates of cluster centers
Code:
from sklearn.cluster import KMeans
import numpy as np
# Sample data
X = np.array([[1, 2], [1, 4], [1, 0],
[10, 2], [10, 4], [10, 0]])
# Model
kmeans = KMeans(n_clusters=2, random_state=0)
kmeans.fit(X)
# Outputs
print("Cluster centers:", kmeans.cluster_centers_)
print("Labels:", kmeans.labels_)
5. Dimensionality Reduction
Used to reduce the number of input variables in a dataset.
Example: PCA (Principal Component Analysis)
Theory:
PCA transforms the data to a new coordinate system, keeping only the components (axes)
that contribute most to variance.
Functions Used:
PCA(n_components=2) - Specify number of reduced features
Machine learning Lab (ugcs 213)
fit_transform() - Fit PCA and apply transformation
Code:
from sklearn.decomposition import PCA
from sklearn.datasets import load_iris
# Load data
X, _ = load_iris(return_X_y=True)
# PCA
pca = PCA(n_components=2)
X_reduced = pca.fit_transform(X)
print("Reduced shape:", X_reduced.shape)
6. Preprocessing
Example: Standardization using StandardScaler
Theory:
StandardScaler standardizes features by removing the mean and scaling to unit variance.
Functions Used:
StandardScaler() - Initialize
fit_transform() - Compute and apply standardization
Code:
from sklearn.preprocessing import StandardScaler
# Data
data = [[1, 2], [3, 4], [5, 6]]
# Standardize
scaler = StandardScaler()
scaled_data = scaler.fit_transform(data)
print("Scaled Data:", scaled_data)
7. Model Selection and Evaluation
Example: Cross-validation
Theory:
Cross-validation is used to evaluate the model’s ability to generalize. It splits data into
training and validation sets multiple times.
Functions Used:
cross_val_score() - Performs k-fold cross-validation
Machine learning Lab (ugcs 213)
Code:
from sklearn.model_selection import cross_val_score
from sklearn.svm import SVC
from sklearn.datasets import load_digits
# Load data
X, y = load_digits(return_X_y=True)
# Model
svc = SVC(kernel='linear')
# Cross-validation
scores = cross_val_score(svc, X, y, cv=5)
print("CV Scores:", scores)
print("Average Accuracy:", scores.mean())
Advantages of Scikit-learn
Simple API for beginners
Integrates well with NumPy and pandas
Strong community support
Wide variety of algorithms
Cross-validation and hyperparameter tuning built-in
Limitations
Not suitable for deep learning (use TensorFlow or PyTorch instead)
Not ideal for very large datasets
Limited support for GPU acceleration
Machine learning Lab (ugcs 213)
Experiment: 02
K-Means clustering with code and examples.
1. Introduction to K-Means Clustering
K-Means is an unsupervised machine learning algorithm used to partition data into K
clusters, where each data point belongs to the cluster with the nearest mean (centroid). It is
widely used for tasks like customer segmentation, image compression, and anomaly
detection.
How K-Means Works
1. Choose the number of clusters KKK.
2. Randomly initialize KKK centroids.
3. Assign each point to the nearest centroid (cluster assignment).
4. Compute new centroids (mean of the points in each cluster).
5. Repeat steps 3–4 until centroids do not change or reach max iterations.
Important Parameters of KMeans()
Parameter Description
n_clusters Number of clusters (K) to form.
init
Method for initialization ('k-means++' by default, improves
performance).
n_init Number of times the algorithm will be run with different centroid seeds.
max_iter Maximum number of iterations for a single run.
random_state Ensures reproducibility of results.
tol Relative tolerance with regards to inertia to declare convergence.
algorithm Algorithm to use: 'auto', 'full', or 'elkan'.
Common Methods in KMeans
Method Description
.fit(X) Fits KMeans model to the data X.
.predict(X) Assigns each sample in X to a cluster.
.fit_predict(X) Combines fit and predict in one step.
.transform(X) Returns the distance of each point to each centroid.
.fit_transform(X) Fits and returns distance to centroids.
Machine learning Lab (ugcs 213)
Attribute Description
.cluster_centers_ Coordinates of cluster centers.
.labels_ Labels of each point.
.inertia_ Sum of squared distances of samples to their closest cluster center.
.n_iter_ Number of iterations run.
Implementation of code :
# Import libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.cluster import KMeans
# Load the dataset
dataset = pd.read_csv("Mall_Customers.csv")
# Select features: Annual Income and Spending Score
x = dataset.iloc[:, [3, 4]].values # or dataset[['Annual Income (k$)', 'Spending Score (1-
100)']].values
# Elbow Method to find optimal number of clusters
wcss_list = [] # Within-cluster sum of squares
for i in range(1, 11):
kmeans = KMeans(n_clusters=i, init='k-means++', random_state=42)
kmeans.fit(x)
wcss_list.append(kmeans.inertia_) # inertia_ = WCSS
Machine learning Lab (ugcs 213)
# Plot the Elbow graph
plt.figure(figsize=(8, 5))
plt.plot(range(1, 11), wcss_list, marker='o')
plt.title('The Elbow Method Graph')
plt.xlabel('Number of clusters (k)')
plt.ylabel('WCSS')
plt.grid(True)
plt.show()
# Apply KMeans with the optimal number of clusters (k = 5)
kmeans = KMeans(n_clusters=5, init='k-means++', random_state=42)
y_kmeans = kmeans.fit_predict(x)
# Visualize the clusters
plt.figure(figsize=(8, 6))
# Cluster 1
plt.scatter(x[y_kmeans == 0, 0], x[y_kmeans == 0, 1],
s=100, c='red', label='Cluster 1')
# Cluster 2
plt.scatter(x[y_kmeans == 1, 0], x[y_kmeans == 1, 1],
s=100, c='blue', label='Cluster 2')
Machine learning Lab (ugcs 213)
# Cluster 3
plt.scatter(x[y_kmeans == 2, 0], x[y_kmeans == 2, 1],
s=100, c='green', label='Cluster 3')
# Cluster 4
plt.scatter(x[y_kmeans == 3, 0], x[y_kmeans == 3, 1],
s=100, c='cyan', label='Cluster 4')
# Cluster 5
plt.scatter(x[y_kmeans == 4, 0], x[y_kmeans == 4, 1],
s=100, c='magenta', label='Cluster 5')
# Plot centroids
plt.scatter(kmeans.cluster_centers_[:, 0], kmeans.cluster_centers_[:, 1],
s=300, c='yellow', label='Centroids', edgecolor='black')
# Labels
plt.title('Customer Segments (KMeans Clustering)')
plt.xlabel('Annual Income (k$)')
plt.ylabel('Spending Score (1-100)')
plt.legend()
plt.grid(True)
plt.show()
Machine learning Lab (ugcs 213)
Machine learning Lab (ugcs 213)
Experiment : 03:
What is Linear Regression?
Linear Regression is a supervised learning algorithm used for predicting a continuous
dependent variable based on one or more independent variables.
🔸 Simple Linear Regression:
Model: y=mx+c
Where:
y = predicted value (dependent variable)
x = independent variable
m = slope (coefficient)
c = intercept
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error, r2_score
# 1. Load the dataset
# dataset = pd.read_csv('Salary_Data.csv')
url = "https://raw.githubusercontent.com/AdiPersonalWorks/Random/master/
Salary_Data.csv"
dataset = pd.read_csv(url)
print(dataset.head())
X = dataset[['YearsExperience']] # Independent variable
y = dataset['Salary'] # Dependent variable
# 2. Split the dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# 3. Create the Linear Regression model
model = LinearRegression()
Machine learning Lab (ugcs 213)
# 4. Train the model
model.fit(X_train, y_train)
# 5. Predict the test set results
y_pred = model.predict(X_test)
# 6. Evaluate the model
print("Coefficient (slope):", model.coef_[0])
print("Intercept:", model.intercept_)
print("Mean Squared Error:", mean_squared_error(y_test, y_pred))
print("R^2 Score (accuracy):", r2_score(y_test, y_pred))
# 7. Plotting the training data
plt.scatter(X_train, y_train, color='blue', label='Training data')
plt.plot(X_train, model.predict(X_train), color='red', label='Regression line')
plt.title('Salary vs Experience (Training set)')
plt.xlabel('Years of Experience')
plt.ylabel('Salary')
plt.legend()
plt.show()
#8. Plotting the test data
plt.scatter(X_test, y_test, color='green', label='Test data')
plt.plot(X_train, model.predict(X_train), color='red', label='Regression line')
plt.title('Salary vs Experience (Test set)')
plt.xlabel('Years of Experience')
plt.ylabel('Salary')
plt.legend()
plt.show()
Machine learning Lab (ugcs 213)
Functions and Methods Used:
Function/Method Description
LinearRegression() Creates a linear regression model.
fit(X, y) Trains the model using features X and target y.
predict(X) Predicts the output for new input values.
mean_squared_error(y_true, Evaluates average squared difference between
y_pred) actual and predicted.
r2_score(y_true, y_pred) Gives model accuracy score (closer to 1 is better).
Machine learning Lab (ugcs 213)
Experiment :04
Binary classification with logistics regression
What is Binary Classification?
Binary classification is used when the output has two classes (e.g., Yes/No, 0/1, Spam/Not
Spam).
Logistic Regression Overview
It's a classification algorithm, not regression, despite the name.
It predicts the probability that a given input belongs to a certain class.
Output is between 0 and 1, using the sigmoid function.
Logistic Regression Key Functions Explained:
Function Description
LogisticRegression() Initializes the model.
fit(X_train, y_train) Trains the model.
predict(X_test) Predicts 0 or 1.
predict_proba(X) Gives probabilities for class 0 and 1.
accuracy_score(y_test, y_pred) Calculates accuracy.
confusion_matrix() Displays TP, TN, FP, FN.
classification_report() Shows precision, recall, f1-score.
Implementation of code:
# Import required libraries
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import LabelEncoder
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix
# Load the dataset
data = pd.read_csv("Titanic-Dataset.csv")
# Select relevant features
features = ['Pclass', 'Sex', 'Age', 'SibSp', 'Parch', 'Fare']
data = data[features + ['Survived']]
# Handle missing values (fix: assign the filled values back)
data['Age'].fillna(data['Age'].median(), inplace=True)
Machine learning Lab (ugcs 213)
# Encode categorical feature 'Sex'
le = LabelEncoder()
data['Sex'] = le.fit_transform(data['Sex']) # male=1, female=0
# Define input and target
X = data[features]
y = data['Survived']
# Split into training and testing datasets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)
# Train logistic regression model
model = LogisticRegression(max_iter=200)
model.fit(X_train, y_train)
# Make predictions
y_pred = model.predict(X_test)
# Accuracy and report
print("Accuracy:", accuracy_score(y_test, y_pred))
print("\nClassification Report:\n", classification_report(y_test, y_pred))
# Confusion Matrix Visualization
cm = confusion_matrix(y_test, y_pred)
plt.figure(figsize=(6, 5))
sns.heatmap(cm, annot=True, fmt='d', cmap='Blues',
xticklabels=['Did not Survive', 'Survived'],
yticklabels=['Did not Survive', 'Survived'])
plt.xlabel('Predicted')
plt.ylabel('Actual')
plt.title('Confusion Matrix')
plt.show()
Machine learning Lab (ugcs 213)
Machine learning Lab (ugcs 213)
Experiment : 05
Explain Decision Tree and implement with code.
What is a Decision Tree?
A Decision Tree is a supervised learning algorithm used for classification and regression
problems. It splits the dataset into branches based on feature values, helping to make
predictions by learning simple decision rules inferred from the data.
How It Works:
The algorithm selects the best feature using criteria like Gini impurity or
Information Gain (Entropy).
It splits the dataset recursively into subsets.
It continues until it meets a stopping criterion (like max depth, pure nodes, or few
samples).
Function/ Method used in this algo:
Function/Method Purpose
DecisionTreeClassifier() Create Decision Tree model
.fit() Train the model on dataset
.predict() Predict using trained model
train_test_split() Split data into train and test sets
accuracy_score() Calculate prediction accuracy
classification_report() Show precision, recall, F1-score
confusion_matrix() Matrix showing TP, TN, FP, FN
sns.heatmap() Visualize confusion matrix
plot_tree() Visual representation of decision rules
LabelEncoder() Convert categorical to numeric (e.g., male → 1)
Implementation of code:
# import library
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
Machine learning Lab (ugcs 213)
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder
from sklearn.tree import DecisionTreeClassifier, plot_tree
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix
# Load dataset
data = pd.read_csv("Titanic-Dataset.csv")
data.head()
# Select features
features = ['Pclass', 'Sex', 'Age', 'SibSp', 'Parch', 'Fare']
data = data[features + ['Survived']]
data
# Handle missing values
data['Age'].fillna(data['Age'].median(), inplace=True)
# Encode categorical variables
le = LabelEncoder()
data['Sex'] = le.fit_transform(data['Sex']) # male = 1, female = 0
# Split data into input and output
X = data[features]
Machine learning Lab (ugcs 213)
y = data['Survived']
# Train-Test split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)
# Decision Tree Model
model = DecisionTreeClassifier(criterion='entropy', random_state=42)
model.fit(X_train, y_train)
# Predict
y_pred = model.predict(X_test)
# Evaluate
print("Accuracy:", accuracy_score(y_test, y_pred))
print("\nClassification Report:\n", classification_report(y_test, y_pred))
# Confusion Matrix
cm = confusion_matrix(y_test, y_pred)
plt.figure(figsize=(6, 5))
sns.heatmap(cm, annot=True, fmt='d', cmap='Greens',
xticklabels=['Did not Survive', 'Survived'],
yticklabels=['Did not Survive', 'Survived'])
plt.xlabel('Predicted')
plt.ylabel('Actual')
Machine learning Lab (ugcs 213)
plt.title('Confusion Matrix - Decision Tree')
plt.show()
# Plot the Decision Tree
plt.figure(figsize=(20, 10))
plot_tree(model, feature_names=features, class_names=['Not Survived', 'Survived'],
filled=True, rounded=True, fontsize=12)
plt.title('Decision Tree Visualization')
plt.show()