0% found this document useful (0 votes)

4 views9 pages

Unit 3 Unsupervised Learning

Uploaded by

abhimanyu.v

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

4 views9 pages

Unit 3 Unsupervised Learning

Uploaded by

abhimanyu.v

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 9

UNIT3 UNSUPERVISED LEARNING

Date: 11/02/2025

CLUSTERING
Clustering is a method used to group similar items together based on certain characteristics.

Imagine you have a mixed collection of fruits, and you want to organize them into groups like
apples, bananas, and oranges. Clustering helps in automatically finding these groups based on
features like color, size, or shape.

K-Means Algorithm
One popular clustering technique is the K-Means algorithm. The "K" represents the number of
clusters you want to create.

The algorithm works as follows:

Choose the Number of Clusters (K): Decide how many groups you want to divide your data into.

Initialize Centroids: Randomly select K points in the data space to serve as the initial centers
(centroids) of the clusters.

Assign Data Points to Clusters: For each data point, calculate its distance to each centroid and
assign it to the nearest one.

Update Centroids: After assigning all data points, recalculate the centroids as the average
position of all points in each cluster.

Repeat: Repeat steps 3 and 4 until the centroids no longer change significantly, indicating that
the clusters are stable.

import numpy as np
import matplotlib.pyplot as plt
from sklearn.cluster import KMeans

# Generate synthetic data: 100 points around (1,1) and 100 points
around (5,5)
np.random.seed(42)
data1 = np.random.randn(100, 2) + np.array([1, 1])
data2 = np.random.randn(100, 2) + np.array([5, 5])
data = np.vstack((data1, data2))

# Apply K-Means clustering with K=2

kmeans = KMeans(n_clusters=2)
kmeans.fit(data)
labels = kmeans.labels_
centroids = kmeans.cluster_centers_

# Plot the data points and centroids

plt.scatter(data[:, 0], data[:, 1], c=labels, cmap='viridis',
marker='o')
plt.scatter(centroids[:, 0], centroids[:, 1], c='red', marker='x',
s=100, label='Centroids')
plt.title('K-Means Clustering Example')
plt.xlabel('Feature 1')
plt.ylabel('Feature 2')
plt.legend()
plt.show()

Hierarchical Clustering
Hierarchical clustering is a method used to group similar items into clusters, forming a tree-like
structure called a dendrogram. This technique is particularly useful when you want to
understand the relationships between different groups within your data.
Types of Hierarchical Clustering
There are two main approaches to hierarchical clustering:

Agglomerative Clustering (Bottom-Up

Approach):
Start with each data point as its own cluster. Iteratively merge the closest clusters until all data
points are grouped into a single cluster or until a stopping criterion is met.

Divisive Clustering (Top-Down Approach):

Begin with all data points in one large cluster. Recursively split clusters into smaller ones until
each data point is in its own cluster or until a stopping criterion is met.

Agglomerative Clustering Example with Python

Agglomerative clustering using Python. The scipy and matplotlib libraries to perform the
clustering and visualize the results.

import numpy as np
import matplotlib.pyplot as plt
from scipy.cluster.hierarchy import dendrogram, linkage

# Generate synthetic data

np.random.seed(42)
data = np.random.randn(20, 2)

# Perform hierarchical/agglomerative clustering

linked = linkage(data, method='ward')

# Plot the dendrogram

plt.figure(figsize=(10, 7))
dendrogram(linked,
orientation='top',
distance_sort='descending',
show_leaf_counts=True)
plt.title('Dendrogram for Agglomerative Clustering')
plt.xlabel('Data Points')
plt.ylabel('Euclidean Distance')
plt.show()
DATE: 24/02/2025

Divisive clustering
Divisive clustering is a hierarchical clustering method that takes a top-down approach to
grouping data points. Unlike agglomerative clustering, which starts with individual data points
and merges them into clusters, divisive clustering begins with all data points in a single cluster
and recursively splits them into smaller clusters until each data point stands alone or a specified
number of clusters is achieved.

Steps in Divisive Clustering:

Start with All Data Points: Begin by considering the entire dataset as one large cluster.

Identify the Least Similar Data Points: Determine which data points within the cluster are least
similar to each other.

Split the Cluster: Divide the cluster into two smaller clusters based on the identified
dissimilarities.
Repeat Recursively: Apply the same process to each resulting cluster, continuing to split them
until each cluster contains only one data point or until the desired number of clusters is reached.

import numpy as np
from sklearn.cluster import KMeans
import matplotlib.pyplot as plt
from scipy.spatial.distance import cdist

# Generate synthetic data

np.random.seed(42)
data = np.random.randn(100, 2)

def bisecting_kmeans(data, k):

clusters = [data]
while len(clusters) < k:
# Choose the largest cluster to split
largest_cluster = max(clusters, key=len)
clusters.remove(largest_cluster)

# Perform K-Means with k=2 on the selected cluster

kmeans = KMeans(n_clusters=2,
random_state=42).fit(largest_cluster)
labels = kmeans.labels_

# Split the cluster into two clusters

cluster1 = largest_cluster[labels == 0]
cluster2 = largest_cluster[labels == 1]

clusters.append(cluster1)
clusters.append(cluster2)

return clusters

# Perform bisecting K-Means to get 3 clusters

clusters = bisecting_kmeans(data, 3)

# Plot the clusters

colors = ['r', 'g', 'b', 'c', 'm', 'y', 'k']
plt.figure(figsize=(8, 6))
for i, cluster in enumerate(clusters):
plt.scatter(cluster[:, 0], cluster[:, 1], c=colors[i %
len(colors)], label=f'Cluster {i+1}')
plt.title('Divisive Clustering using Bisecting K-Means')
plt.xlabel('Feature 1')
plt.ylabel('Feature 2')
plt.legend()
plt.show()
Principal Component Analysis (PCA)
Principal Component Analysis (PCA) is a dimensionality reduction technique used in statistics
and machine learning to simplify large datasets while preserving as much variability as possible.

It transforms correlated variables into a smaller set of uncorrelated variables called principal
components.

Applications of PCA:
Finance: Identifying key factors influencing stock prices.

Image Processing: Reducing the number of pixels while preserving essential details.

Genetics: Analyzing gene expression data.

Machine Learning: Preprocessing high-dimensional data before applying classification or

clustering models.
Example:
Suppose a teacher wants to group students based on their skills, but she has 10 different test
scores for each student.

PCA helps her combine similar scores into a few key categories like "Overall Academic Strength"
and "Sports Ability" instead of looking at all 10 scores separately.

This makes it easier to compare students and identify patterns.

Advantage of PCA. It reduces complexity (makes data easier to understand). It removes

unnecessary details while keeping important information. It helps in visualization (makes it
possible to draw graphs even if the original data had too many variables).

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler

# Sample student test scores (10 subjects for 6 students)

data = np.array([
[85, 78, 92, 88, 76, 95, 89, 91, 85, 87], # Student 1
[70, 75, 80, 78, 85, 90, 79, 77, 88, 86], # Student 2
[60, 65, 55, 50, 70, 72, 68, 60, 58, 63], # Student 3
[90, 85, 88, 92, 93, 97, 95, 91, 89, 90], # Student 4
[50, 55, 48, 45, 52, 58, 54, 50, 53, 55], # Student 5
[80, 82, 85, 87, 88, 92, 89, 85, 83, 81] # Student 6
])

# Convert to DataFrame
df = pd.DataFrame(data, columns=[f'Subject {i+1}' for i in range(10)])
print("Original Student Scores:")
print(df)

# Step 1: Standardizing the Data

scaler = StandardScaler()
scaled_data = scaler.fit_transform(df)

# Step 2: Applying PCA (Reduce from 10 subjects to 2 main components)

pca = PCA(n_components=2)
principal_components = pca.fit_transform(scaled_data)

# Step 3: Creating a new DataFrame with the PCA results

pca_df = pd.DataFrame(principal_components, columns=['Overall Academic
Strength', 'Sports Ability'])
print("\nTransformed Data after PCA:")
print(pca_df)

# Step 4: Visualizing the transformed data

plt.figure(figsize=(8,6))
plt.scatter(pca_df['Overall Academic Strength'], pca_df['Sports
Ability'], color='b', s=100)
for i, txt in enumerate(["Student 1", "Student 2", "Student 3",
"Student 4", "Student 5", "Student 6"]):
plt.annotate(txt, (pca_df['Overall Academic Strength'][i],
pca_df['Sports Ability'][i]), fontsize=12)
plt.axhline(0, color='gray', linestyle='--')
plt.axvline(0, color='gray', linestyle='--')
plt.xlabel('Overall Academic Strength')
plt.ylabel('Sports Ability')
plt.title('Student Grouping using PCA')
plt.show()

Original Student Scores:

Subject 1 Subject 2 Subject 3 Subject 4 Subject 5 Subject 6 \
0 85 78 92 88 76 95
1 70 75 80 78 85 90
2 60 65 55 50 70 72
3 90 85 88 92 93 97
4 50 55 48 45 52 58
5 80 82 85 87 88 92

Subject 7 Subject 8 Subject 9 Subject 10

0 89 91 85 87
1 79 77 88 86
2 68 60 58 63
3 95 91 89 90
4 54 50 53 55
5 89 85 83 81

Transformed Data after PCA:

Overall Academic Strength Sports Ability
0 2.191978 -0.920909
1 0.978242 0.349057
2 -3.026049 0.459299
3 3.222156 0.228173
4 -5.305649 -0.352800
5 1.939321 0.237180

Spatial Modeling Principles in Earth Sciences
100% (1)
Spatial Modeling Principles in Earth Sciences
358 pages
Regional and Urban Economics Vol. 2 - Urban Economics
No ratings yet
Regional and Urban Economics Vol. 2 - Urban Economics
614 pages
Pollution Analysis Through GIS and RS
No ratings yet
Pollution Analysis Through GIS and RS
70 pages
Complete Doc - Lavanya
No ratings yet
Complete Doc - Lavanya
95 pages
Lecture 5. GIS Analysis Functions: Dr. Faith - Karanja
No ratings yet
Lecture 5. GIS Analysis Functions: Dr. Faith - Karanja
32 pages
Zara
No ratings yet
Zara
47 pages
IT3080 Lecture04 2023
No ratings yet
IT3080 Lecture04 2023
56 pages
Clustering Data Mining
No ratings yet
Clustering Data Mining
27 pages
Cluster
No ratings yet
Cluster
20 pages
Map Production and Interpretation
No ratings yet
Map Production and Interpretation
30 pages
ML Exp5 C36
No ratings yet
ML Exp5 C36
18 pages
K Means Clustering
No ratings yet
K Means Clustering
11 pages
K-Means Clustering Explained
No ratings yet
K-Means Clustering Explained
10 pages
Python K-Means Clustering Guide
No ratings yet
Python K-Means Clustering Guide
6 pages
Cluster Analysis Overview
No ratings yet
Cluster Analysis Overview
77 pages
05 Clustering
No ratings yet
05 Clustering
96 pages
Unit 3 Data
No ratings yet
Unit 3 Data
37 pages
Machine Learning
No ratings yet
Machine Learning
23 pages
WS Delineation ARCGIS
No ratings yet
WS Delineation ARCGIS
25 pages
Clustering
No ratings yet
Clustering
7 pages
DWDM Unit5
No ratings yet
DWDM Unit5
14 pages
Cluster Analysis: Talha Farooq Faizan Ali Muhammad Abdul Basit
No ratings yet
Cluster Analysis: Talha Farooq Faizan Ali Muhammad Abdul Basit
16 pages
Hierarchical Clustering Guide
No ratings yet
Hierarchical Clustering Guide
6 pages
ML DSBA Lab7
No ratings yet
ML DSBA Lab7
6 pages
Vid 4
No ratings yet
Vid 4
6 pages
2014 SnowdenTrainingProgramme FINAL
No ratings yet
2014 SnowdenTrainingProgramme FINAL
4 pages
Text Analytics Unit-3
No ratings yet
Text Analytics Unit-3
11 pages
Graph Partitioning & Clustering Techniques
No ratings yet
Graph Partitioning & Clustering Techniques
14 pages
DWDM Lab All
No ratings yet
DWDM Lab All
20 pages
Unsupervised Learning
No ratings yet
Unsupervised Learning
66 pages
Building K-Means Clustering Algorithm From Scratch
No ratings yet
Building K-Means Clustering Algorithm From Scratch
10 pages
Clustering in Python
No ratings yet
Clustering in Python
31 pages
Perbandingan Estimasi Sumberdaya Batubara Menggunakan Metode Ordinary Kriging Dan Metode Cross Section Di Pt. Nan Riang Jambi
No ratings yet
Perbandingan Estimasi Sumberdaya Batubara Menggunakan Metode Ordinary Kriging Dan Metode Cross Section Di Pt. Nan Riang Jambi
17 pages
K-Means Clustering Guide
No ratings yet
K-Means Clustering Guide
26 pages
Artificial Intelligence Report
No ratings yet
Artificial Intelligence Report
23 pages
Questions Reply
No ratings yet
Questions Reply
7 pages
R19M TECHHighwayEngg ISem
No ratings yet
R19M TECHHighwayEngg ISem
31 pages
Spatial Analyses of Homicide With Areal Data
No ratings yet
Spatial Analyses of Homicide With Areal Data
37 pages
Technical Paper - Aisyah AF 130089
No ratings yet
Technical Paper - Aisyah AF 130089
8 pages
Data Mining
No ratings yet
Data Mining
27 pages
MCN301 M2 - Ktunotes - in
No ratings yet
MCN301 M2 - Ktunotes - in
20 pages
Unit 4 Descriptive Modeling
No ratings yet
Unit 4 Descriptive Modeling
18 pages
Unsupervised Methods Overview
No ratings yet
Unsupervised Methods Overview
26 pages
Integration of Geological Knowledge For Variogram Modeling
100% (1)
Integration of Geological Knowledge For Variogram Modeling
8 pages
Moranteaguirre Thesis 2015
No ratings yet
Moranteaguirre Thesis 2015
148 pages
K-Means Algorithm
No ratings yet
K-Means Algorithm
29 pages
Application of Copulas in Geostatistics: Claus P. Haslauer, Jing Li, and Andr As B Ardossy
No ratings yet
Application of Copulas in Geostatistics: Claus P. Haslauer, Jing Li, and Andr As B Ardossy
10 pages
Clustering
No ratings yet
Clustering
27 pages
Clustering in Machine Learning
No ratings yet
Clustering in Machine Learning
4 pages
K-Means Clustering Guide
100% (1)
K-Means Clustering Guide
14 pages
ML Python Exercises UOM BDS Cluster Analysis
No ratings yet
ML Python Exercises UOM BDS Cluster Analysis
8 pages
DSBA Master Codebook - Unsupervised Learning
No ratings yet
DSBA Master Codebook - Unsupervised Learning
7 pages
Quiz 1
No ratings yet
Quiz 1
4 pages
L07 - Advance Analytical Theory and Methods - Clustering
No ratings yet
L07 - Advance Analytical Theory and Methods - Clustering
22 pages
Week 10
No ratings yet
Week 10
84 pages
2002 Spring CS525 Lecture 2
No ratings yet
2002 Spring CS525 Lecture 2
37 pages
Introduction To Cluster Analysis.
No ratings yet
Introduction To Cluster Analysis.
53 pages
Clustering
No ratings yet
Clustering
6 pages
07 Clustering
No ratings yet
07 Clustering
54 pages
Unit 2 - Introduction To Cluster Analysis
No ratings yet
Unit 2 - Introduction To Cluster Analysis
53 pages
Spatial Analysis of Road Transit Firms in Port Harcourt Metropolis, Nigeria
No ratings yet
Spatial Analysis of Road Transit Firms in Port Harcourt Metropolis, Nigeria
8 pages
Partition
No ratings yet
Partition
52 pages
SE KMeansClustering
No ratings yet
SE KMeansClustering
21 pages
ML Unit-4
No ratings yet
ML Unit-4
23 pages
Week 8 DS Practical
No ratings yet
Week 8 DS Practical
13 pages
Cell-Based Model For GIS Generalization
No ratings yet
Cell-Based Model For GIS Generalization
6 pages
Module 4
No ratings yet
Module 4
63 pages
23CC554
No ratings yet
23CC554
10 pages
GIS & Remote Sensing Overview
No ratings yet
GIS & Remote Sensing Overview
69 pages
Gis Data Structures
No ratings yet
Gis Data Structures
3 pages
Data Clustering Problem Set
No ratings yet
Data Clustering Problem Set
2 pages
Wa0009.
No ratings yet
Wa0009.
4 pages
GeoRF RandomForestSpatial
No ratings yet
GeoRF RandomForestSpatial
35 pages
K Means
No ratings yet
K Means
25 pages
Unsupervised Learning Part 1
No ratings yet
Unsupervised Learning Part 1
9 pages
Arc 111 Handout PDF
No ratings yet
Arc 111 Handout PDF
18 pages
Dcu1008 Research Methodology Notes
No ratings yet
Dcu1008 Research Methodology Notes
12 pages
Baidurya Debnath 4
No ratings yet
Baidurya Debnath 4
37 pages
Detailed GIS in Agriculture
No ratings yet
Detailed GIS in Agriculture
9 pages
Clustering
No ratings yet
Clustering
55 pages
ML Clustering2
No ratings yet
ML Clustering2
11 pages
Introduction To Research Paper
No ratings yet
Introduction To Research Paper
4 pages
Day 3
No ratings yet
Day 3
74 pages
Module 3
No ratings yet
Module 3
21 pages
0006 - K Means Clustering - Introduction - 2025
No ratings yet
0006 - K Means Clustering - Introduction - 2025
19 pages
Unit 4 Cluster Analysis 3
No ratings yet
Unit 4 Cluster Analysis 3
20 pages
Exp 5 ML
No ratings yet
Exp 5 ML
9 pages
10 Lecture AI 10
No ratings yet
10 Lecture AI 10
48 pages
Aiml Assignment 10
No ratings yet
Aiml Assignment 10
6 pages

Unit 3 Unsupervised Learning

Uploaded by

Unit 3 Unsupervised Learning

Uploaded by

UNIT3 UNSUPERVISED LEARNING

The algorithm works as follows:

# Apply K-Means clustering with K=2

# Plot the data points and centroids

Agglomerative Clustering (Bottom-Up

Divisive Clustering (Top-Down Approach):

Agglomerative Clustering Example with Python

# Generate synthetic data

# Perform hierarchical/agglomerative clustering

# Plot the dendrogram

Steps in Divisive Clustering:

# Generate synthetic data

def bisecting_kmeans(data, k):

# Perform K-Means with k=2 on the selected cluster

# Split the cluster into two clusters

# Perform bisecting K-Means to get 3 clusters

# Plot the clusters

Genetics: Analyzing gene expression data.

Machine Learning: Preprocessing high-dimensional data before applying classification or

This makes it easier to compare students and identify patterns.

Advantage of PCA. It reduces complexity (makes data easier to understand). It removes

# Sample student test scores (10 subjects for 6 students)

# Step 1: Standardizing the Data

# Step 2: Applying PCA (Reduce from 10 subjects to 2 main components)

# Step 3: Creating a new DataFrame with the PCA results

# Step 4: Visualizing the transformed data

Original Student Scores:

Subject 7 Subject 8 Subject 9 Subject 10

Transformed Data after PCA:

You might also like