Thanks to visit codestin.com
Credit goes to www.scribd.com

0% found this document useful (0 votes)
4 views9 pages

Unit 3 Unsupervised Learning

Uploaded by

abhimanyu.v
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
4 views9 pages

Unit 3 Unsupervised Learning

Uploaded by

abhimanyu.v
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 9

UNIT3 UNSUPERVISED LEARNING

Date: 11/02/2025

CLUSTERING
Clustering is a method used to group similar items together based on certain characteristics.

Imagine you have a mixed collection of fruits, and you want to organize them into groups like
apples, bananas, and oranges. Clustering helps in automatically finding these groups based on
features like color, size, or shape.

K-Means Algorithm
One popular clustering technique is the K-Means algorithm. The "K" represents the number of
clusters you want to create.

The algorithm works as follows:

Choose the Number of Clusters (K): Decide how many groups you want to divide your data into.

Initialize Centroids: Randomly select K points in the data space to serve as the initial centers
(centroids) of the clusters.

Assign Data Points to Clusters: For each data point, calculate its distance to each centroid and
assign it to the nearest one.

Update Centroids: After assigning all data points, recalculate the centroids as the average
position of all points in each cluster.

Repeat: Repeat steps 3 and 4 until the centroids no longer change significantly, indicating that
the clusters are stable.

import numpy as np
import matplotlib.pyplot as plt
from sklearn.cluster import KMeans

# Generate synthetic data: 100 points around (1,1) and 100 points
around (5,5)
np.random.seed(42)
data1 = np.random.randn(100, 2) + np.array([1, 1])
data2 = np.random.randn(100, 2) + np.array([5, 5])
data = np.vstack((data1, data2))

# Apply K-Means clustering with K=2


kmeans = KMeans(n_clusters=2)
kmeans.fit(data)
labels = kmeans.labels_
centroids = kmeans.cluster_centers_

# Plot the data points and centroids


plt.scatter(data[:, 0], data[:, 1], c=labels, cmap='viridis',
marker='o')
plt.scatter(centroids[:, 0], centroids[:, 1], c='red', marker='x',
s=100, label='Centroids')
plt.title('K-Means Clustering Example')
plt.xlabel('Feature 1')
plt.ylabel('Feature 2')
plt.legend()
plt.show()

Hierarchical Clustering
Hierarchical clustering is a method used to group similar items into clusters, forming a tree-like
structure called a dendrogram. This technique is particularly useful when you want to
understand the relationships between different groups within your data.
Types of Hierarchical Clustering
There are two main approaches to hierarchical clustering:

Agglomerative Clustering (Bottom-Up


Approach):
Start with each data point as its own cluster. Iteratively merge the closest clusters until all data
points are grouped into a single cluster or until a stopping criterion is met.

Divisive Clustering (Top-Down Approach):


Begin with all data points in one large cluster. Recursively split clusters into smaller ones until
each data point is in its own cluster or until a stopping criterion is met.

Agglomerative Clustering Example with Python


Agglomerative clustering using Python. The scipy and matplotlib libraries to perform the
clustering and visualize the results.

import numpy as np
import matplotlib.pyplot as plt
from scipy.cluster.hierarchy import dendrogram, linkage

# Generate synthetic data


np.random.seed(42)
data = np.random.randn(20, 2)

# Perform hierarchical/agglomerative clustering


linked = linkage(data, method='ward')

# Plot the dendrogram


plt.figure(figsize=(10, 7))
dendrogram(linked,
orientation='top',
distance_sort='descending',
show_leaf_counts=True)
plt.title('Dendrogram for Agglomerative Clustering')
plt.xlabel('Data Points')
plt.ylabel('Euclidean Distance')
plt.show()
DATE: 24/02/2025

Divisive clustering
Divisive clustering is a hierarchical clustering method that takes a top-down approach to
grouping data points. Unlike agglomerative clustering, which starts with individual data points
and merges them into clusters, divisive clustering begins with all data points in a single cluster
and recursively splits them into smaller clusters until each data point stands alone or a specified
number of clusters is achieved.

Steps in Divisive Clustering:


Start with All Data Points: Begin by considering the entire dataset as one large cluster.

Identify the Least Similar Data Points: Determine which data points within the cluster are least
similar to each other.

Split the Cluster: Divide the cluster into two smaller clusters based on the identified
dissimilarities.
Repeat Recursively: Apply the same process to each resulting cluster, continuing to split them
until each cluster contains only one data point or until the desired number of clusters is reached.

import numpy as np
from sklearn.cluster import KMeans
import matplotlib.pyplot as plt
from scipy.spatial.distance import cdist

# Generate synthetic data


np.random.seed(42)
data = np.random.randn(100, 2)

def bisecting_kmeans(data, k):


clusters = [data]
while len(clusters) < k:
# Choose the largest cluster to split
largest_cluster = max(clusters, key=len)
clusters.remove(largest_cluster)

# Perform K-Means with k=2 on the selected cluster


kmeans = KMeans(n_clusters=2,
random_state=42).fit(largest_cluster)
labels = kmeans.labels_

# Split the cluster into two clusters


cluster1 = largest_cluster[labels == 0]
cluster2 = largest_cluster[labels == 1]

clusters.append(cluster1)
clusters.append(cluster2)

return clusters

# Perform bisecting K-Means to get 3 clusters


clusters = bisecting_kmeans(data, 3)

# Plot the clusters


colors = ['r', 'g', 'b', 'c', 'm', 'y', 'k']
plt.figure(figsize=(8, 6))
for i, cluster in enumerate(clusters):
plt.scatter(cluster[:, 0], cluster[:, 1], c=colors[i %
len(colors)], label=f'Cluster {i+1}')
plt.title('Divisive Clustering using Bisecting K-Means')
plt.xlabel('Feature 1')
plt.ylabel('Feature 2')
plt.legend()
plt.show()
Principal Component Analysis (PCA)
Principal Component Analysis (PCA) is a dimensionality reduction technique used in statistics
and machine learning to simplify large datasets while preserving as much variability as possible.

It transforms correlated variables into a smaller set of uncorrelated variables called principal
components.

Applications of PCA:
Finance: Identifying key factors influencing stock prices.

Image Processing: Reducing the number of pixels while preserving essential details.

Genetics: Analyzing gene expression data.

Machine Learning: Preprocessing high-dimensional data before applying classification or


clustering models.
Example:
Suppose a teacher wants to group students based on their skills, but she has 10 different test
scores for each student.

PCA helps her combine similar scores into a few key categories like "Overall Academic Strength"
and "Sports Ability" instead of looking at all 10 scores separately.

This makes it easier to compare students and identify patterns.

Advantage of PCA. It reduces complexity (makes data easier to understand). It removes


unnecessary details while keeping important information. It helps in visualization (makes it
possible to draw graphs even if the original data had too many variables).

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler

# Sample student test scores (10 subjects for 6 students)


data = np.array([
[85, 78, 92, 88, 76, 95, 89, 91, 85, 87], # Student 1
[70, 75, 80, 78, 85, 90, 79, 77, 88, 86], # Student 2
[60, 65, 55, 50, 70, 72, 68, 60, 58, 63], # Student 3
[90, 85, 88, 92, 93, 97, 95, 91, 89, 90], # Student 4
[50, 55, 48, 45, 52, 58, 54, 50, 53, 55], # Student 5
[80, 82, 85, 87, 88, 92, 89, 85, 83, 81] # Student 6
])

# Convert to DataFrame
df = pd.DataFrame(data, columns=[f'Subject {i+1}' for i in range(10)])
print("Original Student Scores:")
print(df)

# Step 1: Standardizing the Data


scaler = StandardScaler()
scaled_data = scaler.fit_transform(df)

# Step 2: Applying PCA (Reduce from 10 subjects to 2 main components)


pca = PCA(n_components=2)
principal_components = pca.fit_transform(scaled_data)

# Step 3: Creating a new DataFrame with the PCA results


pca_df = pd.DataFrame(principal_components, columns=['Overall Academic
Strength', 'Sports Ability'])
print("\nTransformed Data after PCA:")
print(pca_df)

# Step 4: Visualizing the transformed data


plt.figure(figsize=(8,6))
plt.scatter(pca_df['Overall Academic Strength'], pca_df['Sports
Ability'], color='b', s=100)
for i, txt in enumerate(["Student 1", "Student 2", "Student 3",
"Student 4", "Student 5", "Student 6"]):
plt.annotate(txt, (pca_df['Overall Academic Strength'][i],
pca_df['Sports Ability'][i]), fontsize=12)
plt.axhline(0, color='gray', linestyle='--')
plt.axvline(0, color='gray', linestyle='--')
plt.xlabel('Overall Academic Strength')
plt.ylabel('Sports Ability')
plt.title('Student Grouping using PCA')
plt.show()

Original Student Scores:


Subject 1 Subject 2 Subject 3 Subject 4 Subject 5 Subject 6 \
0 85 78 92 88 76 95
1 70 75 80 78 85 90
2 60 65 55 50 70 72
3 90 85 88 92 93 97
4 50 55 48 45 52 58
5 80 82 85 87 88 92

Subject 7 Subject 8 Subject 9 Subject 10


0 89 91 85 87
1 79 77 88 86
2 68 60 58 63
3 95 91 89 90
4 54 50 53 55
5 89 85 83 81

Transformed Data after PCA:


Overall Academic Strength Sports Ability
0 2.191978 -0.920909
1 0.978242 0.349057
2 -3.026049 0.459299
3 3.222156 0.228173
4 -5.305649 -0.352800
5 1.939321 0.237180

You might also like