Unit 1 - KMeans
Introduction to Unsupervised Learning
Unsupervised learning is a machine learning (ML) technique that finds patterns in
unlabeled data.
It contrasts supervised learning, which uses labeled data. It is useful in pattern
detection, clustering, Association and dimensionality reduction.
Why Unsupervised Learning is Important
- Saves time by automating grouping of data
- Used in network traffic analysis (NTA)
- Helps in threat detection and dimensionality reduction
- Simplifies datasets by removing irrelevant features
Clustering Analysis
Clustering groups similar data points into clusters.
It enables better data profiling, customer segmentation, and dimensionality
reduction.
Types include K-Means, Hierarchical, DBSCAN, GMM.
K-Means Clustering
K-Means groups data based on similarities.
K = number of clusters.
Example: K=5 creates 5 clusters.
Key concepts: Squared Euclidean Distance & Cluster
Inertia.
K-Means Algorithm Steps
1. Choose K clusters (e.g., using elbow method)
2. Randomly select K centroids
3. Assign each data point to the nearest centroid
4. Recalculate centroids
5. Repeat until convergence
K-Means: Pros and Cons
Advantages:
- Efficient computation
- Easy to implement
Disadvantages:
- Poor for non-spherical clusters
- Sensitive to initial centroids
Hierarchical Clustering
Builds clusters by progressively merging them
No need to specify K
Uses dendrograms to visualize
Advantages:
- No preset K
- Good for hierarchy
Disadvantages:
- Sensitive to outliers
- Computationally expensive
Stopping Criteria for K-Means
- No change in centroids
- Points stay in the same cluster
- Max iterations reached
Bisecting K-Means
Improves on K-Means:
- Works with non-spherical clusters
- More efficient for large K
Uses hybrid partitional and hierarchical approach
Bisecting K-Means Algorithm
1. Start with all points as one cluster
2. Bisect the largest SSE cluster using K-means
3. Repeat until K clusters formed
Choose splits based on SSE or size
Map Clustering with R
Use R packages like 'factoextra' to visualize K-means on city location data.
Plot clusters with and without predefined centers.
Code uses:
kmeans(), fviz_cluster(), and coord_flip()