5/23/25, 7:49 AM about:blank
Cheat Sheet: Building Unsupervised Learning Models
Unsupervised learning models
Model Name Brief Description Code Syntax
UMAP (Uniform Manifold Approximation and Projection) is used
for dimensionality reduction.
Pros: High performance, preserves global structure. from umap.umap_ import UMAP
Cons: Sensitive to parameters. umap = UMAP(n_neighbors=15, min_dist=0.1, n_components=2)
Applications: Data visualization, feature extraction.
Key hyperparameters:
UMAP
n_neighbors: Controls the local neighborhood size (default
= 15).
min_dist: Controls the minimum distance between points
in the embedded space (default = 0.1).
n_components: The dimensionality of the embedding
(default = 2).
t-SNE (t-Distributed Stochastic Neighbor Embedding) is a
nonlinear dimensionality reduction technique.
Pros: Good for visualizing high-dimensional data. from sklearn.manifold import TSNE
Cons: Computationally expensive, prone to overfitting. tsne = TSNE(n_components=2, perplexity=30, learning_rate=200)
Applications: Data visualization, anomaly detection.
Key hyperparameters:
t-SNE
n_components: The number of dimensions for the output
(default = 2).
perplexity: Balances attention between local and global
aspects of the data (default = 30).
learning_rate: Controls the step size during optimization
(default = 200).
PCA (principal component analysis) is used for linear
dimensionality reduction. from sklearn.decomposition import PCA
Pros: Easy to interpret, reduces noise. pca = PCA(n_components=2)
Cons: Linear, may lose information in nonlinear data.
Applications: Feature extraction, compression.
Key hyperparameters:
PCA
n_components: Number of principal components to retain
(default = 2).
whiten: Whether to scale the components (default = False).
svd_solver: The algorithm to compute the components
(default = 'auto').
DBSCAN (Density-Based Spatial Clustering of Applications with
Noise) is a density-based clustering algorithm. from sklearn.cluster import DBSCAN
Pros: Identifies outliers, does not require the number of clusters. dbscan = DBSCAN(eps=0.5, min_samples=5)
Cons: Difficult with varying density clusters.
Applications: Anomaly detection, spatial data clustering.
DBSCAN Key hyperparameters:
eps: The maximum distance between two points to be
considered neighbors (default = 0.5).
min_samples: Minimum number of samples in a
neighborhood to form a cluster (default = 5).
HDBSCAN (Hierarchical DBSCAN) improves on DBSCAN by
handling varying density clusters. import hdbscan
Pros: Better handling of varying densities. clusterer = hdbscan.HDBSCAN(min_cluster_size=5)
Cons: Can be slower than DBSCAN.
Applications: Large datasets, complex clustering problems.
HDBSCAN Key hyperparameters:
min_cluster_size: The minimum size of clusters (default =
5).
min_samples: Minimum number of samples to form a
cluster (default = 10).
K-Means K-Means is a centroid-based clustering algorithm that groups data from sklearn.cluster import KMeans
clustering into k clusters. kmeans = KMeans(n_clusters=3)
Pros: Efficient, simple to implement.
Cons: Sensitive to initial cluster centroids.
about:blank 1/3
5/23/25, 7:49 AM about:blank
Model Name Brief Description Code Syntax
Applications: Customer segmentation, pattern recognition.
Key hyperparameters:
n_clusters: Number of clusters (default = 8).
init: Method for initializing the centroids ('k-means++' or
'random', default = 'k-means++').
n_init: Number of times the algorithm will run with
different centroid seeds (default = 10).
Associated fuctions used
Method Brief Description Code Syntax
from sklearn.datasets import make_blobs
X, y = make_blobs(n_samples=100, centers=2, random_state=42)
Generates isotropic Gaussian blobs
make_blobs
for clustering.
from numpy.random import multivariate_normal
samples = multivariate_normal(mean=[0, 0], cov=[[1, 0], [0, 1]], size=100)
Generates samples from a
multivariate_normal
multivariate normal distribution.
import plotly.express as px
fig = px.scatter_3d(df, x='x', y='y', z='z')
fig.show()
Creates a 3D scatter plot using
plotly.express.scatter_3d
Plotly Express.
import geopandas as gpd
gdf = gpd.GeoDataFrame(df, geometry='geometry')
Creates a GeoDataFrame from a
geopandas.GeoDataFrame
Pandas DataFrame.
gdf = gdf.to_crs(epsg=3857)
Transforms the coordinate
geopandas.to_crs reference system of a
GeoDataFrame.
contextily.add_basemap Adds a basemap to a import contextily as ctx
GeoDataFrame plot for context. ax = gdf.plot(figsize=(10, 10))
ctx.add_basemap(ax)
about:blank 2/3
5/23/25, 7:49 AM about:blank
Method Brief Description Code Syntax
from sklearn.decomposition import PCA
pca = PCA(n_components=2)
pca.fit(X)
variance_ratio = pca.explained_variance_ratio_
Returns the proportion of variance
pca.explained_variance_ratio_ explained by each principal
component.
Author
Jeff Grossman
Abhishek Gagneja
about:blank 3/3