Description
Describe the bug
When switching from Sklearn HDBSCAN implementation to original one from hdbscan
library, I've notice that Sklearn's implementation has much worse implementation. I've tried investigating different parameters but it doesn't seem to have an effect on the performance.
I've created synthetic benchmark using make_blobs
function. And those are my results:
CPU: Ryzen 5 1600, 12 [email protected]*
RAM: 32GB DDR4
# dataset
X, y = make_blobs(n_samples=10000, centers=5, cluster_std=0.60, random_state=0, n_features=10)
# hdbscan params
og_hdbscan = OGHDBSCAN(core_dist_n_jobs=-1)
sk_hdbscan = SKHDBSCAN(n_jobs=-1)
- Tested out on Google Collab with similar results
Steps/Code to Reproduce
I am starting both algorithms with n_jobs=-1
to rule out the difference that may occure because of default setting of core_dist_n_jobs=4
in hdbscan
from hdbscan import HDBSCAN as OGHDBSCAN
from sklearn.cluster import HDBSCAN as SKHDBSCAN
import matplotlib.pyplot as plt
import numpy as np
import time
from sklearn.datasets import make_blobs
X, y = make_blobs(n_samples=10000, centers=5, cluster_std=0.60, random_state=0, n_features=10)
og_hdbscan = OGHDBSCAN(core_dist_n_jobs=-1)
sk_hdbscan = SKHDBSCAN(n_jobs=-1)
RUNS = 10
def time_hdbscan(hdbscan, X, runs):
times = []
for _ in range(runs):
start = time.time()
hdbscan.fit(X)
end = time.time()
times.append(end - start)
return times
times_og = time_hdbscan(og_hdbscan, X, RUNS)
times_sk = time_hdbscan(sk_hdbscan, X, RUNS)
print("Mean time OGHDBSCAN: ", np.mean(times_og))
print("Mean time SKHDBSCAN: ", np.mean(times_sk))
plt.plot(range(RUNS), times_og, label='OGHDBSCAN', marker='o')
plt.plot(range(RUNS), times_sk, label='SKHDBSCAN', marker='x')
plt.xlabel('Run')
plt.ylabel('Time (seconds)')
plt.title('HDBSCAN Timing Comparison')
plt.legend()
plt.show()
Expected Results
Similar performance between algorithms from Sklearn and hdbscan
library
Actual Results
Sklearn implementation of HDBSCAN
gets much worse performance than original library. For example when testing much bigger dataset, i.e.
X, y = make_blobs(n_samples=100000, centers=5, cluster_std=0.60, random_state=0, n_features=20)
hdbscan
library performs fit
in 25s on my hardware, while Sklearn needs 5 minutes to perform clustering.
Versions
System:
python: 3.11.9 (tags/v3.11.9:de54cf5, Apr 2 2024, 10:12:12) [MSC v.1938 64 bit (AMD64)]
executable: d:\Documents\Projects\Machine-Learning-Basics\.venv\Scripts\python.exe
machine: Windows-10-10.0.19045-SP0
Python dependencies:
sklearn: 1.6.1
pip: None
setuptools: 80.9.0
numpy: 1.26.4
scipy: 1.15.3
Cython: None
pandas: 2.3.0
matplotlib: 3.10.3
joblib: 1.5.1
threadpoolctl: 3.6.0
Built with OpenMP: True
threadpoolctl info:
user_api: blas
internal_api: openblas
num_threads: 12
prefix: libopenblas
filepath: D:\Documents\Projects\Machine-Learning-Basics\.venv\Lib\site-packages\numpy.libs\libopenblas64__v0.3.23-293-gc2f4bdbb-gcc_10_3_0-2bde3a66a51006b2b53eb373ff767a3f.dll
version: 0.3.23.dev
threading_layer: pthreads
architecture: Zen
user_api: openmp
internal_api: openmp
num_threads: 12
prefix: vcomp
filepath: D:\Documents\Projects\Machine-Learning-Basics\.venv\Lib\site-packages\sklearn\.libs\vcomp140.dll
version: None
user_api: blas
internal_api: openblas
num_threads: 12
prefix: libscipy_openblas
filepath: D:\Documents\Projects\Machine-Learning-Basics\.venv\Lib\site-packages\scipy.libs\libscipy_openblas-f07f5a5d207a3a47104dca54d6d0c86a.dll
version: 0.3.28
threading_layer: pthreads
architecture: Haswell