Thanks to visit codestin.com
Credit goes to github.com

Skip to content

HDBSCAN performance issues compared to original hdbscan implementation (likely because Boruvka algorithm is not implemented) #31503

Open
@lkwinta

Description

@lkwinta

Describe the bug

When switching from Sklearn HDBSCAN implementation to original one from hdbscan library, I've notice that Sklearn's implementation has much worse implementation. I've tried investigating different parameters but it doesn't seem to have an effect on the performance.

I've created synthetic benchmark using make_blobs function. And those are my results:

CPU: Ryzen 5 1600, 12 [email protected]*
RAM: 32GB DDR4

# dataset
X, y = make_blobs(n_samples=10000, centers=5, cluster_std=0.60, random_state=0, n_features=10)

# hdbscan params 
og_hdbscan = OGHDBSCAN(core_dist_n_jobs=-1)
sk_hdbscan = SKHDBSCAN(n_jobs=-1)

Image

  • Tested out on Google Collab with similar results

Steps/Code to Reproduce

I am starting both algorithms with n_jobs=-1 to rule out the difference that may occure because of default setting of core_dist_n_jobs=4 in hdbscan

from hdbscan import HDBSCAN as OGHDBSCAN
from sklearn.cluster import HDBSCAN as SKHDBSCAN

import matplotlib.pyplot as plt
import numpy as np
import time

from sklearn.datasets import make_blobs

X, y = make_blobs(n_samples=10000, centers=5, cluster_std=0.60, random_state=0, n_features=10)

og_hdbscan = OGHDBSCAN(core_dist_n_jobs=-1)
sk_hdbscan = SKHDBSCAN(n_jobs=-1)

RUNS = 10

def time_hdbscan(hdbscan, X, runs):
    times = []
    for _ in range(runs):
        start = time.time()
        hdbscan.fit(X)
        end = time.time()
        times.append(end - start)
    return times

times_og = time_hdbscan(og_hdbscan, X, RUNS)
times_sk = time_hdbscan(sk_hdbscan, X, RUNS)

print("Mean time OGHDBSCAN: ", np.mean(times_og))
print("Mean time SKHDBSCAN: ", np.mean(times_sk))

plt.plot(range(RUNS), times_og, label='OGHDBSCAN', marker='o')
plt.plot(range(RUNS), times_sk, label='SKHDBSCAN', marker='x')
plt.xlabel('Run')
plt.ylabel('Time (seconds)')
plt.title('HDBSCAN Timing Comparison')
plt.legend()
plt.show()

Expected Results

Similar performance between algorithms from Sklearn and hdbscan library

Actual Results

Sklearn implementation of HDBSCAN gets much worse performance than original library. For example when testing much bigger dataset, i.e.

X, y = make_blobs(n_samples=100000, centers=5, cluster_std=0.60, random_state=0, n_features=20)

hdbscan library performs fit in 25s on my hardware, while Sklearn needs 5 minutes to perform clustering.

Versions

System:
    python: 3.11.9 (tags/v3.11.9:de54cf5, Apr  2 2024, 10:12:12) [MSC v.1938 64 bit (AMD64)]
executable: d:\Documents\Projects\Machine-Learning-Basics\.venv\Scripts\python.exe
   machine: Windows-10-10.0.19045-SP0

Python dependencies:
      sklearn: 1.6.1
          pip: None
   setuptools: 80.9.0
        numpy: 1.26.4
        scipy: 1.15.3
       Cython: None
       pandas: 2.3.0
   matplotlib: 3.10.3
       joblib: 1.5.1
threadpoolctl: 3.6.0

Built with OpenMP: True

threadpoolctl info:
       user_api: blas
   internal_api: openblas
    num_threads: 12
         prefix: libopenblas
       filepath: D:\Documents\Projects\Machine-Learning-Basics\.venv\Lib\site-packages\numpy.libs\libopenblas64__v0.3.23-293-gc2f4bdbb-gcc_10_3_0-2bde3a66a51006b2b53eb373ff767a3f.dll
        version: 0.3.23.dev
threading_layer: pthreads
   architecture: Zen

       user_api: openmp
   internal_api: openmp
    num_threads: 12
         prefix: vcomp
       filepath: D:\Documents\Projects\Machine-Learning-Basics\.venv\Lib\site-packages\sklearn\.libs\vcomp140.dll
        version: None

       user_api: blas
   internal_api: openblas
    num_threads: 12
         prefix: libscipy_openblas
       filepath: D:\Documents\Projects\Machine-Learning-Basics\.venv\Lib\site-packages\scipy.libs\libscipy_openblas-f07f5a5d207a3a47104dca54d6d0c86a.dll
        version: 0.3.28
threading_layer: pthreads
   architecture: Haswell

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions