Thanks to visit codestin.com
Credit goes to github.com

Skip to content

ENH Introduces mst_algorithm keyword for HDBSCAN, alongside two new Boruvka MST algorithms #27572

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 196 commits into
base: main
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
196 commits
Select commit Hold shift + click to select a range
1c61429
Initial addition of hdbscan
Micky774 Feb 25, 2022
c5240b7
Added wraparound wrappers where needed
Micky774 Feb 26, 2022
74bd0b3
Updated documentation
Micky774 Feb 27, 2022
15793b2
Merge branch 'main' into hdbscan
Micky774 Mar 4, 2022
faa06b5
Added a new batch of doc updates for passing docstring tests
Micky774 Mar 5, 2022
266c958
Parameter and attribute revisions
Micky774 Mar 6, 2022
2a7cc22
Improved `metric_params` handling
Micky774 Mar 6, 2022
97f036f
Propogated `metric_params` change to tests and other functions
Micky774 Mar 6, 2022
8aa297a
Removed plotting, `to_pandas`, `to_networkx` infrastructure
Micky774 Mar 6, 2022
fe362b5
Removed plotting, `to_pandas`, `to_networkx` infrastructure
Micky774 Mar 6, 2022
dd44dbc
Merge branch 'hdbscan' of https://github.com/Micky774/scikit-learn in…
Micky774 Mar 6, 2022
fda9350
Renamed `plots.py`-->`_trees.py`
Micky774 Mar 6, 2022
7478586
Fixed package namespace in `cluster/__init__.py`
Micky774 Mar 6, 2022
cd1edc4
Drop-in replaced private `dist_metrics` with `metrics.dist_metrics`
Micky774 Mar 6, 2022
0802504
Revert "Drop-in replaced private `dist_metrics` with `metrics.dist_me…
Micky774 Mar 6, 2022
543c35c
Improved hdbscan metric handling and testing
Micky774 Mar 7, 2022
ce94591
Docstring compliance for `flat.py`
Micky774 Mar 7, 2022
e93bfe1
Renamed `flat.py` --> `_flat.py`
Micky774 Mar 7, 2022
028e98f
Renamed `flat.py`-->`_flat.py`
Micky774 Mar 7, 2022
a1ac99a
Renamed `validity.py`-->`_validity.py`
Micky774 Mar 7, 2022
788d4bc
Renamed `robust_single_linkage_.py`
Micky774 Mar 7, 2022
5fba5e0
Merge branch 'main' into hdbscan
Micky774 Mar 7, 2022
cf4f239
Removed `_flat_.py` and associated tests
Micky774 Mar 9, 2022
1ceac43
Made memview readonly constant
Micky774 Mar 10, 2022
6f20a08
Removed experimental/extra API -- may reenable in future PRs
Micky774 Mar 10, 2022
6705fa7
Merge branch 'main' into hdbscan
Micky774 Mar 13, 2022
9e9be81
WIP docstring improvements for RSL
Micky774 Mar 13, 2022
0cd08f3
Trimmed and removed unnecessary RSL estimator
Micky774 Mar 13, 2022
7b73dd8
Updated sqrt2 default in robust_single_linkage
Micky774 Mar 13, 2022
62cf09e
Updated `alpha` arg for rsl functions
Micky774 Mar 13, 2022
f48e148
Added WIP section for HDBSCAN in User Guide
Micky774 Mar 14, 2022
87071a4
Replaced custom `dist_metrics` w/ `metric._dist_metrics`
Micky774 Mar 14, 2022
30b652a
Removed unnecessary arg
Micky774 Mar 14, 2022
42b26e1
Removed vestigial `robust_single_linkage` functionality
Micky774 Mar 16, 2022
e46a418
Removed cython flags
Micky774 Mar 16, 2022
7887e11
Merge branch 'main' into hdbscan
Micky774 Mar 18, 2022
c7699c1
Merge branch 'hdbscan' of https://github.com/Micky774/scikit-learn in…
Micky774 Mar 18, 2022
5ae1d03
Initial addition of HDBSCAN User Guide [doc quick]
Micky774 Mar 19, 2022
d2fbd47
Merge branch 'main' into hdbscan
Micky774 Mar 19, 2022
30f38ea
Add reference for HDBSCAN User Guide entry
Micky774 Mar 19, 2022
38f7019
Added authorship/license info
Micky774 Mar 19, 2022
236c219
Fixed lists in `hdbscan` and improved user guide documentation
Micky774 Mar 21, 2022
59138a7
Merge branch 'main' into hdbscan
Micky774 Mar 21, 2022
7ba96ea
Added name mapping for hdbscan function autosummary
Micky774 Mar 21, 2022
ad92a9b
Merge branch 'main' into hdbscan
Micky774 Mar 25, 2022
d71883d
Merge branch 'main' into hdbscan
Micky774 Mar 25, 2022
b5dcdca
Added hdbscan to `plot_cluster_comparison`
Micky774 Mar 25, 2022
d7734d4
Fixed sphinx lists
Micky774 Mar 25, 2022
9f83f6e
Added initial hdbscan plot file
Micky774 Mar 25, 2022
f98b6bf
Modified clustering rst for image inclusion
Micky774 Mar 25, 2022
5365c3a
Corrected plotting for HDBSCAN
Micky774 Mar 25, 2022
b25d2ad
Fixed image display in user guide entry and fixed hdbscan doc
Micky774 Mar 26, 2022
103642d
Added entry to algorithm comparison table
Micky774 Mar 26, 2022
ba3302d
Added link to original hdbscan repository
Micky774 Mar 26, 2022
0f46e6c
Updated tests and improved caching code
Micky774 Mar 26, 2022
d87855b
Merge branch 'main' into hdbscan
Micky774 Mar 26, 2022
8556478
Merge branch 'main' into hdbscan
Micky774 Mar 27, 2022
e7165ae
Removed extra properties/attributes
Micky774 Mar 27, 2022
7834bf1
Cleaned up function signatures
Micky774 Mar 27, 2022
4a4e3eb
Trimmed docstring, renamed param, removed extra parameters/attrs
Micky774 Mar 27, 2022
df65fb9
Moved single-use functions in-line
Micky774 Mar 27, 2022
4bd72e5
Trim cython file by removing functionality for old `prediction`
Micky774 Mar 27, 2022
dbd6ca5
Merge branch 'main' into hdbscan
Micky774 Apr 1, 2022
a25224f
Apply suggestions from code review
Micky774 Apr 1, 2022
95d95a1
Removed unnecessary `_prediction_utils` files
Micky774 Apr 1, 2022
cd83805
Renamed most `kwargs`-->`metric_params` for consistency
Micky774 Apr 1, 2022
dee1c46
Added clarifying comment in `_validity.py`
Micky774 Apr 1, 2022
2a81824
Added random state objects, and used `tmp_path` fixture
Micky774 Apr 1, 2022
add3617
Improved `badargs` test
Micky774 Apr 1, 2022
0bf1491
Minor wording change
Micky774 Apr 1, 2022
1f31960
Made docstrings more uniform and set default metric to `euclidean`
Micky774 Apr 1, 2022
e7291a8
Improved plotting w/ perturbation examples
Micky774 Apr 1, 2022
32f4d6e
Merge branch 'main' into hdbscan
Micky774 Apr 1, 2022
5d0489d
Merge branch 'main' into hdbscan
Micky774 Apr 2, 2022
b7aca9e
Updated clustering plots for gallery page rendering
Micky774 Apr 2, 2022
42fd546
Merge branch 'main' into hdbscan
Micky774 Apr 17, 2022
3d719d9
Improved plotting example
Micky774 Apr 17, 2022
daf1b2f
Updated User-Guide entry for new plotting example
Micky774 Apr 17, 2022
4ddaddf
Typo fix
Micky774 Apr 17, 2022
9e56fc0
Merge branch 'main' into hdbscan
Micky774 Apr 22, 2022
407a7bf
Merge branch 'main' into hdbscan
Micky774 Apr 23, 2022
fa1d30f
Applied plotting demo review feedback
Micky774 May 8, 2022
f593973
Merge branch 'main' into hdbscan
Micky774 May 8, 2022
8f7f60b
Streamlined and improved plotting demo per review feedback
Micky774 May 8, 2022
6c5f936
Removed default arg for labels
Micky774 May 8, 2022
e0daeb7
Removed `match_reference_implementation` arg
Micky774 May 8, 2022
a095bb9
Improved doc for `algorithm` and changed option `"best"`-->`"auto"`
Micky774 May 8, 2022
ca7e87f
Updated DOI reference and user guide images
Micky774 May 9, 2022
ffb7601
Merge branch 'main' into hdbscan
Micky774 May 30, 2022
bb0f768
Merge branch 'main' into hdbscan
Micky774 May 30, 2022
3b38777
Merge branch 'hdbscan' of https://github.com/Micky774/scikit-learn in…
Micky774 May 30, 2022
57ec680
Refactored parameter validation to use new API
Micky774 May 30, 2022
132c146
Adopted optics-like core_dist backend using `NearestNeighbors`
Micky774 Jun 15, 2022
cfaf597
Refactor of main hdbscan function
Micky774 Jun 15, 2022
400fcf1
Removed `approx_min_span_tree` -- defaulted to `True`
Micky774 Jun 15, 2022
44bb176
Removed unnecessary metric option
Micky774 Jun 15, 2022
6bd6146
Merge remote-tracking branch 'origin' into hdbscan
Micky774 Jun 22, 2022
ef4481e
Removed validity index, replaced w/ fowlkes-mallows score
Micky774 Jun 27, 2022
d7c449a
Minor cosmetic changes to tests
Micky774 Jun 27, 2022
3710209
Refactored boruvka cython
Micky774 Jun 27, 2022
bf571d9
Trimmed unnecessary mutual-reachability functions
Micky774 Jun 28, 2022
997b4cb
Comments and minor cosmetics
Micky774 Jun 28, 2022
6a7095c
Simplified tests wrt new validation mechanism
Micky774 Jun 28, 2022
54d71eb
Update doc/modules/clustering.rst
Micky774 Jun 29, 2022
9162f62
Improved user guide entry wording per review feedback
Micky774 Jun 29, 2022
f104ec9
Merge branch 'hdbscan' of https://github.com/Micky774/scikit-learn in…
Micky774 Jun 29, 2022
4f5f5d6
Improved testing coverage
Micky774 Jun 29, 2022
9abc237
Added initial changelog entry
Micky774 Jun 30, 2022
2d7c4c9
Added pr details in changelog entry
Micky774 Jun 30, 2022
4e37527
Merge branch 'hdbscan' of https://github.com/scikit-learn/scikit-lear…
Micky774 Jun 30, 2022
5e0bc41
Trimmed extra function and modified comments
Micky774 Jul 4, 2022
0847be5
Apply suggestions from code review
Micky774 Jul 26, 2022
1c9a76a
Applied isort (with black on top)
Micky774 Jul 26, 2022
c01a609
Stylistic improvements
Micky774 Aug 26, 2022
b7736ef
Removed boruvka algorithm
Micky774 Aug 26, 2022
84484ea
Refactored file names and setup file
Micky774 Aug 26, 2022
9ebc643
Reintroduced boruvka algorithm
Micky774 Aug 26, 2022
24c5b98
Updated test file for boruvka removal
Micky774 Aug 26, 2022
585d7bb
Added dtype specification to input array validation
Micky774 Aug 28, 2022
507f0da
Apply suggestions from code review
Micky774 Aug 31, 2022
ed6d17d
Further review feedback
Micky774 Aug 31, 2022
232e2b1
Merge branch 'hdbscan' into hdbscan_boruvka
Micky774 Aug 31, 2022
38f71c7
Updated tests and improved `n_jobs` handling
Micky774 Aug 31, 2022
eefbacc
Refactored to remove `hdbscan` function -- use estimator instead
Micky774 Sep 6, 2022
3f89574
minor cleanup
Micky774 Sep 6, 2022
33f950b
Parameter simplification, and cluster_center refactor
Micky774 Sep 6, 2022
d29cc02
Minor typo corrections and reordering of user-guide entry
Micky774 Sep 6, 2022
3b86f1d
streamlined test
Micky774 Sep 6, 2022
cada149
Documentation update per review feedback
Micky774 Sep 6, 2022
45aab3c
Removed unnecessary function and made minor tweak to test
Micky774 Sep 6, 2022
67cab1a
Simplified plotting demo single-axis plots
Micky774 Sep 6, 2022
7edfd55
Refactored weighted centers
Micky774 Sep 14, 2022
d173707
Apply suggestions from code review
Micky774 Sep 14, 2022
1056cb0
Further review feedback implemented
Micky774 Sep 15, 2022
39b3e5a
Updated tests with review feedback
Micky774 Sep 15, 2022
5c42b0d
Apply suggestions from code review
Micky774 Sep 15, 2022
aa999f5
Renamed mst functions
Micky774 Sep 15, 2022
cf2c83d
Merge branch 'hdbscan' of https://github.com/Micky774/scikit-learn in…
Micky774 Sep 15, 2022
7c6c89d
Merge branch 'hdbscan' of https://github.com/scikit-learn/scikit-lear…
Micky774 Sep 15, 2022
4860a7f
Refactored _reachability.pyx
Micky774 Sep 15, 2022
2eff9cc
Adjusted documentation
Micky774 Sep 15, 2022
7a9b365
Cython cleanup for _reachability.pyx
Micky774 Sep 15, 2022
da44c83
Improved docs
Micky774 Sep 15, 2022
26dad21
Update sklearn/cluster/_hdbscan/hdbscan.py
Micky774 Sep 15, 2022
15595be
Minor cleanup
Micky774 Sep 15, 2022
a9a3c22
Merge branch 'hdbscan' of https://github.com/Micky774/scikit-learn in…
Micky774 Sep 15, 2022
0b0fa0e
Minor refactor for propogating missing data
Micky774 Sep 15, 2022
f96e8d6
Updated docs
Micky774 Sep 15, 2022
8f5c22b
Updated authorships
Micky774 Sep 15, 2022
23185f0
Updated `n_cluster` calc in `_weighted_cluster_center`
Micky774 Sep 15, 2022
8ed0869
Refactored brute algorithm and added `copy` parameter
Micky774 Sep 16, 2022
d44ac66
Merge branch 'hdbscan' into hdbscan_boruvka
Micky774 Sep 16, 2022
e6b9c2d
Updated tests a bit
Micky774 Sep 16, 2022
2b7e6d9
Merge branch 'hdbscan' into hdbscan_boruvka
Micky774 Sep 16, 2022
259ca79
Merge branch 'main' into hdbscan_boruvka
Micky774 Aug 9, 2023
886bab0
Removed outdated test file
Micky774 Aug 9, 2023
8feccf1
Removed old setup.py
Micky774 Aug 9, 2023
969b7c5
Removed submodule setup.py
Micky774 Aug 9, 2023
88ff7e2
Iter on styling
Micky774 Aug 9, 2023
5a82cb0
Merge branch 'main' into hdbscan_boruvka
Micky774 Sep 26, 2023
44b1463
Included boruvka in build
Micky774 Sep 26, 2023
177227b
Added int8_t type for boruvka
Micky774 Sep 26, 2023
2835e91
Iter on boruvka, imported cnp
Micky774 Sep 26, 2023
bb43054
Formatting and declaration grouping
Micky774 Sep 26, 2023
3fa0a9b
Ndarray->memview refactor
Micky774 Sep 26, 2023
d5eba10
Updated distancemetric typing
Micky774 Sep 26, 2023
05276bd
Added prototype test
Micky774 Sep 26, 2023
b3ac0d1
Corrected algo key-word
Micky774 Sep 26, 2023
27d593c
Added partial dispatch for boruvka
Micky774 Sep 26, 2023
6a72efa
Updated boruvka formatting
Micky774 Sep 26, 2023
be2b4e9
Refactored NodeData_t and formatted code
Micky774 Sep 29, 2023
11e91f9
Formatting and new Numpy API
Micky774 Sep 30, 2023
014c168
Corrected indexing error
Micky774 Sep 30, 2023
baa6a02
Added greater nogil support and started boruvka bug fix
Micky774 Oct 2, 2023
ffc6b77
Removed debug statements and improved test
Micky774 Oct 8, 2023
974673e
Improved tests and hdbscan dispatch logic
Micky774 Oct 10, 2023
61aef63
Cleaned up cython file
Micky774 Oct 10, 2023
2634228
Removed unnecessary public attributes
Micky774 Oct 10, 2023
53a7ec6
Updated formatting and removed parallel-query schema
Micky774 Oct 11, 2023
37e5d5c
Remove attribute used in debugging
Micky774 Oct 11, 2023
0788281
Improved tests
Micky774 Oct 11, 2023
57330e5
Merge branch 'main' into hdbscan_boruvka
Micky774 Oct 11, 2023
3dd7c5a
Updated changelog
Micky774 Oct 11, 2023
5a9aebe
Changed default to preserve backwards compatability
Micky774 Oct 11, 2023
90d9403
Merge branch 'main' into hdbscan_boruvka
Micky774 Oct 12, 2023
3b40f8a
Improved tests, and adjusted auto option for backwards compatability
Micky774 Oct 12, 2023
04e4007
Corrected changelog entry
Micky774 Oct 12, 2023
e06188f
Removed extraneous function
Micky774 Oct 12, 2023
6ef1668
Stabalized tests by using sorted lists
Micky774 Oct 12, 2023
76713ff
Updated to include deprecation for auto heuristic
Micky774 Oct 12, 2023
de5d041
Updated example in docstring
Micky774 Oct 12, 2023
8ba71e8
Updated centers test to use less adversarial data
Micky774 Oct 12, 2023
6a592f4
Corrected test by making hdb model more noise-tolerant
Micky774 Oct 13, 2023
68d4fd1
Avoid FutureWarning in tests
Micky774 Oct 13, 2023
39aa992
Fixed remaining FutureWarning
Micky774 Oct 14, 2023
fedeb90
Merge branch 'main' into hdbscan_boruvka
Micky774 Oct 16, 2023
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
13 changes: 13 additions & 0 deletions doc/whats_new/v1.4.rst
Original file line number Diff line number Diff line change
Expand Up @@ -217,6 +217,19 @@ Changelog
`kdtree` and `balltree` values will be removed in 1.6.
:pr:`26744` by :user:`Shreesha Kumar Bhat <Shreesha3112>`.

- |Enhancement| : The `mst_algorithm` argument is introduced, allowing for the user to
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
- |Enhancement| : The `mst_algorithm` argument is introduced, allowing for the user to
- |Enhancement| :class:`cluster.HDBScan` has a new `mst_algorithm` argument, allowing for the user to

select between `{"auto", "brute", "prims", "boruvka_exact", "boruvka_approx"}`.
Note that setting `mst_algorithm="prims"` recovers the same functionality as
before this change, except when setting `algorithm="brute"` in which case
both `"auto", "brute"` options for `mst_algorithm` recover current behavior.
This instead introduces `"boruvka_exact", "boruvka_approx"` which are both faster
MST building algorithms than the current `"prims"`.
Comment on lines +225 to +226
Copy link
Member

@jjerphan jjerphan Oct 28, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
This instead introduces `"boruvka_exact", "boruvka_approx"` which are both faster
MST building algorithms than the current `"prims"`.
This instead introduces `"boruvka_exact"` and `"boruvka_approx"` which are both
faster MST building algorithms than the current `"prims"`.

:pr:`27572` by :user:`Meekail Zain <micky774>`.

This implementation is an adaptation from the original implementation of HDBSCAN in
`scikit-learn-contrib/hdbscan <https://github.com/scikit-learn-contrib/hdbscan>`_,
by :user:`Leland McInnes <lmcinnes>` et al.

:mod:`sklearn.compose`
......................

Expand Down
1 change: 1 addition & 0 deletions setup.py
Original file line number Diff line number Diff line change
Expand Up @@ -209,6 +209,7 @@ def check_package_status(package, min_version):
{"sources": ["_k_means_minibatch.pyx"], "include_np": True},
],
"cluster._hdbscan": [
{"sources": ["_boruvka.pyx"], "include_np": True},
{"sources": ["_linkage.pyx"], "include_np": True},
{"sources": ["_reachability.pyx"], "include_np": True},
{"sources": ["_tree.pyx"], "include_np": True},
Expand Down
Loading