-
-
Notifications
You must be signed in to change notification settings - Fork 26k
ENH Introduces mst_algorithm
keyword for HDBSCAN, alongside two new Boruvka MST algorithms
#27572
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Open
Micky774
wants to merge
196
commits into
scikit-learn:main
Choose a base branch
from
Micky774:hdbscan_boruvka
base: main
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Open
Changes from all commits
Commits
Show all changes
196 commits
Select commit
Hold shift + click to select a range
1c61429
Initial addition of hdbscan
Micky774 c5240b7
Added wraparound wrappers where needed
Micky774 74bd0b3
Updated documentation
Micky774 15793b2
Merge branch 'main' into hdbscan
Micky774 faa06b5
Added a new batch of doc updates for passing docstring tests
Micky774 266c958
Parameter and attribute revisions
Micky774 2a7cc22
Improved `metric_params` handling
Micky774 97f036f
Propogated `metric_params` change to tests and other functions
Micky774 8aa297a
Removed plotting, `to_pandas`, `to_networkx` infrastructure
Micky774 fe362b5
Removed plotting, `to_pandas`, `to_networkx` infrastructure
Micky774 dd44dbc
Merge branch 'hdbscan' of https://github.com/Micky774/scikit-learn in…
Micky774 fda9350
Renamed `plots.py`-->`_trees.py`
Micky774 7478586
Fixed package namespace in `cluster/__init__.py`
Micky774 cd1edc4
Drop-in replaced private `dist_metrics` with `metrics.dist_metrics`
Micky774 0802504
Revert "Drop-in replaced private `dist_metrics` with `metrics.dist_me…
Micky774 543c35c
Improved hdbscan metric handling and testing
Micky774 ce94591
Docstring compliance for `flat.py`
Micky774 e93bfe1
Renamed `flat.py` --> `_flat.py`
Micky774 028e98f
Renamed `flat.py`-->`_flat.py`
Micky774 a1ac99a
Renamed `validity.py`-->`_validity.py`
Micky774 788d4bc
Renamed `robust_single_linkage_.py`
Micky774 5fba5e0
Merge branch 'main' into hdbscan
Micky774 cf4f239
Removed `_flat_.py` and associated tests
Micky774 1ceac43
Made memview readonly constant
Micky774 6f20a08
Removed experimental/extra API -- may reenable in future PRs
Micky774 6705fa7
Merge branch 'main' into hdbscan
Micky774 9e9be81
WIP docstring improvements for RSL
Micky774 0cd08f3
Trimmed and removed unnecessary RSL estimator
Micky774 7b73dd8
Updated sqrt2 default in robust_single_linkage
Micky774 62cf09e
Updated `alpha` arg for rsl functions
Micky774 f48e148
Added WIP section for HDBSCAN in User Guide
Micky774 87071a4
Replaced custom `dist_metrics` w/ `metric._dist_metrics`
Micky774 30b652a
Removed unnecessary arg
Micky774 42b26e1
Removed vestigial `robust_single_linkage` functionality
Micky774 e46a418
Removed cython flags
Micky774 7887e11
Merge branch 'main' into hdbscan
Micky774 c7699c1
Merge branch 'hdbscan' of https://github.com/Micky774/scikit-learn in…
Micky774 5ae1d03
Initial addition of HDBSCAN User Guide [doc quick]
Micky774 d2fbd47
Merge branch 'main' into hdbscan
Micky774 30f38ea
Add reference for HDBSCAN User Guide entry
Micky774 38f7019
Added authorship/license info
Micky774 236c219
Fixed lists in `hdbscan` and improved user guide documentation
Micky774 59138a7
Merge branch 'main' into hdbscan
Micky774 7ba96ea
Added name mapping for hdbscan function autosummary
Micky774 ad92a9b
Merge branch 'main' into hdbscan
Micky774 d71883d
Merge branch 'main' into hdbscan
Micky774 b5dcdca
Added hdbscan to `plot_cluster_comparison`
Micky774 d7734d4
Fixed sphinx lists
Micky774 9f83f6e
Added initial hdbscan plot file
Micky774 f98b6bf
Modified clustering rst for image inclusion
Micky774 5365c3a
Corrected plotting for HDBSCAN
Micky774 b25d2ad
Fixed image display in user guide entry and fixed hdbscan doc
Micky774 103642d
Added entry to algorithm comparison table
Micky774 ba3302d
Added link to original hdbscan repository
Micky774 0f46e6c
Updated tests and improved caching code
Micky774 d87855b
Merge branch 'main' into hdbscan
Micky774 8556478
Merge branch 'main' into hdbscan
Micky774 e7165ae
Removed extra properties/attributes
Micky774 7834bf1
Cleaned up function signatures
Micky774 4a4e3eb
Trimmed docstring, renamed param, removed extra parameters/attrs
Micky774 df65fb9
Moved single-use functions in-line
Micky774 4bd72e5
Trim cython file by removing functionality for old `prediction`
Micky774 dbd6ca5
Merge branch 'main' into hdbscan
Micky774 a25224f
Apply suggestions from code review
Micky774 95d95a1
Removed unnecessary `_prediction_utils` files
Micky774 cd83805
Renamed most `kwargs`-->`metric_params` for consistency
Micky774 dee1c46
Added clarifying comment in `_validity.py`
Micky774 2a81824
Added random state objects, and used `tmp_path` fixture
Micky774 add3617
Improved `badargs` test
Micky774 0bf1491
Minor wording change
Micky774 1f31960
Made docstrings more uniform and set default metric to `euclidean`
Micky774 e7291a8
Improved plotting w/ perturbation examples
Micky774 32f4d6e
Merge branch 'main' into hdbscan
Micky774 5d0489d
Merge branch 'main' into hdbscan
Micky774 b7aca9e
Updated clustering plots for gallery page rendering
Micky774 42fd546
Merge branch 'main' into hdbscan
Micky774 3d719d9
Improved plotting example
Micky774 daf1b2f
Updated User-Guide entry for new plotting example
Micky774 4ddaddf
Typo fix
Micky774 9e56fc0
Merge branch 'main' into hdbscan
Micky774 407a7bf
Merge branch 'main' into hdbscan
Micky774 fa1d30f
Applied plotting demo review feedback
Micky774 f593973
Merge branch 'main' into hdbscan
Micky774 8f7f60b
Streamlined and improved plotting demo per review feedback
Micky774 6c5f936
Removed default arg for labels
Micky774 e0daeb7
Removed `match_reference_implementation` arg
Micky774 a095bb9
Improved doc for `algorithm` and changed option `"best"`-->`"auto"`
Micky774 ca7e87f
Updated DOI reference and user guide images
Micky774 ffb7601
Merge branch 'main' into hdbscan
Micky774 bb0f768
Merge branch 'main' into hdbscan
Micky774 3b38777
Merge branch 'hdbscan' of https://github.com/Micky774/scikit-learn in…
Micky774 57ec680
Refactored parameter validation to use new API
Micky774 132c146
Adopted optics-like core_dist backend using `NearestNeighbors`
Micky774 cfaf597
Refactor of main hdbscan function
Micky774 400fcf1
Removed `approx_min_span_tree` -- defaulted to `True`
Micky774 44bb176
Removed unnecessary metric option
Micky774 6bd6146
Merge remote-tracking branch 'origin' into hdbscan
Micky774 ef4481e
Removed validity index, replaced w/ fowlkes-mallows score
Micky774 d7c449a
Minor cosmetic changes to tests
Micky774 3710209
Refactored boruvka cython
Micky774 bf571d9
Trimmed unnecessary mutual-reachability functions
Micky774 997b4cb
Comments and minor cosmetics
Micky774 6a7095c
Simplified tests wrt new validation mechanism
Micky774 54d71eb
Update doc/modules/clustering.rst
Micky774 9162f62
Improved user guide entry wording per review feedback
Micky774 f104ec9
Merge branch 'hdbscan' of https://github.com/Micky774/scikit-learn in…
Micky774 4f5f5d6
Improved testing coverage
Micky774 9abc237
Added initial changelog entry
Micky774 2d7c4c9
Added pr details in changelog entry
Micky774 4e37527
Merge branch 'hdbscan' of https://github.com/scikit-learn/scikit-lear…
Micky774 5e0bc41
Trimmed extra function and modified comments
Micky774 0847be5
Apply suggestions from code review
Micky774 1c9a76a
Applied isort (with black on top)
Micky774 c01a609
Stylistic improvements
Micky774 b7736ef
Removed boruvka algorithm
Micky774 84484ea
Refactored file names and setup file
Micky774 9ebc643
Reintroduced boruvka algorithm
Micky774 24c5b98
Updated test file for boruvka removal
Micky774 585d7bb
Added dtype specification to input array validation
Micky774 507f0da
Apply suggestions from code review
Micky774 ed6d17d
Further review feedback
Micky774 232e2b1
Merge branch 'hdbscan' into hdbscan_boruvka
Micky774 38f71c7
Updated tests and improved `n_jobs` handling
Micky774 eefbacc
Refactored to remove `hdbscan` function -- use estimator instead
Micky774 3f89574
minor cleanup
Micky774 33f950b
Parameter simplification, and cluster_center refactor
Micky774 d29cc02
Minor typo corrections and reordering of user-guide entry
Micky774 3b86f1d
streamlined test
Micky774 cada149
Documentation update per review feedback
Micky774 45aab3c
Removed unnecessary function and made minor tweak to test
Micky774 67cab1a
Simplified plotting demo single-axis plots
Micky774 7edfd55
Refactored weighted centers
Micky774 d173707
Apply suggestions from code review
Micky774 1056cb0
Further review feedback implemented
Micky774 39b3e5a
Updated tests with review feedback
Micky774 5c42b0d
Apply suggestions from code review
Micky774 aa999f5
Renamed mst functions
Micky774 cf2c83d
Merge branch 'hdbscan' of https://github.com/Micky774/scikit-learn in…
Micky774 7c6c89d
Merge branch 'hdbscan' of https://github.com/scikit-learn/scikit-lear…
Micky774 4860a7f
Refactored _reachability.pyx
Micky774 2eff9cc
Adjusted documentation
Micky774 7a9b365
Cython cleanup for _reachability.pyx
Micky774 da44c83
Improved docs
Micky774 26dad21
Update sklearn/cluster/_hdbscan/hdbscan.py
Micky774 15595be
Minor cleanup
Micky774 a9a3c22
Merge branch 'hdbscan' of https://github.com/Micky774/scikit-learn in…
Micky774 0b0fa0e
Minor refactor for propogating missing data
Micky774 f96e8d6
Updated docs
Micky774 8f5c22b
Updated authorships
Micky774 23185f0
Updated `n_cluster` calc in `_weighted_cluster_center`
Micky774 8ed0869
Refactored brute algorithm and added `copy` parameter
Micky774 d44ac66
Merge branch 'hdbscan' into hdbscan_boruvka
Micky774 e6b9c2d
Updated tests a bit
Micky774 2b7e6d9
Merge branch 'hdbscan' into hdbscan_boruvka
Micky774 259ca79
Merge branch 'main' into hdbscan_boruvka
Micky774 886bab0
Removed outdated test file
Micky774 8feccf1
Removed old setup.py
Micky774 969b7c5
Removed submodule setup.py
Micky774 88ff7e2
Iter on styling
Micky774 5a82cb0
Merge branch 'main' into hdbscan_boruvka
Micky774 44b1463
Included boruvka in build
Micky774 177227b
Added int8_t type for boruvka
Micky774 2835e91
Iter on boruvka, imported cnp
Micky774 bb43054
Formatting and declaration grouping
Micky774 3fa0a9b
Ndarray->memview refactor
Micky774 d5eba10
Updated distancemetric typing
Micky774 05276bd
Added prototype test
Micky774 b3ac0d1
Corrected algo key-word
Micky774 27d593c
Added partial dispatch for boruvka
Micky774 6a72efa
Updated boruvka formatting
Micky774 be2b4e9
Refactored NodeData_t and formatted code
Micky774 11e91f9
Formatting and new Numpy API
Micky774 014c168
Corrected indexing error
Micky774 baa6a02
Added greater nogil support and started boruvka bug fix
Micky774 ffc6b77
Removed debug statements and improved test
Micky774 974673e
Improved tests and hdbscan dispatch logic
Micky774 61aef63
Cleaned up cython file
Micky774 2634228
Removed unnecessary public attributes
Micky774 53a7ec6
Updated formatting and removed parallel-query schema
Micky774 37e5d5c
Remove attribute used in debugging
Micky774 0788281
Improved tests
Micky774 57330e5
Merge branch 'main' into hdbscan_boruvka
Micky774 3dd7c5a
Updated changelog
Micky774 5a9aebe
Changed default to preserve backwards compatability
Micky774 90d9403
Merge branch 'main' into hdbscan_boruvka
Micky774 3b40f8a
Improved tests, and adjusted auto option for backwards compatability
Micky774 04e4007
Corrected changelog entry
Micky774 e06188f
Removed extraneous function
Micky774 6ef1668
Stabalized tests by using sorted lists
Micky774 76713ff
Updated to include deprecation for auto heuristic
Micky774 de5d041
Updated example in docstring
Micky774 8ba71e8
Updated centers test to use less adversarial data
Micky774 6a592f4
Corrected test by making hdb model more noise-tolerant
Micky774 68d4fd1
Avoid FutureWarning in tests
Micky774 39aa992
Fixed remaining FutureWarning
Micky774 fedeb90
Merge branch 'main' into hdbscan_boruvka
Micky774 File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change | ||||||||
---|---|---|---|---|---|---|---|---|---|---|
|
@@ -217,6 +217,19 @@ Changelog | |||||||||
`kdtree` and `balltree` values will be removed in 1.6. | ||||||||||
:pr:`26744` by :user:`Shreesha Kumar Bhat <Shreesha3112>`. | ||||||||||
|
||||||||||
- |Enhancement| : The `mst_algorithm` argument is introduced, allowing for the user to | ||||||||||
select between `{"auto", "brute", "prims", "boruvka_exact", "boruvka_approx"}`. | ||||||||||
Note that setting `mst_algorithm="prims"` recovers the same functionality as | ||||||||||
before this change, except when setting `algorithm="brute"` in which case | ||||||||||
both `"auto", "brute"` options for `mst_algorithm` recover current behavior. | ||||||||||
This instead introduces `"boruvka_exact", "boruvka_approx"` which are both faster | ||||||||||
MST building algorithms than the current `"prims"`. | ||||||||||
Comment on lines
+225
to
+226
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Suggested change
|
||||||||||
:pr:`27572` by :user:`Meekail Zain <micky774>`. | ||||||||||
|
||||||||||
This implementation is an adaptation from the original implementation of HDBSCAN in | ||||||||||
`scikit-learn-contrib/hdbscan <https://github.com/scikit-learn-contrib/hdbscan>`_, | ||||||||||
by :user:`Leland McInnes <lmcinnes>` et al. | ||||||||||
|
||||||||||
:mod:`sklearn.compose` | ||||||||||
...................... | ||||||||||
|
||||||||||
|
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.