-
-
Notifications
You must be signed in to change notification settings - Fork 25.9k
Support sample weight in clusterers #3998
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
I'd like to help in this issue. Is this needed individually for all clusterers (as is suggested by the DBSCAN propposal pointed in the OP), or is this more in the direction of a "global" way to handle weighted datasets for clusterers? If individual, I'd suggest creating a checklist similar to the one in #3450 to mark progress. |
I think there can be some global tests, but unlikely global solutions. Yes, I'm happy to add a TODO list |
List added. |
Is this a feature that is actively being built upon in scikit-learn? I have an application for Mean Shift specifically. |
I don't think anyone is working on mean shift right now. |
As far as I know it's not being worked on. On 27 October 2016 at 15:26, David Wasserman [email protected]
|
Thanks for the update. Is there any role for an interested party to get the conversation started or potentially contribute? Currently the work around I am looking at involves itertools.repeat or np.repeat to change the data going into the clusterer, but I am sure this is not an efficient solution (the idea was a hack mentioned in separate thread). |
I think we'd be happy to accept a PR, but runtime without sample_weights should not be affected too much. |
I think that is manageable. So I understand, you might accept an intermediate solution that was not 100% efficient, if it did not impact runtime without sample_weights? |
Looking at the mean shift code, it doesn't look impossible to implement On 31 October 2016 at 15:49, David Wasserman [email protected]
|
I believe the same issue (support for sample weights) also applies for sklearn.mixture (gaussian mixtures), essentially for the same reason (more efficiently taking into account repeated samples). |
Uh oh!
There was an error while loading. Please reload this page.
Currently no clusterers (or clustering metrics) support weighted dataset (although support for DBSCAN is proposed in #3994).
Weighting can be a compact way of representing repeated samples, and may affect cluster means and
variance, average link between clusters, etc.
Ideally BIRCH's global clustering stage should be provided a weighted dataset, and is current use of unweighted representatives may make its parametrisation more brittle.
This could be subject to an invariance test along the lines of:
(There is also a minor question of whether
sample_weight
should be universally accepted byClusterMixin
or whetherWeightedClusterMixin
should be created, etc.)Sample weight support for clusterers:
The text was updated successfully, but these errors were encountered: