Thanks to visit codestin.com
Credit goes to github.com

Skip to content

Support sample weight in clusterers #3998

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
4 of 10 tasks
jnothman opened this issue Dec 24, 2014 · 11 comments
Open
4 of 10 tasks

Support sample weight in clusterers #3998

jnothman opened this issue Dec 24, 2014 · 11 comments
Labels
Enhancement help wanted Moderate Anything that requires some knowledge of conventions and best practices module:cluster

Comments

@jnothman
Copy link
Member

jnothman commented Dec 24, 2014

Currently no clusterers (or clustering metrics) support weighted dataset (although support for DBSCAN is proposed in #3994).

Weighting can be a compact way of representing repeated samples, and may affect cluster means and
variance, average link between clusters, etc.

Ideally BIRCH's global clustering stage should be provided a weighted dataset, and is current use of unweighted representatives may make its parametrisation more brittle.

This could be subject to an invariance test along the lines of:

sample_weight = np.random.randint(0, 10, size=X.shape[0])
weighted_y = clusterer.fit_predict(X, sample_weight=sample_weight)
repeated_y = clusterer.fit_predict(np.repeat(X, sample_weight))
assert_equal(adjusted_rand_score(np.repeat(weighted_y, sample_weight), repeated_y)
# NB: this is only a useful sufficient test if weighted_y differs from clusterer.fit_predict(X)

(There is also a minor question of whether sample_weight should be universally accepted by ClusterMixin or whether WeightedClusterMixin should be created, etc.)

Sample weight support for clusterers:

  • Affinity propagation (I don't know this well enough to know the applicability)
  • BIRCH
  • DBSCAN
  • Hierarchical -> Ward link
  • Hierarchical -> Complete link (N/A, as far as I can tell)
  • Hierarchical -> Average link
  • K Means
  • Minibatch K Means
  • Mean shift
  • Spectral
@jnothman jnothman added Enhancement Moderate Anything that requires some knowledge of conventions and best practices labels Dec 24, 2014
@DanielSidhion
Copy link

I'd like to help in this issue. Is this needed individually for all clusterers (as is suggested by the DBSCAN propposal pointed in the OP), or is this more in the direction of a "global" way to handle weighted datasets for clusterers? If individual, I'd suggest creating a checklist similar to the one in #3450 to mark progress.

@jnothman
Copy link
Member Author

jnothman commented Jan 5, 2015

I think there can be some global tests, but unlikely global solutions. Yes, I'm happy to add a TODO list

@jnothman
Copy link
Member Author

jnothman commented Jan 5, 2015

List added.

@d-wasserman
Copy link

Is this a feature that is actively being built upon in scikit-learn? I have an application for Mean Shift specifically.

@amueller
Copy link
Member

I don't think anyone is working on mean shift right now.

@jnothman
Copy link
Member Author

As far as I know it's not being worked on.

On 27 October 2016 at 15:26, David Wasserman [email protected]
wrote:

Is this a feature that is actively being built upon in scikit-learn? I
have an application for Mean Shift specifically.


You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub
#3998 (comment),
or mute the thread
https://github.com/notifications/unsubscribe-auth/AAEz6542cTIsIvBmP9yFP9aEkrw0h0V1ks5q4CfggaJpZM4DL8kd
.

@d-wasserman
Copy link

d-wasserman commented Oct 27, 2016

Thanks for the update. Is there any role for an interested party to get the conversation started or potentially contribute? Currently the work around I am looking at involves itertools.repeat or np.repeat to change the data going into the clusterer, but I am sure this is not an efficient solution (the idea was a hack mentioned in separate thread).

@amueller
Copy link
Member

I think we'd be happy to accept a PR, but runtime without sample_weights should not be affected too much.

@d-wasserman
Copy link

I think that is manageable. So I understand, you might accept an intermediate solution that was not 100% efficient, if it did not impact runtime without sample_weights?

@jnothman
Copy link
Member Author

Looking at the mean shift code, it doesn't look impossible to implement
sample_weight support, but I can't see how itertools.repeat is readily
going to solve it. If you think you have such a solution, submit a PR.

On 31 October 2016 at 15:49, David Wasserman [email protected]
wrote:

I think that is manageable. So I understand, you might accept an
intermediate solution that was not 100% efficient, if it did not impact
runtime without sample_weights?


You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub
#3998 (comment),
or mute the thread
https://github.com/notifications/unsubscribe-auth/AAEz6-BknEugWoJTuo8ShqjjV2tLZs0Cks5q5XNzgaJpZM4DL8kd
.

@anntzer
Copy link
Contributor

anntzer commented Aug 22, 2022

I believe the same issue (support for sample weights) also applies for sklearn.mixture (gaussian mixtures), essentially for the same reason (more efficiently taking into account repeated samples).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Enhancement help wanted Moderate Anything that requires some knowledge of conventions and best practices module:cluster
Projects
None yet
Development

No branches or pull requests

6 participants