Support sample weight in clusterers #3998

jnothman · 2014-12-24T13:11:42Z

Currently no clusterers (or clustering metrics) support weighted dataset (although support for DBSCAN is proposed in #3994).

Weighting can be a compact way of representing repeated samples, and may affect cluster means and
variance, average link between clusters, etc.

Ideally BIRCH's global clustering stage should be provided a weighted dataset, and is current use of unweighted representatives may make its parametrisation more brittle.

This could be subject to an invariance test along the lines of:

sample_weight = np.random.randint(0, 10, size=X.shape[0])
weighted_y = clusterer.fit_predict(X, sample_weight=sample_weight)
repeated_y = clusterer.fit_predict(np.repeat(X, sample_weight))
assert_equal(adjusted_rand_score(np.repeat(weighted_y, sample_weight), repeated_y)
# NB: this is only a useful sufficient test if weighted_y differs from clusterer.fit_predict(X)

(There is also a minor question of whether sample_weight should be universally accepted by ClusterMixin or whether WeightedClusterMixin should be created, etc.)

Sample weight support for clusterers:

The text was updated successfully, but these errors were encountered:

DanielSidhion · 2015-01-05T00:45:29Z

I'd like to help in this issue. Is this needed individually for all clusterers (as is suggested by the DBSCAN propposal pointed in the OP), or is this more in the direction of a "global" way to handle weighted datasets for clusterers? If individual, I'd suggest creating a checklist similar to the one in #3450 to mark progress.

jnothman · 2015-01-05T01:20:55Z

I think there can be some global tests, but unlikely global solutions. Yes, I'm happy to add a TODO list

jnothman · 2015-01-05T01:24:58Z

List added.

d-wasserman · 2016-10-27T04:26:07Z

Is this a feature that is actively being built upon in scikit-learn? I have an application for Mean Shift specifically.

amueller · 2016-10-27T15:35:38Z

I don't think anyone is working on mean shift right now.

jnothman · 2016-10-27T15:44:12Z

As far as I know it's not being worked on.

On 27 October 2016 at 15:26, David Wasserman [email protected]
wrote:

Is this a feature that is actively being built upon in scikit-learn? I
have an application for Mean Shift specifically.

—
You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub
#3998 (comment),
or mute the thread
https://github.com/notifications/unsubscribe-auth/AAEz6542cTIsIvBmP9yFP9aEkrw0h0V1ks5q4CfggaJpZM4DL8kd
.

d-wasserman · 2016-10-27T16:55:48Z

Thanks for the update. Is there any role for an interested party to get the conversation started or potentially contribute? Currently the work around I am looking at involves itertools.repeat or np.repeat to change the data going into the clusterer, but I am sure this is not an efficient solution (the idea was a hack mentioned in separate thread).

amueller · 2016-10-27T16:58:15Z

I think we'd be happy to accept a PR, but runtime without sample_weights should not be affected too much.

d-wasserman · 2016-10-31T04:49:54Z

I think that is manageable. So I understand, you might accept an intermediate solution that was not 100% efficient, if it did not impact runtime without sample_weights?

jnothman · 2016-10-31T09:34:12Z

Looking at the mean shift code, it doesn't look impossible to implement
sample_weight support, but I can't see how itertools.repeat is readily
going to solve it. If you think you have such a solution, submit a PR.

On 31 October 2016 at 15:49, David Wasserman [email protected]
wrote:

I think that is manageable. So I understand, you might accept an
intermediate solution that was not 100% efficient, if it did not impact
runtime without sample_weights?

—
You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub
#3998 (comment),
or mute the thread
https://github.com/notifications/unsubscribe-auth/AAEz6-BknEugWoJTuo8ShqjjV2tLZs0Cks5q5XNzgaJpZM4DL8kd
.

anntzer · 2022-08-22T20:26:01Z

I believe the same issue (support for sample weights) also applies for sklearn.mixture (gaussian mixtures), essentially for the same reason (more efficiently taking into account repeated samples).

jnothman added Enhancement Moderate Anything that requires some knowledge of conventions and best practices labels Dec 24, 2014

ghost mentioned this issue Feb 7, 2015

added sample_weight support to K-Means and Minibatch K-means #4218

Closed

jnhansen mentioned this issue Apr 7, 2018

[MRG+1] Adding support for sample weights to K-Means #10933

Merged

cmarmo added the help wanted label Dec 6, 2021

cmarmo added the module:cluster label Jan 16, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Support sample weight in clusterers #3998

Support sample weight in clusterers #3998

jnothman commented Dec 24, 2014 •

edited by TomDLT

Loading

DanielSidhion commented Jan 5, 2015

Uh oh!

jnothman commented Jan 5, 2015

Uh oh!

jnothman commented Jan 5, 2015

Uh oh!

d-wasserman commented Oct 27, 2016

Uh oh!

amueller commented Oct 27, 2016

Uh oh!

jnothman commented Oct 27, 2016

Uh oh!

d-wasserman commented Oct 27, 2016 •

edited

Loading

Uh oh!

amueller commented Oct 27, 2016

Uh oh!

d-wasserman commented Oct 31, 2016

Uh oh!

jnothman commented Oct 31, 2016

Uh oh!

anntzer commented Aug 22, 2022

Uh oh!

Uh oh!

Support sample weight in clusterers #3998

Support sample weight in clusterers #3998

Comments

jnothman commented Dec 24, 2014 • edited by TomDLT Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

DanielSidhion commented Jan 5, 2015

Uh oh!

jnothman commented Jan 5, 2015

Uh oh!

jnothman commented Jan 5, 2015

Uh oh!

d-wasserman commented Oct 27, 2016

Uh oh!

amueller commented Oct 27, 2016

Uh oh!

jnothman commented Oct 27, 2016

Uh oh!

d-wasserman commented Oct 27, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

amueller commented Oct 27, 2016

Uh oh!

d-wasserman commented Oct 31, 2016

Uh oh!

jnothman commented Oct 31, 2016

Uh oh!

anntzer commented Aug 22, 2022

Uh oh!

jnothman commented Dec 24, 2014 •

edited by TomDLT

Loading

d-wasserman commented Oct 27, 2016 •

edited

Loading