Thanks to visit codestin.com
Credit goes to github.com

Skip to content

List of estimators with known incorrect handling of sample_weight #16298

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
17 of 54 tasks
rth opened this issue Jan 29, 2020 · 9 comments
Open
17 of 54 tasks

List of estimators with known incorrect handling of sample_weight #16298

rth opened this issue Jan 29, 2020 · 9 comments
Labels
Meta-issue General issue associated to an identified list of tasks

Comments

@rth
Copy link
Member

rth commented Jan 29, 2020

An issue with an associated common check originally discussed in #15015

This is a pretty simple sample_weight test that says that a weight of 0 is equivalent to not having the samples.
I think every failure here should be considered a bug. This is:

The following estimators have a stochastic fit, so testing for correct handling of sample weights cannot be tested with check_sample_weight_equivalence but instead requires a statistical test:

(some might have been fixed since, need to check).

The required sample weight invariance properties (including the behavior of sw=0) were also discussed in #15657

EDIT: expected sample_weight semantics have since been more generally in the refactoring of check_sample_weight_invariance into check_sample_weight_equivalence to check fitting with integer sample weights is equivalent to fitting with repeated data points with a number of repetitions that matches the original weights.

@jnothman
Copy link
Member

jnothman commented Jan 30, 2020 via email

@rth
Copy link
Member Author

rth commented Jan 30, 2020

I think it's unclear what needs to happen in this issue, and I doubt they
all can be fixed straightforwardly.

The goal of this issue, is to move the list of failing estimators from #15015 PR to here, merge that PR with an XFAIL flag once it is supported with #16306.

Finally, keep this issue open for discussion and allow for the failing estimators to be fixed in separate PRs. I agree that the fixing will likely not be straightforwardly.

@ogrisel ogrisel added the Meta-issue General issue associated to an identified list of tasks label Sep 5, 2024
@ogrisel
Copy link
Member

ogrisel commented Sep 5, 2024

@snath-xoc started to investigate more into the current state of sample_weight handling in scikit-learn estimators with the following gist.

https://gist.github.com/snath-xoc/fb28feab39403a1e66b00b5b28f1dcbf

This gist is focused at estimators with a stochastic fit method (that might involve resampling, e.g. BaggingClassifier or RANSACRegressor). For those case we cannot expect the assert_allclose assertions of our current check_sample_weights_invariance to hold exactly but only in expectation. This is tested by fitting with many different values of random_state passed as a constructor argument. Also notes that sometimes, the problem only happens for non-default values of the other constructor parameters.

@ogrisel ogrisel changed the title Check that zero sample weight means samples are ignored List of estimators with known incorrect handling of sample_weight Sep 5, 2024
@ogrisel
Copy link
Member

ogrisel commented Sep 5, 2024

The current implementation of check_sample_weights_invariance(kind="zeros") is expected to fail for estimator with a cv argument. It should be adapted to avoid moving CV boundaries for the non-dropped training points. A special handling of this test has been implemented in an ad-hoc manner in #29419 and #29442 but this should be generalized in the common test itself.

@jeremiedbb
Copy link
Member

I updated the list with the results from the notebook https://gist.github.com/snath-xoc/fb28feab39403a1e66b00b5b28f1dcbf and from the extended common test in #29818.

Note that the notebook only deals with classifiers/regressors for now. It could be extended for other estimators. We can check the result of transform for transformers for instance.

@marenwestermann
Copy link
Member

Just leaving a comment here so I get updates about the status of this issue. (I'm interested in using sample_weight in HistGradientBoostingRegressor).

@ogrisel
Copy link
Member

ogrisel commented Jan 2, 2025

I opened #30564 to better document and motivate the ongoing work on improving sample weight testing and implementations throughout scikit-learn.

@ogrisel
Copy link
Member

ogrisel commented Feb 3, 2025

Status update: @snath-xoc and I refactored the notebooks with statistical test as a single notebook + helper lib with some tests. Here is the summary of the current state:

✅ 19 passed the deterministic test
❌ 4 failed the deterministic test
✅ 14 passed the statistical test
❌ 17 failed the statistical test
❌ 5 other errors
⚠ 112 estimators lack sample_weight support

Full notebook with details: https://github.com/snath-xoc/sample-weight-audit-nondet/blob/main/reports/sklearn_estimators_sample_weight_audit_report.ipynb

This notebook runs both the deterministic check_sample_weight_equivalence_on_dense_data check for estimators with a stochastic fit.

We still need proper support for stochastic clustering models (e.g. KMeans and co) and a simpler/more robust way to deal with transformers.

@ogrisel
Copy link
Member

ogrisel commented Mar 27, 2025

The tool to detect violations of the weighting/repetition equivalence semantics of stochastic estimators can now run on all kinds of estimators:

https://github.com/snath-xoc/sample-weight-audit-nondet/blob/main/reports/sklearn_estimators_sample_weight_audit_report.ipynb

and the results seem meaningful.

✅ 45 passed the statistical test
❌ 14 failed the statistical test
❌ 3 other errors
⚠ 112 estimators lack sample_weight support

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Meta-issue General issue associated to an identified list of tasks
Projects
Development

No branches or pull requests

5 participants