-
-
Notifications
You must be signed in to change notification settings - Fork 25.8k
List of estimators with known incorrect handling of sample_weight
#16298
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
I think it's unclear what needs to happen in this issue, and I doubt they
all can be fixed straightforwardly.
|
The goal of this issue, is to move the list of failing estimators from #15015 PR to here, merge that PR with an XFAIL flag once it is supported with #16306. Finally, keep this issue open for discussion and allow for the failing estimators to be fixed in separate PRs. I agree that the fixing will likely not be straightforwardly. |
@snath-xoc started to investigate more into the current state of https://gist.github.com/snath-xoc/fb28feab39403a1e66b00b5b28f1dcbf This gist is focused at estimators with a stochastic fit method (that might involve resampling, e.g. |
sample_weight
The current implementation of |
I updated the list with the results from the notebook https://gist.github.com/snath-xoc/fb28feab39403a1e66b00b5b28f1dcbf and from the extended common test in #29818. Note that the notebook only deals with classifiers/regressors for now. It could be extended for other estimators. We can check the result of |
Just leaving a comment here so I get updates about the status of this issue. (I'm interested in using sample_weight in HistGradientBoostingRegressor). |
I opened #30564 to better document and motivate the ongoing work on improving sample weight testing and implementations throughout scikit-learn. |
Status update: @snath-xoc and I refactored the notebooks with statistical test as a single notebook + helper lib with some tests. Here is the summary of the current state:
Full notebook with details: https://github.com/snath-xoc/sample-weight-audit-nondet/blob/main/reports/sklearn_estimators_sample_weight_audit_report.ipynb This notebook runs both the deterministic We still need proper support for stochastic clustering models (e.g. |
The tool to detect violations of the weighting/repetition equivalence semantics of stochastic estimators can now run on all kinds of estimators: and the results seem meaningful.
|
An issue with an associated common check originally discussed in #15015
fit
is deterministic or not with the config used incheck_sample_weight_equivalence
: to be investigated.check_sample_weight_equivalence
fails event when subsampling for binning is not enabled.sample_weight
aware but can then be only properly tested with a statistical test instead ofcheck_sample_weight_equivalence
LinearRegression
's numerical stability on rank deficient data by setting thecond
parameter in the call toscipy.linalg.lstsq
#30040liblinear
liblinear
lbfgs
causescheck_sample_weight_equivalence
to fail (slightly)liblinear
withC=0.01
causescheck_sample_weight_equivalence
to fail (slightly)check_sample_weight_equivalence
now passes for this estimator after lowering thetol
value forlsqr
andsparse-cg
in the per-check params.check_sample_weight_equivalence
now passes for this estimator after lowering thetol
value forlsqr
andsparse-cg
in the per-check params.cv
(which is the case incheck_sample_weight_equivalence
) orscoring
params.check_sample_weight_equivalence
fails withprobability=False
: to be investigatedprobability=True
as the weights are not propagated to the internal CV implemented in libsvmsample_weight
to their scorer by default FIX Forward sample weight to the scorer in grid search #30743.sample_weight
in general: SLEP006: default routing #26179The following estimators have a stochastic fit, so testing for correct handling of sample weights cannot be tested with
check_sample_weight_equivalence
but instead requires a statistical test:KBinsDiscretizer
#29907n_samples
and uniform or quantile strategies, the fit is deterministic.(some might have been fixed since, need to check).
The required sample weight invariance properties (including the behavior of sw=0) were also discussed in #15657
EDIT: expected
sample_weight
semantics have since been more generally in the refactoring ofcheck_sample_weight_invariance
intocheck_sample_weight_equivalence
to check fitting with integer sample weights is equivalent to fitting with repeated data points with a number of repetitions that matches the original weights.The text was updated successfully, but these errors were encountered: