List of estimators with known incorrect handling of `sample_weight`

An issue with an associated common check originally discussed in https://github.com/scikit-learn/scikit-learn/pull/15015 

> This is a pretty simple sample_weight test that says that a weight of 0 is equivalent to not having the samples.
> I think every failure here should be considered a bug. This is:

- [ ] AdaBoostClassifier/Regressor
- [x] BayesianRidge #30644
- [x] CalibratedClassifierCV #29796
- [ ] CategoricalNB #31556
- [ ] GradientBoostingClassifier/Regressor
  - not sure if `fit` is deterministic or not with the config used in `check_sample_weight_equivalence`: to be investigated.
    - rounding errors make it non-deterministic and break equivalence because of systematic bias to handle near-tied splits: https://github.com/scikit-learn/scikit-learn/issues/23728#issuecomment-2737437203
- [ ] HistGradientBoostingClassifier/Regressor
  - [ ] `check_sample_weight_equivalence` fails event when subsampling for binning is not enabled.
  - [ ] subsampling for binning needs to be `sample_weight` aware but can then be only properly tested with a statistical test instead of `check_sample_weight_equivalence`
  - [ ] subsampling also happens (without respecting weights) to compute non-default scoring on a subsampled training set when early stopping is enabled without a validation set
  - [ ] #29641
- [ ] HuberRegressor
- [x] LinearRegression
  - [x] #30040
  - [x] #30131
- [ ] LinearSVC 
  - [x] #30057
  - [ ] apparently related to the use of `liblinear`
- [ ] LinearSVR
  - [ ] apparently related to the use of `liblinear`
- [ ] LogisticRegression
  - [ ] `lbfgs` causes `check_sample_weight_equivalence` to fail (slightly)
  - [ ] `liblinear` with `C=0.01` causes `check_sample_weight_equivalence` to fail (slightly)
- [x] LogisticRegressionCV
  - #29419
- [x] ElasticNetCV / LassoCV #29442 and #29796 (both are needed)
- [x] Ridge
  - `check_sample_weight_equivalence` now passes for this estimator after lowering the `tol` value for `lsqr` and `sparse-cg` in the per-check params.
- [x] RidgeClassifier
  - `check_sample_weight_equivalence` now passes for this estimator after lowering the `tol` value for `lsqr` and `sparse-cg` in the per-check params.
- [x] RidgeClassifierCV. #29796
- [ ] DecisionTreeRegressor
   - [ ] biased handling of near-tied splits: https://github.com/scikit-learn/scikit-learn/issues/23728#issuecomment-2737437203
- [x] RidgeCV and RidgeClassifierCV can delegate to GridSearchCV when using a non-default `cv` (which is the case in `check_sample_weight_equivalence`) or `scoring` params.
    - [x] fixed by #30743
- [ ] OneClassSVM
- [ ] NuSVC (same as SVC)
- [ ] NuSVR
- [ ] SVC
  - `check_sample_weight_equivalence` fails with `probability=False`: to be investigated
  - this expected with `probability=True` as the weights are not propagated to the internal CV implemented in libsvm
- [ ] SVR
- [ ] GridSearchCV / RandomizedSearchCV / ...
   - [x] did not forward `sample_weight` to their scorer by default #30743.
   - [ ] do this even when metadata routing is disabled
   - [ ] implement a default routing policy for `sample_weight` in general: #26179

The following estimators have a stochastic fit, so testing for correct handling of sample weights cannot be tested with `check_sample_weight_equivalence` but instead [requires a statistical test](https://github.com/scikit-learn/scikit-learn/issues/16298#issuecomment-2331044662):

- [x] BaggingClassifier/Regressor
  - #31414
- [x] BisectingKMeans
- [x] KBinsDiscretizer
  - #29907
  - note that with a small `n_samples` and uniform or quantile strategies, the fit is deterministic.
- [ ] SGDClassifier/Regressor
- [ ] Perceptron (likely same bug as SGD)
- [ ] RANSACRegressor
  - [ ] https://github.com/scikit-learn/scikit-learn/issues/15836
- [x] KMeans
- [ ] MiniBatchKMeans
- [ ] IsolationForest
- [ ] RandomForestClassifier/Regressor
- [ ] ExtraTreeRegressor
- [ ] ExtraTreesRegressor
- [ ] RandomTreesEmbedding

(some might have been fixed since, need to check).
 
The required sample weight invariance properties (including the behavior of sw=0) were also discussed in https://github.com/scikit-learn/scikit-learn/issues/15657

EDIT: expected `sample_weight` semantics have since been more generally in the [refactoring of `check_sample_weight_invariance` into `check_sample_weight_equivalence`](https://github.com/scikit-learn/scikit-learn/pull/29818) to check fitting with integer sample weights is equivalent to fitting with repeated data points with a number of repetitions that matches the original weights.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

List of estimators with known incorrect handling of `sample_weight` #16298

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

List of estimators with known incorrect handling of sample_weight #16298

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

List of estimators with known incorrect handling of `sample_weight` #16298