Description
An issue with an associated common check originally discussed in #15015
This is a pretty simple sample_weight test that says that a weight of 0 is equivalent to not having the samples.
I think every failure here should be considered a bug. This is:
- AdaBoostClassifier/Regressor
- BayesianRidge FIX Sample weight in BayesianRidge #30644
- CalibratedClassifierCV Make check_sample_weights_invariance cv-aware #29796
- CategoricalNB Fix set
CategoricalNB().__sklearn_tags__.input_tags.categorical
toTrue
#31556 - GradientBoostingClassifier/Regressor
- not sure if
fit
is deterministic or not with the config used incheck_sample_weight_equivalence
: to be investigated.- rounding errors make it non-deterministic and break equivalence because of systematic bias to handle near-tied splits: [Meta] consistently break ties randomly in scikit-learn estimators (with random_state) in an unbiased way #23728 (comment)
- not sure if
- HistGradientBoostingClassifier/Regressor
-
check_sample_weight_equivalence
fails event when subsampling for binning is not enabled. - subsampling for binning needs to be
sample_weight
aware but can then be only properly tested with a statistical test instead ofcheck_sample_weight_equivalence
- subsampling also happens (without respecting weights) to compute non-default scoring on a subsampled training set when early stopping is enabled without a validation set
- Added sample weight handling to BinMapper under HGBT #29641
-
- HuberRegressor
- LinearRegression
- LinearSVC
- Fix linear svc handling sample weights under class_weight="balanced" #30057
- apparently related to the use of
liblinear
- LinearSVR
- apparently related to the use of
liblinear
- apparently related to the use of
- LogisticRegression
-
lbfgs
causescheck_sample_weight_equivalence
to fail (slightly) -
liblinear
withC=0.01
causescheck_sample_weight_equivalence
to fail (slightly)
-
- LogisticRegressionCV
- ElasticNetCV / LassoCV Fix elasticnect cv sample weight #29442 and Make check_sample_weights_invariance cv-aware #29796 (both are needed)
- Ridge
check_sample_weight_equivalence
now passes for this estimator after lowering thetol
value forlsqr
andsparse-cg
in the per-check params.
- RidgeClassifier
check_sample_weight_equivalence
now passes for this estimator after lowering thetol
value forlsqr
andsparse-cg
in the per-check params.
- RidgeClassifierCV. Make check_sample_weights_invariance cv-aware #29796
- DecisionTreeRegressor
- biased handling of near-tied splits: [Meta] consistently break ties randomly in scikit-learn estimators (with random_state) in an unbiased way #23728 (comment)
- RidgeCV and RidgeClassifierCV can delegate to GridSearchCV when using a non-default
cv
(which is the case incheck_sample_weight_equivalence
) orscoring
params. - OneClassSVM
- NuSVC (same as SVC)
- NuSVR
- SVC
check_sample_weight_equivalence
fails withprobability=False
: to be investigated- this expected with
probability=True
as the weights are not propagated to the internal CV implemented in libsvm
- SVR
- GridSearchCV / RandomizedSearchCV / ...
- did not forward
sample_weight
to their scorer by default FIX Forward sample weight to the scorer in grid search #30743. - do this even when metadata routing is disabled
- implement a default routing policy for
sample_weight
in general: SLEP006: default routing #26179
- did not forward
The following estimators have a stochastic fit, so testing for correct handling of sample weights cannot be tested with check_sample_weight_equivalence
but instead requires a statistical test:
- BaggingClassifier/Regressor
- BisectingKMeans
- KBinsDiscretizer
- Fix sample weight passing in
KBinsDiscretizer
#29907 - note that with a small
n_samples
and uniform or quantile strategies, the fit is deterministic.
- Fix sample weight passing in
- SGDClassifier/Regressor
- Perceptron (likely same bug as SGD)
- RANSACRegressor
- KMeans
- MiniBatchKMeans
- IsolationForest
- RandomForestClassifier/Regressor
- ExtraTreeRegressor
- ExtraTreesRegressor
- RandomTreesEmbedding
(some might have been fixed since, need to check).
The required sample weight invariance properties (including the behavior of sw=0) were also discussed in #15657
EDIT: expected sample_weight
semantics have since been more generally in the refactoring of check_sample_weight_invariance
into check_sample_weight_equivalence
to check fitting with integer sample weights is equivalent to fitting with repeated data points with a number of repetitions that matches the original weights.
Metadata
Metadata
Assignees
Type
Projects
Status