Add new default max_samples=None in Bagging estimators #32825

antoinebaker · 2025-12-01T16:15:28Z

Reference Issues/PRs

Fixes #32805. Follow-up to #31414.

What does this implement/fix? Explain your changes.

We add a new option max_samples=None in Bagging estimators, with the meaning of "draw exactly n_samples", and we make it the new default. This ensures that the number of bootstrap samples used (a crucial hyperparameter in bagging) will be the same across the 1.7 -> 1.8 transition, and makes the changes introduced in #31414 less disruptive when fitting with sample_weight:

in 1.7 and before, the default max_samples=1.0 was drawing n_samples (discarding sample_weight)
in FIX Draw indices using sample_weight in Bagging #31414, the default max_samples=1.0 draws sample_weight.sum(). This can potentially lead to catastrophic fits if for some reason the sample_weight are very small or normalized to one for example. We do raise an error in that case, but the changes may still be drastic and surprising to users in less extreme cases.
here we add the new default max_samples=None, which always draws n_samples as in 1.7.

cc @cakedev0 @ogrisel @lesteve

github-actions · 2025-12-01T16:16:31Z

✔️ Linting Passed

All linting checks passed. Your pull request is in excellent shape! ☀️

_{Generated for commit: 762fee9. Link to the linter CI: here}

sklearn/ensemble/tests/test_bagging.py

sklearn/ensemble/_bagging.py

cakedev0

LGTM. A few comments/questions, but nothing blocking.

cakedev0 · 2025-12-01T17:09:21Z

doc/whats_new/upcoming_changes/sklearn.ensemble/32825.fix

+- In :class:`ensemble.BaggingClassifier` and :class:`ensemble.BaggingRegressor`
+  the new default `max_samples=None` draws `X.shape[0]` samples, irrespective of
+  `sample_weight`.
+  By :user:`Antoine Baker <antoinebaker>`.


Is it ok to have two "conflicting" change logs in the same release?

Or would it be better to edit the previous change log?

For me the two changelogs are not in conflict, but complement each other.

31414.fix.rst states that:

Furthermore, max_samples is now
interpreted as a fraction of sample_weight.sum() instead of X.shape[0]
when passed as a float.

It remains true in this PR (for float max_samples). This PR just adds the None max_samples option.

At the same time you can decide to merge the two changelog entries if that is simpler to understand from a user perspective. IIRC changelog entries are merged by PR numbers (not 100% sure) and depending on how much activity has gone into this module, they may not be close to each other.

One way to do this is something like I did in #32831. Basically you use a single changelog entry and you mention one PR with :pr: in the text and the other one is added automatically based on the filename.

@lesteve good catch for the missing extension! I tried your hack by merging the two changelogs in the new one (and removing the old one). The output of towncrier seems okay:

:mod:`sklearn.ensemble` ----------------------- - |Fix| :class:`ensemble.BaggingClassifier`, :class:`ensemble.BaggingRegressor` and :class:`ensemble.IsolationForest` now use `sample_weight` to draw the samples instead of forwarding them multiplied by a uniformly sampled mask to the underlying estimators. Furthermore, `max_samples` is now interpreted as a fraction of `sample_weight.sum()` instead of `X.shape[0]` when passed as a float. The new default `max_samples=None` draws `X.shape[0]` samples, irrespective of `sample_weight`. By :user:`Antoine Baker <antoinebaker>`. :pr:`31414` and :pr:`32825`

sklearn/ensemble/_bagging.py

doc/whats_new/upcoming_changes/sklearn.ensemble/32825.fix

ogrisel

I agree, the behavior might be less surprising for sample_weight users. But we do lose the strict weight/repetition equivalence (frequency semantics) when using the default max_samples=None.

But's it's easy to restore the frequency semantics by explicitly choosing max_samples=1.0 explicitly (or alternatively max_samples=some_integer_of_choice) after making a conscious decision, so maybe this is not a big deal.

sklearn/ensemble/_bagging.py

ogrisel · 2025-12-03T14:52:42Z

@cakedev0 @antoinebaker see my review above.

Co-authored-by: Arthur Lacote <[email protected]>

Co-authored-by: Olivier Grisel <[email protected]>

ogrisel · 2025-12-03T15:28:35Z

Sorry for breaking the linter ;)

lesteve · 2025-12-03T15:31:45Z

The changelog CI is 😱. I set the No changelog Needed label to ignore it and we need to check the changelog in the rendered doc to make sure that it works as expected.

One of this edge case that is not taken into account by the action we are using and may not be worth fixing to be honest.

If you are really curious, we are "changing" two fragment files (removing the 31414 one and adding the 32825 one) and one of the changed fragment is not matching the PR number so the action complains ...

ogrisel · 2025-12-03T15:33:49Z

sklearn/ensemble/_bagging.py

+        set to a float or integer value. When keeping the `max_samples=None` default
+        value, the equivalence between fitting with integer weighted data points or
+        integer repeated data points is no longer guaranteed because the effective
+        bootstrap size is no longer guaranteed to be equivalent.


For reviewers, I updated the notebooks of our sample weight analyzer, in particular this want to highlight the fact that max_samples=None breaks frequency weights semantics:

https://github.com/snath-xoc/sample-weight-audit-nondet/blob/main/reports/sklearn_estimators_sample_weight_audit_investigative_plots.ipynb

The p-value is very small: the null-hypothesis that fitting with integer weights is equivalent to fitting with repeated data points is rejected.

Editing the cell to select either max_samples=1.0 or max_samples=100 restores large p-values, even when increasing n_stochastic_fits meaning that we cannot reject the null-hypothesis that frequency weights semantics are respected.

Yes, we could recommend:

max_samples = some_integer as the safest choice for sample_weight semantics (repeated/weighted equivalence for integer weights and safe with rescaled weights)

max_samples = some_float as the second best (repeated/weighted equivalence for integer weights but unsafe with rescaled weights)

default max_samples = None does not respect the repeated/weighted equivalence for integer weights

antoinebaker · 2025-12-03T15:33:53Z

The changelog CI is 😱.

Is it because I deleted the previous PR changelog (31414.fix.rst) ?

lesteve · 2025-12-03T15:35:40Z

The changelog CI is 😱.

Is it because I deleted the previous PR changelog (31414.fix.rst) ?

Yep you did the right thing, but the CI doesn't cover edge cases like this. I edited my previous comment #32825 (comment) with a more detailed explanation.

ogrisel · 2025-12-03T15:36:34Z

doc/whats_new/upcoming_changes/sklearn.ensemble/32825.fix.rst

+  fraction of `sample_weight.sum()` instead of `X.shape[0]` when passed as a
+  float. The new default `max_samples=None` draws `X.shape[0]` samples,
+  irrespective of `sample_weight`.
+  By :user:`Antoine Baker <antoinebaker>`. :pr:`31414` and


Alternatively, you could keep both doc/whats_new/upcoming_changes/sklearn.ensemble/31414.fix.rst and doc/whats_new/upcoming_changes/sklearn.ensemble/32825.fix.rst and make sure that they have the same text and not PR number in them. As a result, the towncrier tool will merge this.

Yep, but personally as I said in #32831 I find this towncrier functionality a bit too magical and prefer the :pr: way which feels simpler and more understandable ... I actually saw Christian use the :pr: way and quite liked it.

ogrisel

Assuming that we are all ok with explicitly not implementing frequency weights by default anymore for this estimator, I am fine with the contents of this PR.

At least the docstring should be explicit enough.

cc @cakedev0 @lucyleeow.

cakedev0 · 2025-12-03T16:41:26Z

I'm ok with that with explicitly not implementing frequency weights by default, I think it outweighs the risk of using a very small/big number of samples unexpectedly.

At least the docstring should be explicit enough.

The current docstring of the fit method states: Note that the expected frequency semantics for the sample_weight parameter are only fulfilled when sampling with replacement bootstrap=True.

Let's add "and max_samples is a float" at the end and it should be very clear.

lesteve · 2025-12-04T11:01:54Z

sklearn/ensemble/_bagging.py

+
+    # max_samples Real fractional value relative to weighted_n_samples
+    n_samples_bootstrap = max(int(max_samples * weighted_n_samples), 1)
+    # raise warning if n_samples_bootstrap small (< 10)


Looks like you may want to update the comment here and mentions that this is a heuristic to avoid bootstrap with too few samples. Bonus points if you add a few words about the heuristic 😉

Here is a proposal that you should not follow 😜

# magic cubic root heuristic: you are not supposed to understand this, don't even try! # Please don't touch this line or the universe may collapse

It could be:

# raise a warning if n_samples_bootstrap is suspiciously small # the heuristic for "suspiciously small " might be adapted if found unsuitable in practice

jeremiedbb · 2025-12-04T15:29:49Z

What's the plan with max_samples=None in the future ? It's not clear to me reading the discussions here

Is it to be removed in 1.10 ? in which case it should be mentioned in the changelog and a comment to remind us to remove it should be added.
Is it to be deprecated in 1.10 and be removed in 1.12 ? Not sure it's necessary and in which case I'd just directly remove it in 1.10.
Is it to stay ?

lesteve · 2025-12-04T15:34:43Z

What's the plan with max_samples=None in the future ? It's not clear to me reading the discussions here

Just when I was about to say I enabled auto-merge 😅.

I kind of agree and had the same question but thought the decision can be done after 1.8 is out.

Do you see a big downside leaving this for later? My feeling is that we are a bit in a edge case, Bagging{Classifier,Regressor} is not used that much I would guess and Bagging{Classifier,Regressor} with sample_weight probably even less ...

jeremiedbb · 2025-12-04T15:52:00Z

Do you see a big downside leaving this for later?

If we plan to remove it directly in 1.10 it'd be better to document it in the changelog. But maybe leaving for later just rules option 1 out and in 1.10 we'll just be left with deprecate or keep :)

cakedev0 · 2025-12-04T19:15:00Z

IMO, it's here to stay:

to match the API of random forests,
to avoid a backward incompatible change of how weights affect the training in the default case
and to avoid that this default case works in a way that could lead to very poor learning

The downside being that this default doesn't guarantee the weighted/repeated equivalence. But I trust that people who need this will read the docstring for sample_weight and hence be aware of it.

(Note that in v<=1.7 this equivalence is not guaranteed)

lesteve · 2025-12-04T20:51:32Z

I was trying to do something that seemed simple, i.e. covering an uncovered line in codecov, but looks like I will finish it tomorrow instead since I may as well add a test for the helper function while I am at it ...

lesteve · 2025-12-05T08:30:33Z

OK so I pushed 81652cb (#32825) I will merge this PR in an hour or so unless I hear someone shouting strongly!

Summary of my changes:

I added some tests for the helper function
I removed the raise ValueError if max_samples is not an int and not a flot and not None. This is a private function, we can change our mind, and it was the only place where parameters were checked which felt slightly weird. Ironically I started adding a test because this line was not covered, oh well 🤷

I feel like the decision whether max_samples=None is here to stay can be left for later (it looks like it is here to stay for the moment).

jeremiedbb · 2025-12-05T08:50:57Z

I'm okay to leave the future of max_samples=None for later to not delay the release.
So for the future, I'm not sure that we'll want to keep it as the default because I find it weird that the default doesn't properly handle sample weights (according to our expectation of sample weights). Let's keep the discussion open for the next release.

antoinebaker · 2025-12-05T08:54:07Z

Thanks for the final rush edits @lesteve ! Sorry I was off yesterday. For the future of max_samples=None, I am happy with the "here to stay" option for now.

I feel that if we change our mind and want to get rid of this option in the future, we should first incorporate the changes to random forest #31529, and do a dedicated PR to deprecate max_samples=None and harmonize the APIs of forest/bagging (for example besides max_samples, the bootstrap attribute does not have the same meaning between the two estimators)

lesteve · 2025-12-05T10:30:57Z

Let's merge this one, thanks a lot everyone!

ogrisel · 2025-12-05T13:22:29Z

For the record, I think we can keep max_features=None indefinitely for now. And focus our efforts on making it possible to guarantee frequency weight semantics (at least for some documented combinations of parameters) in the most important scikit-learn estimators and metrics.

Once this done, we can clarify in the user guide the common expectations (and known justified exceptions) for sample weight semantics in scikit-learn and maybe enrich the common tests and maybe revisit the decision for the default behavior of bagging models based on the accumulated experience.

…arn#32825) Co-authored-by: Arthur Lacote <[email protected]> Co-authored-by: Olivier Grisel <[email protected]> Co-authored-by: Loïc Estève <[email protected]>

Co-authored-by: Arthur Lacote <[email protected]> Co-authored-by: Olivier Grisel <[email protected]> Co-authored-by: Loïc Estève <[email protected]>

add max_samples None

67e1313

github-actions bot added the module:ensemble label Dec 1, 2025

antoinebaker mentioned this pull request Dec 1, 2025

RFC: Bagging estimators: avoid changing max_samples default behavior in 1.8 #32805

Closed

antoinebaker commented Dec 1, 2025

View reviewed changes

sklearn/ensemble/tests/test_bagging.py Show resolved Hide resolved

antoinebaker commented Dec 1, 2025

View reviewed changes

sklearn/ensemble/_bagging.py Show resolved Hide resolved

changelog

5102346

cakedev0 approved these changes Dec 1, 2025

View reviewed changes

rename weighted_n_samples

c8afff8

ogrisel self-requested a review December 2, 2025 13:54

lesteve added this to the 1.8 milestone Dec 2, 2025

lesteve reviewed Dec 3, 2025

View reviewed changes

doc/whats_new/upcoming_changes/sklearn.ensemble/32825.fix Outdated Show resolved Hide resolved

ogrisel reviewed Dec 3, 2025

View reviewed changes

sklearn/ensemble/_bagging.py Outdated Show resolved Hide resolved

sklearn/ensemble/_bagging.py Outdated Show resolved Hide resolved

sklearn/ensemble/_bagging.py Outdated Show resolved Hide resolved

sklearn/ensemble/_bagging.py Outdated Show resolved Hide resolved

antoinebaker and others added 2 commits December 3, 2025 16:11

changelog

da963ca

Update sklearn/ensemble/_bagging.py

25ee7f8

Co-authored-by: Arthur Lacote <[email protected]>

github-actions bot added the CI:Linter failure The linter CI is failing on this PR label Dec 3, 2025

antoinebaker and others added 2 commits December 3, 2025 16:23

Apply suggestions from code review

dd5154a

Co-authored-by: Olivier Grisel <[email protected]>

typo

c50c851

github-actions bot removed the CI:Linter failure The linter CI is failing on this PR label Dec 3, 2025

lesteve added the No Changelog Needed label Dec 3, 2025

ogrisel reviewed Dec 3, 2025

View reviewed changes

ogrisel approved these changes Dec 3, 2025

View reviewed changes

antoinebaker added 2 commits December 3, 2025 16:49

changleog again

a22a279

Revert "changleog again"

6d6c4e0

antoinebaker and others added 2 commits December 3, 2025 18:03

fix docstring

762fee9

Make changelog sentence more easily understandable

758afc2

lesteve reviewed Dec 4, 2025

View reviewed changes

lesteve added 2 commits December 4, 2025 12:07

tweak comment

78c7144

tweak comment

1d7b1bb

lesteve added 2 commits December 4, 2025 16:30

Add test for error case

4a866a8

[azure parallel]

2e36e2c

Test helper function directly [azure parallel]

81652cb

lesteve force-pushed the bagging_max_samples branch from a055e3b to 81652cb Compare December 5, 2025 08:25

jeremiedbb approved these changes Dec 5, 2025

View reviewed changes

lesteve merged commit ec4f93c into scikit-learn:main Dec 5, 2025
39 checks passed

lesteve mentioned this pull request Dec 9, 2025

REL Release 1.8.0 #32871

Merged

14 tasks

antoinebaker mentioned this pull request Dec 10, 2025

FIX Draw indices using sample_weight in Random Forests #31529

Open

6 tasks

Uh oh!

Add new default max_samples=None in Bagging estimators #32825

Add new default max_samples=None in Bagging estimators #32825

Uh oh!

Conversation

antoinebaker commented Dec 1, 2025

Reference Issues/PRs

What does this implement/fix? Explain your changes.

Uh oh!

github-actions bot commented Dec 1, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

✔️ Linting Passed

Uh oh!

Uh oh!

Uh oh!

cakedev0 left a comment

Choose a reason for hiding this comment

Uh oh!

cakedev0 Dec 1, 2025

Choose a reason for hiding this comment

Uh oh!

antoinebaker Dec 2, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

lesteve Dec 3, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

antoinebaker Dec 3, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

ogrisel left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

ogrisel commented Dec 3, 2025

Uh oh!

ogrisel commented Dec 3, 2025

Uh oh!

lesteve commented Dec 3, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ogrisel Dec 3, 2025

Choose a reason for hiding this comment

Uh oh!

antoinebaker Dec 3, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

antoinebaker commented Dec 3, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

lesteve commented Dec 3, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ogrisel Dec 3, 2025

Choose a reason for hiding this comment

Uh oh!

lesteve Dec 3, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ogrisel left a comment

Choose a reason for hiding this comment

Uh oh!

cakedev0 commented Dec 3, 2025

Uh oh!

lesteve Dec 4, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

cakedev0 Dec 4, 2025

github-actions bot commented Dec 1, 2025 •

edited

Loading

antoinebaker Dec 2, 2025 •

edited

Loading

lesteve Dec 3, 2025 •

edited

Loading

lesteve commented Dec 3, 2025 •

edited

Loading

antoinebaker Dec 3, 2025 •

edited

Loading

antoinebaker commented Dec 3, 2025 •

edited

Loading

lesteve commented Dec 3, 2025 •

edited

Loading

lesteve Dec 3, 2025 •

edited

Loading

lesteve Dec 4, 2025 •

edited

Loading

lesteve commented Dec 4, 2025 •

edited

Loading

jeremiedbb commented Dec 4, 2025 •

edited

Loading

cakedev0 commented Dec 4, 2025 •

edited

Loading

lesteve commented Dec 5, 2025 •

edited

Loading

ogrisel commented Dec 5, 2025 •

edited

Loading