Thanks to visit codestin.com
Credit goes to github.com

Skip to content

Conversation

@antoinebaker
Copy link
Contributor

Reference Issues/PRs

Fixes #32805. Follow-up to #31414.

What does this implement/fix? Explain your changes.

We add a new option max_samples=None in Bagging estimators, with the meaning of "draw exactly n_samples", and we make it the new default. This ensures that the number of bootstrap samples used (a crucial hyperparameter in bagging) will be the same across the 1.7 -> 1.8 transition, and makes the changes introduced in #31414 less disruptive when fitting with sample_weight:

  • in 1.7 and before, the default max_samples=1.0 was drawing n_samples (discarding sample_weight)
  • in FIX Draw indices using sample_weight in Bagging #31414, the default max_samples=1.0 draws sample_weight.sum(). This can potentially lead to catastrophic fits if for some reason the sample_weight are very small or normalized to one for example. We do raise an error in that case, but the changes may still be drastic and surprising to users in less extreme cases.
  • here we add the new default max_samples=None, which always draws n_samples as in 1.7.

cc @cakedev0 @ogrisel @lesteve

@github-actions
Copy link

github-actions bot commented Dec 1, 2025

✔️ Linting Passed

All linting checks passed. Your pull request is in excellent shape! ☀️

Generated for commit: 762fee9. Link to the linter CI: here

Copy link
Contributor

@cakedev0 cakedev0 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. A few comments/questions, but nothing blocking.

- In :class:`ensemble.BaggingClassifier` and :class:`ensemble.BaggingRegressor`
the new default `max_samples=None` draws `X.shape[0]` samples, irrespective of
`sample_weight`.
By :user:`Antoine Baker <antoinebaker>`.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it ok to have two "conflicting" change logs in the same release?

Or would it be better to edit the previous change log?

Copy link
Contributor Author

@antoinebaker antoinebaker Dec 2, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For me the two changelogs are not in conflict, but complement each other.

31414.fix.rst states that:

Furthermore, max_samples is now
interpreted as a fraction of sample_weight.sum() instead of X.shape[0]
when passed as a float.

It remains true in this PR (for float max_samples). This PR just adds the None max_samples option.

Copy link
Member

@lesteve lesteve Dec 3, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At the same time you can decide to merge the two changelog entries if that is simpler to understand from a user perspective. IIRC changelog entries are merged by PR numbers (not 100% sure) and depending on how much activity has gone into this module, they may not be close to each other.

One way to do this is something like I did in #32831. Basically you use a single changelog entry and you mention one PR with :pr: in the text and the other one is added automatically based on the filename.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@lesteve good catch for the missing extension! I tried your hack by merging the two changelogs in the new one (and removing the old one). The output of towncrier seems okay:

:mod:`sklearn.ensemble`
-----------------------

- |Fix| :class:`ensemble.BaggingClassifier`, :class:`ensemble.BaggingRegressor` and
  :class:`ensemble.IsolationForest` now use `sample_weight` to draw the samples
  instead of forwarding them multiplied by a uniformly sampled mask to the
  underlying estimators. Furthermore, `max_samples` is now interpreted as a
  fraction of `sample_weight.sum()` instead of `X.shape[0]` when passed as a
  float. The new default `max_samples=None` draws `X.shape[0]` samples,
  irrespective of `sample_weight`. 
  By :user:`Antoine Baker <antoinebaker>`. :pr:`31414` and :pr:`32825`

@ogrisel ogrisel self-requested a review December 2, 2025 13:54
@lesteve lesteve added this to the 1.8 milestone Dec 2, 2025
Copy link
Member

@ogrisel ogrisel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree, the behavior might be less surprising for sample_weight users. But we do lose the strict weight/repetition equivalence (frequency semantics) when using the default max_samples=None.

But's it's easy to restore the frequency semantics by explicitly choosing max_samples=1.0 explicitly (or alternatively max_samples=some_integer_of_choice) after making a conscious decision, so maybe this is not a big deal.

@ogrisel
Copy link
Member

ogrisel commented Dec 3, 2025

@cakedev0 @antoinebaker see my review above.

@github-actions github-actions bot added the CI:Linter failure The linter CI is failing on this PR label Dec 3, 2025
@ogrisel
Copy link
Member

ogrisel commented Dec 3, 2025

Sorry for breaking the linter ;)

@github-actions github-actions bot removed the CI:Linter failure The linter CI is failing on this PR label Dec 3, 2025
@lesteve
Copy link
Member

lesteve commented Dec 3, 2025

The changelog CI is 😱. I set the No changelog Needed label to ignore it and we need to check the changelog in the rendered doc to make sure that it works as expected.

One of this edge case that is not taken into account by the action we are using and may not be worth fixing to be honest.

If you are really curious, we are "changing" two fragment files (removing the 31414 one and adding the 32825 one) and one of the changed fragment is not matching the PR number so the action complains ...

set to a float or integer value. When keeping the `max_samples=None` default
value, the equivalence between fitting with integer weighted data points or
integer repeated data points is no longer guaranteed because the effective
bootstrap size is no longer guaranteed to be equivalent.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For reviewers, I updated the notebooks of our sample weight analyzer, in particular this want to highlight the fact that max_samples=None breaks frequency weights semantics:

https://github.com/snath-xoc/sample-weight-audit-nondet/blob/main/reports/sklearn_estimators_sample_weight_audit_investigative_plots.ipynb

The p-value is very small: the null-hypothesis that fitting with integer weights is equivalent to fitting with repeated data points is rejected.

Editing the cell to select either max_samples=1.0 or max_samples=100 restores large p-values, even when increasing n_stochastic_fits meaning that we cannot reject the null-hypothesis that frequency weights semantics are respected.

Copy link
Contributor Author

@antoinebaker antoinebaker Dec 3, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, we could recommend:

  1. max_samples = some_integer as the safest choice for sample_weight semantics (repeated/weighted equivalence for integer weights and safe with rescaled weights)
  2. max_samples = some_float as the second best (repeated/weighted equivalence for integer weights but unsafe with rescaled weights)
  3. default max_samples = None does not respect the repeated/weighted equivalence for integer weights

@antoinebaker
Copy link
Contributor Author

antoinebaker commented Dec 3, 2025

The changelog CI is 😱.

Is it because I deleted the previous PR changelog (31414.fix.rst) ?

@lesteve
Copy link
Member

lesteve commented Dec 3, 2025

The changelog CI is 😱.

Is it because I deleted the previous PR changelog (31414.fix.rst) ?

Yep you did the right thing, but the CI doesn't cover edge cases like this. I edited my previous comment #32825 (comment) with a more detailed explanation.

fraction of `sample_weight.sum()` instead of `X.shape[0]` when passed as a
float. The new default `max_samples=None` draws `X.shape[0]` samples,
irrespective of `sample_weight`.
By :user:`Antoine Baker <antoinebaker>`. :pr:`31414` and
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Alternatively, you could keep both doc/whats_new/upcoming_changes/sklearn.ensemble/31414.fix.rst and doc/whats_new/upcoming_changes/sklearn.ensemble/32825.fix.rst and make sure that they have the same text and not PR number in them. As a result, the towncrier tool will merge this.

Copy link
Member

@lesteve lesteve Dec 3, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep, but personally as I said in #32831 I find this towncrier functionality a bit too magical and prefer the :pr: way which feels simpler and more understandable ... I actually saw Christian use the :pr: way and quite liked it.

Copy link
Member

@ogrisel ogrisel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Assuming that we are all ok with explicitly not implementing frequency weights by default anymore for this estimator, I am fine with the contents of this PR.

At least the docstring should be explicit enough.

cc @cakedev0 @lucyleeow.

@cakedev0
Copy link
Contributor

cakedev0 commented Dec 3, 2025

I'm ok with that with explicitly not implementing frequency weights by default, I think it outweighs the risk of using a very small/big number of samples unexpectedly.

At least the docstring should be explicit enough.

The current docstring of the fit method states: Note that the expected frequency semantics for the sample_weight parameter are only fulfilled when sampling with replacement bootstrap=True.

Let's add "and max_samples is a float" at the end and it should be very clear.


# max_samples Real fractional value relative to weighted_n_samples
n_samples_bootstrap = max(int(max_samples * weighted_n_samples), 1)
# raise warning if n_samples_bootstrap small (< 10)
Copy link
Member

@lesteve lesteve Dec 4, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like you may want to update the comment here and mentions that this is a heuristic to avoid bootstrap with too few samples. Bonus points if you add a few words about the heuristic 😉

Here is a proposal that you should not follow 😜

# magic cubic root heuristic: you are not supposed to understand this, don't even try!
# Please don't touch this line or the universe may collapse

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It could be:

# raise a warning if n_samples_bootstrap is suspiciously small 
# the heuristic for "suspiciously small " might be adapted if found unsuitable in practice

@jeremiedbb
Copy link
Member

What's the plan with max_samples=None in the future ? It's not clear to me reading the discussions here

  • Is it to be removed in 1.10 ? in which case it should be mentioned in the changelog and a comment to remind us to remove it should be added.
  • Is it to be deprecated in 1.10 and be removed in 1.12 ? Not sure it's necessary and in which case I'd just directly remove it in 1.10.
  • Is it to stay ?

@lesteve
Copy link
Member

lesteve commented Dec 4, 2025

What's the plan with max_samples=None in the future ? It's not clear to me reading the discussions here

Just when I was about to say I enabled auto-merge 😅.

I kind of agree and had the same question but thought the decision can be done after 1.8 is out.

Do you see a big downside leaving this for later? My feeling is that we are a bit in a edge case, Bagging{Classifier,Regressor} is not used that much I would guess and Bagging{Classifier,Regressor} with sample_weight probably even less ...

@jeremiedbb
Copy link
Member

jeremiedbb commented Dec 4, 2025

Do you see a big downside leaving this for later?

If we plan to remove it directly in 1.10 it'd be better to document it in the changelog. But maybe leaving for later just rules option 1 out and in 1.10 we'll just be left with deprecate or keep :)

@cakedev0
Copy link
Contributor

cakedev0 commented Dec 4, 2025

IMO, it's here to stay:

  • to match the API of random forests,
  • to avoid a backward incompatible change of how weights affect the training in the default case
  • and to avoid that this default case works in a way that could lead to very poor learning

The downside being that this default doesn't guarantee the weighted/repeated equivalence. But I trust that people who need this will read the docstring for sample_weight and hence be aware of it.

(Note that in v<=1.7 this equivalence is not guaranteed)

@lesteve
Copy link
Member

lesteve commented Dec 4, 2025

I was trying to do something that seemed simple, i.e. covering an uncovered line in codecov, but looks like I will finish it tomorrow instead since I may as well add a test for the helper function while I am at it ...

@lesteve lesteve force-pushed the bagging_max_samples branch from a055e3b to 81652cb Compare December 5, 2025 08:25
@lesteve
Copy link
Member

lesteve commented Dec 5, 2025

OK so I pushed 81652cb (#32825) I will merge this PR in an hour or so unless I hear someone shouting strongly!

Summary of my changes:

  • I added some tests for the helper function
  • I removed the raise ValueError if max_samples is not an int and not a flot and not None. This is a private function, we can change our mind, and it was the only place where parameters were checked which felt slightly weird. Ironically I started adding a test because this line was not covered, oh well 🤷

I feel like the decision whether max_samples=None is here to stay can be left for later (it looks like it is here to stay for the moment).

@jeremiedbb
Copy link
Member

I'm okay to leave the future of max_samples=None for later to not delay the release.
So for the future, I'm not sure that we'll want to keep it as the default because I find it weird that the default doesn't properly handle sample weights (according to our expectation of sample weights). Let's keep the discussion open for the next release.

@antoinebaker
Copy link
Contributor Author

Thanks for the final rush edits @lesteve ! Sorry I was off yesterday. For the future of max_samples=None, I am happy with the "here to stay" option for now.

I feel that if we change our mind and want to get rid of this option in the future, we should first incorporate the changes to random forest #31529, and do a dedicated PR to deprecate max_samples=None and harmonize the APIs of forest/bagging (for example besides max_samples, the bootstrap attribute does not have the same meaning between the two estimators)

@lesteve lesteve merged commit ec4f93c into scikit-learn:main Dec 5, 2025
39 checks passed
@lesteve
Copy link
Member

lesteve commented Dec 5, 2025

Let's merge this one, thanks a lot everyone!

@ogrisel
Copy link
Member

ogrisel commented Dec 5, 2025

For the record, I think we can keep max_features=None indefinitely for now. And focus our efforts on making it possible to guarantee frequency weight semantics (at least for some documented combinations of parameters) in the most important scikit-learn estimators and metrics.

Once this done, we can clarify in the user guide the common expectations (and known justified exceptions) for sample weight semantics in scikit-learn and maybe enrich the common tests and maybe revisit the decision for the default behavior of bagging models based on the accumulated experience.

lesteve added a commit to lesteve/scikit-learn that referenced this pull request Dec 9, 2025
…arn#32825)

Co-authored-by: Arthur Lacote <[email protected]>
Co-authored-by: Olivier Grisel <[email protected]>
Co-authored-by: Loïc Estève <[email protected]>
@lesteve lesteve mentioned this pull request Dec 9, 2025
14 tasks
lesteve added a commit to lesteve/scikit-learn that referenced this pull request Dec 9, 2025
…arn#32825)

Co-authored-by: Arthur Lacote <[email protected]>
Co-authored-by: Olivier Grisel <[email protected]>
Co-authored-by: Loïc Estève <[email protected]>
lesteve added a commit that referenced this pull request Dec 9, 2025
Co-authored-by: Arthur Lacote <[email protected]>
Co-authored-by: Olivier Grisel <[email protected]>
Co-authored-by: Loïc Estève <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

RFC: Bagging estimators: avoid changing max_samples default behavior in 1.8

5 participants