-
-
Notifications
You must be signed in to change notification settings - Fork 26.5k
Add new default max_samples=None in Bagging estimators #32825
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
cakedev0
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. A few comments/questions, but nothing blocking.
| - In :class:`ensemble.BaggingClassifier` and :class:`ensemble.BaggingRegressor` | ||
| the new default `max_samples=None` draws `X.shape[0]` samples, irrespective of | ||
| `sample_weight`. | ||
| By :user:`Antoine Baker <antoinebaker>`. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it ok to have two "conflicting" change logs in the same release?
Or would it be better to edit the previous change log?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For me the two changelogs are not in conflict, but complement each other.
31414.fix.rst states that:
Furthermore,
max_samplesis now
interpreted as a fraction ofsample_weight.sum()instead ofX.shape[0]
when passed as a float.
It remains true in this PR (for float max_samples). This PR just adds the None max_samples option.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
At the same time you can decide to merge the two changelog entries if that is simpler to understand from a user perspective. IIRC changelog entries are merged by PR numbers (not 100% sure) and depending on how much activity has gone into this module, they may not be close to each other.
One way to do this is something like I did in #32831. Basically you use a single changelog entry and you mention one PR with :pr: in the text and the other one is added automatically based on the filename.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@lesteve good catch for the missing extension! I tried your hack by merging the two changelogs in the new one (and removing the old one). The output of towncrier seems okay:
:mod:`sklearn.ensemble`
-----------------------
- |Fix| :class:`ensemble.BaggingClassifier`, :class:`ensemble.BaggingRegressor` and
:class:`ensemble.IsolationForest` now use `sample_weight` to draw the samples
instead of forwarding them multiplied by a uniformly sampled mask to the
underlying estimators. Furthermore, `max_samples` is now interpreted as a
fraction of `sample_weight.sum()` instead of `X.shape[0]` when passed as a
float. The new default `max_samples=None` draws `X.shape[0]` samples,
irrespective of `sample_weight`.
By :user:`Antoine Baker <antoinebaker>`. :pr:`31414` and :pr:`32825`
ogrisel
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree, the behavior might be less surprising for sample_weight users. But we do lose the strict weight/repetition equivalence (frequency semantics) when using the default max_samples=None.
But's it's easy to restore the frequency semantics by explicitly choosing max_samples=1.0 explicitly (or alternatively max_samples=some_integer_of_choice) after making a conscious decision, so maybe this is not a big deal.
|
@cakedev0 @antoinebaker see my review above. |
Co-authored-by: Arthur Lacote <[email protected]>
Co-authored-by: Olivier Grisel <[email protected]>
|
Sorry for breaking the linter ;) |
|
The changelog CI is 😱. I set the No changelog Needed label to ignore it and we need to check the changelog in the rendered doc to make sure that it works as expected. One of this edge case that is not taken into account by the action we are using and may not be worth fixing to be honest. If you are really curious, we are "changing" two fragment files (removing the |
| set to a float or integer value. When keeping the `max_samples=None` default | ||
| value, the equivalence between fitting with integer weighted data points or | ||
| integer repeated data points is no longer guaranteed because the effective | ||
| bootstrap size is no longer guaranteed to be equivalent. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For reviewers, I updated the notebooks of our sample weight analyzer, in particular this want to highlight the fact that max_samples=None breaks frequency weights semantics:
The p-value is very small: the null-hypothesis that fitting with integer weights is equivalent to fitting with repeated data points is rejected.
Editing the cell to select either max_samples=1.0 or max_samples=100 restores large p-values, even when increasing n_stochastic_fits meaning that we cannot reject the null-hypothesis that frequency weights semantics are respected.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, we could recommend:
max_samples= some_integer as the safest choice forsample_weightsemantics (repeated/weighted equivalence for integer weights and safe with rescaled weights)max_samples= some_float as the second best (repeated/weighted equivalence for integer weights but unsafe with rescaled weights)- default
max_samples = Nonedoes not respect the repeated/weighted equivalence for integer weights
Is it because I deleted the previous PR changelog (31414.fix.rst) ? |
Yep you did the right thing, but the CI doesn't cover edge cases like this. I edited my previous comment #32825 (comment) with a more detailed explanation. |
| fraction of `sample_weight.sum()` instead of `X.shape[0]` when passed as a | ||
| float. The new default `max_samples=None` draws `X.shape[0]` samples, | ||
| irrespective of `sample_weight`. | ||
| By :user:`Antoine Baker <antoinebaker>`. :pr:`31414` and |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Alternatively, you could keep both doc/whats_new/upcoming_changes/sklearn.ensemble/31414.fix.rst and doc/whats_new/upcoming_changes/sklearn.ensemble/32825.fix.rst and make sure that they have the same text and not PR number in them. As a result, the towncrier tool will merge this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep, but personally as I said in #32831 I find this towncrier functionality a bit too magical and prefer the :pr: way which feels simpler and more understandable ... I actually saw Christian use the :pr: way and quite liked it.
ogrisel
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Assuming that we are all ok with explicitly not implementing frequency weights by default anymore for this estimator, I am fine with the contents of this PR.
At least the docstring should be explicit enough.
cc @cakedev0 @lucyleeow.
|
I'm ok with that with explicitly not implementing frequency weights by default, I think it outweighs the risk of using a very small/big number of samples unexpectedly.
The current docstring of the fit method states: Note that the expected frequency semantics for the sample_weight parameter are only fulfilled when sampling with replacement Let's add "and |
sklearn/ensemble/_bagging.py
Outdated
|
|
||
| # max_samples Real fractional value relative to weighted_n_samples | ||
| n_samples_bootstrap = max(int(max_samples * weighted_n_samples), 1) | ||
| # raise warning if n_samples_bootstrap small (< 10) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like you may want to update the comment here and mentions that this is a heuristic to avoid bootstrap with too few samples. Bonus points if you add a few words about the heuristic 😉
Here is a proposal that you should not follow 😜
# magic cubic root heuristic: you are not supposed to understand this, don't even try!
# Please don't touch this line or the universe may collapseThere was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It could be:
# raise a warning if n_samples_bootstrap is suspiciously small
# the heuristic for "suspiciously small " might be adapted if found unsuitable in practice|
What's the plan with
|
Just when I was about to say I enabled auto-merge 😅. I kind of agree and had the same question but thought the decision can be done after 1.8 is out. Do you see a big downside leaving this for later? My feeling is that we are a bit in a edge case, |
If we plan to remove it directly in 1.10 it'd be better to document it in the changelog. But maybe leaving for later just rules option 1 out and in 1.10 we'll just be left with deprecate or keep :) |
|
IMO, it's here to stay:
The downside being that this default doesn't guarantee the weighted/repeated equivalence. But I trust that people who need this will read the docstring for (Note that in v<=1.7 this equivalence is not guaranteed) |
|
I was trying to do something that seemed simple, i.e. covering an uncovered line in codecov, but looks like I will finish it tomorrow instead since I may as well add a test for the helper function while I am at it ... |
a055e3b to
81652cb
Compare
|
OK so I pushed Summary of my changes:
I feel like the decision whether |
|
I'm okay to leave the future of max_samples=None for later to not delay the release. |
|
Thanks for the final rush edits @lesteve ! Sorry I was off yesterday. For the future of max_samples=None, I am happy with the "here to stay" option for now. I feel that if we change our mind and want to get rid of this option in the future, we should first incorporate the changes to random forest #31529, and do a dedicated PR to deprecate max_samples=None and harmonize the APIs of forest/bagging (for example besides |
|
Let's merge this one, thanks a lot everyone! |
|
For the record, I think we can keep Once this done, we can clarify in the user guide the common expectations (and known justified exceptions) for sample weight semantics in scikit-learn and maybe enrich the common tests and maybe revisit the decision for the default behavior of bagging models based on the accumulated experience. |
…arn#32825) Co-authored-by: Arthur Lacote <[email protected]> Co-authored-by: Olivier Grisel <[email protected]> Co-authored-by: Loïc Estève <[email protected]>
…arn#32825) Co-authored-by: Arthur Lacote <[email protected]> Co-authored-by: Olivier Grisel <[email protected]> Co-authored-by: Loïc Estève <[email protected]>
Co-authored-by: Arthur Lacote <[email protected]> Co-authored-by: Olivier Grisel <[email protected]> Co-authored-by: Loïc Estève <[email protected]>
Reference Issues/PRs
Fixes #32805. Follow-up to #31414.
What does this implement/fix? Explain your changes.
We add a new option
max_samples=Nonein Bagging estimators, with the meaning of "draw exactly n_samples", and we make it the new default. This ensures that the number of bootstrap samples used (a crucial hyperparameter in bagging) will be the same across the 1.7 -> 1.8 transition, and makes the changes introduced in #31414 less disruptive when fitting with sample_weight:n_samples(discarding sample_weight)sample_weight.sum(). This can potentially lead to catastrophic fits if for some reason thesample_weightare very small or normalized to one for example. We do raise an error in that case, but the changes may still be drastic and surprising to users in less extreme cases.max_samples=None, which always drawsn_samplesas in 1.7.cc @cakedev0 @ogrisel @lesteve