-
-
Notifications
You must be signed in to change notification settings - Fork 26.5k
FIX stratification in StratifiedGroupKFold when shuffle=True #32540
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
FIX stratification in StratifiedGroupKFold when shuffle=True #32540
Conversation
ogrisel
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM besides the following nitpicks:
doc/whats_new/upcoming_changes/sklearn.model_selection/32540.fix.rst
Outdated
Show resolved
Hide resolved
|
Thanks a lot for the quick review and helpful suggestions, @ogrisel. |
lucyleeow
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @pfolch ! LGTM!
Small doc nits amended only.
doc/whats_new/upcoming_changes/sklearn.model_selection/32540.fix.rst
Outdated
Show resolved
Hide resolved
…learn#32540) Co-authored-by: Olivier Grisel <[email protected]> Co-authored-by: Lucy Liu <[email protected]>
Reference Issues/PRs
Fixes #32478
cc: @ogrisel
What does this implement/fix? Explain your changes.
When
StratifiedGroupKFoldis instantiated withshuffle=True, the implementation shufflesy_counts_per_groupin place, but does not update the internalgroups_invmapping that encodes each sample’s group index. This desynchronizes the “row i ↔ group i” invariant. The greedy assignment then operates on permuted rows, while the final test indices are gathered using the unpermutedgroups_inv, yielding test folds that correspond to incorrect (effectively random) groups. In practice, this makes the behavior equivalent toGroupKFoldwith a random group assignment rather than a stratified grouping.Change: when
shuffle=True, apply a single permutation toy_counts_per_groupand remapgroups_invby the inverse permutation. This keeps the internal encoding aligned so that each row iny_counts_per_groupstill refers to the same group index used to build the test folds. The stable variance-based sort continues to use the shuffled order only as a tie-breaker for groups with identical class-variance.No API changes; negligible performance impact. Reproducibility with an integer
random_stateis preserved.Any other comments?
Test added:
test_stratified_group_kfold_shuffle_preserves_stratification: non-regression test that, across many random seeds, checks:This reproduces the scenario from the issue and fails on main prior to this fix.
Changelog: a short entry was added to document the bug fix in
StratifiedGroupKFoldwhenshuffle=True.