Thanks to visit codestin.com
Credit goes to github.com

Skip to content

Conversation

@pfolch
Copy link
Contributor

@pfolch pfolch commented Oct 19, 2025

Reference Issues/PRs

Fixes #32478
cc: @ogrisel

What does this implement/fix? Explain your changes.

When StratifiedGroupKFold is instantiated with shuffle=True, the implementation shuffles y_counts_per_group in place, but does not update the internal groups_inv mapping that encodes each sample’s group index. This desynchronizes the “row i ↔ group i” invariant. The greedy assignment then operates on permuted rows, while the final test indices are gathered using the unpermuted groups_inv, yielding test folds that correspond to incorrect (effectively random) groups. In practice, this makes the behavior equivalent to GroupKFold with a random group assignment rather than a stratified grouping.

Change: when shuffle=True, apply a single permutation to y_counts_per_group and remap groups_inv by the inverse permutation. This keeps the internal encoding aligned so that each row in y_counts_per_group still refers to the same group index used to build the test folds. The stable variance-based sort continues to use the shuffled order only as a tie-breaker for groups with identical class-variance.

No API changes; negligible performance impact. Reproducibility with an integer random_state is preserved.

Any other comments?

Test added: test_stratified_group_kfold_shuffle_preserves_stratification: non-regression test that, across many random seeds, checks:

  • no group appears on both sides of any split,
  • train/test class distribution match the global dataset distribution,
  • fold sizes remain balanced (peak-to-peak ≤ 1).
    This reproduces the scenario from the issue and fails on main prior to this fix.

Changelog: a short entry was added to document the bug fix in StratifiedGroupKFold when shuffle=True.

@github-actions
Copy link

github-actions bot commented Oct 19, 2025

✔️ Linting Passed

All linting checks passed. Your pull request is in excellent shape! ☀️

Generated for commit: ede06d1. Link to the linter CI: here

Copy link
Member

@ogrisel ogrisel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM besides the following nitpicks:

@pfolch
Copy link
Contributor Author

pfolch commented Oct 20, 2025

Thanks a lot for the quick review and helpful suggestions, @ogrisel.
I’ve renamed the variable to expected_class_ratios and accepted the changelog edit. I also updated the PR, and CI is green.

@ogrisel ogrisel added Quick Review For PRs that are quick to review Waiting for Second Reviewer First reviewer is done, need a second one! labels Oct 21, 2025
Copy link
Member

@lucyleeow lucyleeow left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @pfolch ! LGTM!

Small doc nits amended only.

@lucyleeow lucyleeow enabled auto-merge (squash) October 29, 2025 00:50
@lucyleeow lucyleeow merged commit 70be2e2 into scikit-learn:main Oct 29, 2025
38 checks passed
rouk1 pushed a commit to rouk1/scikit-learn that referenced this pull request Nov 3, 2025
@pfolch pfolch deleted the fix-StratifiedGroupKFold-shuffle-indices branch November 5, 2025 08:45
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

module:model_selection Quick Review For PRs that are quick to review Waiting for Second Reviewer First reviewer is done, need a second one!

Projects

None yet

Development

Successfully merging this pull request may close these issues.

StratifiedGroupKFold(shuffle=True) breaks group-index mapping, leading to incorrect (effectively random) group assignments per fold

3 participants