-
-
Notifications
You must be signed in to change notification settings - Fork 26.3k
ENH Adds groups parameter to TargetEncoder with validation for non-overlapping CV splits #32239
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
ENH Adds groups parameter to TargetEncoder with validation for non-overlapping CV splits #32239
Conversation
…erlapping CV splits
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This needs to be using metadata routing to route groups
, and isn't at the moment.
We also decided if the user wants to user groups, they need to explicitly pass a grouped cv object, and we won't do that for the user.
Note that handling metadata routing is not necessarily easy and it's okay if you don't want to continue the work.
val_set = set(val_idx) | ||
|
||
# Check for overlap with previous validation sets | ||
if all_val_indices.intersection(val_set): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you can avoid using iterative intersection
operations by simply creating a union, and counting the number of indices returned at each iteration. In the end the set's size needs to be exactly n_samples
, and the sum of counts should be the same number.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hii, I’ve updated the overlap validation logic as suggested, replacing the iterative intersection check with a union-based approach.
I understand that the preferred direction is to handle groups through metadata routing and require users to explicitly pass a grouped CV object, rather than switching automatically inside TargetEncoder. I’d be very interested in contributing to this area, but I’ll need to study the metadata routing system more deeply before I can contribute meaningfully here. In the meantime, others are of course welcome to continue building on this PR. Thanks again to @adrinjalali, and @thomasjpfan for the guidance! |
Summary
Enhancement: Add groups parameter to TargetEncoder
This PR extends TargetEncoder with an optional groups argument:
If groups is provided → use GroupKFold.
Otherwise → fall back to existing behavior (KFold for continuous targets, StratifiedKFold for classification).
Also adds _validate_cv_no_overlap utility to raise if custom CV splitters produce overlapping validation sets.
Added tests for:
Using groups with binary, continuous, and multiclass targets.
Behavior difference with/without groups.
Error when using a custom CV splitter that overlaps.
Motivation
Currently, TargetEncoder always uses KFold or StratifiedKFold for internal cross-fitting.
This can cause data leakage when samples are not independent (e.g., repeated measures, clustered patients, or time-series grouped by entity).
By adding a groups parameter, users can ensure that samples from the same group always appear in the same fold, preventing leakage and producing more reliable encodings.
Remaining work
Update the documentation (user guide + TargetEncoder docstring).
Related Issue
Closes: #32076
Thanks to @MatthiasLoefflerQC for opening the issue and @adrinjalali & @thomasjpfan for guiding the implementation.