Thanks to visit codestin.com
Credit goes to github.com

Skip to content

Conversation

samruddhibaviskar11
Copy link

Summary

Enhancement: Add groups parameter to TargetEncoder
This PR extends TargetEncoder with an optional groups argument:
If groups is provided → use GroupKFold.
Otherwise → fall back to existing behavior (KFold for continuous targets, StratifiedKFold for classification).
Also adds _validate_cv_no_overlap utility to raise if custom CV splitters produce overlapping validation sets.

Added tests for:

Using groups with binary, continuous, and multiclass targets.
Behavior difference with/without groups.
Error when using a custom CV splitter that overlaps.

Motivation

Currently, TargetEncoder always uses KFold or StratifiedKFold for internal cross-fitting.
This can cause data leakage when samples are not independent (e.g., repeated measures, clustered patients, or time-series grouped by entity).
By adding a groups parameter, users can ensure that samples from the same group always appear in the same fold, preventing leakage and producing more reliable encodings.

Remaining work

Update the documentation (user guide + TargetEncoder docstring).

Related Issue
Closes: #32076

Thanks to @MatthiasLoefflerQC for opening the issue and @adrinjalali & @thomasjpfan for guiding the implementation.

Copy link

github-actions bot commented Sep 21, 2025

✔️ Linting Passed

All linting checks passed. Your pull request is in excellent shape! ☀️

Generated for commit: 8dfe314. Link to the linter CI: here

Copy link
Member

@adrinjalali adrinjalali left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This needs to be using metadata routing to route groups, and isn't at the moment.

We also decided if the user wants to user groups, they need to explicitly pass a grouped cv object, and we won't do that for the user.

Note that handling metadata routing is not necessarily easy and it's okay if you don't want to continue the work.

val_set = set(val_idx)

# Check for overlap with previous validation sets
if all_val_indices.intersection(val_set):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you can avoid using iterative intersection operations by simply creating a union, and counting the number of indices returned at each iteration. In the end the set's size needs to be exactly n_samples, and the sum of counts should be the same number.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hii, I’ve updated the overlap validation logic as suggested, replacing the iterative intersection check with a union-based approach.

@adrinjalali adrinjalali added the Closing candidate indicating we might close this one soon / at a later review label Sep 22, 2025
@samruddhibaviskar11
Copy link
Author

This needs to be using metadata routing to route groups, and isn't at the moment.

We also decided if the user wants to user groups, they need to explicitly pass a grouped cv object, and we won't do that for the user.

Note that handling metadata routing is not necessarily easy and it's okay if you don't want to continue the work.

I understand that the preferred direction is to handle groups through metadata routing and require users to explicitly pass a grouped CV object, rather than switching automatically inside TargetEncoder.

I’d be very interested in contributing to this area, but I’ll need to study the metadata routing system more deeply before I can contribute meaningfully here. In the meantime, others are of course welcome to continue building on this PR.

Thanks again to @adrinjalali, and @thomasjpfan for the guidance!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Closing candidate indicating we might close this one soon / at a later review module:preprocessing
Projects
None yet
Development

Successfully merging this pull request may close these issues.

TargetEncoder should take groups as an argument
2 participants