ENH Adds groups parameter to TargetEncoder with validation for non-overlapping CV splits #32239

samruddhibaviskar11 · 2025-09-21T20:45:51Z

Summary

Enhancement: Add groups parameter to TargetEncoder
This PR extends TargetEncoder with an optional groups argument:
If groups is provided → use GroupKFold.
Otherwise → fall back to existing behavior (KFold for continuous targets, StratifiedKFold for classification).
Also adds _validate_cv_no_overlap utility to raise if custom CV splitters produce overlapping validation sets.

Added tests for:

Using groups with binary, continuous, and multiclass targets.
Behavior difference with/without groups.
Error when using a custom CV splitter that overlaps.

Motivation

Currently, TargetEncoder always uses KFold or StratifiedKFold for internal cross-fitting.
This can cause data leakage when samples are not independent (e.g., repeated measures, clustered patients, or time-series grouped by entity).
By adding a groups parameter, users can ensure that samples from the same group always appear in the same fold, preventing leakage and producing more reliable encodings.

Remaining work

Update the documentation (user guide + TargetEncoder docstring).

Related Issue
Closes: #32076

Thanks to @MatthiasLoefflerQC for opening the issue and @adrinjalali & @thomasjpfan for guiding the implementation.

…erlapping CV splits

github-actions · 2025-09-21T20:46:57Z

✔️ Linting Passed

All linting checks passed. Your pull request is in excellent shape! ☀️

_{Generated for commit: 8dfe314. Link to the linter CI: here}

adrinjalali

This needs to be using metadata routing to route groups, and isn't at the moment.

We also decided if the user wants to user groups, they need to explicitly pass a grouped cv object, and we won't do that for the user.

Note that handling metadata routing is not necessarily easy and it's okay if you don't want to continue the work.

adrinjalali · 2025-09-22T09:42:58Z

sklearn/preprocessing/_target_encoder.py

+        val_set = set(val_idx)
+
+        # Check for overlap with previous validation sets
+        if all_val_indices.intersection(val_set):


you can avoid using iterative intersection operations by simply creating a union, and counting the number of indices returned at each iteration. In the end the set's size needs to be exactly n_samples, and the sum of counts should be the same number.

hii, I’ve updated the overlap validation logic as suggested, replacing the iterative intersection check with a union-based approach.

samruddhibaviskar11 · 2025-09-24T20:22:27Z

This needs to be using metadata routing to route groups, and isn't at the moment.

We also decided if the user wants to user groups, they need to explicitly pass a grouped cv object, and we won't do that for the user.

Note that handling metadata routing is not necessarily easy and it's okay if you don't want to continue the work.

I understand that the preferred direction is to handle groups through metadata routing and require users to explicitly pass a grouped CV object, rather than switching automatically inside TargetEncoder.

I’d be very interested in contributing to this area, but I’ll need to study the metadata routing system more deeply before I can contribute meaningfully here. In the meantime, others are of course welcome to continue building on this PR.

Thanks again to @adrinjalali, and @thomasjpfan for the guidance!

ENH Adds groups parameter to TargetEncoder with validation for non-ov…

2542ba4

…erlapping CV splits

github-actions bot added the module:preprocessing label Sep 21, 2025

adrinjalali requested changes Sep 22, 2025

View reviewed changes

adrinjalali added the Closing candidate indicating we might close this one soon / at a later review label Sep 22, 2025

Optimize CV validation and fix linting issues

8dfe314

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

ENH Adds groups parameter to TargetEncoder with validation for non-overlapping CV splits #32239

ENH Adds groups parameter to TargetEncoder with validation for non-overlapping CV splits #32239

samruddhibaviskar11 commented Sep 21, 2025

Uh oh!

github-actions bot commented Sep 21, 2025 •

edited

Loading

Uh oh!

adrinjalali left a comment

Uh oh!

adrinjalali Sep 22, 2025

Uh oh!

samruddhibaviskar11 Sep 24, 2025

Uh oh!

samruddhibaviskar11 commented Sep 24, 2025

Uh oh!

Uh oh!

Uh oh!

ENH Adds groups parameter to TargetEncoder with validation for non-overlapping CV splits #32239

Are you sure you want to change the base?

ENH Adds groups parameter to TargetEncoder with validation for non-overlapping CV splits #32239

Conversation

samruddhibaviskar11 commented Sep 21, 2025

Uh oh!

github-actions bot commented Sep 21, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

✔️ Linting Passed

Uh oh!

adrinjalali left a comment

Choose a reason for hiding this comment

Uh oh!

adrinjalali Sep 22, 2025

Choose a reason for hiding this comment

Uh oh!

samruddhibaviskar11 Sep 24, 2025

Choose a reason for hiding this comment

Uh oh!

samruddhibaviskar11 commented Sep 24, 2025

Uh oh!

Uh oh!

github-actions bot commented Sep 21, 2025 •

edited

Loading