feat: Add support for sample_weights in TargetEncoder #31324

DuarteSJ · 2025-05-06T18:48:33Z

This PR introduces the ability for TargetEncoder to respect sample_weight during fitting, addressing #28881

github-actions · 2025-05-06T18:49:36Z

✔️ Linting Passed

All linting checks passed. Your pull request is in excellent shape! ☀️

_{Generated for commit: d6c8f06. Link to the linter CI: here}

DuarteSJ · 2025-06-12T02:21:04Z

@ogrisel I believe this respects the definition of sample weights like we had discussed in #28881

…re encoders

ogrisel

I ran the common test against this PR and it passes:

$ pytest -vl -k "TargetEncoder and check_sample_weight_equivalence" sklearn/tests/test_common.py
...
sklearn/tests/test_common.py::test_estimators[TargetEncoder(cv=3)-check_sample_weight_equivalence_on_dense_data] I: Seeding RNGs with 238664506
PASSED
============================================================================== 1 passed, 12762 deselected, 26 warnings in 4.12s ==============================================================================

I will update our github.com/snath-xoc/sample-weight-audit-nondet/ tool to be able to test transformers with categorical inputs and report back.

Meanwhile, here is a first pass of feedback.

cc @antoinebaker @snath-xoc.

doc/whats_new/upcoming_changes/sklearn.preprocessing/31324.enhancement.rst

sklearn/preprocessing/tests/test_target_encoder.py

ogrisel · 2025-06-13T15:26:41Z

sklearn/utils/estimator_checks.py

@@ -1505,7 +1505,8 @@ def _check_sample_weight_equivalence(name, estimator_orig, sparse_container):

    rng = np.random.RandomState(42)
    n_samples = 15
-    X = rng.rand(n_samples, n_samples * 2)
+    X = rng.rand(n_samples, n_samples * 2) * 5
+    X = _enforce_estimator_tags_X(estimator_orig, X)


Note: I have pushed this change in the PR to make sure that this check is meaningful enough for estimators that accept categorical features as inputs.

ogrisel · 2025-06-13T15:29:25Z

sklearn/preprocessing/_target_encoder_fast.pyx

-                sums[X_int_tmp] += y[sample_idx]
-                counts[X_int_tmp] += 1.0
+                sums[X_int_tmp] += y[sample_idx] * sample_weight[sample_idx]
+                counts[X_int_tmp] += sample_weight[sample_idx]


Maybe we could rename this to weighted_counts for the sake of consistency.

…ancement.rst Co-authored-by: Olivier Grisel <[email protected]>

Co-authored-by: Olivier Grisel <[email protected]>

ogrisel · 2025-06-17T16:28:20Z

For reference, I ran the statistical equivalence test of sample-weight-audit-nondet with the modifications implemented in snath-xoc/sample-weight-audit-nondet#34 and the following change to allow for precomputed cv folds:

diff --git a/sklearn/preprocessing/_target_encoder.py b/sklearn/preprocessing/_target_encoder.py
index 93ecb9cf4e..00236d6469 100644
--- a/sklearn/preprocessing/_target_encoder.py
+++ b/sklearn/preprocessing/_target_encoder.py
@@ -5,6 +5,8 @@ from numbers import Real
 
 import numpy as np
 
+from sklearn.model_selection._split import check_cv
+
 from ..base import OneToOneFeatureMixin, _fit_context
 from ..utils._param_validation import Interval, StrOptions
 from ..utils.multiclass import type_of_target
@@ -281,16 +283,6 @@ class TargetEncoder(OneToOneFeatureMixin, _BaseEncoder):
             X, y, sample_weight
         )
 
-        # The cv splitter is voluntarily restricted to *KFold to enforce non
-        # overlapping validation folds, otherwise the fit_transform output will
-        # not be well-specified.
-        if self.target_type_ == "continuous":
-            cv = KFold(self.cv, shuffle=self.shuffle, random_state=self.random_state)
-        else:
-            cv = StratifiedKFold(
-                self.cv, shuffle=self.shuffle, random_state=self.random_state
-            )
-
         # If 'multiclass' multiply axis=1 by num classes else keep shape the same
         if self.target_type_ == "multiclass":
             X_out = np.empty(
@@ -301,7 +293,7 @@ class TargetEncoder(OneToOneFeatureMixin, _BaseEncoder):
             X_out = np.empty_like(X_ordinal, dtype=np.float64)
 
         sample_weight = _check_sample_weight(sample_weight, X)
-        for train_idx, test_idx in cv.split(X, y):
+        for train_idx, test_idx in check_cv(self.cv, y).split(X, y):
             X_train, y_train = X_ordinal[train_idx, :], y_encoded[train_idx]
             sample_weight_train = sample_weight[train_idx]
             y_train_mean = np.average(y_train, weights=sample_weight_train, axis=0)

and the statistical test could not reveal any bug in this PR.

ogrisel

Besides the unaddressed comment of my first pass and the comments below, LGTM. Feel free to ping me once those comments are addressed.

ogrisel · 2025-06-17T16:34:52Z

sklearn/preprocessing/tests/test_target_encoder.py

@@ -131,21 +166,31 @@ def test_encoding(categories, unknown_value, global_random_seed, smooth, target_
        random_state=global_random_seed,
    )

-    X_fit_transform = target_encoder.fit_transform(X_train, y_train)
+    X_fit_transform = target_encoder.fit_transform(X_train, y_train, sample_weight)


Nitpick: let's not pass optional arguments as positional arguments.

Suggested change

X_fit_transform = target_encoder.fit_transform(X_train, y_train, sample_weight)

X_fit_transform = target_encoder.fit_transform(

X_train, y_train, sample_weight=sample_weight

)

ogrisel · 2025-06-17T16:42:55Z

sklearn/preprocessing/_target_encoder.py

-        Determines the number of folds in the :term:`cross fitting` strategy used in
-        :meth:`fit_transform`. For classification targets, `StratifiedKFold` is used
-        and for continuous targets, `KFold` is used.
+    cv : int, cross-validation generator or an iterable, default=None


Please revert this change. At this point, cv can only be an int because the code of the TargetEncoder.fit_transform method explicitly constructs KFold splitters in fit to ensure none overlapping split with complete coverage.

I think this is an annoying restriction, but I would rather lift it in a dedicated PR, unrelated to the support of sample_weight.

To lift it, we would need to track which data points are never part of any CV iteration at training time and use the learned transformed encoding to encode the at the end of fit_transform.

Furthermore, we would also need to accumulate (average) encoded values for data points that are part of several validation sets.

Both those changes would require refactoring the way the CV iteration loop and the _transform_X_ordinal method interact. So let's do that in a follow up PR.

github-actions bot added module:preprocessing cython labels May 6, 2025

feat: Add support for sample_weights in TargetEncoder

3c4c057

DuarteSJ force-pushed the fea-target-encoder-sample-weights branch from b8d97a5 to 3c4c057 Compare May 6, 2025 20:29

ogrisel self-requested a review June 13, 2025 13:47

Make _check_sample_weight_equivalence meaningful on categorical featu…

e3edc90

…re encoders

ogrisel reviewed Jun 13, 2025

View reviewed changes

DuarteSJ and others added 4 commits June 15, 2025 17:53

Update doc/whats_new/upcoming_changes/sklearn.preprocessing/31324.enh…

5c7d5b6

…ancement.rst Co-authored-by: Olivier Grisel <[email protected]>

Update sklearn/preprocessing/tests/test_target_encoder.py

e7a8805

Co-authored-by: Olivier Grisel <[email protected]>

Update sklearn/preprocessing/tests/test_target_encoder.py

4de92ba

Co-authored-by: Olivier Grisel <[email protected]>

linting checks

d6c8f06

ogrisel mentioned this pull request Jun 16, 2025

Improve handling of the CV params and estimator that accept categorical inputs snath-xoc/sample-weight-audit-nondet#34

Merged

ogrisel reviewed Jun 17, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

feat: Add support for sample_weights in TargetEncoder #31324

feat: Add support for sample_weights in TargetEncoder #31324

Uh oh!

DuarteSJ commented May 6, 2025

Uh oh!

github-actions bot commented May 6, 2025 •

edited

Loading

Uh oh!

DuarteSJ commented Jun 12, 2025

Uh oh!

ogrisel left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

ogrisel Jun 13, 2025

Uh oh!

ogrisel Jun 13, 2025

Uh oh!

ogrisel commented Jun 17, 2025

Uh oh!

ogrisel left a comment •

edited

Loading

Uh oh!

ogrisel Jun 17, 2025

Uh oh!

ogrisel Jun 17, 2025 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

feat: Add support for sample_weights in TargetEncoder #31324

Are you sure you want to change the base?

feat: Add support for sample_weights in TargetEncoder #31324

Uh oh!

Conversation

DuarteSJ commented May 6, 2025

Uh oh!

github-actions bot commented May 6, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

✔️ Linting Passed

Uh oh!

DuarteSJ commented Jun 12, 2025

Uh oh!

ogrisel left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

ogrisel Jun 13, 2025

Choose a reason for hiding this comment

Uh oh!

ogrisel Jun 13, 2025

Choose a reason for hiding this comment

Uh oh!

ogrisel commented Jun 17, 2025

Uh oh!

ogrisel left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ogrisel Jun 17, 2025

Choose a reason for hiding this comment

Uh oh!

ogrisel Jun 17, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

github-actions bot commented May 6, 2025 •

edited

Loading

ogrisel left a comment •

edited

Loading

ogrisel Jun 17, 2025 •

edited

Loading