Thanks to visit codestin.com
Credit goes to github.com

Skip to content

feat: Add support for sample_weights in TargetEncoder #31324

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 6 commits into
base: main
Choose a base branch
from

Conversation

DuarteSJ
Copy link
Contributor

@DuarteSJ DuarteSJ commented May 6, 2025

This PR introduces the ability for TargetEncoder to respect sample_weight during fitting, addressing #28881

Copy link

github-actions bot commented May 6, 2025

✔️ Linting Passed

All linting checks passed. Your pull request is in excellent shape! ☀️

Generated for commit: d6c8f06. Link to the linter CI: here

@DuarteSJ DuarteSJ force-pushed the fea-target-encoder-sample-weights branch from b8d97a5 to 3c4c057 Compare May 6, 2025 20:29
@DuarteSJ
Copy link
Contributor Author

@ogrisel I believe this respects the definition of sample weights like we had discussed in #28881

@ogrisel ogrisel self-requested a review June 13, 2025 13:47
Copy link
Member

@ogrisel ogrisel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I ran the common test against this PR and it passes:

$ pytest -vl -k "TargetEncoder and check_sample_weight_equivalence" sklearn/tests/test_common.py
...
sklearn/tests/test_common.py::test_estimators[TargetEncoder(cv=3)-check_sample_weight_equivalence_on_dense_data] I: Seeding RNGs with 238664506
PASSED
============================================================================== 1 passed, 12762 deselected, 26 warnings in 4.12s ==============================================================================

I will update our github.com/snath-xoc/sample-weight-audit-nondet/ tool to be able to test transformers with categorical inputs and report back.

Meanwhile, here is a first pass of feedback.

cc @antoinebaker @snath-xoc.

@@ -1505,7 +1505,8 @@ def _check_sample_weight_equivalence(name, estimator_orig, sparse_container):

rng = np.random.RandomState(42)
n_samples = 15
X = rng.rand(n_samples, n_samples * 2)
X = rng.rand(n_samples, n_samples * 2) * 5
X = _enforce_estimator_tags_X(estimator_orig, X)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note: I have pushed this change in the PR to make sure that this check is meaningful enough for estimators that accept categorical features as inputs.

sums[X_int_tmp] += y[sample_idx]
counts[X_int_tmp] += 1.0
sums[X_int_tmp] += y[sample_idx] * sample_weight[sample_idx]
counts[X_int_tmp] += sample_weight[sample_idx]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we could rename this to weighted_counts for the sake of consistency.

@ogrisel
Copy link
Member

ogrisel commented Jun 17, 2025

For reference, I ran the statistical equivalence test of sample-weight-audit-nondet with the modifications implemented in snath-xoc/sample-weight-audit-nondet#34 and the following change to allow for precomputed cv folds:

diff --git a/sklearn/preprocessing/_target_encoder.py b/sklearn/preprocessing/_target_encoder.py
index 93ecb9cf4e..00236d6469 100644
--- a/sklearn/preprocessing/_target_encoder.py
+++ b/sklearn/preprocessing/_target_encoder.py
@@ -5,6 +5,8 @@ from numbers import Real
 
 import numpy as np
 
+from sklearn.model_selection._split import check_cv
+
 from ..base import OneToOneFeatureMixin, _fit_context
 from ..utils._param_validation import Interval, StrOptions
 from ..utils.multiclass import type_of_target
@@ -281,16 +283,6 @@ class TargetEncoder(OneToOneFeatureMixin, _BaseEncoder):
             X, y, sample_weight
         )
 
-        # The cv splitter is voluntarily restricted to *KFold to enforce non
-        # overlapping validation folds, otherwise the fit_transform output will
-        # not be well-specified.
-        if self.target_type_ == "continuous":
-            cv = KFold(self.cv, shuffle=self.shuffle, random_state=self.random_state)
-        else:
-            cv = StratifiedKFold(
-                self.cv, shuffle=self.shuffle, random_state=self.random_state
-            )
-
         # If 'multiclass' multiply axis=1 by num classes else keep shape the same
         if self.target_type_ == "multiclass":
             X_out = np.empty(
@@ -301,7 +293,7 @@ class TargetEncoder(OneToOneFeatureMixin, _BaseEncoder):
             X_out = np.empty_like(X_ordinal, dtype=np.float64)
 
         sample_weight = _check_sample_weight(sample_weight, X)
-        for train_idx, test_idx in cv.split(X, y):
+        for train_idx, test_idx in check_cv(self.cv, y).split(X, y):
             X_train, y_train = X_ordinal[train_idx, :], y_encoded[train_idx]
             sample_weight_train = sample_weight[train_idx]
             y_train_mean = np.average(y_train, weights=sample_weight_train, axis=0)

and the statistical test could not reveal any bug in this PR.

Copy link
Member

@ogrisel ogrisel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Besides the unaddressed comment of my first pass and the comments below, LGTM. Feel free to ping me once those comments are addressed.

@@ -131,21 +166,31 @@ def test_encoding(categories, unknown_value, global_random_seed, smooth, target_
random_state=global_random_seed,
)

X_fit_transform = target_encoder.fit_transform(X_train, y_train)
X_fit_transform = target_encoder.fit_transform(X_train, y_train, sample_weight)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nitpick: let's not pass optional arguments as positional arguments.

Suggested change
X_fit_transform = target_encoder.fit_transform(X_train, y_train, sample_weight)
X_fit_transform = target_encoder.fit_transform(
X_train, y_train, sample_weight=sample_weight
)

Determines the number of folds in the :term:`cross fitting` strategy used in
:meth:`fit_transform`. For classification targets, `StratifiedKFold` is used
and for continuous targets, `KFold` is used.
cv : int, cross-validation generator or an iterable, default=None
Copy link
Member

@ogrisel ogrisel Jun 17, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please revert this change. At this point, cv can only be an int because the code of the TargetEncoder.fit_transform method explicitly constructs KFold splitters in fit to ensure none overlapping split with complete coverage.

I think this is an annoying restriction, but I would rather lift it in a dedicated PR, unrelated to the support of sample_weight.

To lift it, we would need to track which data points are never part of any CV iteration at training time and use the learned transformed encoding to encode the at the end of fit_transform.

Furthermore, we would also need to accumulate (average) encoded values for data points that are part of several validation sets.

Both those changes would require refactoring the way the CV iteration loop and the _transform_X_ordinal method interact. So let's do that in a follow up PR.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants