-
-
Notifications
You must be signed in to change notification settings - Fork 26k
feat: Add support for sample_weights in TargetEncoder #31324
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
feat: Add support for sample_weights in TargetEncoder #31324
Conversation
b8d97a5
to
3c4c057
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I ran the common test against this PR and it passes:
$ pytest -vl -k "TargetEncoder and check_sample_weight_equivalence" sklearn/tests/test_common.py
...
sklearn/tests/test_common.py::test_estimators[TargetEncoder(cv=3)-check_sample_weight_equivalence_on_dense_data] I: Seeding RNGs with 238664506
PASSED
============================================================================== 1 passed, 12762 deselected, 26 warnings in 4.12s ==============================================================================
I will update our github.com/snath-xoc/sample-weight-audit-nondet/ tool to be able to test transformers with categorical inputs and report back.
Meanwhile, here is a first pass of feedback.
doc/whats_new/upcoming_changes/sklearn.preprocessing/31324.enhancement.rst
Outdated
Show resolved
Hide resolved
@@ -1505,7 +1505,8 @@ def _check_sample_weight_equivalence(name, estimator_orig, sparse_container): | |||
|
|||
rng = np.random.RandomState(42) | |||
n_samples = 15 | |||
X = rng.rand(n_samples, n_samples * 2) | |||
X = rng.rand(n_samples, n_samples * 2) * 5 | |||
X = _enforce_estimator_tags_X(estimator_orig, X) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note: I have pushed this change in the PR to make sure that this check is meaningful enough for estimators that accept categorical features as inputs.
sums[X_int_tmp] += y[sample_idx] | ||
counts[X_int_tmp] += 1.0 | ||
sums[X_int_tmp] += y[sample_idx] * sample_weight[sample_idx] | ||
counts[X_int_tmp] += sample_weight[sample_idx] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe we could rename this to weighted_counts
for the sake of consistency.
…ancement.rst Co-authored-by: Olivier Grisel <[email protected]>
Co-authored-by: Olivier Grisel <[email protected]>
Co-authored-by: Olivier Grisel <[email protected]>
For reference, I ran the statistical equivalence test of diff --git a/sklearn/preprocessing/_target_encoder.py b/sklearn/preprocessing/_target_encoder.py
index 93ecb9cf4e..00236d6469 100644
--- a/sklearn/preprocessing/_target_encoder.py
+++ b/sklearn/preprocessing/_target_encoder.py
@@ -5,6 +5,8 @@ from numbers import Real
import numpy as np
+from sklearn.model_selection._split import check_cv
+
from ..base import OneToOneFeatureMixin, _fit_context
from ..utils._param_validation import Interval, StrOptions
from ..utils.multiclass import type_of_target
@@ -281,16 +283,6 @@ class TargetEncoder(OneToOneFeatureMixin, _BaseEncoder):
X, y, sample_weight
)
- # The cv splitter is voluntarily restricted to *KFold to enforce non
- # overlapping validation folds, otherwise the fit_transform output will
- # not be well-specified.
- if self.target_type_ == "continuous":
- cv = KFold(self.cv, shuffle=self.shuffle, random_state=self.random_state)
- else:
- cv = StratifiedKFold(
- self.cv, shuffle=self.shuffle, random_state=self.random_state
- )
-
# If 'multiclass' multiply axis=1 by num classes else keep shape the same
if self.target_type_ == "multiclass":
X_out = np.empty(
@@ -301,7 +293,7 @@ class TargetEncoder(OneToOneFeatureMixin, _BaseEncoder):
X_out = np.empty_like(X_ordinal, dtype=np.float64)
sample_weight = _check_sample_weight(sample_weight, X)
- for train_idx, test_idx in cv.split(X, y):
+ for train_idx, test_idx in check_cv(self.cv, y).split(X, y):
X_train, y_train = X_ordinal[train_idx, :], y_encoded[train_idx]
sample_weight_train = sample_weight[train_idx]
y_train_mean = np.average(y_train, weights=sample_weight_train, axis=0) and the statistical test could not reveal any bug in this PR. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Besides the unaddressed comment of my first pass and the comments below, LGTM. Feel free to ping me once those comments are addressed.
@@ -131,21 +166,31 @@ def test_encoding(categories, unknown_value, global_random_seed, smooth, target_ | |||
random_state=global_random_seed, | |||
) | |||
|
|||
X_fit_transform = target_encoder.fit_transform(X_train, y_train) | |||
X_fit_transform = target_encoder.fit_transform(X_train, y_train, sample_weight) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nitpick: let's not pass optional arguments as positional arguments.
X_fit_transform = target_encoder.fit_transform(X_train, y_train, sample_weight) | |
X_fit_transform = target_encoder.fit_transform( | |
X_train, y_train, sample_weight=sample_weight | |
) |
Determines the number of folds in the :term:`cross fitting` strategy used in | ||
:meth:`fit_transform`. For classification targets, `StratifiedKFold` is used | ||
and for continuous targets, `KFold` is used. | ||
cv : int, cross-validation generator or an iterable, default=None |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please revert this change. At this point, cv
can only be an int
because the code of the TargetEncoder.fit_transform
method explicitly constructs KFold
splitters in fit
to ensure none overlapping split with complete coverage.
I think this is an annoying restriction, but I would rather lift it in a dedicated PR, unrelated to the support of sample_weight
.
To lift it, we would need to track which data points are never part of any CV iteration at training time and use the learned transformed encoding to encode the at the end of fit_transform
.
Furthermore, we would also need to accumulate (average) encoded values for data points that are part of several validation sets.
Both those changes would require refactoring the way the CV iteration loop and the _transform_X_ordinal
method interact. So let's do that in a follow up PR.
This PR introduces the ability for TargetEncoder to respect sample_weight during fitting, addressing #28881