Add sample_weight support to QuantileTransformer #30707

antipisa · 2025-01-22T23:07:51Z

Describe the workflow you want to enable

Would be good to get sample_weight support for QuantileTransformer for dealing with sparse or imbalanced data, a la #15601.

scaler = QuantileTransformer(output_distribution="normal")

scaler.fit(X, sample_weight=w)

Describe your proposed solution

As far as I know it would just require adding the weight argument to the quantiles_ computation in np.nanpercentile.

KBinsDiscretizer supports sample_weight and with strategy='quantile', encode='ordinal' this behavior can be achieved but it is much, much slower.

Describe alternatives you've considered, if relevant

No response

Additional context

No response

The text was updated successfully, but these errors were encountered:

lesteve · 2025-01-23T12:56:13Z

Searching in the issue tracker, looks like this feature is in scope if I understand correctly #20522 (comment).

cc @snath-xoc and @jeremiedbb for an informed opinion, since they have been working on sample weights support recently.

ogrisel · 2025-02-03T16:26:33Z

@antipisa would you be interested in contributing a PR? There is already a common test named check_sample_weight_equivalence_on_dense_data that should be triggered as soon as you add the sample_weight kwarg to fit. Once this is done, the following command should pick it up:

pytest -k "check_sample_weight_equivalence and QuantileTransformer" sklearn/tests/test_common.py -v  -s

You might want to adjust some non-default parameter values for that check in PER_ESTIMATOR_CHECK_PARAMS in sklearn/utils/_test_common/instance_generator.py.

Note that when subsampling is enabled, we need to follow a similar strategy as implemented in KBinsDiscretizer. At the time of writing, the code for KBinsDiscretizer in main is still half broken, but there is a fix that is almost ready to be merged: #29907. You can take a look at this PR for inspiration.

We are working on tooling to help test the case where the estimator's fit is stochastic (depends on random_state which is the case when subsampling is enabled). You can track the latest progress here: #16298 (comment)

kaekkr · 2025-04-03T08:48:11Z

I will take this!

antipisa added Needs Triage Issue requires triage New Feature labels Jan 22, 2025

lesteve removed the Needs Triage Issue requires triage label Jan 23, 2025

antipisa changed the title ~~Add sample_weight support to QuantileTransformer?~~ Add sample_weight support to QuantileTransformer Jan 23, 2025

ogrisel added the Moderate Anything that requires some knowledge of conventions and best practices label Feb 3, 2025

ogrisel added this to Losses and solvers Feb 3, 2025

ogrisel moved this to Todo in Losses and solvers Feb 3, 2025

kaekkr linked a pull request Apr 4, 2025 that will close this issue

Add sample_weight support for QuantileTransformer when fit on dense data #31147

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add sample_weight support to QuantileTransformer #30707

Add sample_weight support to QuantileTransformer #30707

antipisa commented Jan 22, 2025 •

edited

Loading

lesteve commented Jan 23, 2025

ogrisel commented Feb 3, 2025 •

edited

Loading

kaekkr commented Apr 3, 2025

Add sample_weight support to QuantileTransformer #30707

Add sample_weight support to QuantileTransformer #30707

Comments

antipisa commented Jan 22, 2025 • edited Loading

Describe the workflow you want to enable

Describe your proposed solution

Describe alternatives you've considered, if relevant

Additional context

lesteve commented Jan 23, 2025

ogrisel commented Feb 3, 2025 • edited Loading

kaekkr commented Apr 3, 2025

antipisa commented Jan 22, 2025 •

edited

Loading

ogrisel commented Feb 3, 2025 •

edited

Loading