ENH Improve the efficiency of QuantileTransformer #27344

xuefeng-xu · 2023-09-12T04:30:14Z

Reference Issues/PRs

What does this implement/fix? Explain your changes.

The original subsampling in QuantileTransformer was done column by column, I think subsampling can just do once to improve efficiency. Also, I remove the for-loop and transpose while computing the quantiles.

Any other comments?

github-actions · 2023-09-12T04:31:53Z

✔️ Linting Passed

All linting checks passed. Your pull request is in excellent shape! ☀️

_{Generated for commit: 49d2dcf. Link to the linter CI: here}

xuefeng-xu · 2023-10-10T09:48:01Z

Hi @glemaitre would you like to take a look?

glemaitre · 2023-10-11T13:08:11Z

I am not convinced it brings any performance gain since this is not the bottleneck of the transformer. However, it will be a regression in terms of memory consumption since we are allocating a matrix (n_subsample, n_features) while we were computing the quantile iteratively with (n_subsample,).

Do you have any benchmark that shows that the current subsampling is problematic?

xuefeng-xu · 2023-10-11T16:34:17Z

I agree, only modifying the subsampling might not a improvement. I further remove the for-loop and transpose while computing the quantiles, and this indeed improve the performance.

import numpy as np
from sklearn.preprocessing import QuantileTransformer
X = np.random.rand(10**5, 100)

%timeit QuantileTransformer().fit(X)

# before: 297 ms ± 1.87 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
# after: 135 ms ± 581 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

betatim · 2023-10-24T15:37:40Z

It looks like the memory use is similar on main and this PR. I also see faster runtimes (100ms vs 240ms).

main:

$ python -m memory_profiler quantile.py
Filename: quantile.py

Line #    Mem usage    Increment  Occurrences   Line Contents
=============================================================
     5  184.344 MiB  184.344 MiB           1   @profile
     6                                         def do_it():
     7  198.094 MiB   13.750 MiB           1       QuantileTransformer().fit(X)

This PR:

$ python -m memory_profiler quantile.py
Filename: quantile.py

Line #    Mem usage    Increment  Occurrences   Line Contents
=============================================================
     5  185.922 MiB  185.922 MiB           1   @profile
     6                                         def do_it():
     7  195.844 MiB    9.922 MiB           1       QuantileTransformer().fit(X)

`quantile.py`

```python import numpy as np from sklearn.preprocessing import QuantileTransformer X = np.random.rand(10**5, 100)

@Profile
def do_it():
QuantileTransformer().fit(X)

if name == "main":
do_it()



</details>

xuefeng-xu · 2023-10-27T04:56:29Z

@betatim Thanks!

glemaitre · 2023-10-30T09:27:42Z

@betatim Do you think this is worth breaking the backward compatibility?

If we go this road, I would advocate to add a new subsample function in the utilities (next to resample and shuffle) and share the redundant code.

betatim · 2023-10-30T12:49:26Z

What makes this a backwards incompatible change?

The speedup seems worth while some effort.

glemaitre · 2023-10-30T13:06:43Z

What makes this a backwards incompatible change?

This is more a change of behaviour: with the same random_state, you don't get the same results.

betatim · 2023-10-31T07:46:50Z

@xuefeng-xu do you have some numbers showing the runtime of this preprocessor for a real world use case? On a relative scale a roughly factor two speed up is impressive. On a absolute scale a change from 250ms to 100ms is taking something that is already very fast to something that is very very fast. But if real world use-cases have a runtime improvement from 20s to 10s that would be cool.

I don't have a feeling for how annoying it is for people if results change with the same random_state :-/

glemaitre · 2023-10-31T09:11:09Z

I don't have a feeling for how annoying it is for people if results change with the same random_state :-/

OK so we need to acknowledge it in the "Changed model" section of the changelog.

xuefeng-xu · 2023-10-31T09:22:41Z

@betatim I use this dataset for test, which has about 2.1M examples and size of 600MB.

It seems that the time has been reduced and with similar memory usage.

Time

import pandas as pd
from sklearn.preprocessing import QuantileTransformer

df = pd.read_csv("amz_ca_total_products_data_processed.csv")
df = df[["stars","reviews","price","listPrice","boughtInLastMonth"]] # 5 numeric columns

%timeit QuantileTransformer().fit(df)

before

138 ms ± 357 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

after

39 ms ± 396 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

Memory

import pandas as pd
from sklearn.preprocessing import QuantileTransformer

df = pd.read_csv("amz_ca_total_products_data_processed.csv")
df = df[["stars","reviews","price","listPrice","boughtInLastMonth"]] # 5 numeric columns

@profile
def my_func():
    QuantileTransformer().fit(df)

if __name__ == '__main__':
    my_func()

before

$ python -m memory_profiler quantile.py
Filename: quantile.py

Line #    Mem usage    Increment  Occurrences   Line Contents
=============================================================
     7  731.312 MiB  731.312 MiB           1   @profile
     8                                         def my_func():
     9  731.672 MiB    0.359 MiB           1       QuantileTransformer().fit(df)

after

$ python -m memory_profiler quantile.py
Filename: quantile.py

Line #    Mem usage    Increment  Occurrences   Line Contents
=============================================================
     7  736.406 MiB  736.406 MiB           1   @profile
     8                                         def my_func():
     9  737.438 MiB    1.031 MiB           1       QuantileTransformer().fit(df)

xuefeng-xu · 2023-10-31T09:41:06Z

Hi @glemaitre, I have added this in the changed model section.

glemaitre · 2023-10-31T09:46:13Z

sklearn/preprocessing/_data.py

+        if self.subsample < n_samples:
+            subsample_idx = random_state.choice(
+                n_samples, size=self.subsample, replace=False
+            )
+            X = _safe_indexing(X, subsample_idx)
+
+        self.quantiles_ = np.nanpercentile(X, references, axis=0)


Could you also factorize this part that is in common with KBinsDiscretizer by adding a new utility function called subsample.

We would need to add it in classes.rst and have a small test to check the behaviour as well.

Ok, but I'm not sure which is better. Could you give me some suggestions?

Option 1

# QuantileTransformer and KBinsDiscretizer if self.subsample is not None and n_samples > self.subsample: X = subsample(X, n_samples=self.subsample, random_state=random_state) # subsample method in utility def subsample(*arrays, n_samples, random_state): return resample(*arrays, replace=False, n_samples=n_samples, random_state=random_state)

Option 2

# QuantileTransformer and KBinsDiscretizer X = subsample(X, n_samples=self.subsample, random_state=random_state) # subsample method in utility def subsample(*arrays, n_samples, random_state): subsample = n_samples first = arrays[0] n_samples = first.shape[0] if hasattr(first, "shape") else len(first) if subsample is not None and n_samples > subsample: return resample(*arrays, replace=False, n_samples=n_samples, random_state=random_state) else: return arrays

Oh actually call directly the resample function. I did not know that it was exposing all the necessary parameters:

if self.subsample is not None and n_samples > self.subsample: # Take a subsample of `X` X = resample(X, replace=False, n_samples=n_samples, random_state=random_state)

glemaitre

LGTM otherwise.

glemaitre · 2023-10-31T15:23:06Z

doc/whats_new/v1.4.rst

@@ -35,6 +35,11 @@ random sampling procedures.
      solvers (when fit on the same data again). The amount of change depends on the
      specified `tol`, for small values you will get more precise results.

+- |Efficiency| :class:`preprocessing.QuantileTransformer` now uses `resample` function to


We need to check that we don't change the output of the KBinsDiscretizer as well. We might be calling another NumPy API.

I think this won't affect KBinsDiscretizer because resample also uses _safe_indexing.

I also tested with the following code to make sure they are equal.

import numpy as np from sklearn.utils import resample, check_random_state, _safe_indexing random_state = 0 num_example = 1000 subsample = 100 X = np.random.randint(low=0, high=10**2, size=num_example) # before rng = check_random_state(random_state) subsample_idx = rng.choice(num_example, size=subsample, replace=False) result1 = _safe_indexing(X, subsample_idx) # this PR result2 = resample( X, replace=False, n_samples=subsample, random_state=random_state ) np.testing.assert_array_equal(result1, result2)

It indeed looks like

np.random.RandomState(0).choice(n_samples, size=subsample, replace=False)

gives the same sampling as

a = np.arange(n_samples) np.random.RandomState(0).shuffle(a) a[:subsample]

probably what's used underneath. I'm okay with this change, although here it has no impact on efficiency.

doc/whats_new/v1.4.rst

xuefeng-xu · 2024-03-08T13:37:02Z

Move to changelog v1.5

jeremiedbb

LGTM. Thanks @xuefeng-xu. I resolved the conflicts and rephrased the changelog entry a bit.

I just have 1 remark: before the subsampling was different for each column and now it's the same for all columns. I don't think we care since it's just to speed up the computation of the quantiles, but I just want to make sure we're not missing something. @glemaitre @betatim ?

betatim · 2024-03-21T13:45:43Z

I think using the same rows for each column is fine. It is different from before but if you'd wake me at 4am and ask me to implement this I'd probably have gone with the approach of selecting the same rows for each feature.

glemaitre · 2024-03-28T06:28:32Z

I think it was my original remarks and after speaking with @ogrisel I settled thinking that this would be good enough. So still LGTM. I'll merge then. Thanks @jeremiedbb for updating this PR and thanks @xuefeng-xu for the original work.

ENH Improve the efficiency of subsampling in QuantileTransformer

4d1885e

github-actions bot added the module:preprocessing label Sep 12, 2023

xuefeng-xu added 2 commits October 11, 2023 23:48

Merge branch 'main' into subsample

720c6c5

remove for-loop and transpose

1456158

xuefeng-xu changed the title ~~ENH Improve the efficiency of subsampling in QuantileTransformer~~ ENH Improve the efficiency of QuantileTransformer Oct 11, 2023

xuefeng-xu added 2 commits October 31, 2023 17:23

Merge branch 'main' into subsample

3913e8b

add changelog in v1.4

0b8970c

glemaitre reviewed Oct 31, 2023

View reviewed changes

xuefeng-xu added 4 commits October 31, 2023 17:49

fix conflict

fddd76e

use the resample method

5a81ab4

update changelog

2288630

update changelog

c72a131

glemaitre approved these changes Oct 31, 2023

View reviewed changes

xuefeng-xu added 2 commits November 1, 2023 14:21

update changelog

9ece1a6

fix conflict

98e8691

glemaitre added the Waiting for Second Reviewer First reviewer is done, need a second one! label Mar 11, 2024

Merge branch 'main' into subsample

49d2dcf

jeremiedbb approved these changes Mar 21, 2024

View reviewed changes

glemaitre merged commit c63b21e into scikit-learn:main Mar 28, 2024

xuefeng-xu deleted the subsample branch March 28, 2024 06:36

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ENH Improve the efficiency of QuantileTransformer #27344

ENH Improve the efficiency of QuantileTransformer #27344

xuefeng-xu commented Sep 12, 2023 •

edited

Loading

github-actions bot commented Sep 12, 2023 •

edited

Loading

xuefeng-xu commented Oct 10, 2023

glemaitre commented Oct 11, 2023

xuefeng-xu commented Oct 11, 2023

betatim commented Oct 24, 2023

xuefeng-xu commented Oct 27, 2023

glemaitre commented Oct 30, 2023

betatim commented Oct 30, 2023 •

edited

Loading

glemaitre commented Oct 30, 2023

betatim commented Oct 31, 2023

glemaitre commented Oct 31, 2023

xuefeng-xu commented Oct 31, 2023

xuefeng-xu commented Oct 31, 2023

glemaitre Oct 31, 2023

xuefeng-xu Oct 31, 2023

glemaitre Oct 31, 2023

xuefeng-xu Oct 31, 2023

glemaitre left a comment

glemaitre Oct 31, 2023

xuefeng-xu Nov 1, 2023

jeremiedbb Mar 21, 2024

xuefeng-xu commented Mar 8, 2024

jeremiedbb left a comment

betatim commented Mar 21, 2024

glemaitre commented Mar 28, 2024

ENH Improve the efficiency of QuantileTransformer #27344

ENH Improve the efficiency of QuantileTransformer #27344

Conversation

xuefeng-xu commented Sep 12, 2023 • edited Loading

Reference Issues/PRs

What does this implement/fix? Explain your changes.

Any other comments?

github-actions bot commented Sep 12, 2023 • edited Loading

✔️ Linting Passed

xuefeng-xu commented Oct 10, 2023

glemaitre commented Oct 11, 2023

xuefeng-xu commented Oct 11, 2023

betatim commented Oct 24, 2023

xuefeng-xu commented Oct 27, 2023

glemaitre commented Oct 30, 2023

betatim commented Oct 30, 2023 • edited Loading

glemaitre commented Oct 30, 2023

betatim commented Oct 31, 2023

glemaitre commented Oct 31, 2023

xuefeng-xu commented Oct 31, 2023

Time

Memory

xuefeng-xu commented Oct 31, 2023

glemaitre Oct 31, 2023

Choose a reason for hiding this comment

xuefeng-xu Oct 31, 2023

Choose a reason for hiding this comment

glemaitre Oct 31, 2023

Choose a reason for hiding this comment

xuefeng-xu Oct 31, 2023

Choose a reason for hiding this comment

glemaitre left a comment

Choose a reason for hiding this comment

glemaitre Oct 31, 2023

Choose a reason for hiding this comment

xuefeng-xu Nov 1, 2023

Choose a reason for hiding this comment

jeremiedbb Mar 21, 2024

Choose a reason for hiding this comment

xuefeng-xu commented Mar 8, 2024

jeremiedbb left a comment

Choose a reason for hiding this comment

betatim commented Mar 21, 2024

glemaitre commented Mar 28, 2024

xuefeng-xu commented Sep 12, 2023 •

edited

Loading

github-actions bot commented Sep 12, 2023 •

edited

Loading

betatim commented Oct 30, 2023 •

edited

Loading