Thanks to visit codestin.com
Credit goes to github.com

Skip to content

ENH Improve the efficiency of QuantileTransformer #27344

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 12 commits into from
Mar 28, 2024

Conversation

xuefeng-xu
Copy link
Contributor

@xuefeng-xu xuefeng-xu commented Sep 12, 2023

Reference Issues/PRs

See #27263

What does this implement/fix? Explain your changes.

The original subsampling in QuantileTransformer was done column by column, I think subsampling can just do once to improve efficiency. Also, I remove the for-loop and transpose while computing the quantiles.

Any other comments?

@github-actions
Copy link

github-actions bot commented Sep 12, 2023

✔️ Linting Passed

All linting checks passed. Your pull request is in excellent shape! ☀️

Generated for commit: 49d2dcf. Link to the linter CI: here

@xuefeng-xu
Copy link
Contributor Author

Hi @glemaitre would you like to take a look?

@glemaitre
Copy link
Member

I am not convinced it brings any performance gain since this is not the bottleneck of the transformer. However, it will be a regression in terms of memory consumption since we are allocating a matrix (n_subsample, n_features) while we were computing the quantile iteratively with (n_subsample,).

Do you have any benchmark that shows that the current subsampling is problematic?

@xuefeng-xu
Copy link
Contributor Author

I agree, only modifying the subsampling might not a improvement. I further remove the for-loop and transpose while computing the quantiles, and this indeed improve the performance.

import numpy as np
from sklearn.preprocessing import QuantileTransformer
X = np.random.rand(10**5, 100)

%timeit QuantileTransformer().fit(X)

# before: 297 ms ± 1.87 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
# after: 135 ms ± 581 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

@xuefeng-xu xuefeng-xu changed the title ENH Improve the efficiency of subsampling in QuantileTransformer ENH Improve the efficiency of QuantileTransformer Oct 11, 2023
@betatim
Copy link
Member

betatim commented Oct 24, 2023

It looks like the memory use is similar on main and this PR. I also see faster runtimes (100ms vs 240ms).

main:

$ python -m memory_profiler quantile.py
Filename: quantile.py

Line #    Mem usage    Increment  Occurrences   Line Contents
=============================================================
     5  184.344 MiB  184.344 MiB           1   @profile
     6                                         def do_it():
     7  198.094 MiB   13.750 MiB           1       QuantileTransformer().fit(X)

This PR:

$ python -m memory_profiler quantile.py
Filename: quantile.py

Line #    Mem usage    Increment  Occurrences   Line Contents
=============================================================
     5  185.922 MiB  185.922 MiB           1   @profile
     6                                         def do_it():
     7  195.844 MiB    9.922 MiB           1       QuantileTransformer().fit(X)
`quantile.py` ```python import numpy as np from sklearn.preprocessing import QuantileTransformer X = np.random.rand(10**5, 100)

@Profile
def do_it():
QuantileTransformer().fit(X)

if name == "main":
do_it()



</details>

@xuefeng-xu
Copy link
Contributor Author

@betatim Thanks!

@glemaitre
Copy link
Member

@betatim Do you think this is worth breaking the backward compatibility?

If we go this road, I would advocate to add a new subsample function in the utilities (next to resample and shuffle) and share the redundant code.

@betatim
Copy link
Member

betatim commented Oct 30, 2023

What makes this a backwards incompatible change?

The speedup seems worth while some effort.

@glemaitre
Copy link
Member

What makes this a backwards incompatible change?

This is more a change of behaviour: with the same random_state, you don't get the same results.

@betatim
Copy link
Member

betatim commented Oct 31, 2023

@xuefeng-xu do you have some numbers showing the runtime of this preprocessor for a real world use case? On a relative scale a roughly factor two speed up is impressive. On a absolute scale a change from 250ms to 100ms is taking something that is already very fast to something that is very very fast. But if real world use-cases have a runtime improvement from 20s to 10s that would be cool.

I don't have a feeling for how annoying it is for people if results change with the same random_state :-/

@glemaitre
Copy link
Member

I don't have a feeling for how annoying it is for people if results change with the same random_state :-/

OK so we need to acknowledge it in the "Changed model" section of the changelog.

@xuefeng-xu
Copy link
Contributor Author

@betatim I use this dataset for test, which has about 2.1M examples and size of 600MB.

It seems that the time has been reduced and with similar memory usage.

Time

import pandas as pd
from sklearn.preprocessing import QuantileTransformer

df = pd.read_csv("amz_ca_total_products_data_processed.csv")
df = df[["stars","reviews","price","listPrice","boughtInLastMonth"]] # 5 numeric columns

%timeit QuantileTransformer().fit(df)

before

138 ms ± 357 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

after

39 ms ± 396 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

Memory

import pandas as pd
from sklearn.preprocessing import QuantileTransformer

df = pd.read_csv("amz_ca_total_products_data_processed.csv")
df = df[["stars","reviews","price","listPrice","boughtInLastMonth"]] # 5 numeric columns

@profile
def my_func():
    QuantileTransformer().fit(df)

if __name__ == '__main__':
    my_func()

before

$ python -m memory_profiler quantile.py
Filename: quantile.py

Line #    Mem usage    Increment  Occurrences   Line Contents
=============================================================
     7  731.312 MiB  731.312 MiB           1   @profile
     8                                         def my_func():
     9  731.672 MiB    0.359 MiB           1       QuantileTransformer().fit(df)

after

$ python -m memory_profiler quantile.py
Filename: quantile.py

Line #    Mem usage    Increment  Occurrences   Line Contents
=============================================================
     7  736.406 MiB  736.406 MiB           1   @profile
     8                                         def my_func():
     9  737.438 MiB    1.031 MiB           1       QuantileTransformer().fit(df)

@xuefeng-xu
Copy link
Contributor Author

Hi @glemaitre, I have added this in the changed model section.

Comment on lines 2598 to 2604
if self.subsample < n_samples:
subsample_idx = random_state.choice(
n_samples, size=self.subsample, replace=False
)
X = _safe_indexing(X, subsample_idx)

self.quantiles_ = np.nanpercentile(X, references, axis=0)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you also factorize this part that is in common with KBinsDiscretizer by adding a new utility function called subsample.

We would need to add it in classes.rst and have a small test to check the behaviour as well.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, but I'm not sure which is better. Could you give me some suggestions?

Option 1

# QuantileTransformer and KBinsDiscretizer
if self.subsample is not None and n_samples > self.subsample:
    X = subsample(X, n_samples=self.subsample, random_state=random_state)

# subsample method in utility
def subsample(*arrays, n_samples, random_state):
    return resample(*arrays, replace=False, n_samples=n_samples, random_state=random_state)

Option 2

# QuantileTransformer and KBinsDiscretizer
X = subsample(X, n_samples=self.subsample, random_state=random_state)

# subsample method in utility
def subsample(*arrays, n_samples, random_state):
    subsample = n_samples
    first = arrays[0]
    n_samples = first.shape[0] if hasattr(first, "shape") else len(first)
    if subsample is not None and n_samples > subsample:
        return resample(*arrays, replace=False, n_samples=n_samples, random_state=random_state)
    else:
        return arrays

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh actually call directly the resample function. I did not know that it was exposing all the necessary parameters:

if self.subsample is not None and n_samples > self.subsample:
    # Take a subsample of `X`
    X = resample(X, replace=False, n_samples=n_samples, random_state=random_state)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

Copy link
Member

@glemaitre glemaitre left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM otherwise.

@@ -35,6 +35,11 @@ random sampling procedures.
solvers (when fit on the same data again). The amount of change depends on the
specified `tol`, for small values you will get more precise results.

- |Efficiency| :class:`preprocessing.QuantileTransformer` now uses `resample` function to
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need to check that we don't change the output of the KBinsDiscretizer as well. We might be calling another NumPy API.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this won't affect KBinsDiscretizer because resample also uses _safe_indexing.

I also tested with the following code to make sure they are equal.

import numpy as np
from sklearn.utils import resample, check_random_state, _safe_indexing

random_state = 0
num_example = 1000
subsample = 100

X = np.random.randint(low=0, high=10**2, size=num_example)

# before
rng = check_random_state(random_state)
subsample_idx = rng.choice(num_example, size=subsample, replace=False)
result1 = _safe_indexing(X, subsample_idx)

# this PR
result2 = resample(
    X, replace=False, n_samples=subsample, random_state=random_state
)

np.testing.assert_array_equal(result1, result2)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It indeed looks like

np.random.RandomState(0).choice(n_samples, size=subsample, replace=False)

gives the same sampling as

a = np.arange(n_samples)
np.random.RandomState(0).shuffle(a)
a[:subsample]

probably what's used underneath. I'm okay with this change, although here it has no impact on efficiency.

@xuefeng-xu
Copy link
Contributor Author

Move to changelog v1.5

@glemaitre glemaitre added the Waiting for Second Reviewer First reviewer is done, need a second one! label Mar 11, 2024
Copy link
Member

@jeremiedbb jeremiedbb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Thanks @xuefeng-xu. I resolved the conflicts and rephrased the changelog entry a bit.

I just have 1 remark: before the subsampling was different for each column and now it's the same for all columns. I don't think we care since it's just to speed up the computation of the quantiles, but I just want to make sure we're not missing something. @glemaitre @betatim ?

@betatim
Copy link
Member

betatim commented Mar 21, 2024

I think using the same rows for each column is fine. It is different from before but if you'd wake me at 4am and ask me to implement this I'd probably have gone with the approach of selecting the same rows for each feature.

@glemaitre
Copy link
Member

I think it was my original remarks and after speaking with @ogrisel I settled thinking that this would be good enough. So still LGTM. I'll merge then. Thanks @jeremiedbb for updating this PR and thanks @xuefeng-xu for the original work.

@glemaitre glemaitre merged commit c63b21e into scikit-learn:main Mar 28, 2024
@xuefeng-xu xuefeng-xu deleted the subsample branch March 28, 2024 06:36
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
module:preprocessing Waiting for Second Reviewer First reviewer is done, need a second one!
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants