-
-
Notifications
You must be signed in to change notification settings - Fork 25.8k
ENH Improve the efficiency of QuantileTransformer #27344
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Hi @glemaitre would you like to take a look? |
I am not convinced it brings any performance gain since this is not the bottleneck of the transformer. However, it will be a regression in terms of memory consumption since we are allocating a matrix Do you have any benchmark that shows that the current subsampling is problematic? |
I agree, only modifying the subsampling might not a improvement. I further remove the for-loop and transpose while computing the quantiles, and this indeed improve the performance. import numpy as np
from sklearn.preprocessing import QuantileTransformer
X = np.random.rand(10**5, 100)
%timeit QuantileTransformer().fit(X)
# before: 297 ms ± 1.87 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
# after: 135 ms ± 581 µs per loop (mean ± std. dev. of 7 runs, 10 loops each) |
It looks like the memory use is similar on
This PR:
`quantile.py````python import numpy as np from sklearn.preprocessing import QuantileTransformer X = np.random.rand(10**5, 100)@Profile if name == "main":
|
@betatim Thanks! |
@betatim Do you think this is worth breaking the backward compatibility? If we go this road, I would advocate to add a new |
What makes this a backwards incompatible change? The speedup seems worth while some effort. |
What makes this a backwards incompatible change? This is more a change of behaviour: with the same |
@xuefeng-xu do you have some numbers showing the runtime of this preprocessor for a real world use case? On a relative scale a roughly factor two speed up is impressive. On a absolute scale a change from 250ms to 100ms is taking something that is already very fast to something that is very very fast. But if real world use-cases have a runtime improvement from 20s to 10s that would be cool. I don't have a feeling for how annoying it is for people if results change with the same |
OK so we need to acknowledge it in the "Changed model" section of the changelog. |
@betatim I use this dataset for test, which has about 2.1M examples and size of 600MB. It seems that the time has been reduced and with similar memory usage. Timeimport pandas as pd
from sklearn.preprocessing import QuantileTransformer
df = pd.read_csv("amz_ca_total_products_data_processed.csv")
df = df[["stars","reviews","price","listPrice","boughtInLastMonth"]] # 5 numeric columns
%timeit QuantileTransformer().fit(df) before
after
Memoryimport pandas as pd
from sklearn.preprocessing import QuantileTransformer
df = pd.read_csv("amz_ca_total_products_data_processed.csv")
df = df[["stars","reviews","price","listPrice","boughtInLastMonth"]] # 5 numeric columns
@profile
def my_func():
QuantileTransformer().fit(df)
if __name__ == '__main__':
my_func() before
after
|
Hi @glemaitre, I have added this in the changed model section. |
sklearn/preprocessing/_data.py
Outdated
if self.subsample < n_samples: | ||
subsample_idx = random_state.choice( | ||
n_samples, size=self.subsample, replace=False | ||
) | ||
X = _safe_indexing(X, subsample_idx) | ||
|
||
self.quantiles_ = np.nanpercentile(X, references, axis=0) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you also factorize this part that is in common with KBinsDiscretizer
by adding a new utility function called subsample
.
We would need to add it in classes.rst
and have a small test to check the behaviour as well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, but I'm not sure which is better. Could you give me some suggestions?
Option 1
# QuantileTransformer and KBinsDiscretizer
if self.subsample is not None and n_samples > self.subsample:
X = subsample(X, n_samples=self.subsample, random_state=random_state)
# subsample method in utility
def subsample(*arrays, n_samples, random_state):
return resample(*arrays, replace=False, n_samples=n_samples, random_state=random_state)
Option 2
# QuantileTransformer and KBinsDiscretizer
X = subsample(X, n_samples=self.subsample, random_state=random_state)
# subsample method in utility
def subsample(*arrays, n_samples, random_state):
subsample = n_samples
first = arrays[0]
n_samples = first.shape[0] if hasattr(first, "shape") else len(first)
if subsample is not None and n_samples > subsample:
return resample(*arrays, replace=False, n_samples=n_samples, random_state=random_state)
else:
return arrays
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh actually call directly the resample
function. I did not know that it was exposing all the necessary parameters:
if self.subsample is not None and n_samples > self.subsample:
# Take a subsample of `X`
X = resample(X, replace=False, n_samples=n_samples, random_state=random_state)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM otherwise.
doc/whats_new/v1.4.rst
Outdated
@@ -35,6 +35,11 @@ random sampling procedures. | |||
solvers (when fit on the same data again). The amount of change depends on the | |||
specified `tol`, for small values you will get more precise results. | |||
|
|||
- |Efficiency| :class:`preprocessing.QuantileTransformer` now uses `resample` function to |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We need to check that we don't change the output of the KBinsDiscretizer
as well. We might be calling another NumPy API.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this won't affect KBinsDiscretizer
because resample
also uses _safe_indexing
.
I also tested with the following code to make sure they are equal.
import numpy as np
from sklearn.utils import resample, check_random_state, _safe_indexing
random_state = 0
num_example = 1000
subsample = 100
X = np.random.randint(low=0, high=10**2, size=num_example)
# before
rng = check_random_state(random_state)
subsample_idx = rng.choice(num_example, size=subsample, replace=False)
result1 = _safe_indexing(X, subsample_idx)
# this PR
result2 = resample(
X, replace=False, n_samples=subsample, random_state=random_state
)
np.testing.assert_array_equal(result1, result2)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It indeed looks like
np.random.RandomState(0).choice(n_samples, size=subsample, replace=False)
gives the same sampling as
a = np.arange(n_samples)
np.random.RandomState(0).shuffle(a)
a[:subsample]
probably what's used underneath. I'm okay with this change, although here it has no impact on efficiency.
Move to changelog v1.5 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. Thanks @xuefeng-xu. I resolved the conflicts and rephrased the changelog entry a bit.
I just have 1 remark: before the subsampling was different for each column and now it's the same for all columns. I don't think we care since it's just to speed up the computation of the quantiles, but I just want to make sure we're not missing something. @glemaitre @betatim ?
I think using the same rows for each column is fine. It is different from before but if you'd wake me at 4am and ask me to implement this I'd probably have gone with the approach of selecting the same rows for each feature. |
I think it was my original remarks and after speaking with @ogrisel I settled thinking that this would be good enough. So still LGTM. I'll merge then. Thanks @jeremiedbb for updating this PR and thanks @xuefeng-xu for the original work. |
Reference Issues/PRs
See #27263
What does this implement/fix? Explain your changes.
The original subsampling in QuantileTransformer was done column by column, I think subsampling can just do once to improve efficiency. Also, I remove the for-loop and transpose while computing the quantiles.
Any other comments?