-
-
Notifications
You must be signed in to change notification settings - Fork 26.5k
ENH Add clip parameter to MaxAbsScaler #31790
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
jeremiedbb
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the PR @glevv. Here are some comments
doc/whats_new/upcoming_changes/sklearn.preprocessing/31790.enhancement.rst
Outdated
Show resolved
Hide resolved
| X_test = ( | ||
| sparse_container([np.r_[X_min.data[:2] - 10, X_max.data[2:] + 10]]) | ||
| if sparse_container | ||
| else [np.r_[X_min[:2] - 10, X_max[2:] + 10]] | ||
| ) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd rather use numpy equivalents that have an explicit name
| X_test = ( | |
| sparse_container([np.r_[X_min.data[:2] - 10, X_max.data[2:] + 10]]) | |
| if sparse_container | |
| else [np.r_[X_min[:2] - 10, X_max[2:] + 10]] | |
| ) | |
| X_test = np.hstack((X_min[:2] - 10, X_max[2:] + 10)).reshape(1, -1) | |
| if sparse_container: | |
| X_test = sparse_container(X_test) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I switched to np.hstack, but sparse arrays cannot be indexed, so this code snippet won't work.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right, we need to do something like hstack = sp.hstack if sp.issparse(X) else np.hstack.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I decided to refactor the test function by splitting it into two. It should be cleaner this way.
The only thing I missed is that it should be:
np.max(np.abs(X))
instead of just max. With these test datasets it does not matter, since they both have only 0 or positive numbers, but I will change it
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the PR, @glevv!
From the np.clip docs it seems, that we can pass min=-1.0, max=1.0 as kwargs (into _modify_in_place_if_numpy then) and then it's array api compatible (which is a bit surprising to me) but the array API spec seems to allow positional passing. Maybe it's fine as it is, but it's better to have a test.
Oh sorry, I now see you have added MaxAbsScaler(clip=True) to test_preprocessing_array_api_compliance.
I had clicked "send review" by accident and too early. Give me a sec to check everything.
Edit:
That looks all fine to me. :)
| else: | ||
| X /= self.scale_ | ||
| if self.clip: | ||
| device_ = device(X) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
device_ = device(X)
I think that's more consistent to the rest of the codebase to use
xp, _ , device_ = get_namespace_and_device(X)
in the beginning of transform instead.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe that was done to be consistent with the handling of clip in MinMaxScaler which does it that way
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was repeating the clip behavior of MinMaxScaler. If it's not correct, I can change it
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is correct, but it has a little risk that we later by accident introduce some change in the device of X between the beginning of the transform method and here, but then the array api tests would fail. I think it's fine and save.
StefanieSenger
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for your further work and re-structuring the tests, @glevv. These read much more intuitive to me.
I now only have some typo nits.
| else: | ||
| X /= self.scale_ | ||
| if self.clip: | ||
| device_ = device(X) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is correct, but it has a little risk that we later by accident introduce some change in the device of X between the beginning of the transform method and here, but then the array api tests would fail. I think it's fine and save.
… into maxabs-scaler-clip
StefanieSenger
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @glevv! I checked through everything again and it looks good to me.
| ) | ||
| X = sparse_container(X) | ||
| scaler = MaxAbsScaler(clip=True).fit(X) | ||
| X_max = np.max(np.abs(X), axis=0) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry for the nit, but I found this confusing. Using scipy directly is more straightforward I think. You think it's valid?
| X_max = np.max(np.abs(X), axis=0) | |
| X_max = X.max(axis=0) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can look at this discussion #31790 (comment)
I decided to go for max(abs(X)) even though it is unnecessary in mathematical sense, but in terms of readability and versatility of the inputs, this is better in my opinion
As for np.max() or .max(), I almost shure numpy will use scipy internal method for the calculation, so there should be no difference, but I could be wrong
jeremiedbb
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I pushed a commit to simplify the test. I made it more similar to the one regarding minmaxscaler and more similar the first one you wrote. There was no real reason to split the test for dense and sparse and create a new toy dataset. Instead I added a comment to explain how and for what purpose we construct the test sample. I also added a similar comment in the minmaxscaler test.
LGTM. Thanks @glevv !
Co-authored-by: Jérémie du Boisberranger <[email protected]>
Reference Issues/PRs
Closes #31672
What does this implement/fix? Explain your changes.
clipparameter toMaxAbsScalerclass;clipparameter tomaxabs_scalefunction;clipparameter inMaxAbsScalerclass for sparse and dense arrays;clipparameter inMaxAbsScalerclass.Any other comments?