-
-
Notifications
You must be signed in to change notification settings - Fork 25.9k
[WIP] Refactor scaler code #3639
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
53e2a0c
to
468e0a3
Compare
Changes Unknown when pulling 468e0a3 on untom:refactor_scaling into * on scikit-learn:master*. |
Sorry for the slow reply. I tend to think like @larsmans: it's too late to change the
To me this is a YAGNI. We should not introduce a new feature and extend the public API when there is no obvious use case. |
On the other hand here are interesting features / improvements for the preprocessing module:
|
Maybe @jnothman would like to comment on this thread as well. |
And @amueller as well. |
I'm for adding more scalers, I would have to have a closer look to see how much duplication there is and what would be the best way to refactor. The main reason to unify the attributes is to improve code sharing, right? |
Yes. Have a look at the original PR: #2514. For |
Sory for taking such a long time to reply.
The reason to implement this was that The advantage of adding the But I can't think of any real usecase besides that, so maybe it would be more appropriate to have this as undocumented feature? (Or remove it completely, and have the |
+1 |
So the idea of this PR is to do two things, right?
Shouldn't that result in less code, not more? |
3937060
to
a89c91c
Compare
@amueller: Yes, although this PR is mainly meant to pave the way to adding more scalers (Concretely, As far as "making the attributes consistent" goes, the consensus seems to be that it's too late to change this, so I undid those changes. FYI, I don't know why it says "This pull request contains merge conflicts that must be resolved. "... at least on my local branch I can cleanly merge this into master. |
@ogrisel: I've changed the code to do it this way. |
a89c91c
to
4c05f6b
Compare
Do a |
4c05f6b
to
6570389
Compare
I tried that, but it doesn't seem to work:
|
Your master may not be up-to-date: In case you need it, first Then
|
6570389
to
827ebf9
Compare
Thanks, this did the trick! Not sure why it didn't work before (could've sworn I've pulled from upstream before my first try). |
self.feature_range = feature_range | ||
self.copy = copy | ||
|
||
def fit(self, X, y=None): | ||
def fit(self, X, y=None, copy=None): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why did you add this fit parameter?
data_range = self._handle_zeros_in_scale(data_range) | ||
self.scalefactor_ = data_range / (feature_range[1] - feature_range[0]) | ||
self.center_ = data_min - feature_range[0] * self.scalefactor_ | ||
|
||
self.scale_ = (feature_range[1] - feature_range[0]) / data_range |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If these are here but not used, I guess they should be read-only attributes maybe to make explicit they they are not used?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure what you are referring to. Could you elaborate?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
scale_
and min_
are attributes that are set but not used in transform
, right?
So an unsuspecting user might try to change min_
to change the behavior of the estimator, but nothing will happen. Also I'm wondering whether these names should be deprecated then.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also I'm wondering whether these names should be deprecated then.
My original proposal did deprecate these names, but I think the consensus was that it is too late to fundamentally change the existing API.
8b7bb9f
to
4cba946
Compare
4cba946
to
042bb96
Compare
Sorry for the lack of feedback. What is the status here? I lost track a bit. |
So there is a |
Ok, quick overview of this: My original PR for this was #2514, which added two new scalers ( Thus I decided to submit new scalers first, figuring the code could always be refactored later (when the usefulness of the refactoring became more apparent). In #4125 I tried submitting the At this point I am quite frustrated with the whole process, as I have wasted a lot of time on this throughout the 1.5 years since the first submission of the code. I don't know what more I should have done to get any of these PR merged, but I'd be happy for any feedback. If there's anything I can do to make this easier I'd love to hear it for future PRs. With that said, If there is true interest in |
I am very sorry for the lack of feedback, and I totally understand your frustration. We are really starved for devs / reviewers at the moment. I think bost RobustScaler and MaxAbsScaler would be great additions. |
Sounds good! Thanks for taking the time to review #4125 :) |
Now lets get #4828 in next :) |
I think most (all?) of the patches are in by now, so I'm going to close this PR. |
well you also introduced a base-class that got rid of some code duplication.... |
Thank you for your patients and all your work btw :) |
True, but introducing that base-class would've required renaming the |
fair enough :) thank again! |
This PR refactors the scalers in
preprocessing
to use a common baseclass (BaseScaler
). This is done to share some common functionality / avoid code duplication. This PR is part of a series of PR that split up #2514 , and other PRs in this series intend to introduce new scalers that will make use ofBaseScaler
(e.g. a scaler that uses robust statistics).There is one issues with this PR that needs to be discussed:
All scalers (both the existing ones and the one's I'm going to introduce) do center the data and then scale it to fall into some range. So in essence, all layers do
for each feature-column x. The existing scalers already have attributes that expose their centering/scaling statistic, however those are not uniformly named. I propose that the
StandardScaler.mean_
/StandardScaler.std_
/MinMaxScaler.scale_
andMinMaxScaler.min_
properties are being deprecated and replaced with*.center_
/*.scale_
, and thewith_mean
/with_std
constructer arguments be renamed towith_centering
/with_scaling
.@larsmans already said back in #2514 that he is against renaming the
mean_
/std_
attributes, since the scaler API should be stable. An additional problem is that my proposal would change the meaning ofMinMaxScaler.scale_
(as the previous implementation used a slightly different math to arrive at the same scaling result). This will of course break user code that usesMinMaxScaler.scale_
and relies on its current implementation. (NOTE: I had overlooked this when proposing #2514 and just noticed it when preparing this PR)There are several ways to deal with this:
scale_
/center_
arguments in StandardScaler/MinMaxScaler, and just keep the oldmean_
/std_
/scale_
arguments.mean_
/std_
and remove them after a few releases. Print a warning whenever a user usesscale_
onMinMaxScaler
that the value has changed.I would really appreciate any input on what the right thing to do here is!
NOTE: As side-effect of this PR,
StandardScaler
andMinMaxScaler
gain the ability to scale on other rows/samples instead of just columns/features (i.e., it is possible to useaxis=1
). However, no tests for this new functionality is included in this PR. This is because I intend to send another PR tomorrow that refactors the tests in this module (and adds the tests foraxis = 1
). I just thought that splitting up the tests makes it easier to review the changes (but I can add that other PR to this one if you wish).