Thanks to visit codestin.com
Credit goes to github.com

Skip to content

[WIP] Refactor scaler code #3639

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 2 commits into from
Closed

Conversation

untom
Copy link
Contributor

@untom untom commented Sep 5, 2014

This PR refactors the scalers in preprocessing to use a common baseclass (BaseScaler). This is done to share some common functionality / avoid code duplication. This PR is part of a series of PR that split up #2514 , and other PRs in this series intend to introduce new scalers that will make use of BaseScaler (e.g. a scaler that uses robust statistics).

There is one issues with this PR that needs to be discussed:

All scalers (both the existing ones and the one's I'm going to introduce) do center the data and then scale it to fall into some range. So in essence, all layers do

(x - some_centering_statistic) / some_scaling_statistic

for each feature-column x. The existing scalers already have attributes that expose their centering/scaling statistic, however those are not uniformly named. I propose that the StandardScaler.mean_/StandardScaler.std_/MinMaxScaler.scale_ and MinMaxScaler.min_ properties are being deprecated and replaced with *.center_/*.scale_, and the with_mean/with_std constructer arguments be renamed to with_centering/with_scaling.

@larsmans already said back in #2514 that he is against renaming the mean_/std_ attributes, since the scaler API should be stable. An additional problem is that my proposal would change the meaning of MinMaxScaler.scale_ (as the previous implementation used a slightly different math to arrive at the same scaling result). This will of course break user code that uses MinMaxScaler.scale_ and relies on its current implementation. (NOTE: I had overlooked this when proposing #2514 and just noticed it when preparing this PR)

There are several ways to deal with this:

  • don't expose the new scale_/center_ arguments in StandardScaler/MinMaxScaler, and just keep the old mean_/std_/scale_ arguments.
  • deprecate mean_/std_ and remove them after a few releases. Print a warning whenever a user uses scale_ on MinMaxScaler that the value has changed.
  • ???

I would really appreciate any input on what the right thing to do here is!

NOTE: As side-effect of this PR, StandardScaler and MinMaxScaler gain the ability to scale on other rows/samples instead of just columns/features (i.e., it is possible to use axis=1). However, no tests for this new functionality is included in this PR. This is because I intend to send another PR tomorrow that refactors the tests in this module (and adds the tests for axis = 1). I just thought that splitting up the tests makes it easier to review the changes (but I can add that other PR to this one if you wish).

@untom untom changed the title [WIP] ENH Refactor scaler code [WIP] Refactor scaler code Sep 5, 2014
@coveralls
Copy link

Coverage Status

Changes Unknown when pulling 468e0a3 on untom:refactor_scaling into * on scikit-learn:master*.

@untom
Copy link
Contributor Author

untom commented Oct 6, 2014

Pinging @larsmans @arjoly @ogrisel

I'd like some input on how to proceed on this => is it okay to change the meaning of the scale parameter of MinMaxScaler in future releases?

@ogrisel
Copy link
Member

ogrisel commented Nov 25, 2014

Sorry for the slow reply. I tend to think like @larsmans: it's too late to change the mean_ and std_ attributes and the names of the StandardScaler parameters. I am not even sure it's useful to introduce a base class. If there is too much common code we can factorize it in private helper functions when necessary.

NOTE: As side-effect of this PR, StandardScaler and MinMaxScaler gain the ability to scale on other rows/samples instead of just columns/features (i.e., it is possible to use axis=1). However, no tests for this new functionality is included in this PR. This is because I intend to send another PR tomorrow that refactors the tests in this module (and adds the tests for axis = 1). I just thought that splitting up the tests makes it easier to review the changes (but I can add that other PR to this one if you wish).

To me this is a YAGNI. We should not introduce a new feature and extend the public API when there is no obvious use case.

@ogrisel
Copy link
Member

ogrisel commented Nov 25, 2014

On the other hand here are interesting features / improvements for the preprocessing module:

  • MaxAbsScaler (it's easy to implement, it is the default strategy used by vowpal wabbit, it works similarly for dense and sparse data and can trivially support online fitting with a partial_fit method as well)
  • partial_fit for StandardScaler with online estimation of the mean and the variance of each feature. IncrementalPCA has some code to do that internally. It should be factorized.
  • RobustScaler

@ogrisel
Copy link
Member

ogrisel commented Nov 25, 2014

Maybe @jnothman would like to comment on this thread as well.

@ogrisel
Copy link
Member

ogrisel commented Nov 25, 2014

And @amueller as well.

@amueller
Copy link
Member

I'm for adding more scalers, I would have to have a closer look to see how much duplication there is and what would be the best way to refactor. The main reason to unify the attributes is to improve code sharing, right?

@ogrisel
Copy link
Member

ogrisel commented Nov 27, 2014

The main reason to unify the attributes is to improve code sharing, right?

Yes. Have a look at the original PR: #2514. For MaxAbsScaler at least there is not so much code to share with other scalers.

@untom
Copy link
Contributor Author

untom commented Dec 23, 2014

Sory for taking such a long time to reply.

NOTE: As side-effect of this PR, StandardScaler and MinMaxScaler gain the ability to scale on other rows/samples instead of just columns/features (i.e., it is possible to use axis=1). However, no tests for this new functionality is included in this PR. This is because I intend to send another PR tomorrow that refactors the tests in this module (and adds the tests for axis = 1). I just thought that splitting up the tests makes it easier to review the changes (but I can add that other PR to this one if you wish).

To me this is a YAGNI. We should not introduce a new feature and extend the public API when there is no obvious use case.

The reason to implement this was that sklearn.preprocessing.scale already offered this functionality (as does sklearn.preprocessing.normalize, BTW). So likewise, I'd assume minmax_scale and other new scaling-functions should have this.

The advantage of adding the axis parameter to the *Scaler classes as well is that it allows xxx_scalefunctions to have a trivial implementation that simply calls XxxScaler internally (without losing efficiency by transposing), and to have a certain consistency between the function and the corresponding class.

But I can't think of any real usecase besides that, so maybe it would be more appropriate to have this as undocumented feature? (Or remove it completely, and have the xxx_scale function transpose their arguments if needed).

@ogrisel
Copy link
Member

ogrisel commented Dec 24, 2014

Or remove it completely, and have the xxx_scale function transpose their arguments if needed.

+1

@amueller
Copy link
Member

amueller commented Jan 9, 2015

So the idea of this PR is to do two things, right?

  1. Make the attributes of the scalers consistent
  2. refactor their code.

Shouldn't that result in less code, not more?

@untom untom force-pushed the refactor_scaling branch 5 times, most recently from 3937060 to a89c91c Compare January 11, 2015 22:49
@untom
Copy link
Contributor Author

untom commented Jan 11, 2015

@amueller: Yes, although this PR is mainly meant to pave the way to adding more scalers (Concretely, MaxAbsScaler and RobustScaler). While the refactorings IMHO already make the code simpler, they are only going to really pay off once those are added to the codebase.

As far as "making the attributes consistent" goes, the consensus seems to be that it's too late to change this, so I undid those changes.

FYI, I don't know why it says "This pull request contains merge conflicts that must be resolved. "... at least on my local branch I can cleanly merge this into master.

@untom
Copy link
Contributor Author

untom commented Jan 11, 2015

Or remove it completely, and have the xxx_scale function transpose their arguments if needed.

+1

@ogrisel: I've changed the code to do it this way.

@jnothman
Copy link
Member

FYI, I don't know why it says "This pull request contains merge conflicts that must be resolved. "... at least on my local branch I can cleanly merge this into master.

Do a git rebase master then git push -f

@untom
Copy link
Contributor Author

untom commented Jan 12, 2015

I tried that, but it doesn't seem to work:

untom@comp:scikit-learn$ git rebase master
Current branch refactor_scaling is up to date.
untom@comp:scikit-learn$ git push -f
Username for 'https://github.com': untom
Password for 'https://[email protected]': 
Counting objects: 8, done.
Delta compression using up to 8 threads.
Compressing objects: 100% (8/8), done.
Writing objects: 100% (8/8), 2.25 KiB | 0 bytes/s, done.
Total 8 (delta 7), reused 0 (delta 0)
To https://github.com/untom/scikit-learn.git
 + 4c05f6b...6570389 refactor_scaling -> refactor_scaling (forced update)

@jnothman
Copy link
Member

Your master may not be up-to-date:

In case you need it, first git remote add upstream https://github.com/scikit-learn/scikit-learn

Then

git pull upstream master:master
git checkout refactor_scaling
git rebase master
git push -f

@untom
Copy link
Contributor Author

untom commented Jan 12, 2015

Thanks, this did the trick! Not sure why it didn't work before (could've sworn I've pulled from upstream before my first try).

self.feature_range = feature_range
self.copy = copy

def fit(self, X, y=None):
def fit(self, X, y=None, copy=None):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why did you add this fit parameter?

data_range = self._handle_zeros_in_scale(data_range)
self.scalefactor_ = data_range / (feature_range[1] - feature_range[0])
self.center_ = data_min - feature_range[0] * self.scalefactor_

self.scale_ = (feature_range[1] - feature_range[0]) / data_range
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If these are here but not used, I guess they should be read-only attributes maybe to make explicit they they are not used?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure what you are referring to. Could you elaborate?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

scale_ and min_ are attributes that are set but not used in transform, right?
So an unsuspecting user might try to change min_ to change the behavior of the estimator, but nothing will happen. Also I'm wondering whether these names should be deprecated then.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also I'm wondering whether these names should be deprecated then.

My original proposal did deprecate these names, but I think the consensus was that it is too late to fundamentally change the existing API.

@amueller
Copy link
Member

Sorry for the lack of feedback. What is the status here? I lost track a bit.
I want to add a MaxAbsScaler so we can scale sparse data, and I'm wondering if I should wait for this refactoring.

@amueller
Copy link
Member

So there is a MaxAbsScaler in #2514 ... hum... Are you still working on this?

@amueller amueller mentioned this pull request May 15, 2015
22 tasks
@untom
Copy link
Contributor Author

untom commented May 15, 2015

Ok, quick overview of this: My original PR for this was #2514, which added two new scalers (RobustScaler and MaxAbsScaler) and refactored the code to avoid code duplication (plus a few other niceties for the scalers). However according to reviewers that was too big of a change to get it accepted/merged in one bit, so I split it up into several PRs. This one (#3639) was meant to be the first one: it refactors the existing scalers to make it easier to add new ones. However, I never heard back from any reviewers after all of their initial concerns had been addressed.

Thus I decided to submit new scalers first, figuring the code could always be refactored later (when the usefulness of the refactoring became more apparent). In #4125 I tried submitting the RobustScaler without making larger changes to the internal code layout. When that PR also pretty much died the same way as #3639 , I tried pinging reviewers and fixed merge-conflicts with the master branch as they came up. But after a while I pretty much gave up on the whole thing, and thus never submitted a separate MaxAbsScaler PR.

At this point I am quite frustrated with the whole process, as I have wasted a lot of time on this throughout the 1.5 years since the first submission of the code. I don't know what more I should have done to get any of these PR merged, but I'd be happy for any feedback. If there's anything I can do to make this easier I'd love to hear it for future PRs.

With that said, If there is true interest in MaxAbsScaler, I can send it as a separate PR this weekend. I still think that MaxAbsScaler and RobustScalers would be nice additions to sklearn and I'm happy to put in the work, but I'd like to avoid wasting any more time in these PRs if they are going to slowly die due to lack of reviewer-interest, like they have in the past.

@amueller
Copy link
Member

I am very sorry for the lack of feedback, and I totally understand your frustration. We are really starved for devs / reviewers at the moment. I think bost RobustScaler and MaxAbsScaler would be great additions.
I am afraid that if you open another PR, it might not get the attention it deserves, and create even more frustration. Lets get #4125 merged first, and then go for the MaxAbsScaler? I just saw that you pinged me a month ago. Sorry for my lack of responsiveness.

@untom
Copy link
Contributor Author

untom commented May 15, 2015

Sounds good! Thanks for taking the time to review #4125 :)

@jnothman
Copy link
Member

jnothman commented Jun 8, 2015

I must say I am also a bit frustrated after going through all that work on #2514 with you. Hence I'm very glad to see #4125 finally merged. Congrats!

@amueller
Copy link
Member

amueller commented Jun 8, 2015

Now lets get #4828 in next :)

@untom
Copy link
Contributor Author

untom commented Jul 29, 2015

I think most (all?) of the patches are in by now, so I'm going to close this PR.

@untom untom closed this Jul 29, 2015
@amueller
Copy link
Member

well you also introduced a base-class that got rid of some code duplication....

@amueller
Copy link
Member

Thank you for your patients and all your work btw :)

@untom
Copy link
Contributor Author

untom commented Jul 30, 2015

True, but introducing that base-class would've required renaming the mean_ and std_ fields of StandardScaler to center_ and scale_ and changed the meaning of the scale_ field of MinMaxScaler, and there was quite some pushback to that. While those restrictions could've been worked around, I felt that it was not worth it.

@amueller
Copy link
Member

fair enough :) thank again!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants