[WIP] Refactor scaler code #3639

untom · 2014-09-05T08:18:24Z

This PR refactors the scalers in preprocessing to use a common baseclass (BaseScaler). This is done to share some common functionality / avoid code duplication. This PR is part of a series of PR that split up #2514 , and other PRs in this series intend to introduce new scalers that will make use of BaseScaler (e.g. a scaler that uses robust statistics).

There is one issues with this PR that needs to be discussed:

All scalers (both the existing ones and the one's I'm going to introduce) do center the data and then scale it to fall into some range. So in essence, all layers do

(x - some_centering_statistic) / some_scaling_statistic

for each feature-column x. The existing scalers already have attributes that expose their centering/scaling statistic, however those are not uniformly named. I propose that the StandardScaler.mean_/StandardScaler.std_/MinMaxScaler.scale_ and MinMaxScaler.min_ properties are being deprecated and replaced with *.center_/*.scale_, and the with_mean/with_std constructer arguments be renamed to with_centering/with_scaling.

@larsmans already said back in #2514 that he is against renaming the mean_/std_ attributes, since the scaler API should be stable. An additional problem is that my proposal would change the meaning of MinMaxScaler.scale_ (as the previous implementation used a slightly different math to arrive at the same scaling result). This will of course break user code that uses MinMaxScaler.scale_ and relies on its current implementation. (NOTE: I had overlooked this when proposing #2514 and just noticed it when preparing this PR)

There are several ways to deal with this:

don't expose the new scale_/center_ arguments in StandardScaler/MinMaxScaler, and just keep the old mean_/std_/scale_ arguments.
deprecate mean_/std_ and remove them after a few releases. Print a warning whenever a user uses scale_ on MinMaxScaler that the value has changed.
???

I would really appreciate any input on what the right thing to do here is!

NOTE: As side-effect of this PR, StandardScaler and MinMaxScaler gain the ability to scale on other rows/samples instead of just columns/features (i.e., it is possible to use axis=1). However, no tests for this new functionality is included in this PR. This is because I intend to send another PR tomorrow that refactors the tests in this module (and adds the tests for axis = 1). I just thought that splitting up the tests makes it easier to review the changes (but I can add that other PR to this one if you wish).

coveralls · 2014-09-05T08:29:18Z

Changes Unknown when pulling 468e0a3 on untom:refactor_scaling into * on scikit-learn:master*.

untom · 2014-10-06T08:46:05Z

Pinging @larsmans @arjoly @ogrisel

I'd like some input on how to proceed on this => is it okay to change the meaning of the scale parameter of MinMaxScaler in future releases?

ogrisel · 2014-11-25T16:04:51Z

Sorry for the slow reply. I tend to think like @larsmans: it's too late to change the mean_ and std_ attributes and the names of the StandardScaler parameters. I am not even sure it's useful to introduce a base class. If there is too much common code we can factorize it in private helper functions when necessary.

NOTE: As side-effect of this PR, StandardScaler and MinMaxScaler gain the ability to scale on other rows/samples instead of just columns/features (i.e., it is possible to use axis=1). However, no tests for this new functionality is included in this PR. This is because I intend to send another PR tomorrow that refactors the tests in this module (and adds the tests for axis = 1). I just thought that splitting up the tests makes it easier to review the changes (but I can add that other PR to this one if you wish).

To me this is a YAGNI. We should not introduce a new feature and extend the public API when there is no obvious use case.

ogrisel · 2014-11-25T16:09:15Z

On the other hand here are interesting features / improvements for the preprocessing module:

MaxAbsScaler (it's easy to implement, it is the default strategy used by vowpal wabbit, it works similarly for dense and sparse data and can trivially support online fitting with a partial_fit method as well)
partial_fit for StandardScaler with online estimation of the mean and the variance of each feature. IncrementalPCA has some code to do that internally. It should be factorized.
RobustScaler

ogrisel · 2014-11-25T16:10:13Z

Maybe @jnothman would like to comment on this thread as well.

ogrisel · 2014-11-25T16:10:57Z

And @amueller as well.

amueller · 2014-11-25T17:24:13Z

I'm for adding more scalers, I would have to have a closer look to see how much duplication there is and what would be the best way to refactor. The main reason to unify the attributes is to improve code sharing, right?

ogrisel · 2014-11-27T08:04:29Z

The main reason to unify the attributes is to improve code sharing, right?

Yes. Have a look at the original PR: #2514. For MaxAbsScaler at least there is not so much code to share with other scalers.

untom · 2014-12-23T13:33:50Z

Sory for taking such a long time to reply.

NOTE: As side-effect of this PR, StandardScaler and MinMaxScaler gain the ability to scale on other rows/samples instead of just columns/features (i.e., it is possible to use axis=1). However, no tests for this new functionality is included in this PR. This is because I intend to send another PR tomorrow that refactors the tests in this module (and adds the tests for axis = 1). I just thought that splitting up the tests makes it easier to review the changes (but I can add that other PR to this one if you wish).

To me this is a YAGNI. We should not introduce a new feature and extend the public API when there is no obvious use case.

The reason to implement this was that sklearn.preprocessing.scale already offered this functionality (as does sklearn.preprocessing.normalize, BTW). So likewise, I'd assume minmax_scale and other new scaling-functions should have this.

The advantage of adding the axis parameter to the *Scaler classes as well is that it allows xxx_scalefunctions to have a trivial implementation that simply calls XxxScaler internally (without losing efficiency by transposing), and to have a certain consistency between the function and the corresponding class.

But I can't think of any real usecase besides that, so maybe it would be more appropriate to have this as undocumented feature? (Or remove it completely, and have the xxx_scale function transpose their arguments if needed).

ogrisel · 2014-12-24T16:19:12Z

Or remove it completely, and have the xxx_scale function transpose their arguments if needed.

+1

amueller · 2015-01-09T20:14:15Z

So the idea of this PR is to do two things, right?

Make the attributes of the scalers consistent
refactor their code.

Shouldn't that result in less code, not more?

untom · 2015-01-11T22:51:30Z

@amueller: Yes, although this PR is mainly meant to pave the way to adding more scalers (Concretely, MaxAbsScaler and RobustScaler). While the refactorings IMHO already make the code simpler, they are only going to really pay off once those are added to the codebase.

As far as "making the attributes consistent" goes, the consensus seems to be that it's too late to change this, so I undid those changes.

FYI, I don't know why it says "This pull request contains merge conflicts that must be resolved. "... at least on my local branch I can cleanly merge this into master.

untom · 2015-01-11T22:53:41Z

Or remove it completely, and have the xxx_scale function transpose their arguments if needed.

+1

@ogrisel: I've changed the code to do it this way.

jnothman · 2015-01-11T23:37:08Z

FYI, I don't know why it says "This pull request contains merge conflicts that must be resolved. "... at least on my local branch I can cleanly merge this into master.

Do a git rebase master then git push -f

untom · 2015-01-12T00:05:10Z

I tried that, but it doesn't seem to work:

untom@comp:scikit-learn$ git rebase master
Current branch refactor_scaling is up to date.
untom@comp:scikit-learn$ git push -f
Username for 'https://github.com': untom
Password for 'https://[email protected]': 
Counting objects: 8, done.
Delta compression using up to 8 threads.
Compressing objects: 100% (8/8), done.
Writing objects: 100% (8/8), 2.25 KiB | 0 bytes/s, done.
Total 8 (delta 7), reused 0 (delta 0)
To https://github.com/untom/scikit-learn.git
 + 4c05f6b...6570389 refactor_scaling -> refactor_scaling (forced update)

jnothman · 2015-01-12T00:44:28Z

Your master may not be up-to-date:

In case you need it, first git remote add upstream https://github.com/scikit-learn/scikit-learn

Then

git pull upstream master:master
git checkout refactor_scaling
git rebase master
git push -f

untom · 2015-01-12T16:21:04Z

Thanks, this did the trick! Not sure why it didn't work before (could've sworn I've pulled from upstream before my first try).

amueller · 2015-01-13T19:55:24Z

sklearn/preprocessing/data.py

        self.feature_range = feature_range
        self.copy = copy

-    def fit(self, X, y=None):
+    def fit(self, X, y=None, copy=None):


Why did you add this fit parameter?

amueller · 2015-01-13T19:58:18Z

sklearn/preprocessing/data.py

+        data_range = self._handle_zeros_in_scale(data_range)
+        self.scalefactor_ = data_range / (feature_range[1] - feature_range[0])
+        self.center_ = data_min - feature_range[0] * self.scalefactor_
+
        self.scale_ = (feature_range[1] - feature_range[0]) / data_range


If these are here but not used, I guess they should be read-only attributes maybe to make explicit they they are not used?

I'm not sure what you are referring to. Could you elaborate?

scale_ and min_ are attributes that are set but not used in transform, right?
So an unsuspecting user might try to change min_ to change the behavior of the estimator, but nothing will happen. Also I'm wondering whether these names should be deprecated then.

Also I'm wondering whether these names should be deprecated then.

My original proposal did deprecate these names, but I think the consensus was that it is too late to fundamentally change the existing API.

amueller · 2015-05-15T18:50:27Z

Sorry for the lack of feedback. What is the status here? I lost track a bit.
I want to add a MaxAbsScaler so we can scale sparse data, and I'm wondering if I should wait for this refactoring.

amueller · 2015-05-15T18:52:07Z

So there is a MaxAbsScaler in #2514 ... hum... Are you still working on this?

untom · 2015-05-15T19:45:45Z

Ok, quick overview of this: My original PR for this was #2514, which added two new scalers (RobustScaler and MaxAbsScaler) and refactored the code to avoid code duplication (plus a few other niceties for the scalers). However according to reviewers that was too big of a change to get it accepted/merged in one bit, so I split it up into several PRs. This one (#3639) was meant to be the first one: it refactors the existing scalers to make it easier to add new ones. However, I never heard back from any reviewers after all of their initial concerns had been addressed.

Thus I decided to submit new scalers first, figuring the code could always be refactored later (when the usefulness of the refactoring became more apparent). In #4125 I tried submitting the RobustScaler without making larger changes to the internal code layout. When that PR also pretty much died the same way as #3639 , I tried pinging reviewers and fixed merge-conflicts with the master branch as they came up. But after a while I pretty much gave up on the whole thing, and thus never submitted a separate MaxAbsScaler PR.

At this point I am quite frustrated with the whole process, as I have wasted a lot of time on this throughout the 1.5 years since the first submission of the code. I don't know what more I should have done to get any of these PR merged, but I'd be happy for any feedback. If there's anything I can do to make this easier I'd love to hear it for future PRs.

With that said, If there is true interest in MaxAbsScaler, I can send it as a separate PR this weekend. I still think that MaxAbsScaler and RobustScalers would be nice additions to sklearn and I'm happy to put in the work, but I'd like to avoid wasting any more time in these PRs if they are going to slowly die due to lack of reviewer-interest, like they have in the past.

amueller · 2015-05-15T19:55:55Z

I am very sorry for the lack of feedback, and I totally understand your frustration. We are really starved for devs / reviewers at the moment. I think bost RobustScaler and MaxAbsScaler would be great additions.
I am afraid that if you open another PR, it might not get the attention it deserves, and create even more frustration. Lets get #4125 merged first, and then go for the MaxAbsScaler? I just saw that you pinged me a month ago. Sorry for my lack of responsiveness.

untom · 2015-05-15T20:52:27Z

Sounds good! Thanks for taking the time to review #4125 :)

jnothman · 2015-06-08T11:00:14Z

I must say I am also a bit frustrated after going through all that work on #2514 with you. Hence I'm very glad to see #4125 finally merged. Congrats!

amueller · 2015-06-08T15:51:33Z

Now lets get #4828 in next :)

untom · 2015-07-29T09:17:44Z

I think most (all?) of the patches are in by now, so I'm going to close this PR.

amueller · 2015-07-29T15:04:24Z

well you also introduced a base-class that got rid of some code duplication....

amueller · 2015-07-29T15:04:43Z

Thank you for your patients and all your work btw :)

untom · 2015-07-30T07:09:19Z

True, but introducing that base-class would've required renaming the mean_ and std_ fields of StandardScaler to center_ and scale_ and changed the meaning of the scale_ field of MinMaxScaler, and there was quite some pushback to that. While those restrictions could've been worked around, I felt that it was not worth it.

amueller · 2015-07-30T16:10:47Z

fair enough :) thank again!

untom changed the title ~~[WIP] ENH Refactor scaler code~~ [WIP] Refactor scaler code Sep 5, 2014

untom force-pushed the refactor_scaling branch from 53e2a0c to 468e0a3 Compare September 5, 2014 08:20

MechCoder force-pushed the master branch from 6deaea0 to 3f49cee Compare November 3, 2014 12:36

untom force-pushed the refactor_scaling branch 5 times, most recently from 3937060 to a89c91c Compare January 11, 2015 22:49

untom force-pushed the refactor_scaling branch from a89c91c to 4c05f6b Compare January 11, 2015 23:28

untom force-pushed the refactor_scaling branch from 4c05f6b to 6570389 Compare January 12, 2015 00:03

untom force-pushed the refactor_scaling branch from 6570389 to 827ebf9 Compare January 12, 2015 07:09

amueller reviewed Jan 13, 2015
View reviewed changes

untom force-pushed the refactor_scaling branch from 8b7bb9f to 4cba946 Compare January 19, 2015 17:11

ENH Refactor scaler code

042bb96

untom force-pushed the refactor_scaling branch from 4cba946 to 042bb96 Compare January 19, 2015 18:57

untom mentioned this pull request Jan 19, 2015

[MRG + 2] ENH RobustScaler #4125

Merged

Make some MinMaxScaler attributes read-only

5d1fda7

amueller mentioned this pull request May 15, 2015

[MRG] Mlp finishing touches #3939

Closed

22 tasks

TomDLT mentioned this pull request Jun 10, 2015

Uniform columns return a standard deviation of 1 in StandardScaler #4609

Closed

untom mentioned this pull request Jul 11, 2015

sklearn.preprocessing.MinMaxScaler not preserving symmetry / Add axis=None #4892

Closed

amueller mentioned this pull request Jul 28, 2015

Partial_fit for Preprocessing StandardScaler #5028

Closed

untom closed this Jul 29, 2015

Uh oh!

[WIP] Refactor scaler code #3639

[WIP] Refactor scaler code #3639

Uh oh!

Conversation

untom commented Sep 5, 2014

Uh oh!

coveralls commented Sep 5, 2014

Uh oh!

untom commented Oct 6, 2014

Uh oh!

ogrisel commented Nov 25, 2014

Uh oh!

ogrisel commented Nov 25, 2014

Uh oh!

ogrisel commented Nov 25, 2014

Uh oh!

ogrisel commented Nov 25, 2014

Uh oh!

amueller commented Nov 25, 2014

Uh oh!

ogrisel commented Nov 27, 2014

Uh oh!

untom commented Dec 23, 2014

Uh oh!

ogrisel commented Dec 24, 2014

Uh oh!

amueller commented Jan 9, 2015

Uh oh!

untom commented Jan 11, 2015

Uh oh!

untom commented Jan 11, 2015

Uh oh!

jnothman commented Jan 11, 2015

Uh oh!

untom commented Jan 12, 2015

Uh oh!

jnothman commented Jan 12, 2015

Uh oh!

untom commented Jan 12, 2015

Uh oh!

amueller Jan 13, 2015

Choose a reason for hiding this comment

Uh oh!

amueller Jan 13, 2015

Choose a reason for hiding this comment

Uh oh!

untom Jan 19, 2015

Choose a reason for hiding this comment

Uh oh!

amueller Jan 20, 2015

Choose a reason for hiding this comment

Uh oh!

untom Jan 21, 2015

Choose a reason for hiding this comment

Uh oh!

amueller commented May 15, 2015

Uh oh!

amueller commented May 15, 2015

Uh oh!

untom commented May 15, 2015

Uh oh!

amueller commented May 15, 2015

Uh oh!

untom commented May 15, 2015

Uh oh!

jnothman commented Jun 8, 2015

Uh oh!

amueller commented Jun 8, 2015

Uh oh!

untom commented Jul 29, 2015

Uh oh!

amueller commented Jul 29, 2015

Uh oh!

amueller commented Jul 29, 2015

Uh oh!

untom commented Jul 30, 2015

Uh oh!

amueller commented Jul 30, 2015

Uh oh!