[MRG+1-1] Refactoring and expanding sklearn.preprocessing scaling #2514

untom · 2013-10-11T10:34:15Z

Hi there!

This PR refactors the data-scaling code from sklearn.preprocessing to remove some duplicated code, and adds some new features. More specifically:

adds RobustScaler and robust_scale functions that use robust estimates of data center/scale (median & interquartile range), which should work better for outliers.
Adds possibility to scale by sample instead of by feature, via an axis=1 parameter (the scale function could already do this, now the *Scaler classes can, too!)
adds minmax_scale function (requrested by @ogrisel)
adds MaxAbsScaler, similar in functionality to MinMaxScaler, but also works on Sparse Matrices, as proposed in the discussions of Global normalization and sparse matrix support for MinMaxScaler #1799.
Reuses the code common to StandardScaler, RobustScaler, MinMaxScaler and MaxAbsScaler. by putting it in an abstract base class. Essentially the *Scaler classes are only responsible for estimating the necessary statistics in fit, the rest of the Transformer API (transform/inverse_transform, handling sparseness/different axis-parameters) is implemented in a BaseScaler, as this code is common to all of the scalers.
This caused some parameters to be renamed and some attributes to be renamed: with_centering and with_scaling are the parameters that control if scaling/centering is performed and the center_ and scale_ attributes are used to store the centering/scaling values.
*_scale functions now simply reuse the *Scaler classes internally to avoid code duplication.
adds a lot of new tests for all the new functionality.

Notes and Caveats

StandardScaler had parameters with_mean and with_std which are renamed to with_centering and with_scaling to fall in line with the other Scalers. I wasn't sure how to handle deprecating the old parameter names in __init__ -- what's the protocol here?
RobustScaler cannot be fitted on sparse matrices:
- centering doesn't make sense because it risks destroying sparsity (similar to what StandardScaler does)
- Scaling doesn't work because there is no decent code to calculate the IQR of sparse matrices available in scipy
  ~~As an alternative, we could advice people to scale using the MinMaxScaler instead to scale features to the same range with::~~
scaler = MinMaxScaler()
scaler.with_centering=False
scaler.fit_transform(X)

~~(MinMaxScaler doesn't support the with_centering parameter directly because I wasn't sure if this would lead to confusion).~~

ogrisel · 2013-10-11T10:53:48Z

Please "git grep @deprecated" to find examples of estimators with deprecated init parameters. The old names should be preserved and the default value to None (or some other non-ambiguous default marker instance) and if the value is not None an informative deprecation warning should be raised and the new parameter value should be set to the the non-None old parameter value to preserve backward compat.

ogrisel · 2013-10-11T10:56:41Z

For the sparse case I agree we should raise an informative exception that advise the user to try scaler = MinMaxScaler(with_centering=False) instead of using RobustScaler.

amueller · 2013-10-14T00:53:08Z

It would be awesome if we could also somehow include this: #1799

arjoly · 2013-10-14T06:55:40Z

doc/modules/preprocessing.rst

+    of the data does sometimes not work very well. In these cases, you can use
+    :func:`robust_scale` and :class:`RobustScaler` as drop-in replacements
+     instead, which use more robust estimates for the center and range of your
+     data.


There is one more character at the beginning of the two previous lines.

This comment still needs to be addressed.

untom · 2013-10-14T08:49:19Z

Thanks for all the feedback. I will try to address this in a new commit in the upcoming days.

@amueller: If I understood it correctly #1799 is actually something my implementation can already handle, it's just a matter of making the parameter with_centering=False explicit within MinMaxScaler. So including this feature should be no problem and is actually something I was already thinking about anyhow.

(Interestingly, #1799 also includes a per_feature parameter which has the same function than the axis parameter in my submission).

ogrisel · 2013-10-21T09:16:35Z

@untom any news on this?

untom · 2013-10-21T12:53:48Z

Integrating #1799 wasn't as trivial as I previously thought, and I hadn't had enough time to do it last week. I should have some time to get this done within the next few days.

ogrisel · 2013-10-21T12:54:52Z

Thanks no problem. I just wanted to make sure that the PR was not dying :)

amueller · 2013-10-22T05:34:53Z

@untom what are the problems that you found in integrating with #1799. The problems I discussed with @temporaer on that one were mostly about the semantics. I think my last opinion was to disallow sparse input for min-max scaling if there was any offset. (maybe bail?)

untom · 2013-10-27T17:28:45Z

I've added a MaxAbsScaler as discussed in #1799. If this is to everyone's liking, I can start writing documentation for it, as well. Note that my implementations do not include a "global scaling" mode, simply because I'm not sure there's a usecase for it, but it can be easily added to all the scalers if required.

ogrisel · 2013-10-28T17:44:22Z

Thanks @untom please feel free to go forward with the doc, it's interesting. Also please have a look at the travis failures.

untom · 2013-10-29T14:46:11Z

Thanks for looking over the code.... Your proposed test did indeed unearth a bug in my implementation. I'll write some documentation in the upcoming days, then :)

ogrisel · 2013-10-30T10:18:14Z

Thanks @untom. Could you also please have a look at the broken tests reported by travis?

https://travis-ci.org/scikit-learn/scikit-learn/builds/13211844

coveralls · 2013-11-06T19:56:38Z

Coverage remained the same when pulling 99f5391 on untom:robust_scaling into d82cf06 on scikit-learn:master.

coveralls · 2013-11-06T20:32:15Z

Coverage remained the same when pulling 5ac05a9 on untom:robust_scaling into d82cf06 on scikit-learn:master.

coveralls · 2013-11-06T21:25:35Z

Coverage remained the same when pulling 8dfd0bf on untom:robust_scaling into d82cf06 on scikit-learn:master.

coveralls · 2013-11-06T22:47:09Z

Coverage remained the same when pulling 0a47b76 on untom:robust_scaling into d82cf06 on scikit-learn:master.

untom · 2013-11-07T16:21:00Z

Sorry for all the commits lately, but I stumbled over a late bug today. The good news is that thanks to the new added tests, the test-coverage for all the scaling code is now at 100% (except for 3 lines relating to printing depreciation warning).

coveralls · 2013-11-07T16:43:21Z

Coverage remained the same when pulling 098a4ee on untom:robust_scaling into d82cf06 on scikit-learn:master.

untom · 2013-11-21T14:27:09Z

@ogrisel Any change of this getting merged within the 0.15 window?

coveralls · 2013-11-21T17:03:46Z

Coverage remained the same when pulling 4df8cbb on untom:robust_scaling into d82cf06 on scikit-learn:master.

ogrisel · 2013-11-21T18:03:33Z

@untom this looks good. I'll try to have a deeper look soon and should probably make it into 0.15.

ogrisel · 2013-11-21T18:04:36Z

About the center_ name, I think renaming to shift_ would make more sense. I wonder what other people think.

ogrisel · 2013-11-22T21:57:17Z

sklearn/preprocessing/data.py

+    ----------
+    copy : boolean, optional, default is True
+        Set to False to perform inplace row normalization and avoid a
+        copy (if the input is already a numpy array).


I would not call that "row normalization" but rather just "scaling".

coveralls · 2014-08-05T11:55:26Z

Coverage increased (+0.01%) when pulling b918e66 on untom:robust_scaling into 0a7bef6 on scikit-learn:master.

untom · 2014-09-01T13:50:09Z

This PR has been sitting idly for quite some while now. I still think it's a worthwhile addition to sklearn. If there's anything I can do to speed up the acceptance/review process, let me know.

arjoly · 2014-09-02T07:58:22Z

@untom Thanks for your work and patience.

Something that could help to attract reviewers would be to break this pull request into small and independent ones.

untom · 2014-09-02T08:34:26Z

Thanks for the advise, I will do that :)

jnothman · 2014-09-02T11:35:39Z

@arjoly one problem with that is that this PR puts a lot of effort into
consistency and invariance tests.

On 2 September 2014 18:34, untom [email protected] wrote:

Thanks for the advise, I will do that :)

—
Reply to this email directly or view it on GitHub
#2514 (comment)
.

untom · 2014-09-02T11:49:11Z

@jnothman: My current plan for splitting this is as follows:

sparsefuncs improvements ( [MRG+1] Add 'axis' argument to sparsefuncs.mean_variance_axis #3622 )
BaseEstimator abstraction ( ==> those will also include all the invariance tests)
RobustScaler
MaxAbsScaler

PR number 2 will include all the consistency/invariance tests that are present here, but will only test them on StandardScaler/MinMaxScaler. But I will make sure that no tests will be lost. (This might also give me an incentive to review the testcases to make sure they make sense / are thorough and nonredundant).

PR 3/4 will then add the other scalers to the list of tested scalers (and of course include any tests specific to Robust- /MaxAbsScaler).

This way the changes are more modular, and e.g. RobustScaler can be included even if MaxAbsScaler is deemed non-worty for inclusion.

So all in all I think arjoly's proposal makes sense, and it should be doable without too much effort on neither my side nor the reviewers side.

jnothman · 2014-09-02T12:08:47Z

Okay. @arjoly, @untom, what do you think of the alternative construction of invariance tests in jnothman@0e4d04c ? It uses test inheritance to make clear the common features and differences between different scalers. (Just to be annoying, that reconstruction isn't quite complete, but does do some cleaning up of the tests' content without documenting exactly what it fixes.)

untom · 2014-09-02T13:06:55Z

I like it, looks elegant and a bit "cleaner" than iterating over lists.

jnothman · 2014-09-02T13:23:45Z

The issue then is whether it's acceptable in a project where unittest
classes are avoided...

On 2 September 2014 23:06, untom [email protected] wrote:

I like it, looks elegant and a bit "cleaner" than iterating over lists.

—
Reply to this email directly or view it on GitHub
#2514 (comment)
.

untom · 2014-09-04T13:20:56Z

Are they being actively avoided, or was there just no usecase for them until now?

larsmans · 2014-09-04T13:32:11Z

The SGD code uses them, but they're a bit of a pain with nosetests.

untom · 2014-09-04T13:52:38Z

How so?

larsmans · 2014-09-04T14:05:18Z

Because running a single test takes a lot of typing if it's in a class :)

And there's seldom a need for these classes. We usually just loop over things or call common functions.

amueller · 2015-08-25T18:47:37Z

Closing as merged in #4828 and #4125.

arjoly reviewed Oct 14, 2013
View reviewed changes

untom mentioned this pull request Oct 21, 2013

Global normalization and sparse matrix support for MinMaxScaler #1799

Closed

ghost assigned ogrisel Nov 21, 2013

ogrisel reviewed Nov 22, 2013
View reviewed changes

Thomas Unterthiner added 5 commits August 5, 2014 13:40

minmax_scale

a5d24a8

Refactor scaling tests

8bdc83e

DOC MinMaxScaler / minmax_scale

d3903c3

RobustScaler

28df99f

MAxAbsScaler

b918e66

larsmans force-pushed the master branch from 58a55ad to 4b82379 Compare August 25, 2014 21:50

untom mentioned this pull request Sep 2, 2014

[MRG+1] Add 'axis' argument to sparsefuncs.mean_variance_axis #3622

Closed

untom mentioned this pull request Sep 5, 2014

[WIP] Refactor scaler code #3639

Closed

MechCoder force-pushed the master branch from 6deaea0 to 3f49cee Compare November 3, 2014 12:36

untom mentioned this pull request Jan 19, 2015

[MRG + 2] ENH RobustScaler #4125

Merged

amueller mentioned this pull request May 15, 2015

[MRG] Mlp finishing touches #3939

Closed

22 tasks

untom mentioned this pull request Jun 7, 2015

[MRG+2] add MaxAbsScaler #4828

Merged

untom mentioned this pull request Jul 11, 2015

sklearn.preprocessing.MinMaxScaler not preserving symmetry / Add axis=None #4892

Closed

giorgiop mentioned this pull request Aug 10, 2015

[MRG+2] partial_fit for StandardScaler, MinMaxScaler and MaxAbsScaler #5104

Merged

12 tasks

amueller closed this Aug 25, 2015

nlathia mentioned this pull request Aug 30, 2016

Incorrect documentation on scaling sparse data #7293

Closed

Uh oh!

[MRG+1-1] Refactoring and expanding sklearn.preprocessing scaling #2514

[MRG+1-1] Refactoring and expanding sklearn.preprocessing scaling #2514

Uh oh!

Conversation

untom commented Oct 11, 2013

Uh oh!

ogrisel commented Oct 11, 2013

Uh oh!

ogrisel commented Oct 11, 2013

Uh oh!

amueller commented Oct 14, 2013

Uh oh!

arjoly Oct 14, 2013

Choose a reason for hiding this comment

Uh oh!

ogrisel Nov 21, 2013

Choose a reason for hiding this comment

Uh oh!

untom Nov 21, 2013

Choose a reason for hiding this comment

Uh oh!

untom commented Oct 14, 2013

Uh oh!

ogrisel commented Oct 21, 2013

Uh oh!

untom commented Oct 21, 2013

Uh oh!

ogrisel commented Oct 21, 2013

Uh oh!

amueller commented Oct 22, 2013

Uh oh!

untom commented Oct 27, 2013

Uh oh!

ogrisel commented Oct 28, 2013

Uh oh!

untom commented Oct 29, 2013

Uh oh!

ogrisel commented Oct 30, 2013

Uh oh!

coveralls commented Nov 6, 2013

Uh oh!

coveralls commented Nov 6, 2013

Uh oh!

coveralls commented Nov 6, 2013

Uh oh!

coveralls commented Nov 6, 2013

Uh oh!

untom commented Nov 7, 2013

Uh oh!

coveralls commented Nov 7, 2013

Uh oh!

untom commented Nov 21, 2013

Uh oh!

coveralls commented Nov 21, 2013

Uh oh!

ogrisel commented Nov 21, 2013

Uh oh!

ogrisel commented Nov 21, 2013

Uh oh!

ogrisel Nov 22, 2013

Choose a reason for hiding this comment

Uh oh!

coveralls commented Aug 5, 2014

Uh oh!

untom commented Sep 1, 2014

Uh oh!

arjoly commented Sep 2, 2014

Uh oh!

untom commented Sep 2, 2014

Uh oh!

jnothman commented Sep 2, 2014

Uh oh!

untom commented Sep 2, 2014

Uh oh!

jnothman commented Sep 2, 2014

Uh oh!

untom commented Sep 2, 2014

Uh oh!

jnothman commented Sep 2, 2014