Thanks to visit codestin.com
Credit goes to github.com

Skip to content

[MRG+1-1] Refactoring and expanding sklearn.preprocessing scaling #2514

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 7 commits into from

Conversation

untom
Copy link
Contributor

@untom untom commented Oct 11, 2013

Hi there!

This PR refactors the data-scaling code from sklearn.preprocessing to remove some duplicated code, and adds some new features. More specifically:

  • adds RobustScaler and robust_scale functions that use robust estimates of data center/scale (median & interquartile range), which should work better for outliers.
  • Adds possibility to scale by sample instead of by feature, via an axis=1 parameter (the scale function could already do this, now the *Scaler classes can, too!)
  • adds minmax_scale function (requrested by @ogrisel)
  • adds MaxAbsScaler, similar in functionality to MinMaxScaler, but also works on Sparse Matrices, as proposed in the discussions of Global normalization and sparse matrix support for MinMaxScaler #1799.
  • Reuses the code common to StandardScaler, RobustScaler, MinMaxScaler and MaxAbsScaler. by putting it in an abstract base class. Essentially the *Scaler classes are only responsible for estimating the necessary statistics in fit, the rest of the Transformer API (transform/inverse_transform, handling sparseness/different axis-parameters) is implemented in a BaseScaler, as this code is common to all of the scalers.
    This caused some parameters to be renamed and some attributes to be renamed: with_centering and with_scaling are the parameters that control if scaling/centering is performed and the center_ and scale_ attributes are used to store the centering/scaling values.
  • *_scale functions now simply reuse the *Scaler classes internally to avoid code duplication.
  • adds a lot of new tests for all the new functionality.

Notes and Caveats

  • StandardScaler had parameters with_mean and with_std which are renamed to with_centering and with_scaling to fall in line with the other Scalers. I wasn't sure how to handle deprecating the old parameter names in __init__ -- what's the protocol here?

  • RobustScaler cannot be fitted on sparse matrices:

    • centering doesn't make sense because it risks destroying sparsity (similar to what StandardScaler does)
    • Scaling doesn't work because there is no decent code to calculate the IQR of sparse matrices available in scipy
      As an alternative, we could advice people to scale using the MinMaxScaler instead to scale features to the same range with::

    scaler = MinMaxScaler()
    scaler.with_centering=False
    scaler.fit_transform(X)

(MinMaxScaler doesn't support the with_centering parameter directly because I wasn't sure if this would lead to confusion).

@ogrisel
Copy link
Member

ogrisel commented Oct 11, 2013

Please "git grep @deprecated" to find examples of estimators with deprecated init parameters. The old names should be preserved and the default value to None (or some other non-ambiguous default marker instance) and if the value is not None an informative deprecation warning should be raised and the new parameter value should be set to the the non-None old parameter value to preserve backward compat.

@ogrisel
Copy link
Member

ogrisel commented Oct 11, 2013

For the sparse case I agree we should raise an informative exception that advise the user to try scaler = MinMaxScaler(with_centering=False) instead of using RobustScaler.

@amueller
Copy link
Member

It would be awesome if we could also somehow include this: #1799

of the data does sometimes not work very well. In these cases, you can use
:func:`robust_scale` and :class:`RobustScaler` as drop-in replacements
instead, which use more robust estimates for the center and range of your
data.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is one more character at the beginning of the two previous lines.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This comment still needs to be addressed.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

d'oh =)

@untom
Copy link
Contributor Author

untom commented Oct 14, 2013

Thanks for all the feedback. I will try to address this in a new commit in the upcoming days.

@amueller: If I understood it correctly #1799 is actually something my implementation can already handle, it's just a matter of making the parameter with_centering=False explicit within MinMaxScaler. So including this feature should be no problem and is actually something I was already thinking about anyhow.

(Interestingly, #1799 also includes a per_feature parameter which has the same function than the axis parameter in my submission).

@ogrisel
Copy link
Member

ogrisel commented Oct 21, 2013

@untom any news on this?

@untom
Copy link
Contributor Author

untom commented Oct 21, 2013

Integrating #1799 wasn't as trivial as I previously thought, and I hadn't had enough time to do it last week. I should have some time to get this done within the next few days.

@ogrisel
Copy link
Member

ogrisel commented Oct 21, 2013

Thanks no problem. I just wanted to make sure that the PR was not dying :)

@amueller
Copy link
Member

@untom what are the problems that you found in integrating with #1799. The problems I discussed with @temporaer on that one were mostly about the semantics. I think my last opinion was to disallow sparse input for min-max scaling if there was any offset. (maybe bail?)

@untom
Copy link
Contributor Author

untom commented Oct 27, 2013

I've added a MaxAbsScaler as discussed in #1799. If this is to everyone's liking, I can start writing documentation for it, as well. Note that my implementations do not include a "global scaling" mode, simply because I'm not sure there's a usecase for it, but it can be easily added to all the scalers if required.

@ogrisel
Copy link
Member

ogrisel commented Oct 28, 2013

Thanks @untom please feel free to go forward with the doc, it's interesting. Also please have a look at the travis failures.

@untom
Copy link
Contributor Author

untom commented Oct 29, 2013

Thanks for looking over the code.... Your proposed test did indeed unearth a bug in my implementation. I'll write some documentation in the upcoming days, then :)

@ogrisel
Copy link
Member

ogrisel commented Oct 30, 2013

Thanks @untom. Could you also please have a look at the broken tests reported by travis?

https://travis-ci.org/scikit-learn/scikit-learn/builds/13211844

@coveralls
Copy link

Coverage Status

Coverage remained the same when pulling 99f5391 on untom:robust_scaling into d82cf06 on scikit-learn:master.

@coveralls
Copy link

Coverage Status

Coverage remained the same when pulling 5ac05a9 on untom:robust_scaling into d82cf06 on scikit-learn:master.

@coveralls
Copy link

Coverage Status

Coverage remained the same when pulling 8dfd0bf on untom:robust_scaling into d82cf06 on scikit-learn:master.

@coveralls
Copy link

Coverage Status

Coverage remained the same when pulling 0a47b76 on untom:robust_scaling into d82cf06 on scikit-learn:master.

@untom
Copy link
Contributor Author

untom commented Nov 7, 2013

Sorry for all the commits lately, but I stumbled over a late bug today. The good news is that thanks to the new added tests, the test-coverage for all the scaling code is now at 100% (except for 3 lines relating to printing depreciation warning).

@coveralls
Copy link

Coverage Status

Coverage remained the same when pulling 098a4ee on untom:robust_scaling into d82cf06 on scikit-learn:master.

@untom
Copy link
Contributor Author

untom commented Nov 21, 2013

@ogrisel Any change of this getting merged within the 0.15 window?

@coveralls
Copy link

Coverage Status

Coverage remained the same when pulling 4df8cbb on untom:robust_scaling into d82cf06 on scikit-learn:master.

@ghost ghost assigned ogrisel Nov 21, 2013
@ogrisel
Copy link
Member

ogrisel commented Nov 21, 2013

@untom this looks good. I'll try to have a deeper look soon and should probably make it into 0.15.

@ogrisel
Copy link
Member

ogrisel commented Nov 21, 2013

About the center_ name, I think renaming to shift_ would make more sense. I wonder what other people think.

----------
copy : boolean, optional, default is True
Set to False to perform inplace row normalization and avoid a
copy (if the input is already a numpy array).
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would not call that "row normalization" but rather just "scaling".

@coveralls
Copy link

Coverage Status

Coverage increased (+0.01%) when pulling b918e66 on untom:robust_scaling into 0a7bef6 on scikit-learn:master.

@untom
Copy link
Contributor Author

untom commented Sep 1, 2014

This PR has been sitting idly for quite some while now. I still think it's a worthwhile addition to sklearn. If there's anything I can do to speed up the acceptance/review process, let me know.

@arjoly
Copy link
Member

arjoly commented Sep 2, 2014

@untom Thanks for your work and patience.

Something that could help to attract reviewers would be to break this pull request into small and independent ones.

@untom
Copy link
Contributor Author

untom commented Sep 2, 2014

Thanks for the advise, I will do that :)

@jnothman
Copy link
Member

jnothman commented Sep 2, 2014

@arjoly one problem with that is that this PR puts a lot of effort into
consistency and invariance tests.

On 2 September 2014 18:34, untom [email protected] wrote:

Thanks for the advise, I will do that :)


Reply to this email directly or view it on GitHub
#2514 (comment)
.

@untom
Copy link
Contributor Author

untom commented Sep 2, 2014

@jnothman: My current plan for splitting this is as follows:

  1. sparsefuncs improvements ( [MRG+1] Add 'axis' argument to sparsefuncs.mean_variance_axis #3622 )
  2. BaseEstimator abstraction ( ==> those will also include all the invariance tests)
  3. RobustScaler
  4. MaxAbsScaler

PR number 2 will include all the consistency/invariance tests that are present here, but will only test them on StandardScaler/MinMaxScaler. But I will make sure that no tests will be lost. (This might also give me an incentive to review the testcases to make sure they make sense / are thorough and nonredundant).

PR 3/4 will then add the other scalers to the list of tested scalers (and of course include any tests specific to Robust- /MaxAbsScaler).

This way the changes are more modular, and e.g. RobustScaler can be included even if MaxAbsScaler is deemed non-worty for inclusion.

So all in all I think arjoly's proposal makes sense, and it should be doable without too much effort on neither my side nor the reviewers side.

@jnothman
Copy link
Member

jnothman commented Sep 2, 2014

Okay. @arjoly, @untom, what do you think of the alternative construction of invariance tests in jnothman@0e4d04c ? It uses test inheritance to make clear the common features and differences between different scalers. (Just to be annoying, that reconstruction isn't quite complete, but does do some cleaning up of the tests' content without documenting exactly what it fixes.)

@untom
Copy link
Contributor Author

untom commented Sep 2, 2014

I like it, looks elegant and a bit "cleaner" than iterating over lists.

@jnothman
Copy link
Member

jnothman commented Sep 2, 2014

The issue then is whether it's acceptable in a project where unittest
classes are avoided...

On 2 September 2014 23:06, untom [email protected] wrote:

I like it, looks elegant and a bit "cleaner" than iterating over lists.


Reply to this email directly or view it on GitHub
#2514 (comment)
.

@untom
Copy link
Contributor Author

untom commented Sep 4, 2014

Are they being actively avoided, or was there just no usecase for them until now?

@larsmans
Copy link
Member

larsmans commented Sep 4, 2014

The SGD code uses them, but they're a bit of a pain with nosetests.

@untom
Copy link
Contributor Author

untom commented Sep 4, 2014

How so?

@larsmans
Copy link
Member

larsmans commented Sep 4, 2014

Because running a single test takes a lot of typing if it's in a class :)

And there's seldom a need for these classes. We usually just loop over things or call common functions.

@amueller
Copy link
Member

Closing as merged in #4828 and #4125.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants