-
-
Notifications
You must be signed in to change notification settings - Fork 25.9k
[MRG+1-1] Refactoring and expanding sklearn.preprocessing scaling #2514
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Please "git grep @deprecated" to find examples of estimators with deprecated init parameters. The old names should be preserved and the default value to |
For the sparse case I agree we should raise an informative exception that advise the user to try scaler = |
It would be awesome if we could also somehow include this: #1799 |
of the data does sometimes not work very well. In these cases, you can use | ||
:func:`robust_scale` and :class:`RobustScaler` as drop-in replacements | ||
instead, which use more robust estimates for the center and range of your | ||
data. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is one more character at the beginning of the two previous lines.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This comment still needs to be addressed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
d'oh =)
Thanks for all the feedback. I will try to address this in a new commit in the upcoming days. @amueller: If I understood it correctly #1799 is actually something my implementation can already handle, it's just a matter of making the parameter (Interestingly, #1799 also includes a |
@untom any news on this? |
Integrating #1799 wasn't as trivial as I previously thought, and I hadn't had enough time to do it last week. I should have some time to get this done within the next few days. |
Thanks no problem. I just wanted to make sure that the PR was not dying :) |
@untom what are the problems that you found in integrating with #1799. The problems I discussed with @temporaer on that one were mostly about the semantics. I think my last opinion was to disallow sparse input for min-max scaling if there was any offset. (maybe bail?) |
I've added a MaxAbsScaler as discussed in #1799. If this is to everyone's liking, I can start writing documentation for it, as well. Note that my implementations do not include a "global scaling" mode, simply because I'm not sure there's a usecase for it, but it can be easily added to all the scalers if required. |
Thanks @untom please feel free to go forward with the doc, it's interesting. Also please have a look at the travis failures. |
Thanks for looking over the code.... Your proposed test did indeed unearth a bug in my implementation. I'll write some documentation in the upcoming days, then :) |
Thanks @untom. Could you also please have a look at the broken tests reported by travis? https://travis-ci.org/scikit-learn/scikit-learn/builds/13211844 |
Sorry for all the commits lately, but I stumbled over a late bug today. The good news is that thanks to the new added tests, the test-coverage for all the scaling code is now at 100% (except for 3 lines relating to printing depreciation warning). |
@ogrisel Any change of this getting merged within the 0.15 window? |
@untom this looks good. I'll try to have a deeper look soon and should probably make it into 0.15. |
About the |
---------- | ||
copy : boolean, optional, default is True | ||
Set to False to perform inplace row normalization and avoid a | ||
copy (if the input is already a numpy array). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would not call that "row normalization" but rather just "scaling".
This PR has been sitting idly for quite some while now. I still think it's a worthwhile addition to sklearn. If there's anything I can do to speed up the acceptance/review process, let me know. |
@untom Thanks for your work and patience. Something that could help to attract reviewers would be to break this pull request into small and independent ones. |
Thanks for the advise, I will do that :) |
@arjoly one problem with that is that this PR puts a lot of effort into On 2 September 2014 18:34, untom [email protected] wrote:
|
@jnothman: My current plan for splitting this is as follows:
PR number 2 will include all the consistency/invariance tests that are present here, but will only test them on StandardScaler/MinMaxScaler. But I will make sure that no tests will be lost. (This might also give me an incentive to review the testcases to make sure they make sense / are thorough and nonredundant). PR 3/4 will then add the other scalers to the list of tested scalers (and of course include any tests specific to Robust- /MaxAbsScaler). This way the changes are more modular, and e.g. RobustScaler can be included even if MaxAbsScaler is deemed non-worty for inclusion. So all in all I think arjoly's proposal makes sense, and it should be doable without too much effort on neither my side nor the reviewers side. |
Okay. @arjoly, @untom, what do you think of the alternative construction of invariance tests in jnothman@0e4d04c ? It uses test inheritance to make clear the common features and differences between different scalers. (Just to be annoying, that reconstruction isn't quite complete, but does do some cleaning up of the tests' content without documenting exactly what it fixes.) |
I like it, looks elegant and a bit "cleaner" than iterating over lists. |
The issue then is whether it's acceptable in a project where unittest On 2 September 2014 23:06, untom [email protected] wrote:
|
Are they being actively avoided, or was there just no usecase for them until now? |
The SGD code uses them, but they're a bit of a pain with nosetests. |
How so? |
Because running a single test takes a lot of typing if it's in a class :) And there's seldom a need for these classes. We usually just loop over things or call common functions. |
Hi there!
This PR refactors the data-scaling code from
sklearn.preprocessing
to remove some duplicated code, and adds some new features. More specifically:RobustScaler
androbust_scale
functions that use robust estimates of data center/scale (median & interquartile range), which should work better for outliers.axis=1
parameter (thescale
function could already do this, now the *Scaler classes can, too!)minmax_scale
function (requrested by @ogrisel)MaxAbsScaler
, similar in functionality toMinMaxScaler
, but also works on Sparse Matrices, as proposed in the discussions of Global normalization and sparse matrix support for MinMaxScaler #1799.StandardScaler
,RobustScaler
,MinMaxScaler
andMaxAbsScaler
. by putting it in an abstract base class. Essentially the*Scaler
classes are only responsible for estimating the necessary statistics infit
, the rest of theTransformer
API (transform/inverse_transform, handling sparseness/different axis-parameters) is implemented in aBaseScaler
, as this code is common to all of the scalers.This caused some parameters to be renamed and some attributes to be renamed:
with_centering
andwith_scaling
are the parameters that control if scaling/centering is performed and thecenter_
andscale_
attributes are used to store the centering/scaling values.*_scale
functions now simply reuse the*Scaler
classes internally to avoid code duplication.Notes and Caveats
StandardScaler had parameterswith_mean
andwith_std
which are renamed towith_centering
andwith_scaling
to fall in line with the other Scalers. I wasn't sure how to handle deprecating the old parameter names in__init__
-- what's the protocol here?RobustScaler
cannot be fitted on sparse matrices:As an alternative, we could advice people to scale using the MinMaxScaler instead to scale features to the same range with::(MinMaxScaler
doesn't support thewith_centering
parameter directly because I wasn't sure if this would lead to confusion).