[MRG + 2] ENH RobustScaler #4125

untom · 2015-01-19T20:25:01Z

This PR adds RobustScaler and robust_scale as alternative to StandardScaler and scale. They use robust estimates of data center/scale (median & interquartile range), which will work better for data with outliers.

Most of this was discussed before in #2514 (and even older commits). I separated this out so it can be merged as-is.

I originally wanted to submit this only after #3639 was merged, but sending the PR now allows it to be discussed concurrently (if either this or #3639 get merged, I'll of course update the other commit).

raghavrv · 2015-01-19T22:34:11Z

sklearn/preprocessing/data.py

+        ----------
+        X : array-like or CSR matrix.
+            The data used to scale along the specified axis.
+        """


Could you add a line check_is_fitted(self, 'scale_') here too?

jnothman · 2015-01-20T03:42:14Z

Test failure:

AttributeError: 'RobustScaler' object has no attribute 'center_'

untom · 2015-01-20T07:24:58Z

Fixed now, sorry that it even happened, I only ran the sklearn.preprocessing specific tests before.

agramfort · 2015-02-20T14:50:45Z

is this ready to review?

untom · 2015-02-20T14:56:49Z

Yes it is

untom · 2015-02-23T13:06:36Z

Pinging @amueller @ogrisel Maybe you want to have a look?

amueller · 2015-02-25T03:49:56Z

Yes, but probably only after he beta release is cut on Friday.

untom · 2015-04-07T14:20:29Z

ping @amueller is now a good time to remind you about this PR?

amueller · 2015-04-07T14:54:30Z

It is always a good time to remind me ;) I'll try to have a look.

amueller · 2015-05-15T19:57:08Z

sklearn/preprocessing/data.py

+
+    Parameters
+    ----------
+    interquartile_scale: float or string in  ["normal" (default), ],


please add a space before " :"

amueller · 2015-05-15T20:09:17Z

If you like you could add a plot_ example that shows samples from a 2d gaussian with outliers, how it gets transformed using StandardScaler (into something weird) and how using RobustScaler (a ball + outliers). Not required but I think this would be very illustrative.

Apart from my minor nitpicks this looks great. Sorry for the lack of feedback previously.

untom · 2015-05-15T21:43:32Z

I am not sure I understand the parametrization here. So we are always using 1/4 3/4 quantiles and then scale them? Is that a standard practice? [sorry I rarely do robust statistics]

Long answer outside of the affected code lines so the answer stays visible for future reviewers: Yes, using the 0.25-0.75 (the "interquartile range") is the most common option, although others might make sense in some occasions (e.g. Wikipedia also meantions the interdecale range 0.1-0.9 as relevant). We could make the quantiles a user-definable parameter that just defaults to 0.25-0.75, to cover the more general case. The only question is if it's worth the added complexity for the user. I've personally never needed anything more than IQR, but maybe having the option might be nice?

agramfort · 2015-05-21T07:22:05Z

examples/preprocessing/plot_robust_scaling.py

+# Create training and test data
+np.random.seed(42)
+n_datapoints = 100
+C=[[0.9, 0.0],[0.0, 20.0]]


untom · 2015-05-21T13:38:53Z

@agramfort Thanks for taking the time to look at this. I've fixed the mistakes you pointed out, and went over the docstrings in data.py again and removed the repeated words/phrases. If you could look it over once more now, that'd be great.

agramfort · 2015-05-21T17:59:49Z

sklearn/preprocessing/data.py

+    Attributes
+    ----------
+    `center_` : array of floats
+        The median value for each feature in the training set, unless axis=1,


I don't see any axis parameter

You are right, that is a left-over from a feature that I later removed because an other reviewer didn't like it. Thanks for catching it!

agramfort · 2015-05-22T07:29:01Z

ok LGTM

one last review?

untom · 2015-05-22T08:36:06Z

Thanks for looking it over. Do PRs need 3 reviews before being accepted now?

agramfort · 2015-05-22T08:37:23Z

no but I'd like to have one more review to see if my last comments would be nitpicks for another core dev

GaelVaroquaux · 2015-05-25T12:34:00Z

Ok, I find that the issues raised by @agramfort are not major ones. I think that this can be merged as such.

However :), the median and the inter quartile range are very robust estimators but also very noisy. Using a trimmed mean, both on the raw data to estimate center location, and on the squared distances to the center, to estimate spread, is in general a much better strategy with a continuous parameter to interpolate between mean and median.

I would really love a PR adding such a behavior to the robust scaler. It worries me that anybody wanting to do robust scaling will right now have also very noisy scaling.

untom · 2015-05-25T14:35:26Z

I never heard of the trimmed mean being more robust, could you name any references I could read up on?

[MRG + 2] ENH RobustScaler

amueller · 2015-05-26T22:03:22Z

Merged. Thanks @untom.
Would you like to do the MaxAbsScaler?

untom · 2015-05-26T22:07:25Z

Nice! Thanks to the reviewers for their comments.

@amueller: I am fairly busy until the 5th, so most likely I will only find time to work on this afterwards.

amueller · 2015-05-26T22:07:59Z

I smell nips ;) Good luck!

mblondel · 2015-05-27T02:15:33Z

In such cases, the median and the interquartile range often give better results.

If there are papers to support this claim, it would be nice to add them to the references.

ogrisel · 2015-05-27T12:42:32Z

Thanks @untom for this contribution. I added a what's new entry here: 549ecae.

Please feel free to open a new PR to update it to use your real name instead of your github nickname. You might also want to update the authors list in the header of the sklearn/preprocessing/data.py file.

ENH Added RobustScaler

22f718d

raghavrv reviewed Jan 19, 2015
View reviewed changes

Check if parameters are fitted

bf8cfb6

untom changed the title ~~ENH RobustScaler~~ [MRG] ENH RobustScaler Jan 21, 2015

amueller added the New Feature label Feb 25, 2015

untom mentioned this pull request May 15, 2015

[WIP] Refactor scaler code #3639

Closed

amueller reviewed May 15, 2015
View reviewed changes

agramfort reviewed May 21, 2015
View reviewed changes

Fix documentation errors

c12ae60

agramfort reviewed May 21, 2015
View reviewed changes

untom added 2 commits May 21, 2015 22:07

More documentation fixes

cac627a

Remove redundant code

815270b

untom changed the title ~~[MRG + 1] ENH RobustScaler~~ [MRG + 2] ENH RobustScaler May 22, 2015

amueller added a commit that referenced this pull request May 26, 2015

Merge pull request #4125 from untom/RobustScaler

0b07536

[MRG + 2] ENH RobustScaler

amueller merged commit 0b07536 into scikit-learn:master May 26, 2015

untom deleted the RobustScaler branch May 26, 2015 22:09

amueller mentioned this pull request Aug 25, 2015

[MRG+1-1] Refactoring and expanding sklearn.preprocessing scaling #2514

Closed

nlathia mentioned this pull request Aug 30, 2016

Incorrect documentation on scaling sparse data #7293

Closed

naoyak mentioned this pull request May 4, 2017

RobustScaler does not allow sparse matrix input #8796

Closed

rth mentioned this pull request Apr 6, 2018

Where is the axis argument? #10929

Closed

Uh oh!

[MRG + 2] ENH RobustScaler #4125

[MRG + 2] ENH RobustScaler #4125

Uh oh!

Conversation

untom commented Jan 19, 2015

Uh oh!

raghavrv Jan 19, 2015

Choose a reason for hiding this comment

Uh oh!

untom Jan 20, 2015

Choose a reason for hiding this comment

Uh oh!

jnothman commented Jan 20, 2015

Uh oh!

untom commented Jan 20, 2015

Uh oh!

agramfort commented Feb 20, 2015

Uh oh!

untom commented Feb 20, 2015

Uh oh!

untom commented Feb 23, 2015

Uh oh!

amueller commented Feb 25, 2015

Uh oh!

untom commented Apr 7, 2015

Uh oh!

amueller commented Apr 7, 2015

Uh oh!

amueller May 15, 2015

Choose a reason for hiding this comment

Uh oh!

amueller commented May 15, 2015

Uh oh!

untom commented May 15, 2015

Uh oh!

agramfort May 21, 2015

Choose a reason for hiding this comment

Uh oh!

untom commented May 21, 2015

Uh oh!

agramfort May 21, 2015

Choose a reason for hiding this comment

Uh oh!

untom May 21, 2015

Choose a reason for hiding this comment

Uh oh!

agramfort commented May 22, 2015

Uh oh!

untom commented May 22, 2015

Uh oh!

agramfort commented May 22, 2015 via email

Uh oh!

GaelVaroquaux commented May 25, 2015

Uh oh!

untom commented May 25, 2015

Uh oh!

amueller commented May 26, 2015

Uh oh!

untom commented May 26, 2015

Uh oh!

amueller commented May 26, 2015

Uh oh!

mblondel commented May 27, 2015

Uh oh!

ogrisel commented May 27, 2015

Uh oh!

Uh oh!