Thanks to visit codestin.com
Credit goes to github.com

Skip to content

[MRG + 2] ENH RobustScaler #4125

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 21 commits into from
May 26, 2015
Merged

[MRG + 2] ENH RobustScaler #4125

merged 21 commits into from
May 26, 2015

Conversation

untom
Copy link
Contributor

@untom untom commented Jan 19, 2015

This PR adds RobustScaler and robust_scale as alternative to StandardScaler and scale. They use robust estimates of data center/scale (median & interquartile range), which will work better for data with outliers.

Most of this was discussed before in #2514 (and even older commits). I separated this out so it can be merged as-is.

I originally wanted to submit this only after #3639 was merged, but sending the PR now allows it to be discussed concurrently (if either this or #3639 get merged, I'll of course update the other commit).

----------
X : array-like or CSR matrix.
The data used to scale along the specified axis.
"""
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you add a line check_is_fitted(self, 'scale_') here too?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

@jnothman
Copy link
Member

Test failure:

AttributeError: 'RobustScaler' object has no attribute 'center_'

@untom
Copy link
Contributor Author

untom commented Jan 20, 2015

Fixed now, sorry that it even happened, I only ran the sklearn.preprocessing specific tests before.

@untom untom changed the title ENH RobustScaler [MRG] ENH RobustScaler Jan 21, 2015
@agramfort
Copy link
Member

is this ready to review?

@untom
Copy link
Contributor Author

untom commented Feb 20, 2015

Yes it is

@untom
Copy link
Contributor Author

untom commented Feb 23, 2015

Pinging @amueller @ogrisel Maybe you want to have a look?

@amueller
Copy link
Member

Yes, but probably only after he beta release is cut on Friday.

@untom
Copy link
Contributor Author

untom commented Apr 7, 2015

ping @amueller is now a good time to remind you about this PR?

@amueller
Copy link
Member

amueller commented Apr 7, 2015

It is always a good time to remind me ;) I'll try to have a look.


Parameters
----------
interquartile_scale: float or string in ["normal" (default), ],
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please add a space before " :"

@amueller
Copy link
Member

If you like you could add a plot_ example that shows samples from a 2d gaussian with outliers, how it gets transformed using StandardScaler (into something weird) and how using RobustScaler (a ball + outliers). Not required but I think this would be very illustrative.

Apart from my minor nitpicks this looks great. Sorry for the lack of feedback previously.

@untom
Copy link
Contributor Author

untom commented May 15, 2015

I am not sure I understand the parametrization here. So we are always using 1/4 3/4 quantiles and then scale them? Is that a standard practice? [sorry I rarely do robust statistics]

Long answer outside of the affected code lines so the answer stays visible for future reviewers: Yes, using the 0.25-0.75 (the "interquartile range") is the most common option, although others might make sense in some occasions (e.g. Wikipedia also meantions the interdecale range 0.1-0.9 as relevant). We could make the quantiles a user-definable parameter that just defaults to 0.25-0.75, to cover the more general case. The only question is if it's worth the added complexity for the user. I've personally never needed anything more than IQR, but maybe having the option might be nice?

# Create training and test data
np.random.seed(42)
n_datapoints = 100
C=[[0.9, 0.0],[0.0, 20.0]]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pep8

@untom
Copy link
Contributor Author

untom commented May 21, 2015

@agramfort Thanks for taking the time to look at this. I've fixed the mistakes you pointed out, and went over the docstrings in data.py again and removed the repeated words/phrases. If you could look it over once more now, that'd be great.

Attributes
----------
`center_` : array of floats
The median value for each feature in the training set, unless axis=1,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't see any axis parameter

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You are right, that is a left-over from a feature that I later removed because an other reviewer didn't like it. Thanks for catching it!

@agramfort
Copy link
Member

ok LGTM

one last review?

@untom untom changed the title [MRG + 1] ENH RobustScaler [MRG + 2] ENH RobustScaler May 22, 2015
@untom
Copy link
Contributor Author

untom commented May 22, 2015

Thanks for looking it over. Do PRs need 3 reviews before being accepted now?

@agramfort
Copy link
Member

agramfort commented May 22, 2015 via email

@GaelVaroquaux
Copy link
Member

Ok, I find that the issues raised by @agramfort are not major ones. I think that this can be merged as such.

However :), the median and the inter quartile range are very robust estimators but also very noisy. Using a trimmed mean, both on the raw data to estimate center location, and on the squared distances to the center, to estimate spread, is in general a much better strategy with a continuous parameter to interpolate between mean and median.

I would really love a PR adding such a behavior to the robust scaler. It worries me that anybody wanting to do robust scaling will right now have also very noisy scaling.

@untom
Copy link
Contributor Author

untom commented May 25, 2015

I never heard of the trimmed mean being more robust, could you name any references I could read up on?

amueller added a commit that referenced this pull request May 26, 2015
@amueller amueller merged commit 0b07536 into scikit-learn:master May 26, 2015
@amueller
Copy link
Member

Merged. Thanks @untom.
Would you like to do the MaxAbsScaler?

@untom
Copy link
Contributor Author

untom commented May 26, 2015

Nice! Thanks to the reviewers for their comments.

@amueller: I am fairly busy until the 5th, so most likely I will only find time to work on this afterwards.

@amueller
Copy link
Member

I smell nips ;) Good luck!

@untom untom deleted the RobustScaler branch May 26, 2015 22:09
@mblondel
Copy link
Member

In such cases, the median and the interquartile range often give better results.

If there are papers to support this claim, it would be nice to add them to the references.

@ogrisel
Copy link
Member

ogrisel commented May 27, 2015

Thanks @untom for this contribution. I added a what's new entry here: 549ecae.

Please feel free to open a new PR to update it to use your real name instead of your github nickname. You might also want to update the authors list in the header of the sklearn/preprocessing/data.py file.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants