-
-
Notifications
You must be signed in to change notification settings - Fork 25.9k
[MRG + 2] ENH RobustScaler #4125
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
---------- | ||
X : array-like or CSR matrix. | ||
The data used to scale along the specified axis. | ||
""" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you add a line check_is_fitted(self, 'scale_')
here too?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
Test failure: AttributeError: 'RobustScaler' object has no attribute 'center_' |
Fixed now, sorry that it even happened, I only ran the |
is this ready to review? |
Yes it is |
Yes, but probably only after he beta release is cut on Friday. |
ping @amueller is now a good time to remind you about this PR? |
It is always a good time to remind me ;) I'll try to have a look. |
|
||
Parameters | ||
---------- | ||
interquartile_scale: float or string in ["normal" (default), ], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
please add a space before " :"
If you like you could add a plot_ example that shows samples from a 2d gaussian with outliers, how it gets transformed using StandardScaler (into something weird) and how using RobustScaler (a ball + outliers). Not required but I think this would be very illustrative. Apart from my minor nitpicks this looks great. Sorry for the lack of feedback previously. |
Long answer outside of the affected code lines so the answer stays visible for future reviewers: Yes, using the 0.25-0.75 (the "interquartile range") is the most common option, although others might make sense in some occasions (e.g. Wikipedia also meantions the interdecale range 0.1-0.9 as relevant). We could make the quantiles a user-definable parameter that just defaults to 0.25-0.75, to cover the more general case. The only question is if it's worth the added complexity for the user. I've personally never needed anything more than IQR, but maybe having the option might be nice? |
# Create training and test data | ||
np.random.seed(42) | ||
n_datapoints = 100 | ||
C=[[0.9, 0.0],[0.0, 20.0]] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
pep8
@agramfort Thanks for taking the time to look at this. I've fixed the mistakes you pointed out, and went over the docstrings in |
Attributes | ||
---------- | ||
`center_` : array of floats | ||
The median value for each feature in the training set, unless axis=1, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't see any axis parameter
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You are right, that is a left-over from a feature that I later removed because an other reviewer didn't like it. Thanks for catching it!
ok LGTM one last review? |
Thanks for looking it over. Do PRs need 3 reviews before being accepted now? |
no but I'd like to have one more review to see if my last comments would be
nitpicks for another core dev
|
Ok, I find that the issues raised by @agramfort are not major ones. I think that this can be merged as such. However :), the median and the inter quartile range are very robust estimators but also very noisy. Using a trimmed mean, both on the raw data to estimate center location, and on the squared distances to the center, to estimate spread, is in general a much better strategy with a continuous parameter to interpolate between mean and median. I would really love a PR adding such a behavior to the robust scaler. It worries me that anybody wanting to do robust scaling will right now have also very noisy scaling. |
I never heard of the trimmed mean being more robust, could you name any references I could read up on? |
Merged. Thanks @untom. |
Nice! Thanks to the reviewers for their comments. @amueller: I am fairly busy until the 5th, so most likely I will only find time to work on this afterwards. |
I smell nips ;) Good luck! |
If there are papers to support this claim, it would be nice to add them to the references. |
This PR adds
RobustScaler
androbust_scale
as alternative toStandardScaler
andscale
. They use robust estimates of data center/scale (median & interquartile range), which will work better for data with outliers.Most of this was discussed before in #2514 (and even older commits). I separated this out so it can be merged as-is.
I originally wanted to submit this only after #3639 was merged, but sending the PR now allows it to be discussed concurrently (if either this or #3639 get merged, I'll of course update the other commit).