-
-
Notifications
You must be signed in to change notification settings - Fork 25.9k
[MRG] ENH: Add sample_weight to median_absolute_error #6217
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[MRG] ENH: Add sample_weight to median_absolute_error #6217
Conversation
I would say let us make the The idea is something like this, for instance
|
Thanks for the detailed explanation. But the link mentioned above had this thread which discussed this implementation. Could you look into that ? It seems to be working fine. |
And is the formula |
ah right, I edited it |
The reason I would keep both implementations independent is that there is some stuff in gradient boosting which depends on |
Sorry didn't respond to that comment. I realised it when I saw the failing tests. |
sklearn/utils/stats.py
Outdated
# Find index of median prediction for each sample | ||
weight_cdf = sample_weight[sorted_idx].cumsum() | ||
percentile_idx = np.searchsorted( | ||
weight_cdf, (percentile / 100.) * weight_cdf[-1]) | ||
if weight_cdf[percentile_idx] == midpoint: | ||
return np.mean(array[sorted_idx[percentile_idx]:sorted_idx[percentile_idx+1]+1]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In any case, this approach might work only if the percentile is 50, so we are better off with refactoring it.
This is the same thing as done in the wikipedia article, but I think using np.interp
directly will make the code look cleaner and easier to follow
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh okay got the point. Will do the changes and ping you back. Thanks !
Sorry but I am a bit confused about the implementation. This is the code. Please do have a look at it and let me know the modifications. Hope I didn't do it completely wrong.
This is the output this produces for the simple test cases.
Thanks and sorry for keeping on disturbing you. |
In terms of definition of a weighted median, it seems to me that the right Based on https://en.wikipedia.org/wiki/Weighted_median#Properties |
Yes the implementation appears to be correct to me as well. Do make the changes |
Thanks for the comments. Will do the changes and push it now. |
Travis is failing. The failing test is test_scorer_sample_weight. It is getting identical values with and without sample_weights. |
Hmm. Can you try printing out the values that get passed into It might be possible that the median_absolute_error remains the same, because even zeroing out the initial few sample weights, does not change the median value of the cumulative sum of the sample weights by much. |
@MechCoder Sorry for the delay. I'm not clearly sure this is the expected one, but the printed values are of
Please let me know if something else was expected. Thanks. |
In general, the two rules that are tested in the test that fails in Travis, need not hold true for median absolute error, namely. You can play around with toy data and verify that. I would add ignore the tests using a "if name == "median_absolute_error" and add a comment. It would be great to add test for this separately. |
sklearn/metrics/regression.py
Outdated
sample_weight = np.array(sample_weight) | ||
weight_cdf = sample_weight[sorted_idx].cumsum() | ||
weighted_percentile = (weight_cdf - sample_weight/2.0) / weight_cdf[-1] | ||
sorted_array = np.sort(array) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nitpick: array[sorted_idx]
a12e7ad
to
8d8ac0c
Compare
Hi, I have done the changes by skipping the |
Hi, please review this whenever it is possible and let me know if anything else to be done. Thanks :) |
Hi, please review this whenever it is possible and let me know if anything else to be done. Thanks :)
If you think that this is ready to be merged, you should change '[WIP]'
in the title to '[MRG']
|
Thanks Gael. I have done that. It seems that is not sent in the notification to the email :) |
Thanks Gael. I have done that. It seems that is not sent in the
notification to the email :)
The trick is to do it before you do your comment: the email is sent with
the current title, hence if you change it before a comment, everybody
sees the '[MRG]' in their mailbox.
|
Oh, my apologies. Will remember it from next time.
|
Hi everyone, a small ping again. Hope you don't mind. Thanks. |
will review tomorrow |
@MechCoder I suppose I have done all the changes except the last comment(Sorry didn't get the meaning still). Please have a look whenever possible. Thanks. |
@MechCoder could you also please explain your comment if you have time ? Thanks ! |
0693994
to
5093dd3
Compare
Also make the _weighted_percentile more strong in utils
Use linear interpolation to calculate weighted median
5093dd3
to
4f61f59
Compare
|
@maniteja123 was your last comment really meant for this PR? I'm a bit confused. |
Hi @amueller I am really sorry for the confusion. I commented it as I mistook it for bicluster PR which I was working on simultaneously. Please ignore it. |
@maniteja123 Do you still intent to finish this PR? |
@maniteja123 @glemaitre This can be closed as #17225 is merged. |
Good catch |
Enable support for
sample_weight
inmedian_absolute_error
as suggested in #3450. Also make_weighted_percentile
strong as discussed in #6189. The idea of the midpoint of weights was originally given in here. Please let me know if something else is to be done and if there is a need to add new tests. Thanks.This change is