Thanks to visit codestin.com
Credit goes to github.com

Skip to content

[MRG] ENH: Add sample_weight to median_absolute_error #6217

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed

Conversation

maniteja123
Copy link
Contributor

@maniteja123 maniteja123 commented Jan 23, 2016

Enable support for sample_weight in median_absolute_error as suggested in #3450. Also make _weighted_percentile strong as discussed in #6189. The idea of the midpoint of weights was originally given in here. Please let me know if something else is to be done and if there is a need to add new tests. Thanks.


This change is Reviewable

@MechCoder MechCoder changed the title ENH: Add sample_weight to median_absolute_error [WIP] ENH: Add sample_weight to median_absolute_error Jan 24, 2016
@MechCoder
Copy link
Member

I would say let us make the _weighted_percentile issue separate from this because other modules dependent on this.
As for as the weighted median, let us follow the approach described here (https://en.wikipedia.org/wiki/Percentile#Definition_of_the_Weighted_Percentile_method) which is a more intelligent approach.

The idea is something like this, for instance a = np.array([1, 4, 8, 9, 10]), w = np.array([1.2, 4.5, 7.8, 2.3, 4.5])

  1. Sort a
  2. Find the cumulative sum of w which is array([ 1.2, 5.7, 13.5, 15.8, 20.3])
  3. weighted_percentages[i] = cum_sum[i] - w[i] / 2 and divide by w[-1]
  4. Use np.interp to find what you need using a, and weighted_percentages

@maniteja123
Copy link
Contributor Author

Thanks for the detailed explanation. But the link mentioned above had this thread which discussed this implementation. Could you look into that ? It seems to be working fine.

@maniteja123
Copy link
Contributor Author

And is the formula (cum_sum[i] - w[i]/2) ? It seemed so from the Wikipedia link. Please correct me if I am wrong.

@MechCoder
Copy link
Member

ah right, I edited it

@MechCoder
Copy link
Member

The reason I would keep both implementations independent is that there is some stuff in gradient boosting which depends on _weighted_percentile.

@maniteja123
Copy link
Contributor Author

Sorry didn't respond to that comment. I realised it when I saw the failing tests.

# Find index of median prediction for each sample
weight_cdf = sample_weight[sorted_idx].cumsum()
percentile_idx = np.searchsorted(
weight_cdf, (percentile / 100.) * weight_cdf[-1])
if weight_cdf[percentile_idx] == midpoint:
return np.mean(array[sorted_idx[percentile_idx]:sorted_idx[percentile_idx+1]+1])
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In any case, this approach might work only if the percentile is 50, so we are better off with refactoring it.
This is the same thing as done in the wikipedia article, but I think using np.interp directly will make the code look cleaner and easier to follow

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh okay got the point. Will do the changes and ping you back. Thanks !

@maniteja123
Copy link
Contributor Author

Sorry but I am a bit confused about the implementation. This is the code. Please do have a look at it and let me know the modifications. Hope I didn't do it completely wrong.

def _weighted_percentile(array, sample_weight, percentile=50):
    sorted_idx = np.argsort(array)
    sample_weight = np.array(sample_weight)
    weight_cdf = sample_weight[sorted_idx].cumsum()
    weighted_percentile = (weight_cdf - sample_weight/2.0) / weight_cdf[-1]
    sorted_array = np.sort(array)
    weighted_median = np.interp(percentile/100., weighted_percentile, sorted_array)
    return weighted_median

This is the output this produces for the simple test cases.

>>> _weighted_percentile([0,1],[1,1])
0.5
>>> _weighted_percentile([0,1],[1,2])
0.6666666666666667

Thanks and sorry for keeping on disturbing you.

@GaelVaroquaux
Copy link
Member

In terms of definition of a weighted median, it seems to me that the right
definition is the one that starts from the median defined as a Frechet
mean of the l1 distance, and adds weights to such definition.

Based on https://en.wikipedia.org/wiki/Weighted_median#Properties
it seems that the definition that you suggest to use has the right
properties.

@MechCoder
Copy link
Member

Yes the implementation appears to be correct to me as well. Do make the changes

@maniteja123
Copy link
Contributor Author

Thanks for the comments. Will do the changes and push it now.

@maniteja123
Copy link
Contributor Author

Travis is failing. The failing test is test_scorer_sample_weight. It is getting identical values with and without sample_weights.

@MechCoder
Copy link
Member

Hmm. Can you try printing out the values that get passed into np.interp?

It might be possible that the median_absolute_error remains the same, because even zeroing out the initial few sample weights, does not change the median value of the cumulative sum of the sample weights by much.

@maniteja123
Copy link
Contributor Author

@MechCoder Sorry for the delay. I'm not clearly sure this is the expected one, but the printed values are of sample_weight and weighted_percentile passed to np.interp, which seem to be almost same. The two cases are for binary and multiclass problems. And the values for sample_weight as None and np.ones(len(y_true)) are evaluated to be equal though.

I: Seeding RNGs with 1285603334
...................None
[ 1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.
  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.
  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.]
[ 1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.
  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.
  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.]
[ 0.01  0.03  0.05  0.07  0.09  0.11  0.13  0.15  0.17  0.19  0.21  0.23
  0.25  0.27  0.29  0.31  0.33  0.35  0.37  0.39  0.41  0.43  0.45  0.47
  0.49  0.51  0.53  0.55  0.57  0.59  0.61  0.63  0.65  0.67  0.69  0.71
  0.73  0.75  0.77  0.79  0.81  0.83  0.85  0.87  0.89  0.91  0.93  0.95
  0.97  0.99]
[6 1 4 4 8 4 6 3 5 8 7 9 9 2 7 8 8 9 2 6 9 5 4 1 4 6 1 3 4 9 2 4 4 4 8 1 2
 1 5 8 4 3 8 3 1 1 5 6 6 7]
[6 1 4 4 8 4 6 3 5 8 7 9 9 2 7 8 8 9 2 6 9 5 4 1 4 6 1 3 4 9 2 4 4 4 8 1 2
 1 5 8 4 3 8 3 1 1 5 6 6 7]
[ 0.0122449   0.02653061  0.04489796  0.04897959  0.05714286  0.10204082
  0.13061224  0.15714286  0.15714286  0.17142857  0.20612245  0.21428571
  0.24693878  0.26530612  0.27959184  0.28163265  0.28979592  0.30408163
  0.35102041  0.37959184  0.39795918  0.43469388  0.47346939  0.5122449
  0.53877551  0.55102041  0.59387755  0.59387755  0.60816327  0.61428571
  0.63265306  0.64489796  0.66530612  0.67755102  0.68163265  0.70408163
  0.71836735  0.73673469  0.74489796  0.74693878  0.76734694  0.80612245
  0.80408163  0.83877551  0.87142857  0.8877551   0.91632653  0.93877551
  0.95918367  0.98571429]
F...................................None
[ 1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.
  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.
  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.]
[ 1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.
  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.
  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.]
[ 0.01  0.03  0.05  0.07  0.09  0.11  0.13  0.15  0.17  0.19  0.21  0.23
  0.25  0.27  0.29  0.31  0.33  0.35  0.37  0.39  0.41  0.43  0.45  0.47
  0.49  0.51  0.53  0.55  0.57  0.59  0.61  0.63  0.65  0.67  0.69  0.71
  0.73  0.75  0.77  0.79  0.81  0.83  0.85  0.87  0.89  0.91  0.93  0.95
  0.97  0.99]
[6 1 4 4 8 4 6 3 5 8 7 9 9 2 7 8 8 9 2 6 9 5 4 1 4 6 1 3 4 9 2 4 4 4 8 1 2
 1 5 8 4 3 8 3 1 1 5 6 6 7]
[6 1 4 4 8 4 6 3 5 8 7 9 9 2 7 8 8 9 2 6 9 5 4 1 4 6 1 3 4 9 2 4 4 4 8 1 2
 1 5 8 4 3 8 3 1 1 5 6 6 7]
[ 0.02040816  0.05510204  0.06530612  0.08571429  0.08163265  0.10612245
  0.10612245  0.12857143  0.15714286  0.15510204  0.17346939  0.20612245
  0.23061224  0.26938776  0.29591837  0.32653061  0.33469388  0.36122449
  0.40816327  0.42040816  0.44693878  0.49183673  0.50612245  0.53265306
  0.53877551  0.55102041  0.57755102  0.58979592  0.6         0.60612245
  0.64489796  0.64489796  0.66122449  0.68979592  0.68571429  0.70816327
  0.73877551  0.74897959  0.77755102  0.79591837  0.82040816  0.82653061
  0.82857143  0.84693878  0.88367347  0.90816327  0.92857143  0.96326531
  0.98367347  0.98571429]
F..................................................................
======================================================================
FAIL: sklearn.metrics.tests.test_common.test_sample_weight_invariance('median_absolute_error', <function median_absolute_error at 0xb2a2fd84>, array([0, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 1, 1,
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/usr/local/lib/python2.7/dist-packages/nose/case.py", line 197, in runTest
    self.test(*self.arg)
  File "/home/maniteja/FOSS/scikit-learn/sklearn/utils/testing.py", line 319, in wrapper
    return fn(*args, **kwargs)
  File "/home/maniteja/FOSS/scikit-learn/sklearn/metrics/tests/test_common.py", line 924, in check_sample_weight_invariance
    "equal (%f) for %s" % (weighted_score, name))
AssertionError: Unweighted and weighted scores are unexpectedly equal (0.000000) for median_absolute_error
>>  raise self.failureException('Unweighted and weighted scores are unexpectedly equal (0.000000) for median_absolute_error')


======================================================================
FAIL: sklearn.metrics.tests.test_common.test_sample_weight_invariance('median_absolute_error', <function median_absolute_error at 0xb2a2fd84>, array([4, 0, 3, 3, 3, 1, 3, 2, 4, 0, 0, 4, 2, 1, 0, 1, 1, 0, 1, 4, 3, 0, 3,
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/usr/local/lib/python2.7/dist-packages/nose/case.py", line 197, in runTest
    self.test(*self.arg)
  File "/home/maniteja/FOSS/scikit-learn/sklearn/utils/testing.py", line 319, in wrapper
    return fn(*args, **kwargs)
  File "/home/maniteja/FOSS/scikit-learn/sklearn/metrics/tests/test_common.py", line 924, in check_sample_weight_invariance
    "equal (%f) for %s" % (weighted_score, name))
AssertionError: Unweighted and weighted scores are unexpectedly equal (1.000000) for median_absolute_error
>>  raise self.failureException('Unweighted and weighted scores are unexpectedly equal (1.000000) for median_absolute_error')


----------------------------------------------------------------------
Ran 122 tests in 7.671s

FAILED (failures=2)

Please let me know if something else was expected. Thanks.

@MechCoder
Copy link
Member

In general, the two rules that are tested in the test that fails in Travis, need not hold true for median absolute error, namely. You can play around with toy data and verify that. I would add ignore the tests using a "if name == "median_absolute_error" and add a comment. It would be great to add test for this separately.

sample_weight = np.array(sample_weight)
weight_cdf = sample_weight[sorted_idx].cumsum()
weighted_percentile = (weight_cdf - sample_weight/2.0) / weight_cdf[-1]
sorted_array = np.sort(array)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nitpick: array[sorted_idx]

@maniteja123 maniteja123 force-pushed the median_absolute_error branch from a12e7ad to 8d8ac0c Compare February 9, 2016 14:06
@maniteja123
Copy link
Contributor Author

Hi, I have done the changes by skipping the test_sample_weight_invariance check for median-absolute_error and added a separate test for this in test_common.py in metrics module. Please let me know if I am doing anything wrong here. Thanks.

@maniteja123
Copy link
Contributor Author

Hi, please review this whenever it is possible and let me know if anything else to be done. Thanks :)

@maniteja123 maniteja123 changed the title [WIP] ENH: Add sample_weight to median_absolute_error [MRG] ENH: Add sample_weight to median_absolute_error Feb 22, 2016
@GaelVaroquaux
Copy link
Member

GaelVaroquaux commented Feb 22, 2016 via email

@maniteja123
Copy link
Contributor Author

Thanks Gael. I have done that. It seems that is not sent in the notification to the email :)

@GaelVaroquaux
Copy link
Member

GaelVaroquaux commented Feb 22, 2016 via email

@maniteja123
Copy link
Contributor Author

Oh, my apologies. Will remember it from next time.
On 22 Feb 2016 8:01 pm, "Gael Varoquaux" [email protected] wrote:

Thanks Gael. I have done that. It seems that is not sent in the
notification to the email :)

The trick is to do it before you do your comment: the email is sent with
the current title, hence if you change it before a comment, everybody
sees the '[MRG]' in their mailbox.


Reply to this email directly or view it on GitHub
#6217 (comment)
.

@maniteja123
Copy link
Contributor Author

Hi everyone, a small ping again. Hope you don't mind. Thanks.

@MechCoder
Copy link
Member

will review tomorrow

@maniteja123
Copy link
Contributor Author

@MechCoder I suppose I have done all the changes except the last comment(Sorry didn't get the meaning still). Please have a look whenever possible. Thanks.

@maniteja123
Copy link
Contributor Author

@MechCoder could you also please explain your comment if you have time ? Thanks !

@maniteja123
Copy link
Contributor Author

maniteja123 commented Sep 15, 2016

1. It was failing mostly in cases of either 1 feature or 1 sample. I couldn't come up with a better solution rather than specifically checking for dimensions.
2. In sparse case, the `SpectralBiclustering` class was failing due to `nan` and `inf` in the data and there is a finite check. I skipped it right now but I suppose sparse should be handled.~~~

I have added the tests for `median_absolute_error` and `weighted_percentile`
Please do have a look and let me know if something needs to be addressed. Thanks.

@amueller
Copy link
Member

@maniteja123 was your last comment really meant for this PR? I'm a bit confused.

@maniteja123
Copy link
Contributor Author

Hi @amueller I am really sorry for the confusion. I commented it as I mistook it for bicluster PR which I was working on simultaneously. Please ignore it.

@lorentzenchr
Copy link
Member

@maniteja123 Do you still intent to finish this PR?
PR #3779 added sample_weight to sklearn.utils.stats._weighted_percentile. Using that instead of sklearn.utils.extmath._weighted_percentile implemented in this PR might make this PR smaller and more likely to be merged.

@lorentzenchr
Copy link
Member

@maniteja123 @glemaitre This can be closed as #17225 is merged.

@glemaitre
Copy link
Member

Good catch

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants