[MRG] ENH: Add sample_weight to median_absolute_error #6217

maniteja123 · 2016-01-23T16:51:55Z

Enable support for sample_weight in median_absolute_error as suggested in #3450. Also make _weighted_percentile strong as discussed in #6189. The idea of the midpoint of weights was originally given in here. Please let me know if something else is to be done and if there is a need to add new tests. Thanks.

This change is

MechCoder · 2016-01-24T16:10:00Z

I would say let us make the _weighted_percentile issue separate from this because other modules dependent on this.
As for as the weighted median, let us follow the approach described here (https://en.wikipedia.org/wiki/Percentile#Definition_of_the_Weighted_Percentile_method) which is a more intelligent approach.

The idea is something like this, for instance a = np.array([1, 4, 8, 9, 10]), w = np.array([1.2, 4.5, 7.8, 2.3, 4.5])

Sort a
Find the cumulative sum of w which is array([ 1.2, 5.7, 13.5, 15.8, 20.3])
weighted_percentages[i] = cum_sum[i] - w[i] / 2 and divide by w[-1]
Use np.interp to find what you need using a, and weighted_percentages

maniteja123 · 2016-01-24T16:16:16Z

Thanks for the detailed explanation. But the link mentioned above had this thread which discussed this implementation. Could you look into that ? It seems to be working fine.

maniteja123 · 2016-01-24T16:26:27Z

And is the formula (cum_sum[i] - w[i]/2) ? It seemed so from the Wikipedia link. Please correct me if I am wrong.

MechCoder · 2016-01-24T16:27:45Z

ah right, I edited it

MechCoder · 2016-01-24T16:31:03Z

The reason I would keep both implementations independent is that there is some stuff in gradient boosting which depends on _weighted_percentile.

maniteja123 · 2016-01-24T16:32:12Z

Sorry didn't respond to that comment. I realised it when I saw the failing tests.

MechCoder · 2016-01-24T16:36:39Z

sklearn/utils/stats.py

    # Find index of median prediction for each sample
    weight_cdf = sample_weight[sorted_idx].cumsum()
    percentile_idx = np.searchsorted(
        weight_cdf, (percentile / 100.) * weight_cdf[-1])
+    if weight_cdf[percentile_idx] == midpoint:
+        return np.mean(array[sorted_idx[percentile_idx]:sorted_idx[percentile_idx+1]+1])


In any case, this approach might work only if the percentile is 50, so we are better off with refactoring it.
This is the same thing as done in the wikipedia article, but I think using np.interp directly will make the code look cleaner and easier to follow

Oh okay got the point. Will do the changes and ping you back. Thanks !

maniteja123 · 2016-01-24T17:34:49Z

Sorry but I am a bit confused about the implementation. This is the code. Please do have a look at it and let me know the modifications. Hope I didn't do it completely wrong.

def _weighted_percentile(array, sample_weight, percentile=50):
    sorted_idx = np.argsort(array)
    sample_weight = np.array(sample_weight)
    weight_cdf = sample_weight[sorted_idx].cumsum()
    weighted_percentile = (weight_cdf - sample_weight/2.0) / weight_cdf[-1]
    sorted_array = np.sort(array)
    weighted_median = np.interp(percentile/100., weighted_percentile, sorted_array)
    return weighted_median

This is the output this produces for the simple test cases.

>>> _weighted_percentile([0,1],[1,1])
0.5
>>> _weighted_percentile([0,1],[1,2])
0.6666666666666667

Thanks and sorry for keeping on disturbing you.

GaelVaroquaux · 2016-01-24T17:51:36Z

In terms of definition of a weighted median, it seems to me that the right
definition is the one that starts from the median defined as a Frechet
mean of the l1 distance, and adds weights to such definition.

Based on https://en.wikipedia.org/wiki/Weighted_median#Properties
it seems that the definition that you suggest to use has the right
properties.

MechCoder · 2016-01-24T18:03:26Z

Yes the implementation appears to be correct to me as well. Do make the changes

maniteja123 · 2016-01-24T18:08:45Z

Thanks for the comments. Will do the changes and push it now.

maniteja123 · 2016-01-24T19:07:34Z

Travis is failing. The failing test is test_scorer_sample_weight. It is getting identical values with and without sample_weights.

MechCoder · 2016-01-28T23:53:30Z

Hmm. Can you try printing out the values that get passed into np.interp?

It might be possible that the median_absolute_error remains the same, because even zeroing out the initial few sample weights, does not change the median value of the cumulative sum of the sample weights by much.

maniteja123 · 2016-01-30T05:38:23Z

@MechCoder Sorry for the delay. I'm not clearly sure this is the expected one, but the printed values are of sample_weight and weighted_percentile passed to np.interp, which seem to be almost same. The two cases are for binary and multiclass problems. And the values for sample_weight as None and np.ones(len(y_true)) are evaluated to be equal though.

I: Seeding RNGs with 1285603334
...................None
[ 1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.
  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.
  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.]
[ 1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.
  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.
  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.]
[ 0.01  0.03  0.05  0.07  0.09  0.11  0.13  0.15  0.17  0.19  0.21  0.23
  0.25  0.27  0.29  0.31  0.33  0.35  0.37  0.39  0.41  0.43  0.45  0.47
  0.49  0.51  0.53  0.55  0.57  0.59  0.61  0.63  0.65  0.67  0.69  0.71
  0.73  0.75  0.77  0.79  0.81  0.83  0.85  0.87  0.89  0.91  0.93  0.95
  0.97  0.99]
[6 1 4 4 8 4 6 3 5 8 7 9 9 2 7 8 8 9 2 6 9 5 4 1 4 6 1 3 4 9 2 4 4 4 8 1 2
 1 5 8 4 3 8 3 1 1 5 6 6 7]
[6 1 4 4 8 4 6 3 5 8 7 9 9 2 7 8 8 9 2 6 9 5 4 1 4 6 1 3 4 9 2 4 4 4 8 1 2
 1 5 8 4 3 8 3 1 1 5 6 6 7]
[ 0.0122449   0.02653061  0.04489796  0.04897959  0.05714286  0.10204082
  0.13061224  0.15714286  0.15714286  0.17142857  0.20612245  0.21428571
  0.24693878  0.26530612  0.27959184  0.28163265  0.28979592  0.30408163
  0.35102041  0.37959184  0.39795918  0.43469388  0.47346939  0.5122449
  0.53877551  0.55102041  0.59387755  0.59387755  0.60816327  0.61428571
  0.63265306  0.64489796  0.66530612  0.67755102  0.68163265  0.70408163
  0.71836735  0.73673469  0.74489796  0.74693878  0.76734694  0.80612245
  0.80408163  0.83877551  0.87142857  0.8877551   0.91632653  0.93877551
  0.95918367  0.98571429]
F...................................None
[ 1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.
  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.
  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.]
[ 1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.
  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.
  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.]
[ 0.01  0.03  0.05  0.07  0.09  0.11  0.13  0.15  0.17  0.19  0.21  0.23
  0.25  0.27  0.29  0.31  0.33  0.35  0.37  0.39  0.41  0.43  0.45  0.47
  0.49  0.51  0.53  0.55  0.57  0.59  0.61  0.63  0.65  0.67  0.69  0.71
  0.73  0.75  0.77  0.79  0.81  0.83  0.85  0.87  0.89  0.91  0.93  0.95
  0.97  0.99]
[6 1 4 4 8 4 6 3 5 8 7 9 9 2 7 8 8 9 2 6 9 5 4 1 4 6 1 3 4 9 2 4 4 4 8 1 2
 1 5 8 4 3 8 3 1 1 5 6 6 7]
[6 1 4 4 8 4 6 3 5 8 7 9 9 2 7 8 8 9 2 6 9 5 4 1 4 6 1 3 4 9 2 4 4 4 8 1 2
 1 5 8 4 3 8 3 1 1 5 6 6 7]
[ 0.02040816  0.05510204  0.06530612  0.08571429  0.08163265  0.10612245
  0.10612245  0.12857143  0.15714286  0.15510204  0.17346939  0.20612245
  0.23061224  0.26938776  0.29591837  0.32653061  0.33469388  0.36122449
  0.40816327  0.42040816  0.44693878  0.49183673  0.50612245  0.53265306
  0.53877551  0.55102041  0.57755102  0.58979592  0.6         0.60612245
  0.64489796  0.64489796  0.66122449  0.68979592  0.68571429  0.70816327
  0.73877551  0.74897959  0.77755102  0.79591837  0.82040816  0.82653061
  0.82857143  0.84693878  0.88367347  0.90816327  0.92857143  0.96326531
  0.98367347  0.98571429]
F..................................................................
======================================================================
FAIL: sklearn.metrics.tests.test_common.test_sample_weight_invariance('median_absolute_error', <function median_absolute_error at 0xb2a2fd84>, array([0, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 1, 1,
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/usr/local/lib/python2.7/dist-packages/nose/case.py", line 197, in runTest
    self.test(*self.arg)
  File "/home/maniteja/FOSS/scikit-learn/sklearn/utils/testing.py", line 319, in wrapper
    return fn(*args, **kwargs)
  File "/home/maniteja/FOSS/scikit-learn/sklearn/metrics/tests/test_common.py", line 924, in check_sample_weight_invariance
    "equal (%f) for %s" % (weighted_score, name))
AssertionError: Unweighted and weighted scores are unexpectedly equal (0.000000) for median_absolute_error
>>  raise self.failureException('Unweighted and weighted scores are unexpectedly equal (0.000000) for median_absolute_error')


======================================================================
FAIL: sklearn.metrics.tests.test_common.test_sample_weight_invariance('median_absolute_error', <function median_absolute_error at 0xb2a2fd84>, array([4, 0, 3, 3, 3, 1, 3, 2, 4, 0, 0, 4, 2, 1, 0, 1, 1, 0, 1, 4, 3, 0, 3,
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/usr/local/lib/python2.7/dist-packages/nose/case.py", line 197, in runTest
    self.test(*self.arg)
  File "/home/maniteja/FOSS/scikit-learn/sklearn/utils/testing.py", line 319, in wrapper
    return fn(*args, **kwargs)
  File "/home/maniteja/FOSS/scikit-learn/sklearn/metrics/tests/test_common.py", line 924, in check_sample_weight_invariance
    "equal (%f) for %s" % (weighted_score, name))
AssertionError: Unweighted and weighted scores are unexpectedly equal (1.000000) for median_absolute_error
>>  raise self.failureException('Unweighted and weighted scores are unexpectedly equal (1.000000) for median_absolute_error')


----------------------------------------------------------------------
Ran 122 tests in 7.671s

FAILED (failures=2)

Please let me know if something else was expected. Thanks.

MechCoder · 2016-02-02T05:08:56Z

In general, the two rules that are tested in the test that fails in Travis, need not hold true for median absolute error, namely. You can play around with toy data and verify that. I would add ignore the tests using a "if name == "median_absolute_error" and add a comment. It would be great to add test for this separately.

MechCoder · 2016-02-02T05:09:28Z

sklearn/metrics/regression.py

+    sample_weight = np.array(sample_weight)
+    weight_cdf = sample_weight[sorted_idx].cumsum()
+    weighted_percentile = (weight_cdf - sample_weight/2.0) / weight_cdf[-1]
+    sorted_array = np.sort(array)


nitpick: array[sorted_idx]

maniteja123 · 2016-02-09T14:14:36Z

Hi, I have done the changes by skipping the test_sample_weight_invariance check for median-absolute_error and added a separate test for this in test_common.py in metrics module. Please let me know if I am doing anything wrong here. Thanks.

maniteja123 · 2016-02-22T14:10:44Z

Hi, please review this whenever it is possible and let me know if anything else to be done. Thanks :)

GaelVaroquaux · 2016-02-22T14:18:12Z

Hi, please review this whenever it is possible and let me know if anything else to be done. Thanks :)

If you think that this is ready to be merged, you should change '[WIP]' in the title to '[MRG']

maniteja123 · 2016-02-22T14:22:47Z

Thanks Gael. I have done that. It seems that is not sent in the notification to the email :)

GaelVaroquaux · 2016-02-22T14:30:52Z

Thanks Gael. I have done that. It seems that is not sent in the notification to the email :)

The trick is to do it before you do your comment: the email is sent with the current title, hence if you change it before a comment, everybody sees the '[MRG]' in their mailbox.

maniteja123 · 2016-02-22T14:43:07Z

Oh, my apologies. Will remember it from next time.
On 22 Feb 2016 8:01 pm, "Gael Varoquaux" [email protected] wrote:

Thanks Gael. I have done that. It seems that is not sent in the
notification to the email :)

The trick is to do it before you do your comment: the email is sent with
the current title, hence if you change it before a comment, everybody
sees the '[MRG]' in their mailbox.

—
Reply to this email directly or view it on GitHub
#6217 (comment)
.

maniteja123 · 2016-03-08T13:38:54Z

Hi everyone, a small ping again. Hope you don't mind. Thanks.

MechCoder · 2016-03-09T06:12:52Z

will review tomorrow

maniteja123 · 2016-03-24T11:28:02Z

@MechCoder I suppose I have done all the changes except the last comment(Sorry didn't get the meaning still). Please have a look whenever possible. Thanks.

maniteja123 · 2016-04-02T06:59:24Z

@MechCoder could you also please explain your comment if you have time ? Thanks !

Also make the _weighted_percentile more strong in utils

Use linear interpolation to calculate weighted median

maniteja123 · 2016-09-15T12:41:23Z

1. It was failing mostly in cases of either 1 feature or 1 sample. I couldn't come up with a better solution rather than specifically checking for dimensions.
2. In sparse case, the `SpectralBiclustering` class was failing due to `nan` and `inf` in the data and there is a finite check. I skipped it right now but I suppose sparse should be handled.~~~

I have added the tests for `median_absolute_error` and `weighted_percentile`
Please do have a look and let me know if something needs to be addressed. Thanks.

amueller · 2016-10-13T18:31:03Z

@maniteja123 was your last comment really meant for this PR? I'm a bit confused.

maniteja123 · 2016-10-13T19:03:24Z

Hi @amueller I am really sorry for the confusion. I commented it as I mistook it for bicluster PR which I was working on simultaneously. Please ignore it.

lorentzenchr · 2020-04-10T08:12:33Z

@maniteja123 Do you still intent to finish this PR?
PR #3779 added sample_weight to sklearn.utils.stats._weighted_percentile. Using that instead of sklearn.utils.extmath._weighted_percentile implemented in this PR might make this PR smaller and more likely to be merged.

lorentzenchr · 2020-05-28T06:29:29Z

@maniteja123 @glemaitre This can be closed as #17225 is merged.

glemaitre · 2020-05-28T07:16:47Z

Good catch

MechCoder changed the title ~~ENH: Add sample_weight to median_absolute_error~~ [WIP] ENH: Add sample_weight to median_absolute_error Jan 24, 2016

MechCoder mentioned this pull request Jan 24, 2016

[WIP] Adds support for sample weights in median_absolute_error #3784

Closed

MechCoder reviewed Jan 24, 2016
View reviewed changes

MechCoder reviewed Feb 2, 2016
View reviewed changes

maniteja123 force-pushed the median_absolute_error branch from a12e7ad to 8d8ac0c Compare February 9, 2016 14:06

maniteja123 changed the title ~~[WIP] ENH: Add sample_weight to median_absolute_error~~ [MRG] ENH: Add sample_weight to median_absolute_error Feb 22, 2016

maniteja123 force-pushed the median_absolute_error branch from 0693994 to 5093dd3 Compare September 15, 2016 07:45

maniteja123 added 9 commits September 15, 2016 13:17

Add smaple_weight to median_absolute_error

903887c

ENH: Add sample_weight to median_absolute_error

ce21b98

Also make the _weighted_percentile more strong in utils

Revert changes to _weighted_percentile in the utils

35b5f06

Use linear interpolation to calculate weighted median

Some more changes and tests

d008476

Some more changes to the failing tests

e0b2f90

Fix a bug in weighted median implementation

3c7d509

Add documentation to weighted_median

c56ecdc

Add documentation to weighted_median

006d86e

Add more tests

4f61f59

maniteja123 force-pushed the median_absolute_error branch from 5093dd3 to 4f61f59 Compare September 15, 2016 07:51

maniteja123 added 2 commits September 15, 2016 13:44

fix test

e192027

solve flake8 checks

7a8611f

mrecachinas mentioned this pull request Oct 9, 2016

Make _weighted_percentile more robust #6189

Closed

amueller added the Waiting for Reviewer label Oct 13, 2016

qinhanmin2014 mentioned this pull request Oct 4, 2018

[MRG+2] Add max_error to the existing set of metrics for regression #12232

Merged

agamemnonc mentioned this pull request Aug 22, 2019

[MRG+2] median_absolute_error multioutput #14732

Merged

github-actions bot added module:metrics module:utils labels Mar 2, 2020

lucyleeow mentioned this pull request May 14, 2020

ENH Sample weights for median_absolute_error #17225

Merged

glemaitre closed this May 28, 2020

lucyleeow mentioned this pull request May 28, 2020

_weighted_percentile does not lead to the same result than np.median #17370

Open

Uh oh!

[MRG] ENH: Add sample_weight to median_absolute_error #6217

[MRG] ENH: Add sample_weight to median_absolute_error #6217

Uh oh!

Conversation

maniteja123 commented Jan 23, 2016 • edited by lesteve Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

MechCoder commented Jan 24, 2016

Uh oh!

maniteja123 commented Jan 24, 2016

Uh oh!

maniteja123 commented Jan 24, 2016

Uh oh!

MechCoder commented Jan 24, 2016

Uh oh!

MechCoder commented Jan 24, 2016

Uh oh!

maniteja123 commented Jan 24, 2016

Uh oh!

MechCoder Jan 24, 2016

Choose a reason for hiding this comment

Uh oh!

maniteja123 Jan 24, 2016

Choose a reason for hiding this comment

Uh oh!

maniteja123 commented Jan 24, 2016

Uh oh!

GaelVaroquaux commented Jan 24, 2016

Uh oh!

MechCoder commented Jan 24, 2016

Uh oh!

maniteja123 commented Jan 24, 2016

Uh oh!

maniteja123 commented Jan 24, 2016

Uh oh!

MechCoder commented Jan 28, 2016

Uh oh!

maniteja123 commented Jan 30, 2016

Uh oh!

MechCoder commented Feb 2, 2016

Uh oh!

MechCoder Feb 2, 2016

Choose a reason for hiding this comment

Uh oh!

maniteja123 commented Feb 9, 2016

Uh oh!

maniteja123 commented Feb 22, 2016

Uh oh!

GaelVaroquaux commented Feb 22, 2016 via email

Uh oh!

maniteja123 commented Feb 22, 2016

Uh oh!

GaelVaroquaux commented Feb 22, 2016 via email

Uh oh!

maniteja123 commented Feb 22, 2016

Uh oh!

maniteja123 commented Mar 8, 2016

Uh oh!

MechCoder commented Mar 9, 2016

Uh oh!

maniteja123 commented Mar 24, 2016

Uh oh!

maniteja123 commented Apr 2, 2016

Uh oh!

maniteja123 commented Sep 15, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

amueller commented Oct 13, 2016

Uh oh!

maniteja123 commented Oct 13, 2016

Uh oh!

lorentzenchr commented Apr 10, 2020

Uh oh!

lorentzenchr commented May 28, 2020

Uh oh!

glemaitre commented May 28, 2020

Uh oh!

Uh oh!

maniteja123 commented Jan 23, 2016 •

edited by lesteve

Loading

maniteja123 commented Sep 15, 2016 •

edited

Loading