multiclass jaccard_similarity_score should not be equal to accuracy_score #7332

untom · 2016-09-02T13:45:26Z

The documentation for sklearn.metrics.jaccard_similarity_score currently (version 0.17.1) states that:

In binary and multiclass classification, this function is equivalent to the accuracy_score. It differs in the multilabel classification problem.

However, I do not think that this is the right thing to do for multiclass-problems. As far as I can tell, within the machine learning community a more common usage of the Jaccard index for multi-class is to
use the mean Jaccard-Index calculated for each class indivually. i.e., first calculate the jaccard index for class 0, class 1 and class 2, and then average them. This is what is very commonly done in the image segmentation community (where this is referred to as the "mean Intersection over Union" score (see e.g.[1]), but as far as I can tell by skimming it, this is also what the original publication of the jaccard index did in multiclass scenarios [2]. Note that this is NOT the same as the accuracy_score. Consider this example:

y_true = [0, 1, 2]
y_pred = [0, 0, 0]

The accuracy is clearly 1/3, and this is also what the jaccard_score in sklearn currently returns. The class-specific jaccard_scores would be:

J0 = 1 /3
J1 = 0 / 1
J2 = 0 / 1

Thus IMO the jaccard_score should be (J0 + J1 + J2) / 3 = 1/9 in this case

[1] e.g. Long et al, "The Pascal Visual Object Classes Challenge – a Retrospective", https://people.eecs.berkeley.edu/~jonlong/long_shelhamer_fcn.pdf , but see any other paper on Semantic Segmentation

[2] Jaccard, "THE DISTRIBUTION OF THE FLORA IN THE ALPINE ZONE", http://onlinelibrary.wiley.com/doi/10.1111/j.1469-8137.1912.tb05611.x/abstract (Note that I have only skimmed the paper, but it seems to me that the author always reports the average of the "efficient of community" calculated over pairs whenever the author compares more than just 2 groups)

The text was updated successfully, but these errors were encountered:

amueller · 2016-09-14T18:29:52Z

@untom The Pascal VOC is multi-class multi-label, right? Pixels are not evaluated individually, but the whole image is. And then people usually look at per-class measures.

But you're right, it looks like the original definition is different from ours, see https://encrypted.google.com/url?sa=t&rct=j&q=&esrc=s&source=web&cd=3&cad=rja&uact=8&ved=0ahUKEwjT1-XGvY_PAhXGeD4KHSFLBUAQFggwMAI&url=http%3A%2F%2Fwww.informatica.si%2Findex.php%2Finformatica%2Farticle%2Fdownload%2F753%2F608&usg=AFQjCNFd6KF03h8j6Yfk_6hQbx6oeBZP8g&sig2=1OOs_6BFsyaPcxGvPbEtLQ
Where is it in the original paper?

hccheng · 2016-10-27T18:47:29Z

According to the evaluation code (http://host.robots.ox.ac.uk/pascal/VOC/voc2012/VOCdevkit_18-May-2011.tar), the multi-class VOC score is the average of per-class Jaccard scores.
The relevant part is from line 74 to line 90 in VOCevalseg.m

shiba24 · 2017-02-21T02:12:55Z

@untom hi, i think you are right.

Currently jaccard_similarity_score just counts the samples in intersection of pred and true, which is not correct for the definition of Jaccard similarity coefficient. We need to calculate union.

https://github.com/scikit-learn/scikit-learn/blob/master/sklearn/metrics/classification.py#L174

I am not sure what it looks like for multi label.

jnothman · 2017-02-21T03:13:18Z

I agree that this seems to be strange even for the binary case. I would have thought Jaccard is an alternative to precision or recall or F1 (= Dice coefficient) in evaluating performance, in the binary case, on a single positive class, i.e. "true positives / (true positives + false positives + false negatives)". In particular, the binary implementation in our case does not seem to equate to the multilabel implementation run over a single class.

Regarding @untom's initial contention that multiclass implementation is incorrect, I agree that the multiclass implementation is useless. I don't think that the macro averaging he suggests is the only way to go about it, and as with P/R/F, micro-averaging excluding a majority negative class is still meaningful; weighted macro-average may also be feasible.

So yes, multiple strange things in our jaccard implementation IMO, and at a glance I don't see how the reference given in #1795 tells us about the multiclass case.

Labelling this a bug.

ghost · 2017-10-20T14:22:32Z

Any update on this issue? I believe this issue should be assigned as a high priority bug. Incorrect evaluation metrics could mislead many people in interpreting and reporting their experimental results. I may consider submitting a patch if no one else is going to work on it.

jnothman · 2017-10-21T12:05:31Z

Yes, probably. PR welcome

gxyd · 2017-10-26T06:34:11Z

@jacobdang polite ping. Are you still working? It is very much fine if you are still working but if you are not, then this is looks like a really good issue I would like to work on.

ghost · 2017-10-26T07:05:02Z

@gxyd I am catching a deadline so may not be able to work on it immediately. So if you could help that will be highly appreciated. Thanks. 👍

Fixes scikit-learn#7332

dimimal · 2018-03-06T11:35:48Z

Any progress about this issue yet?

jnothman · 2018-03-06T11:47:49Z

Waiting for a second review at #10083

jnothman · 2018-03-06T11:48:45Z

Unfortunately the code there is quite complicated, and perhaps could be simplified if something like #10628 existed, but I haven't had time to make that work let alone bring it up to mergeable standard.

agamemnonc · 2018-07-20T14:41:02Z

So, is the Jaccard similarity score a valid metric to use for multilabel classification problems?

jnothman · 2018-07-22T06:20:05Z

Yes, the current implementation should be okay for multilabel if you want the average set similarity across samples.

dimimal · 2018-07-26T10:39:29Z

@agamemnonc Only if you want the average score across the samples. It is totally different than intersection over union.

jnothman · 2018-07-26T10:40:40Z

Yes, it is different to the micro-average... Again, we have a pull request to fix this, but limited reviewer time :\

TSchattschneider · 2019-02-04T13:55:13Z

This problem in scikit-learn has just recently caused some big headache for me in my research.
I work with segmentation and was surprised to see how scikit-learn interprets the jaccard index metric.

dimimal · 2019-02-04T15:01:55Z

@TSchattschneider I feel you. I remember how frustrated I was. I had the same problem. This bug should be flagged in the documentation, unless is fixed.

jnothman · 2019-02-04T21:00:44Z

The feature was obviously proposed by someone interested in the multilabel case and should have been reviewed with more diligence. Sorry the fix has not been prioritised. #10083 is ready for review, but core developer review time is often hard to get.

See this issue: scikit-learn#7332

jnothman added the Bug label Feb 21, 2017

jnothman added Easy Well-defined and straightforward way to resolve help wanted labels Oct 21, 2017

gxyd added a commit to gxyd/scikit-learn that referenced this issue Nov 7, 2017

multiclass jaccard similarity not equal to accurary_score

64e30d6

Fixes scikit-learn#7332

gxyd mentioned this issue Nov 7, 2017

[MRG] average parameter for jaccard_similarity_score #10083

Closed

agamemnonc mentioned this issue Aug 3, 2018

Classification metrics incosinstencies #11743

Closed

jnothman mentioned this issue Feb 4, 2019

FIX binary/multiclass jaccard_similarity_score and extend to handle averaging #13092

Closed

jnothman mentioned this issue Feb 13, 2019

ENH/FIX Replace jaccard_similarity_score by sane jaccard_score #13151

Merged

qinhanmin2014 closed this as completed in #13151 Mar 13, 2019

hafnerfe added a commit to hafnerfe/scikit-learn that referenced this issue Feb 4, 2021

Fix _classification.py documentation

303faf4

See this issue: scikit-learn#7332

hafnerfe mentioned this issue Feb 4, 2021

DOC fix note in accuracy docstring regading relation with jaccard score #19347

Merged

Uh oh!

multiclass jaccard_similarity_score should not be equal to accuracy_score #7332

multiclass jaccard_similarity_score should not be equal to accuracy_score #7332

Comments

untom commented Sep 2, 2016

amueller commented Sep 14, 2016

Uh oh!

hccheng commented Oct 27, 2016

Uh oh!

shiba24 commented Feb 21, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jnothman commented Feb 21, 2017

Uh oh!

ghost commented Oct 20, 2017

Uh oh!

jnothman commented Oct 21, 2017

Uh oh!

gxyd commented Oct 26, 2017

Uh oh!

ghost commented Oct 26, 2017

Uh oh!

dimimal commented Mar 6, 2018

Uh oh!

jnothman commented Mar 6, 2018

Uh oh!

jnothman commented Mar 6, 2018

Uh oh!

agamemnonc commented Jul 20, 2018

Uh oh!

jnothman commented Jul 22, 2018 via email

Uh oh!

dimimal commented Jul 26, 2018

Uh oh!

jnothman commented Jul 26, 2018 via email

Uh oh!

TSchattschneider commented Feb 4, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

dimimal commented Feb 4, 2019

Uh oh!

jnothman commented Feb 4, 2019 via email

Uh oh!

shiba24 commented Feb 21, 2017 •

edited

Loading

TSchattschneider commented Feb 4, 2019 •

edited

Loading