Balanced accuracy doc - 2 #10040

maskani-moh · 2017-10-29T23:17:25Z

Reference Issues/PRs

In addition to #8066

What does this implement/fix? Explain your changes.

I added some references from the literature for the multiclass balanced accuracy. The threads #6747 and #8066 cited some papers and their different implementations of the metric. There was no real consensus about what was the definition to use, so the user might want to implement the one of his/her/their choice.

jnothman · 2017-10-30T00:43:37Z

doc/modules/model_evaluation.rst

@@ -473,6 +473,12 @@ given binary ``y_true`` and ``y_pred``:
      for C-class classification problem).
    * Macro-average recall as described in [Mosley2013]_ and [Kelleher2015]_: the recall
      for each class is computed independently and the average is taken over all classes.
+    * Class balance accuracy as described in [Mosley2013]_: for each class, the number


balance -> balanced

jnothman · 2017-10-30T00:43:52Z

doc/modules/model_evaluation.rst

@@ -473,6 +473,12 @@ given binary ``y_true`` and ``y_pred``:
      for C-class classification problem).
    * Macro-average recall as described in [Mosley2013]_ and [Kelleher2015]_: the recall
      for each class is computed independently and the average is taken over all classes.
+    * Class balance accuracy as described in [Mosley2013]_: for each class, the number
+      of correctly predicted samples (diagonal element in the confusion matrix) is normalized
+      by the maximum value of either the total number of observations predicted to the class


i.e. it's min(precision, recall)?

yes, that's how it is defined in the paper.

I think you should just state that here... Precision and recall can be assumed knowledge here

Or you can reference below.

…anced-accuracy-doc

…-learn into balanced-accuracy-doc

jnothman · 2017-10-30T02:33:54Z

doc/modules/model_evaluation.rst

+      by the maximum value of either the total number of observations predicted to the class
+      (sum of the class' column in the confusion matrix) or the actual number
+      of observations in that class (sum of the class' row in the confusion matrix).
+      In other words, we take the minimum between the precision and the recall for each class.


Do you think stating the definition from confusion matrix marginals here but not in the macro recall case is helpful?

You're right, either I should put it in both definitions or in none. Does not make sense to put it just in one of them. I'll delete it then 👍
Is that ok for you?

Yes, that's my preference.

Do you know what the benefit of accounting for false positives (and hence precision) is? This seems to make it a different metric that no longer reduces to balanced accuracy in the binary case..?

I also generally think that having a denominator which depends on the predictions is against the whole philosophy of ROC and BalAcc-style metrics.

jnothman

More scrupulous this time!!

jnothman · 2017-10-30T03:49:46Z

doc/modules/model_evaluation.rst

@@ -473,6 +473,13 @@ given binary ``y_true`` and ``y_pred``:
      for C-class classification problem).


I'm now quite sorry for merging the previous PR, because I think this first "Normalized class-wise accuracy average" is very verbose:

"each sample is assigned the class with maximum prediction value" is not a property of the metric, and serves only as a red herring here. It only states that the metric operates over class predictions rather than scores.

Then are we calculating rand accuracy (as in accuracy_score) for each class, or balanced accuracy?

Then is "normalized by the expected value of balanced accuracy for random predictions" something beyond "averaging the individual accuracies over all classes"? It would seem to me that "averaging the individual accuracies" would already multiply by 1/C for C-class classification.

In some way, all of the metrics here are based on a binarization of the problem for each class. (Although in this case it is more about binarization in a subtle way because it takes account for the overall sample size where the other two methods only care about the number of true positives, false positives and false negatives [but not true negatives] for each class.)

Ah, looking at Guyon2015, I now see that "normalize" is not a multiplicative factor, but a chance correction. I would avoid "normalize" as this usually means scaling to ensure that a value falls in an expected range.

Looking at their code, I now see that their their per-class accuracy is recall, not rand or balanced accuracy.

All in all lots of this description is related to the input format setup in that challenge. All they are doing is computing average recall across classes, so this should be merged into the below definition. They then correct for chance, I think, to ensure that "at chance" scores 0 regardless of the number of classes in the task. This can be noted with citation to Guyon2015.

Again sorry I wasn't reviewing more critically in the first place.

maybe "adjusted" instead of normalized would be better, but in either case, describing what is meant would be good.
Do you think giving formulas would be helpful? You just need to define recall and precision for each class and then writing down the three different formulas is pretty straight forward.
So the code you quote is not what they write in the paper http://www.causality.inf.ethz.ch/AutoML/automl_ijcnn15.pdf, which I think was mentioned in one of the other threads. I'll send an email to the authors, given that this has been discussed for a while now....

jnothman · 2017-10-30T04:14:36Z

doc/modules/model_evaluation.rst

+      by the maximum value of either the total number of observations predicted to the class
+      (sum of the class' column in the confusion matrix) or the actual number
+      of observations in that class (sum of the class' row in the confusion matrix).
+      In other words, we take the minimum between the precision and the recall for each class.


Yes, that's my preference.

Do you know what the benefit of accounting for false positives (and hence precision) is? This seems to make it a different metric that no longer reduces to balanced accuracy in the binary case..?

jnothman · 2017-10-30T04:21:22Z

So this summary has led me to conclude that maybe we do have sufficient consensus to use macro-averaged recall, with an optional correction for chance so that scores ranging [0 .. 1] represent [random .. perfect]!

Unless there's a really good reason to take account for precision and not name the metric something else.

amueller · 2017-10-30T17:18:49Z

@jnothman I think one issue with that is that the AutoML paper has a mismatch between implementation and paper, and @rhiever who was the main proponent of introducing the metric uses a different definition ;)

I just didn't really see a big benefit in adding an alias to macro_average_recall that could cause potential confusion. Someone can grep the docs and find this paragraph and use macro average recall.
Though I don't have a strong opinion on this.

amueller · 2017-10-30T21:24:36Z

Can you maybe add https://link.springer.com/article/10.1007/s12065-015-0128-8 for the reference for averaging accuracy? It only has 15 citations, though....

amueller · 2017-10-31T16:51:03Z

None of the other papers that define balanced accuracy as macro average recall seem to have a lot of citations either :-/ Or is there one that I'm overlooking? so that means no support for either version?

jnothman · 2017-10-31T21:19:33Z

metrics are rarely cited, for good or bad. sometimes an implementation will be cited, but more often just imported or copied. The support for macro recall is mostly intuitive. It naturally extends the binary case and has similar properties: constant score under random predictions and denominator determined only by ground truth distribution.

codecov · 2017-11-01T16:13:29Z

Codecov Report

Merging #10040 into master will not change coverage.
The diff coverage is n/a.

@@           Coverage Diff           @@
##           master   #10040   +/-   ##
=======================================
  Coverage   96.19%   96.19%           
=======================================
  Files         336      336           
  Lines       62725    62725           
=======================================
  Hits        60336    60336           
  Misses       2389     2389

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 9172a59...30b1712. Read the comment docs.

maskani-moh · 2017-11-01T16:31:02Z

@jnothman
The paper for class balanced accuracy states:

At the bottom of each per class ratio, the maximum of the row or column sum is chosen resulting in either the Recall or Precision to be the estimate of class accuracy. As a consequence, selecting the larger of the two as the denominator provides the most conservative estimate of accuracy that can be achieved.

So to me, this method does not give the averaged accuracy but a conservative estimate of it (always smaller).

jnothman · 2017-11-01T21:15:55Z

yes, I get that it's more conservative, but that doesn't mean it's well founded. in multiclass, a low precision in one class is reflected in a low recall in another. Measuring precision means you care about which class the system predicted in error. But it also breaks nice properties like fixed score under random predictions.

maskani-moh · 2017-11-14T13:36:21Z

@jnothman

What is left to do with this PR?
Should we mention that despite the numerous definitions found in the literature, there seems to be a common consensus over the macro-average recall definition?

jnothman · 2017-11-14T22:11:47Z

I'm good with this. I'd consider implementing the macro recall in a separate PR. I think it is the only definition that is easily justified, whether adjusted or not.

jnothman · 2017-11-14T22:12:22Z

But I'd appreciate @amueller double checking before merge...

amueller · 2017-11-14T22:24:28Z

I would like to have a brief remark on how to compute macro average recall, just mentioning recall_score(average="macro")

amueller · 2017-11-14T22:24:41Z

But other than that this seems like a nice summary imho.

amueller · 2017-11-14T22:25:48Z

Maybe add one sentence that Mosley2013 discusses all of these with pros and cons in relative detail? @jnothman what did you think of that paper?

amueller · 2017-11-14T22:26:08Z

But happy to merge with addition of recall(average="macro")

jnothman · 2017-11-14T23:22:09Z

That "paper" is a PhD thesis... can't say I've read it!!! But I'll try look for highlights soon...

jnothman · 2017-11-15T01:10:22Z

But I'd appreciate @amueller double checking before merge...

amueller · 2017-11-15T17:43:46Z

thanks :)

* Add references for multiclass balanced-accuracy * Add precision not implemented * Add another reference * Adjust note * Add succint definitions * Add macro-average recall implementation * Add class balance accuracy reference and definition * Fix typo * Add precision/recall comparison * Fix typo * Make descriptions less verbose and merge definitions * Add reference for averaging accuracy * Move macro-average recall example to corresponding section

jnothman · 2018-02-05T00:50:28Z

Some more notes on this:

balanced_accuracy_score(y_true, y_pred) == accuracy_score(y_true, y_pred, sample_weight=compute_sample_weight('balanced', y_true)) holds for binary and should probably hold for multiclass classification. We should probably implement it as such, rather than using the probably-more expensive recall_score.
balanced accuracy * 2 - 1 in the binary case has been called Youden's J statistic, deltap' and informedness. So in a discussion of multiclass extensions of balanced accuracy, we should also consider multiclass extensions of those measures.

I think we should just bite the bullet and implement a macro-averaged recall approach.

maskani-moh added 7 commits October 23, 2017 16:51

Add references for multiclass balanced-accuracy

e9208b7

Add precision not implemented

8e53c16

Add another reference

65f22d3

Adjust note

d35adac

Add succint definitions

fb2c904

Add macro-average recall implementation

8edb1d3

Add class balance accuracy reference and definition

86f3355

maskani-moh mentioned this pull request Oct 29, 2017

Add references for multiclass balanced-accuracy definitions #9982

Merged

maskani-moh closed this Oct 29, 2017

maskani-moh reopened this Oct 29, 2017

Merge branch 'master' into balanced-accuracy-doc

1863bfa

jnothman reviewed Oct 30, 2017

View reviewed changes

maskani-moh added 3 commits October 29, 2017 21:55

Fix typo

ec51eff

Merge branch 'master' of github.com:maskani-moh/scikit-learn into bal…

5610e81

…anced-accuracy-doc

Merge branch 'balanced-accuracy-doc' of github.com:maskani-moh/scikit…

d2245ae

…-learn into balanced-accuracy-doc

jnothman reviewed Oct 30, 2017

View reviewed changes

maskani-moh added 2 commits October 29, 2017 22:38

Add precision/recall comparison

af7f81c

Fix typo

b3f7b2f

jnothman reviewed Oct 30, 2017

View reviewed changes

Make descriptions less verbose and merge definitions

8469ccd

amueller mentioned this pull request Nov 9, 2017

add balanced accuracy metric #6747

Closed

Add reference for averaging accuracy

30b1712

Move macro-average recall example to corresponding section

3541770

amueller merged commit 653de6c into scikit-learn:master Nov 15, 2017

jnothman mentioned this pull request Feb 5, 2018

[MRG+2] ENH multiclass balanced accuracy #10587

Merged

		@@ -473,6 +473,13 @@ given binary ``y_true`` and ``y_pred``:
		for C-class classification problem).

Uh oh!

Balanced accuracy doc - 2 #10040

Balanced accuracy doc - 2 #10040

Uh oh!

Conversation

maskani-moh commented Oct 29, 2017

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

maskani-moh Oct 30, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jnothman Oct 30, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jnothman left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jnothman Oct 30, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jnothman commented Oct 30, 2017

Uh oh!

amueller commented Oct 30, 2017

Uh oh!

amueller commented Oct 30, 2017

Uh oh!

amueller commented Oct 31, 2017

Uh oh!

jnothman commented Oct 31, 2017 via email

Uh oh!

codecov bot commented Nov 1, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

maskani-moh commented Nov 1, 2017

Uh oh!

jnothman commented Nov 1, 2017 via email

Uh oh!

maskani-moh commented Nov 14, 2017

Uh oh!

jnothman commented Nov 14, 2017

Uh oh!

jnothman commented Nov 14, 2017

Uh oh!

amueller commented Nov 14, 2017

Uh oh!

amueller commented Nov 14, 2017

Uh oh!

amueller commented Nov 14, 2017

Uh oh!

amueller commented Nov 14, 2017

Uh oh!

jnothman commented Nov 14, 2017 via email

Uh oh!

jnothman commented Nov 15, 2017

Uh oh!

maskani-moh Oct 30, 2017 •

edited

Loading

jnothman Oct 30, 2017 •

edited

Loading

jnothman Oct 30, 2017 •

edited

Loading

codecov bot commented Nov 1, 2017 •

edited

Loading