Thanks to visit codestin.com
Credit goes to github.com

Skip to content

Balanced accuracy doc - 2 #10040

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 16 commits into from
Nov 15, 2017
Merged

Conversation

maskani-moh
Copy link
Contributor

Reference Issues/PRs

In addition to #8066

What does this implement/fix? Explain your changes.

I added some references from the literature for the multiclass balanced accuracy. The threads #6747 and #8066 cited some papers and their different implementations of the metric. There was no real consensus about what was the definition to use, so the user might want to implement the one of his/her/their choice.

@@ -473,6 +473,12 @@ given binary ``y_true`` and ``y_pred``:
for C-class classification problem).
* Macro-average recall as described in [Mosley2013]_ and [Kelleher2015]_: the recall
for each class is computed independently and the average is taken over all classes.
* Class balance accuracy as described in [Mosley2013]_: for each class, the number
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

balance -> balanced

@@ -473,6 +473,12 @@ given binary ``y_true`` and ``y_pred``:
for C-class classification problem).
* Macro-average recall as described in [Mosley2013]_ and [Kelleher2015]_: the recall
for each class is computed independently and the average is taken over all classes.
* Class balance accuracy as described in [Mosley2013]_: for each class, the number
of correctly predicted samples (diagonal element in the confusion matrix) is normalized
by the maximum value of either the total number of observations predicted to the class
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i.e. it's min(precision, recall)?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, that's how it is defined in the paper.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you should just state that here... Precision and recall can be assumed knowledge here

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Or you can reference below.

by the maximum value of either the total number of observations predicted to the class
(sum of the class' column in the confusion matrix) or the actual number
of observations in that class (sum of the class' row in the confusion matrix).
In other words, we take the minimum between the precision and the recall for each class.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you think stating the definition from confusion matrix marginals here but not in the macro recall case is helpful?

Copy link
Contributor Author

@maskani-moh maskani-moh Oct 30, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're right, either I should put it in both definitions or in none. Does not make sense to put it just in one of them. I'll delete it then πŸ‘
Is that ok for you?

Copy link
Member

@jnothman jnothman Oct 30, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, that's my preference.

Do you know what the benefit of accounting for false positives (and hence precision) is? This seems to make it a different metric that no longer reduces to balanced accuracy in the binary case..?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I also generally think that having a denominator which depends on the predictions is against the whole philosophy of ROC and BalAcc-style metrics.

Copy link
Member

@jnothman jnothman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

More scrupulous this time!!

@@ -473,6 +473,13 @@ given binary ``y_true`` and ``y_pred``:
for C-class classification problem).
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm now quite sorry for merging the previous PR, because I think this first "Normalized class-wise accuracy average" is very verbose:

  • "each sample is assigned the class with maximum prediction value" is not a property of the metric, and serves only as a red herring here. It only states that the metric operates over class predictions rather than scores.
  • Then are we calculating rand accuracy (as in accuracy_score) for each class, or balanced accuracy?
  • Then is "normalized by the expected value of balanced accuracy for random predictions" something beyond "averaging the individual accuracies over all classes"? It would seem to me that "averaging the individual accuracies" would already multiply by 1/C for C-class classification.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In some way, all of the metrics here are based on a binarization of the problem for each class. (Although in this case it is more about binarization in a subtle way because it takes account for the overall sample size where the other two methods only care about the number of true positives, false positives and false negatives [but not true negatives] for each class.)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, looking at Guyon2015, I now see that "normalize" is not a multiplicative factor, but a chance correction. I would avoid "normalize" as this usually means scaling to ensure that a value falls in an expected range.

Looking at their code, I now see that their their per-class accuracy is recall, not rand or balanced accuracy.

All in all lots of this description is related to the input format setup in that challenge. All they are doing is computing average recall across classes, so this should be merged into the below definition. They then correct for chance, I think, to ensure that "at chance" scores 0 regardless of the number of classes in the task. This can be noted with citation to Guyon2015.

Again sorry I wasn't reviewing more critically in the first place.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe "adjusted" instead of normalized would be better, but in either case, describing what is meant would be good.
Do you think giving formulas would be helpful? You just need to define recall and precision for each class and then writing down the three different formulas is pretty straight forward.
So the code you quote is not what they write in the paper http://www.causality.inf.ethz.ch/AutoML/automl_ijcnn15.pdf, which I think was mentioned in one of the other threads. I'll send an email to the authors, given that this has been discussed for a while now....

by the maximum value of either the total number of observations predicted to the class
(sum of the class' column in the confusion matrix) or the actual number
of observations in that class (sum of the class' row in the confusion matrix).
In other words, we take the minimum between the precision and the recall for each class.
Copy link
Member

@jnothman jnothman Oct 30, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, that's my preference.

Do you know what the benefit of accounting for false positives (and hence precision) is? This seems to make it a different metric that no longer reduces to balanced accuracy in the binary case..?

@jnothman
Copy link
Member

So this summary has led me to conclude that maybe we do have sufficient consensus to use macro-averaged recall, with an optional correction for chance so that scores ranging [0 .. 1] represent [random .. perfect]!

Unless there's a really good reason to take account for precision and not name the metric something else.

@amueller
Copy link
Member

@jnothman I think one issue with that is that the AutoML paper has a mismatch between implementation and paper, and @rhiever who was the main proponent of introducing the metric uses a different definition ;)

I just didn't really see a big benefit in adding an alias to macro_average_recall that could cause potential confusion. Someone can grep the docs and find this paragraph and use macro average recall.
Though I don't have a strong opinion on this.

@amueller
Copy link
Member

Can you maybe add https://link.springer.com/article/10.1007/s12065-015-0128-8 for the reference for averaging accuracy? It only has 15 citations, though....

@amueller
Copy link
Member

None of the other papers that define balanced accuracy as macro average recall seem to have a lot of citations either :-/ Or is there one that I'm overlooking? so that means no support for either version?

@jnothman
Copy link
Member

jnothman commented Oct 31, 2017 via email

@codecov
Copy link

codecov bot commented Nov 1, 2017

Codecov Report

Merging #10040 into master will not change coverage.
The diff coverage is n/a.

Impacted file tree graph

@@           Coverage Diff           @@
##           master   #10040   +/-   ##
=======================================
  Coverage   96.19%   96.19%           
=======================================
  Files         336      336           
  Lines       62725    62725           
=======================================
  Hits        60336    60336           
  Misses       2389     2389

Continue to review full report at Codecov.

Legend - Click here to learn more
Ξ” = absolute <relative> (impact), ΓΈ = not affected, ? = missing data
Powered by Codecov. Last update 9172a59...30b1712. Read the comment docs.

@maskani-moh
Copy link
Contributor Author

@jnothman
The paper for class balanced accuracy states:

At the bottom of each per class ratio, the maximum of the row or column sum is chosen resulting in either the Recall or Precision to be the estimate of class accuracy. As a consequence, selecting the larger of the two as the denominator provides the most conservative estimate of accuracy that can be achieved.

So to me, this method does not give the averaged accuracy but a conservative estimate of it (always smaller).

@jnothman
Copy link
Member

jnothman commented Nov 1, 2017 via email

@maskani-moh
Copy link
Contributor Author

@jnothman

What is left to do with this PR?
Should we mention that despite the numerous definitions found in the literature, there seems to be a common consensus over the macro-average recall definition?

@jnothman
Copy link
Member

I'm good with this. I'd consider implementing the macro recall in a separate PR. I think it is the only definition that is easily justified, whether adjusted or not.

@jnothman
Copy link
Member

But I'd appreciate @amueller double checking before merge...

@amueller
Copy link
Member

I would like to have a brief remark on how to compute macro average recall, just mentioning recall_score(average="macro")

@amueller
Copy link
Member

But other than that this seems like a nice summary imho.

@amueller
Copy link
Member

Maybe add one sentence that Mosley2013 discusses all of these with pros and cons in relative detail? @jnothman what did you think of that paper?

@amueller
Copy link
Member

But happy to merge with addition of recall(average="macro")

@jnothman
Copy link
Member

jnothman commented Nov 14, 2017 via email

@jnothman
Copy link
Member

But I'd appreciate @amueller double checking before merge...

@amueller amueller merged commit 653de6c into scikit-learn:master Nov 15, 2017
@amueller
Copy link
Member

thanks :)

jwjohnson314 pushed a commit to jwjohnson314/scikit-learn that referenced this pull request Dec 18, 2017
* Add references for multiclass balanced-accuracy

* Add precision not implemented

* Add another reference

* Adjust note

* Add succint definitions

* Add macro-average recall implementation

* Add class balance accuracy reference and definition

* Fix typo

* Add precision/recall comparison

* Fix typo

* Make descriptions less verbose and merge definitions

* Add reference for averaging accuracy

* Move macro-average recall example to corresponding section
@jnothman
Copy link
Member

jnothman commented Feb 5, 2018

Some more notes on this:

  • balanced_accuracy_score(y_true, y_pred) == accuracy_score(y_true, y_pred, sample_weight=compute_sample_weight('balanced', y_true)) holds for binary and should probably hold for multiclass classification. We should probably implement it as such, rather than using the probably-more expensive recall_score.
  • balanced accuracy * 2 - 1 in the binary case has been called Youden's J statistic, deltap' and informedness. So in a discussion of multiclass extensions of balanced accuracy, we should also consider multiclass extensions of those measures.

I think we should just bite the bullet and implement a macro-averaged recall approach.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants