-
-
Notifications
You must be signed in to change notification settings - Fork 25.9k
Balanced accuracy doc - 2 #10040
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. Weβll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Balanced accuracy doc - 2 #10040
Conversation
doc/modules/model_evaluation.rst
Outdated
@@ -473,6 +473,12 @@ given binary ``y_true`` and ``y_pred``: | |||
for C-class classification problem). | |||
* Macro-average recall as described in [Mosley2013]_ and [Kelleher2015]_: the recall | |||
for each class is computed independently and the average is taken over all classes. | |||
* Class balance accuracy as described in [Mosley2013]_: for each class, the number |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
balance -> balanced
doc/modules/model_evaluation.rst
Outdated
@@ -473,6 +473,12 @@ given binary ``y_true`` and ``y_pred``: | |||
for C-class classification problem). | |||
* Macro-average recall as described in [Mosley2013]_ and [Kelleher2015]_: the recall | |||
for each class is computed independently and the average is taken over all classes. | |||
* Class balance accuracy as described in [Mosley2013]_: for each class, the number | |||
of correctly predicted samples (diagonal element in the confusion matrix) is normalized | |||
by the maximum value of either the total number of observations predicted to the class |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i.e. it's min(precision, recall)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, that's how it is defined in the paper.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think you should just state that here... Precision and recall can be assumed knowledge here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Or you can reference below.
β¦anced-accuracy-doc
β¦-learn into balanced-accuracy-doc
doc/modules/model_evaluation.rst
Outdated
by the maximum value of either the total number of observations predicted to the class | ||
(sum of the class' column in the confusion matrix) or the actual number | ||
of observations in that class (sum of the class' row in the confusion matrix). | ||
In other words, we take the minimum between the precision and the recall for each class. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you think stating the definition from confusion matrix marginals here but not in the macro recall case is helpful?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You're right, either I should put it in both definitions or in none. Does not make sense to put it just in one of them. I'll delete it then π
Is that ok for you?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, that's my preference.
Do you know what the benefit of accounting for false positives (and hence precision) is? This seems to make it a different metric that no longer reduces to balanced accuracy in the binary case..?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I also generally think that having a denominator which depends on the predictions is against the whole philosophy of ROC and BalAcc-style metrics.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
More scrupulous this time!!
doc/modules/model_evaluation.rst
Outdated
@@ -473,6 +473,13 @@ given binary ``y_true`` and ``y_pred``: | |||
for C-class classification problem). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm now quite sorry for merging the previous PR, because I think this first "Normalized class-wise accuracy average" is very verbose:
- "each sample is assigned the class with maximum prediction value" is not a property of the metric, and serves only as a red herring here. It only states that the metric operates over class predictions rather than scores.
- Then are we calculating rand accuracy (as in
accuracy_score
) for each class, or balanced accuracy? - Then is "normalized by the expected value of balanced accuracy for random predictions" something beyond "averaging the individual accuracies over all classes"? It would seem to me that "averaging the individual accuracies" would already multiply by 1/C for C-class classification.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In some way, all of the metrics here are based on a binarization of the problem for each class. (Although in this case it is more about binarization in a subtle way because it takes account for the overall sample size where the other two methods only care about the number of true positives, false positives and false negatives [but not true negatives] for each class.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, looking at Guyon2015, I now see that "normalize" is not a multiplicative factor, but a chance correction. I would avoid "normalize" as this usually means scaling to ensure that a value falls in an expected range.
Looking at their code, I now see that their their per-class accuracy is recall, not rand or balanced accuracy.
All in all lots of this description is related to the input format setup in that challenge. All they are doing is computing average recall across classes, so this should be merged into the below definition. They then correct for chance, I think, to ensure that "at chance" scores 0 regardless of the number of classes in the task. This can be noted with citation to Guyon2015.
Again sorry I wasn't reviewing more critically in the first place.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe "adjusted" instead of normalized would be better, but in either case, describing what is meant would be good.
Do you think giving formulas would be helpful? You just need to define recall and precision for each class and then writing down the three different formulas is pretty straight forward.
So the code you quote is not what they write in the paper http://www.causality.inf.ethz.ch/AutoML/automl_ijcnn15.pdf, which I think was mentioned in one of the other threads. I'll send an email to the authors, given that this has been discussed for a while now....
doc/modules/model_evaluation.rst
Outdated
by the maximum value of either the total number of observations predicted to the class | ||
(sum of the class' column in the confusion matrix) or the actual number | ||
of observations in that class (sum of the class' row in the confusion matrix). | ||
In other words, we take the minimum between the precision and the recall for each class. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, that's my preference.
Do you know what the benefit of accounting for false positives (and hence precision) is? This seems to make it a different metric that no longer reduces to balanced accuracy in the binary case..?
So this summary has led me to conclude that maybe we do have sufficient consensus to use macro-averaged recall, with an optional correction for chance so that scores ranging [0 .. 1] represent [random .. perfect]! Unless there's a really good reason to take account for precision and not name the metric something else. |
@jnothman I think one issue with that is that the AutoML paper has a mismatch between implementation and paper, and @rhiever who was the main proponent of introducing the metric uses a different definition ;) I just didn't really see a big benefit in adding an alias to |
Can you maybe add https://link.springer.com/article/10.1007/s12065-015-0128-8 for the reference for averaging accuracy? It only has 15 citations, though.... |
None of the other papers that define balanced accuracy as macro average recall seem to have a lot of citations either :-/ Or is there one that I'm overlooking? so that means no support for either version? |
metrics are rarely cited, for good or bad. sometimes an implementation will
be cited, but more often just imported or copied.
The support for macro recall is mostly intuitive. It naturally extends the
binary case and has similar properties: constant score under random
predictions and denominator determined only by ground truth distribution.
|
Codecov Report
@@ Coverage Diff @@
## master #10040 +/- ##
=======================================
Coverage 96.19% 96.19%
=======================================
Files 336 336
Lines 62725 62725
=======================================
Hits 60336 60336
Misses 2389 2389 Continue to review full report at Codecov.
|
@jnothman
So to me, this method does not give the averaged accuracy but a conservative estimate of it (always smaller). |
yes, I get that it's more conservative, but that doesn't mean it's well
founded. in multiclass, a low precision in one class is reflected in a low
recall in another. Measuring precision means you care about which class the
system predicted in error. But it also breaks nice properties like fixed
score under random predictions.
|
What is left to do with this PR? |
I'm good with this. I'd consider implementing the macro recall in a separate PR. I think it is the only definition that is easily justified, whether adjusted or not. |
But I'd appreciate @amueller double checking before merge... |
I would like to have a brief remark on how to compute macro average recall, just mentioning |
But other than that this seems like a nice summary imho. |
Maybe add one sentence that Mosley2013 discusses all of these with pros and cons in relative detail? @jnothman what did you think of that paper? |
But happy to merge with addition of |
That "paper" is a PhD thesis... can't say I've read it!!!
But I'll try look for highlights soon...
|
But I'd appreciate @amueller double checking before merge... |
thanks :) |
* Add references for multiclass balanced-accuracy * Add precision not implemented * Add another reference * Adjust note * Add succint definitions * Add macro-average recall implementation * Add class balance accuracy reference and definition * Fix typo * Add precision/recall comparison * Fix typo * Make descriptions less verbose and merge definitions * Add reference for averaging accuracy * Move macro-average recall example to corresponding section
Some more notes on this:
I think we should just bite the bullet and implement a macro-averaged recall approach. |
Reference Issues/PRs
In addition to #8066
What does this implement/fix? Explain your changes.
I added some references from the literature for the multiclass balanced accuracy. The threads #6747 and #8066 cited some papers and their different implementations of the metric. There was no real consensus about what was the definition to use, so the user might want to implement the one of his/her/their choice.