-
-
Notifications
You must be signed in to change notification settings - Fork 25.9k
add balanced accuracy metric #6747
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Of course it is. Why didn't I think of that. I think creating an alias (and a scorer) is a good idea, with the constraint that it applies to binary problems. It could also be calculated per-label for multilabel problems (and then potentially macro-averaged...). |
I think this is moderate seeing as it involved that data format checking and narrative docs. |
Following from EpistasisLab/tpot#108 Balanced accuracy is where you calculate accuracy on a per-class basis, then average all of those accuracies. Here is a paper that introduces it: http://onlinelibrary.wiley.com/doi/10.1002/gepi.20211/abstract |
But by "accuracy" on a per-class basis, you must mean "recall"; and we're still only considering the binary classification case. |
Here's the definition that we use: https://github.com/rhiever/tpot/blob/master/tpot/tpot.py#L1207 In the multiclass case, we simply consider the current class we're calculating accuracy for to be Indeed most of the papers that discuss balanced accuracy do so only in the context of binary classification, but it seems reasonable to expand it to the multiclass case in this manner. |
At a first glance, I don't think that multiclass definition is really appropriate. But I'll think about it at little. I suspect macro-averaged recall reflects better the intentions of balanced accuracy. |
I believe the same procedure is used with macro-averaged AUC. From my understanding, macro-averaged recall != balanced accuracy in the multiclass case, only in the binary classification case. Thus I don't think we should label macro-averaged recall as balanced accuracy. Balanced accuracy is a separate metric that places more importance on TNR (relative to macro-averaged recall) in multiclass classification problems. |
macro-averaged AUC is explicitly for multilabel. Problem is that I'm not so On 11 June 2016 at 06:04, Randy Olson [email protected] wrote:
|
Section 2.1.18 in the attached paper describes the mathematical formulation of balanced accuracy. The key part is that you calculate balanced accuracy using a one-vs-all configuration. So you start with the first class Urbanowicz 2015 ExSTraCS 2.0 description and evaluation of a scalable learning.pdf |
I'm not persuaded that this is the right thing to do, but I am beginning to be persuaded that this is a logical extension that diverse people are assuming is legitimate. |
We've worked through the math and logic behind it several times and it checks off for us, but that doesn't mean we're right. I'm very curious to hear why it may not be the right thing to do. |
The redundancy of information inherent in including both one class's true positives and another's true negatives makes me a little uncomfortable. However, the multiclass case has some niceties: such a macro-average over a binary problem actually results in the same formula as the non-multiclass treatment; and empirically it seems that random class assignment (from a fixed distribution) in the multiclass case will still yield a score of 0.5, which is pretty neat. I'm coming to appreciate that this may be an appropriate extension |
Yes exactly. As with all metrics, balanced accuracy is just an indirect method of capturing what is "good" performance for our models. As a metric in the multiclass case, balanced accuracy puts a stronger emphasis on TN than TP, at least when compared to macro-averaged recall. And as you point out, balanced accuracy has the nice feature that 0.5 will consistently be "as good as random," with plenty of room for models to perform better (>0.5) or worse (<0.5) than random. It'd be great if we could get balanced accuracy added as a new sklearn metric for measuring a model's multiclass performance. |
If this is the only paper using this definition, I'm not sure we should include it. Where did you get it from? That paper? |
So this paper says
The paper that @rhiever claims introduces the metric does so only for binary, right? As long as we can't come up with what the standard definition is (if any), I don't think we should add it under this name. We can add You also first used a different definition of average precision in your code... |
That was a bug that we fixed. :-) The definition of balanced accuracy for the multiclass case is in the Urbanowicz paper. The original paper I linked to was only for the binary case, yes. For the Master's thesis that you linked, that definition of balanced accuracy is under a section describing "Two-Class Evaluation Measures," i.e., binary or multilabel classification. I don't think that thesis discusses balanced accuracy in the multiclass case. It's valid to say that balanced accuracy is the macro-averaged recall in the binary case. In the binary case, it works out the same mathematically as calculating accuracy on a per-class basis then averaging those two accuracies. We're simply proposing an extension to the definition of balanced accuracy to also cover the multiclass case. |
I'm happy for you to veto this @amueller, after my change of heart. I was persuaded by the following features:
These properties are much more persuasively meaningful than any properties of macro-averaged P/R/F in the multiclass case! This extension has also been reinvented in a few places, suggesting it is sought-after and reasonable. |
Here's another paper from the AutoML challenge that defines balanced accuracy for the multiclass case. They use a similar definition, with the only difference being the normalization procedure that they apply at the end (where they ensure that "as good as average accuracy" = (1 / N), where N is the number of classes). |
Thanks for the reference, though I think it muddies the water a bit (pending a look at their implementation). It's far from clear to me that the accuracies they are averaging class-wise in the multiclass case incorporate sensitivity and specificity. By default I assume they mean standard Rand accuracy over each binarization, although this seems a strange choice given that then the binary problem needs to be, as they say, a "special case". That correction for chance (under a uniform prior) allows for an "everything incorrect" response to score zero (assuming I'm correct about their use of Rand accuracy). I don't think your score allows 0, except in the case where the only predicted classes are not in the gold standard. Then classification at random in their measure does not yield a nice score that is invariant of the number of classes, nor one invariant to the distribution of those classes in the gold standard. N = 1000000
for K in range(3, 10):
x = np.random.rand(N)
y_true = (x[:, None] < np.random.rand(K - 1)).sum(axis = 1)
y_pred = np.random.randint(K, size=N)
R = 1/K; classes=np.unique(np.concatenate([y_true, y_pred]))
bac = np.mean([roc_auc_score(y_true == k, y_pred == k) for k in classes])
chalearn_bac = np.mean([accuracy_score(y_true == k, y_pred == k) for k in classes])
print('{:.2f}\t{:.2f}\t{:.2f}'.format(chalearn_bac, (chalearn_bac - R)/(1 - R), bac)) produces
Same results if |
To make things murkier, that metric description is repeated here and hyperlinked to here but I don't see the relevance of the latter! Ah. Now I see the relevance. They've actually implemented macro-averaged recall. Which means that, indeed, the chance correction they propose results in a score of 0 for random predictions. It also means that binary classification isn't actually a special case, despite what they say. But they also have other nonsense in that paper such as "We also normalize F1 with F1 := (F1-R)/(1-R), where R is the expected value of F1 for random predictions (i.e. R=0.5 for binary classification and R=(1/C) for C-class classification problems)." Expected value of F1 for random predictions is the prevalence of the positive class in the binary case, not 0.5. So while they attempt to throw a principled kitchen sink of evaluation metrics at the task, I'm not sure they are coming from a place of critical expertise, at least in their description of the measures. Still, the fact that they describe something different to your metric with the same name makes it a bit uncomfortable... |
we can always call it macro_average_accuracy (that's what we're talking about, right?) and say that "balanced accuracy" can mean "macro average accuracy" or "macro average recall" depending on who you ask. |
Haven't followed this and I'm kinda busy, but this seems like a potential blocker, right? |
@ledell I think @jnothman is concerned with what's a good metric because people use what's in sklearn. People use R^2 for regression because it's the default in sklearn. People use 10 trees in a random forest b/c it's default in sklearn (we are changing the latter, it's hard to change the former). Honestly my conclusion from that would be that we force the user to pick, though. Maybe having an option as @adrinjalali suggests, and for the scorers/strings only have For the record, I think that log-loss is a terrible metric for multi-class classification.
where the argmax gives a correct classification in the second case, but not the first. |
Wasn't it the case that chalearn implemented something different in the code than what they said in the paper? at least one of the ml competitions used weighted macro-average recall. |
Agreed... to clarify, I don't have a preference on what the default metric should be for multi-class problems -- my only concern is the use of a polysemous method name like |
It does not look like it: https://github.com/ch-imad/AutoMl_Challenge/blob/master/Starting_kit/scoring_program/libscores.py#L203
I have a preference there: in addition to the points raised by @jnothman in #6747 (comment), the macro-average recall is falling back the the
So you are interested about the accuracy and do not want to correct for the class imbalancing. Actually, I was wondering why there is no
I would find this option confusing. We all agree that the literature lack of clarity regarding the definition of the metric and this option replicates the same fuzziness in the implementation. So, I am fine with the current behavior and naming of the |
FWIW, an alternative metric used in the imbalanced classification literature is the geometric mean of the per-class recall. |
Why would you find this confusing? Indeed this would mean the implementation reflects the state of the signature and the understanding of the community. There's no fuzzyness if there's two definitions for the same name. And there's lots of literature on multi-class metrics and we can go into that at some point, I think @ledell makes a good point in being clear about what we implement and allowing alternatives. |
"Confusion" might not be the right term but returning completely different statistics would surprise me and I am not sure that we can advise to choose either implementation. In short, I am scared that users switch methods because the score obtained is higher. I am also concerned for the string style for the metric. Having Regarding the metric itself, alternative definitions which do not guarantee to obtain the same result than
I completely agree with this. I am sure that we can make the documentation better and we should be opened to alternative methods, even if I have my concerns this time with the alternative |
That's what we do for the different averaging methods, right? |
That's true |
@amueller is right here. You've referenced the binary case, @glemaitre. Chalearn AutoML indeed implements macro-average recall, adjusted so that random performance is 0: https://github.com/ch-imad/AutoMl_Challenge/blob/2353ec0/Starting_kit/scoring_program/libscores.py#L206-L208. This is equivalent to our I think we can safely eliminate Chalearn as a counter-example to our implementation preference, @ledell. But we could explicitly note in our docs that adjusted=True equates to Chalearn's. I think we may have indeed contacted the authors at some point (@amueller obviously has a better memory of all this history than I do). If we can presume that Chalearn's description was in error, and that some of the subsequent references to "averaged accuracy" are copying Chalearn's in error, can we let this go? Can we please lead the community, and define the standard meaning of balanced accuracy because we have identified many arguments for this definition (and several against alternatives), and put the discrepancy in the literature to rest?? |
We already do say in our documentation that adjusted=True equates to the Chalearn implementation. We do not note that their description is in error. Should we?? |
Yes, I think we can have implement multiple definitions here.
+1. It's not so good but maybe we need to do so if we keep the name What's the definition of Also, I start to wonder whether it's good to regard class balanced accuracy as a multiclass definition of |
I'm too tired (in several ways ;) to make a decision on this but I think it's the last remaining blocker? |
I guess a new option will not block the new release. Things we need to consider now is whether we need to change a name for current scorer. Also, if we decide to implement macro_average_accuracy as another option, we might need to provide some references in the user guide. |
I'm, FWIW, -1 on a new option. I don't want to perpetuate the misreading of the Guyon et al (Chalearn AutoML) paper where they have inaccurately described their implementation. |
I'm okay with having macro-average accuracy available, only I don't know what use it is. |
So the only reference we have for the so-called
I'll vote +0(maybe -1) to include it unless we can find some references which clearly define |
The references to variant multiclass balanced accuracy are discussed in
model_evaluation.rst
|
@jnothman Which entry? Seems that class balanced accuracy and balanced accuracy from Urbanowicz et al. 2015 are not the so-called |
Sorry. This conservation has been thoroughly confused. Partially because of
the time passed since we solved it and wrote it up. No one refers to macro
averaged accuracy as balanced accuracy. That is only a misunderstanding due
to Guyon et al.
We could provide an Urbanowicz-style implementation but I think it's a
really poor reuse of the name "balanced accuracy" for something that has
nothing to do with it. Binary balanced accuracy does not incorporate
precision, it incorporates specificity. They account for the same kind of
error but are fundamentally different quantifications of that error.
Notably, the demonstrate denominator of specificity depends only on y_true,
while the denominator of precision depends on y_pred. It behaves very
differently depending on the biases of the classifier to particular classes.
|
Agree. I don't think the definition from Urbanowicz et al. 2015 is widely accepted, unless provided with more references. @jnothman Close the issue? |
I'm happy to have it closed.
|
No, @ledell cited macro-averaged accuracy as what Guyon et al call balanced accuracy, and as something offered in H2O, but as mean zero-one loss, not under the name "balanced accuracy". As far as I can glean from above @rhiever has used the Urbanowicz definition, which I was incorrect above to say incorporates precision (that was Mosley et al.; I need this mess like I need a hole in the head!), but rather is the average of binary balanced accuracies for each class. (Need I argue against this again?) |
I've recently see more people using "balanced accuracy" for imbalanced binary and multi-class problems. I think it is the same as macro average recall. If so, I think we might want to create an alias, because it is not super obvious, and maybe add a scorer.
Also see EpistasisLab/tpot#108
The text was updated successfully, but these errors were encountered: