Thanks to visit codestin.com
Credit goes to github.com

Skip to content

add balanced accuracy metric #6747

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
amueller opened this issue May 2, 2016 · 70 comments · Fixed by #8066
Closed

add balanced accuracy metric #6747

amueller opened this issue May 2, 2016 · 70 comments · Fixed by #8066
Labels
Blocker Moderate Anything that requires some knowledge of conventions and best practices
Milestone

Comments

@amueller
Copy link
Member

amueller commented May 2, 2016

I've recently see more people using "balanced accuracy" for imbalanced binary and multi-class problems. I think it is the same as macro average recall. If so, I think we might want to create an alias, because it is not super obvious, and maybe add a scorer.
Also see EpistasisLab/tpot#108

@amueller amueller added the Easy Well-defined and straightforward way to resolve label May 2, 2016
@jnothman
Copy link
Member

jnothman commented May 3, 2016

Of course it is. Why didn't I think of that.

I think creating an alias (and a scorer) is a good idea, with the constraint that it applies to binary problems. It could also be calculated per-label for multilabel problems (and then potentially macro-averaged...).

@jnothman
Copy link
Member

jnothman commented May 3, 2016

I think this is moderate seeing as it involved that data format checking and narrative docs.

@jnothman jnothman added Moderate Anything that requires some knowledge of conventions and best practices and removed Easy Well-defined and straightforward way to resolve labels May 3, 2016
@rhiever
Copy link

rhiever commented May 6, 2016

Following from EpistasisLab/tpot#108

Balanced accuracy is where you calculate accuracy on a per-class basis, then average all of those accuracies.

Here is a paper that introduces it: http://onlinelibrary.wiley.com/doi/10.1002/gepi.20211/abstract

@jnothman
Copy link
Member

jnothman commented May 7, 2016

But by "accuracy" on a per-class basis, you must mean "recall"; and we're still only considering the binary classification case.

@rhiever
Copy link

rhiever commented May 7, 2016

Here's the definition that we use: https://github.com/rhiever/tpot/blob/master/tpot/tpot.py#L1207

In the multiclass case, we simply consider the current class we're calculating accuracy for to be 1 and the other classes to be 0, i.e., a one-vs-all configuration. That allows us to calculate accuracy normally for each class.

Indeed most of the papers that discuss balanced accuracy do so only in the context of binary classification, but it seems reasonable to expand it to the multiclass case in this manner.

@jnothman
Copy link
Member

jnothman commented May 7, 2016

At a first glance, I don't think that multiclass definition is really appropriate. But I'll think about it at little. I suspect macro-averaged recall reflects better the intentions of balanced accuracy.

@rhiever
Copy link

rhiever commented Jun 10, 2016

I believe the same procedure is used with macro-averaged AUC.

From my understanding, macro-averaged recall != balanced accuracy in the multiclass case, only in the binary classification case. Thus I don't think we should label macro-averaged recall as balanced accuracy. Balanced accuracy is a separate metric that places more importance on TNR (relative to macro-averaged recall) in multiclass classification problems.

@jnothman
Copy link
Member

macro-averaged AUC is explicitly for multilabel. Problem is that I'm not so
sure what TNR means in a multiclass context, or why OvR transform makes
sense for that. I've had a go at simplifying the overall score for a
3-class classification problem, by hand, but haven't got far enough that I
can see how its formula is interesting and value.

On 11 June 2016 at 06:04, Randy Olson [email protected] wrote:

I believe the same procedure is used with macro-averaged AUC
http://scikit-learn.org/stable/modules/generated/sklearn.metrics.roc_auc_score.html#sklearn.metrics.roc_auc_score
.

From my understanding, macro-averaged recall != balanced accuracy in the
multiclass case, only in the binary classification case. Thus I don't think
we should label macro-averaged recall as balanced accuracy. Balanced
accuracy is a separate metric that places more importance on TNR (relative
to macro-averaged recall) in multiclass classification problems.


You are receiving this because you commented.
Reply to this email directly, view it on GitHub
#6747 (comment),
or mute the thread
https://github.com/notifications/unsubscribe/AAEz6_-q3V22tgGRhOyF4MKstrokPeIhks5qKcNdgaJpZM4IVm6g
.

@rhiever
Copy link

rhiever commented Jun 14, 2016

Section 2.1.18 in the attached paper describes the mathematical formulation of balanced accuracy.

The key part is that you calculate balanced accuracy using a one-vs-all configuration. So you start with the first class a, where you treat the data as a binary classification task such that all records labeled a are 1 and all other classes are 0. Calculate accuracy as you normally would. Then for the next class b, repeat the same process except all instances labeled b are 1 and the rest are 0. And so on. Once you have the per-class accuracy for every class, average them and that is balanced accuracy.

Urbanowicz 2015 ExSTraCS 2.0 description and evaluation of a scalable learning.pdf

@jnothman
Copy link
Member

I'm not persuaded that this is the right thing to do, but I am beginning to be persuaded that this is a logical extension that diverse people are assuming is legitimate.

@rhiever
Copy link

rhiever commented Jun 15, 2016

We've worked through the math and logic behind it several times and it checks off for us, but that doesn't mean we're right. I'm very curious to hear why it may not be the right thing to do.

@jnothman
Copy link
Member

The redundancy of information inherent in including both one class's true positives and another's true negatives makes me a little uncomfortable.

However, the multiclass case has some niceties: such a macro-average over a binary problem actually results in the same formula as the non-multiclass treatment; and empirically it seems that random class assignment (from a fixed distribution) in the multiclass case will still yield a score of 0.5, which is pretty neat.

I'm coming to appreciate that this may be an appropriate extension

@rhiever
Copy link

rhiever commented Jun 16, 2016

Yes exactly. As with all metrics, balanced accuracy is just an indirect method of capturing what is "good" performance for our models. As a metric in the multiclass case, balanced accuracy puts a stronger emphasis on TN than TP, at least when compared to macro-averaged recall. And as you point out, balanced accuracy has the nice feature that 0.5 will consistently be "as good as random," with plenty of room for models to perform better (>0.5) or worse (<0.5) than random.

It'd be great if we could get balanced accuracy added as a new sklearn metric for measuring a model's multiclass performance.

@amueller
Copy link
Member Author

If this is the only paper using this definition, I'm not sure we should include it. Where did you get it from? That paper?

@amueller
Copy link
Member Author

amueller commented Jun 17, 2016

So this paper says

Balanced Accuracy is the Recall for each class, averaged over the number of classes.

The paper that @rhiever claims introduces the metric does so only for binary, right?
The second paper indeed uses the average over the (sensitivity + specificity) / 2 over classes. But I think that is not the standard definition.

As long as we can't come up with what the standard definition is (if any), I don't think we should add it under this name. We can add macro_average_accuracy or something... This is somewhat similar to macro average f1, right?

You also first used a different definition of average precision in your code...

@rhiever
Copy link

rhiever commented Jun 17, 2016

You also first used a different definition of average precision in your code...

That was a bug that we fixed. :-)


The definition of balanced accuracy for the multiclass case is in the Urbanowicz paper. The original paper I linked to was only for the binary case, yes.

For the Master's thesis that you linked, that definition of balanced accuracy is under a section describing "Two-Class Evaluation Measures," i.e., binary or multilabel classification. I don't think that thesis discusses balanced accuracy in the multiclass case.

It's valid to say that balanced accuracy is the macro-averaged recall in the binary case. In the binary case, it works out the same mathematically as calculating accuracy on a per-class basis then averaging those two accuracies.

We're simply proposing an extension to the definition of balanced accuracy to also cover the multiclass case.

@jnothman
Copy link
Member

jnothman commented Jun 18, 2016

I'm happy for you to veto this @amueller, after my change of heart. I was persuaded by the following features:

  • treating a binary problem as multiclass gives identical results (i.e. bal_acc(y_true, y_pred) == bal_acc(~y_true, ~y_pred)) == 1/2 * (bal_acc(y_true, y_pred) + bal_acc(~y_true, ~y_pred)))
  • the valuable property of ROC AUC, that regardless of class prevalence a random prediction will produce a fixed score in the limit holds true in the multiclass transformation
  • perfect classification performance is still 1.0 so introducing random error decreases the score towards 0.5, etc.

These properties are much more persuasively meaningful than any properties of macro-averaged P/R/F in the multiclass case!

This extension has also been reinvented in a few places, suggesting it is sought-after and reasonable.

@rhiever
Copy link

rhiever commented Jun 21, 2016

Here's another paper from the AutoML challenge that defines balanced accuracy for the multiclass case. They use a similar definition, with the only difference being the normalization procedure that they apply at the end (where they ensure that "as good as average accuracy" = (1 / N), where N is the number of classes).

@jnothman
Copy link
Member

jnothman commented Jun 21, 2016

Thanks for the reference, though I think it muddies the water a bit (pending a look at their implementation). It's far from clear to me that the accuracies they are averaging class-wise in the multiclass case incorporate sensitivity and specificity. By default I assume they mean standard Rand accuracy over each binarization, although this seems a strange choice given that then the binary problem needs to be, as they say, a "special case".

That correction for chance (under a uniform prior) allows for an "everything incorrect" response to score zero (assuming I'm correct about their use of Rand accuracy). I don't think your score allows 0, except in the case where the only predicted classes are not in the gold standard.

Then classification at random in their measure does not yield a nice score that is invariant of the number of classes, nor one invariant to the distribution of those classes in the gold standard.

N = 1000000
for K in range(3, 10):
    x = np.random.rand(N)
    y_true = (x[:, None] < np.random.rand(K - 1)).sum(axis = 1)
    y_pred = np.random.randint(K, size=N)
    R = 1/K; classes=np.unique(np.concatenate([y_true, y_pred]))
    bac = np.mean([roc_auc_score(y_true == k, y_pred == k) for k in classes])
    chalearn_bac = np.mean([accuracy_score(y_true == k, y_pred == k) for k in classes])
    print('{:.2f}\t{:.2f}\t{:.2f}'.format(chalearn_bac, (chalearn_bac - R)/(1 - R), bac))

produces

unnorm norm yours
0.56    0.33    0.50
0.62    0.50    0.50
0.68    0.60    0.50
0.72    0.67    0.50
0.76    0.71    0.50
0.78    0.75    0.50
0.80    0.78    0.50

Same results if y_true is uniformly sampled, where above sampling is weighted.

@jnothman
Copy link
Member

jnothman commented Jun 23, 2016

To make things murkier, that metric description is repeated here and hyperlinked to here but I don't see the relevance of the latter!

Ah. Now I see the relevance. They've actually implemented macro-averaged recall. Which means that, indeed, the chance correction they propose results in a score of 0 for random predictions. It also means that binary classification isn't actually a special case, despite what they say.

But they also have other nonsense in that paper such as "We also normalize F1 with F1 := (F1-R)/(1-R), where R is the expected value of F1 for random predictions (i.e. R=0.5 for binary classification and R=(1/C) for C-class classification problems)." Expected value of F1 for random predictions is the prevalence of the positive class in the binary case, not 0.5. So while they attempt to throw a principled kitchen sink of evaluation metrics at the task, I'm not sure they are coming from a place of critical expertise, at least in their description of the measures.

Still, the fact that they describe something different to your metric with the same name makes it a bit uncomfortable...

@amueller amueller added Sprint and removed Sprint labels Jul 15, 2016
@amueller
Copy link
Member Author

we can always call it macro_average_accuracy (that's what we're talking about, right?) and say that "balanced accuracy" can mean "macro average accuracy" or "macro average recall" depending on who you ask.

@amueller
Copy link
Member Author

Haven't followed this and I'm kinda busy, but this seems like a potential blocker, right?

@amueller
Copy link
Member Author

amueller commented Sep 10, 2018

@ledell I think @jnothman is concerned with what's a good metric because people use what's in sklearn. People use R^2 for regression because it's the default in sklearn. People use 10 trees in a random forest b/c it's default in sklearn (we are changing the latter, it's hard to change the former).
Basically sklearn is prescriptive just because of its wide use, for better or worse.

Honestly my conclusion from that would be that we force the user to pick, though. Maybe having an option as @adrinjalali suggests, and for the scorers/strings only have macro_average_recall and macro_average_accuracy.

For the record, I think that log-loss is a terrible metric for multi-class classification.
If the true class is 0, these two have the same log loss:

(.4, .6, 0)
(.4, .3, .3)

where the argmax gives a correct classification in the second case, but not the first.

@amueller
Copy link
Member Author

Wasn't it the case that chalearn implemented something different in the code than what they said in the paper? at least one of the ml competitions used weighted macro-average recall.

@ledell
Copy link

ledell commented Sep 11, 2018

@amueller

People use R^2 for regression because it's the default in sklearn.

Agreed... to clarify, I don't have a preference on what the default metric should be for multi-class problems -- my only concern is the use of a polysemous method name like balanced accuracy_score() to refer to only one of the two things (this is the current status of the code in the release candidate). If you have an option that allows switching between the two definitions, that seems fine to me. Are there any other scoring methods that have a switch like this?

@glemaitre
Copy link
Member

Wasn't it the case that chalearn implemented something different in the code than what they said in the paper? at least one of the ml competitions used weighted macro-average recall.

It does not look like it: https://github.com/ch-imad/AutoMl_Challenge/blob/master/Starting_kit/scoring_program/libscores.py#L203

I don't have a preference on what the default metric should be for multi-class problems

I have a preference there: in addition to the points raised by @jnothman in #6747 (comment), the macro-average recall is falling back the the accuracy_score with a balanced dataset while this is not the case for the macro-average accuracy.

If you care about the accuracy of each class equally (regardless of it's presence in the training set), then it's an appropriate metric to use.

So you are interested about the accuracy and do not want to correct for the class imbalancing. Actually, I was wondering why there is no average parameter in the accuracy_score?

If you have an option that allows switching between the two definitions, that seems fine to me

I would find this option confusing. We all agree that the literature lack of clarity regarding the definition of the metric and this option replicates the same fuzziness in the implementation.

So, I am fine with the current behavior and naming of the balanced_accuracy_score. I don't see a problem on choosing one of the propose definition in the literature, document it properly, and warn about the controversy. IMO, selecting the macro-average recall seems the most appropriate choice (equivalence with accuracy in balanced setting). I would be inclined to also add an average option to the accuracy_score. This behavior corresponds to the other definition and could be added to the documentation as well.

@glemaitre
Copy link
Member

FWIW, an alternative metric used in the imbalanced classification literature is the geometric mean of the per-class recall.

@amueller
Copy link
Member Author

I would find this option confusing. We all agree that the literature lack of clarity regarding the definition of the metric and this option replicates the same fuzziness in the implementation.

Why would you find this confusing? Indeed this would mean the implementation reflects the state of the signature and the understanding of the community. There's no fuzzyness if there's two definitions for the same name.

And there's lots of literature on multi-class metrics and we can go into that at some point, I think @ledell makes a good point in being clear about what we implement and allowing alternatives.
Similarly we decided not to pick an averaging strategy in any of the multi-class metrics and require users to explicitly pass an averaging strategy to clarify what they want.

@glemaitre
Copy link
Member

Why would you find this confusing? Indeed this would mean the implementation reflects the state of the signature and the understanding of the community. There's no fuzzyness if there's two definitions for the same name.

"Confusion" might not be the right term but returning completely different statistics would surprise me and I am not sure that we can advise to choose either implementation. In short, I am scared that users switch methods because the score obtained is higher. I am also concerned for the string style for the metric. Having balanced_accuracy_score and different strings will be difficult to document well.

Regarding the metric itself, alternative definitions which do not guarantee to obtain the same result than accuracy_score in a balanced setting seem weird to me.

And there's lots of literature on multi-class metrics and we can go into that at some point, I think @ledell makes a good point in being clear about what we implement and allowing alternatives.

I completely agree with this. I am sure that we can make the documentation better and we should be opened to alternative methods, even if I have my concerns this time with the alternative balanced_accuracy_score.

@amueller
Copy link
Member Author

I am also concerned for the string style for the metric. Having balanced_accuracy_score and different strings will be difficult to document well.

That's what we do for the different averaging methods, right?

@glemaitre
Copy link
Member

That's what we do for the different averaging methods, right?

That's true

@jnothman
Copy link
Member

jnothman commented Sep 12, 2018

Wasn't it the case that chalearn implemented something different in the code than what they said in the paper? at least one of the ml competitions used weighted macro-average recall.

It does not look like it: ch-imad/AutoMl_Challenge:Starting_kit/scoring_program/libscores.py@master#L203

@amueller is right here. You've referenced the binary case, @glemaitre. Chalearn AutoML indeed implements macro-average recall, adjusted so that random performance is 0: https://github.com/ch-imad/AutoMl_Challenge/blob/2353ec0/Starting_kit/scoring_program/libscores.py#L206-L208. This is equivalent to our balanced_accuracy with adjusted=True.

I think we can safely eliminate Chalearn as a counter-example to our implementation preference, @ledell. But we could explicitly note in our docs that adjusted=True equates to Chalearn's. I think we may have indeed contacted the authors at some point (@amueller obviously has a better memory of all this history than I do).

If we can presume that Chalearn's description was in error, and that some of the subsequent references to "averaged accuracy" are copying Chalearn's in error, can we let this go?

Can we please lead the community, and define the standard meaning of balanced accuracy because we have identified many arguments for this definition (and several against alternatives), and put the discrepancy in the literature to rest??

@jnothman
Copy link
Member

We already do say in our documentation that adjusted=True equates to the Chalearn implementation. We do not note that their description is in error. Should we??

@qinhanmin2014
Copy link
Member

qinhanmin2014 commented Sep 13, 2018

Yes, I think we can have implement multiple definitions here.

Maybe having an option as @adrinjalali suggests, and for the scorers/strings only have macro_average_recall and macro_average_accuracy.

+1. It's not so good but maybe we need to do so if we keep the name balanced_accuracy_score

What's the definition of macro_average_accuracy here? I don't think we've provided users with references about macro_average_accuracy, and I can't find any references in the PR.

Also, I start to wonder whether it's good to regard class balanced accuracy as a multiclass definition of balanced_accuracy_score. It seems that they are two different metrics (See P46 of Mosley2013, where the authors compare balanced accuracy and class balance. What's more, the authors clearly state that Balanced Accuracy is the Recall for each class, averaged over the number of classes. in P25), am I wrong?

@ogrisel ogrisel added this to the 0.20 milestone Sep 18, 2018
@amueller
Copy link
Member Author

I'm too tired (in several ways ;) to make a decision on this but I think it's the last remaining blocker?

@qinhanmin2014
Copy link
Member

I guess a new option will not block the new release. Things we need to consider now is whether we need to change a name for current scorer. Also, if we decide to implement macro_average_accuracy as another option, we might need to provide some references in the user guide.

@jnothman
Copy link
Member

I'm, FWIW, -1 on a new option. I don't want to perpetuate the misreading of the Guyon et al (Chalearn AutoML) paper where they have inaccurately described their implementation.

@jnothman
Copy link
Member

I'm okay with having macro-average accuracy available, only I don't know what use it is.

@qinhanmin2014
Copy link
Member

qinhanmin2014 commented Sep 19, 2018

I'm, FWIW, -1 on a new option. I don't want to perpetuate the misreading of the Guyon et al (Chalearn AutoML) paper where they have inaccurately described their implementation.

So the only reference we have for the so-called macro_average_accuracy is Guyou et al. 2015? If so, I might prefer to to close the issue and leave balanced_accuracy_score as it is. I think by saying the average of class-wise accuracy, they actually mean macro_average_recall.

I'm okay with having macro-average accuracy available, only I don't know what use it is.

I'll vote +0(maybe -1) to include it unless we can find some references which clearly define macro_average_accuracy. At least I can't find any references in the PR.

@jnothman
Copy link
Member

jnothman commented Sep 19, 2018 via email

@qinhanmin2014
Copy link
Member

The references to variant multiclass balanced accuracy are discussed in model_evaluation.rst

@jnothman Which entry? Seems that class balanced accuracy and balanced accuracy from Urbanowicz et al. 2015 are not the so-called macro_average_accuracy discussed here?

@jnothman
Copy link
Member

jnothman commented Sep 20, 2018 via email

@qinhanmin2014
Copy link
Member

We could provide an Urbanowicz-style implementation but I think it's a really poor reuse of the name "balanced accuracy" for something that has nothing to do with it.

Agree. I don't think the definition from Urbanowicz et al. 2015 is widely accepted, unless provided with more references.

@jnothman Close the issue?

@jnothman
Copy link
Member

jnothman commented Sep 20, 2018 via email

@amueller
Copy link
Member Author

macro averaged accuracy as balanced accuracy

I think @rhiever and @ledell do, right? but ok to leave this closed.

@qinhanmin2014
Copy link
Member

I think @rhiever and @ledell do, right?

I'm happy to reopen if someone provides some references (except for Guyou et al. 2015)

@jnothman
Copy link
Member

No one refers to macro averaged accuracy as balanced accuracy.

I think @rhiever and @ledell do, right?

No, @ledell cited macro-averaged accuracy as what Guyon et al call balanced accuracy, and as something offered in H2O, but as mean zero-one loss, not under the name "balanced accuracy". As far as I can glean from above @rhiever has used the Urbanowicz definition, which I was incorrect above to say incorporates precision (that was Mosley et al.; I need this mess like I need a hole in the head!), but rather is the average of binary balanced accuracies for each class. (Need I argue against this again?)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Blocker Moderate Anything that requires some knowledge of conventions and best practices
Projects
None yet
Development

Successfully merging a pull request may close this issue.

9 participants