WIP new, simpler scorer API#2123
Conversation
There was a problem hiding this comment.
No, still is still the old proposal. Sorry, I'll change it.
|
I'm definitely +1 for getting rid of Was there a reason for not using And as to my PRs, I'm only require that there be a way to get more information than a single objective score (multiple class scores; P/R/F) back from the scorer. E.g. the scorer could optionally return a tuple of score = scorer(estimator, X, y)
try:
score, score_data = score
except TypeError:
score_data = {'score': score}:s |
There was a problem hiding this comment.
Indeed. Funny that make test still passes.
There was a problem hiding this comment.
It shouldn't pass since #2033... Might want to check that out.
|
I know it's still WIP, but I thought I'd just mention this assert fail from travis so long: That asside, looks good to me so far. 👍 |
|
Strange, |
|
Yeah passed on mine too when I checked this branch. Travis is annoying like that sometimes haha |
|
@jnothman the reason not to use |
|
I guess I was asking whether there was more than that: I actually find it quite clear to report negative loss being climbed positively towards 0. At least, loss is strictly non-negative (as opposed to some ranking metrics) so it should be unambiguous. The confusion is that it's the opposite of what you get when you use the metric function directly. |
|
@larsmans, despite your fix for CV over MSE, there's a more substantial deprecation issue: If someone passes a Or we could have scorers returning tuples so this doesn't happen. |
|
@jnothman You mean an Compare the situation to that in |
|
No, I mean a (And I didn't mean to imply that the deprecation was excessively imposing, |
|
Yes, I agree with the tuple return value. For PRF I'd say we return values of type That's simpler than a (scalar, dict) pair. |
|
Yes, I see some advantages in the namedtuple. I had thought at one point a dict where the field 'score' was required, and if a scalar is returned it's interpreted as such. If it returns a namedtuple, we'll probably land up calling (One further annoyance with the namedtuple for PRF is that the |
|
We could name the scorer differently, like |
|
I had a quick look and my general feeling is that I like this approach and think that it is an improvement over the current code. The good news is that the current code wasn't in a released version, right, So we won't have to go through deprecation and users hating us, right? |
There was a problem hiding this comment.
I think that the name 'greater_is_better' is more explicit than 'sign'.
There was a problem hiding this comment.
It's actually going to be a sign ∈ {-1, 1}:
sign = 1 if greater_is_better else -1
(Line 160.) This is a private API, btw.
|
No scorer-related stuff was ever released, so we're safe. |
|
Do we want to rename Scorer to Evaluator? I find Scorer rather ambiguous. |
|
Rebased on master. I can haz reviews? |
|
I don't particularly mind Would you advocate changing the Remember also that the default |
Sounds very French to me (Lars, are we having a bad influence on you?). I actually like Scorer as it relates well to 'cross_val_score', and the |
|
Very well :) |
There was a problem hiding this comment.
Does this mean that the behaviour of f1 has changed? If I am not wrong, it was previously returning single values while it now returns tuples. Can this break user code in any way?
There was a problem hiding this comment.
(I am not sure this was shipped in 13.1, so my question might not be relevant)
There was a problem hiding this comment.
Actually, this also seems related to the errors reported by Travis in the tests.
There was a problem hiding this comment.
Nothing was ever shipped. And yes, this is causing the test failure, will look into that.
There was a problem hiding this comment.
Yes, this requires more changes to score averaging, and I think it'll be better in a separate PR...
|
This looks like a nice improvement! Once my comments above regarding the changes in F1 scores are addressed, I am +1 for merge. |
|
One think I liked with the previous API that made For instance I would like to have grid search
The new public API hides too many implementation details to be able to implement this efficiently from a third party tool like hyperopt or even to implement this as options in the |
|
You can do all those things with custom scorers. They're allowed to return tuples holding arbitrary data, and they're still arbitrary Python callables. Partial example: |
There was a problem hiding this comment.
I'm not convinced by the format of this. Do we really need a report function that's little different from pprint(search.cv_scores_)?
There was a problem hiding this comment.
Or, indeed which is identical to print(*search.cv_scores_, file=file, sep='\n')?
There was a problem hiding this comment.
I think it would be much more useful to output something like a CSV, but that requires interpreting the data more.
There was a problem hiding this comment.
It's a proof of concept. I wanted to make clear in some way that just print(cv_scores_) doesn't give all the information. If you know a better solution (e.g. document the pprint trick?), I'm up for suggestions.
There was a problem hiding this comment.
I'm not convinced by the format of this. Do we really need a report function
that's little different from pprint(search.cv_scores_)?
In the long run, we might want such features, but in the short run, I'd
rather avoid.
There was a problem hiding this comment.
e.g. document the pprint trick?)
I think that teaching people to use pprint is a good idea.
There was a problem hiding this comment.
I don't think pprint is wonderful either. Afaik it only knows about the basic standard collections (list, tuple, dict) and reprs everything else, including namedtuples, defaultdicts, arrays, etc, on the basis that its output should be evalable (except that most repr implementations don't support that).
Also I forgot one very important use case: compute the same scores on the training split of the data to be able to detect underfitting and overfitting at the same time. But the meta custom scorer would have to know (hardcoding) if the subscorers "need thresholds" and whether greater is better and so on. Ideally I would like to have the default I think it would be need if it would be possible to have: Also for the multi-class classification problem, one would also like to collect the confusion matrix (and / or) the detailed per-class P/R/F1 prior to multiclass averaging. |
|
Also make the scorer responsible for dumping the raw predictions proba to the hard drive sounds like a design error (separation of concerns). |
There was a problem hiding this comment.
I feel like having two boolean parameters is a bit confusing. What is the difference between needs_proba=True, needs_threshold=False and needs_proba=True, needs_threshold=True?
There was a problem hiding this comment.
Perhaps more explicit would be something like: default input_type='labels' vs input_type='binary_threshold' and input_type='class_proba'.
There was a problem hiding this comment.
Good point, except that "labels" doesn't cover the regression use case. Also, whether the decision function must give 1-d results is not this function's responsibility, the metric will just have to raise an exception. I suggest prediction_method.
There was a problem hiding this comment.
No wait, that doesn't really replace needs_threshold because that backs off to predict_proba when there's no decision_function.
There was a problem hiding this comment.
Also, ThresholdScorer slices [:, 1] in that case where ProbaScorer does not.
There was a problem hiding this comment.
The other option is to keep it this way and raise an exception when both are true...
There was a problem hiding this comment.
I'm just going to raise the exception. Rebase in progress.
|
I'm not sure if I'm happy with the scope of the PR. |
|
I just realized the issue message actually is pretty descriptive, but it is not entirely clear which changes are for the last point and which are for the rest. |
I have suggested the handling of multiple metrics should be a separate PR (as much as I am anxious to have the feature, I want it to be given some thought), and @ogrisel seemed to concur, as we need to consider its interaction with other extensibility in grid search output. |
|
I agree. We should move forward with the scorer refactoring. I plan to work on the multiple metrics features the next couple of days. We want to release soon and we should get stuff in a state that is future-ready but doesn't make changes that we might regret afterwards (like using named tuples ;) |
There was a problem hiding this comment.
not capitalized maybe? or maybe scoring?
|
@amueller @ogrisel @jnothman The scope of the PR is to simplify the scorers so that they are just functions. There are currently only two commits, the first of which establishes a simple API, the second exemplifies how it can be extended. If we want something that works right now, we can just pull the first commit and leave the structured score stuff for later. As for @ogrisel's requirements of passing lots of information around and the complaint about separation of concerns: that's independent of the exact API. When you adorn scorers with all kinds of attributes to store generic information about the CV procedure, you're not doing proper separation of concerns. |
There was a problem hiding this comment.
I think make_scorer should be public api, so it should be in the classes.rst
There was a problem hiding this comment.
It was intended to be, will add it.
There was a problem hiding this comment.
Small remark: when in a rush, I prioritize on PR that are 'MRG', rather
than 'WIP', so I missed that one.
|
+1 for merging the first commit. |
There was a problem hiding this comment.
I changed the scoring from accuracy (the default) to F1 score to demo and test the structured return values from f_scorer and F1 score ≤ accuracy. This is also why the best parameter set changed.
|
the function is still missing from the references. |
I was about to point it out :) |
|
Yes, and the description of the scorer protocol mentioned tuples while it shouldn't for now. Will force-push a new version when I'm confident the tests pass on current master. |
|
Ok, pushed the changes. The test failures are in the second commit. |
A Scorer is now a function that returns a score that should be maximized.
Added a report method to GridSearchCV to use it.
|
Ok, pushed the first commit to master. I'm closing this PR because we'll have to rethink or at least discuss the structured score stuff. |
|
Thanks a lot @larsmans :) |
|
Thank you for setting this up, Andy. This API turns out to be exactly what I needed to do better optimization :) |
Here's a simpler scorer API. Sorry for being so late to the game, I was too busy to bother with the discussions. I hope no-one takes offense (esp. @amueller who worked hard on this). Ping @jnothman. Anyway, the highlights:
estimator, X, y, **kwargs.greater_is_better. A scorer made from a loss function returns minus the loss.Scorerclass. There's a factory functionmake_scorer, but it produces objects of an internal class hierarchy that do their best to hide as callables.TODO:
make_scorerbe public at all?