-
-
Notifications
You must be signed in to change notification settings - Fork 26.6k
OneSE: the One-Standard-Error Rule #14820
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
jnothman
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks. Yes this might be a good thing to have, but it's not clear why we should support 1se rather than a formal hypothesis test.
This would need tests and user guide / examples
sklearn/model_selection/_search.py
Outdated
|
|
||
|
|
||
|
|
||
| def OneSE(cv_results, **kwargs): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We avoid **kw. Please define parameters explicitly. This is explained in our developers' guide
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, the kwargs was unfinished since I wasn't sure yet if there were other parameters folks might want to add. Working on the partial function today, and then will squash/rebase!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
squashed, rebased, and force-pushed. Kwargs is gone. I added an additional pass function that allows for arguments to be attached via a partial directly into the callable. Let me know if this general structure will work here.
| Whether complexity increases as `param` increases. Default is True. | ||
| refit_scoring : str | ||
| Scoring metric. | ||
| tol : float |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Doesn't this effectively make it not 1se?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Correct. But tol here is coded as an alternative to 1 SE in case users wish to override that threshold with a percentile.
|
thankThanks @dPys , I like the idea of providing a pre-built callable if that's such a pervasive use-case. @dPys from your example, I don't understand how the parameters that you define eventually affect As Joel mentioned please provide some basic tests so we can review more easily ;) Thanks! |
|
Thanks @NicolasHug and @jnothman for reviewing. Since it sounds like this could indeed be useful to have as a pre-built callable, I can clean it up further and add some basic tests (in particular, I think it'd be valuable to show that this works across different kinds of models and parameter types). Can anyone think of a scenario where this would be contraindicated? |
|
I'm still not sure why we should prefer 1se over a hypothesis test like
wilcoxon.
Of course this isn't super flexible in terms of how it measures model
complexity. I think that's okay.
|
|
I see no reason why we couldn't also include hypothesis testing as another method beyond 1 SE and percentile tolerance. This might also make scikit-learn's OneSE unique from similar implementations offered by other packages. So that I can get a better sense of what you're thinking, perhaps you could provide a bit more detail (or ideally an example) as to what using Wilcoxon might look like in this context and why this would be superior to 1 SE? |
|
As far as I understand, 1se is rule of thumb approach to approximate
insignificant difference. But there are more precise statistical,
nonparametric approaches to determining if one system is significantly
better than another.
|
|
This is true, but the trouble that I see with Wilcoxon in particular is that the normal approximation is used for the calculations, which means that the samples used would probably need to be pretty large to see the benefit over 1 SE. However, the number of samples in this context equals the length of the vector of a parameter's values (e.g. PCA n_components might be [2, 4, 6, 8]), which in many if not most cases (like in the example above) is going to be n<20. Now, if in creating an optional method for hypothesis testing here, we also include a pre-requisite that the length of the target parameter's grid is >20, then I think your suggestion would work. In such cases (i.e. where there is a more exhaustive grid search over the target parameter's values), the risk of overfitting will probably be higher anyway, and therefore benefit more from smoothing via Wilcoxon or some other nonparametric test. Curious to hear your thoughts |
|
No I think you're looking at samples in the wrong direction. I understand
we are looking for "the least complex model whose performance is
insignificantly different from the best". So you compare each model only to
the best, where the number of samples is the number of test evaluations,
i.e. number of splits.
|
|
I'm confused-- If we were to only consider the number of splits (e.g. 10), wouldn't that still yield too small an N? Do you mean that rather than using the mean (i.e. across cv folds) of the refit scorer of interest (as is currently done with 1 SE and the percentile threshold case), consider all score samples across all folds? If so, it seems that doing that would yield more than enough N :-) |
|
The point is that the scores are paired. You want to see whether there is a
consistent improvement rather than an improvement in the average.
Yes, N is small. That's true for calculating the standard deviation too.
Wilcoxon should be possible for small N though it will not reject the null
hypothesis easily... Which is sort of the point. Feel free to empirically
compare the two with different data sets, or to look for literature to
substantiate one approach over the other.
|
|
Rather than using a function with the user applying partial, why not define
a class whose instance can be called as refit?
|
|
I like that idea @jnothman ! Will see what I can do This would also make it easier/saner to implement other methods such as Wilcoxon |
Another thing to consider is that 1 SE was created in the context of decision trees, where 1 SE was found to be optimal for balancing tree size with error (hence 1-SE vs. 1-SD). Outside of that context, 1 SD may not be empirically optimal as a hard and fast rule. Evaluating 1-SE vs. a hypothesis test across datasets may yield totally different results depending on the type of learner and target parameter used. It seems to me that the best workaround is to simply include each of these methods as options and let the user decide what is most appropriate for the problem at hand? |
|
Also essential is the metric and the test set variability
…On Wed., 28 Aug. 2019, 10:10 am Derek Pisner, ***@***.***> wrote:
The point is that the scores are paired. You want to see whether there is
a consistent improvement rather than an improvement in the average. Yes, N
is small. That's true for calculating the standard deviation too. Wilcoxon
should be possible for small N though it will not reject the null
hypothesis easily... Which is sort of the point. Feel free to empirically
compare the two with different data sets, or to look for literature to
substantiate one approach over the other.
Another thing to consider is that 1 SE was created in the context of
decision trees, where 1 SE was empirically found to be optimal for
balancing tree size with *error* (hence 1-SE vs. 1-SD). Outside of that
context, 1 SD may not be empirically optimal as a hard and fast rule.
Evaluating 1-SE vs. a hypothesis test across datasets may yield totally
different results depending on the type of learner and target parameter
used. It seems to me that the best workaround is to simply include each of
these methods as options and let the user decide what is most appropriate
for the problem at hand?
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#14820?email_source=notifications&email_token=AAATH25EZSRURFX455YN4BDQGW657A5CNFSM4IP5KIY2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD5JPLIA#issuecomment-525530528>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AAATH26QT2UUB3YS3DERCALQGW657ANCNFSM4IP5KIYQ>
.
|
|
Yeah, metric is essential. So is metric-model pairing, and classification vs. regression. I'd have to test what would happen in the case of unbalanced classes. But what's cool is that all of these possibilities become exposed to the user for exploration on a case-by-case basis with the kind of 'meta'-callable that's materializing here. In all honesty, it sounds like this might be more appropriately construed as a generic gridsearch smoothing class for overfit-prevention, rather than '1-SE' in particular (though I do like the name as a throwback). |
|
I don't see how metric-model pairing affects things... or class imbalance.
The minimum assumption we can rely on is that a greater metric is better
for a given test set (regardless of model, dataset, CV or metric choice).
In many specific cases we might be able to assume a roughly normal
distribution of scores given a model class.
|
|
We'd have to test these assumptions to know for sure. Bear in mind, too, that p<0.05 is also more or less a rule-of-thumb (like 1-SE). Thus, I am still not convinced that a formal hypothesis would be 'better' across the board. I think that the thing to do to compare these different criteria is to pick a dataset from one of sklearn's examples (e.g. digits), add a varying number of random noise predictors, then see how well these different methods are at preventing the noise terms from influencing the fit. That being said, this would just be useful for providing some helpful info for a user guide. From a development standpoint, I don't think those kinds of tests should necessarily hold up integration of the code being proposed here, since in the end all of these methods (1 SE, hypothesis test, percentile tolerance) would ideally be made available for the user to decide what to do? |
|
@jnothman -- I've updated the PR extensively. These routines have now been renamed to SmoothCV, which reflects a generalization of the OneSE approach for overfit prevention. I've added support for the Wilcoxon method of hyp-testing across folds, and reorganized the callable partial to query the new SmoothCV class, within which all of these routines are now included. This is still very much a WIP, but for now, please let me know if the Wilcoxon approach was along the lines of what you were thinking. Please also feel free to revise with any of your own edits. Tests to come once we've finish working out any remaining kinks... Cheers, |
|
And use looks something like this: |
|
I don't think smooth_cv is a very clear name. What we are trying to say is
"refit the simplest model that is insignificantly different from the best,
where simplicity is determined from a numeric hyperparameter".
For a name I might rather simplest, or simplest_by_param, or simplest_best,
or razor.
I'm not convinced by combining multiple methods into one name either. I
would be happy if the user had to say:
refit=razors.standard_error(param, scoring='roc_auc')
refit=razors.rank_sum_test(param, scoring='roc_auc', alpha=...)
I think this is much more readable than your current proposal.
|
|
I'm referring to Ockham's razor btw
|
OneSE OneSE - fix typo in docstring OneSE - add function partial for adding explicit user arguments to callable Change name to SmoothCV, add support for Wilcoxon method, reorganize callable partial to query the new SmoothCV class. rename to RazorCV, create routine-independent callables
|
Squashed/rebased changes. Improvements: The current solution works, but there may be other ways to implement it even more cleanly-- e.g. splitting things up into two classes (one containing the universal routines, and the other that contains each callable method, drawing from the first class). Before going down that road, however, I wanted to get your thoughts since there are many ways to go about this. Either way, I think we've got a nice WIP here. |
|
Updated working example: |
|
Please avoid squashing and force pushing. We can squash upon merge but in the meantime it makes a mess of the commit history. |
Oh alright, sorry about that. I see now that the push caught #14771 by mistake. |
|
Assuming we are at least mostly happy with the current design, it seems like a good next step would be to fuzz test functionality with more types of learners, scoring methods, and feature-selecting parameters (i.e. beyond PCA)? ...maybe even create a parameterized test across various combinations of available options currently offered with scikit-learn? Do you know of any potential volunteers who'd be interested in assisting with this? Thanks again for the ongoing help and encouragement :-) |
Reference Issues/PRs
https://github.com/scikit-learn/scikit-learn/blob/master/examples/model_selection/plot_grid_search_refit_callable.py
See also #11269. See also #11354. See also #12865. See also #9499.
What does this implement/fix? Explain your changes.
As discussed briefly with @amueller and @NicolasHug last week, this is an enhancement to the latest implementation of sklearn's refit callable functionality. The aim is to provide a more generic set of methods for balancing model complexity with CV performance via model selection 'smoothing'. In the context of a highly versatile package such as sklearn, where it becomes possible to fit the vast majority of model types using an extensive set of parameters and scorers, the risk of overfitting may be more pressing. To ameliorate this, OneSE might be especially valuable for final model estimation in CV.
Any other comments?
The challenge with implementing such a tool has always been with generalizing its functionality to all types of models, parameters, and scoring methods. In particular, defining model 'complexity' is also relatively context-dependent. This PR lays out a prototype that allows these definitions to be user-determined, rather than hard-coded defaults. Currently, both _error and _score type scorers are supported, along with multi-metric scoring. Either 1 SD bounds can be used, or a percentile tolerance can be specified. Feel free to revise, rewrite, and discuss!
Expanding upon the excellent example recently created by @jiaowoshabi :
Let me know what you think
@dPys