OneSE: the One-Standard-Error Rule #14820

dPys · 2019-08-27T05:25:43Z

Reference Issues/PRs

https://github.com/scikit-learn/scikit-learn/blob/master/examples/model_selection/plot_grid_search_refit_callable.py
See also #11269. See also #11354. See also #12865. See also #9499.

What does this implement/fix? Explain your changes.

As discussed briefly with @amueller and @NicolasHug last week, this is an enhancement to the latest implementation of sklearn's refit callable functionality. The aim is to provide a more generic set of methods for balancing model complexity with CV performance via model selection 'smoothing'. In the context of a highly versatile package such as sklearn, where it becomes possible to fit the vast majority of model types using an extensive set of parameters and scorers, the risk of overfitting may be more pressing. To ameliorate this, OneSE might be especially valuable for final model estimation in CV.

Any other comments?

The challenge with implementing such a tool has always been with generalizing its functionality to all types of models, parameters, and scoring methods. In particular, defining model 'complexity' is also relatively context-dependent. This PR lays out a prototype that allows these definitions to be user-determined, rather than hard-coded defaults. Currently, both _error and _score type scorers are supported, along with multi-metric scoring. Either 1 SD bounds can be used, or a percentile tolerance can be specified. Feel free to revise, rewrite, and discuss!

Expanding upon the excellent example recently created by @jiaowoshabi :

import numpy as np
import matplotlib.pyplot as plt
from sklearn.datasets import load_digits
from sklearn.decomposition import PCA
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection._search import OneSE
from sklearn.pipeline import Pipeline
from sklearn.svm import LinearSVC

pipe = Pipeline([
        ('reduce_dim', PCA(random_state=42)),
        ('classify', LinearSVC(random_state=42)),
])

param_grid = {
    'reduce_dim__n_components': [2, 4, 6, 8]
}

# Here, we set our OneSE parameters, though ideally, we would make them explicit arguments for the actual OneSE callable below:
param='n_components'
greater_is_complex=True
refit_scoring = 'accuracy'
tol = None

grid = GridSearchCV(pipe, cv=10, n_jobs=1, param_grid=param_grid,
                    scoring=['accuracy', 'neg_mean_squared_log_error'], refit=OneSE)
digits = load_digits()
grid.fit(digits.data, digits.target)

n_components = grid.cv_results_['param_reduce_dim__n_components']
test_scores = grid.cv_results_['mean_test_accuracy']

plt.figure()
plt.bar(n_components, test_scores, width=1.3, color='b')

plt.axhline(np.max(test_scores), linestyle='--', color='y', label='Best score')

plt.title("Balance model complexity and cross-validated score")
plt.xlabel('Number of PCA components used')
plt.ylabel('Digit classification accuracy')
plt.xticks(n_components.tolist())
plt.ylim((0, 1.0))
plt.legend(loc='upper left')

best_index_ = grid.best_index_

print("The best_index_ is %d" % best_index_)
print("The n_components selected is %d" % n_components[best_index_])
print("The corresponding accuracy score is %.2f"
      % grid.cv_results_['mean_test_accuracy'][best_index_])
plt.show()

Let me know what you think
@dPys

jnothman

Thanks. Yes this might be a good thing to have, but it's not clear why we should support 1se rather than a formal hypothesis test.

This would need tests and user guide / examples

jnothman · 2019-08-27T08:35:51Z

sklearn/model_selection/_search.py

+
+
+
+def OneSE(cv_results, **kwargs):


We avoid **kw. Please define parameters explicitly. This is explained in our developers' guide

Yes, the kwargs was unfinished since I wasn't sure yet if there were other parameters folks might want to add. Working on the partial function today, and then will squash/rebase!

squashed, rebased, and force-pushed. Kwargs is gone. I added an additional pass function that allows for arguments to be attached via a partial directly into the callable. Let me know if this general structure will work here.

sklearn/model_selection/_search.py

jnothman · 2019-08-27T08:38:14Z

sklearn/model_selection/_search.py

+        Whether complexity increases as `param` increases. Default is True.
+    refit_scoring : str
+        Scoring metric.
+    tol : float


Doesn't this effectively make it not 1se?

Correct. But tol here is coded as an alternative to 1 SE in case users wish to override that threshold with a percentile.

NicolasHug · 2019-08-27T12:45:26Z

thankThanks @dPys , I like the idea of providing a pre-built callable if that's such a pervasive use-case.

@dPys from your example, I don't understand how the parameters that you define eventually affect OneSE since you don't pass them around? Should there be a partial somewhere?

As Joel mentioned please provide some basic tests so we can review more easily ;)

Thanks!

dPys · 2019-08-27T15:51:32Z

Thanks @NicolasHug and @jnothman for reviewing. Since it sounds like this could indeed be useful to have as a pre-built callable, I can clean it up further and add some basic tests (in particular, I think it'd be valuable to show that this works across different kinds of models and parameter types). Can anyone think of a scenario where this would be contraindicated?

jnothman · 2019-08-27T20:54:06Z

I'm still not sure why we should prefer 1se over a hypothesis test like wilcoxon. Of course this isn't super flexible in terms of how it measures model complexity. I think that's okay.

dPys · 2019-08-27T21:13:21Z

@jnothman ,

I see no reason why we couldn't also include hypothesis testing as another method beyond 1 SE and percentile tolerance. This might also make scikit-learn's OneSE unique from similar implementations offered by other packages.

So that I can get a better sense of what you're thinking, perhaps you could provide a bit more detail (or ideally an example) as to what using Wilcoxon might look like in this context and why this would be superior to 1 SE?

@dPys

jnothman · 2019-08-27T21:16:43Z

As far as I understand, 1se is rule of thumb approach to approximate insignificant difference. But there are more precise statistical, nonparametric approaches to determining if one system is significantly better than another.

dPys · 2019-08-27T21:52:28Z

This is true, but the trouble that I see with Wilcoxon in particular is that the normal approximation is used for the calculations, which means that the samples used would probably need to be pretty large to see the benefit over 1 SE. However, the number of samples in this context equals the length of the vector of a parameter's values (e.g. PCA n_components might be [2, 4, 6, 8]), which in many if not most cases (like in the example above) is going to be n<20.

Now, if in creating an optional method for hypothesis testing here, we also include a pre-requisite that the length of the target parameter's grid is >20, then I think your suggestion would work. In such cases (i.e. where there is a more exhaustive grid search over the target parameter's values), the risk of overfitting will probably be higher anyway, and therefore benefit more from smoothing via Wilcoxon or some other nonparametric test.

Curious to hear your thoughts
@dPys

jnothman · 2019-08-27T22:05:09Z

No I think you're looking at samples in the wrong direction. I understand we are looking for "the least complex model whose performance is insignificantly different from the best". So you compare each model only to the best, where the number of samples is the number of test evaluations, i.e. number of splits.

dPys · 2019-08-27T22:18:24Z

I'm confused-- If we were to only consider the number of splits (e.g. 10), wouldn't that still yield too small an N?

Do you mean that rather than using the mean (i.e. across cv folds) of the refit scorer of interest (as is currently done with 1 SE and the percentile threshold case), consider all score samples across all folds? If so, it seems that doing that would yield more than enough N :-)

jnothman · 2019-08-27T22:24:26Z

The point is that the scores are paired. You want to see whether there is a consistent improvement rather than an improvement in the average. Yes, N is small. That's true for calculating the standard deviation too. Wilcoxon should be possible for small N though it will not reject the null hypothesis easily... Which is sort of the point. Feel free to empirically compare the two with different data sets, or to look for literature to substantiate one approach over the other.

jnothman · 2019-08-27T23:08:47Z

Rather than using a function with the user applying partial, why not define a class whose instance can be called as refit?

dPys · 2019-08-27T23:12:37Z

I like that idea @jnothman ! Will see what I can do

This would also make it easier/saner to implement other methods such as Wilcoxon

dPys · 2019-08-28T00:10:05Z

The point is that the scores are paired. You want to see whether there is a consistent improvement rather than an improvement in the average. Yes, N is small. That's true for calculating the standard deviation too. Wilcoxon should be possible for small N though it will not reject the null hypothesis easily... Which is sort of the point. Feel free to empirically compare the two with different data sets, or to look for literature to substantiate one approach over the other.

Another thing to consider is that 1 SE was created in the context of decision trees, where 1 SE was found to be optimal for balancing tree size with error (hence 1-SE vs. 1-SD). Outside of that context, 1 SD may not be empirically optimal as a hard and fast rule. Evaluating 1-SE vs. a hypothesis test across datasets may yield totally different results depending on the type of learner and target parameter used. It seems to me that the best workaround is to simply include each of these methods as options and let the user decide what is most appropriate for the problem at hand?

jnothman · 2019-08-28T03:17:10Z

Also essential is the metric and the test set variability

…

On Wed., 28 Aug. 2019, 10:10 am Derek Pisner, ***@***.***> wrote: The point is that the scores are paired. You want to see whether there is a consistent improvement rather than an improvement in the average. Yes, N is small. That's true for calculating the standard deviation too. Wilcoxon should be possible for small N though it will not reject the null hypothesis easily... Which is sort of the point. Feel free to empirically compare the two with different data sets, or to look for literature to substantiate one approach over the other. Another thing to consider is that 1 SE was created in the context of decision trees, where 1 SE was empirically found to be optimal for balancing tree size with *error* (hence 1-SE vs. 1-SD). Outside of that context, 1 SD may not be empirically optimal as a hard and fast rule. Evaluating 1-SE vs. a hypothesis test across datasets may yield totally different results depending on the type of learner and target parameter used. It seems to me that the best workaround is to simply include each of these methods as options and let the user decide what is most appropriate for the problem at hand? — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#14820?email_source=notifications&email_token=AAATH25EZSRURFX455YN4BDQGW657A5CNFSM4IP5KIY2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD5JPLIA#issuecomment-525530528>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AAATH26QT2UUB3YS3DERCALQGW657ANCNFSM4IP5KIYQ> .

dPys · 2019-08-28T04:14:29Z

Yeah, metric is essential. So is metric-model pairing, and classification vs. regression. I'd have to test what would happen in the case of unbalanced classes.

But what's cool is that all of these possibilities become exposed to the user for exploration on a case-by-case basis with the kind of 'meta'-callable that's materializing here. In all honesty, it sounds like this might be more appropriately construed as a generic gridsearch smoothing class for overfit-prevention, rather than '1-SE' in particular (though I do like the name as a throwback).

jnothman · 2019-08-28T04:56:16Z

I don't see how metric-model pairing affects things... or class imbalance. The minimum assumption we can rely on is that a greater metric is better for a given test set (regardless of model, dataset, CV or metric choice). In many specific cases we might be able to assume a roughly normal distribution of scores given a model class.

dPys · 2019-08-28T15:40:38Z

We'd have to test these assumptions to know for sure. Bear in mind, too, that p<0.05 is also more or less a rule-of-thumb (like 1-SE). Thus, I am still not convinced that a formal hypothesis would be 'better' across the board.

I think that the thing to do to compare these different criteria is to pick a dataset from one of sklearn's examples (e.g. digits), add a varying number of random noise predictors, then see how well these different methods are at preventing the noise terms from influencing the fit.

That being said, this would just be useful for providing some helpful info for a user guide. From a development standpoint, I don't think those kinds of tests should necessarily hold up integration of the code being proposed here, since in the end all of these methods (1 SE, hypothesis test, percentile tolerance) would ideally be made available for the user to decide what to do?

dPys · 2019-09-03T02:44:58Z

@jnothman -- I've updated the PR extensively. These routines have now been renamed to SmoothCV, which reflects a generalization of the OneSE approach for overfit prevention. I've added support for the Wilcoxon method of hyp-testing across folds, and reorganized the callable partial to query the new SmoothCV class, within which all of these routines are now included. This is still very much a WIP, but for now, please let me know if the Wilcoxon approach was along the lines of what you were thinking. Please also feel free to revise with any of your own edits.

Tests to come once we've finish working out any remaining kinks...

Cheers,
@dPys

dPys · 2019-09-05T19:33:18Z

And use looks something like this:

import numpy as np
from sklearn.datasets import load_digits
from sklearn.decomposition import PCA
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection._search import smooth_cv
from sklearn.pipeline import Pipeline
from sklearn.svm import LinearSVC

pipe = Pipeline([
        ('reduce_dim', PCA(random_state=42)),
        ('classify', LinearSVC(random_state=42)),
])

param_grid = {
    'reduce_dim__n_components': [2, 4, 6, 8]
}

param='n_components'
greater_is_complex=True
refit_scoring = 'accuracy'
tol = None
method='wilcoxon'
alpha=0.01

grid = GridSearchCV(pipe, cv=10, n_jobs=1, param_grid=param_grid,
                    scoring=['accuracy', 'neg_mean_squared_log_error'], refit=smooth_cv(param, greater_is_complex, refit_scoring, method, tol, alpha))
digits = load_digits()
grid.fit(digits.data, digits.target)

jnothman · 2019-09-05T21:33:04Z

I don't think smooth_cv is a very clear name. What we are trying to say is "refit the simplest model that is insignificantly different from the best, where simplicity is determined from a numeric hyperparameter". For a name I might rather simplest, or simplest_by_param, or simplest_best, or razor. I'm not convinced by combining multiple methods into one name either. I would be happy if the user had to say: refit=razors.standard_error(param, scoring='roc_auc') refit=razors.rank_sum_test(param, scoring='roc_auc', alpha=...) I think this is much more readable than your current proposal.

jnothman · 2019-09-05T21:33:31Z

I'm referring to Ockham's razor btw

dPys · 2019-09-05T22:12:37Z

@jnothman -- I agree completely and will make those changes asap.

Stay tuned,
@dPys

OneSE OneSE - fix typo in docstring OneSE - add function partial for adding explicit user arguments to callable Change name to SmoothCV, add support for Wilcoxon method, reorganize callable partial to query the new SmoothCV class. rename to RazorCV, create routine-independent callables

dPys · 2019-09-09T03:57:25Z

@jnothman

Squashed/rebased changes.

Improvements:
*All of the routines are now encapsulated in a class called RazorCV
*Several hours were devoted to rethinking how to allow for more readable callables as suggested-- this is actually a rather difficult problem to solve cleanly given the use of class-embedded function partials as callables in this context.

The current solution works, but there may be other ways to implement it even more cleanly-- e.g. splitting things up into two classes (one containing the universal routines, and the other that contains each callable method, drawing from the first class). Before going down that road, however, I wanted to get your thoughts since there are many ways to go about this.

Either way, I think we've got a nice WIP here.
@dPys

dPys · 2019-09-09T03:59:17Z

Updated working example:

import numpy as np
import matplotlib.pyplot as plt

from sklearn.datasets import load_digits
from sklearn.decomposition import PCA
from sklearn.model_selection import GridSearchCV
from sklearn.pipeline import Pipeline
from sklearn.svm import LinearSVC
pipe = Pipeline([
        ('reduce_dim', PCA(random_state=42)),
        ('classify', LinearSVC(random_state=42)),
])

param_grid = {
    'reduce_dim__n_components': [2, 4, 6, 8]
}

param='n_components'
greater_is_complex=True
scoring = 'neg_mean_squared_log_error'
alpha = 0.05

grid = GridSearchCV(pipe, cv=10, n_jobs=1, param_grid=param_grid,
                    scoring=['accuracy', 'neg_mean_squared_log_error'], refit=RazorCV.ranksum(param, greater_is_complex, scoring, alpha=alpha))
digits = load_digits()
grid.fit(digits.data, digits.target)

jnothman · 2019-09-09T06:47:18Z

Please avoid squashing and force pushing. We can squash upon merge but in the meantime it makes a mess of the commit history.

dPys · 2019-09-09T15:09:10Z

Please avoid squashing and force pushing. We can squash upon merge but in the meantime it makes a mess of the commit history.

Oh alright, sorry about that. I see now that the push caught #14771 by mistake.

dPys · 2019-09-09T20:55:14Z

Assuming we are at least mostly happy with the current design, it seems like a good next step would be to fuzz test functionality with more types of learners, scoring methods, and feature-selecting parameters (i.e. beyond PCA)? ...maybe even create a parameterized test across various combinations of available options currently offered with scikit-learn? Do you know of any potential volunteers who'd be interested in assisting with this?

Thanks again for the ongoing help and encouragement :-)
@dPys

jnothman reviewed Aug 27, 2019

View reviewed changes

github-actions bot added module:ensemble module:model_selection labels Mar 2, 2020

Base automatically changed from master to main January 22, 2021 10:51

dPys mentioned this pull request Feb 22, 2022

ENH: Constrained model selection with SearchCV #22573

Open

This pull request was closed.

Uh oh!

OneSE: the One-Standard-Error Rule #14820

OneSE: the One-Standard-Error Rule #14820

Uh oh!

Conversation

dPys commented Aug 27, 2019 • edited by NicolasHug Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Reference Issues/PRs

What does this implement/fix? Explain your changes.

Any other comments?

Uh oh!

jnothman left a comment

Choose a reason for hiding this comment

Uh oh!

jnothman Aug 27, 2019

Choose a reason for hiding this comment

Uh oh!

dPys Aug 27, 2019

Choose a reason for hiding this comment

Uh oh!

dPys Aug 27, 2019

Choose a reason for hiding this comment

Uh oh!

Uh oh!

jnothman Aug 27, 2019

Choose a reason for hiding this comment

Uh oh!

dPys Aug 27, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

NicolasHug commented Aug 27, 2019

Uh oh!

dPys commented Aug 27, 2019

Uh oh!

jnothman commented Aug 27, 2019 via email

Uh oh!

dPys commented Aug 27, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jnothman commented Aug 27, 2019 via email

Uh oh!

dPys commented Aug 27, 2019

Uh oh!

jnothman commented Aug 27, 2019 via email

Uh oh!

dPys commented Aug 27, 2019

Uh oh!

jnothman commented Aug 27, 2019 via email

Uh oh!

jnothman commented Aug 27, 2019 via email

Uh oh!

dPys commented Aug 27, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

dPys commented Aug 28, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jnothman commented Aug 28, 2019 via email

Uh oh!

dPys commented Aug 28, 2019

Uh oh!

jnothman commented Aug 28, 2019 via email

Uh oh!

dPys commented Aug 28, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

dPys commented Sep 3, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

dPys commented Sep 5, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jnothman commented Sep 5, 2019 via email

Uh oh!

jnothman commented Sep 5, 2019 via email

Uh oh!

dPys commented Sep 5, 2019

Uh oh!

dPys commented Sep 9, 2019

dPys commented Aug 27, 2019 •

edited by NicolasHug

Loading

dPys Aug 27, 2019 •

edited

Loading

dPys commented Aug 27, 2019 •

edited

Loading

dPys commented Aug 27, 2019 •

edited

Loading

dPys commented Aug 28, 2019 •

edited

Loading

dPys commented Aug 28, 2019 •

edited

Loading

dPys commented Sep 3, 2019 •

edited

Loading

dPys commented Sep 5, 2019 •

edited

Loading

dPys commented Sep 9, 2019 •

edited

Loading

dPys commented Sep 9, 2019 •

edited

Loading