Thanks to visit codestin.com
Credit goes to github.com

Skip to content

Conversation

@dPys
Copy link

@dPys dPys commented Aug 27, 2019

Reference Issues/PRs

https://github.com/scikit-learn/scikit-learn/blob/master/examples/model_selection/plot_grid_search_refit_callable.py
See also #11269. See also #11354. See also #12865. See also #9499.

What does this implement/fix? Explain your changes.

As discussed briefly with @amueller and @NicolasHug last week, this is an enhancement to the latest implementation of sklearn's refit callable functionality. The aim is to provide a more generic set of methods for balancing model complexity with CV performance via model selection 'smoothing'. In the context of a highly versatile package such as sklearn, where it becomes possible to fit the vast majority of model types using an extensive set of parameters and scorers, the risk of overfitting may be more pressing. To ameliorate this, OneSE might be especially valuable for final model estimation in CV.

Any other comments?

The challenge with implementing such a tool has always been with generalizing its functionality to all types of models, parameters, and scoring methods. In particular, defining model 'complexity' is also relatively context-dependent. This PR lays out a prototype that allows these definitions to be user-determined, rather than hard-coded defaults. Currently, both _error and _score type scorers are supported, along with multi-metric scoring. Either 1 SD bounds can be used, or a percentile tolerance can be specified. Feel free to revise, rewrite, and discuss!

Expanding upon the excellent example recently created by @jiaowoshabi :

import numpy as np
import matplotlib.pyplot as plt
from sklearn.datasets import load_digits
from sklearn.decomposition import PCA
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection._search import OneSE
from sklearn.pipeline import Pipeline
from sklearn.svm import LinearSVC

pipe = Pipeline([
        ('reduce_dim', PCA(random_state=42)),
        ('classify', LinearSVC(random_state=42)),
])

param_grid = {
    'reduce_dim__n_components': [2, 4, 6, 8]
}

# Here, we set our OneSE parameters, though ideally, we would make them explicit arguments for the actual OneSE callable below:
param='n_components'
greater_is_complex=True
refit_scoring = 'accuracy'
tol = None

grid = GridSearchCV(pipe, cv=10, n_jobs=1, param_grid=param_grid,
                    scoring=['accuracy', 'neg_mean_squared_log_error'], refit=OneSE)
digits = load_digits()
grid.fit(digits.data, digits.target)

n_components = grid.cv_results_['param_reduce_dim__n_components']
test_scores = grid.cv_results_['mean_test_accuracy']

plt.figure()
plt.bar(n_components, test_scores, width=1.3, color='b')

plt.axhline(np.max(test_scores), linestyle='--', color='y', label='Best score')

plt.title("Balance model complexity and cross-validated score")
plt.xlabel('Number of PCA components used')
plt.ylabel('Digit classification accuracy')
plt.xticks(n_components.tolist())
plt.ylim((0, 1.0))
plt.legend(loc='upper left')

best_index_ = grid.best_index_

print("The best_index_ is %d" % best_index_)
print("The n_components selected is %d" % n_components[best_index_])
print("The corresponding accuracy score is %.2f"
      % grid.cv_results_['mean_test_accuracy'][best_index_])
plt.show()

Let me know what you think
@dPys

Copy link
Member

@jnothman jnothman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks. Yes this might be a good thing to have, but it's not clear why we should support 1se rather than a formal hypothesis test.

This would need tests and user guide / examples




def OneSE(cv_results, **kwargs):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We avoid **kw. Please define parameters explicitly. This is explained in our developers' guide

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, the kwargs was unfinished since I wasn't sure yet if there were other parameters folks might want to add. Working on the partial function today, and then will squash/rebase!

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

squashed, rebased, and force-pushed. Kwargs is gone. I added an additional pass function that allows for arguments to be attached via a partial directly into the callable. Let me know if this general structure will work here.

Whether complexity increases as `param` increases. Default is True.
refit_scoring : str
Scoring metric.
tol : float
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Doesn't this effectively make it not 1se?

Copy link
Author

@dPys dPys Aug 27, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Correct. But tol here is coded as an alternative to 1 SE in case users wish to override that threshold with a percentile.

@NicolasHug
Copy link
Member

thankThanks @dPys , I like the idea of providing a pre-built callable if that's such a pervasive use-case.

@dPys from your example, I don't understand how the parameters that you define eventually affect OneSE since you don't pass them around? Should there be a partial somewhere?

As Joel mentioned please provide some basic tests so we can review more easily ;)

Thanks!

@dPys
Copy link
Author

dPys commented Aug 27, 2019

Thanks @NicolasHug and @jnothman for reviewing. Since it sounds like this could indeed be useful to have as a pre-built callable, I can clean it up further and add some basic tests (in particular, I think it'd be valuable to show that this works across different kinds of models and parameter types). Can anyone think of a scenario where this would be contraindicated?

@jnothman
Copy link
Member

jnothman commented Aug 27, 2019 via email

@dPys
Copy link
Author

dPys commented Aug 27, 2019

@jnothman ,

I see no reason why we couldn't also include hypothesis testing as another method beyond 1 SE and percentile tolerance. This might also make scikit-learn's OneSE unique from similar implementations offered by other packages.

So that I can get a better sense of what you're thinking, perhaps you could provide a bit more detail (or ideally an example) as to what using Wilcoxon might look like in this context and why this would be superior to 1 SE?

@dPys

@jnothman
Copy link
Member

jnothman commented Aug 27, 2019 via email

@dPys
Copy link
Author

dPys commented Aug 27, 2019

This is true, but the trouble that I see with Wilcoxon in particular is that the normal approximation is used for the calculations, which means that the samples used would probably need to be pretty large to see the benefit over 1 SE. However, the number of samples in this context equals the length of the vector of a parameter's values (e.g. PCA n_components might be [2, 4, 6, 8]), which in many if not most cases (like in the example above) is going to be n<20.

Now, if in creating an optional method for hypothesis testing here, we also include a pre-requisite that the length of the target parameter's grid is >20, then I think your suggestion would work. In such cases (i.e. where there is a more exhaustive grid search over the target parameter's values), the risk of overfitting will probably be higher anyway, and therefore benefit more from smoothing via Wilcoxon or some other nonparametric test.

Curious to hear your thoughts
@dPys

@jnothman
Copy link
Member

jnothman commented Aug 27, 2019 via email

@dPys
Copy link
Author

dPys commented Aug 27, 2019

I'm confused-- If we were to only consider the number of splits (e.g. 10), wouldn't that still yield too small an N?

Do you mean that rather than using the mean (i.e. across cv folds) of the refit scorer of interest (as is currently done with 1 SE and the percentile threshold case), consider all score samples across all folds? If so, it seems that doing that would yield more than enough N :-)

@jnothman
Copy link
Member

jnothman commented Aug 27, 2019 via email

@jnothman
Copy link
Member

jnothman commented Aug 27, 2019 via email

@dPys
Copy link
Author

dPys commented Aug 27, 2019

I like that idea @jnothman ! Will see what I can do

This would also make it easier/saner to implement other methods such as Wilcoxon

@dPys
Copy link
Author

dPys commented Aug 28, 2019

The point is that the scores are paired. You want to see whether there is a consistent improvement rather than an improvement in the average. Yes, N is small. That's true for calculating the standard deviation too. Wilcoxon should be possible for small N though it will not reject the null hypothesis easily... Which is sort of the point. Feel free to empirically compare the two with different data sets, or to look for literature to substantiate one approach over the other.

Another thing to consider is that 1 SE was created in the context of decision trees, where 1 SE was found to be optimal for balancing tree size with error (hence 1-SE vs. 1-SD). Outside of that context, 1 SD may not be empirically optimal as a hard and fast rule. Evaluating 1-SE vs. a hypothesis test across datasets may yield totally different results depending on the type of learner and target parameter used. It seems to me that the best workaround is to simply include each of these methods as options and let the user decide what is most appropriate for the problem at hand?

@jnothman
Copy link
Member

jnothman commented Aug 28, 2019 via email

@dPys
Copy link
Author

dPys commented Aug 28, 2019

Yeah, metric is essential. So is metric-model pairing, and classification vs. regression. I'd have to test what would happen in the case of unbalanced classes.

But what's cool is that all of these possibilities become exposed to the user for exploration on a case-by-case basis with the kind of 'meta'-callable that's materializing here. In all honesty, it sounds like this might be more appropriately construed as a generic gridsearch smoothing class for overfit-prevention, rather than '1-SE' in particular (though I do like the name as a throwback).

@jnothman
Copy link
Member

jnothman commented Aug 28, 2019 via email

@dPys
Copy link
Author

dPys commented Aug 28, 2019

We'd have to test these assumptions to know for sure. Bear in mind, too, that p<0.05 is also more or less a rule-of-thumb (like 1-SE). Thus, I am still not convinced that a formal hypothesis would be 'better' across the board.

I think that the thing to do to compare these different criteria is to pick a dataset from one of sklearn's examples (e.g. digits), add a varying number of random noise predictors, then see how well these different methods are at preventing the noise terms from influencing the fit.

That being said, this would just be useful for providing some helpful info for a user guide. From a development standpoint, I don't think those kinds of tests should necessarily hold up integration of the code being proposed here, since in the end all of these methods (1 SE, hypothesis test, percentile tolerance) would ideally be made available for the user to decide what to do?

@dPys
Copy link
Author

dPys commented Sep 3, 2019

@jnothman -- I've updated the PR extensively. These routines have now been renamed to SmoothCV, which reflects a generalization of the OneSE approach for overfit prevention. I've added support for the Wilcoxon method of hyp-testing across folds, and reorganized the callable partial to query the new SmoothCV class, within which all of these routines are now included. This is still very much a WIP, but for now, please let me know if the Wilcoxon approach was along the lines of what you were thinking. Please also feel free to revise with any of your own edits.

Tests to come once we've finish working out any remaining kinks...

Cheers,
@dPys

@dPys
Copy link
Author

dPys commented Sep 5, 2019

And use looks something like this:

import numpy as np
from sklearn.datasets import load_digits
from sklearn.decomposition import PCA
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection._search import smooth_cv
from sklearn.pipeline import Pipeline
from sklearn.svm import LinearSVC

pipe = Pipeline([
        ('reduce_dim', PCA(random_state=42)),
        ('classify', LinearSVC(random_state=42)),
])

param_grid = {
    'reduce_dim__n_components': [2, 4, 6, 8]
}

param='n_components'
greater_is_complex=True
refit_scoring = 'accuracy'
tol = None
method='wilcoxon'
alpha=0.01

grid = GridSearchCV(pipe, cv=10, n_jobs=1, param_grid=param_grid,
                    scoring=['accuracy', 'neg_mean_squared_log_error'], refit=smooth_cv(param, greater_is_complex, refit_scoring, method, tol, alpha))
digits = load_digits()
grid.fit(digits.data, digits.target)

@jnothman
Copy link
Member

jnothman commented Sep 5, 2019 via email

@jnothman
Copy link
Member

jnothman commented Sep 5, 2019 via email

@dPys
Copy link
Author

dPys commented Sep 5, 2019

@jnothman -- I agree completely and will make those changes asap.

Stay tuned,
@dPys

OneSE

OneSE - fix typo in docstring

OneSE - add function partial for adding explicit user arguments to callable

Change name to SmoothCV, add support for Wilcoxon method, reorganize callable partial to query the new SmoothCV class.

rename to RazorCV, create routine-independent callables
@dPys
Copy link
Author

dPys commented Sep 9, 2019

@jnothman

Squashed/rebased changes.

Improvements:
*All of the routines are now encapsulated in a class called RazorCV
*Several hours were devoted to rethinking how to allow for more readable callables as suggested-- this is actually a rather difficult problem to solve cleanly given the use of class-embedded function partials as callables in this context.

The current solution works, but there may be other ways to implement it even more cleanly-- e.g. splitting things up into two classes (one containing the universal routines, and the other that contains each callable method, drawing from the first class). Before going down that road, however, I wanted to get your thoughts since there are many ways to go about this.

Either way, I think we've got a nice WIP here.
@dPys

@dPys
Copy link
Author

dPys commented Sep 9, 2019

Updated working example:

import numpy as np
import matplotlib.pyplot as plt

from sklearn.datasets import load_digits
from sklearn.decomposition import PCA
from sklearn.model_selection import GridSearchCV
from sklearn.pipeline import Pipeline
from sklearn.svm import LinearSVC
pipe = Pipeline([
        ('reduce_dim', PCA(random_state=42)),
        ('classify', LinearSVC(random_state=42)),
])

param_grid = {
    'reduce_dim__n_components': [2, 4, 6, 8]
}

param='n_components'
greater_is_complex=True
scoring = 'neg_mean_squared_log_error'
alpha = 0.05

grid = GridSearchCV(pipe, cv=10, n_jobs=1, param_grid=param_grid,
                    scoring=['accuracy', 'neg_mean_squared_log_error'], refit=RazorCV.ranksum(param, greater_is_complex, scoring, alpha=alpha))
digits = load_digits()
grid.fit(digits.data, digits.target)

@jnothman
Copy link
Member

jnothman commented Sep 9, 2019

Please avoid squashing and force pushing. We can squash upon merge but in the meantime it makes a mess of the commit history.

@dPys
Copy link
Author

dPys commented Sep 9, 2019

Please avoid squashing and force pushing. We can squash upon merge but in the meantime it makes a mess of the commit history.

Oh alright, sorry about that. I see now that the push caught #14771 by mistake.

@dPys
Copy link
Author

dPys commented Sep 9, 2019

Assuming we are at least mostly happy with the current design, it seems like a good next step would be to fuzz test functionality with more types of learners, scoring methods, and feature-selecting parameters (i.e. beyond PCA)? ...maybe even create a parameterized test across various combinations of available options currently offered with scikit-learn? Do you know of any potential volunteers who'd be interested in assisting with this?

Thanks again for the ongoing help and encouragement :-)
@dPys

Base automatically changed from master to main January 22, 2021 10:51
This pull request was closed.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants