Thanks to visit codestin.com
Credit goes to github.com

Skip to content

allow GridSearchCV to work with params={} or cv=1 #2048

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
eyaler opened this issue Jun 9, 2013 · 24 comments · Fixed by #2082
Closed

allow GridSearchCV to work with params={} or cv=1 #2048

eyaler opened this issue Jun 9, 2013 · 24 comments · Fixed by #2082

Comments

@eyaler
Copy link

eyaler commented Jun 9, 2013

these degenerate cases are sometimes wanted for comparison

@GaelVaroquaux
Copy link
Member

Fair enough. As these are corner cases, I suggest that you submit a pull
request, as all the core developers are drowning under load, and I don't
suspect that they will come to this soon. It shouldn't be too hard to get
it working, I hope.

@jnothman
Copy link
Member

jnothman commented Jun 9, 2013

CV = 1 sounds strange... In terms of "k-fold cv" it doesn't really make sense. The implementation should validate against it, unless you really think it needs to be handled as a special case:

>>> list(check_cv(1, [1,2,3], [4,5,6]))
[(array([], dtype=int32), array([0, 1, 2]))]

(Note: #1742 introduces the ability to return training scores in cross-validation.)

I have wondered about params={}. You can't derive params={} from param_dict={...}. Perhaps we should support some special semantics for params={}, but perhaps you just want something like cross_val_score extended?

@eyaler
Copy link
Author

eyaler commented Jun 10, 2013

imho, cv=1 should just do a fit with the first parameter.

@jnothman
Copy link
Member

imho, cv=1 should just do a fit with the first parameter.

could you please give me an example; I'm not sure what you mean.

@ogrisel
Copy link
Member

ogrisel commented Jun 10, 2013

cv means cross validation: it means how many train/tests splits will be used to evaluate each parameters combination from the grid.

For k-fold cross validation, k=1 is meaningless. We should just raise a ValueError with an informative exception message.

Right now we have:

>>> from sklearn.cross_validation import KFold
>>> for train, test in  KFold(100, 1): print train, test
[] [ 0  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24
 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49
 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74
 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99]

and empty training set is likely to give results or errors that are confusing for the user.

@eyaler
Copy link
Author

eyaler commented Jun 10, 2013

although these are degenerate cases, i find them useful as references and for testing
both should just fit the classifier without iterating on parameters (taking the 1st parameter where applicable)

@ogrisel
Copy link
Member

ogrisel commented Jun 10, 2013

cv=1 is meaningless: many classifiers need at least 2 training samples from 2 different classes to just know the kind of problems they are addressing and cannot be fitted on an empty trainingset.

But I agree we at least need better error messages. For instance now we have:

>>> from sklearn.svm import LinearSVC
>>> clf = LinearSVC()

>>> from sklearn.datasets import load_digits
>>> digits = load_digits()

>>> from sklearn.grid_search import GridSearchCV
>>> GridSearchCV(clf, {'C': [1, 10]}, cv=1).fit(digits.data, digits.target)
Traceback (most recent call last):
  File "<ipython-input-54-9e81f5317334>", line 1, in <module>
    GridSearchCV(clf, {'C': [1, 10]}, cv=1).fit(digits.data, digits.target)
  File "/Users/ogrisel/code/scikit-learn/sklearn/grid_search.py", line 700, in fit
    return self._fit(X, y, ParameterGrid(self.param_grid), **params)
  File "/Users/ogrisel/code/scikit-learn/sklearn/grid_search.py", line 483, in _fit
    for parameters in parameter_iterable
  File "/Users/ogrisel/code/scikit-learn/sklearn/externals/joblib/parallel.py", line 514, in __call__
    self.dispatch(function, args, kwargs)
  File "/Users/ogrisel/code/scikit-learn/sklearn/externals/joblib/parallel.py", line 311, in dispatch
    job = ImmediateApply(func, args, kwargs)
  File "/Users/ogrisel/code/scikit-learn/sklearn/externals/joblib/parallel.py", line 135, in __init__
    self.results = func(*args, **kwargs)
  File "/Users/ogrisel/code/scikit-learn/sklearn/grid_search.py", line 286, in fit_grid_point
    clf.fit(X_train, y_train, **fit_params)
  File "/Users/ogrisel/code/scikit-learn/sklearn/svm/base.py", line 668, in fit
    raise ValueError("The number of classes has to be greater than"
ValueError: The number of classes has to be greater than one.

>>> GridSearchCV(clf, {}, cv=2).fit(digits.data, digits.target)
Traceback (most recent call last):
  File "<ipython-input-55-0fd69621ac4a>", line 1, in <module>
    GridSearchCV(clf, {}, cv=2).fit(digits.data, digits.target)
  File "/Users/ogrisel/code/scikit-learn/sklearn/grid_search.py", line 700, in fit
    return self._fit(X, y, ParameterGrid(self.param_grid), **params)
  File "/Users/ogrisel/code/scikit-learn/sklearn/grid_search.py", line 483, in _fit
    for parameters in parameter_iterable
  File "/Users/ogrisel/code/scikit-learn/sklearn/externals/joblib/parallel.py", line 513, in __call__
    for function, args, kwargs in iterable:
  File "/Users/ogrisel/code/scikit-learn/sklearn/grid_search.py", line 480, in <genexpr>
    delayed(fit_grid_point)(
  File "/Users/ogrisel/code/scikit-learn/sklearn/grid_search.py", line 84, in __iter__
    keys, values = zip(*items)
ValueError: need more than 0 values to unpack

@eyaler
Copy link
Author

eyaler commented Jun 10, 2013

its not an empty training set. cv=1 would mean train on all the data

@jnothman
Copy link
Member

cv doesn't affect iterating over parameters. I presumed what you meant was something like not doing cross-validation, but testing different parameters by fitting and scoring each model on the same data. This is bad practice, so although it can be achieved with cv=[(slice(None), slice(None))] it's probably best if scikit-learn doesn't support it directly. I agree with @ogrisel, what we need here is better validation.

I think what you mean by params={} is to perform cross validation on the existing parameter setting of an estimator. This is done by sklearn.cross_validation.cross_val_score, not GridSearchCV. I have use-cases for doing so with GridSearchCV, but I only think they are meaningful given #1769, or in order to compare the results from *SearchCV across different estimators, some of which may not have parameters at all (in which case you could use RandomizedSearchCV with 1 iteration and an empty param_distributions).

@jnothman
Copy link
Member

its not an empty training set. cv=1 would mean train on all the data

and test on nothing. Or alternatively, train on nothing and test on all the data. Which meaning do you think is more sensible?!

@eyaler
Copy link
Author

eyaler commented Jun 10, 2013

i agree. i wanted training on all and test on nothing. however as k gets smaller the training set also gets smaller so this would not be consistent

@jnothman
Copy link
Member

i wanted training on all and test on nothing.

test on nothing isn't very useful when what's returned from GridSearchCV is a score...

@ogrisel
Copy link
Member

ogrisel commented Jun 10, 2013

@eyaler currently as demonstrated in my previous comment KFold cross validation wtih cv=1 means train on nothing and test on everything. But anyway this is useless and probably too confusing for the naive user not familiar with the concept of cross validation. In my opinion it would just make more sense to raise and explicit exception such as ValueError("Cross-Validation need at least one train / test split by setting cv=2 or more.") in the KFold.__init__ method (and StratifiedKFold.__init__ as well).

@ogrisel
Copy link
Member

ogrisel commented Jun 21, 2013

The error message for n_folds < 2 is now much more explicit. The handling of the empty grid of parameters could be still be improved.

larsmans added a commit to larsmans/scikit-learn that referenced this issue Jun 22, 2013
@Eriz11
Copy link

Eriz11 commented Dec 17, 2018

cv doesn't affect iterating over parameters. I presumed what you meant was something like not doing cross-validation, but testing different parameters by fitting and scoring each model on the same data. This is bad practice, so although it can be achieved with cv=[(slice(None), slice(None))] it's probably best if scikit-learn doesn't support it directly. I agree with @ogrisel, what we need here is better validation.

I think what you mean by params={} is to perform cross validation on the existing parameter setting of an estimator. This is done by sklearn.cross_validation.cross_val_score, not GridSearchCV. I have use-cases for doing so with GridSearchCV, but I only think they are meaningful given #1769, or in order to compare the results from *SearchCV across different estimators, some of which may not have parameters at all (in which case you could use RandomizedSearchCV with 1 iteration and an empty param_distributions).

My two cents so that another "special case" that is growing in usability (at least in my surroundings) is documented and argued, which may give some clarity to this above comment of @jnothman. Hope it helps (first time I propose something here so any advise or comment is very appreciated).

The TimeSeriesSplit only implements anchored Walk Forward Analysis (WFA), which for some purposes is interesting. Another attractive approach is using a non-anchored WFA, which I implemented (very simply) and shared here. This second case accounts for cases in which you want to cross validate with recent data, not going back to the beginning of the dataset.

Adding more reality, the TimeSeriesSplit allows to use GridSearchCV, which seems fair and reasonable. However, and continuing with the comment of @jnothman above, if I would like to grid search in a specific classifier using my non-anchored approach (i.e. I get my train/test window split and grid search with each train set I get -- get the best model and use it to make predictions on the test set), I would need to use cv=[(slice(None), slice(None))] in GridSearchCV, if this is the best practice available right now.

¿Would be considerable to add some extra parameter or argument to the GridSearchCV so that it may be used without cross-validation in alternative cross validation techniques that are not supported in sklearn? My opinion is that this could be more practical than adding a non-anchored WFA approach like mine to the package (others may come up with other techniques as well). However, I believe others could benefit of the non-anchored approach.

All that I can help, please tell me and I will do my best.

@jnothman
Copy link
Member

jnothman commented Dec 18, 2018 via email

@Eriz11
Copy link

Eriz11 commented Dec 18, 2018

Thanks for the reply @jnothman. So, your suggestion is about using the above trick to handle GridSearchCV without the need of cross validation. Am I right? It is true that it is not very common practice, so I agree with you here.

Regarding the second part, which I find more interesting and rewarding to the community, ¿would be considerable to add some construct, similar to TimeSeriesSplit but for the non-anchored Walk Forward Analysis so that it can be used as cv= directly in cross validation approaches? At the end of the day, it is another common technique to make cross validation in TS (I think it is gaining traction, so that we only include the most recent data for optimization purposes).

Looking forward for your views,

@jnothman
Copy link
Member

jnothman commented Dec 18, 2018 via email

@Eriz11
Copy link

Eriz11 commented Dec 19, 2018

Thanks for the advise @jnothman. It's my first time in this situation (together with not being a software engineer) and I have not that much experience to make a working pull request to such a package as scikit-learn. ¿How could I continue this path if I want to help in the development of this feature? I have the working code, but it is far for being ready for user interaction. I would be extremely glad to help in any phase if anybody with more experience could join me.

@kitaev-chen
Copy link

kitaev-chen commented Jan 23, 2020

I give you a use case why cv=1 needed:

The real cross-validation should have a shifted (train set, valid set, test set) in each fold. The train set is for the training model; the valid set is for hyper-parameters tuning; the test set is for test accuracy calculation.

If we want to calculate 5 times test accuracies, we need to use 5-fold cv. In every fold, we have two divided data set A and B. B is for testing, A should be split again into a train set A_t and valid set A_v. Then we should do hyper-parameters tuning on A_v. In this part, we need GridSearchCV(cv=1).

So the pseudocode should like this:

cv = StratifiedShuffleSplit(n_splits=5, ...)

for i, (idx_train_valid, idx_test) in enumerate(cv.split(X, y)):
    
    X_train_valid, X_test = ...
    y_train_valid, y_test = ...


    gs = GridSearchCV(clf, ..., refit=False, cv=1, ...) 
    gs.fit(X_train_valid, y_train_valid)

    clf.set_params(xxx=gs.best_params_[xxx])
    clf.fit(X_train_valid, y_train_valid)
    y_pred = clf.predict(X_test)
    scores = accuracy_score(y_test, y_pred)

Of course, we also can set cv=2 in GridSearchCV which can shift train set and valid set either, but the n_splits can not be compatible with the configuration file directly.

@jnothman
Copy link
Member

jnothman commented Jan 24, 2020 via email

@kitaev-chen
Copy link

You can't ever need cv=1 since cv=int is only a shorthand. I find cv=1 ambiguous to read, and hence am strongly against permitting it

So is there any alternative way to implement my use case?

@kitaev-chen
Copy link

You can't ever need cv=1 since cv=int is only a shorthand. I find cv=1 ambiguous to read, and hence am strongly against permitting it

So is there any alternative way to implement my use case?

OK I know, just set cv to any split (e.g. StratifiedShuffleSplit) with n_splits=1.

@GaelVaroquaux
Copy link
Member

GaelVaroquaux commented Jan 25, 2020 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants