allow GridSearchCV to work with params={} or cv=1 #2048

eyaler · 2013-06-09T20:50:27Z

these degenerate cases are sometimes wanted for comparison

GaelVaroquaux · 2013-06-09T20:53:01Z

Fair enough. As these are corner cases, I suggest that you submit a pull
request, as all the core developers are drowning under load, and I don't
suspect that they will come to this soon. It shouldn't be too hard to get
it working, I hope.

jnothman · 2013-06-09T22:31:50Z

CV = 1 sounds strange... In terms of "k-fold cv" it doesn't really make sense. The implementation should validate against it, unless you really think it needs to be handled as a special case:

>>> list(check_cv(1, [1,2,3], [4,5,6]))
[(array([], dtype=int32), array([0, 1, 2]))]

(Note: #1742 introduces the ability to return training scores in cross-validation.)

I have wondered about params={}. You can't derive params={} from param_dict={...}. Perhaps we should support some special semantics for params={}, but perhaps you just want something like cross_val_score extended?

eyaler · 2013-06-10T07:49:06Z

imho, cv=1 should just do a fit with the first parameter.

jnothman · 2013-06-10T08:20:06Z

imho, cv=1 should just do a fit with the first parameter.

could you please give me an example; I'm not sure what you mean.

ogrisel · 2013-06-10T08:20:42Z

cv means cross validation: it means how many train/tests splits will be used to evaluate each parameters combination from the grid.

For k-fold cross validation, k=1 is meaningless. We should just raise a ValueError with an informative exception message.

Right now we have:

>>> from sklearn.cross_validation import KFold
>>> for train, test in  KFold(100, 1): print train, test
[] [ 0  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24
 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49
 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74
 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99]

and empty training set is likely to give results or errors that are confusing for the user.

eyaler · 2013-06-10T08:28:01Z

although these are degenerate cases, i find them useful as references and for testing
both should just fit the classifier without iterating on parameters (taking the 1st parameter where applicable)

ogrisel · 2013-06-10T09:02:50Z

cv=1 is meaningless: many classifiers need at least 2 training samples from 2 different classes to just know the kind of problems they are addressing and cannot be fitted on an empty trainingset.

But I agree we at least need better error messages. For instance now we have:

>>> from sklearn.svm import LinearSVC
>>> clf = LinearSVC()

>>> from sklearn.datasets import load_digits
>>> digits = load_digits()

>>> from sklearn.grid_search import GridSearchCV
>>> GridSearchCV(clf, {'C': [1, 10]}, cv=1).fit(digits.data, digits.target)
Traceback (most recent call last):
  File "<ipython-input-54-9e81f5317334>", line 1, in <module>
    GridSearchCV(clf, {'C': [1, 10]}, cv=1).fit(digits.data, digits.target)
  File "/Users/ogrisel/code/scikit-learn/sklearn/grid_search.py", line 700, in fit
    return self._fit(X, y, ParameterGrid(self.param_grid), **params)
  File "/Users/ogrisel/code/scikit-learn/sklearn/grid_search.py", line 483, in _fit
    for parameters in parameter_iterable
  File "/Users/ogrisel/code/scikit-learn/sklearn/externals/joblib/parallel.py", line 514, in __call__
    self.dispatch(function, args, kwargs)
  File "/Users/ogrisel/code/scikit-learn/sklearn/externals/joblib/parallel.py", line 311, in dispatch
    job = ImmediateApply(func, args, kwargs)
  File "/Users/ogrisel/code/scikit-learn/sklearn/externals/joblib/parallel.py", line 135, in __init__
    self.results = func(*args, **kwargs)
  File "/Users/ogrisel/code/scikit-learn/sklearn/grid_search.py", line 286, in fit_grid_point
    clf.fit(X_train, y_train, **fit_params)
  File "/Users/ogrisel/code/scikit-learn/sklearn/svm/base.py", line 668, in fit
    raise ValueError("The number of classes has to be greater than"
ValueError: The number of classes has to be greater than one.

>>> GridSearchCV(clf, {}, cv=2).fit(digits.data, digits.target)
Traceback (most recent call last):
  File "<ipython-input-55-0fd69621ac4a>", line 1, in <module>
    GridSearchCV(clf, {}, cv=2).fit(digits.data, digits.target)
  File "/Users/ogrisel/code/scikit-learn/sklearn/grid_search.py", line 700, in fit
    return self._fit(X, y, ParameterGrid(self.param_grid), **params)
  File "/Users/ogrisel/code/scikit-learn/sklearn/grid_search.py", line 483, in _fit
    for parameters in parameter_iterable
  File "/Users/ogrisel/code/scikit-learn/sklearn/externals/joblib/parallel.py", line 513, in __call__
    for function, args, kwargs in iterable:
  File "/Users/ogrisel/code/scikit-learn/sklearn/grid_search.py", line 480, in <genexpr>
    delayed(fit_grid_point)(
  File "/Users/ogrisel/code/scikit-learn/sklearn/grid_search.py", line 84, in __iter__
    keys, values = zip(*items)
ValueError: need more than 0 values to unpack

eyaler · 2013-06-10T09:07:11Z

its not an empty training set. cv=1 would mean train on all the data

jnothman · 2013-06-10T09:07:50Z

cv doesn't affect iterating over parameters. I presumed what you meant was something like not doing cross-validation, but testing different parameters by fitting and scoring each model on the same data. This is bad practice, so although it can be achieved with cv=[(slice(None), slice(None))] it's probably best if scikit-learn doesn't support it directly. I agree with @ogrisel, what we need here is better validation.

I think what you mean by params={} is to perform cross validation on the existing parameter setting of an estimator. This is done by sklearn.cross_validation.cross_val_score, not GridSearchCV. I have use-cases for doing so with GridSearchCV, but I only think they are meaningful given #1769, or in order to compare the results from *SearchCV across different estimators, some of which may not have parameters at all (in which case you could use RandomizedSearchCV with 1 iteration and an empty param_distributions).

jnothman · 2013-06-10T09:21:41Z

its not an empty training set. cv=1 would mean train on all the data

and test on nothing. Or alternatively, train on nothing and test on all the data. Which meaning do you think is more sensible?!

eyaler · 2013-06-10T09:27:28Z

i agree. i wanted training on all and test on nothing. however as k gets smaller the training set also gets smaller so this would not be consistent

jnothman · 2013-06-10T09:31:03Z

i wanted training on all and test on nothing.

test on nothing isn't very useful when what's returned from GridSearchCV is a score...

ogrisel · 2013-06-10T10:37:45Z

@eyaler currently as demonstrated in my previous comment KFold cross validation wtih cv=1 means train on nothing and test on everything. But anyway this is useless and probably too confusing for the naive user not familiar with the concept of cross validation. In my opinion it would just make more sense to raise and explicit exception such as ValueError("Cross-Validation need at least one train / test split by setting cv=2 or more.") in the KFold.__init__ method (and StratifiedKFold.__init__ as well).

ogrisel · 2013-06-21T07:15:54Z

The error message for n_folds < 2 is now much more explicit. The handling of the empty grid of parameters could be still be improved.

Fixes scikit-learn#2048.

Eriz11 · 2018-12-17T13:19:33Z

cv doesn't affect iterating over parameters. I presumed what you meant was something like not doing cross-validation, but testing different parameters by fitting and scoring each model on the same data. This is bad practice, so although it can be achieved with cv=[(slice(None), slice(None))] it's probably best if scikit-learn doesn't support it directly. I agree with @ogrisel, what we need here is better validation.

I think what you mean by params={} is to perform cross validation on the existing parameter setting of an estimator. This is done by sklearn.cross_validation.cross_val_score, not GridSearchCV. I have use-cases for doing so with GridSearchCV, but I only think they are meaningful given #1769, or in order to compare the results from *SearchCV across different estimators, some of which may not have parameters at all (in which case you could use RandomizedSearchCV with 1 iteration and an empty param_distributions).

My two cents so that another "special case" that is growing in usability (at least in my surroundings) is documented and argued, which may give some clarity to this above comment of @jnothman. Hope it helps (first time I propose something here so any advise or comment is very appreciated).

The TimeSeriesSplit only implements anchored Walk Forward Analysis (WFA), which for some purposes is interesting. Another attractive approach is using a non-anchored WFA, which I implemented (very simply) and shared here. This second case accounts for cases in which you want to cross validate with recent data, not going back to the beginning of the dataset.

Adding more reality, the TimeSeriesSplit allows to use GridSearchCV, which seems fair and reasonable. However, and continuing with the comment of @jnothman above, if I would like to grid search in a specific classifier using my non-anchored approach (i.e. I get my train/test window split and grid search with each train set I get -- get the best model and use it to make predictions on the test set), I would need to use cv=[(slice(None), slice(None))] in GridSearchCV, if this is the best practice available right now.

¿Would be considerable to add some extra parameter or argument to the GridSearchCV so that it may be used without cross-validation in alternative cross validation techniques that are not supported in sklearn? My opinion is that this could be more practical than adding a non-anchored WFA approach like mine to the package (others may come up with other techniques as well). However, I believe others could benefit of the non-anchored approach.

All that I can help, please tell me and I will do my best.

jnothman · 2018-12-18T02:48:27Z

I don't understand what stops these cases being handled at the moment without making any changes to the library

Eriz11 · 2018-12-18T11:10:52Z

Thanks for the reply @jnothman. So, your suggestion is about using the above trick to handle GridSearchCV without the need of cross validation. Am I right? It is true that it is not very common practice, so I agree with you here.

Regarding the second part, which I find more interesting and rewarding to the community, ¿would be considerable to add some construct, similar to TimeSeriesSplit but for the non-anchored Walk Forward Analysis so that it can be used as cv= directly in cross validation approaches? At the end of the day, it is another common technique to make cross validation in TS (I think it is gaining traction, so that we only include the most recent data for optimization purposes).

Looking forward for your views,

jnothman · 2018-12-18T22:14:26Z

Alternatives and extensions to TimeSeriesSplit are welcome as pull requests. Yes, defining custom cv and custom scoring are reasonable hacks given the current interface.

Eriz11 · 2018-12-19T11:03:14Z

Thanks for the advise @jnothman. It's my first time in this situation (together with not being a software engineer) and I have not that much experience to make a working pull request to such a package as scikit-learn. ¿How could I continue this path if I want to help in the development of this feature? I have the working code, but it is far for being ready for user interaction. I would be extremely glad to help in any phase if anybody with more experience could join me.

kitaev-chen · 2020-01-23T23:00:56Z

I give you a use case why cv=1 needed:

The real cross-validation should have a shifted (train set, valid set, test set) in each fold. The train set is for the training model; the valid set is for hyper-parameters tuning; the test set is for test accuracy calculation.

If we want to calculate 5 times test accuracies, we need to use 5-fold cv. In every fold, we have two divided data set A and B. B is for testing, A should be split again into a train set A_t and valid set A_v. Then we should do hyper-parameters tuning on A_v. In this part, we need GridSearchCV(cv=1).

So the pseudocode should like this:

cv = StratifiedShuffleSplit(n_splits=5, ...)

for i, (idx_train_valid, idx_test) in enumerate(cv.split(X, y)):
    
    X_train_valid, X_test = ...
    y_train_valid, y_test = ...


    gs = GridSearchCV(clf, ..., refit=False, cv=1, ...) 
    gs.fit(X_train_valid, y_train_valid)

    clf.set_params(xxx=gs.best_params_[xxx])
    clf.fit(X_train_valid, y_train_valid)
    y_pred = clf.predict(X_test)
    scores = accuracy_score(y_test, y_pred)

Of course, we also can set cv=2 in GridSearchCV which can shift train set and valid set either, but the n_splits can not be compatible with the configuration file directly.

jnothman · 2020-01-24T05:26:53Z

You can't ever *need* cv=1 since cv=int is only a shorthand. I find cv=1 ambiguous to read, and hence am strongly against permitting it

kitaev-chen · 2020-01-24T07:07:29Z

You can't ever need cv=1 since cv=int is only a shorthand. I find cv=1 ambiguous to read, and hence am strongly against permitting it

So is there any alternative way to implement my use case?

kitaev-chen · 2020-01-24T20:56:52Z

You can't ever need cv=1 since cv=int is only a shorthand. I find cv=1 ambiguous to read, and hence am strongly against permitting it

So is there any alternative way to implement my use case?

OK I know, just set cv to any split (e.g. StratifiedShuffleSplit) with n_splits=1.

GaelVaroquaux · 2020-01-25T05:19:17Z

You can't ever need cv=1 since cv=int is only a shorthand. I find cv=1 ambiguous to read, and hence am strongly against permitting it So is there any alternative way to implement my use case? OK I know, just set cv to any split (e.g. StratifiedShuffleSplit) with n_splits=1.

Yes, I'm with that.

ogrisel mentioned this issue Jun 11, 2013

[MRG] Enforce n_folds >= 2 for k-fold cross-validation #2054

Merged

larsmans mentioned this issue Jun 21, 2013

MRG allow empty grid in ParameterGrid #2082

Merged

larsmans added a commit to larsmans/scikit-learn that referenced this issue Jun 22, 2013

ENH allow empty grid in ParameterGrid

0c3ae9c

Fixes scikit-learn#2048.

larsmans closed this as completed in #2082 Jun 22, 2013

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

allow GridSearchCV to work with params={} or cv=1 #2048

allow GridSearchCV to work with params={} or cv=1 #2048

eyaler commented Jun 9, 2013

GaelVaroquaux commented Jun 9, 2013

jnothman commented Jun 9, 2013

eyaler commented Jun 10, 2013

jnothman commented Jun 10, 2013

ogrisel commented Jun 10, 2013

eyaler commented Jun 10, 2013

ogrisel commented Jun 10, 2013

eyaler commented Jun 10, 2013

jnothman commented Jun 10, 2013

jnothman commented Jun 10, 2013

eyaler commented Jun 10, 2013

jnothman commented Jun 10, 2013

ogrisel commented Jun 10, 2013

ogrisel commented Jun 21, 2013

Eriz11 commented Dec 17, 2018

jnothman commented Dec 18, 2018 via email

Eriz11 commented Dec 18, 2018 •

edited

Loading

jnothman commented Dec 18, 2018 via email

Eriz11 commented Dec 19, 2018 •

edited

Loading

kitaev-chen commented Jan 23, 2020 •

edited

Loading

jnothman commented Jan 24, 2020 via email

kitaev-chen commented Jan 24, 2020

kitaev-chen commented Jan 24, 2020

GaelVaroquaux commented Jan 25, 2020 via email

allow GridSearchCV to work with params={} or cv=1 #2048

allow GridSearchCV to work with params={} or cv=1 #2048

Comments

eyaler commented Jun 9, 2013

GaelVaroquaux commented Jun 9, 2013

jnothman commented Jun 9, 2013

eyaler commented Jun 10, 2013

jnothman commented Jun 10, 2013

ogrisel commented Jun 10, 2013

eyaler commented Jun 10, 2013

ogrisel commented Jun 10, 2013

eyaler commented Jun 10, 2013

jnothman commented Jun 10, 2013

jnothman commented Jun 10, 2013

eyaler commented Jun 10, 2013

jnothman commented Jun 10, 2013

ogrisel commented Jun 10, 2013

ogrisel commented Jun 21, 2013

Eriz11 commented Dec 17, 2018

jnothman commented Dec 18, 2018 via email

Eriz11 commented Dec 18, 2018 • edited Loading

jnothman commented Dec 18, 2018 via email

Eriz11 commented Dec 19, 2018 • edited Loading

kitaev-chen commented Jan 23, 2020 • edited Loading

jnothman commented Jan 24, 2020 via email

kitaev-chen commented Jan 24, 2020

kitaev-chen commented Jan 24, 2020

GaelVaroquaux commented Jan 25, 2020 via email

Eriz11 commented Dec 18, 2018 •

edited

Loading

Eriz11 commented Dec 19, 2018 •

edited

Loading

kitaev-chen commented Jan 23, 2020 •

edited

Loading