-
-
Notifications
You must be signed in to change notification settings - Fork 25.8k
allow GridSearchCV to work with params={} or cv=1 #2048
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Fair enough. As these are corner cases, I suggest that you submit a pull |
CV = 1 sounds strange... In terms of "k-fold cv" it doesn't really make sense. The implementation should validate against it, unless you really think it needs to be handled as a special case: >>> list(check_cv(1, [1,2,3], [4,5,6]))
[(array([], dtype=int32), array([0, 1, 2]))] (Note: #1742 introduces the ability to return training scores in cross-validation.) I have wondered about |
imho, cv=1 should just do a fit with the first parameter. |
could you please give me an example; I'm not sure what you mean. |
cv means cross validation: it means how many train/tests splits will be used to evaluate each parameters combination from the grid. For k-fold cross validation, k=1 is meaningless. We should just raise a Right now we have:
and empty training set is likely to give results or errors that are confusing for the user. |
although these are degenerate cases, i find them useful as references and for testing |
cv=1 is meaningless: many classifiers need at least 2 training samples from 2 different classes to just know the kind of problems they are addressing and cannot be fitted on an empty trainingset. But I agree we at least need better error messages. For instance now we have:
|
its not an empty training set. cv=1 would mean train on all the data |
I think what you mean by |
and test on nothing. Or alternatively, train on nothing and test on all the data. Which meaning do you think is more sensible?! |
i agree. i wanted training on all and test on nothing. however as k gets smaller the training set also gets smaller so this would not be consistent |
test on nothing isn't very useful when what's returned from |
@eyaler currently as demonstrated in my previous comment KFold cross validation wtih cv=1 means train on nothing and test on everything. But anyway this is useless and probably too confusing for the naive user not familiar with the concept of cross validation. In my opinion it would just make more sense to raise and explicit exception such as |
The error message for n_folds < 2 is now much more explicit. The handling of the empty grid of parameters could be still be improved. |
My two cents so that another "special case" that is growing in usability (at least in my surroundings) is documented and argued, which may give some clarity to this above comment of @jnothman. Hope it helps (first time I propose something here so any advise or comment is very appreciated). The Adding more reality, the ¿Would be considerable to add some extra parameter or argument to the GridSearchCV so that it may be used without cross-validation in alternative cross validation techniques that are not supported in sklearn? My opinion is that this could be more practical than adding a non-anchored WFA approach like mine to the package (others may come up with other techniques as well). However, I believe others could benefit of the non-anchored approach. All that I can help, please tell me and I will do my best. |
I don't understand what stops these cases being handled at the moment
without making any changes to the library
|
Thanks for the reply @jnothman. So, your suggestion is about using the above trick to handle Regarding the second part, which I find more interesting and rewarding to the community, ¿would be considerable to add some construct, similar to Looking forward for your views, |
Alternatives and extensions to TimeSeriesSplit are welcome as pull requests.
Yes, defining custom cv and custom scoring are reasonable hacks given the
current interface.
|
Thanks for the advise @jnothman. It's my first time in this situation (together with not being a software engineer) and I have not that much experience to make a working pull request to such a package as scikit-learn. ¿How could I continue this path if I want to help in the development of this feature? I have the working code, but it is far for being ready for user interaction. I would be extremely glad to help in any phase if anybody with more experience could join me. |
I give you a use case why cv=1 needed: The real cross-validation should have a shifted (train set, valid set, test set) in each fold. The train set is for the training model; the valid set is for hyper-parameters tuning; the test set is for test accuracy calculation. If we want to calculate 5 times test accuracies, we need to use 5-fold cv. In every fold, we have two divided data set A and B. B is for testing, A should be split again into a train set A_t and valid set A_v. Then we should do hyper-parameters tuning on A_v. In this part, we need GridSearchCV(cv=1). So the pseudocode should like this: cv = StratifiedShuffleSplit(n_splits=5, ...)
for i, (idx_train_valid, idx_test) in enumerate(cv.split(X, y)):
X_train_valid, X_test = ...
y_train_valid, y_test = ...
gs = GridSearchCV(clf, ..., refit=False, cv=1, ...)
gs.fit(X_train_valid, y_train_valid)
clf.set_params(xxx=gs.best_params_[xxx])
clf.fit(X_train_valid, y_train_valid)
y_pred = clf.predict(X_test)
scores = accuracy_score(y_test, y_pred) Of course, we also can set cv=2 in GridSearchCV which can shift train set and valid set either, but the n_splits can not be compatible with the configuration file directly. |
You can't ever *need* cv=1 since cv=int is only a shorthand. I find cv=1
ambiguous to read, and hence am strongly against permitting it
|
So is there any alternative way to implement my use case? |
OK I know, just set cv to any split (e.g. StratifiedShuffleSplit) with n_splits=1. |
You can't ever need cv=1 since cv=int is only a shorthand. I find cv=1
ambiguous to read, and hence am strongly against permitting it
So is there any alternative way to implement my use case?
OK I know, just set cv to any split (e.g. StratifiedShuffleSplit) with n_splits=1.
Yes, I'm with that.
|
these degenerate cases are sometimes wanted for comparison
The text was updated successfully, but these errors were encountered: