Thanks to visit codestin.com
Credit goes to github.com

Skip to content

GroupKFold fails in nested cross-validation (similar to #2879) #7646

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
davidslater opened this issue Oct 11, 2016 · 17 comments · Fixed by #26896
Closed

GroupKFold fails in nested cross-validation (similar to #2879) #7646

davidslater opened this issue Oct 11, 2016 · 17 comments · Fixed by #26896

Comments

@davidslater
Copy link

Description

groups parameter in model_selection.cross_val_score() is not propagated in to RandomSearchCV.fit() call. This is similar to #2879 and probably best addressed in #4497.

Steps/Code to Reproduce

import numpy as np
from sklearn.utils.validation import indexable
from sklearn import linear_model
from sklearn import model_selection

# generate data with simple decision boundary, with 2 labels and 2 groups per label
X = np.array(range(20)).reshape(-1, 1)
y = np.array([0] * 10 + [1] * 10)
groups = np.array([0] * 5 + [1] * 5 + [2] * 5 + [3] * 5)

# run nested cross-validation (works with StratifiedKFold, but not GroupKFold)
clf = linear_model.LogisticRegression()
#cv = model_selection.StratifiedKFold(n_splits=2)
cv = model_selection.GroupKFold(n_splits=2)
param_dist = {'penalty': ['l1', 'l2'], 'C': np.logspace(-3, 3, 13)}
random_search = model_selection.RandomizedSearchCV(clf, cv=cv, param_distributions=param_dist, n_iter=20)
print model_selection.cross_val_score(random_search, X, y=y, groups=groups, cv=cv)

Expected Results

When StratifiedKFold is used, the output is [ 0.8 0.7]. In general, it should be an array of 2 floats.

Actual Results

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/Users/davidslater/.virtualenvs/davidslater/lib/python2.7/site-packages/sklearn/model_selection/_validation.py", line 140, in cross_val_score
    for train, test in cv.split(X, y, groups))
  File "/Users/davidslater/.virtualenvs/davidslater/lib/python2.7/site-packages/sklearn/externals/joblib/parallel.py", line 758, in __call__
    while self.dispatch_one_batch(iterator):
  File "/Users/davidslater/.virtualenvs/davidslater/lib/python2.7/site-packages/sklearn/externals/joblib/parallel.py", line 608, in dispatch_one_batch
    self._dispatch(tasks)
  File "/Users/davidslater/.virtualenvs/davidslater/lib/python2.7/site-packages/sklearn/externals/joblib/parallel.py", line 571, in _dispatch
    job = self._backend.apply_async(batch, callback=cb)
  File "/Users/davidslater/.virtualenvs/davidslater/lib/python2.7/site-packages/sklearn/externals/joblib/_parallel_backends.py", line 109, in apply_async
    result = ImmediateResult(func)
  File "/Users/davidslater/.virtualenvs/davidslater/lib/python2.7/site-packages/sklearn/externals/joblib/_parallel_backends.py", line 322, in __init__
    self.results = batch()
  File "/Users/davidslater/.virtualenvs/davidslater/lib/python2.7/site-packages/sklearn/externals/joblib/parallel.py", line 131, in __call__
    return [func(*args, **kwargs) for func, args, kwargs in self.items]
  File "/Users/davidslater/.virtualenvs/davidslater/lib/python2.7/site-packages/sklearn/model_selection/_validation.py", line 238, in _fit_and_score
    estimator.fit(X_train, y_train, **fit_params)
  File "/Users/davidslater/.virtualenvs/davidslater/lib/python2.7/site-packages/sklearn/model_selection/_search.py", line 1185, in fit
    return self._fit(X, y, groups, sampled_params)
  File "/Users/davidslater/.virtualenvs/davidslater/lib/python2.7/site-packages/sklearn/model_selection/_search.py", line 562, in _fit
    for parameters in parameter_iterable
  File "/Users/davidslater/.virtualenvs/davidslater/lib/python2.7/site-packages/sklearn/externals/joblib/parallel.py", line 758, in __call__
    while self.dispatch_one_batch(iterator):
  File "/Users/davidslater/.virtualenvs/davidslater/lib/python2.7/site-packages/sklearn/externals/joblib/parallel.py", line 603, in dispatch_one_batch
    tasks = BatchedCalls(itertools.islice(iterator, batch_size))
  File "/Users/davidslater/.virtualenvs/davidslater/lib/python2.7/site-packages/sklearn/externals/joblib/parallel.py", line 127, in __init__
    self.items = list(iterator_slice)
  File "/Users/davidslater/.virtualenvs/davidslater/lib/python2.7/site-packages/sklearn/model_selection/_search.py", line 563, in <genexpr>
    for train, test in cv.split(X, y, groups))
  File "/Users/davidslater/.virtualenvs/davidslater/lib/python2.7/site-packages/sklearn/model_selection/_split.py", line 321, in split
    for train, test in super(_BaseKFold, self).split(X, y, groups):
  File "/Users/davidslater/.virtualenvs/davidslater/lib/python2.7/site-packages/sklearn/model_selection/_split.py", line 90, in split
    for test_index in self._iter_test_masks(X, y, groups):
  File "/Users/davidslater/.virtualenvs/davidslater/lib/python2.7/site-packages/sklearn/model_selection/_split.py", line 102, in _iter_test_masks
    for test_index in self._iter_test_indices(X, y, groups):
  File "/Users/davidslater/.virtualenvs/davidslater/lib/python2.7/site-packages/sklearn/model_selection/_split.py", line 474, in _iter_test_indices
    raise ValueError("The groups parameter should not be None")
ValueError: The groups parameter should not be None

Versions

Darwin-15.6.0-x86_64-i386-64bit
('Python', '2.7.11 (default, Jan 22 2016, 08:29:18) \n[GCC 4.2.1 Compatible Apple LLVM 7.0.2 (clang-700.1.81)]')
('NumPy', '1.11.2')
('SciPy', '0.18.1')
('Scikit-Learn', '0.18')

@davidslater
Copy link
Author

In particular, the estimator.fit(X_train, y_train, **fit_params) call in model_selection._validation.py does not include "groups" in fit_params, so it defaults to None.

@jnothman jnothman added the Bug label Oct 13, 2016
@jnothman jnothman added this to the 0.18.1 milestone Oct 13, 2016
@jnothman
Copy link
Member

Thanks for the report @davidslater. I'm not sure if this is to be fixed for 0.18.1 (it only applies to 0.18, but AFAIK this kind of nested CV wasn't possible before, so it's not exactly a regression), but I've labelled as such.

@amueller
Copy link
Member

I agree we need #4497 to fix this, and I don't see how we could do it for 0.18.1 - except special-casing the groups parameter to be passed to the cross-validation but not the estimator.

@amueller amueller added the API label Oct 14, 2016
@amueller
Copy link
Member

"need contributor"? I don't think we agree on a fix, do we?

@jnothman
Copy link
Member

Or just some kind of routing parameter specific to CV, seeing as we already
accept a groups param to GridSearchCV.fit?

On 15 October 2016 at 03:41, Andreas Mueller [email protected]
wrote:

"need contributor"? I don't think we agree on a fix, do we?


You are receiving this because you commented.
Reply to this email directly, view it on GitHub
#7646 (comment),
or mute the thread
https://github.com/notifications/unsubscribe-auth/AAEz66TUo2v46mTWvenCuQoKpOgxTvNAks5qz7DNgaJpZM4KUHR0
.

@GaelVaroquaux
Copy link
Member

I really do think that we need #4497 to address this. It's an issue that
is very dear to my heart, by I don't think that rushing it is a good
idea. This is probably material for a sprint discussion.

@raghavrv
Copy link
Member

Yes, I don't think this should be tagged 0.18.1...

@aryamccarthy
Copy link
Contributor

Bump. Is there an agreement on how to approach this?

@jnothman
Copy link
Member

jnothman commented Mar 26, 2017 via email

@s96lam
Copy link

s96lam commented May 9, 2020

I encountered what seems to be another form of the same error running the example code for nested CV with GroupKFold as cv technique:
ValueError: The 'groups' parameter should not be None.

This is no problem, as there are other ways to implement nested CV without using cross_val_score, but the example provided specifically suggests using GroupKFold as an possible method. If not implemented in upcoming versions, could the nested CV example and the cross_val_score documentation (this seems to be where it fails?) be updated to reflect that this is currently not possible?

Sorry if this is the wrong place, but I'm pretty new to this and didn't want to open a new Issue.

@staticdev
Copy link

Any updated on this? It has a WIP PR since 2017 =/

@jnothman
Copy link
Member

jnothman commented May 21, 2020 via email

@lkev
Copy link

lkev commented May 14, 2021

Would it be possible to raise a NotImplementedError or similar on this? It fails silently as-is and just outputs NaN vals for scores. Took a lot of searching to find this issue.

@collinb9
Copy link

In particular, the estimator.fit(X_train, y_train, **fit_params) call in model_selection._validation.py does not include "groups" in fit_params, so it defaults to None.

So unless I'm mistaken, it looks like passing fit_params={"groups": groups} to cross_validate() works for now?

@staticdev
Copy link

staticdev commented May 28, 2021

This is a serious issue for running this kind of splitting. I don't know how it took so many years and still is not fixed.

@jhn-nt
Copy link

jhn-nt commented Oct 5, 2021

Hi All, is there any fix or work around on this one?

@ivezakis
Copy link

So unless I'm mistaken, it looks like passing fit_params={"groups": groups} to cross_validate() works for now?

Curiously, this does seem to work. I am saying curiously because normally, I'd expect an error due to shape mismatch - we'd be passing a subset of the data to the nested CV, but with the full group and not its corresponding subset.
However, there are no errors. I am unfamiliar with the source code to reliably check this myself. Can somebody confirm this is a valid workaround?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment