Thanks to visit codestin.com
Credit goes to github.com

Skip to content

StratifiedShuffleSplit still buggy #6471

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
amueller opened this issue Mar 1, 2016 · 4 comments
Closed

StratifiedShuffleSplit still buggy #6471

amueller opened this issue Mar 1, 2016 · 4 comments

Comments

@amueller
Copy link
Member

amueller commented Mar 1, 2016

y = [0, 1, 2, 3] * 3 + [4, 5] * 5                                                                  
X = np.ones_like(y)                                                                                

splits = StratifiedShuffleSplit(n_iter=1, train_size=11, test_size=11, random_state=0)                                     

train, test = next(iter(splits.split(X=X, y=y)))                                                   

assert_array_equal(np.intersect1d(train, test), [])   
assert_equal(len(train), 11)
assert_equal(len(test), 11)

raise self.failureException('12 != 11')

this is a follow-up on #6379 and a sign that the logic is too complex. I'm doing a rewrite to fix this bug but it is not as clean as I'd like.

@mattilyra
Copy link
Contributor

mattilyra commented May 30, 2016

I can add a few more test cases to this. With the following data

import sklearn.cross_validation
from sklearn.datasets import make_blobs
X, y = make_blobs(n_samples=1000, n_features=2, random_state=42, cluster_std=5.0)

All manner of crazy things happen

split = sklearn.cross_validation.StratifiedShuffleSplit(y, test_size=0.09, train_size=0.1)
train_idx, test_idx = next(iter(split))
assert_equal(train_idx.shape[0], split.n_train) # pass
assert_equal(test_idx.shape[0], split.n_test) # 901 != 90

AssertionError:
Items are not equal:
ACTUAL: 901
DESIRED: 90

split = sklearn.cross_validation.StratifiedShuffleSplit(y, test_size=0.11, train_size=0.11)
train_idx, test_idx = next(iter(split))
assert_equal(train_idx.shape[0], split.n_train) # 900 != 110
assert_equal(test_idx.shape[0], split.n_test) # 111 != 110

AssertionError:
Items are not equal:
ACTUAL: 900
DESIRED: 110

AssertionError:
Items are not equal:
ACTUAL: 111
DESIRED: 110

I'm not that bothered about getting 1 extra instance in the test or training set, but getting 10x the amount of data that was requested is not optimal.

@mattilyra
Copy link
Contributor

Right, so the problem is this line

test.extend(missing_idx[-(split.n_test - len(test)):])

https://github.com/scikit-learn/scikit-learn/blob/51a765a/sklearn/cross_validation.py#L1005

If split.n_test - len(test) is 0, meaning that we already have to correct number of test indices, we end up adding all of the missing indices to the test set, including those that went to train. The simple fix is pretty ugly, so not sure if you're still wanting to rewrite the whole thing.

This seems to have already been fixed on master though.

@lesteve
Copy link
Member

lesteve commented May 31, 2016

This seems to have already been fixed on master though.

Sorry, I may have missed something here. Have you found a bug that has not already been fixed in master or that was not the one for which the issue was created?

@mattilyra
Copy link
Contributor

@lesteve no, after adding the comments I noticed that on the upstream master the issue has been fixed, so this should probably be closed.

The link above is to the 0.17 sources (linked to from the API docs). Sorry for the confusion.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants