-
-
Notifications
You must be signed in to change notification settings - Fork 26.5k
Closed
Description
I have some data X and y that I want to subsample (i.e. only keep a subset of the samples) in a stratified way.
Using something like
_, X_sub, _, y_sub = train_test_split(
X, y, stratify=stratify, train_size=None, test_size=n_samples_sub)is almost what I need. But that will error if:
- I happen to want exactly
X.shape[0]samples (ValueError: test_size=60 should be either positive and smaller than the number of samples) - I want something close to
X.shape[0]which is not enough to have a stratified test or train set (ValueError: The train_size = 1 should be greater or equal to the number of classes = 2)
Would it make sense to add a subsample() util?
Another option would be to add a bypass_checks to train_test_split
Basically what I need is a stratify option to utils.resample, that's probably the most appropriate place to introduce this.
Metadata
Metadata
Assignees
Labels
No labels