Thanks to visit codestin.com
Credit goes to github.com

Skip to content

check_fit2d_1sample and check_fit2d_1feature expect very specific error message #12734

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
NicolasHug opened this issue Dec 7, 2018 · 6 comments

Comments

@NicolasHug
Copy link
Member

NicolasHug commented Dec 7, 2018

check_fit2d_1sample and check_fit2d_1feature are part of the check_estimator test suite.

To pass those checks, the estimator either has to run gracefully, or raise ValueError with an error message containing some predefined substrings:

msgs = ["1 sample", "n_samples = 1", "n_samples=1", "one sample", "1 class", "one class"]

and

msgs = ["1 feature(s)", "n_features = 1", "n_features=1"]

This is quite restrictive. For example:

I have a custom estimator that does early stopping and X is split into train and validation data with train_test_split.

Passing only 1 sample (as in check_fit2d_1sample) to train_test_split will cause the train data to be empty. Of course an exception should be raised. But the appropriate message here is something along the lines of:
"Not enough training data to perform early stopping. Use more training data or adjust 'test_size'".

There are many ways to get an empty train data from train_test_split, passing only one training sample is just one of them. Another way would be to set the test_size param to e.g. .99 when n_samples < 100.

Using one of the required substrings would not make sense here.

So if I want to pass estimator_checks I'm bound to have a very specific check in my code that checks when n_samples==1 and raises a message with one of the appropriate substrings. I guess in my case I could add something like "got n_samples={n_samples}" to the message. But I'm still not sure if forcing a given substring makes sense in all situations.


TLDR: the reason the estimator fails in the context of those checks may be much more general than just because we passed 1 sample (or 1 feature). To pass the tests though, the error message is restricted to a very small subset of causes.

Happy to submit a PR.

@albertcthomas
Copy link
Contributor

Is it meaningful to pass only 1 sample to train_test_split?

@NicolasHug
Copy link
Member Author

No.

But my point is that in the context of my estimator it is just as (not) meaningful to pass one sample as to pass, e.g., test_size=99.9999999%, or any other input combination that would result in an empty train set.

In my situation it makes more sense to have one unique error message that handles all those cases than one error message per case.

@albertcthomas
Copy link
Contributor

Is it meaningful to pass only 1 sample to train_test_split?

Just saw issue #11028 about passing one sample to train_test_split

@albertcthomas
Copy link
Contributor

I see... FWIW if you are using check_array you can set ensure_min_samples to 2.

@jnothman
Copy link
Member

jnothman commented Dec 9, 2018 via email

@NicolasHug
Copy link
Member Author

Early close since this will be addressed by #18582

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants