Thanks to visit codestin.com
Credit goes to github.com

Skip to content

FIX Add NaN handling to selection of best parameters for HalvingGridSearchCV #24539

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 13 commits into from
Nov 3, 2022

Conversation

betatim
Copy link
Member

@betatim betatim commented Sep 29, 2022

Reference Issues/PRs

Fix #20678

What does this implement/fix? Explain your changes.

Parameter combinations that have a score of NaN should not be ranked higher than solutions with an actual score.

This switches to using np.nanargmax() to find the highest scores that are not NaN. In addition, when selecting the top-k parameter combinations we now rank combinations with a score of NaN lower than any solution with a score.

Any other comments?

I am not sure exactly how to provoke the failure. I've tested this with a fake estimator like the following:

class RandomlyFailingClassifier(BaseEstimator, ClassifierMixin):
    def __init__(self, n_estimators=10, some_param=2):
        self.some_param = some_param
        self.n_estimators = n_estimators
    def fit(self, X, y):
        if False: #random.random() < 0.5 and self.n_estimators != 10:
            raise Exception("waaahhh")
        else:
            return self
    def predict(self, X):
        if False and random.random() < .3:
            raise Exception("muuuuuuuh")
        y = random.choices([0, 1], k=X.shape[0])
        return np.array(y)

which allows you to have fit() and/or predict() fail at random. I think for a non regression test we need something better. Does someone have an idea?

Another thing I've noticed is that np.nanargmaxraises an exception if all elements passed to it are NaNs. Not quite sure what we should do in that case. Pick a random order?? Raise?

Parameter combinations that have a score of NaN should not be ranked
higher than solutions with an actual score.
Copy link
Member

@thomasjpfan thomasjpfan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For a deterministic non-regression test, subclass:

class FastClassifier(DummyClassifier):

and a testing_fit_param that fails when testing_fit_param=1 and passes for all other parameters. Then Grid Search over testing_fit_param to include 1. One can do the same for predict.

If all results are nan, then I would raise. We have this behavior when all results fail in GridSearchCV:

all_fits_failed_message = (
f"\nAll the {num_fits} fits failed.\n"
"It is very likely that your model is misconfigured.\n"
"You can try to debug the error by setting error_score='raise'.\n\n"
f"Below are more details about the failures:\n{fit_errors_summary}"
)
raise ValueError(all_fits_failed_message)

For successive halving, if one of the iterations completely fails and there is no "top k" and I think we should raise.

This checks that candidates that have a NaN score are always ranked
lowest.
@betatim
Copy link
Member Author

betatim commented Oct 4, 2022

I added some tests based on a classifier that fails depending on the value of a hyper-parameter. What would be nice is to have a classifier that fails on some of the CV splits, but not all. But I can't work out how to do that and I think it is over the top as mean_test_score will contain a NaN if any of the splits are NaN.

Copy link
Member

@thomasjpfan thomasjpfan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the update!

I think this needs a whats new entry in v1.2 and updating the docstring to explain the behavior.

Copy link
Member

@thomasjpfan thomasjpfan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Otherwise LGTM

@thomasjpfan thomasjpfan changed the title Add NaN handling to selection of best parameters for HalvingGridSearchCV FIX Add NaN handling to selection of best parameters for HalvingGridSearchCV Oct 11, 2022
betatim and others added 3 commits October 12, 2022 09:52
@cmarmo cmarmo added the Waiting for Second Reviewer First reviewer is done, need a second one! label Oct 22, 2022
@glemaitre glemaitre self-requested a review November 3, 2022 10:38
@glemaitre
Copy link
Member

glemaitre commented Nov 3, 2022

I just merge main into the branch before making a review.

Copy link
Member

@glemaitre glemaitre left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Only some nitpick.

@glemaitre glemaitre removed the Waiting for Second Reviewer First reviewer is done, need a second one! label Nov 3, 2022
betatim and others added 2 commits November 3, 2022 16:10
Co-authored-by: Guillaume Lemaitre <[email protected]>
Co-authored-by: Guillaume Lemaitre <[email protected]>
@glemaitre glemaitre merged commit 3c8e0a2 into scikit-learn:main Nov 3, 2022
@glemaitre
Copy link
Member

Safely merging since some of the CIs are green and we did only cosmetic changes.

@betatim betatim deleted the nan-score-selected branch November 4, 2022 10:25
andportnoy pushed a commit to andportnoy/scikit-learn that referenced this pull request Nov 5, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Halving*SearchCV selects estimator with nan score as best estimator
5 participants