FIX Add NaN handling to selection of best parameters for `HalvingGridSearchCV` #24539

betatim · 2022-09-29T09:35:52Z

Reference Issues/PRs

Fix #20678

What does this implement/fix? Explain your changes.

Parameter combinations that have a score of NaN should not be ranked higher than solutions with an actual score.

This switches to using np.nanargmax() to find the highest scores that are not NaN. In addition, when selecting the top-k parameter combinations we now rank combinations with a score of NaN lower than any solution with a score.

Any other comments?

I am not sure exactly how to provoke the failure. I've tested this with a fake estimator like the following:

class RandomlyFailingClassifier(BaseEstimator, ClassifierMixin):
    def __init__(self, n_estimators=10, some_param=2):
        self.some_param = some_param
        self.n_estimators = n_estimators
    def fit(self, X, y):
        if False: #random.random() < 0.5 and self.n_estimators != 10:
            raise Exception("waaahhh")
        else:
            return self
    def predict(self, X):
        if False and random.random() < .3:
            raise Exception("muuuuuuuh")
        y = random.choices([0, 1], k=X.shape[0])
        return np.array(y)

which allows you to have fit() and/or predict() fail at random. I think for a non regression test we need something better. Does someone have an idea?

Another thing I've noticed is that np.nanargmaxraises an exception if all elements passed to it are NaNs. Not quite sure what we should do in that case. Pick a random order?? Raise?

Parameter combinations that have a score of NaN should not be ranked higher than solutions with an actual score.

thomasjpfan

For a deterministic non-regression test, subclass:

scikit-learn/sklearn/model_selection/tests/test_successive_halving.py

Line 26 in 21829b5

class FastClassifier(DummyClassifier):

and a testing_fit_param that fails when testing_fit_param=1 and passes for all other parameters. Then Grid Search over testing_fit_param to include 1. One can do the same for predict.

If all results are nan, then I would raise. We have this behavior when all results fail in GridSearchCV:

scikit-learn/sklearn/model_selection/_validation.py

Lines 361 to 367 in 21829b5

    
           all_fits_failed_message = ( 
        
               f"\nAll the {num_fits} fits failed.\n" 
        
               "It is very likely that your model is misconfigured.\n" 
        
               "You can try to debug the error by setting error_score='raise'.\n\n" 
        
               f"Below are more details about the failures:\n{fit_errors_summary}" 
        
           ) 
        
           raise ValueError(all_fits_failed_message)

For successive halving, if one of the iterations completely fails and there is no "top k" and I think we should raise.

sklearn/model_selection/_search_successive_halving.py

This checks that candidates that have a NaN score are always ranked lowest.

betatim · 2022-10-04T11:54:47Z

I added some tests based on a classifier that fails depending on the value of a hyper-parameter. What would be nice is to have a classifier that fails on some of the CV splits, but not all. But I can't work out how to do that and I think it is over the top as mean_test_score will contain a NaN if any of the splits are NaN.

thomasjpfan

Thank you for the update!

I think this needs a whats new entry in v1.2 and updating the docstring to explain the behavior.

sklearn/model_selection/tests/test_successive_halving.py

Co-authored-by: Thomas J. Fan <[email protected]>

sklearn/model_selection/_search_successive_halving.py

thomasjpfan

Otherwise LGTM

sklearn/model_selection/tests/test_successive_halving.py

sklearn/model_selection/_search_successive_halving.py

sklearn/model_selection/tests/test_successive_halving.py

Co-authored-by: Thomas J. Fan <[email protected]>

sklearn/model_selection/tests/test_successive_halving.py

Co-authored-by: Thomas J. Fan <[email protected]>

glemaitre · 2022-11-03T10:39:53Z

I just merge main into the branch before making a review.

glemaitre

LGTM. Only some nitpick.

doc/whats_new/v1.2.rst

sklearn/model_selection/tests/test_successive_halving.py

Co-authored-by: Guillaume Lemaitre <[email protected]>

glemaitre · 2022-11-03T15:59:07Z

Safely merging since some of the CIs are green and we did only cosmetic changes.

…SearchCV` (scikit-learn#24539) Co-authored-by: Thomas J. Fan <[email protected]> Co-authored-by: Guillaume Lemaitre <[email protected]>

Add NaN handling to selection of best parameters

7ba2a10

Parameter combinations that have a score of NaN should not be ranked higher than solutions with an actual score.

github-actions bot added the module:model_selection label Sep 29, 2022

thomasjpfan reviewed Sep 29, 2022

View reviewed changes

sklearn/model_selection/_search_successive_halving.py Show resolved Hide resolved

betatim commented Sep 30, 2022

View reviewed changes

sklearn/model_selection/_search_successive_halving.py Outdated Show resolved Hide resolved

betatim added 2 commits October 3, 2022 13:51

Use first item when all items have a NaN score

4411569

Add test for NaN handling in Halving*SearchCV

be9efd5

This checks that candidates that have a NaN score are always ranked lowest.

thomasjpfan reviewed Oct 4, 2022

View reviewed changes

sklearn/model_selection/tests/test_successive_halving.py Outdated Show resolved Hide resolved

sklearn/model_selection/tests/test_successive_halving.py Outdated Show resolved Hide resolved

sklearn/model_selection/tests/test_successive_halving.py Outdated Show resolved Hide resolved

betatim and others added 3 commits October 5, 2022 14:18

Add Havling*CV NaN handling entry

9921eb3

Refactor tests and add comment to docstring

7825f20

Reduce amount of indentation

32a21be

Co-authored-by: Thomas J. Fan <[email protected]>

betatim commented Oct 5, 2022

View reviewed changes

sklearn/model_selection/_search_successive_halving.py Show resolved Hide resolved

thomasjpfan approved these changes Oct 11, 2022

View reviewed changes

thomasjpfan changed the title ~~Add NaN handling to selection of best parameters for HalvingGridSearchCV~~ FIX Add NaN handling to selection of best parameters for HalvingGridSearchCV Oct 11, 2022

betatim and others added 3 commits October 12, 2022 09:52

Improve readability of tests

8917452

Co-authored-by: Thomas J. Fan <[email protected]>

Avoid UserWarning about total of parameters

c58c3e7

Co-authored-by: Thomas J. Fan <[email protected]>

Filter expected warnings

725bff0

Co-authored-by: Thomas J. Fan <[email protected]>

thomasjpfan reviewed Oct 12, 2022

View reviewed changes

sklearn/model_selection/tests/test_successive_halving.py Outdated Show resolved Hide resolved

Make warning filter more specific

0b805d5

Co-authored-by: Thomas J. Fan <[email protected]>

cmarmo added the Waiting for Second Reviewer First reviewer is done, need a second one! label Oct 22, 2022

glemaitre self-requested a review November 3, 2022 10:38

Merge remote-tracking branch 'origin/main' into pr/betatim/24539

ed8b369

glemaitre approved these changes Nov 3, 2022

View reviewed changes

glemaitre removed the Waiting for Second Reviewer First reviewer is done, need a second one! label Nov 3, 2022

betatim and others added 2 commits November 3, 2022 16:10

Fix whats new entry

4e4d06b

Co-authored-by: Guillaume Lemaitre <[email protected]>

Improve variable naming

65baa68

Co-authored-by: Guillaume Lemaitre <[email protected]>

glemaitre merged commit 3c8e0a2 into scikit-learn:main Nov 3, 2022

betatim deleted the nan-score-selected branch November 4, 2022 10:25

betatim mentioned this pull request Nov 15, 2022

Inconsistent results with HalvingGridSearchCV #24901

Open

	all_fits_failed_message = (
	f"\nAll the {num_fits} fits failed.\n"
	"It is very likely that your model is misconfigured.\n"
	"You can try to debug the error by setting error_score='raise'.\n\n"
	f"Below are more details about the failures:\n{fit_errors_summary}"
	)
	raise ValueError(all_fits_failed_message)

Uh oh!

FIX Add NaN handling to selection of best parameters for HalvingGridSearchCV #24539

FIX Add NaN handling to selection of best parameters for HalvingGridSearchCV #24539

Uh oh!

Conversation

betatim commented Sep 29, 2022 • edited by glemaitre Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Reference Issues/PRs

What does this implement/fix? Explain your changes.

Any other comments?

Uh oh!

thomasjpfan left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

betatim commented Oct 4, 2022

Uh oh!

thomasjpfan left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

thomasjpfan left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

glemaitre commented Nov 3, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

glemaitre left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

glemaitre commented Nov 3, 2022

Uh oh!

Uh oh!

FIX Add NaN handling to selection of best parameters for `HalvingGridSearchCV` #24539

FIX Add NaN handling to selection of best parameters for `HalvingGridSearchCV` #24539

betatim commented Sep 29, 2022 •

edited by glemaitre

Loading

glemaitre commented Nov 3, 2022 •

edited

Loading