Closed
Description
Describe the bug
Halving*SearchCV
selects nan scores as the best scores. With the default error_score='warn'
that means that the best selected model fails when fitted on the training set. Doing a git bisect, this seems to have been introduced by #20203. I have not yet understood why exactly.
Result of the git bisect:
Steps/Code to Reproduce
from sklearn.datasets import load_iris
from sklearn.ensemble import RandomForestClassifier
from sklearn.experimental import enable_halving_search_cv # noqa
from sklearn.model_selection import HalvingGridSearchCV
X, y = load_iris(return_X_y=True)
clf = RandomForestClassifier(random_state=0)
clf.fit(X, y)
# max_depth=-10 fails during fit
param_grid = {"max_depth": [-10, 3, 10]}
search = HalvingGridSearchCV(
clf,
param_grid,
resource="n_estimators",
max_resources=10,
random_state=0,
refit=False,
)
search.fit(X, y)
print("best score:", search.best_score_)
print("best params:", search.best_params_)
Expected Results
Halving*SearchCV
discards models with nan scores. Output in 0.24.2 and 6484c4f is (at the very bottom, after plenty of warnings):
best score: 0.9333333333333332
best params: {'max_depth': 3, 'n_estimators': 1}
Actual Results
Halving*SearchCV
thinks the best model is the one with a nan score
Output of previous snippet with 6484c4f:
best score: nan
best params: {'max_depth': -10, 'n_estimators': 1}
Versions
System:
python: 3.9.5 | packaged by conda-forge | (default, Jun 19 2021, 00:32:32) [GCC 9.3.0]
executable: /home/lesteve/miniconda3/bin/python
machine: Linux-5.4.0-77-generic-x86_64-with-glibc2.31
Python dependencies:
pip: 21.1.3
setuptools: 49.6.0.post20210108
sklearn: 1.0.dev0
numpy: 1.21.0
scipy: 1.7.0
Cython: 0.29.23
pandas: 1.3.0
matplotlib: 3.4.2
joblib: 1.0.1
threadpoolctl: 2.1.0
Built with OpenMP: True