FIX fix regression in gridsearchcv when parameter grids have estimators as values #29179

MarcoGorelli · 2024-06-04T14:29:46Z

Reference Issues/PRs

What does this implement/fix? Explain your changes.

Fixes regression. Constructs array, and gets the dtype from there, as suggested here, but sets 'U' kinds to object in keeping with this comment

Per discussion in #29157, alternatives to creating an array may not be acceptable

Any other comments?

…s values

github-actions · 2024-06-04T14:30:59Z

✔️ Linting Passed

All linting checks passed. Your pull request is in excellent shape! ☀️

_{Generated for commit: 58689a0. Link to the linter CI: here}

MarcoGorelli · 2024-06-04T14:30:20Z

sklearn/model_selection/tests/test_search.py

-        assert_array_equal(
-            grid_search.cv_results_["param_random_state"], [0, float("nan")]
-        )
+        assert_array_equal(grid_search.cv_results_["param_random_state"], [0, None])


random_state is documented to accept an integer or None, but not float - so I think the new output looks more correct?

Yup. random_state should not be a float.

sklearn/model_selection/tests/test_search.py

adrinjalali

cc @thomasjpfan @lesteve

sklearn/model_selection/tests/test_search.py

adrinjalali · 2024-06-05T10:00:32Z

sklearn/model_selection/tests/test_search.py

+    "ignore:in the future the `.dtype` attribute of a given datatype object must "
+    "be a valid dtype instance:DeprecationWarning"


who's rasing this? As in, are the users gonna see this now?

NumPy raises it in the line np.result_type(*param_list)

It's a DeprecationWarning, so it wouldn't ordinarily be visible to end users, which is why running the example in the linked issue doesn't show any warning #29157

Still, doesn't hurt to silence it, I've gone with that 👍

… into fix-regression

adrinjalali

LGTM

cc @thomasjpfan @lesteve

thomasjpfan

Thank you for the fix @MarcoGorelli !

thomasjpfan · 2024-06-05T15:07:03Z

sklearn/model_selection/_search.py

+                with warnings.catch_warnings():
+                    warnings.filterwarnings(
+                        "ignore",
+                        message="in the future the `.dtype` attribute",


Is NumPy raising this warning? If so, we can add a commend here?

thomasjpfan · 2024-06-05T15:08:08Z

sklearn/model_selection/tests/test_search.py

-        assert_array_equal(
-            grid_search.cv_results_["param_random_state"], [0, float("nan")]
-        )
+        assert_array_equal(grid_search.cv_results_["param_random_state"], [0, None])


Yup. random_state should not be a float.

thomasjpfan · 2024-06-05T15:11:06Z

sklearn/model_selection/tests/test_search.py

+def test_search_with_estimators_issue_29157():
+    pd = pytest.importorskip("pandas")


So we have a short description in the code itself:

Suggested change

def test_search_with_estimators_issue_29157():

pd = pytest.importorskip("pandas")

def test_search_with_estimators_issue_29157():

"""Check cv_results_ for estimators with a `dtype` parameter such as OneHotEncoder."""

pd = pytest.importorskip("pandas")

lesteve · 2024-06-06T05:19:23Z

Thanks for the fix @MarcoGorelli!

It kind of feels like this is geting more and more complicated though 😅 ... see below for some issues I can imagine.

I was wondering why after all the strategy of creating an array and use the automatic dtype was dropped?

I guess one of the reason was in @thomasjpfan #28352 (comment) you said:

(Scikit-learn does not really like using fixed length string dtypes "<U4", so using object here keeps the original behavior.)

Is there anything else? If that's the only reason, maybe we can do .astype(object) if the automatically inferred dtype is a string dtype? I guess that makes a copy, not sure how crucial this is ...

I think the underlying issue is that np.result_type is a bit ambiguous: it can take both dtype-like objects (OrdinalEncoder(), 'float64', np.int32) and values (3.2, np.array([1, 2], dtype=np.float64), ...) see numpy/numpy#26612 (comment) for a proposed way to make this more explicit.

Here are some possible issues I can imagine with the code as it is in this PR:

at one point, the warning will turn into an error:

import numpy as np
from sklearn.preprocessing import OrdinalEncoder

np.result_type(OrdinalEncoder(), OrdinalEncoder())

For completeness the warning is:

<ipython-input-1-5c4a0f627be2>:4: DeprecationWarning: in the future the `.dtype` attribute of a given datatype object must be a valid dtype instance. `data_type.dtype` may need to be coerced using `np.dtype(data_type.dtype)`. (Deprecated NumPy 1.20)

The reason for the warning is that OrdinalEncoder().dtype is np.float64 and not np.dtype(np.float64)

you could imagine some other edge cases, e.g. you do a grid-search on OrdinalEncoder dtype so the values could be like things like ['float64', 'float32']. You would expect the dtype to be object, except that:
```
np.result_type('float64', 'float32') # dtype('float64') not object
```

MarcoGorelli · 2024-06-06T07:49:48Z

thanks!

I was wondering why after all the strategy of creating an array and use the automatic dtype was dropped?

another issue is that then, a list of tuples would be detected as a 2D array instead of an object 1D array of tuples

at one point, the warning will turn into an error:

true, but TypeError, ValueError are already caught - hopefully that'll catch whatever error this turns into too? 🤞

lesteve · 2024-06-06T09:52:57Z

Good points indeed, oh well I guess I don't a better solution so let's say it is OK enough for now.

If there is another bug found in this slightly tricky code we can at least think about moving the code to a function that can be more easily tested with edge cases.

another issue is that then, a list of tuples would be detected as a 2D array instead of an object 1D array of tuples

Indeed, I have seen you added a test for this in #28571 so 👍.

About the warnings that will maybe one day turn into an error in numpy, I guess our scipy-dev CI (testing our dependencies development version) will detect it in case this is neither TypeError ValueError and then we can ask Numpy to consider chosing an exception that does not break our (slightly brittle) code.

…rs as values (scikit-learn#29179)

…rs as values (#29179)

fix regression in gridsearchcv when parameter grids have estimators a…

06cc76a

…s values

github-actions bot added the module:model_selection label Jun 4, 2024

MarcoGorelli commented Jun 4, 2024

View reviewed changes

adrinjalali reviewed Jun 4, 2024

View reviewed changes

sklearn/model_selection/tests/test_search.py Outdated Show resolved Hide resolved

MarcoGorelli added 8 commits June 4, 2024 15:56

preserve list of tuples

4905ef3

consistency, link to gh issue

176ead1

dont allocate array

c318da7

catch deprecation warning from numpy

8e73ce4

fixup

81b6da7

importorskip pandas

1975302

filter the warning

77462ae

it gets simpler!

70ed083

MarcoGorelli marked this pull request as ready for review June 5, 2024 06:21

MarcoGorelli commented Jun 5, 2024

View reviewed changes

sklearn/model_selection/tests/test_search.py Outdated Show resolved Hide resolved

Update sklearn/model_selection/tests/test_search.py

8a65fff

adrinjalali reviewed Jun 5, 2024

View reviewed changes

MarcoGorelli added 2 commits June 5, 2024 13:25

silence warning

b5f944f

Merge branch 'fix-regression' of github.com:MarcoGorelli/scikit-learn…

02a3937

… into fix-regression

adrinjalali approved these changes Jun 5, 2024

View reviewed changes

ogrisel added the Regression label Jun 5, 2024

ogrisel added this to the 1.5.1 milestone Jun 5, 2024

ogrisel added the To backport PR merged in master that need a backport to a release branch defined based on the milestone. label Jun 5, 2024

thomasjpfan reviewed Jun 5, 2024

View reviewed changes

add comment + docstring

58689a0

thomasjpfan approved these changes Jun 5, 2024

View reviewed changes

thomasjpfan merged commit b375b7b into scikit-learn:main Jun 5, 2024
30 checks passed

lesteve mentioned this pull request Jun 24, 2024

Fix a regression in GridSearchCV for parameter grids that have arrays of different sizes as parameter values #29314

Merged

jeremiedbb pushed a commit to jeremiedbb/scikit-learn that referenced this pull request Jul 2, 2024

FIX fix regression in gridsearchcv when parameter grids have estimato…

81a30a7

…rs as values (scikit-learn#29179)

jeremiedbb mentioned this pull request Jul 2, 2024

Release 1.5.1 #29382

Merged

11 tasks

jeremiedbb pushed a commit to jeremiedbb/scikit-learn that referenced this pull request Jul 2, 2024

FIX fix regression in gridsearchcv when parameter grids have estimato…

d39f065

…rs as values (scikit-learn#29179)

jeremiedbb pushed a commit that referenced this pull request Jul 2, 2024

FIX fix regression in gridsearchcv when parameter grids have estimato…

6428b98

…rs as values (#29179)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

FIX fix regression in gridsearchcv when parameter grids have estimators as values #29179

FIX fix regression in gridsearchcv when parameter grids have estimators as values #29179

MarcoGorelli commented Jun 4, 2024

github-actions bot commented Jun 4, 2024 •

edited

Loading

MarcoGorelli Jun 4, 2024

thomasjpfan Jun 5, 2024

adrinjalali left a comment

adrinjalali Jun 5, 2024

MarcoGorelli Jun 5, 2024

adrinjalali left a comment

thomasjpfan left a comment

thomasjpfan Jun 5, 2024

thomasjpfan Jun 5, 2024

thomasjpfan Jun 5, 2024

lesteve commented Jun 6, 2024 •

edited

Loading

MarcoGorelli commented Jun 6, 2024

lesteve commented Jun 6, 2024

		"ignore:in the future the `.dtype` attribute of a given datatype object must "
		"be a valid dtype instance:DeprecationWarning"

		def test_search_with_estimators_issue_29157():
		pd = pytest.importorskip("pandas")

FIX fix regression in gridsearchcv when parameter grids have estimators as values #29179

FIX fix regression in gridsearchcv when parameter grids have estimators as values #29179

Conversation

MarcoGorelli commented Jun 4, 2024

Reference Issues/PRs

What does this implement/fix? Explain your changes.

Any other comments?

github-actions bot commented Jun 4, 2024 • edited Loading

✔️ Linting Passed

MarcoGorelli Jun 4, 2024

Choose a reason for hiding this comment

thomasjpfan Jun 5, 2024

Choose a reason for hiding this comment

adrinjalali left a comment

Choose a reason for hiding this comment

adrinjalali Jun 5, 2024

Choose a reason for hiding this comment

MarcoGorelli Jun 5, 2024

Choose a reason for hiding this comment

adrinjalali left a comment

Choose a reason for hiding this comment

thomasjpfan left a comment

Choose a reason for hiding this comment

thomasjpfan Jun 5, 2024

Choose a reason for hiding this comment

thomasjpfan Jun 5, 2024

Choose a reason for hiding this comment

thomasjpfan Jun 5, 2024

Choose a reason for hiding this comment

lesteve commented Jun 6, 2024 • edited Loading

MarcoGorelli commented Jun 6, 2024

lesteve commented Jun 6, 2024

github-actions bot commented Jun 4, 2024 •

edited

Loading

lesteve commented Jun 6, 2024 •

edited

Loading