BUG: use appropriate dtype in cv_results as opposed to always using object #28352

MarcoGorelli · 2024-02-02T10:54:38Z

Reference Issues/PRs

What does this implement/fix? Explain your changes.

Instead of always using dtype object, use a more appropriate dtype (the one detected by numpy)

Any other comments?

I noticed this when trying to use Polars, which is pickier about object dtype than pandas, for #28345

The existing tests already cover this functionality, so I've just updated them rather than increasing the test suite's running time. I can add a new test if desired though

github-actions · 2024-02-02T10:55:57Z

✔️ Linting Passed

All linting checks passed. Your pull request is in excellent shape! ☀️

_{Generated for commit: 248486f. Link to the linter CI: here}

…bject

This reverts commit 44af268.

adrinjalali · 2024-02-28T13:25:48Z

@MarcoGorelli havne't forgotten about this. But this is touching VERY OLD code, so I need to spend some time to get into it 😉

MarcoGorelli · 2024-02-28T13:29:42Z

no hurry at all, I understand this is low priority!

adrinjalali

LGTM. Thanks!

adrinjalali · 2024-02-29T12:13:52Z

sklearn/model_selection/_search.py

+                for index, value in param_results[key].items():
+                    # Setting the value at an index unmasks that index
+                    ma[index] = value


I don't love the nested for loop here, but I can't think of a better way.

adrinjalali · 2024-02-29T12:15:53Z

@thomasjpfan could you maybe have a look?

thomasjpfan

Thank you for the PR!

thomasjpfan · 2024-03-01T22:40:26Z

sklearn/model_selection/_search.py

+                for index, value in param_results[key].items():
+                    # Setting the value at an index unmasks that index
+                    ma[index] = value
+                param_results[key] = ma


Instead of overwriting the param_results in the loop, can we directly add the new results into the results dict?

Suggested change

param_results[key] = ma

results[key] = ma

A few lines down, we can remove the results.update(param_results).

thomasjpfan · 2024-03-01T22:45:11Z

sklearn/model_selection/_search.py

+                # Use one MaskedArray and mask all the places where the param is not
+                # applicable for that candidate (which may not contain all the params).
+                ma = MaskedArray(np.empty(n_candidates), mask=True, dtype=arr.dtype)
+                for index, value in param_results[key].items():


Suggested change

for index, value in param_results[key].items():

for index, value in param_result.items():

thomasjpfan · 2024-03-01T22:49:56Z

sklearn/model_selection/_search.py

+        for key in param_results:
+            arr = np.array(list(param_results[key].values()))
+            if len(arr) == n_candidates:
+                param_results[key] = MaskedArray(arr, mask=False)
+            else:
+                # Use one MaskedArray and mask all the places where the param is not
+                # applicable for that candidate (which may not contain all the params).
+                ma = MaskedArray(np.empty(n_candidates), mask=True, dtype=arr.dtype)


We can avoid creating a new NumPy array:

Suggested change

for key in param_results:

arr = np.array(list(param_results[key].values()))

if len(arr) == n_candidates:

param_results[key] = MaskedArray(arr, mask=False)

else:

# Use one MaskedArray and mask all the places where the param is not

# applicable for that candidate (which may not contain all the params).

ma = MaskedArray(np.empty(n_candidates), mask=True, dtype=arr.dtype)

for key, param_result in param_results.items():

param_list = list(param_results[key].values())

try:

arr_dtype = np.result_type(*param_list)

except TypeError:

arr_dtype = object

if len(arr) == n_candidates:

results[key] = MaskedArray(arr, mask=False, dtype=arr_dtype)

else:

# Use one MaskedArray and mask all the places where the param is not

# applicable for that candidate (which may not contain all the params).

ma = MaskedArray(np.empty(n_candidates), mask=True, dtype=arr_dtype)

(Scikit-learn does not really like using fixed length string dtypes "<U4", so using object here keeps the original behavior.)

MarcoGorelli · 2024-03-02T11:13:55Z

Thanks both for your reviews!

That looks better, thanks Thomas, have updated

(unrelated, but pre-commit run -a changes a lot of files...FWIW I'd suggest just running pre-commit run -a in CI rather than repeating it all in linting.sh)

thomasjpfan

LGTM

lesteve · 2024-03-04T08:13:56Z

It seems like this broke the doc build on main, see build log

The error is:

Unexpected failing examples:

    ../examples/model_selection/plot_grid_search_text_feature_extraction.py failed leaving traceback:

    Traceback (most recent call last):
      File "/home/circleci/project/examples/model_selection/plot_grid_search_text_feature_extraction.py", line 161, in <module>
        cv_results = pd.DataFrame(random_search.cv_results_)
      File "/home/circleci/mambaforge/envs/testenv/lib/python3.9/site-packages/pandas/core/frame.py", line 767, in __init__
        mgr = dict_to_mgr(data, index, columns, dtype=dtype, copy=copy, typ=manager)
      File "/home/circleci/mambaforge/envs/testenv/lib/python3.9/site-packages/pandas/core/internals/construction.py", line 503, in dict_to_mgr
        return arrays_to_mgr(arrays, columns, index, dtype=dtype, typ=typ, consolidate=copy)
      File "/home/circleci/mambaforge/envs/testenv/lib/python3.9/site-packages/pandas/core/internals/construction.py", line 114, in arrays_to_mgr
        index = _extract_index(arrays)
      File "/home/circleci/mambaforge/envs/testenv/lib/python3.9/site-packages/pandas/core/internals/construction.py", line 664, in _extract_index
        raise ValueError("Per-column arrays must each be 1-dimensional")
    ValueError: Per-column arrays must each be 1-dimensional

Arviz uses scipy.signal.gaussian, which was removed in 1.13. Most recent arviz uses scipy.signal.windows.gaussian scikit-learn 1.5.0 contained a regression (scikit-learn/scikit-learn#28352) that has been fixed in scikit-learn/scikit-learn#29078

* BLD: Fix broken versions Arviz uses scipy.signal.gaussian, which was removed in 1.13. Most recent arviz uses scipy.signal.windows.gaussian scikit-learn 1.5.0 contained a regression (scikit-learn/scikit-learn#28352) that has been fixed in scikit-learn/scikit-learn#29078 * BLD: Chasing errors, limit scipy/arviz in SBR To test the notebooks, need to install SBR extras, which included a version of arviz that isn't available on 3.9. Earlier version works, but restricts scipy version

github-actions bot added the module:model_selection label Feb 2, 2024

MarcoGorelli force-pushed the cv_results-dtypes branch from a9af32e to 41658f3 Compare February 2, 2024 11:01

MarcoGorelli marked this pull request as ready for review February 2, 2024 15:20

adrinjalali self-requested a review February 7, 2024 14:08

MarcoGorelli added 5 commits February 15, 2024 16:39

BUG: use appropriate dtype in cv_results as opposed to always using o…

519758c

…bject

fixup for different platforms 🤞

d4514e7

only check kinds?

7d9aaa3

avoid branching, handle higher-dim case

44af268

Revert "avoid branching, handle higher-dim case"

06e35b6

This reverts commit 44af268.

MarcoGorelli force-pushed the cv_results-dtypes branch from 38a2498 to 06e35b6 Compare February 15, 2024 16:39

adrinjalali approved these changes Feb 29, 2024

View reviewed changes

thomasjpfan reviewed Mar 1, 2024

View reviewed changes

MarcoGorelli added 3 commits March 2, 2024 10:28

Merge remote-tracking branch 'upstream/main' into cv_results-dtypes

5636c28

simplify, dont allocate unnecessary array

401a413

format

50254c7

MarcoGorelli added 2 commits March 2, 2024 12:21

Merge remote-tracking branch 'upstream/main' into cv_results-dtypes

8aba81e

it gets simpler

248486f

MarcoGorelli requested a review from thomasjpfan March 2, 2024 12:34

thomasjpfan approved these changes Mar 3, 2024

View reviewed changes

thomasjpfan merged commit fd28ffd into scikit-learn:main Mar 3, 2024

lesteve mentioned this pull request Mar 4, 2024

🔒 🤖 CI Update lock files for main CI build(s) 🔒 🤖 #28569

Closed

This was referenced Mar 4, 2024

BUG: ensure list of tuples results in 1d masked array in cv_results, as opposed to 2d array #28571

Merged

DOC use polars in plot_digits_pipe example #28576

Merged

adrinjalali mentioned this pull request May 22, 2024

GridSearchCV with custom estimator and nested Parameter Grids raises ValueError in scikit-learn 1.5.0 #29074

Closed

Jacob-Stevens-Haas mentioned this pull request May 29, 2024

BLD: Fix broken versions dynamicslab/pysindy#512

Merged

lesteve mentioned this pull request Jun 4, 2024

TypeError when fitting GridSearchCV or RandomizedSearchCV with OrdinalEncoder and OneHotEncoder in parameters grid #29157

Closed

MarcoGorelli mentioned this pull request Jun 4, 2024

FIX fix regression in gridsearchcv when parameter grids have estimators as values #29179

Merged

jeremiedbb mentioned this pull request Jun 20, 2024

GridSearchCV fails when parameters are arrays with different sizes #29277

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

BUG: use appropriate dtype in cv_results as opposed to always using object #28352

BUG: use appropriate dtype in cv_results as opposed to always using object #28352

Uh oh!

MarcoGorelli commented Feb 2, 2024

Uh oh!

github-actions bot commented Feb 2, 2024 •

edited

Loading

Uh oh!

adrinjalali commented Feb 28, 2024

Uh oh!

MarcoGorelli commented Feb 28, 2024

Uh oh!

adrinjalali left a comment

Uh oh!

adrinjalali Feb 29, 2024

Uh oh!

adrinjalali commented Feb 29, 2024

Uh oh!

thomasjpfan left a comment

Uh oh!

thomasjpfan Mar 1, 2024

Uh oh!

thomasjpfan Mar 1, 2024

Uh oh!

thomasjpfan Mar 1, 2024

Uh oh!

MarcoGorelli commented Mar 2, 2024

Uh oh!

thomasjpfan left a comment

Uh oh!

lesteve commented Mar 4, 2024

Uh oh!

Uh oh!

	for index, value in param_results[key].items():
	for index, value in param_result.items():

Uh oh!

BUG: use appropriate dtype in cv_results as opposed to always using object #28352

BUG: use appropriate dtype in cv_results as opposed to always using object #28352

Uh oh!

Conversation

MarcoGorelli commented Feb 2, 2024

Reference Issues/PRs

What does this implement/fix? Explain your changes.

Any other comments?

Uh oh!

github-actions bot commented Feb 2, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

✔️ Linting Passed

Uh oh!

adrinjalali commented Feb 28, 2024

Uh oh!

MarcoGorelli commented Feb 28, 2024

Uh oh!

adrinjalali left a comment

Choose a reason for hiding this comment

Uh oh!

adrinjalali Feb 29, 2024

Choose a reason for hiding this comment

Uh oh!

adrinjalali commented Feb 29, 2024

Uh oh!

thomasjpfan left a comment

Choose a reason for hiding this comment

Uh oh!

thomasjpfan Mar 1, 2024

Choose a reason for hiding this comment

Uh oh!

thomasjpfan Mar 1, 2024

Choose a reason for hiding this comment

Uh oh!

thomasjpfan Mar 1, 2024

Choose a reason for hiding this comment

Uh oh!

MarcoGorelli commented Mar 2, 2024

Uh oh!

thomasjpfan left a comment

Choose a reason for hiding this comment

Uh oh!

lesteve commented Mar 4, 2024

Uh oh!

Uh oh!

github-actions bot commented Feb 2, 2024 •

edited

Loading