Thanks to visit codestin.com
Credit goes to github.com

Skip to content

BUG: use appropriate dtype in cv_results as opposed to always using object #28352

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 10 commits into from
Mar 3, 2024

Conversation

MarcoGorelli
Copy link
Contributor

Reference Issues/PRs

closes #28350

What does this implement/fix? Explain your changes.

Instead of always using dtype object, use a more appropriate dtype (the one detected by numpy)

Any other comments?

I noticed this when trying to use Polars, which is pickier about object dtype than pandas, for #28345

The existing tests already cover this functionality, so I've just updated them rather than increasing the test suite's running time. I can add a new test if desired though

Copy link

github-actions bot commented Feb 2, 2024

✔️ Linting Passed

All linting checks passed. Your pull request is in excellent shape! ☀️

Generated for commit: 248486f. Link to the linter CI: here

@MarcoGorelli MarcoGorelli marked this pull request as ready for review February 2, 2024 15:20
@adrinjalali adrinjalali self-requested a review February 7, 2024 14:08
@adrinjalali
Copy link
Member

@MarcoGorelli havne't forgotten about this. But this is touching VERY OLD code, so I need to spend some time to get into it 😉

@MarcoGorelli
Copy link
Contributor Author

no hurry at all, I understand this is low priority!

Copy link
Member

@adrinjalali adrinjalali left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Thanks!

Comment on lines 1088 to 1090
for index, value in param_results[key].items():
# Setting the value at an index unmasks that index
ma[index] = value
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't love the nested for loop here, but I can't think of a better way.

@adrinjalali
Copy link
Member

@thomasjpfan could you maybe have a look?

Copy link
Member

@thomasjpfan thomasjpfan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the PR!

for index, value in param_results[key].items():
# Setting the value at an index unmasks that index
ma[index] = value
param_results[key] = ma
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of overwriting the param_results in the loop, can we directly add the new results into the results dict?

Suggested change
param_results[key] = ma
results[key] = ma

A few lines down, we can remove the results.update(param_results).

# Use one MaskedArray and mask all the places where the param is not
# applicable for that candidate (which may not contain all the params).
ma = MaskedArray(np.empty(n_candidates), mask=True, dtype=arr.dtype)
for index, value in param_results[key].items():
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
for index, value in param_results[key].items():
for index, value in param_result.items():

Comment on lines 1080 to 1087
for key in param_results:
arr = np.array(list(param_results[key].values()))
if len(arr) == n_candidates:
param_results[key] = MaskedArray(arr, mask=False)
else:
# Use one MaskedArray and mask all the places where the param is not
# applicable for that candidate (which may not contain all the params).
ma = MaskedArray(np.empty(n_candidates), mask=True, dtype=arr.dtype)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can avoid creating a new NumPy array:

Suggested change
for key in param_results:
arr = np.array(list(param_results[key].values()))
if len(arr) == n_candidates:
param_results[key] = MaskedArray(arr, mask=False)
else:
# Use one MaskedArray and mask all the places where the param is not
# applicable for that candidate (which may not contain all the params).
ma = MaskedArray(np.empty(n_candidates), mask=True, dtype=arr.dtype)
for key, param_result in param_results.items():
param_list = list(param_results[key].values())
try:
arr_dtype = np.result_type(*param_list)
except TypeError:
arr_dtype = object
if len(arr) == n_candidates:
results[key] = MaskedArray(arr, mask=False, dtype=arr_dtype)
else:
# Use one MaskedArray and mask all the places where the param is not
# applicable for that candidate (which may not contain all the params).
ma = MaskedArray(np.empty(n_candidates), mask=True, dtype=arr_dtype)

(Scikit-learn does not really like using fixed length string dtypes "<U4", so using object here keeps the original behavior.)

@MarcoGorelli
Copy link
Contributor Author

Thanks both for your reviews!

That looks better, thanks Thomas, have updated

(unrelated, but pre-commit run -a changes a lot of files...FWIW I'd suggest just running pre-commit run -a in CI rather than repeating it all in linting.sh)

@MarcoGorelli MarcoGorelli requested a review from thomasjpfan March 2, 2024 12:34
Copy link
Member

@thomasjpfan thomasjpfan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@thomasjpfan thomasjpfan merged commit fd28ffd into scikit-learn:main Mar 3, 2024
@lesteve
Copy link
Member

lesteve commented Mar 4, 2024

It seems like this broke the doc build on main, see build log

The error is:

Unexpected failing examples:

    ../examples/model_selection/plot_grid_search_text_feature_extraction.py failed leaving traceback:

    Traceback (most recent call last):
      File "/home/circleci/project/examples/model_selection/plot_grid_search_text_feature_extraction.py", line 161, in <module>
        cv_results = pd.DataFrame(random_search.cv_results_)
      File "/home/circleci/mambaforge/envs/testenv/lib/python3.9/site-packages/pandas/core/frame.py", line 767, in __init__
        mgr = dict_to_mgr(data, index, columns, dtype=dtype, copy=copy, typ=manager)
      File "/home/circleci/mambaforge/envs/testenv/lib/python3.9/site-packages/pandas/core/internals/construction.py", line 503, in dict_to_mgr
        return arrays_to_mgr(arrays, columns, index, dtype=dtype, typ=typ, consolidate=copy)
      File "/home/circleci/mambaforge/envs/testenv/lib/python3.9/site-packages/pandas/core/internals/construction.py", line 114, in arrays_to_mgr
        index = _extract_index(arrays)
      File "/home/circleci/mambaforge/envs/testenv/lib/python3.9/site-packages/pandas/core/internals/construction.py", line 664, in _extract_index
        raise ValueError("Per-column arrays must each be 1-dimensional")
    ValueError: Per-column arrays must each be 1-dimensional

Jacob-Stevens-Haas added a commit to dynamicslab/pysindy that referenced this pull request May 29, 2024
Arviz uses scipy.signal.gaussian, which was removed in 1.13.  Most recent
arviz uses scipy.signal.windows.gaussian

scikit-learn 1.5.0 contained a regression
(scikit-learn/scikit-learn#28352)
that has been fixed in
scikit-learn/scikit-learn#29078
Jacob-Stevens-Haas added a commit to dynamicslab/pysindy that referenced this pull request May 29, 2024
* BLD: Fix broken versions

Arviz uses scipy.signal.gaussian, which was removed in 1.13.  Most recent
arviz uses scipy.signal.windows.gaussian

scikit-learn 1.5.0 contained a regression
(scikit-learn/scikit-learn#28352)
that has been fixed in
scikit-learn/scikit-learn#29078

* BLD: Chasing errors, limit scipy/arviz in SBR

To test the notebooks, need to install SBR extras, which included a version
of arviz that isn't available on 3.9.  Earlier version works, but restricts
scipy version
twhsu-stanley pushed a commit to twhsu-stanley/pysindy that referenced this pull request Oct 29, 2024
* BLD: Fix broken versions

Arviz uses scipy.signal.gaussian, which was removed in 1.13.  Most recent
arviz uses scipy.signal.windows.gaussian

scikit-learn 1.5.0 contained a regression
(scikit-learn/scikit-learn#28352)
that has been fixed in
scikit-learn/scikit-learn#29078

* BLD: Chasing errors, limit scipy/arviz in SBR

To test the notebooks, need to install SBR extras, which included a version
of arviz that isn't available on 3.9.  Earlier version works, but restricts
scipy version
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

GridSearchCV with PCA returns object masked array
4 participants