Thanks to visit codestin.com
Credit goes to github.com

Skip to content

BUG: ensure find-like ufuncs convert arguments to common dtypes #26198

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 1 commit into from

Conversation

ngoldbaum
Copy link
Member

This cleans up how np.strings functions handle multiple string arguments.

Currently, some functions call np.asanyarray and some don't. This makes it so all functions in this namespace that take multiply string arguments do that.

This came up working with pandas, where I want to use np.dtype.StringDType(coerce=False, na_object=pd.NA) as the "default" numpy string dtype. If I don't make these changes, then I have to coerce all the arguments to manually np.strings functions to the same common StringDType instance. This is particularly annoying when dealing with python strings as arguments, since those get coerced to the default StringDType by the promoters, which then leads to an error when trying to use a ufunc with two non-equal StringDType instances.

IMO it will be easier for everyone if numpy just deals with this issue in the np.strings wrappers. You can also bypass this coercion by explicitly passing ndarray arguments due to the use of e.g. dtype=getattr(arg, "dtype", a.dtype) in all the wrappers.

There are also a couple docstring cleanups I noticed.

May require #26147 to be merged for the tests to all pass.

@ngoldbaum ngoldbaum requested a review from lysnikolaou April 2, 2024 23:13
@ngoldbaum ngoldbaum force-pushed the strings-arguments-cleanup branch from ae4e820 to 793fff5 Compare April 2, 2024 23:19
@ngoldbaum ngoldbaum changed the title ENH: ensure find-like ufuncs convert arguments to common dtypes BUG: ensure find-like ufuncs convert arguments to common dtypes Apr 2, 2024
@ngoldbaum ngoldbaum added the 09 - Backport-Candidate PRs tagged should be backported label Apr 2, 2024
Copy link
Contributor

@mhvk mhvk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks good in principle, but I wonder if a better approach would not be to use the promoter - it sort-of feels that is made for dealing with these kind of dtype mismatches. That said, that would perhaps remove failures for mismatching dtype that you would like to preserve. So, perhaps this is the simplest way, at least for now.

@ngoldbaum ngoldbaum force-pushed the strings-arguments-cleanup branch from 793fff5 to c5520b3 Compare April 3, 2024 01:44
@ngoldbaum
Copy link
Member Author

The problem with using the promoter is I'm not handed dtype instances, only the dtypemetas. That means in resolve_descriptors where we originally had a python string we get handed the default stringdtype instance.

Maybe this is a use case for resolve_descriptors_with_scalars?

@seberg
Copy link
Member

seberg commented Apr 3, 2024

That means in resolve_descriptors where we originally had a python string we get handed the default stringdtype instance.

If you just add a promoter you should be getting a new StringDType not an old one, unless you register the loop with the old one, since the promoter will force using the new-style StringDType on all arguments.

@ngoldbaum
Copy link
Member Author

If you just add a promoter you should be getting a new StringDType not an old one, unless you register the loop with the old one, since the promoter will force using the new-style StringDType on all arguments.

Sorry if I was unclear, I was talking about distinct StringDType instances where equality comparisons return false, like in the test I added in this PR.

Right now on the main branch this is an error:

In [1]: import numpy as np

In [2]: arr = np.array(["hello", "world"], dtype=np.dtypes.StringDType(na_object=None))

In [3]: np.strings.find(arr, 'wor')
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
Cell In[3], line 1
----> 1 np.strings.find(arr, 'wor')

File ~/Documents/numpy/build-install/usr/lib/python3.11/site-packages/numpy/_core/strings.py:238, in find(a, sub, start, end)
    206 """
    207 For each element, return the lowest index in the string where
    208 substring ``sub`` is found, such that ``sub`` is contained in the
   (...)
    235
    236 """
    237 end = end if end is not None else MAX
--> 238 return _find_ufunc(a, sub, start, end)

TypeError: Can only do binary operations with equal StringDType instances.

This PR avoids the error by casting the needle string to arr's dtype.

It's not clear to me how to fix this using a promoter only, because the error path that's generating the TypeError above is in resolve_descriptors, after the promoter has already run.

@seberg
Copy link
Member

seberg commented Apr 3, 2024

I suppose you don't have a "cannot be a NULL object" (since that can be used with another one in a binary operation just fine)? But in that case yes, you need to either:

  • Register the loop with the legate string (even if it ends up using your string!)
  • Use the other resolver function, since the cast is not done for you there.

Although, I think it would make sense to see if the str->string cast cannot return a StringDType instance which works here.

@mhvk
Copy link
Contributor

mhvk commented Apr 3, 2024

Although, I think it would make sense to see if the str->string cast cannot return a StringDType instance which works here

Yes, I wonder if one cannot have a (possibly internal only) StringDType that is equal to all others and takes on their object_NA.

@ngoldbaum
Copy link
Member Author

OK, fair enough. I'm going to come back to this with a different approach.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants