ENH: introduce a notion of "compatible" stringdtype instances #26261 #26351
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Backport of #26261.
This is a followup for #26198, it accomplishes the same thing I wanted in that PR.
See the change to the NEP for the details and explanation.
In short, this relaxes the error checking in stringdtype ufuncs and changes the
common_instance
logic, allowing operations between distinct stringdtype instances as long as the result isn't ambiguous. This makes it much simpler to work with non-default stringdtype instances, since users don't need to zealously convert all ufunc arguments to the same dtype before passing them to numpy in the most common cases like passing a python string as an argument.For all operations that take more than one string argument, we now only raise an error if the inputs have distinct na_object settings. We allow distinct
coerce
settings and just choosecoerce=False
for string outputs if any input dtype hadcoerce=False
set.Also added a test. There was one spot in the existing tests where we were doing equality comparisons between arrays with distinct na_object settings, so I updated that test to account for the behavior change.
Ideally we could get this merged in time to be included with 2.0 RC2. I'd like to have this in NumPy 2.0 because it will eliminate a lot of boilerplate argument sanitizing in pandas when it called numpy ufuncs. I totally understand if this is coming too late in the game though.