Thanks to visit codestin.com
Credit goes to github.com

Skip to content

NEP: NEP 55 revision - dedicated scalar type #28842

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

mtsokol
Copy link
Member

@mtsokol mtsokol commented Apr 28, 2025

Issue: #28165

Hi!
As requested in the comment, this PR updates NEP 55 to describe StringDType scalar feature.

Let's first decide on the right approach for implementing it. The initial idea consists in having a scalar type which owns UTF-8 encoded string without interacting with StringDType allocation mechanisms. The type participates in NumPy's type hierarchy and fills the gaps in typing capabilities incurred by "Python str as scalar" approach.

@mtsokol mtsokol self-assigned this Apr 28, 2025
@jorenham
Copy link
Member

Will na_object also be covered by vstr?

@mtsokol
Copy link
Member Author

mtsokol commented Apr 28, 2025

Will na_object also be covered by vstr?

Do you mean support for missing data? That's a good question. First thing that comes to my mind is "it doesn't need to"?
When accessing a single element of StringDType array it's either a vstr scalar or missing value itself. So I don't think vstr scalar would need to also represent na_object:

In [1]: arr = np.array(["hello", "world", np.nan], dtype=np.dtypes.StringDType(na_object=np.nan))

In [2]: arr[1]
Out[2]: np.vstr('world')

In [3]: arr[2]
Out[3]: nan

And empty scalar can be represented by an empty string:

In [1]: np.vstr()
Out[1]: np.vstr('')

@jorenham
Copy link
Member

Will na_object also be covered by vstr?

Do you mean support for missing data? That's a good question. First thing that comes to my mind is "it doesn't need to"? When accessing a single element of StringDType array it's either a vstr scalar or missing value itself. So I don't think vstr scalar would need to also represent na_object:

In [1]: arr = np.array(["hello", "world", np.nan], dtype=np.dtypes.StringDType(na_object=np.nan))

In [2]: arr[1]
Out[2]: np.vstr('world')

In [3]: arr[2]
Out[3]: nan

Hmm, for static typing that would be kinda problematic. For example if you have some x: npt.NDArray[np.vstr], then it's not possible to infer the type of e.g. x.ravel()[0], as it could be either a np.vstr, but also na_object, which isn't known information.

The datetime64 and timedelta64 scalars include their NaT types as a None value. Perhaps something like that could also be done here?

@ngoldbaum
Copy link
Member

ngoldbaum commented Apr 28, 2025

I think this might need a little bit more careful thought. It might even need its own (smaller than NEP-55) NEP.

In particular, this update doesn't engage with @seberg's main criticism of your implementation PR: you're not proposing a PyUnicode_Type subtype. At least as far as I can see?

IMO this would be a lot stronger if you could make it so vstr is a subtype of both np.generic and str. I have no idea how hard that is.

If you don't think vstr needs to be a str subclass then that needs to be justified. It's also probably a breaking change if we make it so that indexing into a StringDType array produces an object that doesn't both duck type as a string and pass isinstance checks, although changing the latter is less breaking than changing the former.

The datetime64 and timedelta64 scalars include their NaT types as a None value. Perhaps something like that could also be done here?

The problem for StringDType is there isn't a single na_object value and we want to support all of e.g. na_object=None, na_object='missing', and na_object=np.nan. So the na_object could be NoneType, a str, or np.float64, or even an arbitrary Python object.

Maybe the StringDType type annotation should be parametrizable with an na_object type? I don't know if that's painful in the Python static type system.

@jorenham
Copy link
Member

IMO this would be a lot stronger if you could make it so vstr is a subtype of both np.generic and str. I have no idea how hard that is.

numpy.str_ does this too, so it's at least possible

@jorenham
Copy link
Member

jorenham commented Apr 28, 2025

Maybe the StringDType type annotation should be parametrizable with an na_object type? I don't know if that's painful in the Python static type system.

Haha that's exactly what I've done in numpy/numtype#335, and it's on my "port-to-numpy" TODO list.


update: #28856

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants