Thanks to visit codestin.com
Credit goes to github.com

Skip to content

TYP: Add type stubs for stringdtype in np.char and np.strings #27470

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 12 commits into from
Oct 14, 2024

Conversation

ngoldbaum
Copy link
Member

@ngoldbaum ngoldbaum commented Sep 28, 2024

This is definitely an improvement but there are still some issues.

@jorenham pointed out (unfortunately after I pushed everything) that we can substantially simplify some of these overloads using a TypeVar.

I also am not able to make expression like np.split(AR_S, ' ') type check properly, since from the type-checker's perspective ' ' is convertible to a unicode array, but in reality in the way we set up the promoters for StringDType, this does return a StringDType array. I had to cheat for functions like that where normally people pass scalars but it's typed to be able to accept an array by passing in an AR_S in the reveal tests where the unicode tests pass in a scalar.

There were also a couple of functions that were missing tests and I've added them.

@ngoldbaum ngoldbaum requested a review from jorenham September 28, 2024 20:28
@ngoldbaum ngoldbaum changed the title Add type stubs for stringdtype in np.char and np.strings TYP: Add type stubs for stringdtype in np.char and np.strings Sep 28, 2024
@ngoldbaum ngoldbaum added 41 - Static typing component: numpy.strings String dtypes and functions 56 - Needs Release Note. Needs an entry in doc/release/upcoming_changes labels Sep 28, 2024
@ngoldbaum ngoldbaum added this to the 2.2.0 release milestone Sep 28, 2024
@jorenham
Copy link
Member

jorenham commented Sep 28, 2024

Since StringDType has char code "T", you could use T_co instead of S_co for it.
That way, the one for bytes doesn't have to be renamed to B_co, additionally avoiding it being confused with uint8 (which has char code B).
It'll help reduce the amount of changed as as well :)

@jorenham
Copy link
Member

jorenham commented Sep 28, 2024

I also am not able to make expression like np.split(AR_S, ' ') type check properly

Maybe I'm missing something, but the second argument of numpy.split should be an int or an array-like of ints according to the docs 🤔

And numpy.strings.split doesn't exist (at least not in the main branch)?

@jorenham
Copy link
Member

I also am not able to make expression like np.split(AR_S, ' ') type check properly, since from the type-checker's perspective ' ' is convertible to a unicode array, but in reality in the way we set up the promoters for StringDType, this does return a StringDType array. I had to cheat for functions like that where normally people pass scalars but it's typed to be able to accept an array by passing in an AR_S in the reveal tests where the unicode tests pass in a scalar.

Ah after applying my mental autocorrect function, I realized that you're talking about numpy.string.strip 👍🏻 . I can kinda relate to this, as I once made the mistake to ask for a "stripper" instead of a "splitter" in a hardware store 😅.

Anyway, I don't think that the issue here has to do with the fact that str_ array-likes also accept raw strs, because there's no overlap between U_co and S_co (the T / StringDType one).

@overload
def strip(a: S_co, chars: None | S_co = ...) -> np.ndarray[_Shape, np.dtypes.StringDType]: ...

The chars parameter is of None | S_co, which resolves to _SupportsArray[StringDType] | None.

When call numpy.string.split(AR_S, " "), then a is inferred as an ndarray of StringDType, which matches the a: S_co parameter of this only overload.
So the typechecker won't look at any other overloads at this point (so your pattern matching mindset is indeed the right way to look at it).
Now the 2nd argument " " is inferred as chars: str. But str can't be assigned to None, and also not to _SupportsArray[StringDType] (because str doesn't implement __array__).
At this point, there are no overloads left, and we're left stranded in the "land of undefined behavior".

So to make a short story long: the chars in the StringDType overload should probably also accept a str.

@ngoldbaum
Copy link
Member Author

Argh, so it looks like the mypy tests are failing here and they didn't fail locally for me because this PR branch is based on main before #27419.

After #27419, this is no longer an error and I have no idea why:

FAILED numpy/typing/tests/test_typing.py::test_fail[strings] - AssertionError: Error mismatch at line 7
E           AssertionError: Error mismatch at line 7
E
E           Expression: np.char.equal(AR_U, AR_S)
E           Expected error: incompatible type
E           Observed error: ''

(Note that this is after fixing a logic error in test_fail which flipped the expected and observed error printouts)

@ngoldbaum
Copy link
Member Author

I think I responded to everything with the latest push.

@ngoldbaum
Copy link
Member Author

I'll finish this off if I can get some help with the test failures and the question I just posted above.

@jorenham
Copy link
Member

jorenham commented Oct 2, 2024

I'll finish this off if I can get some help with the test failures and the question I just posted above.

The test failure seems to be a result of an import issue with StringDType in numpy._typing._array_like:

image

image

So when mypy doesn't know what something is, it just treats is as if its Any, which explains why np.char.equal(AR_U, AR_S) doesn't fail.

@ngoldbaum
Copy link
Member Author

OK, only remaining failure I see is unrelated.

Thanks so much for your help with this @jorenham!

Copy link
Member

@jorenham jorenham left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mypy is still reporting some overload-overlap errors in defchararray.pyi, but I believe those are false positives (i.e. one of those oh-so-familiar mypy bugs), and Pyright seems to be happy with them.

There are some minor opportunities for improvement left, but as far as I'm concerned, those shouldn't be the deal-breakers for merging.
Apart form the review comment I left, I'm referring here to how most overloads could be "merged" through using a TypeVar with constraints, but it's probably cleaner to address this in seperate PR now that I think about it 🤔.

Anyway, it's very nice to see that _core/strings.pyi is completely error free (which unfortunately is a rare signt in the numpy stubs at the moment).
It's also worth noting that pyright appears to agree with mypy on all relevant type-tests (and that's far from trivial), which is a very good sign.

So thanks for your effort and your patience; typing in Python is tricky, but you seem to have a knack for it 👌🏻.


tldr; I'm happy to merge this if you want, so let me me know if you plan on changing anything (e.g. the tags suggest you want to add some release notes?)

@charris
Copy link
Member

charris commented Oct 7, 2024

Needs rebase, some of the reveal tests have been removed.

@ngoldbaum
Copy link
Member Author

I'll try to get this updated soon, I just finished a short vacation today and hopefully I'll have some time to work on this later this week.

@ngoldbaum
Copy link
Member Author

I think this is good to merge now.

@jorenham
Copy link
Member

I think this is good to merge now.

I agree 👌🏻

@jorenham jorenham merged commit e1cc10a into numpy:main Oct 14, 2024
65 of 67 checks passed
@jorenham
Copy link
Member

Thanks @ngoldbaum!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
41 - Static typing 56 - Needs Release Note. Needs an entry in doc/release/upcoming_changes component: numpy.strings String dtypes and functions
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants