Add pandas type-completeness blog post#2548
Add pandas type-completeness blog post#2548MarcoGorelli wants to merge 6 commits intofacebook:mainfrom
Conversation
| In order to improve the developer experience for pandas' users across the ecosystem, we decided to focus on improving pandas' typing. Why? Because better type hints mean: | ||
|
|
||
| - More accurate and useful auto-completions from VSCode / PyCharm / NeoVIM / Positron / other IDEs. | ||
| - More robust pipelines, as some categories of bugs can be caught without even needing to execute your code. |
There was a problem hiding this comment.
Maybe also mention the (alleged) LLM benefits?
There was a problem hiding this comment.
sure, thanks - did you have a reference in mind for this?
There was a problem hiding this comment.
The closest I was able to find is https://www.se.cs.uni-saarland.de/conferences/ASE/ase2023/details/ase-2023/ase-2023-papers/12/Generative-Type-Inference-for-Python.html, but I don't think there's anything yet that tests this for modern LLMs.
There was a problem hiding this comment.
There's also https://llm-guidelines.org/study-types/, which suggests that structured outputs (which if I understand correctly also includes static typing) is indeed helpful.
There was a problem hiding this comment.
thanks - as far as I can tell, that paper's about using llms to do type inference? if so, not sure if we should cite it for the alleged llm benefits of having typed code
There was a problem hiding this comment.
Do we need even need a citation? I mean; I'm all for being accurate, but in this case I doubt that anyone would question that static typing helps LLMs write better code, seeing as it also helps humans write better code 🤷♂️
There was a problem hiding this comment.
But I'm assuming here that the types are correct. Because if not, I wouldn't be surprised that LLMs perform worse than if there are no types at all. The same holds for humans, after all.
There was a problem hiding this comment.
tbh it's not obvious to me that they would perform better, they hallucincate method names all the time and i find that they often suggest code that which doesn't satisfy type-checkers even in codebases that are fully typed
i'd prefer to leave this out unless we have a reference if it's ok
|
|
||
| pandas is one of the most widely used Python libraries. At time of writing, it is [downloaded about half-a-billion times per month from PyPI](https://pypistats.org/packages/pandas), is supported by nearly all Python data science packages, and is generally required learning in data science curriculums. Despite modern alternatives existing, pandas' impact cannot be minimised or understated. | ||
|
|
||
| In order to improve the developer experience for pandas' users across the ecosystem, we decided to focus on improving pandas' typing. Why? Because better type hints mean: |
There was a problem hiding this comment.
I think we should still be more explicit here about who "we" is at the beginning, could you add a clarification, even if its just briefly in brackets? Something like "the team at Quantsight" or "the Quantsight team with support from the Pyrefly team", whatever you feel is appropriate. My main concern is that people coming to the blog on the pyrefly website will assume "we" means just the Pyrefly team
|
|
||
| ## Beyond Pyright - what about "Pyrefly report"? | ||
|
|
||
| Pyright's verifytypes feature takes about 2 and a half minutes to run in pandas-stubs. There's room of improvement here - so much so, that the Pyrefly team is working on a [`pyrefly report`](https://pyrefly.org/en/docs/report/) which would work similarly. The `pyrefly report` API is not yet considered stable, so for now pandas-stubs uses Pyright's `--verifytypes` command, but hopefully a faster is on the horizon! |
There was a problem hiding this comment.
formatting: should it be verifytypes? or verify types?
There was a problem hiding this comment.
typo: "hopefully a faster is on the horizon!" a faster tool?
There was a problem hiding this comment.
formatting: should it be
verifytypes? or verify types?
--verifytypes is correct; pyright --help shows:
Usage: pyright [options] files...
Options:
[..]
--verifytypes <PACKAGE> Verify type completeness of a py.typed package
[..]
There was a problem hiding this comment.
I meant the instance in the first sentence (Pyrights veriftypes feature...) not the --verifytypes one :)
| @@ -0,0 +1,76 @@ | |||
| --- | |||
| title: pandas' public API is now type-complete! | |||
There was a problem hiding this comment.
| title: pandas' public API is now type-complete! | |
| title: Pandas' Public API Is Now Type-Complete! |
Please use title case for titles :)
There was a problem hiding this comment.
they ask that it be used lowercase even at the beginning of a sentece https://pandas.pydata.org/about/citing.html#brand-and-logo
When using the project name pandas, please use it in lower case, even at the beginning of a sentence.
if we're ok going against that in titles, then sure, will do
There was a problem hiding this comment.
oh! Thanks for flagging, lets follow their citation guidelines, but the rest of the title should still be title case imho
|
thanks for your reviews! 🙏 |
javabster
left a comment
There was a problem hiding this comment.
LGTM 🚀
but lets wait to merge this until early next week, we already published a blog earlier this week
| - `DataFrame` is reported as "partially unknown" because its method `.index` returns `Index`, which is partially unknown. | ||
| - `Index` is reported as "partially unknown" because its method `to_series` returns `Series`, which is partially unknown. | ||
| - `Series` is reported as "partially unknown" because its method `to_frame` returns `DataFrame`, which is partially unknown. |
There was a problem hiding this comment.
This is surprising to me / doesn't make a ton of sense to me. DataFrame is unknown because DataFrame is unknown? I'd expect there to be some "Unknown" or "Any" typed attribute or similar.
There was a problem hiding this comment.
i've reworked the example so it's clearer, thanks for commenting!
Summary
As discussed, following on from the hackmd document (thanks @javabster for helpful comments!)
Fixes #XXXX
Test Plan