-
-
Notifications
You must be signed in to change notification settings - Fork 601
feat: data and pyi files in the venv #2936
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
This adds the necessary `.dist-info` files into the mix and should get us reasonably close to handling 99% of the cases. The expected differences from this and a `venv` built by `uv` would be: * Shared libraries are stored in `<package>.libs` in `uv` venvs. This can be achieved in `rules_python` by changing the `installer` settings in the `wheel_installer/wheel.py#unzip` function. * The `RECORD` files are excluded from the `venv`s for better cache hit rate in `bazel`, however I am not sure if we should do that for actual wheels that are downloaded from the internet. Tested: - [x] Building the `//docs` and manually checking the symlinks. - [ ] Unit tests Work towards bazel-contrib#2156
I am thinking I should also link other things passed as I think excluding anything outside Tested:
|
I'll push some of the changes later but I have a hard time writing a test that would fullfill some of the requirements in the PR todo list. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
re: "src" field name: Let's pick a better name. "src" is an established name in bazel/starlark, so I don't think it should be overloaded.
- package
- dist
- link_group
- group_name
- group
- group_key1 (e.g. "simple"), group_key2 (e.g. "1.0.0")
- dist, version (simple, 1.0)
?
The basic change in logic here is associating a set of paths with a dist-info name, right? And only one of those sets of paths will be used? The idea being, given:
("foo", <foo v1 paths>)
("foo", <foo v2 paths>)
Only one of those tuples of info is used.
tests/modules/other/simple_v2/site-packages/simple.libs/data.txt
Outdated
Show resolved
Hide resolved
…xternal data inclusion.
python/private/py_executable.bzl
Outdated
@@ -682,15 +682,30 @@ def _create_venv_symlinks(ctx, venv_dir_map): | |||
def _build_link_map(entries): | |||
# dict[str kind, dict[str rel_path, str link_to_path]] | |||
link_map = {} | |||
|
|||
# Here we store venv paths by package |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The type description makes it clear they're keyed by package, so no need to restate it in prose.
# Here we store venv paths by package |
python/private/py_library.bzl
Outdated
venv_path = dist_info_dir, | ||
)) | ||
|
||
for src in ctx.files.srcs + ctx.files.data: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Something seems wrong here, but I can't quite put my finger on it.
Files in srcs have to have the init.py logic applied to detect namespace package boundaries to auto-detect the proper paths to create.
Files in data shouldn't be part of that logic. If the file is within a directory covered by something in srcs (i.e *.py), then there's no need to look at the data file. If the data file isn't in a directory covered by something in srcs, then it shouldn't treat a .so/.pyi/.pyc file as some sort of boundary marker.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Talking about special treatment, I don't think I understand why we need it. We could just treat everything as just files and interleave the directories? There is nothing special in my mind here.
Beforehand we had the auto-generated namespace pkg __init__.py
files, but if we get rid of them everything is just files, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
treat everything as files and interleave the directories
You mean, for every file, create a venv_symlink entry? That won't scale because it doubles the number of files a binary has to materialize, when all that's really needed is e.g. one symlink to a directory.
However, because of namespace packages, we can't simply use the directories directly under site-packages/
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What if we treat everything as namespace
packages? The way I understand this loop is:
- For each file in the sources, symlink their directories.
- If we have common things across
data
andsrcs
then we double the number of directories that we need to process, but I don't see a way to work around the case where we have a big package (e.g. airflow) that have data files undersite-packages/airflow/data-files-without-python-files
that would need to be symlinked in addition to other namespace packages.
I am wondering how we can do this?
In my test I get the following
This means that It is depth first topological rather than breadth first, if that makes sense. |
"b", | ||
"c1", | ||
"c2", | ||
], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, so I had an idea that instead of discussing in markdown about theory and findings from print
debugs, we can have a Starlark test checking the depset
behaviour.
So as I say, it seems that the topological sorting is DFS.
OK, PTAL, have updated docs. |
I revisited what topological ordering means. You're right, essentially. It's not breadth-depth like I originally thought. It's about ensuring that the nodes that point to a node are listed before that node in the linearized ordering. Note that "nodes" means the values in the depset, not depset objects themselves. i.e. Stated another way: there's no guarantee a value will occur before (or after) another when there is no path between them. This means, in order to get the semantics we want (via depset ordering) -- that simple_v2 can be overridden by another target -- the overriding target must depend on simple_v2, which kinda defeats the purpose of overriding (I shouldn't have to depend on what I want to remove in order to remove it). So yeah, topological ordering doesn't give the "closet target wins" behavior that I was originally thinking. Reading https://bazel.build/rules/lib/builtins/depset, preorder might be closer to what we want (direct elements first)? Though, topological ordering could maybe reflect an install order? I'll take the thinking-out-loud to slack. I'm -1 on trying to classify something as 3P vs 1P. There just isn't a clear definition of what is or isn't. Whether code is made available via pip_parse, or I write an equivalent target in my main repo, or I get it from a custom repository rule, or etc doesn't tell if a particular target is "mine" or "not mine". I would like to make progress on this PR, though. I'm thinking we just drop the mention of an explicit ordering to allow overriding. Just leave it as undefined for now. i.e., if there's multiple entries for the same path, it won't give an error, but behavior is undefined. This is still experimental, after all. We can try and figure out a better conflict resolution strategy separately. |
Hmm, I am +1 on strictly mentioning that it is undefined then. We can leave this discussion about how to override targets separately. I'll cleanup the docs as part of this PR then. |
OK, PTAL, I think the majority of the stuff should be ready to merge. |
I don't think it matters? The RECORD file is just a list of files and their hashes. Bazel is already tracking all the files and their hashes. If any change, then RECORD changes, but also Bazel will see them change directly. |
+1, the |
@arrdem, I think we need to make an additional change in the |
This adds the remaining of the files into the venv and should get us
reasonably close to handling 99% of the cases.
The expected differences from this and a
venv
built byuv
would be:RECORD
files are excluded from thevenv
s for better cache hitrate in
bazel
.Topological ordering is removed because topo ordering doesn't provide
the "closer target first" guarantees desired. For now, just use default
ordering and document conflicts as undefined behavior. Internally, it
continues to use first-wins (i.e. first in depset.to_list() order) semantics.
Work towards #2156