Thanks to visit codestin.com
Credit goes to github.com

Skip to content

feat: data and pyi files in the venv #2936

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 49 commits into from
Jun 10, 2025

Conversation

aignas
Copy link
Collaborator

@aignas aignas commented May 27, 2025

This adds the remaining of the files into the venv and should get us
reasonably close to handling 99% of the cases.

The expected differences from this and a venv built by uv would be:

  • The RECORD files are excluded from the venvs for better cache hit
    rate in bazel.

Topological ordering is removed because topo ordering doesn't provide
the "closer target first" guarantees desired. For now, just use default
ordering and document conflicts as undefined behavior. Internally, it
continues to use first-wins (i.e. first in depset.to_list() order) semantics.

Work towards #2156

This adds the necessary `.dist-info` files into the mix
and should get us reasonably close to handling 99% of the cases.

The expected differences from this and a `venv` built by `uv` would be:
* Shared libraries are stored in `<package>.libs` in `uv` venvs. This
  can be achieved in `rules_python` by changing the `installer` settings
  in the `wheel_installer/wheel.py#unzip` function.
* The `RECORD` files are excluded from the `venv`s for better cache hit
  rate in `bazel`, however I am not sure if we should do that for actual
  wheels that are downloaded from the internet.

Tested:
- [x] Building the `//docs` and manually checking the symlinks.
- [ ] Unit tests

Work towards bazel-contrib#2156
@aignas aignas force-pushed the exp/distinfo-venv branch from ebed371 to 9dd7589 Compare May 28, 2025 14:16
@aignas
Copy link
Collaborator Author

aignas commented May 28, 2025

I am thinking I should also link other things passed as data for completeness.

I think excluding anything outside site-packages directory for now would be wise.


Tested:

  • Building the //docs and manually checking the symlinks.
  • Ensure that the .pyi files get included.
  • Ensure that the libs get included.
  • Tests that .dist_info gets included.
  • Tests that ensure that there are files only from one version of the package.
  • Remove the topological order usage and document the undefined behaviour.
  • document PyInfo changes.

@aignas aignas changed the title wip: dist-info folders in the venv feat: dist-info folders in the venv May 28, 2025
@aignas aignas changed the title feat: dist-info folders in the venv feat: data files in the venv May 28, 2025
@aignas
Copy link
Collaborator Author

aignas commented May 30, 2025

I'll push some of the changes later but I have a hard time writing a test that would fullfill some of the requirements in the PR todo list.

@aignas aignas marked this pull request as ready for review May 30, 2025 09:00
@aignas aignas requested a review from rickeylev as a code owner May 30, 2025 09:00
@aignas aignas changed the title feat: data files in the venv feat: data and pyi files in the venv May 30, 2025
Copy link
Collaborator

@rickeylev rickeylev left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

re: "src" field name: Let's pick a better name. "src" is an established name in bazel/starlark, so I don't think it should be overloaded.

  • package
  • dist
  • link_group
  • group_name
  • group
  • group_key1 (e.g. "simple"), group_key2 (e.g. "1.0.0")
  • dist, version (simple, 1.0)

?

The basic change in logic here is associating a set of paths with a dist-info name, right? And only one of those sets of paths will be used? The idea being, given:

("foo", <foo v1 paths>)
("foo", <foo v2 paths>)

Only one of those tuples of info is used.

@aignas aignas requested review from rickeylev and groodt May 31, 2025 10:49
@@ -682,15 +682,30 @@ def _create_venv_symlinks(ctx, venv_dir_map):
def _build_link_map(entries):
# dict[str kind, dict[str rel_path, str link_to_path]]
link_map = {}

# Here we store venv paths by package
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The type description makes it clear they're keyed by package, so no need to restate it in prose.

Suggested change
# Here we store venv paths by package

venv_path = dist_info_dir,
))

for src in ctx.files.srcs + ctx.files.data:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Something seems wrong here, but I can't quite put my finger on it.

Files in srcs have to have the init.py logic applied to detect namespace package boundaries to auto-detect the proper paths to create.

Files in data shouldn't be part of that logic. If the file is within a directory covered by something in srcs (i.e *.py), then there's no need to look at the data file. If the data file isn't in a directory covered by something in srcs, then it shouldn't treat a .so/.pyi/.pyc file as some sort of boundary marker.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Talking about special treatment, I don't think I understand why we need it. We could just treat everything as just files and interleave the directories? There is nothing special in my mind here.

Beforehand we had the auto-generated namespace pkg __init__.py files, but if we get rid of them everything is just files, right?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

treat everything as files and interleave the directories

You mean, for every file, create a venv_symlink entry? That won't scale because it doubles the number of files a binary has to materialize, when all that's really needed is e.g. one symlink to a directory.

However, because of namespace packages, we can't simply use the directories directly under site-packages/.

Copy link
Collaborator Author

@aignas aignas Jun 4, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What if we treat everything as namespace packages? The way I understand this loop is:

  • For each file in the sources, symlink their directories.
  • If we have common things across data and srcs then we double the number of directories that we need to process, but I don't see a way to work around the case where we have a big package (e.g. airflow) that have data files under site-packages/airflow/data-files-without-python-files that would need to be symlinked in addition to other namespace packages.

I am wondering how we can do this?

@aignas
Copy link
Collaborator Author

aignas commented Jun 5, 2025

In my test I get the following venv_paths when the topological order deps are converted to list (if I remove the patch I have to sort the deps before merging the depsets):

simple
simple-1.0.0.dist-info
simple_v1_extras
simple
simple-2.0.0.dist-info
simple.libs
nspkg/subnspkg/alpha
nspkg/subnspkg/beta
nspkg/subnspkg/delta
nspkg/subnspkg/gamma
__init__.py
single_file.py
with_external_data.py
external_data

This means that It is depth first topological rather than breadth first, if that makes sense.

@aignas aignas marked this pull request as ready for review June 6, 2025 14:07
"b",
"c1",
"c2",
],
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, so I had an idea that instead of discussing in markdown about theory and findings from print debugs, we can have a Starlark test checking the depset behaviour.

So as I say, it seems that the topological sorting is DFS.

@aignas
Copy link
Collaborator Author

aignas commented Jun 7, 2025

OK, PTAL, have updated docs.

@rickeylev
Copy link
Collaborator

I revisited what topological ordering means. You're right, essentially. It's not breadth-depth like I originally thought. It's about ensuring that the nodes that point to a node are listed before that node in the linearized ordering. Note that "nodes" means the values in the depset, not depset objects themselves. i.e. depset(["a"], transitive=["b", "a"]) results in [b, a].

Stated another way: there's no guarantee a value will occur before (or after) another when there is no path between them. This means, in order to get the semantics we want (via depset ordering) -- that simple_v2 can be overridden by another target -- the overriding target must depend on simple_v2, which kinda defeats the purpose of overriding (I shouldn't have to depend on what I want to remove in order to remove it).

So yeah, topological ordering doesn't give the "closet target wins" behavior that I was originally thinking.

Reading https://bazel.build/rules/lib/builtins/depset, preorder might be closer to what we want (direct elements first)? Though, topological ordering could maybe reflect an install order? I'll take the thinking-out-loud to slack.


I'm -1 on trying to classify something as 3P vs 1P. There just isn't a clear definition of what is or isn't. Whether code is made available via pip_parse, or I write an equivalent target in my main repo, or I get it from a custom repository rule, or etc doesn't tell if a particular target is "mine" or "not mine".


I would like to make progress on this PR, though. I'm thinking we just drop the mention of an explicit ordering to allow overriding. Just leave it as undefined for now. i.e., if there's multiple entries for the same path, it won't give an error, but behavior is undefined. This is still experimental, after all. We can try and figure out a better conflict resolution strategy separately.

@aignas
Copy link
Collaborator Author

aignas commented Jun 8, 2025

Hmm, I am +1 on strictly mentioning that it is undefined then. We can leave this discussion about how to override targets separately.

I'll cleanup the docs as part of this PR then.

@aignas
Copy link
Collaborator Author

aignas commented Jun 8, 2025

OK, PTAL, I think the majority of the stuff should be ready to merge.

@rickeylev
Copy link
Collaborator

should RECORD files be excluded

I don't think it matters? The RECORD file is just a list of files and their hashes. Bazel is already tracking all the files and their hashes. If any change, then RECORD changes, but also Bazel will see them change directly.

@arrdem
Copy link
Contributor

arrdem commented Jun 10, 2025

+1, the RECORD file is stable, is part of the distribution, would be expected in a venv built with any other tool and should be preserved here.

@aignas
Copy link
Collaborator Author

aignas commented Jun 10, 2025

+1, the RECORD file is stable, is part of the distribution, would be expected in a venv built with any other tool and should be preserved here.

@arrdem, I think we need to make an additional change in the whl_library_targets for that, so I'll leave this out of scope for this PR. Maybe we can do this only for whl_library instances that are created from whls and not from sdists, because I can see how there could be non-reproducibility for sdists based on some environment.

@aignas aignas enabled auto-merge June 10, 2025 08:42
@aignas aignas added this pull request to the merge queue Jun 10, 2025
@github-merge-queue github-merge-queue bot removed this pull request from the merge queue due to failed status checks Jun 10, 2025
@aignas aignas added this pull request to the merge queue Jun 10, 2025
Merged via the queue into bazel-contrib:main with commit 013acd9 Jun 10, 2025
3 checks passed
@aignas aignas deleted the exp/distinfo-venv branch June 11, 2025 05:35
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants