Thanks to visit codestin.com
Credit goes to github.com

Skip to content

experimental_index_url slows down downloads by 25-50% #2849

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
keith opened this issue May 1, 2025 · 3 comments
Open

experimental_index_url slows down downloads by 25-50% #2849

keith opened this issue May 1, 2025 · 3 comments

Comments

@keith
Copy link
Member

keith commented May 1, 2025

🐞 bug report

Affected Rule

pip.parse

Is this a regression?

No

Description

Through various issues it has been recommended that I use experimental_index_url to get improved behavior for pip.parse. My discovery after attempting to use this in our project is that fetching all deps for the first time slowed down significantly. Here are some of the stats I pulled in our project:

without experimental_index_url:

bazel query 'deps(...)' > /dev/null  1.02s user 0.42s system 0% cpu 4:42.53 total
bazel query 'deps(...)' > /dev/null  1.05s user 0.42s system 0% cpu 5:31.24 total

with experimental index

bazel query 'deps(...)' > /dev/null  1.28s user 0.45s system 0% cpu 6:56.47 total
bazel query 'deps(...)' > /dev/null  1.16s user 0.51s system 0% cpu 7:15.13 total
bazel query 'deps(...)' > /dev/null  1.22s user 0.54s system 0% cpu 7:52.11 total

It seems like this is consistent, I tested about 10 times before assuming this was the cause. I imagine this is heavily dependent on what other repo activity exists in the project. I tried messing with the value of --http_max_parallel_downloads without any luck.

To test this I did a rm -rf ~/.cache/pip ~/.cache/bazel between runs to make sure I was starting completely clean.

πŸ”¬ Minimal Reproduction

I tried reproducing this in a rules_python example but the difference was not as severe as in our project, which is what makes me think that it's heavily dependent on other http_archives etc that you have.

🌍 Your Environment

Operating System:

  
linux x86_64
  

Output of bazel version:

  
https://github.com/bazelbuild/bazel/commit/2780393d35ad0607cf5e344ae082b00a5569a964
  

Rules_python version:

  
1.4.0-rc
  

Anything else relevant?

https://bazelbuild.slack.com/archives/CA306CEV6/p1745365100854859

@rickeylev
Copy link
Collaborator

I recall there was a change to avoid log spam spam when alternative indexes were available. I wonder if that is contributing? I think part of that fix was to change it to fetch an entire index, figure out what was left, then fetch again. The behavior previously was to immediately fetch the next "step" as soon as any given package finished. Don't quote me on this exactly, though -- Ignas would remember better.

Relevant snippet from slack:

Ignas says:
I think it could be that it is fetching more wheels, because indexes may provide multiple wheels that are compatible with target platforms. This could be improved by providing at most one wheel that is the most specialized wheel for the given platform, but then it makes things like musl harder to support.
So bazel query in your case will fetch the musl and manylinux variants of the wheel because you are using query instead of cquery.

This sounds vaguely familiar -- didn't we have a similar problem with our doc building using iblaze? bazel query was trying to fetch windows wheels and failing, but we don't care about windows. I forget what the fix was.

In any case, I think a reasonable functionality request is to provide some basic restriction capability for how it traverses the indexes. e.g. if you simply don't care about musl (or windows, or w/e), then it shouldn't try to follow any edges that are musl-specific, try to download any metadata for musl-specific wheels, etc.

(as an aside, doesn't the uv.lock format and pylock.toml have capabilities that would prevent this problem?)

@keith
Copy link
Member Author

keith commented May 2, 2025

(as an aside, doesn't the uv.lock format and pylock.toml have capabilities that would prevent this problem?)

At the very least there is no index fetching because uv lock encodes all possible urls in the lockfile:

[[package]]
name = "markupsafe"
version = "3.0.2"
source = { registry = "https://pypi.org/simple" }
sdist = { url = "https://files.pythonhosted.org/packages/b2/97/5d42485e71dfc078108a86d6de8fa46db44a1a9295e89c5d6d4a06e23a62/markupsafe-3.0.2.tar.gz", hash = "sha256:ee55d3edf80167e48ea11a923c7386f4669df67d7994554387f84e7d8b0a2bf0", size = 20537 }
wheels = [
    { url = "https://files.pythonhosted.org/packages/6b/28/bbf83e3f76936960b850435576dd5e67034e200469571be53f69174a2dfd/MarkupSafe-3.0.2-cp311-cp311-macosx_10_9_universal2.whl", hash = "sha256:9025b4018f3a1314059769c7bf15441064b2207cb3f065e6ea1e7359cb46db9d", size = 14353 },
    { url = "https://files.pythonhosted.org/packages/6c/30/316d194b093cde57d448a4c3209f22e3046c5bb2fb0820b118292b334be7/MarkupSafe-3.0.2-cp311-cp311-macosx_11_0_arm64.whl", hash = "sha256:93335ca3812df2f366e80509ae119189886b0f3c2b81325d39efdb84a1e2ae93", size = 12392 },

@aignas
Copy link
Collaborator

aignas commented May 2, 2025

Could you please provide the following numbers for each of the cases:

  • How many packages do you have? bazel query "@pip//..." | rg :dist_info | wc -l
  • How many wheels do you have? bazel query "kind(py_library, deps(@pip//...))" | rg :pkg| wc -l
  • If you first do bazel build @pip//:BUILD.bazel, do the numbers change? This should remove the step of querying the SimpleAPI from the equation.
  • What happens if you do cquery instead of query? How do the numbers differ then?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants