-
-
Notifications
You must be signed in to change notification settings - Fork 32k
Support python -Xcpu_count=<n> feature for container environment. #109595
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
I would prefer to first fix os.cpu_count() to take sched_getaffinity() in account. But we should provide the two values: taking affinity in account, and "total" number of CPUs. It might be tricky to change the default :-( |
Providing
We need to understand the actual user's usecase. AFAIK k8s users or container users never use something like I think that that's why JDK still provides |
os.cpu_count() is used in other use cases than containers, where CPU affinity is used. If we modify cpu_count() to take affinity in account and/or if your -X option is implemented, we should add an option to get "total number of CPUs". |
I have used taskset in Kubernetes. It is useful to pin processes to specific CPU cores without using something more advanced/modern like the CPU Manager feature. One of the things I used it for was testing the behavior of a JVM when passing There's also the scenario where you have a container running multiple processes and don't want a process to use all available CPUs. In this scenario, in the absence of [something like] taskset you need something like |
I created issue #109649: os.cpu_count(): add "affinity" parameter to get the number of "usable" CPUs. UPDATE: I renamed the usable parameter to affinity. |
Yeah, this is exactly what I want to say :) Thank you for emphasizing! |
The reason I don't support cgroup directly is the following reasons.
|
Is there an appetite for allowing non-integer / special values to
Obviously we would need to bikeshed the names and behavior bit. But I see this as a potential elegant solution that not only allows forceful count overrides but also allows dynamic derivation using well-defined semantics. It also gives CPython more flexibility to introduce new variants/behavior without breaking backwards compatibility. It seemingly placates all parties. |
As the author of PR #109649, I'm interested by this mode 😁 |
It seems like there is no good default behavior fitting all use cases. Giving more choices should please more people. |
Since you added PYTHONCPUCOUNT env var, the effect of the env var is wider than a single process, child processes are affected as well. Can you please add a -X cpu_count value to ignore PYTHONCPUCOUNT: in short, get Python 3.12 behavior (total number of CPUs) when it is really what I want? Maybe it can just be: See my example for a concrete use case: #109652 (comment) In the UTF-8 mode, I did something similar: PYTHONUTF8=1 python -X utf8=0 ignores PYTHONUTF8 env var and ensures that the UTF-8 Mode is not enabled to this command. |
It may solve my main worry about parsing cgroups: avoid hardcoding AWS 1024 constant which is not a standard but an arbitrary value (no?). By the way, how do we round the number of CPUs, knowing that cpu_count() returns an int? I suppose that it should be rounded up (towards infinity). For example 500/1024 should return 1, but 1200/1024 should return 1 or 2? Maybe round to the nearest but always return a minimum of 1? That's called ROUND_HALF_EVEN.
Well, Donghee initial feature request: provide an integer, ignore affinity, cgroups and anything else, also makes sense. Sometimes, the sysadmin wants to get full control on what's going on. I don't think that having different choices is worse. It gives users the choice to select what better fits their needs. |
That's a nice suggestion see: 89d8bb2 |
Use casesThis issue is complicated since they are multiple use cases. Let me try to list some of them.
I think that use case (A) should be elaborated:
CPUsA CPU core is a physical core, but these days, it's more convenient to count "logical CPUs" because Hyper Threading is really close to 2x faster when run you 2 threads per CPU core. When you consider a virtual machine, we don't talk about physical CPUs anymore, but "virtual CPUs" aka "vCPUs". It's also possible to add or remove CPUs at runtime. On Linux, a CPU can be "on" or "off". Limit CPUsA system administrator has different ways to limit CPU usage:
Number of CPUsOk, now to come back to the number of CPUs, there are:
@corona10 wrote that Java did its best to get cgroups but they decided to give us up, and instead provide a command line option so sysadmins can just tune the Java service using their knowledge of the machine. Read-only containerFor use case (C) and (D), we should discuss Python versions. There are two cases:
Let's take the example of a read-only container image.
|
For Python <= 3.12, it's possible to implement the feature without modifying Python, but just by injecting a import os, sys
def parse_cmdline():
env_ncpu = None
cmdline_ncpu = None
env_opt = os.environ.get('PYTHONCPUCOUNT', None)
if env_opt:
try:
ncpu = int(env_opt)
except ValueError:
print(f"WARNING: invalid PYTHONCPUCOUNT value: {env_opt!r}")
else:
env_ncpu = ncpu
if 'cpu_count' in sys._xoptions:
xopt = sys._xoptions['cpu_count']
try:
ncpu = int(xopt)
except ValueError:
print(f"WARNING: invalid PYTHONCPUCOUNT value: {xopt!r}")
else:
cmdline_ncpu = ncpu
ncpu = env_ncpu
if cmdline_ncpu:
ncpu = cmdline_ncpu
if ncpu:
# Override os.cpu_count()
def cpu_count():
return ncpu
cpu_count.__doc__ = os.cpu_count.__doc__
os.cpu_count = cpu_count
parse_cmdline() Example:
|
I propose to:
Later, we would consider:
|
Technical nit and reason why we shouldn't try to solve the larger "what is a core" problem within the stdlib itself:
This is not accurate. hyperthreading is rarely, if ever, that meaningful for many workloads.
In my sample set: AMD zens can have meaningful HT, at the moment, and eek out <~25% gain by doing so. Modern Intels with HT are no different. A workload with wide mix of very different instruction usage such as integer heavy (ex: our interpreter) scheduled physically along side float / AVX heavy code might be able to get a little more HT parallelism - but that is not the common case and arranging for that is non-trivial (the OS generally won't detect and arrange this for you). All that HT specific deep dive aside, what a "CPU" core is is changing. Big.little designs are becoming mainstream, not just for mobile devices anymore. For example Intel 12th gen and later can have P cores and E cores. So that 10 physical core laptop chip may have 12-14 threads, with some threads being 100-50% higher performance than others. Mac M-series have a similar mix of performance and efficiency cores. To the OS each one of those presents as a "core". But the total compute throughput from each varies a lot, as does the range of compute latencies possible on each. (end technical detail comment, leaving my thoughts on what we should for a followup) |
I like these proposals, they are along the lines of what we can actually accomplish and provide clear concrete results. We're exposing raw Answering deeper questions such as things pertaining to efficiency vs performance, throughput, latency, Linux cgroup shares, other container configs, cpu hotplug, or what VM cores actually are vs what an underlying hypervisor may be configured to actually schedule "core" processes on... are IMNSHO better left to continually evolving PyPI libraries. Because those are pretty fluid concepts. For example, for practical reasons, software may choose to adopt higher level concept libraries such as https://pypi.org/project/threadpoolctl/ which @mdboom mentioned on Discord. Because application processes and libraries are often not isolated and work best if they coordinate which fractions of available compute resources they each use with the other cross-language libraries and processes all being used at once for their common goal. A raw numbers of available logical cores alone can't accomplish that. I like most of your "Later, we would consider" proposals (as followup work) - except for the cgroups one: Per the above, I view that as too ill-defined of a concept for us to dictate any mapping of that to "cores" as a stdlib API. |
I have mixed feeling about reading cgroups in Python stdlib. What I care the most here is to make sure that with the proposed designed, tomorrow, we can still change our mind, and read cgroups. I'm saying that because my first proposition was to add an affinity parameter to os.cpu_count(). It's simple. It's easy. Why not? Well. If tomorrow, we read cgroups, what does it mean? We have to add a second cgroups parameters? Now what? The correct would look like: kwargs = {}
if sys.version_info >= (3, 13):
kwargs['affinity'] = True
if sys.version_info >= (3, 14):
kwargs['cgroups'] = True
cpu_count = os.cpu_count(**kwargs) Hu! That's not convenient. I prefer a separated function which has no parameter: |
Yeah, I like the proposal #109595 (comment), it will makes better situation for exceptional environment of Python users :) |
--------- Co-authored-by: Victor Stinner <[email protected]> Co-authored-by: Gregory P. Smith [Google LLC] <[email protected]>
FYI, I am preparing -Xcpu_count=process option with the separate PR. |
Would you mind to create a separated PR for -Xcpu_count=process? |
Follow-up issue: #110649: Add -Xcpu_count=process cmdline mode to redirect os.cpu_count as os.process_cpu_count. |
did the |
It was added to Python 3.13 (you can test beta1) and it's documented at: https://docs.python.org/dev/using/cmdline.html#envvar-PYTHON_CPU_COUNT |
…hon#109667) --------- Co-authored-by: Victor Stinner <[email protected]> Co-authored-by: Gregory P. Smith [Google LLC] <[email protected]>
Uh oh!
There was an error while loading. Please reload this page.
Feature or enhancement
As #80235, there are requests for isolating CPU count in k8s or container environment, and this is a very important feature these days. (Practically my corp, a lot of workloads are running under container environments, and controlling CPU count is very important to resolve busy neighborhood issues)
There were a lot of discussions, and following the cgroup spec requires a lot of complexity and performance issues (due to fallback).
JDK 21 chooses not to depend on CPU Shares to compute active processor count and they choose to use
-XX:ActiveProcessorCount=<n>
.see: https://bugs.openjdk.org/browse/JDK-8281571
I think that this strategy will be worth using from the CPython side too.
So if the user executes the python with
-Xcpu_count=3
optionos.cpu_count
will return 3, instead of the actual CPU count that was calculated fromos.cpu_count
.cc @vstinner @indygreg
Linked PRs
The text was updated successfully, but these errors were encountered: