-
-
Notifications
You must be signed in to change notification settings - Fork 25.8k
RFC How should we control/expose number of threads for our OpenMP based parallel cython code ? #14265
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. Weβll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
My preference is (2). I opened #14196 in that direction |
Thanks for opening the issue. I have a preference towards (4). The default for I don't like (2) because:
That being said, I wouldn't vote -1. |
I'm leaning towards 3 or 4. I'm not sure which one's better, but as @NicolasHug says, I too rather have it different than |
I too rather have it different than n_jobs.
I think that naming openMP parallelism differently from Python-level
parallelism is a very technical nuance that we be lost to most of our
users. We need to be able to give a simple message in terms of how to
control parallelism.
With regards to the default, I would also love to have the right default
that prevents oversubscription. Do we have the technical tool needed for
that? I know that Jeremie and Olivier have been struggling with
oversubscription.
|
I have reviewed what is done in some other ML libraries that use OpenMP,
I guess
So I don't there is a perfect solution here, but I would be also more in favor of 2.
I don't think that should be documented. It's an implementation detail that can change at any moment, mostly people only care that when one increases n_jobs it makes their code faster (and are disappointed if it doesn't). |
To be more precise, I think we should have a documentation section / glossary discussing different possible levels of parallelism, but that it shouldn't be specified in each estimator. |
I know they can be changed and we don't give users a guarantee that they'll stay the same, but in terms of distributing jobs on a cluster, or to use dask's backend sometimes for instance, it's much easier for the user and developers to know which parts are on MPI and which parts on processes. |
If one leaves, We already have BLAS parallelism happening a bit everywhere, so when one adds OpenMP and processes on top, in such complex systems the easiest is just to benchmark and see what works best in terms of number of threads / processes. Say we add OpenMP somewhere, that may or may not be noteworthy depending on the amount of linalg operations happening before in BLAS (that ratio may also scale in some unknown power with the data size). |
the idea is to call
It works fine for all sklearn use cases :) |
I would like to understand the oversubscription more. @jeremiedbb why does it work fine and what does that mean? |
Grid Search was not a good example because joblib should already handle this case. It does it by setting the number of threads to 1 for openmp. |
@jeremiedbb Why is that not a good example? And if the user sets |
Assuming that we can ensure a good protection against subscription in joblib (which I believe will soon be the case), I would be in favor setting the defaults to let our OpenMP loops use the currently maximum number of threads (as OpenBLAS and MKL do for numpy / scipy). +1 for letting the user pass an explicit number of threads using the For process based parallelism (e.g. with the loky backend of joblib) I think keeping the default For joblib with the threading backend I would be in favor of keeping the current behavior but reexploring this choice later if necessary. |
For reference the over-subscription protection in joblib will be tracked by joblib/joblib#880 . |
My pocket sent this before i finished :D It was not a good example for threadpoolctl. But it is for joblib. Alrhough we found that it doesn't work in previous joblib versions but it should be fixes on next release. Threadpoolctl would be use when doing nested parallelism openmp/blas. It allows to prevent oversubscription by limiting the number of threads for blas. This situation only happens for kmeans right now. Overall we prevent oversubscription by disabling parallelism for non top level parallelism. In summary, any default is fine for n_jobs since it would be forced to one when already in a parallel call |
Just for me to get this straight: If I'm running a grid-search over HistGradientBoosting or something with OpenMP, we would probably want to default it to use all available cores. Now the user sets Separate scenario: |
That's right. In that case, he should set n_jobs=-1 for the grid search to benefit from all cores. In many situations (personal observations) it's better to enable parallelism in the most outer loop. Note that it's the case with BLAS. If you set n_jobs=2 for an estimator, joblib parallelized, BLAS will be limited to use only 1 core. I think that behavior is ok if we document it properly. |
OK, but if they go from n_jobs=1 to 2, they will get worse results which may be a bit unexpected.
Isn't that set via Note that slight oversubscription may not be bad (at least last time I looked into it in HPC). e.g. if you have 4 CPU cores, 2 processes with 4 threads each is not necessarily worse than 2 process with 2 threads each if the task use heterogeneous compute units; the problem is really on 40 core CPU with 40Β² threads.
So basically, it's about priority between |
I agree. That's a drawback of having default None means -1. But I think it's also a matter of documentation, because if I know my estimator already uses all cores, why would I increase the number of workers for my grid search ? Or maybe I'd like to disable parallelism of my estimator and use all cores for the grid search (which is your next point).
Yes.
I guess it's possible.
Indeed, but it's not good either (at least I've never experienced getting better performance with oversubcription).
In the implementation I propose, n_jobs/num_threads has priority over environment variables. Only the default (None) would be influenced by the environment variables. If you have a grid search with n_jobs=2 and you set n_jobs/n_threads=(n_cores // 2), you'll get full saturation of your cores. I'll add that having good default is important, but it shouldn't prevent users from thinking about what are those default, why are they good, and when are they good. |
The case of multiprocessing where each process starts some number of threads makes me think of hybrid MPI/OpenMP programming in HPC (cf e.g. this presentation). The analogy is maybe a bit partial as we don't use MPI nor run on multiple nodes, but the data serialization cost when starting new processes (for now) does have a somewhat similar effect to communication cost with MPI. So my point is that is might be useful to be able to control both the number or parallel processes and the number of nested threads separately, but to control the latter it's preferable to have some global mechanism such as |
That's a drawback of having default None means -1.
I don't think that "None" should mean "-1" or "1". It should mean "best
guess to be efficient given the global constraint", and the semantics of
that can evolve and depend on the context.
What I am saying is that when users specify "None", we should, as time
goes add dynamic scheduling logic to be more efficient, and we should
tell the user explicitly that the details implementation of the job
scheduling will evolve.
Note that slight oversubscription may not be bad (at least last time I
looked into it in HPC). e.g. if you have 4 CPU cores, 2 processes with 4
threads each is not necessarily worse than 2 process with 2 threads each if
the task use heterogeneous compute units; the problem is really on 40 core
CPU with 40Β² threads.
Yes, it matches my experience, as long as we don't blow the memory.
|
I was not aware of that! That is... an interesting side-effect. Was that always the case? Is that documented anywhere? |
It has been introduced in joblib 0.12, see change log However, it was bugged until now and should be fixed in the next joblib release. |
So that was a change introduced in scikit-learn 0.20 but not documented in the changelog? Did we communicate this change to scikit-learn users in some way? Anyway, it sounds like we want a different strategy by default, like |
Nevermind, as discussed above this feature was added but actually had no effect due to a bug, so there is nothing to worry about.
Change proposed in joblib/joblib#913 Also I think we should still merge #14196: it's a private function to be able to use |
It does not work because you can have nested Besides I don't think it's such a bad behavior. Currently |
The discussion has derived to oversubscription questions in sklearn which is not exactly the initial purpose and which I think can be treated separately, assuming that we have the right tools to deal with oversubscription independently of the default. Thus I propose to refocus the discussion on the initial question. Let me try to summarize the discussion. One good thing is we seem to agree on the default :) Comments about the choice for the defaults:
Comments about the name of the parameter:
I propose to try to discuss a little bit more see if one side manage to convince the other side :) |
Again, you looked at this way more than I did so I might be missing something. But OpenBLAS (optionally?) uses OpenMP under the hood, right? |
the OpenBLAS shipped with numpy and scipy does not (but you can build it to use OpenMP and link numpy against that). MKL uses OpenMP. |
I totally agree with that. For small bits of code using OpenMP it's what makes the most sense. My motivation is for estimators like KMeans for which the parallelism happens at the outermost loop. |
Ok. So that's why I find the phrase "OpenMP based multithreading" quite confusing. Because what you really mean is "OpenMP base multithreading where we wrote the call into OpenMP in Cython". And maybe there's a qualitative difference in how you use OpenMP in KMeans vs how it might be used in Nystroem but at least to me that is not obvious (sorry if that has been discussed above). |
Can you maybe say a bit more about that difference? |
I can't I don't know how it would be used in Nystroem :/ the idea is if it's a bit of code that is parallel (with OpenMP in our cython code), we'd like it to behave as BLAS, i.e use as many cores as possible. On the other hand if it's the whole algorithm which is parallel (still OpenMP in our cython code), at the outermost loop, maybe we'd like to provide some control. |
Nystroem is just a call to SVD which I assume is handled entirely by blas ;) |
I also think that
As a side note, about using all threads by default e.g. in BLAS that logic probably originated 10-20 years ago when computers had 2-4 cores. Now you can get a desktop with 10-20 CPU cores and servers with up to 100. In that context using all cores by default kind of assumes that our library is alone in the world and can use all resources with no cost. There is a lot of cases where using all CPU cores is not ideal (shared servers, even desktops/laptops with other resource intensive applications running e.g. rendering all that JS in a browser). On a server under load spanning all threads will actually slow down both the running applications and other applications due to oversubsription. Besides, not that many applications have a good scaling beyond 10 CPU cores (or use big enough data where it makes sense) Yes, using threads is almost always faster, but also at the cost of significantly higher CPU time and electricity consumption. Maybe we don't care so much for the scikit-learn use-cases, but it could still be something worth considering. Not saying we should change anything in the proposed approach now, but to leave things flexible enough so we can change this default later if needed. |
The decorator seems reasonable. |
As a side note, about using all threads by default e.g. in BLAS that logic probably originated 10-20 years ago when computers had 2-4 cores [...] not that many applications have a good scaling beyond 10 CPU cores
I agree. In our experience, the default scaling to all CPUs is a bad idea on large computers for multiple reasons:
* When multiple processes runs like this (which often happens in Python), it leads to huge oversubscription, which can even freeze the box. One examples is n_job=-1 + openMP using all the threads, on a box with n CPU, it uses n**2 CPUs, which leads to disasters if n is largish
* Those boxes are often multi-tenants.
I would be in favor that inner number of threads shouldn't by default exceed 10.
The big-picture problem is a hard one: in a real user codebase, these days, we end up with multiple parallel-computing systems that are nested: Python's multiprocessing, Python's threads, openMP's parallel computing (and several can coexists if eg scikit-learn was compiled with GCC and MKL compiled with ICC).
What we would really need is dynamic scheduling of resources. Intel's TBB offers that by the way (though I'm not suggesting that we use it :D). It's a hard problem, and we won't tackle it here and now. However, it would be good to think about possible evolutions in this directions in the contract that we give to the user in terms of parallel computing.
|
With the next release of
with parallel_backend('loky', inner_max_num_threads=2):
# do stuff, the BLAS in child processes will use 2 threads. |
That will also apply to BLAS. Currently we plan to set MKL_NUM_THREADS, BLIS_NUM_THREADS, OPENBLAS_NUM_THREADS and OMP_NUM_THREADS. |
@ogrisel sorry, I'm not sure I understand your statement. And so will the default of, say, 10 be enforced in loky or in sklearn? |
@amueller @NicolasHug We just merged this over-subscription mitigation feature in |
The current implemented behavior gives:
|
@tomMoral Looks great! Just a question, say I want to restrict the number of BLAS threads without using any other parallelism. Could something like Also I guess if you explicitly create thread based parallelism say with joblib, those are not going to be constrained? Or will they be? |
If you want to restrict the number of thread in a process, you need to either set the env var And indeed for thread based parallelism, we don't have a solution to set the number of inner threads in each threads, except a global limit with env variables. |
To be more specific it seems that introspecting or changing the size of the threadpools dynamically can be problematic in some case for programs linked with multiple openmp runtimes at the same time. For instance there is a case under investigation here: joblib/threadpoolctl#40 So for now I would rather use threadpoolctl as little as possible until we understand better the cause of this aforementioned deadlock. It could very-well be the case that we found thread-safety bugs in those openmp runtimes and if it's the case we will report time so that they can be fixed upstream. In the mean time the safe way to control the number of threads :
|
FYI: joblib 0.14.0 (with the fixed version of overscription for the case Python process vs native threadpools) is out: https://pypi.org/project/joblib/0.14.0/ |
OK so my current understanding is the following:
Is that correct? If it is, I'm all in for option 5 (not expose anything), and of course properly document this somewhere in the user guide. |
There's one limitation. It does not include the threading backend. This is why we are working on threadpoolctl. Based on last meeting, option 5 sees to be the preferred option for most places where we want to introduce So I think I can close this discussion and we can open new ones for specific estimators like hgbt or kmeans. Feel free to re-open if you disagree. |
I'll open a PR soon for documentation |
Before adding OpenMP based parallelism we need to decide how to control the number of threads and how to expose it in the public API.
I've seen several proposition from different people:
Β Β (1) Use the existing
n_jobs
public parameter withNone
means 1 (same a for joblib parallelism)Β Β (2) Use the existing
n_jobs
public parameter withNone
means -1 (like numpy lets BLAS use as many threads as possible)Β Β (3) Add a new public parameter
n_omp_threads
when underlying parallelism is handled by OpenMP, withNone
means 1.Β Β (4) Add a new public parameter
n_omp_threads
when underlying parallelism is handled by OpenMP, withNone
means -1.Β Β (5) Do not expose that in the public API. Use as many threads as possible. The user can still have some control with
OMP_NUM_THREADS
before runtime or using threadpoolctl at runtime.(1) or (2) will require improving documentation of
n_jobs
for each estimator: what's the default, what kind of parallelism, what is done in parallel... (see #14228)@scikit-learn/core-devs, which solution do you prefer ?
If it's none of the previous ones, what's your solution ?
The text was updated successfully, but these errors were encountered: