RFC How should we control/expose number of threads for our OpenMP based parallel cython code ? #14265

jeremiedbb · 2019-07-05T12:36:34Z

Before adding OpenMP based parallelism we need to decide how to control the number of threads and how to expose it in the public API.

I've seen several proposition from different people:

(1) Use the existing n_jobs public parameter with None means 1 (same a for joblib parallelism)

(2) Use the existing n_jobs public parameter with None means -1 (like numpy lets BLAS use as many threads as possible)

(3) Add a new public parameter n_omp_threads when underlying parallelism is handled by OpenMP, with None means 1.

(4) Add a new public parameter n_omp_threads when underlying parallelism is handled by OpenMP, with None means -1.

(5) Do not expose that in the public API. Use as many threads as possible. The user can still have some control with OMP_NUM_THREADS before runtime or using threadpoolctl at runtime.

(1) or (2) will require improving documentation of n_jobs for each estimator: what's the default, what kind of parallelism, what is done in parallel... (see #14228)

@scikit-learn/core-devs, which solution do you prefer ?
If it's none of the previous ones, what's your solution ?

The text was updated successfully, but these errors were encountered:

jeremiedbb · 2019-07-05T12:40:24Z

My preference is (2). I opened #14196 in that direction

NicolasHug · 2019-07-05T13:15:29Z

Thanks for opening the issue.

I have a preference towards (4). The default for n_omp_threads could be 'auto', which would mean "use as many threads as possible, avoiding over-subscription"

I don't like (2) because:

the same parameter is used for different things (OpenMP vs joblib).
the default does something different depending on the underlying implementation. If we're going to expose the implementation to the user, I prefer (4) because it is more explicit.
it is going to be hard to explain to users (the glossary for n_jobs is already intense), especially for estimators that would use both joblib and openmp (if any).

That being said, I wouldn't vote -1.

adrinjalali · 2019-07-05T13:32:08Z

I'm leaning towards 3 or 4. I'm not sure which one's better, but as @NicolasHug says, I too rather have it different than n_jobs.

GaelVaroquaux · 2019-07-05T13:40:08Z

I too rather have it different than n_jobs.

I think that naming openMP parallelism differently from Python-level parallelism is a very technical nuance that we be lost to most of our users. We need to be able to give a simple message in terms of how to control parallelism. With regards to the default, I would also love to have the right default that prevents oversubscription. Do we have the technical tool needed for that? I know that Jeremie and Olivier have been struggling with oversubscription.

rth · 2019-07-05T13:53:09Z

I have reviewed what is done in some other ML libraries that use OpenMP,

xbgoost has nthread but deprecated it in favor of n_jobs in their scikit-learn API.
lighgbm uses n_jobs
in recsys lightfm and implicit use num_threads

I guess num_threads would be semantically more correct, however my main concern for it are,

additional parameters to a lot of models
how we handle parallelism change. Say that now we have a pure python code that uses multiprocessing, we re-write it to Cython + OpenMP, if we go with (4) that would require a deprecation warning, with a change of parameter name, and I don't think that asking that from the user would be justified (or that we should spend efforts on deprecations for this).
similarly, joblib with threading (assuming that it operates on functions that release GIL) is fairly similar to using OpenMP. Having a different parameter name in those case is then not very logical, but at the same time we don't want to rename parameter names in lots of estimators either.

So I don't there is a perfect solution here, but I would be also more in favor of 2.

(1) or (2) will require improving documentation of n_jobs for each estimator: [...] what kind of parallelism, what is done in parallel...

I don't think that should be documented. It's an implementation detail that can change at any moment, mostly people only care that when one increases n_jobs it makes their code faster (and are disappointed if it doesn't).

rth · 2019-07-05T14:01:19Z

I don't think that should be documented.

To be more precise, I think we should have a documentation section / glossary discussing different possible levels of parallelism, but that it shouldn't be specified in each estimator.

adrinjalali · 2019-07-05T14:04:35Z

I know they can be changed and we don't give users a guarantee that they'll stay the same, but in terms of distributing jobs on a cluster, or to use dask's backend sometimes for instance, it's much easier for the user and developers to know which parts are on MPI and which parts on processes.

rth · 2019-07-05T14:30:19Z

In terms of distributing jobs on a cluster, or to use dask's backend sometimes for instance, it's much easier for the user and developers to know which parts are on MPI and which parts on processes.

If one leaves, n_jobs=None sets OMP_NUM_THREADS in the dask worker and uses with joblib.parallel_backend('dask') to let dask handle the number or processes / nodes shouldn't that work independently of the implementation?

We already have BLAS parallelism happening a bit everywhere, so when one adds OpenMP and processes on top, in such complex systems the easiest is just to benchmark and see what works best in terms of number of threads / processes. Say we add OpenMP somewhere, that may or may not be noteworthy depending on the amount of linalg operations happening before in BLAS (that ratio may also scale in some unknown power with the data size).

jeremiedbb · 2019-07-05T15:03:44Z

I would also love to have the right default that prevents oversubscription. Do we have the technical tool needed for that?

the idea is to call openmp_effective_n_threads on the estimator attribute n_jobs or n_threads, just before the prange. It will return a value which depends on omp_get_max_threads. We can influence that at the python level with threadpoolctl to prevent oversubscription (for instance we could use it in grid search).

I know that Jeremie and Olivier have been struggling with oversubscription.

It works fine for all sklearn use cases :)

amueller · 2019-07-05T16:06:19Z

I would like to understand the oversubscription more. @jeremiedbb why does it work fine and what does that mean?
If I run GridSearchCV with n_jobs=-1 and inside I have an estimator that uses OpenMP and I set n_jobs=-1, what happens?

jeremiedbb · 2019-07-05T16:26:52Z

Grid Search was not a good example because joblib should already handle this case. It does it by setting the number of threads to 1 for openmp.

amueller · 2019-07-05T16:29:28Z

@jeremiedbb Why is that not a good example? And if the user sets n_job=2 in GridSearchCV and n_jobs=-1 in and OpenMP estimator? Or n_jobs=4 in both with only 4 cores?
Sorry I'm not familiar with how joblib changes the openmp threads.

ogrisel · 2019-07-05T16:43:47Z

Assuming that we can ensure a good protection against subscription in joblib (which I believe will soon be the case), I would be in favor setting the defaults to let our OpenMP loops use the currently maximum number of threads (as OpenBLAS and MKL do for numpy / scipy).

+1 for letting the user pass an explicit number of threads using the n_jobs argument if they want. It's just that for n_jobs=None (the default) I would let OpenMP what it wants to do.

For process based parallelism (e.g. with the loky backend of joblib) I think keeping the default n_jobs=None mean sequential by default is safer as there can be a non-trivial communication and memory overhead.

For joblib with the threading backend I would be in favor of keeping the current behavior but reexploring this choice later if necessary.

ogrisel · 2019-07-05T16:48:58Z

For reference the over-subscription protection in joblib will be tracked by joblib/joblib#880 .

jeremiedbb · 2019-07-05T16:52:48Z

My pocket sent this before i finished :D

It was not a good example for threadpoolctl. But it is for joblib. Alrhough we found that it doesn't work in previous joblib versions but it should be fixes on next release.

Threadpoolctl would be use when doing nested parallelism openmp/blas. It allows to prevent oversubscription by limiting the number of threads for blas. This situation only happens for kmeans right now.

Overall we prevent oversubscription by disabling parallelism for non top level parallelism.

In summary, any default is fine for n_jobs since it would be forced to one when already in a parallel call

amueller · 2019-07-05T17:23:17Z

In summary, any default is fine for n_jobs since it would be forced to one when already in a parallel call

Just for me to get this straight: If I'm running a grid-search over HistGradientBoosting or something with OpenMP, we would probably want to default it to use all available cores. Now the user sets n_jobs=2 in GridSearchCV to make things run faster, but say they have more than 2 cores. GridSearchCV would use joblib to parallelize, but joblib would prevent OpenMP parallelism inside the parallel call, so less threads would be active and things will likely be slower.

Separate scenario:
Someone set's n_jobs in both GridSearchCV and HistGradientBoosting. No matter what, the HistGradientBoosting one that the user explicitly set would be ignored, right?

jeremiedbb · 2019-07-09T08:41:27Z

Now the user sets n_jobs=2 in GridSearchCV to make things run faster, but say they have more than 2 cores. GridSearchCV would use joblib to parallelize, but joblib would prevent OpenMP parallelism inside the parallel call, so less threads would be active and things will likely be slower.

That's right. In that case, he should set n_jobs=-1 for the grid search to benefit from all cores. In many situations (personal observations) it's better to enable parallelism in the most outer loop.

Note that it's the case with BLAS. If you set n_jobs=2 for an estimator, joblib parallelized, BLAS will be limited to use only 1 core.

I think that behavior is ok if we document it properly.

rth · 2019-07-09T08:57:29Z

In that case, he should set n_jobs=-1 for the grid search to benefit from all cores.

OK, but if they go from n_jobs=1 to 2, they will get worse results which may be a bit unexpected.

Note that it's the case with BLAS. If you set n_jobs=2 for an estimator, joblib parallelized, BLAS will be limited to use only 1 core.

Isn't that set via OMP_NUM_THREADS in the created processes? Which means that potentially it could be anything, including available_cpu_cores // n_jobs?

Note that slight oversubscription may not be bad (at least last time I looked into it in HPC). e.g. if you have 4 CPU cores, 2 processes with 4 threads each is not necessarily worse than 2 process with 2 threads each if the task use heterogeneous compute units; the problem is really on 40 core CPU with 40² threads.

Someone set's n_jobs in both GridSearchCV and HistGradientBoosting. No matter what, the HistGradientBoosting one that the user explicitly set would be ignored, right?

So basically, it's about priority between OMP_NUM_THREADS and user provided num_threads (or n_jobs that corresponds to threading). If one provided a num_threads that is not None, maybe it should have higher priority than OMP_NUM_THREADS?

jeremiedbb · 2019-07-09T11:19:32Z

OK, but if they go from n_jobs=1 to 2, they will get worse results which may be a bit unexpected.

I agree. That's a drawback of having default None means -1. But I think it's also a matter of documentation, because if I know my estimator already uses all cores, why would I increase the number of workers for my grid search ? Or maybe I'd like to disable parallelism of my estimator and use all cores for the grid search (which is your next point).

Isn't that set via OMP_NUM_THREADS in the created processes?

Yes.

Which means that potentially it could be anything, including available_cpu_cores // n_jobs?

I guess it's possible.

Note that slight oversubscription may not be bad (at least last time I looked into it in HPC). e.g. if you have 4 CPU cores, 2 processes with 4 threads each is not necessarily worse than 2 process with 2 threads each if the task use heterogeneous compute units; the problem is really on 40 core CPU with 40² threads.

Indeed, but it's not good either (at least I've never experienced getting better performance with oversubcription).

If one provided a num_threads that is not None, maybe it should have higher priority than OMP_NUM_THREADS?

In the implementation I propose, n_jobs/num_threads has priority over environment variables. Only the default (None) would be influenced by the environment variables. If you have a grid search with n_jobs=2 and you set n_jobs/n_threads=(n_cores // 2), you'll get full saturation of your cores.

I'll add that having good default is important, but it shouldn't prevent users from thinking about what are those default, why are they good, and when are they good.

rth · 2019-07-09T11:49:26Z

I agree. That's a drawback of having default None means -1. But I think it's also a matter of documentation, because if I know my estimator already uses all cores, why would I increase the number of workers for my grid search ?

The case of multiprocessing where each process starts some number of threads makes me think of hybrid MPI/OpenMP programming in HPC (cf e.g. this presentation). The analogy is maybe a bit partial as we don't use MPI nor run on multiple nodes, but the data serialization cost when starting new processes (for now) does have a somewhat similar effect to communication cost with MPI.
The conclusion in the HPC field, as far as I am aware, is that the optimal number of processes/nested threads is application and hardware dependent.

So my point is that is might be useful to be able to control both the number or parallel processes and the number of nested threads separately, but to control the latter it's preferable to have some global mechanism such as OMP_NUM_THREADS (e.g. set via joblib) that would also apply to BLAS, as opposed to a num_threads parameters, that would only apply to scikit-learn specific OpenMP loops.

GaelVaroquaux · 2019-07-09T11:49:59Z

That's a drawback of having default None means -1.

I don't think that "None" should mean "-1" or "1". It should mean "best guess to be efficient given the global constraint", and the semantics of that can evolve and depend on the context. What I am saying is that when users specify "None", we should, as time goes add dynamic scheduling logic to be more efficient, and we should tell the user explicitly that the details implementation of the job scheduling will evolve.

Note that slight oversubscription may not be bad (at least last time I looked into it in HPC). e.g. if you have 4 CPU cores, 2 processes with 4 threads each is not necessarily worse than 2 process with 2 threads each if the task use heterogeneous compute units; the problem is really on 40 core CPU with 40² threads.

Yes, it matches my experience, as long as we don't blow the memory.

amueller · 2019-07-10T02:57:17Z

Note that it's the case with BLAS. If you set n_jobs=2 for an estimator, joblib parallelized, BLAS will be limited to use only 1 core.

I was not aware of that! That is... an interesting side-effect. Was that always the case? Is that documented anywhere?

jeremiedbb · 2019-07-10T08:56:12Z

It has been introduced in joblib 0.12, see change log
It is documented here, under "Avoiding over-subscription of CPU ressources"

However, it was bugged until now and should be fixed in the next joblib release.

amueller · 2019-07-10T21:23:40Z

So that was a change introduced in scikit-learn 0.20 but not documented in the changelog? Did we communicate this change to scikit-learn users in some way?

Anyway, it sounds like we want a different strategy by default, like available_cpu_cores // n_jobs right?

rth · 2019-07-19T14:18:44Z

So that was a change introduced in scikit-learn 0.20 but not documented in the changelog? Did we communicate this change to scikit-learn users in some way?

~~Well now that we unvendored joblib, I'm not sure where to communicate about joblib parallelism changes in scikit-learn release notes.~~

Nevermind, as discussed above this feature was added but actually had no effect due to a bug, so there is nothing to worry about.

Anyway, it sounds like we want a different strategy by default, like available_cpu_cores // n_jobs right?

Change proposed in joblib/joblib#913

Also I think we should still merge #14196: it's a private function to be able to use n_jobs or n_threads=-1 safely and is a bit orthogonal to the present discussion.

jeremiedbb · 2019-08-09T15:08:27Z

Anyway, it sounds like we want a different strategy by default, like available_cpu_cores // n_jobs right?

It does not work because you can have nested joblib.Parallel calls.

Besides I don't think it's such a bad behavior. Currently n_jobs does not match the number of cores you're actually using because it does not affect BLAS. Users are often surprised because you ask for 4 jobs and your htop is full on your 40 cores machine.

jeremiedbb · 2019-08-09T15:09:15Z

The discussion has derived to oversubscription questions in sklearn which is not exactly the initial purpose and which I think can be treated separately, assuming that we have the right tools to deal with oversubscription independently of the default. Thus I propose to refocus the discussion on the initial question.

Let me try to summarize the discussion.
We are currently divided, almost evenly between use n_jobs with default = let OpenMP use as many as possible and add a new parameter n_openmp_threads with same default.

One good thing is we seem to agree on the default :)

Comments about the choice for the defaults:

This choice for the default is supported by the fact that there are ways to tell OpenMP what use as many as possible means. Through OMP_MAX_THREADS env var or through threadpoolctl, which we can use in nested parallelism situations in sklearn.
The fact that joblib sets openmp max threads to 1, isn't entirely satisfactory. We could make it possible to pass the max number of threads as a parameter to joblib.Parallel.

Comments about the name of the parameter:

pros to keeping n_jobs:
- keep a single parameter for parallelism
- changing the underlying implementation moving from joblib to OpenMP based parallelism causes change of parameter name and deprecation cycle.
pros to change name:
- the same parameter name having different defaults depending on the estimator is confusing.
- use different names for different things

I propose to try to discuss a little bit more see if one side manage to convince the other side :)
Then if we still don't agree I guess I'll have to make a slep ?

amueller · 2019-09-19T16:27:17Z

We let BLAS do what it wants, we should do the same for OpenMP based multithreading

Again, you looked at this way more than I did so I might be missing something. But OpenBLAS (optionally?) uses OpenMP under the hood, right?

jeremiedbb · 2019-09-19T16:29:08Z

the OpenBLAS shipped with numpy and scipy does not (but you can build it to use OpenMP and link numpy against that). MKL uses OpenMP.

jeremiedbb · 2019-09-19T16:31:29Z

But why does the user need to care if we wrote cython code or we're calling blas? How is that relevant to the user?

I totally agree with that. For small bits of code using OpenMP it's what makes the most sense.

My motivation is for estimators like KMeans for which the parallelism happens at the outermost loop.

amueller · 2019-09-19T16:31:30Z

Ok. So that's why I find the phrase "OpenMP based multithreading" quite confusing. Because what you really mean is "OpenMP base multithreading where we wrote the call into OpenMP in Cython".

And maybe there's a qualitative difference in how you use OpenMP in KMeans vs how it might be used in Nystroem but at least to me that is not obvious (sorry if that has been discussed above).

amueller · 2019-09-19T16:32:01Z

Can you maybe say a bit more about that difference?

jeremiedbb · 2019-09-19T16:38:31Z

I can't I don't know how it would be used in Nystroem :/

the idea is if it's a bit of code that is parallel (with OpenMP in our cython code), we'd like it to behave as BLAS, i.e use as many cores as possible. On the other hand if it's the whole algorithm which is parallel (still OpenMP in our cython code), at the outermost loop, maybe we'd like to provide some control.

amueller · 2019-09-19T16:39:44Z

Nystroem is just a call to SVD which I assume is handled entirely by blas ;)

rth · 2019-09-19T17:26:08Z

I also think that with parallel_backend("loky", inner_max_num_threads=2): without exposing thread control in scikit-learn would be a good idea. That would sidestep this whole n_jobs / n_threads discussion, and leave us the possibility to change it later if needed, while still exposing a mechanism for controlling n_threads for advanced users.

We don't have a parameter to set OMP_MAX_THREADS every time we call np.dot. So why should we have it in other places?

As a side note, about using all threads by default e.g. in BLAS that logic probably originated 10-20 years ago when computers had 2-4 cores. Now you can get a desktop with 10-20 CPU cores and servers with up to 100. In that context using all cores by default kind of assumes that our library is alone in the world and can use all resources with no cost. There is a lot of cases where using all CPU cores is not ideal (shared servers, even desktops/laptops with other resource intensive applications running e.g. rendering all that JS in a browser). On a server under load spanning all threads will actually slow down both the running applications and other applications due to oversubsription.

Besides, not that many applications have a good scaling beyond 10 CPU cores (or use big enough data where it makes sense) Yes, using threads is almost always faster, but also at the cost of significantly higher CPU time and electricity consumption. Maybe we don't care so much for the scikit-learn use-cases, but it could still be something worth considering. Not saying we should change anything in the proposed approach now, but to leave things flexible enough so we can change this default later if needed.

amueller · 2019-09-19T18:26:11Z

The decorator seems reasonable.
Though if that doesn't apply to BLAS, I'm sure we'll get questions by confused users, and it'll basically mean there's two ways to control the number of threads, right?

GaelVaroquaux · 2019-09-19T18:40:49Z

As a side note, about using all threads by default e.g. in BLAS that logic probably originated 10-20 years ago when computers had 2-4 cores [...] not that many applications have a good scaling beyond 10 CPU cores

I agree. In our experience, the default scaling to all CPUs is a bad idea on large computers for multiple reasons: * When multiple processes runs like this (which often happens in Python), it leads to huge oversubscription, which can even freeze the box. One examples is n_job=-1 + openMP using all the threads, on a box with n CPU, it uses n**2 CPUs, which leads to disasters if n is largish * Those boxes are often multi-tenants. I would be in favor that inner number of threads shouldn't by default exceed 10. The big-picture problem is a hard one: in a real user codebase, these days, we end up with multiple parallel-computing systems that are nested: Python's multiprocessing, Python's threads, openMP's parallel computing (and several can coexists if eg scikit-learn was compiled with GCC and MKL compiled with ICC). What we would really need is dynamic scheduling of resources. Intel's TBB offers that by the way (though I'm not suggesting that we use it :D). It's a hard problem, and we won't tackle it here and now. However, it would be good to think about possible evolutions in this directions in the contract that we give to the user in terms of parallel computing.

tomMoral · 2019-09-20T12:19:16Z

With the next release of joblib (which should happened soon), the over-subscription problem should be reduced. It will provides two improvements:

By default, max_num_threads for the BLAS in child processes will be set to cpu_count() // n_jobs, which is a sensible value to avoid over-subscription. This should tackle the main problem which is that when numpy is used in an estimator with n_jobs=-1 on a big cluster, this leads to catastrophic oversubscription, while remaining performant for smaller number of n_job. We had set the limit to 1 before because it was safer due to the re-usability of the workers.
It will be possible to parametrize this number using the parallel_backend context manager to set different value if it is needed for advance users. This will provide a better control for the user than the usage of global environment variables like OMP_NUM_THREADS that were also reducing the performances in the main process.

with parallel_backend('loky', inner_max_num_threads=2):
    # do stuff, the BLAS in child processes will use 2 threads.

ogrisel · 2019-09-20T15:20:25Z

Though if that doesn't apply to BLAS, I'm sure we'll get questions by confused users, and it'll basically mean there's two ways to control the number of threads, right?

That will also apply to BLAS. Currently we plan to set MKL_NUM_THREADS, BLIS_NUM_THREADS, OPENBLAS_NUM_THREADS and OMP_NUM_THREADS.

amueller · 2019-09-23T19:29:24Z

@ogrisel sorry, I'm not sure I understand your statement.
What will apply to what?
You're saying that loky will set these environment variables, right? But what if the user already set those variables in a script or on the command line or however they launched a job on their cluster?

And so will the default of, say, 10 be enforced in loky or in sklearn?

tomMoral · 2019-09-25T07:58:40Z

@amueller @NicolasHug We just merged this over-subscription mitigation feature in joblib:master, could you give it a try and see if it matches what you would expect? You can find some explanation about this new feature in joblib documentation.

tomMoral · 2019-09-25T08:05:51Z

@ogrisel sorry, I'm not sure I understand your statement.
What will apply to what?
You're saying that loky will set these environment variables, right? But what if the user already set those variables in a script or on the command line or however they launched a job on their cluster?

And so will the default of, say, 10 be enforced in loky or in sklearn?

The current implemented behavior gives:

If the user do nothing and uses n_jobs parallel processes for computation, the call to native libraries with inner threadpools (OpenBLAS, MKL, OpenMP, ..) will be restricted to cpu_count // n_jobs (default).
If the user set an environment variable such as *_NUM_THREADS=n_thread, this limit is passed in the child process and we do not change the behavior. This is mainly intended for cases where the user disable parallel processing for BLAS calls in the whole program, so with n_thread=1.
For more advance usage, the user can programatically override this with the context with parallel_backend('loky', inner_max_num_threads=n_thread):, which will take precedence over the other behavior and can be used to set locally the resource allocation.

rth · 2019-09-25T08:12:15Z

You can find some explanation about this new feature in joblib documentation.

@tomMoral Looks great! Just a question, say I want to restrict the number of BLAS threads without using any other parallelism. Could something like with joblib.parallel_backend(None, inner_max_num_threads=n_thread) be made to work, or should I use directly the vendored threadpoolctl in that case? It would be nice if there was a single API for it.

Also I guess if you explicitly create thread based parallelism say with joblib, those are not going to be constrained? Or will they be?

tomMoral · 2019-09-25T08:27:38Z

If you want to restrict the number of thread in a process, you need to either set the env var *_NUM_THREADS before starting your process or use threadpoolctl. For now, we only rely on the former as we still have some major issues with threadpoolctl. In particular, it is quite unreliable. We found multiple ways to create deadlock by using it because of bad interaction between multiple native libraries with threadpools.

And indeed for thread based parallelism, we don't have a solution to set the number of inner threads in each threads, except a global limit with env variables.

ogrisel · 2019-09-25T10:47:42Z

For now, we only rely on the former as we still have some major issues with threadpoolctl. In particular, it is quite unreliable. We found multiple ways to create deadlock by using it because of bad interaction between multiple native libraries with threadpools.

To be more specific it seems that introspecting or changing the size of the threadpools dynamically can be problematic in some case for programs linked with multiple openmp runtimes at the same time. For instance there is a case under investigation here: joblib/threadpoolctl#40

So for now I would rather use threadpoolctl as little as possible until we understand better the cause of this aforementioned deadlock. It could very-well be the case that we found thread-safety bugs in those openmp runtimes and if it's the case we will report time so that they can be fixed upstream.

In the mean time the safe way to control the number of threads :

Setting the environment variable prior to the launch of the main process or prior to the launch of the worker processes (as we now do in joblib master for the child processes).
The other option would be to have our (scikit-learn) Cython parallel loops explicitly pass an explicit int value to the "num_threads" parameter. scikit-learn could maintain a global variable (shall it be threadlocal?) that would be used by default and could be set globally. Alternatively we could expose n_jobs-style parameters on a per-estimator basis or both.

ogrisel · 2019-10-01T07:43:58Z

FYI: joblib 0.14.0 (with the fixed version of overscription for the case Python process vs native threadpools) is out: https://pypi.org/project/joblib/0.14.0/

NicolasHug · 2019-10-01T18:53:48Z

OK so my current understanding is the following:

joblib.Parallel will set the OMP_NUM_THREADS env variable of the spawned child processes to a value that avoids over-subscription by default
Users can still control the number of openMP threads by either setting OMP_NUM_THREADS (which will take precedence), or by using joblib.Parallel as a context manager and passing inner_max_num_threads

Is that correct?

If it is, I'm all in for option 5 (not expose anything), and of course properly document this somewhere in the user guide.

jeremiedbb · 2019-10-01T19:46:56Z

There's one limitation. It does not include the threading backend. This is why we are working on threadpoolctl.

Based on last meeting, option 5 sees to be the preferred option for most places where we want to introduce pranges.

So I think I can close this discussion and we can open new ones for specific estimators like hgbt or kmeans. Feel free to re-open if you disagree.

NicolasHug · 2019-10-01T20:31:19Z

I'll open a PR soon for documentation

rth mentioned this issue Jul 19, 2019

Multiprocessing: change default inner n_threads to cpu_count // n_jobs joblib/joblib#913

Closed

jeremiedbb mentioned this issue Sep 13, 2019

Vendor threadpoolctl #14979

Closed

NicolasHug mentioned this issue Sep 18, 2019

MNT New helper for effective number of OpenMP threads #14196

Merged

jeremiedbb changed the title ~~RFC How should we control/expose number of threads for OpenMP based parallelism ?~~ RFC How should we control/expose number of threads for our OpenMP based parallel cython code ? Sep 19, 2019

rth mentioned this issue Sep 22, 2019

Deadlock or over-subscription when running scikit-learn test suite in parallel joblib/loky#224

Closed

jeremiedbb closed this as completed Oct 1, 2019

NicolasHug mentioned this issue Oct 1, 2019

[MRG] DOC More details about parallelism (joblib, openMP, MKL...) #15116

Merged

rth mentioned this issue Jan 9, 2020

Make HistGradientBoostingRegressor/Classifer use _openmp_effective_n_threads to set the default maximum number of threads to use #16016

Closed

glemaitre mentioned this issue Jan 9, 2020

[MRG] ENH add option to set the number of OpenMP threads in HistGBDT #16070

Closed

jeremiedbb mentioned this issue Jan 14, 2020

[MRG] new K-means implementation for improved performances #11950

Merged

6 tasks

cmarmo mentioned this issue Mar 31, 2021

[MRG] Common Private Loss Module #19088

Closed

8 tasks

RFC How should we control/expose number of threads for our OpenMP based parallel cython code ? #14265

RFC How should we control/expose number of threads for our OpenMP based parallel cython code ? #14265

Comments

jeremiedbb commented Jul 5, 2019 • edited Loading

jeremiedbb commented Jul 5, 2019

NicolasHug commented Jul 5, 2019

adrinjalali commented Jul 5, 2019

GaelVaroquaux commented Jul 5, 2019 via email

rth commented Jul 5, 2019

rth commented Jul 5, 2019

adrinjalali commented Jul 5, 2019

rth commented Jul 5, 2019 • edited Loading

jeremiedbb commented Jul 5, 2019

amueller commented Jul 5, 2019

jeremiedbb commented Jul 5, 2019

amueller commented Jul 5, 2019

ogrisel commented Jul 5, 2019

ogrisel commented Jul 5, 2019

jeremiedbb commented Jul 5, 2019

amueller commented Jul 5, 2019

jeremiedbb commented Jul 9, 2019

rth commented Jul 9, 2019 • edited Loading

jeremiedbb commented Jul 9, 2019 • edited Loading

rth commented Jul 9, 2019

GaelVaroquaux commented Jul 9, 2019 via email

amueller commented Jul 10, 2019

jeremiedbb commented Jul 10, 2019 • edited Loading

amueller commented Jul 10, 2019

rth commented Jul 19, 2019 • edited Loading

jeremiedbb commented Aug 9, 2019

jeremiedbb commented Aug 9, 2019

amueller commented Sep 19, 2019 • edited Loading

jeremiedbb commented Sep 19, 2019

jeremiedbb commented Sep 19, 2019

amueller commented Sep 19, 2019

amueller commented Sep 19, 2019

jeremiedbb commented Sep 19, 2019

amueller commented Sep 19, 2019

rth commented Sep 19, 2019

amueller commented Sep 19, 2019

GaelVaroquaux commented Sep 19, 2019 via email

tomMoral commented Sep 20, 2019

ogrisel commented Sep 20, 2019

amueller commented Sep 23, 2019

tomMoral commented Sep 25, 2019

tomMoral commented Sep 25, 2019

rth commented Sep 25, 2019 • edited Loading

tomMoral commented Sep 25, 2019

ogrisel commented Sep 25, 2019 • edited Loading

ogrisel commented Oct 1, 2019

NicolasHug commented Oct 1, 2019 • edited Loading

jeremiedbb commented Oct 1, 2019

NicolasHug commented Oct 1, 2019

jeremiedbb commented Jul 5, 2019 •

edited

Loading

rth commented Jul 5, 2019 •

edited

Loading

rth commented Jul 9, 2019 •

edited

Loading

jeremiedbb commented Jul 9, 2019 •

edited

Loading

jeremiedbb commented Jul 10, 2019 •

edited

Loading

rth commented Jul 19, 2019 •

edited

Loading

amueller commented Sep 19, 2019 •

edited

Loading

rth commented Sep 25, 2019 •

edited

Loading

ogrisel commented Sep 25, 2019 •

edited

Loading

NicolasHug commented Oct 1, 2019 •

edited

Loading