-
-
Notifications
You must be signed in to change notification settings - Fork 25.8k
[MRG+1] Releasing the GIL in the inner loop of coordinate descent #3102
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
MechCoder
commented
Apr 23, 2014
- release the GIL in the dual gap check (need to replace numpy calls by BLAS equivalents)
- release the GIL in the sparse input variant
- release the GIL for the precomputed Gram matrix variant
- release the GIL for multi-task variant.
- benchmark the scaling for instance by hacking the cross_val_score function to make it possible to use the threading backend instead of multiprocessing
@ogrisel I tried picking your commit, but there are nasty merge conflicts in the cd_fast.c file. |
Ah well. Fixed them. Sorry for the noise |
Merge conflicts in generated C files are a matter of calling Cython again (but I guess you figured that out). |
Can we close Olivier's old PR? Which one was it? |
# return if we reached desired tolerance | ||
break | ||
with nogil: | ||
for n_iter in range(max_iter): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@larsmans A noob doubt. As far as I read, you can use nogil when there are no Python objects. Then how come this works?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The Cython compiler expands for i in range(n)
, and a few variant constructs, to a C for
loop when i
and n
are properly typed. Check the generated C code to see this at work (it has the input Cython code in comments), or the HTML output from cython -a
(click any line to see the corresponding C code).
@larsmans I'm facing trouble in testing the function outside the scikit-learn directory, in using the cblas.h file. I have a few questions, that I don't have any idea about, (First time I'm doing a setup.py file).
Contents of setup.py - https://gist.github.com/MechCoder/11349640 I might not require it right now, but I might need it later, and I should not be as clueless as I am right now. Thanks. |
(TODO for me, document this, I learned this by hacking scikit-learn for several years.) |
@larsmans
|
|
||
|
||
cdef double max(int n, double* a) nogil: | ||
"""np.max(np.abs(a))""" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I suppose you mean np.max only here
I plotted some benchmarks, and it does seem that releasing the GIL, does offer a considerable advantage generally. Code - https://gist.github.com/MechCoder/b18c519ce018dc47b8b3 @agramfort @larsmans @ogrisel WDYT? EDIT: the first one is with gil and the second one is after releasing it. |
Would it be good to add these benchmarks to sklearn/benchmarks ? |
Sorry but I don't understand these plots. What are the colors, y label, common ylim? What is with or without gil? |
y label - is the time. Basically I have tested for different values of n_alphas and n_l1_ratios. |
I don't see the figure names on github so I just have to guess which is you should fix the random_state in make_regression to make the results if you do so then it should always be faster with higher number of alphas. |
@agramfort. Umm. Surprisingly it does speed up for higher number of alphas. please run this small bit of code,
I'm guessing that this might be due to the fact that, Parallel has a considerable overhead when run for less computationally expensive operations? |
yes it's the Parallel overhead. You should bench or mid size problems too |
Please ignore the previous comment. In master: l1_ratio - 5 In this branch l1_ratio - 5 Is the time gain enough. It also seems to slow down in certain cases. |
There is no gain to expect just by releasing the GIL and your benchmarks seem to show this although it would need to be repeated several times and displayed as bar plots with error bars (for instance with the std deviation) to confirm this. Releasing the GIL just makes it possible to use the threading backend of joblib without having concurrent threads lock one another. To use the threading backend instead instead of the multiprocessing backend for the ElasticNetCV class, you can add |
@ogrisel Is that likely to give a speed gain? The documentation of Parallel tells that, "threading" is a very low-overhead backend but it suffers from the Python Global Interpreter Lock if the called function relies a lot on Python objects. |
Well this is precisely why we want to release the GIL in the speed critical part of the cython code of the |
You also need to perform enough iterations of cv to be able to leverage the inner parallelization, e.g. |
+1 on @ogrisel 's comments. You have a chance to see a benefit in CV case |
@ogrisel , @agramfort I benched with cv = 10, for n_core = [1, 2, 4]. It seems that threading (releasing the GIL does have an advantage). Multiprocess in this branch vs master (after releasing GIL) Threading in this branch vs master (after releasing GIL) Threading in this branch with multiprocessing in master Do I get a go ahead to continue with the rest of this PR? |
Is there a difference in memory consumption? Use a large dataset to see this, and maybe the memory-profiler. I would expect more a difference in memory consumption than speed. |
@ogrisel I pushed in a commit that releases the GIL for the multi-task variant? Do you mind having a quick look before I bench? |
@ogrisel For the Gram variant, after the minor changes :)
In master
|
@GaelVaroquaux @ogrisel @agramfort Some benchmarks for the multi-task variant. X = 10000 X 1000, y = 10000 X 3 In this branch
In master
Memory benefits: In this branch In master I've updated the PR from [WIP] to [MRG] |
# norm_cols_X = (np.asarray(X) ** 2).sum(axis=0) | ||
for ii in range(n_features): | ||
for jj in range(n_samples): | ||
norm_cols_X[ii] += X[jj][ii]**2 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you run cython -a sklearn/linear_model/cd_fast.pyx
you will see that this line causes some python wrapping overhead. I suppose this is caused by the double indexing X[jj][ii]
which is probably not automatically optimized by cython.
X[jj, ii]
makes it possible to remove that overhead.
The same problem also appears at lines 640, 702 and 705.
Also: PEP8 convention would favor the X[jj, ii] ** 2
notation (with whitespaces around **
).
Once my last comment has been taken into account, +1 for merging. Thanks for the contrib and the extensive benchmarks @MechCoder! |
@ogrisel Can I add my name at the top pf the cd_fast.pyx file? |
@MechCoder sure. |
Can you please re-run the multitask threading benchmark to check the impact of this last optim? |
yes. running it. |
@ogrisel It doesn't look like there is much difference. A slight improvement.
In master
|
Ping @agramfort Please have a look :) |
LGTM ! +1 for merge |
@ogrisel shall I merge? |
[MRG+1] Releasing the GIL in the inner loop of coordinate descent
I just did :) |
@ogrisel @agramfort Thanks a lot for your help. Learnt a lot from this. |
@ogrisel @agramfort one of my gsoc goals, is to benchmark cyclic / random coordinate descent, and to test if it converges faster or not. For cyclic descent, I need to permute the indices of the features once before every outer iteration. On the lines of this
Sorry to be a noob, Since we have released the GIL and a numpy call might cause overhead, is there anything that would help me replace it with a C - call. thanks |
No need to reinit a new Now as to answer you question about calling
with gil:
f_shuffle = rng.permutation(n_features)
The first solution my introduce some significant lock contentions on very small problems, so ideally we should try to implement the GIL-free solution. That would probably involve refactoring the cython source of the project to factorize the pure-cython RNG out of the tree code to make it more reusable. |
@ogrisel . I had a quick look at the Fisher-Yates algorithm and the tree code. Is the algo used a special adaption of the Fisher-Yates? Because I see comments like,
In that case, we might need to change it a bit for the linear models. |
In tree code, there are specific changes to take into account constant features in a split. But I don't think you need that sort of think for linear model. I think what @ogrisel suggests is to factorize |
Yes exactly. And then implement your own Fisher-Yates loop in the stochastic CD loop using that refactored |