-
-
Notifications
You must be signed in to change notification settings - Fork 26.6k
[MRG] FIX Avoid accumulating forest predictions in non-threadsafe manner #9830
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
The problem with this is that it makes RF prediction more memory intensive by a factor of I could do this summation over an iterator instead, which with joblib we will only use in the n_jobs=1 case... WDYT? |
| # ForestClassifier or ForestRegressor, because joblib complains that it cannot | ||
| # pickle it when placed there. | ||
|
|
||
| def accumulate_prediction(predict, X, out): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The problem is that "out" is shared among threads, right?
| # ForestClassifier or ForestRegressor, because joblib complains that it cannot | ||
| # pickle it when placed there. | ||
|
|
||
| def accumulate_prediction(predict, X, out): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The problem is that "out" is shared among threads, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If this is the case I don't understand the memory increase. Shouldn't out be n_samples x n_classes and the memory increase therefore n_samples x n_classes x n_jobs?
|
At master, up to n_jobs prediction matrices are in memory at once. The accumulation is done, albeit badly, in each thread. At this PR all prediction matrices are returned from parallel, i.e. n_estimators separate matrices. The accumulation is done after. A mutex lock on the accumulation would be sufficient to solve this, but we've tended to avoid them. Alternatively a map that returns an iterator rather than a list (or a list of promises) would suffice to give roughly current memory consumption with correctness a assured. |
|
Before I forgot I am guessing this is partly reverting #8672. We should take a look at this PR to try to understand the motivations behind it. |
|
yes it is reverting that. that claimed the gil would protect us. it
wouldn't.
should I implement a mutex instead?
|
|
Chatting with @ogrisel about this one, he thinks adding a threading.Lock is probably the best and the simplest at the same time. The summation of probabilities should not be a bottleneck so that the lock will not impact performance. Just curious, can we actually reproduce the failure outside the Debian testing framework (mips is the failing architecture I think) ? |
|
I tried to reproduce it the other day and failed. Our best chance? Lots of random features, max_features=1, very few samples, many estimators, large n_jobs. Let's use |
9e1b199 to
f55fd89
Compare
|
Done |
|
The changes look fine but I am a bit uncomfortable merging this kind of blind, without having to managed to reproduce neither in the scikit-learn tests or in a simpler snippet where you update a single numpy array with parallel summations. Also maybe it would be a good idea to run a quick and dirty benchmarks to make sure that the lock is not impacting performance?
FYI I tried a little while ago of having |
|
My understanding of the issue is that the issue arises when multiple trees are trying to add their individual predictions to the single |
ogrisel
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems like a reasonable fix. GIL contention should not be too visible as adding arrays it probably orders of magnitude faster than computing the predictions themselves.
|
@jmschrei I don't know how the GIL protects the |
|
I.e. numpy unlocks the GIL for lots of operations.
|
|
This reproduces the contention, showing that at master, n_jobs=4 is inconsistent and n_jobs=1 is consistent. Unfortunately, it seems to show that the current PR is also inconsistent and I cannot fathom why: import numpy as np
from sklearn.ensemble import RandomForestRegressor
import sklearn.ensemble
print(sklearn.ensemble.__path__)
X = np.random.rand(10, 100)
y = np.random.rand(10) * 100
rfr = RandomForestRegressor(n_estimators=1000, max_features=1, n_jobs=4).fit(X, y)
ys = []
for i in range(100):
if i % 10 == 0:
print(i)
ys.append(rfr.set_params(n_jobs=4).predict(X))
n_failures = sum(np.any(np.diff(ys, axis=0), axis=1))
if n_failures:
print('Broke up to %d times!' % n_failures)
else:
print('Consistent!') |
|
The answer is that the test is finding instability due to summation order not threading contention. The differences are minuscule, both at master and this PR. So the issue remains without a reliable test. What I can say is that the effect of locking on performance is negligible. |
|
Thanks a lot @jnothman, let's merge this one! |
Resolved #9393. See #9734.
Ping @ogrisel.