[MRG] FIX Avoid accumulating forest predictions in non-threadsafe manner #9830

jnothman · 2017-09-26T00:30:22Z

Resolved #9393. See #9734.

jnothman · 2017-09-26T01:56:33Z

The problem with this is that it makes RF prediction more memory intensive by a factor of n_estimators / n_jobs (assuming n_samples for prediction is much larger than tree size).

I could do this summation over an iterator instead, which with joblib we will only use in the n_jobs=1 case... WDYT?

amueller · 2017-09-27T14:42:03Z

sklearn/ensemble/forest.py

 # ForestClassifier or ForestRegressor, because joblib complains that it cannot
 # pickle it when placed there.

-def accumulate_prediction(predict, X, out):


The problem is that "out" is shared among threads, right?

amueller · 2017-09-27T14:42:13Z

sklearn/ensemble/forest.py

 # ForestClassifier or ForestRegressor, because joblib complains that it cannot
 # pickle it when placed there.

-def accumulate_prediction(predict, X, out):


The problem is that "out" is shared among threads, right?

If this is the case I don't understand the memory increase. Shouldn't out be n_samples x n_classes and the memory increase therefore n_samples x n_classes x n_jobs?

jnothman · 2017-09-27T22:40:50Z

At master, up to n_jobs prediction matrices are in memory at once. The accumulation is done, albeit badly, in each thread.

At this PR all prediction matrices are returned from parallel, i.e. n_estimators separate matrices. The accumulation is done after.

A mutex lock on the accumulation would be sufficient to solve this, but we've tended to avoid them. Alternatively a map that returns an iterator rather than a list (or a list of promises) would suffice to give roughly current memory consumption with correctness a assured.

lesteve · 2017-09-28T07:44:25Z

Before I forgot I am guessing this is partly reverting #8672. We should take a look at this PR to try to understand the motivations behind it.

jnothman · 2017-09-28T08:49:48Z

yes it is reverting that. that claimed the gil would protect us. it wouldn't. should I implement a mutex instead?

lesteve · 2017-09-28T13:35:11Z

Chatting with @ogrisel about this one, he thinks adding a threading.Lock is probably the best and the simplest at the same time. The summation of probabilities should not be a bottleneck so that the lock will not impact performance.

Just curious, can we actually reproduce the failure outside the Debian testing framework (mips is the failing architecture I think) ?

jnothman · 2017-09-30T21:45:03Z

I tried to reproduce it the other day and failed. Our best chance? Lots of random features, max_features=1, very few samples, many estimators, large n_jobs. Let's use threading.Lock then.

jnothman · 2017-09-30T21:52:25Z

Done

lesteve · 2017-10-03T12:59:17Z

The changes look fine but I am a bit uncomfortable merging this kind of blind, without having to managed to reproduce neither in the scikit-learn tests or in a simpler snippet where you update a single numpy array with parallel summations.

Also maybe it would be a good idea to run a quick and dirty benchmarks to make sure that the lock is not impacting performance?

Alternatively a map that returns an iterator rather than a list (or a list of promises) would suffice to give roughly current memory consumption with correctness a assured.

FYI I tried a little while ago of having Parallel return an generator. It was kind of working except when closing the pool before consuming all the results, in which case it hung and I never figured out why. I may have another go at it at one point. My branch is here if you want to know more details.

jmschrei · 2017-10-03T18:56:58Z

My understanding of the issue is that the issue arises when multiple trees are trying to add their individual predictions to the single out array at the same time, causing an issue where some updates are overwritten. Is this correct? It seems weird to me that the GIL is not preventing this from happening, do you know why that is the case?

ogrisel

It seems like a reasonable fix. GIL contention should not be too visible as adding arrays it probably orders of magnitude faster than computing the predictions themselves.

ogrisel · 2017-10-03T19:42:46Z

@jmschrei I don't know how the GIL protects the += operation on numpy arrays.

jnothman · 2017-10-03T23:27:04Z

I.e. numpy unlocks the GIL for lots of operations.

jnothman · 2017-10-16T03:59:09Z

This reproduces the contention, showing that at master, n_jobs=4 is inconsistent and n_jobs=1 is consistent. Unfortunately, it seems to show that the current PR is also inconsistent and I cannot fathom why:

import numpy as np
from sklearn.ensemble import RandomForestRegressor
import sklearn.ensemble
print(sklearn.ensemble.__path__)

X = np.random.rand(10, 100)
y = np.random.rand(10) * 100
rfr = RandomForestRegressor(n_estimators=1000, max_features=1, n_jobs=4).fit(X, y)
ys = []
for i in range(100):
    if i % 10 == 0:
        print(i)
    ys.append(rfr.set_params(n_jobs=4).predict(X))

n_failures = sum(np.any(np.diff(ys, axis=0), axis=1))
if n_failures:
    print('Broke up to %d times!' % n_failures)
else:
    print('Consistent!')

jnothman · 2017-10-16T04:12:18Z

The answer is that the test is finding instability due to summation order not threading contention. The differences are minuscule, both at master and this PR. So the issue remains without a reliable test.

What I can say is that the effect of locking on performance is negligible.

lesteve · 2017-10-16T14:23:29Z

Thanks a lot @jnothman, let's merge this one!

…ner (scikit-learn#9830)

jnothman mentioned this pull request Sep 26, 2017

Debian test failures (was test_preserve_trustworthiness_approximately fails on 32bit: AssertionError: 0.89166666666666661 not greater than 0.9) #9393

Closed

jnothman changed the title ~~FIX Avoid accumulating forest predictions in non-threadsafe manner~~ [MRG] FIX Avoid accumulating forest predictions in non-threadsafe manner Sep 26, 2017

jnothman added this to the 0.19.1 milestone Sep 26, 2017

amueller reviewed Sep 27, 2017

View reviewed changes

Lock in accumulate_predictions

f55fd89

jnothman force-pushed the threadsafe branch from 9e1b199 to f55fd89 Compare September 30, 2017 21:51

Use context manager

6880124

ogrisel approved these changes Oct 3, 2017

View reviewed changes

lesteve merged commit 4a4b711 into scikit-learn:master Oct 16, 2017

jnothman added a commit to jnothman/scikit-learn that referenced this pull request Oct 17, 2017

[MRG] FIX Avoid accumulating forest predictions in non-threadsafe man…

95fcde8

…ner (scikit-learn#9830)

jnothman mentioned this pull request Nov 4, 2017

Non-determinism in RandomForestClassifier predict_proba with n_jobs=-1 #10065

Closed

maskani-moh pushed a commit to maskani-moh/scikit-learn that referenced this pull request Nov 15, 2017

[MRG] FIX Avoid accumulating forest predictions in non-threadsafe man…

ae6b21f

…ner (scikit-learn#9830)

jnothman mentioned this pull request Dec 4, 2017

Setting random_state and np.random.seed does not ensure reproducibility #10237

Closed

jwjohnson314 pushed a commit to jwjohnson314/scikit-learn that referenced this pull request Dec 18, 2017

[MRG] FIX Avoid accumulating forest predictions in non-threadsafe man…

38db78d

…ner (scikit-learn#9830)

jnothman mentioned this pull request May 28, 2018

RandomForestRegressor results not reproducible when n_jobs=-1 even if random_state set #11137

Closed

Uh oh!

[MRG] FIX Avoid accumulating forest predictions in non-threadsafe manner #9830

[MRG] FIX Avoid accumulating forest predictions in non-threadsafe manner #9830

Uh oh!

Conversation

jnothman commented Sep 26, 2017

Uh oh!

jnothman commented Sep 26, 2017

Uh oh!

amueller Sep 27, 2017

Choose a reason for hiding this comment

Uh oh!

amueller Sep 27, 2017

Choose a reason for hiding this comment

Uh oh!

amueller Sep 27, 2017

Choose a reason for hiding this comment

Uh oh!

jnothman commented Sep 27, 2017

Uh oh!

lesteve commented Sep 28, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jnothman commented Sep 28, 2017 via email

Uh oh!

lesteve commented Sep 28, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jnothman commented Sep 30, 2017

Uh oh!

jnothman commented Sep 30, 2017

Uh oh!

lesteve commented Oct 3, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jmschrei commented Oct 3, 2017

Uh oh!

ogrisel left a comment

Choose a reason for hiding this comment

Uh oh!

ogrisel commented Oct 3, 2017

Uh oh!

jnothman commented Oct 3, 2017 via email

Uh oh!

jnothman commented Oct 16, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jnothman commented Oct 16, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

lesteve commented Oct 16, 2017

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

lesteve commented Sep 28, 2017 •

edited

Loading

lesteve commented Sep 28, 2017 •

edited

Loading

lesteve commented Oct 3, 2017 •

edited

Loading

jnothman commented Oct 16, 2017 •

edited

Loading

jnothman commented Oct 16, 2017 •

edited

Loading