Thanks to visit codestin.com
Credit goes to github.com

Skip to content

deadlock in multioutput #8543

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
yupbank opened this issue Mar 6, 2017 · 6 comments
Closed

deadlock in multioutput #8543

yupbank opened this issue Mar 6, 2017 · 6 comments

Comments

@yupbank
Copy link
Contributor

yupbank commented Mar 6, 2017

Description

Example: MultioutputClassifier.fit never ends with base classifier supports n_jobs

Steps/Code to Reproduce

Example:

import sklearn.datasets as datasets
import numpy as np
from sklearn.linear_model.logistic import  LogisticRegression
from sklearn.multioutput import MultiOutputClassifier
import time

x, y = datasets.make_classification(n_samples=100000, n_features=200)
multi_y = np.hstack([y[:,np.newaxis] for i in xrange(1000)])

base_old = LogisticRegression(solver='lbfgs', n_jobs=-1)
multi_clf_never_end = MultiOutputClassifier(base_old, n_jobs=-1)
multi_clf_never_end.fit(x, multi_y)

Expected Results

mulfi_clf fitted with expected

Actual Results

it blocks

Versions

Darwin-16.4.0-x86_64-i386-64bit
('Python', '2.7.13 (default, Dec 18 2016, 07:03:39) \n[GCC 4.2.1 Compatible Apple LLVM 8.0.0 (clang-800.0.42.1)]')
('NumPy', '1.11.2')
('SciPy', '0.16.1')
('Scikit-Learn', '0.19.dev0')

@rth
Copy link
Member

rth commented Mar 6, 2017

@yupbank Doesn't sound like a deadlock. You are training LogisticRegression 1000 times on a (100000, 200) input dataset. To do that on a (1000, 200) dataset takes already ~1 min on a 4 core cpu. Not sure what's the exact scaling of LogistcRegression with n_samples when using lbfgs, but this would probably take hours to compute.

MultioutputClassifier.fit never ends with base classifier supports n_jobs

How long have you waited?

@yupbank
Copy link
Contributor Author

yupbank commented Mar 6, 2017

~40 minutes.

@rth
Copy link
Member

rth commented Mar 6, 2017

Using how many CPU? This could take ≳ 1.3 CPU-hours in my estimation...

@rth
Copy link
Member

rth commented Mar 6, 2017

Also using n_jobs > 1 actually slow things down here,

Using a 100 output vector (instead of 1000 as in your example) on a 4 core CPU,

  • MultiOutputClassifier(base_old, n_jobs=1):
    $ time python /tmp/test.py
    
    real    0m42.301s
    user    2m42.168s
    sys     0m2.268s
    
  • MultiOutputClassifier(base_old, n_jobs=4):
    $ time python /tmp/test.py
    
    real    2m49.912s
    user    9m4.964s
    sys     2m8.304s
    

so using n_jobs=4 slows the computations almost exactly 4 times here. Probably for the same reason as #8216 due joblib.Parallel pickling / memmaping overhead...

@yupbank
Copy link
Contributor Author

yupbank commented Mar 6, 2017

interesting.. well.. i have made this finish within ~3mins with 1000 output vector with updated loss_function and gradient

@lesteve
Copy link
Member

lesteve commented Mar 10, 2017

It's hard to tell whether there is an actual problem here, or whether it is just that your snippet take a long time to run. I am going to close this one, @yupbank feel free to reopen if you have some new information to add to this issue.

@lesteve lesteve closed this as completed Mar 10, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants