You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Hi , I am running gridsearchcv on a 16 core and 112GB RAM Azure VM (rdp). While the same code works fine on my PC with 16 cores and 8GB RAM, I am getting the following error on the VM inconsistently like 8 out of 10 times.The shape of the train data is 800000 * 20. The grid search model is fitting all the folds as I can see from verbose(=10) but before calculating the "best_score_" and "best_estimator_".Below is the traceback. (I am getting the same error for n_jobs=-1 or a number less than 16 and refit is True and for different values of cv ranging from 4-10)
CV] ml__C=0.5, ml__penalty=l1, ml__tol=0.01, ml__max_iter=300, ml__warm_start=False
[CV] ml__C=0.5, ml__penalty=l1, ml__tol=0.01, ml__max_iter=300, ml__warm_start=False, score=0.620114 - 47.6s
Traceback (most recent call last):
File "", line 1, in
File "C:\Anaconda2\lib\site-packages\spyderlib\widgets\externalshell\sitecustomize.py", line 699, in runfile
execfile(filename, namespace)
File "C:\Anaconda2\lib\site-packages\spyderlib\widgets\externalshell\sitecustomize.py", line 74, in execfile
exec(compile(scripttext, filename, 'exec'), glob, loc)
File "C:/Users/Preston/Documents/Python Scripts/ml_cpt_cv.py", line 111, in
model.fit(x_train, y_train)
File "C:\Anaconda2\lib\site-packages\sklearn\grid_search.py", line 804, in fit
return self._fit(X, y, ParameterGrid(self.param_grid))
File "C:\Anaconda2\lib\site-packages\sklearn\grid_search.py", line 553, in _fit
for parameters in parameter_iterable
File "C:\Anaconda2\lib\site-packages\sklearn\externals\joblib\parallel.py", line 818, in call
self._terminate_pool()
File "C:\Anaconda2\lib\site-packages\sklearn\externals\joblib\parallel.py", line 549, in _terminate_pool
self._pool.terminate() # terminate does a join()
File "C:\Anaconda2\lib\site-packages\sklearn\externals\joblib\pool.py", line 583, in terminate
super(MemmapingPool, self).terminate()
File "C:\Anaconda2\lib\multiprocessing\pool.py", line 465, in terminate
self._terminate()
File "C:\Anaconda2\lib\multiprocessing\util.py", line 207, in call
res = self._callback(_self._args, *_self._kwargs)
File "C:\Anaconda2\lib\multiprocessing\pool.py", line 513, in _terminate_pool
p.terminate()
File "C:\Anaconda2\lib\multiprocessing\process.py", line 137, in terminate
self._popen.terminate()
File "C:\Anaconda2\lib\multiprocessing\forking.py", line 312, in terminate
_subprocess.TerminateProcess(int(self._handle), TERMINATE)
WindowsError: [Error 5] Access is denied
ALSO, in the very next run of the same code, I am getting the following Joblib error.
Fitting 4 folds for each of 3 candidates, totalling 12 fits
Traceback (most recent call last):
File "", line 1, in
File "C:\Anaconda2\lib\site-packages\spyderlib\widgets\externalshell\sitecustomize.py", line 699, in runfile
execfile(filename, namespace)
File "C:\Anaconda2\lib\site-packages\spyderlib\widgets\externalshell\sitecustomize.py", line 74, in execfile
exec(compile(scripttext, filename, 'exec'), glob, loc)
File "C:/Users/Preston/Documents/Python Scripts/ml_cpt_cv.py", line 111, in
model.fit(x_train, y_train)
File "C:\Anaconda2\lib\site-packages\sklearn\grid_search.py", line 804, in fit
return self._fit(X, y, ParameterGrid(self.param_grid))
File "C:\Anaconda2\lib\site-packages\sklearn\grid_search.py", line 553, in _fit
for parameters in parameter_iterable
File "C:\Anaconda2\lib\site-packages\sklearn\externals\joblib\parallel.py", line 766, in call
n_jobs = self._initialize_pool()
File "C:\Anaconda2\lib\site-packages\sklearn\externals\joblib\parallel.py", line 515, in _initialize_pool
raise ImportError('[joblib] Attempting to do parallel computing '
ImportError: [joblib] Attempting to do parallel computing without protecting your import on a system that does not support forking. To use parallel-computing in a script, you must protect your main loop using "if name == 'main'". Please see the joblib documentation on Parallel for more information.
This is the code,
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn.svm import LinearSVC
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.metrics import f1_score, precision_score, recall_score, precision_recall_curve, accuracy_score,confusion_matrix
import sklearn.grid_search
import pandas as pd
from sklearn.preprocessing import StandardScaler
from sklearn import pipeline,metrics, grid_search
if name=="main":
train_data=pd.read_csv("train.csv",delimiter=",")
test_data=pd.read_csv("test.csv",delimiter=",")
n_cv=int(raw_input("Please tell us how many fold cross validation you would like to perform,please give a value between 3-10"))
assert (n_cv>=3) and (n_cv<=10), "Looks like the number of folds is not in between 3 and 10"
#scl = StandardScaler()
y_train=train_data["class"]
y_test=test_data["class"]
x_train=train_data
x_test=test_data
#x_test=list(x_test)
y_train=list(y_train)
y_train=[1 if i==True else 0 for i in y_train]
y_test=list(y_test)
y_test=[1 if i==True else 0 for i in y_test]
type_of_ml_model=raw_input("Please input rf for random forests, lr for logistic regression, svm for support vector machines and gbdt for gradient boosting")
if type_of_ml_model=="rf":
ml_model=RandomForestClassifier(random_state=5)
clf = pipeline.Pipeline([('ml', ml_model)])
param_grid = {
'ml__n_estimators':[100,500],
'ml__max_features':["auto",None,"log2"],
'ml__max_depth':[5,4,3],
'ml__min_samples_split':[1,2],
'ml__oob_score':[True,False],
'ml__class_weight':["balanced","balanced_subsample"]
}
Hi, sorry for being of any help here but from the traceback it looks like the joblib being unable to fork new child processes though haven't seen this issue before. And suppose it would be better to post this on the issue tracker of joblib.
The issue seems to be resolved by changing the timout value in multiprocessing/forking.py from 0.1 to 1.0. This has been suggested by @ogrisel in #4016 (comment) . Another work around is to migrate from python 2.7 to 3.5.
Uh oh!
There was an error while loading. Please reload this page.
Hi , I am running gridsearchcv on a 16 core and 112GB RAM Azure VM (rdp). While the same code works fine on my PC with 16 cores and 8GB RAM, I am getting the following error on the VM inconsistently like 8 out of 10 times.The shape of the train data is 800000 * 20. The grid search model is fitting all the folds as I can see from verbose(=10) but before calculating the "best_score_" and "best_estimator_".Below is the traceback. (I am getting the same error for n_jobs=-1 or a number less than 16 and refit is True and for different values of cv ranging from 4-10)
CV] ml__C=0.5, ml__penalty=l1, ml__tol=0.01, ml__max_iter=300, ml__warm_start=False
[CV] ml__C=0.5, ml__penalty=l1, ml__tol=0.01, ml__max_iter=300, ml__warm_start=False, score=0.620114 - 47.6s
Traceback (most recent call last):
File "", line 1, in
File "C:\Anaconda2\lib\site-packages\spyderlib\widgets\externalshell\sitecustomize.py", line 699, in runfile
execfile(filename, namespace)
File "C:\Anaconda2\lib\site-packages\spyderlib\widgets\externalshell\sitecustomize.py", line 74, in execfile
exec(compile(scripttext, filename, 'exec'), glob, loc)
File "C:/Users/Preston/Documents/Python Scripts/ml_cpt_cv.py", line 111, in
model.fit(x_train, y_train)
File "C:\Anaconda2\lib\site-packages\sklearn\grid_search.py", line 804, in fit
return self._fit(X, y, ParameterGrid(self.param_grid))
File "C:\Anaconda2\lib\site-packages\sklearn\grid_search.py", line 553, in _fit
for parameters in parameter_iterable
File "C:\Anaconda2\lib\site-packages\sklearn\externals\joblib\parallel.py", line 818, in call
self._terminate_pool()
File "C:\Anaconda2\lib\site-packages\sklearn\externals\joblib\parallel.py", line 549, in _terminate_pool
self._pool.terminate() # terminate does a join()
File "C:\Anaconda2\lib\site-packages\sklearn\externals\joblib\pool.py", line 583, in terminate
super(MemmapingPool, self).terminate()
File "C:\Anaconda2\lib\multiprocessing\pool.py", line 465, in terminate
self._terminate()
File "C:\Anaconda2\lib\multiprocessing\util.py", line 207, in call
res = self._callback(_self._args, *_self._kwargs)
File "C:\Anaconda2\lib\multiprocessing\pool.py", line 513, in _terminate_pool
p.terminate()
File "C:\Anaconda2\lib\multiprocessing\process.py", line 137, in terminate
self._popen.terminate()
File "C:\Anaconda2\lib\multiprocessing\forking.py", line 312, in terminate
_subprocess.TerminateProcess(int(self._handle), TERMINATE)
WindowsError: [Error 5] Access is denied
ALSO, in the very next run of the same code, I am getting the following Joblib error.
Fitting 4 folds for each of 3 candidates, totalling 12 fits
Traceback (most recent call last):
File "", line 1, in
File "C:\Anaconda2\lib\site-packages\spyderlib\widgets\externalshell\sitecustomize.py", line 699, in runfile
execfile(filename, namespace)
File "C:\Anaconda2\lib\site-packages\spyderlib\widgets\externalshell\sitecustomize.py", line 74, in execfile
exec(compile(scripttext, filename, 'exec'), glob, loc)
File "C:/Users/Preston/Documents/Python Scripts/ml_cpt_cv.py", line 111, in
model.fit(x_train, y_train)
File "C:\Anaconda2\lib\site-packages\sklearn\grid_search.py", line 804, in fit
return self._fit(X, y, ParameterGrid(self.param_grid))
File "C:\Anaconda2\lib\site-packages\sklearn\grid_search.py", line 553, in _fit
for parameters in parameter_iterable
File "C:\Anaconda2\lib\site-packages\sklearn\externals\joblib\parallel.py", line 766, in call
n_jobs = self._initialize_pool()
File "C:\Anaconda2\lib\site-packages\sklearn\externals\joblib\parallel.py", line 515, in _initialize_pool
raise ImportError('[joblib] Attempting to do parallel computing '
ImportError: [joblib] Attempting to do parallel computing without protecting your import on a system that does not support forking. To use parallel-computing in a script, you must protect your main loop using "if name == 'main'". Please see the joblib documentation on Parallel for more information.
This is the code,
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn.svm import LinearSVC
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.metrics import f1_score, precision_score, recall_score, precision_recall_curve, accuracy_score,confusion_matrix
import sklearn.grid_search
import pandas as pd
from sklearn.preprocessing import StandardScaler
from sklearn import pipeline,metrics, grid_search
if name=="main":
train_data=pd.read_csv("train.csv",delimiter=",")
test_data=pd.read_csv("test.csv",delimiter=",")
n_cv=int(raw_input("Please tell us how many fold cross validation you would like to perform,please give a value between 3-10"))
assert (n_cv>=3) and (n_cv<=10), "Looks like the number of folds is not in between 3 and 10"
#scl = StandardScaler()
y_train=train_data["class"]
y_test=test_data["class"]
x_train=train_data
x_test=test_data
#x_test=list(x_test)
y_train=list(y_train)
y_train=[1 if i==True else 0 for i in y_train]
y_test=list(y_test)
y_test=[1 if i==True else 0 for i in y_test]
type_of_ml_model=raw_input("Please input rf for random forests, lr for logistic regression, svm for support vector machines and gbdt for gradient boosting")
if type_of_ml_model=="rf":
ml_model=RandomForestClassifier(random_state=5)
clf = pipeline.Pipeline([('ml', ml_model)])
param_grid = {
'ml__n_estimators':[100,500],
'ml__max_features':["auto",None,"log2"],
'ml__max_depth':[5,4,3],
'ml__min_samples_split':[1,2],
'ml__oob_score':[True,False],
'ml__class_weight':["balanced","balanced_subsample"]
}
elif type_of_ml_model=="lr":
ml_model=LogisticRegression(class_weight="balanced",random_state=5)
clf = pipeline.Pipeline([('ml', ml_model)])
param_grid = {#'ml__fit_intercept':[True],'ml__intercept_scaling':[1,2],
'ml__tol':[0.01],'ml__max_iter':[300],'ml__warm_start':[False],
'ml__C': [0.5,0.2,0.6],
'ml__penalty':["l1"]}
elif type_of_ml_model=="svm":
ml_model=LinearSVC(class_weight="balanced",random_state=5)
clf = pipeline.Pipeline([('ml', ml_model)])
param_grid = {'ml__loss':["squared_hinge"],'ml__dual':[False],'ml__fit_intercept':[True],
'ml__intercept_scaling':[2,3],
'ml__tol':[0.001,0.0001],'ml__max_iter':[200],
'ml__C': [1,0.8],
'ml__penalty':["l1"]}
elif type_of_ml_model=="gbdt":
ml_model=GradientBoostingClassifier(random_state=5)
clf = pipeline.Pipeline([('ml', ml_model)])
param_grid = {'ml__loss':["deviance"],
'ml__n_estimators':[100,200],
'ml__max_features':["auto",None],
'ml__max_depth':[4,3],
'ml__min_samples_split':[1,2],
'ml__subsample':[1.0],'ml__warm_start':[False],
'ml__min_samples_leaf':[1,2]
}
precision_scorer = metrics.make_scorer(precision_score, greater_is_better = True)
model = grid_search.GridSearchCV(estimator = clf, param_grid=param_grid, scoring=precision_scorer,
verbose=10,n_jobs=-1, iid=True, refit=True, cv=n_cv)
model.fit(x_train, y_train)
Is there something in parameters that we need to change when we are running on a VM?
The text was updated successfully, but these errors were encountered: