-
-
Notifications
You must be signed in to change notification settings - Fork 25.8k
[MRG+2] ENH Loop over candidates as outer loop in search #8322
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
This encourages concurrent fits to be over *different datasets* so that fits over the same data subset are more likely to run in serial and hence generate cache hits where memoisation is used.
Codecov Report
@@ Coverage Diff @@
## master #8322 +/- ##
==========================================
+ Coverage 94.73% 94.75% +0.01%
==========================================
Files 342 342
Lines 60674 60801 +127
==========================================
+ Hits 57482 57609 +127
Misses 3192 3192
Continue to review full report at Codecov.
|
@@ -577,8 +577,8 @@ def fit(self, X, y=None, groups=None): | |||
return_n_test_samples=True, | |||
return_times=True, return_parameters=False, | |||
error_score=self.error_score) | |||
for train, test in cv.split(X, y, groups) | |||
for parameters in candidate_params) | |||
for parameters, (train, test) in product(candidate_params, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice. That reverts back to what it was in 0.18 (or pre 0.18?)
LGTM. Maybe quick benchmark? |
LGTM as well. I'd also like to see how strong this effect is. |
with import time, os
import numpy as np
from sklearn import datasets, model_selection, linear_model, pipeline, decomposition
bunch = datasets.fetch_20newsgroups_vectorized()
X = bunch.data
y = bunch.target
for cv in [3, 5]:
for grid_size in [2, 4]:
for n_jobs in [1, 3, 5, 7]:
os.system('rm -rf /tmp/mem')
pipe = pipeline.Pipeline([('decomp', decomposition.LatentDirichletAllocation(random_state=0, learning_method='batch')),
('clf', linear_model.LogisticRegression(random_state=0))],
memory='/tmp/mem')
search = model_selection.GridSearchCV(pipe,
{'clf__C': np.logspace(-2, 2, grid_size)},
cv=cv, n_jobs=n_jobs)
start = time.time()
search.fit(X, y)
print(cv, grid_size, n_jobs, time.time() - start) |
cv grid_size n_jobs master cv_inner
Quite convincing! Thanks for the benchmarks.
|
LGTM, merging this one, thanks a lot @jnothman! |
…n#8322) This encourages concurrent fits to be over *different datasets* so that fits over the same data subset are more likely to run in serial and hence generate cache hits where memoisation is used.
…n#8322) This encourages concurrent fits to be over *different datasets* so that fits over the same data subset are more likely to run in serial and hence generate cache hits where memoisation is used.
…n#8322) This encourages concurrent fits to be over *different datasets* so that fits over the same data subset are more likely to run in serial and hence generate cache hits where memoisation is used.
…n#8322) This encourages concurrent fits to be over *different datasets* so that fits over the same data subset are more likely to run in serial and hence generate cache hits where memoisation is used.
…n#8322) This encourages concurrent fits to be over *different datasets* so that fits over the same data subset are more likely to run in serial and hence generate cache hits where memoisation is used.
…n#8322) This encourages concurrent fits to be over *different datasets* so that fits over the same data subset are more likely to run in serial and hence generate cache hits where memoisation is used.
…n#8322) This encourages concurrent fits to be over *different datasets* so that fits over the same data subset are more likely to run in serial and hence generate cache hits where memoisation is used.
…n#8322) This encourages concurrent fits to be over *different datasets* so that fits over the same data subset are more likely to run in serial and hence generate cache hits where memoisation is used.
Fixes #8830
As suggested in #7990 (review).
This encourages concurrent fits to be over different datasets so that fits over the same data subset are more likely to run in serial and hence generate cache hits where memoisation is used.
This requires storing all CV splits in memory, consuming
n_samples * n_splits
ints worth of memory; the previous version generated these dynamically.