[MRG+2] ENH Loop over candidates as outer loop in search #8322

jnothman · 2017-02-09T03:37:25Z

Fixes #8830
As suggested in #7990 (review).

This encourages concurrent fits to be over different datasets so that fits over the same data subset are more likely to run in serial and hence generate cache hits where memoisation is used.

This requires storing all CV splits in memory, consuming n_samples * n_splits ints worth of memory; the previous version generated these dynamically.

This encourages concurrent fits to be over *different datasets* so that fits over the same data subset are more likely to run in serial and hence generate cache hits where memoisation is used.

codecov · 2017-02-17T02:10:00Z

Codecov Report

Merging #8322 into master will increase coverage by 0.01%.
The diff coverage is 100%.

@@            Coverage Diff             @@
##           master    #8322      +/-   ##
==========================================
+ Coverage   94.73%   94.75%   +0.01%     
==========================================
  Files         342      342              
  Lines       60674    60801     +127     
==========================================
+ Hits        57482    57609     +127     
  Misses       3192     3192

Impacted Files	Coverage Δ
sklearn/model_selection/_search.py	`97.88% <100%> (+0.07%)`	✅
sklearn/pipeline.py	`99.26% <ø> (-0.36%)`	❌
sklearn/linear_model/ridge.py	`93.88% <ø> (-0.02%)`	❌
sklearn/linear_model/tests/test_ridge.py	`100% <ø> (ø)`	✅
sklearn/metrics/classification.py	`97.77% <ø> (ø)`	✅
sklearn/datasets/init.py	`100% <ø> (ø)`	✅
sklearn/utils/mocking.py	`100% <ø> (ø)`	✅
sklearn/metrics/tests/test_common.py	`99.5% <ø> (ø)`	✅
sklearn/model_selection/tests/test_validation.py	`98.23% <ø> (+0.01%)`	✅
sklearn/datasets/base.py	`92.3% <ø> (+0.03%)`	✅
... and 5 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update ddd886b...f64f15f. Read the comment docs.

amueller · 2017-05-17T20:49:37Z

sklearn/model_selection/_search.py

@@ -577,8 +577,8 @@ def fit(self, X, y=None, groups=None):
                                  return_n_test_samples=True,
                                  return_times=True, return_parameters=False,
                                  error_score=self.error_score)
-          for train, test in cv.split(X, y, groups)
-          for parameters in candidate_params)
+          for parameters, (train, test) in product(candidate_params,


Nice. That reverts back to what it was in 0.18 (or pre 0.18?)

amueller · 2017-05-17T20:49:51Z

LGTM. Maybe quick benchmark?

jmschrei · 2017-05-24T06:37:41Z

LGTM as well. I'd also like to see how strong this effect is.

jnothman · 2017-05-24T10:56:12Z

cv	grid_size	n_jobs	master	cv_inner	%
3	2	1	131.78	128.58	97
3	2	3	99.32	71.42	71
3	2	5	72.73	71.94	98
3	2	7	69.49	71.53	102
3	4	1	131.2	134.79	102
3	4	3	125.16	71.48	57
3	4	5	98.05	70.57	71
3	4	7	99.46	70.65	71
5	2	1	202.53	198.94	98
5	2	3	135.37	108.42	80
5	2	5	107.36	103.04	95
5	2	7	106.24	107.32	101
5	4	1	217.78	215.57	98
5	4	3	205.75	130.63	63
5	4	5	142.35	79.49	55
5	4	7	148.33	80.33	54

with

import time, os
import numpy as np
from sklearn import datasets, model_selection, linear_model, pipeline, decomposition
bunch = datasets.fetch_20newsgroups_vectorized()
X = bunch.data
y = bunch.target
for cv in [3, 5]:
    for grid_size in [2, 4]:
        for n_jobs in [1, 3, 5, 7]:
            os.system('rm -rf /tmp/mem')
            pipe = pipeline.Pipeline([('decomp', decomposition.LatentDirichletAllocation(random_state=0, learning_method='batch')),
                                      ('clf', linear_model.LogisticRegression(random_state=0))],
                                      memory='/tmp/mem')
            search = model_selection.GridSearchCV(pipe,
                                                  {'clf__C': np.logspace(-2, 2, grid_size)},
                                                  cv=cv, n_jobs=n_jobs)
            start = time.time()
            search.fit(X, y)
            print(cv, grid_size, n_jobs, time.time() - start)

GaelVaroquaux · 2017-05-24T11:12:31Z

cv grid_size n_jobs master cv_inner

Quite convincing! Thanks for the benchmarks.

lesteve · 2017-06-01T14:29:18Z

LGTM, merging this one, thanks a lot @jnothman!

…n#8322) This encourages concurrent fits to be over *different datasets* so that fits over the same data subset are more likely to run in serial and hence generate cache hits where memoisation is used.

jnothman added 2 commits February 9, 2017 14:32

ENH Loop over candidates as outer loop in search

f4e12e2

This encourages concurrent fits to be over *different datasets* so that fits over the same data subset are more likely to run in serial and hence generate cache hits where memoisation is used.

Fix IID weights

f64f15f

jnothman mentioned this pull request May 4, 2017

Change the grid search order to optimize caching in the new Pipeline #8830

Closed

amueller reviewed May 17, 2017

View reviewed changes

jnothman mentioned this pull request May 23, 2017

why gridsearchCV becomes much slower when migrating into model_selection? #8918

Closed

jnothman changed the title ~~[MRG] ENH Loop over candidates as outer loop in search~~ [MRG+2] ENH Loop over candidates as outer loop in search May 24, 2017

lesteve merged commit 7ce7134 into scikit-learn:master Jun 1, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[MRG+2] ENH Loop over candidates as outer loop in search #8322

[MRG+2] ENH Loop over candidates as outer loop in search #8322

jnothman commented Feb 9, 2017 •

edited

Loading

codecov bot commented Feb 17, 2017

amueller May 17, 2017

amueller commented May 17, 2017

jmschrei commented May 24, 2017

jnothman commented May 24, 2017 •

edited

Loading

GaelVaroquaux commented May 24, 2017 via email

lesteve commented Jun 1, 2017

[MRG+2] ENH Loop over candidates as outer loop in search #8322

[MRG+2] ENH Loop over candidates as outer loop in search #8322

Conversation

jnothman commented Feb 9, 2017 • edited Loading

codecov bot commented Feb 17, 2017

Codecov Report

amueller May 17, 2017

Choose a reason for hiding this comment

amueller commented May 17, 2017

jmschrei commented May 24, 2017

jnothman commented May 24, 2017 • edited Loading

GaelVaroquaux commented May 24, 2017 via email

lesteve commented Jun 1, 2017

jnothman commented Feb 9, 2017 •

edited

Loading

jnothman commented May 24, 2017 •

edited

Loading