Thanks to visit codestin.com
Credit goes to github.com

Skip to content

[MRG+2] ENH Loop over candidates as outer loop in search #8322

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 2 commits into from
Jun 1, 2017

Conversation

jnothman
Copy link
Member

@jnothman jnothman commented Feb 9, 2017

Fixes #8830
As suggested in #7990 (review).

This encourages concurrent fits to be over different datasets so that fits over the same data subset are more likely to run in serial and hence generate cache hits where memoisation is used.

This requires storing all CV splits in memory, consuming n_samples * n_splits ints worth of memory; the previous version generated these dynamically.

This encourages concurrent fits to be over *different datasets* so that
fits over the same data subset are more likely to run in serial and
hence generate cache hits where memoisation is used.
@codecov
Copy link

codecov bot commented Feb 17, 2017

Codecov Report

Merging #8322 into master will increase coverage by 0.01%.
The diff coverage is 100%.

@@            Coverage Diff             @@
##           master    #8322      +/-   ##
==========================================
+ Coverage   94.73%   94.75%   +0.01%     
==========================================
  Files         342      342              
  Lines       60674    60801     +127     
==========================================
+ Hits        57482    57609     +127     
  Misses       3192     3192
Impacted Files Coverage Δ
sklearn/model_selection/_search.py 97.88% <100%> (+0.07%)
sklearn/pipeline.py 99.26% <ø> (-0.36%)
sklearn/linear_model/ridge.py 93.88% <ø> (-0.02%)
sklearn/linear_model/tests/test_ridge.py 100% <ø> (ø)
sklearn/metrics/classification.py 97.77% <ø> (ø)
sklearn/datasets/init.py 100% <ø> (ø)
sklearn/utils/mocking.py 100% <ø> (ø)
sklearn/metrics/tests/test_common.py 99.5% <ø> (ø)
sklearn/model_selection/tests/test_validation.py 98.23% <ø> (+0.01%)
sklearn/datasets/base.py 92.3% <ø> (+0.03%)
... and 5 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update ddd886b...f64f15f. Read the comment docs.

@@ -577,8 +577,8 @@ def fit(self, X, y=None, groups=None):
return_n_test_samples=True,
return_times=True, return_parameters=False,
error_score=self.error_score)
for train, test in cv.split(X, y, groups)
for parameters in candidate_params)
for parameters, (train, test) in product(candidate_params,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice. That reverts back to what it was in 0.18 (or pre 0.18?)

@amueller
Copy link
Member

LGTM. Maybe quick benchmark?

@jmschrei
Copy link
Member

LGTM as well. I'd also like to see how strong this effect is.

@jnothman
Copy link
Member Author

jnothman commented May 24, 2017

cv grid_size n_jobs master cv_inner %
3 2 1 131.78 128.58 97
3 2 3 99.32 71.42 71
3 2 5 72.73 71.94 98
3 2 7 69.49 71.53 102
3 4 1 131.2 134.79 102
3 4 3 125.16 71.48 57
3 4 5 98.05 70.57 71
3 4 7 99.46 70.65 71
5 2 1 202.53 198.94 98
5 2 3 135.37 108.42 80
5 2 5 107.36 103.04 95
5 2 7 106.24 107.32 101
5 4 1 217.78 215.57 98
5 4 3 205.75 130.63 63
5 4 5 142.35 79.49 55
5 4 7 148.33 80.33 54

with

import time, os
import numpy as np
from sklearn import datasets, model_selection, linear_model, pipeline, decomposition
bunch = datasets.fetch_20newsgroups_vectorized()
X = bunch.data
y = bunch.target
for cv in [3, 5]:
    for grid_size in [2, 4]:
        for n_jobs in [1, 3, 5, 7]:
            os.system('rm -rf /tmp/mem')
            pipe = pipeline.Pipeline([('decomp', decomposition.LatentDirichletAllocation(random_state=0, learning_method='batch')),
                                      ('clf', linear_model.LogisticRegression(random_state=0))],
                                      memory='/tmp/mem')
            search = model_selection.GridSearchCV(pipe,
                                                  {'clf__C': np.logspace(-2, 2, grid_size)},
                                                  cv=cv, n_jobs=n_jobs)
            start = time.time()
            search.fit(X, y)
            print(cv, grid_size, n_jobs, time.time() - start)

@jnothman jnothman changed the title [MRG] ENH Loop over candidates as outer loop in search [MRG+2] ENH Loop over candidates as outer loop in search May 24, 2017
@GaelVaroquaux
Copy link
Member

GaelVaroquaux commented May 24, 2017 via email

@lesteve
Copy link
Member

lesteve commented Jun 1, 2017

LGTM, merging this one, thanks a lot @jnothman!

@lesteve lesteve merged commit 7ce7134 into scikit-learn:master Jun 1, 2017
Sundrique pushed a commit to Sundrique/scikit-learn that referenced this pull request Jun 14, 2017
…n#8322)

This encourages concurrent fits to be over *different datasets* so that
fits over the same data subset are more likely to run in serial and
hence generate cache hits where memoisation is used.
dmohns pushed a commit to dmohns/scikit-learn that referenced this pull request Aug 7, 2017
…n#8322)

This encourages concurrent fits to be over *different datasets* so that
fits over the same data subset are more likely to run in serial and
hence generate cache hits where memoisation is used.
dmohns pushed a commit to dmohns/scikit-learn that referenced this pull request Aug 7, 2017
…n#8322)

This encourages concurrent fits to be over *different datasets* so that
fits over the same data subset are more likely to run in serial and
hence generate cache hits where memoisation is used.
NelleV pushed a commit to NelleV/scikit-learn that referenced this pull request Aug 11, 2017
…n#8322)

This encourages concurrent fits to be over *different datasets* so that
fits over the same data subset are more likely to run in serial and
hence generate cache hits where memoisation is used.
paulha pushed a commit to paulha/scikit-learn that referenced this pull request Aug 19, 2017
…n#8322)

This encourages concurrent fits to be over *different datasets* so that
fits over the same data subset are more likely to run in serial and
hence generate cache hits where memoisation is used.
AishwaryaRK pushed a commit to AishwaryaRK/scikit-learn that referenced this pull request Aug 29, 2017
…n#8322)

This encourages concurrent fits to be over *different datasets* so that
fits over the same data subset are more likely to run in serial and
hence generate cache hits where memoisation is used.
maskani-moh pushed a commit to maskani-moh/scikit-learn that referenced this pull request Nov 15, 2017
…n#8322)

This encourages concurrent fits to be over *different datasets* so that
fits over the same data subset are more likely to run in serial and
hence generate cache hits where memoisation is used.
jwjohnson314 pushed a commit to jwjohnson314/scikit-learn that referenced this pull request Dec 18, 2017
…n#8322)

This encourages concurrent fits to be over *different datasets* so that
fits over the same data subset are more likely to run in serial and
hence generate cache hits where memoisation is used.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants