Codestin Search App

History

1089 lines (900 loc) · 39.9 KB

Raw

100

101

102

103

104

105

106

107

108

109

110

111

112

113

114

115

116

117

118

119

120

121

122

123

124

125

126

127

128

129

130

131

132

133

134

135

136

137

138

139

140

141

142

143

144

145

146

147

148

149

150

151

152

153

154

155

156

157

158

159

160

161

162

163

164

165

166

167

168

169

170

171

172

173

174

175

176

177

178

179

180

181

182

183

184

185

186

187

188

189

190

191

192

193

194

195

196

197

198

199

200

201

202

203

204

205

206

207

208

209

210

211

212

213

214

215

216

217

218

219

220

221

222

223

224

225

226

227

228

229

230

231

232

233

234

235

236

237

238

239

240

241

242

243

244

245

246

247

248

249

250

251

252

253

254

255

256

257

258

259

260

261

262

263

264

265

266

267

268

269

270

271

272

273

274

275

276

277

278

279

280

281

282

283

284

285

286

287

288

289

290

291

292

293

294

295

296

297

298

299

300

301

302

303

304

305

306

307

308

309

310

311

312

313

314

315

316

317

318

319

320

321

322

323

324

325

326

327

328

329

330

331

332

333

334

335

336

337

338

339

340

341

342

343

344

345

346

347

348

349

350

351

352

353

354

355

356

357

358

359

360

361

362

363

364

365

366

367

368

369

370

371

372

373

374

375

376

377

378

379

380

381

382

383

384

385

386

387

388

389

390

391

392

393

394

395

396

397

398

399

400

401

402

403

404

405

406

407

408

409

410

411

412

413

414

415

416

417

418

419

420

421

422

423

424

425

426

427

428

429

430

431

432

433

434

435

436

437

438

439

440

441

442

443

444

445

446

447

448

449

450

451

452

453

454

455

456

457

458

459

460

461

462

463

464

465

466

467

468

469

470

471

472

473

474

475

476

477

478

479

480

481

482

483

484

485

486

487

488

489

490

491

492

493

494

495

496

497

498

499

500

501

502

503

504

505

506

507

508

509

510

511

512

513

514

515

516

517

518

519

520

521

522

523

524

525

526

527

528

529

530

531

532

533

534

535

536

537

538

539

540

541

542

543

544

545

546

547

548

549

550

551

552

553

554

555

556

557

558

559

560

561

562

563

564

565

566

567

568

569

570

571

572

573

574

575

576

577

578

579

580

581

582

583

584

585

586

587

588

589

590

591

592

593

594

595

596

597

598

599

600

601

602

603

604

605

606

607

608

609

610

611

612

613

614

615

616

617

618

619

620

621

622

623

624

625

626

627

628

629

630

631

632

633

634

635

636

637

638

639

640

641

642

643

644

645

646

647

648

649

650

651

652

653

654

655

656

657

658

659

660

661

662

663

664

665

666

667

668

669

670

671

672

673

674

675

676

677

678

679

680

681

682

683

684

685

686

687

688

689

690

691

692

693

694

695

696

697

698

699

700

701

702

703

704

705

706

707

708

709

710

711

712

713

714

715

716

717

718

719

720

721

722

723

724

725

726

727

728

729

730

731

732

733

734

735

736

737

738

739

740

741

742

743

744

745

746

747

748

749

750

751

752

753

754

755

756

757

758

759

760

761

762

763

764

765

766

767

768

769

770

771

772

773

774

775

776

777

778

779

780

781

782

783

784

785

786

787

788

789

790

791

792

793

794

795

796

797

798

799

800

801

802

803

804

805

806

807

808

809

810

811

812

813

814

815

816

817

818

819

820

821

822

823

824

825

826

827

828

829

830

831

832

833

834

835

836

837

838

839

840

841

842

843

844

845

846

847

848

849

850

851

852

853

854

855

856

857

858

859

860

861

862

863

864

865

866

867

868

869

870

871

872

873

874

875

876

877

878

879

880

881

882

883

884

885

886

887

888

889

890

891

892

893

894

895

896

897

898

899

900

901

902

903

904

905

906

907

908

909

910

911

912

913

914

915

916

917

918

919

920

921

922

923

924

925

926

927

928

929

930

931

932

933

934

935

936

937

938

939

940

941

942

943

944

945

946

947

948

949

950

951

952

953

954

955

956

957

958

959

960

961

962

963

964

965

966

967

968

969

970

971

972

973

974

975

976

977

978

979

980

981

982

983

984

985

986

987

988

989

990

991

992

993

994

995

996

997

998

999

1000

"""

The :mod:`sklearn.grid_search` includes utilities to fine-tune the parameters

of an estimator.

"""

from __future__ import print_function

# Author: Alexandre Gramfort <[email protected]>,

# Gael Varoquaux <[email protected]>

# License: BSD 3 clause

from abc import ABCMeta, abstractmethod

from collections import Mapping, namedtuple

from functools import partial, reduce

from itertools import product

import numbers

import operator

import time

import warnings

import numpy as np

import numpy.ma.mrecords as mrecords

from .base import BaseEstimator, is_classifier, clone

from .base import MetaEstimatorMixin

from .cross_validation import check_cv

from .externals.joblib import Parallel, delayed, logger

from .externals import six

from .externals.six import iteritems, iterkeys

from .externals.six.moves import zip

from .utils import safe_mask, check_random_state

from .utils.validation import _num_samples, check_arrays

from .metrics import SCORERS, Scorer

__all__ = ['GridSearchCV', 'ParameterGrid', 'fit_grid_point',

'ParameterSampler', 'RandomizedSearchCV']

class ParameterGrid(object):

"""Grid of parameters with a discrete number of values for each.

Can be used to iterate over parameter value combinations with the

Python built-in function iter.

Parameters

----------

param_grid : dict of string to sequence

The parameter grid to explore, as a dictionary mapping estimator

parameters to sequences of allowed values.

Examples

--------

>>> from sklearn.grid_search import ParameterGrid

>>> param_grid = {'a':[1, 2], 'b':[True, False]}

>>> list(ParameterGrid(param_grid)) #doctest: +NORMALIZE_WHITESPACE

[{'a': 1, 'b': True}, {'a': 1, 'b': False},

{'a': 2, 'b': True}, {'a': 2, 'b': False}]

See also

--------

:class:`GridSearchCV`:

uses ``IterGrid`` to perform a full parallelized parameter search.

"""

def __init__(self, param_grid):

warnings.warn("IterGrid was renamed to ParameterGrid and will be"

" removed in 0.15.", DeprecationWarning)

super(IterGrid, self).__init__(param_grid)

class ParameterSampler(object):

"""Generator on parameters sampled from given distributions.

Parameters

----------

param_distributions : dict

Dictionary where the keys are parameters and values

are distributions from which a parameter is to be sampled.

Distributions either have to provide a ``rvs`` function

to sample from them, or can be given as a list of values,

where a uniform distribution is assumed.

n_iter : integer

Number of parameter settings that are produced.

random_state : int or RandomState

Pseudo number generator state used for random sampling.

Returns

-------

params: dict of string to any

**Yields** dictionaries mapping each estimator parameter to

as sampled value.

Examples

--------

>>> from sklearn.grid_search import ParameterSampler

>>> from scipy.stats.distributions import expon

>>> import numpy as np

>>> np.random.seed(0)

>>> param_grid = {'a':[1, 2], 'b': expon()}

>>> list(ParameterSampler(param_grid, n_iter=4))

... #doctest: +NORMALIZE_WHITESPACE +ELLIPSIS

[{'a': 1, 'b': 0.89...}, {'a': 1, 'b': 0.92...},

{'a': 2, 'b': 1.87...}, {'a': 2, 'b': 1.03...}]

"""

def __init__(self, param_distributions, n_iter, random_state=None):

self.param_distributions = param_distributions

self.n_iter = n_iter

self.random_state = random_state

def __iter__(self):

rnd = check_random_state(self.random_state)

# Always sort the keys of a dictionary, for reproducibility

items = sorted(self.param_distributions.items())

for _ in range(self.n_iter):

params = dict()

for k, v in items:

if hasattr(v, "rvs"):

params[k] = v.rvs()

else:

params[k] = v[rnd.randint(len(v))]

yield params

def __len__(self):

"""Number of points that will be sampled."""

return self.n_iter

def fit_grid_point(X, y, base_clf, clf_params, train, test, scorer,

verbose, loss_func=None, **fit_params):

"""Run fit on one set of parameters.

Parameters

----------

X : array-like, sparse matrix or list

Input data.

y : array-like or None

Targets for input data.

base_clf : estimator object

This estimator will be cloned and then fitted.

clf_params : dict

Parameters to be set on base_estimator clone for this grid point.

train : ndarray, dtype int or bool

Boolean mask or indices for training set.

test : ndarray, dtype int or bool

Boolean mask or indices for test set.

scorer : callable or None.

If provided must be a scoring object / function with signature

``scorer(estimator, X, y)``.

verbose : int

Verbosity level.

**fit_params : kwargs

Additional parameter passed to the fit function of the estimator.

Returns

-------

score : float

Score of this parameter setting on given training / test split.

estimator : estimator object

Estimator object of type base_clf that was fitted using clf_params

and provided train / test split.

n_samples_test : int

Number of test samples in this split.

"""

if verbose > 1:

start_time = time.time()

msg = '%s' % (', '.join('%s=%s' % (k, v)

for k, v in clf_params.items()))

print("[GridSearchCV] %s %s" % (msg, (64 - len(msg)) * '.'))

# update parameters of the classifier after a copy of its base structure

clf = clone(base_clf)

clf.set_params(**clf_params)

if hasattr(base_clf, 'kernel') and callable(base_clf.kernel):

# cannot compute the kernel values with custom function

raise ValueError("Cannot use a custom kernel function. "

"Precompute the kernel matrix instead.")

if not hasattr(X, "shape"):

if getattr(base_clf, "_pairwise", False):

raise ValueError("Precomputed kernels or affinity matrices have "

"to be passed as arrays or sparse matrices.")

X_train = [X[idx] for idx in train]

X_test = [X[idx] for idx in test]

else:

if getattr(base_clf, "_pairwise", False):

# X is a precomputed square kernel matrix

if X.shape[0] != X.shape[1]:

raise ValueError("X should be a square kernel matrix")

X_train = X[np.ix_(train, train)]

X_test = X[np.ix_(test, train)]

else:

X_train = X[safe_mask(X, train)]

X_test = X[safe_mask(X, test)]

if y is not None:

y_test = y[safe_mask(y, test)]

y_train = y[safe_mask(y, train)]

clf.fit(X_train, y_train, **fit_params)

if scorer is not None:

this_score = scorer(clf, X_test, y_test)

else:

this_score = clf.score(X_test, y_test)

else:

clf.fit(X_train, **fit_params)

if scorer is not None:

this_score = scorer(clf, X_test)

else:

this_score = clf.score(X_test)

if not isinstance(this_score, numbers.Number):

raise ValueError("scoring must return a number, got %s (%s)"

" instead." % (str(this_score), type(this_score)))

if verbose > 2:

msg += ", score=%f" % this_score

if verbose > 1:

end_msg = "%s -%s" % (msg,

logger.short_format_time(time.time() -

start_time))

print("[GridSearchCV] %s %s" % ((64 - len(end_msg)) * '.', end_msg))

return this_score, clf_params, _num_samples(X_test)

def _check_param_grid(param_grid):

if hasattr(param_grid, 'items'):

param_grid = [param_grid]

for p in param_grid:

for v in p.values():

if isinstance(v, np.ndarray) and v.ndim > 1:

raise ValueError("Parameter array should be one-dimensional.")

check = [isinstance(v, k) for k in (list, tuple, np.ndarray)]

if not True in check:

raise ValueError("Parameter values should be a list.")

if len(v) == 0:

raise ValueError("Parameter values should be a non-empty "

"list.")

class SearchResult(object):

"""

>>> from __future__ import print_function

>>> from sklearn.grid_search import GridSearchCV

>>> from sklearn.datasets import load_iris

>>> from sklearn.svm import SVC

>>> iris = load_iris()

>>> grid = {'C': [0.01, 0.1, 1], 'degree': [1, 2, 3]}

>>> search = GridSearchCV(SVC(kernel='poly'), param_grid=grid)

>>> search = search.fit(iris.data, iris.target)

>>> res = search.results_

>>> res.best().mean_test_score # doctest: +ELLIPSIS

0.973...

>>> res # doctest: +ELLIPSIS

<9 candidates. Best results:

<0.973 for {'C': 0.1..., 'degree': 3}>,

<0.967 for {'C': 1.0, 'degree': 3}>,

<0.967 for {'C': 1.0, 'degree': 2}>, ...>

>>> res[res.param_degree == 2] # doctest: +ELLIPSIS

<3 candidates. Best results:

<0.967 for {'C': 1.0, 'degree': 2}>,

<0.967 for {'C': 0.1..., 'degree': 2}>,

<0.927 for {'C': 0.01, 'degree': 2}>>

>>> res.group_best(['degree']) # doctest: +ELLIPSIS

<3 candidates. Best results:

<0.973 for {'C': 0.1..., 'degree': 3}>,

<0.967 for {'C': 1.0, 'degree': 2}>,

<0.967 for {'C': 1.0, 'degree': 1}>>

>>> for tup in res.zipped('parameters', 'mean_test_score',

... 'std_test_score'):

... print(*tup)

... # doctest: +ELLIPSIS

{'C': 0.01, 'degree': 1} 0.67... 0.03...

{'C': 0.01, 'degree': 2} 0.92... 0.00...

{'C': 0.01, 'degree': 3} 0.96... 0.01...

{'C': 0.10..., 'degree': 1} 0.94 0.01...

{'C': 0.10..., 'degree': 2} 0.96... 0.01...

{'C': 0.10..., 'degree': 3} 0.97... 0.00...

{'C': 1.0, 'degree': 1} 0.96... 0.02...

{'C': 1.0, 'degree': 2} 0.96... 0.00...

{'C': 1.0, 'degree': 3} 0.96... 0.01...

"""

__slots__ = ('_param_arrays', '_data_arrays', '_fold_weight',

'_score_field', '_greater_is_better')

def __init__(self, param_arrays, data_arrays, fold_weight=None,

score_field='test_score', greater_is_better=True):

self._param_arrays = param_arrays

self._data_arrays = data_arrays

self._fold_weight = fold_weight

self._score_field = score_field

self._greater_is_better = greater_is_better

def __getattr__(self, attr):

try:

prefix, field = attr.split('_', 1)

except ValueError:

raise AttributeError('%r has no attribute %r'

% (self.__class__.__name__, attr))

if prefix == 'param':

try:

return self._param_arrays[field]

except (KeyError, ValueError):

raise AttributeError('%r has no attribute %r'

% (self.__class__.__name__, attr))

try:

data = self._data_arrays[field]

except (KeyError, ValueError):

raise AttributeError('%r has no attribute %r'

% (self.__class__.__name__, attr))

if prefix == 'fold':

return data

elif prefix == 'mean':

return np.average(data, axis=-1, weights=self._fold_weight)

elif prefix == 'std':

weight = self._fold_weight

if weight is not None:

avg = np.average(data, axis=-1, weights=weight)

avg.shape = data.shape[:-1]

squares = (data.T - avg) ** 2

return np.sqrt(np.dot(weight, squares) / np.sum(weight))

return np.std(data, axis=-1)

raise AttributeError('%r has no attribute %r'

% (self.__class__.__name__, attr))

def __getitem__(self, index):

index = np.asarray(index)

if index.dtype == np.bool:

index = np.flatnonzero(index)

# TODO: validate

return self.__class__(

{k: v[index] for k, v in iteritems(self._param_arrays)},

{k: v[index] for k, v in iteritems(self._data_arrays)},

self._fold_weight, self._score_field, self._greater_is_better

)

def __len__(self):

shape = np.shape(self._data_arrays[self._score_field])

if len(shape) == 1:

raise TypeError('Singleton results have no length')

return shape[0]

@property

def is_singleton(self):

"""True if result is for a single candidate"""

shape = np.shape(self._data_arrays[self._score_field])

return len(shape) == 1

def __iter__(self):

for i in xrange(len(self)):

yield self[i]

def for_field(self, field, greater_is_better):

"""Create a new SearchResult, using a given score field."""

# TODO: validate

return self.__class__(self._param_arrays, self._data_arrays,

self._fold_weight, field, greater_is_better)

@property

def score(self):

"""Mean score for each candidate"""

return getattr(self, 'mean_' + self._score_field)

def best(self, k=None):

"""Return a ``SearchResult`` with only the best ``k`` candidates.

Results will be ordered in decreasing perormance order.

For k = None, the best result will be returned, with means given as

single values rather than arrays.

"""

order = np.argsort(self.score)

if self._greater_is_better:

order = order[::-1]

if k is not None:

return self[order[:k]]

return self[order[0]]

def best_in_margin(self, margin=0.001):

scores = self.score

if self._greater_is_better:

return self[scores >= scores.max() - margin]

else:

return self[scores <= scores.min() + margin]

def zipped(self, *attrs):

return zip(*[getattr(self, attr) for attr in attrs])

def group(self, fields=None, negate=False):

"""Index candidates by distinct settings of `fields`.

Requires all parameter values for grouping fields to be hashable and

comparable.

"""

items = [(k, v) for k, v in iteritems(self._param_arrays)

if (k in fields) ^ negate]

fields, values = zip(*items)

values = list(zip(*values))

values_arr = np.zeros(len(values), dtype=object)

values_arr[:] = values

distinct, inverse = np.unique(values_arr, return_inverse=True)

return inverse, [dict(zip(fields, values)) for values in distinct]

def group_best(self, fields=None, negate=False):

"""Select the best scoring candidate for each setting of ``fields``.

"""

if self._greater_is_better:

scores = self.score

else:

scores = -self.score

# Sort with major key groups, minor key score:

groups, group_values = self.group(fields, negate)

order = np.lexsort((scores, groups))

groups = groups[order]

# Index marks change from one group to next, i.e. within-group max

index = np.empty(len(groups), 'bool')

index[-1] = True

index[:-1] = groups[1:] != groups[:-1]

return self[order[index]]

@property

def parameters(self):

masked = np.ma.masked

names, values = zip(*list(iteritems(self._param_arrays)))

if self.is_singleton:

return {name: val for name, val in zip(names, values)

if val is not masked}

out = []

for candidate in zip(*values):

out.append({name: val for name, val in zip(names, candidate)

if val is not masked})

return out

def __repr__(self, show_top=3):

try:

n = len(self)

except TypeError:

return '<%0.3f for %r>' % (self.score, self.parameters)

if show_top < n:

suff = ', ...'

else:

suff = ''

return ('<%d candidates. Best results:\n %s%s>'

% (n, ',\n '.join(repr(sr) for sr in self.best(show_top)),

suff))

def __array__(self):

arrays = [('param_' + k, v) for k, v in iteritems(self._param_arrays)]

for field in iterkeys(self._data_arrays):

try:

arrays.append(('mean_' + field,

getattr(self, 'mean_' + field)))

arrays.append(('std_' + field,

getattr(self, 'std_' + field)))

except TypeError:

continue

fields, arrays = zip(*arrays)

return mrecords.fromarrays(arrays, names=fields)

def _params_to_arrays(parameter_dicts):

fields = {}

for params in parameter_dicts:

for name, value in iteritems(params):

fields[name] = value # take an example for masking

field_names = sorted(iterkeys(fields))

data = []

mask = []

for params in parameter_dicts:

row = [(params[name], False) if name in params

else (fields[name], True)

for name in field_names]

rdata, rmask = zip(*row)

data.append(rdata)

mask.append(rmask)

recs = mrecords.fromrecords(data, mask=mask, names=field_names)

return {field: recs[field] for field in field_names}

_CVScoreTuple = namedtuple('_CVScoreTuple',

('parameters', 'mean_validation_score',

'cv_validation_scores'))

class BaseSearchCV(six.with_metaclass(ABCMeta, BaseEstimator,

MetaEstimatorMixin)):

"""Base class for hyper parameter search with cross-validation.

"""

@abstractmethod

def __init__(self, estimator, scoring=None, loss_func=None,

score_func=None, fit_params=None, n_jobs=1, iid=True,

refit=True, cv=None, verbose=0, pre_dispatch='2*n_jobs'):

self.scoring = scoring

self.estimator = estimator

self.loss_func = loss_func

self.score_func = score_func

self.n_jobs = n_jobs

self.fit_params = fit_params if fit_params is not None else {}

self.iid = iid

self.refit = refit

self.cv = cv

self.verbose = verbose

self.pre_dispatch = pre_dispatch

self._check_estimator()

def score(self, X, y=None):

"""Returns the score on the given test data and labels, if the search

estimator has been refit. The ``score`` function of the best estimator

is used, or the ``scoring`` parameter where unavailable.

Parameters

----------

X : array-like, shape = [n_samples, n_features]

Training set.

y : array-like, shape = [n_samples], optional

Labels for X.

Returns

-------

score : float

"""

if hasattr(self.best_estimator_, 'score'):

return self.best_estimator_.score(X, y)

if self.scorer_ is None:

raise ValueError("No score function explicitly defined, "

"and the estimator doesn't provide one %s"

% self.best_estimator_)

return self.scorer_(self.best_estimator_, X, y)

@property

def predict(self):

return self.best_estimator_.predict

@property

def predict_proba(self):

return self.best_estimator_.predict_proba

@property

def decision_function(self):

return self.best_estimator_.decision_function

@property

def transform(self):

return self.best_estimator_.transform

def _check_estimator(self):

"""Check that estimator can be fitted and score can be computed."""

if (not hasattr(self.estimator, 'fit') or

not (hasattr(self.estimator, 'predict')

or hasattr(self.estimator, 'score'))):

raise TypeError("estimator should a be an estimator implementing"

" 'fit' and 'predict' or 'score' methods,"

" %s (type %s) was passed" %

(self.estimator, type(self.estimator)))

if (self.scoring is None and self.loss_func is None and self.score_func

is None):

if not hasattr(self.estimator, 'score'):

raise TypeError(

"If no scoring is specified, the estimator passed "

"should have a 'score' method. The estimator %s "

"does not." % self.estimator)

def _fit(self, X, y, parameter_iterator, **params):

"""Actual fitting, performing the search over parameters."""

if params:

warnings.warn("Passing additional parameters to GridSearchCV "

"is ignored! The option will be removed in 0.15.")

estimator = self.estimator

cv = self.cv

n_samples = _num_samples(X)

X, y = check_arrays(X, y, allow_lists=True, sparse_format='csr')

if self.loss_func is not None:

warnings.warn("Passing a loss function is "

"deprecated and will be removed in 0.15. "

"Either use strings or score objects."

"The relevant new parameter is called ''scoring''. ")

scorer = Scorer(self.loss_func, greater_is_better=False)

elif self.score_func is not None:

warnings.warn("Passing function as ``score_func`` is "

"deprecated and will be removed in 0.15. "

"Either use strings or score objects."

"The relevant new parameter is called ''scoring''.")

scorer = Scorer(self.score_func)

elif isinstance(self.scoring, six.string_types):

scorer = SCORERS[self.scoring]

else:

scorer = self.scoring

self.scorer_ = scorer

if y is not None:

if len(y) != n_samples:

raise ValueError('Target variable (y) has a different number '

'of samples (%i) than data (X: %i samples)'

% (len(y), n_samples))

y = np.asarray(y)

cv = check_cv(cv, X, y, classifier=is_classifier(estimator))

base_clf = clone(self.estimator)

pre_dispatch = self.pre_dispatch

out = Parallel(

n_jobs=self.n_jobs, verbose=self.verbose,

pre_dispatch=pre_dispatch)(

delayed(fit_grid_point)(

X, y, base_clf, clf_params, train, test, scorer,

self.verbose, **self.fit_params) for clf_params in

parameter_iterator for train, test in cv)

# Out is a list of triplet: score, estimator, n_test_samples

n_param_points = len(list(parameter_iterator))

n_fits = len(out)

n_folds = n_fits // n_param_points

scores = list()

cv_scores = list()

for grid_start in range(0, n_fits, n_folds):

n_test_samples = 0

score = 0

these_points = list()

for this_score, clf_params, this_n_test_samples in \

out[grid_start:grid_start + n_folds]:

these_points.append(this_score)

if self.iid:

this_score *= this_n_test_samples

n_test_samples += this_n_test_samples

score += this_score

if self.iid:

score /= float(n_test_samples)

else:

score /= float(n_folds)

scores.append((score, clf_params))

cv_scores.append(these_points)

cv_scores = np.asarray(cv_scores)

self.results_ = SearchResult(

_params_to_arrays(list(parameter_iterator)),

{'test_score': cv_scores},

[len(y[test]) for train, test in cv] if self.iid else None,

'test_score', getattr(scorer, 'greater_is_better', True))

# Note: we do not use max(out) to make ties deterministic even if

# comparison on estimator instances is not deterministic

if scorer is not None:

greater_is_better = scorer.greater_is_better

else:

greater_is_better = True

if greater_is_better:

best_score = -np.inf

else:

best_score = np.inf

for score, params in scores:

if ((score > best_score and greater_is_better)

or (score < best_score and not greater_is_better)):

best_score = score

best_params = params

self.best_params_ = best_params

self.best_score_ = best_score

if self.refit:

# fit the best estimator using the entire dataset

# clone first to work around broken estimators

best_estimator = clone(base_clf).set_params(**best_params)

if y is not None:

best_estimator.fit(X, y, **self.fit_params)

else:

best_estimator.fit(X, **self.fit_params)

self.best_estimator_ = best_estimator

# Store the computed scores

self.cv_scores_ = [

_CVScoreTuple(clf_params, score, all_scores)

for clf_params, (score, _), all_scores

in zip(parameter_iterator, scores, cv_scores)]

return self

class GridSearchCV(BaseSearchCV):

"""Exhaustive search over specified parameter values for an estimator.

Important members are fit, predict.

GridSearchCV implements a "fit" method and a "predict" method like

any classifier except that the parameters of the classifier

used to predict is optimized by cross-validation.

Parameters

----------

estimator : object type that implements the "fit" and "predict" methods

A object of that type is instantiated for each grid point.

param_grid : dict or list of dictionaries

Dictionary with parameters names (string) as keys and lists of

parameter settings to try as values, or a list of such

dictionaries, in which case the grids spanned by each dictionary

in the list are explored. This enables searching over any sequence

of parameter settings.

scoring : string or callable, optional

Either one of either a string ("zero_one", "f1", "roc_auc", ... for

classification, "mse", "r2",... for regression) or a callable.

See 'Scoring objects' in the model evaluation section of the user guide

for details.

fit_params : dict, optional

Parameters to pass to the fit method.

n_jobs : int, optional

Number of jobs to run in parallel (default 1).

pre_dispatch : int, or string, optional

Controls the number of jobs that get dispatched during parallel

execution. Reducing this number can be useful to avoid an

explosion of memory consumption when more jobs get dispatched

than CPUs can process. This parameter can be:

- None, in which case all the jobs are immediately

created and spawned. Use this for lightweight and

fast-running jobs, to avoid delays due to on-demand

spawning of the jobs

- An int, giving the exact number of total jobs that are

spawned

- A string, giving an expression as a function of n_jobs,

as in '2*n_jobs'

iid : boolean, optional

If True, the data is assumed to be identically distributed across

the folds, and the loss minimized is the total loss per sample,

and not the mean loss across the folds.

cv : integer or cross-validation generator, optional

If an integer is passed, it is the number of folds (default 3).

Specific cross-validation objects can be passed, see

sklearn.cross_validation module for the list of possible objects

refit : boolean

Refit the best estimator with the entire dataset.

If "False", it is impossible to make predictions using

this GridSearchCV instance after fitting.

verbose : integer

Controls the verbosity: the higher, the more messages.

Examples

--------

>>> from sklearn import svm, grid_search, datasets

>>> iris = datasets.load_iris()

>>> parameters = {'kernel':('linear', 'rbf'), 'C':[1, 10]}

>>> svr = svm.SVC()

>>> clf = grid_search.GridSearchCV(svr, parameters)

>>> clf.fit(iris.data, iris.target)

... # doctest: +NORMALIZE_WHITESPACE +ELLIPSIS

GridSearchCV(cv=None,

estimator=SVC(C=1.0, cache_size=..., coef0=..., degree=...,

gamma=..., kernel='rbf', max_iter=-1, probability=False,

shrinking=True, tol=...),

fit_params={}, iid=True, loss_func=None, n_jobs=1,

param_grid=...,

...)

Attributes

----------

`cv_scores_` : list of named tuples

Contains scores for all parameter combinations in param_grid.

Each entry corresponds to one parameter setting.

Each named tuple has the attributes:

* ``parameters``, a dict of parameter settings

* ``mean_validation_score``, the mean score over the

cross-validation folds

* ``cv_validation_scores``, the list of scores for each fold

`best_estimator_` : estimator

Estimator that was chosen by the search, i.e. estimator

which gave highest score (or smallest loss if specified)

on the left out data.

`best_score_` : float

Score of best_estimator on the left out data.

`best_params_` : dict

Parameter setting that gave the best results on the hold out data.

Notes

------

The parameters selected are those that maximize the score of the left out

data, unless an explicit score is passed in which case it is used instead.

If `n_jobs` was set to a value higher than one, the data is copied for each

point in the grid (and not `n_jobs` times). This is done for efficiency

reasons if individual jobs take very little time, but may raise errors if

the dataset is large and not enough memory is available. A workaround in

this case is to set `pre_dispatch`. Then, the memory is copied only

`pre_dispatch` many times. A reasonable value for `pre_dispatch` is `2 *

n_jobs`.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

FilesExpand file tree

grid_search.py

Latest commit

History

grid_search.py

File metadata and controls