Thanks to visit codestin.com
Credit goes to github.com

Skip to content

[MRG + 2] ENH Allow cross_val_score, GridSearchCV et al. to evaluate on multiple metrics #7388

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 150 commits into from
Jul 7, 2017
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
150 commits
Select commit Hold shift + click to select a range
d6d1000
ENH cross_val_score now supports multiple metrics
raghavrv Sep 28, 2016
2e52d9b
DOCFIX permutation_test_score
raghavrv Sep 29, 2016
4e7845a
ENH validate multiple metric scorers
raghavrv Sep 29, 2016
47e282f
ENH Move validation of multimetric scoring param out
raghavrv Sep 29, 2016
823a079
ENH GridSearchCV and RandomizedSearchCV now support multiple metrics
raghavrv Sep 30, 2016
55c0743
EXA Add an example demonstrating the multiple metric in GridSearchCV
raghavrv Sep 30, 2016
8e4dd35
ENH Let check_multimetric_scoring tell if its multimetric or not
raghavrv Sep 30, 2016
a4ce716
FIX For single metric name of scorer should remain 'score'
raghavrv Oct 1, 2016
dfcd15e
ENH validation_curve and learning_curve now support multiple metrics
raghavrv Oct 1, 2016
729c262
MNT move _aggregate_score_dicts helper into _validation.py
raghavrv Oct 1, 2016
9b71cfe
TST More testing/ Fixing scores to the correct values
raghavrv Oct 1, 2016
41c1b02
EXA Add cross_val_score to multimetric example
raghavrv Oct 1, 2016
92e031b
Rename to multiple_metric_evaluation.py
raghavrv Oct 1, 2016
10043a7
MNT Remove scaffolding
raghavrv Oct 1, 2016
752adb6
FIX doctest imports
raghavrv Oct 2, 2016
929158f
FIX wrap the scorer and unwrap the score when using _score() in rfe
raghavrv Oct 2, 2016
ac54beb
TST Cleanup the tests. Test for is_multimetric too
raghavrv Oct 2, 2016
137c788
TST Make sure it registers as single metric when scoring is of that type
raghavrv Oct 2, 2016
dcae56d
PEP8
raghavrv Oct 2, 2016
9b0b5ef
Don't use dict comprehension to make it work in python2.6
raghavrv Oct 2, 2016
20fe3ad
ENH/FIX/TST grid_scores_ should not be available for multimetric eval…
raghavrv Oct 2, 2016
5960b40
FIX+TST delegated methods NA when multimetric is enabled...
raghavrv Oct 3, 2016
dc3ee2c
ENH add option to disable delegation on multimetric scoring
raghavrv Oct 3, 2016
88004b9
Remove old function from __all__
raghavrv Oct 3, 2016
590b49b
flake8
raghavrv Oct 3, 2016
ef0fe7d
FIX revert disable_on_multimetric
raghavrv Nov 18, 2016
5951e94
stash
raghavrv Dec 6, 2016
8815b80
Fix incorrect rebase
raghavrv Dec 8, 2016
117de6c
[ci skip]
raghavrv Dec 9, 2016
728d004
Make sure refit works as expected and remove irrelevant tests
raghavrv Dec 9, 2016
3bac166
Allow passing standard scorers by name in multimetric scorers
raghavrv Dec 9, 2016
8d0c1e5
Fix example
raghavrv Dec 9, 2016
91053cd
flake8
raghavrv Dec 9, 2016
5611a08
Address reviews
raghavrv Dec 9, 2016
e797c08
Fix indentation
raghavrv Dec 26, 2016
9501be1
Ensure {'acc': 'accuracy'} and ['precision'] are valid inputs
raghavrv Dec 26, 2016
1e84296
Test that for single metric, 'score' is a key
raghavrv Dec 26, 2016
2551155
Fix incorrect rebase
raghavrv Dec 27, 2016
fd36391
Typos
raghavrv Dec 26, 2016
24ef398
Compare multimetric grid search with multiple single metric searches
raghavrv Dec 26, 2016
c16ac88
Test X, y list and pandas input; Test multimetric for unsupervised gr…
raghavrv Dec 27, 2016
c7094c4
Fix tests; Unsupervised multimetric gs will not pass until #8117 is m…
raghavrv Dec 27, 2016
51cab0d
Make a plot of Precision vs ROC AUC for RandomForest varying the n_es…
raghavrv Jan 5, 2017
21a6304
Add example to grid_search.rst
raghavrv Jan 5, 2017
7143ad5
Use the classic tuning of C param in SVM instead of estimators in RF
raghavrv Jan 16, 2017
44659a1
FIX Remove scoring arg in deafult scorer test
raghavrv Jan 16, 2017
a6d0060
flake8
raghavrv Jan 16, 2017
e64041c
Search for min_samples_split in DTC; Also show f-score
raghavrv Jan 17, 2017
21b6fcb
REVIEW Make check_multimetric_scoring private
raghavrv Jan 17, 2017
9e57644
FIX Add more samples to see if 3% mismatch on 32 bit systems gets fixed
raghavrv Jan 17, 2017
fbf0527
REVIEW Plot best score; Shorten legends
raghavrv Jan 18, 2017
2b666bd
REVIEW/COSMIT multimetric --> multi-metric
raghavrv Jan 18, 2017
c74d5f3
REVIEW Mark the best scores of P/R scores too
raghavrv Jan 18, 2017
c521b63
Revert "FIX Add more samples to see if 3% mismatch on 32 bit systems …
raghavrv Jan 18, 2017
a6727aa
ENH Use looping for iid testing
raghavrv Jan 18, 2017
feed649
FIX use param grid as scipy's stats dist in 0.12 do not accept seed
raghavrv Jan 18, 2017
528d36c
ENH more looping less code; Use small non-noisy dataset
raghavrv Jan 18, 2017
cf3faa4
FIX Use named arg after expanded args
raghavrv Jan 18, 2017
27f025f
TST More testing of the refit parameter
raghavrv Jan 19, 2017
095f3cf
COSMIT multimetric --> multi-metric
raghavrv Jan 24, 2017
074cbcf
REV Correct example doc
raghavrv Jan 24, 2017
e853017
COSMIT
raghavrv Jan 24, 2017
faa1fd0
REVIEW Make tests stronger; Fix bugs in _check_multimetric_scorer
raghavrv Jan 26, 2017
946f41c
REVIEW refit param: Raise for empty strings
raghavrv Jan 26, 2017
5696627
TST Invalid refit params
raghavrv Jan 26, 2017
4819968
REVIEW Use <scorer_name> alone; recall --> Recall
raghavrv Jan 26, 2017
7627e34
REV specify when we expect scorers to not be None
raghavrv Jan 26, 2017
a52f9cc
FLAKE8
raghavrv Jan 26, 2017
afb2a34
REVERT multimetrics in learning_curve and validation_curve
raghavrv Jan 27, 2017
36dfd1a
REVIEW Simpler coding style
raghavrv Jan 27, 2017
afe2bf8
COSMIT
raghavrv Jan 27, 2017
ef88554
REV Compress example a bit. Move comment to top
raghavrv Jan 27, 2017
5986e18
COSMIT
raghavrv Jan 27, 2017
aa6fa8c
FIX fit_grid_point's previous API must be preserved
raghavrv Jan 27, 2017
5e424e7
Flake8
raghavrv Jan 28, 2017
c431ce8
TST Use loop; Compare with single-metric
raghavrv Jan 30, 2017
6f0396e
REVIEW Use dict-comprehension instead of helper
raghavrv Jan 30, 2017
e9c71bf
REVIEW Remove redundant test
raghavrv Jan 30, 2017
ee23970
Fix tests incorrect braces
raghavrv Jan 30, 2017
e457bdb
COSMIT
raghavrv Jan 30, 2017
b4e0213
REVIEW Use regexp
raghavrv Jan 30, 2017
c9a7da4
REV Simplify aggregation of score dicts
raghavrv Jan 30, 2017
a7d865f
FIX precision and accuracy test
raghavrv Jan 30, 2017
277f0a0
FIX doctest and flake8
raghavrv Jan 30, 2017
9eeacfc
TST the best_* attributes multimetric with single metric
raghavrv Jan 30, 2017
428121c
Address @jnothman's review
raghavrv Feb 13, 2017
dd4ac3a
Address more comments \o/
raghavrv Feb 13, 2017
6bc8726
DOCFIXES
raghavrv Feb 13, 2017
0134338
Fix use the validated fit_param from fit's arguments
raghavrv Feb 13, 2017
5db9cea
Revert alpha to a lower value as before
raghavrv Feb 19, 2017
1d44f9e
Using def instead of lambda
raghavrv Feb 24, 2017
2372fbb
Address @jnothman's review batch 1: Fix tests / Doc fixes
raghavrv Mar 7, 2017
8c02a66
Remove superfluous tests
raghavrv Mar 7, 2017
9eae2d6
Remove more superfluous testing
raghavrv Mar 7, 2017
da932ec
TST/FIX loop over refit and check found n_clusters
raghavrv Mar 7, 2017
5c695bb
Cosmetic touches
raghavrv Mar 7, 2017
e40d6e5
Use zip instead of manually listing the keys
raghavrv Mar 16, 2017
2c89a25
Fix inverse_transform
raghavrv May 17, 2017
b5c8b46
MRG update master and fix merge conflicts
raghavrv Jun 6, 2017
ff88ace
FIX bug in fit_grid_point; Allow only single score
raghavrv Jun 6, 2017
6f40803
ENH Use only ROC-AUC and F1-score
raghavrv Jun 6, 2017
674882b
Fix typos and flake8; Address Andy's reviews
raghavrv Jun 6, 2017
09fd482
ENH Better error messages for incorrect multimetric scoring values +...
raghavrv Jun 6, 2017
fd9e82c
Dict keys must be of string type only
raghavrv Jun 6, 2017
9896333
1. Better error message for invalid scoring 2...
raghavrv Jun 6, 2017
d136889
Fix test failures and shuffle tests
raghavrv Jun 6, 2017
13f6a44
Avoid wrapping scorer as dict in learning_curve
raghavrv Jun 6, 2017
d5ab0f1
Remove doc example as asked for
raghavrv Jun 6, 2017
bf408a8
Some leftover ones
raghavrv Jun 6, 2017
bc8c815
Don't wrap scorer in validation_curve either
raghavrv Jun 6, 2017
d63c770
Add a doc example and skip it as dict order fails doctest
raghavrv Jun 6, 2017
309c33b
Import zip from six for python2.7 compat
raghavrv Jun 7, 2017
2631ffe
Make cross_val_score return a cv_results-like dict
raghavrv Jun 7, 2017
afe4837
Add relevant sections to userguide
raghavrv Jun 7, 2017
0da53b2
Flake8 fixes
raghavrv Jun 8, 2017
c63d6e3
Add whatsnew and fix broken links
raghavrv Jun 8, 2017
2a59b12
Merge branch 'master' into multimetric_cross_val_score
raghavrv Jun 8, 2017
7b204d8
Use AUC and accuracy instead of f1
raghavrv Jun 9, 2017
5dbe2a1
Fix failing doctests cross_validation.rst
raghavrv Jun 10, 2017
b6de448
DOC add the wrapper example for metrics that return multiple return v…
raghavrv Jun 10, 2017
2364e39
Address andy's comments
raghavrv Jun 10, 2017
0984907
Merge branch 'master' into multimetric_cross_val_score
raghavrv Jun 10, 2017
58dac26
Be less weird
raghavrv Jun 10, 2017
0130163
Address more of andy's comments
raghavrv Jun 10, 2017
1d077a5
Make a separate cross_validate function to return dict and a cross_va…
raghavrv Jun 11, 2017
f72ab5c
Update the docs to reflect the new cross_validate function
raghavrv Jun 11, 2017
bcfe238
Add cross_validate to toc-tree
raghavrv Jun 11, 2017
ec290fb
Add more tests on type of cross_validate return and time limits
raghavrv Jun 11, 2017
ea7cae3
FIX failing doctests
raghavrv Jun 12, 2017
1c70d51
FIX ensure keys are not plural
raghavrv Jun 12, 2017
a1d386f
DOC fix
raghavrv Jun 12, 2017
bcb0051
Address some pending comments
raghavrv Jun 12, 2017
39c43c3
Remove the comment as it is irrelevant now
raghavrv Jun 12, 2017
751e10a
Remove excess blank line
raghavrv Jun 12, 2017
192d146
Fix flake8 inconsistencies
raghavrv Jun 16, 2017
e01a1c4
Allow fit_times to be 0 to conform with windows precision
raghavrv Jun 16, 2017
809db2e
Merge branch 'master' into multimetric_cross_val_score
raghavrv Jun 19, 2017
f1b4bf1
DOC specify how refit param is to be set in multiple metric case
raghavrv Jun 19, 2017
67fc1a4
TST ensure cross_validate works for string single metrics + address @…
raghavrv Jun 19, 2017
3822147
Doc fixes
raghavrv Jun 19, 2017
926f81f
Remove the shape and transform parameter of _aggregate_score_dicts
raghavrv Jun 19, 2017
b2e7833
Address Joel's doc comments
raghavrv Jun 19, 2017
d48443c
Fix broken doctest
raghavrv Jun 19, 2017
8a20053
Fix the spurious file
raghavrv Jun 20, 2017
81713b6
Address Andy's comments
raghavrv Jun 26, 2017
e06495b
Merge branch 'master' into multimetric_cross_val_score
raghavrv Jun 29, 2017
9402cbb
MNT Remove erroneous entry
raghavrv Jun 30, 2017
d03a515
Address Andy's comments
raghavrv Jun 30, 2017
8a1ebf1
FIX broken links
raghavrv Jun 30, 2017
2d51ac6
Update whats_new.rst
amueller Jul 6, 2017
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions doc/modules/classes.rst
Original file line number Diff line number Diff line change
Expand Up @@ -223,6 +223,7 @@ Model validation
:toctree: generated/
:template: function.rst

model_selection.cross_validate
model_selection.cross_val_score
model_selection.cross_val_predict
model_selection.permutation_test_score
Expand Down
61 changes: 60 additions & 1 deletion doc/modules/cross_validation.rst
Original file line number Diff line number Diff line change
Expand Up @@ -172,6 +172,65 @@ validation iterator instead, for instance::

See :ref:`combining_estimators`.


.. _multimetric_cross_validation:

The cross_validate function and multiple metric evaluation
----------------------------------------------------------

The ``cross_validate`` function differs from ``cross_val_score`` in two ways -

- It allows specifying multiple metrics for evaluation.

- It returns a dict containing training scores, fit-times and score-times in
addition to the test score.

For single metric evaluation, where the scoring parameter is a string,
callable or None, the keys will be - ``['test_score', 'fit_time', 'score_time']``

And for multiple metric evaluation, the return value is a dict with the
following keys -
``['test_<scorer1_name>', 'test_<scorer2_name>', 'test_<scorer...>', 'fit_time', 'score_time']``

``return_train_score`` is set to ``True`` by default. It adds train score keys
for all the scorers. If train scores are not needed, this should be set to
``False`` explicitly.

The multiple metrics can be specified either as a list, tuple or set of
predefined scorer names::

>>> from sklearn.model_selection import cross_validate
>>> from sklearn.metrics import recall_score
>>> scoring = ['precision_macro', 'recall_macro']
>>> clf = svm.SVC(kernel='linear', C=1, random_state=0)
>>> scores = cross_validate(clf, iris.data, iris.target, scoring=scoring,
... cv=5, return_train_score=False)
>>> sorted(scores.keys())
['fit_time', 'score_time', 'test_precision_macro', 'test_recall_macro']
>>> scores['test_recall_macro'] # doctest: +ELLIPSIS
array([ 0.96..., 1. ..., 0.96..., 0.96..., 1. ])

Or as a dict mapping scorer name to a predefined or custom scoring function::

>>> from sklearn.metrics.scorer import make_scorer
>>> scoring = {'prec_macro': 'precision_macro',
... 'rec_micro': make_scorer(recall_score, average='macro')}
>>> scores = cross_validate(clf, iris.data, iris.target, scoring=scoring,
... cv=5, return_train_score=True)
>>> sorted(scores.keys()) # doctest: +NORMALIZE_WHITESPACE
['fit_time', 'score_time', 'test_prec_macro', 'test_rec_micro',
'train_prec_macro', 'train_rec_micro']
>>> scores['train_rec_micro'] # doctest: +ELLIPSIS
array([ 0.97..., 0.97..., 0.99..., 0.98..., 0.98...])

Here is an example of ``cross_validate`` using a single metric::

>>> scores = cross_validate(clf, iris.data, iris.target,
... scoring='precision_macro')
>>> sorted(scores.keys())
['fit_time', 'score_time', 'test_score', 'train_score']


Obtaining predictions by cross-validation
-----------------------------------------

Expand All @@ -186,7 +245,7 @@ These prediction can then be used to evaluate the classifier::
>>> from sklearn.model_selection import cross_val_predict
>>> predicted = cross_val_predict(clf, iris.data, iris.target, cv=10)
>>> metrics.accuracy_score(iris.target, predicted) # doctest: +ELLIPSIS
0.966...
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Am I missing something? What's changed?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The clf now is different as we modified it above.

0.973...

Note that the result of this computation may be slightly different from those
obtained using :func:`cross_val_score` as the elements are grouped in different
Expand Down
25 changes: 25 additions & 0 deletions doc/modules/grid_search.rst
Original file line number Diff line number Diff line change
Expand Up @@ -84,6 +84,10 @@ evaluated and the best combination is retained.
dataset. This is the best practice for evaluating the performance of a
model with grid search.

- See :ref:`sphx_glr_auto_examples_model_selection_plot_multi_metric_evaluation`
for an example of :class:`GridSearchCV` being used to evaluate multiple
metrics simultaneously.

.. _randomized_parameter_search:

Randomized Parameter Optimization
Expand Down Expand Up @@ -161,6 +165,27 @@ scoring function can be specified via the ``scoring`` parameter to
specialized cross-validation tools described below.
See :ref:`scoring_parameter` for more details.

.. _multimetric_grid_search:

Specifying multiple metrics for evaluation
------------------------------------------

``GridSearchCV`` and ``RandomizedSearchCV`` allow specifying multiple metrics
for the ``scoring`` parameter.

Multimetric scoring can either be specified as a list of strings of predefined
scores names or a dict mapping the scorer name to the scorer function and/or
the predefined scorer name(s). See :ref:`multimetric_scoring` for more details.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's appropriate to mention the refit behaviour, as *SearchCV must optimise over a single score.

When specifying multiple metrics, the ``refit`` parameter must be set to the
metric (string) for which the ``best_params_`` will be found and used to build
the ``best_estimator_`` on the whole dataset. If the search should not be
refit, set ``refit=False``. Leaving refit to the default value ``None`` will
result in an error when using multiple metrics.

See :ref:`sphx_glr_auto_examples_model_selection_plot_multi_metric_evaluation`
for an example usage.

Composite estimators and parameter spaces
-----------------------------------------

Expand Down
45 changes: 45 additions & 0 deletions doc/modules/model_evaluation.rst
Original file line number Diff line number Diff line change
Expand Up @@ -210,6 +210,51 @@ the following two rules:
Again, by convention higher numbers are better, so if your scorer
returns loss, that value should be negated.

.. _multimetric_scoring:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't understand why this is here.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've referenced it in grid_search.rst


Using mutiple metric evaluation
-------------------------------

Scikit-learn also permits evaluation of multiple metrics in ``GridSearchCV``,
``RandomizedSearchCV`` and ``cross_validate``.

There are two ways to specify multiple scoring metrics for the ``scoring``
parameter:

- As an iterable of string metrics::
>>> scoring = ['accuracy', 'precision']

- As a ``dict`` mapping the scorer name to the scoring function::
>>> from sklearn.metrics import accuracy_score
>>> from sklearn.metrics import make_scorer
>>> scoring = {'accuracy': make_scorer(accuracy_score),
... 'prec': 'precision'}

Note that the dict values can either be scorer functions or one of the
predefined metric strings.

Currently only those scorer functions that return a single score can be passed
inside the dict. Scorer functions that return multiple values are not
permitted and will require a wrapper to return a single metric::

>>> from sklearn.model_selection import cross_validate
>>> from sklearn.metrics import confusion_matrix
>>> # A sample toy binary classification dataset
>>> X, y = datasets.make_classification(n_classes=2, random_state=0)
>>> svm = LinearSVC(random_state=0)
>>> tp = lambda y_true, y_pred: confusion_matrix(y_true, y_pred)[0, 0]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it a bad idea to recommend lambda when it's not able to be pickled (dill notwithstanding)?

>>> tn = lambda y_true, y_pred: confusion_matrix(y_true, y_pred)[0, 0]
>>> fp = lambda y_true, y_pred: confusion_matrix(y_true, y_pred)[1, 0]
>>> fn = lambda y_true, y_pred: confusion_matrix(y_true, y_pred)[0, 1]
>>> scoring = {'tp' : make_scorer(tp), 'tn' : make_scorer(tn),
... 'fp' : make_scorer(fp), 'fn' : make_scorer(fn)}
>>> cv_results = cross_validate(svm.fit(X, y), X, y, scoring=scoring)
>>> # Getting the test set false positive scores
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Next line says tp, not fp

>>> print(cv_results['test_tp']) # doctest: +NORMALIZE_WHITESPACE
[12 13 15]
>>> # Getting the test set false negative scores
>>> print(cv_results['test_fn']) # doctest: +NORMALIZE_WHITESPACE
[5 4 1]

.. _classification_metrics:

Expand Down
13 changes: 13 additions & 0 deletions doc/whats_new.rst
Original file line number Diff line number Diff line change
Expand Up @@ -31,6 +31,19 @@ Changelog
New features
............

- :class:`model_selection.GridSearchCV` and
:class:`model_selection.RandomizedSearchCV` now support simultaneous
evaluation of multiple metrics. Refer to the
:ref:`multimetric_grid_search` section of the user guide for more
information. :issue:`7388` by `Raghav RV`_

- Added the :func:`model_selection.cross_validate` which allows evaluation
of multiple metrics. This function returns a dict with more useful
information from cross-validation such as the train scores, fit times and
score times.
Refer to :ref:`multimetric_cross_validation` section of the userguide
for more information. :issue:`7388` by `Raghav RV`_

- Added :class:`multioutput.ClassifierChain` for multi-label
classification. By `Adam Kleczewski <adamklec>`_.

Expand Down
94 changes: 94 additions & 0 deletions examples/model_selection/plot_multi_metric_evaluation.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,94 @@
"""Demonstration of multi-metric evaluation on cross_val_score and GridSearchCV
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@GaelVaroquaux is probably right in suggesting we can reduce the number of examples, and instead demonstrate features successively when needed. Can we roll this feature into existing grid search examples?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And we really don't need to illustrate each quirk of the feature, e.g. different ways to specify multiple metrics.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What do you think about keeping this example and removing this one instead?

The GridSearchCV documentation has a class example that illustrates a simple single metric evaluation...

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not change the existing example? will break less links.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Like discussed IRL, I feel it's better to have both... (at least for now)


Multiple metric parameter search can be done by setting the ``scoring``
parameter to a list of metric scorer names or a dict mapping the scorer names
to the scorer callables.

The scores of all the scorers are available in the ``cv_results_`` dict at keys
ending in ``'_<scorer_name>'`` (``'mean_test_precision'``,
``'rank_test_precision'``, etc...)

The ``best_estimator_``, ``best_index_``, ``best_score_`` and ``best_params_``
correspond to the scorer (key) that is set to the ``refit`` attribute.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When using multiple metrics, you need to specify the refit parameter to either specify a metric to use to select the best parameter setting or specify refit="False" to not refit any estimator.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@amueller That is the current case, except refit=False and not the string "False". Is that fine or do you want the behavior changed to refit=False by default for multiple metrics and refit=True by default for single metric?

Also cc: @jnothman @GaelVaroquaux

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah you've already answered this at #7388 (comment). I read this one before.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, but I just want to be very explicit about the current behavior. This is something that people will definitely run into, so just tell them exactly what they need to do.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think @amueller is suggesting you use this kind of instructive wording in the narrative docs. Perhaps just adopt his wording?

"""

# Author: Raghav RV <[email protected]>
# License: BSD

import numpy as np
from matplotlib import pyplot as plt

from sklearn.datasets import make_hastie_10_2
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import make_scorer
from sklearn.metrics import accuracy_score
from sklearn.tree import DecisionTreeClassifier

print(__doc__)

###############################################################################
# Running ``GridSearchCV`` using multiple evaluation metrics
# ----------------------------------------------------------
#

X, y = make_hastie_10_2(n_samples=8000, random_state=42)

# The scorers can be either be one of the predefined metric strings or a scorer
# callable, like the one returned by make_scorer
scoring = {'AUC': 'roc_auc', 'Accuracy': make_scorer(accuracy_score)}

# Setting refit='AUC', refits an estimator on the whole dataset with the
# parameter setting that has the best cross-validated AUC score.
# That estimator is made available at ``gs.best_estimator_`` along with
# parameters like ``gs.best_score_``, ``gs.best_parameters_`` and
# ``gs.best_index_``
gs = GridSearchCV(DecisionTreeClassifier(random_state=42),
param_grid={'min_samples_split': range(2, 403, 10)},
scoring=scoring, cv=5, refit='AUC')
gs.fit(X, y)
results = gs.cv_results_

###############################################################################
# Plotting the result
# -------------------

plt.figure(figsize=(13, 13))
plt.title("GridSearchCV evaluating using multiple scorers simultaneously",
fontsize=16)

plt.xlabel("min_samples_split")
plt.ylabel("Score")
plt.grid()

ax = plt.axes()
ax.set_xlim(0, 402)
ax.set_ylim(0.73, 1)

# Get the regular numpy array from the MaskedArray
X_axis = np.array(results['param_min_samples_split'].data, dtype=float)

for scorer, color in zip(sorted(scoring), ['g', 'k']):
for sample, style in (('train', '--'), ('test', '-')):
sample_score_mean = results['mean_%s_%s' % (sample, scorer)]
sample_score_std = results['std_%s_%s' % (sample, scorer)]
ax.fill_between(X_axis, sample_score_mean - sample_score_std,
sample_score_mean + sample_score_std,
alpha=0.1 if sample == 'test' else 0, color=color)
ax.plot(X_axis, sample_score_mean, style, color=color,
alpha=1 if sample == 'test' else 0.7,
label="%s (%s)" % (scorer, sample))

best_index = np.nonzero(results['rank_test_%s' % scorer] == 1)[0][0]
best_score = results['mean_test_%s' % scorer][best_index]

# Plot a dotted vertical line at the best score for that scorer marked by x
ax.plot([X_axis[best_index], ] * 2, [0, best_score],
linestyle='-.', color=color, marker='x', markeredgewidth=3, ms=8)

# Annotate the best score for that scorer
ax.annotate("%0.2f" % best_score,
(X_axis[best_index], best_score + 0.005))

plt.legend(loc="best")
plt.grid('off')
plt.show()
Loading