[MRG] Better format for storing the search results, to support multiple metric. (grid_scores and _CVScoreTuple) #6

raghavrv · 2016-04-14T09:58:43Z

I'll do this (multiple metric support) in incremental steps. Will merge the trivial PRs as soon as I get a +1.

This is a very trivial PR. Pl take a look

jnothman · 2016-04-14T10:33:50Z

The reason we've not done this before was at least in part because there was disagreement on the form of grid_scores. I for one think a namedtuple is too constraining if we want to optionally include other information in the scores struct.

vene · 2016-04-14T13:40:03Z

I agree with @jnothman, the format might have to change to accommodate multiple metrics.

As a sidenote, I think the current _CVScoreTuple does too much magic in its repr. Users may not realize that all fold scores are available, not just their summary.

jnothman · 2016-04-14T13:41:52Z

to accommodate multiple metrics

This less so than training scores or timing information that have been previously proposed.

amueller · 2016-04-14T15:05:52Z

to accommodate multiple metrics

This less so than training scores or timing information that have been previously proposed.

Both. I think it should be a dict in a form that calling pd.Dataframe() on it gives sensible results (if at all possible). I think that should be possible.

raghavrv · 2016-04-14T17:14:00Z

(Sorry if this is lame) we chose a named tuple instead of dict to make it memory efficient correct?

vene · 2016-04-14T17:18:07Z

That's my understanding, based on the very good comment in the source. I wonder how relevant the memory concern is. Was there an issue prompting it?

On April 14, 2016 1:14:01 PM EDT, Raghav R V [email protected] wrote:

(Sorry if this is lame) we chose a named tuple instead of dict to make
it memory efficient correct?

You are receiving this because you were mentioned.
Reply to this email directly or view it on GitHub:
#6 (comment)

Sent from my Android device with K-9 Mail. Please excuse my brevity.

amueller · 2016-04-14T18:09:36Z

I think I might have introduced it to give it more of a fixed structure. I don't think it was a good idea and I don't think there was a particular concern.

raghavrv · 2016-04-14T18:16:48Z

Ok so now we are naming the grid_scores_ to search_results_ (in GSCV as well as RSCV) and making _CVSCoreTuple into a dict with keys fit_number, parameters, mean_scores, all_scores?

raghavrv · 2016-04-14T18:17:23Z

Maybe the fit_number is superfluous...

amueller · 2016-04-14T18:29:04Z

Where was search_results_ discussed?
We have *_scores_ in several places.

I don't think we want parameters but instead one column per parameter.
An interesting question is whether we want one row per model evaluation or one per parameter setting.
Having one per model evaluation would give more fine-grained control. Then we could have fold and settings_number or parameter_number so that .groupby("parameter_number").mean() would provide the average over folds.
The name is a bit awkward, though.
What happens if you do .groupby(param_grid.keys()).mean()? I'm no good at hierarchical indexing.

MechCoder · 2016-04-14T22:07:10Z

What happens if you do .groupby(param_grid.keys()).mean()? I'm no good at hierarchical indexing.

That should work as well making the proposed parameter_number or settings_number redundant.

MechCoder · 2016-04-14T22:10:22Z

+1 for having a row per fold with a column for every parameter. And a column for every metric whenever multiple metric support is added.

jnothman · 2016-04-15T04:49:02Z

@amueller wrote:

I think it should be a dict in a form that calling pd.Dataframe() on it gives sensible results (if at all possible).

A dict of arrays, or a list of dicts?

I supposed the namedtuple was introduced because the incumbent tuple (wasn't it) was not self-documenting. But namedtuples persist in some of the inflexibility of tuples, particularly with unpacking iteration (for a, b, c in data:).

jnothman · 2016-04-15T04:53:59Z

+1 for having a row per fold with a column for every parameter.

Two problems here:

it should be easy to use set_params after inspecting the results, so we still need an easy way to get back the full set of parameters
not all cross-validations need to use the same set of parameter names, so we need a way of saying "unset" (and None is not a good answer).

I also think it's very strange that we're not having this discussion in a scikit-learn space, but in @rvraghav93's.

But please also see scikit-learn#1787 where This Discussion Was Had Before. (Struct arrays and dataframes both have their advantages, but I agree with @amueller that we are best off giving users a more familiar and universal structure.)

raghavrv · 2016-04-15T11:03:37Z

I also think it's very strange that we're not having this discussion in a scikit-learn space, but in @rvraghav93's.

I am sorry, I started this as a trivial PR which renames grid_scores to search_scores to be consistent with RandomizedSearchCV... I will move this to a proper scikit-learn PR populating it with relevant discussions.

amueller · 2016-04-15T15:20:57Z

@jnothman one benefit of starting the discussion here is that I saw it because it wasn't caught by my scikit-learn filter ;) I think this is a very important discussion.

jnothman · 2016-04-16T12:09:38Z

Hahaha so now we know how to get your attention!

On 16 April 2016 at 01:20, Andreas Mueller [email protected] wrote:

@jnothman https://github.com/jnothman one benefit of starting the
discussion here is that I saw it because it wasn't caught by my
scikit-learn filter ;) I think this is a very important discussion.

—
You are receiving this because you were mentioned.
Reply to this email directly or view it on GitHub
#6 (comment)

raghavrv · 2016-04-20T13:51:40Z

I've raised an issue referring all the relevant issues/PRs noting the important conclusions and my proposed solution here at scikit-learn#6686.

Kindly take a look!

MNT grid_scores_ --> search_scores_

3813848

raghavrv force-pushed the model_selection_enhancements1 branch from 8f7bea9 to a2c3922 Compare April 14, 2016 10:00

raghavrv changed the title ~~[MRG] Rename grid_scores --> search_scores~~ [MRG] Rename grid_scores --> search_scores to be consistent with RandomizedSearchCV Apr 14, 2016

raghavrv mentioned this pull request Apr 14, 2016

[RFC] Changes to model_selection? scikit-learn/scikit-learn#5053

Closed

raghavrv changed the title ~~[MRG] Rename grid_scores --> search_scores to be consistent with RandomizedSearchCV~~ [MRG] Better format for storing the search results, to support multiple metric. (grid_scores and _CVScoreTuple) Apr 14, 2016

raghavrv closed this Apr 15, 2016

raghavrv deleted the search_results_ branch April 15, 2016 11:57

[MRG] Better format for storing the search results, to support multiple metric. (grid_scores and _CVScoreTuple) #6

[MRG] Better format for storing the search results, to support multiple metric. (grid_scores and _CVScoreTuple) #6

Uh oh!

Conversation

raghavrv commented Apr 14, 2016

Uh oh!

jnothman commented Apr 14, 2016

Uh oh!

vene commented Apr 14, 2016

Uh oh!

jnothman commented Apr 14, 2016

Uh oh!

amueller commented Apr 14, 2016

Uh oh!

raghavrv commented Apr 14, 2016

Uh oh!

vene commented Apr 14, 2016

Uh oh!

amueller commented Apr 14, 2016

Uh oh!

raghavrv commented Apr 14, 2016

Uh oh!

raghavrv commented Apr 14, 2016

Uh oh!

amueller commented Apr 14, 2016

Uh oh!

MechCoder commented Apr 14, 2016

Uh oh!

MechCoder commented Apr 14, 2016

Uh oh!

jnothman commented Apr 15, 2016

Uh oh!

jnothman commented Apr 15, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

raghavrv commented Apr 15, 2016

Uh oh!

amueller commented Apr 15, 2016

Uh oh!

jnothman commented Apr 16, 2016

Uh oh!

raghavrv commented Apr 20, 2016

Uh oh!

Uh oh!

jnothman commented Apr 15, 2016 •

edited

Loading