Thanks to visit codestin.com
Credit goes to github.com

Skip to content

Conversation

raghavrv
Copy link
Owner

@MechCoder @amueller @vene @jnothman

I'll do this (multiple metric support) in incremental steps. Will merge the trivial PRs as soon as I get a +1.

This is a very trivial PR. Pl take a look

@raghavrv raghavrv force-pushed the model_selection_enhancements1 branch from 8f7bea9 to a2c3922 Compare April 14, 2016 10:00
@raghavrv raghavrv changed the title [MRG] Rename grid_scores --> search_scores [MRG] Rename grid_scores --> search_scores to be consistent with RandomizedSearchCV Apr 14, 2016
@jnothman
Copy link

The reason we've not done this before was at least in part because there was disagreement on the form of grid_scores. I for one think a namedtuple is too constraining if we want to optionally include other information in the scores struct.

@vene
Copy link

vene commented Apr 14, 2016

I agree with @jnothman, the format might have to change to accommodate multiple metrics.

As a sidenote, I think the current _CVScoreTuple does too much magic in its repr. Users may not realize that all fold scores are available, not just their summary.

@jnothman
Copy link

to accommodate multiple metrics

This less so than training scores or timing information that have been previously proposed.

@amueller
Copy link

to accommodate multiple metrics

This less so than training scores or timing information that have been previously proposed.

Both. I think it should be a dict in a form that calling pd.Dataframe() on it gives sensible results (if at all possible). I think that should be possible.

@raghavrv
Copy link
Owner Author

(Sorry if this is lame) we chose a named tuple instead of dict to make it memory efficient correct?

@vene
Copy link

vene commented Apr 14, 2016

That's my understanding, based on the very good comment in the source. I wonder how relevant the memory concern is. Was there an issue prompting it?

On April 14, 2016 1:14:01 PM EDT, Raghav R V [email protected] wrote:

(Sorry if this is lame) we chose a named tuple instead of dict to make
it memory efficient correct?


You are receiving this because you were mentioned.
Reply to this email directly or view it on GitHub:
#6 (comment)

Sent from my Android device with K-9 Mail. Please excuse my brevity.

@amueller
Copy link

I think I might have introduced it to give it more of a fixed structure. I don't think it was a good idea and I don't think there was a particular concern.

@raghavrv
Copy link
Owner Author

Ok so now we are naming the grid_scores_ to search_results_ (in GSCV as well as RSCV) and making _CVSCoreTuple into a dict with keys fit_number, parameters, mean_scores, all_scores?

@raghavrv
Copy link
Owner Author

Maybe the fit_number is superfluous...

@amueller
Copy link

Where was search_results_ discussed?
We have *_scores_ in several places.

I don't think we want parameters but instead one column per parameter.
An interesting question is whether we want one row per model evaluation or one per parameter setting.
Having one per model evaluation would give more fine-grained control. Then we could have fold and settings_number or parameter_number so that .groupby("parameter_number").mean() would provide the average over folds.
The name is a bit awkward, though.
What happens if you do .groupby(param_grid.keys()).mean()? I'm no good at hierarchical indexing.

@raghavrv raghavrv changed the title [MRG] Rename grid_scores --> search_scores to be consistent with RandomizedSearchCV [MRG] Better format for storing the search results, to support multiple metric. (grid_scores and _CVScoreTuple) Apr 14, 2016
@MechCoder
Copy link

What happens if you do .groupby(param_grid.keys()).mean()? I'm no good at hierarchical indexing.

That should work as well making the proposed parameter_number or settings_number redundant.

@MechCoder
Copy link

+1 for having a row per fold with a column for every parameter. And a column for every metric whenever multiple metric support is added.

@jnothman
Copy link

@amueller wrote:

I think it should be a dict in a form that calling pd.Dataframe() on it gives sensible results (if at all possible).

A dict of arrays, or a list of dicts?

I supposed the namedtuple was introduced because the incumbent tuple (wasn't it) was not self-documenting. But namedtuples persist in some of the inflexibility of tuples, particularly with unpacking iteration (for a, b, c in data:).

@jnothman
Copy link

jnothman commented Apr 15, 2016

+1 for having a row per fold with a column for every parameter.

Two problems here:

  • it should be easy to use set_params after inspecting the results, so we still need an easy way to get back the full set of parameters
  • not all cross-validations need to use the same set of parameter names, so we need a way of saying "unset" (and None is not a good answer).

I also think it's very strange that we're not having this discussion in a scikit-learn space, but in @rvraghav93's.

But please also see scikit-learn#1787 where This Discussion Was Had Before. (Struct arrays and dataframes both have their advantages, but I agree with @amueller that we are best off giving users a more familiar and universal structure.)

@raghavrv
Copy link
Owner Author

I also think it's very strange that we're not having this discussion in a scikit-learn space, but in @rvraghav93's.

I am sorry, I started this as a trivial PR which renames grid_scores to search_scores to be consistent with RandomizedSearchCV... I will move this to a proper scikit-learn PR populating it with relevant discussions.

@raghavrv raghavrv closed this Apr 15, 2016
@raghavrv raghavrv deleted the search_results_ branch April 15, 2016 11:57
@amueller
Copy link

@jnothman one benefit of starting the discussion here is that I saw it because it wasn't caught by my scikit-learn filter ;) I think this is a very important discussion.

@jnothman
Copy link

Hahaha so now we know how to get your attention!

On 16 April 2016 at 01:20, Andreas Mueller [email protected] wrote:

@jnothman https://github.com/jnothman one benefit of starting the
discussion here is that I saw it because it wasn't caught by my
scikit-learn filter ;) I think this is a very important discussion.

β€”
You are receiving this because you were mentioned.
Reply to this email directly or view it on GitHub
#6 (comment)

@raghavrv
Copy link
Owner Author

I've raised an issue referring all the relevant issues/PRs noting the important conclusions and my proposed solution here at scikit-learn#6686.

Kindly take a look!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants