Thanks to visit codestin.com
Credit goes to github.com

Skip to content

[RFC] Better Format for search results in model_selection module. #6686

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
raghavrv opened this issue Apr 20, 2016 · 15 comments
Closed

[RFC] Better Format for search results in model_selection module. #6686

raghavrv opened this issue Apr 20, 2016 · 15 comments

Comments

@raghavrv
Copy link
Member

raghavrv commented Apr 20, 2016

An attempt to solve this - #6697

Currently we store -

grid_scores_ - For both GridSearchCV and RandomizedSearchCV. This holds a list of _CVScoreTuples.

The _CVScoreTuple holds the parameters (parameter setting dict for one candidate), score (aggregated) and all_scores for all the cv splits for that setting as a numpy array.

What I propose -

Could we have a separate SearchResults class (similar to the one proposed by Andy at #1034). One that

  • will store -

search_scores_ - A 2D numpy array with one row for one evaluation (one split/fold) and one column for every metric that we evaluate the predictions upon. Size (n_fits, n_metrics)

search_scores_aggregated_ - A 2D numpy arr. with one row per parameter setting (candidate) and one column for each metric. Size (n_fits/n_splits, n_metrics)

(
search_scores_train_set_ - If needed (Ref #1742)
search_scores_aggregated_train_set_ - If needed
)

train_times_ - 1D numpy array recording the training times for each evaluation. Size (n_fits, 1)

test_times_ - 1D numpy array for the test times. Size (n_fits, 1)

test_set_size_ - 1D numpy array recording the test set size for each split. Size (n_fits, 1)

metric_names_ - The sorted list of all the metrics (equivalently the column header of search_scores* array). Size (n_metrics, )

candidates_/parameter_settings_ - A 2D numpy object array with one row per candidate (parameter setting). Columns correspond to the sorted list of parameters. Size (n_fits/n_splits, n_parameters)

parameters_names_ - The sorted list of parameters. (equivalently the column header for the candidates_/parameter_settings_ array. Size (n_parameters, )

  • And will expose -

get_search_results()/iter_search_results()/__iter__ - Return or yield a list of dict as Olivier proposes here.

get_search_results_pandas() - To get a pandas dataframe of the search results. Similar to tabulate_results.

std/few other statistics on the scores as attempted by Andy at #1034, to help visualize the search results better.

__repr__ - That will tabulate the top 10 candidates


##### Previous discussion (chronological) - #1020 - Andy raised an issue to make the `grid_scores_` more usable #1034 - Andy proposes to fix #1020 by a new `ResultGrid` class which stores `scores` `params` and `values` - Got stalled as `RandomizedSearchCV` was introduced and the PR became outdated.

This is where Joel had proposed tabulate_results to return the scores in a more presentable manner.

ML thread -
#1742 - Andy adds recording training_scores_.
#1768 & #1787 - Joel adds grid_results_/search_results_ as structured arrays with one row per parameter setting (candidate). and fold_results_ as structured array with one row per split (fold) per parameter setting.

This is where Olivier proposes the search output as a list of dict with keys (parameter_id, fold_id, parameters (as dicts), train_fold_size, test_fold_size, ...)

This is from where I got the name search_results (or SearchResults).
#2079 Joel's update (based on comments at #1787). Stalled as the PR became outdated.


@MechCoder @jnothman @amueller @vene @agramfort @AlexanderFabisch @mblondel @GaelVaroquaux

@raghavrv
Copy link
Member Author

raghavrv commented Apr 20, 2016

I am hoping this is not a complex solution, as we store all the scores, times and sizes as separate compact numpy arrays, saving time and space.

But expose the list of dict format that we want users to see as a property or a generator function or via __iter__ itself.

>>> list(gs_instance.search_results_)
[{'candidate_id' : 42, 'all_scores' : np.ndarray([0.4, 0.8, 0.9]), 'mean_score' : 7.0,
  'parameters' : {parameter_dict}},
 ...]

And we can also expose a pandas dataframe for the users to further exploit the search result. (gs_instance.search_results_.get_my_panda())

And provide additional statistics in the form of gs_instance.search_results_.get_that_stat()

@amueller
Copy link
Member

I think we should create a dict of arrays that the user can call pd.DataFrame on as I said elsewhere.
@jnothman brought up the fact that if we have lists of grids in the grid search, not all models share parameters. (I think the list of grids is the only case that can happen, right?).

Your list of dict doesn't have that problem. But what is the result of get_my_panda? That would have the same issue.
And how do you compute stats? If you need to iterate over the list each time you want to compute something, that might take a while. I'd rather parse everything into an efficient data structure first.

We could have a place-holder value for parameters that are not active, like "not_active".
That would make this array into a dtype object array, which is not the best. But if the parameter was a string, it was already of that type. If it was an int, well, then we lost something.
If a user wants statistics, they can convert to a pandas dataframe where "not_active" is the "missing" value.

Instead of the placeholder, we might want to use masked arrays, which wouldn't need the dtype conversion for missing values. @rvraghav93 can you check how well masked arrays can be converted to dataframes?

@amueller
Copy link
Member

Looks promising: http://pandas.pydata.org/pandas-docs/stable/dsintro.html#from-a-series

So why not a dict of masked arrays? I think actually having masks is a "rare" case, so most people don't have to worry about it too much. And if people actually convert to data frames, it should all be fine.
We need to make sure that pandas doesn't map np.NaN to the same missing type. If it does, we need to be explicit about that in our docs. But in the masked array we produce, they will be distinct.

@raghavrv
Copy link
Member Author

raghavrv commented Apr 21, 2016

Thanks heaps for your comments!

Ok if I understand your comment correctly, are you suggesting we have one axis for one parameter and whichever point (combination of parameters) doesn't make sense exist (esp. if its a random search) we mask it to nan?

Indeed this would make gathering statistics much easier.

@amueller
Copy link
Member

one key in the dictionary or one column in the dataframe for each parameter. And if the parameter is not set, this is masked (not NaN).

@raghavrv
Copy link
Member Author

And if the parameter is not set, this is masked (not NaN).

You mean left to the default value?

@vene
Copy link
Member

vene commented Apr 21, 2016

I'm guessing @amueller means "when a parameter is not a part of the parameters being used in this current iteration", a case that can emerge when using lists-of-dicts.

Otherwise there is no real interface (or any point, really) to leave a parameter to its default value.

So am I right we want something like:

kernel    gamma    degree
=========================
'poly'      -        2
'poly'      -        3
'rbf'     0.1        -
'rbf'     0.2        -
'rbf'     0.3        -

where dashes are masked values and each column is an array? (e.g. {'kernel': ['poly', 'poly', 'rbf', 'rbf', 'rbf'], ...})?

@raghavrv
Copy link
Member Author

raghavrv commented Apr 21, 2016

Oh so we are trying to capture the grid/random search process? (This way we can stop early or warm start too?)

i.e. Are you suggesting that we do -

# search progresses this way ---> 
{'param_a' : ['foo', 'bar', 'MASKED'],
 'param_b' : [0, 0, 1],
 'all_scores' : [[0.8, 0.9], [0.9, 0.8], [0.8, 0.8]],
 'aggregated_scores' : [0.85, 0.85, 0.8]
}

In this case why not simply repeat 'bar' in that ndarray? (So as to preserve the information that we are not changing that parameter in the current iteration?)

@raghavrv
Copy link
Member Author

@vene Ah your edited comment explains it clearly! Thanks!

@raghavrv
Copy link
Member Author

raghavrv commented Apr 22, 2016

One question. Why do we need a masked array? Why can't we use nan in those places? As converting to pandas will anyway convert the masked elements to nan.

@raghavrv
Copy link
Member Author

Could we also have a dict element 'candidate_rank' to rank the parameter settings (candidates)?

@MechCoder
Copy link
Member

No strong opinion but that should be pretty straightforward to do if you have a column that has the mean_validation_score no?

@MechCoder
Copy link
Member

I haven't had a deep thought about any possible disadvantages, but in any case @amueller 's proposal on the dict of masked_arrays seems great.

As for masked arrays, if someone wants to operate on search_results_ directly rather than on the converted dataframe, then...

@vene
Copy link
Member

vene commented Apr 22, 2016

@MechCoder is right, we absolutely don't want to require pandas, and nans confuse argmin/argmax computation. That, plus the same reasons we ran into in the missing value representation discussion, make nans not a fun choice IMO.

@raghavrv
Copy link
Member Author

Thanks @MechCoder @vene for the inputs!!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants