[RFC] Better Format for search results in model_selection module.

An attempt to solve this - #6697 
##### Currently we store -

`grid_scores_` - For both `GridSearchCV` and `RandomizedSearchCV`. This holds a list of `_CVScoreTuple`s.

The `_CVScoreTuple` holds the `parameters` (parameter setting dict for one candidate), `score` (aggregated) and `all_scores` for all the cv splits for that setting as a numpy array.
##### What I propose -

Could we have a separate `SearchResults` class (similar to the one proposed by Andy at #1034). One that
- will store -

`search_scores_` - A 2D numpy array with one row for one evaluation (one split/fold) and one column for every metric that we evaluate the predictions upon. Size `(n_fits, n_metrics)`

`search_scores_aggregated_` - A 2D numpy arr. with one row per parameter setting (candidate) and one column for each metric. Size `(n_fits/n_splits, n_metrics)`

(
`search_scores_train_set_` - If needed (Ref https://github.com/scikit-learn/scikit-learn/pull/1742)
`search_scores_aggregated_train_set_` - If needed
)

`train_times_` - 1D numpy array recording the training times for each evaluation. Size `(n_fits, 1)`

`test_times_` - 1D numpy array for the test times. Size `(n_fits, 1)`

`test_set_size_` - 1D numpy array recording the test set size for each split. Size `(n_fits, 1)`

`metric_names_` - The sorted list of all the metrics (equivalently the column header of `search_scores*` array). Size `(n_metrics, )`

`candidates_`/`parameter_settings_` - A 2D numpy object array with one row per candidate (parameter setting). Columns correspond to the sorted list of parameters. Size `(n_fits/n_splits, n_parameters)`

`parameters_names_` - The sorted list of parameters. (equivalently the column header for the `candidates_`/`parameter_settings_` array. Size `(n_parameters, )`
- And will expose - 

`get_search_results()`/`iter_search_results()`/`__iter__` - Return or yield a list of dict as Olivier proposes [here](https://github.com/scikit-learn/scikit-learn/pull/1787#issuecomment-19685071).

`get_search_results_pandas()` - To get a pandas dataframe of the search results. Similar to [`tabulate_results`](https://github.com/jnothman/scikit-learn/blob/a97e2d60d4563eacb1e29e33d5f8d8d23ad799ea/sklearn/grid_search.py#L538).

`std`/few other statistics on the scores as attempted by Andy at #1034, to help visualize the search results better.

`__repr__` - That will tabulate the top 10 candidates

<hr>
##### Previous discussion (chronological) -
#1020 - Andy raised an issue to make the `grid_scores_` more usable
#1034 - Andy proposes to fix #1020 by a new `ResultGrid` class which stores `scores` `params` and `values` - Got stalled as `RandomizedSearchCV` was introduced and the PR became outdated.

This is where Joel [had proposed](https://github.com/jnothman/scikit-learn/blob/a97e2d60d4563eacb1e29e33d5f8d8d23ad799ea/sklearn/grid_search.py#L538) [`tabulate_results`](https://github.com/jnothman/scikit-learn/blob/a97e2d60d4563eacb1e29e33d5f8d8d23ad799ea/sklearn/grid_search.py#L538) to return the scores in a more presentable manner.

[ML thread](https://sourceforge.net/p/scikit-learn/mailman/scikit-learn-general/thread/CAAkaFLVE+8u4sY-Ad4XZvkMfMPeT-PV+5RMXCFyrjRUj1aTekg@mail.gmail.com/) - 
#1742 - Andy adds recording `training_scores_`.
#1768 & #1787 - Joel adds `grid_results_`/`search_results_` as structured arrays with one row per parameter setting (candidate). and `fold_results_` as structured array with one row per split (fold) per parameter setting.

This is where Olivier proposes [the search output as a list of dict](https://github.com/scikit-learn/scikit-learn/pull/1787#issuecomment-19685071) with keys (`parameter_id`, `fold_id`, `parameters` (as dicts), `train_fold_size`, `test_fold_size`, ...)

[This is from where](https://github.com/scikit-learn/scikit-learn/pull/1787#issuecomment-17589137) I got the name `search_results` (or `SearchResults`).
#2079 Joel's update (based on comments at #1787). Stalled as the PR became outdated.

<hr>

@MechCoder @jnothman @amueller @vene @agramfort @AlexanderFabisch @mblondel @GaelVaroquaux


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

[RFC] Better Format for search results in model_selection module. #6686

Currently we store -

What I propose -

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

[RFC] Better Format for search results in model_selection module. #6686

Description

Currently we store -

What I propose -

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions