Description
An attempt to solve this - #6697
Currently we store -
grid_scores_
- For both GridSearchCV
and RandomizedSearchCV
. This holds a list of _CVScoreTuple
s.
The _CVScoreTuple
holds the parameters
(parameter setting dict for one candidate), score
(aggregated) and all_scores
for all the cv splits for that setting as a numpy array.
What I propose -
Could we have a separate SearchResults
class (similar to the one proposed by Andy at #1034). One that
- will store -
search_scores_
- A 2D numpy array with one row for one evaluation (one split/fold) and one column for every metric that we evaluate the predictions upon. Size (n_fits, n_metrics)
search_scores_aggregated_
- A 2D numpy arr. with one row per parameter setting (candidate) and one column for each metric. Size (n_fits/n_splits, n_metrics)
(
search_scores_train_set_
- If needed (Ref #1742)
search_scores_aggregated_train_set_
- If needed
)
train_times_
- 1D numpy array recording the training times for each evaluation. Size (n_fits, 1)
test_times_
- 1D numpy array for the test times. Size (n_fits, 1)
test_set_size_
- 1D numpy array recording the test set size for each split. Size (n_fits, 1)
metric_names_
- The sorted list of all the metrics (equivalently the column header of search_scores*
array). Size (n_metrics, )
candidates_
/parameter_settings_
- A 2D numpy object array with one row per candidate (parameter setting). Columns correspond to the sorted list of parameters. Size (n_fits/n_splits, n_parameters)
parameters_names_
- The sorted list of parameters. (equivalently the column header for the candidates_
/parameter_settings_
array. Size (n_parameters, )
- And will expose -
get_search_results()
/iter_search_results()
/__iter__
- Return or yield a list of dict as Olivier proposes here.
get_search_results_pandas()
- To get a pandas dataframe of the search results. Similar to tabulate_results
.
std
/few other statistics on the scores as attempted by Andy at #1034, to help visualize the search results better.
__repr__
- That will tabulate the top 10 candidates
##### Previous discussion (chronological) - #1020 - Andy raised an issue to make the `grid_scores_` more usable #1034 - Andy proposes to fix #1020 by a new `ResultGrid` class which stores `scores` `params` and `values` - Got stalled as `RandomizedSearchCV` was introduced and the PR became outdated.
This is where Joel had proposed tabulate_results
to return the scores in a more presentable manner.
ML thread -
#1742 - Andy adds recording training_scores_
.
#1768 & #1787 - Joel adds grid_results_
/search_results_
as structured arrays with one row per parameter setting (candidate). and fold_results_
as structured array with one row per split (fold) per parameter setting.
This is where Olivier proposes the search output as a list of dict with keys (parameter_id
, fold_id
, parameters
(as dicts), train_fold_size
, test_fold_size
, ...)
This is from where I got the name search_results
(or SearchResults
).
#2079 Joel's update (based on comments at #1787). Stalled as the PR became outdated.
@MechCoder @jnothman @amueller @vene @agramfort @AlexanderFabisch @mblondel @GaelVaroquaux