Thanks to visit codestin.com
Credit goes to github.com

Skip to content

[RFC] Better Format for search results in model_selection module. #6686

@raghavrv

Description

@raghavrv

An attempt to solve this - #6697

Currently we store -

grid_scores_ - For both GridSearchCV and RandomizedSearchCV. This holds a list of _CVScoreTuples.

The _CVScoreTuple holds the parameters (parameter setting dict for one candidate), score (aggregated) and all_scores for all the cv splits for that setting as a numpy array.

What I propose -

Could we have a separate SearchResults class (similar to the one proposed by Andy at #1034). One that

  • will store -

search_scores_ - A 2D numpy array with one row for one evaluation (one split/fold) and one column for every metric that we evaluate the predictions upon. Size (n_fits, n_metrics)

search_scores_aggregated_ - A 2D numpy arr. with one row per parameter setting (candidate) and one column for each metric. Size (n_fits/n_splits, n_metrics)

(
search_scores_train_set_ - If needed (Ref #1742)
search_scores_aggregated_train_set_ - If needed
)

train_times_ - 1D numpy array recording the training times for each evaluation. Size (n_fits, 1)

test_times_ - 1D numpy array for the test times. Size (n_fits, 1)

test_set_size_ - 1D numpy array recording the test set size for each split. Size (n_fits, 1)

metric_names_ - The sorted list of all the metrics (equivalently the column header of search_scores* array). Size (n_metrics, )

candidates_/parameter_settings_ - A 2D numpy object array with one row per candidate (parameter setting). Columns correspond to the sorted list of parameters. Size (n_fits/n_splits, n_parameters)

parameters_names_ - The sorted list of parameters. (equivalently the column header for the candidates_/parameter_settings_ array. Size (n_parameters, )

  • And will expose -

get_search_results()/iter_search_results()/__iter__ - Return or yield a list of dict as Olivier proposes here.

get_search_results_pandas() - To get a pandas dataframe of the search results. Similar to tabulate_results.

std/few other statistics on the scores as attempted by Andy at #1034, to help visualize the search results better.

__repr__ - That will tabulate the top 10 candidates


##### Previous discussion (chronological) - #1020 - Andy raised an issue to make the `grid_scores_` more usable #1034 - Andy proposes to fix #1020 by a new `ResultGrid` class which stores `scores` `params` and `values` - Got stalled as `RandomizedSearchCV` was introduced and the PR became outdated.

This is where Joel had proposed tabulate_results to return the scores in a more presentable manner.

ML thread -
#1742 - Andy adds recording training_scores_.
#1768 & #1787 - Joel adds grid_results_/search_results_ as structured arrays with one row per parameter setting (candidate). and fold_results_ as structured array with one row per split (fold) per parameter setting.

This is where Olivier proposes the search output as a list of dict with keys (parameter_id, fold_id, parameters (as dicts), train_fold_size, test_fold_size, ...)

This is from where I got the name search_results (or SearchResults).
#2079 Joel's update (based on comments at #1787). Stalled as the PR became outdated.


@MechCoder @jnothman @amueller @vene @agramfort @AlexanderFabisch @mblondel @GaelVaroquaux

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions