Thanks to visit codestin.com
Credit goes to github.com

Skip to content

[RFC] Better Format for search results in model_selection module. #6686

Closed
@raghavrv

Description

@raghavrv

An attempt to solve this - #6697

Currently we store -

grid_scores_ - For both GridSearchCV and RandomizedSearchCV. This holds a list of _CVScoreTuples.

The _CVScoreTuple holds the parameters (parameter setting dict for one candidate), score (aggregated) and all_scores for all the cv splits for that setting as a numpy array.

What I propose -

Could we have a separate SearchResults class (similar to the one proposed by Andy at #1034). One that

  • will store -

search_scores_ - A 2D numpy array with one row for one evaluation (one split/fold) and one column for every metric that we evaluate the predictions upon. Size (n_fits, n_metrics)

search_scores_aggregated_ - A 2D numpy arr. with one row per parameter setting (candidate) and one column for each metric. Size (n_fits/n_splits, n_metrics)

(
search_scores_train_set_ - If needed (Ref #1742)
search_scores_aggregated_train_set_ - If needed
)

train_times_ - 1D numpy array recording the training times for each evaluation. Size (n_fits, 1)

test_times_ - 1D numpy array for the test times. Size (n_fits, 1)

test_set_size_ - 1D numpy array recording the test set size for each split. Size (n_fits, 1)

metric_names_ - The sorted list of all the metrics (equivalently the column header of search_scores* array). Size (n_metrics, )

candidates_/parameter_settings_ - A 2D numpy object array with one row per candidate (parameter setting). Columns correspond to the sorted list of parameters. Size (n_fits/n_splits, n_parameters)

parameters_names_ - The sorted list of parameters. (equivalently the column header for the candidates_/parameter_settings_ array. Size (n_parameters, )

  • And will expose -

get_search_results()/iter_search_results()/__iter__ - Return or yield a list of dict as Olivier proposes here.

get_search_results_pandas() - To get a pandas dataframe of the search results. Similar to tabulate_results.

std/few other statistics on the scores as attempted by Andy at #1034, to help visualize the search results better.

__repr__ - That will tabulate the top 10 candidates


##### Previous discussion (chronological) - #1020 - Andy raised an issue to make the `grid_scores_` more usable #1034 - Andy proposes to fix #1020 by a new `ResultGrid` class which stores `scores` `params` and `values` - Got stalled as `RandomizedSearchCV` was introduced and the PR became outdated.

This is where Joel had proposed tabulate_results to return the scores in a more presentable manner.

ML thread -
#1742 - Andy adds recording training_scores_.
#1768 & #1787 - Joel adds grid_results_/search_results_ as structured arrays with one row per parameter setting (candidate). and fold_results_ as structured array with one row per split (fold) per parameter setting.

This is where Olivier proposes the search output as a list of dict with keys (parameter_id, fold_id, parameters (as dicts), train_fold_size, test_fold_size, ...)

This is from where I got the name search_results (or SearchResults).
#2079 Joel's update (based on comments at #1787). Stalled as the PR became outdated.


@MechCoder @jnothman @amueller @vene @agramfort @AlexanderFabisch @mblondel @GaelVaroquaux

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions