-
-
Notifications
You must be signed in to change notification settings - Fork 25.8k
[RFC] Better Format for search results in model_selection module. #6686
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
I am hoping this is not a complex solution, as we store all the scores, times and sizes as separate compact numpy arrays, saving time and space. But expose the list of dict format that we want users to see as a property or a generator function or via
And we can also expose a pandas dataframe for the users to further exploit the search result. ( And provide additional statistics in the form of |
I think we should create a dict of arrays that the user can call Your list of dict doesn't have that problem. But what is the result of We could have a place-holder value for parameters that are not active, like "not_active". Instead of the placeholder, we might want to use masked arrays, which wouldn't need the dtype conversion for missing values. @rvraghav93 can you check how well masked arrays can be converted to dataframes? |
Looks promising: http://pandas.pydata.org/pandas-docs/stable/dsintro.html#from-a-series So why not a dict of masked arrays? I think actually having masks is a "rare" case, so most people don't have to worry about it too much. And if people actually convert to data frames, it should all be fine. |
Thanks heaps for your comments! Ok if I understand your comment correctly, are you suggesting we have one axis for one parameter and whichever point (combination of parameters) doesn't Indeed this would make gathering statistics much easier. |
one key in the dictionary or one column in the dataframe for each parameter. And if the parameter is not set, this is masked (not NaN). |
You mean left to the default value? |
I'm guessing @amueller means "when a parameter is not a part of the parameters being used in this current iteration", a case that can emerge when using lists-of-dicts. Otherwise there is no real interface (or any point, really) to leave a parameter to its default value. So am I right we want something like:
where dashes are masked values and each column is an array? (e.g. {'kernel': ['poly', 'poly', 'rbf', 'rbf', 'rbf'], ...})? |
Oh so we are trying to capture the grid/random search process? (This way we can stop early or warm start too?) i.e. Are you suggesting that we do -
In this case why not simply repeat 'bar' in that ndarray? (So as to preserve the information that we are not changing that parameter in the current iteration?) |
@vene Ah your edited comment explains it clearly! Thanks! |
|
Could we also have a dict element |
No strong opinion but that should be pretty straightforward to do if you have a column that has the |
I haven't had a deep thought about any possible disadvantages, but in any case @amueller 's proposal on the dict of masked_arrays seems great. As for masked arrays, if someone wants to operate on |
@MechCoder is right, we absolutely don't want to require pandas, and nans confuse argmin/argmax computation. That, plus the same reasons we ran into in the missing value representation discussion, make nans not a fun choice IMO. |
Thanks @MechCoder @vene for the inputs!! |
An attempt to solve this - #6697
Currently we store -
grid_scores_
- For bothGridSearchCV
andRandomizedSearchCV
. This holds a list of_CVScoreTuple
s.The
_CVScoreTuple
holds theparameters
(parameter setting dict for one candidate),score
(aggregated) andall_scores
for all the cv splits for that setting as a numpy array.What I propose -
Could we have a separate
SearchResults
class (similar to the one proposed by Andy at #1034). One thatsearch_scores_
- A 2D numpy array with one row for one evaluation (one split/fold) and one column for every metric that we evaluate the predictions upon. Size(n_fits, n_metrics)
search_scores_aggregated_
- A 2D numpy arr. with one row per parameter setting (candidate) and one column for each metric. Size(n_fits/n_splits, n_metrics)
(
search_scores_train_set_
- If needed (Ref #1742)search_scores_aggregated_train_set_
- If needed)
train_times_
- 1D numpy array recording the training times for each evaluation. Size(n_fits, 1)
test_times_
- 1D numpy array for the test times. Size(n_fits, 1)
test_set_size_
- 1D numpy array recording the test set size for each split. Size(n_fits, 1)
metric_names_
- The sorted list of all the metrics (equivalently the column header ofsearch_scores*
array). Size(n_metrics, )
candidates_
/parameter_settings_
- A 2D numpy object array with one row per candidate (parameter setting). Columns correspond to the sorted list of parameters. Size(n_fits/n_splits, n_parameters)
parameters_names_
- The sorted list of parameters. (equivalently the column header for thecandidates_
/parameter_settings_
array. Size(n_parameters, )
get_search_results()
/iter_search_results()
/__iter__
- Return or yield a list of dict as Olivier proposes here.get_search_results_pandas()
- To get a pandas dataframe of the search results. Similar totabulate_results
.std
/few other statistics on the scores as attempted by Andy at #1034, to help visualize the search results better.__repr__
- That will tabulate the top 10 candidates##### Previous discussion (chronological) - #1020 - Andy raised an issue to make the `grid_scores_` more usable #1034 - Andy proposes to fix #1020 by a new `ResultGrid` class which stores `scores` `params` and `values` - Got stalled as `RandomizedSearchCV` was introduced and the PR became outdated.
This is where Joel had proposed
tabulate_results
to return the scores in a more presentable manner.ML thread -
#1742 - Andy adds recording
training_scores_
.#1768 & #1787 - Joel adds
grid_results_
/search_results_
as structured arrays with one row per parameter setting (candidate). andfold_results_
as structured array with one row per split (fold) per parameter setting.This is where Olivier proposes the search output as a list of dict with keys (
parameter_id
,fold_id
,parameters
(as dicts),train_fold_size
,test_fold_size
, ...)This is from where I got the name
search_results
(orSearchResults
).#2079 Joel's update (based on comments at #1787). Stalled as the PR became outdated.
@MechCoder @jnothman @amueller @vene @agramfort @AlexanderFabisch @mblondel @GaelVaroquaux
The text was updated successfully, but these errors were encountered: