-
-
Notifications
You must be signed in to change notification settings - Fork 26.1k
DOC Improve User Guide for metadata routing #27282
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
DOC Improve User Guide for metadata routing #27282
Conversation
doc/metadata_routing.rst
Outdated
`fit`, | ||
`fit_predict`, | ||
`fit_transform`, | ||
`partial_fit`, | ||
`transform`, | ||
`inverse_transform`, | ||
`predict`, | ||
`predict_log_proba`, | ||
`predict_proba`, | ||
`decision_function`, | ||
`score`, | ||
and `split`. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you can put these in a bullet list
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've deleted this list, because it's only a list of methods some objects in scikit-learn have right now. It's better to mention most common methods instead and hint that there are just examples.
doc/metadata_routing.rst
Outdated
>>> weighted_acc = make_scorer(accuracy_score).set_score_request(sample_weight=True) | ||
>>> lr = LogisticRegressionCV(cv=GroupKFold(), scoring=weighted_acc,) | ||
>>> lr.set_fit_request(sample_weight=True) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why the format change? it's fine as is, isn't it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Alright, let's keep it the way it was.
doc/metadata_routing.rst
Outdated
Setting request values for metadata are only required if the object, e.g. estimator, | ||
scorer, etc., is a consumer of that metadata Unlike | ||
Setting request values for metadata is only required if the object, e.g. estimator, | ||
scorer, etc., is a potential :term:`consumers <consumer>` of that metadata. Unlike |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
not just "potential", they need to set it if they are its consumer and the metadata is passed. Otherwise an error will be raised.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, what I added was misleading. Though it's important to somehow express the idea.
What I meant is that whatever the user sets set_{method}_request(metadata=True)
on a method, it has to know what to do with this metadata.
I think I now found a good way of expressing this by using language that make it clear, that the user defines what a consumer is. Please have a look, @adrinjalali.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have reviewed the whole document, @adrinjalali, and I kindly ask you to consider my feedback seriously and not brush it off. The current state of it is incomprehensible to users.
(Edit: that was a little too strong and also not entirely true, sorry for this.)
One more general comment:
The Bug/Feature that the set_{method}_request
for all objects which potentially can consume sample_weight
NEEDS to be set by the user if they route sample_weight
at all is explained both in li. 113-135 and in li. 225-247.
What do you think of making it into one section?
Also, how is a user supposed to know which other object in their router might consume sample_weight
as well? I propose to add to the corresponding error message and information about which method of which object is not configured (see comment for this).
doc/metadata_routing.rst
Outdated
Setting request values for metadata are only required if the object, e.g. estimator, | ||
scorer, etc., is a consumer of that metadata Unlike | ||
Setting request values for metadata is only required if the object, e.g. estimator, | ||
scorer, etc., is a potential :term:`consumers <consumer>` of that metadata. Unlike |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, what I added was misleading. Though it's important to somehow express the idea.
What I meant is that whatever the user sets set_{method}_request(metadata=True)
on a method, it has to know what to do with this metadata.
I think I now found a good way of expressing this by using language that make it clear, that the user defines what a consumer is. Please have a look, @adrinjalali.
doc/metadata_routing.rst
Outdated
method that request the metadata. For instance, estimators and splitters, that use the | ||
metadata in their `fit()` method would use `set_fit_request()`, and scorers would use | ||
the `set_score_request()`. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
for splitters it would have been set_split_request
, but we have it the default to be requested for groups
in our splitters, so rather not mention it here to avoid more confusion.
method that request the metadata. For instance, estimators and splitters, that use the | |
metadata in their `fit()` method would use `set_fit_request()`, and scorers would use | |
the `set_score_request()`. | |
method that request the metadata. For instance, estimators that use the metadata in | |
their `fit()` method would provide a `set_fit_request()`, and scorers would use the | |
`set_score_request()`. |
With some distance, I went through the .rst file and the comments here on the issue and did some reformulating. I'm content with it right now and there are no questions or open issues from my side. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As for code formatting, I really think we should leave them as black formats them, since that's how the code looks like these days.
doc/metadata_routing.rst
Outdated
via `set_{method}_request()` methods, where `{method}` is substituted by the name of the | ||
method that request the metadata. For instance, estimators and splitters, that use the | ||
metadata in their `fit()` method would use `set_fit_request()`, and scorers would use | ||
the `set_score_request()`. These methods allow us to specify which metadata to request, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the `set_score_request()`. These methods allow us to specify which metadata to request, | |
the `set_score_request()`. These methods allow you to specify which metadata to request, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we use "we" then we also need to use "us".
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@adrinjalali please have a short look into my comments.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@OmarManzoor would like to have a look?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some thoughts from reading the new and improved guide.
doc/metadata_routing.rst
Outdated
user explicitly passes it as a parameter. For instance, some methods in certain objects | ||
can take into account `sample_weight`, `classes` or `groups` if provided by the user. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we make the "For instance, ..." a concrete example? Right now it is a bit "sometimes, in some places you can pass something to someone who will do something". I think by saying "for instance" we allow ourselves to pick one example and it is clear to the reader that there are many more. So I think picking the poster child example (groups?) and only describing that would be make it easier for readers to understand what we are on about.
"For instance, when using GridSearchCV
the sample_weight
passed to fit
is not taken into account when computing the scores of each hyper-parameter combination." (I don't know if this is true, just an example of the concrete example I'm imagining)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We could then pick up this example for the explanation in the paragraph that starts on L40
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree, and I tried to find an easy example. Please have a look. :)
For me, breaking it down from the abstract word "metadata" to understanding that it's not invented by the metadata PRs but that it's already present in the code, took several months. So it's crucial to define is very clearly here.
doc/metadata_routing.rst
Outdated
|
||
With the Metadata Routing API, we can transfer metadata to estimators, scorers, and CV | ||
splitters using :term:`meta-estimators` (such as :class:`~pipeline.Pipeline` or | ||
:class:`~model_selection.GridSearchCV`) or routing functions such as |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is a "routing function"? I think we need to define this metadata routing specific term before using it. Or avoid using it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's avoid it then. I meant it more as an adjective than making up a new term. This guide is already very term intense.
doc/metadata_routing.rst
Outdated
:func:`~model_selection.cross_validate`. In order to pass metadata to a method like | ||
``fit`` or ``score``, the object consuming the metadata, must *request* it. This is done | ||
via `set_{method}_request()` methods, where `{method}` is substituted by the name of the | ||
method that request the metadata. For instance, estimators that use the metadata in | ||
their `fit()` method would use `set_fit_request()`, and scorers would use the | ||
`set_score_request()`. These methods allow us to specify which metadata to request, for | ||
instance `set_fit_request(sample_weight=True)`. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I know usage examples are coming later on, but I've read this paragraph a few times now and still find it hard to parse. I think part of that is because it introduces a lot of new things and does so in the abstract. This is why I think having the example as a concrete one ("You are optimising a LogisticRegression
with a GridearchCV
and have sample_weight
. In this case you need to call set_fit_request(...) on your
LogisticRegressioninstance for the
sample_weight` metadata to be passed to it." (again, I don't know if this is correct, so this is just an example of a concrete example). Ideally we can continue to use the example from the previous paragraph.
... ) | ||
|
||
Note that in this example, :func:`~model_selection.cross_validate` routes ``my_weights`` | ||
to both the scorer and :class:`~linear_model.LogisticRegressionCV`. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we add a sentence or two below to explain what would happen if we didn't use set_fit_request()
and set_score_request()
in the above example?
It is "trivial" if you know it, but I think it would help to illustrate the point and highlight that GroupKFold
is different.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's explicitly name the UnsetMetadataPassedError
then. The next sentence about error handling was already referring to it.
It's not that trivial though. If you only forget to set_score_request(sample_weight=True)
on the scorer, you won't get an error message (since your sample_weight
has been used somewhere). There is no way to know where a user intended to use the routed metadata, except for when they forget to use the set_{method}_requests
entirely. Not routing it to the scorer could be done intentionally.
I tried to put this in words, so that it's simple enough for that introductory section and still conveys how that error works.
If :meth:`linear_model.LogisticRegressionCV.set_fit_request` had not been called, | ||
:func:`~model_selection.cross_validate` would raise an error because ``sample_weight`` | ||
is passed but :class:`~linear_model.LogisticRegressionCV` would not be explicitly | ||
configured to recognize the weights. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
After reading this I am wondering why in the previous example we did not have to explicitly configure set_score_request(groups=False, ...)
and set_fit_request(groups=False, ...)
. The logic seems to be that if a particular metadata is passed in as part of params
I have to configure what each thing should do with it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The user is supposed to know what metadata their scorer (and their other objects used via cross_validate
) is able to consume and what they want to route. This scorer cannot consume groups
nor can LogisticRegressionCV
.
And only metatdata that their objects can consume need to be set as requested or not requested. If they forget to set a metadata as requested or unrequested, they get an error message that explicitly states, what and where.
I agree that there is quite a threshold to enable people to use metadata routing.
I wonder if adding 2-3 sentences right in the beginning of the usage example section stating that users need to be clear which metadata can be consumed by their objects and have a little "plan" helps with this, or if it only adds clutter. After all, most users would not have the goal "user metadata routing" in mind, but rather "how can i make sure the same sample_weight is used everywhere" kind of question, and then find metadata routing as a solution.
example, the following code raises an error, since it hasn't been explicitly | ||
specified whether ``sample_weight`` should be passed to the estimator's scorer | ||
or not:: | ||
If a metadata, e.g. ``sample_weight``, is passed by the user, the metadata request for |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've been wondering what word to use to refer to one of the entries in the dictionary that is passed as params=
.
I think metadata is like data in that there is no plural. "The data show ...", "The author, title and page number are contained in the metadata". This means "a metadata" reads weird. How about "If an item of metadata, e.g. sample weight
, is present in the metadata, the requests for all objects which can consume sample_weight
should be configured by the user" (I also made some other changes to the sentence, just suggestions).
"an item of metadata" is the best I could come up with, I like it because in Python one entry in a dictionary (key and value) is already known as an item. This means the whole dictionary {"sample_weight: ..., "groups": ..}
would be "the metadata" and one item of it (e.g. the weights) would be "an item of metadata".
You could argue that the weights by themselves are metadata, but then we need a term for the weights and groups together. Maybe "the set of metadata", but I like that less than my proposal above.
WDYT?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What about "some metadata" (and then use plural)?
"set of" or "item of" both make me wonder what exactly is meant. And maybe we don't need to distinguish different quantities here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @StefanieSenger. Other than a few minor comments and the suggestions by @betatim , this LGTM.
Co-authored-by: Tim Head <[email protected]> Co-authored-by: Omar Salman <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you, @betatim and @OmarManzoor, your perspectives lead to improvement.
I've implemented some suggestions and replied to some others.
doc/metadata_routing.rst
Outdated
user explicitly passes it as a parameter. For instance, some methods in certain objects | ||
can take into account `sample_weight`, `classes` or `groups` if provided by the user. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree, and I tried to find an easy example. Please have a look. :)
For me, breaking it down from the abstract word "metadata" to understanding that it's not invented by the metadata PRs but that it's already present in the code, took several months. So it's crucial to define is very clearly here.
doc/metadata_routing.rst
Outdated
|
||
With the Metadata Routing API, we can transfer metadata to estimators, scorers, and CV | ||
splitters using :term:`meta-estimators` (such as :class:`~pipeline.Pipeline` or | ||
:class:`~model_selection.GridSearchCV`) or routing functions such as |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's avoid it then. I meant it more as an adjective than making up a new term. This guide is already very term intense.
... ) | ||
|
||
Note that in this example, :func:`~model_selection.cross_validate` routes ``my_weights`` | ||
to both the scorer and :class:`~linear_model.LogisticRegressionCV`. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's explicitly name the UnsetMetadataPassedError
then. The next sentence about error handling was already referring to it.
It's not that trivial though. If you only forget to set_score_request(sample_weight=True)
on the scorer, you won't get an error message (since your sample_weight
has been used somewhere). There is no way to know where a user intended to use the routed metadata, except for when they forget to use the set_{method}_requests
entirely. Not routing it to the scorer could be done intentionally.
I tried to put this in words, so that it's simple enough for that introductory section and still conveys how that error works.
If :meth:`linear_model.LogisticRegressionCV.set_fit_request` had not been called, | ||
:func:`~model_selection.cross_validate` would raise an error because ``sample_weight`` | ||
is passed but :class:`~linear_model.LogisticRegressionCV` would not be explicitly | ||
configured to recognize the weights. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The user is supposed to know what metadata their scorer (and their other objects used via cross_validate
) is able to consume and what they want to route. This scorer cannot consume groups
nor can LogisticRegressionCV
.
And only metatdata that their objects can consume need to be set as requested or not requested. If they forget to set a metadata as requested or unrequested, they get an error message that explicitly states, what and where.
I agree that there is quite a threshold to enable people to use metadata routing.
I wonder if adding 2-3 sentences right in the beginning of the usage example section stating that users need to be clear which metadata can be consumed by their objects and have a little "plan" helps with this, or if it only adds clutter. After all, most users would not have the goal "user metadata routing" in mind, but rather "how can i make sure the same sample_weight is used everywhere" kind of question, and then find metadata routing as a solution.
example, the following code raises an error, since it hasn't been explicitly | ||
specified whether ``sample_weight`` should be passed to the estimator's scorer | ||
or not:: | ||
If a metadata, e.g. ``sample_weight``, is passed by the user, the metadata request for |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What about "some metadata" (and then use plural)?
"set of" or "item of" both make me wonder what exactly is meant. And maybe we don't need to distinguish different quantities here.
@betatim maybe another review? Would be nice to get this doc in. |
@betatim @OmarManzoor another ping :) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. Thanks @StefanieSenger
This PR aims to improve the Metadata Routing section in the User Guide for clarity and readability.
Edit:
This was a draft before, but is now ready for review.