Thanks to visit codestin.com
Credit goes to github.com

Skip to content

DOC Improve User Guide for metadata routing #27282

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 15 commits into from
Feb 14, 2024

Conversation

StefanieSenger
Copy link
Contributor

@StefanieSenger StefanieSenger commented Sep 3, 2023

This PR aims to improve the Metadata Routing section in the User Guide for clarity and readability.

Edit:
This was a draft before, but is now ready for review.

@github-actions
Copy link

github-actions bot commented Sep 3, 2023

✔️ Linting Passed

All linting checks passed. Your pull request is in excellent shape! ☀️

Generated for commit: 492372b. Link to the linter CI: here

@StefanieSenger StefanieSenger changed the title DOC User Guide for metadata routing DOC Improve User Guide for metadata routing Sep 6, 2023
Comment on lines 30 to 41
`fit`,
`fit_predict`,
`fit_transform`,
`partial_fit`,
`transform`,
`inverse_transform`,
`predict`,
`predict_log_proba`,
`predict_proba`,
`decision_function`,
`score`,
and `split`.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you can put these in a bullet list

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've deleted this list, because it's only a list of methods some objects in scikit-learn have right now. It's better to mention most common methods instead and hint that there are just examples.

Comment on lines 91 to 93
>>> weighted_acc = make_scorer(accuracy_score).set_score_request(sample_weight=True)
>>> lr = LogisticRegressionCV(cv=GroupKFold(), scoring=weighted_acc,)
>>> lr.set_fit_request(sample_weight=True)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why the format change? it's fine as is, isn't it?

Copy link
Contributor Author

@StefanieSenger StefanieSenger Jan 2, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Alright, let's keep it the way it was.

Setting request values for metadata are only required if the object, e.g. estimator,
scorer, etc., is a consumer of that metadata Unlike
Setting request values for metadata is only required if the object, e.g. estimator,
scorer, etc., is a potential :term:`consumers <consumer>` of that metadata. Unlike
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not just "potential", they need to set it if they are its consumer and the metadata is passed. Otherwise an error will be raised.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, what I added was misleading. Though it's important to somehow express the idea.

What I meant is that whatever the user sets set_{method}_request(metadata=True) on a method, it has to know what to do with this metadata.

I think I now found a good way of expressing this by using language that make it clear, that the user defines what a consumer is. Please have a look, @adrinjalali.

Copy link
Contributor Author

@StefanieSenger StefanieSenger left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have reviewed the whole document, @adrinjalali, and I kindly ask you to consider my feedback seriously and not brush it off. The current state of it is incomprehensible to users.
(Edit: that was a little too strong and also not entirely true, sorry for this.)

One more general comment:
The Bug/Feature that the set_{method}_request for all objects which potentially can consume sample_weight NEEDS to be set by the user if they route sample_weight at all is explained both in li. 113-135 and in li. 225-247.

What do you think of making it into one section?

Also, how is a user supposed to know which other object in their router might consume sample_weight as well? I propose to add to the corresponding error message and information about which method of which object is not configured (see comment for this).

Setting request values for metadata are only required if the object, e.g. estimator,
scorer, etc., is a consumer of that metadata Unlike
Setting request values for metadata is only required if the object, e.g. estimator,
scorer, etc., is a potential :term:`consumers <consumer>` of that metadata. Unlike
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, what I added was misleading. Though it's important to somehow express the idea.

What I meant is that whatever the user sets set_{method}_request(metadata=True) on a method, it has to know what to do with this metadata.

I think I now found a good way of expressing this by using language that make it clear, that the user defines what a consumer is. Please have a look, @adrinjalali.

Comment on lines 37 to 39
method that request the metadata. For instance, estimators and splitters, that use the
metadata in their `fit()` method would use `set_fit_request()`, and scorers would use
the `set_score_request()`.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

for splitters it would have been set_split_request, but we have it the default to be requested for groups in our splitters, so rather not mention it here to avoid more confusion.

Suggested change
method that request the metadata. For instance, estimators and splitters, that use the
metadata in their `fit()` method would use `set_fit_request()`, and scorers would use
the `set_score_request()`.
method that request the metadata. For instance, estimators that use the metadata in
their `fit()` method would provide a `set_fit_request()`, and scorers would use the
`set_score_request()`.

@StefanieSenger
Copy link
Contributor Author

With some distance, I went through the .rst file and the comments here on the issue and did some reformulating. I'm content with it right now and there are no questions or open issues from my side.
@adrinjalali Would you review this again?

@StefanieSenger StefanieSenger marked this pull request as ready for review January 2, 2024 14:53
Copy link
Member

@adrinjalali adrinjalali left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As for code formatting, I really think we should leave them as black formats them, since that's how the code looks like these days.

via `set_{method}_request()` methods, where `{method}` is substituted by the name of the
method that request the metadata. For instance, estimators and splitters, that use the
metadata in their `fit()` method would use `set_fit_request()`, and scorers would use
the `set_score_request()`. These methods allow us to specify which metadata to request,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
the `set_score_request()`. These methods allow us to specify which metadata to request,
the `set_score_request()`. These methods allow you to specify which metadata to request,

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we use "we" then we also need to use "us".

Copy link
Contributor Author

@StefanieSenger StefanieSenger left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@adrinjalali please have a short look into my comments.

Copy link
Member

@adrinjalali adrinjalali left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@OmarManzoor would like to have a look?

@adrinjalali adrinjalali added this to the 1.5 milestone Jan 5, 2024
Copy link
Member

@betatim betatim left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some thoughts from reading the new and improved guide.

Comment on lines 34 to 35
user explicitly passes it as a parameter. For instance, some methods in certain objects
can take into account `sample_weight`, `classes` or `groups` if provided by the user.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we make the "For instance, ..." a concrete example? Right now it is a bit "sometimes, in some places you can pass something to someone who will do something". I think by saying "for instance" we allow ourselves to pick one example and it is clear to the reader that there are many more. So I think picking the poster child example (groups?) and only describing that would be make it easier for readers to understand what we are on about.

"For instance, when using GridSearchCV the sample_weight passed to fit is not taken into account when computing the scores of each hyper-parameter combination." (I don't know if this is true, just an example of the concrete example I'm imagining)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could then pick up this example for the explanation in the paragraph that starts on L40

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree, and I tried to find an easy example. Please have a look. :)

For me, breaking it down from the abstract word "metadata" to understanding that it's not invented by the metadata PRs but that it's already present in the code, took several months. So it's crucial to define is very clearly here.


With the Metadata Routing API, we can transfer metadata to estimators, scorers, and CV
splitters using :term:`meta-estimators` (such as :class:`~pipeline.Pipeline` or
:class:`~model_selection.GridSearchCV`) or routing functions such as
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is a "routing function"? I think we need to define this metadata routing specific term before using it. Or avoid using it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's avoid it then. I meant it more as an adjective than making up a new term. This guide is already very term intense.

Comment on lines 43 to 49
:func:`~model_selection.cross_validate`. In order to pass metadata to a method like
``fit`` or ``score``, the object consuming the metadata, must *request* it. This is done
via `set_{method}_request()` methods, where `{method}` is substituted by the name of the
method that request the metadata. For instance, estimators that use the metadata in
their `fit()` method would use `set_fit_request()`, and scorers would use the
`set_score_request()`. These methods allow us to specify which metadata to request, for
instance `set_fit_request(sample_weight=True)`.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know usage examples are coming later on, but I've read this paragraph a few times now and still find it hard to parse. I think part of that is because it introduces a lot of new things and does so in the abstract. This is why I think having the example as a concrete one ("You are optimising a LogisticRegression with a GridearchCV and have sample_weight. In this case you need to call set_fit_request(...) on your LogisticRegressioninstance for thesample_weight` metadata to be passed to it." (again, I don't know if this is correct, so this is just an example of a concrete example). Ideally we can continue to use the example from the previous paragraph.

... )

Note that in this example, :func:`~model_selection.cross_validate` routes ``my_weights``
to both the scorer and :class:`~linear_model.LogisticRegressionCV`.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we add a sentence or two below to explain what would happen if we didn't use set_fit_request() and set_score_request() in the above example?

It is "trivial" if you know it, but I think it would help to illustrate the point and highlight that GroupKFold is different.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's explicitly name the UnsetMetadataPassedError then. The next sentence about error handling was already referring to it.

It's not that trivial though. If you only forget to set_score_request(sample_weight=True) on the scorer, you won't get an error message (since your sample_weight has been used somewhere). There is no way to know where a user intended to use the routed metadata, except for when they forget to use the set_{method}_requests entirely. Not routing it to the scorer could be done intentionally.

I tried to put this in words, so that it's simple enough for that introductory section and still conveys how that error works.

If :meth:`linear_model.LogisticRegressionCV.set_fit_request` had not been called,
:func:`~model_selection.cross_validate` would raise an error because ``sample_weight``
is passed but :class:`~linear_model.LogisticRegressionCV` would not be explicitly
configured to recognize the weights.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After reading this I am wondering why in the previous example we did not have to explicitly configure set_score_request(groups=False, ...) and set_fit_request(groups=False, ...). The logic seems to be that if a particular metadata is passed in as part of params I have to configure what each thing should do with it?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The user is supposed to know what metadata their scorer (and their other objects used via cross_validate) is able to consume and what they want to route. This scorer cannot consume groups nor can LogisticRegressionCV.

And only metatdata that their objects can consume need to be set as requested or not requested. If they forget to set a metadata as requested or unrequested, they get an error message that explicitly states, what and where.

I agree that there is quite a threshold to enable people to use metadata routing.

I wonder if adding 2-3 sentences right in the beginning of the usage example section stating that users need to be clear which metadata can be consumed by their objects and have a little "plan" helps with this, or if it only adds clutter. After all, most users would not have the goal "user metadata routing" in mind, but rather "how can i make sure the same sample_weight is used everywhere" kind of question, and then find metadata routing as a solution.

example, the following code raises an error, since it hasn't been explicitly
specified whether ``sample_weight`` should be passed to the estimator's scorer
or not::
If a metadata, e.g. ``sample_weight``, is passed by the user, the metadata request for
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've been wondering what word to use to refer to one of the entries in the dictionary that is passed as params=.

I think metadata is like data in that there is no plural. "The data show ...", "The author, title and page number are contained in the metadata". This means "a metadata" reads weird. How about "If an item of metadata, e.g. sample weight, is present in the metadata, the requests for all objects which can consume sample_weight should be configured by the user" (I also made some other changes to the sentence, just suggestions).

"an item of metadata" is the best I could come up with, I like it because in Python one entry in a dictionary (key and value) is already known as an item. This means the whole dictionary {"sample_weight: ..., "groups": ..} would be "the metadata" and one item of it (e.g. the weights) would be "an item of metadata".

You could argue that the weights by themselves are metadata, but then we need a term for the weights and groups together. Maybe "the set of metadata", but I like that less than my proposal above.

WDYT?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What about "some metadata" (and then use plural)?

"set of" or "item of" both make me wonder what exactly is meant. And maybe we don't need to distinguish different quantities here.

Copy link
Contributor

@OmarManzoor OmarManzoor left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @StefanieSenger. Other than a few minor comments and the suggestions by @betatim , this LGTM.

Co-authored-by: Tim Head <[email protected]>
Co-authored-by: Omar Salman <[email protected]>
Copy link
Contributor Author

@StefanieSenger StefanieSenger left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you, @betatim and @OmarManzoor, your perspectives lead to improvement.
I've implemented some suggestions and replied to some others.

Comment on lines 34 to 35
user explicitly passes it as a parameter. For instance, some methods in certain objects
can take into account `sample_weight`, `classes` or `groups` if provided by the user.
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree, and I tried to find an easy example. Please have a look. :)

For me, breaking it down from the abstract word "metadata" to understanding that it's not invented by the metadata PRs but that it's already present in the code, took several months. So it's crucial to define is very clearly here.


With the Metadata Routing API, we can transfer metadata to estimators, scorers, and CV
splitters using :term:`meta-estimators` (such as :class:`~pipeline.Pipeline` or
:class:`~model_selection.GridSearchCV`) or routing functions such as
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's avoid it then. I meant it more as an adjective than making up a new term. This guide is already very term intense.

... )

Note that in this example, :func:`~model_selection.cross_validate` routes ``my_weights``
to both the scorer and :class:`~linear_model.LogisticRegressionCV`.
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's explicitly name the UnsetMetadataPassedError then. The next sentence about error handling was already referring to it.

It's not that trivial though. If you only forget to set_score_request(sample_weight=True) on the scorer, you won't get an error message (since your sample_weight has been used somewhere). There is no way to know where a user intended to use the routed metadata, except for when they forget to use the set_{method}_requests entirely. Not routing it to the scorer could be done intentionally.

I tried to put this in words, so that it's simple enough for that introductory section and still conveys how that error works.

If :meth:`linear_model.LogisticRegressionCV.set_fit_request` had not been called,
:func:`~model_selection.cross_validate` would raise an error because ``sample_weight``
is passed but :class:`~linear_model.LogisticRegressionCV` would not be explicitly
configured to recognize the weights.
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The user is supposed to know what metadata their scorer (and their other objects used via cross_validate) is able to consume and what they want to route. This scorer cannot consume groups nor can LogisticRegressionCV.

And only metatdata that their objects can consume need to be set as requested or not requested. If they forget to set a metadata as requested or unrequested, they get an error message that explicitly states, what and where.

I agree that there is quite a threshold to enable people to use metadata routing.

I wonder if adding 2-3 sentences right in the beginning of the usage example section stating that users need to be clear which metadata can be consumed by their objects and have a little "plan" helps with this, or if it only adds clutter. After all, most users would not have the goal "user metadata routing" in mind, but rather "how can i make sure the same sample_weight is used everywhere" kind of question, and then find metadata routing as a solution.

example, the following code raises an error, since it hasn't been explicitly
specified whether ``sample_weight`` should be passed to the estimator's scorer
or not::
If a metadata, e.g. ``sample_weight``, is passed by the user, the metadata request for
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What about "some metadata" (and then use plural)?

"set of" or "item of" both make me wonder what exactly is meant. And maybe we don't need to distinguish different quantities here.

@adrinjalali
Copy link
Member

@betatim maybe another review? Would be nice to get this doc in.

@adrinjalali
Copy link
Member

@betatim @OmarManzoor another ping :)

Copy link
Contributor

@OmarManzoor OmarManzoor left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Thanks @StefanieSenger

@adrinjalali adrinjalali merged commit 206c434 into scikit-learn:main Feb 14, 2024
@StefanieSenger StefanieSenger deleted the doc_metadata_routing branch February 16, 2024 13:10
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants