Thanks to visit codestin.com
Credit goes to github.com

Skip to content

TST check if docstring items are equal between objects (functions, classes, etc.) #28678

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 37 commits into from
Sep 5, 2024

Conversation

lucyleeow
Copy link
Member

@lucyleeow lucyleeow commented Mar 22, 2024

Reference Issues/PRs

closes #9388
closes #10323 (supercedes)

What does this implement/fix? Explain your changes.

Adds a test that checks that items in parameters/attributes/returns sections of objects are the same. Builds on #10323

  1. Checks type/description string in all objects and will group objects that have the same string, e.g., will tell you: param 'a' is different between ['ob1', 'obj2'] and ['obj3'] and ['obj4'] etc. Prev PR just iterated and will only tell you if next item is different from previous item. This is more complex but I thought the extra info is useful, e.g., 4 objects are the same and 1 is different, is better than obj 3 is different from obj 2.
  2. If an item does not exist in all objects, skips it but gives warning (suggested by Joel here) - was not sure what best to do here, open to change.
  3. Parameter meanings - I have followed what Joel suggested here.
    • incl and excl mutually exclusive
    • incl False by default (I thought this was better, so user has to explicitly turn on, also less typing as I think people will usually only not check all three sections, but if you want to exclude, you need to turn incl to True)
    • incl True and excl None means check all items
  4. Normalise for whitespace before/after and in between words
  5. Add test for classification metrics (just to show its use) - added a 'versionchanged' to labels param of precision_recall_fscore_support as I can see they were all update dated together in this commit

I've also added a skip fixture to skip tests if numpydoc not installed. Happy to change.

One problem still to solve: we accept NumpyDocString but AFAICT there is not way to get the name of the original object. We are just using naming these 'Object 1' here which is not ideal. Joel suggested that we could accept (name, numpydocstring) tuples in objects. This would work but is not elegant.

Another solution is to use the numpdoc subclasses ClassDoc, FunctionDoc and ObjDoc. These store the original object in a private attrib (e.g., ClassDoc._cls, FunctionDoc._f). We could instead only accept these subclasses and we'd be able to get the object name from the private attrib? BUT there is no specific data descriptor subclass. I don't see anywhere in scikit-learn where we have a data description with param sections, so I wonder how useful the data descriptor case is?

Any other comments?

Still need to add a test for NumpyDocString obj type.

cc @adrinjalali @Charlie-XIAO (and @jnothman just in case, as you reviewed the original)

Copy link

github-actions bot commented Mar 22, 2024

✔️ Linting Passed

All linting checks passed. Your pull request is in excellent shape! ☀️

Generated for commit: 15145b8. Link to the linter CI: here

@lucyleeow lucyleeow changed the title Add test to check if docstring items are equal in related objects Add test to check if docstring items are equal Mar 22, 2024
@lucyleeow
Copy link
Member Author

The incl/excl params are a bit confusing so I made this table, which hopefully helps.

excl
[ ] None
incl
True (excl) All
False Error Skip
[ ] Error (incl)

Copy link
Contributor

@Charlie-XIAO Charlie-XIAO left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the PR @lucyleeow! I do not have much time at this moment so I only did some brief testing, and overall this looks nice. Here are some very general suggestions/concerns at first glance:

  • It might be nice to have a diff output, especially for long docstrings. Maybe using difflib.Differ and its compare method?

  • I'm worried that the current solution is not flexible enough, but I don't have a good solution. For instance, the average parameter of the metrics are almost the same but differs only a bit, and we would have to exclude it from the comparison. Not sure what you and other maintainers think.

@lucyleeow
Copy link
Member Author

lucyleeow commented Mar 23, 2024

It might be nice to have a diff output, especially for long docstrings. Maybe using difflib.Differ and its compare method?

Good idea, I didn't want to implement something complex myself. I will have a play with that. I think I would want to print only the line that is different (it would be a very long output for e.g., the average param), but as I join every line of the description together (to normalise for line breaks), I need to think if there is a way to do this.

I also need to deal with the situation when there are >2 different strings. I could just use the first one as reference and compare each of the others to the first. As long as I am able to only print the line/section that is different, this should be okay.

For instance, the average parameter of the metrics are almost the same but differs only a bit, and we would have to exclude it from the comparison.

I had the same thought. average mostly differs because for recall_score we add an extra line: "Weighted recall is equal to accuracy" (the other differences can be resolved). This particular case could be something we could account for. If difflib gives the indices that are different, we could look at them and match with an input. The the user would need to specify that for obj x, allow added line of "xxx" ?

There are other scenarios that would not be so easy to deal with, e.g., using a different word in the description ('metric' instead of 'score').

@lucyleeow
Copy link
Member Author

I've had a go at print diffs. I've used difflib.context_diff and compared between words (not characters). Here is an example output (note I've grouped words with "+" at the start into one line to shorten the message):

E               AssertionError: The description of Parameter 'average' is inconsistent between ['precision_recall_fscore_support'] and ['f1_score', 'fbeta_score', 'precision_score'] and ['recall_score']:
E               
E               
E               
E               *** ['precision_recall_fscore_support']
E               --- ['f1_score', 'fbeta_score', 'precision_score']
E               ***************
E               
E               *** 10,12 ****
E               
E                 the
E               ! metrics
E                 for
E               --- 10,12 ----
E               
E                 the
E               ! scores
E                 for
E               
E               *** ['precision_recall_fscore_support']
E               --- ['recall_score']
E               ***************
E               
E               *** 10,12 ****
E               
E                 the
E               ! metrics
E                 for
E               --- 10,12 ----
E               
E                 the
E               ! scores
E                 for
E               ***************
E               
E               *** 122,123 ****
E               
E               --- 122,129 ----
E               
E                 recall.
E               + Weighted recall is equal to accuracy.
E                 ``'samples'``:

@lucyleeow
Copy link
Member Author

This is getting a bit old but @adrinjalali do you think we're still interested in having this test?

Copy link
Member

@adrinjalali adrinjalali left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A bit hard to follow the code, but I like the result.

ref_str = ""
ref_group = []
for docstring, group in gd.items():
if not ref_str and not ref_group:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

does it make sense at this indentation to move things to another function? kinda hard for me to follow this method.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah I think the later additions were sort of proof of concept (to address #28678 (review)), to see if it was worth the complexity and if the output was okay.

So printing a diff using context_diff works well for:

  • single word changes,
  • addition or deletion of a sentence/word

but not so good for

  • whole sentence moved to somewhere else in a paragraph,
  • lots of changes in a sentence
    (I'll work on producing examples for the above to show what it would look like).

Having a look at difflib package, context_diff seemed to be the best solution (but it's been a while). I think this is probably acceptable, but any comments so far?

I'm worried that the current solution is not flexible enough, but I don't have a good solution. For instance, the average parameter of the metrics are almost the same but differs only a bit, and we would have to exclude it from the comparison. Not sure what you and other maintainers think.

I think this would be difficult, see: #28678 (comment). Do you think this is worth pursuing?

Copy link
Member

@glemaitre glemaitre left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a first round of comment on the code. I'll look at the tests more in details now.

Comment on lines 760 to 774
Args = namedtuple("args", ["include", "exclude", "arg_name"])
section_dict = {
"Parameters": Args(include_params, exclude_params, "params"),
"Attributes": Args(include_attribs, exclude_attribs, "attribs"),
"Returns": Args(include_returns, exclude_returns, "returns"),
}
for section in list(section_dict):
args = section_dict[section]
if args.exclude and args.include is not True:
raise TypeError(
f"The 'exclude_{args.arg_name}' argument can be set only when the "
f"'include_{args.arg_name}' argument is True."
)
if args.include is False:
del section_dict[section]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not a big fan of deleting a key here. I think that we could instead create the dictionary dynamically just by creating a small function:

Suggested change
Args = namedtuple("args", ["include", "exclude", "arg_name"])
section_dict = {
"Parameters": Args(include_params, exclude_params, "params"),
"Attributes": Args(include_attribs, exclude_attribs, "attribs"),
"Returns": Args(include_returns, exclude_returns, "returns"),
}
for section in list(section_dict):
args = section_dict[section]
if args.exclude and args.include is not True:
raise TypeError(
f"The 'exclude_{args.arg_name}' argument can be set only when the "
f"'include_{args.arg_name}' argument is True."
)
if args.include is False:
del section_dict[section]
Args = namedtuple("args", ["include", "exclude", "arg_name"])
def create_args(include, exclude, arg_name, section_name):
if exclude and include is not True:
raise TypeError(
f"The 'exclude_{arg_name}' argument can be set only when the "
f"'include_{arg_name}' argument is True."
)
if include is False:
return {}
return {section_name: Args(include, exclude, arg_name)}
section_dict = {
**create_args(include_params, exclude_params, "params", "Parameters"),
**create_args(include_attribs, exclude_attribs, "attribs", "Attributes"),
**create_args(include_returns, exclude_returns, "returns", "Returns"),
}

Copy link
Member

@glemaitre glemaitre left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is a good start. We should have subsequent PR to introduce the assertion in more places.

Comment on lines 23 to 29
from sklearn.metrics import (
f1_score,
fbeta_score,
precision_recall_fscore_support,
precision_score,
recall_score,
)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It might be better to change the import here and have

from sklearn import metrics

and then call metrics.f1_score.

We will end-up importing all functions of scikit-learn maybe :)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we have to see if we want to isolate all those consistency checks in the future if there are too many.

Copy link
Member Author

@lucyleeow lucyleeow Sep 5, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I originally put these with the metrics tests (see: #28678 (comment)), as I had the same thought - potentially we will check many docstrings, and it may make more sense to put them with their own tests (or somewhere else?). The other tests in test_docstring_parameters.py are more general, and covers most of the public classes/functions.

But I am not familiar with a lot of the codebase, so I do not know how many more places we would want to use this test. We can always move later!

@glemaitre glemaitre changed the title TST check if docstring items are equal TST check if docstring items are equal between objects (functions, classes, etc.) Sep 4, 2024
@glemaitre
Copy link
Member

glemaitre commented Sep 4, 2024

From the tool that has been written here, I'm thinking if we could reuse some part of the machinery to check that the types of a parameter is consistent with the _parameter_constraints.

Apart of making sure that we have consistent information, we could then safely have something like https://github.com/scientific-python/docstub to use the type from the documentation to generate automatically the stubs (that up to now, we don't maintain) useful for IDEs.

@lucyleeow
Copy link
Member Author

From the tool that has been written here, I'm thinking if we could reuse some part of the machinery to check that the types of a parameter is consistent with the _parameter_constraints.

Good idea, I think I could definitely make that work. We may need to get consensus on terms used (e.g., 'estimator object' vs 'estimator instance' or 'array-like' vs 'ndarray', whether to include 'or None'), but this is probably a good thing.

@glemaitre glemaitre merged commit 9870b52 into scikit-learn:main Sep 5, 2024
30 checks passed
@glemaitre
Copy link
Member

Looks good. Thanks @lucyleeow

@lucyleeow
Copy link
Member Author

Starting to use this function in #29831, I came across two parameters, where it would be nice to compare only part of the description, e.g.,:

estimators : list of (str, estimator)
Base estimators which will be stacked together. Each element of the
list is defined as a tuple of string (i.e. name) and an estimator
instance. An estimator can be set to 'drop' using `set_params`.
The type of estimator is generally expected to be a classifier.
However, one can pass a regressor for some use case (e.g. ordinal
regression).

and

estimators : list of (str, estimator)
Base estimators which will be stacked together. Each element of the
list is defined as a tuple of string (i.e. name) and an estimator
instance. An estimator can be set to 'drop' using `set_params`.

Maybe a nice addition is to be able to specify which part of the description to compare? This may be a better than setting number of words that can be different, as you won't be able to know which words end up being different.

Note when we use it we'd probably run this function on one specific parameter, and set the 'description_subset' for it specifically.

WDYT @glemaitre @adrinjalali

@adrinjalali
Copy link
Member

It seems like a nice variation would be to check for a specific text be present in the docstring. It would cover this case as well, WDYT?

@lucyleeow
Copy link
Member Author

Interesting, hadn't thought of that! Let me summarise potential solutions:

  • Check for specific text to be present in description
    • con: adds an extra place to amend when you change a docstring
  • Check only subset of the description (e.g., let the user pass a list index or list of indicies)
    • con: can be brittle (?), e.g., if passing a complex list of indicies ([0:10, 14:20])
  • Check type only and not description (suggested: TST Add test to check docstring consistency of stacking estimators #29831 (comment))
    • con: description not checked...

Another idea: we could automatically ignore a difference that is a switch between any of the words "estimator", "regressor" and "classifier".

@adrinjalali
Copy link
Member

Another idea: we could automatically ignore a difference that is a switch between any of the words "estimator", "regressor" and "classifier".

We could check against a regex, that would support a subset, and variations in a word. Do you think that'd work?

@lucyleeow
Copy link
Member Author

Love it, I think it would cover all use cases. I'll open a PR and see what people think?

@lucyleeow
Copy link
Member Author

lucyleeow commented Feb 24, 2025

Following on from #30854, where we discovered we need to re-work the assert function before we can use it more widely.

With regards to matching descriptions where the whole description should not match, we can instead have a regex to capture group (e.g., first 2 sentences, excluding the 4th word etc), that will be matched between params.

  • this is better than the current descr_regex_pattern as you will not need to write out the whole description sentence(s) to match
  • the regex can be a little more complicated, as we are dealing with groups

This parameter could also take a dict, allowing us to also specify which param/attr/return this regex matches, e.g.,:

# match only first 2 sentences
descr_regex={'parameter: test': r'^(.*?[.])\s+(.*?[.])'}

(not sure if we have to specify whether the item is a param/attr/return, as it is possible that e.g., a parameter and a return both have the same name)

This means that we don't need to use a 2nd assert, when one of the items don't need to be matched completely.

With regards to matching types that differ slightly, I see 2 cases:

  • default value is different, type otherwise consistent
  • one of the objects has an additional type value (I think this is uncommon, found for BaggingRegressor and IsolationForest max_samples)

Possible solutions:

  • add option to ignore type (for a specific param/attr/return)
    • simple but no checking
  • add ignore_default option
    • would not work for the rare cases where type is also not the same
    • we may want to specify the param/attr/return for this to apply to
  • add flexible_default option, where it passes if default is different, and/or some min number of types match (not sure on this one)
    • most complicated but most flexible
    • we would want to specify the param/attr/return for this to apply to

ping @glemaitre @StefanieSenger (and maybe @adrinjalali ) WDYT?

@adrinjalali
Copy link
Member

I like the sound of these options, but I'd need to see a PR to have a better idea of what the implications are. And a side note, as for different types, we might sometimes actually make them more consistent if we see those differences.

@StefanieSenger
Copy link
Contributor

StefanieSenger commented Feb 25, 2025

I have given this some thought and here are my 2 cents.

Unfortunately I don't have a solution (except to not use regex at all), but I have a very big concern about maintainability: There will be many good reasons for differing param and attribute descriptions and thus after we have added all these new tests, they will have a lot of regex expressions, including very condensed/flexible ones. And having a lot of regex patterns means that adding documentation comes with needing to adapt the regex expressions of this test (not only if the PR author forgot to add the change to related classes, but possibly also if the addition is totally valid, because we defined the regex too narrowly). This puts a very high barrier to any change in the docstrings and I am not sure if we can afford that. Also, doc PRs are often done by new contributors and we make it very hard for new people to add something to the docs, even possibly a typo correction.

In general, I have the impression that in scikit-learn we are building a castle, which is safer but comes with the cost of inflexibility and - at some point - stagnation.

From the more technical standpoint (and totally disregarding what I wrote before):

  1. I think that passing several regex expressions into the same test case would be very handy and will allow us to keep an overview of what had already been included in the test better than if we can only test for one regex per test and as a result have the tests for the same base class sprinkled over several tests. This could be done by passing a dict into descr_regex_pattern, but also by passing a regex directly into include_params, include_attrs and include_returns, which would be a bit easier to read.

  2. If I understand correctly, then passing a very condensed and at the same time flexible regex pattern (that would allow variations on a word) is currently possible already? I didn't understand your comment on not needing the 2nd assert, @lucyleeow.

  3. I think the type checking could be part of the description check?

About making this new contributor issue: I am not sure if with the regex it should be a good first issue. I feel with the need to make a judgement on which params/attributes/returns to include and the decision on how flexible the regex should be it is not an issue for people who don't know the project well and it might require us to define what exactly people should include and what we can add afterwards. What about putting the lables "moderate" and "meta-issue" instead of "good first issue"? Unprecedented idea: Maybe people can pair with maintainers who then also push to their branches to finish the regex. We could offer this to people who have proven they know how to handle the git workflow and other skills in other good first issues first, so to say as a follow up.

I know that a lot of thought and effort went into this a long time before I even became aware of it; so I feel quite guilty for coming in and voicing concerns so late in the process. Please use my comments the way that suits you best, @lucyleeow, I don't want to constrain you in any way.

@lucyleeow
Copy link
Member Author

lucyleeow commented Feb 26, 2025

There will be many good reasons for differing param and attribute descriptions and thus after we have added all these new tests, they will have a lot of regex expressions,

I may have been looking at different objects than you, but from my experience, for objects we want to test, parameters are mostly the same?

But I agree that I can see this getting out of hand. @glemaitre did mention making a "wanted list" of objects for which we want to add this test for, so maybe we could only work on those to start with? We don't want/need to add tests for everything, as I said in the issue, the list is just a starting point and some (many?) items do not warrant a test to be added.

I think:

  • if the descriptions differ by an additional sentence, or different word, we should add a test if we feel it is worth it for these objects
  • if the regex is more complicated than above, we probably shouldn't add a test for that param/object

I didn't envisage a lot of regex use, but we can see after working on the "wanted list".

I think the type checking could be part of the description check?

Sorry I didn't make this clear (it's only obvious if you look at the code) but the description text and the type are dealt with separately in the code, because numpydoc splits these. The regex (descr_regex_pattern) only applies to the description and NOT the type. This is why I suggested a separate parameter to allow types to be more flexible (second part of #28678 (comment))

I didn't understand your comment on not needing the 2nd assert, @lucyleeow.

It's just want you described in item 1, if we allow >1 regex per 'test' we can have less 'tests'. I just used the term assert because we can run assert_docstring_consistency several times within one test.

About making this new contributor issue:

Yes it probably is a bit more complicated than our usual good first issues, even without the regex part, they have to know how to use pytest etc. Not sure if we should have this has a general open issue or use for sprint etc.

Let me make some draft PRs, and we can re-assess from there?

@StefanieSenger
Copy link
Contributor

Thanks for the explanations, @lucyleeow.

I had imagined a very broad use of these tests and in Bagging* I have encountered three params (only checked params, not attributes and returns) that needed a regex. I see you are thinking of a more selected use of the test, and I agree that (given a beginner friendly error message) the test can be enriching. In any case, no need to directly adjust your approach to my opinions/concerns. I am not a maintainer and I just wanted to give some feedback from when I had tested this issue.

@lucyleeow
Copy link
Member Author

Your insights are just as useful and valid as a maintainers, and I'm happy to take them on board.

I had a look at the Bagging estimators and note:

  • BaggingClassifier and BaggingRegressor are pretty similar, whereas IsolationForest differs a bit, which makes sense. Potentially for this item, we would only want to test BaggingClassifier and BaggingRegressor. Or at least test more params for these two, and test fewer params for all 3 objects.
  • warm_start - differs due to different versionadded. I wonder if we should add an option to ignore these. For some objects, the version added is the same, and this check actually helped me add one of these tags for one of the objects.
  • bootstrap - wording differs but AFAICT the meaning is the same, so we should re-word such that they are consistent. The default value differs but this would not be a regex issue (as I explain above the description text and types are handled differently), I'd add an option to make type checking more flexible.

@adrinjalali
Copy link
Member

As for versionadded and other directives, I think we can ignore them all.

@lucyleeow
Copy link
Member Author

lucyleeow commented Mar 3, 2025

As for versionadded and other directives, I think we can ignore them all.

I would tend to agree. I conservatively kept it because once it helped me pick up that we missed a versionadded for an object, but overall it may be easier to just always ignore.

You can take a look at draft PR #30926, the regex option there would allow us to ignore a 'version...' if desired, but it is extra regex work so it may be worth it to ignore by default there too.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Test docstrings for parameters are equal
5 participants