-
-
Notifications
You must be signed in to change notification settings - Fork 25.9k
MNT Use regex capturing in assert_docstring_consistency
#30926
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
I've now added 2 tests to demonstrate it's use. There is a fair bit of tinkering to make the regex work, though now that we have some examples, hopefully it is easier for future additions/LLMs can help you add it 😅 But due to the regex complexity, I am not sure this is the right approach. I do like not having to re-write any docstrings though. Thoughts welcome 🙏 Also I have no idea why mypy is complaining with:
|
"exclude_returns": None, | ||
"descr_regex_patterns": { | ||
# Excludes 2nd sentence if present, matches last sentence until "." | ||
"estimator": r"^([^.]+\.)(?:\s+[^.]+\.)?\s+([^.]+\.)", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm worried that we are introducing too much complexity with those regex.
I'm wondering if we could come with a small utility to express what we want to achieve without writing a regex (but I don't have a solution).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Completely agree, the regexes turned out to be way more complicated than initially thought.
I don't have a great solution either. What about a descr_ignore
that takes a string that is removed from the description? This way you enter the sentence/word that is different, and all occurrences will be removed from the description before matching.
We can use re.sub
and replace with empty string.
Seems like it would be easier to use but is it a bit weird that you are entering the portion that is to be removed ?
cc @adrinjalali
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@glemaitre suggested that we could simplify this by vectorize the descriptions and calculate the cosine similarity and allow users to simply specific specific parameters where, 'close enough' is acceptable?
WDYT @adrinjalali ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
O_o, that's an interesting idea. But I'm afraid in the context of docstrings, contradictory docstrings can have very close cosine similarities, so that's probably mostly unhelpful.
I'm probably missing / forgetting something, but could we start w/o regex and see how much of our docstrings we can test this way? Like exact match or "includes" kinda thing. My thinking is that we don't have to support every single scenario in our codebase. If we can support most docstrings, we're fine?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes I think I agree with you.
I'm probably missing / forgetting something, but could we start w/o regex and see how much of our docstrings we can test this way?
This is a good point. I do think the current descr_regex_pattern
is not a great solution - I don't like how we have to re-write the whole docstring and the regex pattern is difficult to implement.
Why don't I remove descr_regex_pattern
and we add some tests, only for objects that are most pertinent. In the process of adding tests, we may even think of a decent solution...?
The question though, is do we create a meta issue and open it up to everyone or....?
cc @glemaitre
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can open a meta issue, but I personally feel like similar past ones (me being guilty of creating some of them) have distracted us quite a bit from other high priority projects in our roadmap. So I probably would be happy if we end up with a few large-ish PRs that more experienced people like us do?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can open a meta issue, but I personally feel like similar past ones (me being guilty of creating some of them) have distracted us quite a bit from other high priority projects in our roadmap. So I probably would be happy if we end up with a few large-ish PRs that more experienced people like us do?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah I agree with you. I'm happy to tackle it as a side project, for the days when the ol' brain isn't working that well.
Closing this as I agree this solution is not suitable. I will remove the current |
Reference Issues/PRs
Follows on from #28678 (comment)
What does this implement/fix? Explain your changes.
Amend the regex parameter
descr_regex_pattern
todescr_regex_patterns
with the following changes:descr_regex_patterns
takes a dict, key is the parameter/attr/return name, and value is the regexI am not 100% on this solution, so happy to amend or abandon altogether.
If we do go ahead, I wonder if I should do more param checking for
descr_regex_patterns
(ensure it is a dict)Any other comments?
cc @StefanieSenger @adrinjalali @glemaitre