Thanks to visit codestin.com
Credit goes to github.com

Skip to content

MNT Use regex capturing in assert_docstring_consistency #30926

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 3 commits into from

Conversation

lucyleeow
Copy link
Member

Reference Issues/PRs

Follows on from #28678 (comment)

What does this implement/fix? Explain your changes.

Amend the regex parameter descr_regex_pattern to descr_regex_patterns with the following changes:

  • descr_regex_patterns takes a dict, key is the parameter/attr/return name, and value is the regex
  • The regex is now used to capture the portion of the description to be matched
    • pro: this avoids us having to re-write the description
    • con: there is a bit of mucking around with regex to get it to work correctly, but as we have example regexes in the tests that perform common manipulations, it should be reasonably easy to adapt a pattern to match other situations
    • con: if there is a situation where the description should be one of 2 sentences/words (e.g., either "classifier" or "regressor") we can't ensure one of these 2 options is written, we can only exclude that specific word from being matched

I am not 100% on this solution, so happy to amend or abandon altogether.

If we do go ahead, I wonder if I should do more param checking for descr_regex_patterns (ensure it is a dict)

Any other comments?

cc @StefanieSenger @adrinjalali @glemaitre

Copy link

github-actions bot commented Mar 3, 2025

✔️ Linting Passed

All linting checks passed. Your pull request is in excellent shape! ☀️

Generated for commit: bde299c. Link to the linter CI: here

@lucyleeow
Copy link
Member Author

I've now added 2 tests to demonstrate it's use.

There is a fair bit of tinkering to make the regex work, though now that we have some examples, hopefully it is easier for future additions/LLMs can help you add it 😅

But due to the regex complexity, I am not sure this is the right approach. I do like not having to re-write any docstrings though.

Thoughts welcome 🙏

Also I have no idea why mypy is complaining with:

sklearn/tests/test_docstring_parameters_consistency.py:18: error: List item 0 has incompatible type "type[AdaBoostClassifier]"; expected "type[BaseWeightBoosting]"  [list-item]

@glemaitre glemaitre self-requested a review March 13, 2025 09:04
"exclude_returns": None,
"descr_regex_patterns": {
# Excludes 2nd sentence if present, matches last sentence until "."
"estimator": r"^([^.]+\.)(?:\s+[^.]+\.)?\s+([^.]+\.)",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm worried that we are introducing too much complexity with those regex.

I'm wondering if we could come with a small utility to express what we want to achieve without writing a regex (but I don't have a solution).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Completely agree, the regexes turned out to be way more complicated than initially thought.

I don't have a great solution either. What about a descr_ignore that takes a string that is removed from the description? This way you enter the sentence/word that is different, and all occurrences will be removed from the description before matching.
We can use re.sub and replace with empty string.

Seems like it would be easier to use but is it a bit weird that you are entering the portion that is to be removed ?

cc @adrinjalali

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@glemaitre suggested that we could simplify this by vectorize the descriptions and calculate the cosine similarity and allow users to simply specific specific parameters where, 'close enough' is acceptable?

WDYT @adrinjalali ?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

O_o, that's an interesting idea. But I'm afraid in the context of docstrings, contradictory docstrings can have very close cosine similarities, so that's probably mostly unhelpful.

I'm probably missing / forgetting something, but could we start w/o regex and see how much of our docstrings we can test this way? Like exact match or "includes" kinda thing. My thinking is that we don't have to support every single scenario in our codebase. If we can support most docstrings, we're fine?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes I think I agree with you.

I'm probably missing / forgetting something, but could we start w/o regex and see how much of our docstrings we can test this way?

This is a good point. I do think the current descr_regex_pattern is not a great solution - I don't like how we have to re-write the whole docstring and the regex pattern is difficult to implement.

Why don't I remove descr_regex_pattern and we add some tests, only for objects that are most pertinent. In the process of adding tests, we may even think of a decent solution...?

The question though, is do we create a meta issue and open it up to everyone or....?

cc @glemaitre

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can open a meta issue, but I personally feel like similar past ones (me being guilty of creating some of them) have distracted us quite a bit from other high priority projects in our roadmap. So I probably would be happy if we end up with a few large-ish PRs that more experienced people like us do?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can open a meta issue, but I personally feel like similar past ones (me being guilty of creating some of them) have distracted us quite a bit from other high priority projects in our roadmap. So I probably would be happy if we end up with a few large-ish PRs that more experienced people like us do?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah I agree with you. I'm happy to tackle it as a side project, for the days when the ol' brain isn't working that well.

@lucyleeow
Copy link
Member Author

Closing this as I agree this solution is not suitable.

I will remove the current descr_regex_pattern, if we think of a good solution later I can add it. Otherwise, let's try using this only for objects that should be perfect matches.

@lucyleeow lucyleeow closed this Apr 28, 2025
@lucyleeow lucyleeow deleted the ds_consis_regex branch April 28, 2025 11:27
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants