Add option of matching regex to `assert_docstring_consistency` #29867

lucyleeow · 2024-09-17T06:03:15Z

Reference Issues/PRs

Follows from #28678, in particular: #28678 (comment)

What does this implement/fix? Explain your changes.

Adds description_regex parameter to allow matching all descriptions to a regex.
Adds a sample test - this will probably be one of the more difficult/awkward parameter docstrings to match properly.

Any other comments?

cc @adrinjalali @glemaitre

(To add tests once we're happy with the change)

github-actions · 2024-09-17T06:04:35Z

✔️ Linting Passed

All linting checks passed. Your pull request is in excellent shape! ☀️

_{Generated for commit: 6278c15. Link to the linter CI: here}

lucyleeow · 2024-09-17T06:05:16Z

sklearn/metrics/_classification.py

@@ -1254,7 +1254,7 @@ def f1_score(
    average : {'micro', 'macro', 'samples', 'weighted', 'binary'} or None, \
            default='binary'
        This parameter is required for multiclass/multilabel targets.
-        If ``None``, the scores for each class are returned. Otherwise, this


Currently we use 'score' here but revert to metrics (see below, in the same function). I am actually unsure about whether to change all 'metrics' to 'scores' for f1_score, precision_score and recall_score, or leave as 'metrics'?

good question

My own definition would be the following:

"score": scalar where higher value is better

"error": scalar where lower value is better

"metric": term accounting for "score" and "error" to measure the statistical performance.

Therefore, I would be fine to generalize the term "score" when it comes these functions.

Sorry, do you mean 'generalize "metric"' ?

Or more specifically, do you mean use 'metric' everywhere as it accounts for both types?

adrinjalali

I like this.

adrinjalali · 2024-09-17T07:08:25Z

sklearn/metrics/_classification.py

@@ -1254,7 +1254,7 @@ def f1_score(
    average : {'micro', 'macro', 'samples', 'weighted', 'binary'} or None, \
            default='binary'
        This parameter is required for multiclass/multilabel targets.
-        If ``None``, the scores for each class are returned. Otherwise, this


good question

glemaitre

I pretty much how it looks like.

glemaitre · 2024-10-03T09:25:03Z

sklearn/metrics/_classification.py

@@ -1254,7 +1254,7 @@ def f1_score(
    average : {'micro', 'macro', 'samples', 'weighted', 'binary'} or None, \
            default='binary'
        This parameter is required for multiclass/multilabel targets.
-        If ``None``, the scores for each class are returned. Otherwise, this


My own definition would be the following:

"score": scalar where higher value is better

"error": scalar where lower value is better

"metric": term accounting for "score" and "error" to measure the statistical performance.

Therefore, I would be fine to generalize the term "score" when it comes these functions.

sklearn/utils/_testing.py

glemaitre · 2024-10-03T09:34:45Z

sklearn/utils/_testing.py

@@ -760,6 +776,10 @@ def assert_docstring_consistency(
        List of returns to be excluded. If None, no returns are excluded.
        Can only be set if `include_returns` is True.

+    description_regex : str, default=""
+        Regular expression to match to all descriptions. If empty string, will


In this regard, I would rephrase it slightly:

Regular expression pattern to match all descriptions.

Maybe we should be more explicit what we mean by "descriptions".

sklearn/utils/_testing.py

glemaitre · 2024-10-03T09:38:58Z

sklearn/tests/test_docstring_parameters.py

+    description_regex = r"""This parameter is required for multiclass/multilabel targets\.
+        If ``None``, the metrics for each class are returned\. Otherwise, this
+        determines the type of averaging performed on the data:
+        ``'binary'``:
+            Only report results for the class specified by ``pos_label``\.
+            This is applicable only if targets \(``y_\{true,pred\}``\) are binary\.
+        ``'micro'``:
+            Calculate metrics globally by counting the total true positives,
+            false negatives and false positives\.
+        ``'macro'``:
+            Calculate metrics for each label, and find their unweighted
+            mean\.  This does not take label imbalance into account\.
+        ``'weighted'``:
+            Calculate metrics for each label, and find their average weighted
+            by support \(the number of true instances for each label\)\. This
+            alters 'macro' to account for label imbalance; it can result in an
+            F-score that is not between precision and recall\.[\s\w]*\.*
+        ``'samples'``:
+            Calculate metrics for each instance, and find their average \(only
+            meaningful for multilabel classification where this differs from
+            :func:`accuracy_score`\)\."""  # noqa E501


for the purpose of the test, we could maybe pass a real regex pattern instead. For instance, we could require to only match the type of average, e.g., 'binary', 'micro', etc.

In this case, we could also parametrize the test by passing or not the description_regex parameter.

we could maybe pass a real regex pattern instead. For instance, we could require to only match the type of average, e.g., 'binary', 'micro', etc.

Sorry I don't follow. Do you mean take only specific average types, e.g., 'binary', 'micro', and match the extracted description between objects?
That would be another way to do it, but the current regex pattern just matches descriptions from all objects against the given regex...

What I meant was to use this for instance:

description_regex = "\b(binary|micro|macro|weighted)\b"

instead of the entire docstring.

sklearn/tests/test_docstring_parameters.py

lucyleeow · 2024-11-12T10:44:18Z

sklearn/utils/tests/test_testing.py

+        descr_regex_pattern=" ".join(regex_full.split()),
+    )
+    # Check we can just match a few alternate words
+    regex_words = r"(labels|average|binary)"


@glemaitre I've added a test to check the new parameter, this one is in the same vein as what you suggested above.

glemaitre

LGTM. @adrinjalali do you want to have another look.

adrinjalali

Otherwise LGTM.

adrinjalali · 2024-11-15T14:32:45Z

sklearn/tests/test_docstring_parameters.py

+            Calculate metrics for each label, and find their average weighted
+            by support \(the number of true instances for each label\)\. This
+            alters 'macro' to account for label imbalance; it can result in an
+            F-score that is not between precision and recall\.[\s\w]*\.*


can we have a comment here (and wherever we have a regex) on what it does?

Thanks @adrinjalali ! Done so in 2f981d4, hopefully looks okay?

adrinjalali

Thanks @lucyleeow

…scikit-learn#29867)

…#29867)

add regex

aefe5e9

github-actions bot added module:metrics module:utils labels Sep 17, 2024

lucyleeow commented Sep 17, 2024

View reviewed changes

lint

acddf40

lucyleeow added the No Changelog Needed label Sep 17, 2024

cruft

d7f3778

adrinjalali reviewed Sep 17, 2024

View reviewed changes

glemaitre requested review from adrinjalali and glemaitre and removed request for adrinjalali September 17, 2024 16:06

merge main

691b73a

glemaitre reviewed Oct 3, 2024

View reviewed changes

lucyleeow added 10 commits October 15, 2024 16:17

Merge branch 'main' into doc_consis_regex

c3f38c0

review

eff147b

lint

298b7a3

fix test

ec2b7bd

fix docstring ex

8b5acae

skip

399bd64

Merge branch 'main' into doc_consis_regex

540cab1

add test

74ef7fe

lint

da48f8d

add test for error

17a4047

lucyleeow commented Nov 12, 2024

View reviewed changes

glemaitre approved these changes Nov 15, 2024

View reviewed changes

adrinjalali reviewed Nov 15, 2024

View reviewed changes

lucyleeow added 3 commits November 19, 2024 11:50

Merge branch 'main' into doc_consis_regex

d1b2cfb

explain regex

2f981d4

lint

6278c15

adrinjalali approved these changes Nov 21, 2024

View reviewed changes

adrinjalali merged commit 03db24f into scikit-learn:main Nov 21, 2024
30 checks passed

lucyleeow deleted the doc_consis_regex branch November 21, 2024 10:03

jeremiedbb pushed a commit to jeremiedbb/scikit-learn that referenced this pull request Dec 4, 2024

DOC CI Add option of matching regex to assert_docstring_consistency (…

643995f

…scikit-learn#29867)

jeremiedbb pushed a commit that referenced this pull request Dec 6, 2024

DOC CI Add option of matching regex to assert_docstring_consistency (…

4eb215c

…#29867)

Uh oh!

Add option of matching regex to assert_docstring_consistency #29867

Add option of matching regex to assert_docstring_consistency #29867

Uh oh!

Conversation

lucyleeow commented Sep 17, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Reference Issues/PRs

What does this implement/fix? Explain your changes.

Any other comments?

Uh oh!

github-actions bot commented Sep 17, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

✔️ Linting Passed

Uh oh!

lucyleeow Sep 17, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

adrinjalali left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

glemaitre left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

glemaitre left a comment

Choose a reason for hiding this comment

Uh oh!

adrinjalali left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

adrinjalali left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Add option of matching regex to `assert_docstring_consistency` #29867

Add option of matching regex to `assert_docstring_consistency` #29867

lucyleeow commented Sep 17, 2024 •

edited

Loading

github-actions bot commented Sep 17, 2024 •

edited

Loading

lucyleeow Sep 17, 2024 •

edited

Loading