ENH Add "ensure_non_negative" option to check_array #29540

TamaraAtanasoska · 2024-07-22T11:00:05Z

Reference Issues/PRs

What does this implement/fix? Explain your changes.

Adding an option to check if an array has only_non_negative values to the sklearn.utils.validation.check_array function, that contains the sklearn.utils.validation.check_non_negative functionality.

While in the initial issue I proposed ensure_positive as a name for the option, I found only_non_negative with exactly the same function in sklearn.utils.validation._check_sample_weight(here) and I used the same for consistency.

This addition will prevent the need to use sklearn.utils.validation.check_non_negative after sklearn.utils.validation.check_array in the use case of needing only non-negative values in an array. I am keeping the proposed changes minimal for easier review, but I am happy to also change all occurrences in the scikit-learn code where this pattern exists as an added commit.

Any other comments?

github-actions · 2024-07-22T11:01:21Z

✔️ Linting Passed

All linting checks passed. Your pull request is in excellent shape! ☀️

_{Generated for commit: 702f196. Link to the linter CI: here}

jeremiedbb · 2024-07-22T15:57:31Z

Thanks for the PR @TamaraAtanasoska. I'd like to keep the ensure_xxx naming pattern for consistency in check_array. Since _check_sample_weight is private, I'd rather change the name there if we really want to for both to have the same name.

It's possible though that ensure_positive may be confusing w.r.t whether or not zero is allowed. So ensure_non_negative is a valid option too.

TamaraAtanasoska · 2024-07-23T10:43:04Z

Thanks for the PR @TamaraAtanasoska. I'd like to keep the ensure_xxx naming pattern for consistency in check_array. Since _check_sample_weight is private, I'd rather change the name there if we really want to for both to have the same name.

It's possible though that ensure_positive may be confusing w.r.t whether or not zero is allowed. So ensure_non_negative is a valid option too.

Agreed, changed it everywhere to ensure_non_negative in e18e5f3.

sklearn/utils/validation.py

glemaitre

We need to add an entry in the changelog located in doc/whats_new/v1.6.rst since it touch the public API of check_array. We don't need to document anything about _check_sample_weight there.

It is a bit weird that we introduce the option but that we don't use it.

@jeremiedbb do you think that we can use this feature directly in NMF because if I recall well this is one requirement of the transformer. I assume that we are currently calling the check_non_negative function directly in the code.

I don't know if there is something similar with transformer that expect counts as input in the feature extraction for text (like TfidfTransformer).

glemaitre · 2024-07-23T12:55:29Z

sklearn/utils/validation.py

@@ -2002,7 +2010,7 @@ def _check_sample_weight(
    X : {ndarray, list, sparse matrix}
        Input data.

-    only_non_negative : bool, default=False,
+    ensure_non_negative : bool, default=False,
        Whether or not the weights are expected to be non-negative.

        .. versionadded:: 1.0


We should probably have a versionchanged to mention that we rename the parameter. @jeremiedbb I'm scared to break code but this should be quite limited and easy to handle for downstream package.

Co-authored-by: Guillaume Lemaitre <[email protected]>

TamaraAtanasoska · 2024-07-23T13:11:25Z

doc/whats_new/v1.6.rst

Thanks for the review @glemaitre! I am looking into adding to the changelog now. Is this an enhancement or a feature? I will also add it in the places where check_non_negative follows check_array.

glemaitre · 2024-07-23T16:04:19Z

Is this an enhancement or a feature?

I would say an enhancement.

TamaraAtanasoska · 2024-07-23T16:09:10Z

@glemaitre @jeremiedbb a few notes on a6cfa3a:

I added a short changelog entry. Let me know if it needs to be changed in any way, first time 😊
I had to move the check at the very bottom of check_array, which mirrors its usage until now. Otherwise the input array is not "ready" and some changes will need to be made to check_non_negative.
I added two example usages in the existing code. What comes to light when looking at the changes required for them is that there is a loss of information that previous was there, packed in whom which is now a generic check_array as source of input. In what cases is this information essential?

glemaitre

I added two example usages in the existing code. What comes to light when looking at the changes required for them is that there is a loss of information that previous was there, packed in whom which is now a generic check_array as source of input. In what cases is this information essential?

I think that we can alleviate this issue by using both input_name and estimator of the check_array: estimator will help at getting the name of the estimator while input_name will help at indicating which data container contains negative values. We could adapt a bit the error message of check_non_negative to have a meaningful message, or craft a good whom message from these two parameters.

doc/whats_new/v1.6.rst

glemaitre · 2024-07-24T08:02:51Z

sklearn/utils/validation.py

@@ -1120,6 +1127,9 @@ def is_sparse(dtype):
            else:
                array = array.copy(**copy_params)

+    if ensure_non_negative:
+        check_non_negative(array, "check_array")


As stated in another comment, if input_name and estimator are not the default, I think that we can make a crafted message better than "check_array".

glemaitre

I think that we can also modify NMF. Right now the check_non_negative is called in _fit_transform method. However, we could remove it and use the ensure_non_negative of check_array call in fit_transform and transform

Co-authored-by: Guillaume Lemaitre <[email protected]>

TamaraAtanasoska · 2024-07-24T10:19:54Z

@glemaitre I tired to address all the comments above in 890b480. Not sure about adding a generic "estimator" in this line. Would it be better if it was empty maybe?

glemaitre · 2024-07-24T10:48:30Z

Would it be better if it was empty maybe?

I would probably not have a generic name and not mentioned it (so probably the empty yes). It would translate to something like:

    if ensure_non_negative:
        whom = input_name
        if estimator_name:
            whom += f" in {estimator_name}"
        check_non_negative(array, whom=whom)

TamaraAtanasoska · 2024-07-24T11:09:20Z

whom = input_name
        if estimator_name:
            whom += f" in {estimator_name}"

I love this suggestion, just pushed a commit with it. It preserves the whom that makes the usage of it obvious and adds a more natural end of sentence without an estimator name.

jeremiedbb · 2024-07-25T10:03:52Z

sklearn/utils/validation.py

+        Make sure the array has only non-negative values. An array that contains
+        non-negative values will raise a ValueError.


Suggested change

Make sure the array has only non-negative values. An array that contains

non-negative values will raise a ValueError.

Make sure the array has only non-negative values. If True, an array that contains

negative values will raise a ValueError.

jeremiedbb · 2024-07-25T10:06:54Z

sklearn/utils/validation.py

+    if ensure_non_negative:
+        whom = input_name
+        if estimator_name:
+            whom += f" in {estimator_name}"
+        check_non_negative(array, whom)
+


I would put that just before the previous if force_writeable block to be able to fail fast, before a pontential copy.

jeremiedbb · 2024-07-25T10:08:48Z

sklearn/neighbors/_base.py

+    graph = check_array(
+        graph, accept_sparse="csr", ensure_non_negative=True, input_name="graph"
+    )


why didn't you keep "precomputed distance matrix" ?

Agreed, graph is an internal variable and people might get confused by the error message then.

thanks @jeremiedbb , makes sense! I didn't notice there is more relevant naming there.

jeremiedbb · 2024-07-25T10:12:03Z

@glemaitre, should we add it to check_X_y as well ? it has pretty much the same parameters as check_array, although I don't think we ever use it in conjonction with check_non_negative.

jeremiedbb

LGTM

glemaitre · 2024-08-02T08:33:37Z

@glemaitre, should we add it to check_X_y as well ? it has pretty much the same parameters as check_array, although I don't think we ever use it in conjonction with check_non_negative.

Yep we can do that but in a subsequent PR.

…9540

glemaitre

OK after failing at properly using git merge, I think this is all good ;)

glemaitre · 2024-08-02T08:56:43Z

Enabling auto-merge. Thanks @TamaraAtanasoska

Co-authored-by: Guillaume Lemaitre <[email protected]> Co-authored-by: Jérémie du Boisberranger <[email protected]>

Add only_non_negative as check_array option

bd6a309

github-actions bot added the module:utils label Jul 22, 2024

Fix docstring

ae01b8b

glemaitre self-requested a review July 22, 2024 13:08

Replace only_non_negative with ensure_non_negative

e18e5f3

glemaitre reviewed Jul 23, 2024

View reviewed changes

sklearn/utils/validation.py Outdated Show resolved Hide resolved

glemaitre reviewed Jul 23, 2024

View reviewed changes

Add @glemaitre's versioning suggestion in docstring

abe9818

Co-authored-by: Guillaume Lemaitre <[email protected]>

TamaraAtanasoska added 2 commits July 23, 2024 16:49

Relocate check within check_array out of neseted if

bbc8169

Example usage of ensure_non_negative

a6cfa3a

TamaraAtanasoska added 2 commits July 23, 2024 18:38

Fix doc error

f5ecea4

Fix references in docsstring

e1605b4

glemaitre reviewed Jul 24, 2024

View reviewed changes

doc/whats_new/v1.6.rst Outdated Show resolved Hide resolved

glemaitre reviewed Jul 24, 2024

View reviewed changes

TamaraAtanasoska and others added 2 commits July 24, 2024 10:48

Update doc/whats_new/v1.6.rst

96343ac

Co-authored-by: Guillaume Lemaitre <[email protected]>

Add a meaningful error message

890b480

Reformat error message

527dd55

glemaitre changed the title ~~ENH Add "only_non_negative" option to check_array~~ ENH Add "ensure_non_negative" option to check_array Jul 24, 2024

jeremiedbb reviewed Jul 25, 2024

View reviewed changes

iter

e2ba67a

jeremiedbb approved these changes Jul 25, 2024

View reviewed changes

glemaitre self-requested a review August 2, 2024 08:33

glemaitre added 4 commits August 2, 2024 10:41

Merge remote-tracking branch 'origin/main' into pr/TamaraAtanasoska/2…

dbdaf49

…9540

iter

aabc06a

iter

bc4cabe

iter

702f196

glemaitre approved these changes Aug 2, 2024

View reviewed changes

glemaitre enabled auto-merge (squash) August 2, 2024 08:56

glemaitre merged commit ea6c77b into scikit-learn:main Aug 2, 2024
28 checks passed

TamaraAtanasoska deleted the ensure_positive branch August 12, 2024 07:59

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ENH Add "ensure_non_negative" option to check_array #29540

ENH Add "ensure_non_negative" option to check_array #29540

TamaraAtanasoska commented Jul 22, 2024

github-actions bot commented Jul 22, 2024 •

edited

Loading

jeremiedbb commented Jul 22, 2024

TamaraAtanasoska commented Jul 23, 2024

glemaitre left a comment •

edited

Loading

glemaitre Jul 23, 2024

TamaraAtanasoska commented Jul 23, 2024

glemaitre commented Jul 23, 2024

TamaraAtanasoska commented Jul 23, 2024 •

edited

Loading

glemaitre left a comment

glemaitre Jul 24, 2024

glemaitre left a comment

TamaraAtanasoska commented Jul 24, 2024

glemaitre commented Jul 24, 2024

TamaraAtanasoska commented Jul 24, 2024

jeremiedbb Jul 25, 2024

jeremiedbb Jul 25, 2024

jeremiedbb Jul 25, 2024 •

edited

Loading

glemaitre Jul 25, 2024

TamaraAtanasoska Jul 25, 2024

jeremiedbb commented Jul 25, 2024

jeremiedbb left a comment

glemaitre commented Aug 2, 2024

glemaitre left a comment

glemaitre commented Aug 2, 2024

		Make sure the array has only non-negative values. An array that contains
		non-negative values will raise a ValueError.

ENH Add "ensure_non_negative" option to check_array #29540

ENH Add "ensure_non_negative" option to check_array #29540

Conversation

TamaraAtanasoska commented Jul 22, 2024

Reference Issues/PRs

What does this implement/fix? Explain your changes.

Any other comments?

github-actions bot commented Jul 22, 2024 • edited Loading

✔️ Linting Passed

jeremiedbb commented Jul 22, 2024

TamaraAtanasoska commented Jul 23, 2024

glemaitre left a comment • edited Loading

Choose a reason for hiding this comment

glemaitre Jul 23, 2024

Choose a reason for hiding this comment

TamaraAtanasoska commented Jul 23, 2024

glemaitre commented Jul 23, 2024

TamaraAtanasoska commented Jul 23, 2024 • edited Loading

glemaitre left a comment

Choose a reason for hiding this comment

glemaitre Jul 24, 2024

Choose a reason for hiding this comment

glemaitre left a comment

Choose a reason for hiding this comment

TamaraAtanasoska commented Jul 24, 2024

glemaitre commented Jul 24, 2024

TamaraAtanasoska commented Jul 24, 2024

jeremiedbb Jul 25, 2024

Choose a reason for hiding this comment

jeremiedbb Jul 25, 2024

Choose a reason for hiding this comment

jeremiedbb Jul 25, 2024 • edited Loading

Choose a reason for hiding this comment

glemaitre Jul 25, 2024

Choose a reason for hiding this comment

TamaraAtanasoska Jul 25, 2024

Choose a reason for hiding this comment

jeremiedbb commented Jul 25, 2024

jeremiedbb left a comment

Choose a reason for hiding this comment

glemaitre commented Aug 2, 2024

glemaitre left a comment

Choose a reason for hiding this comment

glemaitre commented Aug 2, 2024

github-actions bot commented Jul 22, 2024 •

edited

Loading

glemaitre left a comment •

edited

Loading

TamaraAtanasoska commented Jul 23, 2024 •

edited

Loading

jeremiedbb Jul 25, 2024 •

edited

Loading