Thanks to visit codestin.com
Credit goes to github.com

Skip to content

ENH Add "ensure_non_negative" option to check_array #29540

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 16 commits into from
Aug 2, 2024

Conversation

TamaraAtanasoska
Copy link
Contributor

Reference Issues/PRs

Fixes #29508.

What does this implement/fix? Explain your changes.

Adding an option to check if an array has only_non_negative values to the sklearn.utils.validation.check_array function, that contains the sklearn.utils.validation.check_non_negative functionality.

While in the initial issue I proposed ensure_positive as a name for the option, I found only_non_negative with exactly the same function in sklearn.utils.validation._check_sample_weight(here) and I used the same for consistency.

This addition will prevent the need to use sklearn.utils.validation.check_non_negative after sklearn.utils.validation.check_array in the use case of needing only non-negative values in an array. I am keeping the proposed changes minimal for easier review, but I am happy to also change all occurrences in the scikit-learn code where this pattern exists as an added commit.

Any other comments?

Copy link

github-actions bot commented Jul 22, 2024

✔️ Linting Passed

All linting checks passed. Your pull request is in excellent shape! ☀️

Generated for commit: 702f196. Link to the linter CI: here

@glemaitre glemaitre self-requested a review July 22, 2024 13:08
@jeremiedbb
Copy link
Member

Thanks for the PR @TamaraAtanasoska. I'd like to keep the ensure_xxx naming pattern for consistency in check_array. Since _check_sample_weight is private, I'd rather change the name there if we really want to for both to have the same name.

It's possible though that ensure_positive may be confusing w.r.t whether or not zero is allowed. So ensure_non_negative is a valid option too.

@TamaraAtanasoska
Copy link
Contributor Author

Thanks for the PR @TamaraAtanasoska. I'd like to keep the ensure_xxx naming pattern for consistency in check_array. Since _check_sample_weight is private, I'd rather change the name there if we really want to for both to have the same name.

It's possible though that ensure_positive may be confusing w.r.t whether or not zero is allowed. So ensure_non_negative is a valid option too.

Agreed, changed it everywhere to ensure_non_negative in e18e5f3.

Copy link
Member

@glemaitre glemaitre left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need to add an entry in the changelog located in doc/whats_new/v1.6.rst since it touch the public API of check_array. We don't need to document anything about _check_sample_weight there.

It is a bit weird that we introduce the option but that we don't use it.

@jeremiedbb do you think that we can use this feature directly in NMF because if I recall well this is one requirement of the transformer. I assume that we are currently calling the check_non_negative function directly in the code.

I don't know if there is something similar with transformer that expect counts as input in the feature extraction for text (like TfidfTransformer).

@@ -2002,7 +2010,7 @@ def _check_sample_weight(
X : {ndarray, list, sparse matrix}
Input data.

only_non_negative : bool, default=False,
ensure_non_negative : bool, default=False,
Whether or not the weights are expected to be non-negative.

.. versionadded:: 1.0
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should probably have a versionchanged to mention that we rename the parameter. @jeremiedbb I'm scared to break code but this should be quite limited and easy to handle for downstream package.

@TamaraAtanasoska
Copy link
Contributor Author

doc/whats_new/v1.6.rst

Thanks for the review @glemaitre! I am looking into adding to the changelog now. Is this an enhancement or a feature? I will also add it in the places where check_non_negative follows check_array.

@glemaitre
Copy link
Member

Is this an enhancement or a feature?

I would say an enhancement.

@TamaraAtanasoska
Copy link
Contributor Author

TamaraAtanasoska commented Jul 23, 2024

@glemaitre @jeremiedbb a few notes on a6cfa3a:

  • I added a short changelog entry. Let me know if it needs to be changed in any way, first time 😊
  • I had to move the check at the very bottom of check_array, which mirrors its usage until now. Otherwise the input array is not "ready" and some changes will need to be made to check_non_negative.
  • I added two example usages in the existing code. What comes to light when looking at the changes required for them is that there is a loss of information that previous was there, packed in whom which is now a generic check_array as source of input. In what cases is this information essential?

Copy link
Member

@glemaitre glemaitre left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added two example usages in the existing code. What comes to light when looking at the changes required for them is that there is a loss of information that previous was there, packed in whom which is now a generic check_array as source of input. In what cases is this information essential?

I think that we can alleviate this issue by using both input_name and estimator of the check_array: estimator will help at getting the name of the estimator while input_name will help at indicating which data container contains negative values. We could adapt a bit the error message of check_non_negative to have a meaningful message, or craft a good whom message from these two parameters.

@@ -1120,6 +1127,9 @@ def is_sparse(dtype):
else:
array = array.copy(**copy_params)

if ensure_non_negative:
check_non_negative(array, "check_array")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As stated in another comment, if input_name and estimator are not the default, I think that we can make a crafted message better than "check_array".

Copy link
Member

@glemaitre glemaitre left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that we can also modify NMF. Right now the check_non_negative is called in _fit_transform method. However, we could remove it and use the ensure_non_negative of check_array call in fit_transform and transform

@TamaraAtanasoska
Copy link
Contributor Author

@glemaitre I tired to address all the comments above in 890b480. Not sure about adding a generic "estimator" in this line. Would it be better if it was empty maybe?

@glemaitre
Copy link
Member

Would it be better if it was empty maybe?

I would probably not have a generic name and not mentioned it (so probably the empty yes). It would translate to something like:

    if ensure_non_negative:
        whom = input_name
        if estimator_name:
            whom += f" in {estimator_name}"
        check_non_negative(array, whom=whom)

@TamaraAtanasoska
Copy link
Contributor Author

whom = input_name
        if estimator_name:
            whom += f" in {estimator_name}"

I love this suggestion, just pushed a commit with it. It preserves the whom that makes the usage of it obvious and adds a more natural end of sentence without an estimator name.

@glemaitre glemaitre changed the title ENH Add "only_non_negative" option to check_array ENH Add "ensure_non_negative" option to check_array Jul 24, 2024
Comment on lines 795 to 796
Make sure the array has only non-negative values. An array that contains
non-negative values will raise a ValueError.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
Make sure the array has only non-negative values. An array that contains
non-negative values will raise a ValueError.
Make sure the array has only non-negative values. If True, an array that contains
negative values will raise a ValueError.

Comment on lines 1130 to 1135
if ensure_non_negative:
whom = input_name
if estimator_name:
whom += f" in {estimator_name}"
check_non_negative(array, whom)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would put that just before the previous if force_writeable block to be able to fail fast, before a pontential copy.

Comment on lines 180 to 182
graph = check_array(
graph, accept_sparse="csr", ensure_non_negative=True, input_name="graph"
)
Copy link
Member

@jeremiedbb jeremiedbb Jul 25, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why didn't you keep "precomputed distance matrix" ?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed, graph is an internal variable and people might get confused by the error message then.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks @jeremiedbb , makes sense! I didn't notice there is more relevant naming there.

@jeremiedbb
Copy link
Member

@glemaitre, should we add it to check_X_y as well ? it has pretty much the same parameters as check_array, although I don't think we ever use it in conjonction with check_non_negative.

Copy link
Member

@jeremiedbb jeremiedbb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@glemaitre
Copy link
Member

@glemaitre, should we add it to check_X_y as well ? it has pretty much the same parameters as check_array, although I don't think we ever use it in conjonction with check_non_negative.

Yep we can do that but in a subsequent PR.

@glemaitre glemaitre self-requested a review August 2, 2024 08:33
Copy link
Member

@glemaitre glemaitre left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK after failing at properly using git merge, I think this is all good ;)

@glemaitre
Copy link
Member

Enabling auto-merge. Thanks @TamaraAtanasoska

@glemaitre glemaitre enabled auto-merge (squash) August 2, 2024 08:56
@glemaitre glemaitre merged commit ea6c77b into scikit-learn:main Aug 2, 2024
28 checks passed
@TamaraAtanasoska TamaraAtanasoska deleted the ensure_positive branch August 12, 2024 07:59
MarcBresson pushed a commit to MarcBresson/scikit-learn that referenced this pull request Sep 2, 2024
Co-authored-by: Guillaume Lemaitre <[email protected]>
Co-authored-by: Jérémie du Boisberranger <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add "ensure_positive" to check_array for non-negative value validation
3 participants