-
-
Notifications
You must be signed in to change notification settings - Fork 25.8k
ENH Add "ensure_non_negative" option to check_array #29540
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ENH Add "ensure_non_negative" option to check_array #29540
Conversation
Thanks for the PR @TamaraAtanasoska. I'd like to keep the It's possible though that |
Agreed, changed it everywhere to |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We need to add an entry in the changelog located in doc/whats_new/v1.6.rst
since it touch the public API of check_array
. We don't need to document anything about _check_sample_weight
there.
It is a bit weird that we introduce the option but that we don't use it.
@jeremiedbb do you think that we can use this feature directly in NMF
because if I recall well this is one requirement of the transformer. I assume that we are currently calling the check_non_negative
function directly in the code.
I don't know if there is something similar with transformer that expect counts as input in the feature extraction for text (like TfidfTransformer
).
@@ -2002,7 +2010,7 @@ def _check_sample_weight( | |||
X : {ndarray, list, sparse matrix} | |||
Input data. | |||
|
|||
only_non_negative : bool, default=False, | |||
ensure_non_negative : bool, default=False, | |||
Whether or not the weights are expected to be non-negative. | |||
|
|||
.. versionadded:: 1.0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should probably have a versionchanged
to mention that we rename the parameter. @jeremiedbb I'm scared to break code but this should be quite limited and easy to handle for downstream package.
Co-authored-by: Guillaume Lemaitre <[email protected]>
Thanks for the review @glemaitre! I am looking into adding to the changelog now. Is this an |
I would say an enhancement. |
@glemaitre @jeremiedbb a few notes on a6cfa3a:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added two example usages in the existing code. What comes to light when looking at the changes required for them is that there is a loss of information that previous was there, packed in whom which is now a generic check_array as source of input. In what cases is this information essential?
I think that we can alleviate this issue by using both input_name
and estimator
of the check_array
: estimator
will help at getting the name of the estimator while input_name
will help at indicating which data container contains negative values. We could adapt a bit the error message of check_non_negative
to have a meaningful message, or craft a good whom
message from these two parameters.
sklearn/utils/validation.py
Outdated
@@ -1120,6 +1127,9 @@ def is_sparse(dtype): | |||
else: | |||
array = array.copy(**copy_params) | |||
|
|||
if ensure_non_negative: | |||
check_non_negative(array, "check_array") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As stated in another comment, if input_name
and estimator
are not the default, I think that we can make a crafted message better than "check_array"
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that we can also modify NMF
. Right now the check_non_negative
is called in _fit_transform
method. However, we could remove it and use the ensure_non_negative
of check_array
call in fit_transform
and transform
Co-authored-by: Guillaume Lemaitre <[email protected]>
@glemaitre I tired to address all the comments above in 890b480. Not sure about adding a generic |
I would probably not have a generic name and not mentioned it (so probably the empty yes). It would translate to something like: if ensure_non_negative:
whom = input_name
if estimator_name:
whom += f" in {estimator_name}"
check_non_negative(array, whom=whom) |
I love this suggestion, just pushed a commit with it. It preserves the |
sklearn/utils/validation.py
Outdated
Make sure the array has only non-negative values. An array that contains | ||
non-negative values will raise a ValueError. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Make sure the array has only non-negative values. An array that contains | |
non-negative values will raise a ValueError. | |
Make sure the array has only non-negative values. If True, an array that contains | |
negative values will raise a ValueError. |
sklearn/utils/validation.py
Outdated
if ensure_non_negative: | ||
whom = input_name | ||
if estimator_name: | ||
whom += f" in {estimator_name}" | ||
check_non_negative(array, whom) | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would put that just before the previous if force_writeable
block to be able to fail fast, before a pontential copy.
sklearn/neighbors/_base.py
Outdated
graph = check_array( | ||
graph, accept_sparse="csr", ensure_non_negative=True, input_name="graph" | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why didn't you keep "precomputed distance matrix"
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed, graph
is an internal variable and people might get confused by the error message then.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks @jeremiedbb , makes sense! I didn't notice there is more relevant naming there.
@glemaitre, should we add it to |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
Yep we can do that but in a subsequent PR. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK after failing at properly using git merge
, I think this is all good ;)
Enabling auto-merge. Thanks @TamaraAtanasoska |
Co-authored-by: Guillaume Lemaitre <[email protected]> Co-authored-by: Jérémie du Boisberranger <[email protected]>
Reference Issues/PRs
Fixes #29508.
What does this implement/fix? Explain your changes.
Adding an option to check if an array has
only_non_negative
values to thesklearn.utils.validation.check_array
function, that contains thesklearn.utils.validation.check_non_negative
functionality.While in the initial issue I proposed
ensure_positive
as a name for the option, I foundonly_non_negative
with exactly the same function insklearn.utils.validation._check_sample_weight
(here) and I used the same for consistency.This addition will prevent the need to use
sklearn.utils.validation.check_non_negative
aftersklearn.utils.validation.check_array
in the use case of needing only non-negative values in an array. I am keeping the proposed changes minimal for easier review, but I am happy to also change all occurrences in the scikit-learn code where this pattern exists as an added commit.Any other comments?