-
-
Notifications
You must be signed in to change notification settings - Fork 25.9k
[MRG] Fix numpy.int overflow in make_classification #10811
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Sample generator `make_classification()` checks that `2 ** n_informative < n_classes * n_clusters_per_class`. If `n_informative` is given as numpy.int with a value of 64 or larger, `2 ** n_informative` evaluates to 0, and the check fails with a misleading error message. Casting to Python int() avoids this issue.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should have a non-regression test
Parameter n_informative as numpy.int
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should have a non-regression test, and perhaps a comment explaining why we have not used the simpler expression
The tests fail, because the hypercube samples generator does not create centered data down to the precision required by the tests. This happens for both numpy and Python integers. The generator behaves differently when n_informative > 30. These cases are currently covered by Should we allow less precision in the assertion for centered hypercube data? Atm, it requires precision to the 6th decimal. Otherwise, we need to take a closer look on the hypercube generator. @jnothman Please specify what test you have in mind. |
A non-regression test is a test that fails with master and that does not fail in your PR. |
Quickly looking at it it looks like the test you added is fine. I see a shape mismatch at the moment, so is it really a precision problem?
If you can make it pass with lowering the precision this may be an option. |
The shape mismatch comes from Also, I just found that for large n_instances and small n_classes (e.g. 1080, 2) the cluster assertion fails. Increasing class_sep seems to solve this, but I will test a bit more. |
From my point of view, this is ready to go. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the reminder
[class_sep] * n_informative, | ||
decimal=0, | ||
assert_array_almost_equal(np.abs(centroid) / class_sep, | ||
np.array([class_sep] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why isn't this just np.array([n_informative])
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
np.array([n_informative])
is an array of shape (1,)
, but we need an array of shape (n_informative,)
.
Anyway, the mixture of Python list operations and numpy is confusing,
and I guess it should be np.ones(n_informative)
really..
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, that's it
Replaced confusing Python list operations with concise numpy statement Consistent assertions in both hypercube branches
For matters of consistency and removal of confusing Python list operations, I also changed the |
A friendly reminder of this little PR. Any comments, @lesteve? |
Thanks @VarIr ! |
@VarIr Please add an entry to the change log at |
@qinhanmin2014 this got merged already, sorry. I was also considering asking for a what's new entry, in the end incorrectly decided against it (and merged) -- for some reason I got confused about the end effect of this (though it only addressed "misleading error messages"). You are right, it would deserve a what's new, since this prevents producing errors for input that is valid. @VarIr If you can add a what's new in a comment, we will integrate it. |
Thank you all! @rth Here's my what's new: |
@VarIr Please submit a PR, thanks. |
Added the what's new, somewhat adapted, in da85815. Thanks all! |
The sample generator
make_classification()
raises misleading errors on certain valid inputs.What does this implement/fix? Explain your changes.
The sample generator
make_classification()
checks its parameters,e.g.
2 ** n_informative < n_classes * n_clusters_per_class
.If
n_informative
is given as numpy.int with a value of 64 or larger,2 ** n_informative
evaluates to 0, and the check fails with a misleading error message.Casting to Python int() avoids this issue.
Reproduce the error
Any other comments?
2. ** n_informative
would also do the trick.