Thanks to visit codestin.com
Credit goes to github.com

Skip to content

[MRG] Fix numpy.int overflow in make_classification #10811

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 7 commits into from
Oct 2, 2018
Merged

[MRG] Fix numpy.int overflow in make_classification #10811

merged 7 commits into from
Oct 2, 2018

Conversation

VarIr
Copy link
Contributor

@VarIr VarIr commented Mar 14, 2018

The sample generator make_classification() raises misleading errors on certain valid inputs.

What does this implement/fix? Explain your changes.

The sample generator make_classification() checks its parameters,
e.g. 2 ** n_informative < n_classes * n_clusters_per_class.

If n_informative is given as numpy.int with a value of 64 or larger,
2 ** n_informative evaluates to 0, and the check fails with a misleading error message.

Casting to Python int() avoids this issue.

Reproduce the error

>>> import numpy as np
>>> from sklearn.datasets import make_classification 
>>> N_INFORMATIVE = np.arange(31, 65)
>>> for n_informative in N_INFORMATIVE:
>>>      print(f'n_informative = {n_informative}, 2 ** n_informative = {2 ** n_informative}')
>>>      make_classification(n_features=100, 
>>>          n_informative=n_informative, n_classes=2, n_clusters_per_class=1)

Any other comments?

2. ** n_informative would also do the trick.

Sample generator `make_classification()` checks that `2 ** n_informative < n_classes * n_clusters_per_class`.
If `n_informative` is given as numpy.int with a value of 64 or larger, `2 ** n_informative` evaluates to 0, and the check fails with a misleading error message.
Casting to Python int() avoids this issue.
Copy link
Member

@jnothman jnothman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should have a non-regression test

VarIr added 2 commits March 16, 2018 18:15
Parameter n_informative as numpy.int
Copy link
Member

@jnothman jnothman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should have a non-regression test, and perhaps a comment explaining why we have not used the simpler expression

@VarIr
Copy link
Contributor Author

VarIr commented Mar 19, 2018

The tests fail, because the hypercube samples generator does not create centered data down to the precision required by the tests. This happens for both numpy and Python integers. The generator behaves differently when n_informative > 30. These cases are currently covered by test_make_classification(), but not by test_make_classification_informative_features().

Should we allow less precision in the assertion for centered hypercube data? Atm, it requires precision to the 6th decimal. Otherwise, we need to take a closer look on the hypercube generator.

@jnothman Please specify what test you have in mind.

@lesteve
Copy link
Member

lesteve commented Mar 19, 2018

A non-regression test is a test that fails with master and that does not fail in your PR.

@lesteve
Copy link
Member

lesteve commented Mar 19, 2018

Quickly looking at it it looks like the test you added is fine. I see a shape mismatch at the moment, so is it really a precision problem?

E                   AssertionError: 
E                   Arrays are not almost equal to 0 decimals
E                   Clusters are not centered on hypercube vertices
E                   (shapes (64,), (1,) mismatch)
E                    x: array([ 1000000.68194634,  1000002.57310294,   999999.46742862,
E                           1000000.84695066,   999999.7296249 ,  1000000.21288462,
E                           1000000.79945417,  1000000.3499359 ,  1000000.36114583,...
E                    y: array([ 64000000.])

If you can make it pass with lowering the precision this may be an option.

@VarIr
Copy link
Contributor Author

VarIr commented Mar 19, 2018

The shape mismatch comes from np.array(64) instead of np.int(64)... After fixing that, it's still a precision issue.

Also, I just found that for large n_instances and small n_classes (e.g. 1080, 2) the cluster assertion fails. Increasing class_sep seems to solve this, but I will test a bit more.

@VarIr VarIr changed the title FIX numpy.int overflow in make_classification [MRG] Fix numpy.int overflow in make_classification Apr 3, 2018
@VarIr
Copy link
Contributor Author

VarIr commented Aug 8, 2018

From my point of view, this is ready to go.

Copy link
Member

@jnothman jnothman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the reminder

[class_sep] * n_informative,
decimal=0,
assert_array_almost_equal(np.abs(centroid) / class_sep,
np.array([class_sep]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why isn't this just np.array([n_informative])?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

np.array([n_informative]) is an array of shape (1,), but we need an array of shape (n_informative,).
Anyway, the mixture of Python list operations and numpy is confusing,
and I guess it should be np.ones(n_informative) really..

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, that's it

Replaced confusing Python list operations with concise numpy statement

Consistent assertions in both hypercube branches
@VarIr
Copy link
Contributor Author

VarIr commented Aug 8, 2018

For matters of consistency and removal of confusing Python list operations, I also changed the assert_array_almost_equal in not hypercube branch.
In both branches, precision is thus reduced from 6 to 5 decimals.

@VarIr
Copy link
Contributor Author

VarIr commented Oct 2, 2018

A friendly reminder of this little PR. Any comments, @lesteve?

@rth rth merged commit 60cf1d6 into scikit-learn:master Oct 2, 2018
@rth
Copy link
Member

rth commented Oct 2, 2018

Thanks @VarIr !

@qinhanmin2014
Copy link
Member

@VarIr Please add an entry to the change log at doc/whats_new/v*.rst. Like the other entries there, please reference this pull request with :issue: and credit yourself (and other contributors if applicable) with :user: (We have entries for other similar issues, e.g., #10414, #10844).
I guess we can hurry this one into 0.20.1.

@rth
Copy link
Member

rth commented Oct 2, 2018

@qinhanmin2014 this got merged already, sorry. I was also considering asking for a what's new entry, in the end incorrectly decided against it (and merged) -- for some reason I got confused about the end effect of this (though it only addressed "misleading error messages").

You are right, it would deserve a what's new, since this prevents producing errors for input that is valid.

@VarIr If you can add a what's new in a comment, we will integrate it.

@VarIr
Copy link
Contributor Author

VarIr commented Oct 3, 2018

Thank you all!

@rth Here's my what's new:
Fixed a bug in :func:datasets.make_classification to avoid integer overflow on certain valid inputs of numpy dtypes. Logarithmized an inequation to avoid exponentiation with a potentially very high number.
:issue:10811 by :user:Roman Feldbauer <VarIr>

@qinhanmin2014
Copy link
Member

@VarIr Please submit a PR, thanks.

@rth
Copy link
Member

rth commented Oct 3, 2018

Added the what's new, somewhat adapted, in da85815. Thanks all!

@VarIr VarIr deleted the VarIr-make_classification branch October 4, 2018 09:14
jnothman pushed a commit to jnothman/scikit-learn that referenced this pull request Oct 15, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants