[MRG] Fix numpy.int overflow in make_classification #10811

VarIr · 2018-03-14T15:05:43Z

The sample generator make_classification() raises misleading errors on certain valid inputs.

What does this implement/fix? Explain your changes.

The sample generator make_classification() checks its parameters,
e.g. 2 ** n_informative < n_classes * n_clusters_per_class.

If n_informative is given as numpy.int with a value of 64 or larger,
2 ** n_informative evaluates to 0, and the check fails with a misleading error message.

Casting to Python int() avoids this issue.

Reproduce the error

>>> import numpy as np
>>> from sklearn.datasets import make_classification 
>>> N_INFORMATIVE = np.arange(31, 65)
>>> for n_informative in N_INFORMATIVE:
>>>      print(f'n_informative = {n_informative}, 2 ** n_informative = {2 ** n_informative}')
>>>      make_classification(n_features=100, 
>>>          n_informative=n_informative, n_classes=2, n_clusters_per_class=1)

Any other comments?

2. ** n_informative would also do the trick.

Sample generator `make_classification()` checks that `2 ** n_informative < n_classes * n_clusters_per_class`. If `n_informative` is given as numpy.int with a value of 64 or larger, `2 ** n_informative` evaluates to 0, and the check fails with a misleading error message. Casting to Python int() avoids this issue.

jnothman

This should have a non-regression test

sklearn/datasets/samples_generator.py

Parameter n_informative as numpy.int

jnothman

We should have a non-regression test, and perhaps a comment explaining why we have not used the simpler expression

VarIr · 2018-03-19T09:55:09Z

The tests fail, because the hypercube samples generator does not create centered data down to the precision required by the tests. This happens for both numpy and Python integers. The generator behaves differently when n_informative > 30. These cases are currently covered by test_make_classification(), but not by test_make_classification_informative_features().

Should we allow less precision in the assertion for centered hypercube data? Atm, it requires precision to the 6th decimal. Otherwise, we need to take a closer look on the hypercube generator.

@jnothman Please specify what test you have in mind.

lesteve · 2018-03-19T10:54:21Z

A non-regression test is a test that fails with master and that does not fail in your PR.

lesteve · 2018-03-19T13:08:18Z

Quickly looking at it it looks like the test you added is fine. I see a shape mismatch at the moment, so is it really a precision problem?

E                   AssertionError: 
E                   Arrays are not almost equal to 0 decimals
E                   Clusters are not centered on hypercube vertices
E                   (shapes (64,), (1,) mismatch)
E                    x: array([ 1000000.68194634,  1000002.57310294,   999999.46742862,
E                           1000000.84695066,   999999.7296249 ,  1000000.21288462,
E                           1000000.79945417,  1000000.3499359 ,  1000000.36114583,...
E                    y: array([ 64000000.])

If you can make it pass with lowering the precision this may be an option.

VarIr · 2018-03-19T13:57:02Z

The shape mismatch comes from np.array(64) instead of np.int(64)... After fixing that, it's still a precision issue.

Also, I just found that for large n_instances and small n_classes (e.g. 1080, 2) the cluster assertion fails. Increasing class_sep seems to solve this, but I will test a bit more.

VarIr · 2018-08-08T06:50:47Z

From my point of view, this is ready to go.

jnothman

Thanks for the reminder

jnothman · 2018-08-08T06:58:53Z

sklearn/datasets/tests/test_samples_generator.py

-                                              [class_sep] * n_informative,
-                                              decimal=0,
+                    assert_array_almost_equal(np.abs(centroid) / class_sep,
+                                              np.array([class_sep]


Why isn't this just np.array([n_informative])?

np.array([n_informative]) is an array of shape (1,), but we need an array of shape (n_informative,).
Anyway, the mixture of Python list operations and numpy is confusing,
and I guess it should be np.ones(n_informative) really..

Yes, that's it

Replaced confusing Python list operations with concise numpy statement Consistent assertions in both hypercube branches

VarIr · 2018-08-08T08:25:09Z

For matters of consistency and removal of confusing Python list operations, I also changed the assert_array_almost_equal in not hypercube branch.
In both branches, precision is thus reduced from 6 to 5 decimals.

VarIr · 2018-10-02T08:50:18Z

A friendly reminder of this little PR. Any comments, @lesteve?

rth · 2018-10-02T12:16:06Z

Thanks @VarIr !

qinhanmin2014 · 2018-10-02T14:57:28Z

@VarIr Please add an entry to the change log at doc/whats_new/v*.rst. Like the other entries there, please reference this pull request with :issue: and credit yourself (and other contributors if applicable) with :user: (We have entries for other similar issues, e.g., #10414, #10844).
I guess we can hurry this one into 0.20.1.

rth · 2018-10-02T15:37:35Z

@qinhanmin2014 this got merged already, sorry. I was also considering asking for a what's new entry, in the end incorrectly decided against it (and merged) -- for some reason I got confused about the end effect of this (though it only addressed "misleading error messages").

You are right, it would deserve a what's new, since this prevents producing errors for input that is valid.

@VarIr If you can add a what's new in a comment, we will integrate it.

VarIr · 2018-10-03T14:28:06Z

Thank you all!

@rth Here's my what's new:
Fixed a bug in :func:datasets.make_classification to avoid integer overflow on certain valid inputs of numpy dtypes. Logarithmized an inequation to avoid exponentiation with a potentially very high number.
:issue:10811 by :user:Roman Feldbauer <VarIr>

qinhanmin2014 · 2018-10-03T15:44:32Z

@VarIr Please submit a PR, thanks.

#10811"

rth · 2018-10-03T16:45:05Z

Added the what's new, somewhat adapted, in da85815. Thanks all!

jnothman reviewed Mar 14, 2018

View reviewed changes

sklearn/datasets/samples_generator.py Show resolved Hide resolved

VarIr added 2 commits March 16, 2018 18:15

Using log2

7458604

TST make_classification() with numpy.int

19f2d8b

Parameter n_informative as numpy.int

jnothman reviewed Mar 17, 2018

View reviewed changes

VarIr added 2 commits March 19, 2018 14:42

Comment on log2

bf1f296

Lower hypercube center precision

f108543

fix pyflakes issue

54432a7

VarIr changed the title ~~FIX numpy.int overflow in make_classification~~ [MRG] Fix numpy.int overflow in make_classification Apr 3, 2018

jnothman approved these changes Aug 8, 2018

View reviewed changes

Simplification and consistency

d3396e4

Replaced confusing Python list operations with concise numpy statement Consistent assertions in both hypercube branches

jnothman approved these changes Aug 8, 2018

View reviewed changes

adrinjalali approved these changes Oct 2, 2018

View reviewed changes

rth merged commit 60cf1d6 into scikit-learn:master Oct 2, 2018

rth added a commit that referenced this pull request Oct 3, 2018

DOC what's new entry for "Fix numpy.int overflow in make_classification

da85815

#10811"

VarIr deleted the VarIr-make_classification branch October 4, 2018 09:14

jnothman pushed a commit to jnothman/scikit-learn that referenced this pull request Oct 15, 2018

Fix numpy.int overflow in make_classification (scikit-learn#10811)

288eee1

DimitriPapadopoulos mentioned this pull request Nov 27, 2023

MAINT apply assorted refurb suggestions #27825

Closed

Uh oh!

[MRG] Fix numpy.int overflow in make_classification #10811

[MRG] Fix numpy.int overflow in make_classification #10811

Uh oh!

Conversation

VarIr commented Mar 14, 2018

What does this implement/fix? Explain your changes.

Reproduce the error

Any other comments?

Uh oh!

jnothman left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

jnothman left a comment

Choose a reason for hiding this comment

Uh oh!

VarIr commented Mar 19, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

lesteve commented Mar 19, 2018

Uh oh!

lesteve commented Mar 19, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

VarIr commented Mar 19, 2018

Uh oh!

VarIr commented Aug 8, 2018

Uh oh!

jnothman left a comment

Choose a reason for hiding this comment

Uh oh!

jnothman Aug 8, 2018

Choose a reason for hiding this comment

Uh oh!

VarIr Aug 8, 2018

Choose a reason for hiding this comment

Uh oh!

jnothman Aug 8, 2018

Choose a reason for hiding this comment

Uh oh!

VarIr commented Aug 8, 2018

Uh oh!

VarIr commented Oct 2, 2018

Uh oh!

rth commented Oct 2, 2018

Uh oh!

qinhanmin2014 commented Oct 2, 2018

Uh oh!

rth commented Oct 2, 2018

Uh oh!

VarIr commented Oct 3, 2018

Uh oh!

qinhanmin2014 commented Oct 3, 2018

Uh oh!

rth commented Oct 3, 2018

Uh oh!

Uh oh!

VarIr commented Mar 19, 2018 •

edited

Loading

lesteve commented Mar 19, 2018 •

edited

Loading