Thanks to visit codestin.com
Credit goes to github.com

Skip to content

Float arrays' comparisons in tests #4400

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
artsobolev opened this issue Mar 17, 2015 · 4 comments · Fixed by #9774
Closed

Float arrays' comparisons in tests #4400

artsobolev opened this issue Mar 17, 2015 · 4 comments · Fixed by #9774
Labels
Easy Well-defined and straightforward way to resolve Sprint

Comments

@artsobolev
Copy link
Contributor

A recent issue indicated a flaw in many of sklearn's tests: there are many places where arrays are compared using assert_array_equal which does not take float's lack of precision into account.

Sometimes, though, we might expect a tested functionality to return exactly the same value — when checking, say, predict. It seems legitimate to use strict comparison in those cases.

Even though apparently this is not a problem at the moment (at least no one filed a bunch of bug reports like the one I mentioned), we might want to do something with it. Some of the possible fixes are:

  1. Redefine assert_array_equal to use approximate comparison in case of floating data type. Might break guarantees like "predict returns the same values that were passed in y".
  2. Replace assert_array_equal with assert_array_almost_equal when appropriate. This is a huge body of work, there are at least 229 tests that compare float arrays using assert_array_equal.
  3. Ignore it until somebody files an issue. Tests pass right now, so we're good :-)
@amueller
Copy link
Member

The "are" link is actually "almost" ;)

We should use assert_almost_equal for floats, and assert_equal for ints.
That means that for classification and clustering, we expect the exact same outcome, but for regression and embeddings we don't.

I am very certain that 2. is the way to go. And 229 lines are not that bad. I am quite sure that we are in not too bad a shape, and most uses of assert_array_equal are actually on ints.

@artsobolev
Copy link
Contributor Author

@amueller I got 229 not by greping source code, but by redefining assert_array_equal to raise an exception when called on float arguments (both arguments should be float numpy arrays). So 229 (# of failed tests) is a lower bound, since there could easily be more than one assert_array_equal in a test.

@amueller
Copy link
Member

Ah. I grepped and got ~700.
Still, doable. I replaced input validation in all classes not so recently and had to edit basically all files.
While there may be many lines to edit, they are mostly concentrated in a few tests.

@ogrisel
Copy link
Member

ogrisel commented Mar 18, 2015

I share @amueller's position on that matter.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Easy Well-defined and straightforward way to resolve Sprint
Projects
None yet
3 participants