[MRG] Add common test and estimator tag for preserving float32 dtype in transformers #16290

ksslng · 2020-01-29T16:12:45Z

Reference Issues/PRs

Part of #11000.

What does this implement/fix? Explain your changes.

It will make testing and implementing of #11000 easier. Also added a new tag preserve_32bit_type, which skips the test if false.

Any other comments?

Verifying that values and results are similar as without conversion is needed to be added.
Also it's part of Paris sprint.

rth

Thanks! Overall sounds reasonable to me. I'm not sure it we should keep the default preserves_32bit_dtype as False or True, but I think there should be no expectation for transformers to pass this test by default.

sklearn/utils/estimator_checks.py

doc/developers/develop.rst

sklearn/base.py

sklearn/utils/estimator_checks.py

glemaitre · 2020-01-29T16:33:29Z

sklearn/utils/estimator_checks.py

@@ -1339,6 +1343,35 @@ def check_estimators_dtypes(name, estimator_orig):
                getattr(estimator, method)(X_train)


+def check_estimators_preserve_dtypes(name, estimator_orig):
+
+    if not _safe_tags(estimator_orig, 'preserves_32bit_dtype'):


Usually we don't implement in this way. We avoid to call the function if the tag is false. see above.

Here it was done inside, so that we would have a list of estimators that fail and that needs fixing, as a temporary approach.

I moved it now, as one needs to change the default value of the tag anyways to true to check if an estimator actually fails the test and is not just is skipped.
And after the meta issue is resolved one doesn't need to change anything.

glemaitre · 2020-01-29T16:34:43Z

sklearn/utils/estimator_checks.py

@@ -197,6 +198,9 @@ def _yield_transformer_checks(name, transformer):
        yield check_transformer_data_not_an_array
    # these don't actually fit the data, so don't raise errors
    yield check_transformer_general
+    # it's not important to preserve types with Clustering
+    if not isinstance(transformer, ClusterMixin):


Suggested change

if not isinstance(transformer, ClusterMixin):

if (not isinstance(transformer, ClusterMixin) and

_safe_tags(transformer, "preserves_32bit_dtype")):

sklearn/utils/estimator_checks.py

rth · 2020-01-30T10:32:54Z

sklearn/utils/estimator_checks.py

+        assert X_trans.dtype == dtype_out, \
+            ('Estimator transform dtype: {} - orginal/expected dtype: {}'
+             .format(X_trans.dtype, dtype_out.__name__))
+


In the end maybe we should add the check comparing transforms in this PR, say using,

assert_allclose(X_trans_32, X_trans_64, rtol=1e-2)

with a high enough tolerance. Before we start modifying other estimators to pass this.

why such a high tolerance ?

OK, maybe that was too pessimistic. Maybve rtol=1e-4 then, what value would be good you think?

I really don't know :/ Ideally it would be around 1e-6, but it's not realistic.
The output precision depends of the algorithm and we probably can't expect machine precision for all algorithms even with the best possible implementation.
I put 1e-5 in the tests of our wrappers of scipy blas but I'm fine with 1e-4

rth · 2020-01-30T10:34:17Z

sklearn/utils/estimator_checks.py

@@ -1339,6 +1344,31 @@ def check_estimators_dtypes(name, estimator_orig):
                getattr(estimator, method)(X_train)


+def check_estimators_preserve_dtypes(name, estimator_orig):
+


Please skip this test for all estimators that currently fail,

if name in [...]: raise SkipTest('Known failure to preserve dtypes')

so we can merge this.

The XFAIL support from #16306 could be used once that PR is merged.

They are already skipped here for those who fail. https://github.com/scikit-learn/scikit-learn/pull/16290/files#diff-a95fe0e40350c536a5e303e87ac979c4R194-R213

rth · 2020-01-30T11:08:44Z

Also shouldn't more estimators have the preserves_32bit_dtype=True flag? Say StandardScaler should preserve dtype but currently doesn't have this flag.

glemaitre · 2020-02-13T12:16:50Z

doc/developers/develop.rst

@@ -521,6 +521,9 @@ poor_score (default=``False``)
    are based on current estimators in sklearn and might be replaced by
    something more systematic.

+preserves_32bit_dtype (default=``False``)


Actually, the common test is a bit broader than only preserving 32 bits. I think that we should go with the tag preserves_dtype which should be an array of dtype which will be preserved. By default, it will be only 64 bits

so in short

Suggested change

preserves_32bit_dtype (default=``False``)

preserves_dtype (default=``['float64']``)

and one could give ['float64', 'float'32', 'float16']

great idea!

glemaitre

@ksslng Would you be able to finish the PR or shall I take over?

glemaitre · 2020-02-13T12:19:39Z

sklearn/utils/estimator_checks.py

+
+    for dtype_in, dtype_out in [(np.float32, np.float32),
+                                (np.float64, np.float64),
+                                (np.float16, np.float64)]:


StandardScaler will not pass this test because it preserving even 16 bits without casting which is fine.

Basically, for dtype in the preserves_dtype list, we should ensure that the dtype is the same, otherwise, it should be casted to 64 bits.

glemaitre · 2020-02-13T12:20:25Z

sklearn/utils/estimator_checks.py

+        # FIXME: should we check that the dtype of some attributes are the
+        # same than dtype and check that the value of attributes
+        # between 32bit and 64bit are close
+        assert X_trans.dtype == dtype_out, \


Suggested change

assert X_trans.dtype == dtype_out, \

assert X_trans.dtype.name == dtype_out, \

This will be the changes with we use a string in the list of preserves_dtype

glemaitre · 2020-02-13T12:21:34Z

sklearn/utils/estimator_checks.py

+    for dtype_in, dtype_out in [(np.float32, np.float32),
+                                (np.float64, np.float64),
+                                (np.float16, np.float64)]:
+        X_cast = X.copy().astype(dtype_in)


This will work directly with string name of a dtype.

ksslng · 2020-02-14T13:10:58Z

I'll have time this weekend and I have a couple of changes locally and not pushed yet.

I also added the test to compare the float32 and float64 output of transform, should I leave it in there, or make a different pr out of it? As the question, is if one want to actually guarantee it?

glemaitre · 2020-02-14T13:33:07Z

I'll have time this weekend and I have a couple of changes locally and not pushed yet.

great

I also added the test to compare the float32 and float64 output of transform

It can be added here and we can check when is it failing as well

ksslng · 2020-02-17T23:04:01Z

So a couple of things to add:

I don't test MissingIndicator as it returns bool
I keep the Cross Decomposition methods, even though they return x_scores (I'm not sure how much sense it makes though)
MetaEstimatorMixIn is not useful as some of them the required parameter('estimator') is removed (I guess it is replaced 'base_estimator', but it is not specified either), also I didn't find any code that uses it. Should I open another issues for that? I think it would make sense to create a specifc test for the MetaEstimators.

So to the transformers that fails.
These are the ones that have a dtype mismatch:

BernoulliRBM()
CCA()
DictionaryLearning()
FactorAnalysis()
FastICA()
GaussianRandomProjection()
IncrementalPCA()
Isomap()
KBinsDiscretizer()
KNeighborsTransformer()
LatentDirichletAllocation()
LinearDiscriminantAnalysis()
LocallyLinearEmbedding()
MiniBatchDictionaryLearning()
MiniBatchSparsePCA()
NMF()
NeighborhoodComponentsAnalysis()
PLSCanonical()
PLSRegression()
PLSSVD()
RBFSampler()
RadiusNeighborsTransformer()
RandomTreesEmbedding()
RobustScaler()PASSED
SkewedChi2Sampler()
SparsePCA()
SparseRandomProjection()

These are the ones that return not close enough transformations:


FastICA()
IncrementalPCA()
KernelPCA()
MiniBatchSparsePCA()
Nystroem()
PCA()
PowerTransformer()
SkewedChi2Sampler()

cmarmo · 2020-03-28T14:39:50Z

@ksslng, it would be great if you could find some time to sync your PR with upstream and check build failures. Then, maybe reviewers will show up again... :)
Thanks for your work!

add test and tag whether estimator keeps float32 as dtype

05fc904

jeremiedbb added the Sprint label Jan 29, 2020

rth changed the title ~~[MRG] Add test and tag whether Estimator keeps float32 as dtype~~ [MRG] Add common test and estimator tag for preserving float32 dtype in transformers Jan 29, 2020

fix linting

1d037f1

rth reviewed Jan 29, 2020

View reviewed changes

sklearn/utils/estimator_checks.py Outdated Show resolved Hide resolved

sklearn/utils/estimator_checks.py Outdated Show resolved Hide resolved

glemaitre requested changes Jan 29, 2020

View reviewed changes

move test for 32bit tag and other small fixes

0f5fd23

rth mentioned this pull request Jan 30, 2020

ENH Support for XFAIL/XPASS in common tests #16306

Closed

rth reviewed Jan 30, 2020

View reviewed changes

add closeness as test

b76e1c2

Henley13 mentioned this pull request Jan 31, 2020

[MRG] 32/64-bit float consistency with BernoulliRBM #16352

Merged

change spare matrices into ndarray before comparing

4ec84b6

glemaitre self-assigned this Feb 13, 2020

glemaitre reviewed Feb 13, 2020

View reviewed changes

ksslng added 4 commits February 15, 2020 20:50

change tage to list

d687a3e

adapt test for more estimators

e532681

fix rest of errors and fix linting issues

d4d8971

Merge branch 'master' into test-estimator-preserve-dtype

00ac315

github-actions bot added module:decomposition module:preprocessing module:utils labels Mar 2, 2020

Merge branch 'master' into test-estimator-preserve-dtype

670bd4f

rth added the Waiting for Reviewer label Jun 23, 2020

rth added this to the 0.24 milestone Jun 23, 2020

cmarmo added help wanted Stalled and removed Waiting for Reviewer labels Jul 6, 2020

glemaitre mentioned this pull request Aug 12, 2020

TST Common test preserve dtype #18054

Merged

glemaitre closed this in #18054 Sep 8, 2020

	if not isinstance(transformer, ClusterMixin):
	if (not isinstance(transformer, ClusterMixin) and
	_safe_tags(transformer, "preserves_32bit_dtype")):

		@@ -1339,6 +1344,31 @@ def check_estimators_dtypes(name, estimator_orig):
		getattr(estimator, method)(X_train)


		def check_estimators_preserve_dtypes(name, estimator_orig):

	preserves_32bit_dtype (default=``False``)
	preserves_dtype (default=``['float64']``)

	assert X_trans.dtype == dtype_out, \
	assert X_trans.dtype.name == dtype_out, \

Uh oh!

[MRG] Add common test and estimator tag for preserving float32 dtype in transformers #16290

[MRG] Add common test and estimator tag for preserving float32 dtype in transformers #16290

Uh oh!

Conversation

ksslng commented Jan 29, 2020

Reference Issues/PRs

What does this implement/fix? Explain your changes.

Any other comments?

Uh oh!

rth left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

rth commented Jan 30, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

glemaitre left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ksslng commented Feb 14, 2020

Uh oh!

glemaitre commented Feb 14, 2020

Uh oh!

ksslng commented Feb 17, 2020

Uh oh!

cmarmo commented Mar 28, 2020

Uh oh!

Uh oh!

rth commented Jan 30, 2020 •

edited

Loading