Description
This is the issue which we want to tackle during the Man AHL Hackathon.
We would like that the transformer does not convert float32 to float64 whenever possible. The transformers which are currently failing are:
- BernoulliRBM ([ENH] add dtype preservation to BernoulliRBM #24318)
- Birch (ENH Added
dtype
preservation toBirch
#22968) - CCA - inherits from PLS, so also not applicable
- DictionaryLearning (ENH Preserving dtype for np.float32 in *DictionaryLearning, SparseCoder and orthogonal_mp_gram #22002)
- FactorAnalysis (ENH add dtype preservation to FactorAnalysis #24321)
- FastICA
- FeatureAgglomeration (ENH Add dtype preservation to FeatureAgglomeration #24346)
- GaussianRandomProjection (ENH Preserving dtype for np.float32 in RandomProjection #22114)
- GenericUnivariateSelect
- Isomap (ENH Add dtype preservation for Isomap #24714)
- LatentDirichletAllocation (ENH Preserving dtype for np.float32 in LatentDirichletAllocation #22113)
- LinearDiscriminantAnalysis ([MRG+2] Add float32 support for Linear Discriminant Analysis #13273)
- LocallyLinearEmbedding (ENH Add dtype preservation to LocallyLinearEmbedding #24337)
- LogisticRegression (LogisticRegression convert to float64 (for SAG solver) #13243)
- MiniBatchDictionaryLearning
- MiniBatchSparsePCA (ENH Preserving dtype for np.float32 in SparsePCA and MiniBatchSparsePCA #22111)
- NMF
- PLSCanonical - not applicable as both
X
andy
are used - PLSRegression - not applicable as both
X
andy
are used - PLSSVD - not applicable as both
X
andy
are used - RBFSampler (ENH Use
X
's dtype for the projection inRBFSampler
#24317) - RidgeRegresssion ([MRG+1] ENH Ridge with solver SAG/SAGA does not cast to float64 #13302)
- SGDClassifier/SGDRegressor ([WIP] Allow SGDClassifier to support np.float32 without upcasting to float64 #9084)(ENH: Preserve float32/64 for SGD #13346)
- SkewedChi2Sampler
- SparsePCA (ENH Preserving dtype for np.float32 in SparsePCA and MiniBatchSparsePCA #22111)
- SparseRandomProjection (ENH Preserving dtype for np.float32 in RandomProjection #22114)
We could think to extend it to integer whenever possible and applicable.
Also the following transformers are not included in the common tests. We should write a specific test:
# some strange ones
DONT_TEST = ['SparseCoder', 'DictVectorizer',
'TfidfTransformer',
'TfidfVectorizer' (check 10443), 'IsotonicRegression',
'CategoricalEncoder',
'FeatureHasher',
'TruncatedSVD', 'PolynomialFeatures',
'GaussianRandomProjectionHash', 'HashingVectorizer',
'CountVectorizer']
We could also check classifiers, regressors or clusterers (see #8769 for more context),
- AffinitiyPropagation -> Bug in Incorrect Clusters Due To Dtype Mismatch #10832
- check SVC -> SVC: Do not enforce that input data is of type np.float64 #10713
Below the code executed to find the failure.
# Let's check the 32 - 64 bits type conservation.
if isinstance(X, np.ndarray):
for dtype in [np.float32, np.float64]:
X_cast = X.astype(dtype)
transformer = clone(transformer_orig)
set_random_state(transformer)
if hasattr(transformer, 'fit_transform'):
X_trans = transformer.fit_transform(X_cast, y_)
elif hasattr(transformer, 'fit_transform'):
transformer.fit(X_cast, y_)
X_trans = transformer.transform(X_cast)
# FIXME: should we check that the dtype of some attributes are the
# same than dtype.
assert X_trans.dtype == X_cast.dtype, 'transform dtype: {} - original dtype: {}'.format(X_trans.dtype, X_cast.dtype)
Tips to run the test for a specific transformer:
- Choose a transformer, for instance
FastICA
- If this class does not already have a method named
_more_tags
: add the following code snippet at the bottom of the class definition:
def _more_tags(self):
return {"preserves_dtype": [np.float64, np.float32]}
- Run the common tests for this specific class:
pytest sklearn/tests/test_common.py -k "FastICA and check_transformer_preserve_dtypes" -v
- It should fail: read the error message and try to understand why the
fit_transform
method (if it exists) or thetransform
method returns afloat64
data array when it is passed afloat32
input array.
It might be helpful to use a debugger, for instance by adding the line:
import pdb; pdb.set_trace()
at the beginning of the fit_transform
method and then re-rerunning pytest with:
pytest sklearn/tests/test_common.py -k "FastICA and check_transformer_preserve_dtypes" --pdb
Then using the l
(list), n
(next), s
(step into a function call), p some_array_variable.dtype
(p
stands for print) and c
(continue) commands to interactively debug the execution of this fit_transform
call.
ping @rth feel free to edit this thread.