Description
Describe the bug
For OrdinalEncoder and OneHotEncoder, the fit(X) method fails if two conditions are true: (1) X contains byte strings (dtype='S'), and (2) categories is specified in the constructor. The error:
TypeError: ufunc 'isnan' not supported for the input types
Broken as of v0.24. Seems introduced by PR #17317 (super-useful!). A fix might be as simple as changing the line below to 'OUS'
, but I am not sure. Maybe @thomasjpfan can weigh in.
Workaround
Convert X to unicode for any calls to fit or transform, as in encoder.fit(X.astype('U'))
. (It is not necessary for the categories argument to be converted to unicode for that to work.)
Steps/Code to Reproduce
It happens even if you feed the estimated categories_ attribute back into the constructor of a new OrdinalEncoder or OneHotEncoder instance:
import numpy as np
from sklearn.preprocessing import OrdinalEncoder
X = np.array([[b'A'], [b'B']]) # dtype='S1'
enc_auto = OrdinalEncoder().fit(X) # categories='auto' can fit
enc_cats = OrdinalEncoder(categories=enc_auto.categories_)
enc_cats.fit_transform(X) # categories=[...] cannot fit
Expected Results
In v0.23 the above example produces:
array([[0.],
[1.]])
Actual Results
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
<ipython-input-55-2f684d052a03> in <module>
----> 5 enc_cats.fit_transform(X)
sklearn/base.py in fit_transform(self, X, y, **fit_params)
--> 699 return self.fit(X, **fit_params).transform(X)
sklearn/preprocessing/_encoders.py in fit(self, X, y)
--> 761 self._fit(X)
sklearn/preprocessing/_encoders.py in _fit(self, X, handle_unknown, force_all_finite)
---> 98 stop_idx = -1 if np.isnan(sorted_cats[-1]) else None
TypeError: ufunc 'isnan' not supported for the input types, and the inputs could not be safely coerced to any supported types according to the casting rule ''safe''
Versions
System:
python: 3.9.2 | packaged by conda-forge | (default, Feb 21 2021, 05:02:46) [GCC 9.3.0]
executable: /home/atd/ext/miniconda3/bin/python
machine: Linux-5.4.0-52-generic-x86_64-with-glibc2.31
Python dependencies:
pip: 21.0.1
setuptools: 49.6.0.post20210108
sklearn: 0.24.1
numpy: 1.20.1
scipy: 1.6.0
Cython: None
pandas: None
matplotlib: 3.3.4
joblib: 1.0.1
threadpoolctl: 2.1.0
Built with OpenMP: True