Thanks to visit codestin.com
Credit goes to github.com

Skip to content

OneHot/OrdinalEncoder categories broken for dtype='S' #19677

Closed
@andrewdelong

Description

@andrewdelong

Describe the bug

For OrdinalEncoder and OneHotEncoder, the fit(X) method fails if two conditions are true: (1) X contains byte strings (dtype='S'), and (2) categories is specified in the constructor. The error:

TypeError: ufunc 'isnan' not supported for the input types

Broken as of v0.24. Seems introduced by PR #17317 (super-useful!). A fix might be as simple as changing the line below to 'OUS', but I am not sure. Maybe @thomasjpfan can weigh in.

if Xi.dtype.kind not in 'OU':

Workaround

Convert X to unicode for any calls to fit or transform, as in encoder.fit(X.astype('U')). (It is not necessary for the categories argument to be converted to unicode for that to work.)

Steps/Code to Reproduce

It happens even if you feed the estimated categories_ attribute back into the constructor of a new OrdinalEncoder or OneHotEncoder instance:

import numpy as np
from sklearn.preprocessing import OrdinalEncoder
X = np.array([[b'A'], [b'B']])             # dtype='S1'
enc_auto = OrdinalEncoder().fit(X)         # categories='auto' can fit
enc_cats = OrdinalEncoder(categories=enc_auto.categories_)
enc_cats.fit_transform(X)                  # categories=[...] cannot fit

Expected Results

In v0.23 the above example produces:

array([[0.],
       [1.]])

Actual Results

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-55-2f684d052a03> in <module>
----> 5 enc_cats.fit_transform(X)

sklearn/base.py in fit_transform(self, X, y, **fit_params)
--> 699             return self.fit(X, **fit_params).transform(X)

sklearn/preprocessing/_encoders.py in fit(self, X, y)
--> 761         self._fit(X)

sklearn/preprocessing/_encoders.py in _fit(self, X, handle_unknown, force_all_finite)
---> 98                     stop_idx = -1 if np.isnan(sorted_cats[-1]) else None

TypeError: ufunc 'isnan' not supported for the input types, and the inputs could not be safely coerced to any supported types according to the casting rule ''safe''

Versions

System:
    python: 3.9.2 | packaged by conda-forge | (default, Feb 21 2021, 05:02:46)  [GCC 9.3.0]
executable: /home/atd/ext/miniconda3/bin/python
   machine: Linux-5.4.0-52-generic-x86_64-with-glibc2.31

Python dependencies:
          pip: 21.0.1
   setuptools: 49.6.0.post20210108
      sklearn: 0.24.1
        numpy: 1.20.1
        scipy: 1.6.0
       Cython: None
       pandas: None
   matplotlib: 3.3.4
       joblib: 1.0.1
threadpoolctl: 2.1.0

Built with OpenMP: True

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions