Thanks to visit codestin.com
Credit goes to github.com

Skip to content

Confusing error message in OneHotEncoder with None-encoded missing values #16702

Closed
@ogrisel

Description

@ogrisel

Sister issue for #16703 (OrdinalEncoder)

Code to reproduce

import pandas as pd
from sklearn.preprocessing import OneHotEncoder


df = pd.DataFrame({"cat_feature": ["a", None, "b", "a"]})
OneHotEncoder().fit(df)

Observed result

Got: TypeError: '<' not supported between instances of 'str' and 'NoneType'

Full traceback:

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
~/code/scikit-learn/sklearn/preprocessing/_label.py in _encode(values, uniques, encode, check_unknown)
    111         try:
--> 112             res = _encode_python(values, uniques, encode)
    113         except TypeError:

~/code/scikit-learn/sklearn/preprocessing/_label.py in _encode_python(values, uniques, encode)
     59     if uniques is None:
---> 60         uniques = sorted(set(values))
     61         uniques = np.array(uniques, dtype=values.dtype)

TypeError: '<' not supported between instances of 'str' and 'NoneType'

During handling of the above exception, another exception occurred:

TypeError                                 Traceback (most recent call last)
<ipython-input-31-4b176a24c3a2> in <module>
      4 
      5 df = pd.DataFrame({"cat_feature": ["a", None, "b", "a"]})
----> 6 OneHotEncoder().fit(df)

~/code/scikit-learn/sklearn/preprocessing/_encoders.py in fit(self, X, y)
    375         """
    376         self._validate_keywords()
--> 377         self._fit(X, handle_unknown=self.handle_unknown)
    378         self.drop_idx_ = self._compute_drop_idx()
    379         return self

~/code/scikit-learn/sklearn/preprocessing/_encoders.py in _fit(self, X, handle_unknown)
     84             Xi = X_list[i]
     85             if self.categories == 'auto':
---> 86                 cats = _encode(Xi)
     87             else:
     88                 cats = np.array(self.categories[i], dtype=Xi.dtype)

~/code/scikit-learn/sklearn/preprocessing/_label.py in _encode(values, uniques, encode, check_unknown)
    112             res = _encode_python(values, uniques, encode)
    113         except TypeError:
--> 114             raise TypeError("argument must be a string or number")
    115         return res
    116     else:

Sister issue for #16703 (`OrdinalEncoder`)

TypeError: argument must be a string or number

Expected result

A more informative ValueError, for instance:

ValueError: OneHotEncoder does not accept None typed values. Missing values should be imputed first, for instance using sklearn.preprocessing.SimpleImputer.

Maybe we could even include the URL of some FAQ or example that shows how to deal with a mix of str and None typed values and use the following prior to OneHotEncoding:

SimpleImputer(strategy="constant", missing_values=None, fill_value="missing")

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions