Thanks to visit codestin.com
Credit goes to github.com

Skip to content

Confusing error message in OneHotEncoder with None-encoded missing values #16702

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
ogrisel opened this issue Mar 16, 2020 · 4 comments · Fixed by #16713
Closed

Confusing error message in OneHotEncoder with None-encoded missing values #16702

ogrisel opened this issue Mar 16, 2020 · 4 comments · Fixed by #16713
Labels
Bug good first issue Easy with clear instructions to resolve help wanted

Comments

@ogrisel
Copy link
Member

ogrisel commented Mar 16, 2020

Sister issue for #16703 (OrdinalEncoder)

Code to reproduce

import pandas as pd
from sklearn.preprocessing import OneHotEncoder


df = pd.DataFrame({"cat_feature": ["a", None, "b", "a"]})
OneHotEncoder().fit(df)

Observed result

Got: TypeError: '<' not supported between instances of 'str' and 'NoneType'

Full traceback:

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
~/code/scikit-learn/sklearn/preprocessing/_label.py in _encode(values, uniques, encode, check_unknown)
    111         try:
--> 112             res = _encode_python(values, uniques, encode)
    113         except TypeError:

~/code/scikit-learn/sklearn/preprocessing/_label.py in _encode_python(values, uniques, encode)
     59     if uniques is None:
---> 60         uniques = sorted(set(values))
     61         uniques = np.array(uniques, dtype=values.dtype)

TypeError: '<' not supported between instances of 'str' and 'NoneType'

During handling of the above exception, another exception occurred:

TypeError                                 Traceback (most recent call last)
<ipython-input-31-4b176a24c3a2> in <module>
      4 
      5 df = pd.DataFrame({"cat_feature": ["a", None, "b", "a"]})
----> 6 OneHotEncoder().fit(df)

~/code/scikit-learn/sklearn/preprocessing/_encoders.py in fit(self, X, y)
    375         """
    376         self._validate_keywords()
--> 377         self._fit(X, handle_unknown=self.handle_unknown)
    378         self.drop_idx_ = self._compute_drop_idx()
    379         return self

~/code/scikit-learn/sklearn/preprocessing/_encoders.py in _fit(self, X, handle_unknown)
     84             Xi = X_list[i]
     85             if self.categories == 'auto':
---> 86                 cats = _encode(Xi)
     87             else:
     88                 cats = np.array(self.categories[i], dtype=Xi.dtype)

~/code/scikit-learn/sklearn/preprocessing/_label.py in _encode(values, uniques, encode, check_unknown)
    112             res = _encode_python(values, uniques, encode)
    113         except TypeError:
--> 114             raise TypeError("argument must be a string or number")
    115         return res
    116     else:

Sister issue for #16703 (`OrdinalEncoder`)

TypeError: argument must be a string or number

Expected result

A more informative ValueError, for instance:

ValueError: OneHotEncoder does not accept None typed values. Missing values should be imputed first, for instance using sklearn.preprocessing.SimpleImputer.

Maybe we could even include the URL of some FAQ or example that shows how to deal with a mix of str and None typed values and use the following prior to OneHotEncoding:

SimpleImputer(strategy="constant", missing_values=None, fill_value="missing")
@jnothman
Copy link
Member

jnothman commented Mar 17, 2020 via email

@jnothman jnothman added good first issue Easy with clear instructions to resolve help wanted labels Mar 17, 2020
@ogrisel
Copy link
Member Author

ogrisel commented Mar 19, 2020

I observed None used as missing value in the fetch_openml loader, for instance on the Ames housing dataset:

X, y = fetch_openml("house_prices", as_frame=True, return_X_y=True)

@jnothman
Copy link
Member

jnothman commented Mar 19, 2020 via email

@ogrisel
Copy link
Member Author

ogrisel commented Mar 20, 2020

pandas 1.0 has now proper support for missing values in categorical and integer columns so we can expect openml or fetch_openml to move to that scheme at some point but in the mean time we probably need scikit-learn to expect either None or np.nan as a missing value marker in object dtype columns with str objects.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug good first issue Easy with clear instructions to resolve help wanted
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants