Thanks to visit codestin.com
Credit goes to github.com

Skip to content

OneHotEncoder doesn't handle columns with mix of string and int #11379

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
amueller opened this issue Jun 28, 2018 · 9 comments
Open

OneHotEncoder doesn't handle columns with mix of string and int #11379

amueller opened this issue Jun 28, 2018 · 9 comments

Comments

@amueller
Copy link
Member

I haven't followed the review of OneHotEncoder that closely but I figure that's a known limitation:

from sklearn.preprocessing import OneHotEncoder
X2 = [['a'], ['b'], ['c'], [1]]
OneHotEncoder(categories='auto').fit_transform(X2)

TypeError: '<' not supported between instances of 'int' and 'str'

I think we probably want to support this, right?

@jnothman
Copy link
Member

jnothman commented Jun 28, 2018 via email

@alokmalik
Copy link

One may just use .astype(string) after X2. Now it will treat int as string type. Hence @jnothman's comment, that 'mix of string/int and NaN might be useful', but NaN values can be filled with any string too.

Please correct me if i'm wrong, but I think not handling two data types in one column is more of a feature than a bug, because it tells user if something is unexpected in the data, like a single outlier might be int all other values are string. And still if user wants to handle those two data types in one column, it's solvable by converting them to a single datatype as mentioned above.

@amueller
Copy link
Member Author

@alokmalik pretty sure a list has no astype method. This would work for pandas dataframes, which might be the main usecase, but not in my example.
It might be enough to support dataframes, though.

@jnothman
Copy link
Member

jnothman commented Jun 29, 2018 via email

@jorisvandenbossche
Copy link
Member

In #10209 I reworked the encoding for object dtype and unsorted categories will in principle become possible (which was not the case due to the use of np.searchsorted).
But for now I still sort the inferred categories if the user did not provide the categories themselves, also for object dtype (using sorted(set(values))), so the above use case will still not be possible if we want to store the inferred categories as sorted.

I am not fully convinced that we should try to support such a mixed string/int in a single feature.
Is that really worth the effort? @amueller was it in a specify use case you encountered this?

Not sure about mix of string and int, but yes, mix of string and None/NaN might be useful.​

That's a good point. Currently None or np.nan raise the same error (unorderable types in python 3). But, this is also something we could see as another problem, i.e. better handling of missing data in OneHotEncoder (currenlty you can already solve this with the constant imputer strategy).

@jorisvandenbossche
Copy link
Member

So with the PR mentioned above, I still have the same error as above, but now specifying a mixed int/string categories works:

In [4]: from sklearn.preprocessing import OneHotEncoder
   ...: X2 = [['a'], ['b'], ['c'], [1]]
   ...: OneHotEncoder(categories=[['a', 'b', 'c', 1]]).fit_transform(X2).toarray()
Out[4]: 
array([[ 1.,  0.,  0.,  0.],
       [ 0.,  1.,  0.,  0.],
       [ 0.,  0.,  1.,  0.],
       [ 0.,  0.,  0.,  1.]])

because of the fact that user-specified categories are not sorted.

@jnothman
Copy link
Member

jnothman commented Jul 2, 2018 via email

@jorisvandenbossche
Copy link
Member

I think it would be good to support inferred categories when one is the pandas missing value placeholder

When sorting the inferred categories, we could first check if None or np.nan is in there (and then special case those, eg to have them at the end of the sorted categories).
That would not be too hard to implement, but that would basically mean having "treat missing values as separate category" ad the default behaviour, which then relates to #10465 (discussion what should be the default)

@amueller
Copy link
Member Author

btw had a student come to me with an error from mixing float and string for a homework :-/
The original column was encoded as float but was actually categorical, they imputed using the string "missing" and then OHE breaks...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants