OneHotEncoder doesn't handle columns with mix of string and int #11379

amueller · 2018-06-28T14:07:16Z

I haven't followed the review of OneHotEncoder that closely but I figure that's a known limitation:

from sklearn.preprocessing import OneHotEncoder
X2 = [['a'], ['b'], ['c'], [1]]
OneHotEncoder(categories='auto').fit_transform(X2)

TypeError: '<' not supported between instances of 'int' and 'str'

I think we probably want to support this, right?

jnothman · 2018-06-28T14:15:18Z

Not sure about mix of string and int, but yes, mix of string and None/NaN might be useful.

alokmalik · 2018-06-28T18:47:27Z

One may just use .astype(string) after X2. Now it will treat int as string type. Hence @jnothman's comment, that 'mix of string/int and NaN might be useful', but NaN values can be filled with any string too.

Please correct me if i'm wrong, but I think not handling two data types in one column is more of a feature than a bug, because it tells user if something is unexpected in the data, like a single outlier might be int all other values are string. And still if user wants to handle those two data types in one column, it's solvable by converting them to a single datatype as mentioned above.

amueller · 2018-06-28T19:27:18Z

@alokmalik pretty sure a list has no astype method. This would work for pandas dataframes, which might be the main usecase, but not in my example.
It might be enough to support dataframes, though.

jnothman · 2018-06-29T00:05:38Z

another concern might be that this behaviour could differ from py2 to py3 where several inter-type comparisons were banned

jorisvandenbossche · 2018-07-02T13:21:21Z

In #10209 I reworked the encoding for object dtype and unsorted categories will in principle become possible (which was not the case due to the use of np.searchsorted).
But for now I still sort the inferred categories if the user did not provide the categories themselves, also for object dtype (using sorted(set(values))), so the above use case will still not be possible if we want to store the inferred categories as sorted.

I am not fully convinced that we should try to support such a mixed string/int in a single feature.
Is that really worth the effort? @amueller was it in a specify use case you encountered this?

Not sure about mix of string and int, but yes, mix of string and None/NaN might be useful.

That's a good point. Currently None or np.nan raise the same error (unorderable types in python 3). But, this is also something we could see as another problem, i.e. better handling of missing data in OneHotEncoder (currenlty you can already solve this with the constant imputer strategy).

jorisvandenbossche · 2018-07-02T13:23:16Z

So with the PR mentioned above, I still have the same error as above, but now specifying a mixed int/string categories works:

In [4]: from sklearn.preprocessing import OneHotEncoder
   ...: X2 = [['a'], ['b'], ['c'], [1]]
   ...: OneHotEncoder(categories=[['a', 'b', 'c', 1]]).fit_transform(X2).toarray()
Out[4]: 
array([[ 1.,  0.,  0.,  0.],
       [ 0.,  1.,  0.,  0.],
       [ 0.,  0.,  1.,  0.],
       [ 0.,  0.,  0.,  1.]])

because of the fact that user-specified categories are not sorted.

jnothman · 2018-07-02T22:16:19Z

I think it would be good to support inferred categories when one is the pandas missing value placeholder

jorisvandenbossche · 2018-07-03T11:11:07Z

I think it would be good to support inferred categories when one is the pandas missing value placeholder

When sorting the inferred categories, we could first check if None or np.nan is in there (and then special case those, eg to have them at the end of the sorted categories).
That would not be too hard to implement, but that would basically mean having "treat missing values as separate category" ad the default behaviour, which then relates to #10465 (discussion what should be the default)

amueller · 2019-02-13T16:37:09Z

btw had a student come to me with an error from mixing float and string for a homework :-/
The original column was encoded as float but was actually categorical, they imputed using the string "missing" and then OHE breaks...

This was referenced Jun 28, 2018

Handling of missing values in the CategoricalEncoder #10465

Closed

Support standard data science use-case #10603

Open

timbicker mentioned this issue Sep 8, 2018

Categorical Naive Bayes not available #10856

Closed

cmarmo added Enhancement help wanted module:preprocessing labels Jan 15, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

OneHotEncoder doesn't handle columns with mix of string and int #11379

OneHotEncoder doesn't handle columns with mix of string and int #11379

amueller commented Jun 28, 2018

jnothman commented Jun 28, 2018 via email

alokmalik commented Jun 28, 2018

amueller commented Jun 28, 2018

jnothman commented Jun 29, 2018 via email

jorisvandenbossche commented Jul 2, 2018

jorisvandenbossche commented Jul 2, 2018

jnothman commented Jul 2, 2018 via email

jorisvandenbossche commented Jul 3, 2018

amueller commented Feb 13, 2019

OneHotEncoder doesn't handle columns with mix of string and int #11379

OneHotEncoder doesn't handle columns with mix of string and int #11379

Comments

amueller commented Jun 28, 2018

jnothman commented Jun 28, 2018 via email

alokmalik commented Jun 28, 2018

amueller commented Jun 28, 2018

jnothman commented Jun 29, 2018 via email

jorisvandenbossche commented Jul 2, 2018

jorisvandenbossche commented Jul 2, 2018

jnothman commented Jul 2, 2018 via email

jorisvandenbossche commented Jul 3, 2018

amueller commented Feb 13, 2019