-
-
Notifications
You must be signed in to change notification settings - Fork 26.3k
Description
When using OneHotEncoding it is a common feature to drop a column in order to prevent multi-collinearity. This is done with the drop='first' flag.
On the other hand, when cross validating a model, it can happen that not every category appears on train split. In this setting, it is usual to just assign 0 to all columns. This is done with handle_unknown='ignore'.
Both flags cannot be set as shows as it leads to:
ValueError: `handle_unknown` must be 'error' when the drop parameter is specified, as both would create categories that are all zero.
It is true that using drop='first' and handle_unknown='ignore' leads to some degree of ambiguity. But I don't see any other way to deal with multi-collinearity and unknown classes at training at the same time.
I believe the current behavior is inadequate. It would be better to enable setting both flags at the same time.
This post and this question have found the same thing problematic.