Thanks to visit codestin.com
Credit goes to github.com

Skip to content

Allowing drop='first' and handle_unknown='ignore' in OneHotEncoding #19346

@vinicius-cleves

Description

@vinicius-cleves

When using OneHotEncoding it is a common feature to drop a column in order to prevent multi-collinearity. This is done with the drop='first' flag.

On the other hand, when cross validating a model, it can happen that not every category appears on train split. In this setting, it is usual to just assign 0 to all columns. This is done with handle_unknown='ignore'.

Both flags cannot be set as shows as it leads to:

ValueError: `handle_unknown` must be 'error' when the drop parameter is specified, as both would create categories that are all zero.

It is true that using drop='first' and handle_unknown='ignore' leads to some degree of ambiguity. But I don't see any other way to deal with multi-collinearity and unknown classes at training at the same time.

I believe the current behavior is inadequate. It would be better to enable setting both flags at the same time.

This post and this question have found the same thing problematic.

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions