Thanks to visit codestin.com
Credit goes to github.com

Skip to content

Handling of missing values in the CategoricalEncoder #10465

Closed
@jorisvandenbossche

Description

@jorisvandenbossche

Currently the CategoricalEncoder doesn't handle missing values (you get an error about unorderable types or about nan being an unknown category for numerical types).
This came up in an issue about imputing missing values for categoricals (#2888), but independently of whether we add such abilities to Imputer, we should also discuss how CategoricalEncoder itself could handle missing values in different ways.

Possible ways to deal missing values (np.nan or None):

  • Raise an error when missing values are present:
    • This is still a good default I think (we should only make sure to provide a better error message than currently is raised)
  • Ignore missing values (treat as unknown):
    • This would give a row of all zeros for dummy encoding, and would no be implemented for ordinal encoding.
    • In this way, it is similar in behaviour as unknown categories with handle_unknown='ignore', apart from the fact it can also occur in the training data.
  • Regard missing value as a separate category
    • For ordinal encoding this would give an additional integer, for dummy encoding an additional column.
    • Something similar is available in pd.get_dummies if you specify dummy_na=True keyword.
    • Implementation-wise, a problem that would occur is that if your categories consist of a couple of strings values and a missing value (np.nan or None), it becomes unorderable, while in the CategoricalEncoder we normally sort the unique categories (as a possible solution, we could fallback in such a case to sort the non-missing ones first and then add np.nan in the end).
    • This would be similar to a an indicator feature
  • Preserve as NaN:
    • from comment of @amueller (Improve Imputer 'most_frequent' strategy #2888 (comment)), I suppose the idea would be to first see it as a separate category but before returning the result replace that category again with NaN (so it can be imputed after encoding).
    • This might make sense only for ordinal encoding, unless we want a full row of NaNs for dummy case.
      This option could actually also be a way to deal with the "imputing categorical features" problem (see also next bullet), as it allows an easier and more flexible combination of encoding / imputing.
  • Impute missing values (eg with 'most_frequent' option):
    • Personally I think this one should be left to Imputer itself, but adding it here instead could limit the scope of Imputer to numerical features.

Those options (or a subset of them) could be added as an additional keyword to the CategoricalEncoder. Possible names: handle_missing, handle_na, missing_values

Related to discussions in #2888 and #9012 (comment)

Example notebook on a toy dataframe showing the current problem with missing data in categorical features: http://nbviewer.jupyter.org/gist/jorisvandenbossche/736cead26ab65116ff4de18015b0b324

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions