Description
Currently the CategoricalEncoder doesn't handle missing values (you get an error about unorderable types or about nan being an unknown category for numerical types).
This came up in an issue about imputing missing values for categoricals (#2888), but independently of whether we add such abilities to Imputer, we should also discuss how CategoricalEncoder itself could handle missing values in different ways.
Possible ways to deal missing values (np.nan or None):
- Raise an error when missing values are present:
- This is still a good default I think (we should only make sure to provide a better error message than currently is raised)
- Ignore missing values (treat as unknown):
- This would give a row of all zeros for dummy encoding, and would no be implemented for ordinal encoding.
- In this way, it is similar in behaviour as unknown categories with
handle_unknown='ignore'
, apart from the fact it can also occur in the training data.
- Regard missing value as a separate category
- For ordinal encoding this would give an additional integer, for dummy encoding an additional column.
- Something similar is available in
pd.get_dummies
if you specifydummy_na=True
keyword. - Implementation-wise, a problem that would occur is that if your categories consist of a couple of strings values and a missing value (np.nan or None), it becomes unorderable, while in the CategoricalEncoder we normally sort the unique categories (as a possible solution, we could fallback in such a case to sort the non-missing ones first and then add np.nan in the end).
- This would be similar to a an indicator feature
- Preserve as NaN:
- from comment of @amueller (Improve Imputer 'most_frequent' strategy #2888 (comment)), I suppose the idea would be to first see it as a separate category but before returning the result replace that category again with NaN (so it can be imputed after encoding).
- This might make sense only for ordinal encoding, unless we want a full row of NaNs for dummy case.
This option could actually also be a way to deal with the "imputing categorical features" problem (see also next bullet), as it allows an easier and more flexible combination of encoding / imputing.
- Impute missing values (eg with 'most_frequent' option):
- Personally I think this one should be left to
Imputer
itself, but adding it here instead could limit the scope ofImputer
to numerical features.
- Personally I think this one should be left to
Those options (or a subset of them) could be added as an additional keyword to the CategoricalEncoder. Possible names: handle_missing
, handle_na
, missing_values
Related to discussions in #2888 and #9012 (comment)
Example notebook on a toy dataframe showing the current problem with missing data in categorical features: http://nbviewer.jupyter.org/gist/jorisvandenbossche/736cead26ab65116ff4de18015b0b324