FEAT: Add dictionary support for categories
parameter in OrdinalEncoder
#32179
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Reference Issues/PRs
Fixes #32152.
What does this implement/fix? Explain your changes.
This PR intends to extend
OrdinalEncoder
to allow a dictionary to be passed tocategories
when the inputX
is a pandas DataFrame.Why is this relevant?
Currently, the way most users use
OrdinalEncoder
is by passing a list of array-like objects tocategories
to specify the categories of each column. This is very useful because if allows users to be explicit about the relative order of categories within a column. However, this approach requires users to rely on the order of columns inX
, which is error-prone and less readable.Here is a motivating example:
Even in this simple example, there is no easy way of knowing to which columns the categories are being applied.
You either look at the DataFrame itself (which is not always possible) or you add comments in the code (which can be misleading).
This coupling between column order and categories is unintuitive and fragile, and does not align with how pandas users think about columns (i.e., names instead of positions). In real-world pipelines, where datasets can have many columns or where column order is not guaranteed, relying on column order increases developer cognitive load and potentially introduce silent bugs by applying categories to the wrong columns.
How does this solves the problem?
With this approach, categories can be specified by column name rather than relying on column position, making the code more readable and reducing the risk of column misalignment in complex pipelines.
Example with improved API:
Any other comments?
X
.dict
is only enabled whenX
is apandas.Dataframe
; otherwise aTypeError
is raised to avoid ambiguous behavior.Draft task list
whats_new
entry