Thanks to visit codestin.com
Credit goes to github.com

Skip to content

Conversation

joaosferreira
Copy link

Reference Issues/PRs

Fixes #32152.

What does this implement/fix? Explain your changes.

This PR intends to extend OrdinalEncoder to allow a dictionary to be passed to categories when the input X is a pandas DataFrame.

Why is this relevant?

Currently, the way most users use OrdinalEncoder is by passing a list of array-like objects to categories to specify the categories of each column. This is very useful because if allows users to be explicit about the relative order of categories within a column. However, this approach requires users to rely on the order of columns in X, which is error-prone and less readable.

Here is a motivating example:

X = pd.DataFrame({
    "priority": ["medium", "medium", "high"],
    "size": ["small", "large", "medium"],
})

# List approach relies on column order
enc = OrdinalEncoder(categories=[
    ["low", "medium", "high"],     # for "priority"
    ["small", "medium", "large"],  # for "size"
])

X_trans = enc.fit_transform(X)

Even in this simple example, there is no easy way of knowing to which columns the categories are being applied.
You either look at the DataFrame itself (which is not always possible) or you add comments in the code (which can be misleading).
This coupling between column order and categories is unintuitive and fragile, and does not align with how pandas users think about columns (i.e., names instead of positions). In real-world pipelines, where datasets can have many columns or where column order is not guaranteed, relying on column order increases developer cognitive load and potentially introduce silent bugs by applying categories to the wrong columns.

How does this solves the problem?

With this approach, categories can be specified by column name rather than relying on column position, making the code more readable and reducing the risk of column misalignment in complex pipelines.

Example with improved API:

enc = OrdinalEncoder(categories={
    "priority": ["low", "medium", "high"],
    "size": ["small", "medium", "large"],
})

# Order of columns in X does not matter
enc.fit_transform(X)

Any other comments?

  • This PR introduces the feature in a backward-compatible way: existing behavior with lists should remain unchanged for all input types of X.
  • Support for dict is only enabled when X is a pandas.Dataframe; otherwise a TypeError is raised to avoid ambiguous behavior.

Draft task list

  • Store categories in a way that preserves column name information
  • Update docstring with usage examples
  • Add whats_new entry
  • Add additional tests for edge cases (e.g., missing columns, unknown values)
  • Verify compatibility with pipelines
  • Keep consistency between different encoder classes

Copy link

✔️ Linting Passed

All linting checks passed. Your pull request is in excellent shape! ☀️

Generated for commit: 9324667. Link to the linter CI: here

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Allow categories parameter in OrdinalEncoder to accept a dict of column names → categories
1 participant