Thanks to visit codestin.com
Credit goes to github.com

Skip to content

Pandas Output Proposal Outline #23001

Closed
@thomasjpfan

Description

@thomasjpfan

With get_feature_names_out complete, I am currently reworking the SLEP for pandas output. I am thinking of only covering transformers in the SLEP to reduce the scope. This issue covers the complete idea for pandas output that covers all methods that return arrays: transform, predict, etc.

API Prototype

I put together a functional prototype of this API that you can explore in this colab notebook. Here is a rendered version of the demo. The demo includes the following use cases:

  • DataFrame output from a Single transformer
  • Column Transformer with DataFrame output
  • Feature selection based on column names with cross validation
  • Using HistGradientBoosting to select categories based on dtype
  • Text preprocessing with sparse data

Proposal

TLDR: The proposal is to add a set_output method to configure the output container. When set_output(transform="pandas") the output of the estimator is a pandas dataframe. In #16772, I have shown that sparse data will have a performance regression. To work around this, I propose set_output(transform="frame_or_sparse"), which returns a DataFrame for dense data and a custom SKCSRMatrix for sparse data. SKCSRMatrix is a subclass of csr_matrix so it will work with previous code.

See the rendered notebook to see the API in various use cases.

Future Extensions

These are items that is not in the SLEP

Future: Predictions

log_reg = LogisticRegression()
log_reg.set_output(
    predict_proba="pandas",
    predict="pandas",
    decision_function="pandas",
)
log_reg.fit(X_df, y)

# classes are the column names
X_pred = log_reg.predict_proba(X)

# categorical where classes are the categories
X_pred = log_reg.predict(X)

# binary case, series with name=classes_[1]
# multiclass case, dataframe with columns=classes_
X_pred = log_reg.decision_function(X)

Future: Pipeline with prediction

log_reg = make_pipeline(
    StandardScalar(),  # only uses `transform="pandas"`
    LogisticRegression(), # only uses `predict="pandas"`
).set_output(
    predict="pandas",
    transform="pandas",
)

log_reg.fit(X_df, y)

# DataFrame
y_pred = log_reg.predict(X_df)

CC @amueller @glemaitre

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions