Description
With get_feature_names_out
complete, I am currently reworking the SLEP for pandas output. I am thinking of only covering transformers in the SLEP to reduce the scope. This issue covers the complete idea for pandas output that covers all methods that return arrays: transform
, predict
, etc.
API Prototype
I put together a functional prototype of this API that you can explore in this colab notebook. Here is a rendered version of the demo. The demo includes the following use cases:
- DataFrame output from a Single transformer
- Column Transformer with DataFrame output
- Feature selection based on column names with cross validation
- Using HistGradientBoosting to select categories based on dtype
- Text preprocessing with sparse data
Proposal
TLDR: The proposal is to add a set_output
method to configure the output container. When set_output(transform="pandas")
the output of the estimator is a pandas dataframe. In #16772, I have shown that sparse data will have a performance regression. To work around this, I propose set_output(transform="frame_or_sparse")
, which returns a DataFrame for dense data and a custom SKCSRMatrix
for sparse data. SKCSRMatrix
is a subclass of csr_matrix
so it will work with previous code.
See the rendered notebook to see the API in various use cases.
Future Extensions
These are items that is not in the SLEP
Future: Predictions
log_reg = LogisticRegression()
log_reg.set_output(
predict_proba="pandas",
predict="pandas",
decision_function="pandas",
)
log_reg.fit(X_df, y)
# classes are the column names
X_pred = log_reg.predict_proba(X)
# categorical where classes are the categories
X_pred = log_reg.predict(X)
# binary case, series with name=classes_[1]
# multiclass case, dataframe with columns=classes_
X_pred = log_reg.decision_function(X)
Future: Pipeline with prediction
log_reg = make_pipeline(
StandardScalar(), # only uses `transform="pandas"`
LogisticRegression(), # only uses `predict="pandas"`
).set_output(
predict="pandas",
transform="pandas",
)
log_reg.fit(X_df, y)
# DataFrame
y_pred = log_reg.predict(X_df)