Pandas Output Proposal Outline

With `get_feature_names_out` complete, I am currently reworking the SLEP for pandas output. I am thinking of only covering transformers in the SLEP to reduce the scope. This issue covers the complete idea for pandas output that covers all methods that return arrays: `transform`, `predict`, etc.

## API Prototype

I put together a functional prototype of this API that you can explore in [this colab notebook](https://colab.research.google.com/github/thomasjpfan/pandas-prototype-demo/blob/main/index.ipynb). Here is a [rendered version of the demo](https://nbviewer.org/github/thomasjpfan/pandas-prototype-demo/blob/main/index.ipynb). The demo includes the following use cases:

- DataFrame output from a Single transformer
- Column Transformer with DataFrame output
- Feature selection based on column names with cross validation
- Using HistGradientBoosting to select categories based on dtype
- Text preprocessing with sparse data

## Proposal

**TLDR**: The proposal is to add a `set_output` method to configure the output container. When `set_output(transform="pandas")` the output of the estimator is a pandas dataframe. In https://github.com/scikit-learn/scikit-learn/pull/16772, I have shown that sparse data will have a performance regression. To work around this, I propose `set_output(transform="frame_or_sparse")`, which returns a DataFrame for dense data and a custom `SKCSRMatrix` for sparse data. `SKCSRMatrix` is a subclass of `csr_matrix` so it will work with previous code. 

See [the rendered notebook](https://nbviewer.org/github/thomasjpfan/pandas-prototype-demo/blob/main/index.ipynb) to see the API in various use cases.

## Future Extensions

These are items that is **not** in the SLEP

### Future: Predictions

```python
log_reg = LogisticRegression()
log_reg.set_output(
    predict_proba="pandas",
    predict="pandas",
    decision_function="pandas",
)
log_reg.fit(X_df, y)

# classes are the column names
X_pred = log_reg.predict_proba(X)

# categorical where classes are the categories
X_pred = log_reg.predict(X)

# binary case, series with name=classes_[1]
# multiclass case, dataframe with columns=classes_
X_pred = log_reg.decision_function(X)
```

### Future: Pipeline with prediction

```python
log_reg = make_pipeline(
    StandardScalar(),  # only uses `transform="pandas"`
    LogisticRegression(), # only uses `predict="pandas"`
).set_output(
    predict="pandas",
    transform="pandas",
)

log_reg.fit(X_df, y)

# DataFrame
y_pred = log_reg.predict(X_df)
```

CC @amueller @glemaitre 

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Pandas Output Proposal Outline #23001

API Prototype

Proposal

Future Extensions

Future: Predictions

Future: Pipeline with prediction

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

Pandas Output Proposal Outline #23001

Description

API Prototype

Proposal

Future Extensions

Future: Predictions

Future: Pipeline with prediction

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions