Can't provide feature indices for OneHotEncoder in pipeline #8539

amueller · 2017-03-05T21:16:16Z

Let's say I want to apply a transformation only to some features in a pipeline, such as imputation or one-hot-encoding (or scaling, which currently doesn't support this).
I could provide the indices of the columns I want to transform. But if there are any previous steps in the pipeline, they might re-arrange the features in some arbitrary way (like OneHotEncoder does).

Example

import numpy as np

# assume the second feature is categorical and the third is continuous
X = [[np.NaN, np.NaN, 5], [np.NaN, 1, 3], [np.NaN, 1, np.NaN]]

from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import Imputer, OneHotEncoder

pipe = make_pipeline(Imputer(strategy='most_frequent'), OneHotEncoder(categorical_features=[1], sparse=False))

pipe.fit_transform(X)

array([[ 0., 1., 1.],
[ 1., 0., 1.],
[ 1., 0., 1.]])

desired outcome:

array([[ 1., 5.],
[ 1., 3.],
[ 1., 3.]])

Even if each output feature corresponds to exactly one input feature, and we knew which that was, there would be no way to specify this in OneHotEncoder. This might look constructed but is a pretty obvious use-case in which you have per-column meta-data.

The only solution I see is by keeping along a column index (or column names) and allow to pass that.
Given my experience of .iloc vs .loc in pandas, I'm not entirely happy with the prospect.

cc @mfeurer

Conceptually somewhat related to #8480 and scikit-learn/enhancement_proposals#5 as they deal with feature meta-data.

[and then I introduced hierarchical indices over columns into scikit-learn.... not]

The text was updated successfully, but these errors were encountered:

jnothman · 2017-03-05T21:50:13Z

Yes, sorry I've not got back to that SLEP. But it doesn't really help for this case where a step needs information about the previous steps of a pipeline. The best solution here is just to some names with data; but specifying categorical_features but callback would at least make this possible, though convoluted.

…

On 6 Mar 2017 8:16 am, "Andreas Mueller" ***@***.***> wrote: Let's say I want to apply a transformation only to some features in a pipeline, such as imputation or one-hot-encoding (or scaling, which currently doesn't support this). I could provide the indices of the columns I want to transform. But if there are any previous steps in the pipeline, they might re-arrange the features in some arbitrary way (like OneHotEncoder does). Example import numpy as np # assume the second feature is categorical and the third is continuous X = [[np.NaN, np.NaN, 5], [np.NaN, 1, 3], [np.NaN, 1, np.NaN]] from sklearn.pipeline import make_pipelinefrom sklearn.preprocessing import Imputer, OneHotEncoder pipe = make_pipeline(Imputer(strategy='most_frequent'), OneHotEncoder(categorical_features=[1], sparse=False)) pipe.fit_transform(X) array([[ 0., 1., 1.], [ 1., 0., 1.], [ 1., 0., 1.]]) desired outcome: array([[ 1., 5.], [ 1., 3.], [ 1., 3.]]) cc @mfeurer <https://github.com/mfeurer> Conceptually somewhat related to #8480 <#8480> and scikit-learn/enhancement_proposals#5 <scikit-learn/enhancement_proposals#5> as they deal with feature meta-data — You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub <#8539>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AAEz66RLOTlYnT-RB1zqQZVzb17L22H7ks5riyYhgaJpZM4MTjXV> .

amueller · 2017-03-05T23:04:00Z

don't apologize, I haven't even read it yet :-/

If pipeline uses clone, you can not specify the callback at construction time, right?

amueller · 2017-03-05T23:05:17Z

The best solution here is just to some names with data;

Sorry, can't parse

jnothman · 2017-03-05T23:40:15Z

The best solution here is just to some names with data

some -> pass

If pipeline uses clone, you can not specify the callback at construction time, right?

Yes, you're right. But it's not about Pipeline using clone, since the problem is already present in the context of cross-validation where the whole pipeline is cloning, so any object references are invalidated. The best we could do without passing around feature descriptions, then, might be for pipeline to inform its transformers of their context. my_transformer.inform_context(containing_pipeline, step_name) before fit. Yuck!

amueller · 2017-03-06T22:59:15Z

@janvanrijn pointed out that it also means that pipelines have data-dependent parameters, which is a bit ugly. but I guess we can't get around that unless we detect whether input is categorical using dataframes...

amueller · 2017-03-06T23:01:56Z

I guess once we can handle strings you could cast input columns that should be one-hot-encoded to strings and we could add an option not to encode integers?

amueller mentioned this issue Mar 6, 2017

OneHotEncoder should ignore NaNs outside categorical_features #8540

Closed

janvanrijn mentioned this issue Mar 19, 2017

Imputer to maintain missing collumns #8613

Closed

jorisvandenbossche mentioned this issue Dec 12, 2017

[MRG] Add experimental.ColumnTransformer #9012

Merged

jnothman closed this as completed in #9012 May 29, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Can't provide feature indices for OneHotEncoder in pipeline #8539

Can't provide feature indices for OneHotEncoder in pipeline #8539

amueller commented Mar 5, 2017 •

edited

Loading

jnothman commented Mar 5, 2017 via email

amueller commented Mar 5, 2017

amueller commented Mar 5, 2017

jnothman commented Mar 5, 2017

amueller commented Mar 6, 2017

amueller commented Mar 6, 2017

Can't provide feature indices for OneHotEncoder in pipeline #8539

Can't provide feature indices for OneHotEncoder in pipeline #8539

Comments

amueller commented Mar 5, 2017 • edited Loading

jnothman commented Mar 5, 2017 via email

amueller commented Mar 5, 2017

amueller commented Mar 5, 2017

jnothman commented Mar 5, 2017

amueller commented Mar 6, 2017

amueller commented Mar 6, 2017

amueller commented Mar 5, 2017 •

edited

Loading