Thanks to visit codestin.com
Credit goes to github.com

Skip to content

Potential new subpackage for pipeline and featureunion-like tools #10215

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
jnothman opened this issue Nov 28, 2017 · 50 comments
Closed

Potential new subpackage for pipeline and featureunion-like tools #10215

jnothman opened this issue Nov 28, 2017 · 50 comments

Comments

@jnothman
Copy link
Member

In #9012 and #9041 we have found that it is hard to find a good module to make ColumnTransformer and TransformedTargetRegressor home. We also have the potential to hinge the updating of Pipeline's interface as per #8350 upon the creation of a new module (deprecating sklearn.pipeline very slowly!).

Firstly, we should discuss whether this is the right thing to do. Secondly, I and others can create some proposals for names below and the comments can get thumbs up from contributors and spectators who have an opinion! (Not that we intend to be democratic about it, but can get some idea of consensus.)

@jnothman
Copy link
Member Author

sklearn.workflow

@jnothman
Copy link
Member Author

sklearn.glue

@jnothman
Copy link
Member Author

sklearn.scaffolding

@jnothman
Copy link
Member Author

sklearn.compose

@jnothman
Copy link
Member Author

sklearn.composition

@amueller
Copy link
Member

sklearn.flow

@jnothman
Copy link
Member Author

my main problem with flow is the existence of tensorflow

@jnothman
Copy link
Member Author

sklearn.stack

@jnothman
Copy link
Member Author

sklearn.connectors

@jnothman
Copy link
Member Author

sklearn.connect

@amueller
Copy link
Member

Hm so I think putting pipeline and FeatureUnion in it for the deprecation is a good thing and possibly the least controversial of it. Then there is the questions of whether we want ColumnTransformer and TargetTransformer in it. If it's basically a rename of pipeline I would expect @GaelVaroquaux to argue against ColumnTransformer. I think it would be good to have it there.

I guess TargetTransformer also makes sense. Given the meta-estimator nature it seems to belong here more than into preprocessing.

@amueller
Copy link
Member

Oh and as I said in the other thread, I think a slow deprecation would be good, but I would not use 1.0 as a target.

@GaelVaroquaux
Copy link
Member

GaelVaroquaux commented Nov 28, 2017 via email

@jnothman
Copy link
Member Author

The thing about ColumnTransformer is that it doesn't do any interesting work itself. And I think users will realise that, and become familiar with the concept that they need glue to aid in feature extraction and preprocessing being differentiated by column, rather than a processor itself.

@amueller
Copy link
Member

@GaelVaroquaux I think you should consider that people working with columnar data don't consider themselves people working on columnar data, and have probably never heard that term. Having homogeneous data is the special case for most people that I ever talk to, and people (in particular coming from R or weka) are very surprised that estimators don't handle "columnar data". I think our API choices there make a lot of sense, but I feel that casting this kind of preprocessing as feature extraction will seem very alien to them.

@amueller
Copy link
Member

@GaelVaroquaux @agramfort do you wanna vote? Should I unleash hell and tweet the issue? ;) And maybe most importantly, do we want to stall #9012 for this (I don't think so)

@agramfort
Copy link
Member

I like the idea a start small contrib package for 'advanced' pipelines (I have in the mind also the transform X, y with fit that changes sample size), feature unions etc. Given how long we have talked about this I think that can be a good way to move forward without fear of full project overhead.

@amueller
Copy link
Member

amueller commented Dec 12, 2017

Well fit that changes sample size exists already in imblearn and I haven't really seen another application of changing y. The ColumnTransformer is ready and waiting only for a module to live in.
I really don't want to teach again without being able to do proper cross-validation on titanic or adult.

@jnothman
Copy link
Member Author

I think we must have ColumnTransform in sklearn, not some other package. But yes, perhaps trialling a more conformant Pipeline can happen outside sklearn. I also think that imblearn's Pipeline model is basically good enough and we should adopt it here, but that's sort of off topic.

@agramfort
Copy link
Member

agramfort commented Dec 12, 2017 via email

@amueller
Copy link
Member

amueller commented Dec 12, 2017

Imputing missing values on categorical and continuous variables without custom code. And/or not scaling one-hot-encoded variables. And ideally one-hot-encoding variables within the cross-validation, but that's not 100% necessary.

It's possible but you need a custom column selector. Basically most reasonable workflows I can think of need to treat categorical and continuous variables different during preprocessing. How do you do that right now?

@amueller
Copy link
Member

amueller commented Dec 12, 2017

With ColumnTransformer I can do something like (untested)

categorical = X.dtype == object
preprocess = make_column_transformer(
    (make_pipeline(Imputer('median'), StandardScaler()), ~categorical),
    (make_pipeline(Imputer('most_frequent'), CategoricalEncoder()), categorical))
model = make_pipeline(preprocess, LogisticRegression())

That's the most simple workflow that I can think of that works with "columnar" data, i.e. most of the things my students ever see.

@amueller
Copy link
Member

amueller commented Dec 12, 2017

The other common case that's annoying right now is if training and test data come from different sources (like different csv files), you need to convert to pandas categorical and store the categories (because OneHotEncoder can't deal with strings and CategoricalEncoder requires ColumnTransformer)

@amueller
Copy link
Member

If you allow the imputation of categorical to be a continuous relaxation and are fine with doing a StandardScaler on them, you can simplify my pipeline. But that seems not very natural / didactic. Is that what you're doing?

@amueller
Copy link
Member

Also see #8540

@agramfort
Copy link
Member

thanks @amueller for clarifying the use cases. I've never had the use case of train/test data from 2 different CSVs. For teaching on titanic I tend to bypass the pb of column specific scaling as I make them use linear models with float columns and then I make then use trees...

I would put ColumnTransformer / make_column_transformer in preprocessing for now.

@jnothman
Copy link
Member Author

It just dawned on me that the thing all these estimators have in common is that they're about transformers, and that they're meta-estimators. sklearn.transform_ensemble would be perfect were it not so verbose.

@jnothman
Copy link
Member Author

sklearn.transform

@jnothman
Copy link
Member Author

sklearn.combine

@jorisvandenbossche
Copy link
Member

the thing all these estimators have in common is that they're about transformers, and that they're meta-estimators

Or you could also phrase it as: they're all estimators composed of different transformers / estimators -> sklearn.compose

@jorisvandenbossche
Copy link
Member

But 'combine' is maybe an easier way to say the same as 'compose'

@jnothman
Copy link
Member Author

(combine could be confused with ensemble where compose can less so)

@amueller
Copy link
Member

I like compose.

@jorisvandenbossche
Copy link
Member

@amueller if you vote for that above, then it is leading in the votes :-)

@amueller
Copy link
Member

hm I guess on the first read I didn't like it? I didn't remember it being up there lol.

@amueller
Copy link
Member

Maybe @ogrisel wants to vote?

@jnothman
Copy link
Member Author

jnothman commented Dec 13, 2017 via email

@amueller
Copy link
Member

I'm willing to do that... but you take the responsibility ;)

@jorisvandenbossche
Copy link
Member

Any more comments, or somebody who takes a decision?
This is a blocker for the ColumnTransformer (#9012), unless we decide to for now put it in one of the existing modules and move it later if we decide on this one.

@jnothman
Copy link
Member Author

Well i get the impression people like sklearn.compose, conditioned on the assumption that we're doing this.
@GaelVaroquaux remains rightly concerned that users won't know where to find what they need, but I think that's already true for something like FeatureUnion, and search engines / stack overflow help a lot.

How do we define sklearn.compose? Home to metaestimators for composing transformers together and with other estimators, including being the future home of Pipeline and FeatureUnion versions that do not alter their input parents (in the fabled version 1.0)

@jnothman
Copy link
Member Author

I'd like to see this go ahead, if only to break an impasse.

@GaelVaroquaux
Copy link
Member

I hope that I didn't give the impression that I was opposed to "compose". It's my favorite name.

I am a bit unhappy that I have the impression that we are changing things for cosmetic reasons. That's terrible for long-term users (right now, for a review, I am rewamping code that I wrote 5yrs). It's the error that Python made in the transition from 2 to 3.

But if there is a general idea that pipeline must die, let's call the replacement compose, and move forward.

@jorisvandenbossche
Copy link
Member

But if there is a general idea that pipeline must die, let's call the replacement compose, and move forward.

I think it only makes sense if there is agreement that the current Pipeline API should change (cloning its steps) ànd if we think that changing it inplace in pipeline.Pipeline is a too disruptive change.

including being the future home of Pipeline and FeatureUnion versions that do not alter their input parents (in the fabled version 1.0)

I don't think we need to wait with putting Pipeline and FeatureUnion in compose ? If we add this module now, we can add an adapted version of Pipeline/FeatureUnion there directly as well?

@GaelVaroquaux
Copy link
Member

GaelVaroquaux commented Jan 16, 2018 via email

@jnothman
Copy link
Member Author

jnothman commented Jan 16, 2018 via email

@jnothman
Copy link
Member Author

Is it worth organising some kind of live meeting to resolve the issues here?

@amueller
Copy link
Member

amueller commented Feb 6, 2018

Sorry I'm not monitoring the issue tracker. If you want to meet, I can be reached by email.
It sounds to me like there is agreement and what to do, with the only thing that's left open being:

But issues remain about how to construct a Pipeline from pre-fitted
components

That's the freezing estimators question, right?

@jnothman
Copy link
Member Author

jnothman commented Feb 6, 2018 via email

@amueller
Copy link
Member

should FunctionTransformer go there as well?

@jnothman
Copy link
Member Author

jnothman commented Feb 20, 2018 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants