Potential new subpackage for pipeline and featureunion-like tools #10215

jnothman · 2017-11-28T22:56:07Z

In #9012 and #9041 we have found that it is hard to find a good module to make ColumnTransformer and TransformedTargetRegressor home. We also have the potential to hinge the updating of Pipeline's interface as per #8350 upon the creation of a new module (deprecating sklearn.pipeline very slowly!).

Firstly, we should discuss whether this is the right thing to do. Secondly, I and others can create some proposals for names below and the comments can get thumbs up from contributors and spectators who have an opinion! (Not that we intend to be democratic about it, but can get some idea of consensus.)

jnothman · 2017-11-28T22:56:25Z

sklearn.workflow

jnothman · 2017-11-28T22:56:29Z

sklearn.glue

jnothman · 2017-11-28T22:56:34Z

sklearn.scaffolding

jnothman · 2017-11-28T22:56:44Z

sklearn.compose

jnothman · 2017-11-28T22:56:53Z

sklearn.composition

amueller · 2017-11-28T22:56:54Z

sklearn.flow

jnothman · 2017-11-28T22:57:53Z

my main problem with flow is the existence of tensorflow

jnothman · 2017-11-28T22:58:02Z

sklearn.stack

jnothman · 2017-11-28T22:58:19Z

sklearn.connectors

jnothman · 2017-11-28T22:59:57Z

sklearn.connect

amueller · 2017-11-28T23:03:32Z

Hm so I think putting pipeline and FeatureUnion in it for the deprecation is a good thing and possibly the least controversial of it. Then there is the questions of whether we want ColumnTransformer and TargetTransformer in it. If it's basically a rename of pipeline I would expect @GaelVaroquaux to argue against ColumnTransformer. I think it would be good to have it there.

I guess TargetTransformer also makes sense. Given the meta-estimator nature it seems to belong here more than into preprocessing.

amueller · 2017-11-28T23:04:03Z

Oh and as I said in the other thread, I think a slow deprecation would be good, but I would not use 1.0 as a target.

GaelVaroquaux · 2017-11-28T23:06:25Z

The big question with regards to ColumnTransformer is how do we make it natural for people who have tabular data to look there. I think that the logic to put ColumnTransformer there is one of developers, but not one of users.

jnothman · 2017-11-29T00:02:20Z

The thing about ColumnTransformer is that it doesn't do any interesting work itself. And I think users will realise that, and become familiar with the concept that they need glue to aid in feature extraction and preprocessing being differentiated by column, rather than a processor itself.

amueller · 2017-11-29T17:01:48Z

@GaelVaroquaux I think you should consider that people working with columnar data don't consider themselves people working on columnar data, and have probably never heard that term. Having homogeneous data is the special case for most people that I ever talk to, and people (in particular coming from R or weka) are very surprised that estimators don't handle "columnar data". I think our API choices there make a lot of sense, but I feel that casting this kind of preprocessing as feature extraction will seem very alien to them.

amueller · 2017-12-12T04:05:25Z

@GaelVaroquaux @agramfort do you wanna vote? Should I unleash hell and tweet the issue? ;) And maybe most importantly, do we want to stall #9012 for this (I don't think so)

agramfort · 2017-12-12T04:20:49Z

I like the idea a start small contrib package for 'advanced' pipelines (I have in the mind also the transform X, y with fit that changes sample size), feature unions etc. Given how long we have talked about this I think that can be a good way to move forward without fear of full project overhead.

amueller · 2017-12-12T04:23:17Z

Well fit that changes sample size exists already in imblearn and I haven't really seen another application of changing y. The ColumnTransformer is ready and waiting only for a module to live in.
I really don't want to teach again without being able to do proper cross-validation on titanic or adult.

jnothman · 2017-12-12T04:24:58Z

I think we must have ColumnTransform in sklearn, not some other package. But yes, perhaps trialling a more conformant Pipeline can happen outside sklearn. I also think that imblearn's Pipeline model is basically good enough and we should adopt it here, but that's sort of off topic.

agramfort · 2017-12-12T04:36:52Z

@amueller can you clarify why you say that you cannot "do proper cross-validation on titanic."? do you have a gist of how easy you would like things to be? disclaimer : I use titanic for teaching all the time.

amueller · 2017-12-12T15:49:59Z

Imputing missing values on categorical and continuous variables without custom code. And/or not scaling one-hot-encoded variables. And ideally one-hot-encoding variables within the cross-validation, but that's not 100% necessary.

It's possible but you need a custom column selector. Basically most reasonable workflows I can think of need to treat categorical and continuous variables different during preprocessing. How do you do that right now?

amueller · 2017-12-12T15:54:59Z

With ColumnTransformer I can do something like (untested)

categorical = X.dtype == object
preprocess = make_column_transformer(
    (make_pipeline(Imputer('median'), StandardScaler()), ~categorical),
    (make_pipeline(Imputer('most_frequent'), CategoricalEncoder()), categorical))
model = make_pipeline(preprocess, LogisticRegression())

That's the most simple workflow that I can think of that works with "columnar" data, i.e. most of the things my students ever see.

amueller · 2017-12-12T16:08:00Z

The other common case that's annoying right now is if training and test data come from different sources (like different csv files), you need to convert to pandas categorical and store the categories (because OneHotEncoder can't deal with strings and CategoricalEncoder requires ColumnTransformer)

amueller · 2017-12-12T16:14:22Z

If you allow the imputation of categorical to be a continuous relaxation and are fine with doing a StandardScaler on them, you can simplify my pipeline. But that seems not very natural / didactic. Is that what you're doing?

amueller · 2017-12-12T19:12:13Z

Also see #8540

agramfort · 2017-12-13T04:28:25Z

thanks @amueller for clarifying the use cases. I've never had the use case of train/test data from 2 different CSVs. For teaching on titanic I tend to bypass the pb of column specific scaling as I make them use linear models with float columns and then I make then use trees...

I would put ColumnTransformer / make_column_transformer in preprocessing for now.

jnothman · 2017-12-13T09:59:36Z

It just dawned on me that the thing all these estimators have in common is that they're about transformers, and that they're meta-estimators. sklearn.transform_ensemble would be perfect were it not so verbose.

jnothman · 2017-12-13T09:59:41Z

sklearn.transform

jnothman · 2017-12-13T10:02:34Z

sklearn.combine

jorisvandenbossche · 2017-12-13T10:05:57Z

the thing all these estimators have in common is that they're about transformers, and that they're meta-estimators

Or you could also phrase it as: they're all estimators composed of different transformers / estimators -> sklearn.compose

jorisvandenbossche · 2017-12-13T10:08:00Z

But 'combine' is maybe an easier way to say the same as 'compose'

jnothman · 2017-12-13T10:24:00Z

(combine could be confused with ensemble where compose can less so)

amueller · 2017-12-13T16:41:25Z

I like compose.

jorisvandenbossche · 2017-12-13T16:52:17Z

@amueller if you vote for that above, then it is leading in the votes :-)

amueller · 2017-12-13T16:53:57Z

hm I guess on the first read I didn't like it? I didn't remember it being up there lol.

amueller · 2017-12-13T16:54:48Z

Maybe @ogrisel wants to vote?

jnothman · 2017-12-13T22:26:44Z

I'm curious to see what happens if you tweet it :P

amueller · 2017-12-14T21:02:16Z

I'm willing to do that... but you take the responsibility ;)

jorisvandenbossche · 2018-01-12T20:09:29Z

Any more comments, or somebody who takes a decision?
This is a blocker for the ColumnTransformer (#9012), unless we decide to for now put it in one of the existing modules and move it later if we decide on this one.

jnothman · 2018-01-16T07:29:04Z

Well i get the impression people like sklearn.compose, conditioned on the assumption that we're doing this.
@GaelVaroquaux remains rightly concerned that users won't know where to find what they need, but I think that's already true for something like FeatureUnion, and search engines / stack overflow help a lot.

How do we define sklearn.compose? Home to metaestimators for composing transformers together and with other estimators, including being the future home of Pipeline and FeatureUnion versions that do not alter their input parents (in the fabled version 1.0)

jnothman · 2018-01-16T07:30:03Z

I'd like to see this go ahead, if only to break an impasse.

GaelVaroquaux · 2018-01-16T20:58:31Z

I hope that I didn't give the impression that I was opposed to "compose". It's my favorite name.

I am a bit unhappy that I have the impression that we are changing things for cosmetic reasons. That's terrible for long-term users (right now, for a review, I am rewamping code that I wrote 5yrs). It's the error that Python made in the transition from 2 to 3.

But if there is a general idea that pipeline must die, let's call the replacement compose, and move forward.

jorisvandenbossche · 2018-01-16T21:31:35Z

But if there is a general idea that pipeline must die, let's call the replacement compose, and move forward.

I think it only makes sense if there is agreement that the current Pipeline API should change (cloning its steps) ànd if we think that changing it inplace in pipeline.Pipeline is a too disruptive change.

including being the future home of Pipeline and FeatureUnion versions that do not alter their input parents (in the fabled version 1.0)

I don't think we need to wait with putting Pipeline and FeatureUnion in compose ? If we add this module now, we can add an adapted version of Pipeline/FeatureUnion there directly as well?

GaelVaroquaux · 2018-01-16T21:35:41Z

I think it only makes sense if there is agreement that the current Pipeline API should change (cloning its steps)

I feel like that.

ànd if we think that changing it inplace in pipeline.Pipeline is a too disruptive change.

That's probably true

I don't think we need to wait with putting Pipeline and FeatureUnion in compose ? If we add this module now, we can add an adapted version of Pipeline/ FeatureUnion there directly as well?

Yes: optimistic merge!

jnothman · 2018-01-16T22:36:24Z

I'm fine with reviewing @glemaitre's cloning pipeline for the new module. But issues remain about how to construct a Pipeline from pre-fitted components, or as a subsequence of a previous Pipeline. This needs to be done before we can expect or encourage wide adoption. @GaelVaroquaux, would it be better to put all the new stuff in sklearn.pipeline? You're right, the reasons not to are mostly cosmetic. But at least we're not moving things without breaking changes.

jnothman · 2018-01-29T12:24:19Z

Is it worth organising some kind of live meeting to resolve the issues here?

amueller · 2018-02-06T16:51:00Z

Sorry I'm not monitoring the issue tracker. If you want to meet, I can be reached by email.
It sounds to me like there is agreement and what to do, with the only thing that's left open being:

But issues remain about how to construct a Pipeline from pre-fitted
components

That's the freezing estimators question, right?

jnothman · 2018-02-06T22:12:11Z

It's not the freezing estimators thing necessarily. In some cases you just need an interface to get a subset of an existing pipeline. Freezing estimators would handle this case in a convoluted way, and also work in a setting where the pipeline is cloned

amueller · 2018-02-20T20:20:42Z

should FunctionTransformer go there as well?

jnothman · 2018-02-20T20:52:22Z

I think FunctionTransformer would be quite natural there, but I'm not sure if it's an unhelpful cosmetic change.

…

On 21 February 2018 at 07:29, Andreas Mueller ***@***.***> wrote: should FunctionTransformer go there as well? — You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub <#10215 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AAEz63zIPgW2nUpArJ5odNzh7syM2yVKks5tWykigaJpZM4QuJn3> .

amueller mentioned this issue Dec 13, 2017

[MRG] Add experimental.ColumnTransformer #9012

Merged

jnothman mentioned this issue Feb 28, 2018

[MRG + 1] Add sklearn.compose #10719

Merged

qinhanmin2014 closed this as completed in #10719 Mar 16, 2018

Potential new subpackage for pipeline and featureunion-like tools #10215

Potential new subpackage for pipeline and featureunion-like tools #10215

Comments

jnothman commented Nov 28, 2017

jnothman commented Nov 28, 2017

jnothman commented Nov 28, 2017

jnothman commented Nov 28, 2017

jnothman commented Nov 28, 2017

jnothman commented Nov 28, 2017

amueller commented Nov 28, 2017

jnothman commented Nov 28, 2017

jnothman commented Nov 28, 2017

jnothman commented Nov 28, 2017

jnothman commented Nov 28, 2017

amueller commented Nov 28, 2017

amueller commented Nov 28, 2017

GaelVaroquaux commented Nov 28, 2017 via email

jnothman commented Nov 29, 2017

amueller commented Nov 29, 2017

amueller commented Dec 12, 2017

agramfort commented Dec 12, 2017

amueller commented Dec 12, 2017 • edited Loading

jnothman commented Dec 12, 2017

agramfort commented Dec 12, 2017 via email

amueller commented Dec 12, 2017 • edited Loading

amueller commented Dec 12, 2017 • edited Loading

amueller commented Dec 12, 2017 • edited Loading

amueller commented Dec 12, 2017

amueller commented Dec 12, 2017

agramfort commented Dec 13, 2017

jnothman commented Dec 13, 2017

jnothman commented Dec 13, 2017

jnothman commented Dec 13, 2017

jorisvandenbossche commented Dec 13, 2017

jorisvandenbossche commented Dec 13, 2017

jnothman commented Dec 13, 2017

amueller commented Dec 13, 2017

jorisvandenbossche commented Dec 13, 2017

amueller commented Dec 13, 2017

amueller commented Dec 13, 2017

jnothman commented Dec 13, 2017 via email

amueller commented Dec 14, 2017

jorisvandenbossche commented Jan 12, 2018

jnothman commented Jan 16, 2018

jnothman commented Jan 16, 2018

GaelVaroquaux commented Jan 16, 2018

jorisvandenbossche commented Jan 16, 2018

GaelVaroquaux commented Jan 16, 2018 via email

jnothman commented Jan 16, 2018 via email

jnothman commented Jan 29, 2018

amueller commented Feb 6, 2018

jnothman commented Feb 6, 2018 via email

amueller commented Feb 20, 2018

jnothman commented Feb 20, 2018 via email

amueller commented Dec 12, 2017 •

edited

Loading

amueller commented Dec 12, 2017 •

edited

Loading

amueller commented Dec 12, 2017 •

edited

Loading

amueller commented Dec 12, 2017 •

edited

Loading