-
-
Notifications
You must be signed in to change notification settings - Fork 25.8k
Potential new subpackage for pipeline and featureunion-like tools #10215
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
|
|
|
|
|
|
my main problem with flow is the existence of tensorflow |
|
|
|
Hm so I think putting pipeline and FeatureUnion in it for the deprecation is a good thing and possibly the least controversial of it. Then there is the questions of whether we want I guess |
Oh and as I said in the other thread, I think a slow deprecation would be good, but I would not use 1.0 as a target. |
The big question with regards to ColumnTransformer is how do we make it
natural for people who have tabular data to look there.
I think that the logic to put ColumnTransformer there is one of
developers, but not one of users.
|
The thing about ColumnTransformer is that it doesn't do any interesting work itself. And I think users will realise that, and become familiar with the concept that they need glue to aid in feature extraction and preprocessing being differentiated by column, rather than a processor itself. |
@GaelVaroquaux I think you should consider that people working with columnar data don't consider themselves people working on columnar data, and have probably never heard that term. Having homogeneous data is the special case for most people that I ever talk to, and people (in particular coming from R or weka) are very surprised that estimators don't handle "columnar data". I think our API choices there make a lot of sense, but I feel that casting this kind of preprocessing as feature extraction will seem very alien to them. |
@GaelVaroquaux @agramfort do you wanna vote? Should I unleash hell and tweet the issue? ;) And maybe most importantly, do we want to stall #9012 for this (I don't think so) |
I like the idea a start small contrib package for 'advanced' pipelines (I have in the mind also the transform X, y with fit that changes sample size), feature unions etc. Given how long we have talked about this I think that can be a good way to move forward without fear of full project overhead. |
Well fit that changes sample size exists already in imblearn and I haven't really seen another application of changing y. The |
I think we must have |
@amueller can you clarify why you say that you cannot "do proper
cross-validation on titanic."?
do you have a gist of how easy you would like things to be?
disclaimer : I use titanic for teaching all the time.
|
Imputing missing values on categorical and continuous variables without custom code. And/or not scaling one-hot-encoded variables. And ideally one-hot-encoding variables within the cross-validation, but that's not 100% necessary. It's possible but you need a custom column selector. Basically most reasonable workflows I can think of need to treat categorical and continuous variables different during preprocessing. How do you do that right now? |
With ColumnTransformer I can do something like (untested) categorical = X.dtype == object
preprocess = make_column_transformer(
(make_pipeline(Imputer('median'), StandardScaler()), ~categorical),
(make_pipeline(Imputer('most_frequent'), CategoricalEncoder()), categorical))
model = make_pipeline(preprocess, LogisticRegression()) That's the most simple workflow that I can think of that works with "columnar" data, i.e. most of the things my students ever see. |
The other common case that's annoying right now is if training and test data come from different sources (like different csv files), you need to convert to pandas categorical and store the categories (because |
If you allow the imputation of categorical to be a continuous relaxation and are fine with doing a |
Also see #8540 |
thanks @amueller for clarifying the use cases. I've never had the use case of train/test data from 2 different CSVs. For teaching on titanic I tend to bypass the pb of column specific scaling as I make them use linear models with float columns and then I make then use trees... I would put ColumnTransformer / make_column_transformer in preprocessing for now. |
It just dawned on me that the thing all these estimators have in common is that they're about transformers, and that they're meta-estimators. |
|
|
Or you could also phrase it as: they're all estimators composed of different transformers / estimators -> |
But 'combine' is maybe an easier way to say the same as 'compose' |
( |
I like |
@amueller if you vote for that above, then it is leading in the votes :-) |
hm I guess on the first read I didn't like it? I didn't remember it being up there lol. |
Maybe @ogrisel wants to vote? |
I'm curious to see what happens if you tweet it :P
|
I'm willing to do that... but you take the responsibility ;) |
Any more comments, or somebody who takes a decision? |
Well i get the impression people like sklearn.compose, conditioned on the assumption that we're doing this. How do we define sklearn.compose? Home to metaestimators for composing transformers together and with other estimators, including being the future home of Pipeline and FeatureUnion versions that do not alter their input parents (in the fabled version 1.0) |
I'd like to see this go ahead, if only to break an impasse. |
I hope that I didn't give the impression that I was opposed to "compose". It's my favorite name. I am a bit unhappy that I have the impression that we are changing things for cosmetic reasons. That's terrible for long-term users (right now, for a review, I am rewamping code that I wrote 5yrs). It's the error that Python made in the transition from 2 to 3. But if there is a general idea that pipeline must die, let's call the replacement compose, and move forward. |
I think it only makes sense if there is agreement that the current
I don't think we need to wait with putting Pipeline and FeatureUnion in |
I think it only makes sense if there is agreement that the current Pipeline API
should change (cloning its steps)
I feel like that.
ànd if we think that changing it inplace in pipeline.Pipeline is a too
disruptive change.
That's probably true
I don't think we need to wait with putting Pipeline and FeatureUnion in compose
? If we add this module now, we can add an adapted version of Pipeline/
FeatureUnion there directly as well?
Yes: optimistic merge!
|
I'm fine with reviewing @glemaitre's cloning pipeline for the new module.
But issues remain about how to construct a Pipeline from pre-fitted
components, or as a subsequence of a previous Pipeline. This needs to be
done before we can expect or encourage wide adoption.
@GaelVaroquaux, would it be better to put all the new stuff in
sklearn.pipeline? You're right, the reasons not to are mostly cosmetic. But
at least we're not moving things without breaking changes.
|
Is it worth organising some kind of live meeting to resolve the issues here? |
Sorry I'm not monitoring the issue tracker. If you want to meet, I can be reached by email.
That's the freezing estimators question, right? |
It's not the freezing estimators thing necessarily. In some cases you just
need an interface to get a subset of an existing pipeline. Freezing
estimators would handle this case in a convoluted way, and also work in a
setting where the pipeline is cloned
|
should |
I think FunctionTransformer would be quite natural there, but I'm not sure
if it's an unhelpful cosmetic change.
…On 21 February 2018 at 07:29, Andreas Mueller ***@***.***> wrote:
should FunctionTransformer go there as well?
—
You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub
<#10215 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AAEz63zIPgW2nUpArJ5odNzh7syM2yVKks5tWykigaJpZM4QuJn3>
.
|
In #9012 and #9041 we have found that it is hard to find a good module to make ColumnTransformer and TransformedTargetRegressor home. We also have the potential to hinge the updating of Pipeline's interface as per #8350 upon the creation of a new module (deprecating sklearn.pipeline very slowly!).
Firstly, we should discuss whether this is the right thing to do. Secondly, I and others can create some proposals for names below and the comments can get thumbs up from contributors and spectators who have an opinion! (Not that we intend to be democratic about it, but can get some idea of consensus.)
The text was updated successfully, but these errors were encountered: