Thanks to visit codestin.com
Credit goes to github.com

Skip to content

fit_params in conjunction with FeatureUnion #7136

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
BenjaminBossan opened this issue Aug 4, 2016 · 21 comments
Closed

fit_params in conjunction with FeatureUnion #7136

BenjaminBossan opened this issue Aug 4, 2016 · 21 comments

Comments

@BenjaminBossan
Copy link
Contributor

Description

Using fit_params in conjunction with FeatureUnion may not work as expected. It would be helpful if fit_params names would be resolved to fit the estimators in transformer_list.

Steps/Code to Reproduce

from sklearn.pipeline import Pipeline, FeatureUnion
from sklearn.preprocessing import FunctionTransformer
from sklearn.datasets import make_classification

X, y = make_classification()

class MyTransformer(FunctionTransformer):
    def fit(self, X, y=None, **fit_params):
        print("Fit params are: ", fit_params)
        return super().fit(X, y)

pipe = Pipeline([
    ('step0', FunctionTransformer()),
    ('step1', FeatureUnion([
        ('feature0', MyTransformer()),
        ('feature1', MyTransformer()),
    ])),
    ('step2', FunctionTransformer()),
])

pipe.fit(X, y, step1__feature1__someparam=123)

Expected Results

# prints
Fit params are:  {'someparam': 123}

Actual Results

# prints
Fit params are:  {'feature1__someparam': 123}  # by feature0
Fit params are:  {'feature1__someparam': 123}  # by feature1

Comment

Maybe the actual outcome is what it is supposed to be, but my expectation would be that FeatureUnion resolves the estimator name and only passes the fit_params to the corresponding estimator. At least it would be very useful if it did.

Versions

Python 3.4.5
NumPy 1.10.4
SciPy 0.17.0
Scikit-Learn 0.17.1

@amueller
Copy link
Member

amueller commented Aug 4, 2016

What exactly are you trying to do?
Usually fit_params is only sample_weights which are usually the same across transformers.
Though one might argue that the current version is not very useful, as it will raise errors in most cases (not all transformers support the same fit_params.
It's a bit tricky to change this now from a backward compatibility perspective though we could check whether there is a __ in the parameter name.

It looks like the pipeline acts the way you expect, and I think they should act the same way...

@amueller amueller added the API label Aug 4, 2016
@BenjaminBossan
Copy link
Contributor Author

Thanks for the reply. Pipelines work as expected, but intuitively, I would have thought that FeatureUnions work the same way.

My goal would be to elicit specific behavior in the transformer "feature1" but not "feature0" for some custom transformers I built. However, since the proposed change could indeed break existing code, I will probably just subclass FeatureUnion.

I brought this up because I was not sure whether the current behavior was intended. If you'd like to change it, I could add my implementation as a PR once it's finished, else I close this issue.

@amueller
Copy link
Member

amueller commented Aug 4, 2016

I think we should change it, but in the most backward-compatible way. So if someone passes parameters that don't have the proper estimator names in them, they should be passed to all models. It might be a bit tricky.

Generally, you should only use fit parameters for things that have shape n_samples. Anything that changes behavior should usually be an __init__ parameter.

@BenjaminBossan
Copy link
Contributor Author

Okay, with that functionality, it will be a little tricky. If you pass step1__feature1__someparam=123, it should only be forwarded to feature1 but if you pass step1__feature2__someparam=123, it should be passed to feature1 and feature0, did I get that right?

Thanks for the explanation with the n_samples. I did not actually find any explanation for the scope of fit_params.

@amueller
Copy link
Member

amueller commented Aug 4, 2016

@BenjaminBossan do you want to add something to the dev docs or the roll your own estimator docs about when to use fit_params?

So let's leave out the step1 as that is handled by Pipeline and FeatureUnion never sees it.

If you pass feature1__someparam=123 I'd pass it to feature1 but if you pass someparam=123 I'd pass it to all.
The more tricky case is that if you pass feature3__someparam=123 and feature3 doesn't exist in the FeatureUnion, we should probably pass it to all (as the feature union might contain another pipeline).

@BenjaminBossan
Copy link
Contributor Author

@BenjaminBossan do you want to add something to the dev docs or the roll your own estimator docs about when to use fit_params?

I could but some hours ago I did not even know this about fit_params so maybe someone who knows better should do it ;)

The more tricky case is that if you pass feature3__someparam=123 and feature3 doesn't exist in the FeatureUnion, we should probably pass it to all (as the feature union might contain another pipeline).

Correct me if I'm wrong, but would the same scenario, with Pipeline, not raise a KeyError because it would split on __ and then fail to look up the name in fit_params_steps?

@amueller
Copy link
Member

amueller commented Aug 4, 2016

Sorry, maybe I was a bit to terse (or did not think it through).
I was thinking about

from sklearn.pipeline import Pipeline, make_union
from sklearn.preprocessing import StandardScaler, MinMaxScaler
from sklearn.decomposition import PCA, NMF

thing = make_union(Pipeline([('scaler', StandardScaler()), ('factorization', PCA())]),
                   Pipeline([('scaler', MinMaxScaler()), ('factorization', NMF())]))
X = [[1, 0], [0, 1]]
thing.fit(X, factorization__y=X)

Turns out that this doesn't even work, because FeatureUnion.fit doesn't have fit_params. I think that's a related but separate bug.

@amueller
Copy link
Member

amueller commented Aug 4, 2016

Let's try again. So the following code doesn't make sense, but someone could have used it like this, and I don't want to break it:

from sklearn.pipeline import Pipeline, make_union
from sklearn.preprocessing import StandardScaler, MinMaxScaler
from sklearn.decomposition import PCA

class MyPCA(PCA):
    def fit(self, X, y=None, factor=1):
        PCA.fit(self, X * factor, y)
        return self
    def fit_transform(self, X, y=None, factor=1):
        return self.fit(X, y=None, factor=factor).transform(X)

thing = make_union(Pipeline([('scaler', StandardScaler()), ('factorization', MyPCA())]),
                   Pipeline([('scaler', MinMaxScaler()), ('factorization', MyPCA())]))
X = [[1, 0], [0, 1]]
thing.fit_transform(X, factorization__factor=3)

@BenjaminBossan
Copy link
Contributor Author

Turns out that this doesn't even work, because FeatureUnion.fit doesn't have fit_params.

I did not even notice that. Neither does it use transformer_weights, btw.

For your examples, I agree that they should work.

For the case of feature3__someparam=123, when feature3 does not exist, though, I could imagine that this might be a source of error when it is silently passed and later ignored. As I said, Pipeline would raise a KeyError in the same situation.

@amueller
Copy link
Member

Do you see a way to fix this without breaking the code example (sorry this discussion paged out of my brain in the meantime)

@amueller amueller added the Bug label Sep 12, 2016
@BenjaminBossan
Copy link
Contributor Author

A proposal: We split this issue into

  • fixing the bugs in FeatureUnion
  • dealing with the fit_params.

For the latter, a possibility could be to implement it in a way similar to Pipeline, i.e. raising a KeyError when there is a dunder without a matching name. For me this sounds like the cleaner solution and my guess is few people's code would be affected.

Before that latter change, though, we just check for presence of a non matching name and raise a DeprecationWarning that such things will raise errors in the future. How about that?

@amueller
Copy link
Member

So deprecate current behavior and then do the same as pipeline? Sounds good :)

@jnothman
Copy link
Member

Before we jump in the deep end, I think we need to clarify something. The API policy is that fit() and other methods should take only parameters that are data-dependent, i.e. have the same shape[0] as X. I suspect that Pipeline's fit_params design hails from an era when this policy was not established or was unclear.

Currently neither Pipeline nor FeatureUnion explicitly supports sample_weight, though they should. Perhaps they should be consistent with fit_params, but I think the right solution is still unclear and needs motivating examples.

@amueller
Copy link
Member

I think fit_params is more or less just a more general way to implement sample_weights, with a slightly different API. I tried to implement sample_props once and it was mostly renaming fit_params (and sometimes moving it from __init__ to fit)

@amueller
Copy link
Member

So you don't think fit_params should have parameter delegation? It seems to have it in Pipeline and not have it in FeatureUnion, which seems pretty inconsistent

@amueller amueller modified the milestone: 0.19 Sep 29, 2016
@jnothman jnothman modified the milestones: 0.20, 0.19 Jun 13, 2017
@owlas
Copy link

owlas commented Jun 8, 2018

Is there currently anybody working on a solution to ensure that either sample_weights or fit_params are being passed correctly through Pipeline and FeatureUnion objects?

@jnothman
Copy link
Member

jnothman commented Jun 9, 2018 via email

@owlas
Copy link

owlas commented Jun 9, 2018

@jnothman that looks quite interesting. Why did you decide to move away from fit_params? This seems to be a good solution for now and just needs to be implemented consistently across the API

@jnothman
Copy link
Member

jnothman commented Jun 9, 2018 via email

@glemaitre glemaitre modified the milestones: 0.20, 0.21 Jun 13, 2018
@amueller
Copy link
Member

amueller commented Mar 7, 2019

This issue is basically sampleprops #4497. This is part of the general roadmap, but retagging this issue for every release seems not helpful. Therefore untagging.

@adrinjalali
Copy link
Member

Now fixed with metadata routing.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
6 participants