Thanks to visit codestin.com
Credit goes to github.com

Skip to content

[MRG+1] ENH: Adds FunctionTransformer #4798

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed

Conversation

llllllllll
Copy link
Contributor

CallableTransformer allows a user to convert a standard python callable
into a transformer for use in a Pipeline.

Addresses: #3560

@llllllllll llllllllll force-pushed the callable-transformer branch 2 times, most recently from 9d37fe2 to 17df2a7 Compare June 1, 2015 21:20
Custom Transformers
===================

Often, you will want to convert an existing python function into transformer to
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"into a transformer"

@llllllllll llllllllll force-pushed the callable-transformer branch from 17df2a7 to e0db0a7 Compare June 1, 2015 21:32
CallableTransformer allows a user to convert a standard python callable
into a transformer for use in a Pipeline.
@llllllllll llllllllll force-pushed the callable-transformer branch from e0db0a7 to 190caaf Compare June 1, 2015 21:36
@amueller amueller changed the title ENH: Adds CallableTransformer [MRG + 1] ENH: Adds CallableTransformer Jun 1, 2015
@amueller
Copy link
Member

amueller commented Jun 1, 2015

lgtm

@llllllllll
Copy link
Contributor Author

@GaelVaroquaux Could you take a look please?

be passed after X and y.
kwargs : dict, optional
A dictionary of keyword arguments to be passed to func.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you add a simple / short example here?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The same as in the user guide is fine.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 for including a simple example as a doctest in the docstring of the class.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggestion for example: func=partial(getattr, 'data'), and feed the transformer a dict {'data': X, other stuff...}
Sorry, I can't think of better names.

I often end up doing this when I do e.g. text classification on conversational data and I have messages in both directions. I store my samples as {'from': from, 'to': to} and use a FeatureUnion of two pipelines, each grabbing the respective field and then doing a CountVectorizer.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

except that partial(getattr, 'data') is more-or-less operator.attrgetter

@GaelVaroquaux
Copy link
Member

Sorry, @llllllllll, I am too tired to review this tonight. The jetlag is still killing me.

@llllllllll
Copy link
Contributor Author

@GaelVaroquaux No worries

func : callable, optional default=None
The callable to use for the transformation. This will be passed
the same arguments as transform, with args and kwargs forwarded.
If func is None, then func will be the identity function.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please insert one blankline before the documentation of the next parameter.

@vene
Copy link
Member

vene commented Jun 2, 2015

I'm concerned about what happens when a y is passed, but the func doesn't take/use it.

We fundamentally want this to work with np.log, np.sqrt, etc. But it looks to me like with the current implementation, we can't do CallableTransformer(np.log).transform(X, y). Wouldn't this impact the use in supervised pipelines?

@llllllllll
Copy link
Contributor Author

This is a good point, do you think that this should accept an optional function to act on y?

@vene
Copy link
Member

vene commented Jun 2, 2015

I think it needs to be the same function, as it might need to use the value of y in deciding what to do to each row of X (e.g. class weights)

The very simplest solution is to just not allow this, and just ignore the y.

The next simplest solution is probably to add an attribute (uses_y=True|False), but that would make it cumbersome to grid search over callables where some use y and others don't.

I don't know what other solutions are, introspecting the callable, catching the exception? In which case maybe the call should be func(X, y=y)? This can get messy.

@vene
Copy link
Member

vene commented Jun 2, 2015

If we agree that a supervised pipeline with func=np.log should work by default, maybe we should have a test for this use case?

@llllllllll
Copy link
Contributor Author

Maybe we can have it accept a tuple for the func slot, where the tuple reads: (func, pass_y) and if it is a non-tuple, just treat it like: (func, False)

@llllllllll
Copy link
Contributor Author

After thinking about this a bit more, I think that a good thing might just be to make the call:

try:
    return func(X, y)
except TypeError:
    return func(X)

I would say that using a pass_y flag would be better; however, the search use case seems to make that not work.

Also, to address the partial vs args, kwargs, this was chosen to make searching params to the func easier.

@vene
Copy link
Member

vene commented Jun 3, 2015

I understand your argument. But this also adds more of a maintenance burden.

Also, it might not be that simple to do try: return func(X, y). np.log has a call signature (x[, out]) so it could actually succeed, or fail with a different exception than expected.

Simply not returning TypeError when called with two arguments cannot be unambiguously used to signal the correct signature here. Something like func(X, y=y) might be a bit better, but still, when the user runs into that one 3rd party library function that has an attribute named y by coincidence, she will have to write a clumsy lambda wrapper.

What's the most common use case here? I think it's when y isn't used at all. I also guess that grid searching over funcs that use y and that don't use y isn't that common. I'm afraid we're rushing to generalize to scenarios that nobody actually needs.

So my preferences would be, in order (and with a pretty big gap between 2 and 3):

  1. simply don't support using y at all, just use X
  2. have a parameter indicating whether y is expected or not, and fail accordingly if it's not passed
  3. the (func, flag) tuple input
  4. do heuristics to detect whether y should be used at all.

As for kwargs vs partial, I still vote for partial. Grid search would still look alright with:

grid = dict(functransformer__func=(
    partial(select_columns, cols)
    for cols in [
        (0, 1),  # sepal width & petal width
        (2, 3)  # sepal length & petal length
    ]))

And we wouldn't have to reimplement stdlib functionality.

@jnothman
Copy link
Member

jnothman commented Jun 3, 2015

  • I prefer having a parameter for "use y". I think the case of a search that mixes callables that require or do not require y means the user needs to compensate by making all their callables behave the same way, etc.
  • seeing as this is one area users are likely to use anonymous functions, their incompatibility with pickling should probably be noted in the documentation.
  • partial is better than having kwargs as a constructor parameter. It ties the kwargs to the callable itself, while modifying the parameters through a kwargs param is just as cumbersome as modifying the callable itself; apart from which, I think the case where the kwargs are to be varied in a search is necessarily beyond the scope of a quick helper like this.
  • perhaps the estimator should allow inverse_transform to also be provided, or we could decide that that is beyond its purpose.
  • I think FunctionTransformer might be more understandable to a breadth of users.
  • I'm not sure if this belongs in preprocessing; I think it is closely thematically coupled with Pipeline as a model composition utility.

@amueller
Copy link
Member

amueller commented Jun 3, 2015

+1 on what @jnothman said. inverse_transform would be great, and an argument to pass y would be, too. I don't think the "searching over func" breaks that.

I am not entirely certain about preprocessing vs pipeline module. This only really makes sense when using a pipeline, which I think is a good argument.

@amueller
Copy link
Member

amueller commented Jun 3, 2015

@llllllllll Sorry if this is taking more of your time than you anticipated.

@glouppe
Copy link
Contributor

glouppe commented Jun 3, 2015

Jumping into the conversation, as I have written myself several times such as class for personal use. However, I often find myself needing to apply a function element-wise rather than on the full X. In this setting, I dont know if proposing a shortcut for vectorizing a user-defined function would be something to consider, e.g. as a flag? (using numpy.vectorize internally, this is easy) I fear this kind of use case might pop up sooner than later once such a transformer would be shipped in scikit-learn.

Just my 2 cents -- I dont want this make this longer than it should :)

@jnothman
Copy link
Member

jnothman commented Jun 3, 2015

Can you give examples of element-wise functions that aren't compositions of
numpy functions?

On 4 June 2015 at 05:12, Gilles Louppe [email protected] wrote:

Jumping into the conversation, as I have written myself several times such
as class for personal use. However, I often find myself needing to apply a
function element-wise rather than on the full X. In this setting, I dont
know if proposing a shortcut for vectorizing a user-defined function would
be something to consider? (using numpy.vectorize internally, this is easy)
I fear this kind of use case might pop up sooner than later once such a
transformer would be shipped in scikit-learn.

Just my 2 cents -- I dont want this make this longer than it should :)


Reply to this email directly or view it on GitHub
#4798 (comment)
.

@mblondel
Copy link
Member

mblondel commented Jun 4, 2015

@amueller I prefer FunctionTransformer too.

@llllllllll
Copy link
Contributor Author

Just so everyone knows, I have not forgotten about this PR; however, I have been busy with work. I will address the comments made sometime this weekend so that they can go under another round of review. Thank you all for the feedback.

@amueller
Copy link
Member

amueller commented Jun 5, 2015

Thanks @llllllllll, your contribution is much appreciated :)

Joe Jevnik added 2 commits June 8, 2015 12:26
Makes `pass_y` an argument to FunctionTransformer to indicate that the
labels should be passed to the wrapped function.

validate : bool, optional default=True
Indicate that the input X array should be checked before calling
func. If validate is false, there will be no input validation.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note that this will ensure the input is a non-empty, 2-dimensional array (or sparse matrix) of finite numbers.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jnothman Solved in ec7ddcb.

@larsmans larsmans changed the title [MRG + 1] ENH: Adds CallableTransformer [MRG+1] ENH: Adds FunctionTransformer Jul 3, 2015
DOC expand FunctionTransform docstring
@larsmans
Copy link
Member

larsmans commented Jul 9, 2015

I think we can keep inverses out of this PR. I'm ok with this living in sklearn.pipeline or preprocessing.

Merge conflict, please rebase or merge in master.

@amueller
Copy link
Member

amueller commented Aug 3, 2015

Merged via #5059. Thanks everybody, in particular @llllllllll for his contribution :)

@amueller amueller closed this Aug 3, 2015
@larsmans
Copy link
Member

larsmans commented Aug 3, 2015

As Olivier would say, 🍻

@jnothman
Copy link
Member

jnothman commented Aug 4, 2015

HurraH!

On 4 August 2015 at 06:13, Lars [email protected] wrote:

As Olivier would say, [image: 🍻]


Reply to this email directly or view it on GitHub
#4798 (comment)
.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

10 participants