[MRG+1] ENH: Adds FunctionTransformer #4798

llllllllll · 2015-06-01T20:09:41Z

CallableTransformer allows a user to convert a standard python callable
into a transformer for use in a Pipeline.

Addresses: #3560

amueller · 2015-06-01T21:28:01Z

doc/modules/preprocessing.rst

+Custom Transformers
+===================
+
+Often, you will want to convert an existing python function into transformer to


"into a transformer"

CallableTransformer allows a user to convert a standard python callable into a transformer for use in a Pipeline.

amueller · 2015-06-01T22:37:58Z

lgtm

llllllllll · 2015-06-01T22:48:14Z

@GaelVaroquaux Could you take a look please?

mblondel · 2015-06-02T02:07:58Z

sklearn/preprocessing/callable_transformer.py

+        be passed after X and y.
+    kwargs : dict, optional
+        A dictionary of keyword arguments to be passed to func.
+


Could you add a simple / short example here?

The same as in the user guide is fine.

+1 for including a simple example as a doctest in the docstring of the class.

Suggestion for example: func=partial(getattr, 'data'), and feed the transformer a dict {'data': X, other stuff...}
Sorry, I can't think of better names.

I often end up doing this when I do e.g. text classification on conversational data and I have messages in both directions. I store my samples as {'from': from, 'to': to} and use a FeatureUnion of two pipelines, each grabbing the respective field and then doing a CountVectorizer.

except that partial(getattr, 'data') is more-or-less operator.attrgetter

GaelVaroquaux · 2015-06-02T02:53:14Z

Sorry, @llllllllll, I am too tired to review this tonight. The jetlag is still killing me.

llllllllll · 2015-06-02T03:08:34Z

@GaelVaroquaux No worries

ogrisel · 2015-06-02T06:54:47Z

sklearn/preprocessing/callable_transformer.py

+    func : callable, optional default=None
+        The callable to use for the transformation. This will be passed
+        the same arguments as transform, with args and kwargs forwarded.
+        If func is None, then func will be the identity function.


Please insert one blankline before the documentation of the next parameter.

vene · 2015-06-02T18:56:13Z

I'm concerned about what happens when a y is passed, but the func doesn't take/use it.

We fundamentally want this to work with np.log, np.sqrt, etc. But it looks to me like with the current implementation, we can't do CallableTransformer(np.log).transform(X, y). Wouldn't this impact the use in supervised pipelines?

llllllllll · 2015-06-02T19:00:59Z

This is a good point, do you think that this should accept an optional function to act on y?

vene · 2015-06-02T19:02:59Z

I think it needs to be the same function, as it might need to use the value of y in deciding what to do to each row of X (e.g. class weights)

The very simplest solution is to just not allow this, and just ignore the y.

The next simplest solution is probably to add an attribute (uses_y=True|False), but that would make it cumbersome to grid search over callables where some use y and others don't.

I don't know what other solutions are, introspecting the callable, catching the exception? In which case maybe the call should be func(X, y=y)? This can get messy.

vene · 2015-06-02T19:04:53Z

If we agree that a supervised pipeline with func=np.log should work by default, maybe we should have a test for this use case?

llllllllll · 2015-06-02T19:06:55Z

Maybe we can have it accept a tuple for the func slot, where the tuple reads: (func, pass_y) and if it is a non-tuple, just treat it like: (func, False)

llllllllll · 2015-06-03T04:28:48Z

After thinking about this a bit more, I think that a good thing might just be to make the call:

try:
    return func(X, y)
except TypeError:
    return func(X)

I would say that using a pass_y flag would be better; however, the search use case seems to make that not work.

Also, to address the partial vs args, kwargs, this was chosen to make searching params to the func easier.

vene · 2015-06-03T05:28:41Z

I understand your argument. But this also adds more of a maintenance burden.

Also, it might not be that simple to do try: return func(X, y). np.log has a call signature (x[, out]) so it could actually succeed, or fail with a different exception than expected.

Simply not returning TypeError when called with two arguments cannot be unambiguously used to signal the correct signature here. Something like func(X, y=y) might be a bit better, but still, when the user runs into that one 3rd party library function that has an attribute named y by coincidence, she will have to write a clumsy lambda wrapper.

What's the most common use case here? I think it's when y isn't used at all. I also guess that grid searching over funcs that use y and that don't use y isn't that common. I'm afraid we're rushing to generalize to scenarios that nobody actually needs.

So my preferences would be, in order (and with a pretty big gap between 2 and 3):

simply don't support using y at all, just use X
have a parameter indicating whether y is expected or not, and fail accordingly if it's not passed
the (func, flag) tuple input
do heuristics to detect whether y should be used at all.

As for kwargs vs partial, I still vote for partial. Grid search would still look alright with:

grid = dict(functransformer__func=(
    partial(select_columns, cols)
    for cols in [
        (0, 1),  # sepal width & petal width
        (2, 3)  # sepal length & petal length
    ]))

And we wouldn't have to reimplement stdlib functionality.

jnothman · 2015-06-03T13:38:15Z

I prefer having a parameter for "use y". I think the case of a search that mixes callables that require or do not require y means the user needs to compensate by making all their callables behave the same way, etc.
seeing as this is one area users are likely to use anonymous functions, their incompatibility with pickling should probably be noted in the documentation.
partial is better than having kwargs as a constructor parameter. It ties the kwargs to the callable itself, while modifying the parameters through a kwargs param is just as cumbersome as modifying the callable itself; apart from which, I think the case where the kwargs are to be varied in a search is necessarily beyond the scope of a quick helper like this.
perhaps the estimator should allow inverse_transform to also be provided, or we could decide that that is beyond its purpose.
I think FunctionTransformer might be more understandable to a breadth of users.
I'm not sure if this belongs in preprocessing; I think it is closely thematically coupled with Pipeline as a model composition utility.

amueller · 2015-06-03T19:07:19Z

+1 on what @jnothman said. inverse_transform would be great, and an argument to pass y would be, too. I don't think the "searching over func" breaks that.

I am not entirely certain about preprocessing vs pipeline module. This only really makes sense when using a pipeline, which I think is a good argument.

amueller · 2015-06-03T19:08:40Z

@llllllllll Sorry if this is taking more of your time than you anticipated.

glouppe · 2015-06-03T19:11:54Z

Jumping into the conversation, as I have written myself several times such as class for personal use. However, I often find myself needing to apply a function element-wise rather than on the full X. In this setting, I dont know if proposing a shortcut for vectorizing a user-defined function would be something to consider, e.g. as a flag? (using numpy.vectorize internally, this is easy) I fear this kind of use case might pop up sooner than later once such a transformer would be shipped in scikit-learn.

Just my 2 cents -- I dont want this make this longer than it should :)

jnothman · 2015-06-03T22:56:29Z

Can you give examples of element-wise functions that aren't compositions of
numpy functions?

On 4 June 2015 at 05:12, Gilles Louppe [email protected] wrote:

Jumping into the conversation, as I have written myself several times such
as class for personal use. However, I often find myself needing to apply a
function element-wise rather than on the full X. In this setting, I dont
know if proposing a shortcut for vectorizing a user-defined function would
be something to consider? (using numpy.vectorize internally, this is easy)
I fear this kind of use case might pop up sooner than later once such a
transformer would be shipped in scikit-learn.

Just my 2 cents -- I dont want this make this longer than it should :)

—
Reply to this email directly or view it on GitHub
#4798 (comment)
.

mblondel · 2015-06-04T01:19:36Z

@amueller I prefer FunctionTransformer too.

llllllllll · 2015-06-04T23:50:50Z

Just so everyone knows, I have not forgotten about this PR; however, I have been busy with work. I will address the comments made sometime this weekend so that they can go under another round of review. Thank you all for the feedback.

amueller · 2015-06-05T17:04:57Z

Thanks @llllllllll, your contribution is much appreciated :)

Makes `pass_y` an argument to FunctionTransformer to indicate that the labels should be passed to the wrapped function.

jnothman · 2015-06-18T05:00:02Z

sklearn/preprocessing/function_transformer.py

+
+    validate : bool, optional default=True
+        Indicate that the input X array should be checked before calling
+        func. If validate is false, there will be no input validation.


Note that this will ensure the input is a non-empty, 2-dimensional array (or sparse matrix) of finite numbers.

@jnothman Solved in ec7ddcb.

DOC expand FunctionTransform docstring

larsmans · 2015-07-09T14:53:55Z

I think we can keep inverses out of this PR. I'm ok with this living in sklearn.pipeline or preprocessing.

Merge conflict, please rebase or merge in master.

amueller · 2015-08-03T20:09:51Z

Merged via #5059. Thanks everybody, in particular @llllllllll for his contribution :)

larsmans · 2015-08-03T20:13:09Z

As Olivier would say, 🍻

jnothman · 2015-08-04T00:59:55Z

HurraH!

On 4 August 2015 at 06:13, Lars [email protected] wrote:

As Olivier would say, [image: 🍻]

—
Reply to this email directly or view it on GitHub
#4798 (comment)
.

llllllllll force-pushed the callable-transformer branch 2 times, most recently from 9d37fe2 to 17df2a7 Compare June 1, 2015 21:20

amueller reviewed Jun 1, 2015
View reviewed changes

llllllllll force-pushed the callable-transformer branch from 17df2a7 to e0db0a7 Compare June 1, 2015 21:32

ENH: Adds CallableTransformer

190caaf

CallableTransformer allows a user to convert a standard python callable into a transformer for use in a Pipeline.

llllllllll force-pushed the callable-transformer branch from e0db0a7 to 190caaf Compare June 1, 2015 21:36

amueller changed the title ~~ENH: Adds CallableTransformer~~ [MRG + 1] ENH: Adds CallableTransformer Jun 1, 2015

mblondel reviewed Jun 2, 2015
View reviewed changes

ogrisel reviewed Jun 2, 2015
View reviewed changes

vene mentioned this pull request Jun 3, 2015

[WIP] Add feature_extraction.ColumnTransformer #3886

Closed

8 tasks

Joe Jevnik added 2 commits June 8, 2015 12:26

ENH: Renames CallableTransformer -> FunctionTransformer.

69f5d66

Makes `pass_y` an argument to FunctionTransformer to indicate that the labels should be passed to the wrapped function.

COMPAT: Makes test_function_transformer py2 compatible.

67ddf96

jnothman reviewed Jun 18, 2015
View reviewed changes

DOC expand FunctionTransform docstring

ec7ddcb

larsmans changed the title ~~[MRG + 1] ENH: Adds CallableTransformer~~ [MRG+1] ENH: Adds FunctionTransformer Jul 3, 2015

larsmans mentioned this pull request Jul 3, 2015

DOC expand FunctionTransform docstring llllllllll/scikit-learn#1

Merged

Merge pull request #1 from larsmans/functransf

0b4b880

DOC expand FunctionTransform docstring

amueller mentioned this pull request Jul 30, 2015

[MRG + 1] Function transformer rebase #5059

Merged

amueller closed this Aug 3, 2015

amueller mentioned this pull request Aug 4, 2015

FeatureSelector for Pipeline (New Feature) #3560

Closed

glemaitre mentioned this pull request Aug 21, 2017

Ensure that the output of FunctionTransformer is 2D when validate=True #9595

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[MRG+1] ENH: Adds FunctionTransformer #4798

[MRG+1] ENH: Adds FunctionTransformer #4798

llllllllll commented Jun 1, 2015

amueller Jun 1, 2015

amueller commented Jun 1, 2015

llllllllll commented Jun 1, 2015

mblondel Jun 2, 2015

mblondel Jun 2, 2015

ogrisel Jun 2, 2015

vene Jun 2, 2015

jnothman Jun 3, 2015

GaelVaroquaux commented Jun 2, 2015

llllllllll commented Jun 2, 2015

ogrisel Jun 2, 2015

vene commented Jun 2, 2015

llllllllll commented Jun 2, 2015

vene commented Jun 2, 2015

vene commented Jun 2, 2015

llllllllll commented Jun 2, 2015

llllllllll commented Jun 3, 2015

vene commented Jun 3, 2015

jnothman commented Jun 3, 2015

amueller commented Jun 3, 2015

amueller commented Jun 3, 2015

glouppe commented Jun 3, 2015

jnothman commented Jun 3, 2015

mblondel commented Jun 4, 2015

llllllllll commented Jun 4, 2015

amueller commented Jun 5, 2015

jnothman Jun 18, 2015

larsmans Jul 9, 2015

larsmans commented Jul 9, 2015

amueller commented Aug 3, 2015

larsmans commented Aug 3, 2015

jnothman commented Aug 4, 2015

[MRG+1] ENH: Adds FunctionTransformer #4798

[MRG+1] ENH: Adds FunctionTransformer #4798

Conversation

llllllllll commented Jun 1, 2015

Choose a reason for hiding this comment

amueller commented Jun 1, 2015

llllllllll commented Jun 1, 2015

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

GaelVaroquaux commented Jun 2, 2015

llllllllll commented Jun 2, 2015

Choose a reason for hiding this comment

vene commented Jun 2, 2015

llllllllll commented Jun 2, 2015

vene commented Jun 2, 2015

vene commented Jun 2, 2015

llllllllll commented Jun 2, 2015

llllllllll commented Jun 3, 2015

vene commented Jun 3, 2015

jnothman commented Jun 3, 2015

amueller commented Jun 3, 2015

amueller commented Jun 3, 2015

glouppe commented Jun 3, 2015

jnothman commented Jun 3, 2015

mblondel commented Jun 4, 2015

llllllllll commented Jun 4, 2015

amueller commented Jun 5, 2015

Choose a reason for hiding this comment

Choose a reason for hiding this comment

larsmans commented Jul 9, 2015

amueller commented Aug 3, 2015

larsmans commented Aug 3, 2015

jnothman commented Aug 4, 2015