Tentative support for Pandas Dataframe with example using Titanic dataset #1949

arnaudsj · 2013-05-08T00:43:53Z

Modified check_arrays function to leave Pandas data frames alone
Modified is_multilabel function to support Pandas Series
Added new DataFrameMapper class (inspired by @benhamner & mostly copied from https://github.com/paulgb/sklearn-pandas)
Added new Titanic dataset to sklearn.datasets
Added Titanic example to demonstrate DataFrameMapper use case

Looking for feedback since it is my first contribution to sklearn.

Thank you in advance!

@benhamner

* Modified check_arrays function to leave Pandas data frames alone * Modified is_multilabel function to support Pandas Series * Added new DataFrameMapper class (inspired by @benhamner & https://github.com/paulgb/sklearn-pandas) * Added new Titanic dataset * Added Titanic example to demonstrate DataFrameMapper use case

jnothman · 2013-05-08T01:19:29Z

sklearn/preprocessing.py

Generally, scikit-learn avoids transforming input parameters to estimators in __fit__, so that they may be changed using set_params. Instead, it would be usual to do this validation and transformation in fit().

jnothman · 2013-05-08T01:26:06Z

Thanks for the PR!

I don't see why in this form a pandas DataFrame is more appropriate than a numpy structured array (except that the former is stored by field and the latter by record).

Is the general idea that features available in a dataset might be selected among and then transformed with things like LabelBinarizer?

I think interfacing with external tools such as pandas should be outside the scope of scikit-learn, though perhaps referred to (and perhaps maintained) from the main project.

amueller · 2013-05-08T06:43:06Z

I agree that it is out of scope.
Everything would need to work / give sensible errors with all possible dataframe inputs.
As most algorithms don't preserve "column names" for the outputs / transformations, I don't see the big gain.

If it was not much code to give a stable interface, I might be convinced otherwise.
Basically we would need an additional common test that all estimators can be used with dataframes and the outcomes are the same as using numpy arrays.

arnaudsj · 2013-05-08T13:11:02Z

Thank you for the feedback, I have a few things to clean up definitely. I will work on a revised pull request.

As for the dataframe support, really all I intended was to get the Dataframe Mapper added since it is probably how most people will use it, as raw dataframes don't provide much interest, that they can be casted to numpy arrays anyway easily. I will try though as time allows to write a new common test which will tests all estimators against a dataframe and compare the output to using raw numpy arrays.

@jnothman, my goal was not to add an external dependencies but just make sure sklearn treats dataframe objects nicely enough so that either labels (series) or input matrix (dataframes) can be used as inputs for sklearn estimators. The advantage of dataframes resides most with clean and powerful interface to perform data transformations against the dataframe (grouping, missing value handling, joining, feature derivation, etc...) which is a common task encountered in real-life machine learning problems. The DataframeMapper class just allows you to transform then easily each feature column into properly transformed categorical/scaled features for input into classical estimators. Another typical use would be to fire the TDIFVectorizer against a text based column. The output is a clean numpy array.

Also, I did see TravisCI complain about pandas, what's the most elegant way to deal with this if some of my tests require the pandas library to be installed?

Thanks again for taking the time to review this pull request!

amueller · 2013-05-08T13:29:33Z

I'm not entirely clear on what you want to do: do you want the estimators to accept dataframes or do you want to give a tool to convert them to numpy arrays more easily?

jnothman · 2013-05-08T14:43:50Z

@arnaudsj, you've basically submitted two PRs in one:

validate Pandas objects as arrays
special preprocessing for DataFrame that hstacks the features extracted by transformers for specific columns

Re 1.: There's no reason scikit learn shouldn't more generally support objects with __array__ (which is a numpy invention, not a pandas one). However, the code you've added to check_arrays happens to be redundant, because np.asarray will call __array__ if it's available. Moreover, we happen to be lucky that Pandas objects also have shape because otherwise it would never make it to that part of the code. Finally, the way to ensure scikit-learn supports such structures is by adding them to tests. And by "them", I don't particularly mean Pandas objects, I mean dummy objects of the following class:

class DummyArrayLike(object):
    def __init__(self, ret_array):
        self.ret_array = ret_array
    def __array__(self, dtype=None):
        if dtype:
            return self.ret_array.astype(dtype)
        return self.ret_array

Re 2.: I don't think this is a bad idea, but it can be proposed without being specific to pandas. Generalise it to transforming a sequence of mappings -- or a mapping of sequences -- into a feature matrix, according to a transformer for each selected key. But you'll find similar things that go part of the way in DictVectorizer and FeatureUnion, so you'll need to work out a way to share code with these implementations as much as possible, while justifying the new transformer as filling a need of many users.

Good luck!

jnothman · 2013-05-08T14:46:16Z

Come to think of it, your wrapper is a trivial extension to pipeline.FeatureUnion such that it doesn't pass the whole of X to each constituent transformer, but pulls out a column of X corresponding to each.

GaelVaroquaux · 2013-05-08T17:16:01Z

The DataframeMapper class just allows you to transform then easily each
feature column into properly transformed categorical/scaled features
for input into classical estimators. Another typical use would be to
fire the TDIFVectorizer against a text based column. The output is a
clean numpy array.

I don't understand why such a helper to turn dataframe objects into the
industry standard, numpy arrays, should not live in Panda. You realise
that each time panda changes, we will have to update this function, so it
is not viable in the long run.

Also, I did see TravisCI complain about pandas, what's the most elegant
way to deal with this if some of my tests require the pandas library to
be installed?

As stated, we cannot have a dependence on pandas. Pandas requires a
recent numpy (1.6 last time I checked). Updating numpy on a system is a
non-trivial test, as it forces to recompile some packages. For all these
reasons we cannot have Panda-specific code in scikit-learn.

arnaudsj · 2013-05-08T18:34:50Z

@jnothman thank you for the very detailed feedback. Let me reshape my pull request to be more generic and not include any dependencies to pandas per say. I will also break it into 2 separate ones as you suggested for better clarity

Come to think of it as well, I will create also a third one to add the titanic dataset unless nobody is interested. I will provide a url which will link to the equivalent version using pandas but not included it in the pull request.

Thanks again for all the great feedback! It is my first attempt to contribute to sklearn and I am still learning a lot of the internals and code base. So thank you for all the pointers!

jnothman · 2013-05-08T22:11:03Z

You're welcome. And thank you for your contributions! I'm closing this issue now.

(And a note: if you want to make FeatureUnion into something like DataFrameWrapper, almost all it should need is another parameter called get_field or similar, which for the datawrapper case would be operator.itemgetter, and for the current case would be the identity function lambda X: X. For a transformer named name, it would use get_field(X, name) in place of X. The tricky part is, as always, clear documentation [perhaps a tutorial section on "Importing heterogenous data" is warranted], and rigorous testing.)

jnothman · 2013-06-05T08:16:10Z

This suggestion to extend FeatureUnion is now an issue at #2034.

jnothman reviewed May 8, 2013
View reviewed changes

jnothman closed this May 8, 2013

jnothman mentioned this pull request Jun 5, 2013

[MRG] Non-categorical variables in OneHotEncoder. #2027

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Tentative support for Pandas Dataframe with example using Titanic dataset #1949

Tentative support for Pandas Dataframe with example using Titanic dataset #1949

Uh oh!

arnaudsj commented May 8, 2013

Uh oh!

jnothman May 8, 2013

Uh oh!

jnothman commented May 8, 2013

Uh oh!

amueller commented May 8, 2013

Uh oh!

arnaudsj commented May 8, 2013

Uh oh!

amueller commented May 8, 2013

Uh oh!

jnothman commented May 8, 2013

Uh oh!

jnothman commented May 8, 2013

Uh oh!

GaelVaroquaux commented May 8, 2013

Uh oh!

arnaudsj commented May 8, 2013

Uh oh!

jnothman commented May 8, 2013

Uh oh!

jnothman commented Jun 5, 2013

Uh oh!

Uh oh!

Uh oh!

Tentative support for Pandas Dataframe with example using Titanic dataset #1949

Tentative support for Pandas Dataframe with example using Titanic dataset #1949

Uh oh!

Conversation

arnaudsj commented May 8, 2013

Uh oh!

jnothman May 8, 2013

Choose a reason for hiding this comment

Uh oh!

jnothman commented May 8, 2013

Uh oh!

amueller commented May 8, 2013

Uh oh!

arnaudsj commented May 8, 2013

Uh oh!

amueller commented May 8, 2013

Uh oh!

jnothman commented May 8, 2013

Uh oh!

jnothman commented May 8, 2013

Uh oh!

GaelVaroquaux commented May 8, 2013

Uh oh!

arnaudsj commented May 8, 2013

Uh oh!

jnothman commented May 8, 2013

Uh oh!

jnothman commented Jun 5, 2013

Uh oh!

Uh oh!