-
-
Notifications
You must be signed in to change notification settings - Fork 26.3k
Tentative support for Pandas Dataframe with example using Titanic dataset #1949
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
* Modified check_arrays function to leave Pandas data frames alone * Modified is_multilabel function to support Pandas Series * Added new DataFrameMapper class (inspired by @benhamner & https://github.com/paulgb/sklearn-pandas) * Added new Titanic dataset * Added Titanic example to demonstrate DataFrameMapper use case
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Generally, scikit-learn avoids transforming input parameters to estimators in __fit__
, so that they may be changed using set_params
. Instead, it would be usual to do this validation and transformation in fit()
.
Thanks for the PR! I don't see why in this form a Is the general idea that features available in a dataset might be selected among and then transformed with things like I think interfacing with external tools such as |
I agree that it is out of scope. If it was not much code to give a stable interface, I might be convinced otherwise. |
Thank you for the feedback, I have a few things to clean up definitely. I will work on a revised pull request. As for the dataframe support, really all I intended was to get the Dataframe Mapper added since it is probably how most people will use it, as raw dataframes don't provide much interest, that they can be casted to numpy arrays anyway easily. I will try though as time allows to write a new common test which will tests all estimators against a dataframe and compare the output to using raw numpy arrays. @jnothman, my goal was not to add an external dependencies but just make sure sklearn treats dataframe objects nicely enough so that either labels (series) or input matrix (dataframes) can be used as inputs for sklearn estimators. The advantage of dataframes resides most with clean and powerful interface to perform data transformations against the dataframe (grouping, missing value handling, joining, feature derivation, etc...) which is a common task encountered in real-life machine learning problems. The DataframeMapper class just allows you to transform then easily each feature column into properly transformed categorical/scaled features for input into classical estimators. Another typical use would be to fire the TDIFVectorizer against a text based column. The output is a clean numpy array. Also, I did see TravisCI complain about pandas, what's the most elegant way to deal with this if some of my tests require the pandas library to be installed? Thanks again for taking the time to review this pull request! |
I'm not entirely clear on what you want to do: do you want the estimators to accept dataframes or do you want to give a tool to convert them to numpy arrays more easily? |
@arnaudsj, you've basically submitted two PRs in one:
Re 1.: There's no reason scikit learn shouldn't more generally support objects with
Re 2.: I don't think this is a bad idea, but it can be proposed without being specific to pandas. Generalise it to transforming a sequence of mappings -- or a mapping of sequences -- into a feature matrix, according to a transformer for each selected key. But you'll find similar things that go part of the way in Good luck! |
Come to think of it, your wrapper is a trivial extension to |
I don't understand why such a helper to turn dataframe objects into the
As stated, we cannot have a dependence on pandas. Pandas requires a |
@jnothman thank you for the very detailed feedback. Let me reshape my pull request to be more generic and not include any dependencies to pandas per say. I will also break it into 2 separate ones as you suggested for better clarity Come to think of it as well, I will create also a third one to add the titanic dataset unless nobody is interested. I will provide a url which will link to the equivalent version using pandas but not included it in the pull request. Thanks again for all the great feedback! It is my first attempt to contribute to sklearn and I am still learning a lot of the internals and code base. So thank you for all the pointers! |
You're welcome. And thank you for your contributions! I'm closing this issue now. (And a note: if you want to make |
This suggestion to extend |
Looking for feedback since it is my first contribution to sklearn.
Thank you in advance!