[MRG] DOC add mixed categorical / continuous example with ColumnTransformer #11197

partmor · 2018-06-04T14:48:08Z

Reference Issues/PRs

What does this implement/fix?

I include in the example gallery an example for ColumnTransformer that uses a dataset with
categorical / continous features.

Any other comments?

I have built the example using a dataset with mixed types stored in a numpy array. I think this is not the most correct way to treat this type of dataset (the array's dtype must be object); but I found it necessary to avoid using pandas.

amueller · 2018-06-04T17:03:07Z

Thanks. It would be nice not to require a download, or at least not custom parsing code. But I'm not sure that's realistic. I'm not sure if we want to allow pandas in the examples but we could also just do a pd.read_csv to an URL, of, say, titanic. That would also give an opportunity to show the typical extraction of the labels from a dataframe.

partmor · 2018-06-04T17:34:32Z

@amueller I searched in the sklearn.datasets and the only dataset that I saw with mixed types was Boston housing, but the categorical feature was already encoded. So load from URL was the best thing I could think about.

Regarding the use of pandas, I found numpy not very elegant to hold string and float data, furthermore, not using pandas blocks us from demonstrating the true power of using sklearn.compose.make_column_transformer with column names.

I will procede then to refactor the example using pd.read_csv to a conventional URL, e.g. titanic.

amueller · 2018-06-04T17:53:01Z

I think that's a good idea, and yes, there's no dataset loader for a dataset with mixed features right now. fetch_openml will provide that "soon".

partmor · 2018-06-04T21:56:53Z

I've changed the example.
Here ColumnTransformer takes 2 preprocessing pipelines (for both categorical and numerical data).
I also implemented a custom imputer for categorical data that is used in the mentioned pipeline.

To go straight to the point with the usage of ColumnTransformer I performed a train-test split and evaluated the model directly with default params. I avoided hyperparameter tuning (grid search, ...) to not clog up excessively the example.

TomDLT · 2018-06-05T20:16:30Z

I am surprised CustomCategoryImputer is not something we already have.
I might be out of the loop, but if we don't have it, I think we should, maybe in SimpleImputer with a new strategy (strategy='fixed'), or inside CategoricalEncoder (or its re-factored version in #10523).

partmor · 2018-06-05T21:32:44Z

@TomDLT I have checked the imputers in sklearn.impute (including SimpleImputer) and only numerical types are supported (strategy={'mean', 'median', 'most_frequent}).

Setting handle_unknown='ignore' in CategoricalEncoder allows to return all zeros when unseen categories are encountered when transforming, but yet it can't handle np.nan (even during fit).

jorisvandenbossche · 2018-06-06T07:40:56Z

Yes, the inability to handle missing values with categorical data (both in SimpleImputer and CategoricalEncoder )is a know problem, which we hope to fix at least with a basic solution before the release. We were discussing this yesterday, and were also thinking of adding a 'constant'/'fixed' strategy for SimpleImputer.
But until then, the custom transformer here is fine I think (although it distracts from the actual example), then we can later simply replace it with the built-in solution.

partmor · 2018-06-06T08:06:47Z

@jorisvandenbossche I agree it could distract a bit from the core matter of the example, but I thought it was interesting to include it in order to use a Pipeline for the categorical data inside ColumnTransformer.

The alternative solution would have been to use pandas' dropna on the categorical subset at the beginning of the script, and just apply CategoricalEncoder.

Also, I am tempted to extend the example a bit more and include a hyperparameter optimization with grid search for instance, in order to demonstrate how to access the parameters within a ColumnTransformer. I'll try to do this later on today.

Any further thoughts?

jorisvandenbossche

I don't know how 'far' we want to go for this example, but it could also be nice to show that you can do gridsearch on the parameters of the preprocessing steps.

jorisvandenbossche · 2018-06-06T08:08:25Z

examples/column_transformer_mixed_types.py

+
+# We will train our classifier with the following features
+num_feats = ['age', 'fare']
+cat_feats = ['embarked', 'sex']


I think you can keep 'pclass' as well. It is already integer, but it then shows that that works as well for one-hot encoding.

Sure, no problem

jorisvandenbossche · 2018-06-06T08:11:43Z

examples/column_transformer_mixed_types.py

+    CategoricalEncoder('onehot-dense')
+)
+
+preprocessing_pl = ColumnTransformer(


should we use make_column_transformer for the example?

I decided to instantiate the object directly because I already saw an example that uses make_column_transformer and the latter doesn't have the remainder param, which I thought was worth showing.

I wouldn't have any problem using make_column_transformer if that's the use we want to encourage.

There is a PR to add the remainder keyword to the make_column_transformer (that was an oversight it is not there), so that should at least be fixed soon.

Ah! great stuff. Will switch to make_column_transformer then.

PR is merged: #11183

jorisvandenbossche · 2018-06-06T08:15:43Z

The alternative solution would have been to use pandas' dropna on the categorical subset at the beginning of the script, and just apply CategoricalEncoder.

Alternative pandas solution would also be to fillna('missing') on the categorical columns (which would be equivalent to your custom transformer).
Anyhow, I don't think this is too important, as we plan to replace this part with a SimpleImputer in the short term.

Also, I am tempted to extend the example a bit more and include a hyperparameter optimization with grid search

I commented just the same, so yes, I think that's a good idea :-)
I did a small GridSearchCV in my example here https://jorisvandenbossche.github.io/blog/2018/05/28/scikit-learn-columntransformer/#Using-our-pipeline-in-a-grid-search

jorisvandenbossche · 2018-06-06T08:20:37Z

One other suggestion, certainly if you would make the example longer, is to use the typical sphinx gallery syntax to alternate code and text blocks, see eg in https://github.com/scikit-learn/scikit-learn/blob/master/examples/plot_compare_reduction.py

partmor · 2018-06-06T08:51:41Z

Alternative pandas solution would also be to fillna('missing') on the categorical columns (which would be equivalent to your custom transformer).

True. Should we go for this then?

I commented just the same, so yes, I think that's a good idea :-)
I did a small GridSearchCV in my example here https://jorisvandenbossche.github.io/blog/2018/05/28/scikit-learn-columntransformer/#Using-our-pipeline-in-a-grid-search

Will go for it! Thanks for the link!!

One other suggestion, certainly if you would make the example longer, is to use the typical sphinx gallery syntax to alternate code and text blocks, see eg in https://github.com/scikit-learn/scikit-learn/blob/master/examples/plot_compare_reduction.py

Great! I'll use this. I was starting to worry about all the # comment lines I would have to use :)

Many thanks for the feedback @jorisvandenbossche ! I'll incorporate these changes.

jnothman · 2018-06-06T23:41:41Z

Either at the top or in a comment, you should describe the nature of the features: are they numeric representations of category, or strings with how many values?

jnothman · 2018-06-06T23:42:24Z

Why can't we do mode imputation in a pipeline on the categorical features after an ordinal encoding? It's a bit ugly, but it's not so bad...

jnothman · 2018-06-06T23:43:15Z

examples/column_transformer_mixed_types.py

+# defined in our ``ColumnTransformer``, together with the classifier's
+# hyperparameters as part of a ``Pipeline``.
+# ``ColumnTransformer`` integrates well with the rest of scikit-learn,
+# in particular with ``GridSearchCV`.`


bakticks not in correct places

My bad. Also, should I use sphinx syntax to reference other sklearn classes rather than simply double backticking??

jnothman · 2018-06-07T03:56:58Z

This example should be referenced in place of hetero_feature_union at the end of doc/faq.rst

partmor · 2018-06-07T08:26:20Z

Either at the top or in a comment, you should describe the nature of the features: are they numeric representations of category, or strings with how many values?

Will do.

Why can't we do mode imputation in a pipeline on the categorical features after an ordinal encoding? It's a bit ugly, but it's not so bad...

You mean?: use LabelEncoder as 1st step before one-hot encoding. I am aware we might be abusing the original intention of the label encoder, but it would already handle implicitly the imputing (It would be equivalent to imputing a fixed value) . E.g.:

import numpy as np
from sklearn.preprocessing import LabelEncoder
le = LabelEncoder()
le.fit(['a','b',np.nan,'c'])
le.transform([np.nan, 'b', 'c', 'a', np.nan])
>>>  array([3, 1, 2, 0, 3])

The output from this tranformer can directly jump into the CategoricalEncoder.
@jnothman @jorisvandenbossche what do you think of this?

This example should be referenced in place of hetero_feature_union at the
end of doc/faq.rst

Woud this mean just substituting :ref:sphx_glr_auto_examples_hetero_feature_union.py for :ref:sphx_glr_auto_column_transformer_mixed_types.py?

jorisvandenbossche · 2018-06-07T09:37:32Z

You mean?: use LabelEncoder as 1st step before one-hot encoding

No, @jnothman meant to use CategoricalEncoder(encoding='ordinal') (the new non-abusive version of LabelEncoder for features :-))

Personally I would just leave it like it is for now. The intent is that SimpleImputer(strategy='constant') will be available before the release, and then we can use that in this example (which results in a separate category for 'missing').

This example should be referenced in place of hetero_feature_union at the
end of doc/faq.rst

You can also add this example to the list of examples in the section on ColumnTransformer in compose.rst

amueller · 2018-06-10T18:42:55Z

examples/column_transformer_mixed_types.py

+# We can finally fit the model using the best hyperparameters found
+# and assess model performance on holdout set
+print(("best logistic regression from grid search: %f"
+       % grid_search.best_estimator_.score(X_test, y_test)))


I consider this an antipattern. Use grid_search.score(X_test, y_test) unless you want to explicitly overwrite the grid-search scoring (in this case accuracy) with the estimators scoring (also accuracy). Also the comment is misaligned, the fitting happens above it. Maybe just remove the comment as the the code is pretty self-explanatory imho.

amueller · 2018-06-10T18:44:13Z

looks good apart from nitpicks (and sorry about individual comments). I think the barrage of underscores in the grid-search is telling. But I'm happy we can finally actually do this!

partmor · 2018-06-10T19:11:13Z

@amueller thank you for the feedback!! I've changed the example based on your comments. Additionally, I've tried to make a better intro, I wasn't fully happy with it either.

ogrisel

Looks good as well. Here are a few more comments to improve the readability even further.

ogrisel · 2018-06-11T08:37:29Z

examples/column_transformer_mixed_types.py

+# - sex: categories encoded as strings {'female', 'male'}.
+# - plcass: categories encoded as ints {1, 2, 3}.
+num_feats = ['age', 'fare']
+cat_feats = ['embarked', 'sex', 'pclass']


Better be explicit and call those variables: numerical_features and categorical_features. Readability is extra important for examples.

ogrisel · 2018-06-11T08:38:31Z

examples/column_transformer_mixed_types.py

+
+# We create the preprocessing pipelines for both numeric and categorical data.
+num_pl = make_pipeline(SimpleImputer(), StandardScaler())
+cat_pl = CategoricalEncoder('onehot-dense', handle_unknown='ignore')


Similar remark here; may I suggest: numerical_transformer and categorical_transformer.

ogrisel · 2018-06-11T08:38:53Z

examples/column_transformer_mixed_types.py

+
+# Append classifier to preprocessing pipeline.
+# Now we have a full prediction pipeline.
+clf_pl = make_pipeline(preprocessing_pl, LogisticRegression())


Either clf or pipeline.

ogrisel · 2018-06-11T08:49:39Z

examples/column_transformer_mixed_types.py

+
+# Provisionally, use pd.fillna() to impute missing values for categorical
+# features; SimpleImputer will eventually support strategy="constant".
+data.loc[:, cat_feats] = data.loc[:, cat_feats].fillna(value='missing')


Do we really need the .loc[:, columns] pattern here? @jorisvandenbossche what is your opinion on using:

data[categorical_features] = data[categorical_features].fillna(value='missing)'

Yes, that would work the same
(I personally like that more (and would do it like that in my code), but it's a bit subjective)

I will change it to make it more readable, it does work the same. I've had traumas with setting copy warnings from pandas, that's why I sometimes use .loc unconsciously :-)

partmor · 2018-06-12T08:32:41Z

There were conflicts with a recent merge to master: I have moved the example (and its references) under the compose folder.

jnothman

Looks good, pending improvements to Imputer (perhaps #11211) and to ColumnTransformer (#11190)

jnothman · 2018-06-12T11:32:50Z

examples/compose/column_transformer_mixed_types.py

+# Categorical Features:
+# - embarked: categories encoded as strings {'C', 'S', 'Q'}.
+# - sex: categories encoded as strings {'female', 'male'}.
+# - plcass: categories encoded as ints {1, 2, 3}.


*pclass

you might want to actually describe this as "ordinal", rather than "categories", or "ordered categories"

jnothman · 2018-06-12T12:11:43Z

examples/compose/column_transformer_mixed_types.py

+
+# Read data from Titanic dataset.
+titanic_url = ('https://raw.githubusercontent.com/amueller/'
+               'scipy-2017-sklearn/master/notebooks/datasets/titanic3.csv')


maybe it's better practice to replace master with a fixed commit hash, i.e. e7d90ee

jorisvandenbossche

Also minor remarks, for the rest looks good!

jorisvandenbossche · 2018-06-12T12:00:05Z

examples/compose/column_transformer_mixed_types.py

+
+In this example, the numeric data is standard-scaled after
+mean-imputation, while the categorical data is one-hot
+endcoded after imputing missing values with a new category


typo: endcoded -> encoded

jorisvandenbossche · 2018-06-12T12:45:48Z

examples/compose/column_transformer_mixed_types.py

+In this example, the numeric data is standard-scaled after
+mean-imputation, while the categorical data is one-hot
+endcoded after imputing missing values with a new category
+(:code:`'missing'`).


What is the :code: doing? (I haven't seen this sphinx role before)
I think it can simply be double backticks

Same as double backticks, yep. Will use the latter.

partmor · 2018-06-12T17:46:33Z

Updated with your comments :)

jorisvandenbossche · 2018-06-13T07:10:13Z

Looks good, pending improvements to Imputer (perhaps #11211) and to ColumnTransformer (#11190)

@jnothman do you want to merge those PRs first so this example can be updated?

I think this PR is ready to be merged, so I would do that and then those other PRs can update the example.

jnothman · 2018-06-13T07:18:42Z

Thanks @partmor

jnothman · 2018-06-13T07:18:56Z

Thanks @partmor

partmor · 2018-06-13T07:28:50Z

Thank you all for your guidance and patience @jnothman @jorisvandenbossche @amueller @ogrisel !!!
This is my first contribution ever to a big open source project!! this only encourages me to keep going!!!
Thanks again!

partmor · 2018-06-13T07:42:13Z

Looks good, pending improvements to Imputer (perhaps #11211) and to ColumnTransformer (#11190)

I'll stay tuned to when strategy='constant' is implemented in SimpleImputer, and make a PR to include it in the example.

jnothman · 2018-06-13T07:53:20Z

sorry Joris, i didn't see your comment. I don't see anything wrong with updating those PRs (though one isn't a PR yet) instead of waiting on them.

jnothman · 2018-06-13T07:54:23Z

and congratulations Pedro. we look forward to more contributions from you here or elsewhere in the community.

jorisvandenbossche · 2018-06-13T08:11:14Z

sorry Joris, i didn't see your comment. I don't see anything wrong with updating those PRs (though one isn't a PR yet) instead of waiting on them.

No problem, you already did exactly what I was suggesting in the meantime :-)

ogrisel · 2018-06-13T12:39:44Z

Thanks @partmor!

qinhanmin2014 · 2018-06-13T12:49:53Z

Question here: will it be better to put the data under scikit-learn/examples-data?
(1) It seems bit strange to fetch data from other's repo (though I respect amueller, one of our leaders)
(2) It's good for scikit-learn to have a dataset with missing data and string-encoded categorical variables (see #8888)
(3) We can also provide users with the source and some basic information of the data in the same repo.

jorisvandenbossche · 2018-06-13T12:54:59Z

I think the idea is that the (future) openml loader will provide such datasets (instead of keep adding new single datasets to sklearn for each use case, although a heterogeneous dataset with missing values is of course an important use case ..)

qinhanmin2014 · 2018-06-13T13:01:31Z

I think the idea is that the (future) openml loader will provide such datasets

That's fine. Thanks a lot for the instant feedback :)

amueller · 2018-06-18T18:02:21Z

Yeah it's weird to use my repo, I totally agree. Hopefully we can get rid of that soon. We could just ship the csv without a loader to use in examples (because CSV is such a common format) but that seems slightly strange?
Having an example that contains the boilerplate for separating a target from the data seems very helpful to me. Most data that I worked with required that, but none of our examples require that.

DOC add mixed categorical / continuous example with ColumnTransformer

a8b94de

DOC change ColumnTransformer example. using pandas now

bf900e5

jorisvandenbossche reviewed Jun 6, 2018

View reviewed changes

ogrisel mentioned this pull request Jun 6, 2018

Add strategy="constant" to SimpleImputer #11208

Closed

partmor added 2 commits June 6, 2018 22:46

DOC refactor with PR feedback

f3a0fc4

DOC remove trailing whitespaces - pep8

7785673

jnothman mentioned this pull request Jun 6, 2018

[MRG+1] Move examples to examples/compose #11216

Merged

jnothman reviewed Jun 6, 2018

View reviewed changes

partmor added 2 commits June 7, 2018 09:40

DOC replace numeric for numerical in comments

909ae61

DOC backticks typo. add puncuations at end of comment

5800fe8

glemaitre added this to the 0.20 milestone Jun 8, 2018

partmor added 3 commits June 9, 2018 18:41

DOC add example to the list of ColumnTransformer examples in compose.rst

e7195cb

DOC describe nature of features in comments

008a5d9

DOC add references to Pipeline and GridSearchCV classes

ca81d43

amueller reviewed Jun 10, 2018

View reviewed changes

DOC improve example explanation. reviewed PR feedback

163ba0c

ogrisel reviewed Jun 11, 2018

View reviewed changes

partmor added 2 commits June 11, 2018 21:53

DOC improve readability of variable names

96a94b4

resolve conflicts with master

787b602

jnothman approved these changes Jun 12, 2018

View reviewed changes

jorisvandenbossche approved these changes Jun 12, 2018

View reviewed changes

DOC correct typos and fixed url for dataset

f3b606a

jnothman merged commit bd19a84 into scikit-learn:master Jun 13, 2018

partmor deleted the column-transformer-example branch June 13, 2018 12:23

partmor mentioned this pull request Jun 16, 2018

[WIP] Column selector functions for ColumnTransformer #11301

Closed

avm19 mentioned this pull request Jun 4, 2022

ENH add Naive Bayes Metaestimator ColumnwiseNB (aka "GeneralNB") #22574

Open

Uh oh!

[MRG] DOC add mixed categorical / continuous example with ColumnTransformer #11197

[MRG] DOC add mixed categorical / continuous example with ColumnTransformer #11197

Uh oh!

Conversation

partmor commented Jun 4, 2018

Reference Issues/PRs

What does this implement/fix?

Any other comments?

Uh oh!

amueller commented Jun 4, 2018

Uh oh!

partmor commented Jun 4, 2018

Uh oh!

amueller commented Jun 4, 2018

Uh oh!

partmor commented Jun 4, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

TomDLT commented Jun 5, 2018

Uh oh!

partmor commented Jun 5, 2018

Uh oh!

jorisvandenbossche commented Jun 6, 2018

Uh oh!

partmor commented Jun 6, 2018

Uh oh!

jorisvandenbossche left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jorisvandenbossche commented Jun 6, 2018

Uh oh!

jorisvandenbossche commented Jun 6, 2018

Uh oh!

partmor commented Jun 6, 2018

Uh oh!

jnothman commented Jun 6, 2018

Uh oh!

jnothman commented Jun 6, 2018

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jnothman commented Jun 7, 2018 via email

Uh oh!

partmor commented Jun 7, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jorisvandenbossche commented Jun 7, 2018

Uh oh!

Choose a reason for hiding this comment

Uh oh!

amueller commented Jun 10, 2018

Uh oh!

partmor commented Jun 10, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ogrisel left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

partmor commented Jun 4, 2018 •

edited

Loading

partmor commented Jun 7, 2018 •

edited

Loading

partmor commented Jun 10, 2018 •

edited

Loading