-
-
Notifications
You must be signed in to change notification settings - Fork 25.8k
[MRG] DOC add mixed categorical / continuous example with ColumnTransformer #11197
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[MRG] DOC add mixed categorical / continuous example with ColumnTransformer #11197
Conversation
Thanks. It would be nice not to require a download, or at least not custom parsing code. But I'm not sure that's realistic. I'm not sure if we want to allow pandas in the examples but we could also just do a |
@amueller I searched in the Regarding the use of pandas, I found numpy not very elegant to hold string and float data, furthermore, not using pandas blocks us from demonstrating the true power of using I will procede then to refactor the example using |
I think that's a good idea, and yes, there's no dataset loader for a dataset with mixed features right now. |
I've changed the example. To go straight to the point with the usage of |
I am surprised |
@TomDLT I have checked the imputers in Setting |
Yes, the inability to handle missing values with categorical data (both in SimpleImputer and CategoricalEncoder )is a know problem, which we hope to fix at least with a basic solution before the release. We were discussing this yesterday, and were also thinking of adding a 'constant'/'fixed' strategy for SimpleImputer. |
@jorisvandenbossche I agree it could distract a bit from the core matter of the example, but I thought it was interesting to include it in order to use a The alternative solution would have been to use pandas' Also, I am tempted to extend the example a bit more and include a hyperparameter optimization with grid search for instance, in order to demonstrate how to access the parameters within a Any further thoughts? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't know how 'far' we want to go for this example, but it could also be nice to show that you can do gridsearch on the parameters of the preprocessing steps.
|
||
# We will train our classifier with the following features | ||
num_feats = ['age', 'fare'] | ||
cat_feats = ['embarked', 'sex'] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think you can keep 'pclass' as well. It is already integer, but it then shows that that works as well for one-hot encoding.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, no problem
CategoricalEncoder('onehot-dense') | ||
) | ||
|
||
preprocessing_pl = ColumnTransformer( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should we use make_column_transformer
for the example?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I decided to instantiate the object directly because I already saw an example that uses make_column_transformer
and the latter doesn't have the remainder
param, which I thought was worth showing.
I wouldn't have any problem using make_column_transformer
if that's the use we want to encourage.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is a PR to add the remainder
keyword to the make_column_transformer
(that was an oversight it is not there), so that should at least be fixed soon.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah! great stuff. Will switch to make_column_transformer
then.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
PR is merged: #11183
Alternative pandas solution would also be to
I commented just the same, so yes, I think that's a good idea :-) |
One other suggestion, certainly if you would make the example longer, is to use the typical sphinx gallery syntax to alternate code and text blocks, see eg in https://github.com/scikit-learn/scikit-learn/blob/master/examples/plot_compare_reduction.py |
True. Should we go for this then?
Will go for it! Thanks for the link!!
Great! I'll use this. I was starting to worry about all the # comment lines I would have to use :) Many thanks for the feedback @jorisvandenbossche ! I'll incorporate these changes. |
Either at the top or in a comment, you should describe the nature of the features: are they numeric representations of category, or strings with how many values? |
Why can't we do mode imputation in a pipeline on the categorical features after an ordinal encoding? It's a bit ugly, but it's not so bad... |
# defined in our ``ColumnTransformer``, together with the classifier's | ||
# hyperparameters as part of a ``Pipeline``. | ||
# ``ColumnTransformer`` integrates well with the rest of scikit-learn, | ||
# in particular with ``GridSearchCV`.` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
bakticks not in correct places
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My bad. Also, should I use sphinx syntax to reference other sklearn classes rather than simply double backticking??
This example should be referenced in place of hetero_feature_union at the
end of doc/faq.rst
|
Will do.
You mean?: use import numpy as np
from sklearn.preprocessing import LabelEncoder
le = LabelEncoder()
le.fit(['a','b',np.nan,'c'])
le.transform([np.nan, 'b', 'c', 'a', np.nan])
>>> array([3, 1, 2, 0, 3]) The output from this tranformer can directly jump into the
Woud this mean just substituting :ref: |
No, @jnothman meant to use Personally I would just leave it like it is for now. The intent is that
You can also add this example to the list of examples in the section on ColumnTransformer in compose.rst |
# We can finally fit the model using the best hyperparameters found | ||
# and assess model performance on holdout set | ||
print(("best logistic regression from grid search: %f" | ||
% grid_search.best_estimator_.score(X_test, y_test))) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I consider this an antipattern. Use grid_search.score(X_test, y_test)
unless you want to explicitly overwrite the grid-search scoring (in this case accuracy) with the estimators scoring (also accuracy). Also the comment is misaligned, the fitting happens above it. Maybe just remove the comment as the the code is pretty self-explanatory imho.
looks good apart from nitpicks (and sorry about individual comments). I think the barrage of underscores in the grid-search is telling. But I'm happy we can finally actually do this! |
@amueller thank you for the feedback!! I've changed the example based on your comments. Additionally, I've tried to make a better intro, I wasn't fully happy with it either. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good as well. Here are a few more comments to improve the readability even further.
# - sex: categories encoded as strings {'female', 'male'}. | ||
# - plcass: categories encoded as ints {1, 2, 3}. | ||
num_feats = ['age', 'fare'] | ||
cat_feats = ['embarked', 'sex', 'pclass'] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Better be explicit and call those variables: numerical_features
and categorical_features
. Readability is extra important for examples.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree.
|
||
# We create the preprocessing pipelines for both numeric and categorical data. | ||
num_pl = make_pipeline(SimpleImputer(), StandardScaler()) | ||
cat_pl = CategoricalEncoder('onehot-dense', handle_unknown='ignore') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Similar remark here; may I suggest: numerical_transformer
and categorical_transformer
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree.
|
||
# Append classifier to preprocessing pipeline. | ||
# Now we have a full prediction pipeline. | ||
clf_pl = make_pipeline(preprocessing_pl, LogisticRegression()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Either clf
or pipeline
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree.
|
||
# Provisionally, use pd.fillna() to impute missing values for categorical | ||
# features; SimpleImputer will eventually support strategy="constant". | ||
data.loc[:, cat_feats] = data.loc[:, cat_feats].fillna(value='missing') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we really need the .loc[:, columns]
pattern here? @jorisvandenbossche what is your opinion on using:
data[categorical_features] = data[categorical_features].fillna(value='missing)'
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, that would work the same
(I personally like that more (and would do it like that in my code), but it's a bit subjective)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I will change it to make it more readable, it does work the same. I've had traumas with setting copy warnings from pandas, that's why I sometimes use .loc unconsciously :-)
There were conflicts with a recent merge to master: I have moved the example (and its references) under the compose folder. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
# Categorical Features: | ||
# - embarked: categories encoded as strings {'C', 'S', 'Q'}. | ||
# - sex: categories encoded as strings {'female', 'male'}. | ||
# - plcass: categories encoded as ints {1, 2, 3}. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
*pclass
you might want to actually describe this as "ordinal", rather than "categories", or "ordered categories"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will do.
|
||
# Read data from Titanic dataset. | ||
titanic_url = ('https://raw.githubusercontent.com/amueller/' | ||
'scipy-2017-sklearn/master/notebooks/datasets/titanic3.csv') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe it's better practice to replace master
with a fixed commit hash, i.e. e7d90ee
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will do.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also minor remarks, for the rest looks good!
|
||
In this example, the numeric data is standard-scaled after | ||
mean-imputation, while the categorical data is one-hot | ||
endcoded after imputing missing values with a new category |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
typo: endcoded -> encoded
In this example, the numeric data is standard-scaled after | ||
mean-imputation, while the categorical data is one-hot | ||
endcoded after imputing missing values with a new category | ||
(:code:`'missing'`). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is the :code:
doing? (I haven't seen this sphinx role before)
I think it can simply be double backticks
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same as double backticks, yep. Will use the latter.
Updated with your comments :) |
Thanks @partmor |
Thanks @partmor |
Thank you all for your guidance and patience @jnothman @jorisvandenbossche @amueller @ogrisel !!! |
sorry Joris, i didn't see your comment. I don't see anything wrong with
updating those PRs (though one isn't a PR yet) instead of waiting on them.
|
and congratulations Pedro. we look forward to more contributions from you
here or elsewhere in the community.
|
No problem, you already did exactly what I was suggesting in the meantime :-) |
Thanks @partmor! |
Question here: will it be better to put the data under scikit-learn/examples-data? |
I think the idea is that the (future) openml loader will provide such datasets (instead of keep adding new single datasets to sklearn for each use case, although a heterogeneous dataset with missing values is of course an important use case ..) |
That's fine. Thanks a lot for the instant feedback :) |
Yeah it's weird to use my repo, I totally agree. Hopefully we can get rid of that soon. We could just ship the csv without a loader to use in examples (because CSV is such a common format) but that seems slightly strange? |
Reference Issues/PRs
Fixes #11185
What does this implement/fix?
I include in the example gallery an example for
ColumnTransformer
that uses a dataset withcategorical / continous features.
Any other comments?
I have built the example using a dataset with mixed types stored in a numpy array. I think this is not the most correct way to treat this type of dataset (the array's dtype must be object); but I found it necessary to avoid using pandas.