Support explain_weights(pipeline)#158
Conversation
Introduces new eli5.transform.transform_feature_names singledispatch
Codecov Report
@@ Coverage Diff @@
## master #158 +/- ##
==========================================
- Coverage 97.25% 94.01% -3.24%
==========================================
Files 39 41 +2
Lines 2405 2575 +170
Branches 452 492 +40
==========================================
+ Hits 2339 2421 +82
- Misses 34 113 +79
- Partials 32 41 +9
|
|
(Note: @kmike's wish that we only calculate feature names when we've determined that they are important requires on knowing dependencies between input and output features for each transformer, i.e. something like scikit-learn/enhancement_proposals#5) |
| def explain_weights_pipeline(estimator, feature_names=None, **kwargs): | ||
| last_estimator = estimator.steps[-1][1] | ||
| transform_pipeline = Pipeline(estimator.steps[:-1]) | ||
| feature_names = transform_feature_names(transform_pipeline, feature_names) |
There was a problem hiding this comment.
In other cases feature_names argument of explain_weights has higher priority than feature names extracted from estimator, i.e. it allows to override feature names completely. The way feature_names are used here looks fine though. But we should document this gotcha.
There was a problem hiding this comment.
The passed feature names (sort of) takes priority here and is merely transformed, except where there is a vectoriser in the pipeline. But I think I should be using eli5's get_feature_names here to allow a vectoriser here. Is that correct?
| last_estimator = estimator.steps[-1][1] | ||
| transform_pipeline = Pipeline(estimator.steps[:-1]) | ||
| feature_names = transform_feature_names(transform_pipeline, feature_names) | ||
| out = explain_weights(last_estimator, feature_names=feature_names) |
There was a problem hiding this comment.
it makes sense to pass **kwargs here
There was a problem hiding this comment.
Yes, I'd intended to :)
Fixed
| def _pipeline_names(est, in_names=None): | ||
| names = in_names | ||
| for name, trans in est.steps: | ||
| if trans is not None: |
There was a problem hiding this comment.
Changes released in scikit-learn 0.18 (scikit-learn/scikit-learn#1769) allow parts of a pipeline or union to be replaced or disabled in a grid search.
| @singledispatch | ||
| def transform_feature_names(transformer, in_names=None): | ||
| if hasattr(transformer, 'get_feature_names'): | ||
| return transformer.get_feature_names() |
|
Thanks @jnothman! If you don't have time we can pick up the development. I've left some comments in order to not forget about them. |
|
How much do you think we should prefabricate feature name designs and how much leave it to the user? As I said, we could import implementations of transform_feature_names for Pipeline, FeatureUnion, SelectorMixin by default, and leave others unimplemented; or provide default implementations in some optional import like eli5.sklearn.experimental. the issue is that sometimes a user will to see "topic1" for brevity, but a list of the top features on the topic would be more helpful! |
|
It may similarly be useful to show "scale(featurename)" but some users would rather the scale transform elided. There are many ways to name features. Which of course leads to the possibility of feature descriptions, but I think that can wait for later |
…ntations Make some helpers public
|
More untested code pushed :) |
|
I now see that it should be trivial to support the same feature in |
I think providing as much default feature names as possible is a good thing. As for topics, what do you think about showing only a few top features? |
Good points. |
| # TODO: OneHotEncoder. scikit-learn#6441 doesn't appear complete | ||
|
|
||
| transform_feature_names.register(TfidfTransformer)( | ||
| make_tfn_featurewise('tfidf', lambda est: len(est.idf_))) |
There was a problem hiding this comment.
make_tfn_featurewise has only 1 argument
There was a problem hiding this comment.
This + 724c423 should make mypy env pass. I haven't figured out how to make a pull request to your pull request :)
There was a problem hiding this comment.
You are able to directly edit my branch, I think. What trouble do you have in making a PR to my branch?
| if abs: | ||
| weights = np.abs(weights) | ||
| order = np.argsort(weights, axis=1)[::-1] | ||
| top_names = np.take(in_names, order[:, :max_features]) |
There was a problem hiding this comment.
probably not too important, but it seems this could be made faster using eli5.utils.argsort_k_largest
There was a problem hiding this comment.
Curious. Have you benchmarked argsort_k_largest? Perhaps I'm just used to supporting old versions of numpy where argpartition requires argsort anyway.
There was a problem hiding this comment.
(and if you want speed, use a.take(b, mode='clip') rather than a[b])
There was a problem hiding this comment.
mm, I believe there was a use case where it mattered, but now I can't recall it; for small-medium examples argsort works just as fine.
|
|
||
|
|
||
| # XXX Should these generic things be in eli5.transform? | ||
| # I've left them here to make use of eli5.sklearn.utils.get_feature_names |
There was a problem hiding this comment.
What do you think of this comment, @kmike? Is eli5.sklearn.utils.get_feature_names in the right place?
There was a problem hiding this comment.
There are two kinds of "scikit-learningness": things which work with objects from scikit-learn and things which work with objects which follow scikit-learn API ideas. get_feature_names (which is used in xgboost as well) is currently in eli5.sklearn because of the second meaning.
|
@kmike what do you propose to move this forward? (I've noticed not a great deal of eli5 work lately in general.) Are we happy with the API and transformer coverage, and I should find time to write tests? |
|
@jnothman I'd love to have it moved forward :) To merge it we need tests. I like the API; transformer coverage is great, and we can always improve coverage in future. I'd also register everything by default, without marking it as experimental and requiring user to call a register function. This way explain_weights works in more cases, and if users want to override the way features are formatted it is the same amount of work (register their own function). Other than that it looks fine. |
| def explain_weights_pipeline(estimator, feature_names=None, **kwargs): | ||
| last_estimator = estimator.steps[-1][1] | ||
| transform_pipeline = Pipeline(estimator.steps[:-1]) | ||
| if hasattr(feature_names, 'get_feature_names'): |
There was a problem hiding this comment.
What is this condition for? I'd expect feature_names to be a list or FeatureNames instance.
There was a problem hiding this comment.
This is to reflect some of the behaviour of eli5.sklearn.utils.get_feature_names. If you don't care for it, I'll remove it.
There was a problem hiding this comment.
Never mind. Cleaned up.
|
i think that might be for an old conception. I'm slowly working through
tests and docs, but there's a lot to test. starting bottom up.
…On 6 Apr 2017 7:42 pm, "Mikhail Korobov" ***@***.***> wrote:
***@***.**** commented on this pull request.
------------------------------
In eli5/sklearn/explain_weights.py
<#158 (comment)>:
> @@ -412,3 +414,19 @@ def _features(target_id):
method='linear model',
is_regression=True,
)
+
+
***@***.***_weights.register(Pipeline)
+def explain_weights_pipeline(estimator, feature_names=None, **kwargs):
+ last_estimator = estimator.steps[-1][1]
+ transform_pipeline = Pipeline(estimator.steps[:-1])
+ if hasattr(feature_names, 'get_feature_names'):
What is this condition for? I'd expect feature_names to be a list or
FeatureNames instance.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#158 (review)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AAEz68CpZLPDj5cb_H6s9HRHot-gbIMgks5rtLNugaJpZM4LyOOJ>
.
|
|
@jnothman thanks for the update! What do you think about merging it partially, e.g. only the API part, Pipeline, FeatureUnion, and maybe a couple of basic transformers, to make it easier to test & document? |
|
might be sensible. Either way I don't think I'll have anything for the next
week
…On 6 Apr 2017 10:00 pm, "Mikhail Korobov" ***@***.***> wrote:
@jnothman <https://github.com/jnothman> thanks for the update! What do
you think about merging it partially, e.g. only the API part, Pipeline,
FeatureUnion, and maybe a couple of basic transformers, to make it easier
to test & document?
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#158 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AAEz6-6fIV89IJivgwtPoJq6ylNlPe00ks5rtNPxgaJpZM4LyOOJ>
.
|
|
I'll pull out a PR with just the basic ( |
|
Where does further documentation belong? |
|
@jnothman what further documentation are you thinking of? I think an example of how to register your own transform_feature_name handler may go to sklearn library docs. |
|
Yes, fair enough.
…On 3 May 2017 at 22:39, Mikhail Korobov ***@***.***> wrote:
@jnothman <https://github.com/jnothman> what further documentation are
you thinking of? I think an example of how to register your own
transform_feature_name handler may go to sklearn library docs.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#158 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AAEz6yGad0xV8h6iK0P3rg60VDwTfQ2xks5r2HV2gaJpZM4LyOOJ>
.
|
Towards fixing #15, allowing
explain_weightsforPipelines. Does not handleexplain_predictions; tracking the provenance of features back to parts of text may be tricky in a convoluted pipeline/union structure.No tests yet.
Introduces new
eli5.transform.transform_feature_namessingledispatch, which could do with more implementations.I think the exact forms of the names should remain experimental... or indeed we could refuse to provide default implementations for anything but vectorizers, selectors,
PipelineandFeatureUnion, letting users register their own.Rubbish example:
I'm not sure I'll be able to complete this patch any time soon, though I am keen to have something like this merged! Help is welcome.