Codestin Search App

jnothman · 2017-01-31T04:10:11Z

Towards fixing #15, allowing explain_weights for Pipelines. Does not handle explain_predictions; tracking the provenance of features back to parts of text may be tricky in a convoluted pipeline/union structure.

No tests yet.

Introduces new eli5.transform.transform_feature_names singledispatch, which could do with more implementations.

I think the exact forms of the names should remain experimental... or indeed we could refuse to provide default implementations for anything but vectorizers, selectors, Pipeline and FeatureUnion, letting users register their own.

Rubbish example:

from sklearn import *
from eli5 import explain_weights
from eli5.sklearn.transform import register_experimental_feature_names
from IPython.display import display
register_experimental_feature_names()
est = pipeline.make_pipeline(pipeline.FeatureUnion([('words', pipeline.make_pipeline(feature_extraction.text.CountVectorizer(),
                                                                                     feature_extraction.text.TfidfTransformer())),
                                                    ('chars', feature_extraction.text.CountVectorizer(analyzer='char'))]),
                             feature_selection.SelectKBest(feature_selection.chi2, k=10),
                             linear_model.LogisticRegression())
bunch = datasets.fetch_20newsgroups(categories=['alt.atheism', 'talk.religion.misc', 'rec.autos'])
est.fit(bunch.filenames, bunch.target)

display(explain_weights(est, target_names=bunch.target_names))

I'm not sure I'll be able to complete this patch any time soon, though I am keen to have something like this merged! Help is welcome.

Introduces new eli5.transform.transform_feature_names singledispatch

codecov-io · 2017-01-31T04:10:13Z

Codecov Report

Merging #158 into master will decrease coverage by 3.23%.
The diff coverage is 48.53%.

@@            Coverage Diff             @@
##           master     #158      +/-   ##
==========================================
- Coverage   97.25%   94.01%   -3.24%     
==========================================
  Files          39       41       +2     
  Lines        2405     2575     +170     
  Branches      452      492      +40     
==========================================
+ Hits         2339     2421      +82     
- Misses         34      113      +79     
- Partials       32       41       +9

Impacted Files	Coverage Δ
eli5/sklearn/__init__.py	`100% <100%> (ø)`	⬆️
eli5/__init__.py	`83.33% <100%> (+0.57%)`	⬆️
eli5/sklearn/explain_weights.py	`100% <100%> (ø)`	⬆️
eli5/sklearn/transform.py	`43.33% <43.33%> (ø)`
eli5/transform.py	`66.66% <66.66%> (ø)`
eli5/_feature_names.py	`96.29% <66.66%> (-0.88%)`	⬇️

jnothman · 2017-01-31T04:13:04Z

(Note: @kmike's wish that we only calculate feature names when we've determined that they are important requires on knowing dependencies between input and output features for each transformer, i.e. something like scikit-learn/enhancement_proposals#5)

kmike · 2017-01-31T14:51:58Z

+def explain_weights_pipeline(estimator, feature_names=None, **kwargs):
+    last_estimator = estimator.steps[-1][1]
+    transform_pipeline = Pipeline(estimator.steps[:-1])
+    feature_names = transform_feature_names(transform_pipeline, feature_names)


In other cases feature_names argument of explain_weights has higher priority than feature names extracted from estimator, i.e. it allows to override feature names completely. The way feature_names are used here looks fine though. But we should document this gotcha.

The passed feature names (sort of) takes priority here and is merely transformed, except where there is a vectoriser in the pipeline. But I think I should be using eli5's get_feature_names here to allow a vectoriser here. Is that correct?

kmike · 2017-01-31T14:52:55Z

+    last_estimator = estimator.steps[-1][1]
+    transform_pipeline = Pipeline(estimator.steps[:-1])
+    feature_names = transform_feature_names(transform_pipeline, feature_names)
+    out = explain_weights(last_estimator, feature_names=feature_names)


it makes sense to pass **kwargs here

Yes, I'd intended to :)
Fixed

kmike · 2017-01-31T14:53:49Z

+def _pipeline_names(est, in_names=None):
+    names = in_names
+    for name, trans in est.steps:
+        if trans is not None:


when can trans be None?

Changes released in scikit-learn 0.18 (scikit-learn/scikit-learn#1769) allow parts of a pipeline or union to be replaced or disabled in a grid search.

Thanks for the explanation!

kmike · 2017-01-31T14:54:19Z

+@singledispatch
+def transform_feature_names(transformer, in_names=None):
+    if hasattr(transformer, 'get_feature_names'):
+        return transformer.get_feature_names()


kmike · 2017-01-31T14:55:53Z

Thanks @jnothman! If you don't have time we can pick up the development. I've left some comments in order to not forget about them.

jnothman · 2017-01-31T22:43:48Z

How much do you think we should prefabricate feature name designs and how much leave it to the user? As I said, we could import implementations of transform_feature_names for Pipeline, FeatureUnion, SelectorMixin by default, and leave others unimplemented; or provide default implementations in some optional import like eli5.sklearn.experimental. the issue is that sometimes a user will to see "topic1" for brevity, but a list of the top features on the topic would be more helpful!

jnothman · 2017-01-31T22:46:41Z

It may similarly be useful to show "scale(featurename)" but some users would rather the scale transform elided. There are many ways to name features.

Which of course leads to the possibility of feature descriptions, but I think that can wait for later

…ntations Make some helpers public

jnothman · 2017-02-01T10:47:17Z

More untested code pushed :)

jnothman · 2017-02-08T01:38:28Z

I now see that it should be trivial to support the same feature in explain_prediction

kmike · 2017-02-08T08:13:34Z

How much do you think we should prefabricate feature name designs and how much leave it to the user? As I said, we could import implementations of transform_feature_names for Pipeline, FeatureUnion, SelectorMixin by default, and leave others unimplemented; or provide default implementations in some optional import like eli5.sklearn.experimental. the issue is that sometimes a user will to see "topic1" for brevity, but a list of the top features on the topic would be more helpful!

I think providing as much default feature names as possible is a good thing. As for topics, what do you think about showing only a few top features? eli5.formatters.html could allow to expand topics on a client, or just show more words in title attribute.

kmike · 2017-02-08T08:15:59Z

It may similarly be useful to show "scale(featurename)" but some users would rather the scale transform elided. There are many ways to name features.
Which of course leads to the possibility of feature descriptions, but I think that can wait for later

Good points.
I think providing defaults like scale(featurename) is fine; customizability can be added / documented later if there is demand.

kmike · 2017-02-14T10:46:41Z

+    # TODO: OneHotEncoder. scikit-learn#6441 doesn't appear complete
+
+    transform_feature_names.register(TfidfTransformer)(
+        make_tfn_featurewise('tfidf', lambda est: len(est.idf_)))


make_tfn_featurewise has only 1 argument

This + 724c423 should make mypy env pass. I haven't figured out how to make a pull request to your pull request :)

You are able to directly edit my branch, I think. What trouble do you have in making a PR to my branch?

kmike · 2017-02-14T12:32:47Z

+            if abs:
+                weights = np.abs(weights)
+            order = np.argsort(weights, axis=1)[::-1]
+            top_names = np.take(in_names, order[:, :max_features])


probably not too important, but it seems this could be made faster using eli5.utils.argsort_k_largest

Curious. Have you benchmarked argsort_k_largest? Perhaps I'm just used to supporting old versions of numpy where argpartition requires argsort anyway.

(and if you want speed, use a.take(b, mode='clip') rather than a[b])

mm, I believe there was a use case where it mattered, but now I can't recall it; for small-medium examples argsort works just as fine.

jnothman · 2017-02-15T00:47:40Z

+
+
+# XXX Should these generic things be in eli5.transform?
+# I've left them here to make use of eli5.sklearn.utils.get_feature_names


What do you think of this comment, @kmike? Is eli5.sklearn.utils.get_feature_names in the right place?

There are two kinds of "scikit-learningness": things which work with objects from scikit-learn and things which work with objects which follow scikit-learn API ideas. get_feature_names (which is used in xgboost as well) is currently in eli5.sklearn because of the second meaning.

jnothman · 2017-03-01T02:38:19Z

@kmike what do you propose to move this forward? (I've noticed not a great deal of eli5 work lately in general.) Are we happy with the API and transformer coverage, and I should find time to write tests?

kmike · 2017-03-01T07:55:09Z

@jnothman I'd love to have it moved forward :) To merge it we need tests. I like the API; transformer coverage is great, and we can always improve coverage in future.

I'd also register everything by default, without marking it as experimental and requiring user to call a register function. This way explain_weights works in more cases, and if users want to override the way features are formatted it is the same amount of work (register their own function).

Other than that it looks fine.

kmike · 2017-04-06T09:42:06Z

+def explain_weights_pipeline(estimator, feature_names=None, **kwargs):
+    last_estimator = estimator.steps[-1][1]
+    transform_pipeline = Pipeline(estimator.steps[:-1])
+    if hasattr(feature_names, 'get_feature_names'):


What is this condition for? I'd expect feature_names to be a list or FeatureNames instance.

This is to reflect some of the behaviour of eli5.sklearn.utils.get_feature_names. If you don't care for it, I'll remove it.

Never mind. Cleaned up.

jnothman · 2017-04-06T10:28:56Z

i think that might be for an old conception. I'm slowly working through tests and docs, but there's a lot to test. starting bottom up.

…

On 6 Apr 2017 7:42 pm, "Mikhail Korobov" ***@***.***> wrote: ***@***.**** commented on this pull request. ------------------------------ In eli5/sklearn/explain_weights.py <#158 (comment)>: > @@ -412,3 +414,19 @@ def _features(target_id): method='linear model', is_regression=True, ) + + ***@***.***_weights.register(Pipeline) +def explain_weights_pipeline(estimator, feature_names=None, **kwargs): + last_estimator = estimator.steps[-1][1] + transform_pipeline = Pipeline(estimator.steps[:-1]) + if hasattr(feature_names, 'get_feature_names'): What is this condition for? I'd expect feature_names to be a list or FeatureNames instance. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#158 (review)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AAEz68CpZLPDj5cb_H6s9HRHot-gbIMgks5rtLNugaJpZM4LyOOJ> .

kmike · 2017-04-06T12:00:49Z

@jnothman thanks for the update! What do you think about merging it partially, e.g. only the API part, Pipeline, FeatureUnion, and maybe a couple of basic transformers, to make it easier to test & document?

jnothman · 2017-04-06T12:37:42Z

might be sensible. Either way I don't think I'll have anything for the next week

…

On 6 Apr 2017 10:00 pm, "Mikhail Korobov" ***@***.***> wrote: @jnothman <https://github.com/jnothman> thanks for the update! What do you think about merging it partially, e.g. only the API part, Pipeline, FeatureUnion, and maybe a couple of basic transformers, to make it easier to test & document? — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#158 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AAEz6-6fIV89IJivgwtPoJq6ylNlPe00ks5rtNPxgaJpZM4LyOOJ> .

jnothman · 2017-05-01T12:03:26Z

I'll pull out a PR with just the basic (explain_weights_pipeline plus transform_feature_names defined for feature selection and pipeline).

jnothman · 2017-05-01T12:10:38Z

Where does further documentation belong?

kmike · 2017-05-03T12:39:18Z

@jnothman what further documentation are you thinking of? I think an example of how to register your own transform_feature_name handler may go to sklearn library docs.

jnothman · 2017-05-03T12:48:30Z

Yes, fair enough.

…

On 3 May 2017 at 22:39, Mikhail Korobov ***@***.***> wrote: @jnothman <https://github.com/jnothman> what further documentation are you thinking of? I think an example of how to register your own transform_feature_name handler may go to sklearn library docs. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#158 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AAEz6yGad0xV8h6iK0P3rg60VDwTfQ2xks5r2HV2gaJpZM4LyOOJ> .

Support explain_weights(pipeline)

4f93e7f

Introduces new eli5.transform.transform_feature_names singledispatch

jnothman mentioned this pull request Jan 31, 2017

Transformative get_feature_names for various transformers scikit-learn/scikit-learn#6425

Closed

11 tasks

kmike mentioned this pull request Jan 31, 2017

Scikit-learn Pipeline support #15

Open

kmike reviewed Jan 31, 2017

View reviewed changes

Pass kwargs to pipeline last estimator explanation

16fa47b

jnothman closed this Jan 31, 2017

jnothman reopened this Feb 1, 2017

Use utils.get_feature_names where possible; various transform impleme…

7ce396a

…ntations Make some helpers public

silence some of mypy warnings

724c423

kmike reviewed Feb 14, 2017

View reviewed changes

jnothman commented Feb 15, 2017

View reviewed changes

jnothman added 2 commits February 15, 2017 11:51

Fix incorrect signature; make attribute getting more flexible

0f952c2

Feature names for tree embeddings

324291d

This was referenced Feb 21, 2017

Pipeline: apply all transformations except the last classifier scikit-learn/scikit-learn#8414

Closed

[WIP] ENH allow extraction of subsequence pipeline scikit-learn/scikit-learn#8431

Closed

kmike reviewed Apr 6, 2017

View reviewed changes

jnothman added 5 commits May 1, 2017 21:43

Test explain_weights_pipeline

45cb4b7

Various enhancements to transformer names

4ef337d

Merge branch 'master' into pipeline

cbc2a58

Remove inappropriate handling of vectorizer for feature_names

301d4dd

Handle vec in explain_weights_pipeline

2107920

Documentation

f40512e

jnothman mentioned this pull request May 1, 2017

explain_weights in Pipelines: minimal version #177

Merged

jnothman closed this May 17, 2017



		# XXX Should these generic things be in eli5.transform?
		# I've left them here to make use of eli5.sklearn.utils.get_feature_names

Conversation

jnothman commented Jan 31, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

codecov-io commented Jan 31, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

jnothman commented Jan 31, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

kmike commented Jan 31, 2017

Uh oh!

jnothman commented Jan 31, 2017

Uh oh!

jnothman commented Jan 31, 2017

Uh oh!

jnothman commented Feb 1, 2017

Uh oh!

jnothman commented Feb 8, 2017

Uh oh!

kmike commented Feb 8, 2017

Uh oh!

kmike commented Feb 8, 2017

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jnothman commented Mar 1, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

kmike commented Mar 1, 2017

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jnothman commented Apr 6, 2017 via email

Uh oh!

kmike commented Apr 6, 2017

Uh oh!

jnothman commented Apr 6, 2017 via email

Uh oh!

jnothman commented May 1, 2017

Uh oh!

jnothman commented May 1, 2017

Uh oh!

jnothman commented Jan 31, 2017 •

edited

Loading

codecov-io commented Jan 31, 2017 •

edited

Loading

jnothman commented Jan 31, 2017 •

edited

Loading

jnothman commented Mar 1, 2017 •

edited

Loading