Thanks to visit codestin.com
Credit goes to github.com

Skip to content

Support explain_weights(pipeline)#158

Closed
jnothman wants to merge 12 commits into
TeamHG-Memex:masterfrom
jnothman:pipeline
Closed

Support explain_weights(pipeline)#158
jnothman wants to merge 12 commits into
TeamHG-Memex:masterfrom
jnothman:pipeline

Conversation

@jnothman
Copy link
Copy Markdown
Contributor

@jnothman jnothman commented Jan 31, 2017

Towards fixing #15, allowing explain_weights for Pipelines. Does not handle explain_predictions; tracking the provenance of features back to parts of text may be tricky in a convoluted pipeline/union structure.

No tests yet.

Introduces new eli5.transform.transform_feature_names singledispatch, which could do with more implementations.

I think the exact forms of the names should remain experimental... or indeed we could refuse to provide default implementations for anything but vectorizers, selectors, Pipeline and FeatureUnion, letting users register their own.

Rubbish example:

from sklearn import *
from eli5 import explain_weights
from eli5.sklearn.transform import register_experimental_feature_names
from IPython.display import display
register_experimental_feature_names()
est = pipeline.make_pipeline(pipeline.FeatureUnion([('words', pipeline.make_pipeline(feature_extraction.text.CountVectorizer(),
                                                                                     feature_extraction.text.TfidfTransformer())),
                                                    ('chars', feature_extraction.text.CountVectorizer(analyzer='char'))]),
                             feature_selection.SelectKBest(feature_selection.chi2, k=10),
                             linear_model.LogisticRegression())
bunch = datasets.fetch_20newsgroups(categories=['alt.atheism', 'talk.religion.misc', 'rec.autos'])
est.fit(bunch.filenames, bunch.target)

display(explain_weights(est, target_names=bunch.target_names))

I'm not sure I'll be able to complete this patch any time soon, though I am keen to have something like this merged! Help is welcome.

Introduces new eli5.transform.transform_feature_names singledispatch
@codecov-io
Copy link
Copy Markdown

codecov-io commented Jan 31, 2017

Codecov Report

Merging #158 into master will decrease coverage by 3.23%.
The diff coverage is 48.53%.

@@            Coverage Diff             @@
##           master     #158      +/-   ##
==========================================
- Coverage   97.25%   94.01%   -3.24%     
==========================================
  Files          39       41       +2     
  Lines        2405     2575     +170     
  Branches      452      492      +40     
==========================================
+ Hits         2339     2421      +82     
- Misses         34      113      +79     
- Partials       32       41       +9
Impacted Files Coverage Δ
eli5/sklearn/__init__.py 100% <100%> (ø) ⬆️
eli5/__init__.py 83.33% <100%> (+0.57%) ⬆️
eli5/sklearn/explain_weights.py 100% <100%> (ø) ⬆️
eli5/sklearn/transform.py 43.33% <43.33%> (ø)
eli5/transform.py 66.66% <66.66%> (ø)
eli5/_feature_names.py 96.29% <66.66%> (-0.88%) ⬇️

@jnothman
Copy link
Copy Markdown
Contributor Author

jnothman commented Jan 31, 2017

(Note: @kmike's wish that we only calculate feature names when we've determined that they are important requires on knowing dependencies between input and output features for each transformer, i.e. something like scikit-learn/enhancement_proposals#5)

def explain_weights_pipeline(estimator, feature_names=None, **kwargs):
last_estimator = estimator.steps[-1][1]
transform_pipeline = Pipeline(estimator.steps[:-1])
feature_names = transform_feature_names(transform_pipeline, feature_names)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In other cases feature_names argument of explain_weights has higher priority than feature names extracted from estimator, i.e. it allows to override feature names completely. The way feature_names are used here looks fine though. But we should document this gotcha.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The passed feature names (sort of) takes priority here and is merely transformed, except where there is a vectoriser in the pipeline. But I think I should be using eli5's get_feature_names here to allow a vectoriser here. Is that correct?

Comment thread eli5/sklearn/explain_weights.py Outdated
last_estimator = estimator.steps[-1][1]
transform_pipeline = Pipeline(estimator.steps[:-1])
feature_names = transform_feature_names(transform_pipeline, feature_names)
out = explain_weights(last_estimator, feature_names=feature_names)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it makes sense to pass **kwargs here

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I'd intended to :)
Fixed

Comment thread eli5/sklearn/transform.py
def _pipeline_names(est, in_names=None):
names = in_names
for name, trans in est.steps:
if trans is not None:
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

when can trans be None?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changes released in scikit-learn 0.18 (scikit-learn/scikit-learn#1769) allow parts of a pipeline or union to be replaced or disabled in a grid search.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the explanation!

Comment thread eli5/transform.py
@singledispatch
def transform_feature_names(transformer, in_names=None):
if hasattr(transformer, 'get_feature_names'):
return transformer.get_feature_names()
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

❤️

@kmike
Copy link
Copy Markdown
Contributor

kmike commented Jan 31, 2017

Thanks @jnothman! If you don't have time we can pick up the development. I've left some comments in order to not forget about them.

@jnothman
Copy link
Copy Markdown
Contributor Author

How much do you think we should prefabricate feature name designs and how much leave it to the user? As I said, we could import implementations of transform_feature_names for Pipeline, FeatureUnion, SelectorMixin by default, and leave others unimplemented; or provide default implementations in some optional import like eli5.sklearn.experimental. the issue is that sometimes a user will to see "topic1" for brevity, but a list of the top features on the topic would be more helpful!

@jnothman
Copy link
Copy Markdown
Contributor Author

It may similarly be useful to show "scale(featurename)" but some users would rather the scale transform elided. There are many ways to name features.

Which of course leads to the possibility of feature descriptions, but I think that can wait for later

@jnothman jnothman closed this Jan 31, 2017
@jnothman jnothman reopened this Feb 1, 2017
@jnothman
Copy link
Copy Markdown
Contributor Author

jnothman commented Feb 1, 2017

More untested code pushed :)

@jnothman
Copy link
Copy Markdown
Contributor Author

jnothman commented Feb 8, 2017

I now see that it should be trivial to support the same feature in explain_prediction

@kmike
Copy link
Copy Markdown
Contributor

kmike commented Feb 8, 2017

How much do you think we should prefabricate feature name designs and how much leave it to the user? As I said, we could import implementations of transform_feature_names for Pipeline, FeatureUnion, SelectorMixin by default, and leave others unimplemented; or provide default implementations in some optional import like eli5.sklearn.experimental. the issue is that sometimes a user will to see "topic1" for brevity, but a list of the top features on the topic would be more helpful!

I think providing as much default feature names as possible is a good thing. As for topics, what do you think about showing only a few top features? eli5.formatters.html could allow to expand topics on a client, or just show more words in title attribute.

@kmike
Copy link
Copy Markdown
Contributor

kmike commented Feb 8, 2017

It may similarly be useful to show "scale(featurename)" but some users would rather the scale transform elided. There are many ways to name features.
Which of course leads to the possibility of feature descriptions, but I think that can wait for later

Good points.
I think providing defaults like scale(featurename) is fine; customizability can be added / documented later if there is demand.

Comment thread eli5/sklearn/transform.py Outdated
# TODO: OneHotEncoder. scikit-learn#6441 doesn't appear complete

transform_feature_names.register(TfidfTransformer)(
make_tfn_featurewise('tfidf', lambda est: len(est.idf_)))
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

make_tfn_featurewise has only 1 argument

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This + 724c423 should make mypy env pass. I haven't figured out how to make a pull request to your pull request :)

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You are able to directly edit my branch, I think. What trouble do you have in making a PR to my branch?

Comment thread eli5/sklearn/transform.py
if abs:
weights = np.abs(weights)
order = np.argsort(weights, axis=1)[::-1]
top_names = np.take(in_names, order[:, :max_features])
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

probably not too important, but it seems this could be made faster using eli5.utils.argsort_k_largest

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Curious. Have you benchmarked argsort_k_largest? Perhaps I'm just used to supporting old versions of numpy where argpartition requires argsort anyway.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(and if you want speed, use a.take(b, mode='clip') rather than a[b])

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

mm, I believe there was a use case where it mattered, but now I can't recall it; for small-medium examples argsort works just as fine.

Comment thread eli5/sklearn/transform.py


# XXX Should these generic things be in eli5.transform?
# I've left them here to make use of eli5.sklearn.utils.get_feature_names
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What do you think of this comment, @kmike? Is eli5.sklearn.utils.get_feature_names in the right place?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are two kinds of "scikit-learningness": things which work with objects from scikit-learn and things which work with objects which follow scikit-learn API ideas. get_feature_names (which is used in xgboost as well) is currently in eli5.sklearn because of the second meaning.

@jnothman
Copy link
Copy Markdown
Contributor Author

jnothman commented Mar 1, 2017

@kmike what do you propose to move this forward? (I've noticed not a great deal of eli5 work lately in general.) Are we happy with the API and transformer coverage, and I should find time to write tests?

@kmike
Copy link
Copy Markdown
Contributor

kmike commented Mar 1, 2017

@jnothman I'd love to have it moved forward :) To merge it we need tests. I like the API; transformer coverage is great, and we can always improve coverage in future.

I'd also register everything by default, without marking it as experimental and requiring user to call a register function. This way explain_weights works in more cases, and if users want to override the way features are formatted it is the same amount of work (register their own function).

Other than that it looks fine.

Comment thread eli5/sklearn/explain_weights.py Outdated
def explain_weights_pipeline(estimator, feature_names=None, **kwargs):
last_estimator = estimator.steps[-1][1]
transform_pipeline = Pipeline(estimator.steps[:-1])
if hasattr(feature_names, 'get_feature_names'):
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is this condition for? I'd expect feature_names to be a list or FeatureNames instance.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is to reflect some of the behaviour of eli5.sklearn.utils.get_feature_names. If you don't care for it, I'll remove it.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Never mind. Cleaned up.

@jnothman
Copy link
Copy Markdown
Contributor Author

jnothman commented Apr 6, 2017 via email

@kmike
Copy link
Copy Markdown
Contributor

kmike commented Apr 6, 2017

@jnothman thanks for the update! What do you think about merging it partially, e.g. only the API part, Pipeline, FeatureUnion, and maybe a couple of basic transformers, to make it easier to test & document?

@jnothman
Copy link
Copy Markdown
Contributor Author

jnothman commented Apr 6, 2017 via email

@jnothman
Copy link
Copy Markdown
Contributor Author

jnothman commented May 1, 2017

I'll pull out a PR with just the basic (explain_weights_pipeline plus transform_feature_names defined for feature selection and pipeline).

@jnothman
Copy link
Copy Markdown
Contributor Author

jnothman commented May 1, 2017

Where does further documentation belong?

@kmike
Copy link
Copy Markdown
Contributor

kmike commented May 3, 2017

@jnothman what further documentation are you thinking of? I think an example of how to register your own transform_feature_name handler may go to sklearn library docs.

@jnothman
Copy link
Copy Markdown
Contributor Author

jnothman commented May 3, 2017 via email

@jnothman jnothman closed this May 17, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants