Thanks to visit codestin.com
Credit goes to github.com

Skip to content

Tree-based models are very hard to introspect #7613

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
amueller opened this issue Oct 8, 2016 · 26 comments
Closed

Tree-based models are very hard to introspect #7613

amueller opened this issue Oct 8, 2016 · 26 comments
Labels
Easy Well-defined and straightforward way to resolve Enhancement help wanted

Comments

@amueller
Copy link
Member

amueller commented Oct 8, 2016

Currently it's really hard to understand tree-based models imho.
If I train a random forest with default parameters, I might want to know how deep it is or how many leaves it has so that I know how to prune it. That's pretty tricky to do, given that these are very fundamental properties of the model. Do you think these would be helpful to make more accessible?
Ping @glouppe @arjoly @jmschrei?

@jmschrei
Copy link
Member

jmschrei commented Oct 8, 2016

Yes. I think what might be a good overall idea is if we have either a summary method for every estimator, or a summary function which takes in an estimator, and it prints out the vital information about it. For linear regression this could be the coefficients, for trees this could be the size, for ensembles it could be the max/min size of each tree, and so on. Basically, provide a uniform way to inspect the important aspects of each model which are already stored.

@jnothman
Copy link
Member

jnothman commented Oct 8, 2016

ah, how we creep towards Weka :P

On 9 October 2016 at 08:07, Jacob Schreiber [email protected]
wrote:

Yes. I think what might be a good overall idea is if we have either a
summary method for every estimator, or a summary function which takes in an
estimator, and it prints out the vital information about it. For linear
regression this could be the coefficients, for trees this could be the
size, for ensembles it could be the max/min size of each tree, and so on.
Basically, provide a uniform way to inspect the important aspects of each
model which are already stored.

β€”
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
#7613 (comment),
or mute the thread
https://github.com/notifications/unsubscribe-auth/AAEz6y5gA6hvoJG-SL66264v6w_tQ8Ayks5qyAYGgaJpZM4KRywk
.

@amueller
Copy link
Member Author

amueller commented Oct 8, 2016

@jmschrei also see #6323 ;)

@jnothman I was thinking more R. I guess I should use more weka for inspiration? ;)

@jnothman
Copy link
Member

jnothman commented Oct 8, 2016

weka was cool before R. :)

On 9 October 2016 at 08:53, Andreas Mueller [email protected]
wrote:

@jmschrei https://github.com/jmschrei also see #6323
#6323 ;)

@jnothman https://github.com/jnothman I was thinking more R. I guess I
should use more weka for inspiration? ;)

β€”
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
#7613 (comment),
or mute the thread
https://github.com/notifications/unsubscribe-auth/AAEz6-HkVYQu6D0GxT5r4UuaEWFlYAyHks5qyBDAgaJpZM4KRywk
.

@kmike
Copy link
Contributor

kmike commented Oct 9, 2016

@jmschrei we've started something like that recently: https://github.com/TeamHG-Memex/eli5; it is still in infancy though.

@kmike
Copy link
Contributor

kmike commented Oct 9, 2016

@lesshaste
Copy link

@kmike eli5 looks awesome and I am looking forward to trying it out. One related topic I am interested in is creating a single decision tree from a random forest with better classification performance than you would get from creating the decision tree directly. That is another way to aid interpretability. Is that something you have looked at?

@kmike
Copy link
Contributor

kmike commented Oct 9, 2016

@lesshaste it is easy to use a decision tree instead of a linear classifier in LIME approach (i.e. replace LogisticRegression with DecisionTreeClassifier here, to explain a single prediction of a random forest as a locally fit decision tree). But I haven't tried it, and I don't have an intuition of how well will it work in practice.

@lesshaste
Copy link

lesshaste commented Oct 9, 2016

@kmike Thank you that is interesting but it isn't quite what I meant. I meant a single decision tree that replaces the entire random forest rather than just explaining a single prediction. Clearly this decision tree won't in general be as good as the random forest was a classifer (or regressor). I realise this isn't at all what LIME is for but wondered if it was something you already knew about nonetheless.

@amueller
Copy link
Member Author

amueller commented Oct 9, 2016

@lesshaste not entirely a tree but similar: https://arxiv.org/abs/1606.05390

@jnothman
Copy link
Member

jnothman commented Oct 9, 2016

@kmike It would be interesting indeed to see whether a local DTC in LIME yields meaningful feature combinations in practice...

Actually, just looks at eli5. This looks like a great idea; I had been wondering when someone would make LIME more usable. (Although your approach may still be largely limited to text.) I wonder if there is a way to direct contributors to scikit-learn ecosystem projects like this, in the case that they need help. WDYT?

@amueller
Copy link
Member Author

amueller commented Oct 9, 2016

I'm not sure I understand what eli5 is doing... @kmike can you explain a bit?

@kmike
Copy link
Contributor

kmike commented Oct 10, 2016

@jnothman it seems for non-text data some kind of density estimation is needed; original LIME code uses binning with bins of the same size for non-text data, which is more limited. I've opened TeamHG-Memex/eli5#13 with some links, but haven't started working on it yet; better ideas are welcome.

@amueller eli5 is not related directly to your ticket description, sorry for having a discussion here! The idea is to provide a unified interface for inspecting classifiers and their predictions, along with various helpers - e.g. for linear classifier it shows feature names with largest/smallest coefficients (either global, or active in a given example), for random forest it shows feature importances, for black-box classifiers it can approximate it locally (near a given example) using a simpler, more inspectabe classifier, for HashingVectorizer it can restore feature names given a vocabulary, helpers like that.

@lesshaste
Copy link

lesshaste commented Oct 10, 2016

@kmike That's a really interesting paper and it cites "inTrees" too which I notice has an R package https://cran.r-project.org/web/packages/inTrees/inTrees.pdf .

There is also https://github.com/sato9hara/defragTrees in python for the https://arxiv.org/abs/1606.09066 paper.

"Example 4 - Interpreting Scikit-learn Model" is particularly interesting/relevant.

@amueller
Copy link
Member Author

I'd actually be interested to have more uniform introspection in scikit-learn. We recently discussed adding a plotting module ;)

@themrmax
Copy link
Contributor

One option would be to use export_graphviz to visualise the trees (you would probably want to remove most of the text and arrange them in a grid to be more suitable for models with lots of trees):

from IPython.display import Image, display  
from sklearn.ensemble import RandomForestClassifier
from sklearn.datasets import load_iris
from sklearn import tree
import pydotplus 

iris = load_iris()
clf = RandomForestClassifier(n_estimators=3, max_depth = 2)
clf = clf.fit(iris.data, iris.target)
images = []
for c in clf.estimators_:
    dot_data = tree.export_graphviz(c, out_file=None)  
    graph = pydotplus.graph_from_dot_data(dot_data)
    images.append(Image(graph.create_png(), width=100, height=100))
display(*images)

image

@amueller
Copy link
Member Author

With many trees that's not really feasible. I mean all the information is there, but it's hard to access. We could add methods like get_max_depth() or get_n_leaves() or something?

@kmike
Copy link
Contributor

kmike commented Oct 15, 2016

@amueller @lesshaste thanks for the links, sounds interesting!

@amueller as for inspection, I'm not sure what to add to scikit-learn API other than get_feature_names for transformers, pipelines, etc. We'll try to experiment with some ideas in eli5. One such experiment is a set of classes to recover feature names for HashingVectorizer and FeatureHasher (https://github.com/TeamHG-Memex/eli5/blob/master/eli5/sklearn/unhashing.py); are you open to adding this feature to scikit-learn itself, once the code is polished?

@amueller
Copy link
Member Author

@kmike Definitely the feature names, but I would also like to know the max depth or number of leaves. How am I supposed to set parameters of a grid-search otherwise?

I feel like trying to get the features back from the hashing models defeats the purpose. I would rather try implementing the cuckoo hash if that's your goal: alex.smola.org/data/kdd2015/Cuckoo.pdf

@kmike
Copy link
Contributor

kmike commented Oct 17, 2016

@amueller yeah, I meant what to add other than your proposed changed, no fresh ideas from me.

I feel like trying to get the features back from the hashing models defeats the purpose. I would rather try implementing the cuckoo hash if that's your goal: alex.smola.org/data/kdd2015/Cuckoo.pdf

A nice link!

I think getting feature names back from hashing vectorizer is still helpful - it has memory/cpu overhead neither at training time nor at run time - you pay the price only if you inspect the model (well, you may need to store a sample dataset). It is also helpful if one wants to inspect a single prediction - in this case vocabulary can be built from a single document, and one can check coefficients or importances of features active in a given document. Vowpal Wabbit provides a similar option.

@amueller amueller added Easy Well-defined and straightforward way to resolve Enhancement help wanted labels Apr 13, 2018
@amueller
Copy link
Member Author

I still want get_depth and get_n_leafs

@pramitchoudhary
Copy link

pramitchoudhary commented Sep 28, 2018

Jumping late to the party @amueller @kmike (you guys are doing some great work with eli5, love it). @lesshaste.
One other option could possibly be tree-based surrogates as an explanation model(using Decision Trees) along with any ensemble forest built. Tree Surrogates accompanied by pruning(pre and post) could go a long way in generating faithful explanations(both global and local).
Making use of the recent amazing work contributed to the repo(having access to decision_path), we add some initial support within Skater(it's still in the early phases of development). But, here is an example notebook(https://github.com/datascienceinc/Skater/blob/master/examples/rule_list_notebooks/explanation_using_tree_surrogate.ipynb).
We are exploring more intuitive ways of enabling understandable explanations.
Keep up the amazing work guys.

related paper(TREPAN): https://papers.nips.cc/paper/1152-extracting-tree-structured-representations-of-trained-networks.pdf

@adrinjalali
Copy link
Member

I find going through tree based models' code pretty joyful. I could try and add the two functions, while the rest of the discussion goes ahead.

@jnothman
Copy link
Member

jnothman commented Sep 29, 2018 via email

@amueller
Copy link
Member Author

amueller commented Oct 1, 2018

@adrinjalali I would appreciate that!

@amueller
Copy link
Member Author

I'm in favor of closing this with #12300, we can open more concrete issues if there's something else?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Easy Well-defined and straightforward way to resolve Enhancement help wanted
Projects
None yet
Development

No branches or pull requests

8 participants