Tree-based models are very hard to introspect #7613

amueller · 2016-10-08T18:34:41Z

Currently it's really hard to understand tree-based models imho.
If I train a random forest with default parameters, I might want to know how deep it is or how many leaves it has so that I know how to prune it. That's pretty tricky to do, given that these are very fundamental properties of the model. Do you think these would be helpful to make more accessible?
Ping @glouppe @arjoly @jmschrei?

jmschrei · 2016-10-08T21:07:16Z

Yes. I think what might be a good overall idea is if we have either a summary method for every estimator, or a summary function which takes in an estimator, and it prints out the vital information about it. For linear regression this could be the coefficients, for trees this could be the size, for ensembles it could be the max/min size of each tree, and so on. Basically, provide a uniform way to inspect the important aspects of each model which are already stored.

jnothman · 2016-10-08T21:46:51Z

ah, how we creep towards Weka :P

On 9 October 2016 at 08:07, Jacob Schreiber [email protected]
wrote:

Yes. I think what might be a good overall idea is if we have either a
summary method for every estimator, or a summary function which takes in an
estimator, and it prints out the vital information about it. For linear
regression this could be the coefficients, for trees this could be the
size, for ensembles it could be the max/min size of each tree, and so on.
Basically, provide a uniform way to inspect the important aspects of each
model which are already stored.

—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
#7613 (comment),
or mute the thread
https://github.com/notifications/unsubscribe-auth/AAEz6y5gA6hvoJG-SL66264v6w_tQ8Ayks5qyAYGgaJpZM4KRywk
.

amueller · 2016-10-08T21:53:02Z

@jmschrei also see #6323 ;)

@jnothman I was thinking more R. I guess I should use more weka for inspiration? ;)

jnothman · 2016-10-08T21:55:07Z

weka was cool before R. :)

On 9 October 2016 at 08:53, Andreas Mueller [email protected]
wrote:

@jmschrei https://github.com/jmschrei also see #6323
#6323 ;)

@jnothman https://github.com/jnothman I was thinking more R. I guess I
should use more weka for inspiration? ;)

—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
#7613 (comment),
or mute the thread
https://github.com/notifications/unsubscribe-auth/AAEz6-HkVYQu6D0GxT5r4UuaEWFlYAyHks5qyBDAgaJpZM4KRywk
.

kmike · 2016-10-09T08:08:37Z

@jmschrei we've started something like that recently: https://github.com/TeamHG-Memex/eli5; it is still in infancy though.

kmike · 2016-10-09T08:12:12Z

See also: https://github.com/andosa/treeinterpreter

lesshaste · 2016-10-09T09:47:13Z

@kmike eli5 looks awesome and I am looking forward to trying it out. One related topic I am interested in is creating a single decision tree from a random forest with better classification performance than you would get from creating the decision tree directly. That is another way to aid interpretability. Is that something you have looked at?

kmike · 2016-10-09T14:36:51Z

@lesshaste it is easy to use a decision tree instead of a linear classifier in LIME approach (i.e. replace LogisticRegression with DecisionTreeClassifier here, to explain a single prediction of a random forest as a locally fit decision tree). But I haven't tried it, and I don't have an intuition of how well will it work in practice.

lesshaste · 2016-10-09T17:05:23Z

@kmike Thank you that is interesting but it isn't quite what I meant. I meant a single decision tree that replaces the entire random forest rather than just explaining a single prediction. Clearly this decision tree won't in general be as good as the random forest was a classifer (or regressor). I realise this isn't at all what LIME is for but wondered if it was something you already knew about nonetheless.

amueller · 2016-10-09T21:39:01Z

@lesshaste not entirely a tree but similar: https://arxiv.org/abs/1606.05390

jnothman · 2016-10-09T22:12:05Z

@kmike It would be interesting indeed to see whether a local DTC in LIME yields meaningful feature combinations in practice...

Actually, just looks at eli5. This looks like a great idea; I had been wondering when someone would make LIME more usable. (Although your approach may still be largely limited to text.) I wonder if there is a way to direct contributors to scikit-learn ecosystem projects like this, in the case that they need help. WDYT?

amueller · 2016-10-09T22:16:20Z

I'm not sure I understand what eli5 is doing... @kmike can you explain a bit?

kmike · 2016-10-10T08:15:19Z

@jnothman it seems for non-text data some kind of density estimation is needed; original LIME code uses binning with bins of the same size for non-text data, which is more limited. I've opened TeamHG-Memex/eli5#13 with some links, but haven't started working on it yet; better ideas are welcome.

@amueller eli5 is not related directly to your ticket description, sorry for having a discussion here! The idea is to provide a unified interface for inspecting classifiers and their predictions, along with various helpers - e.g. for linear classifier it shows feature names with largest/smallest coefficients (either global, or active in a given example), for random forest it shows feature importances, for black-box classifiers it can approximate it locally (near a given example) using a simpler, more inspectabe classifier, for HashingVectorizer it can restore feature names given a vocabulary, helpers like that.

lesshaste · 2016-10-10T09:55:48Z

@kmike That's a really interesting paper and it cites "inTrees" too which I notice has an R package https://cran.r-project.org/web/packages/inTrees/inTrees.pdf .

There is also https://github.com/sato9hara/defragTrees in python for the https://arxiv.org/abs/1606.09066 paper.

"Example 4 - Interpreting Scikit-learn Model" is particularly interesting/relevant.

amueller · 2016-10-10T19:38:44Z

I'd actually be interested to have more uniform introspection in scikit-learn. We recently discussed adding a plotting module ;)

themrmax · 2016-10-11T02:49:40Z

One option would be to use export_graphviz to visualise the trees (you would probably want to remove most of the text and arrange them in a grid to be more suitable for models with lots of trees):

from IPython.display import Image, display  
from sklearn.ensemble import RandomForestClassifier
from sklearn.datasets import load_iris
from sklearn import tree
import pydotplus 

iris = load_iris()
clf = RandomForestClassifier(n_estimators=3, max_depth = 2)
clf = clf.fit(iris.data, iris.target)
images = []
for c in clf.estimators_:
    dot_data = tree.export_graphviz(c, out_file=None)  
    graph = pydotplus.graph_from_dot_data(dot_data)
    images.append(Image(graph.create_png(), width=100, height=100))
display(*images)

amueller · 2016-10-11T03:06:18Z

With many trees that's not really feasible. I mean all the information is there, but it's hard to access. We could add methods like get_max_depth() or get_n_leaves() or something?

kmike · 2016-10-15T08:51:12Z

@amueller @lesshaste thanks for the links, sounds interesting!

@amueller as for inspection, I'm not sure what to add to scikit-learn API other than get_feature_names for transformers, pipelines, etc. We'll try to experiment with some ideas in eli5. One such experiment is a set of classes to recover feature names for HashingVectorizer and FeatureHasher (https://github.com/TeamHG-Memex/eli5/blob/master/eli5/sklearn/unhashing.py); are you open to adding this feature to scikit-learn itself, once the code is polished?

amueller · 2016-10-17T15:53:28Z

@kmike Definitely the feature names, but I would also like to know the max depth or number of leaves. How am I supposed to set parameters of a grid-search otherwise?

I feel like trying to get the features back from the hashing models defeats the purpose. I would rather try implementing the cuckoo hash if that's your goal: alex.smola.org/data/kdd2015/Cuckoo.pdf

kmike · 2016-10-17T16:47:34Z

@amueller yeah, I meant what to add other than your proposed changed, no fresh ideas from me.

I feel like trying to get the features back from the hashing models defeats the purpose. I would rather try implementing the cuckoo hash if that's your goal: alex.smola.org/data/kdd2015/Cuckoo.pdf

A nice link!

I think getting feature names back from hashing vectorizer is still helpful - it has memory/cpu overhead neither at training time nor at run time - you pay the price only if you inspect the model (well, you may need to store a sample dataset). It is also helpful if one wants to inspect a single prediction - in this case vocabulary can be built from a single document, and one can check coefficients or importances of features active in a given document. Vowpal Wabbit provides a similar option.

amueller · 2018-04-13T16:04:04Z

I still want get_depth and get_n_leafs

pramitchoudhary · 2018-09-28T02:02:22Z

Jumping late to the party @amueller @kmike (you guys are doing some great work with eli5, love it). @lesshaste.
One other option could possibly be tree-based surrogates as an explanation model(using Decision Trees) along with any ensemble forest built. Tree Surrogates accompanied by pruning(pre and post) could go a long way in generating faithful explanations(both global and local).
Making use of the recent amazing work contributed to the repo(having access to decision_path), we add some initial support within Skater(it's still in the early phases of development). But, here is an example notebook(https://github.com/datascienceinc/Skater/blob/master/examples/rule_list_notebooks/explanation_using_tree_surrogate.ipynb).
We are exploring more intuitive ways of enabling understandable explanations.
Keep up the amazing work guys.

related paper(TREPAN): https://papers.nips.cc/paper/1152-extracting-tree-structured-representations-of-trained-networks.pdf

adrinjalali · 2018-09-28T13:34:56Z

I find going through tree based models' code pretty joyful. I could try and add the two functions, while the rest of the discussion goes ahead.

jnothman · 2018-09-29T23:34:55Z

Maybe you want to lead the review of NOCATS (#4899) too!

amueller · 2018-10-01T17:16:30Z

@adrinjalali I would appreciate that!

amueller · 2018-10-10T19:26:54Z

I'm in favor of closing this with #12300, we can open more concrete issues if there's something else?

amueller added Easy Well-defined and straightforward way to resolve Enhancement help wanted labels Apr 13, 2018

adrinjalali mentioned this issue Oct 5, 2018

Add get_n_leaves() and get_max_depth() to DesicionTrees #12300

Merged

jnothman closed this as completed Oct 10, 2018

Uh oh!

Tree-based models are very hard to introspect #7613

Tree-based models are very hard to introspect #7613

Comments

amueller commented Oct 8, 2016

jmschrei commented Oct 8, 2016

Uh oh!

jnothman commented Oct 8, 2016

Uh oh!

amueller commented Oct 8, 2016

Uh oh!

jnothman commented Oct 8, 2016

Uh oh!

kmike commented Oct 9, 2016

Uh oh!

kmike commented Oct 9, 2016

Uh oh!

lesshaste commented Oct 9, 2016

Uh oh!

kmike commented Oct 9, 2016

Uh oh!

lesshaste commented Oct 9, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

amueller commented Oct 9, 2016

Uh oh!

jnothman commented Oct 9, 2016

Uh oh!

amueller commented Oct 9, 2016

Uh oh!

kmike commented Oct 10, 2016

Uh oh!

lesshaste commented Oct 10, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

amueller commented Oct 10, 2016

Uh oh!

themrmax commented Oct 11, 2016

Uh oh!

amueller commented Oct 11, 2016

Uh oh!

kmike commented Oct 15, 2016

Uh oh!

amueller commented Oct 17, 2016

Uh oh!

kmike commented Oct 17, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

amueller commented Apr 13, 2018

Uh oh!

pramitchoudhary commented Sep 28, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

adrinjalali commented Sep 28, 2018

Uh oh!

jnothman commented Sep 29, 2018 via email

Uh oh!

amueller commented Oct 1, 2018

Uh oh!

amueller commented Oct 10, 2018

Uh oh!

lesshaste commented Oct 9, 2016 •

edited

Loading

lesshaste commented Oct 10, 2016 •

edited

Loading

kmike commented Oct 17, 2016 •

edited

Loading

pramitchoudhary commented Sep 28, 2018 •

edited

Loading