-
-
Notifications
You must be signed in to change notification settings - Fork 25.9k
Tree-based models are very hard to introspect #7613
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. Weβll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Yes. I think what might be a good overall idea is if we have either a summary method for every estimator, or a summary function which takes in an estimator, and it prints out the vital information about it. For linear regression this could be the coefficients, for trees this could be the size, for ensembles it could be the max/min size of each tree, and so on. Basically, provide a uniform way to inspect the important aspects of each model which are already stored. |
ah, how we creep towards Weka :P On 9 October 2016 at 08:07, Jacob Schreiber [email protected]
|
weka was cool before R. :) On 9 October 2016 at 08:53, Andreas Mueller [email protected]
|
@jmschrei we've started something like that recently: https://github.com/TeamHG-Memex/eli5; it is still in infancy though. |
@kmike eli5 looks awesome and I am looking forward to trying it out. One related topic I am interested in is creating a single decision tree from a random forest with better classification performance than you would get from creating the decision tree directly. That is another way to aid interpretability. Is that something you have looked at? |
@lesshaste it is easy to use a decision tree instead of a linear classifier in LIME approach (i.e. replace LogisticRegression with DecisionTreeClassifier here, to explain a single prediction of a random forest as a locally fit decision tree). But I haven't tried it, and I don't have an intuition of how well will it work in practice. |
@kmike Thank you that is interesting but it isn't quite what I meant. I meant a single decision tree that replaces the entire random forest rather than just explaining a single prediction. Clearly this decision tree won't in general be as good as the random forest was a classifer (or regressor). I realise this isn't at all what LIME is for but wondered if it was something you already knew about nonetheless. |
@lesshaste not entirely a tree but similar: https://arxiv.org/abs/1606.05390 |
@kmike It would be interesting indeed to see whether a local DTC in LIME yields meaningful feature combinations in practice... Actually, just looks at eli5. This looks like a great idea; I had been wondering when someone would make LIME more usable. (Although your approach may still be largely limited to text.) I wonder if there is a way to direct contributors to scikit-learn ecosystem projects like this, in the case that they need help. WDYT? |
I'm not sure I understand what eli5 is doing... @kmike can you explain a bit? |
@jnothman it seems for non-text data some kind of density estimation is needed; original LIME code uses binning with bins of the same size for non-text data, which is more limited. I've opened TeamHG-Memex/eli5#13 with some links, but haven't started working on it yet; better ideas are welcome. @amueller eli5 is not related directly to your ticket description, sorry for having a discussion here! The idea is to provide a unified interface for inspecting classifiers and their predictions, along with various helpers - e.g. for linear classifier it shows feature names with largest/smallest coefficients (either global, or active in a given example), for random forest it shows feature importances, for black-box classifiers it can approximate it locally (near a given example) using a simpler, more inspectabe classifier, for HashingVectorizer it can restore feature names given a vocabulary, helpers like that. |
@kmike That's a really interesting paper and it cites "inTrees" too which I notice has an R package https://cran.r-project.org/web/packages/inTrees/inTrees.pdf . There is also https://github.com/sato9hara/defragTrees in python for the https://arxiv.org/abs/1606.09066 paper. "Example 4 - Interpreting Scikit-learn Model" is particularly interesting/relevant. |
I'd actually be interested to have more uniform introspection in scikit-learn. We recently discussed adding a plotting module ;) |
One option would be to use from IPython.display import Image, display
from sklearn.ensemble import RandomForestClassifier
from sklearn.datasets import load_iris
from sklearn import tree
import pydotplus
iris = load_iris()
clf = RandomForestClassifier(n_estimators=3, max_depth = 2)
clf = clf.fit(iris.data, iris.target)
images = []
for c in clf.estimators_:
dot_data = tree.export_graphviz(c, out_file=None)
graph = pydotplus.graph_from_dot_data(dot_data)
images.append(Image(graph.create_png(), width=100, height=100))
display(*images) |
With many trees that's not really feasible. I mean all the information is there, but it's hard to access. We could add methods like |
@amueller @lesshaste thanks for the links, sounds interesting! @amueller as for inspection, I'm not sure what to add to scikit-learn API other than get_feature_names for transformers, pipelines, etc. We'll try to experiment with some ideas in eli5. One such experiment is a set of classes to recover feature names for HashingVectorizer and FeatureHasher (https://github.com/TeamHG-Memex/eli5/blob/master/eli5/sklearn/unhashing.py); are you open to adding this feature to scikit-learn itself, once the code is polished? |
@kmike Definitely the feature names, but I would also like to know the max depth or number of leaves. How am I supposed to set parameters of a grid-search otherwise? I feel like trying to get the features back from the hashing models defeats the purpose. I would rather try implementing the cuckoo hash if that's your goal: alex.smola.org/data/kdd2015/Cuckoo.pdf |
@amueller yeah, I meant what to add other than your proposed changed, no fresh ideas from me.
A nice link! I think getting feature names back from hashing vectorizer is still helpful - it has memory/cpu overhead neither at training time nor at run time - you pay the price only if you inspect the model (well, you may need to store a sample dataset). It is also helpful if one wants to inspect a single prediction - in this case vocabulary can be built from a single document, and one can check coefficients or importances of features active in a given document. Vowpal Wabbit provides a similar option. |
I still want |
Jumping late to the party @amueller @kmike (you guys are doing some great work with eli5, love it). @lesshaste. related paper(TREPAN): https://papers.nips.cc/paper/1152-extracting-tree-structured-representations-of-trained-networks.pdf |
I find going through tree based models' code pretty joyful. I could try and add the two functions, while the rest of the discussion goes ahead. |
Maybe you want to lead the review of NOCATS (#4899) too!
|
@adrinjalali I would appreciate that! |
I'm in favor of closing this with #12300, we can open more concrete issues if there's something else? |
Currently it's really hard to understand tree-based models imho.
If I train a random forest with default parameters, I might want to know how deep it is or how many leaves it has so that I know how to prune it. That's pretty tricky to do, given that these are very fundamental properties of the model. Do you think these would be helpful to make more accessible?
Ping @glouppe @arjoly @jmschrei?
The text was updated successfully, but these errors were encountered: