Thanks to visit codestin.com
Credit goes to github.com

Skip to content

[MRG] Matplotlib tree plotting #9251

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 73 commits into from
Oct 11, 2018

Conversation

amueller
Copy link
Member

@amueller amueller commented Jun 29, 2017

Fixes #8508

This PR does three things:

  • Refactors the graphviz export into a class for easier code reuse (public interface stays the same)
  • adds a matplotlib based plotting frontend
  • adds the Reingold-Tilford tree layouting algorithm

For now this lives in the tree submodule to avoid conflicting with work on the plot submodule.

Todo:

  • docs
  • tests?
  • automatic scaling in X and Y direction
  • remove hacks
  • figure out z-index
  • create base class for exporters.
  • rounded

We could try to avoid the intermediate Tree data structure but not sure if that's worth the hassle.

Example:

from sklearn.tree.export import plot_tree
from sklearn.datasets import load_iris
iris = load_iris()
from sklearn.tree import DecisionTreeClassifier, export_graphviz
tree = DecisionTreeClassifier(max_leaf_nodes=10).fit(iris.data, iris.target)

plot_tree(tree, filled=True, class_names=iris.target_names, feature_names=iris.feature_names, scalex=150, scaley=80)

image

compared to graphviz:
image

The Reingold-Tilford algorithm makes sure a parent is centered above its children. Graphviz uses a different algorithm that results in a somewhat more compact (if possibly uglier?) layout.

Currently the box sizes of the nodes are ignored, i.e. assumed constant, where the constant is expressed using the scalex and scaley variables.
I think it's ok to assume all boxes are the same size. I don't aim for perfection, I aim for a reasonable plot.

Disclaimer: this is an afternoon hack, and I'm not sure if I'll be able to fix it.
I might use it at scipy though. (cc @agramfort )

@amueller
Copy link
Member Author

Graphviz uses a custom heuristic to solve an integer linear program as described here:
http://www.graphviz.org/Documentation/TSE93.pdf (section 4.1)
We could save most of the work (all sections before 4) as we already have a tree, but it still seems kinda tricky and I'm not sure it's worth it.

@amueller
Copy link
Member Author

amueller commented Jul 6, 2017

I changed the tests to work on the new color definition. They are visually indistinguishable and one might argue that #ffffff is a more obvious representation of white than #e5813900 ;)

@amueller
Copy link
Member Author

re-introduce scalex, scaley
add max_extents to tree to get tree size before plotting
Copy link
Member

@rth rth left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe my first review was overly negative. I think it's a nice improvement, and the ability to draw trees without graphviz is definitely important. The amount of added code is also relatively small with respect to what it allows to do. A few comments below (some of which you ready just addressed)

My main concern is that we are not testing this to the same extent as regular estimators. For GUI it's quite difficult anyway, just because we run the examples, doesn't mean that someone cannot make a PR that would make the output wrong without throwing an exception (or that next matplotlib version wont). Unless someone looks at the figure we won't be able to tell. It's different from regular examples, where it matter less, as users are expected to run this algorithm on any kind of tree. Similarly it's hard to know if the output will be always satisfactory for any kind of tree configuration. Running this in CircleCI or Travis is not really going to change that, I agree. The vendored sklearn/tree/_reingold_tilford.py also comes from a repo without unit tests, that hasn't been updated in 5-7 years.

Overall I think it's a nice improvement, and I'm +1 to merge it, but someone would need to answer future issues about this :) Maybe it could be worth marking plot_tree experimental in the docstring?

# is about the same as the distance between boxes
max_x, max_y = draw_tree.max_extents() + 1
ax_width = ax.get_window_extent().width
ax_height = ax.get_window_extent().height
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The issue with this that it does not allow to change the size of the plot. For instance, I can't read the text with the default size (it's too tiny when it takes 1/4 of my screen by default, the below version is a bit zoomed, depending on your screen resolution)
figure_initial
while if I put it to full screen I get the following (with all the white space around it),
figure_full_screen
which is the area it occupied on my screen initially. No way to zoom in either.

Tested with matplotlib 2.2. I can confirm the results don't change for 3.0 either.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you need to change figsize or dpi I think. Happy to hear about a better way to do that. I guess we could redraw on resize?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we should add something to the docstring that you can control the size of the tree with figsize and dpi?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Docs sound appropriate

>>> tree.export_graphviz(clf,
... out_file='tree.dot') # doctest: +SKIP
>>> tree.plot_tree(clf) # doctest: +SKIP
[Text(251.5,345.217,'X[3] <= 0.8...
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That will really change between runs? Can't remove the SKIP? Though, I guess it can also be impacted by matplotlib versions?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm pretty sure I had a reason for this. I don't remember whether it was non-determinism or something about requiring the presence of matplotlib. I can try to remove it.


- |Fix| Fixed a bug in :class:`cluster.DBSCAN` with precomputed sparse neighbors
- |Fix| Fixed a bug in :class:`cluster.DBSCAN` with precomputed sparse neighbors
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Extra spaces here (possibly issue with with merge).

- |Feature| In :func:`datasets.make_blobs`, one can now pass a list to the
``n_samples`` parameter to indicate the number of samples to generate per
cluster. :issue:`8617` by :user:`Maskani Filali Mohamed <maskani-moh>` and
:user:`Konstantinos Katrioplas <kkatrio>`.
>>>>>>> master
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Merge conflict markup ..

@rth
Copy link
Member

rth commented Oct 1, 2018

Also not running at this in Travis produces 92.72% (-0.24%) code coverage, which is a bit a shame.

Maybe we could a short test for buchheim(tree) that wouldn't require matplotlib? I know it's a vendored code, but since there are no tests elsewhere maybe we could add at least a few a sanity tests for that function?

@amueller
Copy link
Member Author

amueller commented Oct 1, 2018

I would consider the code vendored but I would also consider the upstream unmaintained (I have not tried to contact the author so they might reply, but I wouldn't count on it).

So adding some tests on buchheim might make sense - I was probably just being lazy.

@amueller
Copy link
Member Author

amueller commented Oct 3, 2018

The skip is required because otherwise the matplotlib import fails...

@amueller amueller changed the title Matplotlib tree plotting [MRG] Matplotlib tree plotting Oct 3, 2018
@amueller
Copy link
Member Author

amueller commented Oct 3, 2018

image
it's a sign...

Copy link
Member

@jnothman jnothman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wish I had the patience and time to go through the algorithm in detail, but I don't atm. I'm happy with the changes in export.py, and happy with the output.

@@ -0,0 +1,187 @@
# taken from https://github.com/llimllib/pymag-trees/blob/master/buchheim.py
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should this file not include a more extensive license?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added the license.

@@ -0,0 +1,54 @@
import numpy as np
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you please rename this to test_reingold_tilford.py?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done.

@amueller
Copy link
Member Author

ping @rth ;)

@rth
Copy link
Member

rth commented Oct 11, 2018

@amueller Will try to review at the end of the week. Thanks for making these improvements!

@amueller
Copy link
Member Author

Thanks! And sorry for being pushy. I'm afraid some other grant or something will come my way and it'll lie around for another year :-/

Copy link
Member

@rth rth left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No worries. It's a bit difficult to review every detail of this PR, but I can confirm that the expected figure is indeed produced and renders nicely. Overall the code LGTM, we can always fix some edge-cases as they arrive in the future.

There is one duplicate file that needs removing though (cf below).

# reached all leafs
break
assert len(np.unique(x_at_this_depth)) == len(x_at_this_depth)
depth += 1
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This file is identical with sklearn/tree/tests/test_reingold_tilford.py and should be removed

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

meh thanks!

@amueller amueller merged commit c13ba26 into scikit-learn:master Oct 11, 2018
@jnothman
Copy link
Member

jnothman commented Oct 13, 2018 via email

@Njreardo
Copy link

Is there a way to increase the size of the tree image? I can't make any sense of the one that has been plotted for my purposes and I can't just download graphviz because it's a company computer.

@jnothman
Copy link
Member

jnothman commented Sep 13, 2019 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add matplotlib based plotting of decision trees
4 participants