[MRG] Matplotlib tree plotting #9251

amueller · 2017-06-29T22:38:43Z

This PR does three things:

Refactors the graphviz export into a class for easier code reuse (public interface stays the same)
adds a matplotlib based plotting frontend
adds the Reingold-Tilford tree layouting algorithm

For now this lives in the tree submodule to avoid conflicting with work on the plot submodule.

Todo:

We could try to avoid the intermediate Tree data structure but not sure if that's worth the hassle.

Example:

from sklearn.tree.export import plot_tree
from sklearn.datasets import load_iris
iris = load_iris()
from sklearn.tree import DecisionTreeClassifier, export_graphviz
tree = DecisionTreeClassifier(max_leaf_nodes=10).fit(iris.data, iris.target)

plot_tree(tree, filled=True, class_names=iris.target_names, feature_names=iris.feature_names, scalex=150, scaley=80)

compared to graphviz:

The Reingold-Tilford algorithm makes sure a parent is centered above its children. Graphviz uses a different algorithm that results in a somewhat more compact (if possibly uglier?) layout.

Currently the box sizes of the nodes are ignored, i.e. assumed constant, where the constant is expressed using the scalex and scaley variables.
I think it's ok to assume all boxes are the same size. I don't aim for perfection, I aim for a reasonable plot.

Disclaimer: this is an afternoon hack, and I'm not sure if I'll be able to fix it.
I might use it at scipy though. (cc @agramfort )

amueller · 2017-06-29T22:49:59Z

Graphviz uses a custom heuristic to solve an integer linear program as described here:
http://www.graphviz.org/Documentation/TSE93.pdf (section 4.1)
We could save most of the work (all sections before 4) as we already have a tree, but it still seems kinda tricky and I'm not sure it's worth it.

trying to get rid of scalex, scaley

amueller · 2017-07-06T21:38:12Z

I changed the tests to work on the new color definition. They are visually indistinguishable and one might argue that #ffffff is a more obvious representation of white than #e5813900 ;)

render everything once to get the bbox sizes, then again to actually plot it with known extents.

… hacks.

…CI machines

amueller · 2017-07-13T18:33:09Z

check out rendered example here: https://12061-843222-gh.circle-artifacts.com/0/home/ubuntu/scikit-learn/doc/_build/html/stable/auto_examples/tree/plot_iris.html

re-introduce scalex, scaley add max_extents to tree to get tree size before plotting

rth

Maybe my first review was overly negative. I think it's a nice improvement, and the ability to draw trees without graphviz is definitely important. The amount of added code is also relatively small with respect to what it allows to do. A few comments below (some of which you ready just addressed)

My main concern is that we are not testing this to the same extent as regular estimators. For GUI it's quite difficult anyway, just because we run the examples, doesn't mean that someone cannot make a PR that would make the output wrong without throwing an exception (or that next matplotlib version wont). Unless someone looks at the figure we won't be able to tell. It's different from regular examples, where it matter less, as users are expected to run this algorithm on any kind of tree. Similarly it's hard to know if the output will be always satisfactory for any kind of tree configuration. Running this in CircleCI or Travis is not really going to change that, I agree. The vendored sklearn/tree/_reingold_tilford.py also comes from a repo without unit tests, that hasn't been updated in 5-7 years.

Overall I think it's a nice improvement, and I'm +1 to merge it, but someone would need to answer future issues about this :) Maybe it could be worth marking plot_tree experimental in the docstring?

rth · 2018-10-01T20:23:06Z

sklearn/tree/export.py

+        # is about the same as the distance between boxes
+        max_x, max_y = draw_tree.max_extents() + 1
+        ax_width = ax.get_window_extent().width
+        ax_height = ax.get_window_extent().height


The issue with this that it does not allow to change the size of the plot. For instance, I can't read the text with the default size (it's too tiny when it takes 1/4 of my screen by default, the below version is a bit zoomed, depending on your screen resolution)

while if I put it to full screen I get the following (with all the white space around it),

which is the area it occupied on my screen initially. No way to zoom in either.

Tested with matplotlib 2.2. I can confirm the results don't change for 3.0 either.

you need to change figsize or dpi I think. Happy to hear about a better way to do that. I guess we could redraw on resize?

Maybe we should add something to the docstring that you can control the size of the tree with figsize and dpi?

Docs sound appropriate

rth · 2018-10-01T20:27:27Z

sklearn/tree/export.py

-    >>> tree.export_graphviz(clf,
-    ...     out_file='tree.dot')                # doctest: +SKIP
+    >>> tree.plot_tree(clf) # doctest: +SKIP
+    [Text(251.5,345.217,'X[3] <= 0.8...


That will really change between runs? Can't remove the SKIP? Though, I guess it can also be impacted by matplotlib versions?

I'm pretty sure I had a reason for this. I don't remember whether it was non-determinism or something about requiring the presence of matplotlib. I can try to remove it.

rth · 2018-10-01T20:28:27Z

doc/whats_new/v0.21.rst

-
- |Fix| Fixed a bug in :class:`cluster.DBSCAN` with precomputed sparse neighbors
+  
+  - |Fix| Fixed a bug in :class:`cluster.DBSCAN` with precomputed sparse neighbors


Extra spaces here (possibly issue with with merge).

rth · 2018-10-01T20:31:05Z

doc/whats_new/v0.20.rst

 - |Feature| In :func:`datasets.make_blobs`, one can now pass a list to the
  ``n_samples`` parameter to indicate the number of samples to generate per
  cluster. :issue:`8617` by :user:`Maskani Filali Mohamed <maskani-moh>` and
  :user:`Konstantinos Katrioplas <kkatrio>`.
+>>>>>>> master


Merge conflict markup ..

rth · 2018-10-01T21:29:09Z

Also not running at this in Travis produces 92.72% (-0.24%) code coverage, which is a bit a shame.

Maybe we could a short test for buchheim(tree) that wouldn't require matplotlib? I know it's a vendored code, but since there are no tests elsewhere maybe we could add at least a few a sanity tests for that function?

amueller · 2018-10-01T22:32:45Z

I would consider the code vendored but I would also consider the upstream unmaintained (I have not tried to contact the author so they might reply, but I wouldn't count on it).

So adding some tests on buchheim might make sense - I was probably just being lazy.

___init__ into superclass

amueller · 2018-10-03T19:38:41Z

The skip is required because otherwise the matplotlib import fails...

…mpl plotting

amueller · 2018-10-03T20:33:51Z

it's a sign...

jnothman

I wish I had the patience and time to go through the algorithm in detail, but I don't atm. I'm happy with the changes in export.py, and happy with the output.

jnothman · 2018-10-08T13:18:20Z

sklearn/tree/_reingold_tilford.py

@@ -0,0 +1,187 @@
+# taken from https://github.com/llimllib/pymag-trees/blob/master/buchheim.py


should this file not include a more extensive license?

added the license.

jnothman · 2018-10-08T13:18:30Z

sklearn/tree/tests/test_reingold.py

@@ -0,0 +1,54 @@
+import numpy as np


can you please rename this to test_reingold_tilford.py?

# Conflicts: # doc/modules/tree.rst

…-learn into matplotlib_tree_plotting # Conflicts: # doc/modules/tree.rst

amueller · 2018-10-10T19:45:55Z

ping @rth ;)

rth · 2018-10-11T07:43:33Z

@amueller Will try to review at the end of the week. Thanks for making these improvements!

amueller · 2018-10-11T18:42:01Z

Thanks! And sorry for being pushy. I'm afraid some other grant or something will come my way and it'll lie around for another year :-/

rth

No worries. It's a bit difficult to review every detail of this PR, but I can confirm that the expected figure is indeed produced and renders nicely. Overall the code LGTM, we can always fix some edge-cases as they arrive in the future.

There is one duplicate file that needs removing though (cf below).

rth · 2018-10-11T19:08:06Z

sklearn/tree/tests/test_reingold.py

+            # reached all leafs
+            break
+        assert len(np.unique(x_at_this_depth)) == len(x_at_this_depth)
+        depth += 1


This file is identical with sklearn/tree/tests/test_reingold_tilford.py and should be removed

meh thanks!

jnothman · 2018-10-13T23:26:02Z

Teehee good to have this merged! Now the text export too??

Njreardo · 2019-09-12T15:15:50Z

Is there a way to increase the size of the tree image? I can't make any sense of the one that has been plotted for my purposes and I can't just download graphviz because it's a company computer.

jnothman · 2019-09-13T00:12:09Z

Please ask such questions on stack overflow

amueller added 6 commits June 29, 2017 15:57

add reingold tillford tree layout algorithm

1881ece

add first silly implementation of matplotlib based plotting for trees

bd7d022

object oriented design for export_graphviz so it can be extended

287c1d2

add class for mlp export

0d5e3e2

add colors

8f52d87

separately scale x and y, add arrowheads, fix strings

4a5fe67

amueller added 7 commits June 29, 2017 19:07

implement max_depth

ddb6c16

don't use alpha for coloring because it makes boxes transparent

fed2d1d

remove unused variables

5145ed2

vertical center of boxes

8663ad7

fix/simplify newline trimming

d750deb

somewhere in the middle of stuff

d3c17ea

trying to get rid of scalex, scaley

remove "find_longest_child" for now, fix tests

823ce1f

amueller added 13 commits July 6, 2017 18:33

make scalex and scaley internal, and ax local.

0229d5d

render everything once to get the bbox sizes, then again to actually plot it with known extents.

add some margin to the max bbox width

a2df69e

add _BaseTreeExporter baseclass

5212f59

add docstring to plot_tree

60c0b73

use data coordinates so we can put the plot in a subplot, remove some…

3b4a730

… hacks.

remove scalex, scaley, add automatic font size

a30f634

use rendered stuff for setting limits (well nearly there)

27a29ac

Merge branch 'master' into matplotlib_tree_plotting

c2e6d31

import plot_tree into tree module

538d257

set limits before font size adjustment?

c6ecbb2

add tree plotting via matplotlib to iris example and to docs

fc7bdbe

pep8 fix

9d672ab

skip doctest on plot_tree because matplotlib is not installed on all …

1c8b8d6

…CI machines

redo everything in axis pixel coordinates

474c557

re-introduce scalex, scaley add max_extents to tree to get tree size before plotting

whitespace ...

042865a

rth reviewed Oct 1, 2018

View reviewed changes

amueller added 4 commits October 3, 2018 14:09

remove doctest skip to see what's happening

1a16750

added some simple invariance tests buchheim function

6f2d597

refactor

6803f96

___init__ into superclass

added some tests of plot_tree

7b62316

amueller added 2 commits October 3, 2018 15:51

put skip back in, fix typo, fix versionadded number

817b2bb

remove unused parameters special_characters and parallel_leaves from …

55c7d36

…mpl plotting

amueller changed the title ~~Matplotlib tree plotting~~ [MRG] Matplotlib tree plotting Oct 3, 2018

Merge branch 'master' into matplotlib_tree_plotting

c228ae0

jnothman approved these changes Oct 8, 2018

View reviewed changes

amueller added 5 commits October 10, 2018 15:40

Merge branch 'master' into matplotlib_tree_plotting

69e7c40

# Conflicts: # doc/modules/tree.rst

Merge branch 'matplotlib_tree_plotting' of github.com:amueller/scikit…

fc756d0

…-learn into matplotlib_tree_plotting # Conflicts: # doc/modules/tree.rst

rename tests to test_reingold_tilford

9554a87

Merge branch 'master' into matplotlib_tree_plotting

becfa07

added license header from pymag-trees repo

435d217

rth approved these changes Oct 11, 2018

View reviewed changes

remove duplicate test file.

98e8d5a

amueller merged commit c13ba26 into scikit-learn:master Oct 11, 2018

lesteve mentioned this pull request Sep 13, 2019

[MRG] Remove GraphViz mention in plot_tree docstring. #14973

Merged


		- \|Fix\| Fixed a bug in :class:`cluster.DBSCAN` with precomputed sparse neighbors

		- \|Fix\| Fixed a bug in :class:`cluster.DBSCAN` with precomputed sparse neighbors

		@@ -0,0 +1,187 @@
		# taken from https://github.com/llimllib/pymag-trees/blob/master/buchheim.py

Uh oh!

[MRG] Matplotlib tree plotting #9251

[MRG] Matplotlib tree plotting #9251

Uh oh!

Conversation

amueller commented Jun 29, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

amueller commented Jun 29, 2017

Uh oh!

amueller commented Jul 6, 2017

Uh oh!

amueller commented Jul 13, 2017

Uh oh!

rth left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

rth commented Oct 1, 2018

Uh oh!

amueller commented Oct 1, 2018

Uh oh!

amueller commented Oct 3, 2018

Uh oh!

amueller commented Oct 3, 2018

Uh oh!

jnothman left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

amueller commented Oct 10, 2018

Uh oh!

rth commented Oct 11, 2018

Uh oh!

amueller commented Oct 11, 2018

Uh oh!

rth left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jnothman commented Oct 13, 2018 via email

Uh oh!

Njreardo commented Sep 12, 2019

Uh oh!

jnothman commented Sep 13, 2019 via email

Uh oh!

Uh oh!

amueller commented Jun 29, 2017 •

edited

Loading