-
-
Notifications
You must be signed in to change notification settings - Fork 25.9k
[MRG] Matplotlib tree plotting #9251
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[MRG] Matplotlib tree plotting #9251
Conversation
Graphviz uses a custom heuristic to solve an integer linear program as described here: |
trying to get rid of scalex, scaley
I changed the tests to work on the new color definition. They are visually indistinguishable and one might argue that |
render everything once to get the bbox sizes, then again to actually plot it with known extents.
check out rendered example here: https://12061-843222-gh.circle-artifacts.com/0/home/ubuntu/scikit-learn/doc/_build/html/stable/auto_examples/tree/plot_iris.html |
re-introduce scalex, scaley add max_extents to tree to get tree size before plotting
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe my first review was overly negative. I think it's a nice improvement, and the ability to draw trees without graphviz is definitely important. The amount of added code is also relatively small with respect to what it allows to do. A few comments below (some of which you ready just addressed)
My main concern is that we are not testing this to the same extent as regular estimators. For GUI it's quite difficult anyway, just because we run the examples, doesn't mean that someone cannot make a PR that would make the output wrong without throwing an exception (or that next matplotlib version wont). Unless someone looks at the figure we won't be able to tell. It's different from regular examples, where it matter less, as users are expected to run this algorithm on any kind of tree. Similarly it's hard to know if the output will be always satisfactory for any kind of tree configuration. Running this in CircleCI or Travis is not really going to change that, I agree. The vendored sklearn/tree/_reingold_tilford.py
also comes from a repo without unit tests, that hasn't been updated in 5-7 years.
Overall I think it's a nice improvement, and I'm +1 to merge it, but someone would need to answer future issues about this :) Maybe it could be worth marking plot_tree
experimental in the docstring?
# is about the same as the distance between boxes | ||
max_x, max_y = draw_tree.max_extents() + 1 | ||
ax_width = ax.get_window_extent().width | ||
ax_height = ax.get_window_extent().height |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The issue with this that it does not allow to change the size of the plot. For instance, I can't read the text with the default size (it's too tiny when it takes 1/4 of my screen by default, the below version is a bit zoomed, depending on your screen resolution)
while if I put it to full screen I get the following (with all the white space around it),
which is the area it occupied on my screen initially. No way to zoom in either.
Tested with matplotlib 2.2. I can confirm the results don't change for 3.0 either.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you need to change figsize
or dpi
I think. Happy to hear about a better way to do that. I guess we could redraw on resize?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe we should add something to the docstring that you can control the size of the tree with figsize
and dpi
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Docs sound appropriate
>>> tree.export_graphviz(clf, | ||
... out_file='tree.dot') # doctest: +SKIP | ||
>>> tree.plot_tree(clf) # doctest: +SKIP | ||
[Text(251.5,345.217,'X[3] <= 0.8... |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That will really change between runs? Can't remove the SKIP? Though, I guess it can also be impacted by matplotlib versions?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm pretty sure I had a reason for this. I don't remember whether it was non-determinism or something about requiring the presence of matplotlib. I can try to remove it.
doc/whats_new/v0.21.rst
Outdated
|
||
- |Fix| Fixed a bug in :class:`cluster.DBSCAN` with precomputed sparse neighbors | ||
- |Fix| Fixed a bug in :class:`cluster.DBSCAN` with precomputed sparse neighbors |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Extra spaces here (possibly issue with with merge).
doc/whats_new/v0.20.rst
Outdated
- |Feature| In :func:`datasets.make_blobs`, one can now pass a list to the | ||
``n_samples`` parameter to indicate the number of samples to generate per | ||
cluster. :issue:`8617` by :user:`Maskani Filali Mohamed <maskani-moh>` and | ||
:user:`Konstantinos Katrioplas <kkatrio>`. | ||
>>>>>>> master |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Merge conflict markup ..
Also not running at this in Travis produces 92.72% (-0.24%) code coverage, which is a bit a shame. Maybe we could a short test for |
I would consider the code vendored but I would also consider the upstream unmaintained (I have not tried to contact the author so they might reply, but I wouldn't count on it). So adding some tests on |
The skip is required because otherwise the matplotlib import fails... |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wish I had the patience and time to go through the algorithm in detail, but I don't atm. I'm happy with the changes in export.py, and happy with the output.
@@ -0,0 +1,187 @@ | |||
# taken from https://github.com/llimllib/pymag-trees/blob/master/buchheim.py |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should this file not include a more extensive license?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
added the license.
sklearn/tree/tests/test_reingold.py
Outdated
@@ -0,0 +1,54 @@ | |||
import numpy as np |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you please rename this to test_reingold_tilford.py
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done.
# Conflicts: # doc/modules/tree.rst
…-learn into matplotlib_tree_plotting # Conflicts: # doc/modules/tree.rst
ping @rth ;) |
@amueller Will try to review at the end of the week. Thanks for making these improvements! |
Thanks! And sorry for being pushy. I'm afraid some other grant or something will come my way and it'll lie around for another year :-/ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No worries. It's a bit difficult to review every detail of this PR, but I can confirm that the expected figure is indeed produced and renders nicely. Overall the code LGTM, we can always fix some edge-cases as they arrive in the future.
There is one duplicate file that needs removing though (cf below).
sklearn/tree/tests/test_reingold.py
Outdated
# reached all leafs | ||
break | ||
assert len(np.unique(x_at_this_depth)) == len(x_at_this_depth) | ||
depth += 1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This file is identical with sklearn/tree/tests/test_reingold_tilford.py
and should be removed
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
meh thanks!
Teehee good to have this merged! Now the text export too??
|
Is there a way to increase the size of the tree image? I can't make any sense of the one that has been plotted for my purposes and I can't just download graphviz because it's a company computer. |
Please ask such questions on stack overflow
|
Fixes #8508
This PR does three things:
For now this lives in the tree submodule to avoid conflicting with work on the plot submodule.
Todo:
We could try to avoid the intermediate Tree data structure but not sure if that's worth the hassle.
Example:
compared to graphviz:

The Reingold-Tilford algorithm makes sure a parent is centered above its children. Graphviz uses a different algorithm that results in a somewhat more compact (if possibly uglier?) layout.
Currently the box sizes of the nodes are ignored, i.e. assumed constant, where the constant is expressed using the
scalex
andscaley
variables.I think it's ok to assume all boxes are the same size. I don't aim for perfection, I aim for a reasonable plot.
Disclaimer: this is an afternoon hack, and I'm not sure if I'll be able to fix it.
I might use it at scipy though. (cc @agramfort )