-
Notifications
You must be signed in to change notification settings - Fork 207
API: Refactor image scraping #313
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Looks like CircleCI has broken independently of my changes -- it updated numpy/scipy but apparently not Mayavi (since there is a numpy version number mismatch) |
This is a nice refactor! I'll take a closer look in the coming days (teaching a software carpentry @ sfn tomorrow). A quick question: this doesn't handle the HTML viz condition I was talking about in #313 right? In that case I was talking about actually grabbing the generated HTML of libraries, to create pages like http://ipyvolume.readthedocs.io/en/latest/bokeh.html |
Just glancing at this, it looks quite neat! My main concern is something that I stated elsewhere: I am very reluctant to have captured pngs looking the same as captured figures. Maybe the capturer could output some rst, rather than just saving images to the right location? |
Hmm... how should they look different? |
Right, that would require a bigger refactoring. Maybe we need |
Maybe something like this (i.e. one section per capturer + explicit filenames for the "saved files" capturer)? Better suggestions, more than welcome! Having one section per capturer allows to be a bit more clear than what we currently have. At the moment matplotlib and mayavi are mixed and the order can potentially be confusing because matplotlib figures are always captured before mayavi ones. See snapshot at the end of this message. At the very least for the saved files capturer the filenames should be visible to be able to do easy matching between the script (filenames) and the output (images). Without the saved files filenames, the order of the saved files images is likely not obvious. Attaching the mockup .svg as text file, in case it is useful if someone wants to tweak it: Snapshot for part of mayavi_examples/plot_3d.py: |
FWIW the current PR makes it so that the order in What you propose about having different sections is a bit of a backward compat change (this PR preserves existing behavior). Is that okay to impose since it's an aesthetic change (as it technically shouldn't break any code just reorganize visually)? |
Personally, I don't mind not having separate sections for each image type. After all, if people really want separate plots for separate types, they can always make separate code sections... |
Personally, I don't mind not having separate sections for each image
type. After all, if people really want separate plots for separate
types, they can always make separate code sections...
+1
|
+1 to @larsoner as well...not totally against the idea of different visual cues but I don't think it needs to be in this PR |
+1 to @larsoner as well...not totally against the idea of different
visual cues but I don't think it needs to be in this PR
Different css class for the div seem like a good idea: it opens the
possibility of doing that.
|
One future-compatible change I could make in this PR is ensure that each scraper exposes a |
+1 seems like a good compromise |
Do I understand correctly that everybody agree it's fine to force users to use notebook-like syntax (aka code-cells) and essentially have one image output per cell, in order to avoid possible confusion in the output? It feels quite a bit kludgy to me I have to say but if that's the consensus fine. I was hoping that the capturer contract would be a bit more than just saving an image in the right location. Returning some rst on top of saving the image was a suggestion but maybe there a better way of doing it. This way if a user wants to implement a PNGScrapper that has some kind of text with the filename he can do it. |
Do I understand correctly that everybody agree it's fine to force users
to use notebook-like syntax (aka code-cells) and essentially have one
image output per cell, in order to avoid possible confusion in the
output? It feels quite a bit kludgy to me I have to say but if that's
the consensus fine.
Yes. I have found while writting examples that splitting cells often was
important to be didactic anyhow.
|
This is essentially the current behavior / status quo. I agree it's not ideal, but it's certainly nonetheless useful.
What if we say the contract in this PR is merely a "minimal contract" change?
In the last case example (what @Titan-C and @choldgraf have talked about I think), maybe we will want to generalize the existing parameter to handle this, or maybe it will make more sense to make an entirely different class type. In either case, I don't think these are precluded by merging this PR. The current proposal seems pretty future compatible, i.e. shouldn't lead to too clunky interfaces when the API of these more advanced contracts are decided. In the meantime, it opens up the possibility to fix two existing problems: including saved images (#206) and including images from other viz libraries (my use case, VisPy). In this light, even adding of a |
I have created a quick and dirty PR on top of yours to show that supporting rst in scrapers is not so much work on top of the work you have already done. See larsoner#3. My main worry is that with the interface of the scraper where you only save images, there is no way to have a PNG saved image scraper where the ordering is obvious from just looking at the example HTML. People (like yt) may want to use this feature for capturing saved PNG images, they'll have multiple images in some cells and they'll be confused by the ordering. For an example where the output is confusing in yt, look at http://yt-project.org/docs/dev/cookbook/simple_plots.html#showing-and-hiding-axis-labels-and-colorbars. Admittedly this kind of development could be done in a further PR. I kind of feel there is enough momentum behind this refactoring to have scraper return rst. The risk of doing it in separate PRs is that it may not happen in the near future and then we'll have this non-optimal scraper interface for a while, and then we'll need to think about deprecating it, with all the pain it entails for both maintainers and users. |
Feel free to take over |
My point above is that I think that such future enhancements (returning the RST) are future-compatible with the current approach. For example, we can make it such that:
If we also require that people add a Do you agree that this would be future compatible with your proof-of-concept PR without the need for a deprecation cycle?
I can see that the proof of concept probably wasn't too much work. However, as I'm sure you know, there is more work involved to get things fully fleshed out, working properly without any |
I do get all of your points of course ... basically you say the minimal amount of change already allows a lot of use-cases that weren't there before. I say: oh but we are so close to something that is generic that we should just implement the generic solution. I am happy for other to jump in and give their opinions. I am not so fond of a scraper interface that allows a lot of possible outputs I would say. It makes the code harder to maintain IMO. Maybe I'll try to work on my PR to your PR a bit more, tidy it up a bit and try to convince you that the scaper interface returning a rst is within reach with not so much effort. |
My 2 cents: I think that a general image scraper functionality would be a great addition, though I don't think it should be blocking on this PR. This effectively takes us from N=2 to N=3 image production approaches, no? I think that's a valuable contribution in itself, and shouldn't be impeded just because we want to go to N > 3, no? If @larsoner is correct that this PR lays a foundation for general scraping functionality, why not merge this PR after the next sphinx-gallery release and then there will be one development cycle's worth of time to generalize it per @lesteve's suggestions. |
Thanks for this suggestion, sounds very reasonable to me. |
I'm completely in favour of this PR making the matplotlib and mayavi capturers independent, and starting to build the logic for adding new capturers. I do not agree putting in our documentation without mention of being an experimental feature a suggestion for a scraper. In general I don't like scrapers as they target everything, we need something that is specific to the output. As I commented in #208 (comment), what if there are other png is the directory one is scraping being source pngs to work on, those will be moved as output. For a later iteration we can indeed figure out how to capture any object and output their rst representation. |
In that case I might rename the var to
I thought this at first (we'd probably want separate classes / config vars) but from the discussion about both picking images and embedding RST, I don't think we need to anymore. The current API is hopefully sufficiently extendable / future compatible to allow scraping and embedding other objects, too.
I'll add something saying that it's a half-baked stub implementation that should be tailored to suit an actual use case.
In cases where people actually want a PNGScraper-like functionality, they can devise workable logic based on their particular use cases. For example, one could configure the scraper to only look for Hopefully after this PR is merged, @alexhuth, @ngoldbaum, or someone else who wants saved-image scraping can take the stub and flesh it out into something more generally useful, and we could include it as a proper class in SG. So to me the todo list is:
|
Since this is a half-baked experimental feature I really don't think we should worry about future compat. |
My plan is to very soon after this PR implement a scraper for VisPy. It would be nice if I didn't have to update it later. The |
It would be nice but I don't think that's not the kind of guarantees that you should expect from half-baked experimental features :(. IMO the I think the scraper returning RST is just the way forward. This feels like the way that will create less combined effort from everyone involved. I'll try to find some time to make some progress in this direction. |
I agree with this statement, but I don't think of it as being half-baked ...
I assume this is why you do. This does make the simple case of embedding images harder (PNG scraper, VisPy scraper, etc.), though. You are opposed to allowing |
Codecov Report
@@ Coverage Diff @@
## master #313 +/- ##
==========================================
- Coverage 95.38% 95.33% -0.05%
==========================================
Files 27 29 +2
Lines 2013 2166 +153
==========================================
+ Hits 1920 2065 +145
- Misses 93 101 +8
Continue to review full report at Codecov.
|
Okay @lesteve I have refactored the code quite a bit, and updated the top-level description. See if you are satisfied by the description and API contracts described here (for image_scrapers and reset_modules): https://537-25860190-gh.circle-artifacts.com/0/rtd_html/advanced_configuration.html If so, then the code is ready for review/merge from my end. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I really like this step toward being able to capture other objects.
Having the scraper output rst is the path forward, as stated earlier in this PR comments.
I really like the idea of the image_path_iterator
, instead of tracking the amount of images captured.
Something that does come to my mind, is that at current state it assumes we are only capturing png files. But that can be later overridden inside the scraper itself.
A bit on the speculation side. I'm also thinking of having a general structure of capturing things. For example for now we capture STDOUT, but as specified earlier some functions do output nicer HTML representations, we might want to put that into an option of capturing things. But in other PR.
doc/advanced_configuration.rst
Outdated
By default, Sphinx-gallery will only look for :mod:`matplotlib.pyplot` figures | ||
when building. However, extracting figures generated by :mod:`mayavi.mlab` is | ||
also supported. To enable this feature, you can do:: | ||
Image scrapers |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would like to add that this is an Experimental feature. Here in the Title and in the description. This is something we are trying out in order to capture different output objects
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IIRC @lesteve wanted such a warning if we were not satisfied with the API. Do you think that it might need to change?
doc/advanced_configuration.rst
Outdated
} | ||
|
||
.. note:: The parameter ``find_mayavi_figures`` which can also be used to | ||
extract Mayavi figures is **deprecated** in version 1.13+, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In our numbering we are 0.2.0, 1.13+ is way in the future
|
||
.. _reset_modules: | ||
|
||
Resetting modules |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As @lesteve always reminds me. We should have different features in different PRs
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see these as a bit related, though, since if you don't want to use the mpl
scraper then you probably also do not want to use the mpl
resetter.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(And if you want to add your own scraper, you might need your own resetter.)
sphinx_gallery/gen_rst.py
Outdated
fig.savefig(current_fig, **kwargs) | ||
figure_list.append(current_fig) | ||
image_paths.append(image_path) | ||
fig.savefig(image_paths[-1], **kwargs) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I somehow prefer to save the image first then append the filepath to the list. Do you have any reasons for this order? In my mind, if saving fails, then we don't have the image listed. In case of failure of course I expect an exception, and everything is irrelevant. I also see later on you have a check to scan for the registered images.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think the order matters. But since you prefer the other I can switch it.
sphinx_gallery/gen_rst.py
Outdated
e = mlab.get_engine() | ||
for scene, image_path in zip(e.scenes, image_path_iterator): | ||
image_paths.append(image_path) | ||
mlab.savefig(image_paths[-1], figure=scene) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same comment with ordering of save and path recording
sphinx_gallery/gen_rst.py
Outdated
Configuration and run time variables | ||
gallery_conf : dict | ||
Contains the configuration of Sphinx-Gallery | ||
base_image_name = os.path.splitext(fname)[0] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is an error from the rebase. The content on the left is the correct one
@Titan-C comments addressed, and I also broadened the API to have the scrapers take the |
... and I added a note about the API being experimental for custom scrapers. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm good with this. In my PR #324 experimenting with Bokeh I also run into the need to have the save_figures
function intake the block and the execution global dictionary. Taking the block_vars
might not be enough as one needs to capture something in the state of the running program. That said and since this is experimental features and with the goal in mind of capturing different objects more like a plugin. I'm good on merging this
doc/advanced_configuration.rst
Outdated
pngs = sorted(glob.glob(os.path.join(os.getcwd(), '*.png')) | ||
image_names = list() | ||
image_path_iterator = block_vars['image_path_iterator'] | ||
for png in my_pngs: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To avoid confusion remove count=0
line and iterate over pngs
.
|
@choldgraf do you have time to look, too? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I took a relatively quick pass, but I unfortunately don't have a ton of time because I'm headed to Europe getting married this week and will be sorta offline for a few weeks after as I'll by trying not to work as much as possible.
In general I think this looks good - there's still not a clear path in my mind towards how I would create my own custom scraper (or how I could add a scraper for something HTML-based instead of file-based) but I think we can spot-check this in later PRs. I'm always +1 on refactoring and cleaning things up and I like that this generalizes these features to (potentially) new image producing things. So if @Titan-C is +1 then I think we should 🚢 it
sphinx_gallery/gen_rst.py
Outdated
_import_matplotlib() | ||
|
||
|
||
def matplotlib_scraper(block, block_vars, gallery_conf): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I kinda feel like these scrapers should be in a different module (scrapers.py
?), what do you all think? gen_rst
is quite generic and if there's a lot of scraper-specific code that could be added perhaps its enough to live in its own file.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed, I have refactored them into their own module now.
For example, a naive class to scrape any new PNG outputs in the | ||
current directory could do, e.g.:: | ||
|
||
import glob |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I must be honest that this explanation is not super clear to me, but I don't know how the scrapers work so am not sure how I could improve it. However, I don't think it should block this PR because it's a good start. We should open an issue about improving this documentation once it's merged.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes let's iterate on the docs, probably after one of the issues listed at the top gets addressed by a user
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not happy at all with the idea of having a scraper use glob. I think I already raised the concern in another issue. For example, what if people have PNG images on this as data input(scikit-image maybe), then this might accidentally capture those. Or if one day we implement parallel builds, could this lead to problems if some examples save to disk at the same time. But this is an experimental feature, and we will iterate on the docs and inner mechanisms. So I can let this pass.
sphinx_gallery/gen_rst.py
Outdated
|
||
|
||
def clean_modules(gallery_conf, fname): | ||
"""Remove/unload seaborn from the name space |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
no longer seaborn-specific right?
sphinx_gallery/gen_rst.py
Outdated
does not want to influence in other examples in the gallery. | ||
""" | ||
for reset_module in gallery_conf['reset_modules']: | ||
reset_module(gallery_conf, fname) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I assume this doesn't fail if one of the packages isn't installed, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right now matplotlib
is a hard requirement. Eventually we can remove this requirement if we want, and this PR at least makes that easier.
@choldgraf comments addressed. https://543-25860190-gh.circle-artifacts.com/0/rtd_html/reference.html |
I'm +1 on a merge if appveyor becomes happy. @Titan-C want to do the honors if you're OK with the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is fine. Better get this in and start experimenting. I'm definitely changing the API of save_figures as I need to pass an extra variable containing the state of the example being executed. I would also be happier if instead of scrapers we name this object capturers. But that is just me, and we can iterate later.
For example, a naive class to scrape any new PNG outputs in the | ||
current directory could do, e.g.:: | ||
|
||
import glob |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not happy at all with the idea of having a scraper use glob. I think I already raised the concern in another issue. For example, what if people have PNG images on this as data input(scikit-image maybe), then this might accidentally capture those. Or if one day we implement parallel builds, could this lead to problems if some examples save to disk at the same time. But this is an experimental feature, and we will iterate on the docs and inner mechanisms. So I can let this pass.
This PR:
clean_modules
behavior to allow custom "cleaning"clean_modules
to run once per folder before any files (fixing First gallery plot uses .matplotlibrc rather than the matplotlib defaults #316).Provides potential end-user-derived solutions for:
Closes #316.