Thanks to visit codestin.com
Credit goes to github.com

Skip to content

[SVG] Introduce sequential ID-generation scheme for clip-paths. #27833

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
Jul 6, 2024
Merged

[SVG] Introduce sequential ID-generation scheme for clip-paths. #27833

merged 1 commit into from
Jul 6, 2024

Conversation

jayaddison
Copy link
Contributor

@jayaddison jayaddison commented Feb 28, 2024

PR summary

This pull request is intended to improve the reproducibility of SVG output from matplotlib, by removing variability from the ID generation scheme for the identifiers of <clipPath> XML elements (and references to them).

In particular, use of the Python built-in id(...) function, that retrieves an integer identifier for an object in memory at runtime -- not necessarily a memory address, but often so -- is removed and replaced by a monotonically increasing counter value.

Closes #27831.

PR checklist

Edit: use a more-direct hyperlink to the test coverage recommendation.

Copy link

@github-actions github-actions bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for opening your first PR into Matplotlib!

If you have not heard from us in a week or so, please leave a new comment below and that should bring it to our attention. Most of our reviewers are volunteers and sometimes things fall through the cracks.

You can also join us on gitter for real-time discussion.

For details on testing, writing docs, and our review process, please see the developer guide

We strive to be a welcoming and open project. Please follow our Code of Conduct.

@jayaddison jayaddison changed the title [SVG] Implement monotonically-increasing counter for clipPath identifiers [SVG] Use monotonically-increasing counter for non-rectangular clip-path identifiers Feb 29, 2024
@jayaddison jayaddison marked this pull request as ready for review February 29, 2024 14:13
@tacaswell tacaswell added this to the v3.10.0 milestone Feb 29, 2024
@jayaddison
Copy link
Contributor Author

Comment-pinging to check for possible review on this PR - thank you!

@@ -590,7 +596,7 @@ def _get_clip_attrs(self, gc):
clippath, clippath_trans = gc.get_clip_path()
if clippath is not None:
clippath_trans = self._make_flip_transform(clippath_trans)
dictkey = (id(clippath), str(clippath_trans))
dictkey = (self._get_next_clip_id(), str(clippath_trans))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this breaks the re-use of a clip path if we have multiple artists that use the same clip path.

We need to keep a second dictionary that maps id(clippath) -> incrementing int so that on line 608 when we make the oid we can use that instead.

Copy link
Member

@tacaswell tacaswell Mar 12, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It wont let me comment on line 608, but I think the logic there should be something like

if clip is None:
    stable_id = self._get_next_clip_id()
    oid = self._make_id('p', stable_id)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so I don't think we actually need a second dictionary to keep the mapping? That seems good to not pick up extra state.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, yep. It seems like some extra test coverage would be useful in that case, to confirm whether keeping an id-mapping is required (and to test any fixes if so).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure the test coverage I've added makes semantic/usage sense, but it does now cover sharing a clip-path (technically a Patch) object across multiple artists.

I've added some test coverage on ID-uniqueness in the generated SVG at the same time after finding some code/issue history mentioning that it's important to maintain distinct identifiers.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(however, as noted in #27831 - I missed a note that recommended a particular style of test coverage - it's not yet added here)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this breaks the re-use of a clip path if we have multiple artists that use the same clip path.

We need to keep a second dictionary that maps id(clippath) -> incrementing int so that on line 608 when we make the oid we can use that instead.

This pull request has been updated to implement this in a way that (I hope!) follows the intended behaviour, after re-reading the issue thread and details like the above. There's no second-level dictionary required, but we do make use of one dictionary to store the clip-to-incrementing-id mapping.

I don't think the changes are ready quite yet though; the 2x2 grid test case is yet to be added.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so I don't think we actually need a second dictionary to keep the mapping? That seems good to not pick up extra state.

The implication of this is that further refactoring in this part of the code could be helpful -- and I agree that there seem to be opportunities to simplify the logic here. However: I think it's important to confirm that the test coverage is sufficient first.

@tacaswell
Copy link
Member

sorry for the delay.

@jayaddison
Copy link
Contributor Author

No problem, thank you for the comments!

@jayaddison

This comment was marked as outdated.

@jayaddison jayaddison changed the title [SVG] Use monotonically-increasing counter for non-rectangular clip-path identifiers [SVG] Introduce repeatable ID-generation scheme for clip-path identifiers. Apr 25, 2024
@jayaddison jayaddison changed the title [SVG] Introduce repeatable ID-generation scheme for clip-path identifiers. [SVG] Introduce repeatable ID-generation scheme for clip-paths. Apr 25, 2024
@jayaddison

This comment was marked as outdated.

@jayaddison

This comment was marked as outdated.

@jayaddison jayaddison changed the title [SVG] Introduce repeatable ID-generation scheme for clip-paths. [SVG] Introduce sequential ID-generation scheme for clip-paths. Apr 28, 2024
@tacaswell
Copy link
Member

Can you rebase [and force push to your branch] this on main (rather than merging)?

@jayaddison jayaddison marked this pull request as ready for review April 29, 2024 22:12
@jayaddison
Copy link
Contributor Author

I realized that I hadn't visually inspected the results of rendering the 2x2 star grid example in the test cases -- and related to that, worried that perhaps the re-use of clip path IDs could cause problems in that case. However, the results do appear correct to me:

image

And included below is the 'before' image -- where I've rendered the same diagram but without the changes to backend_svg.py from this branch:

image

(note: I screenshotted these using manually-selected rectangular screen regions, so the screenshots are not bit-for-bit identical; and indeed neither are the SVG files they were displayed from, as expected -- but re-rendering the files with the backend_svg.py changes in place does produce bit-for-bit identical SVG files)

@jayaddison
Copy link
Contributor Author

Perhaps this is an unlikely scenario, or perhaps it is a non-issue, but before merge I would like to confirm that the following situation is handled reasonably:

  • Two distinct separate SVG diagrams are constructed from matplotlib code using the same hashsalt and SOURCE_DATE_EPOCH value -- as could realistically occur when (re)building the contents of a report from source.
  • Both of the SVG diagrams make use of distinct clipping paths.
  • Both of the SVG diagrams appear in a single HTML page as output.

My specific concern here is that the id values of the output SVG diagrams could collide. Unfortunately my understanding is that when SVG diagrams are embedded within HTML documents, they do not have separate namespace scoping.

I'd forgotten about that particular problem until recently, but it may be relevant here.

@jayaddison jayaddison marked this pull request as draft May 5, 2024 19:43
@tacaswell
Copy link
Member

Collisions between elements in different svgs in the same html document is a real problem (we have had bug reports about this in the past). The hash key includes details about the clip path transform so we may avoid collision (or if we do have collisions we may still be OK!), but that should be tested. I think the ultimate fix may be to include something deterministic for the figure layer (like mixing fig.get_label() or fig.get_gid() into the hash).

@jayaddison
Copy link
Contributor Author

I have to admit, I'd forgotten (I am a goldfish) that the string-representation of the clippath is included in the hash key -- that mostly reassures me that the problem should be infrequent, if it occurs at all. I'll move the PR back into ready-for-review state because most of that concern is resolved, but even so I'll try to build some more confidence about this by running some more checks locally.

@jayaddison jayaddison marked this pull request as ready for review May 6, 2024 17:32
@jayaddison
Copy link
Contributor Author

The following testing methology for ID collisions isn't exhaustive, but is what I'm starting with:

  • I've used some editor scripting to configure matplotlib.rcParams['svg.hashsalt'] to a fixed value at the start of each of the .py files in the examples dir of this repository.
  • Subsequently I've added a call to plt.savefig(f"{__file__}.svg") at the end of each of those files.
  • I've temporarily removed the embedding_webagg_sgskip.py and ginput_manual_clabel_sgskip.py files from the directory tree because they appear to be interactive/blocking processes.
  • Now I'm running all of those files with a fixed SOURCE_DATE_EPOCH for the batch, and inspecting the results as they appear.

So far every clipPath id="p<....>" value is unique within each individual SVG file, however some duplicates do appear; I believe these refer to common/shared path shapes that are re-used across different diagrams.

In particular, one clipPath with id="p209e94b0da" has appeared in more than 100 different output diagrams so far, so I'd like to determine what it represents. 35-or-so other non-unique clipPaths exist, generally with single-digit re-use across files.

@jayaddison
Copy link
Contributor Author

(as a sidenote: stylesheet-related name collisions are in fact the cause of the problem I'd remembered in the past, not ID collisions -- it would be nice to confirm that IDs either are, or are not, namespace-internal to embedded SVG elements; there is almost certainly a smart way to do that with some simple local testcases. even so, I've started some of this testing, so I'm going to try to learn a few facts from it)

@jayaddison
Copy link
Contributor Author

In particular, one clipPath with id="p209e94b0da" has appeared in more than 100 different output diagrams so far, so I'd like to determine what it represents. 35-or-so other non-unique clipPaths exist, generally with single-digit re-use across files.

The 100+ repetition case is from a rectangular clip-path defined as <rect x="57.6" y="41.472" width="357.12" height="266.112"/> that appears in many of the gallery examples when rendered to SVG.

@jayaddison
Copy link
Contributor Author

Based on the results so far, I'm fairly confident about the state of the updated ID-generation scheme. I also re-checked the code and confirmed that it is using an acceptably-strong hash function during _make_id, SHA256. Currently we don't retain the entire hash digest for use in the path ID, but that could be extended if required, trading-off against output filesizes.

@jayaddison
Copy link
Contributor Author

Please let me know if there's anything else I can do to make progress on this pull request; thank you.



def _save_figure(objects='mhi', fmt="pdf", usetex=False):
class PathClippedImagePatch(PathPatch):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we need a custom class here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That code is borrowed from the demo_text_path.py gallery example - it's a shortcut I took when checking the examples for some suitable test coverage code.

It may be possible to refactor this class out and achieve the same test coverage - I'll look into cleaning that up.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(this is going to take me a while to get around to, but I'll confirm results when possible)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh, I did not realize this was pulled from an example, if that is the case can you move it to in the function where it is used and add a note "lifted from example"?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Class definition relocated, and an explanatory comment added alongside.

I haven't yet attempted to simplify/refactor the code itself to remove the class entirely - it may be possible; figuring out whether the bbox redraw is required (and how to catch that event without implementing a custom pathpatch, if so) seems to be the main question there.

@tacaswell
Copy link
Member

Sorry this fell off my radar.

The implementation looks good, but I do not understand the custom class in the tests. As you note, caching anything about the size/location of things in the output space is fraught and needs to be re-generated on each draw (or have careful cache invalidation). Is that class exercising something that no existing Artist class does?

The 100+ repetition case is from a rectangular clip-path defined as that appears in many of the gallery examples when rendered to SVG.

That makes sense as I suspect that as the galley figures are all the same size, many contain only one Axes, and do not have any auto-layout then location of the bounding box of the axes in output is going to be the same for all of them and we clip almost everything to the bounding box of the axes.

@jayaddison
Copy link
Contributor Author

As you note, caching anything about the size/location of things in the output space is fraught and needs to be re-generated on each draw

A small clarification, so that you don't over-estimate my understanding of the code: if that note mention refers to the -JJ comment, then that's from the existing gallery sample code. Even so, it does make sense to me that clipping would need to be re-evaluated after changes to intersecting objects in the diagram/scene.

@jayaddison
Copy link
Contributor Author

One more response re: a PathClippedImagePatch question after a re-read:

Is that class exercising something that no existing Artist class does?

As best as I can remember, my thinking with that case was that adding coverage for text-based paths might be worthwhile since they're relatively geometrically complicated (so another way to attempt to catch problems). Aside from that, though, I don't think it tests anything fundamentally different.

@tacaswell
Copy link
Member

@jayaddison Ah, your use name started with 'J' so I thought 'JJ' was you, that is probably actually @leejjoon .

@tacaswell
Copy link
Member

This is looking good to me!

@jayaddison are you willing to squash this to one or two commits?

Copy link
Member

@tacaswell tacaswell left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should be squashed either by OP or by merge.

@jayaddison
Copy link
Contributor Author

Thank you very much @tacaswell!

Re: squashing commits: yep, I'm happy to squash this down to a single commit. I'll review the dev docs and some mainline commit messages before doing that. Even so a double-check afterwards from you and/or the person merging could be helpful.

@@ -67,12 +145,13 @@ def _save_figure(objects='mhi', fmt="pdf", usetex=False):
("m", "pdf", False),
("h", "pdf", False),
("i", "pdf", False),
("mhi", "pdf", False),
("mhi", "ps", False),
("p", "svg", False), # (clipping) paths are only relevant for SVG output
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A late self-nitpick here: this comment seems poorly phrased and could potentially be misleading to future readers.

If I understand correctly, Matplotlib clipping is a feature that is output-format-agnostic (that is, it should work for all output formats).

The fix in this pull request only affects SVG elements named clipPath -- but they're a different concept.

I do think it could make sense to perform an isolated SVG-format-only path test (p), alongside the complete-functionality test (mhip) -- but I think the comment attempting to explain should either be improved, or omitted entirely.

At the moment I'm leaning towards removing it entirely, and perhaps relocating the line so that future readers don't consider it a possible typo/accidental difference from the preceding pdf test parameters.

(maybe an overly verbose explanation for a small detail, but I want to try to explain my thinking)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(done, and no further changes planned on this branch)

@jayaddison
Copy link
Contributor Author

Re-pinging to keep this thread active; please let me know if there's anything further I should adjust here.

This change enables more diagrams to emit deterministic (repeatable) SVG
format output -- provided that the prerequisite ``hashsalt`` rcParams
option has been configured, and also that the clip paths themselves are
added to the diagram(s) in deterministic order.

Previously, the Python built-in ``id(...)`` function was used to provide
a convenient but runtime-varying (and therefore non-deterministic)
mechanism to uniquely identify each clip path instance; instead here we
introduce an in-memory dictionary to store and lookup sequential integer
IDs that are assigned to each clip path.
@jayaddison
Copy link
Contributor Author

It's been a few weeks since the previous rebase, so I'm going to perform another rebase of these changes against the latest main branch to re-confirm test results.

@tacaswell
Copy link
Member

@jayaddison Thank you for following up#

@jayaddison
Copy link
Contributor Author

No problem - I'll admit that I'm eager for this to be merged, but I also get that release prep and branch co-ordination requires patience :)

@greglucas greglucas merged commit 2d1db48 into matplotlib:main Jul 6, 2024
40 of 41 checks passed
@greglucas
Copy link
Contributor

Congratulations on your first merged PR to matplotlib @jayaddison 🎉 We hope to see more contributions from you in the future.

@jayaddison
Copy link
Contributor Author

Thank you very much @greglucas @tacaswell! I'll make sure to be around to watch for any potentially-related bugs in the bugtracker when v3.10.0 is released.

I do have one other potential issue/bugreport that I'm still researching; when I can figure out more about that I'll open a bugreport and perhaps a fix alongside if it's within my ability.

@jayaddison jayaddison deleted the issue-27831/deterministic-svg-clippath-identifiers branch July 6, 2024 15:33
@jayaddison
Copy link
Contributor Author

This thread is slightly stale now, but even so, some brief updates:

Thank you very much @greglucas @tacaswell! I'll make sure to be around to watch for any potentially-related bugs in the bugtracker when v3.10.0 is released.

Given the 3.10 release recently, I've been checking for any SVG / clipPath related bugreports in the issue tracker (and PRs, just in case). So far, so good (no reports).

I do have one other potential issue/bugreport that I'm still researching; when I can figure out more about that I'll open a bugreport and perhaps a fix alongside if it's within my ability.

I haven't been able to track that down - it was a nondeterminism issue and seemed very similar to #28574 (tick axes changing within a multi-chart grid) -- so optimistically it may be solved, but I'll revisit this if I encounter it again.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Development

Successfully merging this pull request may close these issues.

[Bug]: Nondeterminism in SVG clipPath element id attributes
3 participants