-
-
Notifications
You must be signed in to change notification settings - Fork 7.9k
Reproducible PS/PDF output (master) #6597
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
* honour SOURCE_DATE_EPOCH for timestamps in PS and PDF files. See https://reproducible-builds.org/specs/source-date-epoch/ * get keys sorted so that hatchPatterns, images and markers are included with a reproducible order in the PDF file. See https://reproducible-builds.org/
I think we are already sorting all of those keys on the way out so do not need the ordered dicts. Can you add an entry in https://github.com/matplotlib/matplotlib/tree/master/doc/users/whats_new and a test (like the deterministic svg test) that this actually works? attn @jkseppan |
I agree that a test is needed. I don't see the reason for the ordered dicts either, but there could be some subtlety that I'm missing. It would be neat to make the date settable by the user (via e.g. the PdfPages constructor) but of course not necessary for this particular use case. |
- SOURCE_DATE_EPOCH support - reproducible output
I added a test for reproducible PDF output (you're right: this is very important!). |
Oh, of course! I was thinking about the page stream, but it also matters in which order the objects are output at the top level of the pdf file. I suspect that the ordering is needed for |
I would like to add tests for postscript too, but I'm afraid the code will be nearly the same as for PDF: do you think I should move the common code somewhere (maybe in a new |
These failures on appveyor look real. |
My naive guess is that there is a unicode vs bytes issues |
Sure! I'm working on it but for now I can't reproduce the problem at home. |
…and test_determinism_all
…environment is shared between threads).
…th, day of month and times. This way we don't mind if timestamps are written with leading 0 or space.
I think this is better now.
Appveyor is now reporting only a |
👍 from me, attn @jkseppan can you do a final review? |
|
||
The ``SOURCE_DATE_EPOCH`` environment variable can now be used to set | ||
the timestamps value in the PS and PDF outputs, which are then | ||
reproducible. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should list the known limitations, because when downstream users see this promise, they may start depending on it and run into the limitations later. At least usetex with the ps backend should be ruled out, based on comments in the tests.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please add a link to the specification that describes the value of this environment variable.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Other possible limitations include at least mathtext, ps.usedistiller, ps.fonttype, and pdf.fonttype. These are features or settings that I think could easily hide a source of nondeterminism and that don't seem to get exercised by the new test. (To get a more reliable list of relevant features than just my guesses, you'd need to run the test with coverage checking and see which parts of the backends get no coverage.)
To be clear, I don't mean we should delay merging until there are tests for everything, but we should be realistic in how we document this. I imagine the reproducible-builds community would appreciate the feature as it is now, with careful documentation of what exactly they can rely on.
Commit c007f49 changes the tests to use a two-digit date in December instead of January 1st. I think the underlying problem should be fixed instead of avoiding it in the tests, since we don't want to tell users that their |
I considered reproducibility this way: with the same tools (same versions of all what you installed on your computer), the same commands lead to the same results. |
The fail happens on the build where ghostscript is not installed, so bf7387e commit is not the right way to handle such situation. |
The |
'power cycled to restart CI (appveyor failures looked due to qt-related packaging issues). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Resuse UTC timezone from dates.py
@@ -135,6 +136,20 @@ def _string_escape(match): | |||
assert False | |||
|
|||
|
|||
# tzinfo class for UTC | |||
class UTCtimezone(tzinfo): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We already have this in mpl/dates.py
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes! Sorry I missed that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Resuse UTC timezone from dates.py
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Resuse UTC timezone from dates.py
Other than one minor change, this looks good to go! Don't worry about the coverall failure, that is un-related. Sorry this has dragged on so long. |
The coveralls drop is probably due to a change in master, and not this PR; a rebase to latest master might help that, but is not strictly necessary. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A couple of minor things.
------------------------------ | ||
|
||
The ``SOURCE_DATE_EPOCH`` environment variable can now be used to set | ||
the timestamps value in the PS and PDF outputs. See |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
timestamp
default value is "mhi", so that the test includes all these objects. | ||
format : str | ||
format string. The default value is "pdf". | ||
uid : str |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not seeing this being used for anything?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The uid
option is used in test_determinism_all_tex
from lib/matplotlib/tests/test_backend_ps.py
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But it was only useful when files were used to store the figures. I will remove that.
format string, such as "pdf". | ||
string : str | ||
timestamp string for 2000-01-01 00:00 UTC. | ||
keyword : str |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It appears to be bytes
, not str
.
@QuLogic: thanks a lot for your review. I corrected those points. |
A rebase might fix coveralls (as I'm pretty sure it was a change in master that dropped coverage), but it's not strictly needed. I guess we aren't really waiting on anything else here. |
Thanks to you all for your kind support! |
@JojoBoulix Thanks for taking care of this! Sorry it was such a protracted process. |
This is a rebase of #6595 to the master branch.
Several software packages use matplotlib in their building process (mainly to produce PS or PDF documents). To make their build reproducible, it would be great to make matplotlib output reproducible.
To allow reproducible PS and PDF output:
See https://reproducible-builds.org/specs/source-date-epoch/
a reproducible order in the PDF file. Another solution is to sort
self.hatchPatterns
inwriteHatches
(and similar ordering inwriteImages
andwriteMarkers
), but this consumes more memory.This patch has been submitted in debian bug #827361
See also https://reproducible-builds.org/