-
-
Notifications
You must be signed in to change notification settings - Fork 7.9k
PDF file generation is not deterministic - results in different outputs on the same input #6317
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. Weβll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Looking inside the generated pdf file, it appears that matplotlib inserts a creation date
but that is not the only difference. There appear to be other differences as well. |
Which version of matplotlib are you using? I think some work has been done On Mon, Apr 18, 2016 at 9:10 PM, Dzhelil Rufat [email protected]
|
|
It seems like the issue is in the |
matplotlib version 1.5.1 still supports python 2.6, so ordered dicts can't However, the solution there actually might be applicable here too. On Mon, Apr 18, 2016 at 10:11 PM, Dzhelil Rufat [email protected]
|
Even the svg backend is not completely deterministic
|
now that is weird... I am fairly certain we added unit tests to check that On Mon, Apr 18, 2016 at 10:40 PM, Dzhelil Rufat [email protected]
|
Ah, which version of matplotlib are you using? This might only be on master On Tue, Apr 19, 2016 at 9:11 AM, Benjamin Root [email protected] wrote:
|
and, reading it further, it looks like by default, it will not be But, the use of OrderedDicts is in master, so I would suggest trying out On Tue, Apr 19, 2016 at 9:14 AM, Benjamin Root [email protected] wrote:
|
I don't see why you'd need to use OrderedDict. You could just change the dict branch of May I ask why you'd want to make pdf files be exactly the same? |
@jkseppan There are many reasons why determinism is important - functions are generally expected to return the same output if given the same input - it makes debugging much easier. If changes in the saved plot files are introduced by genuine changes in the plot data, rather than spurious change introduced by how matplotlib saves the file, one can easily detect any mistakes. Suppose you are working on a large code base, and you are saving some function plots. When you make changes in one area of the code you want to ensure you have not broken other parts, and you do that by comparing the generated plot files, and ensuring they have not changed. Since you cannot visually inspect every single one of them, it is easier to check that they have not changed by comparing their actual content or md5sum. With the current nondeterminism in pdf saving it is impossible to compare two pdf files. Also I do not see the need to save metadata in the pdf files such as @efiring I will try to look at it later this weekend, but I am no expert in pdf generation. |
The matplotlib test suite checks that the pdf backend produces the same output as earlier, by rendering the output file into pixels and comparing the results. This works across versions and big changes at all levels of matplotlib. A pdf file is fundamentally different from a png file or a matrix of numbers: it is an encoding of a drawing process into a language (essentially a very restricted form of PostScript), and a further rendering step is needed to produce an image on some output medium. There are always going to be many ways of encoding the process to create a given output, and the ordering of dicts at the bottom-most level is just the most obvious kind of variation. We can certainly sort the dictionaries repeatably and add an option to remove the creation date, but next you're likely to find that small changes in your script or small updates to matplotlib will cause changes in the ordering of higher-level objects, or coordinates that differ in the 10th decimal, or some similar changes that have no semantic meaning but will result in a different hash of the output file. One more source of nondeterminism: font subsetting will add random tags in the font name in any pdf-generating program, so that when different subsets of the same font end up getting included in the same print job, you avoid the printer using a cached font that is missing some random characters. If the software that combines pdf files is written diligently, it should recurse through all included pdf files and alter any font names that do not refer to the same exact font subset, but using a random tag is a very simple way to avoid this problem even if the combining program fails to do this. |
I am going to take it as a given that determinism certainly has its The goal for the determinism for SVG output was to make it easier to Now, there are plenty of reasons to include a creation date in the pdf On Tue, Apr 19, 2016 at 1:42 PM, Jouni K. SeppΓ€nen <[email protected]
|
Determinism is also nice for auto-generated stuff that's going into git--it's nice to have things change only if they actually changed. I don't think this is a supreme goal, but there's nothing wrong with working towards it if there are simple changes to be made (I consider ordereddict to be simple). As far as creation date goes, is there a way for the user to set that? If I was doing this as in some automated process, and the creation date is guaranteed to be in the metadata of the output file, I'd want some way to be able to control it. In this case I'd want the PDF to be the last modification time of the generating script, not just when I happened to run the script. |
This thread is getting a bit testier than strictly needed, we are all on the same side here π I think the main contention is what 'the same' means. The meaning mpl has settled on (long ago) is 'looks the same' and the internal details of the vector backends is 'an implementation detail' and not part of the stable public API. To summarize the svg discussion:
The compromise was to add the It seems reasonable to do the same for the pdf backend:
@dopplershift That comes back to the more general problem of passing backend-specific kwargs/meta-data through I can see three positions on including the date:
@drufat If you want to use pdfs for testing regressions I strongly suggest you use the render-and-compare approach we use internally. For example, changing the z-order (or even just plot order) of two non-overlapping lines will change the pdf representation, but not change your output in any meaningful way. I also would consider the fact that the png backend produces bit identical outputs a coincidence and not to be part of the public API. That hash value may change on a micro-version, may not be the same across platforms. Again, if you are using this for regressions, open the images and compare them. |
I still think it's simpler to iterate dicts in order than to replace them with OrderedDicts. The IMHO we should add separate options to set the creation date to a user-supplied value (or to remove it) and to disable font subsetting (the pdf spec requires the addition of tags for subsets, so not adding them makes the resulting file non-compliant), and any other sources of nondeterminism that we discover. I think it's likely that there are such sources that we may have missed so far. |
Using sorted instead of OrderedDict in your for loops could affect the time complexity. Sorting is an O(N log N) operation, whereas OrderedDict can be iterated in linear time O(N). |
But then you just move the sort operation to dict construction time instead of iteration time? Both happen once, sometimes with a few inserts in between. In any case, I will be very surprised if you can measure the performance difference in outputting a pdf document. The dicts typically have a few keys, a few tens at the most. |
Fixed by #6597. |
For future visitors https://matplotlib.org/2.1.1/users/whats_new.html#reproducible-ps-pdf-and-svg-output is a helpful resource |
It seems that now we should call savefig with |
Uh oh!
There was an error while loading. Please reload this page.
Suppose you want to generate a pdf file with matplotlib and save it.
genfigure.py
:Run the script from the command line
Given that we are saving the same figure, we would expect the output to be the same. However, after looking at the file hashes, they appear to be different. In my particular case:
On the other hand, no such issue exists when saving png files.
The two
png
files are exactly the sameIt appears that pdf saving has some source of non-determinism.
Is there a way to ensure that saving the same figure multiple times, results in exactly the same pdf file?
The text was updated successfully, but these errors were encountered: