Thanks to visit codestin.com
Credit goes to github.com

Skip to content

pdf: Use explicit palette when saving indexed images #25824

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
Jun 13, 2023

Conversation

QuLogic
Copy link
Member

@QuLogic QuLogic commented May 5, 2023

PR summary

Asking Pillow for an "adaptive palette" does not appear to guarantee that the chosen colours will be the same, even if asking for exactly the same number as exist in the image. Instead, create an explicit palette, and quantize using it.

Additionally, since now the palette may be smaller than 256 colours, Pillow may choose to encode the image data with fewer than 8 bits per component, so we need to properly reflect that in the decode parameters (this was already done for the image parameters).

The effect on test images with many colours is small, with a maximum RMS of 1.024, but for images with few colours, the result can be completely wrong as in the reported #25806.

Since this bugginess requires a large image with few colours and relies on possibly-fixable caching in Pillow, I opted to just update the existing test images instead and we should be a little more careful in changing them in the future.

Fixes #20575
Fixes #25806

PR checklist

ksunden
ksunden previously approved these changes May 6, 2023
@ksunden
Copy link
Member

ksunden commented May 16, 2023

Looks like there is one additional failing test left unresolved:

lib/matplotlib/tests/test_image.py::test_figimage[pdf-False]

@QuLogic
Copy link
Member Author

QuLogic commented May 30, 2023

I can reproduce this on Ubuntu WSL, but it's for both suppressComposite. Confusingly, if I copy the result into the baseline, git shows no differences and the tests still fail. Not sure if there's a Ghostscript bug on Ubuntu.

@QuLogic
Copy link
Member Author

QuLogic commented Jun 6, 2023

Further testing shows that the results are in fact still wrong. Disabling the code for indexed image results in all related tests failing.

Using a fairly simple image of a red gradient with pure Pillow, and the attempted conversion in this PR:

import numpy as np
from PIL import Image

x = np.zeros((256, 1, 3), dtype=np.uint8)
x[:, 0, 0] = np.arange(256)
orig = Image.fromarray(x)

palette = Image.new('P', (1, 1))
palette.putpalette([comp for _, color in orig.getcolors() for comp in color])
new = orig.quantize(dither=Image.Dither.NONE, palette=palette)

print(*new.getdata())
y = np.asarray(new.convert('RGB'))
print(y[:, 0, 0])
np.testing.assert_array_equal(x, y)

results in an assertion:

255 255 255 255 251 251 251 251 247 247 247 247 243 243 243 243 239 239 239 239 235 235 235 235 231 231 231 231 227 227 227 227 223 223 223 223 219 219 219 219 215 215 215 215 211 211 211 211 207 207 207 207 203 203 203 203 199 199 199 199 195 195 195 195 191 191 191 191 187 187 187 187 183 183 183 183 179 179 179 179 175 175 175 175 171 171 171 171 167 167 167 167 163 163 163 163 159 159 159 159 155 155 155 155 151 151 151 151 147 147 147 147 143 143 143 143 139 139 139 139 135 135 135 135 131 131 131 131 127 127 127 127 123 123 123 123 119 119 119 119 115 115 115 115 111 111 111 111 107 107 107 107 103 103 103 103 99 99 99 99 95 95 95 95 91 91 91 91 87 87 87 87 83 83 83 83 79 79 79 79 75 75 75 75 71 71 71 71 67 67 67 67 63 63 63 63 59 59 59 59 55 55 55 55 51 51 51 51 47 47 47 47 43 43 43 43 39 39 39 39 35 35 35 35 31 31 31 31 27 27 27 27 23 23 23 23 19 19 19 19 15 15 15 15 11 11 11 11 7 7 7 7 3 3 3 3
[  0   0   0   0   4   4   4   4   8   8   8   8  12  12  12  12  16  16
  16  16  20  20  20  20  24  24  24  24  28  28  28  28  32  32  32  32
  36  36  36  36  40  40  40  40  44  44  44  44  48  48  48  48  52  52
  52  52  56  56  56  56  60  60  60  60  64  64  64  64  68  68  68  68
  72  72  72  72  76  76  76  76  80  80  80  80  84  84  84  84  88  88
  88  88  92  92  92  92  96  96  96  96 100 100 100 100 104 104 104 104
 108 108 108 108 112 112 112 112 116 116 116 116 120 120 120 120 124 124
 124 124 128 128 128 128 132 132 132 132 136 136 136 136 140 140 140 140
 144 144 144 144 148 148 148 148 152 152 152 152 156 156 156 156 160 160
 160 160 164 164 164 164 168 168 168 168 172 172 172 172 176 176 176 176
 180 180 180 180 184 184 184 184 188 188 188 188 192 192 192 192 196 196
 196 196 200 200 200 200 204 204 204 204 208 208 208 208 212 212 212 212
 216 216 216 216 220 220 220 220 224 224 224 224 228 228 228 228 232 232
 232 232 236 236 236 236 240 240 240 240 244 244 244 244 248 248 248 248
 252 252 252 252]
Traceback (most recent call last):
  File "/home/elliott/code/matplotlib/foo.py", line 15, in <module>
    np.testing.assert_array_equal(orig, new.convert('RGB'))
  File "/var/container/conda/envs/mpl39/lib/python3.9/site-packages/numpy/testing/_private/utils.py", line 983, in assert_array_equal
    assert_array_compare(operator.__eq__, x, y, err_msg=err_msg,
  File "/var/container/conda/envs/mpl39/lib/python3.9/contextlib.py", line 79, in inner
    return func(*args, **kwds)
  File "/var/container/conda/envs/mpl39/lib/python3.9/site-packages/numpy/testing/_private/utils.py", line 862, in assert_array_compare
    raise AssertionError(msg)
AssertionError:
Arrays are not equal

Mismatched elements: 192 / 768 (25%)
Max absolute difference: 3
Max relative difference: 0.75
 x: array([[[  0,   0,   0]],

       [[  1,   0,   0]],...
 y: array([[[  0,   0,   0]],

       [[  0,   0,   0]],...

Even though NumPy says 25% difference, that's for all channels, and only the 256 values from the red channel were unique, so 192 mismatches is really more like 75% difference. You can also see in the printed out indices or the re-converted RGB, that several values are skipped.

Probably due to python-pillow/Pillow#1852 (comment), even the explicit palette used in the example above / in this PR does not work. I think Pillow's quantization is not reliable if we want to have lossless palette conversion. Options seem to be either 1) implement some kind of lookup a) in NumPy or b) via our C extensions; 2) disable this optimization altogether. Whether we do one or the other might depend on the speed vs file size tradeoff.

@QuLogic QuLogic marked this pull request as draft June 6, 2023 06:30
@tacaswell
Copy link
Member

should we push this to 3.8 and make a call one way or the other this week

@QuLogic
Copy link
Member Author

QuLogic commented Jun 6, 2023

We can discuss tomorrow or Thursday; I guess pushing out to 3.8 depends on how we decide to fix it.

@jklymak
Copy link
Member

jklymak commented Jun 6, 2023

It seems for short term we should revert the optimization? If a proper fix gets in for 3.8, that's great, if not at least we won't have something buggy?

We should perhaps still consider making interpolation_stage='rgba' the default to avoid the resampling code by default

@ksunden ksunden dismissed their stale review June 7, 2023 02:51

evidence that more is needed since my review

@QuLogic
Copy link
Member Author

QuLogic commented Jun 8, 2023

We should perhaps still consider making interpolation_stage='rgba' the default to avoid the resampling code by default

Note that interpolation is a red herring; what really matters is how many colours are in the final image, and possibly its size. Interpolation just changes what the final size is.

Asking Pillow for an "adaptive palette" does not appear to guarantee
that the chosen colours will be the same, even if asking for exactly the
same number as exist in the image. And asking Pillow to quantize with an
explicit palette does not work either, as Pillow uses a cache that trims
the last two bits from the colour and never makes an explicit match.
python-pillow/Pillow#1852 (comment)

So instead, manually calculate the indexed image using some NumPy
tricks.

Additionally, since now the palette may be smaller than 256 colours,
Pillow may choose to encode the image data with fewer than 8 bits per
component, so we need to properly reflect that in the decode parameters
(this was already done for the image parameters).

The effect on test images with _many_ colours is small, with a maximum
RMS of 1.024, but for images with few colours, the result can be
completely wrong as in the reported matplotlib#25806.
@QuLogic
Copy link
Member Author

QuLogic commented Jun 10, 2023

I've now re-implemented the indexing in NumPy, instead of using Pillow. I hoped to add a test that could show simply how the original results were wrong, but in order to reproduce the error from #25806, it requires an image of >= 7073*7073 pixels, which makes a 12s-long test. As many other test images were also changed by fixing this bug, I did not see the benefit of adding such a test instead of updating existing results. Instead, I've added a test that confirms that my original plan of using Pillow's quantize-with-an-explicit-palette method does not work (which, if you revert the code back to ba490cb, results in an image as in #25824 (comment)).

Additionally, to confirm that this code is now correct, I've commented out the block that does this optimization, run the test suite again, and thus checked against the normal RGB version. The only test that fails that is test_rotate_image as so:
rotate_image_pdf-failed-diff
And this case seems small enough that it's a PDF renderer issue.

@QuLogic QuLogic marked this pull request as ready for review June 10, 2023 02:46
@QuLogic
Copy link
Member Author

QuLogic commented Jun 13, 2023

As an additional test, I wrote a small benchmark:

from io import BytesIO
import struct
import timeit

import numpy as np
from PIL import Image


def _writePng(img):
    buffer = BytesIO()
    img.save(buffer, format="png")
    buffer.seek(8)
    png_data = b''
    bit_depth = palette = None
    while True:
        length, type = struct.unpack(b'!L4s', buffer.read(8))
        if type in [b'IHDR', b'PLTE', b'IDAT']:
            data = buffer.read(length)
            if len(data) != length:
                raise RuntimeError("truncated data")
            if type == b'IHDR':
                bit_depth = int(data[8])
            elif type == b'PLTE':
                palette = data
            elif type == b'IDAT':
                png_data += data
        elif type == b'IEND':
            break
        else:
            buffer.seek(length, 1)
        buffer.seek(4, 1)   # skip CRC
    return png_data, bit_depth, palette


def pillow(data):
    height, width, color_channels = data.shape
    img = Image.fromarray(data)
    img_colors = img.getcolors(maxcolors=256)
    if color_channels == 3 and img_colors is not None:
        num_colors = len(img_colors)
        dither = getattr(Image, 'Dither', Image).NONE
        pmode = getattr(Image, 'Palette', Image).ADAPTIVE
        img = img.convert(
            mode='P', dither=dither, palette=pmode, colors=num_colors
        )
        png_data, bit_depth, palette = _writePng(img)
        palette = palette[:num_colors * 3]


def numpy(data):
    height, width, color_channels = data.shape
    img = Image.fromarray(data)
    img_colors = img.getcolors(maxcolors=256)
    if color_channels == 3 and img_colors is not None:
        num_colors = len(img_colors)
        palette = np.array([comp for _, color in img_colors for comp in color],
                           dtype=np.uint8)
        palette24 = ((palette[0::3].astype(np.uint32) << 16) |
                     (palette[1::3].astype(np.uint32) << 8) |
                     palette[2::3])
        rgb24 = ((data[:, :, 0].astype(np.uint32) << 16) |
                 (data[:, :, 1].astype(np.uint32) << 8) |
                 data[:, :, 2])
        indices = np.argsort(palette24).astype(np.uint8)
        rgb8 = indices[np.searchsorted(palette24, rgb24, sorter=indices)]
        img = Image.fromarray(rgb8, mode='P')
        img.putpalette(palette)
        png_data, bit_depth, palette = _writePng(img)
        palette = palette[:num_colors * 3]


for size in [2, 10, 100, 1000, 10000]:
    print(size)
    data = np.full((size, size, 3), 239, dtype=np.uint8)
    data[:size//2, size//2:, :] = (255, 0, 0)
    for mode in ['pillow', 'numpy']:
        print(mode, timeit.timeit(f'{mode}(data)', globals=globals(), number=10))

This creates an 'image' of various sizes in grey with a red square in the top right quadrant, and then tries the old and new indexing. The results are:

2
pillow 0.013012180104851723
numpy 0.001572401961311698
10
pillow 0.0008533929940313101
numpy 0.001532592112198472
100
pillow 0.006254185922443867
numpy 0.0027515022084116936
1000
pillow 0.5866641108877957
numpy 0.17633948801085353
10000
pillow 57.763156425906345
numpy 31.72443200601265

Surprisingly, in almost all cases, NumPy is faster than Pillow.

@ksunden ksunden merged commit 475612e into matplotlib:main Jun 13, 2023
@lumberbot-app
Copy link

lumberbot-app bot commented Jun 13, 2023

Owee, I'm MrMeeseeks, Look at me.

There seem to be a conflict, please backport manually. Here are approximate instructions:

  1. Checkout backport branch and update it.
git checkout v3.7.x
git pull
  1. Cherry pick the first parent branch of the this PR on top of the older branch:
git cherry-pick -x -m1 475612ef50a7d4881eebaa0ebe5b410b7b202b28
  1. You will likely have some merge/cherry-pick conflict here, fix them and commit:
git commit -am 'Backport PR #25824: pdf: Use explicit palette when saving indexed images'
  1. Push to a named branch:
git push YOURFORK v3.7.x:auto-backport-of-pr-25824-on-v3.7.x
  1. Create a PR against branch v3.7.x, I would have named this PR:

"Backport PR #25824 on branch v3.7.x (pdf: Use explicit palette when saving indexed images)"

And apply the correct labels and milestones.

Congratulations — you did some good work! Hopefully your backport PR will be tested by the continuous integration and merged soon!

Remember to remove the Still Needs Manual Backport label once the PR gets merged.

If these instructions are inaccurate, feel free to suggest an improvement.

@QuLogic QuLogic deleted the fix-index-pdf branch June 13, 2023 21:36
@QuLogic
Copy link
Member Author

QuLogic commented Jun 13, 2023

Conflicts are:

Unmerged paths:
  (use "git add/rm <file>..." as appropriate to mark resolution)
        both modified:   lib/matplotlib/tests/baseline_images/test_backend_pdf/grayscale_alpha.pdf
        both modified:   lib/matplotlib/tests/baseline_images/test_image/image_alpha.pdf
        deleted by us:   lib/matplotlib/tests/baseline_images/test_image/image_placement.pdf
        both modified:   lib/matplotlib/tests/baseline_images/test_image/image_shift.pdf
        both modified:   lib/matplotlib/tests/baseline_images/test_image/imshow_masked_interpolation.pdf
        both modified:   lib/matplotlib/tests/baseline_images/test_image/no_interpolation_origin.pdf
        both modified:   lib/matplotlib/tests/baseline_images/test_image/rotate_image.pdf
        both modified:   lib/matplotlib/tests/baseline_images/test_tightlayout/tight_layout5.pdf

and these all appear to be caused by #25704 which wasn't backported.

@QuLogic
Copy link
Member Author

QuLogic commented Jun 29, 2023

The backport should work now with #25704 backported, I think.

@meeseeksdev backport to v3.7.x

QuLogic added a commit that referenced this pull request Jun 29, 2023
…824-on-v3.7.x

Backport PR #25824 on branch v3.7.x (pdf: Use explicit palette when saving indexed images)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[Bug]: pdf export for large image sizes results in wrong colors cm.set_bad() not working for specific values of grayscale and dpi when saving as pdf
4 participants