Thanks to visit codestin.com
Credit goes to github.com

Skip to content

pdfiumload: fix rendering multiple pages with different sizes #3594

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 3 commits into from
Aug 14, 2023

Conversation

DarthSim
Copy link
Contributor

@DarthSim DarthSim commented Aug 4, 2023

Hey there πŸ‘‹

Long story short πŸ™‚
Here's a PDF file containing multiple pages of different sizes: https://img.darthsim.me/R2rprQ86Q6.pdf.
Here's how vips renders it: https://img.darthsim.me/mJ8SkNJb_m.png

As you can see, the pages after the first one are rendered incorrectly. That's because vips sets a tile size equal to the first page size and renders a full page in each tile. Also, there are leftovers from the previous pages because vips doesn't clear the region before rendering pages in it.

Luckily, despite what comments in pdfiumload.c say, PDFium allows rendering a part of the page.

This PR fixes the pdfiumload behavior to work in sequential mode and render part of a page in each tile. Also, I added resetting of the output region to remove previous pages' leftovers.

Here's how vips renders the test PDF now: https://img.darthsim.me/9p4CSGSVC_.png

And here're benchmarks:

# Before the fix
/usr/bin/time -f "%e %M" vips copy /images/pdf8.pdf[n=-1] /images/pdf8.png
0.35 119368

# After the fix
/usr/bin/time -f "%e %M" vips copy /images/pdf8.pdf[n=-1] /images/pdf8.png
0.21 44324

After the fix, pdfiumload becomes faster and eats less memory.

PS. #3456 have broken the rendering of PDFs with multiple pages of different sizes completely. That's because rect.width may be zero here – https://github.com/libvips/libvips/blob/master/libvips/foreign/pdfiumload.c#L613-L615, and FPDF_FFLDraw segfaults because of that. Changing its arguments to ones of FPDF_RenderPageBitmap from this PR fixes the issue.

@jcupitt
Copy link
Member

jcupitt commented Aug 5, 2023

Hey @DarthSim, this is very cool! Yes, the pdfium loader is rather unloved :(

I'll have a look.

@kleisauke

This comment was marked as resolved.

@kleisauke
Copy link
Member

Ah, sorry for the noise I was able to reproduce this (I had a dangling vips-poppler.so).

@kleisauke
Copy link
Member

I just opened PR #3602 given that this PR targets the 8.14 branch.

PS. #3456 have broken the rendering of PDFs with multiple pages of different sizes completely. That's because rect.width may be zero here – https://github.com/libvips/libvips/blob/master/libvips/foreign/pdfiumload.c#L613-L615, and FPDF_FFLDraw segfaults because of that. Changing its arguments to ones of FPDF_RenderPageBitmap from this PR fixes the issue.

Confirmed. I wonder why this segv hasn't been caught by OSS-Fuzz. πŸ€”

@kleisauke
Copy link
Member

It appears that PDFium still cannot render any part of the page on demand. I could not reproduce this on this PR, but removing the vips_sequential() and switching to VIPS_DEMAND_STYLE_SMALLTILE would cause non-deterministic output. See for example commit kleisauke@aca7051 and this test script:

vips copy R2rprQ86Q6.pdf[n=-1] x.png

expected_sha256=$(sha256sum "x.png" | awk '{ print $1 }')
for run in {1..20}; do
  vips copy R2rprQ86Q6.pdf[n=-1] x.png
  echo "$expected_sha256 x.png" | sha256sum --check --quiet
done
$ ./check-deterministic.sh
x.png: FAILED
sha256sum: WARNING: 1 computed checksum did NOT match
...

Given this, I'm a bit worried to land this as-is on 8.14. How about splitting the changes in vips_foreign_load_pdf_load() to a separate PR?

@DarthSim
Copy link
Contributor Author

This is not a PDFium issue but the issue with the way vips works with it. More specifically, multithreading is the problem here.

VipsForeignLoadPdf stores only a single page at a time. Let's assume that the first thread starts generating a tile from page 1 and calls vips_foreign_load_pdf_get_page. Then, the second thread starts generating a tile from page 2 and calls vips_foreign_load_pdf_get_page before the first thread started rendering its tile. In this case, both threads will render page 2.

The current implementation may also face the same issue, but since it renders huge tiles, it's quite unlikely but still possible.

I'll think what can be done here.

@jcupitt
Copy link
Member

jcupitt commented Aug 13, 2023

Let's assume that the first thread starts generating a tile from page 1 and calls vips_foreign_load_pdf_get_page.

I think this should be OK -- pdfiumload runs behind a tilecache, and this will only let one thread at once through to vips_foreign_load_pdf_generate(). As long as that function sets the page each call, it should work.

@DarthSim
Copy link
Contributor Author

I guess, we could just use vips_sequential as we do in nsgifload and friends,

@DarthSim DarthSim force-pushed the fix/pdfiumload-different-sizes-8.14 branch 2 times, most recently from 2a15b4b to b39194b Compare August 13, 2023 17:24
@DarthSim
Copy link
Contributor Author

I can confirm that using using vips_sequential works without issues even with 1px tile_height.

I changed tile_height to 16. It slows pdfiumload down a little bit but the memory footprint is lower:

/usr/bin/time -f "%e %M" vips copy /images/pdf8.pdf[n=-1] /images/pdf8.png
0.23 39348

@jcupitt
Copy link
Member

jcupitt commented Aug 13, 2023

What if you swap the vips_sequential for vips_tilecache? I'd think it should work.

popperload uses:

/* Render PDFs with tiles this size. They need to be pretty big to limit
 * overcomputation.
 *
 * An A4 page at 300dpi is 3508 pixels, so this should be enough to prevent
 * most rerendering.
 */ 
#define TILE_SIZE (4000)

...
        vips_tilecache(t[0], &t[1],
            "tile_width", TILE_SIZE,
            "tile_height", TILE_SIZE,
            "max_tiles", 2 * (1 + t[0]->Xsize / TILE_SIZE),
            NULL) ||

ie. huge tiles, since poppler is really slow at rendering parts of a page. If PDFium can render sections of pages quickly, smaller tiles might be better.

@jcupitt
Copy link
Member

jcupitt commented Aug 13, 2023

poppler is slow at rendering subsections of pages since it does no caching, unfortunately.

If you render a 128x128 area on a page, it will redraw the entire page and just clip that part, and redrawing includes re-decompressing any image resources, like JPGs. You can end up repeating a lot of work for each tile, so rendering whole pages can be dramatically quicker, perhaps a factor of 100 in a bad case.

I kept this PDF around for benchmarking:

http://www.rollthepotato.net/~john/Audi_US%20R8_2017-2.pdf

It has lots of huge graphic elements, and a range of page sizes. It's very slow with poppler, unless you render a page at a time. It'd be interesting to test with this PR.

@DarthSim
Copy link
Contributor Author

Wow, this is hella of a PDF πŸ™‚ I've rendered only the first 5 pages of it, but I believe that's enough:

# sequential(tile_height = 16) + VIPS_DEMAND_STYLE_FATSTRIP
/usr/bin/time -f "%e %M" vips copy /images/pdf9.pdf[n=5] /images/pdf8.png
1.96 124896

# sequential(tile_height = 128) + VIPS_DEMAND_STYLE_FATSTRIP
/usr/bin/time -f "%e %M" vips copy /images/pdf9.pdf[n=5] /images/pdf8.png
1.21 122668

# sequential(tile_height = 16) + VIPS_DEMAND_STYLE_SMALLTILE
/usr/bin/time -f "%e %M" vips copy /images/pdf9.pdf[n=5] /images/pdf8.png
2.00 143968

# sequential(tile_height = 128) + VIPS_DEMAND_STYLE_SMALLTILE
/usr/bin/time -f "%e %M" vips copy /images/pdf9.pdf[n=5] /images/pdf8.png
1.23 145340

# tilecache(tile_width = 4000, tile_height = 4000) + VIPS_DEMAND_STYLE_FATSTRIP
/usr/bin/time -f "%e %M" vips copy /images/pdf9.pdf[n=5] /images/pdf8.png
1.65 150564

# tilecache(tile_width = 4000, tile_height = 4000) + VIPS_DEMAND_STYLE_SMALLTILE
/usr/bin/time -f "%e %M" vips copy /images/pdf9.pdf[n=5] /images/pdf8.png
1.70 150528

Hmm, vips_sequential with tile_height of 128 and VIPS_DEMAND_STYLE_FATSTRIP seems to be the best option after all πŸ€”

@jcupitt
Copy link
Member

jcupitt commented Aug 14, 2023

I tried with that Audi PDF.

Here's the PR as it stands, so PDFium latest, plus sequential + tile-height 16:

$ time vips copy Audi_US\ R8_2017-2.pdf[dpi=300] x.jpg
memory: high-water mark 31.23 MB

real	0m58.948s
user	0m47.905s
sys	0m11.144s

And with this patch:

diff --git a/libvips/foreign/pdfiumload.c b/libvips/foreign/pdfiumload.c
index 21859cdc8..c7e611ed1 100644
--- a/libvips/foreign/pdfiumload.c
+++ b/libvips/foreign/pdfiumload.c
@@ -641,9 +641,19 @@ vips_foreign_load_pdf_load( VipsForeignLoad *load )
 
        if( vips_image_generate( t[0],
                NULL, vips_foreign_load_pdf_generate, NULL, pdf, NULL ) ||
+
+                       /*
                vips_sequential( t[0], &t[1],
                        "tile_height", VIPS__FATSTRIP_HEIGHT,
                        NULL ) ||
+                        */
+
+               vips_tilecache(t[0], &t[1],
+                   "tile_width", 4000,
+                   "tile_height", 4000,
+                   "max_tiles", 2 * (1 + t[0]->Xsize / 4000),
+                   NULL) ||
+
                vips_image_write( t[1], load->real ) )
                return -1;

I see:

$ time vips copy Audi_US\ R8_2017-2.pdf[dpi=300] x.jpg
memory: high-water mark 54.17 MB

real	0m0.786s
user	0m0.794s
sys	0m0.133s

So higher memory use, but 75x faster. I don't think PDFium can render parts of a page efficiently either, unfortunately :(

@DarthSim DarthSim force-pushed the fix/pdfiumload-different-sizes-8.14 branch from b39194b to d18a7d2 Compare August 14, 2023 16:25
@DarthSim DarthSim force-pushed the fix/pdfiumload-different-sizes-8.14 branch from d18a7d2 to 5aac511 Compare August 14, 2023 16:27
@DarthSim
Copy link
Contributor Author

Yeah, you seem to be right. Changed vips_sequential with vips_tilecache

Copy link
Member

@jcupitt jcupitt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great!

@kleisauke kleisauke merged commit 422c3b8 into libvips:8.14 Aug 14, 2023
@kleisauke
Copy link
Member

Thank you!

Changing its arguments to ones of FPDF_RenderPageBitmap from this PR fixes the issue.

A PR for this would be welcome.

@DarthSim
Copy link
Contributor Author

@kleisauke Here it is: #3613

@kleisauke
Copy link
Member

I wonder why this segv hasn't been caught by OSS-Fuzz.

FWIW, it looks like this was only reproducible with n other than 1 (default). Unfortunately, those optional input arguments aren't covered by the fuzzers.

VIPS_ARG_INT(class, "n", 11,
_("n"),
_("Number of pages to load, -1 for all"),
VIPS_ARGUMENT_OPTIONAL_INPUT,
G_STRUCT_OFFSET(VipsForeignLoadPdf, n),
-1, 100000, 1);

@DarthSim
Copy link
Contributor Author

Exactly. This could happen only when the page's rect does not intersect with the region's rect. And this could happen only when you render multiple pages some of which are wider than the first one.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants