-
-
Notifications
You must be signed in to change notification settings - Fork 705
pdfiumload: fix rendering multiple pages with different sizes #3594
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. Weβll occasionally send you account related emails.
Already on GitHub? Sign in to your account
pdfiumload: fix rendering multiple pages with different sizes #3594
Conversation
Hey @DarthSim, this is very cool! Yes, the pdfium loader is rather unloved :( I'll have a look. |
This comment was marked as resolved.
This comment was marked as resolved.
Ah, sorry for the noise I was able to reproduce this (I had a dangling |
I just opened PR #3602 given that this PR targets the 8.14 branch.
Confirmed. I wonder why this segv hasn't been caught by OSS-Fuzz. π€ |
It appears that PDFium still cannot render any part of the page on demand. I could not reproduce this on this PR, but removing the vips copy R2rprQ86Q6.pdf[n=-1] x.png
expected_sha256=$(sha256sum "x.png" | awk '{ print $1 }')
for run in {1..20}; do
vips copy R2rprQ86Q6.pdf[n=-1] x.png
echo "$expected_sha256 x.png" | sha256sum --check --quiet
done $ ./check-deterministic.sh
x.png: FAILED
sha256sum: WARNING: 1 computed checksum did NOT match
... Given this, I'm a bit worried to land this as-is on 8.14. How about splitting the changes in |
This is not a PDFium issue but the issue with the way vips works with it. More specifically, multithreading is the problem here.
The current implementation may also face the same issue, but since it renders huge tiles, it's quite unlikely but still possible. I'll think what can be done here. |
I think this should be OK -- |
I guess, we could just use |
2a15b4b
to
b39194b
Compare
I can confirm that using using I changed
|
What if you swap the
/* Render PDFs with tiles this size. They need to be pretty big to limit
* overcomputation.
*
* An A4 page at 300dpi is 3508 pixels, so this should be enough to prevent
* most rerendering.
*/
#define TILE_SIZE (4000)
...
vips_tilecache(t[0], &t[1],
"tile_width", TILE_SIZE,
"tile_height", TILE_SIZE,
"max_tiles", 2 * (1 + t[0]->Xsize / TILE_SIZE),
NULL) || ie. huge tiles, since poppler is really slow at rendering parts of a page. If PDFium can render sections of pages quickly, smaller tiles might be better. |
poppler is slow at rendering subsections of pages since it does no caching, unfortunately. If you render a 128x128 area on a page, it will redraw the entire page and just clip that part, and redrawing includes re-decompressing any image resources, like JPGs. You can end up repeating a lot of work for each tile, so rendering whole pages can be dramatically quicker, perhaps a factor of 100 in a bad case. I kept this PDF around for benchmarking: http://www.rollthepotato.net/~john/Audi_US%20R8_2017-2.pdf It has lots of huge graphic elements, and a range of page sizes. It's very slow with poppler, unless you render a page at a time. It'd be interesting to test with this PR. |
Wow, this is hella of a PDF π I've rendered only the first 5 pages of it, but I believe that's enough: # sequential(tile_height = 16) + VIPS_DEMAND_STYLE_FATSTRIP
/usr/bin/time -f "%e %M" vips copy /images/pdf9.pdf[n=5] /images/pdf8.png
1.96 124896
# sequential(tile_height = 128) + VIPS_DEMAND_STYLE_FATSTRIP
/usr/bin/time -f "%e %M" vips copy /images/pdf9.pdf[n=5] /images/pdf8.png
1.21 122668
# sequential(tile_height = 16) + VIPS_DEMAND_STYLE_SMALLTILE
/usr/bin/time -f "%e %M" vips copy /images/pdf9.pdf[n=5] /images/pdf8.png
2.00 143968
# sequential(tile_height = 128) + VIPS_DEMAND_STYLE_SMALLTILE
/usr/bin/time -f "%e %M" vips copy /images/pdf9.pdf[n=5] /images/pdf8.png
1.23 145340
# tilecache(tile_width = 4000, tile_height = 4000) + VIPS_DEMAND_STYLE_FATSTRIP
/usr/bin/time -f "%e %M" vips copy /images/pdf9.pdf[n=5] /images/pdf8.png
1.65 150564
# tilecache(tile_width = 4000, tile_height = 4000) + VIPS_DEMAND_STYLE_SMALLTILE
/usr/bin/time -f "%e %M" vips copy /images/pdf9.pdf[n=5] /images/pdf8.png
1.70 150528 Hmm, |
I tried with that Audi PDF. Here's the PR as it stands, so PDFium latest, plus sequential + tile-height 16:
And with this patch: diff --git a/libvips/foreign/pdfiumload.c b/libvips/foreign/pdfiumload.c
index 21859cdc8..c7e611ed1 100644
--- a/libvips/foreign/pdfiumload.c
+++ b/libvips/foreign/pdfiumload.c
@@ -641,9 +641,19 @@ vips_foreign_load_pdf_load( VipsForeignLoad *load )
if( vips_image_generate( t[0],
NULL, vips_foreign_load_pdf_generate, NULL, pdf, NULL ) ||
+
+ /*
vips_sequential( t[0], &t[1],
"tile_height", VIPS__FATSTRIP_HEIGHT,
NULL ) ||
+ */
+
+ vips_tilecache(t[0], &t[1],
+ "tile_width", 4000,
+ "tile_height", 4000,
+ "max_tiles", 2 * (1 + t[0]->Xsize / 4000),
+ NULL) ||
+
vips_image_write( t[1], load->real ) )
return -1; I see:
So higher memory use, but 75x faster. I don't think PDFium can render parts of a page efficiently either, unfortunately :( |
b39194b
to
d18a7d2
Compare
d18a7d2
to
5aac511
Compare
Yeah, you seem to be right. Changed |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great!
Thank you!
A PR for this would be welcome. |
@kleisauke Here it is: #3613 |
FWIW, it looks like this was only reproducible with libvips/libvips/foreign/pdfiumload.c Lines 721 to 726 in d59e8b9
|
Exactly. This could happen only when the page's rect does not intersect with the region's rect. And this could happen only when you render multiple pages some of which are wider than the first one. |
Hey there π
Long story short π
Here's a PDF file containing multiple pages of different sizes: https://img.darthsim.me/R2rprQ86Q6.pdf.
Here's how vips renders it: https://img.darthsim.me/mJ8SkNJb_m.png
As you can see, the pages after the first one are rendered incorrectly. That's because vips sets a tile size equal to the first page size and renders a full page in each tile. Also, there are leftovers from the previous pages because vips doesn't clear the region before rendering pages in it.
Luckily, despite what comments in
pdfiumload.c
say, PDFium allows rendering a part of the page.This PR fixes the
pdfiumload
behavior to work in sequential mode and render part of a page in each tile. Also, I added resetting of the output region to remove previous pages' leftovers.Here's how vips renders the test PDF now: https://img.darthsim.me/9p4CSGSVC_.png
And here're benchmarks:
After the fix,
pdfiumload
becomes faster and eats less memory.PS. #3456 have broken the rendering of PDFs with multiple pages of different sizes completely. That's because
rect.width
may be zero here β https://github.com/libvips/libvips/blob/master/libvips/foreign/pdfiumload.c#L613-L615, andFPDF_FFLDraw
segfaults because of that. Changing its arguments to ones ofFPDF_RenderPageBitmap
from this PR fixes the issue.