perf: optimize conversion module hot paths#4969
Open
JakeChampion wants to merge 4 commits intolibvips:masterfrom
Open
perf: optimize conversion module hot paths#4969JakeChampion wants to merge 4 commits intolibvips:masterfrom
JakeChampion wants to merge 4 commits intolibvips:masterfrom
Conversation
66e28bb to
30fed1c
Compare
lovell
reviewed
Mar 26, 2026
Member
lovell
left a comment
There was a problem hiding this comment.
Thanks for this Jake, those timings look great. Would be great to see the impact on x64 CPUs too. I've left a couple of comments inline.
Replace per-pixel double-precision division in vips_flatten for UCHAR input with precomputed 256-entry LUTs. This applies to both the black background and arbitrary background paths, with a special-case unrolled loop for the common RGBA (4-band) case. Flatten is on the hot path for every RGBA-to-JPEG conversion (PNG/WebP with alpha saved as JPEG). Benchmarked on 4000x4000 RGBA (arm64, Apple M-series): Black background: ~3.6% faster (82ms -> 79ms) Colored background: ~2.8% faster (82ms -> 80ms)
Precompute inv_max_band = 1.0 / max_band once during build and use multiplication instead of division when scaling pixels to 0-1 in the composite blend loop. Applied to both the generic double path and the v4f SIMD vector path. Division is 3-5x slower than multiplication on modern CPUs.
Initialize A[] and f[] arrays with = {0} at declaration instead of zeroing unused entries in a per-pixel loop.
Removes up to 63 double stores per pixel for images with few bands.
…mple Replace byte-at-a-time pixel copy loops with a VIPS_MEMCPY macro that uses typed stores for common pixel sizes (1/2/3/4/8 bytes) and falls back to memcpy for others. Define the macro once in util.h.
30fed1c to
e3293dc
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
inv_max_band[]reciprocals instead of dividing per pixel; initialize blend arrays with= {0}instead of per-pixel zero-fill loopMeasured with hyperfine on an arm64 Apple M2, clang 21
Test images: 8000x8000 or 4000x4000 uchar, created with
vips gaussnoise + cast + bandjoinAll affected operations produced bit-identical output to master, verified by running each operation on a 512x512 RGBA/RGB test image and comparing raw
.vfiles withcmp -s.Test plan
meson test -C buildpasses