Thanks to visit codestin.com
Credit goes to github.com

Skip to content

perf: optimize conversion module hot paths#4969

Open
JakeChampion wants to merge 4 commits intolibvips:masterfrom
JakeChampion:jake/perf-conversion
Open

perf: optimize conversion module hot paths#4969
JakeChampion wants to merge 4 commits intolibvips:masterfrom
JakeChampion:jake/perf-conversion

Conversation

@JakeChampion
Copy link
Copy Markdown
Contributor

@JakeChampion JakeChampion commented Mar 25, 2026

  • flatten: replace per-pixel float division with 256-entry LUTs for uchar alpha blending
  • composite: precompute inv_max_band[] reciprocals instead of dividing per pixel; initialize blend arrays with = {0} instead of per-pixel zero-fill loop
  • rot/flip/zoom/subsample: replace byte-at-a-time pixel copies with typed stores

Measured with hyperfine on an arm64 Apple M2, clang 21

Test images: 8000x8000 or 4000x4000 uchar, created with vips gaussnoise + cast + bandjoin

$ hyperfine --warmup 5 --runs 15 \
    -n master 'VIPS_CONCURRENCY=1 vips <op> input.v output.v' \
    -n branch 'VIPS_CONCURRENCY=1 vips <op> input.v output.v'
Operation master branch speedup
flatten 8k RGBA 170.0 ms 96.0 ms 1.77x faster
rot90 4k RGBA 86.0 ms 52.4 ms 1.64x faster
rot90 4k RGB 80.6 ms 50.7 ms 1.59x faster
zoom 2x 4k RGB 121.1 ms 75.9 ms 1.59x faster
rot270 4k RGB 79.8 ms 55.0 ms 1.45x faster
flip horiz 4k RGBA 68.7 ms 47.6 ms 1.44x faster
flip horiz 4k RGB 63.3 ms 47.7 ms 1.33x faster
rot180 4k RGB 63.3 ms 48.8 ms 1.30x faster
subsample 2x 8k RGB 78.2 ms 66.6 ms 1.17x faster

All affected operations produced bit-identical output to master, verified by running each operation on a 512x512 RGBA/RGB test image and comparing raw .v files with cmp -s.

Test plan

  • meson test -C build passes
  • Output images are bit-identical to master
  • Test on x86_64 / gcc

@JakeChampion JakeChampion force-pushed the jake/perf-conversion branch from 66e28bb to 30fed1c Compare March 25, 2026 12:46
@JakeChampion JakeChampion marked this pull request as ready for review March 25, 2026 12:46
Copy link
Copy Markdown
Member

@lovell lovell left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for this Jake, those timings look great. Would be great to see the impact on x64 CPUs too. I've left a couple of comments inline.

Comment thread libvips/conversion/composite.cpp
Comment thread libvips/conversion/flip.c Outdated
Replace per-pixel double-precision division in vips_flatten for
UCHAR input with precomputed 256-entry LUTs. This applies to both
the black background and arbitrary background paths, with a
special-case unrolled loop for the common RGBA (4-band) case.

Flatten is on the hot path for every RGBA-to-JPEG conversion
(PNG/WebP with alpha saved as JPEG).

Benchmarked on 4000x4000 RGBA (arm64, Apple M-series):
  Black background: ~3.6% faster (82ms -> 79ms)
  Colored background: ~2.8% faster (82ms -> 80ms)
Precompute inv_max_band = 1.0 / max_band once during build and use
multiplication instead of division when scaling pixels to 0-1 in the
composite blend loop. Applied to both the generic double path and the
v4f SIMD vector path. Division is 3-5x slower than multiplication on
modern CPUs.
Initialize A[] and f[] arrays with = {0} at declaration instead of zeroing unused entries in a per-pixel loop.
Removes up to 63 double stores per pixel for images with few bands.
…mple

Replace byte-at-a-time pixel copy loops with a VIPS_MEMCPY macro that
uses typed stores for common pixel sizes (1/2/3/4/8 bytes) and falls
back to memcpy for others. Define the macro once in util.h.
@JakeChampion JakeChampion force-pushed the jake/perf-conversion branch from 30fed1c to e3293dc Compare March 27, 2026 10:38
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants