Thanks to visit codestin.com
Credit goes to github.com

Skip to content

perf: specialize pixel replication in embed for common pixel sizes#4966

Open
JakeChampion wants to merge 1 commit intolibvips:masterfrom
JakeChampion:jake/perf1
Open

perf: specialize pixel replication in embed for common pixel sizes#4966
JakeChampion wants to merge 1 commit intolibvips:masterfrom
JakeChampion:jake/perf1

Conversation

@JakeChampion
Copy link
Copy Markdown
Contributor

@JakeChampion JakeChampion commented Mar 23, 2026

Replace the byte-by-byte loop in vips_embed_base_copy_pixel with
type-specific stores for common pixel sizes:

  • 1 byte (greyscale): memset
  • 2 bytes (greyscale+alpha): guint16 store
  • 3 bytes (RGB): explicit byte assignments
  • 4 bytes (RGBA): guint32 store
  • 8 bytes (RGBA double/complex): guint64 store
  • other: memcpy per pixel

Benchmark: vips embed x.v x2.v 0 0 10000 10000 --extend copy,
VIPS_CONCURRENCY=1, median of 5 runs, arm64 Apple M-series, clang -O3:

Bands master PR speedup
1 0.030s 0.020s 33%
3 0.040s 0.030s 25%
4 0.040s 0.030s 25%
8 0.070s 0.050s 29%
16 0.120s 0.100s 17%
32 0.210s 0.150s 29%
48 0.330s 0.240s 27%

@JakeChampion JakeChampion marked this pull request as ready for review March 23, 2026 16:35
@jcupitt
Copy link
Copy Markdown
Member

jcupitt commented Mar 23, 2026

Hi @JakeChampion, thanks for this.

It's been a few years since I last looked at this, but a small loop used to be faster than memcpy() for fewer than about 20 bytes when built with -O3.

Is memcpy() always faster now? We should probably test at least current clang and gcc, and x64 and arm64.

@jcupitt
Copy link
Copy Markdown
Member

jcupitt commented Mar 23, 2026

I made a tiny benchmark:

#!/bin/bash

for b in {1..50}; do
  vips black x.v 1 1 --bands $b
  echo -n $b,
  VIPS_CONCURRENCY=1 /usr/bin/time -f %U \
    vips embed x.v x2.v 0 0 10000 10000 --extend copy
done

For a release build of git master libvips vs. this PR on an AMD threaripper pro I see:

image

So very similar speed. The for loop is quicker for small (<10?) bytes per pixel.

It's not a very sophisticated test, of course!

@lovell
Copy link
Copy Markdown
Member

lovell commented Mar 23, 2026

This is really interesting, thanks @JakeChampion, were you able to do any benchmarking/profiling?

I've created https://godbolt.org/z/daq8nGqhM to help explore the assembly generated by each approach under various compilers/architectures, but remember less code != faster.

If this is a particularly "hot" function for your scenario, a possible (and more verbose) alternative might be fixed-size loops for common values of bs as these could allow the compiler to optimise the unroll vs memcpy decision itself.

@JakeChampion
Copy link
Copy Markdown
Contributor Author

What I ended up with which looked to perform well on my machine was this:

https://godbolt.org/z/zP8KxjxTr

static void
vips_embed_base_copy_pixel(VipsEmbedBase *base,
	VipsPel *q, VipsPel *p, int n)
{
	const int bs = VIPS_IMAGE_SIZEOF_PEL(base->in);

	int x;

	switch (bs) {
	case 1:
		memset(q, p[0], n);
		break;
	case 2:
		for (x = 0; x < n; x++)
			((guint16 *)q)[x] = *(guint16 *)p;
		break;
	case 3:
		for (x = 0; x < n; x++) {
			q[0] = p[0];
			q[1] = p[1];
			q[2] = p[2];
			q += 3;
		}
		break;
	case 4:
		for (x = 0; x < n; x++)
			((guint32 *)q)[x] = *(guint32 *)p;
		break;
	case 8:
		for (x = 0; x < n; x++)
			((guint64 *)q)[x] = *(guint64 *)p;
		break;
	default:
		for (x = 0; x < n; x++) {
			memcpy(q, p, bs);
			q += bs;
		}
		break;
	}
}

@JakeChampion
Copy link
Copy Markdown
Contributor Author

I made a tiny benchmark:

#!/bin/bash

for b in {1..50}; do
  vips black x.v 1 1 --bands $b
  echo -n $b,
  VIPS_CONCURRENCY=1 /usr/bin/time -f %U \
    vips embed x.v x2.v 0 0 10000 10000 --extend copy
done

If I use this same approach, my results are:

embed_bench_median

@JakeChampion JakeChampion changed the title perf: use memcpy instead of byte loop for pixel replication in embed perf: specialize pixel replication in embed for common pixel sizes Mar 25, 2026
Replace the byte-by-byte loop in vips_embed_base_copy_pixel with
type-specific stores for common pixel sizes:

- 1 byte (greyscale): memset
- 2 bytes (greyscale+alpha): guint16 store
- 3 bytes (RGB): explicit byte assignments
- 4 bytes (RGBA): guint32 store
- 8 bytes (complex/double): guint64 store
- other: memcpy per pixel
@jcupitt
Copy link
Copy Markdown
Member

jcupitt commented Mar 27, 2026

Oh, nice!

How about making this into a macro, perhaps VIPS_MEMCPY()? If we do this, it'll mostly be compiled away for values of N known at compile time.

I guess it could also be tagged as inline in a header, though I'm uncertain if all the C compilers we need to support do the right thing for that.

There are quite a few places in libvips where a fast(er) memcpy would be useful.

@JakeChampion
Copy link
Copy Markdown
Contributor Author

@jcupitt
Copy link
Copy Markdown
Member

jcupitt commented Mar 27, 2026

Oh, sure, let's do that.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants