Codestin Search App

JakeChampion · 2026-03-23T16:28:55Z

Replace the byte-by-byte loop in vips_embed_base_copy_pixel with
type-specific stores for common pixel sizes:

1 byte (greyscale): memset
2 bytes (greyscale+alpha): guint16 store
3 bytes (RGB): explicit byte assignments
4 bytes (RGBA): guint32 store
8 bytes (RGBA double/complex): guint64 store
other: memcpy per pixel

Benchmark: vips embed x.v x2.v 0 0 10000 10000 --extend copy,
VIPS_CONCURRENCY=1, median of 5 runs, arm64 Apple M-series, clang -O3:

Bands	master	PR	speedup
1	0.030s	0.020s	33%
3	0.040s	0.030s	25%
4	0.040s	0.030s	25%
8	0.070s	0.050s	29%
16	0.120s	0.100s	17%
32	0.210s	0.150s	29%
48	0.330s	0.240s	27%

jcupitt · 2026-03-23T16:44:34Z

Hi @JakeChampion, thanks for this.

It's been a few years since I last looked at this, but a small loop used to be faster than memcpy() for fewer than about 20 bytes when built with -O3.

Is memcpy() always faster now? We should probably test at least current clang and gcc, and x64 and arm64.

jcupitt · 2026-03-23T18:49:56Z

I made a tiny benchmark:

#!/bin/bash

for b in {1..50}; do
  vips black x.v 1 1 --bands $b
  echo -n $b,
  VIPS_CONCURRENCY=1 /usr/bin/time -f %U \
    vips embed x.v x2.v 0 0 10000 10000 --extend copy
done

For a release build of git master libvips vs. this PR on an AMD threaripper pro I see:

So very similar speed. The for loop is quicker for small (<10?) bytes per pixel.

It's not a very sophisticated test, of course!

lovell · 2026-03-23T19:00:36Z

This is really interesting, thanks @JakeChampion, were you able to do any benchmarking/profiling?

I've created https://godbolt.org/z/daq8nGqhM to help explore the assembly generated by each approach under various compilers/architectures, but remember less code != faster.

If this is a particularly "hot" function for your scenario, a possible (and more verbose) alternative might be fixed-size loops for common values of bs as these could allow the compiler to optimise the unroll vs memcpy decision itself.

JakeChampion · 2026-03-24T16:29:08Z

What I ended up with which looked to perform well on my machine was this:

https://godbolt.org/z/zP8KxjxTr

static void
vips_embed_base_copy_pixel(VipsEmbedBase *base,
	VipsPel *q, VipsPel *p, int n)
{
	const int bs = VIPS_IMAGE_SIZEOF_PEL(base->in);

	int x;

	switch (bs) {
	case 1:
		memset(q, p[0], n);
		break;
	case 2:
		for (x = 0; x < n; x++)
			((guint16 *)q)[x] = *(guint16 *)p;
		break;
	case 3:
		for (x = 0; x < n; x++) {
			q[0] = p[0];
			q[1] = p[1];
			q[2] = p[2];
			q += 3;
		}
		break;
	case 4:
		for (x = 0; x < n; x++)
			((guint32 *)q)[x] = *(guint32 *)p;
		break;
	case 8:
		for (x = 0; x < n; x++)
			((guint64 *)q)[x] = *(guint64 *)p;
		break;
	default:
		for (x = 0; x < n; x++) {
			memcpy(q, p, bs);
			q += bs;
		}
		break;
	}
}

JakeChampion · 2026-03-25T10:25:35Z

I made a tiny benchmark:

#!/bin/bash

for b in {1..50}; do
  vips black x.v 1 1 --bands $b
  echo -n $b,
  VIPS_CONCURRENCY=1 /usr/bin/time -f %U \
    vips embed x.v x2.v 0 0 10000 10000 --extend copy
done

If I use this same approach, my results are:

Replace the byte-by-byte loop in vips_embed_base_copy_pixel with type-specific stores for common pixel sizes: - 1 byte (greyscale): memset - 2 bytes (greyscale+alpha): guint16 store - 3 bytes (RGB): explicit byte assignments - 4 bytes (RGBA): guint32 store - 8 bytes (complex/double): guint64 store - other: memcpy per pixel

jcupitt · 2026-03-27T14:33:27Z

Oh, nice!

How about making this into a macro, perhaps VIPS_MEMCPY()? If we do this, it'll mostly be compiled away for values of N known at compile time.

I guess it could also be tagged as inline in a header, though I'm uncertain if all the C compilers we need to support do the right thing for that.

There are quite a few places in libvips where a fast(er) memcpy would be useful.

JakeChampion · 2026-03-27T15:30:25Z

@jcupitt shall we wait for https://github.com/libvips/libvips/pull/4969/files#diff-2d3fc3425516d0b30885586f344f3902e9541a56b5de4871eb328ef461a3d7e8 and then bring that into this branch?

jcupitt · 2026-03-27T16:04:30Z

Oh, sure, let's do that.

JakeChampion marked this pull request as ready for review March 23, 2026 16:35

JakeChampion changed the title ~~perf: use memcpy instead of byte loop for pixel replication in embed~~ perf: specialize pixel replication in embed for common pixel sizes Mar 25, 2026

JakeChampion force-pushed the jake/perf1 branch from c5e5526 to 587d312 Compare March 25, 2026 10:29

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

perf: specialize pixel replication in embed for common pixel sizes#4966

perf: specialize pixel replication in embed for common pixel sizes#4966
JakeChampion wants to merge 1 commit intolibvips:masterfrom
JakeChampion:jake/perf1

JakeChampion commented Mar 23, 2026 •

edited

Loading

Uh oh!

jcupitt commented Mar 23, 2026

Uh oh!

jcupitt commented Mar 23, 2026 •

edited

Loading

Uh oh!

lovell commented Mar 23, 2026

Uh oh!

JakeChampion commented Mar 24, 2026

Uh oh!

JakeChampion commented Mar 25, 2026

Uh oh!

jcupitt commented Mar 27, 2026

Uh oh!

JakeChampion commented Mar 27, 2026

Uh oh!

jcupitt commented Mar 27, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Uh oh!

Conversation

JakeChampion commented Mar 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jcupitt commented Mar 23, 2026

Uh oh!

jcupitt commented Mar 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

lovell commented Mar 23, 2026

Uh oh!

JakeChampion commented Mar 24, 2026

Uh oh!

JakeChampion commented Mar 25, 2026

Uh oh!

jcupitt commented Mar 27, 2026

Uh oh!

JakeChampion commented Mar 27, 2026

Uh oh!

jcupitt commented Mar 27, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

JakeChampion commented Mar 23, 2026 •

edited

Loading

jcupitt commented Mar 23, 2026 •

edited

Loading