perf: specialize pixel replication in embed for common pixel sizes#4966
perf: specialize pixel replication in embed for common pixel sizes#4966JakeChampion wants to merge 1 commit intolibvips:masterfrom
Conversation
|
Hi @JakeChampion, thanks for this. It's been a few years since I last looked at this, but a small loop used to be faster than Is |
|
This is really interesting, thanks @JakeChampion, were you able to do any benchmarking/profiling? I've created https://godbolt.org/z/daq8nGqhM to help explore the assembly generated by each approach under various compilers/architectures, but remember less code != faster. If this is a particularly "hot" function for your scenario, a possible (and more verbose) alternative might be fixed-size loops for common values of |
|
What I ended up with which looked to perform well on my machine was this: https://godbolt.org/z/zP8KxjxTr static void
vips_embed_base_copy_pixel(VipsEmbedBase *base,
VipsPel *q, VipsPel *p, int n)
{
const int bs = VIPS_IMAGE_SIZEOF_PEL(base->in);
int x;
switch (bs) {
case 1:
memset(q, p[0], n);
break;
case 2:
for (x = 0; x < n; x++)
((guint16 *)q)[x] = *(guint16 *)p;
break;
case 3:
for (x = 0; x < n; x++) {
q[0] = p[0];
q[1] = p[1];
q[2] = p[2];
q += 3;
}
break;
case 4:
for (x = 0; x < n; x++)
((guint32 *)q)[x] = *(guint32 *)p;
break;
case 8:
for (x = 0; x < n; x++)
((guint64 *)q)[x] = *(guint64 *)p;
break;
default:
for (x = 0; x < n; x++) {
memcpy(q, p, bs);
q += bs;
}
break;
}
} |
Replace the byte-by-byte loop in vips_embed_base_copy_pixel with type-specific stores for common pixel sizes: - 1 byte (greyscale): memset - 2 bytes (greyscale+alpha): guint16 store - 3 bytes (RGB): explicit byte assignments - 4 bytes (RGBA): guint32 store - 8 bytes (complex/double): guint64 store - other: memcpy per pixel
c5e5526 to
587d312
Compare
|
Oh, nice! How about making this into a macro, perhaps I guess it could also be tagged as There are quite a few places in libvips where a fast(er) memcpy would be useful. |
|
@jcupitt shall we wait for https://github.com/libvips/libvips/pull/4969/files#diff-2d3fc3425516d0b30885586f344f3902e9541a56b5de4871eb328ef461a3d7e8 and then bring that into this branch? |
|
Oh, sure, let's do that. |


Replace the byte-by-byte loop in
vips_embed_base_copy_pixelwithtype-specific stores for common pixel sizes:
Benchmark:
vips embed x.v x2.v 0 0 10000 10000 --extend copy,VIPS_CONCURRENCY=1, median of 5 runs, arm64 Apple M-series, clang -O3: