This project provides fast hand-written ARMv8-A NEON (AArch64) assembly routines for:
- Inverting a rigid (rotation + translation) 4x4 matrix
- Multiplying two 4x4 matrices
Both routines use single-precision floats in column-major order and are optimized for pipelines like Cortex-A53 (but suitable for all ARM64 NEON platforms).
A general 4x4 affine matrix (as used in 3D graphics) can include rotation, translation, scale, and shear. However, for most scene transforms (camera, object pose), only rotation and translation are present—this is called a rigid (or Euclidean) transform.
Such a matrix takes the form:
[ R | t ] (R = 3x3 rotation, t = translation) [ 0 | 1 ]
The inverse of a rigid affine matrix is mathematically simple and can be written as:
[ Rᵗ | -Rᵗ * t ] [ 0 | 1 ]
- The 3×3 rotation block is transposed (Rᵗ)—that's much cheaper than a full matrix inverse.
- The translation is efficiently computed as the negative dot product of the new rows and the original translation column.
- No determinant, cofactors, or division required!
In this project, the NEON routine computes the inverse by:
- Using NEON vector "unzip" and "zip" instructions to quickly transpose the 3×3 rotation.
- Efficiently calculating and negating the new translation using NEON fused multiply-add, treating the 3 rows as parallel dot-products.
- Writing out the new matrix in a tight, branchless sequence suited to modern ARM cores.
This leverages the affine rigid property—enabling the inverse of a pose matrix to be orders of magnitude faster than a general inverse.
neon_mat4.S: NEON assembly code for efficient matrix inverse and multiplication.test.c: C test harness with correctness-checking for various rigid transforms.
All matrices are 4x4, stored as float[16] in column-major order (matching OpenGL conventions):
| m00 m04 m08 m12 |
| m01 m05 m09 m13 |
| m02 m06 m10 m14 |
| m03 m07 m11 m15 |
For a rigid transform:
- The upper-left 3x3 is a rotation matrix
- The last column (excluding bottom-right) is translation
- Bottom-right element is always 1
- GCC cross compiler for AArch64 (e.g.,
aarch64-linux-gnu-gcc) - QEMU user-mode emulator for AArch64 (
qemu-aarch64) - Standard Linux tools
You can cross-compile it on Linux using GNU's aarch64 toolchain, and run using QEMU to emulate aarch64.
#assemble using aarch64 toolchain
aarch64-linux-gnu-as neon_mat4.S -o mat4.o
# compile the c test file and link
aarch64-linux-gnu-gcc mat4.o test.c -o tests
# run using qemu-system-aarch64
qemu-aarch64 -cpu cortex-a53 -L /usr/aarch64-linux-gnu ./tests
# you can also debug it on a x86-64 host
qemu-aarch64 -L /usr/aarch64-linux-gnu -g 1234 ./tests &
aarch64-linux-gnu-gdb -ex "target remote localhost:1234"Declare prototypes:
extern void neon_mat4_affine_rigid_inverse(const float* src, float* dst);
extern void neon_mat4_mul(const float* a, const float* b, float* dst);The assembly routines' calling convention follows the AAPCS64 ABI
All pointers must be at least 8-byte (preferably 16-byte) aligned.