Thanks to visit codestin.com
Credit goes to GitHub.com

Skip to content

Hand-written ARMv8-A NEON (AArch64) assembly for fast rigid 4x4 matrix inversion & multiplication

License

Notifications You must be signed in to change notification settings

vernizzi/neon_mat4

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

CI

NEON 4x4 Rigid Matrix Inverse and Multiply

This project provides fast hand-written ARMv8-A NEON (AArch64) assembly routines for:

  • Inverting a rigid (rotation + translation) 4x4 matrix
  • Multiplying two 4x4 matrices

Both routines use single-precision floats in column-major order and are optimized for pipelines like Cortex-A53 (but suitable for all ARM64 NEON platforms).

How the Rigid Affine Property Optimizes the Inverse

A general 4x4 affine matrix (as used in 3D graphics) can include rotation, translation, scale, and shear. However, for most scene transforms (camera, object pose), only rotation and translation are present—this is called a rigid (or Euclidean) transform.

Such a matrix takes the form:

[ R | t ] (R = 3x3 rotation, t = translation) [ 0 | 1 ]

The inverse of a rigid affine matrix is mathematically simple and can be written as:

[ Rᵗ | -Rᵗ * t ] [ 0 | 1 ]

  • The 3×3 rotation block is transposed (Rᵗ)—that's much cheaper than a full matrix inverse.
  • The translation is efficiently computed as the negative dot product of the new rows and the original translation column.
  • No determinant, cofactors, or division required!

In this project, the NEON routine computes the inverse by:

  • Using NEON vector "unzip" and "zip" instructions to quickly transpose the 3×3 rotation.
  • Efficiently calculating and negating the new translation using NEON fused multiply-add, treating the 3 rows as parallel dot-products.
  • Writing out the new matrix in a tight, branchless sequence suited to modern ARM cores.

This leverages the affine rigid property—enabling the inverse of a pose matrix to be orders of magnitude faster than a general inverse.

Files

  • neon_mat4.S: NEON assembly code for efficient matrix inverse and multiplication.
  • test.c: C test harness with correctness-checking for various rigid transforms.

Matrix Format

All matrices are 4x4, stored as float[16] in column-major order (matching OpenGL conventions):

| m00 m04 m08 m12 |
| m01 m05 m09 m13 |
| m02 m06 m10 m14 |
| m03 m07 m11 m15 |

For a rigid transform:

  • The upper-left 3x3 is a rotation matrix
  • The last column (excluding bottom-right) is translation
  • Bottom-right element is always 1

Building & Testing

Prerequisites

  • GCC cross compiler for AArch64 (e.g., aarch64-linux-gnu-gcc)
  • QEMU user-mode emulator for AArch64 (qemu-aarch64)
  • Standard Linux tools

Building and Running

You can cross-compile it on Linux using GNU's aarch64 toolchain, and run using QEMU to emulate aarch64.

#assemble using aarch64 toolchain
aarch64-linux-gnu-as neon_mat4.S -o mat4.o

# compile the c test file and link
aarch64-linux-gnu-gcc mat4.o test.c -o tests

# run using qemu-system-aarch64
qemu-aarch64 -cpu cortex-a53 -L /usr/aarch64-linux-gnu ./tests

# you can also debug it on a x86-64 host
qemu-aarch64 -L /usr/aarch64-linux-gnu -g 1234 ./tests &
aarch64-linux-gnu-gdb -ex "target remote localhost:1234"

Usage from C

Declare prototypes:

extern void neon_mat4_affine_rigid_inverse(const float* src, float* dst);
extern void neon_mat4_mul(const float* a, const float* b, float* dst);

Note

The assembly routines' calling convention follows the AAPCS64 ABI

All pointers must be at least 8-byte (preferably 16-byte) aligned.

About

Hand-written ARMv8-A NEON (AArch64) assembly for fast rigid 4x4 matrix inversion & multiplication

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published