This repository contains a simple implementation of matrix multiplication using OpenMP and the NEON instruction set. The goal is to demonstrate the use of parallel processing and optimized instructions for matrix operations.
- A compatible ARM-based processor with NEON support
- OpenMP installed and configured on your system
- A C compiler (e.g., GCC/CLANG)
To compile the code, use the following command:
gcc -o gemm gemm.c -O3 -ffast-math -fopenmp -march=nativeThis benchmark compares the performance of four different matrix multiplication implementations on an M2 Pro processor. Implementations:
- Optimized Neon Parallel BLOCKED
- Standard Neon Parallel BLOCKED
- Normal Parallel NEON
- Normal Parallel matmul
- N = 1024, BLOCK_SIZE = 16
- Optimized: 87.17 GFLOP/S
- Standard: 69.49 GFLOP/S
- Normal NEON: 76.44 GFLOP/S
- Normal matmul: 4.85 GFLOP/S s
- N = 8192, BLOCK_SIZE = 16
- Optimized: 122.27 GFLOP/S ms
- Standard: 72.04 GFLOP/S ms
- Normal NEON: 60.23 GFLOP/S ms
- Normal matmul: Not applicable
Note
Our old best performance was ~76 GFLOP/S, and we have now achieved a significant improvement of 122.27 GFLOP/S with our optimized implementation.