High-performance, no_std, MCU-friendly DSP library featuring FFT, DCT, DST, Hartley, Wavelet, STFT, and more. Stack-only, SIMD-optimized, and batch transforms for embedded and scientific Rust applications.
- 🚀 Zero-allocation stack-only APIs for MCU/embedded systems
- ⚡ SIMD acceleration (x86_64 AVX2 & SSE, AArch64 NEON, WebAssembly SIMD)
- 🧮 Split-radix FFTs for power-of-two sizes, with radix-2/4 and mixed-radix support
- 🔧 Multiple transform types and modules: FFT, NDFFT (n-dimensional), DCT (Types I-IV), DST (Types I-IV), Hartley, Hilbert transform, Cepstrum, Wavelet, STFT, CZT, Goertzel
- 📊 Window functions: Hann, Hamming, Blackman, Kaiser
- 🔄 Batch and multi-channel processing
- 🌐 WebAssembly support
- 📱 Parallel processing (optional)
- 🎵 Hybrid song identification: fast metadata lookup with BLAKE3 fallback
See benchmarks for detailed benchmark results and data.
[dependencies]
kofft = { version = "0.1.5", features = [
# "x86_64", # AVX/SSE on x86_64
# "sse", # force SSE2-only backend
# "aarch64", # NEON on 64-bit ARM
# "wasm", # WebAssembly SIMD128
# "avx2", # AVX2-specific code paths
# "avx512", # AVX-512 code paths
# "parallel", # Rayon-based parallel helpers
# "simd", # portable SIMD FFT implementations
# "soa", # structure-of-arrays complex vectors
# "precomputed-twiddles", # embed precomputed twiddle factors (requires std)
# "compile-time-rfft", # precompute real FFT tables at compile time
# "slow", # include naive reference algorithms
# "internal-tests", # enable proptest/rand for internal tests
] }For an overview of the Fast Fourier Transform (FFT), see Wikipedia.
use kofft::{Complex32, FftPlanner};
use kofft::fft::{ScalarFftImpl, FftImpl};
// Create FFT instance with planner (caches twiddle factors)
let planner = FftPlanner::<f32>::new();
let fft = ScalarFftImpl::with_planner(planner);
// Prepare data
let mut data = vec![
Complex32::new(1.0, 0.0),
Complex32::new(2.0, 0.0),
Complex32::new(3.0, 0.0),
Complex32::new(4.0, 0.0),
];
// Compute FFT
fft.fft(&mut data)?;
// Compute inverse FFT
fft.ifft(&mut data)?;Enable the parallel feature to automatically split large transforms across
threads via Rayon. Use the fft_parallel and
ifft_parallel helpers which safely fall back to single-threaded execution when
Rayon is not available.
By default, kofft parallelizes an FFT when each CPU core would process at least
max(L1_cache_bytes / size_of::<Complex32>(), per_core_work) elements. The
defaults assume a 32 KiB L1 cache and require roughly 4,096 points per core.
The heuristic scales with the number of detected cores (via
num_cpus) and can be tuned using the
KOFFT_PAR_FFT_THRESHOLD, KOFFT_PAR_FFT_CACHE_BYTES, or
KOFFT_PAR_FFT_PER_CORE_WORK environment variables, or by calling
kofft::fft::set_parallel_fft_threshold, set_parallel_fft_l1_cache, or
set_parallel_fft_per_core_work at runtime.
use kofft::fft::{fft_parallel, ifft_parallel, Complex32};
let mut data = vec![Complex32::new(1.0, 0.0); 1 << 14];
fft_parallel(&mut data)?;
ifft_parallel(&mut data)?;The crate exposes several Cargo features. Refer to Cargo.toml for the canonical list and definitions.
std– enable the Rust standard library (default)parallel– Rayon-based parallel helpers- Architecture backends:
x86_64– AVX/SSE on x86_64 CPUssse– force SSE2-only backendaarch64– NEON on 64-bit ARMwasm– WebAssembly SIMD128avx2– AVX2-specific code pathsavx512– AVX-512 code paths
- Miscellaneous:
simd– portable SIMD FFT implementationssoa– structure-of-arrays complex vectors for SIMDprecomputed-twiddles– embed precomputed FFT twiddle factors (requiresstd)compile-time-rfft– generate real FFT tables at compile timeslow– include naive reference algorithmsinternal-tests– enable proptest and rand for internal testing
All stack-only APIs require you to provide output buffers. This enables no_std operation without any heap allocation.
use kofft::fft::{Complex32, fft_inplace_stack};
// 8-point FFT (power-of-two only for stack APIs)
let mut buf: [Complex32; 8] = [
Complex32::new(1.0, 0.0), Complex32::new(2.0, 0.0),
Complex32::new(3.0, 0.0), Complex32::new(4.0, 0.0),
Complex32::new(5.0, 0.0), Complex32::new(6.0, 0.0),
Complex32::new(7.0, 0.0), Complex32::new(8.0, 0.0),
];
fft_inplace_stack(&mut buf)?;use kofft::dct::dct1_inplace_stack;
let input: [f32; 8] = [1.0, 2.0, 3.0, 4.0, 5.0, 6.0, 7.0, 8.0];
let mut output: [f32; 8] = [0.0; 8];
dct1_inplace_stack(&input, &mut output);use kofft::dct::dct2_inplace_stack;
let input: [f32; 8] = [1.0, 2.0, 3.0, 4.0, 5.0, 6.0, 7.0, 8.0];
let mut output: [f32; 8] = [0.0; 8];
dct2_inplace_stack(&input, &mut output);use kofft::dst::dst2_inplace_stack;
let input: [f32; 8] = [1.0, 2.0, 3.0, 4.0, 5.0, 6.0, 7.0, 8.0];
let mut output: [f32; 8] = [0.0; 8];
dst2_inplace_stack(&input, &mut output);use kofft::dst::dst4_inplace_stack;
let input: [f32; 8] = [1.0, 2.0, 3.0, 4.0, 5.0, 6.0, 7.0, 8.0];
let mut output: [f32; 8] = [0.0; 8];
dst4_inplace_stack(&input, &mut output);use kofft::wavelet::{haar_forward_inplace_stack, haar_inverse_inplace_stack};
// Forward transform
let input: [f32; 8] = [1.0, 2.0, 3.0, 4.0, 5.0, 6.0, 7.0, 8.0];
let mut avg = [0.0; 4];
let mut diff = [0.0; 4];
haar_forward_inplace_stack(&input, &mut avg[..], &mut diff[..]);
// Inverse transform
let mut out = [0.0; 8];
haar_inverse_inplace_stack(&avg[..], &diff[..], &mut out[..]);use kofft::window::{hann_inplace_stack, hamming_inplace_stack, blackman_inplace_stack};
let mut hann: [f32; 8] = [0.0; 8];
hann_inplace_stack(&mut hann);
let mut hamming: [f32; 8] = [0.0; 8];
hamming_inplace_stack(&mut hamming);
let mut blackman: [f32; 8] = [0.0; 8];
blackman_inplace_stack(&mut blackman);The workspace provides a sanity-check binary for comparing spectrograms
between kofft and rustfft. It can optionally emit an SVG file using
--svg-output:
cargo run -r -p sanity-check -- input.flac --svg-output=spec.svgWith the std feature (enabled by default), you get heap-based APIs for more flexibility.
use kofft::fft::{Complex32, ScalarFftImpl, FftImpl};
let fft = ScalarFftImpl::<f32>::default();
// Heap-based FFT
let mut data = vec![
Complex32::new(1.0, 0.0),
Complex32::new(2.0, 0.0),
Complex32::new(3.0, 0.0),
Complex32::new(4.0, 0.0),
];
fft.fft(&mut data)?;
// Or create new vector
let result = fft.fft_vec(&data)?;use kofft::fft::{ScalarFftImpl, FftImpl};
use kofft::rfft::RealFftImpl;
let fft = ScalarFftImpl::<f32>::default();
let mut input = vec![1.0, 2.0, 3.0, 4.0, 5.0, 6.0, 7.0, 8.0];
let mut output = vec![Complex32::zero(); input.len() / 2 + 1];
fft.rfft(&mut input, &mut output)?;Stack-only helpers avoid heap allocation:
use kofft::rfft::{irfft_stack, rfft_stack};
use kofft::Complex32;
let input = [1.0f32, 2.0, 3.0, 4.0];
let mut freq = [Complex32::new(0.0, 0.0); 3];
rfft_stack(&input, &mut freq)?;
let mut time = [0.0f32; 4];
irfft_stack(&freq, &mut time)?;For background on STFT, see Wikipedia.
use kofft::stft::{stft, istft};
use kofft::window::hann;
use kofft::fft::ScalarFftImpl;
let signal = vec![1.0, 2.0, 3.0, 4.0, 5.0, 6.0, 7.0, 8.0];
let window = hann(4);
let hop_size = 2;
let fft = ScalarFftImpl::<f32>::default();
let mut frames = vec![vec![]; (signal.len() + hop_size - 1) / hop_size];
stft(&signal, &window, hop_size, &mut frames, &fft)?;
let mut output = vec![0.0; signal.len()];
let mut scratch = vec![0.0; output.len()];
istft(&mut frames, &window, hop_size, &mut output, &mut scratch, &fft)?;use kofft::stft::{StftStream, istft};
use kofft::window::hann;
use kofft::fft::{Complex32, ScalarFftImpl};
let signal = vec![1.0, 2.0, 3.0, 4.0, 5.0, 6.0, 7.0, 8.0];
let window = hann(4);
let hop_size = 2;
let fft = ScalarFftImpl::<f32>::default();
let mut stream = StftStream::new(&signal, &window, hop_size, &fft)?;
let mut frames = Vec::new();
let mut frame = vec![Complex32::new(0.0, 0.0); window.len()];
while stream.next_frame(&mut frame)? {
frames.push(frame.clone());
}
let mut output = vec![0.0; signal.len()];
let mut scratch = vec![0.0; output.len()];
istft(&mut frames, &window, hop_size, &mut output, &mut scratch, &fft)?;use kofft::fft::{ScalarFftImpl, FftImpl};
let fft = ScalarFftImpl::<f32>::default();
let mut batches = vec![
vec![Complex32::new(1.0, 0.0), Complex32::new(2.0, 0.0)],
vec![Complex32::new(3.0, 0.0), Complex32::new(4.0, 0.0)],
];
fft.batch(&mut batches)?;Run the included examples with:
cargo run --example basic_usage
cargo run --example stft_usage
cargo run --example ndfft_usage
cargo run --example embedded_example
cargo run --example benchmark
cargo run --example rfft_usage
cargo run --example dct_usage --release
cargo run --example spectrogram -- <input.wav> <output.png>The spectrogram example reads a WAV file and writes a yellow→purple spectrogram image.
dct_usage compares a naive DCT-II against a planner that caches cosine
values and reports the average runtime of each approach, demonstrating the
benefit of reusing planning data.
Enable architecture-specific features in Cargo.toml:
[dependencies]
kofft = { version = "0.1.4", features = [
# "x86_64", # x86_64 AVX/SSE backends
# "aarch64", # AArch64 NEON backend
# "wasm", # WebAssembly SIMD128 backend
# "parallel", # Rayon-based parallel helpers
] }The x86_64, aarch64, and wasm features activate optimized backends for
their respective architectures. When the corresponding CPU or Wasm SIMD
extensions are available (e.g., AVX2, NEON, or simd128), kofft will
automatically select the best implementation.
SIMD backends are also enabled automatically when compiling with the
appropriate target-feature flags (e.g., RUSTFLAGS="-C target-feature=+avx2").
To opt into additional optional features for local builds, set the
KOFFT_FEATURES environment variable. Any features listed there are appended
to those detected by the xtask utility:
KOFFT_FEATURES="simd compile-time-rfft" cargo xtask testEnable the parallel feature (using Rayon) as shown above:
use kofft::stft::parallel;
let signal = vec![1.0, 2.0, 3.0, 4.0, 5.0, 6.0, 7.0, 8.0];
let window = vec![1.0, 1.0, 1.0, 1.0];
let hop_size = 2;
let mut frames = vec![vec![]; (signal.len() + hop_size - 1) / hop_size];
parallel(&signal, &window, hop_size, &mut frames)?;- DCT – Discrete Cosine Transform (Wikipedia)
- DST – Discrete Sine Transform (Wikipedia)
- Hartley Transform – (Wikipedia)
- Wavelet Transform – multi-level Haar, Daubechies, Symlets, Coiflets (Wikipedia)
- Goertzel Algorithm – (Wikipedia)
- Chirp Z-Transform – (Wikipedia)
- Hilbert Transform – (Wikipedia)
- Cepstrum – (Wikipedia)
use kofft::{dct, dst, hartley, wavelet, goertzel, czt, hilbert, cepstrum};
// DCT variants
let dct2_result = dct::dct2(&input);
let dct3_result = dct::dct3(&input);
let dct4_result = dct::dct4(&input);
// DST variants
let dst1_result = dst::dst1(&input);
let dst2_result = dst::dst2(&input);
let dst3_result = dst::dst3(&input);
// Hartley Transform
let hartley_result = hartley::dht(&input);
// Wavelet Transform
use wavelet::{
haar_forward_multi, haar_inverse_multi, db4_forward_multi, db4_inverse_multi,
};
let (approx, details) = haar_forward_multi(&input, 2);
let reconstructed = haar_inverse_multi(&approx, &details);
// Additional families, e.g. Daubechies-4
let (db4_a, db4_d) = db4_forward_multi(&input, 2);
let db4_recon = db4_inverse_multi(&db4_a, &db4_d);
// Goertzel Algorithm (single frequency detection)
let magnitude = goertzel::goertzel_f32(&input, 44100.0, 1000.0);
// Chirp Z-Transform
let czt_result = czt::czt_f32(&input, 64, (0.5, 0.0), (1.0, 0.0));
// Hilbert Transform
let hilbert_result = hilbert::hilbert_analytic(&input);
// Cepstrum
let cepstrum_result = cepstrum::real_cepstrum(&input);#![no_std]
use kofft::fft::{Complex32, fft_inplace_stack};
use kofft::dct::dct2_inplace_stack;
use kofft::window::hann_inplace_stack;
#[entry]
fn main() -> ! {
// FFT example
let mut fft_buf: [Complex32; 8] = [
Complex32::new(1.0, 0.0), Complex32::new(2.0, 0.0),
Complex32::new(3.0, 0.0), Complex32::new(4.0, 0.0),
Complex32::new(5.0, 0.0), Complex32::new(6.0, 0.0),
Complex32::new(7.0, 0.0), Complex32::new(8.0, 0.0),
];
fft_inplace_stack(&mut fft_buf).unwrap();
// DCT example
let dct_input: [f32; 8] = [1.0, 2.0, 3.0, 4.0, 5.0, 6.0, 7.0, 8.0];
let mut dct_output: [f32; 8] = [0.0; 8];
dct2_inplace_stack(&dct_input, &mut dct_output);
// Window function example
let mut window: [f32; 8] = [0.0; 8];
hann_inplace_stack(&mut window);
loop {
// Your application logic here
}
}- Stack-only APIs: No heap allocation, suitable for MCUs with limited RAM
- SIMD acceleration: 2-4x speedup on supported platforms
- Parallel FFT: Enable the
parallelfeature to scale across CPU cores - Power-of-two sizes: Most efficient for FFT operations
- Memory usage: Stack usage scales with transform size (e.g., 8-point FFT uses ~64 bytes for
Complex32)
| Platform | SIMD Support | Enable via |
|---|---|---|
| x86_64 | AVX2/FMA | x86_64 feature or -C target-feature=+avx2 |
| x86_64 (SSE) | SSE2 | sse feature or default sse2 target |
| AArch64 | NEON | aarch64 feature or -C target-feature=+neon |
| WebAssembly | SIMD128 | wasm feature or -C target-feature=+simd128 |
| Generic | Scalar | Default fallback |
Feature selection precedence: x86_64 (AVX2) → sse → scalar fallback.
Licensed under either of
- Apache License, Version 2.0 (LICENSE-APACHE or https://www.apache.org/licenses/LICENSE-2.0)
- MIT license (LICENSE-MIT or https://opensource.org/licenses/MIT)
at your option.
Contributions are welcome! Please feel free to submit a Pull Request. For major changes, please open an issue first to discuss what you would like to change.
Common development commands are exposed via the xtask binary:
cargo xtask build # Build the project with auto-detected features
cargo xtask test # Run tests with matching features
cargo xtask clippy # Run clippy lints
cargo xtask fmt # Format the codebase
cargo xtask analyze # Run fmt and clippy together
cargo xtask benchmark # Execute benchmarks
cargo xtask bench-libs # Run criterion benches across libraries
cargo xtask update-bench-readme # Refresh benchmark README data
cargo xtask sanity -- <path-to-flac> # Run the sanity check example