Thanks to visit codestin.com
Credit goes to lib.rs

#machine-learning #quantization #fp4 #mxfp4

float4

MXFP4-compatible 4-bit floating point types and block formats for Rust

2 unstable releases

Uses new Rust 2024

0.2.0 Mar 4, 2026
0.1.0 Aug 4, 2025

#346 in Encoding

Codestin Search App Codestin Search App Codestin Search App Codestin Search App Codestin Search App Codestin Search App Codestin Search App Codestin Search App Codestin Search App Codestin Search App Codestin Search App Codestin Search App Codestin Search App Codestin Search App Codestin Search App Codestin Search App Codestin Search App

55,884 downloads per month
Used in 270 crates (3 directly)

MIT license

63KB
697 lines

float4

MXFP4-compatible 4-bit floating point types and block formats for Rust.

This crate provides low-precision floating-point types following the OCP MX specification, designed for efficient storage and computation in machine learning applications where extreme quantization is beneficial.

Available Types

  • F4E2M1: 4-bit floating-point with 2 exponent bits and 1 mantissa bit
  • F4E2M1x2: Packed pair of two F4E2M1 values in a single byte (NVIDIA __nv_fp4x2_e2m1 compatible)
  • E8M0: 8-bit scale factor representing powers of two (2^-127 to 2^127)
  • MXFP4Block: Block format storing 32 F4E2M1 values with a shared E8M0 scale

Features

  • Extreme compression: 4× smaller than f32 with MXFP4Block format
  • IEEE 754 compliant rounding: Round-to-nearest-even for F4E2M1
  • Power-of-two scales: E8M0 provides exact scaling without rounding errors
  • Efficient block storage: Pack multiple values with shared scale factor
  • NVIDIA compatible packing: F4E2M1x2 matches __nv_fp4x2_e2m1 layout for zero-copy CUDA interop
  • Pack/unpack utilities: Convert between F4E2M1 slices and packed F4E2M1x2 vectors
  • Comprehensive API: Conversions, constants, and trait implementations

Quick Start

Add this to your Cargo.toml:

[dependencies]
float4 = "0.2"

Example Usage

use float4::F4E2M1;

// Create from f64
let a = F4E2M1::from_f64(1.5);
assert_eq!(a.to_f64(), 1.5);

// Create from raw bits
let b = F4E2M1::from_bits(0x3); // 0b0011 = 1.5
assert_eq!(b.to_f64(), 1.5);

// Arithmetic operations (via f64 conversion)
let x = F4E2M1::from_f64(2.0);
let y = F4E2M1::from_f64(3.0);
let sum = F4E2M1::from_f64(x.to_f64() + y.to_f64());
assert_eq!(sum.to_f64(), 5.0); // May round to nearest representable value

// Constants
assert_eq!(F4E2M1::MAX.to_f64(), 6.0);
assert_eq!(F4E2M1::MIN.to_f64(), -6.0);
assert_eq!(F4E2M1::EPSILON.to_f64(), 0.5);

Packed Pairs (F4E2M1x2)

Two F4E2M1 values packed into a single byte, matching NVIDIA's __nv_fp4x2_e2m1 layout (lower nibble = first value, upper nibble = second value):

use float4::{F4E2M1, F4E2M1x2, pack, unpack};

// Pack two values into one byte
let pair = F4E2M1x2::new(F4E2M1::from_f64(1.5), F4E2M1::from_f64(-2.0));
assert_eq!(pair.lo().to_f64(), 1.5);
assert_eq!(pair.hi().to_f64(), -2.0);

// Convert from f32 pairs directly
let pair = F4E2M1x2::from_f32_pair(3.0, 0.5);
let (a, b) = pair.to_f32_pair();
assert_eq!(a, 3.0);
assert_eq!(b, 0.5);

// Pack a slice of F4E2M1 values into pairs
let values = vec![
    F4E2M1::from_f64(1.0),
    F4E2M1::from_f64(2.0),
    F4E2M1::from_f64(3.0),
    F4E2M1::from_f64(4.0),
];
let packed = pack(&values);   // [F4E2M1x2(1.0, 2.0), F4E2M1x2(3.0, 4.0)]
let unpacked = unpack(&packed); // [1.0, 2.0, 3.0, 4.0]
assert_eq!(values, unpacked);

Block Format Example

use float4::{F4E2M1, E8M0, MXFP4Block};

// Original data
let data = vec![1.5, -2.0, 0.5, 3.0, 1.0, -0.5];

// Compute scale factor (rounds up to power of 2)
let scale = E8M0::from_f32_slice(&data);
assert_eq!(scale.to_f64(), 4.0); // 3.0 rounds up to 4.0

// Quantize to F4E2M1
let mut quantized = [F4E2M1::from_f64(0.0); 32];
for (i, &value) in data.iter().enumerate() {
    quantized[i] = F4E2M1::from_f64(value as f64 / scale.to_f64());
}

// Pack into block (17 bytes for 32 values vs 128 bytes for f32)
let block = MXFP4Block::from_f32_slice(quantized, scale);

// Retrieve values
let restored = block.to_f32_array();
assert_eq!(restored[0], 1.5);
assert_eq!(restored[1], -2.0);

E8M0 Scale Factors

The E8M0 type represents scale factors as exact powers of two:

use float4::E8M0;

// Exact powers of two are preserved
let scale = E8M0::from(4.0);
assert_eq!(scale.to_f64(), 4.0);

// Non-powers round UP to next power of two
let scale = E8M0::from(3.0);
assert_eq!(scale.to_f64(), 4.0);  // 3.0 → 4.0

let scale = E8M0::from(5.0);
assert_eq!(scale.to_f64(), 8.0);  // 5.0 → 8.0

// Computing scale from data
let data = [1.5, -2.0, 0.5, 3.0];
let scale = E8M0::from_f32_slice(&data);
assert_eq!(scale.to_f64(), 4.0);  // max(|data|) = 3.0 → 4.0

Key characteristics:

  • Range: 2^-127 to 2^127
  • Always rounds UP (toward positive infinity)
  • No rounding errors when scaling by powers of two
  • Ideal for block quantization schemes

Representable Values

F4E2M1 can exactly represent 16 distinct values:

Value Bit Pattern Type
0.0 0000 Zero
0.5 0001 Subnormal
1.0 0010 Normal
1.5 0011 Normal
2.0 0100 Normal
3.0 0101 Normal
4.0 0110 Normal
6.0 0111 Normal
-0.0 1000 Negative zero
-0.5 1001 Subnormal
-1.0 1010 Normal
-1.5 1011 Normal
-2.0 1100 Normal
-3.0 1101 Normal
-4.0 1110 Normal
-6.0 1111 Normal

Special Values

Unlike standard floating point formats, F4E2M1 has no representation for infinity or NaN. These values saturate to the maximum representable value:

use float4::F4E2M1;

assert_eq!(F4E2M1::from_f64(f64::INFINITY).to_f64(), 6.0);
assert_eq!(F4E2M1::from_f64(f64::NEG_INFINITY).to_f64(), -6.0);
assert_eq!(F4E2M1::from_f64(f64::NAN).to_f64(), 6.0);

No runtime deps