[float8] Add fnuz fp8 dtypes to Float8Layout #2351

jcaip · 2025-06-10T21:35:34Z

This should give us AMD perf on vLLM. With Phi-4-mini-instruct on MI300x with TorchAO FP8 rowwise quant on the MLP I see the following, which is about a 5% speedup:

Avg latency: 1.080369415456274 seconds
10% percentile latency: 1.075335633114446 seconds
25% percentile latency: 1.0811904482543468 seconds
50% percentile latency: 1.082176529977005 seconds
75% percentile latency: 1.0826280051842332 seconds
90% percentile latency: 1.0831242799758911 seconds
99% percentile latency: 1.0836151059856638 seconds

For comparison, here is the baseline Phi-4-mini-instruct on MI300x:

Avg latency: 1.148340248184589 seconds
10% percentile latency: 1.1391733552212826 seconds
25% percentile latency: 1.14905939399614 seconds
50% percentile latency: 1.150204271019902 seconds
75% percentile latency: 1.1523984443047084 seconds
90% percentile latency: 1.1536207939614542 seconds
99% percentile latency: 1.1548575214319863 seconds

Previously, these checks were failing on the unsigned zero ROCm fp8 dtypes, causing us to call .dequantize() and then do a bfloat16 mm, which was slower than the bf16 baseline (~2s).

pytorch-bot · 2025-06-10T21:35:37Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/ao/2351

📄 Preview Python docs built from this PR

Note: Links to docs will display an error until the docs builds have been completed.

❌ 1 New Failure, 2 Unrelated Failures

As of commit 98eb0dc with merge base 16e2d0a ():

NEW FAILURE - The following job has failed:

Run Regression Tests / test (CUDA 2.7, linux.g5.12xlarge.nvidia.gpu, torch==2.7.0, cuda, 12.6) / linux-job (gh)
test/integration/test_integration.py::TestAutoQuant::test_autoquant_compile_12_cuda

FLAKY - The following job failed but was likely due to flakiness present on trunk:

Run Regression Tests / test-nightly (CUDA Nightly, linux.g5.12xlarge.nvidia.gpu, --pre torch --index-url https://downloa... / linux-job (gh) (disabled by #2352)
test/integration/test_integration.py::TestSubclass::test_int4_weight_only_quant_subclass_grouped_5_cuda

BROKEN TRUNK - The following job failed but was present on the merge base:

👉 Rebase onto the `viable/strict` branch to avoid these failures

Run Regression Tests on ROCm / test-nightly (ROCM Nightly, linux.rocm.gpu.mi300.2, --pre torch --index-url https://download.pyto... / linux-job (gh) (trunk failure)
##[error]The operation was canceled.

This comment was automatically generated by Dr. CI and updates every 15 minutes.

jcaip · 2025-06-10T21:45:55Z

cc @drisspg @jerryzh168

Trying to think what's the best way to test this but I don't think it's that simple since we try and dequantize -> do dense matmul by default, which means that testing correctness is not enough here - Calling DynamicFloat8... on a model would fail on perf but would output the correct result.

Any opinions on maybe turning off (or putting it behind a flag) the dequantize -> dense op fallback by default or will that break a lot of things?

drisspg · 2025-06-10T23:17:10Z

dense op fallback by default or will that break a lot of things? I think there is flag for specifying this @jerryzh168

jerryzh168 · 2025-06-12T17:47:42Z

torchao/dtypes/floatx/float8_layout.py

        isinstance(weight_tensor, AffineQuantizedTensor)
        and isinstance(weight_tensor._layout, Float8Layout)
-        and weight_tensor.tensor_impl.dtype in [torch.float8_e4m3fn, torch.float8_e5m2]
+        and _is_float8_type(weight_tensor.tensor_impl.dtype)


so previously it's using fallback? we should probably have a way to check the kernel is called, or just remove the fallback

jerryzh168 · 2025-06-12T17:48:51Z

we can try removing the fallback in a PR I think, it might be OK

jerryzh168 · 2025-06-12T18:41:24Z

dense op fallback by default or will that break a lot of things? I think there is flag for specifying this @jerryzh168

fallback is still the default bahavior, there is a flag for specific kernel choice as well if people want to make sure they are testing a specific kernel path

This should give us AMD perf on vLLM. With Phi-4-mini-instruct on MI300x with TorchAO FP8 rowwise quant on the MLP I see the following, which is about a 5% speedup: ``` Avg latency: 1.080369415456274 seconds 10% percentile latency: 1.075335633114446 seconds 25% percentile latency: 1.0811904482543468 seconds 50% percentile latency: 1.082176529977005 seconds 75% percentile latency: 1.0826280051842332 seconds 90% percentile latency: 1.0831242799758911 seconds 99% percentile latency: 1.0836151059856638 seconds ``` For comparison, here is the baseline Phi-4-mini-instruct on MI300x: ``` Avg latency: 1.148340248184589 seconds 10% percentile latency: 1.1391733552212826 seconds 25% percentile latency: 1.14905939399614 seconds 50% percentile latency: 1.150204271019902 seconds 75% percentile latency: 1.1523984443047084 seconds 90% percentile latency: 1.1536207939614542 seconds 99% percentile latency: 1.1548575214319863 seconds ``` Previously, these checks were failing on the unsigned zero ROCm fp8 dtypes, causing us to call `.dequantize()` and then do a bfloat16 mm, which was slower than the bf16 baseline (~2s).

ruff check

c717764

facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Jun 10, 2025

jcaip added module: rocm topic: bug fix Use this tag for PRs that fix bugs labels Jun 10, 2025

jcaip requested review from drisspg and jerryzh168 June 10, 2025 21:38

petrex added the ciflow/rocm label Jun 10, 2025

use utility function

98eb0dc

jerryzh168 reviewed Jun 12, 2025

View reviewed changes

jerryzh168 approved these changes Jun 12, 2025

View reviewed changes

jcaip merged commit aec0821 into main Jun 12, 2025
17 of 20 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[float8] Add fnuz fp8 dtypes to Float8Layout #2351

[float8] Add fnuz fp8 dtypes to Float8Layout #2351

Uh oh!

jcaip commented Jun 10, 2025

Uh oh!

pytorch-bot bot commented Jun 10, 2025 •

edited

Loading

Uh oh!

jcaip commented Jun 10, 2025

Uh oh!

drisspg commented Jun 10, 2025

Uh oh!

jerryzh168 Jun 12, 2025 •

edited

Loading

Uh oh!

jerryzh168 commented Jun 12, 2025

Uh oh!

Uh oh!

jerryzh168 commented Jun 12, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

[float8] Add fnuz fp8 dtypes to Float8Layout #2351

[float8] Add fnuz fp8 dtypes to Float8Layout #2351

Uh oh!

Conversation

jcaip commented Jun 10, 2025

Uh oh!

pytorch-bot bot commented Jun 10, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/ao/2351

❌ 1 New Failure, 2 Unrelated Failures

Uh oh!

jcaip commented Jun 10, 2025

Uh oh!

drisspg commented Jun 10, 2025

Uh oh!

jerryzh168 Jun 12, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jerryzh168 commented Jun 12, 2025

Uh oh!

Uh oh!

jerryzh168 commented Jun 12, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

pytorch-bot bot commented Jun 10, 2025 •

edited

Loading

jerryzh168 Jun 12, 2025 •

edited

Loading