Introduce float 8 types, FLOATE4M3, FLOATE5M2 #4805

xadupre · 2023-01-25T10:56:38Z

Description

Introduces four new types for quantization / computation to speed up deep learning models on GPU.

First pair for NVIDIA:

FLOATE8E4M3FN: float 8 bits with 4 bits for the exponent and 3 for the mantissa, usually used for the coefficients, support nan values
FLOATE8E5M2: float 8 bits with 5 bits for the exponent and 2 for the mantissa, usually used for the gradients, supports nan values and infinities

Second pair for GraphCore:

FLOATE8E4M3FNUZ: float 8 bits with 4 bits for the exponent and 3 for the mantissa, usually used for the coefficients, only (no infinities)
FLOATE8E5M2FNUZ: float 8 bits with 5 bits for the exponent and 2 for the mantissa, usually used for the gradients, supports nan values only (no infinities)

Suffix FN means no infinities, UZ means the negative zero is used to represent nan values.

The PR modifies operator Cast, CastLike, QuantizeLinear, DequantierLinear to make them support the four new types. It adds function to cast from/to float32.

Motivation and Context

Latest NVIDIA, Arm, Inter, GraphCore introduces float 8 for faster computation.

FP8 FORMATS FOR DEEP LEARNING: specifications for Nvidia, Arm, Intel
8-bit Numerical Formats For Deep Neural Networks (GraphCore)

Other related papers:

Signed-off-by: xadupre <[email protected]>

onnx/reference/ops/op_cast.py

Signed-off-by: xadupre <[email protected]>

onnx/reference/ops/op_cast.py

Signed-off-by: xadupre <[email protected]>

linkerzhang

Are these "new" types common enough to be added here? or they mey be added internally in ORT?

Signed-off-by: xadupre <[email protected]>

onnx/reference/ops/op_quantize_linear.py

Signed-off-by: xadupre <[email protected]>

onnx/reference/ops/op_quantize_linear.py

Signed-off-by: xadupre <[email protected]>

Co-authored-by: Andrew Fitzgibbon <[email protected]> Signed-off-by: Xavier Dupré <[email protected]>

onnx/defs/tensor/defs.cc

Signed-off-by: xadupre <[email protected]>

onnx/defs/quantization/defs.cc

Signed-off-by: xadupre <[email protected]>

onnx/defs/tensor/defs.cc

Signed-off-by: xadupre <[email protected]>

lutzroeder · 2023-04-08T02:09:08Z

@xadupre @yuanyao-nv @liqunfu @guoyu-wang @edgchen1 @linkerzhang will the DataType changes get ported to ort.fbs to keep the two in sync?

edgchen1 · 2023-04-10T16:57:38Z

@xadupre @yuanyao-nv @liqunfu @guoyu-wang @edgchen1 @linkerzhang will the DataType changes get ported to ort.fbs to keep the two in sync?

I think so, once we update ONNX Runtime to consume a version of ONNX with this change. Thanks for the heads-up.
FYI @skottmckay

### Description The PR implements FloatE4M3FN, FloatE5M2, FloatE4MEFNUZ, FloatE5M2FNUZ as described in PR onnx/onnx#4805. It uses CUDA API to cast float/half to float8 if CUDA>=11.8, a custom implementation if CUDA<11.8. * It implements, Cast, QuantizeLinear, DequantizeLinear for all types on CPU, only for types FloatE4M3FN, FloatE5M2 on CUDA. * It extends the supported types for control flow operator, Shape, Reshape, Identity, If, Loop, Scan, Reshape * It implements Equal(19). * Cast, QuantizeLinear, DequantizeLinear operators now support a parameter `saturate` only valid for float 8 types. It is true by default. In that case, any value out of range is converted into the maximum float 8 value. If false, it is infinite. * QuantizeLinear, DequantizeLinear now supports multiple scales on CUDA (and ROCm by extension), scale = 1D tensor with one scale per channel ### Motivation and Context Supports latest onnx version. Fixes [AB#15395](https://aiinfra.visualstudio.com/6a833879-cd9b-44a4-a9de-adc2d818f13c/_workitems/edit/15395) --------- Co-authored-by: Xavier Dupre <[email protected]@orttrainingdev8.d32nl1ml4oruzj4qz3bqlggovf.px.internal.cloudapp.net> Co-authored-by: Randy Shuai <[email protected]> Co-authored-by: Edward Chen <[email protected]> Co-authored-by: Scott McKay <[email protected]>

xadupre added 4 commits January 25, 2023 10:31

add types FLOATE4M3, FLOATE5M2 in onnx.in.proto

dde08cc

Signed-off-by: xadupre <[email protected]>

introduce float8 types

1c2b17e

Signed-off-by: xadupre <[email protected]>

update style and markdown

adfbcba

Signed-off-by: xadupre <[email protected]>

update markdown

82e746d

Signed-off-by: xadupre <[email protected]>

xadupre requested review from a team as code owners January 25, 2023 10:56

xadupre added 3 commits January 25, 2023 12:01

tweak proto

93acc7b

Signed-off-by: xadupre <[email protected]>

add empty lines

3d27ed9

Signed-off-by: xadupre <[email protected]>

restore removed lines

eb14f5a

Signed-off-by: xadupre <[email protected]>

github-advanced-security bot found potential problems Jan 25, 2023

View reviewed changes

onnx/reference/ops/op_cast.py Fixed Show fixed Hide fixed

xadupre added 3 commits January 25, 2023 12:29

lint

aa31e8b

Signed-off-by: xadupre <[email protected]>

fix changes

22ad4f7

Signed-off-by: xadupre <[email protected]>

fix proto

56f2016

Signed-off-by: xadupre <[email protected]>

github-advanced-security bot found potential problems Jan 25, 2023

View reviewed changes

onnx/reference/ops/op_cast.py Fixed Show fixed Hide fixed

xadupre added 6 commits January 25, 2023 14:13

register constant

628adad

Signed-off-by: xadupre <[email protected]>

update unit tests

f8c2927

Signed-off-by: xadupre <[email protected]>

implements floate4m3 to float32

3ccd15b

Signed-off-by: xadupre <[email protected]>

Merge branch 'main' of https://github.com/onnx/onnx into f8

079a57b

improve comment

e2b173f

Signed-off-by: xadupre <[email protected]>

comment

486a083

Signed-off-by: xadupre <[email protected]>

linkerzhang reviewed Jan 29, 2023

View reviewed changes

xadupre added 3 commits February 4, 2023 18:45

Merge branch 'main' of https://github.com/onnx/onnx into f8

25b49e2

finalize convert function

d863e92

Signed-off-by: xadupre <[email protected]>

update runtime

1610091

Signed-off-by: xadupre <[email protected]>

github-advanced-security bot found potential problems Feb 4, 2023

View reviewed changes

onnx/reference/ops/op_quantize_linear.py Fixed Show fixed Hide fixed

lint

f32c7e9

Signed-off-by: xadupre <[email protected]>

github-advanced-security bot found potential problems Feb 5, 2023

View reviewed changes

onnx/reference/ops/op_quantize_linear.py Fixed Show fixed Hide fixed

xadupre added 3 commits February 5, 2023 17:57

complete PR

dc18c9d

Signed-off-by: xadupre <[email protected]>

fix backend test

a5c4637

Signed-off-by: xadupre <[email protected]>

lint

1a805c4

Signed-off-by: xadupre <[email protected]>

xadupre and others added 5 commits March 31, 2023 11:31

Update docs/IR.md

bc0d3fc

Co-authored-by: Andrew Fitzgibbon <[email protected]> Signed-off-by: Xavier Dupré <[email protected]>

Update docs/docsgen/source/technical/float8.rst

124d855

Co-authored-by: Andrew Fitzgibbon <[email protected]> Signed-off-by: Xavier Dupré <[email protected]>

Update docs/Changelog.md

2245c07

Co-authored-by: Andrew Fitzgibbon <[email protected]> Signed-off-by: Xavier Dupré <[email protected]>

Update docs/docsgen/source/technical/float8.rst

561c1db

Co-authored-by: Andrew Fitzgibbon <[email protected]> Signed-off-by: Xavier Dupré <[email protected]>

Merge branch 'main' of https://github.com/onnx/onnx into f8

2831cb0

awf reviewed Mar 31, 2023

View reviewed changes

onnx/defs/tensor/defs.cc Outdated Show resolved Hide resolved

update documentation, fix misspellings

6b1979d

Signed-off-by: xadupre <[email protected]>

yuanyao-nv reviewed Apr 3, 2023

View reviewed changes

onnx/defs/quantization/defs.cc Outdated Show resolved Hide resolved

xadupre added 2 commits April 3, 2023 11:32

Merge branch 'main' of https://github.com/onnx/onnx into f8

23ad667

fix merge conflicts

133c0b8

Signed-off-by: xadupre <[email protected]>

jakeh-gc mentioned this pull request Apr 4, 2023

[RFC] Add Float8E4M3FNUZ and Float8E5M2FNUZ to StableHLO. openxla/stablehlo#1342

Merged

xadupre added 8 commits April 6, 2023 09:27

Merge branch 'main' of https://github.com/onnx/onnx into f8

ff90522

copyright

6f73f2c

Signed-off-by: xadupre <[email protected]>

Merge branch 'main' of https://github.com/onnx/onnx into f8

b3e0558

update casting functions

14bd42e

Signed-off-by: xadupre <[email protected]>

fix quantize test

42988c4

Signed-off-by: xadupre <[email protected]>

remove file

9d20e4d

Signed-off-by: xadupre <[email protected]>

update md

4e6a69c

Signed-off-by: xadupre <[email protected]>

fix unit test

3b6e6e7

Signed-off-by: xadupre <[email protected]>

yuanyao-nv reviewed Apr 7, 2023

View reviewed changes

onnx/defs/tensor/defs.cc Outdated Show resolved Hide resolved

xadupre added 2 commits April 7, 2023 08:59

update documentation

2f8795c

Signed-off-by: xadupre <[email protected]>

documentation

1758c48

Signed-off-by: xadupre <[email protected]>

yuanyao-nv approved these changes Apr 7, 2023

View reviewed changes

liqunfu approved these changes Apr 7, 2023

View reviewed changes

xadupre merged commit 4543c94 into onnx:main Apr 7, 2023

lutzroeder mentioned this pull request Apr 16, 2023

RuntimeWarning: overflow encountered in cast #5139

Closed

MagellaX mentioned this pull request Jul 18, 2025

Add support for FP6 quantization data types (FLOAT6E2M3 and FLOAT6E3M2) addressing RFC #7048 #7145

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Introduce float 8 types, FLOATE4M3, FLOATE5M2 #4805

Introduce float 8 types, FLOATE4M3, FLOATE5M2 #4805

xadupre commented Jan 25, 2023 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

linkerzhang left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

lutzroeder commented Apr 8, 2023

Uh oh!

edgchen1 commented Apr 10, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

16 participants

Introduce float 8 types, FLOATE4M3, FLOATE5M2 #4805

Introduce float 8 types, FLOATE4M3, FLOATE5M2 #4805

Conversation

xadupre commented Jan 25, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Motivation and Context

Uh oh!

Uh oh!

Uh oh!

linkerzhang left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

lutzroeder commented Apr 8, 2023

Uh oh!

edgchen1 commented Apr 10, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

16 participants

xadupre commented Jan 25, 2023 •

edited

Loading