-
Notifications
You must be signed in to change notification settings - Fork 3.8k
Introduce float 8 types, FLOATE4M3, FLOATE5M2 #4805
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Signed-off-by: xadupre <[email protected]>
Signed-off-by: xadupre <[email protected]>
Signed-off-by: xadupre <[email protected]>
Signed-off-by: xadupre <[email protected]>
Signed-off-by: xadupre <[email protected]>
Signed-off-by: xadupre <[email protected]>
Signed-off-by: xadupre <[email protected]>
Signed-off-by: xadupre <[email protected]>
Signed-off-by: xadupre <[email protected]>
Signed-off-by: xadupre <[email protected]>
Signed-off-by: xadupre <[email protected]>
Signed-off-by: xadupre <[email protected]>
Signed-off-by: xadupre <[email protected]>
Signed-off-by: xadupre <[email protected]>
Signed-off-by: xadupre <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are these "new" types common enough to be added here? or they mey be added internally in ORT?
Signed-off-by: xadupre <[email protected]>
Signed-off-by: xadupre <[email protected]>
Signed-off-by: xadupre <[email protected]>
Signed-off-by: xadupre <[email protected]>
Signed-off-by: xadupre <[email protected]>
Signed-off-by: xadupre <[email protected]>
Co-authored-by: Andrew Fitzgibbon <[email protected]> Signed-off-by: Xavier Dupré <[email protected]>
Co-authored-by: Andrew Fitzgibbon <[email protected]> Signed-off-by: Xavier Dupré <[email protected]>
Co-authored-by: Andrew Fitzgibbon <[email protected]> Signed-off-by: Xavier Dupré <[email protected]>
Co-authored-by: Andrew Fitzgibbon <[email protected]> Signed-off-by: Xavier Dupré <[email protected]>
Signed-off-by: xadupre <[email protected]>
Signed-off-by: xadupre <[email protected]>
Signed-off-by: xadupre <[email protected]>
Signed-off-by: xadupre <[email protected]>
Signed-off-by: xadupre <[email protected]>
Signed-off-by: xadupre <[email protected]>
Signed-off-by: xadupre <[email protected]>
Signed-off-by: xadupre <[email protected]>
Signed-off-by: xadupre <[email protected]>
Signed-off-by: xadupre <[email protected]>
|
@xadupre @yuanyao-nv @liqunfu @guoyu-wang @edgchen1 @linkerzhang will the |
I think so, once we update ONNX Runtime to consume a version of ONNX with this change. Thanks for the heads-up. |
### Description The PR implements FloatE4M3FN, FloatE5M2, FloatE4MEFNUZ, FloatE5M2FNUZ as described in PR onnx/onnx#4805. It uses CUDA API to cast float/half to float8 if CUDA>=11.8, a custom implementation if CUDA<11.8. * It implements, Cast, QuantizeLinear, DequantizeLinear for all types on CPU, only for types FloatE4M3FN, FloatE5M2 on CUDA. * It extends the supported types for control flow operator, Shape, Reshape, Identity, If, Loop, Scan, Reshape * It implements Equal(19). * Cast, QuantizeLinear, DequantizeLinear operators now support a parameter `saturate` only valid for float 8 types. It is true by default. In that case, any value out of range is converted into the maximum float 8 value. If false, it is infinite. * QuantizeLinear, DequantizeLinear now supports multiple scales on CUDA (and ROCm by extension), scale = 1D tensor with one scale per channel ### Motivation and Context Supports latest onnx version. Fixes [AB#15395](https://aiinfra.visualstudio.com/6a833879-cd9b-44a4-a9de-adc2d818f13c/_workitems/edit/15395) --------- Co-authored-by: Xavier Dupre <[email protected]@orttrainingdev8.d32nl1ml4oruzj4qz3bqlggovf.px.internal.cloudapp.net> Co-authored-by: Randy Shuai <[email protected]> Co-authored-by: Edward Chen <[email protected]> Co-authored-by: Scott McKay <[email protected]>
Description
Introduces four new types for quantization / computation to speed up deep learning models on GPU.
First pair for NVIDIA:
Second pair for GraphCore:
Suffix FN means no infinities, UZ means the negative zero is used to represent nan values.
The PR modifies operator Cast, CastLike, QuantizeLinear, DequantierLinear to make them support the four new types. It adds function to cast from/to float32.
Motivation and Context
Latest NVIDIA, Arm, Inter, GraphCore introduces float 8 for faster computation.
Other related papers: