Thanks to visit codestin.com
Credit goes to github.com

Skip to content

Conversation

@xadupre
Copy link
Contributor

@xadupre xadupre commented Jan 25, 2023

Description

Introduces four new types for quantization / computation to speed up deep learning models on GPU.

First pair for NVIDIA:

  • FLOATE8E4M3FN: float 8 bits with 4 bits for the exponent and 3 for the mantissa, usually used for the coefficients, support nan values
  • FLOATE8E5M2: float 8 bits with 5 bits for the exponent and 2 for the mantissa, usually used for the gradients, supports nan values and infinities

Second pair for GraphCore:

  • FLOATE8E4M3FNUZ: float 8 bits with 4 bits for the exponent and 3 for the mantissa, usually used for the coefficients, only (no infinities)
  • FLOATE8E5M2FNUZ: float 8 bits with 5 bits for the exponent and 2 for the mantissa, usually used for the gradients, supports nan values only (no infinities)

Suffix FN means no infinities, UZ means the negative zero is used to represent nan values.

The PR modifies operator Cast, CastLike, QuantizeLinear, DequantierLinear to make them support the four new types. It adds function to cast from/to float32.

Motivation and Context

Latest NVIDIA, Arm, Inter, GraphCore introduces float 8 for faster computation.

Other related papers:

@xadupre xadupre requested review from a team as code owners January 25, 2023 10:56
Signed-off-by: xadupre <[email protected]>
Signed-off-by: xadupre <[email protected]>
Signed-off-by: xadupre <[email protected]>
Signed-off-by: xadupre <[email protected]>
Signed-off-by: xadupre <[email protected]>
Signed-off-by: xadupre <[email protected]>
Copy link
Member

@linkerzhang linkerzhang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are these "new" types common enough to be added here? or they mey be added internally in ORT?

Signed-off-by: xadupre <[email protected]>
Signed-off-by: xadupre <[email protected]>
Signed-off-by: xadupre <[email protected]>
Signed-off-by: xadupre <[email protected]>
xadupre and others added 5 commits March 31, 2023 11:31
Co-authored-by: Andrew Fitzgibbon <[email protected]>
Signed-off-by: Xavier Dupré <[email protected]>
Co-authored-by: Andrew Fitzgibbon <[email protected]>
Signed-off-by: Xavier Dupré <[email protected]>
Co-authored-by: Andrew Fitzgibbon <[email protected]>
Signed-off-by: Xavier Dupré <[email protected]>
Co-authored-by: Andrew Fitzgibbon <[email protected]>
Signed-off-by: Xavier Dupré <[email protected]>
xadupre added 2 commits April 7, 2023 08:59
Signed-off-by: xadupre <[email protected]>
Signed-off-by: xadupre <[email protected]>
@xadupre xadupre merged commit 4543c94 into onnx:main Apr 7, 2023
@lutzroeder
Copy link
Member

@xadupre @yuanyao-nv @liqunfu @guoyu-wang @edgchen1 @linkerzhang will the DataType changes get ported to ort.fbs to keep the two in sync?

@edgchen1
Copy link
Contributor

@xadupre @yuanyao-nv @liqunfu @guoyu-wang @edgchen1 @linkerzhang will the DataType changes get ported to ort.fbs to keep the two in sync?

I think so, once we update ONNX Runtime to consume a version of ONNX with this change. Thanks for the heads-up.
FYI @skottmckay

askhade pushed a commit to microsoft/onnxruntime that referenced this pull request May 30, 2023
### Description
The PR implements FloatE4M3FN, FloatE5M2, FloatE4MEFNUZ, FloatE5M2FNUZ
as described in PR onnx/onnx#4805. It uses CUDA
API to cast float/half to float8 if CUDA>=11.8, a custom implementation
if CUDA<11.8.

* It implements, Cast, QuantizeLinear, DequantizeLinear for all types on
CPU, only for types FloatE4M3FN, FloatE5M2 on CUDA.
* It extends the supported types for control flow operator, Shape,
Reshape, Identity, If, Loop, Scan, Reshape
* It implements Equal(19).
* Cast, QuantizeLinear, DequantizeLinear operators now support a
parameter `saturate` only valid for float 8 types. It is true by
default. In that case, any value out of range is converted into the
maximum float 8 value. If false, it is infinite.
* QuantizeLinear, DequantizeLinear now supports multiple scales on CUDA
(and ROCm by extension), scale = 1D tensor with one scale per channel

### Motivation and Context
Supports latest onnx version.

Fixes
[AB#15395](https://aiinfra.visualstudio.com/6a833879-cd9b-44a4-a9de-adc2d818f13c/_workitems/edit/15395)

---------

Co-authored-by: Xavier Dupre <[email protected]@orttrainingdev8.d32nl1ml4oruzj4qz3bqlggovf.px.internal.cloudapp.net>
Co-authored-by: Randy Shuai <[email protected]>
Co-authored-by: Edward Chen <[email protected]>
Co-authored-by: Scott McKay <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

run release CIs Use this label to trigger release tests in CI topic: operator Issues related to ONNX operators

Projects

None yet

Development

Successfully merging this pull request may close these issues.