-
Notifications
You must be signed in to change notification settings - Fork 3.8k
Fix NaN values in float8 quantization by handling them in saturate_cast #7223
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
When converting to float8 types, NaN values were not properly handled and would remain as NaN in the quantized tensors. This caused model validation to fail. The fix explicitly replaces NaN values with 0 before clipping in the saturate_cast function for float8 types.
|
Thanks - the current behavior seems intended: https://onnx.ai/onnx/technical/float8.html#cast Is there an example from a different framework that will turn NaNs into 0s for reference? |
|
Why would the original value be NaN in the first place? Can that be fixed? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Blocking for now
|
I have fix it, add an arg. |
|
I still don’t think this is the right fix. Does this behavior (NaN to 0) exist in other frameworks? Should the quantization tool itself be fixed? I am also still wondering where did the NaN values come from |
|
If it’s due to division by zero, then the actual computation (in the call site, not in onnx) needs to be fixed |
|
Sorry, I can't understand the meaning, I think it's difficult to see why NaN happens. And other kinds of packages also has the feature to make NaN to 0. |
Out of interest, which package does this? |
|
|
Description
When converting to float8 types, NaN values were not properly handled and would remain as NaN in the quantized tensors. This caused model validation to fail. The fix explicitly replaces NaN values with 0 before clipping in the saturate_cast function for float8 types.
Motivation and Context
#7222