-
Notifications
You must be signed in to change notification settings - Fork 344
precise_math attribute on functions #2080
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
dneto0
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for taking this stab. It's getting there
wgsl/index.bs
Outdated
| <tr><td><dfn noexport dfn-for="attribute">`precise_math`</dfn> | ||
| <td>*None* | ||
|
|
||
| Indicates that the arithmetic computations in the function need to be performed with |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have trouble with the word "precision" here, because that means "with more bits represented".
(Never mind the "precise" part of the attribute name, inherited from GLSL. It's good to reuse the GLSL word.)
Also, this should be constrained to floating point, I think.
How about:
Indicates that the floating point arithmetic computations in the function should be performed
- without [=reassociation/reassociating=] subexpressions
- while preserving infinities, NaNs, and signed zeroes
Apply this attribute when the correctness of the function is numerically sensitive, and it is acceptable to incur potential performance loss when forbidding such optimizations.
blah blah blah?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you!
I took the liberty of modifying this a bit more. Let me know if it needs more fixing!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I felt that it was important to refer to the floating point evaluation section from here
dneto0
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems ok to me now.
The key word is "should", instead of "must"
The group should review this.
|
Metal exposes fastmath on the entire module: https://developer.apple.com/documentation/metal/mtlcompileoptions?language=objc. So this is a good idea, but it should be elevated to module-level (either by something at the global scope in the language, or as additional data to |
|
The SPIR-V registry says SignedZeroInfNanPreserve is missing before version 1.4. The earliest version of Vulkan to require SPIR-V 1.4 is Vulkan 1.2, which I thought was unavailable on most Android devices. Can we really require it? |
|
I don't think we require this
There is definitely value in having it exposed in a more granular level than the module scope:
|
WGSL meeting minutes 2021-09-07
|
|
To fill in some a detail:
|
|
My reading of the current state of the debate is that we need to decide if this functionality is testable or not. I believe having it testable would make a stronger API, and thus we need to explore this path before proceeding (with this PR as it stands now). It sounds like DX12 and Metal support this "precise" mode unconditionally, and there is a chance we'll be able to test it. In Vulkan, it's more complicated. As @dneto0 noted, there is an extension. However, one has to check for the properties of this extension before using them: https://vulkan.gpuinfo.org/listpropertiesextensions.php?extension=VK_KHR_shader_float_controls&platform=all If we make this an optional feature, we'd deny access to it for users who either don't care about |
Agreed. I wasn't sure on the call yesterday, so I investigated what Vulkan does to test NoContract: The NoContract feature has been supported by SPIR-V / Vulkan from the start. The test tempts the compiler to fuse a multiply-add into one operation (FMA). This depends crucially on the fact that certain basic operations (add, subtract, multiply) are "correctly rounded" (as defined by IEEE 754, and adopted by Vulkan and WGSL). In general, catastrophic cancellation can be used to magnify errors for other undesirable cases: reassociation, distribution of multiply over addiiton. So I think fusing, reassociation, and distribution aspects are testable. |
|
In an ideal world
We already have the |
|
Metal is not exactly all or nothing. As @kainino0x pointed in #2076 (comment), we can pick a subset of fast-math stuff. It sounds like you are suggesting to adopt the current PR but cut out everything related to Then we can have an optional feature exposing something that captures |
Yes, just disable fusing, reassociation, and distribution. With signed zeros and such not universal, and the lack of example code that would be effected by them I am proposing scaling back to only what
I was not sure if that was kosher since it was not documented in the Metal spec.
Or you could wait until someone needs it, its probably a failure of my imagination but I can't think of a case where asymptotic limits would be of use in rendering. |
|
The last commit here describes this semantics. I'm sure @dneto0 would want to put more technical details of what is preserved, adding examples and such, and I'm hoping we can follow-up with this. |
|
Love to see where it is going, thanks @kvark. Floating point expansion definitely not rely on NaN/Infinity/SignedZero's. |
|
From talking with the Metal team, we haven't gotten requests to apply fastMath per function rather than per MTLLibrary. This makes intuitive sense, because the use cases that need IEEE precision are things like scientific computing, where it's likely that all the functions in the library will need to be precise. Conversely, for use cases like games, it's likely that none of the functions in the library will need to be precise. (Games do need things like the |
These things aren't API. Ideally, WebGPU / WGSL wouldn't rely on anything that isn't API in the 3 backend APIs. The API is a single boolean switch. (Anything that isn't API is unsupported, and able/willing to be removed at any point in the future.) |
|
It would be unfortunate to make fastMath a "best effort" attribute. From an author's perspective, what's the point of a precision guarantee if the guarantee isn't actually guaranteed? From an implementor's perspective, why would an implementor implement any of the feature at all if it just slows down code and doesn't actually have any expected (testable) behavior? Or, stated a different way: Let's say I want to implement this feature in a particular WebGPU implementation, and I sit down and start typing code into the computer to do it. How do I know when I'm done? Why shouldn't I consider myself to be done implementing the feature before writing a single line of code? |
|
@litherum it sounds like the desire to have this behavior testable is shared between all parties, so it's good to have this settled. The last version of the PR, which I mentioned in #2080 (comment), already makes it normative. It just doesn't spell out the exact norms affecting it, which is intended to be written at some point. So, no "best effort" any more. As for the scope of the change, I'm curious what use cases are to consider. From the distance, it felt useful to be able to make, say, vertex shaders precise but not the fragment shaders. Or even just computation of one specific output of a vertex shader. But I haven't used this myself, so happy to hear ISV feedback! @mrshannon could you share the intended usage of this attribute? Would you be doing it for the whole module, or potentially more granularly? |
First, I am specifically talking about The first is extremely large scale terrain generation in a compute shader which requires double precision. An existing example of this is Elite Dangerous which uses real doubles on some cards and emulated doubles (which require Use in the wild: Generating the Universe in Elite Dangerous The second case is when rendering very large objects (which cannot be handled in other ways). To avoid jitter we need to perform the model to camera space transform in double precision sometimes. Therefore, again emulated doubles. But in this case the calculation is in the vertex shader and is pretty narrow in scope as it is just used for the model to world transform and furthermore is only used on a small subset of vertices (those close to the camera). Therefore it would be undesirable to require Use in the wild: 3D Engine Design for Virtual Globes
This is not true, see Generating the Universe in Elite Dangerous. What is required is not IEEE but specifically the guarantees that HLSL gives with its In general there are cases where floating point error needs to be mitigated, even in rendering, which requires controlling the order of operations. |
FWIW the flags I pointed to can probably only be used when invoking an MSL compiler via command line, but not via newLibraryWithSource. However I found the associated clang pragmas: Of course @litherum's point that these aren't officially supported still stands. |
@munrocket Not sure that it is working on Windows, is the top of the fractal supposed to be filled with strange bands. Also not sure you need |
|
@mrshannon yes, it shows that float32 with limited precision. It’s intentional. I am started to think that fast-math is pretty ok even for this purposes because Dekker multiplication algorithm become smaller in x10 (2 FLOP vs 17 FLOP) with hardware fma instruction. It is implicitly inherited from fma(a,b,c) in current WebGPU implementation in Chrome/Firefox. Also with If you going to expose precise math in this PR then fma(a, b, c) will become twice rounded expression RN(RN(a * b) + c). And your will need to use more slower algorithm. I don’t know could you add support for hardware fma in this PR or not. But currently it is a trade-off. Fast multiplication and slow summation VS slow multiplication and fast summation |
|
@munrocket We just tested the We are likely to use it over this PR (even if it is merged in) as it has better performance on Metal due to not disabling all optimizations and works at the expression and not at the module level. |
|
Glad to help. The only reason why someone will still need to turn off fast-math if they detect that host doesn’t support FMA in hardware. After that fma emulation with I am removed |
|
Hey users, if you keep finding nice hacks and workarounds for this, we'll have no incentive to do anything with the spec! 🤪 |
|
Ha-ha, that was fun. It's actually miracle how it works. Because current rounding is UB and not specified, as well as fast-math mode. This PR still have potential. For example if somebody figure out how to turn on correct math and fused-multiply-add at the same time, mrshannon probably will use it. |
WGSL meeting minutes 2021-09-14
|
|
Hey @munrocket thanks for this technique! And thank you also for a nice compact test case. We had been discussing the need for a good way to test the behaviour. Some thoughts:
So two thumbs up for this technique! |
It's a feature, not a bug. :-) |
|
Another thing about the performance cost: Yes, this prevents the compiler from rearranging code to go faster, but that's exactly what the programmer wanted. |
It works with round-to-nearest-even floating point rounding, which is default usually, but not specified for some reason in DX11 for example. Also as mentioned: floating point arithmetic not associative, muladd should be allowed only for fma.
Usually for emulated double addition 20 flop, multiplication 24 flop with software FMA, 9 flop with hardware. So it's cheap. When we using
Interesting, if we need a stronger confidence, we can pass variable there. |
|
About the performance cost, I meant the additional performance cost of using the select. Thanks for the extra info for the cost of the double precision emulation. :-) Right, rounding mode is not specified for graphics APIs because some devices use round-to-even, some use round-to-zero (which is cheaper in hardware). Does select do branching? It is common for GPUs to use predicated execution: they execute both paths, but selectively turn off side effects of that path "not taken", and then only use the chosen result. (wikipedia This trades off possibly wasting cycles stepping through the dead code path, but saves the machine from taking a branch and destroying internal state. So that's why I would hope to make the evaluation of the "other" path and the condition cheap: we want that so on a predicated execution they don't waste too much extra time. |
Both of these use cases are supported by putting the precise attribute on the entire module rather than the individual function - just put these two entry points and their dependent functionality in a separate module. Linking a vertex shader from one module and a fragment shader from a different module is supported. |
|
If I understand correctly, we are mostly fine with introducing
|
|
The user has reasonable workaround, and it appears to be performant and likely stable over time. I thought this was an easy "not in V1.0" decision. |
|
I'm not happy about this workaround becoming sort of a tribal knowledge thing. fn compute(val: T) -> TSo doing |
It would be wasteful in the 2nd case. Perhaps as much as 10% of vertices (depending on camera location) in any given object need emulated double vertex position. The rest can take the faster 32-bit float path as they are further from the camera. |
Either this or actual function scope (including Metal) would keep us from using the |
|
@litherum it looks like MSL supports |
|
|
WGSL meeting minutes 2021-09-28
|
revoking my own review. Let's reconsider with fresh eyes
|
I'm not sure this idea appeared above but .... what about module level flag that only works if a feature like "high-precision" exists? So you check if the adapter supports "high-precision". If it does you request a device with This way, if an GPU/driver can't pass the high-precision CTS tests it doesn't advertise the feature. If you don't like features bleeding into WGSL you could move the check into pipeline creation where you use the precision keywords/options in WGSL but when you go try to make a pipeline, if you didn't request the |
|
Is there any way I can get this moving again? |
| * without [=Reassociation|reassociating=] subexpressions | ||
|
|
||
| Note: this translates to `NoContraction` decoration in SPIR-V, `precise` qualifier in HLSL, | ||
| and a subset of `"-fno-fast-math" group of compile options in MSL. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
and a subset of `"-fno-fast-math" group of compile options in MSL.
Should the subset not be documented?
Closes #2077
Fixes #2076
NoContraction.MTLLibrary.preciseto the variable declarations used by the affected functions.Note:
SignedZeroInfNanPreserveand other features ofVK_KHR_shader_float_controlsare intentionally not included.