Thanks to visit codestin.com
Credit goes to github.com

Skip to content

[hipDNN integration-tests] conv-wgrad switched from dynamic per-shape tolerances to a fixed 0.2 → flaky CI failures #8030

@BrianHarrisonAMD

Description

@BrianHarrisonAMD

Summary

When the miopen-provider CI was pointed at the shared cross-provider integration suite (#7433), bf16 3D conv weight-gradient tests began failing flakily (Smoke/IntegrationGpuConvWrw3dBfp16.Correctness/14, ConvolutionWgrad_0::DW). The cause on the test side is a tolerance regression: the shared suite validates conv-wgrad with a fixed tolerance (getToleranceWrw<bf16>() = 0.2), whereas the pre-migration miopen-provider tests used a dynamic, per-shape tolerance (calculateConvWrwTolerance<...>), which for this shape was on the order of hundreds.

The fixed 0.2 is below the bf16 reduction-noise floor for deep conv-wgrad reductions, so it intermittently flags results that are within expected bf16 error. (There is also a real CK kernel precision bug that makes some kernels noisier than necessary — filed separately; see Related. This issue is specifically about the test tolerance being wrong, which should be fixed regardless.)

Concrete code

Why neither old nor new tolerance is right

For this case (dyDims=[8,1,16,16,16], range [-1,1]): reduction depth N·D·H·W = 32768, sumAbsProductBound = 32768.

  • Old dynamic calculateConvWrwTolerancegamma·sumAbsProductBound (≈258) + output-cast (eps_bf16·32768 ≈ 256)~500 atol. This is a worst-case coherent bound (γₙ·Σ|xᵢ| = n²·u, and it bounds the result by Σ|products| rather than the actual cancelled |DW|≈100). It's so loose it would pass almost anything — which is why dynamic tolerances were disabled.
  • New fixed 0.2 atol — too tight: it's below the per-element bf16 reduction noise for a 32768-deep reduction (legit abs error is ~eps_bf16·|partials| ≈ O(0.5–2)), and it ignores reduction depth and output magnitude.

A proper tolerance should be statistical and shape/depth-aware, e.g. scale an absolute term with ~ eps_bf16 · √(reduction_depth) · (typical product magnitude) plus a small relative term for the output cast — not a flat constant, and not the worst-case linear/coherent bound.

Why the failure is on a small weight (and why MIOpen's own tests miss it)

DW here is 16 weights spanning ~[0.7, 135]. The bf16 reduction error is roughly uniform in absolute terms across weights, so it only trips the per-element check on the small-magnitude weight (e.g. ref 0.69 → impl 1.09; abs err 0.40 > 0.2+0.2·0.69=0.338). The relative-L2 / RMS metric MIOpen and MIOpenDriver -V use is dominated by the large weights (‖ref‖≈200), so the same error is ~0.017 ≪ threshold → it passes there. The per-element check the shared suite uses is the more correct check; the issue is purely that its tolerance value is wrong for this op/shape.

Asks

  1. Replace the fixed getToleranceWrw (and the other fixed conv tolerances, if similarly affected) with a realistic shape/depth-aware tolerance for conv-wgrad — or restore a fixed dynamic-tolerance that uses a statistical (not worst-case-coherent) bound.
  2. Coordinate the chosen tolerance with the CK kernel precision fix (Related) so the test reflects achievable accuracy once the kernel keeps split-k partials in fp32.

Environment

  • Surfaced on: Test miopenprovider (gfx94X-dcgpu) cross-provider integration check, MI300X, ROCm 7.14.
  • Test: Smoke/IntegrationGpuConvWrw3dBfp16.Correctness/14 (NDHWC, x:[8,16,16,16,16] w:[1,16,1,1,1], bf16).

Related

Metadata

Metadata

Assignees

Type

No type
No fields configured for issues without a type.

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions