[hipDNN integration-tests] conv-wgrad switched from dynamic per-shape tolerances to a fixed 0.2 → flaky CI failures

## Summary

When the miopen-provider CI was pointed at the shared cross-provider integration suite (#7433), bf16 3D conv weight-gradient tests began failing **flakily** (`Smoke/IntegrationGpuConvWrw3dBfp16.Correctness/14`, `ConvolutionWgrad_0::DW`). The cause on the test side is a **tolerance regression**: the shared suite validates conv-wgrad with a **fixed** tolerance (`getToleranceWrw<bf16>() = 0.2`), whereas the pre-migration miopen-provider tests used a **dynamic, per-shape** tolerance (`calculateConvWrwTolerance<...>`), which for this shape was on the order of **hundreds**.

The fixed 0.2 is below the bf16 reduction-noise floor for deep conv-wgrad reductions, so it intermittently flags results that are within expected bf16 error. (There is also a real CK kernel precision bug that makes some kernels noisier than necessary — filed separately; see Related. This issue is specifically about the **test tolerance** being wrong, which should be fixed regardless.)

## Concrete code

- **New (shared) test** — `dnn-providers/integration-tests/src/integration_tests/conv/IntegrationGpuConvBackwardWeights.cpp:95`:
  ```cpp
  this->registerValidator(outputs.dw, this->getTolerance(graphObj, outputs.dw));
  ```
  → `getToleranceWrw<bfloat16>()` = **0.2** (`projects/hipdnn/test_sdk/include/hipdnn_test_sdk/utilities/TestTolerances.hpp:228`), with rtol 0.2. Per-element check `|impl-ref| > atol + rtol*|ref|` (`CpuFpReferenceValidation.hpp`).
- **Old (pre-migration) miopen-provider test** — recover at `dddc256cb50~1:dnn-providers/miopen-provider/integration_tests/IntegrationGpuConvBackwardWeights.cpp` (deleted by #7433):
  ```cpp
  auto tolerance = calculateConvWrwTolerance<bfloat16, bfloat16, float>(minVal, maxVal, minVal, maxVal, testCase.yDims);
  this->registerValidator(dwTensorAttr, tolerance, 0.01f);
  ```
- **Migration timeline**: shared test created with the fixed tolerance in #6317 (2026-04-13); #7433 (2026-06-01) deleted the miopen-provider copy and pointed CI at the shared suite → the fixed tolerance started being applied to MIOpen, surfacing the failures. (#7658's NaN-sentinel output prefill is not the cause for conv-wgrad — both versions already sentinel-fill.)

## Why neither old nor new tolerance is right

For this case (`dyDims=[8,1,16,16,16]`, range [-1,1]): reduction depth `N·D·H·W = 32768`, `sumAbsProductBound = 32768`.
- **Old dynamic** `calculateConvWrwTolerance` ≈ `gamma·sumAbsProductBound (≈258) + output-cast (eps_bf16·32768 ≈ 256)` ≈ **~500** atol. This is a worst-case *coherent* bound (`γₙ·Σ|xᵢ| = n²·u`, and it bounds the result by `Σ|products|` rather than the actual cancelled `|DW|≈100`). It's so loose it would pass almost anything — which is why dynamic tolerances were disabled.
- **New fixed** `0.2` atol — too tight: it's below the per-element bf16 reduction noise for a 32768-deep reduction (legit abs error is ~`eps_bf16·|partials| ≈ O(0.5–2)`), and it ignores reduction depth and output magnitude.

A proper tolerance should be **statistical and shape/depth-aware**, e.g. scale an absolute term with `~ eps_bf16 · √(reduction_depth) · (typical product magnitude)` plus a small relative term for the output cast — not a flat constant, and not the worst-case linear/coherent bound.

## Why the failure is on a *small* weight (and why MIOpen's own tests miss it)

`DW` here is 16 weights spanning ~[0.7, 135]. The bf16 reduction error is roughly uniform in *absolute* terms across weights, so it only trips the per-element check on the *small-magnitude* weight (e.g. ref 0.69 → impl 1.09; abs err 0.40 > 0.2+0.2·0.69=0.338). The relative-L2 / RMS metric MIOpen and `MIOpenDriver -V` use is dominated by the large weights (‖ref‖≈200), so the same error is ~0.017 ≪ threshold → it passes there. The per-element check the shared suite uses is the *more correct* check; the issue is purely that its tolerance value is wrong for this op/shape.

## Asks

1. Replace the fixed `getToleranceWrw` (and the other fixed conv tolerances, if similarly affected) with a **realistic shape/depth-aware tolerance** for conv-wgrad — or restore a *fixed* dynamic-tolerance that uses a statistical (not worst-case-coherent) bound.
2. Coordinate the chosen tolerance with the CK kernel precision fix (Related) so the test reflects achievable accuracy once the kernel keeps split-k partials in fp32.

## Environment
- Surfaced on: `Test miopenprovider (gfx94X-dcgpu)` cross-provider integration check, MI300X, ROCm 7.14.
- Test: `Smoke/IntegrationGpuConvWrw3dBfp16.Correctness/14` (NDHWC, x:[8,16,16,16,16] w:[1,16,1,1,1], bf16).

## Related
- CK kernel precision bug (bf16 split-k partials) that makes some kernels exceed even a reasonable tolerance: #8029


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[hipDNN integration-tests] conv-wgrad switched from dynamic per-shape tolerances to a fixed 0.2 → flaky CI failures #8030

Summary

Concrete code

Why neither old nor new tolerance is right

Why the failure is on a small weight (and why MIOpen's own tests miss it)

Asks

Environment

Related

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

[hipDNN integration-tests] conv-wgrad switched from dynamic per-shape tolerances to a fixed 0.2 → flaky CI failures #8030

Description

Summary

Concrete code

Why neither old nor new tolerance is right

Why the failure is on a small weight (and why MIOpen's own tests miss it)

Asks

Environment

Related

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions