Thanks to visit codestin.com
Credit goes to github.com

Skip to content

fix(stirrup_agent): reasoning fallback + surface tool-arg validation errors#1397

Open
agronskiy wants to merge 4 commits into
mainfrom
agronskiy/fix/dsv4-reasoning-field-fallback
Open

fix(stirrup_agent): reasoning fallback + surface tool-arg validation errors#1397
agronskiy wants to merge 4 commits into
mainfrom
agronskiy/fix/dsv4-reasoning-field-fallback

Conversation

@agronskiy
Copy link
Copy Markdown
Contributor

@agronskiy agronskiy commented May 22, 2026

Summary

Two complementary fixes for DSv4-Pro (and any vLLM ≥ 0.16 model) running through the Stirrup agent.

1) Reasoning-field fallback

DynamicMaxTokensChatCompletionsClient.__call__ in responses_api_agents/stirrup_agent/nemo_client.py:299 only reads msg.reasoning_content. vLLM ≥ 0.16.0 (DeepSeek-V4's --reasoning-parser deepseek_v4) emits the field as msg.reasoning per the Responses-API convention. Without the fallback, reasoning is silently dropped from the AssistantMessage and never threaded back into the next-turn request — the agent forgets its plan every turn and walks max_turns.

Fix: add an elif branch that consults msg.reasoning when reasoning_content is absent.

2) Surface pydantic ValidationError in failed tool-call results

Upstream stirrup catches ValidationError in Agent.run_tool and returns the bare string "Tool arguments are not valid" as ToolResult.content, dropping the pydantic detail on the floor (e.g. paths: Input should be a valid list, input_type=str). The agent gets no signal about which field failed or what type was expected, so it retries the same broken shape forever.

Fix: install a one-shot monkey-patch on stirrup.core.agent.Agent.run_tool at module import time. On the failure path it re-runs tool.parameters.model_validate_json to capture the pydantic error, then rebuilds the ToolMessage with a detailed content including field-by-field error messages and a 500-char preview of the submitted args. Success path untouched. Guarded by an _gym_surfacing_patched attribute to prevent re-application.

Why both matter — observed failure modes

End-to-end on DSv4-Pro GDPVal:

Run Outcome Root cause
r3 (a691…, slurm 2855690) 0 deliverables / 762 ✗ finish / 0 ✓ finish in 2h27m reasoning stripped at client layer (fixed by §1)
r5 (9683…, slurm 2857807) 0 deliverables / 669 ✗ finish / 0 ✓ finish in 2h7m model emits paths as string not list; bare "not valid" error hides which field; agent can't self-correct (fixed by §2)
r7 (e9c4…, slurm 2858587, concurrency=48) 0 deliverables / 376 ✗ finish / max_turn=100 in 1h17m same as r5; concurrency-independent

cache.db inspection confirmed:

  • vLLM responses have message.reasoning populated (string) and message.reasoning_content absent — §1 is necessary.
  • Failed finish tool calls overwhelmingly emit "paths": "[]", "paths": "[\"foo.txt\"]" or "paths": "single.pdf" — string instead of list. §2 surfaces this back to the agent so it can self-correct.

Test plan

  • Repin EFB DSv4-Pro GDPVal leaf to this PR's head and resubmit (r8) with concurrency=48. Expected: non-zero ✓ finish; rollouts/histories/deliverables landing; ToolResult content on retry attempts shows the detailed pydantic field-by-field error.
  • Smoke check against an existing reasoning model that emits reasoning_content (Kimi-K2.5-Thinking, GLM-4.5) — first branch of §1's if still wins; behavior unchanged.
  • Test fixture in tests/test_nemo_client.py pins reasoning = None so the §1 elif doesn't trigger on MagicMock auto-attr access (3 tests previously failing on first push, now passing).

🤖 Generated with Claude Code

… absent

DynamicMaxTokensChatCompletionsClient.__call__ only checked
msg.reasoning_content when parsing the response. vLLM >= 0.16.0 (and
specifically DeepSeek-V4's `--reasoning-parser deepseek_v4`) emits the
field as `reasoning` per the Responses-API convention. Without the
fallback, reasoning silently dropped from the AssistantMessage and was
never threaded back into the next-turn request — the agent forgot its
plan every turn and walked the max_turns ceiling without ever
successfully calling `finish`.

Observed end-to-end on DSv4-Pro GDPVal r3 (slurm job 2855690): across
~3 h of cluster wall-time and 1118 successful code_exec invocations,
zero successful `finish` calls and zero rollouts persisted; the
ResponseReasoningInterceptor reported reasoning_words=0 for every
single response across the run while cache.db responses showed
message.reasoning populated.

Also pin `choice.message.reasoning = None` in the `_make_response` test
fixture (mirroring the existing `reasoning_content = None` pin) so the
new `elif hasattr(msg, "reasoning")` branch is not triggered by
MagicMock auto-attr access in the three sampling/cap tests.

Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>
Signed-off-by: Alex Gronskiy <[email protected]>
@agronskiy agronskiy force-pushed the agronskiy/fix/dsv4-reasoning-field-fallback branch from 6012b53 to acacd8f Compare May 22, 2026 18:38
… tool-call results

Upstream stirrup catches `pydantic.ValidationError` in `Agent.run_tool`
and returns the bare string "Tool arguments are not valid" as the
`ToolResult` content, dropping all the pydantic error detail on the
floor (e.g. "paths: Input should be a valid list, input_type=str").
The agent has no signal about which field failed or what type was
expected, so it retries the same broken shape forever.

Observed on DSv4-Pro GDPVal r5/r7: the model consistently emitted
`paths` as a JSON string literal ('"[]"', '"[\"foo.txt\"]"',
'"single.pdf"', etc.) instead of a JSON array. All ~660 finish-tool
calls in r5 (2h7m elapsed) failed with the same bare-string error;
zero ✓ finishes, max_turn stuck at 18 for 71 min as rollouts looped
on the same malformed shape. r7 (concurrency=48) hit the same wall
just sooner.

This patch installs a one-shot monkey-patch on
`stirrup.core.agent.Agent.run_tool` at import time (guarded by an
`_gym_surfacing_patched` attribute to prevent re-application). On the
failure path it re-runs `tool.parameters.model_validate_json` to
capture the pydantic error, then rebuilds the ToolMessage with a
detailed content including field-by-field error messages and a 500-char
preview of the submitted arguments. Success path is untouched.

Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>
Signed-off-by: Alex Gronskiy <[email protected]>
@agronskiy agronskiy changed the title fix(stirrup_agent): fall back to msg.reasoning when reasoning_content absent fix(stirrup_agent): reasoning fallback + surface tool-arg validation errors May 22, 2026
agronskiy and others added 2 commits May 22, 2026 23:51
…not self._tools)

The ValidationError-surfacing wrapper iterated self._tools to find the
tool for the failed call. That list mixes plain Tool instances with
provider objects (e.g. ApptainerCodeExecToolProvider) which don't have
a .name attribute — the `t.name` access raised AttributeError on every
failed-finish path, breaking 45 rollouts on r8 before scancel.

Switch to self._active_tools.get(tool_call.name) — the same dict
upstream stirrup uses in its own run_tool. _active_tools is built
during __aenter__ via `if isinstance(tool, Tool): self._active_tools[tool.name] = tool`,
so it's the safe lookup path.

Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>
Signed-off-by: Alex Gronskiy <[email protected]>
…epseek_v4 parser bug

DeepSeek-V4-Pro served by vLLM 0.20.0 (the wedu image
vllm-deepseekv4-v0200-cu130-ray-arm64.sqsh) emits non-string-typed
tool-call args as JSON-encoded strings. The model produces
`<|DSML|parameter ... string="false">[...]</|DSML|parameter>` per the
chat template, but vLLM's --tool-call-parser deepseek_v4 in 0.20.0
doesn't honor the string="false" flag and forwards the inner JSON
verbatim as a literal string. Stirrup's FinishParams rejects with
"paths: Input should be a valid list, type=list_type" and the agent
loops forever on the same broken shape (1605 ✗ finish / 0 ✓ finish
across r9's full 4h walltime).

Upstream fix landed in vLLM PR #41801 (merged 2026-05-06), but the
wedu image predates it. Until the image is rebuilt:

- responses_api_agents/stirrup_agent/finish_tool_coercing.py: new
  module with CoercingFinishParams (pydantic field_validator(mode=
  "before") on `paths` that accepts list (passthrough), JSON-encoded
  string array, or bare filename string) and COERCING_FINISH_TOOL
  wrapping stirrup's _validating_finish_executor.

- responses_api_agents/stirrup_agent/nemo_client.py: third monkey-patch
  at module-import time replaces SIMPLE_FINISH_TOOL in stirrup.tools
  .finish, stirrup.tools, and stirrup.core.agent with the coercing
  variant. Agent.__init__ defaults pick it up via the existing
  ``finish_tool if finish_tool is not None else SIMPLE_FINISH_TOOL``
  fallback. Idempotency tag on the tool object prevents double-patching.

- responses_api_agents/stirrup_agent/tests/test_finish_tool_coercing.py:
  12 cases covering the 4 known broken shapes from r5 client log
  (`"[]"`, `"[\"a.txt\"]"`, `"[\"a.txt\",\"b.pdf\"]"`, bare filename)
  plus correct-shape passthrough, required-field enforcement, dict
  rejection, and non-string item stringification.

When the wedu image is rebuilt against vLLM main >= #41801, this
coercion becomes a no-op (the first isinstance(v, list) branch always
takes) and both this module and the monkey-patch can be removed.

Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>
Signed-off-by: Alex Gronskiy <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant