Codestin Search App

agronskiy · 2026-05-22T18:25:03Z

Summary

Two complementary fixes for DSv4-Pro (and any vLLM ≥ 0.16 model) running through the Stirrup agent.

1) Reasoning-field fallback

DynamicMaxTokensChatCompletionsClient.__call__ in responses_api_agents/stirrup_agent/nemo_client.py:299 only reads msg.reasoning_content. vLLM ≥ 0.16.0 (DeepSeek-V4's --reasoning-parser deepseek_v4) emits the field as msg.reasoning per the Responses-API convention. Without the fallback, reasoning is silently dropped from the AssistantMessage and never threaded back into the next-turn request — the agent forgets its plan every turn and walks max_turns.

Fix: add an elif branch that consults msg.reasoning when reasoning_content is absent.

2) Surface pydantic ValidationError in failed tool-call results

Upstream stirrup catches ValidationError in Agent.run_tool and returns the bare string "Tool arguments are not valid" as ToolResult.content, dropping the pydantic detail on the floor (e.g. paths: Input should be a valid list, input_type=str). The agent gets no signal about which field failed or what type was expected, so it retries the same broken shape forever.

Fix: install a one-shot monkey-patch on stirrup.core.agent.Agent.run_tool at module import time. On the failure path it re-runs tool.parameters.model_validate_json to capture the pydantic error, then rebuilds the ToolMessage with a detailed content including field-by-field error messages and a 500-char preview of the submitted args. Success path untouched. Guarded by an _gym_surfacing_patched attribute to prevent re-application.

Why both matter — observed failure modes

End-to-end on DSv4-Pro GDPVal:

Run	Outcome	Root cause
r3 (`a691…`, slurm 2855690)	0 deliverables / 762 ✗ finish / 0 ✓ finish in 2h27m	reasoning stripped at client layer (fixed by §1)
r5 (`9683…`, slurm 2857807)	0 deliverables / 669 ✗ finish / 0 ✓ finish in 2h7m	model emits `paths` as string not list; bare "not valid" error hides which field; agent can't self-correct (fixed by §2)
r7 (`e9c4…`, slurm 2858587, concurrency=48)	0 deliverables / 376 ✗ finish / max_turn=100 in 1h17m	same as r5; concurrency-independent

cache.db inspection confirmed:

vLLM responses have message.reasoning populated (string) and message.reasoning_content absent — §1 is necessary.
Failed finish tool calls overwhelmingly emit "paths": "[]", "paths": "[\"foo.txt\"]" or "paths": "single.pdf" — string instead of list. §2 surfaces this back to the agent so it can self-correct.

Test plan

Repin EFB DSv4-Pro GDPVal leaf to this PR's head and resubmit (r8) with concurrency=48. Expected: non-zero ✓ finish; rollouts/histories/deliverables landing; ToolResult content on retry attempts shows the detailed pydantic field-by-field error.
Smoke check against an existing reasoning model that emits reasoning_content (Kimi-K2.5-Thinking, GLM-4.5) — first branch of §1's if still wins; behavior unchanged.
Test fixture in tests/test_nemo_client.py pins reasoning = None so the §1 elif doesn't trigger on MagicMock auto-attr access (3 tests previously failing on first push, now passing).

🤖 Generated with Claude Code

… absent DynamicMaxTokensChatCompletionsClient.__call__ only checked msg.reasoning_content when parsing the response. vLLM >= 0.16.0 (and specifically DeepSeek-V4's `--reasoning-parser deepseek_v4`) emits the field as `reasoning` per the Responses-API convention. Without the fallback, reasoning silently dropped from the AssistantMessage and was never threaded back into the next-turn request — the agent forgot its plan every turn and walked the max_turns ceiling without ever successfully calling `finish`. Observed end-to-end on DSv4-Pro GDPVal r3 (slurm job 2855690): across ~3 h of cluster wall-time and 1118 successful code_exec invocations, zero successful `finish` calls and zero rollouts persisted; the ResponseReasoningInterceptor reported reasoning_words=0 for every single response across the run while cache.db responses showed message.reasoning populated. Also pin `choice.message.reasoning = None` in the `_make_response` test fixture (mirroring the existing `reasoning_content = None` pin) so the new `elif hasattr(msg, "reasoning")` branch is not triggered by MagicMock auto-attr access in the three sampling/cap tests. Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]> Signed-off-by: Alex Gronskiy <[email protected]>

… tool-call results Upstream stirrup catches `pydantic.ValidationError` in `Agent.run_tool` and returns the bare string "Tool arguments are not valid" as the `ToolResult` content, dropping all the pydantic error detail on the floor (e.g. "paths: Input should be a valid list, input_type=str"). The agent has no signal about which field failed or what type was expected, so it retries the same broken shape forever. Observed on DSv4-Pro GDPVal r5/r7: the model consistently emitted `paths` as a JSON string literal ('"[]"', '"[\"foo.txt\"]"', '"single.pdf"', etc.) instead of a JSON array. All ~660 finish-tool calls in r5 (2h7m elapsed) failed with the same bare-string error; zero ✓ finishes, max_turn stuck at 18 for 71 min as rollouts looped on the same malformed shape. r7 (concurrency=48) hit the same wall just sooner. This patch installs a one-shot monkey-patch on `stirrup.core.agent.Agent.run_tool` at import time (guarded by an `_gym_surfacing_patched` attribute to prevent re-application). On the failure path it re-runs `tool.parameters.model_validate_json` to capture the pydantic error, then rebuilds the ToolMessage with a detailed content including field-by-field error messages and a 500-char preview of the submitted arguments. Success path is untouched. Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]> Signed-off-by: Alex Gronskiy <[email protected]>

…not self._tools) The ValidationError-surfacing wrapper iterated self._tools to find the tool for the failed call. That list mixes plain Tool instances with provider objects (e.g. ApptainerCodeExecToolProvider) which don't have a .name attribute — the `t.name` access raised AttributeError on every failed-finish path, breaking 45 rollouts on r8 before scancel. Switch to self._active_tools.get(tool_call.name) — the same dict upstream stirrup uses in its own run_tool. _active_tools is built during __aenter__ via `if isinstance(tool, Tool): self._active_tools[tool.name] = tool`, so it's the safe lookup path. Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]> Signed-off-by: Alex Gronskiy <[email protected]>

…epseek_v4 parser bug DeepSeek-V4-Pro served by vLLM 0.20.0 (the wedu image vllm-deepseekv4-v0200-cu130-ray-arm64.sqsh) emits non-string-typed tool-call args as JSON-encoded strings. The model produces `<｜DSML｜parameter ... string="false">[...]</｜DSML｜parameter>` per the chat template, but vLLM's --tool-call-parser deepseek_v4 in 0.20.0 doesn't honor the string="false" flag and forwards the inner JSON verbatim as a literal string. Stirrup's FinishParams rejects with "paths: Input should be a valid list, type=list_type" and the agent loops forever on the same broken shape (1605 ✗ finish / 0 ✓ finish across r9's full 4h walltime). Upstream fix landed in vLLM PR #41801 (merged 2026-05-06), but the wedu image predates it. Until the image is rebuilt: - responses_api_agents/stirrup_agent/finish_tool_coercing.py: new module with CoercingFinishParams (pydantic field_validator(mode= "before") on `paths` that accepts list (passthrough), JSON-encoded string array, or bare filename string) and COERCING_FINISH_TOOL wrapping stirrup's _validating_finish_executor. - responses_api_agents/stirrup_agent/nemo_client.py: third monkey-patch at module-import time replaces SIMPLE_FINISH_TOOL in stirrup.tools .finish, stirrup.tools, and stirrup.core.agent with the coercing variant. Agent.__init__ defaults pick it up via the existing ``finish_tool if finish_tool is not None else SIMPLE_FINISH_TOOL`` fallback. Idempotency tag on the tool object prevents double-patching. - responses_api_agents/stirrup_agent/tests/test_finish_tool_coercing.py: 12 cases covering the 4 known broken shapes from r5 client log (`"[]"`, `"[\"a.txt\"]"`, `"[\"a.txt\",\"b.pdf\"]"`, bare filename) plus correct-shape passthrough, required-field enforcement, dict rejection, and non-string item stringification. When the wedu image is rebuilt against vLLM main >= #41801, this coercion becomes a no-op (the first isinstance(v, list) branch always takes) and both this module and the monkey-patch can be removed. Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]> Signed-off-by: Alex Gronskiy <[email protected]>

copy-pr-bot Bot temporarily deployed to public May 22, 2026 18:25 Inactive

copy-pr-bot Bot temporarily deployed to public May 22, 2026 18:26 Inactive

copy-pr-bot Bot deployed to public May 22, 2026 18:27 Active

agronskiy force-pushed the agronskiy/fix/dsv4-reasoning-field-fallback branch from 6012b53 to acacd8f Compare May 22, 2026 18:38

agronskiy changed the title ~~fix(stirrup_agent): fall back to msg.reasoning when reasoning_content absent~~ fix(stirrup_agent): reasoning fallback + surface tool-arg validation errors May 22, 2026

agronskiy and others added 2 commits May 22, 2026 23:51

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(stirrup_agent): reasoning fallback + surface tool-arg validation errors#1397

fix(stirrup_agent): reasoning fallback + surface tool-arg validation errors#1397
agronskiy wants to merge 4 commits into
mainfrom
agronskiy/fix/dsv4-reasoning-field-fallback

agronskiy commented May 22, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

agronskiy commented May 22, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

1) Reasoning-field fallback

2) Surface pydantic ValidationError in failed tool-call results

Why both matter — observed failure modes

Test plan

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

agronskiy commented May 22, 2026 •

edited

Loading