fix(stirrup_agent): reasoning fallback + surface tool-arg validation errors#1397
Open
agronskiy wants to merge 4 commits into
Open
fix(stirrup_agent): reasoning fallback + surface tool-arg validation errors#1397agronskiy wants to merge 4 commits into
agronskiy wants to merge 4 commits into
Conversation
… absent DynamicMaxTokensChatCompletionsClient.__call__ only checked msg.reasoning_content when parsing the response. vLLM >= 0.16.0 (and specifically DeepSeek-V4's `--reasoning-parser deepseek_v4`) emits the field as `reasoning` per the Responses-API convention. Without the fallback, reasoning silently dropped from the AssistantMessage and was never threaded back into the next-turn request — the agent forgot its plan every turn and walked the max_turns ceiling without ever successfully calling `finish`. Observed end-to-end on DSv4-Pro GDPVal r3 (slurm job 2855690): across ~3 h of cluster wall-time and 1118 successful code_exec invocations, zero successful `finish` calls and zero rollouts persisted; the ResponseReasoningInterceptor reported reasoning_words=0 for every single response across the run while cache.db responses showed message.reasoning populated. Also pin `choice.message.reasoning = None` in the `_make_response` test fixture (mirroring the existing `reasoning_content = None` pin) so the new `elif hasattr(msg, "reasoning")` branch is not triggered by MagicMock auto-attr access in the three sampling/cap tests. Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]> Signed-off-by: Alex Gronskiy <[email protected]>
6012b53 to
acacd8f
Compare
… tool-call results
Upstream stirrup catches `pydantic.ValidationError` in `Agent.run_tool`
and returns the bare string "Tool arguments are not valid" as the
`ToolResult` content, dropping all the pydantic error detail on the
floor (e.g. "paths: Input should be a valid list, input_type=str").
The agent has no signal about which field failed or what type was
expected, so it retries the same broken shape forever.
Observed on DSv4-Pro GDPVal r5/r7: the model consistently emitted
`paths` as a JSON string literal ('"[]"', '"[\"foo.txt\"]"',
'"single.pdf"', etc.) instead of a JSON array. All ~660 finish-tool
calls in r5 (2h7m elapsed) failed with the same bare-string error;
zero ✓ finishes, max_turn stuck at 18 for 71 min as rollouts looped
on the same malformed shape. r7 (concurrency=48) hit the same wall
just sooner.
This patch installs a one-shot monkey-patch on
`stirrup.core.agent.Agent.run_tool` at import time (guarded by an
`_gym_surfacing_patched` attribute to prevent re-application). On the
failure path it re-runs `tool.parameters.model_validate_json` to
capture the pydantic error, then rebuilds the ToolMessage with a
detailed content including field-by-field error messages and a 500-char
preview of the submitted arguments. Success path is untouched.
Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>
Signed-off-by: Alex Gronskiy <[email protected]>
…not self._tools) The ValidationError-surfacing wrapper iterated self._tools to find the tool for the failed call. That list mixes plain Tool instances with provider objects (e.g. ApptainerCodeExecToolProvider) which don't have a .name attribute — the `t.name` access raised AttributeError on every failed-finish path, breaking 45 rollouts on r8 before scancel. Switch to self._active_tools.get(tool_call.name) — the same dict upstream stirrup uses in its own run_tool. _active_tools is built during __aenter__ via `if isinstance(tool, Tool): self._active_tools[tool.name] = tool`, so it's the safe lookup path. Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]> Signed-off-by: Alex Gronskiy <[email protected]>
…epseek_v4 parser bug DeepSeek-V4-Pro served by vLLM 0.20.0 (the wedu image vllm-deepseekv4-v0200-cu130-ray-arm64.sqsh) emits non-string-typed tool-call args as JSON-encoded strings. The model produces `<|DSML|parameter ... string="false">[...]</|DSML|parameter>` per the chat template, but vLLM's --tool-call-parser deepseek_v4 in 0.20.0 doesn't honor the string="false" flag and forwards the inner JSON verbatim as a literal string. Stirrup's FinishParams rejects with "paths: Input should be a valid list, type=list_type" and the agent loops forever on the same broken shape (1605 ✗ finish / 0 ✓ finish across r9's full 4h walltime). Upstream fix landed in vLLM PR #41801 (merged 2026-05-06), but the wedu image predates it. Until the image is rebuilt: - responses_api_agents/stirrup_agent/finish_tool_coercing.py: new module with CoercingFinishParams (pydantic field_validator(mode= "before") on `paths` that accepts list (passthrough), JSON-encoded string array, or bare filename string) and COERCING_FINISH_TOOL wrapping stirrup's _validating_finish_executor. - responses_api_agents/stirrup_agent/nemo_client.py: third monkey-patch at module-import time replaces SIMPLE_FINISH_TOOL in stirrup.tools .finish, stirrup.tools, and stirrup.core.agent with the coercing variant. Agent.__init__ defaults pick it up via the existing ``finish_tool if finish_tool is not None else SIMPLE_FINISH_TOOL`` fallback. Idempotency tag on the tool object prevents double-patching. - responses_api_agents/stirrup_agent/tests/test_finish_tool_coercing.py: 12 cases covering the 4 known broken shapes from r5 client log (`"[]"`, `"[\"a.txt\"]"`, `"[\"a.txt\",\"b.pdf\"]"`, bare filename) plus correct-shape passthrough, required-field enforcement, dict rejection, and non-string item stringification. When the wedu image is rebuilt against vLLM main >= #41801, this coercion becomes a no-op (the first isinstance(v, list) branch always takes) and both this module and the monkey-patch can be removed. Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]> Signed-off-by: Alex Gronskiy <[email protected]>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Two complementary fixes for DSv4-Pro (and any vLLM ≥ 0.16 model) running through the Stirrup agent.
1) Reasoning-field fallback
DynamicMaxTokensChatCompletionsClient.__call__inresponses_api_agents/stirrup_agent/nemo_client.py:299only readsmsg.reasoning_content. vLLM ≥ 0.16.0 (DeepSeek-V4's--reasoning-parser deepseek_v4) emits the field asmsg.reasoningper the Responses-API convention. Without the fallback, reasoning is silently dropped from the AssistantMessage and never threaded back into the next-turn request — the agent forgets its plan every turn and walksmax_turns.Fix: add an
elifbranch that consultsmsg.reasoningwhenreasoning_contentis absent.2) Surface pydantic ValidationError in failed tool-call results
Upstream stirrup catches
ValidationErrorinAgent.run_tooland returns the bare string"Tool arguments are not valid"asToolResult.content, dropping the pydantic detail on the floor (e.g.paths: Input should be a valid list, input_type=str). The agent gets no signal about which field failed or what type was expected, so it retries the same broken shape forever.Fix: install a one-shot monkey-patch on
stirrup.core.agent.Agent.run_toolat module import time. On the failure path it re-runstool.parameters.model_validate_jsonto capture the pydantic error, then rebuilds theToolMessagewith a detailed content including field-by-field error messages and a 500-char preview of the submitted args. Success path untouched. Guarded by an_gym_surfacing_patchedattribute to prevent re-application.Why both matter — observed failure modes
End-to-end on DSv4-Pro GDPVal:
a691…, slurm 2855690)9683…, slurm 2857807)pathsas string not list; bare "not valid" error hides which field; agent can't self-correct (fixed by §2)e9c4…, slurm 2858587, concurrency=48)cache.db inspection confirmed:
message.reasoningpopulated (string) andmessage.reasoning_contentabsent — §1 is necessary.finishtool calls overwhelmingly emit"paths": "[]","paths": "[\"foo.txt\"]"or"paths": "single.pdf"— string instead of list. §2 surfaces this back to the agent so it can self-correct.Test plan
reasoning_content(Kimi-K2.5-Thinking, GLM-4.5) — first branch of §1'sifstill wins; behavior unchanged.tests/test_nemo_client.pypinsreasoning = Noneso the §1 elif doesn't trigger on MagicMock auto-attr access (3 tests previously failing on first push, now passing).🤖 Generated with Claude Code