-
Notifications
You must be signed in to change notification settings - Fork 0
Description
Our GitHub Actions workflow, which uses claude-code-action or opencode to run long, autonomous experiments, currently lacks resilience to transient InternalServerErrors.
Here are examples of the errors we are encountering:
Example 1 (From ClaudeCodeGithubActions):
Annotations
2 errors
The job was not acquired by Runner of type self-hosted even after multiple attempts
Internal server error. Correlation ID: 0ea4-3ad0-403a-8b4b-612aa654e317Example 2 (From OpenCode, another tool facing similar API issues):
===== Launching run: source =====
Error: {"type":"api_error","message":"Internal server error"}
{"type":"api_error","message":"Internal server error"}```
---Scenario:
- The workflow initiates a long-running task (e.g.,
full_experiment) via theclaude-code-action. This task can run for several hours. - During the execution, the action encounters a transient
InternalServerError, likely from the underlying LLM API.
Actual Behavior:
- The claude-code-action step fails immediately.
- The entire GitHub Actions job stops and is marked as failed.
- Manually re-running the job (
Re-run failed jobs) starts the process from a clean state, as actions/checkout resets the workspace. - All progress is lost, including hours of computation and any in-memory or on-disk changes made by the agent during that run. The long experiment has to start over from the beginning.
Expected Behavior:
The workflow should be able to recover from temporary external errors without losing all progress. Ideally, the job should be able to:
- Retry the specific failed operation (the API call) while preserving the state of the running experiment.
- Or, if the job must be re-run, it should resume from the last saved checkpoint, restoring the modified code and experiment progress.
Impact:
This issue makes our fully autonomous workflow unreliable for any long-running task. It leads to a significant waste of computational resources and time whenever a temporary external error occurs. We believe implementing a robust checkpointing and resuming mechanism is crucial for the stability of such workflows.