Thanks to visit codestin.com
Credit goes to github.com

Skip to content

[Bug fix] Unrecoverable from InternalServerError in ClaudeCodeGithubActions or OpenCode #390

@genga6

Description

@genga6

Our GitHub Actions workflow, which uses claude-code-action or opencode to run long, autonomous experiments, currently lacks resilience to transient InternalServerErrors.

Here are examples of the errors we are encountering:
Example 1 (From ClaudeCodeGithubActions):

Annotations
2 errors
The job was not acquired by Runner of type self-hosted even after multiple attempts
Internal server error. Correlation ID: 0ea4-3ad0-403a-8b4b-612aa654e317

Example 2 (From OpenCode, another tool facing similar API issues):

===== Launching run: source =====

Error: {"type":"api_error","message":"Internal server error"}

{"type":"api_error","message":"Internal server error"}```
---

Scenario:

  1. The workflow initiates a long-running task (e.g., full_experiment) via the claude-code-action. This task can run for several hours.
  2. During the execution, the action encounters a transient InternalServerError, likely from the underlying LLM API.

Actual Behavior:

  • The claude-code-action step fails immediately.
  • The entire GitHub Actions job stops and is marked as failed.
  • Manually re-running the job (Re-run failed jobs) starts the process from a clean state, as actions/checkout resets the workspace.
  • All progress is lost, including hours of computation and any in-memory or on-disk changes made by the agent during that run. The long experiment has to start over from the beginning.

Expected Behavior:
The workflow should be able to recover from temporary external errors without losing all progress. Ideally, the job should be able to:

  • Retry the specific failed operation (the API call) while preserving the state of the running experiment.
  • Or, if the job must be re-run, it should resume from the last saved checkpoint, restoring the modified code and experiment progress.

Impact:
This issue makes our fully autonomous workflow unreliable for any long-running task. It leads to a significant waste of computational resources and time whenever a temporary external error occurs. We believe implementing a robust checkpointing and resuming mechanism is crucial for the stability of such workflows.

Metadata

Metadata

Assignees

Labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions