[Bug fix] Unrecoverable from InternalServerError in ClaudeCodeGithubActions or OpenCode

Our GitHub Actions workflow, which uses `claude-code-action` or `opencode` to run long, autonomous experiments, currently lacks resilience to transient InternalServerErrors.

Here are examples of the errors we are encountering:
**Example 1 (From ClaudeCodeGithubActions):**
```bash
Annotations
2 errors
The job was not acquired by Runner of type self-hosted even after multiple attempts
Internal server error. Correlation ID: 0ea4-3ad0-403a-8b4b-612aa654e317
```

**Example 2 (From OpenCode, another tool facing similar API issues):**
```bash
===== Launching run: source =====

Error: {"type":"api_error","message":"Internal server error"}

{"type":"api_error","message":"Internal server error"}```
---
```

**Scenario:**
1. The workflow initiates a long-running task (e.g., `full_experiment`) via the `claude-code-action`. This task can run for several hours.
2. During the execution, the action encounters a transient `InternalServerError`, likely from the underlying LLM API.

**Actual Behavior:**
- The claude-code-action step fails immediately.
- The entire GitHub Actions job stops and is marked as failed.
- Manually re-running the job (`Re-run failed jobs`) starts the process from a clean state, as actions/checkout resets the workspace.
- All progress is lost, including hours of computation and any in-memory or on-disk changes made by the agent during that run. The long experiment has to start over from the beginning.

**Expected Behavior:**
The workflow should be able to recover from temporary external errors without losing all progress. Ideally, the job should be able to:
- Retry the specific failed operation (the API call) while preserving the state of the running experiment.
- Or, if the job must be re-run, it should resume from the last saved checkpoint, restoring the modified code and experiment progress.

**Impact:**
This issue makes our fully autonomous workflow unreliable for any long-running task. It leads to a significant waste of computational resources and time whenever a temporary external error occurs. We believe implementing a robust checkpointing and resuming mechanism is crucial for the stability of such workflows.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Bug fix] Unrecoverable from InternalServerError in ClaudeCodeGithubActions or OpenCode #390

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[Bug fix] Unrecoverable from InternalServerError in ClaudeCodeGithubActions or OpenCode #390

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions