fix(session): prevent data loss with atomic writes and corrupt-file repair#3312
Conversation
SessionManager.save() previously used bare open("w") which could
truncate the JSONL file if the process crashed mid-write. Now writes
to a .tmp file and atomically replaces via os.replace(), matching the
pattern already used in qq.py.
_load() now attempts _repair() before returning None, recovering
valid lines from partially-written files. 12 new tests cover atomic
save correctness, temp-file cleanup on failure, and repair of
truncated/corrupt JSONL.
cowork-with:opencode(glm-5.1)
Re-bin
left a comment
There was a problem hiding this comment.
This is the right kind of fix.
The intent is narrow, the boundary is clear, and the atomic-write change is the right way to prevent truncation-driven session loss.
During review I found one missing read-side path: the PR originally repaired corrupt files in SessionManager._load(), but read_session_file() and list_sessions() still treated the same corrupt JSONL as unreadable, which meant WebUI / HTTP views could still behave as if the session had disappeared. I patched that gap, added focused regression coverage in tests/agent/test_session_atomic.py, and pushed the fix back to this PR branch (b490fd87).
I pulled in the latest origin/main and reran both targeted and full tests locally:
python -m pytest tests/agent/test_session_atomic.py tests/agent/test_session_delete.py tests/agent/test_session_manager_history.py tests/agent/test_unified_session.py -q->53 passedpython -m pytest -q->2139 passed
From my side, this is now clean and ready to merge.
Summary
Session saves are now atomic and corrupt session files are automatically repaired on load. A crash during save (OOM kill, disk full, SIGKILL) no longer destroys conversation history.
What changed:
save()writes to a.jsonl.tmpfile then atomically replaces the target viaos.replace(), matching the pattern already used inqq.py. On failure the temp file is cleaned up, leaving the previous version intact._repair()method recovers valid JSONL lines from partially-written files by skipping malformed lines instead of discarding the entire session._load()falls back to_repair()before returningNone, so a truncated file yields a session with surviving messages rather than a blank conversation.Why
os.replaceoveropen("w"): The oldopen(path, "w")truncates the file before writing begins. If the process dies between truncation and the finalwrite(), the file is left empty or partial.os.replaceis an atomic POSIX operation that swaps the file pointer in a single filesystem step. The same pattern is already in use atqq.py:675.Design decisions:
os.replace).BaseExceptioncatch insave()coversKeyboardInterruptandSystemExit, not justException._repair()tolerates missing metadata and bad timestamps, rebuilding defaults for anything it cannot parse.Tests:
tests/agent/test_session_atomic.pycovers atomic save correctness, temp-file cleanup on failure, and repair of truncated, corrupt, and mixed-validity JSONL files (12 tests). All 45 session-related tests pass.