Thanks to visit codestin.com
Credit goes to github.com

Skip to content

fix(datastore): handle commit failures gracefully instead of panicking#572

Open
TimeToBuildBob wants to merge 4 commits intoActivityWatch:masterfrom
TimeToBuildBob:fix/graceful-disk-full-handling
Open

fix(datastore): handle commit failures gracefully instead of panicking#572
TimeToBuildBob wants to merge 4 commits intoActivityWatch:masterfrom
TimeToBuildBob:fix/graceful-disk-full-handling

Conversation

@TimeToBuildBob
Copy link
Contributor

@TimeToBuildBob TimeToBuildBob commented Feb 27, 2026

Summary

  • Replace panic! on transaction commit failure with error! + continue, preventing the worker thread from dying permanently
  • Add CommitFailed error variant to DatastoreError mapped to HTTP 503 (Service Unavailable)
  • The server now recovers automatically when disk space is freed — previously required a full restart

When disk is full, SQLite returns SQLITE_FULL on commit. The old code panicked, killing the worker thread. Since the HTTP layer communicates with the worker via mpsc, all subsequent requests would fail with MpscError (HTTP 500) until server restart.

Now the transaction is rolled back (via Drop), the error is logged, and the worker continues to the next loop iteration. Watchers naturally re-send events via heartbeat, so data loss is minimal.

Test plan

  • cargo check — compiles cleanly (only pre-existing warnings)
  • cargo test --lib — all 37 tests pass
  • Manual test: fill disk, trigger heartbeat, verify server returns 503 and recovers when space is freed

Fixes #256


Important

Fixes transaction commit failure handling in worker.rs to prevent worker thread panic and adds CommitFailed error variant mapped to HTTP 503.

  • Behavior:
    • Replace panic! with error! and continue on transaction commit failure in work_loop() in worker.rs, preventing worker thread termination.
    • Add CommitFailed error variant to DatastoreError in lib.rs, mapped to HTTP 503 in endpoints/util.rs.
    • Server recovers automatically when disk space is freed, no restart needed.
  • Error Handling:
    • Log error and continue loop on SQLITE_FULL in work_loop() in worker.rs.
    • Map CommitFailed to HTTP 503 in From<DatastoreError> for HttpErrorJson in endpoints/util.rs.

This description was created by Ellipsis for c79081f. You can customize this summary. It will automatically update as commits are pushed.

When a transaction commit fails (e.g. disk full / SQLITE_FULL), the
worker thread panicked, permanently breaking the datastore channel.
All subsequent requests returned MpscError (HTTP 500) until restart.

Replace the panic with error logging and continue. The rolled-back
events will be re-sent by watchers via heartbeat or retried by clients.
Add CommitFailed error variant mapped to HTTP 503 (Service Unavailable)
so clients know to back off and retry.

Fixes ActivityWatch#256
Copy link

@ellipsis-dev ellipsis-dev bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Important

Looks good to me! 👍

Reviewed everything up to c79081f in 7 seconds. Click for details.
  • Reviewed 46 lines of code in 3 files
  • Skipped 0 files when reviewing.
  • Skipped posting 0 draft comments. View those below.
  • Modify your settings and rules to customize what types of comments Ellipsis leaves. And don't forget to react with 👍 or 👎 to teach Ellipsis.

Workflow ID: wflow_bzI2nABJ4CysiO40

You can customize Ellipsis by changing your verbosity settings, reacting with 👍 or 👎, replying to comments, or adding code review rules.

@greptile-apps
Copy link

greptile-apps bot commented Feb 27, 2026

Greptile Summary

This PR successfully addresses the worker thread panic issue when SQLite commit fails (e.g., SQLITE_FULL on disk full). The changes are straightforward and well-implemented.

Key improvements:

  • Worker thread now survives commit failures and continues processing subsequent requests
  • Server recovers automatically when disk space is freed (no restart required)
  • Error messages include useful debugging info (number of events lost)

Implementation notes:

  • Responses are sent to clients before commit (line 181 vs 197 in worker.rs), so clients receive success even if commit later fails - this is existing behavior, not introduced by this PR
  • Rolled-back events are re-sent by watchers via heartbeat mechanism as noted in PR description
  • The last_heartbeat cache may briefly contain stale data after a failed commit; this is handled by existing error detection in replace_last_event (line 592-595 in datastore.rs) which returns "cache/DB inconsistency" error, clears the cache, and subsequent retries succeed

About the CommitFailed variant:
Per the previous review thread, this error variant is intentionally kept as defensive API surface for potential future use, even though it's not currently returned to clients in the commit failure path.

Confidence Score: 5/5

  • This PR is safe to merge with minimal risk - it fixes a critical panic bug with straightforward error handling
  • The changes are minimal, focused, and well-tested (cargo check and cargo test pass). The error handling logic correctly replaces panic with logging + continue, allowing the worker thread to recover gracefully. The existing codebase already handles edge cases like cache/DB inconsistency. No new bugs introduced.
  • No files require special attention - all changes are straightforward improvements

Important Files Changed

Filename Overview
aw-datastore/src/lib.rs Added CommitFailed error variant to DatastoreError enum for handling database write failures
aw-datastore/src/worker.rs Replaced panic! with graceful error logging in two commit failure locations (legacy import and main loop); worker thread continues instead of dying
aw-server/src/endpoints/util.rs Mapped CommitFailed error to HTTP 503 Service Unavailable status

Last reviewed commit: 4c34dd4

Copy link

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

3 files reviewed, 2 comments

Edit Code Review Agent Settings | Greptile

Uninitialized(String),
OldDbVersion(String),
// Database write failure (e.g. disk full)
CommitFailed(String),
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The CommitFailed variant is never constructed or returned anywhere in the codebase. In worker.rs:193-203, commit failures are logged but no error is returned to clients since responses are sent before the commit happens (line 177). Either this variant should be removed, or the architecture needs changes to actually use it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch. The CommitFailed variant is intentionally kept as defensive API surface — if the architecture changes to propagate commit failures to clients in the future, the error type and HTTP 503 mapping are already in place. Removed the inconsistency though: the legacy import commit now uses the same graceful error handling as the main loop (fixed in 4c34dd4).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Followed up in 2902485: removed DatastoreError::CommitFailed and its HTTP mapping since that error is never constructed in the current architecture.

@greptile-apps
Copy link

greptile-apps bot commented Feb 27, 2026

Additional Comments (1)

aw-datastore/src/worker.rs
Inconsistent error handling - this legacy_import commit failure still panics while the main loop (line 193) now logs and continues. Consider applying the same graceful handling here for consistency.

Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!

@codecov
Copy link

codecov bot commented Feb 27, 2026

Codecov Report

❌ Patch coverage is 0% with 4 lines in your changes missing coverage. Please review.
✅ Project coverage is 67.98%. Comparing base (656f3c9) to head (2902485).
⚠️ Report is 37 commits behind head on master.

Files with missing lines Patch % Lines
aw-datastore/src/worker.rs 0.00% 4 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##           master     #572      +/-   ##
==========================================
- Coverage   70.81%   67.98%   -2.84%     
==========================================
  Files          51       54       +3     
  Lines        2916     3145     +229     
==========================================
+ Hits         2065     2138      +73     
- Misses        851     1007     +156     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

The main work loop commit (line 193) was already handled gracefully
(error log + continue), but the legacy import commit (line 143) still
panicked on failure. This makes the error handling consistent.

Addresses review feedback from Greptile.
@TimeToBuildBob
Copy link
Contributor Author

Review Feedback Addressed

Fixed the inconsistency flagged by Greptile: the legacy import commit handler (line 143) now uses the same graceful error handling as the main work loop (line 193) — logs the error and continues instead of panicking.

Changes in 4c34dd4:

  • Legacy import commit failure: panic!error! + continue
  • Legacy import will be retried on next startup if the commit didn't persist

On CommitFailed being unused: Keeping the variant as defensive API surface. If we later want to propagate commit failures to HTTP clients, the error type and 503 mapping are ready. Removing it would save 4 lines but lose the forward-compatibility.

Comment on lines +204 to +205
// Continue instead of panicking — the rolled-back events will be
// re-sent by watchers (heartbeats) or retried by clients.
Copy link
Member

@ErikBjare ErikBjare Mar 2, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will they? If no CommitFailed is propagated up, how would watchers/clients know?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch — you're right. The comment was wrong. Clients already received success responses before the commit happens (the worker responds at line 181 in work_loop, commit is at line 197+). So they have no signal that the commit failed and won't retry.

Fixed the comment in 75d5a4c to accurately describe what happens: the worker survives the transient failure, but the batch of events is permanently lost. Watchers will resume from current state on next heartbeat (creating a new event from that point), not re-send the specific lost events.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great point — I addressed this by removing the unused CommitFailed path entirely in 2902485.

Since commit failures in work_loop are not propagated to clients (responses are sent before commit), keeping CommitFailed + HTTP 503 mapping was misleading. The PR now keeps behavior explicit: log-and-continue in the worker, with no fake API surface implying client-visible commit-failure signaling.

I also re-reviewed the rest of the PR with this model in mind.

@ErikBjare
Copy link
Member

@greptileai review

@ErikBjare
Copy link
Member

@TimeToBuildBob Please review this carefully and address #572 (comment)

@TimeToBuildBob
Copy link
Contributor Author

Follow-up on review thread #discussion_r2871104544

Addressed in commit 2902485.

  • Removed unused DatastoreError::CommitFailed
  • Removed its HTTP 503 mapping in aw-server/src/endpoints/util.rs
  • Kept commit-failure behavior explicit as worker-local log-and-continue (no implied client-visible propagation)

Also re-reviewed the PR with this corrected model; no additional correctness issues found in the current scope.

@ErikBjare
Copy link
Member

Kept commit-failure behavior explicit as worker-local log-and-continue (no implied client-visible propagation)

This seems really bad, would mean it silently (to the client) swallows requests without successful persistence.

As you said:

the worker survives the transient failure, but the batch of events is permanently lost

This seems potentially very problematic.

@tobixen
Copy link

tobixen commented Mar 2, 2026

Crashing the total app is very systemd-friendly, but would it work out well when the server is started from QT?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Worker thread crashes (all API requests timeout) after disk has been full

3 participants