wait_for API: Boolean to Result #286

Vanuan · 2025-11-29T00:19:52Z

GPU Wait API Migration: Boolean to Result Implementation

Fixes #248 | Inspired by zed-industries/zed#43070

Problem

The wait_for API provides GPU synchronization by blocking until the GPU reaches a specific sync point, preventing the CPU from destroying or reusing resources while they're still in use. When you submit work to the GPU (like rendering a frame), it continues running asynchronously. Without proper synchronization, the CPU could delete textures or reuse buffers while the GPU is still accessing them, causing memory corruption or crashes.

Current limitation: Returns a simple boolean that loses critical error information:

❌ Timeouts look the same as device loss
❌ Backend-specific failures indistinguishable
❌ All failures return false - no context for recovery
❌ Callers can't make informed decisions about resource cleanup

Solution

Introduce Result-based error handling while maintaining backward compatibility.

Migration Strategy

Dual implementation approach:

✅ Existing wait_for() → bool (preserved, delegates to new method)
✨ New wait_for_result() → Result<(), WaitError> (detailed errors)

New Error Type

pub enum WaitError {
    Timeout,        // Operation timed out, GPU still busy
    DeviceLost,     // GPU device lost/removed (unrecoverable)
    OutOfDate,      // Resource out of date (Vulkan swapchain)
    Other(String),  // Backend-specific errors with details
}

New Trait Method

fn wait_for_result(&self, sp: &Self::SyncPoint, timeout_ms: u32) 
    -> Result<(), WaitError>;

Backend Implementations

Each backend maps its native error conditions to WaitError:

Backend	Mechanism	Success	Timeout	Device Lost
Vulkan	Timeline semaphores	`Ok(())`	`TIMEOUT`	`ERROR_DEVICE_LOST`
GLES	GL sync objects	`ALREADY_SIGNALED`/`CONDITION_SATISFIED`	`TIMEOUT_EXPIRED`	N/A
Metal	Command buffer polling	`MTLCommandBufferStatus::Completed`	Elapsed time check	NSError inspection

Usage Patterns

Frame Synchronization (Frame Pacer)

Before: Blind wait, no error handling

context.wait_for(&sp, !0);
// Always cleanup, even if device lost

After: Conditional cleanup based on error

match context.wait_for_result(&sp, !0) {
    Ok(()) => { /* cleanup resources */ }
    Err(WaitError::DeviceLost) => { 
        cleanup = false; // Prevent crashes on device loss
    }
    Err(e) => { cleanup = false; }
}

Buffer Reuse (Buffer Belt)

Before: Can't distinguish busy from error

// false could mean "busy" or "device lost"
if gpu.wait_for(sp, 0) { reuse_buffer() }

After: Intelligent reuse decisions

match gpu.wait_for_result(sp, 0) {
    Ok(()) => true,                    // Safe to reuse
    Err(WaitError::Timeout) => false,  // Still busy, try later
    Err(e) => {                        // Unexpected error
        log::warn!("Unexpected: {:?}", e);
        false
    }
}

Texture Cleanup (EGUI)

After: Partition textures by readiness state

Ok(()) → Delete immediately
Timeout → Keep for next frame
Other errors → Keep with warning

Shader Hot Reload

After: Block critical operations on device loss

match gpu.wait_for_result(sync_point, !0) {
    Ok(()) => { /* safe to reload shaders */ }
    Err(WaitError::DeviceLost) => {
        log::error!("Cannot reload: device lost");
        return false; // Prevent crash
    }
    Err(e) => { /* log, continue with caution */ }
}

Benefits

✅ Actionable Error Information: Distinguish timeouts (retry later) from device loss (unrecoverable)
✅ Safer Resource Management: Skip cleanup on device loss to prevent crashes
✅ Better Debugging: Detailed error messages from backend-specific failures
✅ Backward Compatible: Existing wait_for callers continue working unchanged
✅ Foundation for Recovery: Enables future device loss recovery strategies

Migration Path

✅ Add WaitError enum and wait_for_result to trait
✅ Implement in all backends (Vulkan, GLES, Metal)
✅ Make wait_for delegate to wait_for_result
✅ Migrate critical usage sites (frame pacer, buffer belt, texture cleanup, shader reload)
🔜 Future: Deprecate boolean API, full migration to Result-based API

Vanuan · 2025-11-29T01:11:21Z

blade-graphics/src/metal/mod.rs

+                        "Metal command buffer error".to_string(),
+                    ))
+                }
+                _ => {}


State Meaning

metal::MTLCommandBufferStatus::NotEnqueued A command buffer's initial state, which indicates its command queue isn't reserving a place for it. You can modify a command buffer in this state by encoding commands to it, or by adding a state change handler.

metal::MTLCommandBufferStatus::Enqueued A command buffer's second state, which indicates its command queue is reserving a place for it. You can modify a command buffer in this state by encoding commands to it, or by adding a state change handler.

metal::MTLCommandBufferStatus::Committed A command buffer's third state, which indicates the command queue is preparing to schedule the command buffer by resolving its dependencies. You can't modify a command buffer in this state.

metal::MTLCommandBufferStatus::Scheduled A command buffer's fourth state, which indicates the command buffer has its resources ready and is waiting for the GPU to run its commands. You can't modify a command buffer in this state.

metal::MTLCommandBufferStatus::Completed A command buffer's successful, final state, which indicates the GPU finished running the command buffer's commands without any problems.

metal::MTLCommandBufferStatus::Error A command buffer's error state, which indicates the GPU encountered an error while running the command buffer's commands.

Error Code Description

none An error code that represents the absence of any problems.

timeout An error code that indicates the system interrupted and terminated the command buffer before it finished running.

pageFault An error code that indicates the command buffer generated a page fault the GPU can't service.

notPermitted An error code that indicates a process doesn't have access to a GPU device.

outOfMemory An error code that indicates the GPU device doesn't have sufficient memory to execute a command buffer.

invalidResource An error code that indicates the command buffer has an invalid reference to resource.

memoryless An error code that indicates the GPU ran out of one or more of its internal resources that support memoryless render pass attachments.

deviceRemoved An error code that indicates a person physically removed the GPU device before the command buffer finished running.

stackOverflow An error code that indicates the GPU terminated the command buffer because a kernel function of tile shader used too many stack frames.

accessRevoked An error code that indicates the system has revoked the Metal device's access because it's responsible for too many timeouts or hangs.

internal An error code that indicates the Metal framework has an internal problem.

Vanuan · 2025-11-29T04:12:47Z

@kvark ready to review

Vanuan added 2 commits November 29, 2025 02:18

introduce wait_for_result and vk implementation

7b7a655

introduce wait_for_result gles implementation

48fa3a2

Vanuan changed the title ~~introduce wait_for_result and vk implementation~~ Introduce wait_for_result Nov 29, 2025

Vanuan added 3 commits November 29, 2025 02:41

introduce wait_for_result metal implementation

b6726ef

wait_for_result: migrate frame_pacer

68347bc

wait_for_result: migrate belt

c9f3b9e

Vanuan commented Nov 29, 2025

View reviewed changes

Vanuan added 3 commits November 29, 2025 04:09

wait_for_result: error handling for Metal

6d419c8

waite_for_result: blade-egui

7d42878

wait_for_result: blade-render hot reload

c20b725

Vanuan changed the title ~~Introduce wait_for_result~~ Wait API: Boolean to Result Nov 29, 2025

Vanuan changed the title ~~Wait API: Boolean to Result~~ wait_for API: Boolean to Result Nov 29, 2025

Vanuan mentioned this pull request Nov 29, 2025

gpui: Implement GPU device loss recovery for Linux X11 zed-industries/zed#43070

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

wait_for API: Boolean to Result #286

wait_for API: Boolean to Result #286

Uh oh!

Vanuan commented Nov 29, 2025 •

edited

Loading

Uh oh!

Vanuan Nov 29, 2025

Uh oh!

Vanuan Nov 29, 2025

Uh oh!

Vanuan commented Nov 29, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

State	Meaning
`metal::MTLCommandBufferStatus::NotEnqueued`	A command buffer's initial state, which indicates its command queue isn't reserving a place for it. You can modify a command buffer in this state by encoding commands to it, or by adding a state change handler.
`metal::MTLCommandBufferStatus::Enqueued`	A command buffer's second state, which indicates its command queue is reserving a place for it. You can modify a command buffer in this state by encoding commands to it, or by adding a state change handler.
`metal::MTLCommandBufferStatus::Committed`	A command buffer's third state, which indicates the command queue is preparing to schedule the command buffer by resolving its dependencies. You can't modify a command buffer in this state.
`metal::MTLCommandBufferStatus::Scheduled`	A command buffer's fourth state, which indicates the command buffer has its resources ready and is waiting for the GPU to run its commands. You can't modify a command buffer in this state.
`metal::MTLCommandBufferStatus::Completed`	A command buffer's successful, final state, which indicates the GPU finished running the command buffer's commands without any problems.
`metal::MTLCommandBufferStatus::Error`	A command buffer's error state, which indicates the GPU encountered an error while running the command buffer's commands.

Error Code	Description
`none`	An error code that represents the absence of any problems.
`timeout`	An error code that indicates the system interrupted and terminated the command buffer before it finished running.
`pageFault`	An error code that indicates the command buffer generated a page fault the GPU can't service.
`notPermitted`	An error code that indicates a process doesn't have access to a GPU device.
`outOfMemory`	An error code that indicates the GPU device doesn't have sufficient memory to execute a command buffer.
`invalidResource`	An error code that indicates the command buffer has an invalid reference to resource.
`memoryless`	An error code that indicates the GPU ran out of one or more of its internal resources that support memoryless render pass attachments.
`deviceRemoved`	An error code that indicates a person physically removed the GPU device before the command buffer finished running.
`stackOverflow`	An error code that indicates the GPU terminated the command buffer because a kernel function of tile shader used too many stack frames.
`accessRevoked`	An error code that indicates the system has revoked the Metal device's access because it's responsible for too many timeouts or hangs.
`internal`	An error code that indicates the Metal framework has an internal problem.

wait_for API: Boolean to Result #286

Are you sure you want to change the base?

wait_for API: Boolean to Result #286

Uh oh!

Conversation

Vanuan commented Nov 29, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

GPU Wait API Migration: Boolean to Result Implementation

Problem

Solution

Migration Strategy

New Error Type

New Trait Method

Backend Implementations

Usage Patterns

Frame Synchronization (Frame Pacer)

Buffer Reuse (Buffer Belt)

Texture Cleanup (EGUI)

Shader Hot Reload

Benefits

Migration Path

Uh oh!

Vanuan Nov 29, 2025

Choose a reason for hiding this comment

Uh oh!

Vanuan Nov 29, 2025

Choose a reason for hiding this comment

Uh oh!

Vanuan commented Nov 29, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Vanuan commented Nov 29, 2025 •

edited

Loading