Thanks to visit codestin.com
Credit goes to github.com

Skip to content

Conversation

@Vanuan
Copy link

@Vanuan Vanuan commented Nov 29, 2025

GPU Wait API Migration: Boolean to Result Implementation

Fixes #248 | Inspired by zed-industries/zed#43070

Problem

The wait_for API provides GPU synchronization by blocking until the GPU reaches a specific sync point, preventing the CPU from destroying or reusing resources while they're still in use. When you submit work to the GPU (like rendering a frame), it continues running asynchronously. Without proper synchronization, the CPU could delete textures or reuse buffers while the GPU is still accessing them, causing memory corruption or crashes.

Current limitation: Returns a simple boolean that loses critical error information:

  • ❌ Timeouts look the same as device loss
  • ❌ Backend-specific failures indistinguishable
  • ❌ All failures return false - no context for recovery
  • ❌ Callers can't make informed decisions about resource cleanup

Solution

Introduce Result-based error handling while maintaining backward compatibility.

Migration Strategy

Dual implementation approach:

  • ✅ Existing wait_for()bool (preserved, delegates to new method)
  • ✨ New wait_for_result()Result<(), WaitError> (detailed errors)

New Error Type

pub enum WaitError {
    Timeout,        // Operation timed out, GPU still busy
    DeviceLost,     // GPU device lost/removed (unrecoverable)
    OutOfDate,      // Resource out of date (Vulkan swapchain)
    Other(String),  // Backend-specific errors with details
}

New Trait Method

fn wait_for_result(&self, sp: &Self::SyncPoint, timeout_ms: u32) 
    -> Result<(), WaitError>;

Backend Implementations

Each backend maps its native error conditions to WaitError:

Backend Mechanism Success Timeout Device Lost
Vulkan Timeline semaphores Ok(()) TIMEOUT ERROR_DEVICE_LOST
GLES GL sync objects ALREADY_SIGNALED/CONDITION_SATISFIED TIMEOUT_EXPIRED N/A
Metal Command buffer polling MTLCommandBufferStatus::Completed Elapsed time check NSError inspection

Usage Patterns

Frame Synchronization (Frame Pacer)

Before: Blind wait, no error handling

context.wait_for(&sp, !0);
// Always cleanup, even if device lost

After: Conditional cleanup based on error

match context.wait_for_result(&sp, !0) {
    Ok(()) => { /* cleanup resources */ }
    Err(WaitError::DeviceLost) => { 
        cleanup = false; // Prevent crashes on device loss
    }
    Err(e) => { cleanup = false; }
}

Buffer Reuse (Buffer Belt)

Before: Can't distinguish busy from error

// false could mean "busy" or "device lost"
if gpu.wait_for(sp, 0) { reuse_buffer() }

After: Intelligent reuse decisions

match gpu.wait_for_result(sp, 0) {
    Ok(()) => true,                    // Safe to reuse
    Err(WaitError::Timeout) => false,  // Still busy, try later
    Err(e) => {                        // Unexpected error
        log::warn!("Unexpected: {:?}", e);
        false
    }
}

Texture Cleanup (EGUI)

After: Partition textures by readiness state

  • Ok(()) → Delete immediately
  • Timeout → Keep for next frame
  • Other errors → Keep with warning

Shader Hot Reload

After: Block critical operations on device loss

match gpu.wait_for_result(sync_point, !0) {
    Ok(()) => { /* safe to reload shaders */ }
    Err(WaitError::DeviceLost) => {
        log::error!("Cannot reload: device lost");
        return false; // Prevent crash
    }
    Err(e) => { /* log, continue with caution */ }
}

Benefits

Actionable Error Information: Distinguish timeouts (retry later) from device loss (unrecoverable)
Safer Resource Management: Skip cleanup on device loss to prevent crashes
Better Debugging: Detailed error messages from backend-specific failures
Backward Compatible: Existing wait_for callers continue working unchanged
Foundation for Recovery: Enables future device loss recovery strategies

Migration Path

  1. ✅ Add WaitError enum and wait_for_result to trait
  2. ✅ Implement in all backends (Vulkan, GLES, Metal)
  3. ✅ Make wait_for delegate to wait_for_result
  4. ✅ Migrate critical usage sites (frame pacer, buffer belt, texture cleanup, shader reload)
  5. 🔜 Future: Deprecate boolean API, full migration to Result-based API

@Vanuan Vanuan changed the title introduce wait_for_result and vk implementation Introduce wait_for_result Nov 29, 2025
"Metal command buffer error".to_string(),
))
}
_ => {}
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

State Meaning
metal::MTLCommandBufferStatus::NotEnqueued A command buffer's initial state, which indicates its command queue isn't reserving a place for it. You can modify a command buffer in this state by encoding commands to it, or by adding a state change handler.
metal::MTLCommandBufferStatus::Enqueued A command buffer's second state, which indicates its command queue is reserving a place for it. You can modify a command buffer in this state by encoding commands to it, or by adding a state change handler.
metal::MTLCommandBufferStatus::Committed A command buffer's third state, which indicates the command queue is preparing to schedule the command buffer by resolving its dependencies. You can't modify a command buffer in this state.
metal::MTLCommandBufferStatus::Scheduled A command buffer's fourth state, which indicates the command buffer has its resources ready and is waiting for the GPU to run its commands. You can't modify a command buffer in this state.
metal::MTLCommandBufferStatus::Completed A command buffer's successful, final state, which indicates the GPU finished running the command buffer's commands without any problems.
metal::MTLCommandBufferStatus::Error A command buffer's error state, which indicates the GPU encountered an error while running the command buffer's commands.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Error Code Description
none An error code that represents the absence of any problems.
timeout An error code that indicates the system interrupted and terminated the command buffer before it finished running.
pageFault An error code that indicates the command buffer generated a page fault the GPU can't service.
notPermitted An error code that indicates a process doesn't have access to a GPU device.
outOfMemory An error code that indicates the GPU device doesn't have sufficient memory to execute a command buffer.
invalidResource An error code that indicates the command buffer has an invalid reference to resource.
memoryless An error code that indicates the GPU ran out of one or more of its internal resources that support memoryless render pass attachments.
deviceRemoved An error code that indicates a person physically removed the GPU device before the command buffer finished running.
stackOverflow An error code that indicates the GPU terminated the command buffer because a kernel function of tile shader used too many stack frames.
accessRevoked An error code that indicates the system has revoked the Metal device's access because it's responsible for too many timeouts or hangs.
internal An error code that indicates the Metal framework has an internal problem.

@Vanuan
Copy link
Author

Vanuan commented Nov 29, 2025

@kvark ready to review

@Vanuan Vanuan changed the title Introduce wait_for_result Wait API: Boolean to Result Nov 29, 2025
@Vanuan Vanuan changed the title Wait API: Boolean to Result wait_for API: Boolean to Result Nov 29, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Feature request: Introduce SyncStatus and InvalidSyncPoint for enhanced synchronization feedback

1 participant