Thanks to visit codestin.com
Credit goes to github.com

Skip to content

perf: Add completion queue sharding#8815

Open
geobeau wants to merge 1 commit into
triton-inference-server:mainfrom
geobeau:main
Open

perf: Add completion queue sharding#8815
geobeau wants to merge 1 commit into
triton-inference-server:mainfrom
geobeau:main

Conversation

@geobeau
Copy link
Copy Markdown

@geobeau geobeau commented Jun 3, 2026

What does the PR do?

This PR adds a way to shard and configure more than one completion queue.
Per the commit message, see background for the motivation.

Add --grpc-infer-cq-count to control the number of inference completion queues. Default is 1 (single shared CQ, same behavior as before).
 Set to 0 for one CQ per handler thread, or N>1 for N sharded CQs, to reduce contention at high throughput.

Checklist

  • I have read the Contribution guidelines and signed the Contributor License
    Agreement
  • PR title reflects the change and is of format <commit_type>: <Title>
  • Changes are described in the pull request.
  • Related issues are referenced.
  • Populated github labels field
  • Added test plan and verified test passes.
  • Verified that the PR passes existing CI.
  • I ran pre-commit locally (pre-commit install, pre-commit run --all)
  • Verified copyright is correct on all changed files.
  • Added succinct git squash message before merging ref.
  • All template sections are filled out.
  • Optional: Additional screenshots for behavior/output changes with before/after.

Commit Type:

Check the conventional commit type
box here and add the label to the github PR.

  • build
  • ci
  • docs
  • feat
  • fix
  • perf
  • refactor
  • revert
  • style
  • test

Test plan:

Caveats:

Increasing the number of queues might make the load a bit unbalanced.

Background

When running above 100k QPS, GRPC threads will be the bottleneck to scale further we need to add more GRPC threads.
Unfortunately, adding more GRPC will have dimishing returns because there is a big contention on the futex required to access the completion queue. By sharding the completion queue into N part, we can reduce the contention drastically.

Add --grpc-infer-cq-count to control the number of inference completion queues. Default is 1 (single shared CQ, same behavior as before).
Set to 0 for one CQ per handler thread, or N>1 for N sharded CQs, to reduce contention at high throughput.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Development

Successfully merging this pull request may close these issues.

1 participant