Codestin Search App

geobeau · 2026-06-03T15:56:45Z

What does the PR do?

This PR adds a way to shard and configure more than one completion queue.
Per the commit message, see background for the motivation.

Add --grpc-infer-cq-count to control the number of inference completion queues. Default is 1 (single shared CQ, same behavior as before).
 Set to 0 for one CQ per handler thread, or N>1 for N sharded CQs, to reduce contention at high throughput.

Checklist

Commit Type:

Check the conventional commit type
box here and add the label to the github PR.

Test plan:

Caveats:

Increasing the number of queues might make the load a bit unbalanced.

Background

When running above 100k QPS, GRPC threads will be the bottleneck to scale further we need to add more GRPC threads.
Unfortunately, adding more GRPC will have dimishing returns because there is a big contention on the futex required to access the completion queue. By sharding the completion queue into N part, we can reduce the contention drastically.

Add --grpc-infer-cq-count to control the number of inference completion queues. Default is 1 (single shared CQ, same behavior as before). Set to 0 for one CQ per handler thread, or N>1 for N sharded CQs, to reduce contention at high throughput.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

perf: Add completion queue sharding#8815

perf: Add completion queue sharding#8815
geobeau wants to merge 1 commit into
triton-inference-server:mainfrom
geobeau:main

geobeau commented Jun 3, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Milestone

Development

Uh oh!

1 participant

Conversation

geobeau commented Jun 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does the PR do?

Checklist

Commit Type:

Test plan:

Caveats:

Background

Uh oh!

Reviewers

Assignees

Labels

Milestone

Development

Uh oh!

1 participant

geobeau commented Jun 3, 2026 •

edited

Loading