The Dynamo KV Router intelligently routes requests by evaluating their computational costs across different workers. It considers both decoding costs (from active blocks) and prefill costs (from newly computed blocks), using KV cache overlap to minimize redundant computation. Optimizing the KV Router is critical for achieving maximum throughput and minimum latency in distributed inference setups.
To launch the Dynamo frontend with the KV Router:
For Kubernetes, set DYN_ROUTER_MODE=kv on the Frontend service. For event-driven KV state, configure backend workers to publish KV cache events using the backend-specific flags described in Router Operations. Use --no-router-kv-events only when you want approximate cache-state prediction.
You can also run the KV router as a standalone service (without the Dynamo frontend). See the Standalone Router component for more details.
For deployment modes and quick start steps, see the Router Guide. For CLI arguments and tuning guidelines, see Configuration and Tuning. For A/B benchmarking, see the KV Router A/B Benchmarking Guide.
Requirements:
register_model() with model_input=ModelInput.Tokens. Your backend handler receives pre-tokenized requests with token_ids instead of raw text.register_model() with model_input=ModelInput.Tokens (see Backend Guide)Multimodal Support:
Limitations:
For basic model registration without KV routing, use --router-mode round-robin, --router-mode random, --router-mode least-loaded, or --router-mode device-aware-weighted with both static and dynamic endpoints.