This directory contains the optimized implementation of the Thompson Sampling router for Dynamo, using the "Processor-as-Backend" pattern with Dynamic Discovery to intercept requests from the default Dynamo frontend.
┌─────────────────────────────────────────────────────────────────────────┐
│ Client Request (with nvext.annotations) │
│ ↓ │
│ Default Dynamo Frontend (port 8000) │
│ ↓ tokenization + nvext parsing │
│ ↓ discovers backends via ETCD ModelWatcher │
│ ↓ finds Processor's model card! │
│ │
│ Custom Processor (dynamo.backend.generate-{instance_id}) │
│ ↓ extracts: prefix_id, total_requests, osl, iat │
│ ↓ queries Thompson Sampling router │
│ │
│ Custom Router (dynamo.router.find_worker) │
│ ↓ KV overlap + workload-aware selection │
│ ↓ returns worker_id │
│ │
│ Processor forwards to dynamo.worker.generate (with worker_id) │
│ ↓ │
│ SGLang Worker (actual inference) │
│ ↓ │
│ Response + Feedback to Router │
└─────────────────────────────────────────────────────────────────────────┘
| Component | File | Endpoint | Purpose |
|---|---|---|---|
| Processor | processor.py |
dynamo.backend.generate + etcd model card |
Intercepts frontend requests, extracts hints, coordinates routing |
| Router | router.py |
dynamo.router.find_worker |
Thompson Sampling + KV overlap worker selection |
| config | config.yaml |
- | Router configuration parameters |
Instead of using the deprecated --static-endpoint flag on the frontend, this processor uses dynamic discovery via etcd:
- Processor registers as
dynamo.backend.generate(dynamic mode with instance ID) - Processor calls
register_llm()to advertise a model card in etcd - Frontend ModelWatcher discovers the processor's model card
- Frontend routes requests to the discovered processor endpoint
- SGLang Worker registers as
dynamo.worker.generate(also dynamic)
The --static-endpoint flag is deprecated and will be removed in future Dynamo versions. Dynamic discovery provides:
- Forward compatibility with future Dynamo releases
- Support for multiple processor instances (load balancing)
- Standard Dynamo discovery patterns
- Dynamic scaling capabilities
The processor uses register_llm() to advertise itself in etcd:
@dynamo_worker(static=False) # Dynamic mode for ETCD discovery
async def worker(runtime: DistributedRuntime):
component = runtime.namespace("dynamo").component("backend")
# NOTE: create_service() was removed in Dynamo 0.8.x - endpoint creation handles registration
endpoint = component.endpoint("generate")
# Register model card so frontend can discover us
await register_llm(
model_input=ModelInput.Tokens,
model_type=ModelType.Chat | ModelType.Completions,
endpoint=endpoint,
model_path=args.model_path,
model_name=args.model_name,
)
handler = ProcessorRequestHandler(runtime, ...)
await endpoint.serve_endpoint(handler.generate)The processor now requires:
--model-path: Path to the model directory (for tokenizer and model card)--model-name: Served model name (must match the model expected by the frontend)
# Set required environment variable
export DYNAMO_MODEL_DIR="/path/to/Llama-3.3-70B-Instruct"
# Start all components
bash ../start_dynamo_optimized_thompson_hints_sglang.sh
# or
bash ../start_dynamo_optimized_thompson_hints_vllm.sh# Basic request (no routing hints)
curl http://localhost:8000/v1/chat/completions \
-H 'Content-Type: application/json' \
-d '{
"model": "llama-3.3-70b",
"messages": [{"role": "user", "content": "Hello!"}],
"max_tokens": 50
}'
# Request with nvext annotations (routing hints)
curl http://localhost:8000/v1/chat/completions \
-H 'Content-Type: application/json' \
-d '{
"model": "llama-3.3-70b",
"messages": [{"role": "user", "content": "Hello!"}],
"max_tokens": 50,
"nvext": {
"annotations": [
"prefix_id:my-session-001",
"total_requests:10",
"osl:MEDIUM",
"iat:LOW"
]
}
}'| Annotation | Format | Description |
|---|---|---|
prefix_id |
prefix_id:<string> |
Unique identifier for prefix reuse across requests |
total_requests |
total_requests:<int> |
Expected total requests in this prefix group |
osl |
osl:LOW|MEDIUM|HIGH |
Expected output sequence length |
iat |
iat:LOW|MEDIUM|HIGH |
Inter-arrival time hint |
To confirm that requests are flowing through the processor (not directly to workers), run:
docker logs dynamo-sglang-components 2>&1 | grep -E "(Processor|processor|Processing request|Routing decision|dynamo.backend|backend.generate|find_worker)" | tail -50When the system is working correctly, you should see output similar to:
Step 3: Starting Custom Processor (Registers as backend.generate)...
Processor PID: 3735
Registered at: dynamo.backend.generate (intercepts frontend requests)
INFO processor._init_prometheus_metrics: Prometheus metrics initialized for processor
INFO processor.initialize: Router clients created, waiting for instances...
INFO dynamo_runtime::component::client: wait_for_instances: Found 1 instance(s) for endpoint
INFO processor.initialize: Router clients initialized successfully
INFO processor.initialize: Engine client created, waiting for worker instances...
INFO processor.initialize: Processor initialized successfully (routing to dynamo.worker.generate)
INFO processor.generate: Processing request: prefix=auto-3f0519ac1cc442d2... total=1 osl=MEDIUM iat=MEDIUM tokens=37
INFO processor.generate: Routing decision: worker=7587892168930944779 decision=bcc5180740ed44c6... reuse_budget=0
INFO processor.generate: Processing request: prefix=auto-2593032a6cf843ce... total=1 osl=MEDIUM iat=MEDIUM tokens=37
INFO processor.generate: Routing decision: worker=7587892168930944779 decision=ba4440fd3a144822... reuse_budget=0
| Log Message | Meaning |
|---|---|
Registering model card: model_name=... |
Processor registering with etcd |
Model card registered successfully |
Frontend can now discover the processor |
Router clients initialized successfully |
Connected to Thompson Sampling router |
Processor initialized successfully |
Ready to process requests |
Processing request: prefix=... tokens=N |
Request received and being processed |
Routing decision: worker=... decision=... |
Router selected a worker |
Symptom: Requests fail or go directly to workers, bypassing processor.
Cause: Model card not registered or model name mismatch.
Verification:
# Check if processor registered its model card
docker logs dynamo-sglang-components 2>&1 | grep -i "model card"
# Check ETCD for registered models
curl -s http://localhost:2379/v3/kv/range -X POST \
-H "Content-Type: application/json" \
-d '{"key":"ZHluYW1v","range_end":"ZHluYW1w"}' | jq .Solution:
- Ensure
--model-namematches between processor and frontend - Ensure
--model-pathpoints to a valid model directory - Processor must start BEFORE frontend
Cause: Processor couldn't reach workers.
Solution: Ensure workers are registered and running:
docker logs dynamo-sglang-components 2>&1 | grep "worker.generate"Symptom: No "Processing request" logs, but responses still work.
Cause: Frontend is routing directly to workers instead of through the processor.
Verification:
# Check if processor is receiving requests
docker logs dynamo-sglang-components 2>&1 | grep "Processing request"Solution:
- Ensure processor's
--model-namematches the frontend--model-nameparameter exactly - Processor must register BEFORE frontend starts
- Check that processor's model card is in etcd
Symptom: Router stream ended without worker_id; falling back to engine load balancing
Cause: Router not started or not registered.
Solution: Check router logs:
docker logs dynamo-sglang-components 2>&1 | grep -i router| Metric | Description |
|---|---|
thompson_processor_requests_total |
Total requests processed |
thompson_processor_request_latency_seconds |
Request latency histogram |
thompson_processor_tokens_in_total |
Total input tokens |
thompson_processor_tokens_out_total |
Total output tokens |
thompson_processor_routing_decisions_total |
Routing decisions by worker |
thompson_processor_router_errors_total |
Router communication errors |
thompson_processor_engine_errors_total |
Backend engine errors |
thompson_processor_active_requests |
Currently active requests |
Access metrics:
curl http://localhost:8081/metrics | grep thompson_processorSee config.yaml for router configuration options and PARAMETERS.md for detailed parameter documentation.