FusionInfer

A Kubernetes controller for unified LLM inference orchestration, supporting both monolithic and prefill/decode (PD) disaggregated serving topologies.

Description

FusionInfer provides a single InferenceService CRD that enables:

Monolithic deployment: Single-pod inference handling full request lifecycle
PD disaggregated deployment: Separate prefill and decode roles for better GPU utilization
Multi-node deployment: Distributed inference across multiple nodes using tensor parallelism
Gang scheduling: Atomic scheduling via Volcano PodGroup integration
Intelligent routing: Gateway API integration with EPP (Endpoint Picker) for request scheduling

Demo

Prefix Cache Aware Routing

fusioninfer-demo.mp4

Multi-Node Inference

fusioninfer-multi-node.mp4

Architecture

┌─────────────────────────────────────────────────────────────────┐
│                      InferenceService CRD                       │
│   (roles: worker/prefiller/decoder, replicas, multinode)        │
└─────────────────────────────────┬───────────────────────────────┘
                                  │
                    ┌───────────────────────────────┐
                    │   InferenceService Controller │
                    └─────────────┬─────────────────┘
                                  │
        ┌─────────────────────────┼─────────────────────────┐
        │                         │                         │
        ▼                         ▼                         ▼
┌───────────────┐       ┌─────────────────┐       ┌─────────────────┐
│   PodGroup    │       │ LeaderWorkerSet │       │  Router (EPP)   │
│  (Volcano)    │       │     (LWS)       │       │  InferencePool  │
│               │       │                 │       │  HTTPRoute      │
└───────────────┘       └─────────────────┘       └─────────────────┘

Getting Started

Install Dependencies

FusionInfer requires the following components:

1. LeaderWorkerSet (LWS) - For multi-node workload management

kubectl create -f https://github.com/kubernetes-sigs/lws/releases/download/v0.7.0/manifests.yaml

Reference: LWS Installation Guide | Releases

2. Volcano - For gang scheduling

kubectl apply -f https://raw.githubusercontent.com/volcano-sh/volcano/v1.13.1/installer/volcano-development.yaml

Reference: Volcano Installation Guide | Releases

3. Gateway API - For service routing

kubectl apply -f https://github.com/kubernetes-sigs/gateway-api/releases/download/v1.4.1/standard-install.yaml

Reference: Gateway API Installation Guide | Releases

4. Gateway API Inference Extension - For intelligent inference request routing

kubectl apply -f https://github.com/kubernetes-sigs/gateway-api-inference-extension/releases/download/v1.2.1/manifests.yaml

Reference: Inference Extension Docs | Releases

Install the Gateway

Set the Kgateway version and install the Kgateway CRDs:

KGTW_VERSION=v2.1.0
helm upgrade -i --create-namespace --namespace kgateway-system --version $KGTW_VERSION kgateway-crds oci://cr.kgateway.dev/kgateway-dev/charts/kgateway-crds

Install Kgateway:

helm upgrade -i --namespace kgateway-system --version $KGTW_VERSION kgateway oci://cr.kgateway.dev/kgateway-dev/charts/kgateway --set inferenceExtension.enabled=true

Deploy the Inference Gateway:

kubectl apply -f https://github.com/kubernetes-sigs/gateway-api-inference-extension/raw/main/config/manifests/gateway/kgateway/gateway.yaml

Quick Start (Local Development)

# 1. Create a kind cluster (optional)
kind create cluster --name fusioninfer

# 2. Install FusionInfer CRDs
make install

# 3. Run the controller locally
make run

Usage Examples

Monolithic LLM Service

apiVersion: fusioninfer.io/v1alpha1
kind: InferenceService
metadata:
  name: qwen-inference
spec:
  roles:
    - name: router
      componentType: router
      strategy: prefix-cache
      httproute:
        parentRefs:
          - name: inference-gateway
    - name: inference
      componentType: worker
      replicas: 1
      template:
        spec:
          containers:
            - name: vllm
              image: vllm/vllm-openai:v0.11.0
              args: ["--model", "Qwen/Qwen3-8B"]
              resources:
                limits:
                  nvidia.com/gpu: "1"

Send Request

# You can use minikube tunnel to assign IP address to an LoadBalancer Type Service
GATEWAY_IP=$(kubectl get gateway inference-gateway -o jsonpath='{.status.addresses[0].value}')

curl -X POST "http://${GATEWAY_IP}/v1/chat/completions" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "Qwen/Qwen3-8B",
    "messages": [
      {"role": "user", "content": "Hello, how are you?"}
    ]
  }'

Name		Name	Last commit message	Last commit date
Latest commit History 42 Commits
.devcontainer		.devcontainer
.github/workflows		.github/workflows
api/core/v1alpha1		api/core/v1alpha1
client-go		client-go
cmd		cmd
config		config
docs/fusioninfer		docs/fusioninfer
hack		hack
pkg		pkg
test		test
.dockerignore		.dockerignore
.gitignore		.gitignore
.golangci.yml		.golangci.yml
Dockerfile		Dockerfile
Makefile		Makefile
PROJECT		PROJECT
README.md		README.md
go.mod		go.mod
go.sum		go.sum

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

FusionInfer

Description

Demo

Prefix Cache Aware Routing

Multi-Node Inference

Architecture

Getting Started

Install Dependencies

Install the Gateway

Quick Start (Local Development)

Usage Examples

Monolithic LLM Service

Send Request

About

Uh oh!

Releases

Packages

Contributors 2

Uh oh!

Languages

fusioninfer/fusioninfer

Folders and files

Latest commit

History

Repository files navigation

FusionInfer

Description

Demo

Prefix Cache Aware Routing

Multi-Node Inference

Architecture

Getting Started

Install Dependencies

Install the Gateway

Quick Start (Local Development)

Usage Examples

Monolithic LLM Service

Send Request

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Uh oh!

Languages

Packages