A web-based platform for deploying and managing large language models on Kubernetes with support for multiple inference providers.
- οΏ½οΈ Web UI: Modern interface for all deployment and management tasks
- π¦ Model Catalog: Browse curated models or search the entire HuggingFace Hub
- π Smart Filtering: Automatically filters models by architecture compatibility
- π GPU Capacity Warnings: Visual indicators showing if models fit your cluster's GPU memory
- β‘ Autoscaler Integration: Detects cluster autoscaling and provides capacity guidance
- π One-Click Deploy: Configure and deploy models without writing YAML
- π Live Dashboard: Monitor deployments with auto-refresh and status tracking
- π Multi-Provider Support: Extensible architecture supporting multiple inference runtimes
- π§ Multiple Engines: vLLM, SGLang, and TensorRT-LLM (via NVIDIA Dynamo)
- π₯ Installation Wizard: Install providers via Helm directly from the UI
- π¨ Dark Theme: Modern dark UI with provider-specific accents
| Provider | Status | Description |
|---|---|---|
| NVIDIA Dynamo | β Available | GPU-accelerated inference with aggregated or disaggregated serving |
| KubeRay | β Available | Ray-based distributed inference |
- Kubernetes cluster with
kubectlconfigured helmCLI installed- GPU nodes with NVIDIA drivers (for GPU-accelerated inference)
- HuggingFace account (for accessing gated models like Llama)
Download the latest release for your platform and run:
./kubefoundryOpen the web UI at http://localhost:3001
Requires:
kubectlconfigured with cluster access,helmCLI installed
kubectl apply -f https://raw.githubusercontent.com/sozercan/kube-foundry/main/deploy/kubernetes/kubefoundry.yaml
# Access via port-forward
kubectl port-forward -n kubefoundry-system svc/kubefoundry 3001:80Open the web UI at http://localhost:3001
See Kubernetes Deployment for configuration options.
Navigate to the Installation page and click Install next to your preferred provider. The UI will guide you through the Helm installation process with real-time status updates.
Go to Settings β HuggingFace and click "Sign in with Hugging Face" to connect your account via OAuth. Your token will be automatically distributed to all required namespaces.
Note: A HuggingFace token is required to access gated models like Llama.
- Navigate to the Models page
- Browse the curated catalog or Search HuggingFace for any compatible model
- Review GPU memory estimates and fit indicators (β fits, β tight, β exceeds)
- Click Deploy on your chosen model
- Select Runtime: Choose between NVIDIA Dynamo or KubeRay based on installed runtimes
- Configure deployment options (engine, replicas, tensor parallelism, etc.)
- Click Create Deployment to launch
Note: Each deployment can use a different runtime. The deployment list shows which runtime each deployment is using.
Head to the Deployments page to:
- View real-time status of all deployments
- See pod readiness and health checks
- Access logs and deployment details
- Scale or delete deployments
Once status shows Running, your model exposes an OpenAI-compatible API. Use kubectl port-forward to access it locally:
# Port-forward to the service (check Deployments page for exact service name)
kubectl port-forward svc/<deployment-name> 8000:8000 -n <namespace>
# List available models
curl http://localhost:8000/v1/models
# Test with a chat completion
curl http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{"model": "<model-name>", "messages": [{"role": "user", "content": "Hello!"}]}'KubeFoundry supports any HuggingFace model with a compatible architecture. Browse the curated catalog for tested models, or search HuggingFace Hub for thousands more.
When searching HuggingFace, models are filtered by architecture compatibility:
| Engine | Supported Architectures |
|---|---|
| vLLM | LlamaForCausalLM, MistralForCausalLM, Qwen2ForCausalLM, GPT2LMHeadModel, and 40+ more |
| SGLang | LlamaForCausalLM, MistralForCausalLM, Qwen2ForCausalLM, and 20+ more |
| TensorRT-LLM | LlamaForCausalLM, GPTForCausalLM, MistralForCausalLM, and 15+ more |
KubeFoundry supports optional authentication using your existing kubeconfig OIDC credentials.
To enable, start the server with:
AUTH_ENABLED=true ./kubefoundryThen use the CLI to login:
kubefoundry login # Uses current kubeconfig context
kubefoundry login --server https://example.com # Specify server URL
kubefoundry login --context my-cluster # Use specific contextThe login command extracts your OIDC token and opens the browser automatically.
- Architecture Overview
- API Reference
- Development Guide
- Azure Cluster Autoscaling Setup
- Kubernetes Deployment
We welcome contributions! Please see CONTRIBUTING.md for development setup and guidelines.