Thanks to visit codestin.com
Credit goes to github.com

Skip to content
@bentoml

BentoML

Build fast and reliable model serving systems

Welcome to BentoML 👋 Twitter Follow

Important

BentoML is now part of Modular. Together, we're building a unified stack for high-performance AI inference.

github banner

What's cooking? 👩‍🍳

🍱 BentoML: The Unified Serving Framework for AI/ML Systems

BentoML is a Python library for building online serving systems optimized for AI apps and model inference. It supports serving any model format/runtime and custom Python code, offering the key primitives for serving optimizations, task queues, batching, multi-model chains, distributed orchestration, and multi-GPU serving.

🎨 Examples: Learn by doing!

A collection of examples for BentoML, from deploying OpenAI-compatible LLM service, to building voice phone calling agents and RAG applications. Use these examples to learn how to use BentoML and build your own solutions.

🦾 OpenLLM: Self-hosting Large Language Models Made Easy

Run any open-source LLMs (Llama, Mistral, Qwen, Phi and more) or custom fine-tuned models as OpenAI-compatible APIs with a single command. It features a built-in chat UI, state-of-the-art inference performance, and a simplified workflow for production-grade cloud deployment.

☁️ BentoCloud: Unified Inference Platform for any model, on any cloud

BentoCloud is the easist way to build and deploy with BentoML, in our cloud or yours. It brings fast and scalable inference infrastructure into any cloud, allowing AI teams to move 10x faster in building AI applications with ML/AI models, while reducing compute cost - by maxmizing compute utilization, fast GPU autoscaling, minimimal coldstarts and full observability. Sign up today!.

⚡ Modular MAX: Hardware-optimized inference across any accelerator

For end-to-end performance and portability across the AI execution stack, see Modular MAX.

🔥 Mojo: A faster future for AI inference

Mojo is on the path to a stable 1.0 release in 2026, delivering semantic versioning and a high-performance language for CPU and GPU programming. As Mojo matures, BentoML will leverage Mojo's compiled performance for CPU and GPU kernels to push model inference performance even further. See the Mojo Roadmap for more information.

flowchart TD
    A["🚀 Your App / API"]
    B["🍱 BentoML — Serving · Orchestration · Batching · Multi-model Chains"]
    C["⚡ Modular MAX — Hardware-optimized Runtimes on Any Accelerator"]
    D["🔥 Mojo — High-performance Kernel & Systems Programming"]
    E["🖥️ Any Hardware · NVIDIA · AMD · CPU"]
    A --> B --> C --> D --> E
Loading

Get in touch 💬

💬 Join the Modular Discord and Modular Forum

👀 Follow us on X @Modular and LinkedIn

📖 Read our blog

Pinned Loading

  1. BentoML BentoML Public

    The easiest way to serve AI apps and models - Build Model Inference APIs, Job queues, LLM apps, Multi-model pipelines, and more!

    Python 8.5k 916

  2. OpenLLM OpenLLM Public

    Run any open-source LLMs, such as DeepSeek and Llama, as OpenAI compatible API endpoint in the cloud.

    Python 12.1k 799

  3. Yatai Yatai Public

    Model Deployment at Scale on Kubernetes 🦄️

    TypeScript 836 76

  4. comfy-pack comfy-pack Public

    A comprehensive toolkit for reliably locking, packing and deploying environments for ComfyUI workflows.

    Python 207 28

  5. llm-inference-handbook llm-inference-handbook Public

    Everything you need to know about LLM inference

    TypeScript 267 23

  6. llm-optimizer llm-optimizer Public

    Benchmark and optimize LLM inference across frameworks with ease

    Python 169 17

Repositories

Showing 10 of 117 repositories
  • .github Public

    ✨🍱🦄️

    bentoml/.github’s past year of commit activity
    1 5 0 0 Updated Mar 3, 2026
  • yatai-image-builder Public

    🐳 Build OCI images for Bentos in k8s

    bentoml/yatai-image-builder’s past year of commit activity
    Go 20 12 6 13 Updated Mar 2, 2026
  • BentoML Public

    The easiest way to serve AI apps and models - Build Model Inference APIs, Job queues, LLM apps, Multi-model pipelines, and more!

    bentoml/BentoML’s past year of commit activity
    Python 8,485 Apache-2.0 916 135 7 Updated Mar 2, 2026
  • OpenLLM Public

    Run any open-source LLMs, such as DeepSeek and Llama, as OpenAI compatible API endpoint in the cloud.

    bentoml/OpenLLM’s past year of commit activity
    Python 12,144 Apache-2.0 799 5 7 Updated Mar 2, 2026
  • llm-inference-handbook Public

    Everything you need to know about LLM inference

    bentoml/llm-inference-handbook’s past year of commit activity
    TypeScript 267 Apache-2.0 23 3 1 Updated Mar 2, 2026
  • bentoml/bentocloud-homepage-news’s past year of commit activity
    2 2 0 0 Updated Feb 10, 2026
  • BentoSpacy Public
    bentoml/BentoSpacy’s past year of commit activity
    Python 0 0 0 2 Updated Jan 26, 2026
  • BentoTRTLLM Public
    bentoml/BentoTRTLLM’s past year of commit activity
    Python 10 1 2 2 Updated Jan 22, 2026
  • mise-nebius Public
    bentoml/mise-nebius’s past year of commit activity
    Shell 0 MIT 0 0 0 Updated Jan 22, 2026
  • BentoVLLM Public

    Self-host LLMs with vLLM and BentoML

    bentoml/BentoVLLM’s past year of commit activity
    Python 169 Apache-2.0 22 3 4 Updated Jan 21, 2026