Thanks to visit codestin.com
Credit goes to Github.com

Skip to content

A high-performance API server that provides OpenAI-compatible endpoints for MLX models. Developed using Python and powered by the FastAPI framework, it provides an efficient, scalable, and user-friendly solution for running MLX-based vision and language models locally with an OpenAI-compatible interface.

License

Notifications You must be signed in to change notification settings

cubist38/mlx-openai-server

Repository files navigation

mlx-openai-server

MIT License Python 3.11

A high-performance OpenAI-compatible API server for MLX models. Run text, vision, audio, and image generation models locally on Apple Silicon with a drop-in OpenAI replacement.

Note: Requires macOS with M-series chips (MLX is optimized for Apple Silicon).


Table of Contents

Click to expand

5-Second Quick Start

mlx-openai-server launch --model-path mlx-community/Qwen3-Coder-Next-4bit --model-type lm

Then point your OpenAI client to http://localhost:8000/v1. For full setup, see Installation and Quick Start.


Key Features

  • 🚀 OpenAI-compatible API - Drop-in replacement for OpenAI services
  • 🖼️ Multimodal support - Text, vision, audio, and image generation/editing
  • 🎨 Flux-series models - Image generation (schnell, dev, krea-dev, flux-2-klein) and editing (kontext, qwen-image-edit)
  • 🔌 Easy integration - Works with existing OpenAI client libraries
  • 📦 Multi-model mode - Run multiple models in one server via a YAML config; route requests by model ID
  • Performance - Configurable quantization (4/8/16-bit), context length, and speculative decoding (lm)
  • 🎛️ LoRA adapters - Fine-tuned image generation and editing
  • 📈 Queue management - Built-in request queuing and monitoring

Installation

Prerequisites

  • macOS with Apple Silicon (M-series)
  • Python 3.11+

Quick Install

# Create virtual environment
python3.11 -m venv .venv
source .venv/bin/activate

# Install from PyPI
pip install mlx-openai-server

# Or install from GitHub
pip install git+https://github.com/cubist38/mlx-openai-server.git

Optional: Whisper Support

For audio transcription models, install ffmpeg:

brew install ffmpeg

Quick Start

Start the Server

# Text-only or multimodal models
mlx-openai-server launch \
  --model-path <path-to-mlx-model> \
  --model-type <lm|multimodal>

# Text-only with speculative decoding (faster generation using a smaller draft model)
mlx-openai-server launch \
  --model-path <path-to-main-model> \
  --model-type lm \
  --draft-model-path <path-to-draft-model> \
  --num-draft-tokens 4

# Image generation (Flux-series)
mlx-openai-server launch \
  --model-type image-generation \
  --model-path <path-to-flux-model> \
  --config-name flux-dev \
  --quantize 8

# Image editing
mlx-openai-server launch \
  --model-type image-edit \
  --model-path <path-to-flux-model> \
  --config-name flux-kontext-dev \
  --quantize 8

# Embeddings
mlx-openai-server launch \
  --model-type embeddings \
  --model-path <embeddings-model-path>

# Whisper (audio transcription)
mlx-openai-server launch \
  --model-type whisper \
  --model-path mlx-community/whisper-large-v3-mlx

Server Parameters

Parameter Required Type Default Description
Required parameters
--model-path Yes path Path to MLX model (local or HuggingFace repo)
--model-type Yes string lm, multimodal, image-generation, image-edit, embeddings, or whisper
Model configuration
--config-name No* string Image models: flux-schnell, flux-dev, flux-krea-dev, flux-kontext-dev, flux2-klein-4b, flux2-klein-9b, qwen-image, qwen-image-edit, z-image-turbo, fibo
--quantize No int Quantization level: 4, 8, or 16 (image models)
--context-length No int Max sequence length for memory optimization
Sampling parameters (used when API request omits them)
--max-tokens No int 100000 Default maximum tokens to generate
--temperature No float 1.0 Default sampling temperature
--top-p No float 1.0 Default nucleus sampling (top-p) probability
--top-k No int 20 Default top-k sampling parameter
Speculative decoding (lm only)
--draft-model-path No path Path to draft model for speculative decoding
--num-draft-tokens No int 2 Draft tokens per step
Advanced options
--lora-paths No string Comma-separated LoRA adapter paths (image models)
--lora-scales No string Comma-separated LoRA scales (must match paths)
--log-level No string INFO DEBUG, INFO, WARNING, ERROR, CRITICAL
--no-log-file No flag false Disable file logging (console only)

*Required for image-generation and image-edit model types.

Launching Multiple Models

You can run several models in one server using a YAML config file. Each model gets its own handler; requests are routed by the model ID you use in the API (the model field in the request).

Video: Serving Multiple Models at Once? mlx-openai-server + OpenWebUI Test

Start with a config file

mlx-openai-server launch --config config.yaml

You must provide either --config (multi-handler) or --model-path (single model). You cannot mix them.

YAML config format

Create a YAML file with a server section (host, port, logging) and a models list. Each entry in models defines one model and supports the same options as the CLI (model path, type, context length, queue settings, etc.).

Key Required Description
model_path Yes Path or HuggingFace repo of the model
model_type No lm, multimodal, image-generation, image-edit, embeddings, whisper (default: lm)
model_id No ID used in API requests; defaults to model_path if omitted
context_length No Max context length (lm / multimodal)
max_concurrency, queue_timeout, queue_size No Per-model queue settings

Example config.yaml:

server:
  host: "0.0.0.0"
  port: 8000
  log_level: INFO
  # log_file: logs/app.log     # uncomment to log to file
  # no_log_file: true           # uncomment to disable file logging

models:
  # Language model
  - model_path: mlx-community/GLM-4.7-Flash-8bit
    model_type: lm
    model_id: glm-4.7-flash    # optional alias (defaults to model_path)
    enable_auto_tool_choice: true
    tool_call_parser: glm4_moe
    reasoning_parser: glm47_flash
    message_converter: glm4_moe

  # Another language model
  - model_path: mlx-community/Qwen3-Coder-Next-4bit
    model_type: lm
    max_concurrency: 1
    tool_call_parser: qwen3_coder
    message_converter: qwen3_coder


  - model_path: mlx-community/Qwen3-VL-2B-Instruct-4bit
    model_type: multimodal
    tool_call_parser: qwen3_vl

  - model_path: black-forest-labs/FLUX.2-klein-4B
    model_type: image-generation
    config_name: flux2-klein-4b
    quantize: 4
    model_id: flux2-klein-4b

A full example is in examples/config.yaml.

Multi-handler process isolation (HandlerProcessProxy)

In multi-handler mode, each model runs in a dedicated subprocess spawned via multiprocessing.get_context("spawn"). The main FastAPI process uses a HandlerProcessProxy to forward requests to the child process over multiprocessing queues.

This design prevents MLX Metal/GPU semaphore leaks on macOS. When MLX arrays or Metal runtime state are shared across forked processes, the resource tracker can report leaked semaphore objects at shutdown (ml-explore/mlx#2457). Using spawn instead of the default fork gives each model a clean Metal context, avoiding those warnings.

┌─────────────────────────────────────┐     ┌─────────────────────────────────────┐
│  Main Process (FastAPI)             │     │  Child Process (Handler)             │
│  ┌───────────────────────────────┐  │     │  ┌───────────────────────────────┐  │
│  │  HandlerProcessProxy          │  │     │  │  Concrete handler (e.g.       │  │
│  │  • request_queue ────────────┼──┼─────┼─>│    MLXLMHandler)              │  │
│  │  • response_queue <──────────┼──┼<────┼──│  • Model (MLX_LM)              │  │
│  │  • generate_*() forwards RPC  │  │     │  │  • InferenceWorker (thread)   │  │
│  └───────────────────────────────┘  │     │  └───────────────────────────────┘  │
└─────────────────────────────────────┘     └─────────────────────────────────────┘

The proxy exposes the same interface as the concrete handlers (generate_text_stream, generate_embeddings_response, etc.), so API endpoints work without changes. Requests and responses are serialized across the process boundary via queues; non-picklable objects (e.g. uploaded files) are pre-processed in the main process before being sent as file paths.

Using the API with multiple models

Set the model field in your request to the model ID (the model_id from the config, or model_path if you did not set model_id). The server looks up the handler for that ID and runs the request on the correct model.

import openai

client = openai.OpenAI(base_url="http://localhost:8000/v1", api_key="not-needed")

# Use the first model (qwen2.5-7b)
r1 = client.chat.completions.create(
    model="glm-4.7-flash",
    messages=[{"role": "user", "content": "Say hello in one word."}],
)
print(r1.choices[0].message.content)

# Use the second model (full path as model_id)
r2 = client.chat.completions.create(
    model="mlx-community/Qwen3-Coder-Next-4bit",
    messages=[{"role": "user", "content": "Say hello in one word."}],
)
print(r2.choices[0].message.content)
  • GET /v1/models returns all loaded models (their IDs).
  • If you send a model that is not in the config, the server returns 404 with an error listing available models.

Supported Model Types

  1. Text-only (lm) - Language models via mlx-lm
  2. Multimodal (multimodal) - Text, images, audio via mlx-vlm
  3. Image generation (image-generation) - Flux-series, Qwen Image, Z-Image Turbo, Fibo
  4. Image editing (image-edit) - Flux kontext, Qwen Image Edit
  5. Embeddings (embeddings) - Text embeddings via mlx-embeddings
  6. Whisper (whisper) - Audio transcription (requires ffmpeg)

Image Model Configurations

Generation:

  • flux-schnell - Fast (4 steps, no guidance)
  • flux-dev - Balanced (25 steps, 3.5 guidance)
  • flux-krea-dev - High quality (28 steps, 4.5 guidance)
  • flux2-klein-4b / flux2-klein-9b - Flux 2 Klein models
  • qwen-image - Qwen image generation (50 steps, 4.0 guidance)
  • z-image-turbo - Z-Image Turbo
  • fibo - Fibo model

Editing:

  • flux-kontext-dev - Context-aware editing (28 steps, 2.5 guidance)
  • flux2-klein-edit-4b / flux2-klein-edit-9b - Flux 2 Klein editing
  • qwen-image-edit - Qwen image editing (50 steps, 4.0 guidance)

Common Use Cases

Use Case One-liner Launch
Text generation mlx-openai-server launch --model-type lm --model-path <path>
Vision Q&A mlx-openai-server launch --model-type multimodal --model-path <path>
Image generation mlx-openai-server launch --model-type image-generation --model-path <path> --config-name flux-dev
Image editing mlx-openai-server launch --model-type image-edit --model-path <path> --config-name flux-kontext-dev
Audio transcription mlx-openai-server launch --model-type whisper --model-path mlx-community/whisper-large-v3-mlx
Embeddings mlx-openai-server launch --model-type embeddings --model-path <path>

Using the API

The server provides OpenAI-compatible endpoints. Use standard OpenAI client libraries:

Text Completion

import openai

client = openai.OpenAI(
    base_url="http://localhost:8000/v1",
    api_key="not-needed"
)

response = client.chat.completions.create(
    model="local-model",
    messages=[{"role": "user", "content": "What is the capital of France?"}]
)
print(response.choices[0].message.content)

Vision (Multimodal)

import openai
import base64

client = openai.OpenAI(base_url="http://localhost:8000/v1", api_key="not-needed")

with open("image.jpg", "rb") as f:
    base64_image = base64.b64encode(f.read()).decode('utf-8')

response = client.chat.completions.create(
    model="local-multimodal",
    messages=[{
        "role": "user",
        "content": [
            {"type": "text", "text": "What's in this image?"},
            {"type": "image_url", "image_url": {"url": f"data:image/jpeg;base64,{base64_image}"}}
        ]
    }]
)
print(response.choices[0].message.content)

Image Generation

import openai
import base64
from io import BytesIO
from PIL import Image

client = openai.OpenAI(base_url="http://localhost:8000/v1", api_key="not-needed")

response = client.images.generate(
    prompt="A serene landscape with mountains and a lake at sunset",
    model="local-image-generation-model",
    size="1024x1024"
)

image_data = base64.b64decode(response.data[0].b64_json)
image = Image.open(BytesIO(image_data))
image.show()

Image Editing

import openai
import base64
from io import BytesIO
from PIL import Image

client = openai.OpenAI(base_url="http://localhost:8000/v1", api_key="not-needed")

with open("image.png", "rb") as f:
    result = client.images.edit(
        image=f,
        prompt="make it like a photo in 1800s",
        model="flux-kontext-dev"
    )

image_data = base64.b64decode(result.data[0].b64_json)
image = Image.open(BytesIO(image_data))
image.show()

Function Calling

import openai

client = openai.OpenAI(base_url="http://localhost:8000/v1", api_key="not-needed")

messages = [{"role": "user", "content": "What is the weather in Tokyo?"}]
tools = [{
    "type": "function",
    "function": {
        "name": "get_weather",
        "description": "Get the weather in a given city",
        "parameters": {
            "type": "object",
            "properties": {
                "city": {"type": "string", "description": "The city name"}
            }
        }
    }
}]

completion = client.chat.completions.create(
    model="local-model",
    messages=messages,
    tools=tools,
    tool_choice="auto"
)

if completion.choices[0].message.tool_calls:
    tool_call = completion.choices[0].message.tool_calls[0]
    print(f"Function: {tool_call.function.name}")
    print(f"Arguments: {tool_call.function.arguments}")

Embeddings

import openai

client = openai.OpenAI(base_url="http://localhost:8000/v1", api_key="not-needed")

response = client.embeddings.create(
    model="local-model",
    input=["The quick brown fox jumps over the lazy dog"]
)

print(f"Embedding dimension: {len(response.data[0].embedding)}")

Responses API

The server exposes the OpenAI Responses API at POST /v1/responses. Use client.responses.create() with the OpenAI SDK for text and multimodal (lm/multimodal) models.

Text input (non-streaming):

import openai

client = openai.OpenAI(base_url="http://localhost:8000/v1", api_key="not-needed")

response = client.responses.create(
    model="local-model",
    input="Tell me a three sentence bedtime story about a unicorn."
)
# response.output contains reasoning and message items
for item in response.output:
    if item.type == "message":
        for part in item.content:
            if getattr(part, "text", None):
                print(part.text)

Text input (streaming):

response = client.responses.create(
    model="local-model",
    input="Tell me a three sentence bedtime story about a unicorn.",
    stream=True
)
for chunk in response:
    print(chunk)

Image input (vision / multimodal):

response = client.responses.create(
    model="local-multimodal",
    input=[
        {
            "role": "user",
            "content": [
                {"type": "input_text", "text": "What is in this image?"},
                {
                    "type": "input_image",
                    "image_url": "path/to/image.jpg",
                    "detail": "low"
                }
            ]
        }
    ]
)

Function calling:

tools = [{
    "type": "function",
    "name": "get_current_weather",
    "description": "Get the current weather in a given location",
    "parameters": {
        "type": "object",
        "properties": {
            "location": {"type": "string", "description": "The city and state"},
            "unit": {"type": "string", "enum": ["celsius", "fahrenheit"]}
        },
        "required": ["location", "unit"]
    }
}]

response = client.responses.create(
    model="local-model",
    tools=tools,
    input="What is the weather like in Boston today?",
    tool_choice="auto"
)

Structured outputs (Pydantic):

from pydantic import BaseModel

class Address(BaseModel):
    street: str
    city: str
    state: str
    zip: str

response = client.responses.parse(
    model="local-model",
    input=[{"role": "user", "content": "Format: 1 Hacker Wy Menlo Park CA 94025"}],
    text_format=Address
)
address = response.output_parsed  # Pydantic model instance
print(address)

See examples/responses_api.ipynb for full examples including streaming, image input, tool calls, and structured outputs.

Structured Outputs (JSON Schema)

import openai
import json

client = openai.OpenAI(base_url="http://localhost:8000/v1", api_key="not-needed")

response_format = {
    "type": "json_schema",
    "json_schema": {
        "name": "Address",
        "schema": {
            "type": "object",
            "properties": {
                "street": {"type": "string"},
                "city": {"type": "string"},
                "state": {"type": "string"},
                "zip": {"type": "string"}
            },
            "required": ["street", "city", "state", "zip"]
        }
    }
}

completion = client.chat.completions.create(
    model="local-model",
    messages=[{"role": "user", "content": "Format: 1 Hacker Wy Menlo Park CA 94025"}],
    response_format=response_format
)

address = json.loads(completion.choices[0].message.content)
print(json.dumps(address, indent=2))

Advanced Configuration

Parser Configuration

For models requiring custom parsing (tool calls, reasoning):

mlx-openai-server launch \
  --model-path <path-to-model> \
  --model-type lm \
  --tool-call-parser qwen3 \
  --reasoning-parser qwen3 \
  --enable-auto-tool-choice

Available parsers: qwen3, glm4_moe, qwen3_coder, qwen3_moe, qwen3_next, qwen3_vl, harmony, minimax_m2

Message Converters

For models requiring message format conversion:

mlx-openai-server launch \
  --model-path <path-to-model> \
  --model-type lm \
  --message-converter glm4_moe

Available converters: glm4_moe, minimax_m2, nemotron3_nano, qwen3_coder

Custom Chat Templates

mlx-openai-server launch \
  --model-path <path-to-model> \
  --model-type lm \
  --chat-template-file /path/to/template.jinja

Speculative Decoding (lm)

Use a smaller draft model to propose tokens and verify them with the main model for faster text generation. Supported only for --model-type lm.

mlx-openai-server launch \
  --model-path mlx-community/MyModel-8B-4bit \
  --model-type lm \
  --draft-model-path mlx-community/MyModel-1B-4bit \
  --num-draft-tokens 4
  • --draft-model-path: Path or HuggingFace repo of the draft model (smaller size model).
  • --num-draft-tokens: Number of tokens the draft model generates per verification step (default: 2). Higher values can increase throughput at the cost of more draft compute.

Request Queue System

The server includes a request queue system with monitoring:

# Check queue status
curl http://localhost:8000/v1/queue/stats

Response:

{
  "status": "ok",
  "queue_stats": {
    "running": true,
    "queue_size": 3,
    "max_queue_size": 100,
    "active_requests": 1,
    "max_concurrency": 1
  }
}

Example Notebooks

Check the examples/ directory for comprehensive guides:

Category Notebooks Description
Text & Chat responses_api.ipynb, simple_rag_demo.ipynb Responses API (text, image, tools, streaming, structured outputs); RAG pipeline demo
Vision vision_examples.ipynb Vision capabilities
Audio audio_examples.ipynb, transcription_examples.ipynb Audio processing and transcription
Embeddings embedding_examples.ipynb, lm_embeddings_examples.ipynb, vlm_embeddings_examples.ipynb Text, LM, and VLM embeddings
Images image_generations.ipynb, image_edit.ipynb Image generation and editing
Advanced structured_outputs_examples.ipynb JSON schema / structured outputs

Large Models

For models that don't fit in RAM, improve performance on macOS 15.0+:

bash configure_mlx.sh

This raises the system's wired memory limit for better performance.


Troubleshooting

Issue Solution
Memory problems Use --quantize 4 or 8 for image models; reduce --context-length for lm/multimodal. Run configure_mlx.sh on macOS 15+ to raise wired memory limits.
Model download issues Ensure transformers and huggingface_hub are installed. Check network access; some models require Hugging Face login.
Port already in use Use --port to specify a different port (e.g. --port 8001).
Quantization questions For lm/multimodal, use pre-quantized models from mlx-community. For image models, use --quantize 4 or 8.
Metal/semaphore warnings Use multi-handler mode (--config); each model runs in a spawned subprocess to avoid Metal context issues.

Quick Reference Card

# Text (language model)
mlx-openai-server launch --model-type lm --model-path <path>

# Vision (multimodal)
mlx-openai-server launch --model-type multimodal --model-path <path>

# Image generation
mlx-openai-server launch --model-type image-generation --model-path <path> --config-name flux-dev

# Image editing
mlx-openai-server launch --model-type image-edit --model-path <path> --config-name flux-kontext-dev

# Embeddings
mlx-openai-server launch --model-type embeddings --model-path <path>

# Whisper (audio transcription)
mlx-openai-server launch --model-type whisper --model-path mlx-community/whisper-large-v3-mlx

Contributing

We welcome contributions! Please:

  1. Fork the repository
  2. Create a feature branch
  3. Make your changes with tests
  4. Submit a pull request

Follow Conventional Commits for commit messages.

Support

License

MIT License - see LICENSE file for details.

Acknowledgments

Built on top of:


GitHub stars

About

A high-performance API server that provides OpenAI-compatible endpoints for MLX models. Developed using Python and powered by the FastAPI framework, it provides an efficient, scalable, and user-friendly solution for running MLX-based vision and language models locally with an OpenAI-compatible interface.

Topics

Resources

License

Stars

Watchers

Forks

Packages

No packages published

Contributors 13

Languages