exo connects all your devices into an AI cluster. Not only does exo enable running models larger than would fit on a single device, but with day-0 support for RDMA over Thunderbolt, makes models run faster as you add more devices.
- Automatic Device Discovery: Devices running exo automatically discover each other - no manual configuration.
- RDMA over Thunderbolt: exo ships with day-0 support for RDMA over Thunderbolt 5, enabling 99% reduction in latency between devices.
- Topology-Aware Auto Parallel: exo figures out the best way to split your model across all available devices based on a realtime view of your device topology. It takes into account device resources and network latency/bandwidth between each link.
- Tensor Parallelism: exo supports sharding models, for up to 1.8x speedup on 2 devices and 3.2x speedup on 4 devices.
- MLX Support: exo uses MLX as an inference backend and MLX distributed for distributed communication.
Qwen3-235B (8-bit) on 4 × M3 Ultra Mac Studio with Tensor Parallel RDMA
Source: Jeff Geerling: 15 TB VRAM on Mac Studio – RDMA over Thunderbolt 5
DeepSeek v3.1 671B (8-bit) on 4 × M3 Ultra Mac Studio with Tensor Parallel RDMA
Source: Jeff Geerling: 15 TB VRAM on Mac Studio – RDMA over Thunderbolt 5
Kimi K2 Thinking (native 4-bit) on 4 × M3 Ultra Mac Studio with Tensor Parallel RDMA
Source: Jeff Geerling: 15 TB VRAM on Mac Studio – RDMA over Thunderbolt 5
Devices running exo automatically discover each other, without needing any manual configuration. Each device provides an API and a dashboard for interacting with your cluster (runs at http://localhost:52415).
There are two ways to run exo:
Prerequisites:
-
brew (for simple package management on MacOS)
/bin/bash -c "$(curl -fsSL https://raw.githubusercontent.com/Homebrew/install/HEAD/install.sh)" -
uv (for Python dependency management)
-
macmon (for hardware monitoring on Apple Silicon)
-
node (for building the dashboard)
brew install uv macmon node
-
rust (to build Rust bindings, nightly for now)
curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh rustup toolchain install nightly
Clone the repo, build the dashboard, and run exo:
# Clone exo
git clone https://github.com/exo-explore/exo
# Build dashboard
cd exo/dashboard && npm install && npm run build && cd ..
# Run exo
uv run exoThis starts the exo dashboard and API at http://localhost:52415/
exo ships a macOS app that runs in the background on your Mac.
The macOS app requires macOS Tahoe 26.2 or later.
Download the latest build here: EXO-latest.dmg.
The app will ask for permission to modify system settings and install a new Network profile. Improvements to this are being worked on.
If you prefer to interact with exo via the API, here is an example creating an instance of a small model (mlx-community/Llama-3.2-1B-Instruct-4bit), sending a chat completions request and deleting the instance.
1. Preview instance placements
The /instance/previews endpoint will preview all valid placements for your model.
curl "http://localhost:52415/instance/previews?model_id=llama-3.2-1b"Sample response:
{
"previews": [
{
"model_id": "mlx-community/Llama-3.2-1B-Instruct-4bit",
"sharding": "Pipeline",
"instance_meta": "MlxRing",
"instance": {...},
"memory_delta_by_node": {"local": 729808896},
"error": null
}
// ...possibly more placements...
]
}This will return all valid placements for this model. Pick a placement that you like.
To pick the first one, pipe into jq:
curl "http://localhost:52415/instance/previews?model_id=llama-3.2-1b" | jq -c '.previews[] | select(.error == null) | .instance' | head -n12. Create a model instance
Send a POST to /instance with your desired placement in the instance field (the full payload must match types as in CreateInstanceParams), which you can copy from step 1:
curl -X POST http://localhost:52415/instance \
-H 'Content-Type: application/json' \
-d '{
"instance": {...}
}'Sample response:
{
"message": "Command received.",
"command_id": "e9d1a8ab-...."
}3. Send a chat completion
Now, make a POST to /v1/chat/completions (the same format as OpenAI's API):
curl -N -X POST http://localhost:52415/v1/chat/completions \
-H 'Content-Type: application/json' \
-d '{
"model": "mlx-community/Llama-3.2-1B-Instruct-4bit",
"messages": [
{"role": "user", "content": "What is Llama 3.2 1B?"}
],
"stream": true
}'4. Delete the instance
When you're done, delete the instance by its ID (find it via /state or /instance endpoints):
curl -X DELETE http://localhost:52415/instance/YOUR_INSTANCE_IDOther useful API endpoints:*
- List all models:
curl http://localhost:52415/models - Inspect instance IDs and deployment state:
curl http://localhost:52415/state
For further details, see API types and endpoints in src/exo/master/api.py.
On macOS, exo uses the GPU. On Linux, exo currently runs on CPU. We are working on extending hardware accelerator support. If you'd like support for a new hardware platform, please search for an existing feature request and add a thumbs up so we know what hardware is important to the community.
See CONTRIBUTING.md for guidelines on how to contribute to exo.