NexaSDK is an easy-to-use developer toolkit for running any AI model locally — across NPUs, GPUs, and CPUs — powered by our NexaML engine, built entirely from scratch for peak performance on every hardware stack. Unlike wrappers that depend on existing runtimes, NexaML is a unified inference engine built at the kernel level. It’s what lets NexaSDK achieve Day-0 support for new model architectures (LLMs, multimodal, audio, vision). NexaML supports 3 model formats: GGUF, MLX, and Nexa AI's own .nexa
format.
Features | NexaSDK | Ollama | llama.cpp | LM Studio |
---|---|---|---|---|
NPU support | ✅ NPU-first | ❌ | ❌ | ❌ |
Support any model in GGUF, MLX, NEXA format | ✅ Low-level Control | ❌ | ❌ | |
Full multimodality support | ✅ Image, Audio, Text | |||
Cross-platform support | ✅ Desktop, Mobile, Automotive, IoT | |||
One line of code to run | ✅ | ✅ | ✅ | |
OpenAI-compatible API + Function calling | ✅ | ✅ | ✅ | ✅ |
Legend:
✅ Supported |
- 📣 Day-0 Support for Qwen3-VL-4B and 8B in GGUF, MLX, .nexa format for NPU/GPU/CPU. We are the only framework that supports the GGUF format. Featured in Qwen's post about our partnership.
- 📣 Day-0 Support for IBM Granite 4.0 on NPU/GPU/CPU. NexaML engine were featured right next to vLLM, llama.cpp, and MLX in IBM's blog.
- 📣 Day-0 Support for Google EmbeddingGemma on NPU. We are featured in Google's social post.
- 📣 Supported vision capability for Gemma3n: First-ever Gemma-3n multimodal inference for GPU & CPU, in GGUF format.
- 📣 AMD NPU Support for SDXL image generation
- 📣 Intel NPU Support DeepSeek-r1-distill-Qwen-1.5B and Llama3.2-3B
- 📣 Apple Neural Engine Support for real-time speech recognition with Parakeet v3 model
curl -fsSL https://github.com/NexaAI/nexa-sdk/releases/latest/download/nexa-cli_linux_x86_64.sh -o install.sh && chmod +x install.sh && ./install.sh && rm install.sh
curl -fsSL https://github.com/NexaAI/nexa-sdk/releases/latest/download/nexa-cli_linux_arm64.sh -o install.sh && chmod +x install.sh && ./install.sh && rm install.sh
You can run any compatible GGUF, MLX, or nexa model from 🤗 Hugging Face by using the nexa infer <full repo name>
.
Tip
GGUF runs on macOS, Linux, and Windows on CPU/GPU. Note certain GGUF models are only supported by NexaSDK (e.g. Qwen3-VL-4B and 8B).
📝 Run and chat with LLMs, e.g. Qwen3:
nexa infer ggml-org/Qwen3-1.7B-GGUF
🖼️ Run and chat with Multimodal models, e.g. Qwen3-VL-4B:
nexa infer NexaAI/Qwen3-VL-4B-Instruct-GGUF
Tip
MLX is macOS-only (Apple Silicon). Many MLX models in the Hugging Face mlx-community organization have quality issues and may not run reliably. We recommend starting with models from our curated NexaAI Collection for best results. For example
📝 Run and chat with LLMs, e.g. Qwen3:
nexa infer NexaAI/Qwen3-4B-4bit-MLX
🖼️ Run and chat with Multimodal models, e.g. Gemma3n:
nexa infer NexaAI/gemma-3n-E4B-it-4bit-MLX
Tip
You need to download the arm64 with Qualcomm NPU support and make sure you have Snapdragon® X Elite chip on your laptop.
-
Login & Get Access Token (required for Pro Models)
- Create an account at sdk.nexa.ai
- Go to Deployment → Create Token
- Run this once in your terminal (replace with your token):
nexa config set license '<your_token_here>'
-
Run and chat with our multimodal model, OmniNeural-4B, or other models on NPU
nexa infer NexaAI/OmniNeural-4B
nexa infer NexaAI/Granite-4-Micro-NPU
nexa infer NexaAI/Qwen3-VL-4B-Instruct-NPU
Essential Command | What it does |
---|---|
nexa -h |
show all CLI commands |
nexa pull <repo> |
Interactive download & cache of a model |
nexa infer <repo> |
Local inference |
nexa list |
Show all cached models with sizes |
nexa remove <repo> / nexa clean |
Delete one / all cached models |
nexa serve --host 127.0.0.1:8080 |
Launch OpenAI‑compatible REST server |
nexa run <repo> |
Chat with a model via an existing server |
👉 To interact with multimodal models, you can drag photos or audio clips directly into the CLI — you can even drop multiple images at once!
See CLI Reference for full commands.
We would like to thank the following projects: