Thanks to visit codestin.com
Credit goes to github.com

Skip to content

Run the latest LLMs and VLMs across GPU, NPU, and CPU with PC (Python/C++) & mobile (Android & iOS) support, running quickly with OpenAI gpt-oss, Granite4, Qwen3VL, Gemma 3n and more.

License

Notifications You must be signed in to change notification settings

NexaAI/nexa-sdk

Repository files navigation

Nexa AI Banner

🤝 Trusted by Partners

Qualcomm NVIDIA AMD Intel

Documentation X account Join us on Discord Join us on Slack

NexaSDK - Run any AI model on any backend

NexaSDK is an easy-to-use developer toolkit for running any AI model locally — across NPUs, GPUs, and CPUs — powered by our NexaML engine, built entirely from scratch for peak performance on every hardware stack. Unlike wrappers that depend on existing runtimes, NexaML is a unified inference engine built at the kernel level. It’s what lets NexaSDK achieve Day-0 support for new model architectures (LLMs, multimodal, audio, vision). NexaML supports 3 model formats: GGUF, MLX, and Nexa AI's own .nexa format.

⚙️ Differentiation

Features NexaSDK Ollama llama.cpp LM Studio
NPU support ✅ NPU-first
Support any model in GGUF, MLX, NEXA format ✅ Low-level Control ⚠️
Full multimodality support ✅ Image, Audio, Text ⚠️ ⚠️ ⚠️
Cross-platform support ✅ Desktop, Mobile, Automotive, IoT ⚠️ ⚠️ ⚠️
One line of code to run ⚠️
OpenAI-compatible API + Function calling

Legend: ✅ Supported   |   ⚠️ Partial or limited support   |   ❌ No

Recent Wins

Quick Start

Step 1: Download Nexa CLI with one click

macOS

Windows

Linux

For x86_64:

curl -fsSL https://github.com/NexaAI/nexa-sdk/releases/latest/download/nexa-cli_linux_x86_64.sh -o install.sh && chmod +x install.sh && ./install.sh && rm install.sh

For arm64:

curl -fsSL https://github.com/NexaAI/nexa-sdk/releases/latest/download/nexa-cli_linux_arm64.sh -o install.sh && chmod +x install.sh && ./install.sh && rm install.sh

Step 2: Run models with one line of code

You can run any compatible GGUF, MLX, or nexa model from 🤗 Hugging Face by using the nexa infer <full repo name>.

GGUF models

Tip

GGUF runs on macOS, Linux, and Windows on CPU/GPU. Note certain GGUF models are only supported by NexaSDK (e.g. Qwen3-VL-4B and 8B).

📝 Run and chat with LLMs, e.g. Qwen3:

nexa infer ggml-org/Qwen3-1.7B-GGUF

🖼️ Run and chat with Multimodal models, e.g. Qwen3-VL-4B:

nexa infer NexaAI/Qwen3-VL-4B-Instruct-GGUF

MLX models

Tip

MLX is macOS-only (Apple Silicon). Many MLX models in the Hugging Face mlx-community organization have quality issues and may not run reliably. We recommend starting with models from our curated NexaAI Collection for best results. For example

📝 Run and chat with LLMs, e.g. Qwen3:

nexa infer NexaAI/Qwen3-4B-4bit-MLX

🖼️ Run and chat with Multimodal models, e.g. Gemma3n:

nexa infer NexaAI/gemma-3n-E4B-it-4bit-MLX

Qualcomm NPU models

Tip

You need to download the arm64 with Qualcomm NPU support and make sure you have Snapdragon® X Elite chip on your laptop.

Quick Start (Windows arm64, Snapdragon X Elite)

  1. Login & Get Access Token (required for Pro Models)

    • Create an account at sdk.nexa.ai
    • Go to Deployment → Create Token
    • Run this once in your terminal (replace with your token):
      nexa config set license '<your_token_here>'
  2. Run and chat with our multimodal model, OmniNeural-4B, or other models on NPU

nexa infer NexaAI/OmniNeural-4B
nexa infer NexaAI/Granite-4-Micro-NPU
nexa infer NexaAI/Qwen3-VL-4B-Instruct-NPU

CLI Reference

Essential Command What it does
nexa -h show all CLI commands
nexa pull <repo> Interactive download & cache of a model
nexa infer <repo> Local inference
nexa list Show all cached models with sizes
nexa remove <repo> / nexa clean Delete one / all cached models
nexa serve --host 127.0.0.1:8080 Launch OpenAI‑compatible REST server
nexa run <repo> Chat with a model via an existing server

👉 To interact with multimodal models, you can drag photos or audio clips directly into the CLI — you can even drop multiple images at once!

See CLI Reference for full commands.

Acknowledgements

We would like to thank the following projects:

About

Run the latest LLMs and VLMs across GPU, NPU, and CPU with PC (Python/C++) & mobile (Android & iOS) support, running quickly with OpenAI gpt-oss, Granite4, Qwen3VL, Gemma 3n and more.

Topics

Resources

License

Stars

Watchers

Forks

Packages

No packages published

Contributors 40