Note
OpenArc is under active development.
OpenArc is an inference engine for Intel devices. Serve LLMs, VLMs, Whisper, Kokoro-TTS, Embedding and Reranker models over OpenAI compatible endpoints, powered by OpenVINO on your device. Local, private, open source AI.
OpenArc 2.0 arrives with more endpoints, better UX, pipeline paralell, NPU support and much more!
Drawing on ideas from llama.cpp, vLLM, transformers, OpenVINO Model Server, Ray, Lemonade, and other projects cited below, OpenArc has been a way for me to learn about inference engines by trying to build one myself.
Along the way a Discord community has formed around this project, which was unexpected! If you are interested in using Intel devices for AI and machine learning, feel free to stop by.
Thanks to everyone on Discord for their continued support!
- Features
- Quickstart
- OpenArc CLI
- Model Sources
- Converting Models to OpenVINO IR
- Learning Resources
- Acknowledgments
OpenArc 2.0 arrives with more endpoints, better UX, pipeline paralell, NPU support and much more!
- Multi GPU Pipeline Paralell
- CPU offload/Hybrid device
- NPU device support
- OpenAI compatible endpoints
/v1/models/v1/completions:llmonly/v1/chat/completions/v1/audio/transcriptions:whisperonly/v1/audio/speech:kokoroonly/v1/embeddings:qwen3-embedding#33 by @mwrothbe/v1/rerank:qwen3-reranker#39 by @mwrothbe
jinjatemplating withAutoTokenizers- OpenAI Compatible tool calls with streaming and paralell
- tool call parser currently reads "name", "argument"
- Fully async multi engine, multi task architecture
- Model concurrency: load and infer multiple models at once
- Automatic unload on inference failure
llama-benchstyle benchmarking forllmw/automatic database sqlite- metrics on every request
- ttft
- prefill_throughput
- decode_throughput
- decode_duration
- tpot
- load time
- stream mode
- More OpenVINO [examples](examples)
- OpenVINO implementation of hexgrad/Kokoro-82M
Note
Interested in contributing? Please open an issue before submitting a PR!
Linux
- OpenVINO requires device specifc drivers.
- Visit OpenVINO System Requirments for the latest information on drivers.
-
Install uv from astral
-
After cloning use:
uv sync
- Activate your environment with:
source .venv/bin/activate
Build latest optimum
uv pip install "optimum-intel[openvino] @ git+https://github.com/huggingface/optimum-intel"
Build latest OpenVINO and OpenVINO GenAI from nightly wheels
uv pip install --pre -U openvino-genai --extra-index-url https://storage.openvinotoolkit.org/simple/wheels/nightly
- Set your API key as an environment variable:
export OPENARC_API_KEY=<api-key>
- To get started, run:
openarc --help
Windows
- OpenVINO requires device specifc drivers.
- Visit OpenVINO System Requirments to get the latest information on drivers.
-
Install uv from astral
-
Clone OpenArc, enter the directory and run:
uv sync
- Activate your environment with:
.venv\Scripts\activate
Build latest optimum
uv pip install "optimum-intel[openvino] @ git+https://github.com/huggingface/optimum-intel"
Build latest OpenVINO and OpenVINO GenAI from nightly wheels
uv pip install --pre -U openvino-genai --extra-index-url https://storage.openvinotoolkit.org/simple/wheels/nightly
- Set your API key as an environment variable:
setx OPENARC_API_KEY openarc-api-key
- To get started, run:
openarc --help
Note
Need help installing drivers? Join our Discord or open an issue.
Note
uv has a pip interface which is a drop in replacement for pip, but faster. Pretty cool, and a good place to start.
This section documents the CLI commands available to you.
All commands have aliases but are written here for clarity.
openarc add
Add a model to openarc_config.json for easy loading with openarc load.
openarc add --model-name <model-name> --model-path <path/to/model> --engine <engine> --model-type <model-type> --device <target-device>
openarc add --model-name <model-name> --model-path <path/to/model> --engine <engine> --model-type <model-type> --device <target-device> --vlm-type <vlm-type>
Getting VLM to work the way I wanted required using VLMPipeline in ways that are not well documented. You can look at the code to see where the magic happens.
vlm-type maps a vision token for a given architecture using strings like qwen25vl, phi4mm and more. Use openarc add --help to see the available options. The server will complain if you get anything wrong, so it should be easy to figure out.
openarc add --model-name <model-name> --model-path <path/to/whisper> --engine ovgenai --model-type whisper --device <target-device>
openarc add --model-name <model-name> --model-path <path/to/kokoro> --engine openvino --model-type kokoro --device CPU
Accepts many options to modify openvino runtime behavior for different inference scenarios. OpenArc reports c++ errors to the server when these fail, making experimentation easy.
See OpenVINO documentation on Inference Optimization to learn more about what can be customized.
Review pipeline-paralellism preview to learn how you can customize multi device inference using HETERO device plugin. Some example commands are provided for a few difference scenarios:
openarc add --model-name <model-name> --model-path <path/to/model> --engine ovgenai --model-type llm --device <HETERO:GPU.0,GPU.1> --runtime-config {"MODEL_DISTRIBUTION_POLICY": "PIPELINE_PARALLEL"}
Requires more than one CPU socket in a single node.
openarc add --model-name <model-name> --model-path <path/to/model> --engine ovgenai --model-type llm --device CPU --runtime-config {"MODEL_DISTRIBUTION_POLICY": "TENSOR_PARALLEL"}
openarc add --model-name <model-name> -model-path <path/to/model> --engine ovgenai --model-type llm --device <HETERO:GPU.0,CPU> --runtime-config {"MODEL_DISTRIBUTION_POLICY": "PIPELINE_PARALLEL"}
openarc list
Reads added configurations from openarc_config.json.
Display all saved configurations:
openarc list
Remove a configuration:
openarc list --remove --model-name <model-name>
openarc serve
Starts the server.
openarc serve start # defauls to 0.0.0.0:8000
Configure host and port
openarc serve start --host --openarc-port
To load models on startup:
openarc serve start --load-models model1 model2
openarc load
After using openarc add you can use openarc load to read the added configuration and load models onto the OpenArc server.
OpenArc uses arguments from openarc add as metadata to make routing decisions internally; you are querying for correct inference code.
openarc load <model-name>
To load multiple models at once, use:
openarc load <model-name1> <model-name2> <model-name3>
Be mindful of your resources; loading models can be resource intensive! On the first load, OpenVINO performs model compilation for the target --device.
When openarc load fails, the CLI tool displays a full stack trace to help you figure out why.
openarc status
Calls /openarc/status endpoint and returns a report. Shows loaded models.
openarc status
openarc bench
Benchmark llm performance with pseudo-random input tokens.
This approach follows llama-bench, providing a baseline for the community to assess inference performance between llama.cpp backends and openvino.
To support different llm tokenizers, we need to standardize how tokens are chosen for benchmark inference. When you set --p we select 512 pseudo-random tokens as input_ids from the set of all tokens in the vocabulary.
--n controls the maximum amount of tokens we allow the model to generate; this bypasses eos and sets a hard upper limit.
Default values are:
openarc bench <model-name> --p <512> --n <128> --r <5>
Which gives:
openarc bench also records metrics in a sqlite database openarc_bench.db for easy analysis.
openarc tool
Utility scripts.
To see openvino properties your device supports use:
openarc tool device-props
To see available devices use:
openarc tool device-detect
There are a few sources of preconverted models which can be used with OpenArc;
LLMs
VLMs
| Models |
|---|
| Echo9Zulu/gemma-3-4b-it-int8_asym-ov |
| Echo9Zulu/Gemma-3-12b-it-qat-int4_asym-ov |
| Echo9Zulu/Qwen2.5-VL-7B-Instruct-int4_sym-ov |
| Echo9Zulu/Nanonets-OCR2-3B-LM-INT4_ASYM-VE-FP16-ov |
Whisper
| Models |
|---|
| OpenVINO/distil-whisper-large-v3-int8-ov |
| OpenVINO/distil-whisper-large-v3-fp16-ov |
| OpenVINO/whisper-large-v3-int8-ov |
| OpenVINO/openai-whisper-large-v3-fp16-ov |
Optimum-Intel provides a hands on primer you can use to build some intuition about quantization and post training optimization using OpenVINO.
Intel provides a suite of tools you can use to apply different post training optimization techniques developed over at Neural Network Compression Framwork.
-
Use the Optimum-CLI conversion tool to learn how you can convert models to OpenVINO IR from other formats.
-
Visit Supported Architectures to see what models can be converted to OpenVINO using the tools described in this section.
-
If you use the CLI tool and get an error about an unsupported architecture or "missing export config" follow the link, open an issue reference the model card and the maintainers will get back to you.
Demos help illustrate what you can do with OpenArc and are meant to be extended. I will continue adding to these, but for now they are a good start.
talk_to_llm.py sets up a "chain" between whisper, an LLM, and kokoro. Talk with any LLM you can run on your PC from the command line. Accumulates context and does not filter reasoning (very interesting).
whisper_button.py use spacebar to record audio with whisper and see the transcription right in the terminal. NPU users should probably start here.
Learn more about how to leverage your Intel devices for Machine Learning:
Mutli GPU Pipeline Paralell with OpenVINO Model Server
OpenArc stands on the shoulders of many other projects:
@article{zhou2024survey,
title={A Survey on Efficient Inference for Large Language Models},
author={Zhou, Zixuan and Ning, Xuefei and Hong, Ke and Fu, Tianyu and Xu, Jiaming and Li, Shiyao and Lou, Yuming and Wang, Luning and Yuan, Zhihang and Li, Xiuhong and Yan, Shengen and Dai, Guohao and Zhang, Xiao-Ping and Dong, Yuhan and Wang, Yu},
journal={arXiv preprint arXiv:2404.14294},
year={2024}
}
Thanks for your work!!