Codestin Search App

Note

OpenArc is under active development.

OpenArc is an inference engine for Intel devices. Serve LLMs, VLMs, Whisper, Kokoro-TTS, Embedding and Reranker models over OpenAI compatible endpoints, powered by OpenVINO on your device. Local, private, open source AI.

OpenArc 2.0 arrives with more endpoints, better UX, pipeline paralell, NPU support and much more!

Drawing on ideas from llama.cpp, vLLM, transformers, OpenVINO Model Server, Ray, Lemonade, and other projects cited below, OpenArc has been a way for me to learn about inference engines by trying to build one myself.

Along the way a Discord community has formed around this project, which was unexpected! If you are interested in using Intel devices for AI and machine learning, feel free to stop by.

Thanks to everyone on Discord for their continued support!

Features

OpenArc 2.0 arrives with more endpoints, better UX, pipeline paralell, NPU support and much more!

Multi GPU Pipeline Paralell
CPU offload/Hybrid device
NPU device support
OpenAI compatible endpoints
- /v1/models
- /v1/completions: llm only
- /v1/chat/completions
- /v1/audio/transcriptions: whisper only
- /v1/audio/speech: kokoro only
- /v1/embeddings: qwen3-embedding #33 by @mwrothbe
- /v1/rerank: qwen3-reranker #39 by @mwrothbe
jinja templating with AutoTokenizers
OpenAI Compatible tool calls with streaming and paralell
- tool call parser currently reads "name", "argument"
Fully async multi engine, multi task architecture
Model concurrency: load and infer multiple models at once
Automatic unload on inference failure
llama-bench style benchmarking for llm w/automatic database sqlite
metrics on every request
- ttft
- prefill_throughput
- decode_throughput
- decode_duration
- tpot
- load time
- stream mode
More OpenVINO [examples](examples)
OpenVINO implementation of hexgrad/Kokoro-82M

Note

Interested in contributing? Please open an issue before submitting a PR!

↑ Top

Quickstart

Linux

OpenVINO requires device specifc drivers.

Visit OpenVINO System Requirments for the latest information on drivers.

Install uv from astral
After cloning use:

uv sync

Activate your environment with:

source .venv/bin/activate

Build latest optimum

uv pip install "optimum-intel[openvino] @ git+https://github.com/huggingface/optimum-intel"

Build latest OpenVINO and OpenVINO GenAI from nightly wheels

uv pip install --pre -U openvino-genai --extra-index-url https://storage.openvinotoolkit.org/simple/wheels/nightly

Set your API key as an environment variable:

	export OPENARC_API_KEY=<api-key>

To get started, run:

openarc --help

Windows

OpenVINO requires device specifc drivers.

Visit OpenVINO System Requirments to get the latest information on drivers.

Install uv from astral
Clone OpenArc, enter the directory and run:

uv sync

Activate your environment with:

.venv\Scripts\activate

Build latest optimum

uv pip install "optimum-intel[openvino] @ git+https://github.com/huggingface/optimum-intel"

Build latest OpenVINO and OpenVINO GenAI from nightly wheels

uv pip install --pre -U openvino-genai --extra-index-url https://storage.openvinotoolkit.org/simple/wheels/nightly

Set your API key as an environment variable:

setx OPENARC_API_KEY openarc-api-key

To get started, run:

openarc --help

Note

Need help installing drivers? Join our Discord or open an issue.

Note

uv has a pip interface which is a drop in replacement for pip, but faster. Pretty cool, and a good place to start.

OpenArc CLI

This section documents the CLI commands available to you.

All commands have aliases but are written here for clarity.

openarc add

Add a model to openarc_config.json for easy loading with openarc load.

Single device

openarc add --model-name <model-name> --model-path <path/to/model> --engine <engine> --model-type <model-type> --device <target-device>

VLM

openarc add --model-name <model-name> --model-path <path/to/model> --engine <engine> --model-type <model-type> --device <target-device> --vlm-type <vlm-type>

Getting VLM to work the way I wanted required using VLMPipeline in ways that are not well documented. You can look at the code to see where the magic happens.

vlm-type maps a vision token for a given architecture using strings like qwen25vl, phi4mm and more. Use openarc add --help to see the available options. The server will complain if you get anything wrong, so it should be easy to figure out.

Whisper

openarc add --model-name <model-name> --model-path <path/to/whisper> --engine ovgenai --model-type whisper --device <target-device>

Kokoro (CPU only)

openarc add --model-name <model-name> --model-path <path/to/kokoro> --engine openvino --model-type kokoro --device CPU

`runtime-config`

Accepts many options to modify openvino runtime behavior for different inference scenarios. OpenArc reports c++ errors to the server when these fail, making experimentation easy.

See OpenVINO documentation on Inference Optimization to learn more about what can be customized.

Review pipeline-paralellism preview to learn how you can customize multi device inference using HETERO device plugin. Some example commands are provided for a few difference scenarios:

Multi-GPU Pipeline Paralell

openarc add --model-name <model-name> --model-path <path/to/model> --engine ovgenai --model-type llm --device <HETERO:GPU.0,GPU.1> --runtime-config {"MODEL_DISTRIBUTION_POLICY": "PIPELINE_PARALLEL"}

Tensor Paralell (CPU only)

Requires more than one CPU socket in a single node.

openarc add --model-name <model-name> --model-path <path/to/model> --engine ovgenai --model-type llm --device CPU --runtime-config {"MODEL_DISTRIBUTION_POLICY": "TENSOR_PARALLEL"}

Hybrid Mode/CPU Offload

openarc add --model-name <model-name> -model-path <path/to/model> --engine ovgenai --model-type llm --device <HETERO:GPU.0,CPU> --runtime-config {"MODEL_DISTRIBUTION_POLICY": "PIPELINE_PARALLEL"}

openarc list

Reads added configurations from openarc_config.json.

Display all saved configurations:

openarc list

Remove a configuration:

openarc list --remove --model-name <model-name>

openarc serve

Starts the server.

openarc serve start # defauls to 0.0.0.0:8000

Configure host and port

openarc serve start --host --openarc-port

To load models on startup:

openarc serve start --load-models model1 model2

openarc load

After using openarc add you can use openarc load to read the added configuration and load models onto the OpenArc server.

OpenArc uses arguments from openarc add as metadata to make routing decisions internally; you are querying for correct inference code.

openarc load <model-name>

To load multiple models at once, use:

openarc load <model-name1> <model-name2> <model-name3>

Be mindful of your resources; loading models can be resource intensive! On the first load, OpenVINO performs model compilation for the target --device.

When openarc load fails, the CLI tool displays a full stack trace to help you figure out why.

openarc status

Calls /openarc/status endpoint and returns a report. Shows loaded models.

openarc status

openarc bench

Benchmark llm performance with pseudo-random input tokens.

This approach follows llama-bench, providing a baseline for the community to assess inference performance between llama.cpp backends and openvino.

To support different llm tokenizers, we need to standardize how tokens are chosen for benchmark inference. When you set --p we select 512 pseudo-random tokens as input_ids from the set of all tokens in the vocabulary.

--n controls the maximum amount of tokens we allow the model to generate; this bypasses eos and sets a hard upper limit.

Default values are:

openarc bench <model-name> --p <512> --n <128> --r <5>

Which gives:

openarc bench also records metrics in a sqlite database openarc_bench.db for easy analysis.

openarc tool

Utility scripts.

To see openvino properties your device supports use:

openarc tool device-props

To see available devices use:

openarc tool device-detect

↑ Top

Model Sources

There are a few sources of preconverted models which can be used with OpenArc;

OpenVINO on HuggingFace

My HuggingFace repo

LLMs optimized for NPU

More models to get you started!

LLMs

Models
Echo9Zulu/Qwen3-1.7B-int8_asym-ov
Echo9Zulu/Qwen3-4B-Instruct-2507-int4_asym-awq-ov
Gapeleon/Satyr-V0.1-4B-HF-int4_awq-ov
Echo9Zulu/Dolphin-X1-8B-int4_asym-awq-ov
Echo9Zulu/Qwen3-8B-ShiningValiant3-int4-asym-ov
Echo9Zulu/Qwen3-14B-int4_sym-ov
Echo9Zulu/Cydonia-24B-v4.2.0-int4_asym-awq-ov
Echo9Zulu/Qwen2.5-Microsoft-NextCoder-Soar-Instruct-FUSED-CODER-Fast-11B-int4_asym-awq-ov
Echo9Zulu/Magistral-Small-2509-Text-Only-int4_asym-awq-ov
Echo9Zulu/Hermes-4-70B-int4_asym-awq-ov
Echo9Zulu/Qwen2.5-Coder-32B-Instruct-int4_sym-awq-ov
Echo9Zulu/Qwen3-32B-Instruct-int4_sym-awq-ov

VLMs

Models
Echo9Zulu/gemma-3-4b-it-int8_asym-ov
Echo9Zulu/Gemma-3-12b-it-qat-int4_asym-ov
Echo9Zulu/Qwen2.5-VL-7B-Instruct-int4_sym-ov
Echo9Zulu/Nanonets-OCR2-3B-LM-INT4_ASYM-VE-FP16-ov

Whisper

Models
OpenVINO/distil-whisper-large-v3-int8-ov
OpenVINO/distil-whisper-large-v3-fp16-ov
OpenVINO/whisper-large-v3-int8-ov
OpenVINO/openai-whisper-large-v3-fp16-ov

Kokoro

Models
Echo9Zulu/Kokoro-82M-FP16-OpenVINO

Embedding

Models
Echo9Zulu/Qwen3-Embedding-0.6B-int8_asym-ov

Reranker

Models
OpenVINO/Qwen3-Reranker-0.6B-fp16-ov

↑ Top

Converting Models to OpenVINO IR

Optimum-Intel provides a hands on primer you can use to build some intuition about quantization and post training optimization using OpenVINO.

Intel provides a suite of tools you can use to apply different post training optimization techniques developed over at Neural Network Compression Framwork.

Use the Optimum-CLI conversion tool to learn how you can convert models to OpenVINO IR from other formats.
Visit Supported Architectures to see what models can be converted to OpenVINO using the tools described in this section.
If you use the CLI tool and get an error about an unsupported architecture or "missing export config" follow the link, open an issue reference the model card and the maintainers will get back to you.

↑ Top

Demos

Demos help illustrate what you can do with OpenArc and are meant to be extended. I will continue adding to these, but for now they are a good start.

talk_to_llm.py sets up a "chain" between whisper, an LLM, and kokoro. Talk with any LLM you can run on your PC from the command line. Accumulates context and does not filter reasoning (very interesting).

whisper_button.py use spacebar to record audio with whisper and see the transcription right in the terminal. NPU users should probably start here.

Resources

Learn more about how to leverage your Intel devices for Machine Learning:

Install OpenVINO

openvino_notebooks

OpenVINO Python API

OpenVINO GenAI Python API

Inference with Optimum-Intel

Optimum-Intel

NPU Devices

vllm with IPEX

Mutli GPU Pipeline Paralell with OpenVINO Model Server

Transformers Auto Classes

↑ Top

Acknowledgments

OpenArc stands on the shoulders of many other projects:

@article{zhou2024survey,
  title={A Survey on Efficient Inference for Large Language Models},
  author={Zhou, Zixuan and Ning, Xuefei and Hong, Ke and Fu, Tianyu and Xu, Jiaming and Li, Shiyao and Lou, Yuming and Wang, Luning and Yuan, Zhihang and Li, Xiuhong and Yan, Shengen and Dai, Guohao and Zhang, Xiao-Ping and Dong, Yuhan and Wang, Yu},
  journal={arXiv preprint arXiv:2404.14294},
  year={2024}
}

Thanks for your work!!

Name		Name	Last commit message	Last commit date
Latest commit History 314 Commits
assets		assets
benchmark		benchmark
demos		demos
docs		docs
examples		examples
models		models
src		src
.cursorignore		.cursorignore
.gitignore		.gitignore
.python-version		.python-version
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
README.md		README.md
project.md		project.md
pyproject.toml		pyproject.toml
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Uh oh!

Repository files navigation

Table of Contents

Features

Quickstart

OpenArc CLI

Single device

VLM

Whisper

Kokoro (CPU only)

`runtime-config`

Multi-GPU Pipeline Paralell

Tensor Paralell (CPU only)

Hybrid Mode/CPU Offload

Model Sources

More models to get you started!

Converting Models to OpenVINO IR

Demos

Resources

Acknowledgments

About

Uh oh!

Releases 4

Packages

Contributors 4

Languages

Uh oh!

License

Uh oh!

SearchSavior/OpenArc

Folders and files

Latest commit

History

Repository files navigation

Table of Contents

Features

Quickstart

OpenArc CLI

Single device

VLM

Whisper

Kokoro (CPU only)

runtime-config

Multi-GPU Pipeline Paralell

Tensor Paralell (CPU only)

Hybrid Mode/CPU Offload

Model Sources

More models to get you started!

Converting Models to OpenVINO IR

Demos

Resources

Acknowledgments

About

Topics

Resources

License

Contributing

Uh oh!

Stars

Watchers

Forks

Releases 4

Packages 0

Contributors 4

Languages

`runtime-config`

Packages