Thanks to visit codestin.com
Credit goes to github.com

Skip to content

Inference engine for Intel devices. Serve LLMs, VLMs, Whisper, Kokoro-TTS, Embedding and Rerank models over OpenAI endpoints.

License

SearchSavior/OpenArc

Repository files navigation

openarc_DOOM

Discord Hugging Face Devices

Note

OpenArc is under active development.

OpenArc is an inference engine for Intel devices. Serve LLMs, VLMs, Whisper, Kokoro-TTS, Embedding and Reranker models over OpenAI compatible endpoints, powered by OpenVINO on your device. Local, private, open source AI.

OpenArc 2.0 arrives with more endpoints, better UX, pipeline paralell, NPU support and much more!

Drawing on ideas from llama.cpp, vLLM, transformers, OpenVINO Model Server, Ray, Lemonade, and other projects cited below, OpenArc has been a way for me to learn about inference engines by trying to build one myself.

Along the way a Discord community has formed around this project, which was unexpected! If you are interested in using Intel devices for AI and machine learning, feel free to stop by.

Thanks to everyone on Discord for their continued support!

Table of Contents

Features

OpenArc 2.0 arrives with more endpoints, better UX, pipeline paralell, NPU support and much more!

  • Multi GPU Pipeline Paralell
  • CPU offload/Hybrid device
  • NPU device support
  • OpenAI compatible endpoints
    • /v1/models
    • /v1/completions: llm only
    • /v1/chat/completions
    • /v1/audio/transcriptions: whisper only
    • /v1/audio/speech: kokoro only
    • /v1/embeddings: qwen3-embedding #33 by @mwrothbe
    • /v1/rerank: qwen3-reranker #39 by @mwrothbe
  • jinja templating with AutoTokenizers
  • OpenAI Compatible tool calls with streaming and paralell
    • tool call parser currently reads "name", "argument"
  • Fully async multi engine, multi task architecture
  • Model concurrency: load and infer multiple models at once
  • Automatic unload on inference failure
  • llama-bench style benchmarking for llm w/automatic database sqlite
  • metrics on every request
    • ttft
    • prefill_throughput
    • decode_throughput
    • decode_duration
    • tpot
    • load time
    • stream mode
  • More OpenVINO [examples](examples)
  • OpenVINO implementation of hexgrad/Kokoro-82M

Note

Interested in contributing? Please open an issue before submitting a PR!

Quickstart

Linux
  1. OpenVINO requires device specifc drivers.
  1. Install uv from astral

  2. After cloning use:

uv sync
  1. Activate your environment with:
source .venv/bin/activate

Build latest optimum

uv pip install "optimum-intel[openvino] @ git+https://github.com/huggingface/optimum-intel"

Build latest OpenVINO and OpenVINO GenAI from nightly wheels

uv pip install --pre -U openvino-genai --extra-index-url https://storage.openvinotoolkit.org/simple/wheels/nightly
  1. Set your API key as an environment variable:
	export OPENARC_API_KEY=<api-key>
  1. To get started, run:
openarc --help
Windows
  1. OpenVINO requires device specifc drivers.
  1. Install uv from astral

  2. Clone OpenArc, enter the directory and run:

uv sync
  1. Activate your environment with:
.venv\Scripts\activate

Build latest optimum

uv pip install "optimum-intel[openvino] @ git+https://github.com/huggingface/optimum-intel"

Build latest OpenVINO and OpenVINO GenAI from nightly wheels

uv pip install --pre -U openvino-genai --extra-index-url https://storage.openvinotoolkit.org/simple/wheels/nightly
  1. Set your API key as an environment variable:
setx OPENARC_API_KEY openarc-api-key
  1. To get started, run:
openarc --help

Note

Need help installing drivers? Join our Discord or open an issue.

Note

uv has a pip interface which is a drop in replacement for pip, but faster. Pretty cool, and a good place to start.

OpenArc CLI

This section documents the CLI commands available to you.

All commands have aliases but are written here for clarity.

openarc add

Add a model to openarc_config.json for easy loading with openarc load.

Single device

openarc add --model-name <model-name> --model-path <path/to/model> --engine <engine> --model-type <model-type> --device <target-device>

VLM

openarc add --model-name <model-name> --model-path <path/to/model> --engine <engine> --model-type <model-type> --device <target-device> --vlm-type <vlm-type>

Getting VLM to work the way I wanted required using VLMPipeline in ways that are not well documented. You can look at the code to see where the magic happens.

vlm-type maps a vision token for a given architecture using strings like qwen25vl, phi4mm and more. Use openarc add --help to see the available options. The server will complain if you get anything wrong, so it should be easy to figure out.

Whisper

openarc add --model-name <model-name> --model-path <path/to/whisper> --engine ovgenai --model-type whisper --device <target-device> 

Kokoro (CPU only)

openarc add --model-name <model-name> --model-path <path/to/kokoro> --engine openvino --model-type kokoro --device CPU 

runtime-config

Accepts many options to modify openvino runtime behavior for different inference scenarios. OpenArc reports c++ errors to the server when these fail, making experimentation easy.

See OpenVINO documentation on Inference Optimization to learn more about what can be customized.

Review pipeline-paralellism preview to learn how you can customize multi device inference using HETERO device plugin. Some example commands are provided for a few difference scenarios:

Multi-GPU Pipeline Paralell

openarc add --model-name <model-name> --model-path <path/to/model> --engine ovgenai --model-type llm --device <HETERO:GPU.0,GPU.1> --runtime-config {"MODEL_DISTRIBUTION_POLICY": "PIPELINE_PARALLEL"}

Tensor Paralell (CPU only)

Requires more than one CPU socket in a single node.

openarc add --model-name <model-name> --model-path <path/to/model> --engine ovgenai --model-type llm --device CPU --runtime-config {"MODEL_DISTRIBUTION_POLICY": "TENSOR_PARALLEL"}

Hybrid Mode/CPU Offload

openarc add --model-name <model-name> -model-path <path/to/model> --engine ovgenai --model-type llm --device <HETERO:GPU.0,CPU> --runtime-config {"MODEL_DISTRIBUTION_POLICY": "PIPELINE_PARALLEL"}
openarc list

Reads added configurations from openarc_config.json.

Display all saved configurations:

openarc list

Remove a configuration:

openarc list --remove --model-name <model-name>
openarc serve

Starts the server.

openarc serve start # defauls to 0.0.0.0:8000

Configure host and port

openarc serve start --host --openarc-port

To load models on startup:

openarc serve start --load-models model1 model2
openarc load

After using openarc add you can use openarc load to read the added configuration and load models onto the OpenArc server.

OpenArc uses arguments from openarc add as metadata to make routing decisions internally; you are querying for correct inference code.

openarc load <model-name>

To load multiple models at once, use:

openarc load <model-name1> <model-name2> <model-name3>

Be mindful of your resources; loading models can be resource intensive! On the first load, OpenVINO performs model compilation for the target --device.

When openarc load fails, the CLI tool displays a full stack trace to help you figure out why.

openarc status

Calls /openarc/status endpoint and returns a report. Shows loaded models.

openarc status
openarc bench

Benchmark llm performance with pseudo-random input tokens.

This approach follows llama-bench, providing a baseline for the community to assess inference performance between llama.cpp backends and openvino.

To support different llm tokenizers, we need to standardize how tokens are chosen for benchmark inference. When you set --p we select 512 pseudo-random tokens as input_ids from the set of all tokens in the vocabulary.

--n controls the maximum amount of tokens we allow the model to generate; this bypasses eos and sets a hard upper limit.

Default values are:

openarc bench <model-name> --p <512> --n <128> --r <5>

Which gives:

openarc bench

openarc bench also records metrics in a sqlite database openarc_bench.db for easy analysis.

openarc tool

Utility scripts.

To see openvino properties your device supports use:

openarc tool device-props

To see available devices use:

openarc tool device-detect

device-detect



Model Sources

There are a few sources of preconverted models which can be used with OpenArc;

OpenVINO on HuggingFace

My HuggingFace repo

LLMs optimized for NPU

More models to get you started!

LLMs
Models
Echo9Zulu/Qwen3-1.7B-int8_asym-ov
Echo9Zulu/Qwen3-4B-Instruct-2507-int4_asym-awq-ov
Gapeleon/Satyr-V0.1-4B-HF-int4_awq-ov
Echo9Zulu/Dolphin-X1-8B-int4_asym-awq-ov
Echo9Zulu/Qwen3-8B-ShiningValiant3-int4-asym-ov
Echo9Zulu/Qwen3-14B-int4_sym-ov
Echo9Zulu/Cydonia-24B-v4.2.0-int4_asym-awq-ov
Echo9Zulu/Qwen2.5-Microsoft-NextCoder-Soar-Instruct-FUSED-CODER-Fast-11B-int4_asym-awq-ov
Echo9Zulu/Magistral-Small-2509-Text-Only-int4_asym-awq-ov
Echo9Zulu/Hermes-4-70B-int4_asym-awq-ov
Echo9Zulu/Qwen2.5-Coder-32B-Instruct-int4_sym-awq-ov
Echo9Zulu/Qwen3-32B-Instruct-int4_sym-awq-ov
VLMs
Models
Echo9Zulu/gemma-3-4b-it-int8_asym-ov
Echo9Zulu/Gemma-3-12b-it-qat-int4_asym-ov
Echo9Zulu/Qwen2.5-VL-7B-Instruct-int4_sym-ov
Echo9Zulu/Nanonets-OCR2-3B-LM-INT4_ASYM-VE-FP16-ov
Whisper
Models
OpenVINO/distil-whisper-large-v3-int8-ov
OpenVINO/distil-whisper-large-v3-fp16-ov
OpenVINO/whisper-large-v3-int8-ov
OpenVINO/openai-whisper-large-v3-fp16-ov
Kokoro
Models
Echo9Zulu/Kokoro-82M-FP16-OpenVINO
Embedding
Models
Echo9Zulu/Qwen3-Embedding-0.6B-int8_asym-ov
Reranker
Models
OpenVINO/Qwen3-Reranker-0.6B-fp16-ov

Converting Models to OpenVINO IR

Optimum-Intel provides a hands on primer you can use to build some intuition about quantization and post training optimization using OpenVINO.

Intel provides a suite of tools you can use to apply different post training optimization techniques developed over at Neural Network Compression Framwork.

  • Use the Optimum-CLI conversion tool to learn how you can convert models to OpenVINO IR from other formats.

  • Visit Supported Architectures to see what models can be converted to OpenVINO using the tools described in this section.

  • If you use the CLI tool and get an error about an unsupported architecture or "missing export config" follow the link, open an issue reference the model card and the maintainers will get back to you.

Demos

Demos help illustrate what you can do with OpenArc and are meant to be extended. I will continue adding to these, but for now they are a good start.

talk_to_llm.py sets up a "chain" between whisper, an LLM, and kokoro. Talk with any LLM you can run on your PC from the command line. Accumulates context and does not filter reasoning (very interesting).

whisper_button.py use spacebar to record audio with whisper and see the transcription right in the terminal. NPU users should probably start here.

Resources

Learn more about how to leverage your Intel devices for Machine Learning:

Install OpenVINO

openvino_notebooks

OpenVINO Python API

OpenVINO GenAI Python API

Inference with Optimum-Intel

Optimum-Intel

NPU Devices

vllm with IPEX

Mutli GPU Pipeline Paralell with OpenVINO Model Server

Transformers Auto Classes

Acknowledgments

OpenArc stands on the shoulders of many other projects:

Optimum-Intel

OpenVINO

OpenVINO GenAI

llama.cpp

vLLM

Transformers

FastAPI

click

rich-click

@article{zhou2024survey,
  title={A Survey on Efficient Inference for Large Language Models},
  author={Zhou, Zixuan and Ning, Xuefei and Hong, Ke and Fu, Tianyu and Xu, Jiaming and Li, Shiyao and Lou, Yuming and Wang, Luning and Yuan, Zhihang and Li, Xiuhong and Yan, Shengen and Dai, Guohao and Zhang, Xiao-Ping and Dong, Yuhan and Wang, Yu},
  journal={arXiv preprint arXiv:2404.14294},
  year={2024}
}

Thanks for your work!!