Ι΄α΄Ι’Κα΄α΄ provides a complete setup for running multiple AI models locally using OpenVINO for AI-assisted coding and development with direct, quality answers.
git clone https://github.com/CyphrRiot/npglue.git
cd npglue
./installThe installer will ask you to choose your model from 8 options:
- Qwen3-8B-INT8 (~6-8GB) - Best quality for complex tasks
- Qwen3-0.6B-FP16 (~1-2GB) - Fast and lightweight
- OpenLlama-7B-INT4 (~4-5GB) - Great balance for coding
- OpenLlama-3B-INT4 (~2-3GB) - Lightweight with good performance
- Llama-3.1-8B-INT4 (~5-6GB) - Latest Llama with excellent coding abilities
- Phi-3-Mini-4K (~4GB) - Microsoft model optimized for NPU
- DeepSeek-Coder-6.7B (~6-7GB) - Specialized coding model, excellent for development
- DeepSeek-Coder-1.3B (~2GB) - Lightweight coding specialist
- Multiple AI Models: 8 models to choose from - Qwen3, Llama, Phi-3, DeepSeek coding specialists
- Model Choice: Pick based on your needs - quality vs speed vs coding specialization
- Easy Model Switching: Use
./switch_model.shto change models anytime (no reinstall needed!) - OpenVINO Optimized: Fast inference optimized for Intel NPU/GPU hardware
- 20-30+ tokens/sec: Fast local inference with memory efficiency
- Performance Display: Every response shows completion time and token rate
- Direct Answers: No rambling - get concise, actionable responses
- Zed Compatible: Works as Ollama provider (no API key hassles!)
- Full Ollama API: Complete compatibility with Ollama ecosystem
- Dual API Support: Both OpenAI and Ollama compatible endpoints
- Goose Ready: Drop-in replacement for OpenAI API
Most AI inference solutions (like Ollama) rely on traditional GPUs or CPUs, but NPGlue leverages cutting-edge NPU hardware that provides significant advantages:
| Setup | Hardware Used | Token Speed | Memory Efficiency | Power Usage |
|---|---|---|---|---|
| Ollama (CPU) | CPU cores only | 2-8 tok/s | High RAM usage | High power |
| Ollama (GPU) | NVIDIA/AMD GPU | 15-30 tok/s | VRAM limited | Very high power |
| NPGlue (NPU) | Intel/AMD NPU | 20-60 tok/s | Optimized | Low power |
1. Purpose-Built for AI:
Traditional GPU: Designed for graphics, adapted for AI
NPU: Purpose-built neural processing unit for AI inference
Result: 2-3x better performance per watt2. Memory Efficiency:
GPU: Requires loading entire model into VRAM (8-24GB limits)
NPU: Optimized memory access patterns, works with system RAM
Result: Can run larger models with less memory3. Power Efficiency:
GPU: 150-300W+ power consumption
NPU: 5-15W power consumption
Result: 10-20x more power efficient4. Parallel Processing:
CPU + GPU + NPU: All three can work together
Traditional: Usually either CPU OR GPU
Result: Better overall system performanceYour Intel Core Ultra 7 256V System:
# Ollama (CPU-only, no NPU support)
ollama run qwen2.5:7b # 2-5 tokens/sec, 100% CPU usage
# NPGlue (NPU-accelerated)
./start_server.sh # 20-30 tokens/sec, <20% CPU usage
curl -X POST http://localhost:11434/v1/chat/completions \
-d '{"model":"qwen3","messages":[{"role":"user","content":"Hello"}]}'AMD Ryzen AI Max+ 395 System:
# Ollama (powerful CPU, but still no NPU)
ollama run qwen2.5:7b # 8-15 tokens/sec
# NPGlue (NPU-accelerated)
./start_server.sh # 40-60 tokens/sec
# Plus can run 70B models that won't fit in GPU VRAMChoose Ollama if:
- β You have a powerful NVIDIA GPU (3080+)
- β You want the largest model ecosystem
- β You don't have NPU hardware
- β You need specific model formats (GGUF variety)
Choose NPGlue if:
- β You have Intel Core Ultra or AMD Ryzen AI processors (NPU available)
- β You want maximum performance per watt
- β You prefer purpose-built AI acceleration
- β You want cutting-edge 2024+ hardware utilization
- β You need efficient performance on laptops
NPU Support (NPGlue Advantage):
β
Intel Core Ultra (12th gen+) - Intel NPU
β
AMD Ryzen AI (8000 series+) - AMD XDNA NPU
β
Qualcomm Snapdragon X Elite - Hexagon NPU
β Older Intel/AMD processors - No NPUGPU Support (Ollama Advantage):
β
NVIDIA RTX 20/30/40 series - CUDA acceleration
β
AMD RX 6000/7000 series - ROCm acceleration
β
Apple M1/M2/M3 - Metal acceleration
β Intel integrated graphics - Limited supportNPGlue positions you at the forefront of AI hardware evolution:
- 2024: NPUs becoming standard in new processors
- 2025: Expected 3-5x NPU performance improvements
- 2026+: NPU-first AI software ecosystem
You're not just running AI faster today - you're using tomorrow's standard technology!
π‘ Bottom Line: If you have NPU hardware, NPGlue gives you hardware acceleration that Ollama simply cannot access, making it the superior choice for performance, efficiency, and future-proofing.
- OS: Linux (Arch/CachyOS recommended)
- Memory: 2GB+ RAM (for 0.6B model) or 8GB+ RAM (for 8B model)
- Storage: 10-15GB free space
- CPU: Intel preferred (excellent OpenVINO optimization)
- Shell: Compatible with bash, zsh, and fish
- Hardware acceleration:
- Best: Intel NPU (12th gen+ processors) - 20-30 tokens/sec
- Good: Intel integrated GPU - 5-10 tokens/sec
- Basic: Any CPU - 2-5 tokens/sec (slower but functional)
Ι΄α΄Ι’Κα΄α΄ automatically displays performance metrics with every response:
What is the capital of France?
The capital of France is Paris.
*Completed in 0.85 seconds at 23.2 tokens/sec*
Benefits:
- Real-time feedback on AI response speed
- Performance monitoring under different loads
- Model comparison when testing different configurations
- System optimization insights for tuning
This helps you:
- Monitor system performance
- Compare model variants (8B vs 0.6B)
- Identify when your system needs optimization
- Debug slow response issues
Token Limits:
- Respects user preferences - Request up to 4096 tokens
- No artificial caps - Let the model complete naturally
- Smart defaults - 200 tokens if not specified
- Memory aware - Monitors available RAM during generation
Ι΄α΄Ι’Κα΄α΄ includes built-in tools to diagnose and optimize performance:
./switch_model.shEasily switch between models based on your needs:
- 8B model: Maximum quality for complex tasks (needs 8GB+ RAM)
- 0.6B model: Speed and efficiency for quick responses (needs 2GB+ RAM)
Tip: If you're getting slow performance (under 15 tok/sec), run the diagnostics tool to identify memory pressure or other issues.
# Manual CPU optimization
./boost_cpu.sh # Set CPU to performance mode
# Manual CPU restoration
./restore_cpu.sh # Restore power-saving mode
# Automatic management (recommended)
./start_server.sh # Auto-saves/restores CPU settingsNote: start_server.sh automatically saves your CPU governor settings and restores them when you press Ctrl+C or exit the server.
| Model | Size | Memory | Speed (NPU) | Speed (iGPU) | Speed (CPU) | Best For |
|---|---|---|---|---|---|---|
| Qwen3-8B-INT8 | ~6-8GB | 8GB+ RAM | 20-30 tok/s | 5-10 tok/s | 2-5 tok/s | Complex tasks, detailed explanations |
| Qwen3-0.6B-FP16 | ~1-2GB | 2GB+ RAM | 25-40 tok/s | 8-15 tok/s | 4-8 tok/s | Quick answers, simple tasks |
| OpenLlama-7B-INT4 | ~4-5GB | 6GB+ RAM | 22-35 tok/s | 6-12 tok/s | 3-6 tok/s | Balanced coding and general tasks |
| OpenLlama-3B-INT4 | ~2-3GB | 4GB+ RAM | 30-45 tok/s | 10-18 tok/s | 5-9 tok/s | Fast responses, lightweight |
| Llama-3.1-8B-INT4 | ~5-6GB | 8GB+ RAM | 20-30 tok/s | 5-10 tok/s | 2-5 tok/s | Latest Llama, excellent coding |
| Phi-3-Mini-4K | ~4GB | 6GB+ RAM | 25-35 tok/s | 7-14 tok/s | 3-7 tok/s | NPU-optimized, Microsoft quality |
| DeepSeek-Coder-6.7B | ~6-7GB | 8GB+ RAM | 18-28 tok/s | 4-9 tok/s | 2-4 tok/s | Best for coding, development tasks |
| DeepSeek-Coder-1.3B | ~2GB | 3GB+ RAM | 35-50 tok/s | 12-20 tok/s | 6-10 tok/s | Fast coding assistant, lightweight |
- β Checks system requirements (RAM, disk space)
- β Installs system dependencies (Python, OpenVINO drivers, etc.)
- β Creates clean Python virtual environment
- β Installs CPU-only AI packages (OpenVINO 2024.x, transformers, PyTorch-CPU)
- β Interactive model choice: Pick Qwen3-8B-INT8 or Qwen3-0.6B-FP16
- β Downloads your chosen optimized OpenVINO model
- β Memory-safe verification (no crashes during setup)
- β CPU performance optimization
- β Safe Goose setup: Checks for existing config, won't overwrite
- β Zed integration: Exact settings for assistant configuration
- β Testing steps: How to verify everything works properly
The installer provides safe configuration that won't overwrite existing settings:
If you DON'T have Goose configured:
mkdir -p ~/.config/goose
cp goose_config_example.yaml ~/.config/goose/config.yaml
# No API key needed! Uses Ollama provider which is simpler.If you HAVE existing Goose config, just add:
GOOSE_PROVIDER: ollama
GOOSE_MODEL: qwen3
OLLAMA_HOST: http://localhost:11434Why Ollama provider? Ι΄α΄Ι’Κα΄α΄ supports both OpenAI and Ollama APIs, but Goose's Ollama provider doesn't require API key setup - much simpler!
Ι΄α΄Ι’Κα΄α΄i works as an Ollama provider (no API key hassles!):
{
"language_models": {
"ollama": {
"api_url": "http://localhost:11434",
"available_models": [
{
"name": "qwen3",
"display_name": "Qwen3 Local",
"max_tokens": 4096,
"supports_tools": true
}
]
}
},
"agent": {
"default_model": {
"provider": "ollama",
"model": "qwen3"
}
}
}Why this works: Zed's OpenAI provider is finicky about API keys, but the Ollama provider "just works"!
After running ./install, test with:
# Start the server
./start_server.sh
# Test health
curl http://localhost:11434/health
# Test Ollama API (for Zed)
curl http://localhost:11434/api/tags
# Test OpenAI API (for Goose)
curl http://localhost:11434/v1/models
# Run full model testΙ΄α΄Ι’Κα΄α΄ provides complete API compatibility with both OpenAI and Ollama:
OpenAI API (for Goose):
GET /v1/models- List modelsPOST /v1/chat/completions- Chat completionsGET /health- Health check
Ollama API (for Zed):
GET /api/tags- List modelsPOST /api/chat- Chat completionsPOST /api/generate- Text generationPOST /api/show- Model detailsGET /api/version- Version infoPOST /api/pull- Model management (returns success for local models)
Utilities:
GET /models- System informationPOST /unload- Unload model from memory
- Goose: AI development assistant
- Zed: Modern code editor
- Cursor: AI-powered IDE
- Continue.dev: VS Code extension
- Any OpenAI-compatible client
npglue/
βββ install # π Beautiful one-command installer
βββ start_server.sh # Start the AI server (auto CPU cleanup on exit)
βββ server_production.py # FastAPI server with dual API compatibility
βββ boost_cpu.sh # CPU performance optimization
βββ restore_cpu.sh # π Restore CPU to power-saving mode
βββ switch_model.sh # π Easy model switching utility
βββ goose_config_example.yaml # Safe Goose configuration template
βββ README.md # This documentation
βββ LICENSE # License file
βββ models/ # Downloaded models (created by installer)
βββ qwen3-8b-int8/ # High quality model (8GB)
βββ qwen3-0.6b-fp16/ # Fast model (1-2GB)
- One Command Setup:
./installdoes everything beautifully - Model Choice: Choose between quality (8B) or speed (0.6B)
- Memory Safe: Won't crash during installation or use
- Configuration Safe: Won't overwrite your existing tool settings
- Expert Optimized: Uses official OpenVINO optimized models
- Direct Answers: No rambling - designed for practical Q&A
- Clear Instructions: Tells you exactly what to do next
- Local Privacy: No data sent to external APIs
- Fast Performance: Optimized for Intel hardware
- Production Ready: Proper error handling and monitoring
- Chat:
http://localhost:11434/v1/chat/completions(OpenAI compatible) - Health:
http://localhost:11434/health - Docs:
http://localhost:11434/docs
# Activate environment manually
source npglue-env/bin/activate
# Check available devices
python -c "import openvino; print(openvino.Core().available_devices)"- β NPU vs GPU Comparison: Detailed analysis of why NPGlue + NPU beats traditional GPU solutions
- β 8 Model Choices: Added OpenLlama, Phi-3, DeepSeek, and Llama-3.1 models to installer
- β
Enhanced Model Switching: Easy utility to switch between models (
switch_model.sh) - β Optional CPU Performance: Installer now asks before enabling performance mode (no automatic changes)
- β Robust Dependencies: Better handling of protobuf, sentencepiece, and model-specific requirements
- β Smart Chat Templates: Automatic handling for different model families (Qwen, Phi-3, DeepSeek)
- β
Complete Ollama API: Added
/api/show,/api/version,/api/pullendpoints (no more 404s!) - β Memory Optimization: Automatic detection and fixes for memory pressure issues
- β Flexible Token Limits: Respects user preferences up to 4096 tokens (no more artificial caps!)
- β Performance Display: All responses now show "Completed in X.XX seconds at X.X tokens/sec"
- β CPU-Only Install: No NVIDIA dependencies on Intel systems
- β Dual API Support: Both OpenAI AND Ollama compatible endpoints
- β Zed Integration Fixed: Works as Ollama provider (no API key issues!)
- β Safe configuration: Protects existing Goose/Zed settings
- β Simplified installer: One beautiful command does everything
- β Expert models: Official OpenVINO optimized versions
Ι΄α΄Ι’Κα΄α΄: One command to local AI coding bliss! π
Get the power of Qwen3's direct, practical responses running locally on your machine in minutes.