ANEMLL (pronounced like "animal") is an open-source project focused on accelerating the porting of Large Language Models (LLMs) to tensor processors, starting with the Apple Neural Engine (ANE).
For complete release notes, see docs/RELEASE_NOTES_0.3.5.md.
- ANEMLL Chat redesign — Fully rebuilt iOS/macOS/visionOS reference app with voice input, AirDrop model sharing, local model import/linking, Markdown rendering, and thinking mode. TestFlight Beta
- Gemma 3 family — Full support for 270M, 1B, 4B QAT with sliding-window + global attention, FP16 scaling, and up to 4K context.
- Monolithic models — Single-file conversion and inference for all architectures (LLaMA, Qwen, Qwen 2.5, Gemma 3) with ANEMLL-Dedup for ~50% size reduction.
- In-model argmax (
--argmax) — Moves argmax into the CoreML LM head, outputting per-chunk winner index+value instead of full logits. Drastically reduces ANE-to-host data transfer. Extensible to top-k sampling. Recorded inmeta.yamlasargmax_in_model: true. - Swift inference stability — IOSurface-backed buffers, serial prediction queue, ping-pong/ring buffer patterns eliminate ANE race conditions on iOS.
- ANEMLL-Dedup — Surgical weight deduplication for multifunction CoreML models (~50% savings). Documentation
- Qwen 3 multi-chunk fix — Fixed inference divergence caused by applying final RMSNorm on every FFN chunk instead of only the last.
- New conversion tools — ANE Profiler (docs), auto chunk calculator (docs), FP16 preflight, real-time conversion monitor.
- Chat CLI improvements — New
--st(single-token prefill for debugging),--cpu,--debug-argmax,--mem-report,--split-rotate,--sliding-windowflags. Architecture-aware stop-token detection. - Auto-activate venv —
convert_model.shandcheck_dependencies.shauto-activate project venvs. Override withANEMLL_VENVor disable withANEMLL_AUTO_VENV=0.
- 📊 lm-evaluation-harness Support - Model evaluation with standard benchmarks (BoolQ, ARC Challenge, etc.) - Documentation
- 🎯 New RMSN-orm Implementation - Precise calculation with ANE hardware ops
- 🐛 Fixed RoPE Tensor Size Bug - Resolved random overflows (existing pre-0.3.4 models should be re-converted)
| Task | HF-FP16 | ANEMLL-FP16 | DIFF % |
|---|---|---|---|
| arc_challenge | 31.66% | 30.97% | -0.69% |
| arc_easy | 60.65% | 60.94% | +0.29% |
| boolq | 63.91% | 64.68% | +0.77% |
| piqa | 66.81% | 67.74% | +0.93% |
| winogrande | 56.43% | 56.67% | +0.24% |
| Average | 55.89% | 56.60% | +0.71% |
✅ DIFF = ANEMLL-FP16 - HF-FP16, where positive values indicate ANEMLL outperforms HuggingFace on that metric.
🆕 New 0.3.4.models with benchmarks are here
# 1. Setup environment (uv recommended)
brew install uv # one-time
./create_uv_env.sh # creates env-anemll with Python 3.9
source env-anemll/bin/activate
./install_dependencies.sh
# 2. Test conversion pipeline
python tests/test_gemma3_model.py # Gemma 3 270M (monolithic + argmax)
python tests/test_qwen_model.py # Qwen 3
python tests/test_llama_model.py # LLaMA
# 3. Convert your own models
./anemll/utils/convert_model.sh --model <path> --output <dir>The goal is to provide a fully open-source pipeline from model conversion to inference for common LLM architectures running on ANE. This enables seamless integration and on-device inference for low-power applications on edge devices, ensuring maximum privacy and security. This is critical for autonomous applications, where models run directly on the device without requiring an internet connection.
We aim to:
- Provide flexible and easy to use library/framework to port LLMs to ANE directly from Hugging Face models
- Provide on-device examples for iOS and macOS swift or C/C++ Applications
See update Roadmap.md for more details
ANEMLL provides six main components for Apple Neural Engine inference development:
-
LLM Conversion Tools - Scripts and code to convert models directly from Hugging Face weights
-
ANE Profiler - CoreML/ANE profiling without Xcode (analyze compute plan, benchmark all units, compatibility reports). Requires CoreMLTools 9.0+ and macOS 15+.
-
Swift Reference Implementation - Optimized inference code for Swift applications
- Sample CLI application in
anemll-swift-cli - Core inference engine implementation
- Sample CLI application in
-
Python Sample Code - Reference implementation and testing tools
- Basic chat interface (
chat.py) - Advanced conversation management (
chat_full.py)
- Basic chat interface (
-
iOS/macOS Sample Applications - Redesigned ANEMLL Chat app with voice input, AirDrop sharing, Markdown, and thinking mode. TestFlight Beta
- SwiftUI Chat interface (iOS, macOS, visionOS)
- HuggingFace model downloads, local import, network drive linking
- Conversation management with streaming and performance metrics
-
ANEMLL-BENCH - Apple Neural Engine Benchmarking
- Performance testing and comparison
- Model optimization metrics
- Hardware-specific benchmarks
- GitHub Repository
We provide sample converted models ready for use:
- Gemma 3 (270M, 1B, 4B QAT) — SWA + global attention, up to 4K context, monolithic and chunked
- LLaMA 3.1/3.2 (1B, 8B) — including iOS "friendly builds"
- Qwen 3 (0.6B, 1.7B) — thinking mode support
- Qwen 2.5 (0.5B) — monolithic available
- DeepSeek R1 (8B distilled) — via LLaMA converter
- DeepHermes (3B, 8B) — LLaMA-based fine-tuned models
Note
Please note that Quantization should be improved. LUT4 quality is fairly low due to lack of Block Quantization on Apple Neural Engine.
- Generic HF Model Testing:
./tests/conv/test_hf_model.sh [model_name] [output_dir] [chunks] - LLaMA Testing:
python tests/test_llama_model.py - Qwen 3 Testing:
python tests/test_qwen_model.py - Qwen 2.5 Testing:
python tests/test_qwen2.5_model.py - Gemma 3 Testing:
python tests/test_gemma3_model.py
# Test any model with automatic naming
./tests/conv/test_hf_model.sh meta-llama/Llama-3.2-1B-Instruct
# Test with custom output directory
./tests/conv/test_hf_model.sh Qwen/Qwen2.5-0.5B-Instruct /tmp/my-test
# Test larger models with chunks
./tests/conv/test_hf_model.sh meta-llama/Llama-3.2-8B-Instruct /tmp/llama8b 4Gemma 3 models use a split KV cache architecture with interleaved local (sliding window) and global attention layers.
Note: The conversion script now auto-detects HuggingFace model names and downloads them automatically!
# Convert Gemma 3 270M (small, good for testing)
./anemll/utils/convert_model.sh \
--model google/gemma-3-270m-it \
--output /path/to/output/gemma3_270m \
--context 512 \
--batch 64 \
--lut2 4 \
--lut3 6 \
--chunk 1
# Convert Gemma 3 1B with LUT6 and 4K context (single chunk)
./anemll/utils/convert_model.sh \
--model google/gemma-3-1b-it \
--output /path/to/output/gemma3_1b_lut6_ctx4096 \
--context 4096 \
--batch 64 \
--lut1 6 \
--lut2 6 \
--lut3 6 \
--chunk 1
# Test the converted model
python3 tests/chat.py --meta /path/to/output/gemma3_270m/meta.yaml --prompt "Hello!"Gemma 3 Notes:
- HuggingFace model names (e.g.,
google/gemma-3-1b-it) are auto-detected and downloaded - 270M model: Uses monolithic format (single CoreML file) with argmax - ideal for quick testing
- 1B model: Uses standard chunked format (separate embeddings, FFN, LM head)
- Uses split KV cache: local layers (sliding window 512) + global layers (full context)
- For context > 512: 4-function models (infer, infer_rotate, prefill, prefill_rotate) enable automatic cache rotation
- Recommended:
--chunk 1for all Gemma 3 models (1B fits in single chunk) - Supports context lengths up to 4096 (512-2048 recommended for optimal ANE performance)
- Large vocabulary (262K tokens) uses 16-way LM head splitting
- Requires HuggingFace login for gated models:
hf login
⚠️ FP16 Overflow Warning: Gemma 3 models can produce activations exceeding FP16 range (65,504). See FP16 Compatibility below.
- Auto-downloads models: No manual setup required, downloads models from HuggingFace
- Fast validation: Uses unquantized FP16 conversion for quick pipeline testing
- Virtual environment aware: Automatically activates env-anemll if present
- End-to-end validation: Tests cover conversion → Python inference → Swift CLI inference
- Clean testing: Uses
/tmpdirectories to avoid cluttering your workspace - HuggingFace Authentication: Automatically uses your HF token for gated models
Some GPTQ and Spin Quant should greatly improve LUT4 models.
Visit our Hugging Face repository for the latest converted models.
This is Beta Release 0.3.5 — Gemma 3, monolithic models, in-model argmax, ANEMLL Chat redesign, and ANE stability fixes.
- Breaking Change:
install_dependencies.shmoved to project root- Dependency baseline:
coremltools>=9.0- Stable architectures: LLaMA 3.1/3.2, DeepSeek R1, DeepHermes, Qwen 3, Qwen 2.5, Gemma 3
- New conversion modes: Monolithic (
convert_monolith.sh), in-model argmax (--argmax), per-component LUT (--lut-embeddings,--lut-lmhead)Please visit https://huggingface.co/anemll for pre-converted models and follow @anemll for updates
Star this repo to support the project!
- Downloads reference or custom models from HuggingFace
- Inference / chat implementation use Swift Library
- Sample TestFlight App for a quick test
- See iOS/macOS Sample Applications Guide for details
Tip
Try our TestFlight app: Join Beta
The Swift CLI provides a reference implementation for running models on Apple Neural Engine. For detailed documentation, see Swift CLI Guide.
- Download a model from Hugging Face
- Convert the model using our single-shot conversion script:
./anemll/utils/convert_model.sh --model <path_to_model> --output <output_directory>- Run the model using our sample code:
python ./tests/chat.py --meta <output_directory>/meta.yamlFor detailed conversion steps and advanced options, see:
We provide two chat interfaces:
chat.py- Basic chat interface for quick testingchat_full.py- Advanced chat with conversation history management
Features of chat_full.py:
- Maintains full conversation history within context window
- Automatically truncates older messages when needed
- Shifts context window dynamically during long responses
- Shows generation speed and token statistics
- Better handles multi-turn conversations
# Test complete pipeline: download → convert → inference
./tests/conv/test_qwen_simple.sh # Tests Qwen3-0.6B conversion
./tests/conv/test_llama_simple.sh # Tests meta-llama/Llama-3.2-1B (requires HF access)📝 Note: Test scripts use small models (0.6B-1B parameters) with unquantized FP16 conversion for faster testing and validation. For production models with quantization (LUT4/LUT6), use the full conversion script with your preferred model size.
# Basic chat
python ./tests/chat.py --meta ./converted_models/meta.yaml
# Full conversation mode
python ./tests/chat_full.py --meta ./converted_models/meta.yamlSee chat.md for more details
[Note] The first time the model loads, macOS will take some time to place it on the device. Subsequent loads will be instantaneous. Use Ctrl-D to exit, Ctrl-C to interrupt inference.
- macOS Sequoia with Apple Neural Engine (Apple Silicon recommended)
- Minimum 16GB RAM (32GB recommended for 8B models)
- Python 3.9-3.11 (Python 3.9 strongly recommended for best compatibility)
- Xcode Command Line Tools (for CoreML compiler)
- Dependencies: coremltools>=9.0, transformers>=4.36.0, numpy>=1.24.0, scikit-learn<=1.5.1
Recommended: UV Setup (fast, reproducible):
# Install uv (once)
brew install uv
# Create env-anemll with Python 3.9 and install dependencies
./create_uv_env.sh
source env-anemll/bin/activate
./install_dependencies.sh
# Verify
python --version # Should show 3.9.x
python -c "import coremltools; print(coremltools.__version__)"Alternative: Standard venv:
./create_python39_env.sh
source env-anemll/bin/activate
./install_dependencies.shTest the pipeline:
./tests/conv/test_qwen_simple.sh # Qwen3-0.6B end-to-end (auto-downloads ~2.4GB)
./tests/conv/test_llama_simple.sh # SmolLM-135M end-to-end (auto-downloads ~500MB)📝 Note on Test Scripts: The automated test scripts will automatically download required models from HuggingFace:
test_qwen_simple.shdownloadsQwen/Qwen3-0.6B(2.4GB) - tiny model, unquantized FP16test_llama_simple.shdownloadsHuggingFaceTB/SmolLM-135M(500MB) - tiny model, unquantized FP16First run may take longer due to model downloads. Models are cached for subsequent runs. These use small models with no quantization for fast validation - ideal for testing the pipeline.
Alternative: Test with your own models:
# Convert any HuggingFace model ./anemll/utils/convert_model.sh --model <your_model_path> --output /tmp/test-model python3 tests/chat.py --meta /tmp/test-model/meta.yaml --prompt "Hello!"
The installation script automatically verifies:
- ✅ Python version compatibility (3.9-3.11 supported, 3.9 recommended)
- ✅ Xcode Command Line Tools (
xcode-select --installif missing) - ✅ CoreML compiler (
xcrun --find coremlcompiler) - ✅ PyTorch with MPS support
- ✅ CoreML Tools compatibility
- ✅ Apple Neural Engine availability
Manual verification commands:
# Check CoreML compiler
xcrun --find coremlcompiler
# Verify Python environment
python --version # Should show 3.9.x - 3.11.x
pip list | grep -E "(torch|coremltools|transformers)"
# Test Apple Neural Engine
python -c "import torch; print('MPS available:', torch.backends.mps.is_available())"🦙 LLaMA Family (Stable)
- Meta LLaMA 3.1/3.2 (1B, 8B) - Production ready
- DeepSeek R1 (8B distilled) - Based on LLaMA architecture
- DeepHermes (3B, 8B) - LLaMA-based fine-tuned models
- Context lengths: Up to 2048 tokens (512-1024 recommended for optimal ANE performance, 4K verified)
Qwen Family (Stable)
- Qwen 3 (0.6B, 1.7B, 8B) — chunked and monolithic, thinking mode support
- Qwen 2.5 (0.5B, 1.5B, 3B, 7B) — chunked and monolithic
- Context lengths: Up to 4K (512-2048 recommended for ANE)
Gemma 3 Family (Stable)
- Gemma 3 (270M, 1B, 4B QAT) — split KV cache with sliding-window + global attention
- Context lengths: Up to 4096 tokens (512-2048 recommended for ANE)
- Special features: SWA + global attention, FP16 scaling, in-model argmax, 4-function rotation support
- M1/A14 limitation: Constrained to 512-context monolithic models due to ANE non-uniform state shape restrictions
| Model Family | Sizes | Context | Chunked | Monolithic | Status |
|---|---|---|---|---|---|
| LLaMA 3.1/3.2 | 1B, 8B | 512-2048 | Yes | Yes | Stable |
| DeepSeek R1 | 8B | 512-1024 | Yes | — | Stable |
| DeepHermes | 3B, 8B | 512-1024 | Yes | — | Stable |
| Qwen 3 | 0.6B, 1.7B, 8B | 512-4096 | Yes | Yes | Stable |
| Qwen 2.5 | 0.5B, 1.5B, 3B, 7B | 512-2048 | Yes | Yes | Stable |
| Gemma 3 | 270M, 1B, 4B QAT | 512-4096 | Yes | Yes | Stable |
- Recommended context: 512-1024 tokens for best performance
- Memory requirements: 16GB+ RAM for 1B models, 32GB+ for 8B models
- Quantization: LUT4 (FFN) + LUT6 (LM Head) for optimal speed/quality balance
- Chunking: Automatic chunking for large models to fit ANE constraints
- Additional Qwen 2.5 variants (14B, 32B)
- Mistral family support
- Enhanced quantization (GPTQ, SpinQuant integration)
- Larger context lengths (8K, 16K optimization)
Ready-to-use models available at Hugging Face:
- iOS-friendly builds (unzipped .mlmodelc)
- Standard builds for macOS development
- Multiple quantization levels (FP16, LUT4, LUT6)
Apple Neural Engine (ANE) operates in FP16 precision, which can only represent values up to ±65,504. Some models (particularly Gemma 3) produce activations that exceed this range, causing NaN/Inf failures.
Models trained in BF16 (range ±3.4×10³⁸) may have:
- Residual accumulation overflow: The cumulative
hidden = hidden + attention + mlpgrows too large - All sub-tensors within range: Individual attention, MLP, and norm outputs are fine
- Overflow in layer outputs: Combined residual stream exceeds FP16 max
This affects all Gemma 3 sizes (270M through 27B) - see Unsloth's analysis.
Check any HuggingFace model for ANE compatibility:
# Quick check
python anemll/utils/fp16_compatibility_check.py --model google/gemma-3-1b-it
# Full analysis with clamp sweep
python anemll/utils/fp16_compatibility_check.py --model google/gemma-3-4b-it-qat-int4-unquantized --sweepRecommended one-command pre-conversion sweep:
./anemll/utils/fp16_preflight.sh --model <model_id_or_path>This runs the sweep by default and writes a JSON report to tests/dev/logs/.
The tool reports:
- Weight analysis (are weights within FP16 range?)
- Precision tests (BF16, FP16, FP16→FP32)
- Residual accumulation analysis
- Recommended scaling factor (α)
We support two approaches:
| Approach | Pros | Cons |
|---|---|---|
| Weight Scaling (Recommended) | Zero runtime overhead, 100% quality match | Requires preprocessing |
| Runtime Clamping | Simple to implement | Adds ops per layer |
For Gemma 3 models, apply a weight-only transformation:
alpha = 0.1875 # 3/16, adjust based on model
# 1. Scale embedding weights
embed_tokens.weight *= alpha
# 2. Transform post-norm weights (Gemma uses (1+w) gain)
for layer in layers:
post_attention_layernorm.weight = alpha * (1 + w_old) - 1
post_feedforward_layernorm.weight = alpha * (1 + w_old) - 1| Model | Peak Activation | α Recommended | Status |
|---|---|---|---|
| gemma-3-270m | 104,162 (1.6x) | 0.48 | 100% match |
| gemma-3-1b-it | 61,040 (0.93x) | 0.82 | 100% match |
| gemma-3-4b-it-qat | 292,969 (4.5x) | 0.17-0.1875 | 100% match |
- GEMMA3_FP16_SCALING.md - Detailed scaling guide
- fp16_compatibility_check.py - Diagnostic tool
- fp16_preflight.sh - One-command FP16 pre-conversion sweep
- Thanks to @apple for developing the Apple Neural Engine
- Thanks to Apple CoreML Tools team for providing the tools https://github.com/apple/coremltools
- Thanks to @huggingface for providing the transformers library and models
- Stephen Panaro https://x.com/flat for feedback and coreml-llm-cli https://github.com/smpanaro/coreml-llm-cli
- Seba https://x.com/CulStory for inspiration with fast ANE models. https://huggingface.co/seba
- Maynard Handley https://x.com/handleym99 For indepth ANE resources https://github.com/name99-org/AArch64-Explore/blob/main/vol7%20ANE.nb.pdf and feedback
Note
We welcome contributions! Please read our contributing guidelines before submitting PRs.
Feel free to submit issues and pull requests to improve ANEMLL!
Note
If you're using ANEMLL in your project, please submit a PR to add it to this list. We love to showcase how the community is using ANEMLL!
- anemll-server - Server implementation of ANEMLL inference
Note
If you're using ANEMLL in your project, please submit a PR to add it to this list. We love to showcase how the community is using ANEMLL!
For examples of how to integrate ANEMLL into your projects, see:
- 🌐 Website: anemll.com
- 🤗 Models: huggingface.co/anemll
- 📱 X: @anemll
- 💻 GitHub: github.com/anemll
For any questions or support, reach out to us at [email protected]
ANEMLL is licensed under the MIT License. https://opensource.org/license/mit