Powdered Metal — High-performance LLM fine-tuning framework for Apple Silicon, written in Rust.
PMetal is a machine learning framework that brings Unsloth-style optimizations to macOS. It leverages custom Metal shaders and the MLX framework to achieve state-of-the-art training throughput on Apple Silicon GPUs.
# Clone the repository
git clone https://github.com/epistates/pmetal.git
cd pmetal
# Build in release mode
cargo build --release# LoRA fine-tuning with auto-detected max-seq-len and sequence packing
./target/release/pmetal train \
--model qwen/Qwen3-0.6B-Base \
--dataset path/to/train.jsonl \
--output ./output \
--lora-r 16 \
--batch-size 4 \
--learning-rate 2e-4# Inference with thinking mode enabled
./target/release/pmetal infer \
--model qwen/Qwen3-0.6B-Base \
--lora ./output/lora_weights.safetensors \
--prompt "Does absolute truth exist?" \
--chat \
--show-thinkingPMetal is organized as a Rust workspace with 15 specialized crates:
pmetal/
├── pmetal-core # Foundation: configs, traits, types
├── pmetal-metal # Custom Metal GPU kernels
├── pmetal-mlx # MLX backend integration (KV cache, RoPE, etc.)
├── pmetal-models # LLM architectures (Llama, Qwen, DeepSeek, etc.)
├── pmetal-lora # LoRA/QLoRA training implementations
├── pmetal-trainer # Training loops (SFT, DPO, GRPO)
├── pmetal-data # Dataset loading and preprocessing
├── pmetal-hub # HuggingFace Hub integration
├── pmetal-distill # Knowledge distillation
├── pmetal-merge # Model merging (SLERP, TIES, DARE)
├── pmetal-gguf # GGUF format with imatrix quantization
├── pmetal-mhc # Manifold-Constrained Hyper-Connections
├── pmetal-distributed # Distributed training support
├── pmetal-vocoder # BigVGAN neural vocoder
└── pmetal-cli # Command-line interface
┌─────────────────┐
│ pmetal-cli │
└────────┬────────┘
│
┌───────────────────┼───────────────────┐
│ │ │
▼ ▼ ▼
┌─────────────────┐ ┌─────────────────┐ ┌─────────────────┐
│ pmetal-trainer│ │ pmetal-lora │ │ pmetal-data │
└────────┬────────┘ └────────┬────────┘ └────────┬────────┘
│ │ │
└───────────────────┼───────────────────┘
│
┌───────────────────┼───────────────────┐
│ │ │
▼ ▼ ▼
┌─────────────────┐ ┌─────────────────┐ ┌─────────────────┐
│ pmetal-models │ │ pmetal-mlx │ │ pmetal-metal │
└────────┬────────┘ └────────┬────────┘ └────────┬────────┘
│ │ │
└───────────────────┼───────────────────┘
│
▼
┌─────────────────┐
│ pmetal-core │
└─────────────────┘
| Family | Variants | LoRA | QLoRA | Full FT |
|---|---|---|---|---|
| Llama | 2, 3, 3.1, 3.2, 3.3 | ✓ | ✓ | ✓ |
| Llama 4 | Scout, Maverick | ✓ | - | ✓ |
| Qwen | 2, 2.5, 3, 3-MoE | ✓ | - | ✓ |
| DeepSeek | V3, V3.2, V3.2-Speciale | ✓ | - | ✓ |
| Mistral | 7B, 8x7B | ✓ | ✓ | ✓ |
| Gemma | 2, 3 | ✓ | - | ✓ |
| Phi | 3, 4 | ✓ | - | ✓ |
| Cohere | Command R | ✓ | - | ✓ |
| Granite | 3.0, 3.1 | ✓ | - | ✓ |
| NemotronH | Hybrid (Mamba+Attention) | ✓ | - | ✓ |
| StarCoder2 | 3B, 7B, 15B | ✓ | - | ✓ |
| RecurrentGemma | Griffin | ✓ | - | ✓ |
| Jamba | 1.5 | ✓ | - | ✓ |
| GPT-OSS | 20B, 120B | ✓ | - | - |
Architecture implementations exist but are not yet integrated into the CLI dispatcher.
| Family | Variants | Status |
|---|---|---|
| Pixtral | 12B | Architecture implemented |
| Qwen2-VL | 2B, 7B | Architecture implemented |
| MLlama | 3.2-Vision | Architecture implemented |
| CLIP | ViT-L/14 | Architecture implemented |
| Whisper | Base, Small, Medium, Large | Architecture implemented |
| Family | Variants | Status |
|---|---|---|
| Flux | 1-dev, 1-schnell | Dispatcher + pipeline implemented |
- Supervised Fine-Tuning (SFT): Standard next-token prediction
- LoRA: Low-Rank Adaptation with configurable rank and alpha
- QLoRA: 4-bit quantized base weights with LoRA adapters
- DoRA: Weight-Decomposed Low-Rank Adaptation
- DPO: Direct Preference Optimization for RLHF
- GRPO: Group Relative Policy Optimization
- DAPO: Decoupled Clip and Dynamic Sampling Policy Optimization
- GSPO: Group Sequence Policy Optimization (fixes GRPO length bias)
- PPO: Proximal Policy Optimization
- ORPO: Odds Ratio Preference Optimization (reference-free)
- SimPO: Simple Preference Optimization
- KTO: Kahneman-Tversky Optimization (unpaired preference data)
- Online DPO: Online Direct Preference Optimization with reward models
- Diffusion Training: LLaDA-style masked diffusion for language models
Custom Metal shaders provide significant speedups:
- FlashAttention: O(n) memory attention with fused softmax
- Fused LoRA: Combined forward pass for adapter layers
- Fused Cross-Entropy: Unsloth-style chunked loss computation
- Fused RoPE: Rotary position embeddings in-kernel
- Fused Sampler: JIT-compiled token sampling
Efficiently pack multiple sequences into single batches:
--use-sequence-packing # Enable packing (99.7% efficiency)
--max-seq-len 2048 # Maximum packed sequence lengthTrade compute for memory on large models:
--gradient-checkpointing # Enable memory-efficient trainingSupported formats for training data:
ShareGPT (conversations):
{"conversations": [{"from": "human", "value": "..."}, {"from": "gpt", "value": "..."}]}Alpaca (instruction):
{"instruction": "...", "input": "...", "output": "..."}Messages (chat):
{"messages": [{"role": "user", "content": "..."}, {"role": "assistant", "content": "..."}]}| Parameter | Default | Description |
|---|---|---|
--lora-r |
16 | LoRA rank |
--lora-alpha |
32.0 | LoRA scaling factor (2x rank) |
--batch-size |
4 | Micro-batch size |
--learning-rate |
2e-4 | Learning rate |
--max-seq-len |
0 | Max seq len (0 = auto-detect) |
--epochs |
1 | Number of training epochs |
--max-grad-norm |
1.0 | Gradient clipping |
| Parameter | Default | Description |
|---|---|---|
--temperature |
Model default | Sampling temperature |
--top-k |
Model default | Top-k sampling |
--top-p |
Model default | Nucleus sampling |
--max-tokens |
256 | Maximum generation length |
--repetition-penalty |
1.0 | Repetition penalty |
# Debug build
cargo build
# Release build with optimizations
cargo build --release
# Run tests
cargo test --all
# Run clippy
cargo clippy --all- Implement the
CausalLMModeltrait inpmetal-models - Add architecture detection in
dispatcher.rs - Create LoRA wrapper in
pmetal-loraif needed - Update the model registry
Run the included benchmarks:
# FFI overhead benchmark
cargo bench --bench ffi_overheadIf you see "cannot execute tool 'metal'":
xcodebuild -downloadComponent MetalToolchainTry these options:
- Reduce
--batch-size - Enable
--gradient-checkpointing - Use
--use-sequence-packingfor variable-length data - Reduce
--max-seq-len
Licensed under either of:
- Apache License, Version 2.0 (LICENSE-APACHE or http://www.apache.org/licenses/LICENSE-2.0)
- MIT license (LICENSE-MIT or http://opensource.org/licenses/MIT)
at your option.
- MLX - Apple's machine learning framework
- mlx-rs - Rust bindings for MLX
- Unsloth - Inspiration for fused kernel optimizations
- HuggingFace - Model hub and tokenizers