-
Notifications
You must be signed in to change notification settings - Fork 0
Home
HΓΌseyin KAMA edited this page Dec 16, 2025
·
1 revision
The central knowledge base for the Adaptive Hybrid Quantization Framework.
QKV Core is a kernel-level optimization pipeline designed to democratize Large Language Models (LLMs). By solving critical memory fragmentation issues through Surgical Alignment and utilizing Entropy-based Hybrid Compression, QKV Core enables high-performance inference of 7B+ models on consumer hardware like the NVIDIA GTX 1050 (4GB VRAM).
Explore the technical details and user guides below:
- Installation Guide: Setup prerequisites (Python 3.10+, PyTorch, CUDA) and installation steps.
- Quick Start: Run your first model conversion and launch the Web UI in under 5 minutes.
-
CLI Reference: Complete documentation for
qkv-clicommands.
- System Architecture: Detailed C4 diagrams and sequence flows explaining the pipeline.
- Adaptive Compression Logic: How our entropy analysis decides between Dictionary Coding and Raw Storage.
-
Surgical Alignment Explained: The math behind fixing
llama.cpppadding errors and saving 44MB per file.
- VRAM Analysis: Comparative graphs showing OOM prevention on 4GB cards.
- I/O Speed Tests: Evidence of 34% faster load times due to block alignment.
Standard quantization tools often fail on low-VRAM devices due to inefficient memory padding (fragmentation). QKV Core introduces a novel "Trim & Re-align" approach:
- Analyze: Measures tensor entropy to choose the best compression format.
- Compress: Uses bit-packed dictionary encoding for repetitive weights.
- Align: Surgically trims padding bytes to strictly adhere to 110-byte (Q3_K) block boundaries.
- Found a bug? Open an Issue.
- Want to contribute? Read our Contribution Guidelines.
- Discussions: Join the conversation in GitHub Discussions.
Maintained by HΓΌseyin Kama