Home

Welcome to the QKV-Core Wiki! 👋

The central knowledge base for the Adaptive Hybrid Quantization Framework.

QKV Core is a kernel-level optimization pipeline designed to democratize Large Language Models (LLMs). By solving critical memory fragmentation issues through Surgical Alignment and utilizing Entropy-based Hybrid Compression, QKV Core enables high-performance inference of 7B+ models on consumer hardware like the NVIDIA GTX 1050 (4GB VRAM).

📚 Documentation Map

Explore the technical details and user guides below:

🚀 Getting Started

Installation Guide: Setup prerequisites (Python 3.10+, PyTorch, CUDA) and installation steps.
Quick Start: Run your first model conversion and launch the Web UI in under 5 minutes.
CLI Reference: Complete documentation for qkv-cli commands.

🧠 Deep Dive & Architecture

System Architecture: Detailed C4 diagrams and sequence flows explaining the pipeline.
Adaptive Compression Logic: How our entropy analysis decides between Dictionary Coding and Raw Storage.
Surgical Alignment Explained: The math behind fixing llama.cpp padding errors and saving 44MB per file.

📊 Performance & Benchmarks

VRAM Analysis: Comparative graphs showing OOM prevention on 4GB cards.
I/O Speed Tests: Evidence of 34% faster load times due to block alignment.

💡 Why QKV Core?

Standard quantization tools often fail on low-VRAM devices due to inefficient memory padding (fragmentation). QKV Core introduces a novel "Trim & Re-align" approach:

Analyze: Measures tensor entropy to choose the best compression format.
Compress: Uses bit-packed dictionary encoding for repetitive weights.
Align: Surgically trims padding bytes to strictly adhere to 110-byte (Q3_K) block boundaries.

🤝 Community & Support

Found a bug? Open an Issue.
Want to contribute? Read our Contribution Guidelines.
Discussions: Join the conversation in GitHub Discussions.

_{Maintained by Hüseyin Kama}

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!