StageRAG is a lightweight, production-ready RAG framework designed to give you precise control over the speed-versus-accuracy trade-off. It allows you to build high-factuality applications while gracefully managing uncertainty in LLM responses.
- Dual-Mode Pipelines: Dynamically switch between two processing modes based on your needs:
- Speed Mode: 3-step pipeline (1B + 3B models, ~3-5s response)
- Precision Mode: 4-step pipeline (3B model, ~6-12s response)
- Easy Knowledge Base Integration: Deploy with your own data by providing a JSONL file in the standard conversation format. The system automatically builds vector indices and handles retrieval.
- Built-in Confidence Scoring: Every answer includes multi-component confidence evaluation (retrieval quality, answer structure, relevance, uncertainty detection). Programmatically handle low-confidence responses to reduce hallucinations.
- Optimized for Smaller Models: Built on Llama 3.2 1B and 3B models with 4-bit quantization support, requiring only 5-10GB GPU memory while maintaining quality.
You must request access to both Llama models:
- Visit https://huggingface.co/meta-llama/Llama-3.2-1B-Instruct
- Visit https://huggingface.co/meta-llama/Llama-3.2-3B-Instruct
- Click "Access gated model" and accept the license
- Wait for approval (usually instant)
pip install huggingface-hub
huggingface-cli login
# Enter your HuggingFace token when promptedGet your token from: https://huggingface.co/settings/tokens
- Python >= 3.8
- CUDA-capable GPU (recommended) or CPU
- 5GB+ RAM for 4-bit mode, 10GB+ for full precision
- Internet connection for initial model download
git clone https://github.com/darrencxl0301/StageRAG.git
cd StageRAG# Install main dependencies
pip install -r requirements.txt
# Install in development mode
pip install -e .python scripts/download_data.pyThis downloads the sample dataset from darren0301/domain-mix-qa-1k to data/data.jsonl.
from datasets import load_dataset
import json
dataset = load_dataset("darren0301/domain-mix-qa-1k")
with open("data/data.jsonl", "w") as f:
for item in dataset["train"]:
json.dump({"conversations": item["conversations"]}, f)
f.write("\n")Create a JSONL file with this format:
{"conversations": [{"role": "user", "content": "What is EPF?"}, {"role": "assistant", "content": "EPF is the Employees Provident Fund..."}]}
{"conversations": [{"role": "user", "content": "How to apply for leave?"}, {"role": "assistant", "content": "To apply for leave..."}]}# Basic usage (CPU)
python demo/interactive_demo.py --rag_dataset data/data.jsonl
# With GPU and 4-bit quantization (recommended)
python demo/interactive_demo.py --rag_dataset data/data.jsonl --use_4bit --device cudaInteractive Commands:
mode speed- Switch to speed mode (3-step)mode precision- Switch to precision mode (4-step)cache stats- View cache performancesearch <query>- Test RAG retrievalquitorq- Exit
python demo/basic_usage.py --rag_dataset data/data.jsonlfrom stagerag import StageRAGSystem
import argparse
# Setup configuration
args = argparse.Namespace(
rag_dataset='data/data.jsonl',
device='cuda',
use_4bit=True,
cache_size=1000,
temperature=0.7,
top_p=0.85,
max_new_tokens=512,
max_seq_len=2048,
disable_rag=False,
rag_threshold=0.3,
seed=42
)
# Initialize system
system = StageRAGSystem(args)
# Process query
result = system.process_query(
"What are the EPF contribution rates?",
mode="speed"
)
print(f"Answer: {result['answer']}")
print(f"Confidence: {result['confidence']['overall_confidence']:.3f}")
print(f"Time: {result['processing_time']:.2f}s")# Install test dependencies
pip install pytest pytest-cov
# Run all tests
pytest tests/ -v
# Run specific test files
pytest tests/test_cache.py -v
pytest tests/test_confidence.py -v
pytest tests/test_rag.py -v
# Run with detailed output
pytest tests/test_cache.py -vv
# Run with coverage report
pytest tests/ --cov=stagerag --cov-report=html| Argument | Default | Description |
|---|---|---|
--rag_dataset |
Required | Path to JSONL knowledge base |
--device |
cuda |
Device to use (cuda/cpu) |
--use_4bit |
False |
Enable 4-bit quantization |
--cache_size |
1000 |
LRU cache size |
--temperature |
0.7 |
Sampling temperature (0.0-1.0) |
--top_p |
0.85 |
Top-p nucleus sampling |
--max_new_tokens |
512 |
Max tokens to generate |
--disable_rag |
False |
Disable RAG retrieval |
Edit stagerag/config.py to adjust confidence evaluation:
weights = {
'retrieval': 0.25, # RAG retrieval quality
'basic_quality': 0.25, # Answer structure/length
'relevance': 0.25, # Keyword relevance
'uncertainty': 0.25 # Uncertainty detection
}StageRAG/
βββ stagerag/ # Main package
β βββ __init__.py # Package exports
β βββ main.py # StageRAGSystem class
β βββ cache.py # LRU cache implementation
β βββ confidence.py # Confidence evaluator
β βββ rag.py # RAG retrieval system
β βββ prompts.py # Prompt templates
β βββ config.py # Configuration dataclasses
βββ demo/ # Usage examples
β βββ interactive_demo.py
β βββ basic_usage.py
βββ scripts/ # Utility scripts
β βββ download_data.py # HuggingFace dataset downloader
βββ tests/ # Test suite
β βββ test_cache.py
β βββ test_confidence.py
β βββ test_rag.py
βββ data/ # Knowledge base (created on first run)
β βββ data.jsonl
βββ requirements.txt # Production dependencies
βββ requirements-dev.txt # Development dependencies
βββ setup.py # Package configuration
βββ README.md
User Input β [1B] Normalize β [3B] RAG Filter β [1B] Generate Answer β Response
User Input β [1B] Normalize β [3B] RAG Retrieve β [3B] Synthesize β [3B] Final Answer β Response
| Mode | Avg Time | Avg Confidence | Use Case |
|---|---|---|---|
| Speed | 3.3s | 0.72 | Real-time chat |
| Precision | 7.8s | 0.83 | Complex queries, critical decisions |
Tested on NVIDIA RTX 3090 GPU with 4-bit quantization
Sample dataset: darren0301/domain-mix-qa-1k
Contains 1,000 domain-specific Q&A pairs covering:
- Logical & Mathematical Reasoning
- Specialized Medical Domain Knowledge
- Open-Ended General Instruction Following
- Employee benefits information
- Practical, Real-World Q&A
Contributions are welcome! Please:
- Fork the repository
- Create a feature branch (
git checkout -b feature/amazing-feature) - Commit your changes (
git commit -m 'Add amazing feature') - Push to the branch (
git push origin feature/amazing-feature) - Open a Pull Request
This project is licensed under the MIT License - see the LICENSE file for details.
@software{stagerag2024,
author = {Darren Chai Xin Lun},
title = {StageRAG: A Framework for Building Hallucination-Resistant RAG Applications},
year = {2024},
url = {https://github.com/darrencxl0301/StageRAG},
note = {Dataset: https://huggingface.co/datasets/darren0301/domain-mix-qa-1k}
}- Built with Llama 3.2 models by Meta
- FAISS for vector similarity search
- Sentence Transformers for embeddings
- HuggingFace for model hosting
Darren Chai Xin Lun
- GitHub: @darrencxl0301
- HuggingFace: @darren0301
β If you find this project helpful, please give it a star!