Thanks to visit codestin.com
Credit goes to github.com

Skip to content

A blueprint for building production-ready RAG systems that minimize hallucination, featuring switchable 3-step (Speed) and 4-step (Precision) pipelines.

License

Notifications You must be signed in to change notification settings

sextun/StageRAG

Β 
Β 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

13 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

StageRAG: A Framework for Building Hallucination-Resistant RAG Applications

Python 3.8+ License: MIT HuggingFace

StageRAG is a lightweight, production-ready RAG framework designed to give you precise control over the speed-versus-accuracy trade-off. It allows you to build high-factuality applications while gracefully managing uncertainty in LLM responses.

🌟 Features

  • Dual-Mode Pipelines: Dynamically switch between two processing modes based on your needs:
    • Speed Mode: 3-step pipeline (1B + 3B models, ~3-5s response)
    • Precision Mode: 4-step pipeline (3B model, ~6-12s response)
  • Easy Knowledge Base Integration: Deploy with your own data by providing a JSONL file in the standard conversation format. The system automatically builds vector indices and handles retrieval.
  • Built-in Confidence Scoring: Every answer includes multi-component confidence evaluation (retrieval quality, answer structure, relevance, uncertainty detection). Programmatically handle low-confidence responses to reduce hallucinations.
  • Optimized for Smaller Models: Built on Llama 3.2 1B and 3B models with 4-bit quantization support, requiring only 5-10GB GPU memory while maintaining quality.

πŸ“‹ Prerequisites

1. Get Llama Model Access

You must request access to both Llama models:

  1. Visit https://huggingface.co/meta-llama/Llama-3.2-1B-Instruct
  2. Visit https://huggingface.co/meta-llama/Llama-3.2-3B-Instruct
  3. Click "Access gated model" and accept the license
  4. Wait for approval (usually instant)

2. Login to HuggingFace

pip install huggingface-hub
huggingface-cli login
# Enter your HuggingFace token when prompted

Get your token from: https://huggingface.co/settings/tokens

3. System Requirements

  • Python >= 3.8
  • CUDA-capable GPU (recommended) or CPU
  • 5GB+ RAM for 4-bit mode, 10GB+ for full precision
  • Internet connection for initial model download

πŸš€ Installation

Clone Repository

git clone https://github.com/darrencxl0301/StageRAG.git
cd StageRAG

Install Dependencies

# Install main dependencies
pip install -r requirements.txt

# Install in development mode
pip install -e .

πŸ“Š Download Sample Dataset

Option 1: Automatic Download (Recommended)

python scripts/download_data.py

This downloads the sample dataset from darren0301/domain-mix-qa-1k to data/data.jsonl.

Option 2: Manual Download

from datasets import load_dataset
import json

dataset = load_dataset("darren0301/domain-mix-qa-1k")

with open("data/data.jsonl", "w") as f:
    for item in dataset["train"]:
        json.dump({"conversations": item["conversations"]}, f)
        f.write("\n")

Option 3: Use Your Own Data

Create a JSONL file with this format:

{"conversations": [{"role": "user", "content": "What is EPF?"}, {"role": "assistant", "content": "EPF is the Employees Provident Fund..."}]}
{"conversations": [{"role": "user", "content": "How to apply for leave?"}, {"role": "assistant", "content": "To apply for leave..."}]}

πŸ’» Usage

Interactive Chat Demo

# Basic usage (CPU)
python demo/interactive_demo.py --rag_dataset data/data.jsonl

# With GPU and 4-bit quantization (recommended)
python demo/interactive_demo.py --rag_dataset data/data.jsonl --use_4bit --device cuda

Interactive Commands:

  • mode speed - Switch to speed mode (3-step)
  • mode precision - Switch to precision mode (4-step)
  • cache stats - View cache performance
  • search <query> - Test RAG retrieval
  • quit or q - Exit

Basic Usage Example

python demo/basic_usage.py --rag_dataset data/data.jsonl

Programmatic Usage

from stagerag import StageRAGSystem
import argparse

# Setup configuration
args = argparse.Namespace(
    rag_dataset='data/data.jsonl',
    device='cuda',
    use_4bit=True,
    cache_size=1000,
    temperature=0.7,
    top_p=0.85,
    max_new_tokens=512,
    max_seq_len=2048,
    disable_rag=False,
    rag_threshold=0.3,
    seed=42
)

# Initialize system
system = StageRAGSystem(args)

# Process query
result = system.process_query(
    "What are the EPF contribution rates?",
    mode="speed"
)

print(f"Answer: {result['answer']}")
print(f"Confidence: {result['confidence']['overall_confidence']:.3f}")
print(f"Time: {result['processing_time']:.2f}s")

πŸ§ͺ Testing

# Install test dependencies
pip install pytest pytest-cov

# Run all tests
pytest tests/ -v

# Run specific test files
pytest tests/test_cache.py -v
pytest tests/test_confidence.py -v
pytest tests/test_rag.py -v

# Run with detailed output
pytest tests/test_cache.py -vv

# Run with coverage report
pytest tests/ --cov=stagerag --cov-report=html

βš™οΈ Configuration

Command Line Arguments

Argument Default Description
--rag_dataset Required Path to JSONL knowledge base
--device cuda Device to use (cuda/cpu)
--use_4bit False Enable 4-bit quantization
--cache_size 1000 LRU cache size
--temperature 0.7 Sampling temperature (0.0-1.0)
--top_p 0.85 Top-p nucleus sampling
--max_new_tokens 512 Max tokens to generate
--disable_rag False Disable RAG retrieval

Confidence Weights

Edit stagerag/config.py to adjust confidence evaluation:

weights = {
    'retrieval': 0.25,      # RAG retrieval quality
    'basic_quality': 0.25,  # Answer structure/length
    'relevance': 0.25,      # Keyword relevance
    'uncertainty': 0.25     # Uncertainty detection
}

πŸ“ Project Structure

StageRAG/
β”œβ”€β”€ stagerag/              # Main package
β”‚   β”œβ”€β”€ __init__.py       # Package exports
β”‚   β”œβ”€β”€ main.py           # StageRAGSystem class
β”‚   β”œβ”€β”€ cache.py          # LRU cache implementation
β”‚   β”œβ”€β”€ confidence.py     # Confidence evaluator
β”‚   β”œβ”€β”€ rag.py            # RAG retrieval system
β”‚   β”œβ”€β”€ prompts.py        # Prompt templates
β”‚   └── config.py         # Configuration dataclasses
β”œβ”€β”€ demo/                 # Usage examples
β”‚   β”œβ”€β”€ interactive_demo.py
β”‚   └── basic_usage.py
β”œβ”€β”€ scripts/              # Utility scripts
β”‚   └── download_data.py  # HuggingFace dataset downloader
β”œβ”€β”€ tests/                # Test suite
β”‚   β”œβ”€β”€ test_cache.py
β”‚   β”œβ”€β”€ test_confidence.py
β”‚   └── test_rag.py
β”œβ”€β”€ data/                 # Knowledge base (created on first run)
β”‚   └── data.jsonl
β”œβ”€β”€ requirements.txt      # Production dependencies
β”œβ”€β”€ requirements-dev.txt  # Development dependencies
β”œβ”€β”€ setup.py             # Package configuration
└── README.md

🎯 Architecture

Speed Mode (3-step Pipeline)

User Input β†’ [1B] Normalize β†’ [3B] RAG Filter β†’ [1B] Generate Answer β†’ Response

Precision Mode (4-step Pipeline)

User Input β†’ [1B] Normalize β†’ [3B] RAG Retrieve β†’ [3B] Synthesize β†’ [3B] Final Answer β†’ Response

πŸ“Š Performance Benchmarks

Mode Avg Time Avg Confidence Use Case
Speed 3.3s 0.72 Real-time chat
Precision 7.8s 0.83 Complex queries, critical decisions

Tested on NVIDIA RTX 3090 GPU with 4-bit quantization

πŸ“¦ Dataset

Sample dataset: darren0301/domain-mix-qa-1k

Contains 1,000 domain-specific Q&A pairs covering:

  • Logical & Mathematical Reasoning
  • Specialized Medical Domain Knowledge
  • Open-Ended General Instruction Following
  • Employee benefits information
  • Practical, Real-World Q&A

🀝 Contributing

Contributions are welcome! Please:

  1. Fork the repository
  2. Create a feature branch (git checkout -b feature/amazing-feature)
  3. Commit your changes (git commit -m 'Add amazing feature')
  4. Push to the branch (git push origin feature/amazing-feature)
  5. Open a Pull Request

πŸ“„ License

This project is licensed under the MIT License - see the LICENSE file for details.

πŸ“š Citation

@software{stagerag2024,
  author = {Darren Chai Xin Lun},
  title = {StageRAG: A Framework for Building Hallucination-Resistant RAG Applications},
  year = {2024},
  url = {https://github.com/darrencxl0301/StageRAG},
  note = {Dataset: https://huggingface.co/datasets/darren0301/domain-mix-qa-1k}
}

πŸ™ Acknowledgments

πŸ“§ Contact

Darren Chai Xin Lun


⭐ If you find this project helpful, please give it a star!

About

A blueprint for building production-ready RAG systems that minimize hallucination, featuring switchable 3-step (Speed) and 4-step (Precision) pipelines.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Python 100.0%