General-purpose RAG system with a hexagonal architecture (Ports & Adapters), FastAPI, three retrieval modes (BM25, FAISS, hybrid), and swappable LLM connectors (OpenAI or Ollama). Designed as a solid base to iterate in real development environments.
-
Clean architecture
- Hexagonal (Ports & Adapters): domain decoupled from infrastructure.
- Explicit typing and domain models.
-
Retrieval
- Sparse: BM25 (offline).
- Dense: FAISS + SentenceTransformers.
- Hybrid: dense + BM25 combination with configurable weight.
-
LLMs
- OpenAI Chat (via API key).
- Local Ollama (over HTTP). Current clients are synchronous.
-
Persistence
- SQLite via SQLAlchemy: documents and Q&A history.
- FAISS on disk for dense/hybrid mode.
-
API
- FastAPI with validation and OpenAPI at
/docs. - Health:
/api/health, Readiness:/api/ready, Ollama health:/api/health/ollama. - Config:
/api/config, Templates:/api/templates. - OpenRouter proxy (OpenAI-compatible):
POST /api/openrouter/generate.
- FastAPI with validation and OpenAPI at
-
Tests
- Unit, integration, and E2E with
pytest.
- Unit, integration, and E2E with
.
├── data/ # CSV, SQLite DB, FAISS files
├── frontend/ # Simple UI (index.html) for dev
├── src/local_rag_backend/
│ ├── app/ # FastAPI (main, routers, DI, factory)
│ ├── core/ # domain, ports and services (ETL, RAG)
│ ├── infrastructure/ # adapters: llms, retrievers, storage, loaders
│ ├── scripts/ # bootstrap and build_index
│ └── frontend/ # packaged index.html to serve at /
└── tests/ # unit + integration + e2e- Python 3.11+
- Operating system: Linux / macOS / Windows
- For dense/hybrid mode:
faissandsentence_transformers(installed as extras or manually)
git clone https://github.com/Intrinsical-AI/rag-prototype.git
cd rag-prototype
python -m venv .venv
source .venv/bin/activate
# Windows: .venv\Scripts\activate
# Install the package (add extras if you want faiss/sentence_transformers)
pip install -e .
# (Optional) Install development dependencies
# pip install -e ".[dev]"Initialize sample data and start:
# Load sample CSV into SQLite and, if applicable, build FAISS
rag-bootstrap
# FastAPI server
rag-server
# UI: http://localhost:8000/
# Health: http://localhost:8000/api/health
# Ollama health: http://localhost:8000/api/health/ollama
# Docs: http://localhost:8000/docsIf you prefer to invoke scripts directly:
python -m local_rag_backend.scripts.bootstrapanduvicorn local_rag_backend.app.main:app --reload.
Options are in
local_rag_backend/settings.py(Pydantic Settings). They can be overridden with environment variables or a.envfile (case-insensitive).
| Variable | Default | Scope | Description |
|---|---|---|---|
APP_HOST |
0.0.0.0 |
server | Service host |
APP_PORT |
8000 |
server | Service port |
DEBUG |
false |
server | Reload/detailed logging |
RETRIEVAL_MODE |
hybrid |
retrieval | sparse | dense | hybrid |
SQLITE_URL |
sqlite:///./data/app.db |
storage | SQLite URL |
FAQ_CSV |
data/faq.csv |
ingestion | FAQ CSV |
CSV_HAS_HEADER |
true |
ingestion | CSV has header |
ST_EMBEDDING_MODEL |
all-MiniLM-L6-v2 |
dense/hybrid | SentenceTransformers model |
INDEX_PATH |
data/index.faiss |
dense/hybrid | FAISS file |
ID_MAP_PATH |
data/id_map.pkl |
dense/hybrid | FAISS ID map |
ENABLE_MONITORING |
false |
monitoring | Enable metrics middleware and /metrics endpoint |
OPENAI_TOP_P |
1.0 |
OpenAI | top-p parameter |
OPENROUTER_ENABLED |
false |
OpenRouter | Enable OpenRouter proxy |
OPENROUTER_API_KEY |
— | OpenRouter | API key |
OPENROUTER_BASE_URL |
https://openrouter.ai/api/v1 |
OpenRouter | Base URL |
OPENROUTER_MODEL |
openai/gpt-4o-mini |
OpenRouter | Default model |
OPENROUTER_SITE_URL |
— | OpenRouter | Optional Referer header |
OPENROUTER_APP_TITLE |
— | OpenRouter | Optional X-Title header |
HYBRID_RETRIEVAL_ALPHA |
0.5 |
hybrid | Weight of the sparse component (0=dense, 1=sparse) |
OPENAI_API_KEY |
— | OpenAI | API key |
OPENAI_MODEL |
gpt-4o-mini |
OpenAI | Chat model |
OPENAI_TEMPERATURE |
0.2 |
OpenAI | Temperature |
OLLAMA_ENABLED |
false |
Ollama | Enable Ollama |
OLLAMA_MODEL |
gemma3:1b |
Ollama | Model served by Ollama |
OLLAMA_BASE_URL |
http://localhost:11434 |
Ollama | Server URL |
OLLAMA_REQUEST_TIMEOUT |
180 |
Ollama | Timeout (s) |
Example .env:
RETRIEVAL_MODE=hybrid
OPENAI_API_KEY=sk-...
OPENAI_MODEL=gpt-4o-mini
OPENAI_TOP_P=1.0
OLLAMA_ENABLED=false
ST_EMBEDDING_MODEL=all-MiniLM-L6-v2
# Optionals
# OPENROUTER_ENABLED=true
# ...-
Sparse: stores directly in SQLite (no embeddings required).
-
Dense / Hybrid:
- Save chunks in SQLite
- Generate embeddings with SentenceTransformers (
ST_EMBEDDING_MODEL) - Upsert into FAISS (
INDEX_PATH,ID_MAP_PATH)
Chunking parameters (in settings):
INGEST_CHUNK_CHARS(default 1200)INGEST_CHUNK_OVERLAP(default 200)
Available scripts:
# Ingest from CSV and build FAISS if applicable
rag-bootstrap
# Explicitly build the index from the CSV (populate SQL and FAISS if applicable)
rag-build-index
# Summarized system and files status
rag-statusRetrieval mode is selected via
RETRIEVAL_MODE(there is no--modeflag).
The ingestion process is orchestrated by IngestionPipeline:
- Load items from a
LoaderPort(e.g.,CSVLoader) returningLoadedItem(text, metadata). - Preprocess (
preprocess_text) and chunk (default_chunker) with overlap. - Format chunks (metadata header) and batch-ingest via
ETLService.ingest().
This pipeline is used by scripts/bootstrap.py.
You can ingest data from any LangChain document loader via the LangChainLoader adapter, which implements the project's LoaderPort.
Installation:
pip install -e ".[loaders]"
# or when installing from PyPI:
# pip install intrinsical-rag-prototype[loaders]Quick usage example:
from langchain_community.document_loaders import WebBaseLoader
from local_rag_backend.core.services.etl import ETLService
from local_rag_backend.core.services.ingestion import IngestionPipeline
from local_rag_backend.infrastructure.ingestion.loaders import LangChainLoader
# 1) Create/obtain your ETLService as usual (doc store, vector store, embedder)
etl = ETLService(doc_repo, vector_repo, embedder)
# 2) Wrap any LangChain loader
lc_loader = WebBaseLoader(["https://example.com"]) # or DirectoryLoader, SitemapLoader, etc.
loader = LangChainLoader(lc_loader, drop_empty=True, metadata_filter={"lang": "en"})
# 3) Run the pipeline
pipeline = IngestionPipeline(loader=loader, etl_service=etl)
count = pipeline.run()
print(f"Ingested {count} chunks")Notes:
drop_empty=Trueskips whitespace-only documents.metadata_filter={...}yields only items whose metadata includes the given key/value pairs.- The adapter expects each LangChain
Documentto havepage_contentandmetadatafields. It gracefully falls back to dict-like objects or stringification when needed.
Prerequisites: Docker Desktop/Engine.
# Build and start backend + Ollama
docker compose up -d --build
# (Optional) Pull a model into Ollama once the service is up
docker exec -it ollama ollama pull gemma:2b
# Verify services
curl http://localhost:8000/api/health
curl http://localhost:8000/api/health/ollamaNotes:
- Backend listens on
8000, Ollama on11434. - Configure providers via
.envor environment variables (see.env.example). - In
docker-compose.yml,OLLAMA_ENABLED=trueandOLLAMA_BASE_URL=http://ollama:11434are set.
-
GET /→ Serves packagedindex.htmlor the repo’sfrontend/index.html. -
GET /api/healthandGET /api/ready -
GET /api/health/ollama -
GET /api/configandGET /api/templates -
POST /api/ask- Body:
{ "question": "str", "k": int (1..10, default 3) } - Response:
{ "answer": "str", "sources": [ { "document": {"id": int, "content": "str"}, "score": float(0..1) }, ... ] }
- Body:
-
GET /api/history?limit=1..100&offset>=0- Response: list of
{ id, question, answer, created_at, source_ids[] }
- Response: list of
-
FastAPI docs:
GET /docsandGET /openapi.json -
POST /api/docs(ingest texts) andGET /api/docs(list docs) -
POST /api/openrouter/generate(enabled if OpenRouter configured)
Notes:
- Retrieval “scores” are normalized to [0,1] in the adapters.
- The service persists each Q/A with the IDs of the retrieved sources.
Example:
curl -X POST "http://localhost:8000/api/ask" \
-H "Content-Type: application/json" \
-d '{"question": "What is RAG?", "k": 3}'pytest
# Coverage
pytest --cov=src --cov-report=term-missingTests suite includes unit, integration, and E2E (FastAPI TestClient). Some integration tests require
faissand/orsentence_transformers; if they are not installed, those tests are skipped automatically. Current status: 133 tests, 86.45% coverage.
- LLM: implement
GeneratorPort(seeinfrastructure/llms/*) and wire it inapp/factory.py. - Retriever: implement
RetrieverPortand wire it infactory.get_retriever(). - Vector store: implement
VectorRepoPort(e.g., an alternative to FAISS). - Document store: implement
DocumentRepoPortto use a DB other than SQLite. - Loader: implement
LoaderPortfor new sources (PDFs, web, etc.).
- Singleton per process:
RagServiceis initialized as a singleton infactory. Withuvicorn --workers N, each process loads its own instance (and its FAISS). Align deployment and warm-up as needed. - Metrics: if
ENABLE_MONITORING=trueandprometheus-clientis installed,/metricsprovides Prometheus format. - Dense/Hybrid: must use the same embedding model for indexing and querying (
ST_EMBEDDING_MODEL).
- Synchronous LLM clients (requests/OpenAI SDK); migration to async is straightforward but not included.
- Minimal UI without front-end tests.
- No authentication/rate limiting or exported metrics (logging and status CLI are included).
- FAISS index type
IndexFlatL2(simple). For large volumes, consider IVF/HNSW or other backends.
MIT. See LICENSE file for details.
Built with ❤️ by Intrinsical AI