🦫 RAGondin — The Open RAG Experimentation Playground

RAGondin is a lightweight, modular and extensible Retrieval-Augmented Generation (RAG) framework designed to explore and test advanced RAG techniques — 100% open source and focused on experimentation, not lock-in.

Built by the OpenLLM France community, RAGondin offers a sovereign-by-design alternative to mainstream RAG stacks like LangChain or Haystack.

Goals

Experiment with advanced RAG techniques
Develop evaluation metrics for RAG applications
Collaborate with the community to innovate and push the boundaries of RAG applications

Current Features

This section provides a detailed explanation of the currently supported features.

The .hydra_config directory contains all the configuration files for the application. These configurations are structured using the Hydra configuration framework. This directory will be referenced for setting up the RAG (Retrieval-Augmented Generation) pipeline.

Supported File Formats

This branch currently supports the following file types:

TextFiles: txt, md
Document Files: pdf, docx, doc, pptx
Audio Files: wav, mp3, mp4, ogg, flv, wma, aac
Images: png:, jpeg, jpg, svg

The content from all supported formats is first converted into Markdown. Any images found in the documents are replaced by captions generated by a Vision Language Model (VLM). (Refer to the Configuration section for additional details.) The final Markdown output is then split into chunks and indexed in the Milvus vector database.

Note

Upcoming Support: Future releases will expand compatibility to include additional formats such as csv, odt, html, and other widely used open-source document types.

Chunking

Several chunking strategies are supported, including semantic, markdown and recursive chunking. By default, the recursive chunker is applied to all supported file types for its efficiency and low memory usage. This is the recommended strategy for most use cases. Future updates may introduce format-specific chunkers (e.g., for CSV, HTML, etc.). The recursive chunker configuration is located at: .hydra_config/chunker/recursive_splitter.yaml.

# .hydra_config/chunker/recursive_splitter.yaml
defaults:
  - base
name: recursive_splitter
chunk_size: 1000
chunk_overlap: 100

The chunk_size and chunk_overlap values are expressed in tokens, not characters. For enhanced retrieval, enable the contextual retrieval feature — a technique introduced by Anthropic to improve retrieval performance (Contextual Retrieval). To activate it, set CONTEXTUAL_RETRIEVAL=true in your .env file. Refer to the Usage section for further instructions.

Indexing

Once chunked, the document fragments are stored in the Milvus vector database using the Qwen/Qwen3-Embedding-0.6B multilingual model—ranked among the top on the MTEB benchmark. To use a different model, simply set the EMBEDDER_MODEL variable in your .env file to any Hugging Face–compatible option (e.g., "sentence-transformers/all-MiniLM-L6-v2"). For faster embedding, the model is served via an inference server (VLLM).

Important

Choose an embedding model that aligns with your document languages and offers a suitable context window. The default model supports both English and French.

Retriever & Search

Search Pipeline

Our hybrid search pipeline combines semantic search with BM25 keyword matching to deliver comprehensive results. This dual approach captures both semantically related content and documents containing specific keywords. Retrieved results are then consolidated and reordered using the Reciprocal Rank Fusion (RRF) algorithm for optimal relevance ranking.

Retrieval Strategies

Three strategies feed into the hybrid search—available only in the RAG pipeline (see OpenAI Compatible API), not in the standalone Semantic Search endpoints:

multiQuery: Uses an LLM to generate multiple query reformulations, merging their results for superior relevance (default and most effective based on benchmarks).
single: Executes a single, straightforward retrieval query. This is the default method.
hyde: Generates a hypothetical answer via LLM, then retrieves chunks similar to that synthetic response.

Reranker

Finally, retrieved documents are re-ordered by relevance using a multilingual reranking model. By default, we employ Alibaba-NLP/gte-multilingual-reranker-base from Hugging Face.

Important

The retriever fetches documents that are semantically similar to the query. However, semantic similarity doesn't always equate to relevance. Therefore, rerankers are crucial for reordering results and filtering out less pertinent documents, thereby reducing the likelihood of hallucinations.

Reranker Configuration

Reranking models support multiple deployment formats:

Online Serving (In-Memory)

Supports crossencoder and colbert reranking types
Set the RERANKER_MODEL_TYPE environment variable to match your chosen model's architecture (crossencoder or colbert)

High-Throughput Serving

Deploy reranking models via Infinity for enhanced performance
Set RERANKER_MODEL_TYPE=infinity to enable this mode
Both crossencoder and colbert architectures are supported through this deployment method

RAG Type

SimpleRAG: Basic implementation without chat history taken into account.
ChatBotRAG: Version that maintains conversation context.

🚀 Getting Started

Prerequisites

Python 3.12 or higher recommended
Docker and Docker Compose
pFor GPU capable machines, ensure you have the NVIDIA Container Toolkit installed. Refer to the NVIDIA documentation for installation instructions.

RAGondin is designed to run in a containerized environment under Linux on x86_64 architecture. ARM processors are not supported, this is subject to change in the future.

Installation and Configuration

1. Clone the repository:

git clone https://github.com/OpenLLM-France/RAGondin.git

# git clone --recurse-submodules https://github.com/OpenLLM-France/RAGondin.git # to clone the repo with the associated submodules

cd RAGondin
git checkout main # or a given release

2. Create uv environment and install dependencies:

Important

Ensure you have Python 3.12 installed along with uv. For detailed installation instructions for uv, refer to the uv official documentation. You can either use uv or pip (if already available) or curl. Additional installation methods are outlined in the documentation.

# with pip
pip install uv

# with curl
curl -LsSf https://astral.sh/uv/install.sh | sh

# Create a new environment with all dependencies
cd RAGondin/
uv sync

3. Create a `.env` File

Create a .env file at the root of the project, mirroring the structure of .env.example, to configure your environment.

Define your LLM-related variables (API_KEY, BASE_URL, MODEL_NAME) and VLM (Vision Language Model) settings (VLM_API_KEY, VLM_BASE_URL, VLM_MODEL_NAME).

Important

The VLM is used for generating captions from images extracted from files, and for tasks like summarizing chat history. The same endpoint can serve as both your LLM and VLM.

For PDF indexing, multiple loader options are available. Set your choice using the PDFLoader variable:
- MarkerLoader and DoclingLoader are recommended for optimal performance, especially on OCR-processed PDFs. They support both GPU and CPU execution.
- For lightweight testing on CPU, use PyMuPDF4LLMLoader or PyMuPDFLoader.
  
  ⚠️ These do not support non-searchable PDFs or image-based content and images are not handled.
Audio & Video Files
- Audio and video content are transcribed using OpenAI’s Whisper model. Supported model sizes include: tiny, base, small, medium, large, and turbo. For details, refer to the OpenAI Whisper repository.The default transcription model is set via the WHISPER_MODEL variable, which defaults to 'base'.

Other file formats (txt, docx, doc, pptx, etc) are pre-configured.

set AUTH_TOKEN to enable HTTP authentication via HTTPBearer. If not provided the endpoints will be accessible without any restrictions

# This is the minimal settings required.

# LLM
BASE_URL=
API_KEY=
MODEL=
LLM_SEMAPHORE=10

# VLM
VLM_BASE_URL=
VLM_API_KEY=
VLM_MODEL=
VLM_SEMAPHORE=10

# App
APP_PORT=8080 # this is the forwarded port

# Vector db VDB Milvus
VDB_HOST=milvus
VDB_PORT=19530
VDB_CONNECTOR_NAME=milvus

# RETRIEVER
CONTEXTUAL_RETRIEVAL=true
RETRIEVER_TOP_K=20 # Number of documents to return before reranking

# EMBEDDER
EMBEDDER_MODEL_NAME=Qwen/Qwen3-Embedding-0.6B # or jinaai/jina-embeddings-v3 if you will
EMBEDDER_BASE_URL=http://vllm:8000/v1
EMBEDDER_API_KEY=EMPTY

# RERANKER
RERANKER_ENABLED=false
RERANKER_MODEL=Alibaba-NLP/gte-multilingual-reranker-base # or jinaai/jina-reranker-v2-base-multilingual or jinaai/jina-colbert-v2 if you want
RERANKER_MODEL_TYPE=crossencoder # colbert
RERANKER_TOP_K=5 # Number of documents to return after reranking. increment it for better results if your llm has a wider context window

# Prompts
PROMPTS_DIR=../prompts/example3

# Loaders
PDFLoader=MarkerLoader

# RAY
RAY_DEDUP_LOGS=0
RAY_DASHBOARD_PORT=8265

RAY_NUM_GPUS=0.1
RAY_POOL_SIZE=4 # Number of serializer actor instances
RAY_MAX_TASKS_PER_WORKER=5 # Number of tasks per serializer instance
RAY_DASHBOARD_PORT=8265
RAY_RUNTIME_ENV_HOOK=ray._private.runtime_env.uv_runtime_env_hook.hook


# To enable HTTP authentication via HTTPBearer for the api endpoints
AUTH_TOKEN=super-secret-token

Running on CPU: For quick testing on CPU, you can reduce computational load by adjusting the following settings in the .env file:

Set RERANKER_TOP_K=5 or lower to limit the number of documents returned by the reranker. This defines how many documents are included in the LLM's context—reduce it if your LLM has a limited context window (4–5 is usually sufficient for an 8k context). You can also disable the reranker entirely with RERANKER_ENABLED=false, as it is a costly operation.

Warning

These adjustments may affect performance and result quality but are appropriate for lightweight testing.

Running on GPU: The default values are well-suited for GPU usage. However, you can adjust them as needed to experiment with different configurations based on your machine’s capabilities.

4.Deployment: Launch the app

Important

Before launching the app, You might want to configure Indexer UI (A Web interface for intuitive document ingestion, indexing, and management.). For that, see this document

The application can be launched in either in GPU or CPU environment, depending on your device's capabilities. Use the following commands:

# Launch with GPU support (recommended for faster processing)
docker compose up --build # or 'down' # to stop it

# Launch with CPU only
docker compose --profile cpu up --build # or '--profile cpu down' to stop it

Once it is running, you can check everything is fine by doing:

curl http://localhost:8080/health_check # APP_PORT=8080

Important

The initial launch is longer due to the installation of required dependencies. Once the application is up and running, you can access the fastapi documentation at http://localhost:8080/docs (8080 is the APP_PORT variable determined in your .env) to manage documents, execute searches, or interact with the RAG pipeline (see the next section about the api for more details). A default chat ui is also deployed using chainlit. This ui can be deactivate by setting this boolean variable WITH_CHAINLIT_UI=False. You can access to it at http://localhost:8080/chainlit chat with your documents with our RAG engine behind it.

Note

By default, the chainlit chat ui doesn't restrict access. In order to allow chat ui authentification these variables should be added

# Chainlit UI authentication
CHAINLIT_AUTH_SECRET=... # has to be generated with with this command: 'uv run chainlit create-secret' but a random value works too.
CHAINLIT_USERNAME=Ragondin
CHAINLIT_PASSWORD=Ragondin2025

Important

Chat history is disabled by default. Chainlit includes built-in data persistence capabilities with an out-of-the-box data layer schema to store all chat interactions: conversations, elements, user feedback, and more. To enable this functionality, follow the dedicated setup guide:

➡ Enable Chainlit Data Persistence

5. Distributed deployment in a Ray cluster

To scale RAGondin in a distributed environment using Ray, follow the dedicated guide:

➡ Deploy RAGondin in a Ray cluster

6. 🧠 API Overview

This FastAPI-powered backend offers capabilities for document-based question answering (RAG), semantic search, and document indexing across multiple partitions. It exposes endpoints for interacting with a vector database and managing document ingestion, processing, and querying.

📍 Endpoints Summary

For all the following endpoints, make sure to include your authentication token AUTH_TOKEN in the HTTP request header if authentication is enabled.

📦 Indexer

POST /indexer/partition/{partition}/file/{file_id}
Uploads a file (with optional metadata) to a specific partition.

Inputs:
- file (form-data): binary – File to upload
- metadata (form-data): JSON string – Metadata for the file (e.g. {"file_type": "pdf"})
Returns:
- 201 Created with a JSON containing the task status URL
- 409 Conflit if a file with the same id in the same partition already exists

PUT /indexer/partition/{partition}/file/{file_id}
Replaces an existing file in the partition. Deletes existing entry and creates a new indexation task.

Inputs:
- file (form-data): binary – File to upload
- metadata (form-data): JSON string – Metadata for the file (e.g. {"file_type": "pdf"})
Returns:
- 202 Accepted with a JSON containing the task status URL

PATCH /indexer/partition/{partition}/file/{file_id}
Updates the metadata of an existing file without reindexing.

Inputs:
- metadata (form-data): JSON string – Metadata for the file (e.g. {"file_type": "pdf"})
Returns:
- 200 Ok if metadata for that file is successfully updated

DELETE /indexer/partition/{partition}/file/{file_id}
Deletes a file from a specific partition.

Returns:
- 204 No content
- 404 Not found if the file is not found in the partition

Note

Once the rag is running you can attack these endpoints in order to index multiple files. Check this data_indexer.py in the 📁utility folder.

GET /indexer/task/{task_id}
Retrieves the status of an asynchronous indexing task (see the POST /indexer/partition/{partition}/file/{file_id} endpoint).

🔎 Semantic Search

GET /search/
Searches across multiple partitions using a semantic query.

Inputs:
- partitions (query, optional): List[str] – Partitions to search (default: ["all"])
- text (query, required): string – Text to search semantically
- top_k (query, optional): int – Number of top results to return (default: 5)
Returns:
- 200 OK with a JSON list of document links (HATEOAS style)
- 400 bad request if the field partitions isn't correctly set

GET /search/partition/{partition}
Searches within a specific partition.

Inputs:
- text (query, required): string – Text to search semantically
- top_k (query, optional): int – Number of top results to return (default: 5)
Returns:
- 200 OK with a JSON list of document links (HATEOAS style)
- 400 bad request if the field partitions isn't correctly set

GET /search/partition/{partition}/file/{file_id}
Searches within a specific file in a partition.

Inputs:
- text (query, required): string – Text to search semantically
- top_k (query, optional): int – Number of top results to return (default: 5)
Returns:
- 200 OK with a JSON list of document links (HATEOAS style)
- 400 bad request if the field partitions isn't correctly set

📄 Document Extract Details

GET /extract/{extract_id}
Fetches a specific extract by its ID.

Returns:
- content and metadata of the extract (an extract is a chunk) inn JSON format

💬 OpenAI-Compatible Chat

OpenAI API compatibility enables seamless integration with existing tools and workflows that follow the OpenAI interface. It makes it easy to use popular UIs like OpenWebUI without the need for custom adapters.

For the following OpenAI-compatible endpoints, when using an OpenAI client, provide your AUTH_TOKEN as the api_key if authentication is enabled; otherwise, you can use any placeholder value such as 'sk-1234'.

GET /v1/models This endpoint allows to list all existant models

Note

Model names follow the pattern ragondin-{partition_name}, where partition_name refers to a data partition containing specific files. These “models” aren’t standalone LLMs (like GPT-4 or Llama), but rather placeholders that tell your LLM endpoint to generate responses using only the data from the chosen partition. To query the entire vector database, use the special model name partition-all.

POST /v1/chat/completions
OpenAI-compatible chat completion endpoint using a Retrieval-Augmented Generation (RAG) pipeline. Accepts model, messages, temperature, top_p, etc.
POST /v1/completions Same for this endpoint

Tip

To test these endpoint with openai client, you can refer to the the openai_compatibility_guide.ipynb notebook from the 📁 utility folder

ℹ️ Utils

GET /health_check

Simple endpoint to ensure the server is running.

Troubleshooting

Error on dependencies installation

After running uv sync, if you have this error:

error: Distribution `ray==2.43.0 @ registry+https://pypi.org/simple` can't be installed because it doesn't have a source distribution or wheel for the current platform

hint: You're using CPython 3.13 (`cp313`), but `ray` (v2.43.0) only has wheels with the following Python ABI tag: `cp312`

This means your uv installation relies on cpython 3.13 while you are using python 3.12.

To solve it, please run:

uv venv --python=3.12
uv sync

Error with models' weights downloading

While executing RAGondin, if you encounter a problem that prevents you from downloading the models' weights locally, then you just need to create the needed folder and authorize it to be written and executed

sudo mkdir /app/model_weights
sudo chmod 775 /app/model_weights

🔧 Contributing

We ❤️ contributions!

Contribute

Contributions are welcome! Please follow standard GitHub workflow:

Fork the repository
Create a feature branch
Submit a pull request

Name		Name	Last commit message	Last commit date
Latest commit History 868 Commits
.hydra_config		.hydra_config
benchmarks		benchmarks
docs		docs
extern		extern
model_weights		model_weights
prompts		prompts
public		public
ragondin		ragondin
ray-cluster		ray-cluster
tests/api		tests/api
utility		utility
vdb		vdb
.env.example		.env.example
.gitignore		.gitignore
.gitmodules		.gitmodules
Dockerfile		Dockerfile
Dockerfile.ray		Dockerfile.ray
LICENSE		LICENSE
RAG_architecture.png		RAG_architecture.png
README.md		README.md
README_WORKFLOW.md		README_WORKFLOW.md
cluster.yaml		cluster.yaml
docker-compose.yaml		docker-compose.yaml
entrypoint.sh		entrypoint.sh
pyproject.toml		pyproject.toml
test_copilot.html		test_copilot.html
uv.lock		uv.lock

License

andyne13/RAGondin

Folders and files

Latest commit

History

Repository files navigation

🦫 RAGondin — The Open RAG Experimentation Playground

Table of Contents

Goals

Current Features

Supported File Formats

Chunking

Indexing

Retriever & Search

Search Pipeline

Retrieval Strategies

Reranker

Reranker Configuration

RAG Type

🚀 Getting Started

Prerequisites

Installation and Configuration

1. Clone the repository:

2. Create uv environment and install dependencies:

3. Create a .env File

4.Deployment: Launch the app

5. Distributed deployment in a Ray cluster

6. 🧠 API Overview

📍 Endpoints Summary

For all the following endpoints, make sure to include your authentication token AUTH_TOKEN in the HTTP request header if authentication is enabled.

📦 Indexer

🔎 Semantic Search

📄 Document Extract Details

💬 OpenAI-Compatible Chat

ℹ️ Utils

Troubleshooting

Error on dependencies installation

Error with models' weights downloading

🔧 Contributing

Contribute

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 1

Packages 0

Uh oh!

Contributors 12

Uh oh!

Languages

3. Create a `.env` File

Packages