RAGondin is a lightweight, modular and extensible Retrieval-Augmented Generation (RAG) framework designed to explore and test advanced RAG techniques — 100% open source and focused on experimentation, not lock-in.
Built by the OpenLLM France community, RAGondin offers a sovereign-by-design alternative to mainstream RAG stacks like LangChain or Haystack.
- 🦫 RAGondin — The Open RAG Experimentation Playground
- Table of Contents
- Goals
- Current Features
- 🚀 Getting Started
- Troubleshooting
- 🔧 Contributing
- Experiment with advanced RAG techniques
- Develop evaluation metrics for RAG applications
- Collaborate with the community to innovate and push the boundaries of RAG applications
This section provides a detailed explanation of the currently supported features.
The .hydra_config directory contains all the configuration files for the application. These configurations are structured using the Hydra configuration framework. This directory will be referenced for setting up the RAG (Retrieval-Augmented Generation) pipeline.
This branch currently supports the following file types:
- TextFiles:
txt,md - Document Files:
pdf,docx,doc,pptx - Audio Files:
wav,mp3,mp4,ogg,flv,wma,aac - Images:
png:, jpeg, jpg, svg
The content from all supported formats is first converted into Markdown. Any images found in the documents are replaced by captions generated by a Vision Language Model (VLM). (Refer to the Configuration section for additional details.) The final Markdown output is then split into chunks and indexed in the Milvus vector database.
Note
Upcoming Support: Future releases will expand compatibility to include additional formats such as csv, odt, html, and other widely used open-source document types.
Several chunking strategies are supported, including semantic, markdown and recursive chunking. By default, the recursive chunker is applied to all supported file types for its efficiency and low memory usage. This is the recommended strategy for most use cases. Future updates may introduce format-specific chunkers (e.g., for CSV, HTML, etc.). The recursive chunker configuration is located at: .hydra_config/chunker/recursive_splitter.yaml.
# .hydra_config/chunker/recursive_splitter.yaml
defaults:
- base
name: recursive_splitter
chunk_size: 1000
chunk_overlap: 100The chunk_size and chunk_overlap values are expressed in tokens, not characters. For enhanced retrieval, enable the contextual retrieval feature — a technique introduced by Anthropic to improve retrieval performance (Contextual Retrieval). To activate it, set CONTEXTUAL_RETRIEVAL=true in your .env file. Refer to the Usage section for further instructions.
Once chunked, the document fragments are stored in the Milvus vector database using the Qwen/Qwen3-Embedding-0.6B multilingual model—ranked among the top on the MTEB benchmark. To use a different model, simply set the EMBEDDER_MODEL variable in your .env file to any Hugging Face–compatible option (e.g., "sentence-transformers/all-MiniLM-L6-v2"). For faster embedding, the model is served via an inference server (VLLM).
Important
Choose an embedding model that aligns with your document languages and offers a suitable context window. The default model supports both English and French.
Our hybrid search pipeline combines semantic search with BM25 keyword matching to deliver comprehensive results. This dual approach captures both semantically related content and documents containing specific keywords. Retrieved results are then consolidated and reordered using the Reciprocal Rank Fusion (RRF) algorithm for optimal relevance ranking.
Three strategies feed into the hybrid search—available only in the RAG pipeline (see OpenAI Compatible API), not in the standalone Semantic Search endpoints:
- multiQuery: Uses an LLM to generate multiple query reformulations, merging their results for superior relevance (default and most effective based on benchmarks).
- single: Executes a single, straightforward retrieval query. This is the default method.
- hyde: Generates a hypothetical answer via LLM, then retrieves chunks similar to that synthetic response.
Finally, retrieved documents are re-ordered by relevance using a multilingual reranking model. By default, we employ Alibaba-NLP/gte-multilingual-reranker-base from Hugging Face.
Important
The retriever fetches documents that are semantically similar to the query. However, semantic similarity doesn't always equate to relevance. Therefore, rerankers are crucial for reordering results and filtering out less pertinent documents, thereby reducing the likelihood of hallucinations.
Reranking models support multiple deployment formats:
Online Serving (In-Memory)
- Supports
crossencoderandcolbertreranking types - Set the
RERANKER_MODEL_TYPEenvironment variable to match your chosen model's architecture (crossencoderorcolbert)
High-Throughput Serving
- Deploy reranking models via Infinity for enhanced performance
- Set
RERANKER_MODEL_TYPE=infinityto enable this mode - Both
crossencoderandcolbertarchitectures are supported through this deployment method
- SimpleRAG: Basic implementation without chat history taken into account.
- ChatBotRAG: Version that maintains conversation context.
- Python 3.12 or higher recommended
- Docker and Docker Compose
- pFor GPU capable machines, ensure you have the NVIDIA Container Toolkit installed. Refer to the NVIDIA documentation for installation instructions.
RAGondin is designed to run in a containerized environment under Linux on x86_64 architecture. ARM processors are not supported, this is subject to change in the future.
git clone https://github.com/OpenLLM-France/RAGondin.git
# git clone --recurse-submodules https://github.com/OpenLLM-France/RAGondin.git # to clone the repo with the associated submodules
cd RAGondin
git checkout main # or a given releaseImportant
Ensure you have Python 3.12 installed along with uv. For detailed installation instructions for uv, refer to the uv official documentation. You can either use uv or pip (if already available) or curl. Additional installation methods are outlined in the documentation.
# with pip
pip install uv
# with curl
curl -LsSf https://astral.sh/uv/install.sh | sh# Create a new environment with all dependencies
cd RAGondin/
uv syncCreate a .env file at the root of the project, mirroring the structure of .env.example, to configure your environment.
- Define your LLM-related variables (
API_KEY,BASE_URL,MODEL_NAME) and VLM (Vision Language Model) settings (VLM_API_KEY,VLM_BASE_URL,VLM_MODEL_NAME).
Important
The VLM is used for generating captions from images extracted from files, and for tasks like summarizing chat history. The same endpoint can serve as both your LLM and VLM.
-
For PDF indexing, multiple loader options are available. Set your choice using the
PDFLoadervariable:MarkerLoaderandDoclingLoaderare recommended for optimal performance, especially on OCR-processed PDFs. They support both GPU and CPU execution.- For lightweight testing on CPU, use
PyMuPDF4LLMLoaderorPyMuPDFLoader.⚠️ These do not support non-searchable PDFs or image-based content and images are not handled.
-
Audio & Video Files
- Audio and video content are transcribed using OpenAI’s Whisper model. Supported model sizes include:
tiny,base,small,medium,large, andturbo. For details, refer to the OpenAI Whisper repository.The default transcription model is set via theWHISPER_MODELvariable, which defaults to'base'.
- Audio and video content are transcribed using OpenAI’s Whisper model. Supported model sizes include:
Other file formats (txt, docx, doc, pptx, etc) are pre-configured.
- set
AUTH_TOKENto enable HTTP authentication via HTTPBearer. If not provided the endpoints will be accessible without any restrictions
# This is the minimal settings required.
# LLM
BASE_URL=
API_KEY=
MODEL=
LLM_SEMAPHORE=10
# VLM
VLM_BASE_URL=
VLM_API_KEY=
VLM_MODEL=
VLM_SEMAPHORE=10
# App
APP_PORT=8080 # this is the forwarded port
# Vector db VDB Milvus
VDB_HOST=milvus
VDB_PORT=19530
VDB_CONNECTOR_NAME=milvus
# RETRIEVER
CONTEXTUAL_RETRIEVAL=true
RETRIEVER_TOP_K=20 # Number of documents to return before reranking
# EMBEDDER
EMBEDDER_MODEL_NAME=Qwen/Qwen3-Embedding-0.6B # or jinaai/jina-embeddings-v3 if you will
EMBEDDER_BASE_URL=http://vllm:8000/v1
EMBEDDER_API_KEY=EMPTY
# RERANKER
RERANKER_ENABLED=false
RERANKER_MODEL=Alibaba-NLP/gte-multilingual-reranker-base # or jinaai/jina-reranker-v2-base-multilingual or jinaai/jina-colbert-v2 if you want
RERANKER_MODEL_TYPE=crossencoder # colbert
RERANKER_TOP_K=5 # Number of documents to return after reranking. increment it for better results if your llm has a wider context window
# Prompts
PROMPTS_DIR=../prompts/example3
# Loaders
PDFLoader=MarkerLoader
# RAY
RAY_DEDUP_LOGS=0
RAY_DASHBOARD_PORT=8265
RAY_NUM_GPUS=0.1
RAY_POOL_SIZE=4 # Number of serializer actor instances
RAY_MAX_TASKS_PER_WORKER=5 # Number of tasks per serializer instance
RAY_DASHBOARD_PORT=8265
RAY_RUNTIME_ENV_HOOK=ray._private.runtime_env.uv_runtime_env_hook.hook
# To enable HTTP authentication via HTTPBearer for the api endpoints
AUTH_TOKEN=super-secret-token- Running on CPU:
For quick testing on CPU, you can reduce computational load by adjusting the following settings in the
.envfile:
- Set
RERANKER_TOP_K=5or lower to limit the number of documents returned by the reranker. This defines how many documents are included in the LLM's context—reduce it if your LLM has a limited context window (4–5 is usually sufficient for an 8k context). You can also disable the reranker entirely withRERANKER_ENABLED=false, as it is a costly operation.
Warning
These adjustments may affect performance and result quality but are appropriate for lightweight testing.
- Running on GPU: The default values are well-suited for GPU usage. However, you can adjust them as needed to experiment with different configurations based on your machine’s capabilities.
Important
Before launching the app, You might want to configure Indexer UI (A Web interface for intuitive document ingestion, indexing, and management.). For that, see this document
The application can be launched in either in GPU or CPU environment, depending on your device's capabilities. Use the following commands:
# Launch with GPU support (recommended for faster processing)
docker compose up --build # or 'down' # to stop it
# Launch with CPU only
docker compose --profile cpu up --build # or '--profile cpu down' to stop itOnce it is running, you can check everything is fine by doing:
curl http://localhost:8080/health_check # APP_PORT=8080Important
The initial launch is longer due to the installation of required dependencies. Once the application is up and running, you can access the fastapi documentation at http://localhost:8080/docs (8080 is the APP_PORT variable determined in your .env) to manage documents, execute searches, or interact with the RAG pipeline (see the next section about the api for more details). A default chat ui is also deployed using chainlit. This ui can be deactivate by setting this boolean variable WITH_CHAINLIT_UI=False. You can access to it at http://localhost:8080/chainlit chat with your documents with our RAG engine behind it.
Note
By default, the chainlit chat ui doesn't restrict access. In order to allow chat ui authentification these variables should be added
# Chainlit UI authentication
CHAINLIT_AUTH_SECRET=... # has to be generated with with this command: 'uv run chainlit create-secret' but a random value works too.
CHAINLIT_USERNAME=Ragondin
CHAINLIT_PASSWORD=Ragondin2025Important
Chat history is disabled by default. Chainlit includes built-in data persistence capabilities with an out-of-the-box data layer schema to store all chat interactions: conversations, elements, user feedback, and more. To enable this functionality, follow the dedicated setup guide:
➡ Enable Chainlit Data Persistence
To scale RAGondin in a distributed environment using Ray, follow the dedicated guide:
➡ Deploy RAGondin in a Ray cluster
This FastAPI-powered backend offers capabilities for document-based question answering (RAG), semantic search, and document indexing across multiple partitions. It exposes endpoints for interacting with a vector database and managing document ingestion, processing, and querying.
For all the following endpoints, make sure to include your authentication token AUTH_TOKEN in the HTTP request header if authentication is enabled.
POST /indexer/partition/{partition}/file/{file_id}
Uploads a file (with optional metadata) to a specific partition.
-
Inputs:
file(form-data): binary – File to uploadmetadata(form-data): JSON string – Metadata for the file (e.g.{"file_type": "pdf"})
-
Returns:
201 Createdwith a JSON containing the task status URL409 Conflitif a file with the same id in the same partition already exists
PUT /indexer/partition/{partition}/file/{file_id}
Replaces an existing file in the partition. Deletes existing entry and creates a new indexation task.
-
Inputs:
file(form-data): binary – File to uploadmetadata(form-data): JSON string – Metadata for the file (e.g.{"file_type": "pdf"})
-
Returns:
202 Acceptedwith a JSON containing the task status URL
PATCH /indexer/partition/{partition}/file/{file_id}
Updates the metadata of an existing file without reindexing.
-
Inputs:
metadata(form-data): JSON string – Metadata for the file (e.g.{"file_type": "pdf"})
-
Returns:
200 Okif metadata for that file is successfully updated
DELETE /indexer/partition/{partition}/file/{file_id}
Deletes a file from a specific partition.
- Returns:
204 No content404 Not foundif the file is not found in the partition
Note
Once the rag is running you can attack these endpoints in order to index multiple files. Check this data_indexer.py in the 📁utility folder.
GET /indexer/task/{task_id}
Retrieves the status of an asynchronous indexing task (see the POST /indexer/partition/{partition}/file/{file_id} endpoint).
GET /search/
Searches across multiple partitions using a semantic query.
-
Inputs:
partitions(query, optional): List[str] – Partitions to search (default:["all"])text(query, required): string – Text to search semanticallytop_k(query, optional): int – Number of top results to return (default:5)
-
Returns:
200 OKwith a JSON list of document links (HATEOAS style)400 bad requestif the fieldpartitionsisn't correctly set
GET /search/partition/{partition}
Searches within a specific partition.
-
Inputs:
text(query, required): string – Text to search semanticallytop_k(query, optional): int – Number of top results to return (default:5)
-
Returns:
200 OKwith a JSON list of document links (HATEOAS style)400 bad requestif the fieldpartitionsisn't correctly set
GET /search/partition/{partition}/file/{file_id}
Searches within a specific file in a partition.
-
Inputs:
text(query, required): string – Text to search semanticallytop_k(query, optional): int – Number of top results to return (default:5)
-
Returns:
200 OKwith a JSON list of document links (HATEOAS style)400 bad requestif the fieldpartitionsisn't correctly set
GET /extract/{extract_id}
Fetches a specific extract by its ID.
- Returns:
contentandmetadataof the extract (an extract is a chunk) inn JSON format
OpenAI API compatibility enables seamless integration with existing tools and workflows that follow the OpenAI interface. It makes it easy to use popular UIs like OpenWebUI without the need for custom adapters.
For the following OpenAI-compatible endpoints, when using an OpenAI client, provide your AUTH_TOKEN as the api_key if authentication is enabled; otherwise, you can use any placeholder value such as 'sk-1234'.
GET /v1/modelsThis endpoint allows to list all existantmodels
Note
Model names follow the pattern ragondin-{partition_name}, where partition_name refers to a data partition containing specific files. These “models” aren’t standalone LLMs (like GPT-4 or Llama), but rather placeholders that tell your LLM endpoint to generate responses using only the data from the chosen partition. To query the entire vector database, use the special model name partition-all.
-
POST /v1/chat/completions
OpenAI-compatible chat completion endpoint using a Retrieval-Augmented Generation (RAG) pipeline. Acceptsmodel,messages,temperature,top_p, etc. -
POST /v1/completionsSame for this endpoint
Tip
To test these endpoint with openai client, you can refer to the the openai_compatibility_guide.ipynb notebook from the 📁 utility folder
GET /health_check
Simple endpoint to ensure the server is running.
After running uv sync, if you have this error:
error: Distribution `ray==2.43.0 @ registry+https://pypi.org/simple` can't be installed because it doesn't have a source distribution or wheel for the current platform
hint: You're using CPython 3.13 (`cp313`), but `ray` (v2.43.0) only has wheels with the following Python ABI tag: `cp312`
This means your uv installation relies on cpython 3.13 while you are using python 3.12.
To solve it, please run:
uv venv --python=3.12
uv syncWhile executing RAGondin, if you encounter a problem that prevents you from downloading the models' weights locally, then you just need to create the needed folder and authorize it to be written and executed
sudo mkdir /app/model_weights
sudo chmod 775 /app/model_weightsWe ❤️ contributions!
Contributions are welcome! Please follow standard GitHub workflow:
- Fork the repository
- Create a feature branch
- Submit a pull request