This project is a fully modular Multi-Modal Multi-Agent Retrieval-Augmented Generation (RAG) system capable of processing PDFs, HTMLs, images, and tables for answering questions using a pipeline of specialized agents:
-
Text Agent+Image Agent: For generating insights from the retrieved contexts of the RAG system -
Generalize Agentwould combine and generalize the answers from TextAgent and ImageAgent for each question -
Planning Agentwould receive the query from the user and then separates it into several "tasks" or questions for retrieving many information from the RAG system. -
Merge Agentwould combine all the responses from the Generalize Agent and merge them into a response which would answer the initial query from the user. -
Verfier Agentwould score the combined answer of the Merge Agent and then telling if we need to query for more information by generating following up questions for continually retrieve information.
It supports local document extraction via Docling, embedding with SentenceTransformers, and multi-agent orchestration.
M3ARAG/
├── agents/ # Modular agent logic.
├── pipeline/ # Pipeline and Chat launcher interface
├── data/ # Storing the downloaded files
│ ├── store/ # Raw downloaded files (PDF, HTML, etc.)
│ ├── merge/ # Single processing location for indexing of RAG.
│ └── extract/ # Converted PDFs, extracted images/tables
├── RAG/ # RAG system
├── config/ # Config files for RAG, Agents and Prompt file
│ ├── agent_config.py # Config for using Agents
│ ├── rag_config.py # Config for using RAG
│ └── prompt.py # Prompts Storage.
├── rag_text/ # RAG text captioning
├── rag_image/ # RAG image captioning
├── utils/ # Helper utilities (e.g., process_documents)
├── test/ # Testing places
├── main.py # Main entry point
├── chat_streamlit.py # Main function for chatting via streamlit
├── README.md # Main information about the repository
├── timeline.md # Tasks and next tasks that we have done
git clone https://github.com/pdz1804/M3ARAG.git
cd M3ARAGpython -m venv myenv
# Or: conda create -n m3arag python=3.10
# Activate
source myenv/bin/activate # macOS/Linux
myenv\Scripts\activate # Windowspip install -r requirements.txt💡 If
requirements.txtis missing, install manually like below:
pip install sentence-transformers langchain openai chromadb docling python-dotenv-
Download Poppler for Windows:
- Visit: https://github.com/oschwartz10612/poppler-windows/releases/
- Download the latest
.zipfile under Assets (e.g.,poppler-xx_xx_xx.zip).
-
Extract the zip to a location like
C:\poppler. -
Add Poppler to PATH:
- Open Start > Environment Variables.
- Under System Variables, find and select
Path, click Edit. - Click New and add:
C:\poppler\Library\bin - Click OK and restart your terminal.
-
Verify installation:
where pdfinfo
You should see:
C:\poppler\Library\bin\pdfinfo.exe
brew install popplerTo verify:
which pdfinfosudo apt update
sudo apt install poppler-utilsTo verify:
which pdfinfoCopy .env.example and rename to .env, then fill in your keys:
OPENAI_API_KEY=pdz-...
GOOGLE_API_KEY=pdz-...If you want to run RAG-flow individually without Agents or with Agents:
# Download data only (for local testing)
python main.py --download
# Ingest data only
python main.py --ingest
# Chatting
python main.py --chat
# Small note: we can run --download --ingest --chat at once
# Run it on streamlit: by uploading docs or inputing urls
python main.py --appThis will:
- Download and store files from hardcoded URLs.
- Extract content using Docling.
- Index text via SentenceTransformers + Chroma.
- Start interactive agent-based chat loop.
| Agent | Description |
|---|---|
TextAgent |
Answers questions by retrieving from embedded text chunks |
ImageAgent |
Answers questions by retrieving from embedded images of pages |
GeneralizeAgent |
Combines answers from multiple modalities (text, image) |
PlanningAgent |
Decomposes complex questions into structured sub-questions. |
MergeAgent |
Fuses sub-agent responses into a coherent final answer. |
VerifierAgent |
Evaluates merged answer, determines quality, and suggests refinement. |
Toggle to view Document Processing & Indexing
This diagram shows how documents are split into chunks and images, indexed via ChromaDB and stored on disk.
Toggle to view Multi-Modal Retrieval Pipeline
Illustration of text and image-based retrieval using sub-queries from the user question.
Toggle to view Agent-Oriented Workflow
Overview of how multiple specialized agents interact to process, merge, verify, and answer complex queries.
- ✅ PDF documents (
.pdf) - ✅ HTML, MD, PPTX, CSV, DOCX, TXT (converted to PDF)
- ✅ Extracted images (captioning + indexing coming soon)
- 🧪 Support for
audio,.json,.xmlbeing tested for later release
This project is licensed under the MIT License. See LICENSE for details.
Built by Nguyen Quang Phu (pdz1804) and Tieu Tri Bang
Reach out or open an issue for support or ideas.