- Asistec Agentic RAG Chatbot Backend Documentation
This document provides detailed information and guidance for the Asistec Agentic RAG Chatbot Backend. It covers the system's architecture, technology stack, setup instructions, API reference, and information on document indexing.
The Asistec RAG Chatbot Backend is a high-performance system designed to allow users to query a collection of technical and regulatory PDF documents using natural language. Leveraging cutting-edge AI models and an agentic workflow, it retrieves relevant information from the document corpus and generates accurate, contextually relevant responses.
The system is provided as a scalable and modular API backend, intended for integration with various frontend applications (such as the provided sample frontend).
- Natural Language Querying: Users can ask complex questions about technical documents and regulations in natural language.
- Retrieval-Augmented Generation (RAG): Combines powerful language models with a dedicated knowledge base (indexed PDFs) for grounded responses.
- Agentic Workflow: Implements a sophisticated process to determine if retrieval is needed, perform searches, evaluate document relevance, potentially rewrite queries for better results, and synthesize final answers.
- Scalable Vector Store: Uses Pinecone for efficient semantic search over indexed document chunks.
- Persistent Chat History: Stores user conversations and messages using SQLModel with a PostgreSQL (Supabase) database.
- Mem0 Integration: Allows for storing and retrieving user-specific long-term memories or preferences.
- Asynchronous Document Indexing: Handles the processing and indexing of large PDF files in the background via a dedicated API endpoint with status tracking.
- Modular API: Designed as a RESTful API using FastAPI for easy integration.
The backend is built using the following key technologies:
- Python: The primary programming language.
- FastAPI: A modern, fast (high-performance) web framework for building the API.
- SQLModel: A library for interacting with relational databases, combining the power of SQLAlchemy with the convenience of Pydantic.
- PostgreSQL (Supabase): Relational database for storing user data, chat threads, messages, and upload job statuses.
- Pinecone: A managed vector database service used to store and perform similarity search on document embeddings.
- OpenAI API: Provides the text embedding models.
- Google API: Provides the Large Language Models (LLMs) for core RAG response generation, summarization, document grading, title generation and query rewriting.
- LangChain: Framework used to build and manage interactions with LLMs, vector stores, document loading, and text splitting.
- LangGraph: A framework built on LangChain for creating stateful, multi-step agentic workflows (the core of the RAG process).
- Mem0: An external service used for managing user-specific long-term memories.
- PyPDFLoader: A LangChain document loader for extracting text from PDF files.
- RecursiveCharacterTextSplitter: A LangChain text splitter for breaking down large documents into smaller chunks suitable for embedding.
- Tenacity: Library used for adding retry logic with exponential backoff to API calls (like Pinecone upserts) to handle rate limits and transient errors.
- Uvicorn: An ASGI server used to run the FastAPI application.
The core of the system is an agentic workflow orchestrated by LangGraph, designed to provide more intelligent and context-aware responses than a simple RAG chain.
When a user sends a message:
- Receive Message: The API receives the user's message and the current thread ID. The message is saved to the database.
- Load History & Memory: The system loads the conversation history from the database and potentially retrieves relevant long-term memories from Mem0. A summary of the conversation history might be generated if it's very long.
- Agent Decision (
Answer_or_Retrieve): An LLM analyzes the user's query, the conversation history summary, and long-term memories. It decides whether:- To answer directly (e.g., for simple greetings or general knowledge questions).
- To use a
retriever_toolto search the document knowledge base for relevant information (for specific questions about regulations or indexed content).
- Tool Execution (
retriever_tool): If the LLM decides to use the tool, theretriever_toolexecutes. It takes a search query (formulated by the LLM), performs a vector similarity search in Pinecone, retrieves the top N most similar document chunks, and returns their combined content. - Document Grading: Another LLM grades the retrieved document content based on its relevance to the original user query (using a score from 0-10).
- Query Rewrite (Conditional):
- If the documents are deemed relevant (score >= 6), the process proceeds to generate the final answer.
- If the documents are deemed less relevant (score < 6) AND the system hasn't retried too many times (loop count < max), the workflow routes to a query rewriting step. An LLM analyzes the original query and history to generate a new, optimized search query. The process then loops back to the
Tool Executionstep with the new query (limited to a maximum of 2 attempts to prevent infinite loops).
- Prepare Final Context: After the retrieval/grading/rewrite loop (either because documents were relevant or max retries were reached), the retrieved document content (even if low-scoring) is formatted as the final context for the answer generation step.
- Generate Final Answer: A powerful LLM receives the full conversation history, the selected retrieved context, and the user's latest query. Using a detailed prompt, it synthesizes the information to provide a comprehensive, accurate, and contextually relevant answer, adhering to the role of the expert AI assistant for Asistec.
- Save Assistant Response: The generated AI response is saved to the database as a new message in the thread.
- API Response: The final AI message content is returned to the client.
This workflow ensures that the AI intelligently leverages the document base when needed, attempts to improve searches if initial results are poor, and uses conversation history and potentially user memory for better context.
To get the backend running, you need to set up the environment, install dependencies, and configure external services.
- Clone the Repository: Obtain the project code.
- Navigate to Project Directory: Open your terminal and change directory to the project root (
RAG_Chatbot). - Create Virtual Environment (using
uv):uv venv # On Windows: .\venv\Scripts\activate source venv/bin/activate
- Install Dependencies: The project uses
pyproject.tomland potentiallyuv.lockor equivalent for dependency management. If usinguv(recommended):If usinguv sync
pipwithrequirements.txt(which you would generate frompyproject.toml):pip install -r requirements.txt
Create a file named .env in the root directory (RAG_Chatbot/.env). Copy the contents of .env.example and fill in your credentials and desired settings.
Once configured, run the application using Uvicorn:
uvicorn main:app --reload --host 0.0.0.0 --port 8000main:app: Refers to theappobject inmain.py.--reload: Useful for development; the server restarts automatically on code changes. Remove this flag for production.--host 0.0.0.0: Makes the server accessible externally (use127.0.0.1orlocalhostfor local-only access).--port 8000: Specifies the port to run on.
The API will be accessible at http://127.0.0.1:8000 (or your server's IP/hostname).
When the application starts (via the lifespan function in main.py), it automatically performs the following tasks:
- Database Table Creation: Checks if tables defined in
database/models.py(users,threads,messages,upload_jobs,file_processing_attempts) exist and creates them if they don't.
Monitor the console output during startup for messages regarding these tasks.
The API provides endpoints for managing chat threads, sending messages, handling document uploads, and accessing user memories.
All endpoints are prefixed with /api.
Example: http://127.0.0.1:8000/api
Note: As per current implementation, there is no robust authentication mechanism in place. The system relies on user_id provided in request payloads/query parameters to identify users and authorize thread/memory access. User IDs in the provided OpenAPI schema are specified as string, aligning with how they might be represented in external systems, though the current DB model uses int for User.id. This potential mismatch needs to be addressed during integration.
For production deployment, it is critical to implement a proper authentication and authorization layer (e.g., using API keys, OAuth2, JWT tokens) to verify user identity and control access to data.
Below is a summary of the API endpoints based on the provided OpenAPI schema:
| Method | Path | Description | Request Body (Schema) | Response Body (Schema) | Tags |
|---|---|---|---|---|---|
GET |
/health |
Health Check | - | {} |
Health |
POST |
/users |
Create New User | UserCreateRequestSchema |
UserResponseSchema |
Users |
GET |
/users/{user_id} |
Get User Details | - | UserResponseSchema |
Users |
POST |
/threads |
Create New Thread | ThreadCreateWithUserSchema |
ThreadResponseSchema |
Threads |
GET |
/threads |
Get All Threads For A User | Query: user_id (string) |
List[ThreadResponseSchema] |
Threads |
GET |
/threads/{thread_id} |
Get Single Thread | - | ThreadResponseSchema |
Threads |
PATCH |
/threads/{thread_id} |
Update Thread Title | ThreadUpdateRequestSchema |
ThreadResponseSchema |
Threads |
DELETE |
/threads/{thread_id} |
Delete A Thread | - | 204 No Content |
Threads |
GET |
/threads/{thread_id}/messages |
Get Messages In A Thread | - | List[MessageResponseSchema] |
Messages |
POST |
/threads/{thread_id}/messages |
Send Message And Get Rag Response | MessageCreateWithUserSchema |
ChatResponseSchema |
Messages |
GET |
/memories |
Get all memories for a user | Query: user_id (string) |
GetAllMemoriesResponse |
Memories |
DELETE |
/memories/{memory_id} |
Delete a single memory by ID | - | DeleteMemoryResponse |
Memories |
DELETE |
/memories/by_user/{user_id} |
Delete all memories for a user | - | DeleteAllUserMemoriesResponse |
Memories |
POST |
/upload |
Upload PDF files for asynchronous indexing (Internal Use) | multipart/form-data (files[]) |
UploadFileResponse |
Document Management (Internal) |
GET |
/upload/status/{job_id} |
Get the processing status of an upload job (Internal Use) | - | JobStatusResponse |
Document Management (Internal) |
API documentation (Swagger UI) is automatically generated by FastAPI and available at /docs (e.g., http://127.0.0.1:8000/docs). You can use this interface to explore schemas and test endpoints interactively. The schemas referenced above (UserCreateRequestSchema, ThreadResponseSchema, etc.) are fully defined in the /docs interface under the "Schemas" section.
GET /health: Simple endpoint to verify the service is running.
POST /users: Endpoint to explicitly create a new user record. ExpectsUserCreateRequestSchemawhich includesuser_id(string) and optionallyemail.GET /users/{user_id}: Retrieve details for a specific user by their stringuser_id.
POST /threads: Creates a new chat thread. ExpectsThreadCreateWithUserSchemaincludinguser_id(string) and optionaltitle.GET /threads: Lists threads for a user, filtered byuser_idpassed as a query parameter.GET /threads/{thread_id}: Retrieves details for a single thread by itsthread_id.PATCH /threads/{thread_id}: Updates a thread's title. ExpectsThreadUpdateRequestSchemain the request body.DELETE /threads/{thread_id}: Deletes a thread and associated messages.
GET /threads/{thread_id}/messages: Retrieves all messages within a specific thread.POST /threads/{thread_id}/messages: The primary endpoint for sending a user message and triggering the RAG workflow. ExpectsMessageCreateWithUserSchemaincluding thecontentof the message and theuser_id. ReturnsChatResponseSchemacontaining the assistant's response and IDs.
GET /memories: Retrieves memories from Mem0 for a user identified byuser_idquery parameter.DELETE /memories/{memory_id}: Deletes a specific memory in Mem0 by its ID.DELETE /memories/by_user/{user_id}: Deletes all memories for a user in Mem0.
POST /upload: (Internal Use Only) Upload PDF files. Processes files asynchronously for indexing. Does not accept tags. Returns ajob_id.GET /upload/status/{job_id}: (Internal Use Only) Check the status of an upload job.
The process of taking raw PDF files and making them searchable in the RAG system is handled by the /upload endpoint.
Given that PDF files, especially large ones (300-600 pages) in languages like Spanish, can take significant time to process (load, chunk, embed) and indexing them into Pinecone involves external API calls with rate limits, the /upload endpoint uses asynchronous background tasks:
- You send one or more PDF files via a
multipart/form-dataPOST request to/api/upload. - The backend immediately reads the file bytes and creates a new
UploadJobrecord in the database with a uniquejob_id. Records for each file within the job are also created with a status ofpending. - The endpoint returns a
202 Acceptedresponse instantly, including thejob_id. - For each accepted file, a separate background task is initiated by FastAPI.
- Each background task processes one file:
- It updates the file's status in the database to
processing. - It loads the PDF content (
core/loader.py). - It splits the content into chunks (
core/splitter.py). - It adds these chunks (generating embeddings) to the Pinecone index (
core/vectorstore.py). This step includes internal batching and retry logic usingtenacityto handle Pinecone/OpenAI rate limits. - Once complete, it updates the file's status in the database to
completedorfailed, along with a message and potentially the number of chunks indexed. - The overall job status is updated based on the status of its files.
- It updates the file's status in the database to
This design prevents the API request from timing out and allows the client to initiate indexing for many files without waiting for completion.
To know when a file has been indexed, you must use the status endpoint:
- Call
POST /api/uploadand note thejob_idin the response. - Periodically call
GET /api/upload/status/{job_id}(e.g., every 5-10 seconds). - The response from the status endpoint will show the
overall_statusof the job and the status (pending,processing,completed,failed) for each individual file within that job, including any error messages or chunk counts. - Once a file's status is
completed, it means its content is now available for querying via the chat endpoints.
A simple sample frontend is provided in the frontend folder. This frontend is basic and serves primarily to demonstrate interacting with the backend API.
- A web browser.
- The backend must be running and accessible from where you open the frontend (e.g., on
http://127.0.0.1:8000).
- Navigate to the
frontendfolder in your file explorer. - Open the
index.htmlfile in your web browser.
The sample frontend is a simple HTML/CSS/JS application. It likely contains JavaScript code that makes API calls to the backend. You might need to:
- Check the API Base URL: Open
frontend/index.htmlin a text editor. Locate where the API base URL is defined (e.g.,http://127.0.0.1:8000/api) and ensure it matches the actual address where your backend is running. Note the API base path is/apiaccording to the schema. - CORS: Ensure the backend is configured to allow requests from the origin where you are opening
index.html. By default, the backend is configured withallow_origins=["*"]which permits access from any origin, suitable for local testing. For production, restrictallow_originsto your frontend's actual domain. You can add allowed origins incore/config.py.
Interact with the frontend to send messages and observe the responses from the backend.