Thanks to visit codestin.com
Credit goes to github.com

Skip to content

sarankumar1325/DOC-RAG

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

3 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

πŸ“„ Document RAG Assistant

Image

A powerful Document-based Retrieval Augmented Generation (RAG) system built with ChromaDB, Google's Gemini 2.0 Flash, and Streamlit. Upload PDF documents and ask questions based strictly on their content - no hallucinations, only factual answers from your document.

πŸš€ Features

  • πŸ“ PDF Document Upload: Extract text from PDF files automatically
  • πŸ” Smart Text Chunking: Intelligently splits documents for optimal retrieval
  • 🎯 Semantic Search: Find relevant document sections using vector similarity
  • πŸ€– Anti-Hallucination AI: Responses based only on document content
  • πŸ“– Source Attribution: See exactly which document sections were used
  • πŸ—‘οΈ Database Management: Clear and manage your document collection
  • ⚑ Real-time Processing: Fast document processing and query responses

πŸ› οΈ Setup

Prerequisites

  • Python 3.8+
  • Gemini API key from Google AI Studio

Installation

  1. Clone the repository:
git clone <your-repo-url>
cd "RAG DEMO"
  1. Install required packages:
pip install -r requirements.txt
  1. Set up environment variables: Create a .env file in the root directory:
GEMINI_API_KEY=your_gemini_api_key_here
CHROMA_API_KEY=your_chroma_api_key_here
CHROMA_TENANT=your_chroma_tenant_id
CHROMA_DATABASE=your_database_name
  1. Run the application:
streamlit run app.py

πŸ“‹ Usage Guide

1. Upload Documents

  • Click "Choose a PDF file" to upload your document
  • Wait for processing (text extraction + chunking)
  • See confirmation with chunk count and preview

2. Ask Questions

  • Type your question in the text input field
  • Click "πŸ” Get Answer" to search and generate response
  • Review the AI answer and source document sections

3. Manage Documents

  • Use the sidebar to see document chunk count
  • Click "πŸ—‘οΈ Clear Database" to remove all documents
  • Upload multiple documents for cross-document queries

🎯 Key Benefits

βœ… No Hallucinations

  • AI responds only based on uploaded document content
  • Says "I don't know" when information isn't available
  • No external knowledge or made-up information

βœ… Source Transparency

  • Shows relevant document sections used for answers
  • Displays filename and chunk information
  • Enables fact-checking and verification

βœ… Smart Retrieval

  • Uses semantic similarity for relevant content discovery
  • Configurable chunk size (1000 chars) with overlap (200 chars)
  • Retrieves top 5 most relevant sections per query

πŸ—οΈ Architecture

PDF Upload β†’ Text Extraction β†’ Text Chunking β†’ Vector Embeddings β†’ ChromaDB Storage
                                                                            ↓
User Query β†’ Semantic Search β†’ Relevant Chunks β†’ Context Building β†’ Gemini 2.0 β†’ Response

πŸ”§ Technical Components

πŸ“ Project Structure

RAG DEMO/
β”œβ”€β”€ app.py                 # Main Streamlit application
β”œβ”€β”€ requirements.txt       # Python dependencies
β”œβ”€β”€ .env                  # Environment variables (not in git)
β”œβ”€β”€ .gitignore           # Git ignore rules
└── README.md            # This file

βš™οΈ Configuration

Environment Variables

  • GEMINI_API_KEY: Your Google AI Studio API key
  • CHROMA_API_KEY: ChromaDB cloud API key (optional)
  • CHROMA_TENANT: ChromaDB tenant ID (optional)
  • CHROMA_DATABASE: ChromaDB database name (optional)

Chunking Parameters

  • Chunk Size: 1000 characters (adjustable in code)
  • Chunk Overlap: 200 characters (prevents information loss)
  • Retrieval Count: 5 most relevant chunks per query

πŸ”’ Security

  • API keys stored in environment variables
  • .env file excluded from version control
  • No sensitive data in code repository
  • Local ChromaDB instance for privacy

🚫 Limitations

  • PDF Only: Currently supports PDF documents only
  • Text-based: Cannot process images, tables, or complex layouts
  • Memory Storage: Uses in-memory ChromaDB (data lost on restart)
  • Single Session: No persistent user sessions or multi-user support

πŸ›£οΈ Future Enhancements

  • Support for Word documents, text files, and web pages
  • Persistent ChromaDB storage
  • Multi-user authentication and sessions
  • Advanced document preprocessing (tables, images)
  • Conversation history and follow-up questions
  • Document comparison and analysis features

πŸ“„ License

This project is open source and available under the MIT License.

🀝 Contributing

  1. Fork the repository
  2. Create a feature branch
  3. Commit your changes
  4. Push to the branch
  5. Create a Pull Request

πŸ“ž Support

If you encounter any issues or have questions:

  • Check the troubleshooting section in the app
  • Review error messages in the Streamlit interface
  • Ensure your API keys are correctly configured
  • Verify PDF files contain extractable text

Happy Document Querying! πŸŽ‰

About

A powerful Document-based Retrieval Augmented Generation (RAG) system built with ChromaDB, Google's Gemini 2.0 Flash, and Streamlit. Upload PDF documents and ask questions based strictly on their content - no hallucinations, only factual answers from your documents.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages