A sophisticated RAG (Retrieval-Augmented Generation) system that enables intelligent conversations with documents, images, and audio files through a clean ChatGPT-style interface.
# Start the application
streamlit run chatbot_app.py- Python 3.8+ - Primary language
- Streamlit - Web interface and UI framework
- SQLite3 - File metadata storage and management
- Ollama - Local LLM hosting (Llama 3.1 8B model)
- PyTorch - Deep learning framework
- Transformers - Hugging Face model library
- OpenAI Whisper - Speech-to-text conversion (base model)
- BLIP - Image captioning (Salesforce/blip-image-captioning-base)
- ChromaDB - Vector storage and similarity search
- Nomic Embed Text - Text embeddings via Ollama (768-dim vectors)
- CLIP - Visual embeddings for images (openai/clip-vit-base-patch32)
- FAISS - Alternative vector search (Facebook AI)
- PyPDF2 - PDF text extraction
- python-docx - Word document processing
- pdfplumber - Advanced PDF parsing
- python-pptx - PowerPoint file support
- Pillow (PIL) - Image manipulation
- OpenCV - Computer vision operations
- Tesseract OCR - Text extraction from images
- pytesseract - Python wrapper for Tesseract
- PyDub - Audio file manipulation
- librosa - Audio analysis and processing
- Whisper - Audio transcription
- NumPy - Numerical computations
- PyYAML - Configuration management
- tqdm - Progress bars
- requests - HTTP client
smartrag/
├── chatbot_app.py # Main Streamlit application
├── config.yaml # System configuration
├── requirements.txt # Python dependencies
├── multimodal_rag/ # Core RAG system
│ ├── system.py # Main RAG orchestrator
│ ├── base.py # Base classes and interfaces
│ ├── processors/ # File processors
│ │ ├── document_processor.py # PDF, DOCX, TXT
│ │ ├── image_processor.py # Images with OCR
│ │ └── audio_processor.py # Audio transcription
│ └── vector_stores/ # Vector database implementations
│ ├── chroma_store.py # ChromaDB integration
│ └── faiss_store.py # FAISS integration
├── file_storage.db # SQLite database
├── vector_db/ # ChromaDB persistence
└── user_data/ # User session data
- Documents: PDF, DOCX, DOC, TXT, MD, RTF
- Images: JPG, PNG, BMP, TIFF, WEBP (with OCR)
- Audio: MP3, WAV, M4A, OGG, FLAC, AAC
- Local LLM inference with Ollama
- Semantic search with vector embeddings
- Image understanding and captioning
- Speech-to-text transcription
- Context-aware document retrieval
- ChatGPT-style conversation interface
- File upload and management
- Real-time processing feedback
- Document viewer for stored files
- Recent uploads tracking
SmartRAG uses a single source of truth configuration system with Pydantic validation.
The system uses config.yaml with priority chain:
CLI Overrides > Environment Variables > config.yaml > Defaults
Example config.yaml:
system:
name: "SmartRAG System"
debug: false
log_level: "INFO"
models:
llm_model: "llama3.1:8b" # Ollama Llama 3.1 8B
embedding_model: "nomic-embed-text" # 768-dim embeddings
vision_model: "Salesforce/blip-image-captioning-base"
whisper_model: "base"
vector_store:
type: "chromadb"
embedding_dimension: 768 # Must match embedding model
processing:
chunk_size: 1000
chunk_overlap: 200
ocr_enabled: true # Tesseract OCRexport SMARTRAG_LLM_MODEL=llama2:7b
export SMARTRAG_TEMPERATURE=0.5
export SMARTRAG_DEBUG=truefrom config_schema import load_config
config = load_config(
"config.yaml",
models__llm_model="llama2:7b",
generation__temperature=0.5
)📖 See CONFIG.md for comprehensive configuration documentation
- Python: 3.8 or higher
- Ollama: For local LLM inference
- Tesseract OCR: For image text extraction
- FFmpeg: For audio processing (optional)
# Clone the repository
git clone https://github.com/itanishqshelar/SmartRAG.git
cd SmartRAG
# Start with Docker Compose
cd docker
docker-compose up -d
# Access at http://localhost:8501See docker/README.md for detailed Docker deployment instructions.
-
Install dependencies:
pip install -r requirements.txt
-
Install Ollama and pull models:
# Install Ollama (see ollama.ai) ollama pull llama3.1:8b ollama pull nomic-embed-text -
Install Tesseract OCR:
- Windows: Download from GitHub releases
- macOS:
brew install tesseract - Linux:
sudo apt-get install tesseract-ocr
-
Run the application:
streamlit run chatbot_app.py
[User Input] → [Streamlit UI] → [RAG System] → [File Processors]
↓ ↓
[Document: PyPDF2/python-docx]
[Image: Tesseract OCR + BLIP]
[Audio: Whisper Transcription]
↓ ↓
[SQLite DB] ← [Text Chunks] → [Nomic Embed Text (Ollama)] → [ChromaDB]
↓
[Vector Search] → [Context Retrieval] → [Llama 3.1 8B (Ollama)] → [Response]
• Text Documents: Extracted with PyPDF2/python-docx → Chunked → Embedded with Nomic Embed Text
• Images: OCR with Tesseract + Captioning with BLIP → Combined text → Embedded with Nomic Embed Text
• Audio: Transcribed with Whisper → Chunked → Embedded with Nomic Embed Text
• Storage: All embeddings stored in ChromaDB (768-dim vectors) for semantic search
• Generation: Retrieved context fed to Llama 3.1 8B via Ollama for response generation
- Upload Files: Drag & drop or browse files in the sidebar
- Chat: Ask questions about your uploaded content
- View Files: Use the eye icon to preview stored documents
- Manage Data: Clear chat history or uploaded files as needed
- Fully Offline: All processing happens locally
- No Data Sent: No external API calls for LLM inference
- Local Storage: Files and embeddings stored on your machine
system.ingest_file("presentation.pdf") # Slides system.ingest_file("screenshot.png") # Image with text system.ingest_file("meeting_recording.mp3") # Audio transcript
response = system.query("What was discussed about the Q4 budget?")
### Batch Processing
```python
# Process entire directories
results = system.ingest_directory("./company_docs/", recursive=True)
# Get processing summary
successful = sum(1 for r in results.values() if r.success)
total_chunks = sum(len(r.chunks) for r in results.values() if r.success)
print(f"Processed {successful} files, created {total_chunks} chunks")
Run the test suite to verify installation:
# Run all tests
python -m pytest tests/
# Run specific test file
python tests/test_system.py
# Run with coverage
pip install coverage
coverage run tests/test_system.py
coverage reportChromaDB (Default - Recommended)
vector_store:
type: "chromadb"
persist_directory: "./vector_db"
collection_name: "documents"
embedding_dimension: 768 # For nomic-embed-textFAISS (Alternative - High performance)
vector_store:
type: "faiss"
persist_directory: "./faiss_db"
embedding_dimension: 768 # Must match nomic-embed-textpython cli.py interactiveFROM python:3.10-slim
WORKDIR /app
COPY . .
RUN pip install -r requirements.txt
CMD ["python", "cli.py", "interactive"]from fastapi import FastAPI
from multimodal_rag.system import MultimodalRAGSystem
app = FastAPI()
system = MultimodalRAGSystem()
@app.post("/query")
async def query_endpoint(query: str):
response = system.query(query)
return {"answer": response.answer, "sources": len(response.sources)}# Install development dependencies
pip install -r requirements.txt
pip install pytest black flake8
# Run tests
python -m pytest
# Format code
black multimodal_rag/ tests/ examples/
# Lint code
flake8 multimodal_rag/ tests/ examples/This project is licensed under the MIT License - see the LICENSE file for details.
- ChromaDB for vector storage
- Sentence Transformers for embeddings
- Hugging Face Transformers for language models
- OpenAI Whisper for speech recognition
- Tesseract OCR for text extraction
SmartRAG - Intelligent multimodal document understanding for the modern age.